Risks
Critical
safety systems have the potential for disasters
- Nuclear Reactor control
- Airline
Navigation
Underlying
causes of software failures not understood
- Understanding
human approach to writing software may be understood to reduce risks.
Issues to be covered
Human risks
from poorly engineered software
Incidents
Airbus Accidents
Space Shuttle
DC Metro
Therac-25
The Airbus A320 Incidents
January
1992, Flight IT5148 crashes outside of Strasbourg
-
Of 90 passenger, 6 crew, 87 died
-
Plane was descending three times faster on the approach at
2300 feet per minute
-
Well below recommended minimum height for approach.
More Incidents
Bangalore,
1990
-
Indian Airlines flight falls short of the runway, crashing on
final approach.
-
Pilot inadvertently put engines in idle.
Mulhouse,
1988
-
Plane brushes a patch of tree at the end of a demonstration flight and crashes.
-
Pilot maintains that the equipment did not show errors before
the crash.
Why?
Computers
control all flight operations of the A320
-
If you program to fly into a mountain, it will!
Data
input error by pilots caused the crashes.
Pilots
were confused when entering data into the computer.
Pilots
often received ambiguous errors from the computer during data input.
Cockpit
controls are not like a normal plane.
Too many
computer screens holding too much data in a hard to understand fashion.
Space Shuttle Columbia
While
practicing the Transatlantic Abort Sequence of the Shuttle Mission Simulator,
the computer flight systems locked up as a result of a computer error.
This error had
never been detected before!
Implications
Discovery
of a logic error which are extremely difficult to detect and prevent.
Impossible
to exhaustively test the programs because there are too many possible states during
execution.
Programmers
made too many assumptions about the inputs and were not careful with boundary cases and
reinitialization of variables.
Problem:
branch into segment code that did not exist causing the operating system to loop, trying
to field and service repeated interrupts.
Conclusions
Necessary
to identify particular logic errors in order to prevent future occurrences.
There
are no hard and fast solutions to this problem; thorough analysis must be used to detect
these errors.
Proof
that NASA, considered among the best in the business, can fall victim to the subtleties of
software design.
Washington D.C Metrorail
September
1999, the central computer system monitoring every train on the Washington, D.C. Metrorail
system failed to work.
Employees were
forced to actively monitor the 96 miles of track by radio.
Implications
Graphics
generating device froze preventing accurate tracking of the trains on the system.
Relays
on the track have a life expectancy of 70 and an expected malfunction rate of one every 50
years!
During
a 15-month period that the system was in use, it crashed 50 times.
Since
April 1999, the system has been run manually in order to prevent unnecessary slowdowns.
Conclusions
Passengers
in danger when traveling on a system where the computer frequently fails.
Shortcuts
in the software engineering process lead only to problems for the user down the road.
A
more efficient design for the monitor and the back-up should be used in order to prevent
inconveniences.
Therac-25
Modes of Operation
Implications
Design
flaws
Data entry speed
produced errors
Not fully tested after
hardware integration
Not enough
error-detection and error-handling
Confusing error
messages
Users desensitized to
error messages
Patients
overexposed
Problem
not recognized promptly enough
Software
used as a safety device, instead of hardware
Cases
Kennestone Regional Oncology Center, June 1985
No investigations
Ontario Cancer Foundation, July 1985
H-TILT error
message
Yakima Valley Memorial Hospital, December 1985
Doctors could not
confirm conclusion
East Texas Cancer Center, March 1986
Malfunction 54
and Fritz Hager
East Texas Cancer Center, April 1986
Malfunction 54
Yakima Valley Memorial Hospital, January 1987
FLATNESS error message
Conclusions
Clear
documentation is very important!
Software
can never be ignored as one of the problems
Should
not sacrifice safety for a friendly user interface
Critical
software systems must be programmed defensively
Protection
against software errors can and should be built into both the system and the software
itself
Inadequate investigation and follow-up on accident reports
South Park vs. the Therac-25
Lessons To Be Learned
Build
software to be safe. Trying to be correct is
not enough.
Most
accidents occur because requirements are wrong, not due to coding errors.
No
general solution to prevent software errors
Even
the best in the business can fall prey to the subtleties of software design
coding error is not as important as the general unsafe design of
the software overall.
CAUSAL FACTORS
Overconfidence
in Software
Lack of
Defensive Design
Failure to
Eliminate Root Causes
Unrealistic
Risk Assessments
Inadequate
Investigation or Follow-up on Accident Reports
Inadequate
Software Engineering Practices
Safe vs.
Friendly User Interfaces
General Solutions
Keep things
simple
Trial and
Error
Reduce
confidence in software
Solve user
interface problem by understanding human psychology and behavior
Credits
Presentation
and the Therac issue, Ed G.
Space
Shuttle & Slides, Andrew S.
Therac
& South Park, Igor G.
General
Intro, Adam G.
Airbus
Incidents, Jeff S.
DC
Metro System & Slides, Greg R.
Questions