Risks

     Critical safety systems have the potential for disasters

          - Nuclear Reactor control

            - Airline Navigation

     Underlying causes of software failures not understood

            - Understanding human approach to writing software may be understood to reduce risks.

 

Issues to be covered

     Human risks from poorly engineered software

     Incidents

   Airbus Accidents

   Space Shuttle

   DC Metro

   Therac-25

 

The Airbus A320 Incidents

      January 1992, Flight IT5148 crashes outside of Strasbourg

     - Of 90 passenger, 6 crew, 87 died

     - Plane was descending three times faster on the approach at 2300 feet per minute

     - Well below recommended minimum height for approach.

 

More Incidents

      Bangalore, 1990

     - Indian Airlines flight falls short of the runway, crashing on final approach.

     - Pilot inadvertently put engines in idle.

      Mulhouse, 1988

     - Plane brushes a patch of tree at the end of a demonstration flight and crashes.

     - Pilot maintains that the equipment did not show errors before the crash.

 

Why?

      Computers control all flight operations of the A320

     - If you program to fly into a mountain, it will!

      Data input error by pilots caused the crashes.

      Pilots were confused when entering data into the computer.

      Pilots often received ambiguous errors from the computer during data input.

 

 

Cockpit controls are not like a normal plane. 

Too many computer screens holding too much data in a hard to understand fashion.

 

 

Space Shuttle Columbia

While practicing the “Transatlantic Abort Sequence” of the Shuttle Mission Simulator, the computer flight systems locked up as a result of a computer error.

 

This error had never been detected before!

 

Implications

      Discovery of a “logic error” which are extremely difficult to detect and prevent.

      Impossible to exhaustively test the programs because there are too many possible states during execution.

      Programmers made too many assumptions about the inputs and were not careful with boundary cases and reinitialization of variables.

      Problem: branch into segment code that did not exist causing the operating system to loop, trying to field and service repeated interrupts.

 

Conclusions

      Necessary to identify particular logic errors in order to prevent future occurrences.

      There are no hard and fast solutions to this problem; thorough analysis must be used to detect these errors.

      Proof that NASA, considered among the best in the business, can fall victim to the subtleties of software design.

 

 

Washington D.C Metrorail

September 1999, the central computer system monitoring every train on the Washington, D.C. Metrorail system failed to work.

 

Employees were forced to actively monitor the 96 miles of track by radio.

 

 

Implications

      Graphics generating device froze preventing accurate tracking of the trains on the system.

      Relays on the track have a life expectancy of 70 and an expected malfunction rate of one every 50 years!

      During a 15-month period that the system was in use, it crashed 50 times.

      Since April 1999, the system has been run manually in order to prevent unnecessary slowdowns.

 

Conclusions

      Passengers in danger when traveling on a system where the computer frequently fails.

      Shortcuts in the software engineering process lead only to problems for the user down the road.

      A more efficient design for the monitor and the back-up should be used in order to prevent inconveniences.

 

Therac-25

Modes of Operation

Implications

      Design flaws

   Data entry speed produced errors

   Not fully tested after hardware integration

   Not enough error-detection and error-handling

   Confusing error messages

   Users desensitized to error messages

      Patients overexposed

      Problem not recognized promptly enough

      Software used as a safety device, instead of hardware

 

Cases

      Kennestone Regional Oncology Center, June 1985

    No investigations

      Ontario Cancer Foundation, July 1985

    H-TILT error message

      Yakima Valley Memorial Hospital, December 1985

    Doctors could not confirm conclusion

      East Texas Cancer Center, March 1986

    Malfunction 54 and Fritz Hager

      East Texas Cancer Center, April 1986

    Malfunction 54

      Yakima Valley Memorial Hospital, January 1987

    FLATNESS error message

 

Conclusions

      Clear documentation is very important!

      Software can never be ignored as one of the problems

      Should not sacrifice safety for a friendly user interface

      Critical software systems must be programmed defensively

      Protection against software errors can and should be built into both the system and the software itself

      Inadequate investigation and follow-up on accident reports

 

South Park vs. the Therac-25

  

Lessons To Be Learned

      “Build software to be safe.  Trying to be correct is not enough.”

      Most accidents occur because requirements are wrong, not due to coding errors.

      No general solution to prevent software errors

      Even the best in the business can fall prey to the subtleties of software design

      coding error is not as important as the general unsafe design of the software overall.

 

CAUSAL FACTORS

     Overconfidence in Software

     Lack of Defensive Design

     Failure to Eliminate Root Causes

     Unrealistic Risk Assessments

     Inadequate Investigation or Follow-up on Accident Reports

     Inadequate Software Engineering Practices

     Safe vs. Friendly User Interfaces

 

General Solutions

     Keep things simple

     Trial and Error

     Reduce confidence in software

     Solve user interface problem by understanding human psychology and behavior

 

Credits

      Presentation and the Therac issue, Ed G.

      Space Shuttle & Slides, Andrew S.

      Therac & South Park, Igor G.

      General Intro, Adam G.

      Airbus Incidents, Jeff S.

      DC Metro System & Slides, Greg R.

 

Questions