System Failures

This chapter contains information derived from the descriptions of 342 failures of products found during acceptance test or operation. The products were voluntarily recalled by their manufacturers. There were NO losses of life in any of these instances. A paper Lessons from 342 Failures of Medical Devices provides details.

For this set of data, the physical system failed, and the manufacturer provided a description relating each failure to one or more software problems. For the analysis, the fault most closely representing the root location was selected for the fault class tables. An example is that while modified software was not tested, the root cause may have been an incorrect algorithm inserted during a modification; the classification in the tables is "calculation." These problems appear in tables organized by classes of software faults. Within each class, there are two sets of tables. The first table provides a few generic examples of problems, with related prevention and detection techniques. The second table contains a generic description of the problem, for each fault within that category, again with prevention and detection techniques. Prevention techniques are applied before or during the process which produced the fault; detection techniques are applied after the process producing the fault, o sometimes during the process.

While the collection is not sufficiently large to generalize for all software problems, the examination of the problems indicates that some generic guidance can be derived at a high level. Basic "good" practices may have prevented some of the problems, or may have caught them before the system was delivered to its user community. Not all the vendors made all the mistakes; each vendor made only one. The collective message suggests the traditional good practices such as in (url for VV234 and for Sp 223) could help to prevent or eliminate some of these problems. Some occurred in highly complex systems for which either more sophisticated or new methods may be needed. The following topics provide a simplistic overview of practices to help companies to produce better products:

NOTE about the fault classes: For purposes of understanding the problems and identifying best practices, not all the categories are "faults" in the purest sense. For example, fault tolerance is not a fault, but rather an omitted dependability attribute for safety-critical systems. In the purest sense, a a fault tolerance problem would be in the manner of implementing the fault tolerance, for instance, an incorrect assertion. Quality assurance is not a fault type either, but is a problem within the development of the product. However, given the problem descriptions, the reason for the failures reduces down to the selected categories.
Calculation Change Impact Data Logic
Initialization Omission Requirements Other
Timing Configuration Management Quality Assurance Fault Tolerance
Interface

backing.gif (15025 bytes) BACK