3 Fault Tolerance Concepts With Examples
A weakened strut may lead to another strut developing faults, which in turn could put more load on the original strut causing it to weaken further. This would be a cyclic fault trajectory. If the faults which developed in the second strut did not further trigger the fault in the first strut it would be an acyclic fault trajectory.
A piece of software with a bad bit set in one of its instructions could cause a bad value to be calculated which could cause the program to take a different logical path. This different path might cause the original piece of software to be re-executed which could lead to still other unexpected behavior. This would be a cyclic fault trajectory. If the original fault did not ultimately result in the fault being triggered again it would be an acyclic fault trajectory.
Similarly, as components are aggregated into a system, eventually the system is complete. Everything else (e.g., the user, the environment, etc.) is not a part of the system. This is the system boundary. Failures occur when faults reach the system boundary.
As illustrated in Figure 3-1
Generated with CERN WebMaker
3.3.1.1 Concept Definition
A component of a system is said to depend on another component if the correctness of the first component's behavior requires the correct operation of the second component. Traditionally, the set of possible dependencies in a system are considered to form an acyclic graph. The term fault tree analysis seems to imply this, among other things. Indeed, many systems exhibit this behavior, in which one fault leads to another which leads to another until eventually a failure occurs. It is possible, however, for a dependency relationship to cycle back upon itself. A dependency relationship is said to be acyclic if it forms part of a tree. A cyclic dependency relationship is one that cannot be described as part of a tree, but rather must be described as part of a directed cyclic graph.3.3.1.2 Bridge Example
In a bridge, the structural integrity of the roadbed depends, in part, on the structural integrity of the bridge piers. In a suspension bridge, the structural integrity of each of the suspension lines depends on each of the others.3.3.1.3 Computer System Example
In a computer system, consider two cooperating sequential processes using semaphores to synchronize. If either process fails to release the semaphore when it should, then the other process will fail as well. Thus they are mutually dependent.3.3.2 Failure Regions
Defining a failure region limits the consideration of faults and failures to a portion of a system and its environment. This is necessary to insure that system specification, analysis and design efforts are concentrated on the portions of a system that can be observed and controlled by the designer and user. It helps to simplify an otherwise overwhelming task.3.3.2.1 Concept Definition
A system is typically made up of lots of components parts. These components are, in turn, made up of sub-components. This continues arbitrarily until an atomic component (a component that is not divisible or that we choose not to divide into sub-components) is reached. Although all components are theoretically capable of having faults, for any system there is a level beyond which the faults are "not interesting". This level is called the fault floor. Atomic components lie at the fault floor. We are concerned with faults emerging from atomic components, but not faults that lie within these components.
3.3.2.2 Bridge Example
Bridges are designed with the assumption that the structural members used (beams, braces, fasteners) have known load bearing, deformation, and fracture characteristics, which are predicted from knowledge of the composition of the materials, the process used to produce the materials, and from statistical sampling of the materials. Thus the structural members form the fault floor for most bridges. Faults at the molecular level are generally below the level of consideration. The design process for a typical bridge design begins with specification of a certain grade of steel and employs standard structural shapes. The combination of known materials, known shapes, and standard procedures for summing loads and forces is used to predict the failure modes of the overall structure.3.3.2.3 Computer System Example
In a computer example, a repair person may not care to localize a "problem" to the component level, but instead be satisfied to localize it to the circuit board level. The circuit board represents a fault floor for the repair person. This fault floor is often referred to as a Field Replacable Unit (FRU) or Line Replaceable Unit (LRU). The selection of FRUs and LRUs is an important part of the maintenance strategy for any computer system. The selection is based on considerations such as replacement cost, diagnosis facilities, and skill levels in the field and at repair depots. Notice, however, that when the board is shipped back to the repair depot, they may indeed care about localizing the "problem" down to the component level. In this case the fault floor has changed.
A Conceptual Framework for Systems Fault Tolerance - 30 MAR 95
[Next] [Previous] [Up] [Top]