[Next] [Previous] [Up] [Top]

3 Fault Tolerance Concepts With Examples

3.5 Other Fault Attributes

3.5.1 Observability


Faults originate in a system component or subsystem, in the system's environment, or in an interaction between the system and a user, operator, or another subsystem. A fault may ultimately have one of several effects:

  1. It may disappear with no perceptible effect

  2. It may remain in place with no perceptible effect

  3. It may lead to a sequence of additional faults that result in a failure in the system's delivered service (propagation to failure)

  4. It may lead to a sequence of additional faults with no perceptible effect on the system (undetected propagation)

  5. It may lead to a sequence of additional faults that have a perceptible effect on the system but do not result in a failure in the system's delivered service (detected propagation without failure)

Fault detection is usually the first step in fault tolerance. Even if other elements of a system prevent a failure by compensating for a fault, it is important to detect and remove faults to avoid the exhaustion of a systems fault tolerance resources.

3.5.1.1 Concept Definition


A fault is observable if there is information about its existence available at the system interface. The information that indicates the existence of a fault is a symptom. A symptom may be a directly observed fault or failure, or it may be a change in system behavior such that the system still meets its specifications. A fault that a fault tolerance mechanism of a system has found is said to be detected. Otherwise it is latent, whether it is observable or not. The definition of detected is independent of whether or not the fault tolerance mechanism is able to successfully deal with the fault condition. For a fault to be detected, it is sufficient that it be known about.

3.5.1.2 Bridge Example


Fault detection in a bridge usually relies on the principle that stress in a structural member results in deformation of the member, which can usually be observed by looking for cracks in the surface or changes in the alignment of the bridge. Note that the fault is not observed directly; rather, its effects are observed. Other faults, such as metal fatigue, can only be predicted by knowing the history of the loads imposed on the member.

A flaw in a structural member of the bridge is a latent fault. If a bridge inspector x-rays the member and discovers the flaw, or observes a crack that is a logical consequence of the flaw, it is a detected fault.

3.5.1.3 Computer System Example


To provide failure-free outputs in a computer-based fault tolerant system, the system must detect faults, a process that requires redundant information (that is, information in addition to the minimum information needed to perform a prescribed function). Redundant information may be combined with a value or it may be stored separately. Such information may include attributes of a value, such as an abstract type; encoded information, such as error correcting code words; and independently calculated reference values. Attribute information is used to verify that the value is being used in the correct context. Codeword information is used to determine if one or a few of the bits in the value have been changed since the value was created. Independently calculated values may be static (for example, a predefined invariant or limit) or they may be dynamically calculated by a reference process. The reference process may be a redundant copy of the primary process, or it may be a diverse implementation that uses a different approach to produce the value being tested. Either time redundancy (retry) or space redundancy (a concurrently executing process) may be used. For instance, a flipped bit in a program is a latent fault. If a checksum is taken, and it does not match a previously computed value, the fault becomes detected, although, in this case, it may only be possible to tell that a fault exists, and not exactly where it is.

Timing faults may be detected by recognizing the passage of an allotted time interval or by serializing outputs to detect missing outputs. The passage of time may be monitored directly using values from hardware clocks or it may be inferred by noting the completion of one or more processes that complete within a known time interval under normal circumstances.

3.5.2 Propagation


3.5.2.1 Concept Definition


A fault that propagates to other faults or failures is said to be active. A non-propagating fault is said to be dormant. When a previously dormant fault becomes active it is said to be triggered. An active fault may again become dormant, awaiting a new trigger. The sequence of faults, each successive one triggered by the preceding one and possibly ending in a failure, is known as a fault trajectory. (Because of the ways faults trigger successive faults, a fault trajectory could be viewed as a chain reaction.)

Figure 3-2 shows the relationship between detected, latent, dormant, and active

faults.

3.5.2.2 Bridge Example


Suppose the example bridge was designed to carry 10 ton vehicles over it, but the highway department erects a "load limit 40 tons" sign on the approach. The sign is a dormant fault. It becomes active when a 38 ton truck triggers it by attempting to drive over the bridge and causes the bridge to fall (a failure) or perhaps a structural member to weaken (another fault). The original fault (the sign) becomes dormant again, until another over weight truck drives onto the bridge. The sequence "overweight truck drives over bridge", "structural member weakens" is the fault trajectory.

3.5.2.3 Computer System Example


As another example, consider a computer program loaded in memory, but with a bad bit in one of its instructions. Until that instruction is executed, the fault is dormant. Once it is executed it becomes active and perhaps results in a crash (failure) or a wrong value in a computation (fault). If the value computed was the altitude of an aircraft, and the resulting faulty information led to the plane flying into a mountain, that would be another fault in the fault trajectory (actually a failure in this case).

3.5.1 - Observability
3.5.1.1 - Concept Definition
3.5.1.2 - Bridge Example
3.5.1.3 - Computer System Example
3.5.2 - Propagation
3.5.2.1 - Concept Definition
3.5.2.2 - Bridge Example
3.5.2.3 - Computer System Example

A Conceptual Framework for Systems Fault Tolerance - 30 MAR 95
[Next] [Previous] [Up] [Top]

Generated with CERN WebMaker