3 Fault Tolerance Concepts With Examples
Faults may be classified based on Locality (atomic component, composite component, system, operator, environment), on Effect (timing, data), or on Cause (design, damage). Other possible classification criteria include Duration (transient, persistent) and Effect on System State (crash, amnesia, partial amnesia, etc.).
Since the location of a fault is so important, fault location is a logical starting point for classifying faults.
Since time increases monotonically, it is possible to further classify timing faults into early, late, or "never" (omission) faults. Since it is practically impossible to determine if "never" occurs, omission faults are really late timing faults that exceed an arbitrary limit. Systems that never produce value faults, but only fail by omission are called fail-silent systems. If all failures require system restart, the system is a fail-stop system.
Resource depletion faults occur when a portion of the system is unable to obtain the resources required to perform its task. Resources may include time on a processing or communications device, storage, power, logical structures such as a data structure, or a physical item such as a processor.
Logic faults occur when adequate resources are available, but the system does not behave according to specification. Logic faults may be the result of improper design or implementation, as discussed in the next section. Logic faults may occur in hardware or software.
Physical faults occur when hardware breaks or a mutation occurs in executable software. Most common fault tolerance mechanisms deal with hardware faults.
A common ultimate cause of a fault is an improper requirements specification which leads to a specification fault. Technically this is not a fault, since a fault is defined to be the failure of a component/interacting systems and a failure is the deviation of the system from specification. However, it can be the reason a system deviates from the behavior expected by the user. An especially insidious instance of this arises when the requirements ignore aspects of the environment in which the system operates. For instance, radiation causing a bit to flip in a memory location would be a value fault which would be considered an external fault (Section 3.4.1.4). However, if the fault propagates inside the system boundary the ultimate cause is a specification fault because the system specification did not foresee the problem.
Flowing down the waterfall, a design fault results when the system design does not correctly match the requirements, and an implementation fault arises when the system implementation does not adequately implement the design. The validation process is specifically designed to detect these faults. Finally, a documentation fault occurs when the documented system does not match the real system.
Generated with CERN WebMaker
3.4.1 Locality
3.4.1.1 Atomic Component Faults
Concept Definition
A atomic component fault is a fault at the fault floor, that is, in a component that cannot be subdivided for analysis purposes. Bridge Example
A fault in an individual structural member in a bridge may be considered a atomic component fault. If the bridge design properly distributes the load among the various structural members (resources) of the bridge, then the load is transferred to other structural members, no failure occurs, and the fault is masked. The fault may be detected by observation of cracks or deformation, or it may remain latent. Computer System Example
In a computer system, substrate faults can appear in diverse forms. For instance, a fault in a memory bit is not an atomic component fault if the details of the memory are below the current span of concern. Such a fault may or may not appear as a memory fault, depending upon the memory's ability to mask bit faults. 3.4.1.2 Composite Component Faults
Concept Definition
A composite component fault is one that arises within an aggregation of atomic components rather than in an atomic component. It may be the result of one or more atomic component faults. Bridge Example
A pier failure would be an example of a composite component failure for a bridge. Computer System Example
A disk drive failure in a computer system is an example of a composite component failure. If the individual bits of memory are considered to be in the span of concern, a failure of one of those would be a component failure as well.3.4.1.3 System Level Faults
Concept Definition
A system level fault is one that arises in the structure of a system rather than in the system's components. Such faults are usually interaction or integration faults, that is, they occur because of the way the system is assembled rather than because of the integrity of any individual component. Note that an inconsistency in the operating rules for a system may lead to a system level fault. System level faults also include operator faults, in which an operator does not correctly perform his or her role in system operation. Systems that distribute objects or information are prone to a special kind of system fault: replication faults. Replication faults occur when replicated information in a system becomes inconsistent, either because replicates that are supposed to provide identical results no longer do so, or because the aggregate of the data from the various replicates is no longer consistent with system specifications. Replication faults can be caused by malicious faults, in which components such as processors "lie" by providing conflicting versions of the same information to other components in the system. Malicious faults are sometimes called Byzantine faults after an early formulation of the problem in terms of Byzantine generals trying to reach a consensus on attacking when one of the generals is a traitor [Lamport 82]. Bridge Example
A bridge failure resulting from insufficient allowance for thermal expansion in the overall structure could be considered a system failure: individual structural members behave as specified, but faulty assembly causes failures when they interact. Operator faults have been discussed in the example in Section 3.2.1. Computer System Example
Consider the computer systems in an automobile. Suppose the airbag deployment computer and the anti-lock brake computer are both known to work properly and yet fail in operation because one computer interferes with the other when they are both present. This would be a system fault.3.4.1.4 External Faults
External faults arise from outside the system boundary, the environment, or the user. Environmental faults include phenomena that directly affect the operation of the system, such as temperature, vibration, or nuclear or electromagnetic radiation or that affect the inputs provided to the system. User faults are created by the user in employing the system. Note that the roles of user and operator are considered separately; the user is considered to be external to the system while the operator is considered to be a part of the system.3.4.2 Effects
Faults may also be classified according to their effect on the user of the system or service. Since computer system components interact by exchanging data values in a specified time and/or sequence, fault effects can be cleanly separated into timing faults and value faults. Timing faults occur when a value is delivered before or after the specified time. Value faults occur when the data differs in value from the specification.3.4.2.1 Value Faults
Computer systems communicate by providing values. A value fault occurs when a computation returns a result that does not meet the system's specification. Value faults are usually detected using knowledge of the allowable values of the data, possibly determined at run time.3.4.2.2 Timing Faults
A timing fault occurs when a process or service is not delivered or completed within the specified time interval. Timing faults cannot occur if there is no explicit or implicit specification of a deadline. Timing faults can be detected by observing the time at which a required interaction takes place; no knowledge of the data involved is usually needed.3.4.3 Duration
Persistent faults remain active for a significant period of time. These faults are sometimes termed hard faults. Persistent faults usually are the easiest to detect and diagnose, but may be difficult to contain and mask unless redundant hardware is available. Persistent faults can be effectively detected by test routines that are interleaved with normal processing. Transient faults remain active for a short period of time. A transient fault that becomes active periodically is a periodic fault (sometimes referred to as an intermittent fault). Because of their short duration, transient faults are often detected through the faults that result from their propagation.3.4.4 Immediate Cause
Faults can be classified according to the operational condition that causes them. These include resource depletion, logic faults, or physical faults.3.4.5 Ultimate Cause
Faults can also be classified as to their ultimate cause. Ultimate causes are the things that must be fixed to eliminate a fault. These faults occur during the development process and are most effectively dealt with using fault avoidance and fault removal techniques.
A Conceptual Framework for Systems Fault Tolerance - 30 MAR 95
[Next] [Previous] [Up] [Top]