4 Fault Tolerance Mechanisms
4.4.2 Fault Diagnosis
Fault diagnosis with comparison is dependent upon whether pair-wise or voting comparison is used:
4.4.2.1 Voting Issues
Voting may be centralized or decentralized. Centralized voting is easy to mechanize, either in software or hardware, but results in a single point of failure, a violation of many qualitative requirements specifications. It is possible to compensate for total voter failure using a master-slave approach that replaces a silent voter with a standby voter, as in the pair and spare approach. Decentralized voting avoids the single point of failure, but requires a consensus among multiple voting agents, either hardware or software in order to avoid replication faults mentioned in Section 3.4.1.3. In order to reach consensus, the distributed voters must synchronize to exchange several rounds of messages. In the worst case, where up to f faulty processors are allowed to send misleading results to other processors participating in the consensus process, 3f+1 distributed voters must be provided to reach a state known as interactive consistency [Pease 80]. Interactive consistency requires that each non-faulty processor provides a value, that all non-faulty processors agree on the same set of values, and that the values are correct for each of the non-faulty processors. Similar processes are required to maintain a consensus as to the number of members remaining in a group of distributed processors [Cristian 88].
When voting is used, containment is achieved by ignoring the failed processor and reconfiguring it out of the system.
Pair-wise comparison requires the existence of multiple pairs of processors to mask faults. In this case the faulty pair of processors is halted, and values are obtained from the functional, good pairs.
When voting is used, recovery from a failed processor is accomplished by utilizing the "good" values from the other processors. A processor that is outvoted may be allowed to continue execution and may be configured back into the system if it successfully matches in a specified number of subsequent votes.
Generated with CERN WebMaker
4.4.3 Fault Containment
When pair-wise comparison is used, containment is achieved by stopping all activity in the mismatching pair. Any other pairs in operation can continue executing the application, undisturbed. They detect the failure of the miscomparing pair through time-outs.4.4.4 Fault Masking
In a comparison based system, fault masking is achievable in two ways. When voting is used the voter only allows the correct value to pass on. If hardware voters are used, this usually occurs quickly enough to meet any response deadlines. If the voting is done by software voters that must reach a consensus, adequate time may not be available.4.4.5 Fault Compensation
The value provided by a voter may be the majority value, the median value, a plurality value, or some other predetermined satisfactory value. While this choice is application dependent, the most common choice is the median value. This guarantees that the value selected was calculated by at least one of the participating processors and that it is not an extreme value.4.4.6 Fault Repair
In a comparison-based system with a single pair of processors, there is no recovery from a fault. With multiple pairs of pairs, recovery consists of using the values from the "good" pair. Some systems provide mechanisms to restart the miscomparing pair with data from a "good" pair. If the miscomparing pair subsequently produces results that compare for an adequate period of time, it may be configured back into the system.
A Conceptual Framework for Systems Fault Tolerance - 30 MAR 95
[Next] [Previous] [Up] [Top]