Fault Tolerance in Parallel Systems. Jamie Boeheim Sarah Kay May 18, 2006

Size: px

Start display at page:

Download "Fault Tolerance in Parallel Systems. Jamie Boeheim Sarah Kay May 18, 2006"

Giles Hill
6 years ago
Views:

1 Fault Tolerance in Parallel Systems Jamie Boeheim Sarah Kay May 18, 2006

2 Outline What is a fault? Reasons for fault tolerance Desirable characteristics Fault detection Methodology Hardware Redundancy Routing Example schemes Software Operating system Software Routing protocols Example system Cost Conclusions References

3 What is a fault? an abnormal condition or defect at the component, equipment, or sub-system system level which may lead to a failure. Hardware A defect in a circuit or wiring caused by imperfect connections, poor insulation, grounding, or shorting. Software An accidental condition, or a manifestation of a programming mistake, that may cause a system or component not to perform as required.

4 Why Is Fault Tolerance Necessary? In a standard system, a fault can disrupt work on the processor and destroy the data. A single fault can pass through the entire system. For components in series, the probability of a failure is the product of the probability of failure in each individual component.

5 Desirable Characteristics A large number of PE-disjoint paths between any two pairs of PEs for increased reliability and fault tolerance The message routing should be simple to implement and flexible to route around faulty PEs in the network Graceful degradation in performance with increasing number of faults PE Processing Element

6 Fault Detection Can have faults in a processor, faults in the network, or faults in data Processor faults Errors in processor itself, can be detected by processor status bits or external result comparison Network faults Broken links, can be detected with link status information Data faults Errors in data, can be detected with parity bits, error checking code, etc.

7 Fault Tolerance Methodologies Hardware Redundancy Limited Routing FTPA Software Check Pointing FDIR GENESIS Cluster Support Routing protocols Crosshatch Meshes/Tori

8 Hardware Fault Tolerance Redundancy Processing Nodes Have multiple processing elements performing the same calculations. Compare the results to find the correct value. In a simple computation (e.g. systolic multiplication) Majority rules Simple comparator used for selection of result More complex/critical systems Confidence voting More complex logic required more possibilities of failure

Hardware Fault Tolerance Redundancy Links Multiple links between processing nodes If a failure is detected on one link, stop sending/accepting

9 Hardware Fault Tolerance Redundancy Links Multiple links between processing nodes If a failure is detected on one link, stop sending/accepting packets on that link Move communication to an unused link Split messages assigned to nonfunctional link among other links (some software intervention)

10 Hardware Fault Tolerance Routing Once a fault is detected, the offending link or processing node needs to be fixed, masked, or avoided. Fault is masked (redundant system) Elements of masking were shown for redundant systems, but if this is not available routing around the error is important. Limited amount of routing that can be done directly by the hardware Fault must be routed around (no redundancy)

11 Hardware Fault Tolerance FTPA (Fault Tolerant Processor Array) Designed to route data around nonfunctional processors. In design, it was necessary to determine where to route data while trying to minimize communication time. Swap out entire block if an error occurs in one cell (set switching) Single redundant cell assigned to a small cluster can replace one of the cells (local redundancy)

Hardware Fault Tolerance FTPA (continued) Switches remove damaged processors from the pipeline and add spare nodes to handle the operations necessary (processor

12 Hardware Fault Tolerance FTPA (continued) Switches remove damaged processors from the pipeline and add spare nodes to handle the operations necessary (processor switching) Scheme Set Switching Local Redundancy Processor Switching Simplicity Good Good Fair Efficiency Poor Fair Good Area Poor Fair Fair Summary of redundancy techniques

13 Software Fault Tolerance Check Pointing Copy process resources/state to stable storage Non-deterministic events should be prevented during creation (e.g. blocking its inter-process communication to stop rollback propagation) If a fault occurs, process can be restarted on same or different PE by simply copying saved process state

14 Software Fault Tolerance FDIR Used with NASA X-38 X experimental vehicle processors Software used to track where faults occur, and if necessary provide recovery with some form of backup.

15 Software Fault Tolerance GENESIS Cluster Support Transparent check pointing for programmer Check pointing similar to process duplication High performance Low overhead

16 Software Fault Tolerance Alternative Software Approaches CALYPSO Cocheck checkpoint checkpoint based Manetho log based Fault Tolerant MPI

17 Software Fault Tolerance Routing Protocols After an erroneous module or link has been found, a way to avoid it should be determined. Even with masking, only a limited number of faults can be tolerated. Software allows for more flexible design.

changing the switching technique Rerouted messages may deadlock as they take space on routes not

18 Software Fault Tolerance Crosshatch Routing Each switch knows information about the fault status of the switches to which it is connected In case of a fault, packets are transmitted around the fault without changing the switching technique Rerouted messages may deadlock as they take space on routes not intended to handle them One way to avoid the deadlock is to specify certain switches to handle fault conditions

19 Software Fault Tolerance Meshes/Tori Tradeoff: flexibility vs. performance Minimize use of additional resources (e.g. virtual channels) Adaptive routing around failure area (single PE or block) Reconfigure routing table to adapt to new topology after failure

20 Hardware Cost of Fault Tolerance Redundant hardware requires extra space Major issue in massively parallel machines May lose performance if, instead of duplicating hardware, dedicate some of existing hardware to fault tolerance Software Performance degradation with checks Memory requirements

21 Conclusions Added cost of fault tolerance necessary when PEs are inherently error-prone nanotechnology Long term projects require extended reliability space exploration Accuracy of results is essential banking transactions Hardware fault tolerance has less system overhead but is not flexible Software fault tolerance has more system overhead but better adaptability for individual implementations

22 References KleinOsowski, A. et al. The Recursive NanoBox Processor Grid: A Reliable System Architecture for Unreliable Nanotechnology Devices.. IEEE G omez, M.E. et al. An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori. Baratlooz, A. et al. Calypso: A Novel Software System for Fault-Tolerant Parallel Processing on Distributed Platforms. Racine, R. et al. Design of a Fault-Tolerant Parallel Processor.. IEEE Rough, J., Goscinski, A. Exploiting operating system services to efficiently checkpoint parallel applications in GENESIS.. Algorithms and Architectures for Parallel Processing Yasudo et al. Deadlock-free Fault-tolerant tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201 Chean, M., Fortes, J. A A Taxonomy of Reconfiguration Techniques for Fault- Tolerant Processor Arrays.. Survey & Tutorial Series Harper, R. et al. Fault Tolerant Parallel Processor Architecture Overview. IEEE. 1988

23 Questions

Fault Tolerance. The Three universe model

Fault Tolerance. The Three universe model Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful