Soft Error Fault Tolerant Systems: CS456 Survey

Size: px

Start display at page:

Download "Soft Error Fault Tolerant Systems: CS456 Survey"

Osborn Francis
6 years ago
Views:

1 Soft Error Fault Tolerant Systems: CS456 Survey Alok Garg Abstract Currently programming errors have been attributed to be the foremost cause of most system failures. But recent studies have suggested that soft errors are increasingly responsible for system downtime. Computer systems are becoming more complex and are optimized for price and performance and not for availability. This makes soft errors an even more common case. Move towards denser, smaller, and low voltage transistors has the potential to increase these transient errors. Until now most system software architectures assume complete faith in underlying hardware, and software make no provisions to deal with hardware faults. In this survey paper, we investigate the influence of soft error on the system as a whole and current research into proposed recovery mechanisms. 1 Introduction Soft errors are unintended transitions of logic state in a circuit typically caused external source of ionizing radiations. The ionization creates excess free carriers, which recombine with the stored charges, thereby corrupting the state of transistor. Device scaling, reduction in feature size and voltage levels of the transistor, along with high density transistors have increased the risk of hardware faults due to soft errors. Research by Shivakumar et al. [12] predict that soft error rate (SER) per chip of logic circuits will increase nine orders of magnitude from 1992 to 2011 and at that point will be comparable to the SER per chip of unprotected memory elements. Due to demand for high performance and low cost computers, availability has received less attention. It is a common belief that software errors are, and will continue to be, the most probable cause of loss of availability. But with Processors, caches, and memories are becoming larger, faster and denser, while being increasingly used in adverse environments, soft errors are also becoming more probable. Ziegler et al. [15,16], through Extensive field studies, predicted and verified the soft error rate (SER) of FIT (1 FIT equals 1 failure in 10 9 Hours) on a 16 Mbit DRAM chip. A system with 100 such chips will have a fail rate of about one per week. They also claim that a typical processor s silicon can have a soft-error rate of 4000 FIT, of which 50% will affect processor logic and 50% the large on-chip cache. Until now techniques such as Error Correction Codes (ECC) have been used to correct errors in main memory and system interconnects. Unfortunately such techniques only help reduce visible error rate for semiconductor elements that can be covered by such codes. For example, a 1 Gbit memory system based on 64Mbit DRAMs still has a combined visible error rate of 3435 FIT when using Single Error Correct-Double Error Detect (SEC- DED) ECC [3]. This is equivalent to around 300 errors in such machines in 1 year. Analysis by Xu et al. [13] shows that soft errors may lead to serious security vulnerabilities in the systems if not system crash. Due to price sensitivity and higher demand for performance, it is not cost effective for the hardware to provide full support in order to mask or contain these soft errors. Therefore, the burden falls to system software to attempt to handle these errors for highest availability. Current system software assumes complete faith in underlying hardware and doesn t provide provisions for any software based mechanisms to recover from hardware faults. Research by Messer et al. [6] and Charng-da Lu [2] have analyzed the effect of soft errors on system software. Software faulttolerance techniques have also been proposed by Rebaudengo et al. [10] and Milojicic et al. [7]. 2 Existing Hardware Support for Error Handling Availability in computer systems is determined by hardware and software reliability. Hardware reliability has traditionally existed only in proprietary servers, with specialized redundantly configured hardware and critical software components. But few relia- 1

2 bility features also exist in commercial price sensitive processors. 2.1 Support for Memory and Communication Errors Depending on memory size, technology sensitivity to soft errors, and price pressure, PC systems usually support at least parity detection on main memory and system buses. Error Correction Codes (ECC) is also supported for large caches. Parity check is able to detect and report 1-bit in error. While normally ECC is capable of correcting 1-bit in error and detecting 2-bit in error (SEC-DED). Once the error is detected processor tries to correct these error if possible. Otherwise, processor may report the error to firmware. Firmware with advance software support may handle the reported errors Itanium Processor The IA-64 architecture extends support for soft errors in two ways [9]. First, additional hardware detection is supported for processor implementation, such as providing parity or ECC protection to the system bus and the three on-chip caches. These provide good coverage from transient errors in processor cache memory and system buses. Second, the recoverability is handled through machine-check exception. Machine-check exception is supported by providing several types of well-defined error scenarios. Error logging provides information for potential software containment of the errors. Itanium processor reliability and availability features are presented in Figure Memory Raid ECC still offers excellent protection for many servers. As memory capacity grows, however, the level of effectiveness ECC provides actually decreases. HP developed Hot Plug RAID Memory [4] to extend the effectiveness of ECC. Hot Plug RAID Memory provides redundancy and hot-plug capabilities for dual inline memory modules (DIMMs) to deliver unprecedented levels of availability, scalability, and fault tolerance. 2.2 Support for Logical Errors Although, a lot of commercial support is available for error correction and detection in memories and system buses, but any kind protection from soft errors in logical circuits is almost missing. According to the predictions, soft errors are going to be as probable in logical circuits as in system memories or caches, and would become a reliability concern by A lot of research is cur- Figure 1. Itanium processor reliability and availability features. [9] rently going on at circuit and architecture level to detect and correct errors in processor logic. But this research is still very rudimentary and impractical for complex and price sensitive processors. 3 Influence of Soft Errors on System Software Possible software error recovery mechanisms require clear understanding of influence of soft errors on the system software. This is also required to understand whether transient error is an issue for system software or not. And, if soft error is an issue, how severs is it? Although no field study of any kind exist, but simulations using software fault injection have provided a relative insight into the problem. Impact of soft errors on a commodity OS is analyzed by Messer et al. [6]. Charng-da Lu and Daniel A. Reed [2] simulated single bit memory errors, register file upsets, and MPI message payload corruption and measured the behavioral response for a suite of MPI applications. The potential sources of error in the Processor provide better understanding of simulated fault injection techniques. Components of the processor having direct impact on the software, if any part of the processor is corrupted by soft error, are: Processor Register: Regular integer registers are most vulnerable to transient errors. Because these general purpose 2

3 registers contain live data at any give time, single bit upsets in these registers are very likely to affect application behavior. Transient errors in the processor logic may also propagate corrupt results to the registers. For example, transient error during arithmetic addition may write wrong result to the destination register. These kind of errors are very difficult to detect. Processor Cache: Processor cache (SRAM) including memory and TLB cache is at least protected by parity checks and is less vulnerable compared to system memory (DRAM). System Memory: Error in the software code or data may change the behavior of the program if remains un-detected. System memory is usually protected by ECC SEC-DED. System Bus: Bus logic is also prone to soft errors and normally uses parity check for error detection. IO Buffers: Error in IO buffer may corrupt the information available from the disk or network. Distributed applications like MPI are more sensitive to soft errors in IO buffers. To understand how errors affect the system software, soft errors can be characterized with following information. Overwritten: Errors detected in memory or register during write may be ignored since the content is overwritten. User Signalable: If the error is detected while reading from the memory or from a processor register. Error on Memory read is considered user signalable if location of the read is in user data/code space space, while processor is executing user or kernel code. At the same time register read is considered user signalable if processor is executing user code. Recovery from these errors is possible by signaling user applications either for termination or potential application recovery. Kernel Fatal: If the error is detected during memory read located in Kernel data/code space or register read while executing kernel code. These errors may corrupt the Kernel state and hence are called Kernel Fatal. Silent Data Corruption: If error remains undetected by any of the mechanisms, may still corrupt the output. This is most dangerous of all the possible errors because there is little sign during the execution that can alert the user. 3.1 Error Injections Given the importance of soft errors on system software, fault injection techniques are used to study software responses to transient faults. Fault injection can be either hardware-based or software-based [5]. Hardware fault injection technique consists of subjecting chips to heavy ion radiation to simulate the effect of alpha particles. In contrast, software-implemented fault injection does not require expensive equipments and can target specific software components, such as the operating system, software libraries or applications. Messer et al. [6] performed investigations on an IA-32 platform using watch points to simulate memory errors. /proc Kernel virtual file system interface is used to setup a watch point, called a /proc/mfi. The watch point facility does not allow more than one virtual address to be monitored simultaneously. A user program randomly selects the physical address for error injection. Kernel searches with the physical address (kernel or user) maps into the physical address provided by the user program. Reverse page table entry (PTE) lookup is performed when a task is first scheduled after the error was injected. Timeout based mechanism is used for time bound simulations. Charng-da Lu and Daniel A. Reed [2] used memory fault injection to target both registers and application memory regions. Fault injector employs different techniques for injecting faults in different regions of the address space. Techniques for injecting faults in the IO buffer are also used. 3.2 Soft Error Analysis Based on simulated experiments conducted by previous research [2,6], following insight provide better understanding of impact on system software. Registers and IO buffer are particularly vulnerable to singlebit-flip faults, an average of 34.7% of all the activated faults. When IO buffer fault activate, the chance of producing a wrong output can be quite high, ranging from 28 to 71%. 90% of the memory errors need not be fatal to the operating systems execution and may require minor support for partial recovery. Large number of memory activations are overwritten. This stems from the write before read use of most memory locations. 3

4 Kernel fatal memory accesses only accounts for a small number of all memory errors. For user applications, the memory errors in the object heap have a higher activation and susceptibility rate than those in the static data area. A large portion of heap error activation is caused by the garbage collector, and cause fewer application errors than other sources of activation. Above analysis clearly points out that software based fault tolerant efforts must target processor registers along with memory. While only few of the memory faults are actually damaging. 4 Software Based Recovery Approach Various mechanisms are proposed [1,7,10] through which system software tries to provide fault tolerance and higher availability guarantees. These methods depends on level of processor support in error handling is provided to the firmware. Recover techniques are also based on contexts, like fault tolerant schemes in context of distributed systems could be much different from schemes for a single system. A generalized scheme for fault containment and recovery is presented in Figure 2. An error is typically detected at the hardware level, and then it is interpreted, logged, and if needed the next level (firmware) is notified. The interpret/log/notify phases are repeated at different levels until either the error is recovered or determined as non-recoverable. This order of events is presented in Figure 3, where firmware level is split into processor and platform specific, and the OS level into Machine Check Abort (MCA, a serious error exception) and OS-specific. 4.1 Error Detection Hardware typically detects errors through parity check or ECC. It is possible that hardware does not provide support for certain kind of error scenarios like soft error in processor logic. Even if hardware does not detect some errors, it is possible for software to detect inconsistencies typically represented in the form of invalid pointers or incorrect checksums. We have already discussed the hardware support of error detection in Section 2. We will discuss some of the well know software based transient error detection schemes in the following sections Assertions [11] The use of Assertions, i.e. logic statements inserted at different points in the program that reflects invariant relationships between the variables of the program can lead to different problems, since assertions are not transparent to the programmer and their effectiveness largely depends on the nature of the application and on the programmers ability Control Flow Checking [14] The basic idea of Control Flow checking is to partition the application program in basic blocks, i.e., branch-free parts of code. For each block a deterministic signature is computed and faults can be detected by comparing the run-time signature with a precomputed one. In most control-flow checking techniques one of the main problems is to tune the test granularity that should be used Procedure Duplication [8] Considering the Procedure Duplication, the programmer decides to duplicate the most critical procedures and to compare the obtained results. This approach requires that the programmer define a set of procedures to be duplicated and introduces the proper checks on the results. These code modifications can be executed only manually and may introduce errors Data and Code Redudancy [10] Figure 2. Errors are detected, then the error state is logged, interpreted and recovery attempted. If unsuccessful, the next level may be notified. [7] Data and code redundancy is proposed to detect errors affecting both data and code. The redundancy is introduced according to a set of transformations to be performed on the high-level source code. Errors in data are detected by duplicating each variable and adding consistency checks after every read operation. Other transformations focus on errors affecting the code, and cor- 4

Figure 3. Memory failure recovery scenario. Memory error is typically detected by HW. If the error cannot be contained, it is notified to FW. FW gathers information and attempts recovery.

5 Figure 3. Memory failure recovery scenario. Memory error is typically detected by HW. If the error cannot be contained, it is notified to FW. FW gathers information and attempts recovery. Recovery is performed at the processor and at the platform-level. If recovery is possible, the state is prepared for OS and it is notified. OS attempts recovery at the MCA and at the OS-level. In case of successful OS recovery, application is notified with relevant state. Application analyzes the state and attempts to recover. All but the first arrows are optional. [7] respond from one side, to duplicate the code implementing each operation, and from the other side, to add checks for verifying the consistency of the executed operations. The main advantage of the method lies in the fact that it can be automatically applied to a high-level source code, thus freeing the programmer from the burden of guaranteeing its robustness against errors (e.g., by selecting what to duplicate and where to put the checks). The method is completely independent on the underlying hardware, and addresses any kind of fault affecting either the code or the data Directions for Improvements in Fault Detection Techniques All the methods we have discussed for transient fault detection assumes very little hardware support, and are generic techniques. Due to generic nature of above error detection techniques, software overhead of error detection is very high. Software overhead is highest for scheme based on Data and Code Redundancy. These overheads may turn out to be very costly for commonly used systems. Alternative hardware aware software techniques could be more feasible solution to improve error detection of the system as a hole. Some of these low level techniques may go into firmware and others may be part of the hardware dependent OS layer, based on specific techniques targeted towards the specific hardware and OS kernel. 4.2 Fault Recovery Mechanisms Whenever the fault is detected in hardware, Processor tries to correct it. If the fault is uncorrectable, then Processor tries to contain the error by giving firmware an opportunity for error recovery. We have already discussed hardware based error recover and containment mechanisms in Section 2. In this section we investigate how software (firmware, OS, or application) can react to errors, given system is capable of detecting transient faults missed by the hardware and return to the consistent state that existed before the failure. If the error cannot be notified in an exact and restartable manner, then the software needs to offer greater support for recovery. For the software to be able to restart the transactions, it is required that sufficient state be saved. Hence software complexity increases with reduced hardware support for same level of fault tolerance. Based on classification of faults according to severity in Section 3, various recovery mechanisms can be implemented in software according to level of availability expected from the system as a whole. Few of the mechanisms for OS recovery are discussed next: User Signalable: In the case of user signalable errors, the state of a particular user program has become corrupt, but the processor may allow the kernel to continue operating. As a result, the kernel can signal the user task and proceed with another one or interrupt the system call. User program may deal with the recovery according to the availability requirements from the application at the application level. Kernel Fatal: Error recovery is possible through analysis of kernel. Like error in duplicate memory regions may be recovered by re-fetching the data for the correct copy. Corruption within logs or statistical counters should not bring the system down. More complicated checkpoint based rollback recovery mechanisms may also be implemented. Recovery mechanism differs and depends on individual level 5

6 Table 1. Failure Recovery Outcome [7] Level Recovery Full Recovery Partial Recovery System Failure Hardware mask errors halt/downgrade performance/functionality halt/reboot Firmware mask error notify OS reboot (notify OS) OS continue to execute notify app, kill user reboot OS thread Application continue to execute notify user terminate applica- tion (Hardware, Firmware, OS, and Application). For example, a distributed application may still provide availability when one node fails. Hence outcome of the recovery can be full or partial recovery, or system failure based on detected error. Refer to Table 1 for details. Full recovery effectively masks the errors from higher levels; error may be logged for statistical purposes. In case recovery is not possible at any particular level, system is halted to prevent corrupt data from propagating to network or disk. 5 Conclusion Because of common belief that soft errors would dominate all kinds of errors, increased support for soft error signaling would be required in future not only in hardware, but also in software to increase the availability of the system as a whole. We have discussed many error detection and recovery aspects of the system, both at hardware and software level. 6 Road Map Road map for improving the availability of the system through hardware and software cooperation is summarized as follows: Hierarchical approach for recovery from the soft errors at different levels may provide elegant solution for improved system availability. But naive implementation of fault detection and recovery techniques at each level may be costly in terms system performance and complexity. Implementation of fault detection techniques need to be balanced at each level (Hardware, Firmware, OS, and Application) and optimum for performance and complexity. Fault recovery mechanisms at each level requires better understanding of sensitivity of these levels to soft errors, so that recovery mechanisms can be optimized for cost and performance. Each level may also differentiate critical data structures from reliability point of view and indicate tolerable latencies for improved reliability. References [1] N. S. Bowen and D. K. Pradhan. Processor and Memory- Based Checkpoint and Rollback Recovery. IEEE Computer, 26(2):22 31, Feb [2] C. da Lu and D. A. Reed. Assessing Fault Sensitivity in MPI Applications. In Supercomputing, page 37, Pittsburgh, Pennsylvania, Nov [3] T. J. Dell. A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory. IBM Microelectronics Division, Nov [4] HP. Tech brief: Hot Plug RAID Memory technology for fault tolerance and scalability, Sept [5] M.-C. Hsueh, T. K. Tsai, and R. K. Iyer. Fault Injection Techniques and Tools. IEEE Computer, 30(4):75 82, Apr [6] A. Messer, P. Bernadat, G. Fu, D. Chen, Z. Dimitrijevic, D. Lie, D. D. Mannaru, A. Riska, and D. Milojicic. Susceptibility of Commodity Systems and Software to Memory Soft Errors. IEEE Transactions on Computers, 53(12): , Dec [7] D. Milojicic, A. Messer, J. Shau, G. Fu, P. Alto, and A. Munoz. Increasing relevance of memory hardware errors: a case for recoverable programming models. In ACM SIGOPS European workshop, pages , Kolding, Denmark, Sept [8] D. K. Pradhan. Fault-Tolerant Computer System Design. Prentice Hall PTR, [9] N. Quach. High Availability and Reliability in the Itanium Processor. IEEE Micro, 20(5):61 69, Sept. Oct [10] M. Rebaudengo, M. S. Reorda, M. Torchiano, and M. Violante. Soft-error Detection through Software Fault- Tolerance techniques. In IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, pages , Albuquerque, New Mexico, Nov [11] M. Z. Rela, H. Madeira, and J. G. Silva. Experimental Evaluation of the Fail Silent Behavior in Programs with Consistency Checks. In International Symposium on Fault- Tolerant Computing, pages , Sendai, Japan, June

7 [12] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In International Conference on Dependable Systems and Networks, pages , Bethesda, Maryland, June [13] J. Xu, S. Chen, Z. Kalbarczyk, and R. K. Iyer. An Experimental Study of Security Vulnerabilities Caused by Errors. In International Conference on Dependable Systems and Networks, pages , Goteborg, Sweden, [14] S. Yau and F. Chen. An Approach to Concurrent Control Flow Checking. IEEE Transactions on Software Engineering, 6(2): , Mar [15] J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, B. Chin, M. Nicewicz, C. A. Russell, W. Y. Wang, L. B. Freeman, P. Hosier, L. E. LaFave, J. L. Walsh, J. M. Orro, G. J. Unger, J. M. Ross, T. J. O Gorman, B. Messina, T. D. Sullivan, A. J. Sykes, H. Yourke, T. A. Enger, V. Tolat, T. S. Scott, A. H. Taber, R. J. Sussman, W. A. Klein, and C. W. Wahaus. IBM experiments in soft fails in computer electronics ( ). IBM Journal of Research and Development, 40(1):3 18, Jan [16] J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, T. J. O Gorman, and J. M. Ross. Accelerated testing for cosmic soft-error rate. IBM Journal of Research and Development, 40(1):51 72, Jan

hot plug RAID memory technology for fault tolerance and scalability

hp industry standard servers april 2003 technology brief TC030412TB hot plug RAID memory technology for fault tolerance and scalability table of contents abstract... 2 introduction... 2 memory reliability...