Soft Error Fault Tolerant Systems: CS456 Survey

Size: px
Start display at page:

Download "Soft Error Fault Tolerant Systems: CS456 Survey"

Transcription

1 Soft Error Fault Tolerant Systems: CS456 Survey Alok Garg Abstract Currently programming errors have been attributed to be the foremost cause of most system failures. But recent studies have suggested that soft errors are increasingly responsible for system downtime. Computer systems are becoming more complex and are optimized for price and performance and not for availability. This makes soft errors an even more common case. Move towards denser, smaller, and low voltage transistors has the potential to increase these transient errors. Until now most system software architectures assume complete faith in underlying hardware, and software make no provisions to deal with hardware faults. In this survey paper, we investigate the influence of soft error on the system as a whole and current research into proposed recovery mechanisms. 1 Introduction Soft errors are unintended transitions of logic state in a circuit typically caused external source of ionizing radiations. The ionization creates excess free carriers, which recombine with the stored charges, thereby corrupting the state of transistor. Device scaling, reduction in feature size and voltage levels of the transistor, along with high density transistors have increased the risk of hardware faults due to soft errors. Research by Shivakumar et al. [12] predict that soft error rate (SER) per chip of logic circuits will increase nine orders of magnitude from 1992 to 2011 and at that point will be comparable to the SER per chip of unprotected memory elements. Due to demand for high performance and low cost computers, availability has received less attention. It is a common belief that software errors are, and will continue to be, the most probable cause of loss of availability. But with Processors, caches, and memories are becoming larger, faster and denser, while being increasingly used in adverse environments, soft errors are also becoming more probable. Ziegler et al. [15,16], through Extensive field studies, predicted and verified the soft error rate (SER) of FIT (1 FIT equals 1 failure in 10 9 Hours) on a 16 Mbit DRAM chip. A system with 100 such chips will have a fail rate of about one per week. They also claim that a typical processor s silicon can have a soft-error rate of 4000 FIT, of which 50% will affect processor logic and 50% the large on-chip cache. Until now techniques such as Error Correction Codes (ECC) have been used to correct errors in main memory and system interconnects. Unfortunately such techniques only help reduce visible error rate for semiconductor elements that can be covered by such codes. For example, a 1 Gbit memory system based on 64Mbit DRAMs still has a combined visible error rate of 3435 FIT when using Single Error Correct-Double Error Detect (SEC- DED) ECC [3]. This is equivalent to around 300 errors in such machines in 1 year. Analysis by Xu et al. [13] shows that soft errors may lead to serious security vulnerabilities in the systems if not system crash. Due to price sensitivity and higher demand for performance, it is not cost effective for the hardware to provide full support in order to mask or contain these soft errors. Therefore, the burden falls to system software to attempt to handle these errors for highest availability. Current system software assumes complete faith in underlying hardware and doesn t provide provisions for any software based mechanisms to recover from hardware faults. Research by Messer et al. [6] and Charng-da Lu [2] have analyzed the effect of soft errors on system software. Software faulttolerance techniques have also been proposed by Rebaudengo et al. [10] and Milojicic et al. [7]. 2 Existing Hardware Support for Error Handling Availability in computer systems is determined by hardware and software reliability. Hardware reliability has traditionally existed only in proprietary servers, with specialized redundantly configured hardware and critical software components. But few relia- 1

2 bility features also exist in commercial price sensitive processors. 2.1 Support for Memory and Communication Errors Depending on memory size, technology sensitivity to soft errors, and price pressure, PC systems usually support at least parity detection on main memory and system buses. Error Correction Codes (ECC) is also supported for large caches. Parity check is able to detect and report 1-bit in error. While normally ECC is capable of correcting 1-bit in error and detecting 2-bit in error (SEC-DED). Once the error is detected processor tries to correct these error if possible. Otherwise, processor may report the error to firmware. Firmware with advance software support may handle the reported errors Itanium Processor The IA-64 architecture extends support for soft errors in two ways [9]. First, additional hardware detection is supported for processor implementation, such as providing parity or ECC protection to the system bus and the three on-chip caches. These provide good coverage from transient errors in processor cache memory and system buses. Second, the recoverability is handled through machine-check exception. Machine-check exception is supported by providing several types of well-defined error scenarios. Error logging provides information for potential software containment of the errors. Itanium processor reliability and availability features are presented in Figure Memory Raid ECC still offers excellent protection for many servers. As memory capacity grows, however, the level of effectiveness ECC provides actually decreases. HP developed Hot Plug RAID Memory [4] to extend the effectiveness of ECC. Hot Plug RAID Memory provides redundancy and hot-plug capabilities for dual inline memory modules (DIMMs) to deliver unprecedented levels of availability, scalability, and fault tolerance. 2.2 Support for Logical Errors Although, a lot of commercial support is available for error correction and detection in memories and system buses, but any kind protection from soft errors in logical circuits is almost missing. According to the predictions, soft errors are going to be as probable in logical circuits as in system memories or caches, and would become a reliability concern by A lot of research is cur- Figure 1. Itanium processor reliability and availability features. [9] rently going on at circuit and architecture level to detect and correct errors in processor logic. But this research is still very rudimentary and impractical for complex and price sensitive processors. 3 Influence of Soft Errors on System Software Possible software error recovery mechanisms require clear understanding of influence of soft errors on the system software. This is also required to understand whether transient error is an issue for system software or not. And, if soft error is an issue, how severs is it? Although no field study of any kind exist, but simulations using software fault injection have provided a relative insight into the problem. Impact of soft errors on a commodity OS is analyzed by Messer et al. [6]. Charng-da Lu and Daniel A. Reed [2] simulated single bit memory errors, register file upsets, and MPI message payload corruption and measured the behavioral response for a suite of MPI applications. The potential sources of error in the Processor provide better understanding of simulated fault injection techniques. Components of the processor having direct impact on the software, if any part of the processor is corrupted by soft error, are: Processor Register: Regular integer registers are most vulnerable to transient errors. Because these general purpose 2

3 registers contain live data at any give time, single bit upsets in these registers are very likely to affect application behavior. Transient errors in the processor logic may also propagate corrupt results to the registers. For example, transient error during arithmetic addition may write wrong result to the destination register. These kind of errors are very difficult to detect. Processor Cache: Processor cache (SRAM) including memory and TLB cache is at least protected by parity checks and is less vulnerable compared to system memory (DRAM). System Memory: Error in the software code or data may change the behavior of the program if remains un-detected. System memory is usually protected by ECC SEC-DED. System Bus: Bus logic is also prone to soft errors and normally uses parity check for error detection. IO Buffers: Error in IO buffer may corrupt the information available from the disk or network. Distributed applications like MPI are more sensitive to soft errors in IO buffers. To understand how errors affect the system software, soft errors can be characterized with following information. Overwritten: Errors detected in memory or register during write may be ignored since the content is overwritten. User Signalable: If the error is detected while reading from the memory or from a processor register. Error on Memory read is considered user signalable if location of the read is in user data/code space space, while processor is executing user or kernel code. At the same time register read is considered user signalable if processor is executing user code. Recovery from these errors is possible by signaling user applications either for termination or potential application recovery. Kernel Fatal: If the error is detected during memory read located in Kernel data/code space or register read while executing kernel code. These errors may corrupt the Kernel state and hence are called Kernel Fatal. Silent Data Corruption: If error remains undetected by any of the mechanisms, may still corrupt the output. This is most dangerous of all the possible errors because there is little sign during the execution that can alert the user. 3.1 Error Injections Given the importance of soft errors on system software, fault injection techniques are used to study software responses to transient faults. Fault injection can be either hardware-based or software-based [5]. Hardware fault injection technique consists of subjecting chips to heavy ion radiation to simulate the effect of alpha particles. In contrast, software-implemented fault injection does not require expensive equipments and can target specific software components, such as the operating system, software libraries or applications. Messer et al. [6] performed investigations on an IA-32 platform using watch points to simulate memory errors. /proc Kernel virtual file system interface is used to setup a watch point, called a /proc/mfi. The watch point facility does not allow more than one virtual address to be monitored simultaneously. A user program randomly selects the physical address for error injection. Kernel searches with the physical address (kernel or user) maps into the physical address provided by the user program. Reverse page table entry (PTE) lookup is performed when a task is first scheduled after the error was injected. Timeout based mechanism is used for time bound simulations. Charng-da Lu and Daniel A. Reed [2] used memory fault injection to target both registers and application memory regions. Fault injector employs different techniques for injecting faults in different regions of the address space. Techniques for injecting faults in the IO buffer are also used. 3.2 Soft Error Analysis Based on simulated experiments conducted by previous research [2,6], following insight provide better understanding of impact on system software. Registers and IO buffer are particularly vulnerable to singlebit-flip faults, an average of 34.7% of all the activated faults. When IO buffer fault activate, the chance of producing a wrong output can be quite high, ranging from 28 to 71%. 90% of the memory errors need not be fatal to the operating systems execution and may require minor support for partial recovery. Large number of memory activations are overwritten. This stems from the write before read use of most memory locations. 3

4 Kernel fatal memory accesses only accounts for a small number of all memory errors. For user applications, the memory errors in the object heap have a higher activation and susceptibility rate than those in the static data area. A large portion of heap error activation is caused by the garbage collector, and cause fewer application errors than other sources of activation. Above analysis clearly points out that software based fault tolerant efforts must target processor registers along with memory. While only few of the memory faults are actually damaging. 4 Software Based Recovery Approach Various mechanisms are proposed [1,7,10] through which system software tries to provide fault tolerance and higher availability guarantees. These methods depends on level of processor support in error handling is provided to the firmware. Recover techniques are also based on contexts, like fault tolerant schemes in context of distributed systems could be much different from schemes for a single system. A generalized scheme for fault containment and recovery is presented in Figure 2. An error is typically detected at the hardware level, and then it is interpreted, logged, and if needed the next level (firmware) is notified. The interpret/log/notify phases are repeated at different levels until either the error is recovered or determined as non-recoverable. This order of events is presented in Figure 3, where firmware level is split into processor and platform specific, and the OS level into Machine Check Abort (MCA, a serious error exception) and OS-specific. 4.1 Error Detection Hardware typically detects errors through parity check or ECC. It is possible that hardware does not provide support for certain kind of error scenarios like soft error in processor logic. Even if hardware does not detect some errors, it is possible for software to detect inconsistencies typically represented in the form of invalid pointers or incorrect checksums. We have already discussed the hardware support of error detection in Section 2. We will discuss some of the well know software based transient error detection schemes in the following sections Assertions [11] The use of Assertions, i.e. logic statements inserted at different points in the program that reflects invariant relationships between the variables of the program can lead to different problems, since assertions are not transparent to the programmer and their effectiveness largely depends on the nature of the application and on the programmers ability Control Flow Checking [14] The basic idea of Control Flow checking is to partition the application program in basic blocks, i.e., branch-free parts of code. For each block a deterministic signature is computed and faults can be detected by comparing the run-time signature with a precomputed one. In most control-flow checking techniques one of the main problems is to tune the test granularity that should be used Procedure Duplication [8] Considering the Procedure Duplication, the programmer decides to duplicate the most critical procedures and to compare the obtained results. This approach requires that the programmer define a set of procedures to be duplicated and introduces the proper checks on the results. These code modifications can be executed only manually and may introduce errors Data and Code Redudancy [10] Figure 2. Errors are detected, then the error state is logged, interpreted and recovery attempted. If unsuccessful, the next level may be notified. [7] Data and code redundancy is proposed to detect errors affecting both data and code. The redundancy is introduced according to a set of transformations to be performed on the high-level source code. Errors in data are detected by duplicating each variable and adding consistency checks after every read operation. Other transformations focus on errors affecting the code, and cor- 4

5 Figure 3. Memory failure recovery scenario. Memory error is typically detected by HW. If the error cannot be contained, it is notified to FW. FW gathers information and attempts recovery. Recovery is performed at the processor and at the platform-level. If recovery is possible, the state is prepared for OS and it is notified. OS attempts recovery at the MCA and at the OS-level. In case of successful OS recovery, application is notified with relevant state. Application analyzes the state and attempts to recover. All but the first arrows are optional. [7] respond from one side, to duplicate the code implementing each operation, and from the other side, to add checks for verifying the consistency of the executed operations. The main advantage of the method lies in the fact that it can be automatically applied to a high-level source code, thus freeing the programmer from the burden of guaranteeing its robustness against errors (e.g., by selecting what to duplicate and where to put the checks). The method is completely independent on the underlying hardware, and addresses any kind of fault affecting either the code or the data Directions for Improvements in Fault Detection Techniques All the methods we have discussed for transient fault detection assumes very little hardware support, and are generic techniques. Due to generic nature of above error detection techniques, software overhead of error detection is very high. Software overhead is highest for scheme based on Data and Code Redundancy. These overheads may turn out to be very costly for commonly used systems. Alternative hardware aware software techniques could be more feasible solution to improve error detection of the system as a hole. Some of these low level techniques may go into firmware and others may be part of the hardware dependent OS layer, based on specific techniques targeted towards the specific hardware and OS kernel. 4.2 Fault Recovery Mechanisms Whenever the fault is detected in hardware, Processor tries to correct it. If the fault is uncorrectable, then Processor tries to contain the error by giving firmware an opportunity for error recovery. We have already discussed hardware based error recover and containment mechanisms in Section 2. In this section we investigate how software (firmware, OS, or application) can react to errors, given system is capable of detecting transient faults missed by the hardware and return to the consistent state that existed before the failure. If the error cannot be notified in an exact and restartable manner, then the software needs to offer greater support for recovery. For the software to be able to restart the transactions, it is required that sufficient state be saved. Hence software complexity increases with reduced hardware support for same level of fault tolerance. Based on classification of faults according to severity in Section 3, various recovery mechanisms can be implemented in software according to level of availability expected from the system as a whole. Few of the mechanisms for OS recovery are discussed next: User Signalable: In the case of user signalable errors, the state of a particular user program has become corrupt, but the processor may allow the kernel to continue operating. As a result, the kernel can signal the user task and proceed with another one or interrupt the system call. User program may deal with the recovery according to the availability requirements from the application at the application level. Kernel Fatal: Error recovery is possible through analysis of kernel. Like error in duplicate memory regions may be recovered by re-fetching the data for the correct copy. Corruption within logs or statistical counters should not bring the system down. More complicated checkpoint based rollback recovery mechanisms may also be implemented. Recovery mechanism differs and depends on individual level 5

6 Table 1. Failure Recovery Outcome [7] Level Recovery Full Recovery Partial Recovery System Failure Hardware mask errors halt/downgrade performance/functionality halt/reboot Firmware mask error notify OS reboot (notify OS) OS continue to execute notify app, kill user reboot OS thread Application continue to execute notify user terminate applica- tion (Hardware, Firmware, OS, and Application). For example, a distributed application may still provide availability when one node fails. Hence outcome of the recovery can be full or partial recovery, or system failure based on detected error. Refer to Table 1 for details. Full recovery effectively masks the errors from higher levels; error may be logged for statistical purposes. In case recovery is not possible at any particular level, system is halted to prevent corrupt data from propagating to network or disk. 5 Conclusion Because of common belief that soft errors would dominate all kinds of errors, increased support for soft error signaling would be required in future not only in hardware, but also in software to increase the availability of the system as a whole. We have discussed many error detection and recovery aspects of the system, both at hardware and software level. 6 Road Map Road map for improving the availability of the system through hardware and software cooperation is summarized as follows: Hierarchical approach for recovery from the soft errors at different levels may provide elegant solution for improved system availability. But naive implementation of fault detection and recovery techniques at each level may be costly in terms system performance and complexity. Implementation of fault detection techniques need to be balanced at each level (Hardware, Firmware, OS, and Application) and optimum for performance and complexity. Fault recovery mechanisms at each level requires better understanding of sensitivity of these levels to soft errors, so that recovery mechanisms can be optimized for cost and performance. Each level may also differentiate critical data structures from reliability point of view and indicate tolerable latencies for improved reliability. References [1] N. S. Bowen and D. K. Pradhan. Processor and Memory- Based Checkpoint and Rollback Recovery. IEEE Computer, 26(2):22 31, Feb [2] C. da Lu and D. A. Reed. Assessing Fault Sensitivity in MPI Applications. In Supercomputing, page 37, Pittsburgh, Pennsylvania, Nov [3] T. J. Dell. A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory. IBM Microelectronics Division, Nov [4] HP. Tech brief: Hot Plug RAID Memory technology for fault tolerance and scalability, Sept [5] M.-C. Hsueh, T. K. Tsai, and R. K. Iyer. Fault Injection Techniques and Tools. IEEE Computer, 30(4):75 82, Apr [6] A. Messer, P. Bernadat, G. Fu, D. Chen, Z. Dimitrijevic, D. Lie, D. D. Mannaru, A. Riska, and D. Milojicic. Susceptibility of Commodity Systems and Software to Memory Soft Errors. IEEE Transactions on Computers, 53(12): , Dec [7] D. Milojicic, A. Messer, J. Shau, G. Fu, P. Alto, and A. Munoz. Increasing relevance of memory hardware errors: a case for recoverable programming models. In ACM SIGOPS European workshop, pages , Kolding, Denmark, Sept [8] D. K. Pradhan. Fault-Tolerant Computer System Design. Prentice Hall PTR, [9] N. Quach. High Availability and Reliability in the Itanium Processor. IEEE Micro, 20(5):61 69, Sept. Oct [10] M. Rebaudengo, M. S. Reorda, M. Torchiano, and M. Violante. Soft-error Detection through Software Fault- Tolerance techniques. In IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, pages , Albuquerque, New Mexico, Nov [11] M. Z. Rela, H. Madeira, and J. G. Silva. Experimental Evaluation of the Fail Silent Behavior in Programs with Consistency Checks. In International Symposium on Fault- Tolerant Computing, pages , Sendai, Japan, June

7 [12] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In International Conference on Dependable Systems and Networks, pages , Bethesda, Maryland, June [13] J. Xu, S. Chen, Z. Kalbarczyk, and R. K. Iyer. An Experimental Study of Security Vulnerabilities Caused by Errors. In International Conference on Dependable Systems and Networks, pages , Goteborg, Sweden, [14] S. Yau and F. Chen. An Approach to Concurrent Control Flow Checking. IEEE Transactions on Software Engineering, 6(2): , Mar [15] J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, B. Chin, M. Nicewicz, C. A. Russell, W. Y. Wang, L. B. Freeman, P. Hosier, L. E. LaFave, J. L. Walsh, J. M. Orro, G. J. Unger, J. M. Ross, T. J. O Gorman, B. Messina, T. D. Sullivan, A. J. Sykes, H. Yourke, T. A. Enger, V. Tolat, T. S. Scott, A. H. Taber, R. J. Sussman, W. A. Klein, and C. W. Wahaus. IBM experiments in soft fails in computer electronics ( ). IBM Journal of Research and Development, 40(1):3 18, Jan [16] J. F. Ziegler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, T. J. O Gorman, and J. M. Ross. Accelerated testing for cosmic soft-error rate. IBM Journal of Research and Development, 40(1):51 72, Jan

hot plug RAID memory technology for fault tolerance and scalability

hot plug RAID memory technology for fault tolerance and scalability hp industry standard servers april 2003 technology brief TC030412TB hot plug RAID memory technology for fault tolerance and scalability table of contents abstract... 2 introduction... 2 memory reliability...

More information

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy

Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy Improving the Fault Tolerance of a Computer System with Space-Time Triple Modular Redundancy Wei Chen, Rui Gong, Fang Liu, Kui Dai, Zhiying Wang School of Computer, National University of Defense Technology,

More information

ARCHITECTURE DESIGN FOR SOFT ERRORS

ARCHITECTURE DESIGN FOR SOFT ERRORS ARCHITECTURE DESIGN FOR SOFT ERRORS Shubu Mukherjee ^ШВпШшр"* AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO T^"ТГПШГ SAN FRANCISCO SINGAPORE SYDNEY TOKYO ^ P f ^ ^ ELSEVIER Morgan

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 5 Processor-Level Techniques & Byzantine Failures Chapter 2 Hardware Fault Tolerance Part.5.1 Processor-Level Techniques

More information

HP Advanced Memory Protection technologies

HP Advanced Memory Protection technologies HP Advanced Memory Protection technologies technology brief, 5th edition Abstract... 2 Introduction... 2 Memory errors... 2 Single-bit and multi-bit errors... 3 Hard errors and soft errors... 3 Increasing

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance

More information

HDL IMPLEMENTATION OF SRAM BASED ERROR CORRECTION AND DETECTION USING ORTHOGONAL LATIN SQUARE CODES

HDL IMPLEMENTATION OF SRAM BASED ERROR CORRECTION AND DETECTION USING ORTHOGONAL LATIN SQUARE CODES HDL IMPLEMENTATION OF SRAM BASED ERROR CORRECTION AND DETECTION USING ORTHOGONAL LATIN SQUARE CODES (1) Nallaparaju Sneha, PG Scholar in VLSI Design, (2) Dr. K. Babulu, Professor, ECE Department, (1)(2)

More information

Multiple Event Upsets Aware FPGAs Using Protected Schemes

Multiple Event Upsets Aware FPGAs Using Protected Schemes Multiple Event Upsets Aware FPGAs Using Protected Schemes Costas Argyrides, Dhiraj K. Pradhan University of Bristol, Department of Computer Science Merchant Venturers Building, Woodland Road, Bristol,

More information

Reliable Architectures

Reliable Architectures 6.823, L24-1 Reliable Architectures Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 6.823, L24-2 Strike Changes State of a Single Bit 10 6.823, L24-3 Impact

More information

Software-based Fault Tolerance Mission (Im)possible?

Software-based Fault Tolerance Mission (Im)possible? Software-based Fault Tolerance Mission Im)possible? Peter Ulbrich The 29th CREST Open Workshop on Software Redundancy November 18, 2013 System Software Group http://www4.cs.fau.de Embedded Systems Initiative

More information

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation What are transient faults? Hardware Fault Detection

More information

Exploiting Unused Spare Columns to Improve Memory ECC

Exploiting Unused Spare Columns to Improve Memory ECC 2009 27th IEEE VLSI Test Symposium Exploiting Unused Spare Columns to Improve Memory ECC Rudrajit Datta and Nur A. Touba Computer Engineering Research Center Department of Electrical and Computer Engineering

More information

Susceptibility of Modern Systems and Software to Soft Errors

Susceptibility of Modern Systems and Software to Soft Errors Susceptibility of Modern Systems and Software to Soft Errors Alan Messer, Philippe Bernadat, Guangrui Fu, Deqing Chen 1, Zoran Dimitrijevic 2, David Lie 3, Durga Devi Mannaru 4, Alma Riska 5, Dejan Milojicic

More information

ECE 574 Cluster Computing Lecture 19

ECE 574 Cluster Computing Lecture 19 ECE 574 Cluster Computing Lecture 19 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 10 November 2015 Announcements Projects HW extended 1 MPI Review MPI is *not* shared memory

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

COSC 6385 Computer Architecture - Memory Hierarchies (III)

COSC 6385 Computer Architecture - Memory Hierarchies (III) COSC 6385 Computer Architecture - Memory Hierarchies (III) Edgar Gabriel Spring 2014 Memory Technology Performance metrics Latency problems handled through caches Bandwidth main concern for main memory

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2010 Daniel J. Sorin Duke University Definition and Motivation Outline General Principles of Available System Design

More information

Intel iapx 432-VLSI building blocks for a fault-tolerant computer

Intel iapx 432-VLSI building blocks for a fault-tolerant computer Intel iapx 432-VLSI building blocks for a fault-tolerant computer by DAVE JOHNSON, DAVE BUDDE, DAVE CARSON, and CRAIG PETERSON Intel Corporation Aloha, Oregon ABSTRACT Early in 1983 two new VLSI components

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors 1 PAR-BS Mutlu and Moscibroda, ISCA 08 A batch of requests (per bank) is formed: each thread can only contribute

More information

Lecture 5: Refresh, Chipkill. Topics: refresh basics and innovations, error correction

Lecture 5: Refresh, Chipkill. Topics: refresh basics and innovations, error correction Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction 1 Refresh Basics A cell is expected to have a retention time of 64ms; every cell must be refreshed within a 64ms window

More information

Fast SEU Detection and Correction in LUT Configuration Bits of SRAM-based FPGAs

Fast SEU Detection and Correction in LUT Configuration Bits of SRAM-based FPGAs Fast SEU Detection and Correction in LUT Configuration Bits of SRAM-based FPGAs Hamid R. Zarandi,2, Seyed Ghassem Miremadi, Costas Argyrides 2, Dhiraj K. Pradhan 2 Department of Computer Engineering, Sharif

More information

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES S. SRINIVAS KUMAR *, R.BASAVARAJU ** * PG Scholar, Electronics and Communication Engineering, CRIT

More information

Eliminating Single Points of Failure in Software Based Redundancy

Eliminating Single Points of Failure in Software Based Redundancy Eliminating Single Points of Failure in Software Based Redundancy Peter Ulbrich, Martin Hoffmann, Rüdiger Kapitza, Daniel Lohmann, Reiner Schmid and Wolfgang Schröder-Preikschat EDCC May 9, 2012 SYSTEM

More information

A Low-Cost Correction Algorithm for Transient Data Errors

A Low-Cost Correction Algorithm for Transient Data Errors A Low-Cost Correction Algorithm for Transient Data Errors Aiguo Li, Bingrong Hong School of Computer Science and Technology Harbin Institute of Technology, Harbin 150001, China liaiguo@hit.edu.cn Introduction

More information

Duke University Department of Electrical and Computer Engineering

Duke University Department of Electrical and Computer Engineering Duke University Department of Electrical and Computer Engineering Senior Honors Thesis Spring 2008 Proving the Completeness of Error Detection Mechanisms in Simple Core Chip Multiprocessors Michael Edward

More information

Robust System Design with MPSoCs Unique Opportunities

Robust System Design with MPSoCs Unique Opportunities Robust System Design with MPSoCs Unique Opportunities Subhasish Mitra Robust Systems Group Departments of Electrical Eng. & Computer Sc. Stanford University Email: subh@stanford.edu Acknowledgment: Stanford

More information

A Robust Bloom Filter

A Robust Bloom Filter A Robust Bloom Filter Yoon-Hwa Choi Department of Computer Engineering, Hongik University, Seoul, Korea. Orcid: 0000-0003-4585-2875 Abstract A Bloom filter is a space-efficient randomized data structure

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

Accurate Analysis of Single Event Upsets in a Pipelined Microprocessor

Accurate Analysis of Single Event Upsets in a Pipelined Microprocessor Accurate Analysis of Single Event Upsets in a Pipelined Microprocessor M. Rebaudengo, M. Sonza Reorda, M. Violante Politecnico di Torino Dipartimento di Automatica e Informatica Torino, Italy www.cad.polito.it

More information

DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY

DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY WHITEPAPER DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY A Detailed Review ABSTRACT No single mechanism is sufficient to ensure data integrity in a storage system.

More information

A Low-Power ECC Check Bit Generator Implementation in DRAMs

A Low-Power ECC Check Bit Generator Implementation in DRAMs 252 SANG-UHN CHA et al : A LOW-POWER ECC CHECK BIT GENERATOR IMPLEMENTATION IN DRAMS A Low-Power ECC Check Bit Generator Implementation in DRAMs Sang-Uhn Cha *, Yun-Sang Lee **, and Hongil Yoon * Abstract

More information

FPGA Implementation of Double Error Correction Orthogonal Latin Squares Codes

FPGA Implementation of Double Error Correction Orthogonal Latin Squares Codes FPGA Implementation of Double Error Correction Orthogonal Latin Squares Codes E. Jebamalar Leavline Assistant Professor, Department of ECE, Anna University, BIT Campus, Tiruchirappalli, India Email: jebilee@gmail.com

More information

ECC Protection in Software

ECC Protection in Software Center for RC eliable omputing ECC Protection in Software by Philip P Shirvani RATS June 8, 1999 Outline l Motivation l Requirements l Coding Schemes l Multiple Error Handling l Implementation in ARGOS

More information

Error Detecting and Correcting Code Using Orthogonal Latin Square Using Verilog HDL

Error Detecting and Correcting Code Using Orthogonal Latin Square Using Verilog HDL Error Detecting and Correcting Code Using Orthogonal Latin Square Using Verilog HDL Ch.Srujana M.Tech [EDT] srujanaxc@gmail.com SR Engineering College, Warangal. M.Sampath Reddy Assoc. Professor, Department

More information

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University CS 370: SYSTEM ARCHITECTURE & SOFTWARE [MASS STORAGE] Frequently asked questions from the previous class survey Shrideep Pallickara Computer Science Colorado State University L29.1 L29.2 Topics covered

More information

I/O Hardwares. Some typical device, network, and data base rates

I/O Hardwares. Some typical device, network, and data base rates Input/Output 1 I/O Hardwares Some typical device, network, and data base rates 2 Device Controllers I/O devices have components: mechanical component electronic component The electronic component is the

More information

DESIGN AND ANALYSIS OF TRANSIENT FAULT TOLERANCE FOR MULTI CORE ARCHITECTURE

DESIGN AND ANALYSIS OF TRANSIENT FAULT TOLERANCE FOR MULTI CORE ARCHITECTURE DESIGN AND ANALYSIS OF TRANSIENT FAULT TOLERANCE FOR MULTI CORE ARCHITECTURE DivyaRani 1 1pg scholar, ECE Department, SNS college of technology, Tamil Nadu, India -----------------------------------------------------------------------------------------------------------------------------------------------

More information

Very Large Scale Integration (VLSI)

Very Large Scale Integration (VLSI) Very Large Scale Integration (VLSI) Lecture 10 Dr. Ahmed H. Madian Ah_madian@hotmail.com Dr. Ahmed H. Madian-VLSI 1 Content Manufacturing Defects Wafer defects Chip defects Board defects system defects

More information

Storage systems. Computer Systems Architecture CMSC 411 Unit 6 Storage Systems. (Hard) Disks. Disk and Tape Technologies. Disks (cont.

Storage systems. Computer Systems Architecture CMSC 411 Unit 6 Storage Systems. (Hard) Disks. Disk and Tape Technologies. Disks (cont. Computer Systems Architecture CMSC 4 Unit 6 Storage Systems Alan Sussman November 23, 2004 Storage systems We already know about four levels of storage: registers cache memory disk but we've been a little

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

ZKLWHýSDSHU. 3UHð)DLOXUHý:DUUDQW\ý 0LQLPL]LQJý8QSODQQHGý'RZQWLPH. +3ý 1HW6HUYHUý 0DQDJHPHQW. Executive Summary. A Closer Look

ZKLWHýSDSHU. 3UHð)DLOXUHý:DUUDQW\ý 0LQLPL]LQJý8QSODQQHGý'RZQWLPH. +3ý 1HW6HUYHUý 0DQDJHPHQW. Executive Summary. A Closer Look 3UHð)DLOXUHý:DUUDQW\ý 0LQLPL]LQJý8QSODQQHGý'RZQWLPH ZKLWHýSDSHU Executive Summary The Hewlett-Packard Pre-Failure Warranty 1 helps you run your business with less downtime. It extends the advantage of

More information

416 Distributed Systems. Errors and Failures Oct 16, 2018

416 Distributed Systems. Errors and Failures Oct 16, 2018 416 Distributed Systems Errors and Failures Oct 16, 2018 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:

More information

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture 1 Last class: Course administration OS definition, some history Today: Background on Computer Architecture 2 Canonical System Hardware CPU: Processor to perform computations Memory: Programs and data I/O

More information

Xentry: Hypervisor-Level Soft Error Detection

Xentry: Hypervisor-Level Soft Error Detection 2014 43rd International Conference on Parallel Processing Xentry: Hypervisor-Level Soft Error Detection Xin Xu Ron C. Chiang H. Howie Huang George Washington University Abstract Cloud data centers leverage

More information

Soft-error Detection Using Control Flow Assertions

Soft-error Detection Using Control Flow Assertions Soft-error Detection Using Control Flow Assertions O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, M. Violante Politecnico di Torino, Dipartimento di Automatica e Informatica Torino, Italy Abstract Over

More information

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory

Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory Let Software Decide: Matching Application Diversity with One- Size-Fits-All Memory Mattan Erez The University of Teas at Austin 2010 Workshop on Architecting Memory Systems March 1, 2010 iggest Problems

More information

Analyzing Heap Error Behavior in Embedded JVM Environments

Analyzing Heap Error Behavior in Embedded JVM Environments Analyzing Heap Error Behavior in Embedded JVM Environments G. Chen, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. J. Irwin Department of Computer Science and Engineering The Pennsylvania State

More information

Improving Fault Tolerance Using Memory Redundancy and Hot-Plug Actions in Dell PowerEdge Servers

Improving Fault Tolerance Using Memory Redundancy and Hot-Plug Actions in Dell PowerEdge Servers Improving Fault Tolerance Using Redundancy and Hot-Plug Actions in Dell PowerEdge Servers Features that enable redundancy across physical memory can enhance server reliability and help keep critical business

More information

Architectural Level Fault- Tolerance Techniques. EECE 513: Design of Fault- tolerant Digital Systems

Architectural Level Fault- Tolerance Techniques. EECE 513: Design of Fault- tolerant Digital Systems Architectural Level Fault- Tolerance Techniques EECE 513: Design of Fault- tolerant Digital Systems Learning ObjecDves List the techniques for improving the reliability of commodity & high end processors

More information

Efficient Implementation of Single Error Correction and Double Error Detection Code with Check Bit Precomputation

Efficient Implementation of Single Error Correction and Double Error Detection Code with Check Bit Precomputation http://dx.doi.org/10.5573/jsts.2012.12.4.418 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.12, NO.4, DECEMBER, 2012 Efficient Implementation of Single Error Correction and Double Error Detection

More information

Fault-Tolerant Computer System Design ECE 60872/CS Topic 9: Validation

Fault-Tolerant Computer System Design ECE 60872/CS Topic 9: Validation Fault-Tolerant Computer System Design ECE 60872/CS 59000 Topic 9: Validation Saurabh Bagchi ECE/CS Purdue University ECE/CS 1 Outline Introduction Validation methods Design phase Fault simulation Prototype

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

A Fault Tolerant Superscalar Processor

A Fault Tolerant Superscalar Processor A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level Fault Check Regimen in a Superscalar Processor by V. Reddy and E. Rotenberg (2008)] P R E S E N T E D B Y NAN Z

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

Comparison of SET-Resistant Approaches for Memory-Based Architectures

Comparison of SET-Resistant Approaches for Memory-Based Architectures Comparison of SET-Resistant Approaches for Memory-Based Architectures Daniel R. Blum and José G. Delgado-Frias School of Electrical Engineering and Computer Science Washington State University Pullman,

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING. Björn Döbel (TU Dresden)

OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING. Björn Döbel (TU Dresden) OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING Björn Döbel (TU Dresden) Brussels, 02.02.2013 Hardware Faults Radiation-induced soft errors Mainly an issue in avionics+space 1 DRAM errors in large

More information

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names

Fast access ===> use map to find object. HW == SW ===> map is in HW or SW or combo. Extend range ===> longer, hierarchical names Fast access ===> use map to find object HW == SW ===> map is in HW or SW or combo Extend range ===> longer, hierarchical names How is map embodied: --- L1? --- Memory? The Environment ---- Long Latency

More information

Definition of RAID Levels

Definition of RAID Levels RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds

More information

Improving Memory Repair by Selective Row Partitioning

Improving Memory Repair by Selective Row Partitioning 200 24th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems Improving Memory Repair by Selective Row Partitioning Muhammad Tauseef Rab, Asad Amin Bawa, and Nur A. Touba Computer

More information

WHITE PAPER THE HIGHEST AVAILABILITY FEATURES FOR PRIMEQUEST

WHITE PAPER THE HIGHEST AVAILABILITY FEATURES FOR PRIMEQUEST WHITE PAPER THE HIGHEST AVAILABILITY FEATURES FOR PRIMEQUEST WHITE PAPER THE HIGHEST AVAILABILITY FEATURES FOR PRIMEQUEST Business continuity and cost-efficiency have become essential demands on IT platforms.

More information

White paper PRIMEQUEST 1000 series high availability realized by Fujitsu s quality assurance

White paper PRIMEQUEST 1000 series high availability realized by Fujitsu s quality assurance White paper PRIMEQUEST 1000 series high availability realized by Fujitsu s quality assurance PRIMEQUEST is an open enterprise server platform that fully maximizes uptime. This whitepaper explains how Fujitsu

More information

Distributed Systems

Distributed Systems 15-440 Distributed Systems 11 - Fault Tolerance, Logging and Recovery Tuesday, Oct 2 nd, 2018 Logistics Updates P1 Part A checkpoint Part A due: Saturday 10/6 (6-week drop deadline 10/8) *Please WORK hard

More information

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following: CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online

More information

Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery

Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery White Paper Business Continuity Protecting Mission-Critical Application Environments The Top 5 Challenges and Solutions for Backup and Recovery Table of Contents Executive Summary... 1 Key Facts About

More information

CS5460: Operating Systems Lecture 20: File System Reliability

CS5460: Operating Systems Lecture 20: File System Reliability CS5460: Operating Systems Lecture 20: File System Reliability File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving

More information

Low Power Cache Design. Angel Chen Joe Gambino

Low Power Cache Design. Angel Chen Joe Gambino Low Power Cache Design Angel Chen Joe Gambino Agenda Why is low power important? How does cache contribute to the power consumption of a processor? What are some design challenges for low power caches?

More information

High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs

High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs Vol. 3, Issue. 5, Sep - Oct. 2013 pp-2894-2900 ISSN: 2249-6645 High Speed Fault Injection Tool (FITO) Implemented With VHDL on FPGA For Testing Fault Tolerant Designs M. Reddy Sekhar Reddy, R.Sudheer Babu

More information

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University Frequently asked questions from the previous class survey CS 370: OPERATING SYSTEMS [MASS STORAGE] How does the OS caching optimize disk performance? How does file compression work? Does the disk change

More information

A Field Analysis of System-level Effects of Soft Errors Occurring in Microprocessors used in Information Systems

A Field Analysis of System-level Effects of Soft Errors Occurring in Microprocessors used in Information Systems A Field Analysis of System-level Effects of Soft Errors Occurring in Microprocessors used in Information Systems Syed Z. Shazli, Mohammed Abdul-Aziz, Mehdi B. Tahoori, David R. Kaeli Department ofelectrical

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION Rapid advances in integrated circuit technology have made it possible to fabricate digital circuits with large number of devices on a single chip. The advantages of integrated circuits

More information

Reliable Computing I

Reliable Computing I Instructor: Mehdi Tahoori Reliable Computing I Lecture 9: Concurrent Error Detection INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the

More information

Ultra Low-Cost Defect Protection for Microprocessor Pipelines

Ultra Low-Cost Defect Protection for Microprocessor Pipelines Ultra Low-Cost Defect Protection for Microprocessor Pipelines Smitha Shyam Kypros Constantinides Sujay Phadke Valeria Bertacco Todd Austin Advanced Computer Architecture Lab University of Michigan Key

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Principle of Locality Programs access a small proportion of their address space at any time Temporal locality Items accessed recently are likely to

More information

Chapter 11: File System Implementation. Objectives

Chapter 11: File System Implementation. Objectives Chapter 11: File System Implementation Objectives To describe the details of implementing local file systems and directory structures To describe the implementation of remote file systems To discuss block

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review Administrivia CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Homework #4 due Thursday answers posted soon after Exam #2 on Thursday, April 24 on memory hierarchy (Unit 4) and

More information

Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance

Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance Outline Introduction and Motivation Software-centric Fault Detection Process-Level Redundancy Experimental Results

More information

Built-in Self-Test and Repair (BISTR) Techniques for Embedded RAMs

Built-in Self-Test and Repair (BISTR) Techniques for Embedded RAMs Built-in Self-Test and Repair (BISTR) Techniques for Embedded RAMs Shyue-Kung Lu and Shih-Chang Huang Department of Electronic Engineering Fu Jen Catholic University Hsinchuang, Taipei, Taiwan 242, R.O.C.

More information

POWER4 Systems: Design for Reliability. Douglas Bossen, Joel Tendler, Kevin Reick IBM Server Group, Austin, TX

POWER4 Systems: Design for Reliability. Douglas Bossen, Joel Tendler, Kevin Reick IBM Server Group, Austin, TX Systems: Design for Reliability Douglas Bossen, Joel Tendler, Kevin Reick IBM Server Group, Austin, TX Microprocessor 2-way SMP system on a chip > 1 GHz processor frequency >1GHz Core Shared L2 >1GHz Core

More information

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to

More information

Scalable Controller Based PMBIST Design For Memory Testability M. Kiran Kumar, G. Sai Thirumal, B. Nagaveni M.Tech (VLSI DESIGN)

Scalable Controller Based PMBIST Design For Memory Testability M. Kiran Kumar, G. Sai Thirumal, B. Nagaveni M.Tech (VLSI DESIGN) Scalable Controller Based PMBIST Design For Memory Testability M. Kiran Kumar, G. Sai Thirumal, B. Nagaveni M.Tech (VLSI DESIGN) Abstract With increasing design complexity in modern SOC design, many memory

More information

CS370: Operating Systems [Fall 2018] Dept. Of Computer Science, Colorado State University

CS370: Operating Systems [Fall 2018] Dept. Of Computer Science, Colorado State University Frequently asked questions from the previous class survey CS 370: OPERATING SYSTEMS [DISK SCHEDULING ALGORITHMS] Shrideep Pallickara Computer Science Colorado State University ECCs: How does it impact

More information

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan PHX: Memory Speed HPC I/O with NVM Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan Node Local Persistent I/O? Node local checkpoint/ restart - Recover from transient failures ( node restart)

More information

CSE 380 Computer Operating Systems

CSE 380 Computer Operating Systems CSE 380 Computer Operating Systems Instructor: Insup Lee University of Pennsylvania Fall 2003 Lecture Note on Disk I/O 1 I/O Devices Storage devices Floppy, Magnetic disk, Magnetic tape, CD-ROM, DVD User

More information

Supercomputer Field Data. DRAM, SRAM, and Projections for Future Systems

Supercomputer Field Data. DRAM, SRAM, and Projections for Future Systems Supercomputer Field Data DRAM, SRAM, and Projections for Future Systems Nathan DeBardeleben, Ph.D. (LANL) Ultrascale Systems Research Center (USRC) 6 th Soft Error Rate (SER) Workshop Santa Clara, October

More information

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory

Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory Yixin Luo, Sriram Govindan, Bikash Sharma, Mark Santaniello, Justin Meza, Aman Kansal,

More information

SEE Tolerant Self-Calibrating Simple Fractional-N PLL

SEE Tolerant Self-Calibrating Simple Fractional-N PLL SEE Tolerant Self-Calibrating Simple Fractional-N PLL Robert L. Shuler, Avionic Systems Division, NASA Johnson Space Center, Houston, TX 77058 Li Chen, Department of Electrical Engineering, University

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

416 Distributed Systems. Errors and Failures Feb 1, 2016

416 Distributed Systems. Errors and Failures Feb 1, 2016 416 Distributed Systems Errors and Failures Feb 1, 2016 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:

More information

VMware vsphere Clusters in Security Zones

VMware vsphere Clusters in Security Zones SOLUTION OVERVIEW VMware vsan VMware vsphere Clusters in Security Zones A security zone, also referred to as a DMZ," is a sub-network that is designed to provide tightly controlled connectivity to an organization

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information