Page 1. Outline. Microprocessor Errors/Failures. Microprocessor Fault Tolerance. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems

Size: px

Start display at page:

Download "Page 1. Outline. Microprocessor Errors/Failures. Microprocessor Fault Tolerance. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems"

Sabrina Russell
5 years ago
Views:

1 Outline Fault Tolerant and Testable Computing Systems Real Systems: Hardware Solutions for Tolerating Hardware Faults Microprocessors Memory Disks Networks Multiprocessors Copyright 2011 Daniel J. Sorin Duke University 2 Microprocessor Errors/Failures Error models Transient stuck-at (bit flip) on transistor or wire Hard stuck-at on transistor or wire Chipkill: whole chip is dead (e.g., due to power/ground short) Failure models Incorrect instruction trap/exception Incorrect output Dead chip (no output and/or smoke output) Microprocessor Fault Tolerance There ain t much! Most common microprocessors are designed to maximize performance per dollar Intel and AMD s x86-64 multicores Intel Itanium II (1- and 2-core) Sun UltraSPARC IV, UltraSPARC T2 (Niagara 2) IBM Power6 (2-core) has the most fault tolerance in this list Microprocessors may have some limited error detection/correction in their L2 or L3 s Note: microprocessors are designed with hardware for performing built-in self-test (BIST). We will cover this topic towards the end of the semester. 3 4 Page 1

2 Fault Tolerance in Custom Microprocessors Most systems built from commodity microprocessors Off-the-shelf parts are cost-efficient And, even if they re not very reliable individually, we can design reliable systems out of un-reliable parts (remember Teramac!) However, custom microprocessors may be built for those systems which require very high availability and/or reliability Example: IBM mainframe microprocessors (e.g., G5 and G6) Fault Tolerance in the DEC VAX DEC s VAX was very successful family of systems Follow-ons to DEC s PDP-11 computer Forerunner of DEC/Compaq/Intel Alpha processor (now dead) VAX known today for being epitome of CISC-ness Could detect and sometimes tolerate many faults Illegal instruction execution Trying to access restricted Arithmetic exceptions (which may be due to faults) Power failure Etc. Tries to provide info with trap/interrupt Places fault type info into known location Maintains registers specifically for error monitoring 5 6 More About the VAX ( ) Early VAX-11/750 and VAX-11/780 had following FT Built-in self-test (executed at power-on) ECC on main Multiple-bit parity on, TLB, and a few other structures Parity bits on the SBI (synchronous backplane interconnect = bus) Field-replaceable unit (FRU) is the chip (instead of board) In the later VAX 8600 and 8700, more FT added Instruction retry Better diagnostics through error logging and analysis Online self-test of floating point unit (F-box in VAX lingo) Error handling via a microcode routine ( micro-routine ) Micro-diagnostics to self-test system and diagnose faults to FRUs System diagnostic bus (SDB) for console control/observation IBM RAS RAS Strategy for IBM S/390 G5 and G6 (Mueller et al.) 7 8 Page 2

3 DIVA DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design (Todd Austin, MICRO 1999) Argus Albert Meixner, Michael E. Bauer, and Daniel J. Sorin. Argus: Low-Cost, Comprehensive Error Detection in Simple Cores. 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December, Outline Transient Memory Errors Microprocessors Memory Disks Networks Multiprocessors Transient error models Single bit error (single event upset: SEU) Burst of bit errors (errors in contiguous bits) We used to only worry about DRAM, but now we have to worry about soft errors in SRAM, too Remember the Ziegler paper! Page 3

4 Permanent Memory Errors Error models Single bit or multi-bit stuck-at Memory chip failure ( chipkill ) Chipkill failures Chipkill is a fail-stop permanent error/failure model Only applies to that is not on processor chip» Off-chip L2 or L3» DRAM main Tolerating Transient Memory Errors Almost uniformly tolerated with EDC/ECC At what granularity would you apply EDC/ECC? If EDC, then we need a higher-level mechanism to recover from errors So then why use EDC instead of ECC? What kinds of EDC/ECC are appropriate for our transient error models? Parity» Single bit» Multiple bit» Two-dimensional CRC Hamming code Which EDC/ECC are NOT appropriate for? Tolerating Permanent Memory Bit Errors Caches (SRAM) and memories (DRAM) inherently have lots of redundancy Lots of bits why not just provision some spares? Then, if hard fault detected, map out faulty bits and replace with spare bits Disks have been doing this for a long time, but this is a relatively recent development for and DRAM Design issue: granularity of mapping What is the field-replaceable unit?» Bit» Row» Column What are the trade-offs in choosing a granularity? Tolerating Chipkill Memory Errors Requires that we can reconstruct the data on the dead chip from redundant data on other chips Should sound a bit like RAID protection for disks This has been implemented as RAID-M (or chipkill ) I won t make you read this paper, but this is a good reference on RAID-M A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory (Dell) Page 4

5 Outline Disk Errors Microprocessors Memory Disks Networks Multiprocessors Error models Transient single bit error Transient burst of bit errors Permanently bad sector (from defect or fault)» In general, disks don t consider finer granularities Permanently bad disk (because of storage medium or controller) Disk Fault Tolerance Disk Physical Redundancy Disks are often considered the stable storage on which we save critical data E.g., databases write their important data to disks We sometimes backup critical disk systems with tape E.g., your home directory for your account on EE or CS system Periodically (e.g., nightly, weekly) log diffs to tape Disks are generally protected with Information redundancy (EDC/ECC) Physical redundancy Physical redundancy at different granularities Sector-level redundancy Disks come with more sectors than specified Can map out a sector with a hard fault and transparently replace it with a spare sector Disk-level redundancy Can use multiple disks to tolerate faults that:» Corrupt data on one or more disks» Completely disable one or more disks Page 5

6 RAID A Case for Redundant Arrays of Inexpensive Disks (RAID), by Patterson, Gibson, and Katz (1987) Famous paper that first described RAID Basic idea: disks are getting cheap, so let s use a bunch of them to get Better performance (in terms of throughput) Better fault tolerance Many flavors of RAID that trade-off: Performance (for reads and/or writes) Fault tolerance Hardware cost RAID-1 Instead of keeping all data on N disks, mirror it on 2N disks Faster for reads Slower for writes Can tolerate loss of any single disk 100% hardware overhead RAID-4 and RAID-5 Stripe data, including parity, at block granularity across disks RAID-4: all parity data on one disk Parity disk becomes bottleneck, particularly for writes RAID-5: parity data spread across disks But Our RAID goes up to 11 There are many flavors of RAID that have been developed since original RAID paper RAID-0: striping but no redundancy High performance, but no fault tolerance RAID-10: combines RAID-1 and RAID-0 (2 flavors) RAID-0+1: data is organized as stripes across multiple disks, and then the striped disk sets are mirrored RAID-1+0: data is mirrored and the mirrors are striped RAID-50: combines RAID-5 and RAID-0 (1 flavor) RAID-5+0: combines the straight block-level striping of RAID-0 with the distributed parity of RAID-5. This is a RAID-0 array striped across RAID-5 elements. RAID-30, RAID-100, RAID-1.7, RAID-S, etc. You are not expected to memorize this! Page 6

7 Implementing RAID Can implement it either in hardware or software Hardware: special hardware controller that manages the access to the RAID array Software: the hardware is oblivious, and the OS manages the access to the RAID array (through the disk controller) Software RAID is generally less effective in terms of performance and fault tolerance, but it can be cheaper and more flexible RAID in the Real World RAID is used very frequently for reliable disk storage We have several RAID arrays at Duke Limitations of RAID Other Issues in Disk I/O What faults can t be tolerated with RAID? How might we tolerate faults that can t be tolerated with RAID? Still a potential single point of failure at the I/O bus Or at I/O bridge One approach is to have redundant paths proc I/O bridge I/O bus disk disk disk Page 7

8 Other Issues in Disk I/O Outline Still a potential single point of failure at the I/O bus Or at I/O bridge One approach is to have redundant paths proc I/O bridge Microprocessors Memory Disks Networks Multiprocessors I/O bus disk disk disk I/O bus Fault-Tolerant Networks Network Errors and Failures A good reference: Principles and Practices of Interconnection Networks by Dally and Towles Endpoints (e.g., processors) communicate over network Network consists of switches and links endpoint switch switch switch switch endpoint Switch errors/failures Dead (fail-stop) Internal logic is mis-routing messages Dropping messages Corrupting messages Link errors/failures Dead Corrupting messages with bit errors (e.g., wire stuck-at-x) Deadlock Network gets completely stuck and can t make forward progress in routing messages (similar to gridlock on streets) Livelock Network is doing work, but not making forward progress Page 8

9 Network Fault Tolerance The key, as always, is redundancy Information: protect communications with EDC/ECC Temporal: ability to re-send communications Physical: extra switches and links Hybrids: use multiple forms of redundancy Most networks use hybrid forms of redundancy E.g., link errors detected with EDC and recovered by re-sending But we ll first talk about the separate types of redundancy before putting them all together Network Information Redundancy Network links often use cyclic redundancy check (CRC) codes for error detection (not correction) An n-bit CRC check can detect all errors of less than n bits and all but 1 in 2 n multi-bit errors Can use CRC check at various granularities Per flit (flit = unit of flow control) Per packet (packet = unit of routing, can have multiple flits) Per message (message can have multiple packets) Per transaction (may also use checksum, e.g., ftp) What are the advantages/disadvantages of each granularity? Hint: think end-to-end Network Temporal Redundancy If at first you don t succeed, try, try again If EDC detects an error, recover from it with re-try Depending on granularity of EDC, may re-try flit, packet, or message Requires that sender keep copy of message after sending it How long? Until an acknowledgment from the receiver What if the ack gets corrupted/dropped? In what scenarios is EDC with re-try preferable to using ECC to detect and correct errors? Hint 1: what is the error-free performance impact of ECC? Hint 2: what is the per-error performance impact of re-try? What error model are we assuming? Big hint: how would re-try cope with hard faults? Network Physical Redundancy Networks often have more than the minimum number of switches and links CPU1 Switch 1 Switch 3 Switch 2 Switch 4 CPU2 Can get from CPU1 to CPU2 in more than one way CPU1 Switch 1 Switch 2 Switch 4 CPU2 CPU1 Switch 1 Switch 3 Switch 4 CPU2 This is redundancy that can be used for handling hard faults Page 9

10 Network Physical Redundancy To cope with hard fault, must be able to exploit path redundancy Requires either: Adaptive routing: no fixed path for communication from point A to point B Fault diagnosis and network reconfiguration: ability to establish a new fixed path from point A to point B if necessary CPU1 Switch 1 Switch 2 Routing Static routing always chooses same path from A to B Can be implemented with:» Table lookup at each switch» Sender attaches routing decisions to packet Adaptive routing can choose a path that enables the best performance/throughput Can route packets around congested or faulty parts of network Many different algorithms exist Switch 3 Switch 4 CPU Example: Ethernet Example: Internet (TCP/IP) Local Area Network (LAN) technology for data link layer (level 2 in OSI) Standardized by IEEE standard Uses CRC-32 for error detection If error detected, higher level protocol must decide what to do about it TCP is a transport layer protocol (level 4 in OSI) that provides reliable end-to-end communication All TCP segments carry a checksum, which is used by the receiver to detect errors with either the TCP header or data TCP implements retransmission schemes for data that may be lost or damaged. The use of positive acknowledgments by the receiver to the sender confirms successful reception of data. The lack of positive acknowledgments, coupled with a timeout period calls for a retransmission. IP is a network layer protocol (level 3 in OSI) that has no role in reliable communication (it is unreliable) Page 10

11 Outline Multiprocessors Microprocessors Memory Disks Networks Multiprocessors Multiprocessor: computer system with multiple processors that can communicate with each other (see ECE 259 / CPS 221 for lots of details) essors communicate over interconnection network (e.g., bus, tree, 2D torus, hypercube, etc.) Interconnection network Commercial Multiprocessors Some current commercial multiprocessors: IBM mainframes and xservers Sun UltraEnterprise (E10000, E12000, etc.) Silicon Graphics (SGI) Origin 3000 HP Superdome 9000 Multicore processors (multiprocessor on a chip) Intel and AMD have multicore (2 and 4-core) chips Sun Niagara1 and Niagara2 (8 cores) Expectations of dozens to hundreds of cores per chip Clusters of uniprocessors Hook up a bunch of uniprocessors (e.g., Beowulf cluster) Use a commodity network, not a tightly coupled interconnect Used in many applications (e.g., Google, Amazon.com) Multiprocessor Errors/Failures A superset of the errors/failures we ve looked at for: Microprocessors,, disks, interconnection network Tolerating these faults in an MP is different because an MP has so many more components More components more opportunities for errors/failures Also, many MPs are used for reliable computing, even though their constituent parts are not reliable For example, Sun UltraEnterprise E10000 is a very reliable system with no single point of failure Yet it uses mostly commodity parts, including UltraSparc processor Key: many MPs use software to improve reliability in the presence of hardware faults Page 11

12 Multiprocessor Fault Tolerance A superset of the fault tolerance we ve looked at for: Microprocessors,, disks, network May also need to recover in-flight messages Recall BER schemes for multiprocessors from a few weeks ago We also want the ability to recover from permanently failed microprocessor What do we use to provide fault tolerance? Same stuff we ve been learning about so far this semester! Multiprocessors that Use FER Tandem Integrity S2 TMR processors Stratus computer system Pair-and-spare processors IBM mainframes Redundancy within processors Redundant processors Sun UltraEnterprise server No single point of failure Redundant buses, power supplies, etc MP Recovery Without Pure FER Would like to avoid having to replicate processors TMR, pair-and-spare, etc.: all use lots of hardware and power Would like to be able to resume failed process on another processor Must be able to get that process data How do we do this? Use the lessons we learned about BER! Multiprocessors that Use BER Tandem computers prior to the Integrity S2 Periodically checkpoint state on other processor Sequoia computer systems Flush state to main at every checkpoint Page 12

13 Multiprocessor FT in Interconnection Network Sun UltraEnterprise E10000 Connects processors with 4 buses Interleaved by address that is requested Can tolerate hard fault in any bus by mapping out that half of the interconnect (i.e., interleaving every 2 addresses instead of 4) Multiprocessor FT in Interconnection Network Cray T3E supercomputer Connects (Compaq/Intel) Alpha microprocessors with 3D torus Can tolerate hard fault in any node with adaptive routing P P P P Multiprocessor Diagnostics Multiprocessors often have extra diagnostic hardware Sun UltraEnterprise servers have system service processor Central controller that is different from other processors Used to monitor system and perform diagnostics Thinking Machines CM-5 had its own diagnostic network Used strictly for diagnostic purposes (i.e., not time multiplexed with active execution data) And numerous other examples Multiprocessor Fault Isolation Fault isolation Keep effects of a fault from propagating into the rest of system Benefits Enable a system to continue to operate at least partially Recover part of system while rest of it is running Prevent additional data from being corrupted Page 13

14 Fault Isolation with Logical Partitioning Logical partitioning (LPAR) Logically divide a multiprocessor into multiple partitions These partitions can t affect each other fault isolation Often requires some amount of software Examples Sun UltraEnterprise E10000 IBM mainframes and servers Why buy a big machine and partition it? Why not just buy several smaller machines? Is a cluster just a logically partitioned multiprocessor?? Fault Isolation with Virtual Machines Virtual machines Use software to create multiple virtual machines that run on a single multiprocessor (or even on a single uniprocessor) Crash in one virtual machine doesn t affect others Examples: VMWare Xen What fault/error model does this address? Tends to be most useful for tolerating software errors Why? BulletProof Core Cannibalization Ultra Low-Cost Defect Protection for Microprocessor Pipelines (Shyam et al., ASPLOS 2006) Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore essors in the Presence of Hard Faults (Romanescu and Sorin, PACT 2008) Page 14

15 Outline Microprocessors Memory Disks Networks Multiprocessors 57 Page 15

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS LECTURE 8: HARDWARE FAULT TOLERANCE TECHNIQUES Fall 2014 Avinash Kodi kodi@ohio.edu Acknowledgement: Daniel Sorin, Behrooz Parhami, Srinivasan Ramasubramanian