Page 1. Outline. Microprocessor Errors/Failures. Microprocessor Fault Tolerance. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems
|
|
- Sabrina Russell
- 5 years ago
- Views:
Transcription
1 Outline Fault Tolerant and Testable Computing Systems Real Systems: Hardware Solutions for Tolerating Hardware Faults Microprocessors Memory Disks Networks Multiprocessors Copyright 2011 Daniel J. Sorin Duke University 2 Microprocessor Errors/Failures Error models Transient stuck-at (bit flip) on transistor or wire Hard stuck-at on transistor or wire Chipkill: whole chip is dead (e.g., due to power/ground short) Failure models Incorrect instruction trap/exception Incorrect output Dead chip (no output and/or smoke output) Microprocessor Fault Tolerance There ain t much! Most common microprocessors are designed to maximize performance per dollar Intel and AMD s x86-64 multicores Intel Itanium II (1- and 2-core) Sun UltraSPARC IV, UltraSPARC T2 (Niagara 2) IBM Power6 (2-core) has the most fault tolerance in this list Microprocessors may have some limited error detection/correction in their L2 or L3 s Note: microprocessors are designed with hardware for performing built-in self-test (BIST). We will cover this topic towards the end of the semester. 3 4 Page 1
2 Fault Tolerance in Custom Microprocessors Most systems built from commodity microprocessors Off-the-shelf parts are cost-efficient And, even if they re not very reliable individually, we can design reliable systems out of un-reliable parts (remember Teramac!) However, custom microprocessors may be built for those systems which require very high availability and/or reliability Example: IBM mainframe microprocessors (e.g., G5 and G6) Fault Tolerance in the DEC VAX DEC s VAX was very successful family of systems Follow-ons to DEC s PDP-11 computer Forerunner of DEC/Compaq/Intel Alpha processor (now dead) VAX known today for being epitome of CISC-ness Could detect and sometimes tolerate many faults Illegal instruction execution Trying to access restricted Arithmetic exceptions (which may be due to faults) Power failure Etc. Tries to provide info with trap/interrupt Places fault type info into known location Maintains registers specifically for error monitoring 5 6 More About the VAX ( ) Early VAX-11/750 and VAX-11/780 had following FT Built-in self-test (executed at power-on) ECC on main Multiple-bit parity on, TLB, and a few other structures Parity bits on the SBI (synchronous backplane interconnect = bus) Field-replaceable unit (FRU) is the chip (instead of board) In the later VAX 8600 and 8700, more FT added Instruction retry Better diagnostics through error logging and analysis Online self-test of floating point unit (F-box in VAX lingo) Error handling via a microcode routine ( micro-routine ) Micro-diagnostics to self-test system and diagnose faults to FRUs System diagnostic bus (SDB) for console control/observation IBM RAS RAS Strategy for IBM S/390 G5 and G6 (Mueller et al.) 7 8 Page 2
3 DIVA DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design (Todd Austin, MICRO 1999) Argus Albert Meixner, Michael E. Bauer, and Daniel J. Sorin. Argus: Low-Cost, Comprehensive Error Detection in Simple Cores. 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December, Outline Transient Memory Errors Microprocessors Memory Disks Networks Multiprocessors Transient error models Single bit error (single event upset: SEU) Burst of bit errors (errors in contiguous bits) We used to only worry about DRAM, but now we have to worry about soft errors in SRAM, too Remember the Ziegler paper! Page 3
4 Permanent Memory Errors Error models Single bit or multi-bit stuck-at Memory chip failure ( chipkill ) Chipkill failures Chipkill is a fail-stop permanent error/failure model Only applies to that is not on processor chip» Off-chip L2 or L3» DRAM main Tolerating Transient Memory Errors Almost uniformly tolerated with EDC/ECC At what granularity would you apply EDC/ECC? If EDC, then we need a higher-level mechanism to recover from errors So then why use EDC instead of ECC? What kinds of EDC/ECC are appropriate for our transient error models? Parity» Single bit» Multiple bit» Two-dimensional CRC Hamming code Which EDC/ECC are NOT appropriate for? Tolerating Permanent Memory Bit Errors Caches (SRAM) and memories (DRAM) inherently have lots of redundancy Lots of bits why not just provision some spares? Then, if hard fault detected, map out faulty bits and replace with spare bits Disks have been doing this for a long time, but this is a relatively recent development for and DRAM Design issue: granularity of mapping What is the field-replaceable unit?» Bit» Row» Column What are the trade-offs in choosing a granularity? Tolerating Chipkill Memory Errors Requires that we can reconstruct the data on the dead chip from redundant data on other chips Should sound a bit like RAID protection for disks This has been implemented as RAID-M (or chipkill ) I won t make you read this paper, but this is a good reference on RAID-M A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory (Dell) Page 4
5 Outline Disk Errors Microprocessors Memory Disks Networks Multiprocessors Error models Transient single bit error Transient burst of bit errors Permanently bad sector (from defect or fault)» In general, disks don t consider finer granularities Permanently bad disk (because of storage medium or controller) Disk Fault Tolerance Disk Physical Redundancy Disks are often considered the stable storage on which we save critical data E.g., databases write their important data to disks We sometimes backup critical disk systems with tape E.g., your home directory for your account on EE or CS system Periodically (e.g., nightly, weekly) log diffs to tape Disks are generally protected with Information redundancy (EDC/ECC) Physical redundancy Physical redundancy at different granularities Sector-level redundancy Disks come with more sectors than specified Can map out a sector with a hard fault and transparently replace it with a spare sector Disk-level redundancy Can use multiple disks to tolerate faults that:» Corrupt data on one or more disks» Completely disable one or more disks Page 5
6 RAID A Case for Redundant Arrays of Inexpensive Disks (RAID), by Patterson, Gibson, and Katz (1987) Famous paper that first described RAID Basic idea: disks are getting cheap, so let s use a bunch of them to get Better performance (in terms of throughput) Better fault tolerance Many flavors of RAID that trade-off: Performance (for reads and/or writes) Fault tolerance Hardware cost RAID-1 Instead of keeping all data on N disks, mirror it on 2N disks Faster for reads Slower for writes Can tolerate loss of any single disk 100% hardware overhead RAID-4 and RAID-5 Stripe data, including parity, at block granularity across disks RAID-4: all parity data on one disk Parity disk becomes bottleneck, particularly for writes RAID-5: parity data spread across disks But Our RAID goes up to 11 There are many flavors of RAID that have been developed since original RAID paper RAID-0: striping but no redundancy High performance, but no fault tolerance RAID-10: combines RAID-1 and RAID-0 (2 flavors) RAID-0+1: data is organized as stripes across multiple disks, and then the striped disk sets are mirrored RAID-1+0: data is mirrored and the mirrors are striped RAID-50: combines RAID-5 and RAID-0 (1 flavor) RAID-5+0: combines the straight block-level striping of RAID-0 with the distributed parity of RAID-5. This is a RAID-0 array striped across RAID-5 elements. RAID-30, RAID-100, RAID-1.7, RAID-S, etc. You are not expected to memorize this! Page 6
7 Implementing RAID Can implement it either in hardware or software Hardware: special hardware controller that manages the access to the RAID array Software: the hardware is oblivious, and the OS manages the access to the RAID array (through the disk controller) Software RAID is generally less effective in terms of performance and fault tolerance, but it can be cheaper and more flexible RAID in the Real World RAID is used very frequently for reliable disk storage We have several RAID arrays at Duke Limitations of RAID Other Issues in Disk I/O What faults can t be tolerated with RAID? How might we tolerate faults that can t be tolerated with RAID? Still a potential single point of failure at the I/O bus Or at I/O bridge One approach is to have redundant paths proc I/O bridge I/O bus disk disk disk Page 7
8 Other Issues in Disk I/O Outline Still a potential single point of failure at the I/O bus Or at I/O bridge One approach is to have redundant paths proc I/O bridge Microprocessors Memory Disks Networks Multiprocessors I/O bus disk disk disk I/O bus Fault-Tolerant Networks Network Errors and Failures A good reference: Principles and Practices of Interconnection Networks by Dally and Towles Endpoints (e.g., processors) communicate over network Network consists of switches and links endpoint switch switch switch switch endpoint Switch errors/failures Dead (fail-stop) Internal logic is mis-routing messages Dropping messages Corrupting messages Link errors/failures Dead Corrupting messages with bit errors (e.g., wire stuck-at-x) Deadlock Network gets completely stuck and can t make forward progress in routing messages (similar to gridlock on streets) Livelock Network is doing work, but not making forward progress Page 8
9 Network Fault Tolerance The key, as always, is redundancy Information: protect communications with EDC/ECC Temporal: ability to re-send communications Physical: extra switches and links Hybrids: use multiple forms of redundancy Most networks use hybrid forms of redundancy E.g., link errors detected with EDC and recovered by re-sending But we ll first talk about the separate types of redundancy before putting them all together Network Information Redundancy Network links often use cyclic redundancy check (CRC) codes for error detection (not correction) An n-bit CRC check can detect all errors of less than n bits and all but 1 in 2 n multi-bit errors Can use CRC check at various granularities Per flit (flit = unit of flow control) Per packet (packet = unit of routing, can have multiple flits) Per message (message can have multiple packets) Per transaction (may also use checksum, e.g., ftp) What are the advantages/disadvantages of each granularity? Hint: think end-to-end Network Temporal Redundancy If at first you don t succeed, try, try again If EDC detects an error, recover from it with re-try Depending on granularity of EDC, may re-try flit, packet, or message Requires that sender keep copy of message after sending it How long? Until an acknowledgment from the receiver What if the ack gets corrupted/dropped? In what scenarios is EDC with re-try preferable to using ECC to detect and correct errors? Hint 1: what is the error-free performance impact of ECC? Hint 2: what is the per-error performance impact of re-try? What error model are we assuming? Big hint: how would re-try cope with hard faults? Network Physical Redundancy Networks often have more than the minimum number of switches and links CPU1 Switch 1 Switch 3 Switch 2 Switch 4 CPU2 Can get from CPU1 to CPU2 in more than one way CPU1 Switch 1 Switch 2 Switch 4 CPU2 CPU1 Switch 1 Switch 3 Switch 4 CPU2 This is redundancy that can be used for handling hard faults Page 9
10 Network Physical Redundancy To cope with hard fault, must be able to exploit path redundancy Requires either: Adaptive routing: no fixed path for communication from point A to point B Fault diagnosis and network reconfiguration: ability to establish a new fixed path from point A to point B if necessary CPU1 Switch 1 Switch 2 Routing Static routing always chooses same path from A to B Can be implemented with:» Table lookup at each switch» Sender attaches routing decisions to packet Adaptive routing can choose a path that enables the best performance/throughput Can route packets around congested or faulty parts of network Many different algorithms exist Switch 3 Switch 4 CPU Example: Ethernet Example: Internet (TCP/IP) Local Area Network (LAN) technology for data link layer (level 2 in OSI) Standardized by IEEE standard Uses CRC-32 for error detection If error detected, higher level protocol must decide what to do about it TCP is a transport layer protocol (level 4 in OSI) that provides reliable end-to-end communication All TCP segments carry a checksum, which is used by the receiver to detect errors with either the TCP header or data TCP implements retransmission schemes for data that may be lost or damaged. The use of positive acknowledgments by the receiver to the sender confirms successful reception of data. The lack of positive acknowledgments, coupled with a timeout period calls for a retransmission. IP is a network layer protocol (level 3 in OSI) that has no role in reliable communication (it is unreliable) Page 10
11 Outline Multiprocessors Microprocessors Memory Disks Networks Multiprocessors Multiprocessor: computer system with multiple processors that can communicate with each other (see ECE 259 / CPS 221 for lots of details) essors communicate over interconnection network (e.g., bus, tree, 2D torus, hypercube, etc.) Interconnection network Commercial Multiprocessors Some current commercial multiprocessors: IBM mainframes and xservers Sun UltraEnterprise (E10000, E12000, etc.) Silicon Graphics (SGI) Origin 3000 HP Superdome 9000 Multicore processors (multiprocessor on a chip) Intel and AMD have multicore (2 and 4-core) chips Sun Niagara1 and Niagara2 (8 cores) Expectations of dozens to hundreds of cores per chip Clusters of uniprocessors Hook up a bunch of uniprocessors (e.g., Beowulf cluster) Use a commodity network, not a tightly coupled interconnect Used in many applications (e.g., Google, Amazon.com) Multiprocessor Errors/Failures A superset of the errors/failures we ve looked at for: Microprocessors,, disks, interconnection network Tolerating these faults in an MP is different because an MP has so many more components More components more opportunities for errors/failures Also, many MPs are used for reliable computing, even though their constituent parts are not reliable For example, Sun UltraEnterprise E10000 is a very reliable system with no single point of failure Yet it uses mostly commodity parts, including UltraSparc processor Key: many MPs use software to improve reliability in the presence of hardware faults Page 11
12 Multiprocessor Fault Tolerance A superset of the fault tolerance we ve looked at for: Microprocessors,, disks, network May also need to recover in-flight messages Recall BER schemes for multiprocessors from a few weeks ago We also want the ability to recover from permanently failed microprocessor What do we use to provide fault tolerance? Same stuff we ve been learning about so far this semester! Multiprocessors that Use FER Tandem Integrity S2 TMR processors Stratus computer system Pair-and-spare processors IBM mainframes Redundancy within processors Redundant processors Sun UltraEnterprise server No single point of failure Redundant buses, power supplies, etc MP Recovery Without Pure FER Would like to avoid having to replicate processors TMR, pair-and-spare, etc.: all use lots of hardware and power Would like to be able to resume failed process on another processor Must be able to get that process data How do we do this? Use the lessons we learned about BER! Multiprocessors that Use BER Tandem computers prior to the Integrity S2 Periodically checkpoint state on other processor Sequoia computer systems Flush state to main at every checkpoint Page 12
13 Multiprocessor FT in Interconnection Network Sun UltraEnterprise E10000 Connects processors with 4 buses Interleaved by address that is requested Can tolerate hard fault in any bus by mapping out that half of the interconnect (i.e., interleaving every 2 addresses instead of 4) Multiprocessor FT in Interconnection Network Cray T3E supercomputer Connects (Compaq/Intel) Alpha microprocessors with 3D torus Can tolerate hard fault in any node with adaptive routing P P P P Multiprocessor Diagnostics Multiprocessors often have extra diagnostic hardware Sun UltraEnterprise servers have system service processor Central controller that is different from other processors Used to monitor system and perform diagnostics Thinking Machines CM-5 had its own diagnostic network Used strictly for diagnostic purposes (i.e., not time multiplexed with active execution data) And numerous other examples Multiprocessor Fault Isolation Fault isolation Keep effects of a fault from propagating into the rest of system Benefits Enable a system to continue to operate at least partially Recover part of system while rest of it is running Prevent additional data from being corrupted Page 13
14 Fault Isolation with Logical Partitioning Logical partitioning (LPAR) Logically divide a multiprocessor into multiple partitions These partitions can t affect each other fault isolation Often requires some amount of software Examples Sun UltraEnterprise E10000 IBM mainframes and servers Why buy a big machine and partition it? Why not just buy several smaller machines? Is a cluster just a logically partitioned multiprocessor?? Fault Isolation with Virtual Machines Virtual machines Use software to create multiple virtual machines that run on a single multiprocessor (or even on a single uniprocessor) Crash in one virtual machine doesn t affect others Examples: VMWare Xen What fault/error model does this address? Tends to be most useful for tolerating software errors Why? BulletProof Core Cannibalization Ultra Low-Cost Defect Protection for Microprocessor Pipelines (Shyam et al., ASPLOS 2006) Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore essors in the Presence of Hard Faults (Romanescu and Sorin, PACT 2008) Page 14
15 Outline Microprocessors Memory Disks Networks Multiprocessors 57 Page 15
EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS
EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS LECTURE 8: HARDWARE FAULT TOLERANCE TECHNIQUES Fall 2014 Avinash Kodi kodi@ohio.edu Acknowledgement: Daniel Sorin, Behrooz Parhami, Srinivasan Ramasubramanian
More informationECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University
Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2010 Daniel J. Sorin Duke University Definition and Motivation Outline General Principles of Available System Design
More information416 Distributed Systems. Errors and Failures Oct 16, 2018
416 Distributed Systems Errors and Failures Oct 16, 2018 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:
More informationWhere We Are in This Course Right Now. ECE 152 Introduction to Computer Architecture Input/Output (I/O) Copyright 2012 Daniel J. Sorin Duke University
Introduction to Computer Architecture Input/Output () Copyright 2012 Daniel J. Sorin Duke University Slides are derived from work by Amir Roth (Penn) Spring 2012 Where We Are in This Course Right Now So
More informationEE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1
EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14 1 Announcements Don t forget to iterate with us for your checkpoint 1 report Send time slot preferences for checkpoint
More informationStorage Systems. Storage Systems
Storage Systems Storage Systems We already know about four levels of storage: Registers Cache Memory Disk But we've been a little vague on how these devices are interconnected In this unit, we study Input/output
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance
More informationAdministrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review
Administrivia CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Homework #4 due Thursday answers posted soon after Exam #2 on Thursday, April 24 on memory hierarchy (Unit 4) and
More informationChecker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India
Advanced Department of Computer Science Indian Institute of Technology New Delhi, India Outline Introduction Advanced 1 Introduction 2 Checker Pipeline Checking Mechanism 3 Advanced Core Checker L1 Failure
More informationStorage systems. Computer Systems Architecture CMSC 411 Unit 6 Storage Systems. (Hard) Disks. Disk and Tape Technologies. Disks (cont.
Computer Systems Architecture CMSC 4 Unit 6 Storage Systems Alan Sussman November 23, 2004 Storage systems We already know about four levels of storage: registers cache memory disk but we've been a little
More informationCSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura
CSE 451: Operating Systems Winter 2013 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Gary Kimura The challenge Disk transfer rates are improving, but much less fast than CPU performance
More informationPage 1. Outline. A Good Reference and a Caveat. Testing. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Testing and Design for Test
Page Outline ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems Testing and Design for Test Copyright 24 Daniel J. Sorin Duke University Introduction and Terminology Test Generation for Single
More informationLecture 25: Dependability and RAID
CS 61C: Great Ideas in Computer Architecture Lecture 25: Dependability and RAID Krste Asanović & Randy H. Katz http://inst.eecs.berkeley.edu/~cs61c/fa17 11/28/17 Fall 2017 Lecture #25 1 Storage Attachment
More informationDistributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013
Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware
More informationECE/CS 250 Computer Architecture. Summer 2016
ECE/CS 250 Computer Architecture Summer 2016 Multicore Dan Sorin and Tyler Bletsch Duke University Multicore and Multithreaded Processors Why multicore? Thread-level parallelism Multithreaded cores Multiprocessors
More informationDistributed Systems 24. Fault Tolerance
Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices
More informationLecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID
Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1 Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual
More informationCS5460: Operating Systems Lecture 20: File System Reliability
CS5460: Operating Systems Lecture 20: File System Reliability File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving
More informationRAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE
RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting
More informationCS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University
CS 370: SYSTEM ARCHITECTURE & SOFTWARE [MASS STORAGE] Frequently asked questions from the previous class survey Shrideep Pallickara Computer Science Colorado State University L29.1 L29.2 Topics covered
More informationMass-Storage. ICS332 - Fall 2017 Operating Systems. Henri Casanova
Mass-Storage ICS332 - Fall 2017 Operating Systems Henri Casanova (henric@hawaii.edu) Magnetic Disks! Magnetic disks (a.k.a. hard drives ) are (still) the most common secondary storage devices today! They
More informationLecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors
Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors 1 PAR-BS Mutlu and Moscibroda, ISCA 08 A batch of requests (per bank) is formed: each thread can only contribute
More informationComputer Science 146. Computer Architecture
Computer Science 46 Computer Architecture Spring 24 Harvard University Instructor: Prof dbrooks@eecsharvardedu Lecture 22: More I/O Computer Science 46 Lecture Outline HW5 and Project Questions? Storage
More informationStorage. Hwansoo Han
Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics
More informationFault Tolerance in Multicore Processors With Reconfigurable Hardware Unit
Fault Tolerance in Multicore Processors With Reconfigurable Hardware Unit Rajesh S, Vinoth Kumar C, Srivatsan R, Harini S and A.P. Shanthi Department of Computer Science & Engineering, College of Engineering
More information416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016
416 Distributed Systems Errors and Failures, part 2 Feb 3, 2016 Options in dealing with failure 1. Silently return the wrong answer. 2. Detect failure. 3. Correct / mask the failure 2 Block error detection/correction
More information1. Introduction. Traditionally, a high bandwidth file system comprises a supercomputer with disks connected
1. Introduction Traditionally, a high bandwidth file system comprises a supercomputer with disks connected by a high speed backplane bus such as SCSI [3][4] or Fibre Channel [2][67][71]. These systems
More informationChapter 4 Main Memory
Chapter 4 Main Memory Course Outcome (CO) - CO2 Describe the architecture and organization of computer systems Program Outcome (PO) PO1 Apply knowledge of mathematics, science and engineering fundamentals
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationDistributed Systems 23. Fault Tolerance
Distributed Systems 23. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 4/20/2011 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors
More informationCurrent Topics in OS Research. So, what s hot?
Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general
More informationFailure is not an option... Disk Arrays Mar. 23, 2005
15-410...Failure is not an option... Disk Arrays Mar. 23, 2005 Dave Eckhardt Bruce Maggs Contributions by Michael Ashley-Rollman - 1 - L24_RAID Synchronization Today: Disk Arrays Text: 14.5 (a good start)
More informationCSE380 - Operating Systems. Communicating with Devices
CSE380 - Operating Systems Notes for Lecture 15-11/4/04 Matt Blaze (some examples by Insup Lee) Communicating with Devices Modern architectures support convenient communication with devices memory mapped
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationCS 341l Fall 2008 Test #4 NAME: Key
CS 341l all 2008 est #4 NAME: Key CS3411 est #4, 21 November 2008. 100 points total, number of points each question is worth is indicated in parentheses. Answer all questions. Be as concise as possible
More informationSelf-Repair for Robust System Design. Yanjing Li Intel Labs Stanford University
Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University 1 Hardware Failures: Major Concern Permanent: our focus Temporary 2 Tolerating Permanent Hardware Failures Detection Diagnosis
More informationPOWER4 Systems: Design for Reliability. Douglas Bossen, Joel Tendler, Kevin Reick IBM Server Group, Austin, TX
Systems: Design for Reliability Douglas Bossen, Joel Tendler, Kevin Reick IBM Server Group, Austin, TX Microprocessor 2-way SMP system on a chip > 1 GHz processor frequency >1GHz Core Shared L2 >1GHz Core
More informationI/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University
I/O, Disks, and RAID Yi Shi Fall 2017 Xi an Jiaotong University Goals for Today Disks How does a computer system permanently store data? RAID How to make storage both efficient and reliable? 2 What does
More informationToday: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems
Today: Coda, xfs Case Study: Coda File System Brief overview of other file systems xfs Log structured file systems HDFS Object Storage Systems Lecture 20, page 1 Coda Overview DFS designed for mobile clients
More informationIntel iapx 432-VLSI building blocks for a fault-tolerant computer
Intel iapx 432-VLSI building blocks for a fault-tolerant computer by DAVE JOHNSON, DAVE BUDDE, DAVE CARSON, and CRAIG PETERSON Intel Corporation Aloha, Oregon ABSTRACT Early in 1983 two new VLSI components
More informationLecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background
Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation
More informationLecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )
Lecture 23: Storage Systems Topics: disk access, bus design, evaluation metrics, RAID (Sections 7.1-7.9) 1 Role of I/O Activities external to the CPU are typically orders of magnitude slower Example: while
More informationRedundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992
Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical
More informationARCHITECTURE DESIGN FOR SOFT ERRORS
ARCHITECTURE DESIGN FOR SOFT ERRORS Shubu Mukherjee ^ШВпШшр"* AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO T^"ТГПШГ SAN FRANCISCO SINGAPORE SYDNEY TOKYO ^ P f ^ ^ ELSEVIER Morgan
More informationPhysical Storage Media
Physical Storage Media These slides are a modified version of the slides of the book Database System Concepts, 5th Ed., McGraw-Hill, by Silberschatz, Korth and Sudarshan. Original slides are available
More informationChapter 2: Computer-System Structures. Hmm this looks like a Computer System?
Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding
More informationUniprocessor Computer Architecture Example: Cray T3E
Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure
More informationToday: Coda, xfs! Brief overview of other file systems. Distributed File System Requirements!
Today: Coda, xfs! Case Study: Coda File System Brief overview of other file systems xfs Log structured file systems Lecture 21, page 1 Distributed File System Requirements! Transparency Access, location,
More informationDefinition of RAID Levels
RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds
More informationRAID (Redundant Array of Inexpensive Disks)
Magnetic Disk Characteristics I/O Connection Structure Types of Buses Cache & I/O I/O Performance Metrics I/O System Modeling Using Queuing Theory Designing an I/O System RAID (Redundant Array of Inexpensive
More informationCSE 451: Operating Systems Spring Module 18 Redundant Arrays of Inexpensive Disks (RAID)
CSE 451: Operating Systems Spring 2017 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan 2017 Gribble, Lazowska, Levy, Zahorjan, Zbikowski 1 Disks are cheap Background An individual
More informationMass-Storage Structure
CS 4410 Operating Systems Mass-Storage Structure Summer 2011 Cornell University 1 Today How is data saved in the hard disk? Magnetic disk Disk speed parameters Disk Scheduling RAID Structure 2 Secondary
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationAddress Accessible Memories. A.R. Hurson Department of Computer Science Missouri University of Science & Technology
Address Accessible Memories A.R. Hurson Department of Computer Science Missouri University of Science & Technology 1 Memory System Memory Requirements for a Computer An internal storage medium to store
More information416 Distributed Systems. Errors and Failures Feb 1, 2016
416 Distributed Systems Errors and Failures Feb 1, 2016 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationECE Enterprise Storage Architecture. Fall 2018
ECE590-03 Enterprise Storage Architecture Fall 2018 RAID Tyler Bletsch Duke University Slides include material from Vince Freeh (NCSU) A case for redundant arrays of inexpensive disks Circa late 80s..
More informationCS 3640: Introduction to Networks and Their Applications
CS 3640: Introduction to Networks and Their Applications Fall 2018, Lecture 5: The Link Layer I Errors and medium access Instructor: Rishab Nithyanand Teaching Assistant: Md. Kowsar Hossain 1 You should
More informationReliable Computing I
Instructor: Mehdi Tahoori Reliable Computing I Lecture 8: Redundant Disk Arrays INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the Helmholtz
More informationI/O Hardwares. Some typical device, network, and data base rates
Input/Output 1 I/O Hardwares Some typical device, network, and data base rates 2 Device Controllers I/O devices have components: mechanical component electronic component The electronic component is the
More informationVirtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili
Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed
More informationCSE 380 Computer Operating Systems
CSE 380 Computer Operating Systems Instructor: Insup Lee University of Pennsylvania Fall 2003 Lecture Note on Disk I/O 1 I/O Devices Storage devices Floppy, Magnetic disk, Magnetic tape, CD-ROM, DVD User
More informationECE 250 / CS250 Introduction to Computer Architecture
ECE 250 / CS250 Introduction to Computer Architecture Main Memory Benjamin C. Lee Duke University Slides from Daniel Sorin (Duke) and are derived from work by Amir Roth (Penn) and Alvy Lebeck (Duke) 1
More informationAlternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model
What is Computer Architecture? Structure: static arrangement of the parts Organization: dynamic interaction of the parts and their control Implementation: design of specific building blocks Performance:
More informationECE 574 Cluster Computing Lecture 19
ECE 574 Cluster Computing Lecture 19 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 10 November 2015 Announcements Projects HW extended 1 MPI Review MPI is *not* shared memory
More informationCS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University
Frequently asked questions from the previous class survey CS 370: OPERATING SYSTEMS [MASS STORAGE] How does the OS caching optimize disk performance? How does file compression work? Does the disk change
More information06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1
Credits:4 1 Understand the Distributed Systems and the challenges involved in Design of the Distributed Systems. Understand how communication is created and synchronized in Distributed systems Design and
More informationAn Overview of CORAID Technology and ATA-over-Ethernet (AoE)
An Overview of CORAID Technology and ATA-over-Ethernet (AoE) Dr. Michael A. Covington, Ph.D. University of Georgia 2008 1. Introduction All CORAID products revolve around one goal: making disk storage
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationAgenda. Agenda 11/12/12. Review - 6 Great Ideas in Computer Architecture
/3/2 Review - 6 Great Ideas in Computer Architecture CS 6C: Great Ideas in Computer Architecture (Machine Structures) Dependability and RAID Instructors: Krste Asanovic, Randy H. Katz hfp://inst.eecs.berkeley.edu/~cs6c/fa2.
More informationLecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H Katz Computer Science 252 Spring 996 RHKS96 Review: Storage System Issues Historical Context of Storage I/O Storage I/O Performance
More informationGFS: The Google File System. Dr. Yingwu Zhu
GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected
More informationMass-Storage. ICS332 Operating Systems
Mass-Storage ICS332 Operating Systems Magnetic Disks Magnetic disks are (still) the most common secondary storage devices today They are messy Errors, bad blocks, missed seeks, moving parts And yet, the
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationSMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems
Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]
More informationDependability and ECC
ecture 38 Computer Science 61C Spring 2017 April 24th, 2017 Dependability and ECC 1 Great Idea #6: Dependability via Redundancy Applies to everything from data centers to memory Redundant data centers
More informationCS 43: Computer Networks The Link Layer. Kevin Webb Swarthmore College November 28, 2017
CS 43: Computer Networks The Link Layer Kevin Webb Swarthmore College November 28, 2017 TCP/IP Protocol Stack host host HTTP Application Layer HTTP TCP Transport Layer TCP router router IP IP Network Layer
More informationArchitectural Level Fault- Tolerance Techniques. EECE 513: Design of Fault- tolerant Digital Systems
Architectural Level Fault- Tolerance Techniques EECE 513: Design of Fault- tolerant Digital Systems Learning ObjecDves List the techniques for improving the reliability of commodity & high end processors
More informationECC Protection in Software
Center for RC eliable omputing ECC Protection in Software by Philip P Shirvani RATS June 8, 1999 Outline l Motivation l Requirements l Coding Schemes l Multiple Error Handling l Implementation in ARGOS
More informationOutline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.
Parallel Database Systems STAVROS HARIZOPOULOS stavros@cs.cmu.edu Outline Background Hardware architectures and performance metrics Parallel database techniques Gamma Bonus: NCR / Teradata Conclusions
More informationModern RAID Technology. RAID Primer A Configuration Guide
Modern RAID Technology RAID Primer A Configuration Guide E x c e l l e n c e i n C o n t r o l l e r s Modern RAID Technology RAID Primer A Configuration Guide 6th Edition Copyright 1997-2003 ICP vortex
More informationDistributed Systems. Fault Tolerance. Paul Krzyzanowski
Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected
More informationI/O CANNOT BE IGNORED
LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.
More informationComputer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 13:
More informationEE 6900: FAULT-TOLERANT COMPUTING SYSTEMS
EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS LECTURE 6: CODING THEORY - 2 Fall 2014 Avinash Kodi kodi@ohio.edu Acknowledgement: Daniel Sorin, Behrooz Parhami, Srinivasan Ramasubramanian Agenda Hamming Codes
More informationPANASAS TIERED PARITY ARCHITECTURE
PANASAS TIERED PARITY ARCHITECTURE Larry Jones, Matt Reid, Marc Unangst, Garth Gibson, and Brent Welch White Paper May 2010 Abstract Disk drives are approximately 250 times denser today than a decade ago.
More informationLecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown
Lecture 21: Reliable, High Performance Storage CSC 469H1F Fall 2006 Angela Demke Brown 1 Review We ve looked at fault tolerance via server replication Continue operating with up to f failures Recovery
More informationGrowth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs.
Internetworking Multiple networks are a fact of life: Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs. Fault isolation,
More informationRouting Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)
Routing Algorithm How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Many routing algorithms exist 1) Arithmetic 2) Source-based 3) Table lookup
More information6.033 Lecture Fault Tolerant Computing 3/31/2014
6.033 Lecture 14 -- Fault Tolerant Computing 3/31/2014 So far what have we seen: Modularity RPC Processes Client / server Networking Implements client/server Seen a few examples of dealing with faults
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationInput/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security
Input/Output Today Principles of I/O hardware & software I/O software layers Disks Next Protection & Security Operating Systems and I/O Two key operating system goals Control I/O devices Provide a simple,
More informationRouting Algorithms. Review
Routing Algorithms Today s topics: Deterministic, Oblivious Adaptive, & Adaptive models Problems: efficiency livelock deadlock 1 CS6810 Review Network properties are a combination topology topology dependent
More informationMIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer
MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware
More informationNetworking and Internetworking 1
Networking and Internetworking 1 Today l Networks and distributed systems l Internet architecture xkcd Networking issues for distributed systems Early networks were designed to meet relatively simple requirements
More informationFault Tolerance Dealing with an imperfect world
Fault Tolerance Dealing with an imperfect world Paul Krzyzanowski Rutgers University September 14, 2012 1 Introduction If we look at the words fault and tolerance, we can define the fault as a malfunction
More informationOperating Systems 2010/2011
Operating Systems 2010/2011 Input/Output Systems part 2 (ch13, ch12) Shudong Chen 1 Recap Discuss the principles of I/O hardware and its complexity Explore the structure of an operating system s I/O subsystem
More informationIO System. CP-226: Computer Architecture. Lecture 25 (24 April 2013) CADSL
IO System Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/
More information