Page 1. Outline. Microprocessor Errors/Failures. Microprocessor Fault Tolerance. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems

Size: px
Start display at page:

Download "Page 1. Outline. Microprocessor Errors/Failures. Microprocessor Fault Tolerance. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems"

Transcription

1 Outline Fault Tolerant and Testable Computing Systems Real Systems: Hardware Solutions for Tolerating Hardware Faults Microprocessors Memory Disks Networks Multiprocessors Copyright 2011 Daniel J. Sorin Duke University 2 Microprocessor Errors/Failures Error models Transient stuck-at (bit flip) on transistor or wire Hard stuck-at on transistor or wire Chipkill: whole chip is dead (e.g., due to power/ground short) Failure models Incorrect instruction trap/exception Incorrect output Dead chip (no output and/or smoke output) Microprocessor Fault Tolerance There ain t much! Most common microprocessors are designed to maximize performance per dollar Intel and AMD s x86-64 multicores Intel Itanium II (1- and 2-core) Sun UltraSPARC IV, UltraSPARC T2 (Niagara 2) IBM Power6 (2-core) has the most fault tolerance in this list Microprocessors may have some limited error detection/correction in their L2 or L3 s Note: microprocessors are designed with hardware for performing built-in self-test (BIST). We will cover this topic towards the end of the semester. 3 4 Page 1

2 Fault Tolerance in Custom Microprocessors Most systems built from commodity microprocessors Off-the-shelf parts are cost-efficient And, even if they re not very reliable individually, we can design reliable systems out of un-reliable parts (remember Teramac!) However, custom microprocessors may be built for those systems which require very high availability and/or reliability Example: IBM mainframe microprocessors (e.g., G5 and G6) Fault Tolerance in the DEC VAX DEC s VAX was very successful family of systems Follow-ons to DEC s PDP-11 computer Forerunner of DEC/Compaq/Intel Alpha processor (now dead) VAX known today for being epitome of CISC-ness Could detect and sometimes tolerate many faults Illegal instruction execution Trying to access restricted Arithmetic exceptions (which may be due to faults) Power failure Etc. Tries to provide info with trap/interrupt Places fault type info into known location Maintains registers specifically for error monitoring 5 6 More About the VAX ( ) Early VAX-11/750 and VAX-11/780 had following FT Built-in self-test (executed at power-on) ECC on main Multiple-bit parity on, TLB, and a few other structures Parity bits on the SBI (synchronous backplane interconnect = bus) Field-replaceable unit (FRU) is the chip (instead of board) In the later VAX 8600 and 8700, more FT added Instruction retry Better diagnostics through error logging and analysis Online self-test of floating point unit (F-box in VAX lingo) Error handling via a microcode routine ( micro-routine ) Micro-diagnostics to self-test system and diagnose faults to FRUs System diagnostic bus (SDB) for console control/observation IBM RAS RAS Strategy for IBM S/390 G5 and G6 (Mueller et al.) 7 8 Page 2

3 DIVA DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design (Todd Austin, MICRO 1999) Argus Albert Meixner, Michael E. Bauer, and Daniel J. Sorin. Argus: Low-Cost, Comprehensive Error Detection in Simple Cores. 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December, Outline Transient Memory Errors Microprocessors Memory Disks Networks Multiprocessors Transient error models Single bit error (single event upset: SEU) Burst of bit errors (errors in contiguous bits) We used to only worry about DRAM, but now we have to worry about soft errors in SRAM, too Remember the Ziegler paper! Page 3

4 Permanent Memory Errors Error models Single bit or multi-bit stuck-at Memory chip failure ( chipkill ) Chipkill failures Chipkill is a fail-stop permanent error/failure model Only applies to that is not on processor chip» Off-chip L2 or L3» DRAM main Tolerating Transient Memory Errors Almost uniformly tolerated with EDC/ECC At what granularity would you apply EDC/ECC? If EDC, then we need a higher-level mechanism to recover from errors So then why use EDC instead of ECC? What kinds of EDC/ECC are appropriate for our transient error models? Parity» Single bit» Multiple bit» Two-dimensional CRC Hamming code Which EDC/ECC are NOT appropriate for? Tolerating Permanent Memory Bit Errors Caches (SRAM) and memories (DRAM) inherently have lots of redundancy Lots of bits why not just provision some spares? Then, if hard fault detected, map out faulty bits and replace with spare bits Disks have been doing this for a long time, but this is a relatively recent development for and DRAM Design issue: granularity of mapping What is the field-replaceable unit?» Bit» Row» Column What are the trade-offs in choosing a granularity? Tolerating Chipkill Memory Errors Requires that we can reconstruct the data on the dead chip from redundant data on other chips Should sound a bit like RAID protection for disks This has been implemented as RAID-M (or chipkill ) I won t make you read this paper, but this is a good reference on RAID-M A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory (Dell) Page 4

5 Outline Disk Errors Microprocessors Memory Disks Networks Multiprocessors Error models Transient single bit error Transient burst of bit errors Permanently bad sector (from defect or fault)» In general, disks don t consider finer granularities Permanently bad disk (because of storage medium or controller) Disk Fault Tolerance Disk Physical Redundancy Disks are often considered the stable storage on which we save critical data E.g., databases write their important data to disks We sometimes backup critical disk systems with tape E.g., your home directory for your account on EE or CS system Periodically (e.g., nightly, weekly) log diffs to tape Disks are generally protected with Information redundancy (EDC/ECC) Physical redundancy Physical redundancy at different granularities Sector-level redundancy Disks come with more sectors than specified Can map out a sector with a hard fault and transparently replace it with a spare sector Disk-level redundancy Can use multiple disks to tolerate faults that:» Corrupt data on one or more disks» Completely disable one or more disks Page 5

6 RAID A Case for Redundant Arrays of Inexpensive Disks (RAID), by Patterson, Gibson, and Katz (1987) Famous paper that first described RAID Basic idea: disks are getting cheap, so let s use a bunch of them to get Better performance (in terms of throughput) Better fault tolerance Many flavors of RAID that trade-off: Performance (for reads and/or writes) Fault tolerance Hardware cost RAID-1 Instead of keeping all data on N disks, mirror it on 2N disks Faster for reads Slower for writes Can tolerate loss of any single disk 100% hardware overhead RAID-4 and RAID-5 Stripe data, including parity, at block granularity across disks RAID-4: all parity data on one disk Parity disk becomes bottleneck, particularly for writes RAID-5: parity data spread across disks But Our RAID goes up to 11 There are many flavors of RAID that have been developed since original RAID paper RAID-0: striping but no redundancy High performance, but no fault tolerance RAID-10: combines RAID-1 and RAID-0 (2 flavors) RAID-0+1: data is organized as stripes across multiple disks, and then the striped disk sets are mirrored RAID-1+0: data is mirrored and the mirrors are striped RAID-50: combines RAID-5 and RAID-0 (1 flavor) RAID-5+0: combines the straight block-level striping of RAID-0 with the distributed parity of RAID-5. This is a RAID-0 array striped across RAID-5 elements. RAID-30, RAID-100, RAID-1.7, RAID-S, etc. You are not expected to memorize this! Page 6

7 Implementing RAID Can implement it either in hardware or software Hardware: special hardware controller that manages the access to the RAID array Software: the hardware is oblivious, and the OS manages the access to the RAID array (through the disk controller) Software RAID is generally less effective in terms of performance and fault tolerance, but it can be cheaper and more flexible RAID in the Real World RAID is used very frequently for reliable disk storage We have several RAID arrays at Duke Limitations of RAID Other Issues in Disk I/O What faults can t be tolerated with RAID? How might we tolerate faults that can t be tolerated with RAID? Still a potential single point of failure at the I/O bus Or at I/O bridge One approach is to have redundant paths proc I/O bridge I/O bus disk disk disk Page 7

8 Other Issues in Disk I/O Outline Still a potential single point of failure at the I/O bus Or at I/O bridge One approach is to have redundant paths proc I/O bridge Microprocessors Memory Disks Networks Multiprocessors I/O bus disk disk disk I/O bus Fault-Tolerant Networks Network Errors and Failures A good reference: Principles and Practices of Interconnection Networks by Dally and Towles Endpoints (e.g., processors) communicate over network Network consists of switches and links endpoint switch switch switch switch endpoint Switch errors/failures Dead (fail-stop) Internal logic is mis-routing messages Dropping messages Corrupting messages Link errors/failures Dead Corrupting messages with bit errors (e.g., wire stuck-at-x) Deadlock Network gets completely stuck and can t make forward progress in routing messages (similar to gridlock on streets) Livelock Network is doing work, but not making forward progress Page 8

9 Network Fault Tolerance The key, as always, is redundancy Information: protect communications with EDC/ECC Temporal: ability to re-send communications Physical: extra switches and links Hybrids: use multiple forms of redundancy Most networks use hybrid forms of redundancy E.g., link errors detected with EDC and recovered by re-sending But we ll first talk about the separate types of redundancy before putting them all together Network Information Redundancy Network links often use cyclic redundancy check (CRC) codes for error detection (not correction) An n-bit CRC check can detect all errors of less than n bits and all but 1 in 2 n multi-bit errors Can use CRC check at various granularities Per flit (flit = unit of flow control) Per packet (packet = unit of routing, can have multiple flits) Per message (message can have multiple packets) Per transaction (may also use checksum, e.g., ftp) What are the advantages/disadvantages of each granularity? Hint: think end-to-end Network Temporal Redundancy If at first you don t succeed, try, try again If EDC detects an error, recover from it with re-try Depending on granularity of EDC, may re-try flit, packet, or message Requires that sender keep copy of message after sending it How long? Until an acknowledgment from the receiver What if the ack gets corrupted/dropped? In what scenarios is EDC with re-try preferable to using ECC to detect and correct errors? Hint 1: what is the error-free performance impact of ECC? Hint 2: what is the per-error performance impact of re-try? What error model are we assuming? Big hint: how would re-try cope with hard faults? Network Physical Redundancy Networks often have more than the minimum number of switches and links CPU1 Switch 1 Switch 3 Switch 2 Switch 4 CPU2 Can get from CPU1 to CPU2 in more than one way CPU1 Switch 1 Switch 2 Switch 4 CPU2 CPU1 Switch 1 Switch 3 Switch 4 CPU2 This is redundancy that can be used for handling hard faults Page 9

10 Network Physical Redundancy To cope with hard fault, must be able to exploit path redundancy Requires either: Adaptive routing: no fixed path for communication from point A to point B Fault diagnosis and network reconfiguration: ability to establish a new fixed path from point A to point B if necessary CPU1 Switch 1 Switch 2 Routing Static routing always chooses same path from A to B Can be implemented with:» Table lookup at each switch» Sender attaches routing decisions to packet Adaptive routing can choose a path that enables the best performance/throughput Can route packets around congested or faulty parts of network Many different algorithms exist Switch 3 Switch 4 CPU Example: Ethernet Example: Internet (TCP/IP) Local Area Network (LAN) technology for data link layer (level 2 in OSI) Standardized by IEEE standard Uses CRC-32 for error detection If error detected, higher level protocol must decide what to do about it TCP is a transport layer protocol (level 4 in OSI) that provides reliable end-to-end communication All TCP segments carry a checksum, which is used by the receiver to detect errors with either the TCP header or data TCP implements retransmission schemes for data that may be lost or damaged. The use of positive acknowledgments by the receiver to the sender confirms successful reception of data. The lack of positive acknowledgments, coupled with a timeout period calls for a retransmission. IP is a network layer protocol (level 3 in OSI) that has no role in reliable communication (it is unreliable) Page 10

11 Outline Multiprocessors Microprocessors Memory Disks Networks Multiprocessors Multiprocessor: computer system with multiple processors that can communicate with each other (see ECE 259 / CPS 221 for lots of details) essors communicate over interconnection network (e.g., bus, tree, 2D torus, hypercube, etc.) Interconnection network Commercial Multiprocessors Some current commercial multiprocessors: IBM mainframes and xservers Sun UltraEnterprise (E10000, E12000, etc.) Silicon Graphics (SGI) Origin 3000 HP Superdome 9000 Multicore processors (multiprocessor on a chip) Intel and AMD have multicore (2 and 4-core) chips Sun Niagara1 and Niagara2 (8 cores) Expectations of dozens to hundreds of cores per chip Clusters of uniprocessors Hook up a bunch of uniprocessors (e.g., Beowulf cluster) Use a commodity network, not a tightly coupled interconnect Used in many applications (e.g., Google, Amazon.com) Multiprocessor Errors/Failures A superset of the errors/failures we ve looked at for: Microprocessors,, disks, interconnection network Tolerating these faults in an MP is different because an MP has so many more components More components more opportunities for errors/failures Also, many MPs are used for reliable computing, even though their constituent parts are not reliable For example, Sun UltraEnterprise E10000 is a very reliable system with no single point of failure Yet it uses mostly commodity parts, including UltraSparc processor Key: many MPs use software to improve reliability in the presence of hardware faults Page 11

12 Multiprocessor Fault Tolerance A superset of the fault tolerance we ve looked at for: Microprocessors,, disks, network May also need to recover in-flight messages Recall BER schemes for multiprocessors from a few weeks ago We also want the ability to recover from permanently failed microprocessor What do we use to provide fault tolerance? Same stuff we ve been learning about so far this semester! Multiprocessors that Use FER Tandem Integrity S2 TMR processors Stratus computer system Pair-and-spare processors IBM mainframes Redundancy within processors Redundant processors Sun UltraEnterprise server No single point of failure Redundant buses, power supplies, etc MP Recovery Without Pure FER Would like to avoid having to replicate processors TMR, pair-and-spare, etc.: all use lots of hardware and power Would like to be able to resume failed process on another processor Must be able to get that process data How do we do this? Use the lessons we learned about BER! Multiprocessors that Use BER Tandem computers prior to the Integrity S2 Periodically checkpoint state on other processor Sequoia computer systems Flush state to main at every checkpoint Page 12

13 Multiprocessor FT in Interconnection Network Sun UltraEnterprise E10000 Connects processors with 4 buses Interleaved by address that is requested Can tolerate hard fault in any bus by mapping out that half of the interconnect (i.e., interleaving every 2 addresses instead of 4) Multiprocessor FT in Interconnection Network Cray T3E supercomputer Connects (Compaq/Intel) Alpha microprocessors with 3D torus Can tolerate hard fault in any node with adaptive routing P P P P Multiprocessor Diagnostics Multiprocessors often have extra diagnostic hardware Sun UltraEnterprise servers have system service processor Central controller that is different from other processors Used to monitor system and perform diagnostics Thinking Machines CM-5 had its own diagnostic network Used strictly for diagnostic purposes (i.e., not time multiplexed with active execution data) And numerous other examples Multiprocessor Fault Isolation Fault isolation Keep effects of a fault from propagating into the rest of system Benefits Enable a system to continue to operate at least partially Recover part of system while rest of it is running Prevent additional data from being corrupted Page 13

14 Fault Isolation with Logical Partitioning Logical partitioning (LPAR) Logically divide a multiprocessor into multiple partitions These partitions can t affect each other fault isolation Often requires some amount of software Examples Sun UltraEnterprise E10000 IBM mainframes and servers Why buy a big machine and partition it? Why not just buy several smaller machines? Is a cluster just a logically partitioned multiprocessor?? Fault Isolation with Virtual Machines Virtual machines Use software to create multiple virtual machines that run on a single multiprocessor (or even on a single uniprocessor) Crash in one virtual machine doesn t affect others Examples: VMWare Xen What fault/error model does this address? Tends to be most useful for tolerating software errors Why? BulletProof Core Cannibalization Ultra Low-Cost Defect Protection for Microprocessor Pipelines (Shyam et al., ASPLOS 2006) Core Cannibalization Architecture: Improving Lifetime Chip Performance for Multicore essors in the Presence of Hard Faults (Romanescu and Sorin, PACT 2008) Page 14

15 Outline Microprocessors Memory Disks Networks Multiprocessors 57 Page 15

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS LECTURE 8: HARDWARE FAULT TOLERANCE TECHNIQUES Fall 2014 Avinash Kodi kodi@ohio.edu Acknowledgement: Daniel Sorin, Behrooz Parhami, Srinivasan Ramasubramanian

More information

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2010 Daniel J. Sorin Duke University Definition and Motivation Outline General Principles of Available System Design

More information

416 Distributed Systems. Errors and Failures Oct 16, 2018

416 Distributed Systems. Errors and Failures Oct 16, 2018 416 Distributed Systems Errors and Failures Oct 16, 2018 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:

More information

Where We Are in This Course Right Now. ECE 152 Introduction to Computer Architecture Input/Output (I/O) Copyright 2012 Daniel J. Sorin Duke University

Where We Are in This Course Right Now. ECE 152 Introduction to Computer Architecture Input/Output (I/O) Copyright 2012 Daniel J. Sorin Duke University Introduction to Computer Architecture Input/Output () Copyright 2012 Daniel J. Sorin Duke University Slides are derived from work by Amir Roth (Penn) Spring 2012 Where We Are in This Course Right Now So

More information

EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1

EE382C Lecture 14. Reliability and Error Control 5/17/11. EE 382C - S11 - Lecture 14 1 EE382C Lecture 14 Reliability and Error Control 5/17/11 EE 382C - S11 - Lecture 14 1 Announcements Don t forget to iterate with us for your checkpoint 1 report Send time slot preferences for checkpoint

More information

Storage Systems. Storage Systems

Storage Systems. Storage Systems Storage Systems Storage Systems We already know about four levels of storage: Registers Cache Memory Disk But we've been a little vague on how these devices are interconnected In this unit, we study Input/output

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance

More information

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review

Administrivia. CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Disks (cont.) Disks - review Administrivia CMSC 411 Computer Systems Architecture Lecture 19 Storage Systems, cont. Homework #4 due Thursday answers posted soon after Exam #2 on Thursday, April 24 on memory hierarchy (Unit 4) and

More information

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India Advanced Department of Computer Science Indian Institute of Technology New Delhi, India Outline Introduction Advanced 1 Introduction 2 Checker Pipeline Checking Mechanism 3 Advanced Core Checker L1 Failure

More information

Storage systems. Computer Systems Architecture CMSC 411 Unit 6 Storage Systems. (Hard) Disks. Disk and Tape Technologies. Disks (cont.

Storage systems. Computer Systems Architecture CMSC 411 Unit 6 Storage Systems. (Hard) Disks. Disk and Tape Technologies. Disks (cont. Computer Systems Architecture CMSC 4 Unit 6 Storage Systems Alan Sussman November 23, 2004 Storage systems We already know about four levels of storage: registers cache memory disk but we've been a little

More information

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura CSE 451: Operating Systems Winter 2013 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Gary Kimura The challenge Disk transfer rates are improving, but much less fast than CPU performance

More information

Page 1. Outline. A Good Reference and a Caveat. Testing. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Testing and Design for Test

Page 1. Outline. A Good Reference and a Caveat. Testing. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Testing and Design for Test Page Outline ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems Testing and Design for Test Copyright 24 Daniel J. Sorin Duke University Introduction and Terminology Test Generation for Single

More information

Lecture 25: Dependability and RAID

Lecture 25: Dependability and RAID CS 61C: Great Ideas in Computer Architecture Lecture 25: Dependability and RAID Krste Asanović & Randy H. Katz http://inst.eecs.berkeley.edu/~cs61c/fa17 11/28/17 Fall 2017 Lecture #25 1 Storage Attachment

More information

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013 Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware

More information

ECE/CS 250 Computer Architecture. Summer 2016

ECE/CS 250 Computer Architecture. Summer 2016 ECE/CS 250 Computer Architecture Summer 2016 Multicore Dan Sorin and Tyler Bletsch Duke University Multicore and Multithreaded Processors Why multicore? Thread-level parallelism Multithreaded cores Multiprocessors

More information

Distributed Systems 24. Fault Tolerance

Distributed Systems 24. Fault Tolerance Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Storage and Other I/O Topics I/O Performance Measures Types and Characteristics of I/O Devices Buses Interfacing I/O Devices

More information

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1 Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual

More information

CS5460: Operating Systems Lecture 20: File System Reliability

CS5460: Operating Systems Lecture 20: File System Reliability CS5460: Operating Systems Lecture 20: File System Reliability File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University CS 370: SYSTEM ARCHITECTURE & SOFTWARE [MASS STORAGE] Frequently asked questions from the previous class survey Shrideep Pallickara Computer Science Colorado State University L29.1 L29.2 Topics covered

More information

Mass-Storage. ICS332 - Fall 2017 Operating Systems. Henri Casanova

Mass-Storage. ICS332 - Fall 2017 Operating Systems. Henri Casanova Mass-Storage ICS332 - Fall 2017 Operating Systems Henri Casanova (henric@hawaii.edu) Magnetic Disks! Magnetic disks (a.k.a. hard drives ) are (still) the most common secondary storage devices today! They

More information

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors

Lecture 5: Scheduling and Reliability. Topics: scheduling policies, handling DRAM errors Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors 1 PAR-BS Mutlu and Moscibroda, ISCA 08 A batch of requests (per bank) is formed: each thread can only contribute

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Science 46 Computer Architecture Spring 24 Harvard University Instructor: Prof dbrooks@eecsharvardedu Lecture 22: More I/O Computer Science 46 Lecture Outline HW5 and Project Questions? Storage

More information

Storage. Hwansoo Han

Storage. Hwansoo Han Storage Hwansoo Han I/O Devices I/O devices can be characterized by Behavior: input, out, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 2 I/O System Characteristics

More information

Fault Tolerance in Multicore Processors With Reconfigurable Hardware Unit

Fault Tolerance in Multicore Processors With Reconfigurable Hardware Unit Fault Tolerance in Multicore Processors With Reconfigurable Hardware Unit Rajesh S, Vinoth Kumar C, Srivatsan R, Harini S and A.P. Shanthi Department of Computer Science & Engineering, College of Engineering

More information

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016 416 Distributed Systems Errors and Failures, part 2 Feb 3, 2016 Options in dealing with failure 1. Silently return the wrong answer. 2. Detect failure. 3. Correct / mask the failure 2 Block error detection/correction

More information

1. Introduction. Traditionally, a high bandwidth file system comprises a supercomputer with disks connected

1. Introduction. Traditionally, a high bandwidth file system comprises a supercomputer with disks connected 1. Introduction Traditionally, a high bandwidth file system comprises a supercomputer with disks connected by a high speed backplane bus such as SCSI [3][4] or Fibre Channel [2][67][71]. These systems

More information

Chapter 4 Main Memory

Chapter 4 Main Memory Chapter 4 Main Memory Course Outcome (CO) - CO2 Describe the architecture and organization of computer systems Program Outcome (PO) PO1 Apply knowledge of mathematics, science and engineering fundamentals

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Distributed Systems 23. Fault Tolerance

Distributed Systems 23. Fault Tolerance Distributed Systems 23. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 4/20/2011 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors

More information

Current Topics in OS Research. So, what s hot?

Current Topics in OS Research. So, what s hot? Current Topics in OS Research COMP7840 OSDI Current OS Research 0 So, what s hot? Operating systems have been around for a long time in many forms for different types of devices It is normally general

More information

Failure is not an option... Disk Arrays Mar. 23, 2005

Failure is not an option... Disk Arrays Mar. 23, 2005 15-410...Failure is not an option... Disk Arrays Mar. 23, 2005 Dave Eckhardt Bruce Maggs Contributions by Michael Ashley-Rollman - 1 - L24_RAID Synchronization Today: Disk Arrays Text: 14.5 (a good start)

More information

CSE380 - Operating Systems. Communicating with Devices

CSE380 - Operating Systems. Communicating with Devices CSE380 - Operating Systems Notes for Lecture 15-11/4/04 Matt Blaze (some examples by Insup Lee) Communicating with Devices Modern architectures support convenient communication with devices memory mapped

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

CS 341l Fall 2008 Test #4 NAME: Key

CS 341l Fall 2008 Test #4 NAME: Key CS 341l all 2008 est #4 NAME: Key CS3411 est #4, 21 November 2008. 100 points total, number of points each question is worth is indicated in parentheses. Answer all questions. Be as concise as possible

More information

Self-Repair for Robust System Design. Yanjing Li Intel Labs Stanford University

Self-Repair for Robust System Design. Yanjing Li Intel Labs Stanford University Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University 1 Hardware Failures: Major Concern Permanent: our focus Temporary 2 Tolerating Permanent Hardware Failures Detection Diagnosis

More information

POWER4 Systems: Design for Reliability. Douglas Bossen, Joel Tendler, Kevin Reick IBM Server Group, Austin, TX

POWER4 Systems: Design for Reliability. Douglas Bossen, Joel Tendler, Kevin Reick IBM Server Group, Austin, TX Systems: Design for Reliability Douglas Bossen, Joel Tendler, Kevin Reick IBM Server Group, Austin, TX Microprocessor 2-way SMP system on a chip > 1 GHz processor frequency >1GHz Core Shared L2 >1GHz Core

More information

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University

I/O, Disks, and RAID Yi Shi Fall Xi an Jiaotong University I/O, Disks, and RAID Yi Shi Fall 2017 Xi an Jiaotong University Goals for Today Disks How does a computer system permanently store data? RAID How to make storage both efficient and reliable? 2 What does

More information

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems Today: Coda, xfs Case Study: Coda File System Brief overview of other file systems xfs Log structured file systems HDFS Object Storage Systems Lecture 20, page 1 Coda Overview DFS designed for mobile clients

More information

Intel iapx 432-VLSI building blocks for a fault-tolerant computer

Intel iapx 432-VLSI building blocks for a fault-tolerant computer Intel iapx 432-VLSI building blocks for a fault-tolerant computer by DAVE JOHNSON, DAVE BUDDE, DAVE CARSON, and CRAIG PETERSON Intel Corporation Aloha, Oregon ABSTRACT Early in 1983 two new VLSI components

More information

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation

More information

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections ) Lecture 23: Storage Systems Topics: disk access, bus design, evaluation metrics, RAID (Sections 7.1-7.9) 1 Role of I/O Activities external to the CPU are typically orders of magnitude slower Example: while

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

ARCHITECTURE DESIGN FOR SOFT ERRORS

ARCHITECTURE DESIGN FOR SOFT ERRORS ARCHITECTURE DESIGN FOR SOFT ERRORS Shubu Mukherjee ^ШВпШшр"* AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO T^"ТГПШГ SAN FRANCISCO SINGAPORE SYDNEY TOKYO ^ P f ^ ^ ELSEVIER Morgan

More information

Physical Storage Media

Physical Storage Media Physical Storage Media These slides are a modified version of the slides of the book Database System Concepts, 5th Ed., McGraw-Hill, by Silberschatz, Korth and Sudarshan. Original slides are available

More information

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System? Chapter 2: Computer-System Structures Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure of a computer system and understanding

More information

Uniprocessor Computer Architecture Example: Cray T3E

Uniprocessor Computer Architecture Example: Cray T3E Chapter 2: Computer-System Structures MP Example: Intel Pentium Pro Quad Lab 1 is available online Last lecture: why study operating systems? Purpose of this lecture: general knowledge of the structure

More information

Today: Coda, xfs! Brief overview of other file systems. Distributed File System Requirements!

Today: Coda, xfs! Brief overview of other file systems. Distributed File System Requirements! Today: Coda, xfs! Case Study: Coda File System Brief overview of other file systems xfs Log structured file systems Lecture 21, page 1 Distributed File System Requirements! Transparency Access, location,

More information

Definition of RAID Levels

Definition of RAID Levels RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds

More information

RAID (Redundant Array of Inexpensive Disks)

RAID (Redundant Array of Inexpensive Disks) Magnetic Disk Characteristics I/O Connection Structure Types of Buses Cache & I/O I/O Performance Metrics I/O System Modeling Using Queuing Theory Designing an I/O System RAID (Redundant Array of Inexpensive

More information

CSE 451: Operating Systems Spring Module 18 Redundant Arrays of Inexpensive Disks (RAID)

CSE 451: Operating Systems Spring Module 18 Redundant Arrays of Inexpensive Disks (RAID) CSE 451: Operating Systems Spring 2017 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan 2017 Gribble, Lazowska, Levy, Zahorjan, Zbikowski 1 Disks are cheap Background An individual

More information

Mass-Storage Structure

Mass-Storage Structure CS 4410 Operating Systems Mass-Storage Structure Summer 2011 Cornell University 1 Today How is data saved in the hard disk? Magnetic disk Disk speed parameters Disk Scheduling RAID Structure 2 Secondary

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Address Accessible Memories. A.R. Hurson Department of Computer Science Missouri University of Science & Technology

Address Accessible Memories. A.R. Hurson Department of Computer Science Missouri University of Science & Technology Address Accessible Memories A.R. Hurson Department of Computer Science Missouri University of Science & Technology 1 Memory System Memory Requirements for a Computer An internal storage medium to store

More information

416 Distributed Systems. Errors and Failures Feb 1, 2016

416 Distributed Systems. Errors and Failures Feb 1, 2016 416 Distributed Systems Errors and Failures Feb 1, 2016 Types of Errors Hard errors: The component is dead. Soft errors: A signal or bit is wrong, but it doesn t mean the component must be faulty Note:

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

ECE Enterprise Storage Architecture. Fall 2018

ECE Enterprise Storage Architecture. Fall 2018 ECE590-03 Enterprise Storage Architecture Fall 2018 RAID Tyler Bletsch Duke University Slides include material from Vince Freeh (NCSU) A case for redundant arrays of inexpensive disks Circa late 80s..

More information

CS 3640: Introduction to Networks and Their Applications

CS 3640: Introduction to Networks and Their Applications CS 3640: Introduction to Networks and Their Applications Fall 2018, Lecture 5: The Link Layer I Errors and medium access Instructor: Rishab Nithyanand Teaching Assistant: Md. Kowsar Hossain 1 You should

More information

Reliable Computing I

Reliable Computing I Instructor: Mehdi Tahoori Reliable Computing I Lecture 8: Redundant Disk Arrays INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the Helmholtz

More information

I/O Hardwares. Some typical device, network, and data base rates

I/O Hardwares. Some typical device, network, and data base rates Input/Output 1 I/O Hardwares Some typical device, network, and data base rates 2 Device Controllers I/O devices have components: mechanical component electronic component The electronic component is the

More information

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed

More information

CSE 380 Computer Operating Systems

CSE 380 Computer Operating Systems CSE 380 Computer Operating Systems Instructor: Insup Lee University of Pennsylvania Fall 2003 Lecture Note on Disk I/O 1 I/O Devices Storage devices Floppy, Magnetic disk, Magnetic tape, CD-ROM, DVD User

More information

ECE 250 / CS250 Introduction to Computer Architecture

ECE 250 / CS250 Introduction to Computer Architecture ECE 250 / CS250 Introduction to Computer Architecture Main Memory Benjamin C. Lee Duke University Slides from Daniel Sorin (Duke) and are derived from work by Amir Roth (Penn) and Alvy Lebeck (Duke) 1

More information

Alternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model

Alternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model What is Computer Architecture? Structure: static arrangement of the parts Organization: dynamic interaction of the parts and their control Implementation: design of specific building blocks Performance:

More information

ECE 574 Cluster Computing Lecture 19

ECE 574 Cluster Computing Lecture 19 ECE 574 Cluster Computing Lecture 19 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 10 November 2015 Announcements Projects HW extended 1 MPI Review MPI is *not* shared memory

More information

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University Frequently asked questions from the previous class survey CS 370: OPERATING SYSTEMS [MASS STORAGE] How does the OS caching optimize disk performance? How does file compression work? Does the disk change

More information

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1

06-Dec-17. Credits:4. Notes by Pritee Parwekar,ANITS 06-Dec-17 1 Credits:4 1 Understand the Distributed Systems and the challenges involved in Design of the Distributed Systems. Understand how communication is created and synchronized in Distributed systems Design and

More information

An Overview of CORAID Technology and ATA-over-Ethernet (AoE)

An Overview of CORAID Technology and ATA-over-Ethernet (AoE) An Overview of CORAID Technology and ATA-over-Ethernet (AoE) Dr. Michael A. Covington, Ph.D. University of Georgia 2008 1. Introduction All CORAID products revolve around one goal: making disk storage

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Agenda. Agenda 11/12/12. Review - 6 Great Ideas in Computer Architecture

Agenda. Agenda 11/12/12. Review - 6 Great Ideas in Computer Architecture /3/2 Review - 6 Great Ideas in Computer Architecture CS 6C: Great Ideas in Computer Architecture (Machine Structures) Dependability and RAID Instructors: Krste Asanovic, Randy H. Katz hfp://inst.eecs.berkeley.edu/~cs6c/fa2.

More information

Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 23: I/O Redundant Arrays of Inexpensive Disks Professor Randy H Katz Computer Science 252 Spring 996 RHKS96 Review: Storage System Issues Historical Context of Storage I/O Storage I/O Performance

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Mass-Storage. ICS332 Operating Systems

Mass-Storage. ICS332 Operating Systems Mass-Storage ICS332 Operating Systems Magnetic Disks Magnetic disks are (still) the most common secondary storage devices today They are messy Errors, bad blocks, missed seeks, moving parts And yet, the

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Dependability and ECC

Dependability and ECC ecture 38 Computer Science 61C Spring 2017 April 24th, 2017 Dependability and ECC 1 Great Idea #6: Dependability via Redundancy Applies to everything from data centers to memory Redundant data centers

More information

CS 43: Computer Networks The Link Layer. Kevin Webb Swarthmore College November 28, 2017

CS 43: Computer Networks The Link Layer. Kevin Webb Swarthmore College November 28, 2017 CS 43: Computer Networks The Link Layer Kevin Webb Swarthmore College November 28, 2017 TCP/IP Protocol Stack host host HTTP Application Layer HTTP TCP Transport Layer TCP router router IP IP Network Layer

More information

Architectural Level Fault- Tolerance Techniques. EECE 513: Design of Fault- tolerant Digital Systems

Architectural Level Fault- Tolerance Techniques. EECE 513: Design of Fault- tolerant Digital Systems Architectural Level Fault- Tolerance Techniques EECE 513: Design of Fault- tolerant Digital Systems Learning ObjecDves List the techniques for improving the reliability of commodity & high end processors

More information

ECC Protection in Software

ECC Protection in Software Center for RC eliable omputing ECC Protection in Software by Philip P Shirvani RATS June 8, 1999 Outline l Motivation l Requirements l Coding Schemes l Multiple Error Handling l Implementation in ARGOS

More information

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs. Parallel Database Systems STAVROS HARIZOPOULOS stavros@cs.cmu.edu Outline Background Hardware architectures and performance metrics Parallel database techniques Gamma Bonus: NCR / Teradata Conclusions

More information

Modern RAID Technology. RAID Primer A Configuration Guide

Modern RAID Technology. RAID Primer A Configuration Guide Modern RAID Technology RAID Primer A Configuration Guide E x c e l l e n c e i n C o n t r o l l e r s Modern RAID Technology RAID Primer A Configuration Guide 6th Edition Copyright 1997-2003 ICP vortex

More information

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Distributed Systems. Fault Tolerance. Paul Krzyzanowski Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 13:

More information

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS

EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS EE 6900: FAULT-TOLERANT COMPUTING SYSTEMS LECTURE 6: CODING THEORY - 2 Fall 2014 Avinash Kodi kodi@ohio.edu Acknowledgement: Daniel Sorin, Behrooz Parhami, Srinivasan Ramasubramanian Agenda Hamming Codes

More information

PANASAS TIERED PARITY ARCHITECTURE

PANASAS TIERED PARITY ARCHITECTURE PANASAS TIERED PARITY ARCHITECTURE Larry Jones, Matt Reid, Marc Unangst, Garth Gibson, and Brent Welch White Paper May 2010 Abstract Disk drives are approximately 250 times denser today than a decade ago.

More information

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 21: Reliable, High Performance Storage CSC 469H1F Fall 2006 Angela Demke Brown 1 Review We ve looked at fault tolerance via server replication Continue operating with up to f failures Recovery

More information

Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs.

Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs. Internetworking Multiple networks are a fact of life: Growth. Individual departments in a university buy LANs for their own machines and eventually want to interconnect with other campus LANs. Fault isolation,

More information

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Routing Algorithm How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Many routing algorithms exist 1) Arithmetic 2) Source-based 3) Table lookup

More information

6.033 Lecture Fault Tolerant Computing 3/31/2014

6.033 Lecture Fault Tolerant Computing 3/31/2014 6.033 Lecture 14 -- Fault Tolerant Computing 3/31/2014 So far what have we seen: Modularity RPC Processes Client / server Networking Implements client/server Seen a few examples of dealing with faults

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Input/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security

Input/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security Input/Output Today Principles of I/O hardware & software I/O software layers Disks Next Protection & Security Operating Systems and I/O Two key operating system goals Control I/O devices Provide a simple,

More information

Routing Algorithms. Review

Routing Algorithms. Review Routing Algorithms Today s topics: Deterministic, Oblivious Adaptive, & Adaptive models Problems: efficiency livelock deadlock 1 CS6810 Review Network properties are a combination topology topology dependent

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Networking and Internetworking 1

Networking and Internetworking 1 Networking and Internetworking 1 Today l Networks and distributed systems l Internet architecture xkcd Networking issues for distributed systems Early networks were designed to meet relatively simple requirements

More information

Fault Tolerance Dealing with an imperfect world

Fault Tolerance Dealing with an imperfect world Fault Tolerance Dealing with an imperfect world Paul Krzyzanowski Rutgers University September 14, 2012 1 Introduction If we look at the words fault and tolerance, we can define the fault as a malfunction

More information

Operating Systems 2010/2011

Operating Systems 2010/2011 Operating Systems 2010/2011 Input/Output Systems part 2 (ch13, ch12) Shudong Chen 1 Recap Discuss the principles of I/O hardware and its complexity Explore the structure of an operating system s I/O subsystem

More information

IO System. CP-226: Computer Architecture. Lecture 25 (24 April 2013) CADSL

IO System. CP-226: Computer Architecture. Lecture 25 (24 April 2013) CADSL IO System Virendra Singh Associate Professor Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay http://www.ee.iitb.ac.in/~viren/

More information