Elzar Triple Modular Redundancy using Intel AVX

Size: px
Start display at page:

Download "Elzar Triple Modular Redundancy using Intel AVX"

Transcription

1 Elzar Triple Modular Redundancy using Intel AVX Dmitrii Kuvaiskii Oleksii Oleksenko Pramod Bhatotia Christof Fetzer Pascal Felber

2 Hardware Errors in the Wild Online services run in huge data centers 1

3 Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception 1

4 Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception These faults can lead to arbitrary state corruptions 1

5 Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception These faults can lead to arbitrary state corruptions There were a handful of messages that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect 1

6 Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception These faults can lead to arbitrary state corruptions There were a handful of messages that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect Corruption can occur transiently in CPU or RAM. Guarding against such corruptions is an important goal in Mesa s overall design 1

7 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches 2

8 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance 2

9 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance Tolerates arbitrary faults Pessimistic fault model High resource overheads 2

10 Protecting Against Data Corruptions Principled approaches Byzantine Fault Tolerance Ad-hoc approaches Checksums / Assertions Tolerates arbitrary faults Pessimistic fault model High resource overheads 2

11 Protecting Against Data Corruptions Principled approaches Byzantine Fault Tolerance Ad-hoc approaches Checksums / Assertions Tolerates arbitrary faults Low performance overheads Pessimistic fault model Only anticipated faults High resource overheads Manual and error-prone 2

12 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance Checksums / Assertions Triple Modular Redundancy 2

13 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance better performance Checksums / Assertions Triple Modular Redundancy Practical fault model 2

14 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance better performance Triple Modular Redundancy Practical fault model Disciplined protection Checksums / Assertions better fault coverage 2

15 Triple Modular Redundancy Source code Triple Modular Redundancy (TMR) Executable CPU 3

16 Triple Modular Redundancy Source code Triple Modular Redundancy (TMR) Executable CPU Native z = add x, y 3

17 Triple Modular Redundancy Source code Triple Modular Redundancy (TMR) Executable CPU Native TMR z z = add x, y z2 = add x2, y2 z3 = add x3, y3 = add x, y 3

18 Triple Modular Redundancy Source code Triple Modular Redundancy (TMR) with SIMD Executable CPU Native TMR TMR with SIMD z z = add x, y z2 = add x2, y2 z3 = add x3, y3 zwide = add xwide, ywide = add x, y 3

19 Triple Modular Redundancy Triple Modular Redundancy (TMR) with SIMD Source code Executable CPU Native TMR TMR with SIMD z z = add x, y z2 = add x2, y2 z3 = add x3, y3 zwide = add xwide, ywide = add x, y Idea SIMD uses less instructions less perf overhead? 3

20 Research Question Can we use SIMD for efficient fault tolerance? 4

21 Research Question Can we use SIMD for efficient fault tolerance? Implementation using Intel AVX 4

22 Elzar Source code Elzar Executable CPU 5

23 Elzar Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy 5

24 Elzar Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Outcome Mixed results (AVX not general-purpose) 5

25 Elzar Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Outcome Mixed results (AVX not general-purpose) Investigation Two bottlenecks in current AVX 5

26 Motivation Intel AVX Design Evaluation Discussion

27 Intel AVX Background 6

28 Intel AVX Background 256 bit New registers YMM0 YMM1 YMM2 YMM15 64-bit 64-bit 64-bit 64-bit 6

29 Intel AVX Background 256 bit New registers YMM0 YMM1 YMM2 YMM15 New instructions + 64-bit 64-bit 64-bit 64-bit x1 x2 x3 x4 YMM0 y1 y2 y3 y4 YMM1 x1+y1 x2+y2 x3+y3 x4+y4 YMM2 6

30 AVX as free resource? AVX is not actively used? 7

31 AVX as free resource? AVX is not actively used? unused used 7

32 AVX as free resource? AVX is not actively used? unused used 7

33 AVX as free resource? AVX is not actively used? unused used 7

34 AVX as free resource? AVX is not actively used? Reuse for fault tolerance! unused used 7

35 Motivation Intel AVX Design Evaluation Discussion

36 Fault Model Protect against transient faults in CPU corruptions in CPU registers CPU miscomputations in CPU execution units Memory is protected by other means DRAM protected by ECC Memory CPU caches protected by ECC and parity 8

37 Elzar by Example Native x z = load a = add x, 1 9

38 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) 9

39 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = Elzar load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) x z = avx-load a = avx-add x, 1 avx-check(z) 9

40 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = Elzar load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) x z = avx-load a = avx-add x, 1 avx-check(z) Common instructions introduce lower overhead 9

41 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = Elzar load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) x z = avx-load a = avx-add x, 1 avx-check(z) Bottleneck 1: Memory accesses require wrappers 9

42 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = Elzar load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) x z = avx-load a = avx-add x, 1 avx-check(z) Bottleneck 2: Checks are expensive 9

43 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 10

44 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) No such thing as avx-load! 1 10

45 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a 10

46 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a extract a 10

47 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a extract RAM a load x 10

48 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a extract RAM load a x broadcast x x x x 10

49 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a extract RAM load a x broadcast x x x x Memory accesses require wrappers 10

50 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 11

51 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z1 z2 z3 z4 11

52 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z1 z2 z3 z4 z2 z1 z4 z3 z2 z1 z1 z2 z4 z3 z3 z4 shuffle xor 11

53 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z1 z2 z3 z4 z2 z1 z4 z3 z2 z1 z1 z2 z4 z3 z3 z4 shuffle xor ptest FLAGS <all zeros, proceed> 11

54 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z z z' z shuffle xor ptest FLAGS 11

55 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z z z' z z z z z' z z z z z z' z' z shuffle xor ptest FLAGS 11

56 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z z z' z z z z z' z z z z z z' z' z shuffle xor ptest FLAGS jne recover 11

57 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z z z' z z z z z' z z z z z z' z' z shuffle xor ptest FLAGS jne recover Checks introduce additional 55% overhead 11

58 Motivation Intel AVX Design Evaluation Discussion

59 Performance overheads 12

60 Performance overheads lower better 12

61 Performance overheads lower better 12

62 Performance overheads lower better Average performance overhead is 4 12

63 Performance overheads lower better Amortized by very poor memory access pattern 12

64 Performance overheads lower better (1) Native benefits from AVX vectorization (2) Most of time spent in mem-intensive bzero() 12

65 Comparison with SWIFT-R (state-of-the-art TMR approach) 13

66 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better 13

67 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better 13

68 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better Elzar performs 46% worse than SWIFT-R on average 13

69 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better Dominated by memory accesses, Elzar inserts many wrappers 13

70 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better Elzar produces 3x less instructions than SWIFT-R 13

71 Motivation Intel AVX Design Evaluation Discussion

72 Bottlenecks and Proposed Solution Problem AVX lacks certain instructions Need wrappers for memory accesses Need shuffle-xor-ptest for checks 14

73 Bottlenecks and Proposed Solution Problem AVX lacks certain instructions Need wrappers for memory accesses Need shuffle-xor-ptest for checks Solution Offload to FPGA accelerator Intel's Xeon-FPGA pairing 14

74 Bottlenecks and Proposed Solution Problem AVX lacks certain instructions Need wrappers for memory accesses Need shuffle-xor-ptest for checks Solution Offload to FPGA accelerator Intel's Xeon-FPGA pairing FPGA CPU addr addr addr addr addr majority voting load Memory val replication val val val val 14

75 Bottlenecks and Proposed Solution Problem AVX lacks certain instructions Need wrappers for memory accesses Need shuffle-xor-ptest for checks Solution Offload to FPGA accelerator Intel's Xeon-FPGA pairing FPGA CPU addr addr addr addr addr majority voting load Memory val replication val val val val Potentially 71% better than SWIFT-R 14

76 Conclusion 15

77 Conclusion Source code Elzar Executable AVX CPU 15

78 Conclusion Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy 15

79 Conclusion Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Hypothesis Less instructions less perf overhead 15

80 Conclusion Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Hypothesis Less instructions less perf overhead Outcome 46% worse than SWIFT-R 15

81 Conclusion Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Hypothesis Less instructions less perf overhead Outcome 46% worse than SWIFT-R Discussion With FPGA, ~71% better than SWIFT-R 15

82 Thank you! GitHub repo:

83 backup slides

84 Intel Haswell CPU microarchitecture Instruction Scheduler port 0 64-bit ALU port 1 64-bit ALU 256-bit MUL 256-bit FMA port 5 64-bit ALU 256-bit ALU 256-bit Shuffle ports 2-4,7 256-bit FADD Memory access 256-bit ALU port 6 64-bit ALU

85 Intel Haswell CPU microarchitecture Instruction Scheduler port 0 64-bit ALU port 1 64-bit ALU 256-bit MUL 256-bit FMA port 5 64-bit ALU 256-bit ALU 256-bit Shuffle ports 2-4,7 256-bit FADD Memory access 256-bit ALU port 6 64-bit ALU Intel AVX units

86 Elzar: Check on Branch Native cmp x, y jne truebranch

87 Elzar: Check on Branch Native cmp x, y jne truebranch x1 x2 x3 x4 y1 y2 y3 y4 cmpeq x1=y1 x2=y2 x3=y3 x4=y4 ptest jne FLAGS ja je true branch false branch recover

88 Elzar: Check on Branch Native cmp x, y jne truebranch x1 x2 x3 x4 y1 y2 y3 y4 cmpeq x1=y1 x2=y2 x3=y3 x4=y4 ptest jne FLAGS ja je true branch false branch recover Impact Checks on branches introduce only 4% overhead

89 Main Bottlenecks Problem AVX instruction set lacks certain instructions Loads/stores require extracting AVX-replicated address Elzar creates expensive wrappers around loads/stores AVX-512 introduces promising gather/scatter instructions AVX has only one control-flow instruction ptest Elzar has to insert ptest for each control-flow decision AVX could add cmp instructions directly affecting control flow AVX misses integer division, integer truncation, etc Elzar produces very ineffective code in these cases

90 Estimation of Proposed Changes lower better

91 Estimation of Proposed Changes lower better

92 Estimation of Proposed Changes lower better

93 Estimation of Proposed Changes lower better Estimated average overhead would be 48% (71% improvement over SWIFT-R)

HAFT Hardware-Assisted Fault Tolerance

HAFT Hardware-Assisted Fault Tolerance HAFT Hardware-Assisted Fault Tolerance Dmitrii Kuvaiskii Rasha Faqeh Pramod Bhatotia Christof Fetzer Technische Universität Dresden Pascal Felber Université de Neuchâtel Hardware Errors in the Wild Online

More information

SGXBounds Memory Safety for Shielded Execution

SGXBounds Memory Safety for Shielded Execution SGXBounds Memory Safety for Shielded Execution Dmitrii Kuvaiskii, Oleksii Oleksenko, Sergei Arnautov, Bohdan Trach, Pramod Bhatotia *, Pascal Felber, Christof Fetzer TU Dresden, * The University of Edinburgh,

More information

Distributed Systems 24. Fault Tolerance

Distributed Systems 24. Fault Tolerance Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network

More information

Distributed Systems 23. Fault Tolerance

Distributed Systems 23. Fault Tolerance Distributed Systems 23. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 4/20/2011 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material

More information

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, panda.2}@osu.edu The Ohio State University

More information

TSW Reliability and Fault Tolerance

TSW Reliability and Fault Tolerance TSW Reliability and Fault Tolerance Alexandre David 1.2.05 Credits: some slides by Alan Burns & Andy Wellings. Aims Understand the factors which affect the reliability of a system. Introduce how software

More information

Registers. Registers

Registers. Registers All computers have some registers visible at the ISA level. They are there to control execution of the program hold temporary results visible at the microarchitecture level, such as the Top Of Stack (TOS)

More information

Ultra Low-Cost Defect Protection for Microprocessor Pipelines

Ultra Low-Cost Defect Protection for Microprocessor Pipelines Ultra Low-Cost Defect Protection for Microprocessor Pipelines Smitha Shyam Kypros Constantinides Sujay Phadke Valeria Bertacco Todd Austin Advanced Computer Architecture Lab University of Michigan Key

More information

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Distributed Systems. Fault Tolerance. Paul Krzyzanowski Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected

More information

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Authors: Robert L Akamine, Robert F. Hodson, Brock J. LaMeres, and Robert E. Ray www.nasa.gov Contents Introduction to the

More information

Siewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., X

Siewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., X Dependable Systems Hardware Dependability - Diagnosis Dr. Peter Tröger Sources: Siewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., 156881092X

More information

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013 Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware

More information

Self-Repair for Robust System Design. Yanjing Li Intel Labs Stanford University

Self-Repair for Robust System Design. Yanjing Li Intel Labs Stanford University Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University 1 Hardware Failures: Major Concern Permanent: our focus Temporary 2 Tolerating Permanent Hardware Failures Detection Diagnosis

More information

Eliminating Single Points of Failure in Software Based Redundancy

Eliminating Single Points of Failure in Software Based Redundancy Eliminating Single Points of Failure in Software Based Redundancy Peter Ulbrich, Martin Hoffmann, Rüdiger Kapitza, Daniel Lohmann, Reiner Schmid and Wolfgang Schröder-Preikschat EDCC May 9, 2012 SYSTEM

More information

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017 Memory Bandwidth and Low Precision Computation CS6787 Lecture 9 Fall 2017 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing

More information

OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING. Björn Döbel (TU Dresden)

OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING. Björn Döbel (TU Dresden) OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING Björn Döbel (TU Dresden) Brussels, 02.02.2013 Hardware Faults Radiation-induced soft errors Mainly an issue in avionics+space 1 DRAM errors in large

More information

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following: CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online

More information

Outline of Presentation Field Programmable Gate Arrays (FPGAs(

Outline of Presentation Field Programmable Gate Arrays (FPGAs( FPGA Architectures and Operation for Tolerating SEUs Chuck Stroud Electrical and Computer Engineering Auburn University Outline of Presentation Field Programmable Gate Arrays (FPGAs( FPGAs) How Programmable

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018 Memory Bandwidth and Low Precision Computation CS6787 Lecture 10 Fall 2018 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing

More information

6.033 Lecture Fault Tolerant Computing 3/31/2014

6.033 Lecture Fault Tolerant Computing 3/31/2014 6.033 Lecture 14 -- Fault Tolerant Computing 3/31/2014 So far what have we seen: Modularity RPC Processes Client / server Networking Implements client/server Seen a few examples of dealing with faults

More information

Associate Professor Dr. Raed Ibraheem Hamed

Associate Professor Dr. Raed Ibraheem Hamed Associate Professor Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Computer Science Department 2015 2016 1 Points to Cover Storing Data in a DBMS Primary Storage

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

Issues in Programming Language Design for Embedded RT Systems

Issues in Programming Language Design for Embedded RT Systems CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics

More information

Reliable Computing I

Reliable Computing I Instructor: Mehdi Tahoori Reliable Computing I Lecture 8: Redundant Disk Arrays INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the Helmholtz

More information

HAFT: Hardware-Assisted Fault Tolerance

HAFT: Hardware-Assisted Fault Tolerance HAFT: Hardware-Assisted Fault Tolerance Dmitrii Kuvaiskii TU Dresden dmitrii.kuvaiskii@tu-dresden.de Pramod Bhatotia TU Dresden pramod.bhatotia@tu-dresden.de Pascal Felber University of Neuchâtel pascal.felber@unine.ch

More information

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques

CprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques : Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.

More information

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.

Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N. Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016

416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016 416 Distributed Systems Errors and Failures, part 2 Feb 3, 2016 Options in dealing with failure 1. Silently return the wrong answer. 2. Detect failure. 3. Correct / mask the failure 2 Block error detection/correction

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance

More information

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in

More information

Duke University Department of Electrical and Computer Engineering

Duke University Department of Electrical and Computer Engineering Duke University Department of Electrical and Computer Engineering Senior Honors Thesis Spring 2008 Proving the Completeness of Error Detection Mechanisms in Simple Core Chip Multiprocessors Michael Edward

More information

Distributed Operating Systems

Distributed Operating Systems 2 Distributed Operating Systems System Models, Processor Allocation, Distributed Scheduling, and Fault Tolerance Steve Goddard goddard@cse.unl.edu http://www.cse.unl.edu/~goddard/courses/csce855 System

More information

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Systems IT332 Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to

More information

Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation

Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation Prof. Dr.-Ing. Stefan Kowalewski Chair Informatik 11, Embedded Software Laboratory RWTH Aachen University Summer Semester

More information

Selective Validations for Efficient Protections on Coarse-Grained Reconfigurable Architectures

Selective Validations for Efficient Protections on Coarse-Grained Reconfigurable Architectures Selective Validations for Efficient Protections on Coarse-Grained Reconfigurable Architectures Jihoon Kang The Graduate School Yonsei University Department of Computer Science Selective Validations for

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

CS5460: Operating Systems Lecture 20: File System Reliability

CS5460: Operating Systems Lecture 20: File System Reliability CS5460: Operating Systems Lecture 20: File System Reliability File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving

More information

SIListra. Coded Processing in Medical Devices. Dr. Martin Süßkraut (TU-Dresden / SIListra Systems)

SIListra. Coded Processing in Medical Devices. Dr. Martin Süßkraut (TU-Dresden / SIListra Systems) SIListra making systems safer Coded Processing in Medical Devices Dr. Martin Süßkraut (TU-Dresden / SIListra Systems) martin.suesskraut@se.inf.tu-dresden.de Embedded goes Medical 5./6. Oct. 2011 1 SIListra

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 5 Processor-Level Techniques & Byzantine Failures Chapter 2 Hardware Fault Tolerance Part.5.1 Processor-Level Techniques

More information

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Dr. Fabio Baruffa Dr. Luigi Iapichino Leibniz Supercomputing Centre fabio.baruffa@lrz.de Outline of the talk

More information

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser

Eric Rotenberg Karthik Sundaramoorthy, Zach Purser Karthik Sundaramoorthy, Zach Purser Dept. of Electrical and Computer Engineering North Carolina State University http://www.tinker.ncsu.edu/ericro ericro@ece.ncsu.edu Many means to an end Program is merely

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

Optimized Scientific Computing:

Optimized Scientific Computing: Optimized Scientific Computing: Coding Efficiently for Real Computing Architectures Noah Kurinsky SASS Talk, November 11 2015 Introduction Components of a CPU Architecture Design Choices Why Is This Relevant

More information

Networks and distributed computing

Networks and distributed computing Networks and distributed computing Abstractions provided for networks network card has fixed MAC address -> deliver message to computer on LAN -> machine-to-machine communication -> unordered messages

More information

Practical Byzantine Fault

Practical Byzantine Fault Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Castro and Liskov, OSDI 1999 Nathan Baker, presenting on 23 September 2005 What is a Byzantine fault? Rationale for Byzantine Fault

More information

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Fault Tolerance Dr. Yong Guan Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Outline for Today s Talk Basic Concepts Process Resilience Reliable

More information

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.

Transient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation What are transient faults? Hardware Fault Detection

More information

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1 Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Physical Storage Media

Physical Storage Media Physical Storage Media These slides are a modified version of the slides of the book Database System Concepts, 5th Ed., McGraw-Hill, by Silberschatz, Korth and Sudarshan. Original slides are available

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d) Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Implementation Issues. Remote-Write Protocols

Implementation Issues. Remote-Write Protocols Implementation Issues Two techniques to implement consistency models Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes Replicated write

More information

Software-defined Storage: Fast, Safe and Efficient

Software-defined Storage: Fast, Safe and Efficient Software-defined Storage: Fast, Safe and Efficient TRY NOW Thanks to Blockchain and Intel Intelligent Storage Acceleration Library Every piece of data is required to be stored somewhere. We all know about

More information

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory

More information

How to write powerful parallel Applications

How to write powerful parallel Applications How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

Bespoke Processors for Applications with Ultra-low Area and Power Constraints

Bespoke Processors for Applications with Ultra-low Area and Power Constraints Bespoke Processors for Applications with Ultra-low Area and Power Constraints by Cherupalli et al. ISCA 17 Jielun Tan, Tim Wesley Overview Motivation Intro to Bespoke Benchmarks and Results Discussion

More information

ECE 574 Cluster Computing Lecture 19

ECE 574 Cluster Computing Lecture 19 ECE 574 Cluster Computing Lecture 19 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 10 November 2015 Announcements Projects HW extended 1 MPI Review MPI is *not* shared memory

More information

A Fault Tolerant Superscalar Processor

A Fault Tolerant Superscalar Processor A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level Fault Check Regimen in a Superscalar Processor by V. Reddy and E. Rotenberg (2008)] P R E S E N T E D B Y NAN Z

More information

ARCHITECTURE DESIGN FOR SOFT ERRORS

ARCHITECTURE DESIGN FOR SOFT ERRORS ARCHITECTURE DESIGN FOR SOFT ERRORS Shubu Mukherjee ^ШВпШшр"* AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO T^"ТГПШГ SAN FRANCISCO SINGAPORE SYDNEY TOKYO ^ P f ^ ^ ELSEVIER Morgan

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Fault Tolerance Dealing with an imperfect world

Fault Tolerance Dealing with an imperfect world Fault Tolerance Dealing with an imperfect world Paul Krzyzanowski Rutgers University September 14, 2012 1 Introduction If we look at the words fault and tolerance, we can define the fault as a malfunction

More information

Last Class:Consistency Semantics. Today: More on Consistency

Last Class:Consistency Semantics. Today: More on Consistency Last Class:Consistency Semantics Consistency models Data-centric consistency models Client-centric consistency models Eventual Consistency and epidemic protocols Lecture 16, page 1 Today: More on Consistency

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance

Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance Outline Introduction and Motivation Software-centric Fault Detection Process-Level Redundancy Experimental Results

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

Dependability tree 1

Dependability tree 1 Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques

More information

Software Techniques for Dependable Computer-based Systems. Matteo SONZA REORDA

Software Techniques for Dependable Computer-based Systems. Matteo SONZA REORDA Software Techniques for Dependable Computer-based Systems Matteo SONZA REORDA Summary Introduction State of the art Assertions Algorithm Based Fault Tolerance (ABFT) Control flow checking Data duplication

More information

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC EASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores 12 cores/node Power Challenges Exascale Technology Roadmap Meeting San Diego California, December 2009. $1M per

More information

Darek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June

Darek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June Darek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June 21 2008 Agenda Introduction Gemulator Bochs Proposed ISA Extensions Conclusions and Future Work Q & A Jun-21-2008 AMAS-BT 2008 2 Introduction

More information

Overview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)

Overview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.) ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Fault Modeling Lectures Set 2 Overview Fault Modeling References Fault models at different levels (HW)

More information

AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware

AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware Ute Schiffel Christof Fetzer Martin Süßkraut Technische Universität Dresden Institute for System Architecture ute.schiffel@inf.tu-dresden.de

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 3 - Resilient Structures Chapter 2 HW Fault Tolerance Part.3.1 M-of-N Systems An M-of-N system consists of N identical

More information

RAID 0 (non-redundant) RAID Types 4/25/2011

RAID 0 (non-redundant) RAID Types 4/25/2011 Exam 3 Review COMP375 Topics I/O controllers chapter 7 Disk performance section 6.3-6.4 RAID section 6.2 Pipelining section 12.4 Superscalar chapter 14 RISC chapter 13 Parallel Processors chapter 18 Security

More information

High Performance Computing Course Notes High Performance Storage

High Performance Computing Course Notes High Performance Storage High Performance Computing Course Notes 2008-2009 2009 High Performance Storage Storage devices Primary storage: register (1 CPU cycle, a few ns) Cache (10-200 cycles, 0.02-0.5us) Main memory Local main

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems. September 2002 Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 13:

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 6 Spr 23 course slides Some material adapted from Hennessy & Patterson / 23 Elsevier Science Characteristics IBM 39 IBM UltraStar Integral 82 Disk diameter

More information

Software-based Fault Tolerance Mission (Im)possible?

Software-based Fault Tolerance Mission (Im)possible? Software-based Fault Tolerance Mission Im)possible? Peter Ulbrich The 29th CREST Open Workshop on Software Redundancy November 18, 2013 System Software Group http://www4.cs.fau.de Embedded Systems Initiative

More information

Intel iapx 432-VLSI building blocks for a fault-tolerant computer

Intel iapx 432-VLSI building blocks for a fault-tolerant computer Intel iapx 432-VLSI building blocks for a fault-tolerant computer by DAVE JOHNSON, DAVE BUDDE, DAVE CARSON, and CRAIG PETERSON Intel Corporation Aloha, Oregon ABSTRACT Early in 1983 two new VLSI components

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

CDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability CDA 5140 Software Fault-tolerance - so far have looked at reliability as hardware reliability - however, reliability of the overall system is actually a product of the hardware, software, and human reliability

More information