Elzar Triple Modular Redundancy using Intel AVX
|
|
- Laurel Day
- 6 years ago
- Views:
Transcription
1 Elzar Triple Modular Redundancy using Intel AVX Dmitrii Kuvaiskii Oleksii Oleksenko Pramod Bhatotia Christof Fetzer Pascal Felber
2 Hardware Errors in the Wild Online services run in huge data centers 1
3 Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception 1
4 Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception These faults can lead to arbitrary state corruptions 1
5 Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception These faults can lead to arbitrary state corruptions There were a handful of messages that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect 1
6 Hardware Errors in the Wild Online services run in huge data centers Hardware faults are the norm rather than the exception These faults can lead to arbitrary state corruptions There were a handful of messages that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect Corruption can occur transiently in CPU or RAM. Guarding against such corruptions is an important goal in Mesa s overall design 1
7 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches 2
8 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance 2
9 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance Tolerates arbitrary faults Pessimistic fault model High resource overheads 2
10 Protecting Against Data Corruptions Principled approaches Byzantine Fault Tolerance Ad-hoc approaches Checksums / Assertions Tolerates arbitrary faults Pessimistic fault model High resource overheads 2
11 Protecting Against Data Corruptions Principled approaches Byzantine Fault Tolerance Ad-hoc approaches Checksums / Assertions Tolerates arbitrary faults Low performance overheads Pessimistic fault model Only anticipated faults High resource overheads Manual and error-prone 2
12 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance Checksums / Assertions Triple Modular Redundancy 2
13 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance better performance Checksums / Assertions Triple Modular Redundancy Practical fault model 2
14 Protecting Against Data Corruptions Principled approaches Ad-hoc approaches Byzantine Fault Tolerance better performance Triple Modular Redundancy Practical fault model Disciplined protection Checksums / Assertions better fault coverage 2
15 Triple Modular Redundancy Source code Triple Modular Redundancy (TMR) Executable CPU 3
16 Triple Modular Redundancy Source code Triple Modular Redundancy (TMR) Executable CPU Native z = add x, y 3
17 Triple Modular Redundancy Source code Triple Modular Redundancy (TMR) Executable CPU Native TMR z z = add x, y z2 = add x2, y2 z3 = add x3, y3 = add x, y 3
18 Triple Modular Redundancy Source code Triple Modular Redundancy (TMR) with SIMD Executable CPU Native TMR TMR with SIMD z z = add x, y z2 = add x2, y2 z3 = add x3, y3 zwide = add xwide, ywide = add x, y 3
19 Triple Modular Redundancy Triple Modular Redundancy (TMR) with SIMD Source code Executable CPU Native TMR TMR with SIMD z z = add x, y z2 = add x2, y2 z3 = add x3, y3 zwide = add xwide, ywide = add x, y Idea SIMD uses less instructions less perf overhead? 3
20 Research Question Can we use SIMD for efficient fault tolerance? 4
21 Research Question Can we use SIMD for efficient fault tolerance? Implementation using Intel AVX 4
22 Elzar Source code Elzar Executable CPU 5
23 Elzar Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy 5
24 Elzar Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Outcome Mixed results (AVX not general-purpose) 5
25 Elzar Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Outcome Mixed results (AVX not general-purpose) Investigation Two bottlenecks in current AVX 5
26 Motivation Intel AVX Design Evaluation Discussion
27 Intel AVX Background 6
28 Intel AVX Background 256 bit New registers YMM0 YMM1 YMM2 YMM15 64-bit 64-bit 64-bit 64-bit 6
29 Intel AVX Background 256 bit New registers YMM0 YMM1 YMM2 YMM15 New instructions + 64-bit 64-bit 64-bit 64-bit x1 x2 x3 x4 YMM0 y1 y2 y3 y4 YMM1 x1+y1 x2+y2 x3+y3 x4+y4 YMM2 6
30 AVX as free resource? AVX is not actively used? 7
31 AVX as free resource? AVX is not actively used? unused used 7
32 AVX as free resource? AVX is not actively used? unused used 7
33 AVX as free resource? AVX is not actively used? unused used 7
34 AVX as free resource? AVX is not actively used? Reuse for fault tolerance! unused used 7
35 Motivation Intel AVX Design Evaluation Discussion
36 Fault Model Protect against transient faults in CPU corruptions in CPU registers CPU miscomputations in CPU execution units Memory is protected by other means DRAM protected by ECC Memory CPU caches protected by ECC and parity 8
37 Elzar by Example Native x z = load a = add x, 1 9
38 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) 9
39 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = Elzar load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) x z = avx-load a = avx-add x, 1 avx-check(z) 9
40 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = Elzar load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) x z = avx-load a = avx-add x, 1 avx-check(z) Common instructions introduce lower overhead 9
41 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = Elzar load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) x z = avx-load a = avx-add x, 1 avx-check(z) Bottleneck 1: Memory accesses require wrappers 9
42 Elzar by Example Native TMR x z x z z2 z3 = load a = add x, 1 = = = = Elzar load a add x, 1 add x2, 1 add x3, 1 majority(z, z2, z3) x z = avx-load a = avx-add x, 1 avx-check(z) Bottleneck 2: Checks are expensive 9
43 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 10
44 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) No such thing as avx-load! 1 10
45 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a 10
46 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a extract a 10
47 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a extract RAM a load x 10
48 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a extract RAM load a x broadcast x x x x 10
49 Bottleneck 1: Memory accesses x = avx-load a z = avx-add x, avx-check(z) 1 a a a a extract RAM load a x broadcast x x x x Memory accesses require wrappers 10
50 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 11
51 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z1 z2 z3 z4 11
52 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z1 z2 z3 z4 z2 z1 z4 z3 z2 z1 z1 z2 z4 z3 z3 z4 shuffle xor 11
53 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z1 z2 z3 z4 z2 z1 z4 z3 z2 z1 z1 z2 z4 z3 z3 z4 shuffle xor ptest FLAGS <all zeros, proceed> 11
54 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z z z' z shuffle xor ptest FLAGS 11
55 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z z z' z z z z z' z z z z z z' z' z shuffle xor ptest FLAGS 11
56 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z z z' z z z z z' z z z z z z' z' z shuffle xor ptest FLAGS jne recover 11
57 Bottleneck 2: Checks x = avx-load a z = avx-add x, avx-check(z) 1 z z z' z z z z z' z z z z z z' z' z shuffle xor ptest FLAGS jne recover Checks introduce additional 55% overhead 11
58 Motivation Intel AVX Design Evaluation Discussion
59 Performance overheads 12
60 Performance overheads lower better 12
61 Performance overheads lower better 12
62 Performance overheads lower better Average performance overhead is 4 12
63 Performance overheads lower better Amortized by very poor memory access pattern 12
64 Performance overheads lower better (1) Native benefits from AVX vectorization (2) Most of time spent in mem-intensive bzero() 12
65 Comparison with SWIFT-R (state-of-the-art TMR approach) 13
66 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better 13
67 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better 13
68 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better Elzar performs 46% worse than SWIFT-R on average 13
69 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better Dominated by memory accesses, Elzar inserts many wrappers 13
70 Comparison with SWIFT-R (state-of-the-art TMR approach) lower better Elzar produces 3x less instructions than SWIFT-R 13
71 Motivation Intel AVX Design Evaluation Discussion
72 Bottlenecks and Proposed Solution Problem AVX lacks certain instructions Need wrappers for memory accesses Need shuffle-xor-ptest for checks 14
73 Bottlenecks and Proposed Solution Problem AVX lacks certain instructions Need wrappers for memory accesses Need shuffle-xor-ptest for checks Solution Offload to FPGA accelerator Intel's Xeon-FPGA pairing 14
74 Bottlenecks and Proposed Solution Problem AVX lacks certain instructions Need wrappers for memory accesses Need shuffle-xor-ptest for checks Solution Offload to FPGA accelerator Intel's Xeon-FPGA pairing FPGA CPU addr addr addr addr addr majority voting load Memory val replication val val val val 14
75 Bottlenecks and Proposed Solution Problem AVX lacks certain instructions Need wrappers for memory accesses Need shuffle-xor-ptest for checks Solution Offload to FPGA accelerator Intel's Xeon-FPGA pairing FPGA CPU addr addr addr addr addr majority voting load Memory val replication val val val val Potentially 71% better than SWIFT-R 14
76 Conclusion 15
77 Conclusion Source code Elzar Executable AVX CPU 15
78 Conclusion Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy 15
79 Conclusion Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Hypothesis Less instructions less perf overhead 15
80 Conclusion Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Hypothesis Less instructions less perf overhead Outcome 46% worse than SWIFT-R 15
81 Conclusion Source code Elzar Executable AVX CPU Implementation Intel AVX for triple modular redundancy Hypothesis Less instructions less perf overhead Outcome 46% worse than SWIFT-R Discussion With FPGA, ~71% better than SWIFT-R 15
82 Thank you! GitHub repo:
83 backup slides
84 Intel Haswell CPU microarchitecture Instruction Scheduler port 0 64-bit ALU port 1 64-bit ALU 256-bit MUL 256-bit FMA port 5 64-bit ALU 256-bit ALU 256-bit Shuffle ports 2-4,7 256-bit FADD Memory access 256-bit ALU port 6 64-bit ALU
85 Intel Haswell CPU microarchitecture Instruction Scheduler port 0 64-bit ALU port 1 64-bit ALU 256-bit MUL 256-bit FMA port 5 64-bit ALU 256-bit ALU 256-bit Shuffle ports 2-4,7 256-bit FADD Memory access 256-bit ALU port 6 64-bit ALU Intel AVX units
86 Elzar: Check on Branch Native cmp x, y jne truebranch
87 Elzar: Check on Branch Native cmp x, y jne truebranch x1 x2 x3 x4 y1 y2 y3 y4 cmpeq x1=y1 x2=y2 x3=y3 x4=y4 ptest jne FLAGS ja je true branch false branch recover
88 Elzar: Check on Branch Native cmp x, y jne truebranch x1 x2 x3 x4 y1 y2 y3 y4 cmpeq x1=y1 x2=y2 x3=y3 x4=y4 ptest jne FLAGS ja je true branch false branch recover Impact Checks on branches introduce only 4% overhead
89 Main Bottlenecks Problem AVX instruction set lacks certain instructions Loads/stores require extracting AVX-replicated address Elzar creates expensive wrappers around loads/stores AVX-512 introduces promising gather/scatter instructions AVX has only one control-flow instruction ptest Elzar has to insert ptest for each control-flow decision AVX could add cmp instructions directly affecting control flow AVX misses integer division, integer truncation, etc Elzar produces very ineffective code in these cases
90 Estimation of Proposed Changes lower better
91 Estimation of Proposed Changes lower better
92 Estimation of Proposed Changes lower better
93 Estimation of Proposed Changes lower better Estimated average overhead would be 48% (71% improvement over SWIFT-R)
HAFT Hardware-Assisted Fault Tolerance
HAFT Hardware-Assisted Fault Tolerance Dmitrii Kuvaiskii Rasha Faqeh Pramod Bhatotia Christof Fetzer Technische Universität Dresden Pascal Felber Université de Neuchâtel Hardware Errors in the Wild Online
More informationSGXBounds Memory Safety for Shielded Execution
SGXBounds Memory Safety for Shielded Execution Dmitrii Kuvaiskii, Oleksii Oleksenko, Sergei Arnautov, Bohdan Trach, Pramod Bhatotia *, Pascal Felber, Christof Fetzer TU Dresden, * The University of Edinburgh,
More informationDistributed Systems 24. Fault Tolerance
Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network
More informationDistributed Systems 23. Fault Tolerance
Distributed Systems 23. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 4/20/2011 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors
More informationRedundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992
Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical
More informationLecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)
Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material
More informationEC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures
EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, panda.2}@osu.edu The Ohio State University
More informationTSW Reliability and Fault Tolerance
TSW Reliability and Fault Tolerance Alexandre David 1.2.05 Credits: some slides by Alan Burns & Andy Wellings. Aims Understand the factors which affect the reliability of a system. Introduce how software
More informationRegisters. Registers
All computers have some registers visible at the ISA level. They are there to control execution of the program hold temporary results visible at the microarchitecture level, such as the Top Of Stack (TOS)
More informationUltra Low-Cost Defect Protection for Microprocessor Pipelines
Ultra Low-Cost Defect Protection for Microprocessor Pipelines Smitha Shyam Kypros Constantinides Sujay Phadke Valeria Bertacco Todd Austin Advanced Computer Architecture Lab University of Michigan Key
More informationDistributed Systems. Fault Tolerance. Paul Krzyzanowski
Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected
More informationError Mitigation of Point-to-Point Communication for Fault-Tolerant Computing
Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Authors: Robert L Akamine, Robert F. Hodson, Brock J. LaMeres, and Robert E. Ray www.nasa.gov Contents Introduction to the
More informationSiewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., X
Dependable Systems Hardware Dependability - Diagnosis Dr. Peter Tröger Sources: Siewiorek, Daniel P.; Swarz, Robert S.: Reliable Computer Systems. third. Wellesley, MA : A. K. Peters, Ltd., 1998., 156881092X
More informationDistributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013
Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware
More informationSelf-Repair for Robust System Design. Yanjing Li Intel Labs Stanford University
Self-Repair for Robust System Design Yanjing Li Intel Labs Stanford University 1 Hardware Failures: Major Concern Permanent: our focus Temporary 2 Tolerating Permanent Hardware Failures Detection Diagnosis
More informationEliminating Single Points of Failure in Software Based Redundancy
Eliminating Single Points of Failure in Software Based Redundancy Peter Ulbrich, Martin Hoffmann, Rüdiger Kapitza, Daniel Lohmann, Reiner Schmid and Wolfgang Schröder-Preikschat EDCC May 9, 2012 SYSTEM
More informationMemory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017
Memory Bandwidth and Low Precision Computation CS6787 Lecture 9 Fall 2017 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing
More informationOPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING. Björn Döbel (TU Dresden)
OPERATING SYSTEM SUPPORT FOR REDUNDANT MULTITHREADING Björn Döbel (TU Dresden) Brussels, 02.02.2013 Hardware Faults Radiation-induced soft errors Mainly an issue in avionics+space 1 DRAM errors in large
More informationCS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:
CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online
More informationOutline of Presentation Field Programmable Gate Arrays (FPGAs(
FPGA Architectures and Operation for Tolerating SEUs Chuck Stroud Electrical and Computer Engineering Auburn University Outline of Presentation Field Programmable Gate Arrays (FPGAs( FPGAs) How Programmable
More informationBasic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.
Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery
More informationMemory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018
Memory Bandwidth and Low Precision Computation CS6787 Lecture 10 Fall 2018 Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing
More information6.033 Lecture Fault Tolerant Computing 3/31/2014
6.033 Lecture 14 -- Fault Tolerant Computing 3/31/2014 So far what have we seen: Modularity RPC Processes Client / server Networking Implements client/server Seen a few examples of dealing with faults
More informationAssociate Professor Dr. Raed Ibraheem Hamed
Associate Professor Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Computer Science Department 2015 2016 1 Points to Cover Storing Data in a DBMS Primary Storage
More informationRedundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992
Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical
More informationIssues in Programming Language Design for Embedded RT Systems
CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics
More informationReliable Computing I
Instructor: Mehdi Tahoori Reliable Computing I Lecture 8: Redundant Disk Arrays INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the Helmholtz
More informationHAFT: Hardware-Assisted Fault Tolerance
HAFT: Hardware-Assisted Fault Tolerance Dmitrii Kuvaiskii TU Dresden dmitrii.kuvaiskii@tu-dresden.de Pramod Bhatotia TU Dresden pramod.bhatotia@tu-dresden.de Pascal Felber University of Neuchâtel pascal.felber@unine.ch
More informationCprE 458/558: Real-Time Systems. Lecture 17 Fault-tolerant design techniques
: Real-Time Systems Lecture 17 Fault-tolerant design techniques Fault Tolerant Strategies Fault tolerance in computer system is achieved through redundancy in hardware, software, information, and/or computations.
More informationMicroarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors. Moinuddin K. Qureshi Onur Mutlu Yale N.
Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors Moinuddin K. Qureshi Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical
More informationFailure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems
Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements
More information416 Distributed Systems. Errors and Failures, part 2 Feb 3, 2016
416 Distributed Systems Errors and Failures, part 2 Feb 3, 2016 Options in dealing with failure 1. Silently return the wrong answer. 2. Detect failure. 3. Correct / mask the failure 2 Block error detection/correction
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationDep. Systems Requirements
Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 18 Chapter 7 Case Studies Part.18.1 Introduction Illustrate practical use of methods described previously Highlight fault-tolerance
More informationCOMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in
More informationDuke University Department of Electrical and Computer Engineering
Duke University Department of Electrical and Computer Engineering Senior Honors Thesis Spring 2008 Proving the Completeness of Error Detection Mechanisms in Simple Core Chip Multiprocessors Michael Edward
More informationDistributed Operating Systems
2 Distributed Operating Systems System Models, Processor Allocation, Distributed Scheduling, and Fault Tolerance Steve Goddard goddard@cse.unl.edu http://www.cse.unl.edu/~goddard/courses/csce855 System
More informationVirtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili
Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed
More informationAdvanced processor designs
Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The
More informationFault Tolerance. Distributed Systems IT332
Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to
More informationSafety and Reliability of Software-Controlled Systems Part 14: Fault mitigation
Safety and Reliability of Software-Controlled Systems Part 14: Fault mitigation Prof. Dr.-Ing. Stefan Kowalewski Chair Informatik 11, Embedded Software Laboratory RWTH Aachen University Summer Semester
More informationSelective Validations for Efficient Protections on Coarse-Grained Reconfigurable Architectures
Selective Validations for Efficient Protections on Coarse-Grained Reconfigurable Architectures Jihoon Kang The Graduate School Yonsei University Department of Computer Science Selective Validations for
More informationECE 571 Advanced Microprocessor-Based Design Lecture 4
ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted
More informationLecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability
Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationCS5460: Operating Systems Lecture 20: File System Reliability
CS5460: Operating Systems Lecture 20: File System Reliability File System Optimizations Modern Historic Technique Disk buffer cache Aggregated disk I/O Prefetching Disk head scheduling Disk interleaving
More informationSIListra. Coded Processing in Medical Devices. Dr. Martin Süßkraut (TU-Dresden / SIListra Systems)
SIListra making systems safer Coded Processing in Medical Devices Dr. Martin Süßkraut (TU-Dresden / SIListra Systems) martin.suesskraut@se.inf.tu-dresden.de Embedded goes Medical 5./6. Oct. 2011 1 SIListra
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 5 Processor-Level Techniques & Byzantine Failures Chapter 2 Hardware Fault Tolerance Part.5.1 Processor-Level Techniques
More informationPerformance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures
Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Dr. Fabio Baruffa Dr. Luigi Iapichino Leibniz Supercomputing Centre fabio.baruffa@lrz.de Outline of the talk
More informationEric Rotenberg Karthik Sundaramoorthy, Zach Purser
Karthik Sundaramoorthy, Zach Purser Dept. of Electrical and Computer Engineering North Carolina State University http://www.tinker.ncsu.edu/ericro ericro@ece.ncsu.edu Many means to an end Program is merely
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More informationOptimized Scientific Computing:
Optimized Scientific Computing: Coding Efficiently for Real Computing Architectures Noah Kurinsky SASS Talk, November 11 2015 Introduction Components of a CPU Architecture Design Choices Why Is This Relevant
More informationNetworks and distributed computing
Networks and distributed computing Abstractions provided for networks network card has fixed MAC address -> deliver message to computer on LAN -> machine-to-machine communication -> unordered messages
More informationPractical Byzantine Fault
Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Castro and Liskov, OSDI 1999 Nathan Baker, presenting on 23 September 2005 What is a Byzantine fault? Rationale for Byzantine Fault
More informationCprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University
Fault Tolerance Dr. Yong Guan Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Outline for Today s Talk Basic Concepts Process Resilience Reliable
More informationTransient Fault Detection and Reducing Transient Error Rate. Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof.
Transient Fault Detection and Reducing Transient Error Rate Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation What are transient faults? Hardware Fault Detection
More informationLecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID
Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1 Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationPhysical Storage Media
Physical Storage Media These slides are a modified version of the slides of the book Database System Concepts, 5th Ed., McGraw-Hill, by Silberschatz, Korth and Sudarshan. Original slides are available
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationFAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)
Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy
More informationI/O CANNOT BE IGNORED
LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationImplementation Issues. Remote-Write Protocols
Implementation Issues Two techniques to implement consistency models Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes Replicated write
More informationSoftware-defined Storage: Fast, Safe and Efficient
Software-defined Storage: Fast, Safe and Efficient TRY NOW Thanks to Blockchain and Intel Intelligent Storage Acceleration Library Every piece of data is required to be stored somewhere. We all know about
More informationIMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign
SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory
More informationHow to write powerful parallel Applications
How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationDistributed Systems COMP 212. Lecture 19 Othon Michail
Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails
More informationBespoke Processors for Applications with Ultra-low Area and Power Constraints
Bespoke Processors for Applications with Ultra-low Area and Power Constraints by Cherupalli et al. ISCA 17 Jielun Tan, Tim Wesley Overview Motivation Intro to Bespoke Benchmarks and Results Discussion
More informationECE 574 Cluster Computing Lecture 19
ECE 574 Cluster Computing Lecture 19 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 10 November 2015 Announcements Projects HW extended 1 MPI Review MPI is *not* shared memory
More informationA Fault Tolerant Superscalar Processor
A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level Fault Check Regimen in a Superscalar Processor by V. Reddy and E. Rotenberg (2008)] P R E S E N T E D B Y NAN Z
More informationARCHITECTURE DESIGN FOR SOFT ERRORS
ARCHITECTURE DESIGN FOR SOFT ERRORS Shubu Mukherjee ^ШВпШшр"* AMSTERDAM BOSTON HEIDELBERG LONDON NEW YORK OXFORD PARIS SAN DIEGO T^"ТГПШГ SAN FRANCISCO SINGAPORE SYDNEY TOKYO ^ P f ^ ^ ELSEVIER Morgan
More informationRAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE
RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationFault Tolerance Dealing with an imperfect world
Fault Tolerance Dealing with an imperfect world Paul Krzyzanowski Rutgers University September 14, 2012 1 Introduction If we look at the words fault and tolerance, we can define the fault as a malfunction
More informationLast Class:Consistency Semantics. Today: More on Consistency
Last Class:Consistency Semantics Consistency models Data-centric consistency models Client-centric consistency models Eventual Consistency and epidemic protocols Lecture 16, page 1 Today: More on Consistency
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationUsing Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance Outline Introduction and Motivation Software-centric Fault Detection Process-Level Redundancy Experimental Results
More informationAR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance
More informationDependability tree 1
Dependability tree 1 Means for achieving dependability A combined use of methods can be applied as means for achieving dependability. These means can be classified into: 1. Fault Prevention techniques
More informationSoftware Techniques for Dependable Computer-based Systems. Matteo SONZA REORDA
Software Techniques for Dependable Computer-based Systems Matteo SONZA REORDA Summary Introduction State of the art Assertions Algorithm Based Fault Tolerance (ABFT) Control flow checking Data duplication
More informationEXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC
EASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores 12 cores/node Power Challenges Exascale Technology Roadmap Meeting San Diego California, December 2009. $1M per
More informationDarek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June
Darek Mihocka, Emulators.com Stanislav Shwartsman, Intel Corp. June 21 2008 Agenda Introduction Gemulator Bochs Proposed ISA Extensions Conclusions and Future Work Q & A Jun-21-2008 AMAS-BT 2008 2 Introduction
More informationOverview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Fault Modeling Lectures Set 2 Overview Fault Modeling References Fault models at different levels (HW)
More informationAN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware
AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware Ute Schiffel Christof Fetzer Martin Süßkraut Technische Universität Dresden Institute for System Architecture ute.schiffel@inf.tu-dresden.de
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 3 - Resilient Structures Chapter 2 HW Fault Tolerance Part.3.1 M-of-N Systems An M-of-N system consists of N identical
More informationRAID 0 (non-redundant) RAID Types 4/25/2011
Exam 3 Review COMP375 Topics I/O controllers chapter 7 Disk performance section 6.3-6.4 RAID section 6.2 Pipelining section 12.4 Superscalar chapter 14 RISC chapter 13 Parallel Processors chapter 18 Security
More informationHigh Performance Computing Course Notes High Performance Storage
High Performance Computing Course Notes 2008-2009 2009 High Performance Storage Storage devices Primary storage: register (1 CPU cycle, a few ns) Cache (10-200 cycles, 0.02-0.5us) Main memory Local main
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationFault Tolerance. Distributed Systems. September 2002
Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationComputer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 13:
More informationSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier
Some material adapted from Mohamed Younis, UMBC CMSC 6 Spr 23 course slides Some material adapted from Hennessy & Patterson / 23 Elsevier Science Characteristics IBM 39 IBM UltraStar Integral 82 Disk diameter
More informationSoftware-based Fault Tolerance Mission (Im)possible?
Software-based Fault Tolerance Mission Im)possible? Peter Ulbrich The 29th CREST Open Workshop on Software Redundancy November 18, 2013 System Software Group http://www4.cs.fau.de Embedded Systems Initiative
More informationIntel iapx 432-VLSI building blocks for a fault-tolerant computer
Intel iapx 432-VLSI building blocks for a fault-tolerant computer by DAVE JOHNSON, DAVE BUDDE, DAVE CARSON, and CRAIG PETERSON Intel Corporation Aloha, Oregon ABSTRACT Early in 1983 two new VLSI components
More informationCSE 5306 Distributed Systems
CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves
More informationCDA 5140 Software Fault-tolerance. - however, reliability of the overall system is actually a product of the hardware, software, and human reliability
CDA 5140 Software Fault-tolerance - so far have looked at reliability as hardware reliability - however, reliability of the overall system is actually a product of the hardware, software, and human reliability
More information