When is CAN the weakest link? A bound on Failures-In-Time in CAN-Based Distributed Real-Time Systems. Arpan Gujara6 Björn B.

Size: px
Start display at page:

Download "When is CAN the weakest link? A bound on Failures-In-Time in CAN-Based Distributed Real-Time Systems. Arpan Gujara6 Björn B."

Transcription

1 When is CAN the weakest link? A bound on Failures-In-Time in CAN-Based Distributed Real-Time Systems Arpan Gujara6 Björn B. Brandenburg

2 Failures due to Transient Faults Harsh environments Spark plugs Hard radia:on High-power machinery

3 Failures due to Transient Faults Harsh environments Spark plugs Hard radia:on High-power machinery Electromagne:c Interference (EMI) Bit-flips in the hosts and in the network

4 Failures due to Transient Faults Harsh environments Spark plugs Hard radia:on High-power machinery Electromagne:c Interference (EMI) Bit-flips in the hosts and in the network EMI-induced transient faults Manifest as program-visible failures

5 Failures due to Transient Faults Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Crash failures (due to fault-induced excep:ons)

6 Failures due to Transient Faults Transmission failures (faults on the wire) Tolerated by error detec6on and retransmissions Commission failures (bit-flips in the memory buffers) Crash failures (due to fault-induced excep:ons)

7 Failures due to Transient Faults Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Crash failures (due to fault-induced excep:ons) Tolerated by error detec6on and retransmissions Tolerated by ac:ve replica6on of tasks on independent hosts

8 Failures due to Transient Faults Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Crash failures (due to fault-induced excep:ons) Tolerated by error detec6on and retransmissions Tolerated by ac:ve replica6on of tasks on independent hosts How to decide the best replica6on strategy? Is Triple Modular Redundancy (TMR) enough? or is Quadruple Modular Redundancy (QMR) required? Would you replicate only the high-frequency tasks? or only the high-cri:cality tasks?

9 Retransmissions vs. Replica5on Tradeoff

10 Retransmissions vs. Replica5on Tradeoff For tolera:ng retransmissions-induced delays (Ensure no deadline viola:ons) The more slack, the beeer!

11 Retransmissions vs. Replica5on Tradeoff For tolera:ng retransmissions-induced delays (Ensure no deadline viola:ons) The more slack, the beeer! versus Ac:ve replica:on of tasks Reduced slack in the schedule

12 Retransmissions vs. Replica5on Tradeoff For tolera:ng retransmissions-induced delays (Ensure no deadline viola:ons) The more slack, the beeer! versus Ac:ve replica:on of tasks Reduced slack in the schedule Correctness Replica:on factor

13 Retransmissions vs. Replica5on Tradeoff For tolera:ng retransmissions-induced delays (Ensure no deadline viola:ons) The more slack, the beeer! versus Ac:ve replica:on of tasks Reduced slack in the schedule Correctness Replica:on factor Timeliness Replica:on factor

14 Retransmissions vs. Replica5on Tradeoff For tolera:ng retransmissions-induced delays (Ensure no deadline viola:ons) The more slack, the beeer! versus Ac:ve replica:on of tasks Reduced slack in the schedule Correctness Replica:on factor Timeliness Replica:on factor Success Replica:on factor

15 Retransmissions vs. Replica5on Tradeoff For tolera:ng retransmissions-induced delays (Ensure no deadline viola:ons) The more slack, the beeer! versus Ac:ve replica:on of tasks Reduced slack in the schedule? Correctness Replica:on factor Timeliness Replica:on factor Success Replica:on factor How to sta:cally determine the op6mal replica6on factor?

16 This Work For CAN-based distributed real-6me systems Probabilis6c analysis Quan:fy the replica:on vs. retransmissions tradeoff

17 The Larger Picture The CAN-based system is just one component in a safety-cri:cal system

18 The Larger Picture The CAN-based system is just one component in a safety-cri:cal system = (Photo: Airbus D&S/Dassault/Finmeccanica) UAV Power CAN

19 The Larger Picture The CAN-based system is just one component in a safety-cri:cal system = (Photo: Airbus D&S/Dassault/Finmeccanica) UAV Power CAN Replicate tasks, add more ECUs to the CAN subsystem

20 The Larger Picture The CAN-based system is just one component in a safety-cri:cal system = (Photo: Airbus D&S/Dassault/Finmeccanica) UAV Add a heat-sink to the power supply unit Power CAN Replicate tasks, add more ECUs to the CAN subsystem

21 The Larger Picture The CAN-based system is just one component in a safety-cri:cal system = (Photo: Airbus D&S/Dassault/Finmeccanica) UAV Add a heat-sink to the power supply unit Power CAN Replicate tasks, add more ECUs to the CAN subsystem What if the UAV has strict weight constraints? and you can either add the heat sink or the addi:onal ECUs How do you decide the best choice?

22 Failures-In-Time (FIT) Rate Expected #failures in one billion opera:ng hours e.g., 1M UAVs flying for 1K hours each FIT rates are widely used in the industry

23 Failures-In-Time (FIT) Rate Expected #failures in one billion opera:ng hours e.g., 1M UAVs flying for 1K hours each FIT rates are widely used in the industry

24 Failures-In-Time (FIT) Rate Expected #failures in one billion opera:ng hours e.g., 1M UAVs flying for 1K hours each FIT rates are widely used in the industry When is the CAN-based distributed real-:me system the weakest link in the system?

25 This Work For CAN-based distributed real-6me systems Probabilis6c analysis Quan:fy the replica:on vs. retransmissions tradeoff FIT rate analysis Builds upon the proposed probabilis:c analysis

26 Overview Model Analysis Evalua6on

27 Fault Abstrac5on & Modeling Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Crash failures (due to fault-induced excep:ons)

28 Fault Abstrac5on & Modeling Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Poisson Distribu6on Crash failures (due to fault-induced excep:ons)

29 Fault Abstrac5on & Modeling Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Poisson Distribu6on Crash failures (due to fault-induced excep:ons) Probability that each message is omieed / corrupted / retransmieed

30 Fault Abstrac5on & Modeling Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Poisson Distribu6on Crash failures (due to fault-induced excep:ons) Probability that each message is omieed / corrupted / retransmieed We do not consider sotware defects

31 System Model Host 2 CAN bus Host N Host 1 Host 3

32 System Model Host 2 Task A CAN bus Task B Host N Host 1 Host 3

33 System Model Host 2 Task A Msg. Sender Receiver M Task A Task B Message M from Task A to Task B CAN bus M Task B OUT Host N Host 1 Host 3

34 System Model Host 2 Task A Msg. Sender Receiver M Task A Task B Message M from Task A to Task B CAN bus M Task B OUT Host N Host 1 Host 3 Is OUT generated on :me? Is OUT generated from a logically correct value of M?

35 System Model Host 2 with Task Replica6on Task A Msg. Sender Receiver Replicas M Task A Task B 3 Message M from Task A to Task B CAN bus M Task B OUT Host N Host 1 Host 3

36 System Model Host 2 with Task Replica6on Task A Msg. Sender Receiver Replicas M Task A Task B 3 Message M from Task A to Task B CAN bus M Task B OUT M M Host N Task A replica Task A replica Message M from replicas of Task A to Task B Host 1 Host 3

37 Aggrega5ng the replicated messages How & when to compute OUT from mul:ple copies of M?

38 Aggrega5ng the replicated messages How & when to compute OUT from mul:ple copies of M? Case 1: Synchronous Systems Common global :me base e.g. majority value at the absolute deadline

39 Aggrega5ng the replicated messages How & when to compute OUT from mul:ple copies of M? Case 1: Synchronous Systems Common global :me base e.g. majority value at the absolute deadline Case 2: Asynchronous Systems No global :me base e.g. majority value ater enough copies have been received

40 Overview Model Analysis Evalua6on

41 The Larger Picture = (Photo: Airbus D&S/Dassault/Finmeccanica)

42 The Larger Picture = (Photo: Airbus D&S/Dassault/Finmeccanica) Objec:ves: A good replica6on strategy for the CAN-based system Compare the reliability of the CAN-based system with other components in the safety-cri6cal system

43 The Larger Picture = (Photo: Airbus D&S/Dassault/Finmeccanica) Objec:ves: A good replica6on strategy for the CAN-based system Compare the reliability of the CAN-based system with other components in the safety-cri6cal system Solu:on: FIT rate analysis Using the probabilis6c analysis

44 FIT Rate Analysis of the System FIT rate of the CAN subsystem

45 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 FITn

46 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 FITn Lower Bound on the Probability of Successful Transmission of M1

47 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 Probability Density Func:on f1(t) FITn Lower Bound on the Probability of Successful Transmission of M1

48 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) FITn Lower Bound on the Probability of Successful Transmission of M1

49 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) FITn Lower Bound on the Probability of Successful Transmission of M1

50 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 FITn Standard procedure to compute FIT rates given the failure probabili:es, but tailored for real-6me workloads Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1

51 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) FITn Lower Bound on the Probability of Successful Transmission of M1

52 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) FITn Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults

53 FIT Rate Analysis of the System Host and network faults follow a Poisson distribu6on FIT rate of the CAN subsystem FIT1 FIT2 FITn Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults

54 FIT Rate Analysis of the System FIT1 FIT rate of FIT2 the CAN subsystem FITn retransmiced due to transmission failures omiced due to crash failures corrupted due to commission failures Host and network faults follow a Poisson distribu6on Probabili:es that each message: Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on

55 FIT Rate Analysis of the System Broster et al. s probabilis6c response-6me analysis* FIT rate of the CAN subsystem FIT1 FIT2 Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) FITn Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me *Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Timing analysis of real-6me communica6on under electromagne6c interference." Real-Time Systems (2005):

56 FIT Rate Analysis of the System Broster et al. s probabilis6c response-6me analysis* We extend the analysis for a set of message replicas E.g., any 1 out of 3 message replicas are transmiced on :me FIT rate of the CAN subsystem FIT1 FIT2 FITn Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me *Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Timing analysis of real-6me communica6on under electromagne6c interference." Real-Time Systems (2005):

57 FIT Rate Analysis of the System Broster et al. s probabilis6c response-6me analysis* We extend the analysis for a set of message replicas E.g., any 1 out of 3 message replicas are transmiced on :me Deadline viola:ons due to crash failures FIT rate of the CAN subsystem FIT1 FIT2 FITn Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me *Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Timing analysis of real-6me communica6on under electromagne6c interference." Real-Time Systems (2005):

58 FIT Rate Analysis of the System Broster et al. s probabilis6c response-6me analysis* We extend the analysis for a set of message replicas E.g., any 1 out of 3 message replicas are transmiced on :me Deadline viola:ons due to crash failures FIT rate of the CAN subsystem FIT1 Incorrect output due to commission failures FIT2 FITn Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me OUT1 does not contain a corrupted value of M1 *Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Timing analysis of real-6me communica6on under electromagne6c interference." Real-Time Systems (2005):

59 FIT Rate Analysis of the System Broster et al. s probabilis6c response-6me analysis* We extend the analysis for a set of message replicas E.g., any 1 out of 3 message replicas are transmiced on :me Deadline viola:ons due to crash failures FIT rate of the CAN subsystem FIT1 Incorrect output due to commission failures FIT2 FITn Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me OUT1 does not contain a corrupted value of M1 *Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Timing analysis of real-6me communica6on under electromagne6c interference." Real-Time Systems (2005):

60 FIT Rate Analysis of the System Broster et al. s probabilis6c response-6me analysis* We extend the analysis for a set of message replicas E.g., any 1 out of 3 message replicas are transmiced on :me Deadline viola:ons due to crash failures FIT rate of the CAN subsystem FIT1 Incorrect output due to commission failures FIT2 FITn Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me OUT1 does not contain a corrupted value of M1 *Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Timing analysis of real-6me communica6on under electromagne6c interference." Real-Time Systems (2005):

61 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 FITn Math in the paper! Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me OUT1 does not contain a corrupted value of M1

62 Overview Model Analysis Evalua6on

63 Mobile Robot Workload* Task Name Length (bytes) Period (ms) Deadline (ms) MotorCtrl Wheel Wheel RadioIn Proximity Logging Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Comparing real-6me communica6on under electromagne6c interference." Real-Time Systems, ECRTS Proceedings. 16th Euromicro Conference on. IEEE, 2004.

64 Mobile Robot Workload* Only the MotorCtrl task is replicated (#replicas vary from 1 to 9) Task Name Length (bytes) Period (ms) Deadline (ms) MotorCtrl Wheel Wheel RadioIn Proximity Logging Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Comparing real-6me communica6on under electromagne6c interference." Real-Time Systems, ECRTS Proceedings. 16th Euromicro Conference on. IEEE, 2004.

65 Evalua5on Assess the proposed FIT rate deriva:on Comparison with results from CAN bus simula:on

66 Evalua5on Assess the proposed FIT rate deriva:on Comparison with results from CAN bus simula:on Is the FIT rate analysis too coarse-grained? Analysis for various fault rates

67 Analysis versus Simula5on for MotorCtrl Prob. of Failure of MotorCtrl Lower means beeer MotorCtrl reliability! #Replicas of MotorCtrl Task

68 Analysis versus Simula5on for MotorCtrl 1 Prob. of Failure of MotorCtrl e-05 1e-06 1e-07 1e-08 Asynchronous Analysis Asynchronous Simula6on #Replicas of MotorCtrl Task

69 Analysis versus Simula5on for MotorCtrl 1 Prob. of Failure of MotorCtrl e-05 1e-06 1e-07 1e-08 Synchronous Analysis Asynchronous Analysis Synchronous Simula6on Asynchronous Simula6on #Replicas of MotorCtrl Task

70 Prob. of Failure of MotorCtrl Analysis versus Simula5on for MotorCtrl e-05 1e-06 1e-07 1e-08 MotorCtrl becomes more reliable with replica6on Synchronous Analysis Asynchronous Analysis Synchronous Simula6on Asynchronous Simula6on #Replicas of MotorCtrl Task

71 Analysis versus Simula5on for MotorCtrl 1 Prob. of Failure of MotorCtrl e-05 1e-06 1e-07 1e-08 Analysis tracks the simula6on results Synchronous Analysis Asynchronous Analysis Synchronous Simula6on Asynchronous Simula6on #Replicas of MotorCtrl Task

72 FIT Rate Analysis of the CAN Subsystem FIT Rate of the System Lower means beeer reliability of the CAN subsystem! #Replicas of MotorCtrl Task

73 FIT Rate Analysis of the CAN Subsystem FIT Rate FIT of Rate the System 1e e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 FIT Rate vs. Replication Factor for varying EMI rates host-fault-rate=10-20 /ms, bus-fault-rate=10-6 /ms host-fault-rate=10-12 /ms, bus-fault-rate=10-8 /ms host-fault-rate=10-16 /ms, bus-fault-rate=10-10 /ms Replicaion Factor #Replicas of MotorCtrl Task

74 FIT Rate Analysis of the CAN Subsystem FIT Rate FIT of Rate the System 1e e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 FIT Rate vs. Replication Factor for varying EMI rates host-fault-rate=10-20 /ms, bus-fault-rate=10-6 /ms host-fault-rate=10-12 /ms, bus-fault-rate=10-8 /ms host-fault-rate=10-16 /ms, bus-fault-rate=10-10 /ms Replicaion Factor System becomes more reliable due to beeer tolerance against crash and commission failures #Replicas of MotorCtrl Task

75 FIT Rate Analysis of the CAN Subsystem 1e+10 FIT Rate vs. Replication Factor for varying EMI rates FIT Rate FIT of Rate the System System 1 reliability degrades 1e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 due to less slack host-fault-rate=10-20 /ms, bus-fault-rate=10-6 /ms host-fault-rate=10-12 /ms, bus-fault-rate=10-8 /ms host-fault-rate=10-16 /ms, bus-fault-rate=10-10 /ms Replicaion Factor #Replicas of MotorCtrl Task

76 Success Replica:on factor

77 Success Failure Replica:on factor Replica:on factor

78 FIT Rate Analysis of the CAN Subsystem FIT Rate FIT of Rate the System 1e e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 FIT Rate vs. Replication Factor for varying EMI rates host-fault-rate=10-20 /ms, bus-fault-rate=10-6 /ms host-fault-rate=10-12 /ms, bus-fault-rate=10-8 /ms host-fault-rate=10-16 /ms, bus-fault-rate=10-10 /ms Replicaion Factor #Replicas of MotorCtrl Task

79 FIT Rate Analysis of the CAN Subsystem FIT Rate FIT of Rate the System 1e e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 FIT Rate vs. Replication Factor for varying EMI rates Analysis yields the op6mal replica6on factor host-fault-rate=10-20 /ms, bus-fault-rate=10-6 /ms host-fault-rate=10-12 /ms, bus-fault-rate=10-8 /ms host-fault-rate=10-16 /ms, bus-fault-rate=10-10 /ms Replicaion Factor #Replicas of MotorCtrl Task

80 FIT Rate Analysis of the CAN Subsystem FIT Rate FIT of Rate the System 1e e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 FIT Rate vs. Replication Factor for varying EMI rates Analysis yields the op6mal replica6on factor host-fault-rate=10-20 /ms, bus-fault-rate=10-6 /ms host-fault-rate=10-12 /ms, bus-fault-rate=10-8 /ms host-fault-rate=10-16 /ms, bus-fault-rate=10-10 /ms FIT rate spans more than 20 orders of magnitude Replicaion Factor #Replicas of MotorCtrl Task

81 Summary

82 Summary? Find the best replica:on strategy for CAN-based systems Correctness Replica:on factor Timeliness Replica:on factor Success Replica:on factor

83 ? Summary Replica:on factor Success Timeliness Correctness Find the best replica:on strategy for CAN-based systems Replica:on factor Replica:on factor Compare reliability of the CAN-bus subsystem in the context of the larger safety-cri:cal system = (Photo: Airbus D&S/Dassault/Finmeccanica) UAV Power CAN

84 Future Work More complex system models CAN-based systems bridged together Sporadic DAG models

85 Future Work More complex system models CAN-based systems bridged together Sporadic DAG models Study of other technologies, e.g., Real Time Ethernet

86 Future Work More complex system models CAN-based systems bridged together Sporadic DAG models Study of other technologies, e.g., Real Time Ethernet hep://

A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems

A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems Work in progress A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems December 5, 2017 CERTS 2017 Malte Appel, Arpan Gujarati and Björn B. Brandenburg Distributed

More information

Redundancy and Dependability Evalua2on. EECE 513: Design of Fault- tolerant Systems

Redundancy and Dependability Evalua2on. EECE 513: Design of Fault- tolerant Systems Redundancy and Dependability Evalua2on EECE 513: Design of Fault- tolerant Systems Learning Objec2ves At the end of this lecture, you will be able to Define combinatorial models of reliability Evaluate

More information

Distributed Systems 24. Fault Tolerance

Distributed Systems 24. Fault Tolerance Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network

More information

Chapter 39: Concepts of Time-Triggered Communication. Wenbo Qiao

Chapter 39: Concepts of Time-Triggered Communication. Wenbo Qiao Chapter 39: Concepts of Time-Triggered Communication Wenbo Qiao Outline Time and Event Triggered Communication Fundamental Services of a Time-Triggered Communication Protocol Clock Synchronization Periodic

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what

More information

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013

Distributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013 Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware

More information

Module 8 - Fault Tolerance

Module 8 - Fault Tolerance Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced

More information

Fault Tolerance. The Three universe model

Fault Tolerance. The Three universe model Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful

More information

Amrita Vishwa Vidyapeetham. ES623 Networked Embedded Systems Answer Key

Amrita Vishwa Vidyapeetham. ES623 Networked Embedded Systems Answer Key Time: Two Hours Amrita Vishwa Vidyapeetham M.Tech Second Assessment February 2013 Second Semester Embedded Systems Roll No: ES623 Networked Embedded Systems Answer Key Answer all Questions Maximum: 50

More information

MapReduce, Apache Hadoop

MapReduce, Apache Hadoop NDBI040: Big Data Management and NoSQL Databases hp://www.ksi.mff.cuni.cz/ svoboda/courses/2016-1-ndbi040/ Lecture 2 MapReduce, Apache Hadoop Marn Svoboda svoboda@ksi.mff.cuni.cz 11. 10. 2016 Charles University

More information

MapReduce, Apache Hadoop

MapReduce, Apache Hadoop Czech Technical University in Prague, Faculty of Informaon Technology MIE-PDB: Advanced Database Systems hp://www.ksi.mff.cuni.cz/~svoboda/courses/2016-2-mie-pdb/ Lecture 12 MapReduce, Apache Hadoop Marn

More information

Fault Tolerance. Distributed Software Systems. Definitions

Fault Tolerance. Distributed Software Systems. Definitions Fault Tolerance Distributed Software Systems Definitions Availability: probability the system operates correctly at any given moment Reliability: ability to run correctly for a long interval of time Safety:

More information

An Orthogonal and Fault-Tolerant Subsystem for High-Precision Clock Synchronization in CAN Networks *

An Orthogonal and Fault-Tolerant Subsystem for High-Precision Clock Synchronization in CAN Networks * An Orthogonal and Fault-Tolerant Subsystem for High-Precision Clock Synchronization in Networks * GUILLERMO RODRÍGUEZ-NAVAS and JULIÁN PROENZA Departament de Matemàtiques i Informàtica Universitat de les

More information

Issues in Programming Language Design for Embedded RT Systems

Issues in Programming Language Design for Embedded RT Systems CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics

More information

An Encapsulated Communication System for Integrated Architectures

An Encapsulated Communication System for Integrated Architectures An Encapsulated Communication System for Integrated Architectures Architectural Support for Temporal Composability Roman Obermaisser Overview Introduction Federated and Integrated Architectures DECOS Architecture

More information

Chapter 5: Availability

Chapter 5: Availability Chapter 5: Availability Rob Pe4t SWE 443 Spring 2015 Len Bass, Paul Clements, Rick Kazman, distributed under CreaLve Commons AMribuLon License Chapter Outline What is Availability? Availability General

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

Distributed Systems. Fault Tolerance. Paul Krzyzanowski

Distributed Systems. Fault Tolerance. Paul Krzyzanowski Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected

More information

Dependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8

Dependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8 TDDD7 Real-time Systems Lecture 7 Dependability & Fault tolerance Simin Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Dependability and real-time If a system

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

Distributed Systems 23. Fault Tolerance

Distributed Systems 23. Fault Tolerance Distributed Systems 23. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 4/20/2011 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

Quan%ta%ve Verifica%on: Formal Guarantees for Timeliness, Reliability and Performance

Quan%ta%ve Verifica%on: Formal Guarantees for Timeliness, Reliability and Performance Quan%ta%ve Verifica%on: Formal Guarantees for Timeliness, Reliability and Performance Dave Parker University of Birmingham FMICS 2014, Florence, September 2014 Quan%ta%ve Verifica%on: Formal Guarantees

More information

Distributed Systems (ICE 601) Fault Tolerance

Distributed Systems (ICE 601) Fault Tolerance Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability

More information

OS History and OS Structures

OS History and OS Structures OS History and OS Structures Karthik Dantu CSE 421/521: Opera>ng Systems Slides adopted from CS162 class at Berkeley, CSE 451 at U-Washington and CSE 421 by Prof Kosar at UB Join Piazza Ac>on Items From

More information

Schedulability Analysis of the Linux Push and Pull Scheduler with Arbitrary Processor Affinities

Schedulability Analysis of the Linux Push and Pull Scheduler with Arbitrary Processor Affinities Schedulability Analysis of the Linux Push and Pull Scheduler with Arbitrary Processor Affinities Arpan Gujarati, Felipe Cerqueira, and Björn Brandenburg Multiprocessor real-time scheduling theory Global

More information

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following: CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online

More information

Fault Tolerance. Basic Concepts

Fault Tolerance. Basic Concepts COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time

More information

Dependable Computer Systems

Dependable Computer Systems Dependable Computer Systems Part 6b: System Aspects Contents Synchronous vs. Asynchronous Systems Consensus Fault-tolerance by self-stabilization Examples Time-Triggered Ethernet (FT Clock Synchronization)

More information

Opera&ng Systems CMPSCI 377 Dynamic Memory Management. Emery Berger and Mark Corner University of Massachuse9s Amherst

Opera&ng Systems CMPSCI 377 Dynamic Memory Management. Emery Berger and Mark Corner University of Massachuse9s Amherst Opera&ng Systems CMPSCI 377 Dynamic Memory Management Emery Berger and Mark Corner University of Massachuse9s Amherst Dynamic Memory Management How the heap manager is implemented malloc, free new, delete

More information

Module 8 Fault Tolerance CS655! 8-1!

Module 8 Fault Tolerance CS655! 8-1! Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

CORBA in the Time-Triggered Architecture

CORBA in the Time-Triggered Architecture 1 CORBA in the Time-Triggered Architecture H. Kopetz TU Wien July 2003 Outline 2 Hard Real-Time Computing Event and State Messages The Time Triggered Architecture The Marriage of CORBA with the TTA Conclusion

More information

ARTIST-Relevant Research from Linköping

ARTIST-Relevant Research from Linköping ARTIST-Relevant Research from Linköping Department of Computer and Information Science (IDA) Linköping University http://www.ida.liu.se/~eslab/ 1 Outline Communication-Intensive Real-Time Systems Timing

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/58 Definition Distributed Systems Distributed System is

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/60 Definition Distributed Systems Distributed System is

More information

TU Wien. Fault Isolation and Error Containment in the TT-SoC. H. Kopetz. TU Wien. July 2007

TU Wien. Fault Isolation and Error Containment in the TT-SoC. H. Kopetz. TU Wien. July 2007 TU Wien 1 Fault Isolation and Error Containment in the TT-SoC H. Kopetz TU Wien July 2007 This is joint work with C. El.Salloum, B.Huber and R.Obermaisser Outline 2 Introduction The Concept of a Distributed

More information

A CAN-Based Architecture for Highly Reliable Communication Systems

A CAN-Based Architecture for Highly Reliable Communication Systems A CAN-Based Architecture for Highly Reliable Communication Systems H. Hilmer Prof. Dr.-Ing. H.-D. Kochs Gerhard-Mercator-Universität Duisburg, Germany E. Dittmar ABB Network Control and Protection, Ladenburg,

More information

Probabilistic Worst-Case Response-Time Analysis for the Controller Area Network

Probabilistic Worst-Case Response-Time Analysis for the Controller Area Network Probabilistic Worst-Case Response-Time Analysis for the Controller Area Network Thomas Nolte, Hans Hansson, and Christer Norström Mälardalen Real-Time Research Centre Department of Computer Engineering

More information

MapReduce. Cloud Computing COMP / ECPE 293A

MapReduce. Cloud Computing COMP / ECPE 293A Cloud Computing COMP / ECPE 293A MapReduce Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, In Proceedings of the 6th conference on Symposium on Opera7ng Systems

More information

Comparing IPv4 and IPv6 from the perspec7ve of BGP dynamic ac7vity. Geoff Huston APNIC February 2012

Comparing IPv4 and IPv6 from the perspec7ve of BGP dynamic ac7vity. Geoff Huston APNIC February 2012 Comparing IPv4 and IPv6 from the perspec7ve of BGP dynamic ac7vity Geoff Huston APNIC February 2012 The IPv4 Table: 2004 - now The IPv6 Table: 2004 - now AS131072 BGP Updates / day V4 - ~100K updates/day

More information

Distributed Systems. Communica3on and models. Rik Sarkar Spring University of Edinburgh

Distributed Systems. Communica3on and models. Rik Sarkar Spring University of Edinburgh Distributed Systems Communica3on and models Rik Sarkar Spring 2018 University of Edinburgh Models Expecta3ons/assump3ons about things Every idea or ac3on anywhere is based on a model Determines what can

More information

TSW Reliability and Fault Tolerance

TSW Reliability and Fault Tolerance TSW Reliability and Fault Tolerance Alexandre David 1.2.05 Credits: some slides by Alan Burns & Andy Wellings. Aims Understand the factors which affect the reliability of a system. Introduce how software

More information

CSE 473: Ar+ficial Intelligence Uncertainty and Expec+max Tree Search

CSE 473: Ar+ficial Intelligence Uncertainty and Expec+max Tree Search CSE 473: Ar+ficial Intelligence Uncertainty and Expec+max Tree Search Instructors: Luke ZeDlemoyer Univeristy of Washington [These slides were adapted from Dan Klein and Pieter Abbeel for CS188 Intro to

More information

Today s Objec2ves. AWS/MR Review Final Projects Distributed File Systems. Nov 3, 2017 Sprenkle - CSCI325

Today s Objec2ves. AWS/MR Review Final Projects Distributed File Systems. Nov 3, 2017 Sprenkle - CSCI325 Today s Objec2ves AWS/MR Review Final Projects Distributed File Systems Nov 3, 2017 Sprenkle - CSCI325 1 Inverted Index final input files have been posted Another email out to AWS Google cloud Nov 3, 2017

More information

System Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics

System Models. 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models. Nicola Dragoni Embedded Systems Engineering DTU Informatics System Models Nicola Dragoni Embedded Systems Engineering DTU Informatics 2.1 Introduction 2.2 Architectural Models 2.3 Fundamental Models Architectural vs Fundamental Models Systems that are intended

More information

Consistency Rationing in the Cloud: Pay only when it matters

Consistency Rationing in the Cloud: Pay only when it matters Consistency Rationing in the Cloud: Pay only when it matters By Sandeepkrishnan Some of the slides in this presenta4on have been taken from h7p://www.cse.iitb.ac.in./dbms/cs632/ra4oning.ppt 1 Introduc4on:

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

TU Wien. Shortened by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet. H Kopetz TU Wien December H. Kopetz 12.

TU Wien. Shortened by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet. H Kopetz TU Wien December H. Kopetz 12. TU Wien 1 Shortened by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet H Kopetz TU Wien December 2008 Properties of a Successful Protocol 2 A successful real-time protocol must have the following

More information

Effec%ve Replica Maintenance for Distributed Storage Systems

Effec%ve Replica Maintenance for Distributed Storage Systems Effec%ve Replica Maintenance for Distributed Storage Systems USENIX NSDI2006 Byung Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek, John Kubiatowicz, and Robert

More information

Objectives. Architectural Design. Software architecture. Topics covered. Architectural design. Advantages of explicit architecture

Objectives. Architectural Design. Software architecture. Topics covered. Architectural design. Advantages of explicit architecture Objectives Architectural Design To introduce architectural design and to discuss its importance To explain the architectural design decisions that have to be made To introduce three complementary architectural

More information

There is a tempta7on to say it is really used, it must be good

There is a tempta7on to say it is really used, it must be good Notes from reviews Dynamo Evalua7on doesn t cover all design goals (e.g. incremental scalability, heterogeneity) Is it research? Complexity? How general? Dynamo Mo7va7on Normal database not the right fit

More information

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura CSE 451: Operating Systems Winter 2013 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Gary Kimura The challenge Disk transfer rates are improving, but much less fast than CPU performance

More information

Distributed Systems. Communica3on and models. Rik Sarkar 2015/2016. University of Edinburgh

Distributed Systems. Communica3on and models. Rik Sarkar 2015/2016. University of Edinburgh Distributed Systems Communica3on and models Rik Sarkar 2015/2016 University of Edinburgh Models Expecta3ons/assump3ons about things Every idea or ac3on anywhere is based on a model Determines what can

More information

Conges'on. Last Week: Discovery and Rou'ng. Today: Conges'on Control. Distributed Resource Sharing. Conges'on Collapse. Conges'on

Conges'on. Last Week: Discovery and Rou'ng. Today: Conges'on Control. Distributed Resource Sharing. Conges'on Collapse. Conges'on Last Week: Discovery and Rou'ng Provides end-to-end connectivity, but not necessarily good performance Conges'on logical link name Michael Freedman COS 461: Computer Networks Lectures: MW 10-10:50am in

More information

Systems. Roland Kammerer. 10. November Institute of Computer Engineering Vienna University of Technology. Communication Protocols for Embedded

Systems. Roland Kammerer. 10. November Institute of Computer Engineering Vienna University of Technology. Communication Protocols for Embedded Communication Roland Institute of Computer Engineering Vienna University of Technology 10. November 2010 Overview 1. Definition of a protocol 2. Protocol properties 3. Basic Principles 4. system communication

More information

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing Authors: Robert L Akamine, Robert F. Hodson, Brock J. LaMeres, and Robert E. Ray www.nasa.gov Contents Introduction to the

More information

No compromises: distributed transac2ons with consistency, availability, and performance

No compromises: distributed transac2ons with consistency, availability, and performance No compromises: distributed transac2ons with consistency, availability, and performance Aleksandar Dragojevic, Dushyanth Narayanan, Edmund B. Nigh2ngale, MaDhew Renzelmann, Alex Shamis, Anirudh Badam,

More information

Confinement (Running Untrusted Programs)

Confinement (Running Untrusted Programs) Confinement (Running Untrusted Programs) Chester Rebeiro Indian Institute of Technology Madras Untrusted Programs How to run untrusted programs and not harm your system? Answer: Confinement (some:mes called

More information

Dependability and real-time. TDDD07 Real-time Systems Lecture 7: Dependability & Fault tolerance. Early computer systems. Dependable systems

Dependability and real-time. TDDD07 Real-time Systems Lecture 7: Dependability & Fault tolerance. Early computer systems. Dependable systems TDDD7 Real-time Systems Lecture 7: Dependability & Fault tolerance Simin i Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Linköping university Dependability and

More information

CS4514 Real-Time Systems and Modeling

CS4514 Real-Time Systems and Modeling CS4514 Real-Time Systems and Modeling Fall 2015 José M. Garrido Department of Computer Science College of Computing and Software Engineering Kennesaw State University Real-Time Systems RTS are computer

More information

Chapter 8 Fault Tolerance

Chapter 8 Fault Tolerance DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to

More information

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Fault Tolerance Dr. Yong Guan Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Outline for Today s Talk Basic Concepts Process Resilience Reliable

More information

Fault Tolerance. Chapter 7

Fault Tolerance. Chapter 7 Fault Tolerance Chapter 7 Basic Concepts Dependability Includes Availability Reliability Safety Maintainability Failure Models Type of failure Crash failure Omission failure Receive omission Send omission

More information

Towards a Real Time Communica3on Framework for Wireless Sensor Networks

Towards a Real Time Communica3on Framework for Wireless Sensor Networks Towards a Real Time Communica3on Framework for Wireless Sensor Networks Chenyang Lu Department of Computer Science and Engineering Applica3on challenges High data rate Low latency Priori;za;on Predictability

More information

Aerospace Software Engineering

Aerospace Software Engineering 16.35 Aerospace Software Engineering Reliability, Availability, and Maintainability Software Fault Tolerance Prof. Kristina Lundqvist Dept. of Aero/Astro, MIT Definitions Software reliability The probability

More information

TU Wien. Excerpt by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet. H Kopetz TU Wien December H. Kopetz 12.

TU Wien. Excerpt by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet. H Kopetz TU Wien December H. Kopetz 12. TU Wien 1 Excerpt by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet H Kopetz TU Wien December 2008 Time 2 Whenever we use the term time we mean physical time as defined by the international

More information

Introduction to Real-Time Communications. Real-Time and Embedded Systems (M) Lecture 15

Introduction to Real-Time Communications. Real-Time and Embedded Systems (M) Lecture 15 Introduction to Real-Time Communications Real-Time and Embedded Systems (M) Lecture 15 Lecture Outline Modelling real-time communications Traffic and network models Properties of networks Throughput, delay

More information

Implementation Issues. Remote-Write Protocols

Implementation Issues. Remote-Write Protocols Implementation Issues Two techniques to implement consistency models Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes Replicated write

More information

Part 2: Basic concepts and terminology

Part 2: Basic concepts and terminology Part 2: Basic concepts and terminology Course: Dependable Computer Systems 2012, Stefan Poledna, All rights reserved part 2, page 1 Def.: Dependability (Verlässlichkeit) is defined as the trustworthiness

More information

1/10/16. RPC and Clocks. Tom Anderson. Last Time. Synchroniza>on RPC. Lab 1 RPC

1/10/16. RPC and Clocks. Tom Anderson. Last Time. Synchroniza>on RPC. Lab 1 RPC RPC and Clocks Tom Anderson Go Synchroniza>on RPC Lab 1 RPC Last Time 1 Topics MapReduce Fault tolerance Discussion RPC At least once At most once Exactly once Lamport Clocks Mo>va>on MapReduce Fault Tolerance

More information

Open, Scalable Real-Time Solutions

Open, Scalable Real-Time Solutions Open, Scalable Real-Time Solutions Introducing TestDrive ECU-in-the-Loop Testing Q & A Alan Soltis Applications Engineer TestDrive Simulator An Overview Compact, robust chassis Pentium 4

More information

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability. Copyright 2010 Daniel J. Sorin Duke University Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2010 Daniel J. Sorin Duke University Definition and Motivation Outline General Principles of Available System Design

More information

Viewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De

Viewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De Viewstamped Replication to Practical Byzantine Fault Tolerance Pradipta De pradipta.de@sunykorea.ac.kr ViewStamped Replication: Basics What does VR solve? VR supports replicated service Abstraction is

More information

Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD

Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Op#mizing PGAS overhead in a mul#-locale Chapel implementa#on of CoMD Riyaz Haque and David F. Richards This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore

More information

Fifty Years of Operating Systems. SIGOPS SOSP History Day October 4, 2015

Fifty Years of Operating Systems. SIGOPS SOSP History Day October 4, 2015 Fifty Years of Operating Systems SIGOPS SOSP History Day October 4, 2015 1 The Story Opera)ng systems are a major enterprise within compu)ng. They are hosted on over a billion devices connected to the

More information

AirTight: A Resilient Wireless Communication Protocol for Mixed- Criticality Systems

AirTight: A Resilient Wireless Communication Protocol for Mixed- Criticality Systems AirTight: A Resilient Wireless Communication Protocol for Mixed- Criticality Systems Alan Burns, James Harbin, Leandro Indrusiak, Iain Bate, Robert Davis and David Griffin Real-Time Systems Research Group

More information

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki

Introduction to Software Fault Tolerance Techniques and Implementation. Presented By : Hoda Banki Introduction to Software Fault Tolerance Techniques and Implementation Presented By : Hoda Banki 1 Contents : Introduction Types of faults Dependability concept classification Error recovery Types of redundancy

More information

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner Real-Time Component Software slide credits: H. Kopetz, P. Puschner Overview OS services Task Structure Task Interaction Input/Output Error Detection 2 Operating System and Middleware Application Software

More information

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju

Chapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju Chapter 5: Distributed Systems: Fault Tolerance Fall 2013 Jussi Kangasharju Chapter Outline n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 2 Basic

More information

1/12/11. ECE 1749H: Interconnec3on Networks for Parallel Computer Architectures. Introduc3on. Interconnec3on Networks Introduc3on

1/12/11. ECE 1749H: Interconnec3on Networks for Parallel Computer Architectures. Introduc3on. Interconnec3on Networks Introduc3on ECE 1749H: Interconnec3on Networks for Parallel Computer Architectures Introduc3on Prof. Natalie Enright Jerger Winter 2011 ECE 1749H: Interconnec3on Networks (Enright Jerger) 1 Interconnec3on Networks

More information

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi 1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.

More information

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs 1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds

More information

Data Flow Analysis. Suman Jana. Adopted From U Penn CIS 570: Modern Programming Language Implementa=on (Autumn 2006)

Data Flow Analysis. Suman Jana. Adopted From U Penn CIS 570: Modern Programming Language Implementa=on (Autumn 2006) Data Flow Analysis Suman Jana Adopted From U Penn CIS 570: Modern Programming Language Implementa=on (Autumn 2006) Data flow analysis Derives informa=on about the dynamic behavior of a program by only

More information

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems Developing deterministic networking technology for railway applications using TTEthernet software-based end systems Project n 100021 Astrit Ademaj, TTTech Computertechnik AG Outline GENESYS requirements

More information

Main Points. File systems. Storage hardware characteris7cs. File system usage Useful abstrac7ons on top of physical devices

Main Points. File systems. Storage hardware characteris7cs. File system usage Useful abstrac7ons on top of physical devices Storage Systems Main Points File systems Useful abstrac7ons on top of physical devices Storage hardware characteris7cs Disks and flash memory File system usage pa@erns File System Abstrac7on File system

More information

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992

Redundancy in fault tolerant computing. D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 Redundancy in fault tolerant computing D. P. Siewiorek R.S. Swarz, Reliable Computer Systems, Prentice Hall, 1992 1 Redundancy Fault tolerance computing is based on redundancy HARDWARE REDUNDANCY Physical

More information

System Models for Distributed Systems

System Models for Distributed Systems System Models for Distributed Systems INF5040/9040 Autumn 2015 Lecturer: Amir Taherkordi (ifi/uio) August 31, 2015 Outline 1. Introduction 2. Physical Models 4. Fundamental Models 2 INF5040 1 System Models

More information

A Low-Cost Correction Algorithm for Transient Data Errors

A Low-Cost Correction Algorithm for Transient Data Errors A Low-Cost Correction Algorithm for Transient Data Errors Aiguo Li, Bingrong Hong School of Computer Science and Technology Harbin Institute of Technology, Harbin 150001, China liaiguo@hit.edu.cn Introduction

More information

Eliminating Single Points of Failure in Software Based Redundancy

Eliminating Single Points of Failure in Software Based Redundancy Eliminating Single Points of Failure in Software Based Redundancy Peter Ulbrich, Martin Hoffmann, Rüdiger Kapitza, Daniel Lohmann, Reiner Schmid and Wolfgang Schröder-Preikschat EDCC May 9, 2012 SYSTEM

More information

Lecture 6 The Data Link Layer. Antonio Cianfrani DIET Department Networking Group netlab.uniroma1.it

Lecture 6 The Data Link Layer. Antonio Cianfrani DIET Department Networking Group netlab.uniroma1.it Lecture 6 The Data Link Layer Antonio Cianfrani DIET Department Networking Group netlab.uniroma1.it Link Layer: setting the context two physically connected devices: host-router, router-router, host-host,

More information

Christian Brothers High School, Lewisham. Year 12 Information Processes & Technology. Assessment Task 3: Communications Systems.

Christian Brothers High School, Lewisham. Year 12 Information Processes & Technology. Assessment Task 3: Communications Systems. Name: Teacher: Christian Brothers High School, Lewisham Year 12 Information Processes & Technology Assessment Task 3: Communications Systems June 2001 Outcomes Assessed: H1.1, H1.2, H2.1, H3.1, H5.2. Weighting:

More information

Project Update. Last update: Principal Investigators

Project Update. Last update: Principal Investigators Broadcas(ng Video in Dense 802.11g Networks Using Applica(on FEC and Mul(cast Project Update Last update: 6-20-2011 Principal Investigators Dr James Martin School of Computing Clemson University Clemson,

More information

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d)

FAULT TOLERANCE. Fault Tolerant Systems. Faults Faults (cont d) Distributed Systems Fö 9/10-1 Distributed Systems Fö 9/10-2 FAULT TOLERANCE 1. Fault Tolerant Systems 2. Faults and Fault Models. Redundancy 4. Time Redundancy and Backward Recovery. Hardware Redundancy

More information

Virtual Memory: Concepts

Virtual Memory: Concepts Virtual Memory: Concepts 5-23 / 8-23: Introduc=on to Computer Systems 6 th Lecture, Mar. 8, 24 Instructors: Anthony Rowe, Seth Goldstein, and Gregory Kesden Today VM Movaon and Address spaces ) VM as a

More information

ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures. Introduc1on. Prof. Natalie Enright Jerger

ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures. Introduc1on. Prof. Natalie Enright Jerger ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures Introduc1on Prof. Natalie Enright Jerger Winter 2011 ECE 1749H: Interconnec1on Networks (Enright Jerger) 1 Interconnec1on Networks

More information

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. CS 138: Practical Byzantine Consensus CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Scenario Asynchronous system Signed messages s are state machines It has to be practical CS 138

More information

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.

Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit. Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery

More information

Ethernet and WiFi. h-p://xkcd.com/466/

Ethernet and WiFi. h-p://xkcd.com/466/ Ethernet and WiFi h-p://xkcd.com/466/ CSCI 466: Networks Keith Vertanen Fall 2011 Overview Mul?ple access networks Ethernet Long history Dominant wired technology 802.11 Dominant wireless technology 2

More information