When is CAN the weakest link? A bound on Failures-In-Time in CAN-Based Distributed Real-Time Systems. Arpan Gujara6 Björn B.

Size: px

Start display at page:

Download "When is CAN the weakest link? A bound on Failures-In-Time in CAN-Based Distributed Real-Time Systems. Arpan Gujara6 Björn B."

Chastity Fisher
5 years ago
Views:

1 When is CAN the weakest link? A bound on Failures-In-Time in CAN-Based Distributed Real-Time Systems Arpan Gujara6 Björn B. Brandenburg

2 Failures due to Transient Faults Harsh environments Spark plugs Hard radia:on High-power machinery

3 Failures due to Transient Faults Harsh environments Spark plugs Hard radia:on High-power machinery Electromagne:c Interference (EMI) Bit-flips in the hosts and in the network

4 Failures due to Transient Faults Harsh environments Spark plugs Hard radia:on High-power machinery Electromagne:c Interference (EMI) Bit-flips in the hosts and in the network EMI-induced transient faults Manifest as program-visible failures

5 Failures due to Transient Faults Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Crash failures (due to fault-induced excep:ons)

6 Failures due to Transient Faults Transmission failures (faults on the wire) Tolerated by error detec6on and retransmissions Commission failures (bit-flips in the memory buffers) Crash failures (due to fault-induced excep:ons)

7 Failures due to Transient Faults Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Crash failures (due to fault-induced excep:ons) Tolerated by error detec6on and retransmissions Tolerated by ac:ve replica6on of tasks on independent hosts

8 Failures due to Transient Faults Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Crash failures (due to fault-induced excep:ons) Tolerated by error detec6on and retransmissions Tolerated by ac:ve replica6on of tasks on independent hosts How to decide the best replica6on strategy? Is Triple Modular Redundancy (TMR) enough? or is Quadruple Modular Redundancy (QMR) required? Would you replicate only the high-frequency tasks? or only the high-cri:cality tasks?

9 Retransmissions vs. Replica5on Tradeoff

10 Retransmissions vs. Replica5on Tradeoff For tolera:ng retransmissions-induced delays (Ensure no deadline viola:ons) The more slack, the beeer!

11 Retransmissions vs. Replica5on Tradeoff For tolera:ng retransmissions-induced delays (Ensure no deadline viola:ons) The more slack, the beeer! versus Ac:ve replica:on of tasks Reduced slack in the schedule

12 Retransmissions vs. Replica5on Tradeoff For tolera:ng retransmissions-induced delays (Ensure no deadline viola:ons) The more slack, the beeer! versus Ac:ve replica:on of tasks Reduced slack in the schedule Correctness Replica:on factor

13 Retransmissions vs. Replica5on Tradeoff For tolera:ng retransmissions-induced delays (Ensure no deadline viola:ons) The more slack, the beeer! versus Ac:ve replica:on of tasks Reduced slack in the schedule Correctness Replica:on factor Timeliness Replica:on factor

14 Retransmissions vs. Replica5on Tradeoff For tolera:ng retransmissions-induced delays (Ensure no deadline viola:ons) The more slack, the beeer! versus Ac:ve replica:on of tasks Reduced slack in the schedule Correctness Replica:on factor Timeliness Replica:on factor Success Replica:on factor

15 Retransmissions vs. Replica5on Tradeoff For tolera:ng retransmissions-induced delays (Ensure no deadline viola:ons) The more slack, the beeer! versus Ac:ve replica:on of tasks Reduced slack in the schedule? Correctness Replica:on factor Timeliness Replica:on factor Success Replica:on factor How to sta:cally determine the op6mal replica6on factor?

16 This Work For CAN-based distributed real-6me systems Probabilis6c analysis Quan:fy the replica:on vs. retransmissions tradeoff

17 The Larger Picture The CAN-based system is just one component in a safety-cri:cal system

18 The Larger Picture The CAN-based system is just one component in a safety-cri:cal system = (Photo: Airbus D&S/Dassault/Finmeccanica) UAV Power CAN

19 The Larger Picture The CAN-based system is just one component in a safety-cri:cal system = (Photo: Airbus D&S/Dassault/Finmeccanica) UAV Power CAN Replicate tasks, add more ECUs to the CAN subsystem

20 The Larger Picture The CAN-based system is just one component in a safety-cri:cal system = (Photo: Airbus D&S/Dassault/Finmeccanica) UAV Add a heat-sink to the power supply unit Power CAN Replicate tasks, add more ECUs to the CAN subsystem

21 The Larger Picture The CAN-based system is just one component in a safety-cri:cal system = (Photo: Airbus D&S/Dassault/Finmeccanica) UAV Add a heat-sink to the power supply unit Power CAN Replicate tasks, add more ECUs to the CAN subsystem What if the UAV has strict weight constraints? and you can either add the heat sink or the addi:onal ECUs How do you decide the best choice?

22 Failures-In-Time (FIT) Rate Expected #failures in one billion opera:ng hours e.g., 1M UAVs flying for 1K hours each FIT rates are widely used in the industry

23 Failures-In-Time (FIT) Rate Expected #failures in one billion opera:ng hours e.g., 1M UAVs flying for 1K hours each FIT rates are widely used in the industry

24 Failures-In-Time (FIT) Rate Expected #failures in one billion opera:ng hours e.g., 1M UAVs flying for 1K hours each FIT rates are widely used in the industry When is the CAN-based distributed real-:me system the weakest link in the system?

25 This Work For CAN-based distributed real-6me systems Probabilis6c analysis Quan:fy the replica:on vs. retransmissions tradeoff FIT rate analysis Builds upon the proposed probabilis:c analysis

26 Overview Model Analysis Evalua6on

27 Fault Abstrac5on & Modeling Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Crash failures (due to fault-induced excep:ons)

28 Fault Abstrac5on & Modeling Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Poisson Distribu6on Crash failures (due to fault-induced excep:ons)

29 Fault Abstrac5on & Modeling Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Poisson Distribu6on Crash failures (due to fault-induced excep:ons) Probability that each message is omieed / corrupted / retransmieed

Fault Abstrac5on & Modeling Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Poisson Distribu6on

30 Fault Abstrac5on & Modeling Transmission failures (faults on the wire) Commission failures (bit-flips in the memory buffers) Poisson Distribu6on Crash failures (due to fault-induced excep:ons) Probability that each message is omieed / corrupted / retransmieed We do not consider sotware defects

31 System Model Host 2 CAN bus Host N Host 1 Host 3

32 System Model Host 2 Task A CAN bus Task B Host N Host 1 Host 3

33 System Model Host 2 Task A Msg. Sender Receiver M Task A Task B Message M from Task A to Task B CAN bus M Task B OUT Host N Host 1 Host 3

34 System Model Host 2 Task A Msg. Sender Receiver M Task A Task B Message M from Task A to Task B CAN bus M Task B OUT Host N Host 1 Host 3 Is OUT generated on :me? Is OUT generated from a logically correct value of M?

35 System Model Host 2 with Task Replica6on Task A Msg. Sender Receiver Replicas M Task A Task B 3 Message M from Task A to Task B CAN bus M Task B OUT Host N Host 1 Host 3

36 System Model Host 2 with Task Replica6on Task A Msg. Sender Receiver Replicas M Task A Task B 3 Message M from Task A to Task B CAN bus M Task B OUT M M Host N Task A replica Task A replica Message M from replicas of Task A to Task B Host 1 Host 3

37 Aggrega5ng the replicated messages How & when to compute OUT from mul:ple copies of M?

38 Aggrega5ng the replicated messages How & when to compute OUT from mul:ple copies of M? Case 1: Synchronous Systems Common global :me base e.g. majority value at the absolute deadline

Aggrega5ng the replicated messages How & when to compute OUT from mul:ple copies of M? Case 1: Synchronous Systems Common global :me base e.g. majority value at the absolute deadline Case 2: Asynchronous Systems No global :me base e.

39 Aggrega5ng the replicated messages How & when to compute OUT from mul:ple copies of M? Case 1: Synchronous Systems Common global :me base e.g. majority value at the absolute deadline Case 2: Asynchronous Systems No global :me base e.g. majority value ater enough copies have been received

40 Overview Model Analysis Evalua6on

41 The Larger Picture = (Photo: Airbus D&S/Dassault/Finmeccanica)

42 The Larger Picture = (Photo: Airbus D&S/Dassault/Finmeccanica) Objec:ves: A good replica6on strategy for the CAN-based system Compare the reliability of the CAN-based system with other components in the safety-cri6cal system

43 The Larger Picture = (Photo: Airbus D&S/Dassault/Finmeccanica) Objec:ves: A good replica6on strategy for the CAN-based system Compare the reliability of the CAN-based system with other components in the safety-cri6cal system Solu:on: FIT rate analysis Using the probabilis6c analysis

44 FIT Rate Analysis of the System FIT rate of the CAN subsystem

45 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 FITn

46 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 FITn Lower Bound on the Probability of Successful Transmission of M1

47 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 Probability Density Func:on f1(t) FITn Lower Bound on the Probability of Successful Transmission of M1

48 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) FITn Lower Bound on the Probability of Successful Transmission of M1

49 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) FITn Lower Bound on the Probability of Successful Transmission of M1

50 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 FITn Standard procedure to compute FIT rates given the failure probabili:es, but tailored for real-6me workloads Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1

51 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) FITn Lower Bound on the Probability of Successful Transmission of M1

52 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) FITn Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults

53 FIT Rate Analysis of the System Host and network faults follow a Poisson distribu6on FIT rate of the CAN subsystem FIT1 FIT2 FITn Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults

54 FIT Rate Analysis of the System FIT1 FIT rate of FIT2 the CAN subsystem FITn retransmiced due to transmission failures omiced due to crash failures corrupted due to commission failures Host and network faults follow a Poisson distribu6on Probabili:es that each message: Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on

55 FIT Rate Analysis of the System Broster et al. s probabilis6c response-6me analysis* FIT rate of the CAN subsystem FIT1 FIT2 Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) FITn Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me *Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Timing analysis of real-6me communica6on under electromagne6c interference." Real-Time Systems (2005):

56 FIT Rate Analysis of the System Broster et al. s probabilis6c response-6me analysis* We extend the analysis for a set of message replicas E.g., any 1 out of 3 message replicas are transmiced on :me FIT rate of the CAN subsystem FIT1 FIT2 FITn Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me *Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Timing analysis of real-6me communica6on under electromagne6c interference." Real-Time Systems (2005):

57 FIT Rate Analysis of the System Broster et al. s probabilis6c response-6me analysis* We extend the analysis for a set of message replicas E.g., any 1 out of 3 message replicas are transmiced on :me Deadline viola:ons due to crash failures FIT rate of the CAN subsystem FIT1 FIT2 FITn Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me *Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Timing analysis of real-6me communica6on under electromagne6c interference." Real-Time Systems (2005):

58 FIT Rate Analysis of the System Broster et al. s probabilis6c response-6me analysis* We extend the analysis for a set of message replicas E.g., any 1 out of 3 message replicas are transmiced on :me Deadline viola:ons due to crash failures FIT rate of the CAN subsystem FIT1 Incorrect output due to commission failures FIT2 FITn Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me OUT1 does not contain a corrupted value of M1 *Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Timing analysis of real-6me communica6on under electromagne6c interference." Real-Time Systems (2005):

59 FIT Rate Analysis of the System Broster et al. s probabilis6c response-6me analysis* We extend the analysis for a set of message replicas E.g., any 1 out of 3 message replicas are transmiced on :me Deadline viola:ons due to crash failures FIT rate of the CAN subsystem FIT1 Incorrect output due to commission failures FIT2 FITn Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me OUT1 does not contain a corrupted value of M1 *Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Timing analysis of real-6me communica6on under electromagne6c interference." Real-Time Systems (2005):

60 FIT Rate Analysis of the System Broster et al. s probabilis6c response-6me analysis* We extend the analysis for a set of message replicas E.g., any 1 out of 3 message replicas are transmiced on :me Deadline viola:ons due to crash failures FIT rate of the CAN subsystem FIT1 Incorrect output due to commission failures FIT2 FITn Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me OUT1 does not contain a corrupted value of M1 *Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Timing analysis of real-6me communica6on under electromagne6c interference." Real-Time Systems (2005):

61 FIT Rate Analysis of the System FIT rate of the CAN subsystem FIT1 FIT2 FITn Math in the paper! Mean Time Between Failures MTBF1 Probability Density Func:on f1(t) Lower Bound on the Probability of Successful Transmission of M1 Network faults Host faults Message Retransmission Message Omission Message Corrup:on OUT1 is computed on :me OUT1 does not contain a corrupted value of M1

62 Overview Model Analysis Evalua6on

63 Mobile Robot Workload* Task Name Length (bytes) Period (ms) Deadline (ms) MotorCtrl Wheel Wheel RadioIn Proximity Logging Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Comparing real-6me communica6on under electromagne6c interference." Real-Time Systems, ECRTS Proceedings. 16th Euromicro Conference on. IEEE, 2004.

64 Mobile Robot Workload* Only the MotorCtrl task is replicated (#replicas vary from 1 to 9) Task Name Length (bytes) Period (ms) Deadline (ms) MotorCtrl Wheel Wheel RadioIn Proximity Logging Broster, Ian, Alan Burns, and Guillermo Rodriguez-Navas. "Comparing real-6me communica6on under electromagne6c interference." Real-Time Systems, ECRTS Proceedings. 16th Euromicro Conference on. IEEE, 2004.

65 Evalua5on Assess the proposed FIT rate deriva:on Comparison with results from CAN bus simula:on

66 Evalua5on Assess the proposed FIT rate deriva:on Comparison with results from CAN bus simula:on Is the FIT rate analysis too coarse-grained? Analysis for various fault rates

67 Analysis versus Simula5on for MotorCtrl Prob. of Failure of MotorCtrl Lower means beeer MotorCtrl reliability! #Replicas of MotorCtrl Task

68 Analysis versus Simula5on for MotorCtrl 1 Prob. of Failure of MotorCtrl e-05 1e-06 1e-07 1e-08 Asynchronous Analysis Asynchronous Simula6on #Replicas of MotorCtrl Task

69 Analysis versus Simula5on for MotorCtrl 1 Prob. of Failure of MotorCtrl e-05 1e-06 1e-07 1e-08 Synchronous Analysis Asynchronous Analysis Synchronous Simula6on Asynchronous Simula6on #Replicas of MotorCtrl Task

70 Prob. of Failure of MotorCtrl Analysis versus Simula5on for MotorCtrl e-05 1e-06 1e-07 1e-08 MotorCtrl becomes more reliable with replica6on Synchronous Analysis Asynchronous Analysis Synchronous Simula6on Asynchronous Simula6on #Replicas of MotorCtrl Task

71 Analysis versus Simula5on for MotorCtrl 1 Prob. of Failure of MotorCtrl e-05 1e-06 1e-07 1e-08 Analysis tracks the simula6on results Synchronous Analysis Asynchronous Analysis Synchronous Simula6on Asynchronous Simula6on #Replicas of MotorCtrl Task

72 FIT Rate Analysis of the CAN Subsystem FIT Rate of the System Lower means beeer reliability of the CAN subsystem! #Replicas of MotorCtrl Task

73 FIT Rate Analysis of the CAN Subsystem FIT Rate FIT of Rate the System 1e e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 FIT Rate vs. Replication Factor for varying EMI rates host-fault-rate=10-20 /ms, bus-fault-rate=10-6 /ms host-fault-rate=10-12 /ms, bus-fault-rate=10-8 /ms host-fault-rate=10-16 /ms, bus-fault-rate=10-10 /ms Replicaion Factor #Replicas of MotorCtrl Task

74 FIT Rate Analysis of the CAN Subsystem FIT Rate FIT of Rate the System 1e e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 FIT Rate vs. Replication Factor for varying EMI rates host-fault-rate=10-20 /ms, bus-fault-rate=10-6 /ms host-fault-rate=10-12 /ms, bus-fault-rate=10-8 /ms host-fault-rate=10-16 /ms, bus-fault-rate=10-10 /ms Replicaion Factor System becomes more reliable due to beeer tolerance against crash and commission failures #Replicas of MotorCtrl Task

75 FIT Rate Analysis of the CAN Subsystem 1e+10 FIT Rate vs. Replication Factor for varying EMI rates FIT Rate FIT of Rate the System System 1 reliability degrades 1e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 due to less slack host-fault-rate=10-20 /ms, bus-fault-rate=10-6 /ms host-fault-rate=10-12 /ms, bus-fault-rate=10-8 /ms host-fault-rate=10-16 /ms, bus-fault-rate=10-10 /ms Replicaion Factor #Replicas of MotorCtrl Task

76 Success Replica:on factor

77 Success Failure Replica:on factor Replica:on factor

78 FIT Rate Analysis of the CAN Subsystem FIT Rate FIT of Rate the System 1e e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 FIT Rate vs. Replication Factor for varying EMI rates host-fault-rate=10-20 /ms, bus-fault-rate=10-6 /ms host-fault-rate=10-12 /ms, bus-fault-rate=10-8 /ms host-fault-rate=10-16 /ms, bus-fault-rate=10-10 /ms Replicaion Factor #Replicas of MotorCtrl Task

79 FIT Rate Analysis of the CAN Subsystem FIT Rate FIT of Rate the System 1e e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 FIT Rate vs. Replication Factor for varying EMI rates Analysis yields the op6mal replica6on factor host-fault-rate=10-20 /ms, bus-fault-rate=10-6 /ms host-fault-rate=10-12 /ms, bus-fault-rate=10-8 /ms host-fault-rate=10-16 /ms, bus-fault-rate=10-10 /ms Replicaion Factor #Replicas of MotorCtrl Task

80 FIT Rate Analysis of the CAN Subsystem FIT Rate FIT of Rate the System 1e e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 FIT Rate vs. Replication Factor for varying EMI rates Analysis yields the op6mal replica6on factor host-fault-rate=10-20 /ms, bus-fault-rate=10-6 /ms host-fault-rate=10-12 /ms, bus-fault-rate=10-8 /ms host-fault-rate=10-16 /ms, bus-fault-rate=10-10 /ms FIT rate spans more than 20 orders of magnitude Replicaion Factor #Replicas of MotorCtrl Task

81 Summary

82 Summary? Find the best replica:on strategy for CAN-based systems Correctness Replica:on factor Timeliness Replica:on factor Success Replica:on factor

? Summary Replica:on factor Success Timeliness Correctness Find the best replica:on strategy for CAN-based systems Replica:on factor Replica:on factor

83 ? Summary Replica:on factor Success Timeliness Correctness Find the best replica:on strategy for CAN-based systems Replica:on factor Replica:on factor Compare reliability of the CAN-bus subsystem in the context of the larger safety-cri:cal system = (Photo: Airbus D&S/Dassault/Finmeccanica) UAV Power CAN

84 Future Work More complex system models CAN-based systems bridged together Sporadic DAG models

85 Future Work More complex system models CAN-based systems bridged together Sporadic DAG models Study of other technologies, e.g., Real Time Ethernet

86 Future Work More complex system models CAN-based systems bridged together Sporadic DAG models Study of other technologies, e.g., Real Time Ethernet hep://

A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems

Work in progress A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems December 5, 2017 CERTS 2017 Malte Appel, Arpan Gujarati and Björn B. Brandenburg Distributed