Dependable Computer Systems

Size: px

Start display at page:

Download "Dependable Computer Systems"

Kelley Lynch
5 years ago
Views:

1 Dependable Computer Systems Part 6b: System Aspects

2 Contents Synchronous vs. Asynchronous Systems Consensus Fault-tolerance by self-stabilization Examples Time-Triggered Ethernet (FT Clock Synchronization) Self-Stabilization in the Time-Triggered Architecture Traffic Policing System Architectures 2

3 Synchronous vs. asynchronous systems 3

4 Synchronous processors: A processor is said to be synchronous if it makes at least one processing step during real-time steps (or if some other computer in the system makes s processing steps). Bounded communication: The communication delay is said to be bounded if any message sent will arrive at its destination within real-time steps (if no failure occurs). Synchronous system: A system is said to be synchronous if its processors are synchronous and the communication delay is bounded. Real-time systems are per definition synchronous. Asynchronous systems: A system is said to be asynchronous if either its processors are asynchronous or the communication delay is unbounded 4

5 Consensus 5

6 Consensus Each processor starts a protocol with its local input value, which is sent to all other processors in the group, fulfilling the following properties: Consistency: All correct processors agree on the same value and all decisions are final. Non-triviality: The agreed-upon input value must have been some processors input (or is a function of the individual input values). Termination: Each correct processor decides on a value within a finite time interval. 6

7 Consensus (cont.) The consensus problem under the assumption of byzantine failures was first defined in 1980 in the context of the SIFT project which was aimed at building a computer system with ultra-high dependability. Other names are byzantine agreement or byzantine general problem interactive consistency 7

8 Impossibility of deterministic consensus in asynch. systems asynchronous systems cannot achieve consensus by a deterministic algorithm in the presence of even one crash failure of a processor it is impossible to differentiate between a late response and a processor crash by using coin flips, probabilistic consensus protocols can achieve consensus in a constant expected number of rounds failure detectors which suspect late processors to be crashed can also be used to achieve consensus in asynchronous systems 8

9 Impossibility of deterministic consensus in asynch. systems (cont.) n 3t + 1 processors are necessary to tolerate t failures Round 1: F Maj(0, 0, 1) = 0 Maj(0, 1, 1) = 1 Round 2: F Maj(0, 0, 1) = 0 Maj(0, 1, 1) = 1 9

10 Fault-tolerance by self-stabilization 10

11 Fault-tolerance by self-stabilization Self-Stabilization: A distributed system S is self-stabilizing with respect to some global predicate P if it satisfies the following two properties: 1. Closure: P is closed under the execution of S. That is, once P is established in S, it cannot be falsified. 2. Convergence: Starting from an arbitrary global state, S is guaranteed to reach a global state satisfying P within a finite number of state transitions 11

12 Fault-tolerance by self-stabilization (cont.) self-stabilizing systems need not be initialized and they can recover from transient failures (adaptive DSD is selfstabilizing) self-stabilization is a different approach to fault-tolerance, it is not based on countering the effects of failures but concentrating on the ability to reach a consistent state problems are how to achieve a global property with local actions and local knowledge, lack of theory on how to design self-stabilizing algorithms and how to guarantee timeliness 12

13 Examples - Time-Triggered Ethernet (FT Clock Synchronization) - Self-Stabilization in the Time-Triggered Architecture - Traffic Policing - System Architectures 13

14 Time Master TTE TTE TTE IN 1 IN 1 TTE TTE Eth IN 1 IN 1 TTE Time Master TTE TTE TTE Time Master IN 1 IN 1 TTE Eth TTE 1588 Fault-tolerant synchronization services are needed for establishing a robust global time base Eth

Failure Model for High-Integrity Components: Inconsistent-Omission Faulty COM MON IN OUT listen_in listen_out intercept Core COM/MON Assumptions: - COM and MON fail independently - MON can intercept

15 Failure Model for High-Integrity Components: Inconsistent-Omission Faulty COM MON IN OUT listen_in listen_out intercept Core COM/MON Assumptions: - COM and MON fail independently - MON can intercept a faulty message produced by the COM - COM cannot produce a valid message such that this message appears as two different messages on listen_out and OUT; though it may be valid on listen_out but detectable faulty on OUT or vice versa - MON cannot itself generate a faulty message, neither by inverting listen_out to an output, nor by toggling the intercept signal 15

16 Synchronization Services Clock Synchronization Service is executed during normal operation mode to keep the local clocks synchronized to each other. Startup/Restart Service is executed to reach an initial synchronization of the local clocks in the system. Integration/Reintegration Service is used for components to join an already synchronized system. Clique Detection Services are used to detect loss of synchronization and establishment of disjoint sets of synchronized components. Clock Synchronization Service Computer Time Fast Clock Perfect Clock Message Exchange Message Exchange Slow Clock Startup/Restart Service R.int R.int Real Time 16

17 Clock Synchronization Algorithm Algorithm Specification 17

18 Step 1: SMs send messages to CMs Compression Master IN 1 Synchronization Master 1 IN 2 IN 3 Synchronization Master 2 Synchronization Master 3 IN 4 IN 5 Synchronization Master 5 Synchronization Master 4 Precision Dispatch SM1 SM2 SM5 Permanence SM1 SM2 SM5 SM4 SM3 SM4 SM3 Acceptance Window (of SM 2/5)... CM CM t_0 t_1, t_2 t_4, t_5 Reference Point 18

19 Step 2: CMs send voted clock values back Compression Master IN C Synchronization Master 1 Synchronization Master 5 Synchronization Master 2 Synchronization Master 3 Synchronization Master 4 Precision Dispatch SM1 SM2 SM5 Permanence SM1 SM2 SM5 SM4 SM3 SM4 SM3 Acceptance Window (of SM 2/5)... CM CM t_0 t_1, t_2 t_4, t_5 Reference Point 19

20 Step 2: Multiple Channels/CMs Compression Master 1 Compression Master Synchronization Master 2 Synchronization Master 3 Multiple Channels/CMs are required for fault-tolerance. Synchronization Masters (SMs) receive synchronization messages from all non-faulty Compression Masters (CMs) SMs use either the median or the arithmetic mean on the redundant messages from the CMs.

21 Step 2: Faulty CMs send to some SMs Only one Channel and one CM are shown. Typically replicated. Compression Master IN C Synchronization Master 1 Synchronization Master 5 Synchronization Master 2 Synchronization Master 3 Synchronization Master 4 Precision Dispatch SM1 SM2 SM5 Permanence SM1 SM2 SM5 SM4 SM3 SM4 SM3 Acceptance Window (of SM 2/5)... CM CM t_0 t_1, t_2 t_4, t_5 Reference Point 21

22 Clock Synchronization Two Failures Scenario Even in the byzantine failure case, the non-faulty clocks remain synchronized with known bounds. 22

23 Examples - Time-Triggered Ethernet (FT Clock Synchronization) - Self-Stabilization in the Time-Triggered Architecture - Traffic Policing - System Architectures 23

24 Given a fault-tolerant system and some fault affecting multiple components. System state is inconsistent! 24

25 Self-Stabilization (Dijkstra 1974) Two properties: Closure Convergence Closure: If a system is in a legitimate state it will remain in this state Convergence: A system will eventually reach a legitimate state, from an arbitrary state legitimate illegitimate 25

26 Self-stabilization and fault-tolerance Different failure detection mechanism and failure correction mechanisms for different failures Using a sequence of algorithms to bring system nearer to the safe state Wilfried Steiner Real-Time Systems Group 26

27 Time-Triggered Architecture (cont.) - Time is split-up into (not necessarily equal) slots depending on message length - Slots are grouped into TDMA rounds TMDA round Real-Time - Each node has assigned exactly one slot in the TDMA round - This assignment is equal for every TDMA round - Actual Transmission Phase < Sending Slot 27

28 Phases of the TTP Coldstart sync. < 4 nodes sync. >= 4 nodes A A A S A S S 1 A S 2 A S 3 S A A A A A A loss of synchronous system view A... asynchronous node S... synchronous node 3 Phases: coldstart, synchronous < 4 nodes, no guarantees for the system services, sync. messages are broadcasted periodically synchronous >= 4 nodes (normal system operation), sync. messages are broadcasted periodically Wilfried Steiner Real-Time Systems Group 28

29 Phases of the TTP/C (cont.) cold-start sync. < 4 nodes sync. >= 4 nodes sync. >=threshold A A A S A S S 1 A S 2 A S 3 S A A A A A A S S S 4 S S S loss of synchronous system view A... asynchronous node S... synchronous node Identifying a 4 th phase of protocol execution: A sufficient number of nodes is synchronous 29

30 Cliques Nodes that communicate with each other form a clique In correct operation mode only one clique Possibility of more cliques after multiple transient failures Two types of multiple cliques operation: Benign: Synchronous operation Malign: Asynchronous operation 30

31 Cliques (cont.) Benign Cliques: Act still slot-synchronously accept and reject counters are used to determine the amount of nodes in the own clique Malign Cliques: Multiple cliques act slot-asynchronously Cliques do not see each other Clique A sends in the IFGs of Clique B and vice versa 7 Clique Slot a) TP 2 IFG... 3 TP IFG TP malign! 2 IFG ok! 8 Clique b) 31

32 Self-Stabilization and the TTA (cont.) Detect system misbehavior using the Membership Vector: If >= (n/2)+1 nodes set in the membership: ok! If < (n)/2+1 nodes set in membership: restart! Bring the system in a safe state again by using the startup algorithm and reintegration of TTP Amount of nodes Amount of nodes Threshold Time Threshold Time 32

33 Examples - Time-Triggered Ethernet (FT Clock Synchronization) - Self-Stabilization in the Time-Triggered Architecture - Traffic Policing - System Architectures 33

34 Leaky-Bucket Traffic Policing Rate-Constrained Traffic (RC) Switch/Router Receiver Sender min. duration min. duration min. duration 34

35 Leaky-Bucket Traffic Policing (cont.) Token-Bucket / Leaky-Bucket algorithm are implemented to control the behavior of rate-constrained traffic. In the case when a faulty end point attempts to send frames too close back-to-back, then the token/leaky-bucket algorithm will detect this behavior and drop the frame. Token/leaky-bucket algorithms may be expensive to implement as they require to track the timing on a per VL basis. 35

36 Scheduled Traffic Policing D F A G A D E B C Transmission Sequence of a TT frame 1. Frame dispatch at t_dispatch 2. Frame start sending at t_send 3. Frame start receiving at t_receive E Link AD Physical Topology Dataflow Path Virtual Link Maximum Clock Difference H Link DE t_dispatch t_send max (d_shuffle) t_receive Pi Pi max (d_shuffle) Acceptance Window t_acc t_dispatch t_send max (d_shuffle) Time A Time D Time D Temporal Correctness is checked via Acceptance Window Test t_receive = t_send + l_link (l_link link latency) t_acc = t_dispatch + l_link Pi (Pi maximum distance of any two synchronized correct clocks in the system) t_receive 36 Time E

37 Scheduled Traffic Policing (cont.) Scheduled traffic policing checks the correctness of a received message with respect to a synchronized global time. Scheduled traffic policing enforces minimum durations as well as maximum durations in between two successive messages of the same stream. 37

38 Central Bus Guardian Nodes and switches are synchronized to each other with a precision Pi. In a cut-through setting the TDMA slot needs to account four times the precision (4*Pi) as a safety margin. Only then it is guaranteed that traffic policing in the switch does not truncate correct transmissions. Transmission Window (Slot n) Write Access Granted Guardian Transmission Window (Slot n) Maximum Clock Difference Transmission Window (Slot n) Fast Node Slow Node Actual Transmission Physical Time 38

39 Examples - Time-Triggered Ethernet (FT Clock Synchronization) - Self-Stabilization in the Time-Triggered Architecture - Traffic Policing - System Architectures 39

40 Distributed Integrated Modular Avionics (Distributed IMA) Synchronized Partitions MODULE (N) Partition 1 Task Task Task Partition OS MODULE 1 Partition 2 Task Task 1 2 Partition OS... Partition n Task Task 1 2 Partition OS Partition 1 Task Task Task Partition OS MODULE (N-1) Partition 2 Task Task 1 2 Partition OS... Partition n Task Task 1 2 Partition OS Partition 1 Partition 2 Task Task Task Task Task Partition OS Partition OS Message Channels (Sampling/Queueing Ports)... Partition n Task Task 1 2 Partition OS TTE Sync Library TTE COM Layer ARINC 653 TTEthernet API Library TTEthernet NIC Driver Best-Effort Message Channels (Sampling/Queueing Ports) Rate-Constr. TT-traffic TTEthernet NIC IMA OS Message Channel API Core Services VxWorks653 Time/Space Partitioning OS Fault-Tolerant Timebase TTE Sync Library TTE COM Layer ARINC 653 TTEthernet API Library TTEthernet NIC Driver TT-traffic Message Channels (Sampling/Queueing Ports) Rate-Constr. Best-Effort TTEthernet NIC IMA OS Message Channel API Core Services VxWorks653 Time/Space Partitioning OS Module (N+1) Module (N+2) Module (N+3) TTE COM Layer ARINC 653 TTEthernet API Library TTEthernet NIC Driver TT-traffic Rate-Constr. Best-Effort TTEthernet NIC IMA OS Message Channel API Core Services VxWorks653 Time/Space Partitioning OS Non-Critical Subsystem Module (N+4) TTEthernet (SAE AS6802) Synchronous (TDM) & Asynchronous Comm. Distributed fault-tolerant timebase Robust TDMA partitioning Task 1 Task 2 VxWorks 6.8 Ethernet NIC (ARINC664) Task 1 VxWorks 6.8 Ethernet NIC Task 1 VxWorks 6.8 Ethernet NIC Task Task 1 2 RT-Linux Ethernet NIC Module (N+5) Task Task 1 2 Windows Ethernet NIC Module (N+6) Task Task 1 2 Windows Ethernet NIC 40

41 Distributed Integrated Modular Avionics (Distributed IMA) (cont.) The network is a distributed fault-tolerant embedded computer and executes a set of partitions synchronous TDMA communication for Ethernet allows integration of low-latency, low-jitter VLs in complex networks System-level partitioning closes the gap between federated and integrated architectures The partitions can be mutually aligned and synchronized to system time 41

Automotive Integrated Safety Platform zfas,

variety of innovative functions with multiple safety

zfas uses numerous technology components from TTTech.

42 Automotive Integrated Safety Platform zfas, co-developed with TTTech, enabling Audi to integrate a variety of innovative functions with multiple safety criticality levels. zfas uses numerous technology components from TTTech. For example, the individual CPU cores are connected based on Deterministic Ethernet communication 42

43 Automotive Integrated Safety Platform (cont.) 43

44 Automotive Integrated Safety Platform (cont.) Combines high-performance computing with functional safety Highly efficient deterministic SW integration Applications can be moved between embedded cores Supports various SoCs (System-on-a-Chip) and operating systems Accelerated development due to PC-based co-simulation 44

45 Fog Computing for the Dependable Internet of Things A New Infrastructure Layer Fog Fog Fog Fog Fog Fog 45

46 Fog Computing for the Dependable Internet of Things (cont.) Fog computing is an architecture approach that provides nonfunctional knowledge to enable dependability in the IoT. Thing Fog Node Determinism 46

47 Fog Computing for the Dependable Internet of Things (cont.) + +? Multiple Components Modularized & Cross-Industry 47

48 Fog Computing for the Dependable Internet of Things (cont.) Moore s Law Alive and Well Heterogeneous Multi-Core Hardware Supported Virtualization at Chip Level Possible virtual machine virtual machine virtual machine virtual machine Safety Apps Control Apps IT / Media Apps Security Apps Safe OS RTOS Rich OS Secure OS Thin Software Hypervisor Hardware Hypervisor Support Core 0 Core 1 48

49 Fog Node on Wheels 49

An Introduction to TTEthernet

An Introduction to TTEthernet An Introduction to thernet TU Vienna, Apr/26, 2013 Guest Lecture in Deterministic Networking (DetNet) Wilfried Steiner, Corporate Scientist wilfried.steiner@tttech.com Copyright TTTech Computertechnik