Dependable Computer Systems

Size: px
Start display at page:

Download "Dependable Computer Systems"

Transcription

1 Dependable Computer Systems Part 6b: System Aspects

2 Contents Synchronous vs. Asynchronous Systems Consensus Fault-tolerance by self-stabilization Examples Time-Triggered Ethernet (FT Clock Synchronization) Self-Stabilization in the Time-Triggered Architecture Traffic Policing System Architectures 2

3 Synchronous vs. asynchronous systems 3

4 Synchronous processors: A processor is said to be synchronous if it makes at least one processing step during real-time steps (or if some other computer in the system makes s processing steps). Bounded communication: The communication delay is said to be bounded if any message sent will arrive at its destination within real-time steps (if no failure occurs). Synchronous system: A system is said to be synchronous if its processors are synchronous and the communication delay is bounded. Real-time systems are per definition synchronous. Asynchronous systems: A system is said to be asynchronous if either its processors are asynchronous or the communication delay is unbounded 4

5 Consensus 5

6 Consensus Each processor starts a protocol with its local input value, which is sent to all other processors in the group, fulfilling the following properties: Consistency: All correct processors agree on the same value and all decisions are final. Non-triviality: The agreed-upon input value must have been some processors input (or is a function of the individual input values). Termination: Each correct processor decides on a value within a finite time interval. 6

7 Consensus (cont.) The consensus problem under the assumption of byzantine failures was first defined in 1980 in the context of the SIFT project which was aimed at building a computer system with ultra-high dependability. Other names are byzantine agreement or byzantine general problem interactive consistency 7

8 Impossibility of deterministic consensus in asynch. systems asynchronous systems cannot achieve consensus by a deterministic algorithm in the presence of even one crash failure of a processor it is impossible to differentiate between a late response and a processor crash by using coin flips, probabilistic consensus protocols can achieve consensus in a constant expected number of rounds failure detectors which suspect late processors to be crashed can also be used to achieve consensus in asynchronous systems 8

9 Impossibility of deterministic consensus in asynch. systems (cont.) n 3t + 1 processors are necessary to tolerate t failures Round 1: F Maj(0, 0, 1) = 0 Maj(0, 1, 1) = 1 Round 2: F Maj(0, 0, 1) = 0 Maj(0, 1, 1) = 1 9

10 Fault-tolerance by self-stabilization 10

11 Fault-tolerance by self-stabilization Self-Stabilization: A distributed system S is self-stabilizing with respect to some global predicate P if it satisfies the following two properties: 1. Closure: P is closed under the execution of S. That is, once P is established in S, it cannot be falsified. 2. Convergence: Starting from an arbitrary global state, S is guaranteed to reach a global state satisfying P within a finite number of state transitions 11

12 Fault-tolerance by self-stabilization (cont.) self-stabilizing systems need not be initialized and they can recover from transient failures (adaptive DSD is selfstabilizing) self-stabilization is a different approach to fault-tolerance, it is not based on countering the effects of failures but concentrating on the ability to reach a consistent state problems are how to achieve a global property with local actions and local knowledge, lack of theory on how to design self-stabilizing algorithms and how to guarantee timeliness 12

13 Examples - Time-Triggered Ethernet (FT Clock Synchronization) - Self-Stabilization in the Time-Triggered Architecture - Traffic Policing - System Architectures 13

14 Time Master TTE TTE TTE IN 1 IN 1 TTE TTE Eth IN 1 IN 1 TTE Time Master TTE TTE TTE Time Master IN 1 IN 1 TTE Eth TTE 1588 Fault-tolerant synchronization services are needed for establishing a robust global time base Eth

15 Failure Model for High-Integrity Components: Inconsistent-Omission Faulty COM MON IN OUT listen_in listen_out intercept Core COM/MON Assumptions: - COM and MON fail independently - MON can intercept a faulty message produced by the COM - COM cannot produce a valid message such that this message appears as two different messages on listen_out and OUT; though it may be valid on listen_out but detectable faulty on OUT or vice versa - MON cannot itself generate a faulty message, neither by inverting listen_out to an output, nor by toggling the intercept signal 15

16 Synchronization Services Clock Synchronization Service is executed during normal operation mode to keep the local clocks synchronized to each other. Startup/Restart Service is executed to reach an initial synchronization of the local clocks in the system. Integration/Reintegration Service is used for components to join an already synchronized system. Clique Detection Services are used to detect loss of synchronization and establishment of disjoint sets of synchronized components. Clock Synchronization Service Computer Time Fast Clock Perfect Clock Message Exchange Message Exchange Slow Clock Startup/Restart Service R.int R.int Real Time 16

17 Clock Synchronization Algorithm Algorithm Specification 17

18 Step 1: SMs send messages to CMs Compression Master IN 1 Synchronization Master 1 IN 2 IN 3 Synchronization Master 2 Synchronization Master 3 IN 4 IN 5 Synchronization Master 5 Synchronization Master 4 Precision Dispatch SM1 SM2 SM5 Permanence SM1 SM2 SM5 SM4 SM3 SM4 SM3 Acceptance Window (of SM 2/5)... CM CM t_0 t_1, t_2 t_4, t_5 Reference Point 18

19 Step 2: CMs send voted clock values back Compression Master IN C Synchronization Master 1 Synchronization Master 5 Synchronization Master 2 Synchronization Master 3 Synchronization Master 4 Precision Dispatch SM1 SM2 SM5 Permanence SM1 SM2 SM5 SM4 SM3 SM4 SM3 Acceptance Window (of SM 2/5)... CM CM t_0 t_1, t_2 t_4, t_5 Reference Point 19

20 Step 2: Multiple Channels/CMs Compression Master 1 Compression Master Synchronization Master 2 Synchronization Master 3 Multiple Channels/CMs are required for fault-tolerance. Synchronization Masters (SMs) receive synchronization messages from all non-faulty Compression Masters (CMs) SMs use either the median or the arithmetic mean on the redundant messages from the CMs.

21 Step 2: Faulty CMs send to some SMs Only one Channel and one CM are shown. Typically replicated. Compression Master IN C Synchronization Master 1 Synchronization Master 5 Synchronization Master 2 Synchronization Master 3 Synchronization Master 4 Precision Dispatch SM1 SM2 SM5 Permanence SM1 SM2 SM5 SM4 SM3 SM4 SM3 Acceptance Window (of SM 2/5)... CM CM t_0 t_1, t_2 t_4, t_5 Reference Point 21

22 Clock Synchronization Two Failures Scenario Even in the byzantine failure case, the non-faulty clocks remain synchronized with known bounds. 22

23 Examples - Time-Triggered Ethernet (FT Clock Synchronization) - Self-Stabilization in the Time-Triggered Architecture - Traffic Policing - System Architectures 23

24 Given a fault-tolerant system and some fault affecting multiple components. System state is inconsistent! 24

25 Self-Stabilization (Dijkstra 1974) Two properties: Closure Convergence Closure: If a system is in a legitimate state it will remain in this state Convergence: A system will eventually reach a legitimate state, from an arbitrary state legitimate illegitimate 25

26 Self-stabilization and fault-tolerance Different failure detection mechanism and failure correction mechanisms for different failures Using a sequence of algorithms to bring system nearer to the safe state Wilfried Steiner Real-Time Systems Group 26

27 Time-Triggered Architecture (cont.) - Time is split-up into (not necessarily equal) slots depending on message length - Slots are grouped into TDMA rounds TMDA round Real-Time - Each node has assigned exactly one slot in the TDMA round - This assignment is equal for every TDMA round - Actual Transmission Phase < Sending Slot 27

28 Phases of the TTP Coldstart sync. < 4 nodes sync. >= 4 nodes A A A S A S S 1 A S 2 A S 3 S A A A A A A loss of synchronous system view A... asynchronous node S... synchronous node 3 Phases: coldstart, synchronous < 4 nodes, no guarantees for the system services, sync. messages are broadcasted periodically synchronous >= 4 nodes (normal system operation), sync. messages are broadcasted periodically Wilfried Steiner Real-Time Systems Group 28

29 Phases of the TTP/C (cont.) cold-start sync. < 4 nodes sync. >= 4 nodes sync. >=threshold A A A S A S S 1 A S 2 A S 3 S A A A A A A S S S 4 S S S loss of synchronous system view A... asynchronous node S... synchronous node Identifying a 4 th phase of protocol execution: A sufficient number of nodes is synchronous 29

30 Cliques Nodes that communicate with each other form a clique In correct operation mode only one clique Possibility of more cliques after multiple transient failures Two types of multiple cliques operation: Benign: Synchronous operation Malign: Asynchronous operation 30

31 Cliques (cont.) Benign Cliques: Act still slot-synchronously accept and reject counters are used to determine the amount of nodes in the own clique Malign Cliques: Multiple cliques act slot-asynchronously Cliques do not see each other Clique A sends in the IFGs of Clique B and vice versa 7 Clique Slot a) TP 2 IFG... 3 TP IFG TP malign! 2 IFG ok! 8 Clique b) 31

32 Self-Stabilization and the TTA (cont.) Detect system misbehavior using the Membership Vector: If >= (n/2)+1 nodes set in the membership: ok! If < (n)/2+1 nodes set in membership: restart! Bring the system in a safe state again by using the startup algorithm and reintegration of TTP Amount of nodes Amount of nodes Threshold Time Threshold Time 32

33 Examples - Time-Triggered Ethernet (FT Clock Synchronization) - Self-Stabilization in the Time-Triggered Architecture - Traffic Policing - System Architectures 33

34 Leaky-Bucket Traffic Policing Rate-Constrained Traffic (RC) Switch/Router Receiver Sender min. duration min. duration min. duration 34

35 Leaky-Bucket Traffic Policing (cont.) Token-Bucket / Leaky-Bucket algorithm are implemented to control the behavior of rate-constrained traffic. In the case when a faulty end point attempts to send frames too close back-to-back, then the token/leaky-bucket algorithm will detect this behavior and drop the frame. Token/leaky-bucket algorithms may be expensive to implement as they require to track the timing on a per VL basis. 35

36 Scheduled Traffic Policing D F A G A D E B C Transmission Sequence of a TT frame 1. Frame dispatch at t_dispatch 2. Frame start sending at t_send 3. Frame start receiving at t_receive E Link AD Physical Topology Dataflow Path Virtual Link Maximum Clock Difference H Link DE t_dispatch t_send max (d_shuffle) t_receive Pi Pi max (d_shuffle) Acceptance Window t_acc t_dispatch t_send max (d_shuffle) Time A Time D Time D Temporal Correctness is checked via Acceptance Window Test t_receive = t_send + l_link (l_link link latency) t_acc = t_dispatch + l_link Pi (Pi maximum distance of any two synchronized correct clocks in the system) t_receive 36 Time E

37 Scheduled Traffic Policing (cont.) Scheduled traffic policing checks the correctness of a received message with respect to a synchronized global time. Scheduled traffic policing enforces minimum durations as well as maximum durations in between two successive messages of the same stream. 37

38 Central Bus Guardian Nodes and switches are synchronized to each other with a precision Pi. In a cut-through setting the TDMA slot needs to account four times the precision (4*Pi) as a safety margin. Only then it is guaranteed that traffic policing in the switch does not truncate correct transmissions. Transmission Window (Slot n) Write Access Granted Guardian Transmission Window (Slot n) Maximum Clock Difference Transmission Window (Slot n) Fast Node Slow Node Actual Transmission Physical Time 38

39 Examples - Time-Triggered Ethernet (FT Clock Synchronization) - Self-Stabilization in the Time-Triggered Architecture - Traffic Policing - System Architectures 39

40 Distributed Integrated Modular Avionics (Distributed IMA) Synchronized Partitions MODULE (N) Partition 1 Task Task Task Partition OS MODULE 1 Partition 2 Task Task 1 2 Partition OS... Partition n Task Task 1 2 Partition OS Partition 1 Task Task Task Partition OS MODULE (N-1) Partition 2 Task Task 1 2 Partition OS... Partition n Task Task 1 2 Partition OS Partition 1 Partition 2 Task Task Task Task Task Partition OS Partition OS Message Channels (Sampling/Queueing Ports)... Partition n Task Task 1 2 Partition OS TTE Sync Library TTE COM Layer ARINC 653 TTEthernet API Library TTEthernet NIC Driver Best-Effort Message Channels (Sampling/Queueing Ports) Rate-Constr. TT-traffic TTEthernet NIC IMA OS Message Channel API Core Services VxWorks653 Time/Space Partitioning OS Fault-Tolerant Timebase TTE Sync Library TTE COM Layer ARINC 653 TTEthernet API Library TTEthernet NIC Driver TT-traffic Message Channels (Sampling/Queueing Ports) Rate-Constr. Best-Effort TTEthernet NIC IMA OS Message Channel API Core Services VxWorks653 Time/Space Partitioning OS Module (N+1) Module (N+2) Module (N+3) TTE COM Layer ARINC 653 TTEthernet API Library TTEthernet NIC Driver TT-traffic Rate-Constr. Best-Effort TTEthernet NIC IMA OS Message Channel API Core Services VxWorks653 Time/Space Partitioning OS Non-Critical Subsystem Module (N+4) TTEthernet (SAE AS6802) Synchronous (TDM) & Asynchronous Comm. Distributed fault-tolerant timebase Robust TDMA partitioning Task 1 Task 2 VxWorks 6.8 Ethernet NIC (ARINC664) Task 1 VxWorks 6.8 Ethernet NIC Task 1 VxWorks 6.8 Ethernet NIC Task Task 1 2 RT-Linux Ethernet NIC Module (N+5) Task Task 1 2 Windows Ethernet NIC Module (N+6) Task Task 1 2 Windows Ethernet NIC 40

41 Distributed Integrated Modular Avionics (Distributed IMA) (cont.) The network is a distributed fault-tolerant embedded computer and executes a set of partitions synchronous TDMA communication for Ethernet allows integration of low-latency, low-jitter VLs in complex networks System-level partitioning closes the gap between federated and integrated architectures The partitions can be mutually aligned and synchronized to system time 41

42 Automotive Integrated Safety Platform zfas, co-developed with TTTech, enabling Audi to integrate a variety of innovative functions with multiple safety criticality levels. zfas uses numerous technology components from TTTech. For example, the individual CPU cores are connected based on Deterministic Ethernet communication 42

43 Automotive Integrated Safety Platform (cont.) 43

44 Automotive Integrated Safety Platform (cont.) Combines high-performance computing with functional safety Highly efficient deterministic SW integration Applications can be moved between embedded cores Supports various SoCs (System-on-a-Chip) and operating systems Accelerated development due to PC-based co-simulation 44

45 Fog Computing for the Dependable Internet of Things A New Infrastructure Layer Fog Fog Fog Fog Fog Fog 45

46 Fog Computing for the Dependable Internet of Things (cont.) Fog computing is an architecture approach that provides nonfunctional knowledge to enable dependability in the IoT. Thing Fog Node Determinism 46

47 Fog Computing for the Dependable Internet of Things (cont.) + +? Multiple Components Modularized & Cross-Industry 47

48 Fog Computing for the Dependable Internet of Things (cont.) Moore s Law Alive and Well Heterogeneous Multi-Core Hardware Supported Virtualization at Chip Level Possible virtual machine virtual machine virtual machine virtual machine Safety Apps Control Apps IT / Media Apps Security Apps Safe OS RTOS Rich OS Secure OS Thin Software Hypervisor Hardware Hypervisor Support Core 0 Core 1 48

49 Fog Node on Wheels 49

An Introduction to TTEthernet

An Introduction to TTEthernet An Introduction to thernet TU Vienna, Apr/26, 2013 Guest Lecture in Deterministic Networking (DetNet) Wilfried Steiner, Corporate Scientist wilfried.steiner@tttech.com Copyright TTTech Computertechnik

More information

Computer Networks for Space Applications

Computer Networks for Space Applications Computer Networks for Space Applications MDH, Vasteras, Nov/19, 2014 Wilfried Steiner, Corporate Scientist wilfried.steiner@tttech.com Copyright TTTech Computertechnik AG. All rights reserved. 11/25/2014

More information

Time-Triggered Ethernet

Time-Triggered Ethernet Time-Triggered Ethernet Chapters 42 in the Textbook Professor: HONGWEI ZHANG CSC8260 Winter 2016 Presented By: Priyank Baxi (fr0630) fr0630@wayne.edu Outline History Overview TTEthernet Traffic Classes

More information

Distributed IMA with TTEthernet

Distributed IMA with TTEthernet Distributed IMA with thernet ARINC 653 Integration of thernet Georg Gaderer, Product Manager Georg.Gaderer@tttech.com October 30, 2012 Copyright TTTech Computertechnik AG. All rights reserved. Introduction

More information

Discussion of Failure Mode Assumptions for IEEE 802.1Qbt

Discussion of Failure Mode Assumptions for IEEE 802.1Qbt Discussion of Failure Mode Assumptions for IEEE 802.1Qbt Wilfried Steiner, Corporate Scientist wilfried.steiner@tttech.com www.tttech.com Page 1 Clock Synchronization is a core building block of many RT

More information

Chapter 39: Concepts of Time-Triggered Communication. Wenbo Qiao

Chapter 39: Concepts of Time-Triggered Communication. Wenbo Qiao Chapter 39: Concepts of Time-Triggered Communication Wenbo Qiao Outline Time and Event Triggered Communication Fundamental Services of a Time-Triggered Communication Protocol Clock Synchronization Periodic

More information

Deterministic Ethernet as Reliable Communication Infrastructure for Distributed Dependable Systems

Deterministic Ethernet as Reliable Communication Infrastructure for Distributed Dependable Systems Deterministic Ethernet as Reliable Communication Infrastructure for Distributed Dependable Systems DREAM Seminar UC Berkeley, January 21 st, 2014 Wilfried Steiner, Corporate Scientist wilfried.steiner@tttech.com

More information

Distributed Embedded Systems and realtime networks

Distributed Embedded Systems and realtime networks STREAM01 / Mastère SE Distributed Embedded Systems and realtime networks Embedded network TTP Marie-Agnès Peraldi-Frati AOSTE Project UNSA- CNRS-INRIA January 2008 1 Abstract Requirements for TT Systems

More information

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems Developing deterministic networking technology for railway applications using TTEthernet software-based end systems Project n 100021 Astrit Ademaj, TTTech Computertechnik AG Outline GENESYS requirements

More information

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner Real-Time Component Software slide credits: H. Kopetz, P. Puschner Overview OS services Task Structure Task Interaction Input/Output Error Detection 2 Operating System and Middleware Application Software

More information

An Encapsulated Communication System for Integrated Architectures

An Encapsulated Communication System for Integrated Architectures An Encapsulated Communication System for Integrated Architectures Architectural Support for Temporal Composability Roman Obermaisser Overview Introduction Federated and Integrated Architectures DECOS Architecture

More information

Systems. Roland Kammerer. 10. November Institute of Computer Engineering Vienna University of Technology. Communication Protocols for Embedded

Systems. Roland Kammerer. 10. November Institute of Computer Engineering Vienna University of Technology. Communication Protocols for Embedded Communication Roland Institute of Computer Engineering Vienna University of Technology 10. November 2010 Overview 1. Definition of a protocol 2. Protocol properties 3. Basic Principles 4. system communication

More information

Self Stabilization. CS553 Distributed Algorithms Prof. Ajay Kshemkalyani. by Islam Ismailov & Mohamed M. Ali

Self Stabilization. CS553 Distributed Algorithms Prof. Ajay Kshemkalyani. by Islam Ismailov & Mohamed M. Ali Self Stabilization CS553 Distributed Algorithms Prof. Ajay Kshemkalyani by Islam Ismailov & Mohamed M. Ali Introduction There is a possibility for a distributed system to go into an illegitimate state,

More information

16 Time Triggered Protocol

16 Time Triggered Protocol 16 Time Triggered Protocol [TTtech04] (TTP) 18-549 Distributed Embedded Systems Philip Koopman October 25, 2004 Significant material drawn from: Prof. H. Kopetz [Kopetz] TTP Specification v 1.1 [TTTech]

More information

Simulation-Based Fault Injection as a Verification Oracle for the Engineering of Time-Triggered Ethernet networks

Simulation-Based Fault Injection as a Verification Oracle for the Engineering of Time-Triggered Ethernet networks Simulation-Based Fault Injection as a Verification Oracle for the Engineering of Time-Triggered Ethernet networks Loïc FEJOZ, RealTime-at-Work (RTaW) Bruno REGNIER, CNES Philippe, MIRAMONT, CNES Nicolas

More information

ESA ADCSS Deterministic Ethernet in Space Avionics

ESA ADCSS Deterministic Ethernet in Space Avionics ESA ADCSS 2015 Deterministic Ethernet in Space Avionics Bülent Altan Strategic Advisor with Jean-Francois Dufour, Christian Fidi and Matthias Mäke-Kail Copyright TTTech Computertechnik AG. All rights reserved.

More information

Theory, Concepts and Applications

Theory, Concepts and Applications Theory, Concepts and Applications ETR 2015 Rennes August, the 27 th Jean-Baptiste Chaudron jean-baptiste.chaudron@tttech.com Copyright TTTech Computertechnik AG. All rights reserved. Page 1 AGENDA Introduction

More information

Distributed Algorithms Models

Distributed Algorithms Models Distributed Algorithms Models Alberto Montresor University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents 1 Taxonomy

More information

Deterministic Ethernet & Unified Networking

Deterministic Ethernet & Unified Networking Deterministic Ethernet & Unified Networking Never bet against Ethernet Mirko Jakovljevic mirko.jakovljevic@tttech.com www.tttech.com Copyright TTTech Computertechnik AG. All rights reserved. About TTTech

More information

TU Wien. Shortened by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet. H Kopetz TU Wien December H. Kopetz 12.

TU Wien. Shortened by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet. H Kopetz TU Wien December H. Kopetz 12. TU Wien 1 Shortened by Hermann Härtig The Rationale for Time-Triggered (TT) Ethernet H Kopetz TU Wien December 2008 Properties of a Successful Protocol 2 A successful real-time protocol must have the following

More information

Introduction to Distributed Systems Seif Haridi

Introduction to Distributed Systems Seif Haridi Introduction to Distributed Systems Seif Haridi haridi@kth.se What is a distributed system? A set of nodes, connected by a network, which appear to its users as a single coherent system p1 p2. pn send

More information

Distributed Systems (ICE 601) Fault Tolerance

Distributed Systems (ICE 601) Fault Tolerance Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability

More information

Field buses (part 2): time triggered protocols

Field buses (part 2): time triggered protocols Field buses (part 2): time triggered protocols Nico Fritz Universität des Saarlandes Embedded Systems 2002/2003 (c) Daniel Kästner. 1 CAN and LIN LIN CAN Type Arbitration Transfer rate Serial communication

More information

Consolidation of IT and OT based on Virtualization and Deterministic Ethernet

Consolidation of IT and OT based on Virtualization and Deterministic Ethernet Consolidation of IT and OT based on Virtualization and Deterministic Ethernet Wilfried Steiner wilfried.steiner@tttech.com https://at.linkedin.com/in/wilfriedsteiner Smart Factories of the Future will

More information

The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer

The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer - proposes a formal definition for the timed asynchronous distributed system model - presents measurements of process

More information

DTU IMM. MSc Thesis. Analysis and Optimization of TTEthernet-based Safety Critical Embedded Systems. Radoslav Hristov Todorov s080990

DTU IMM. MSc Thesis. Analysis and Optimization of TTEthernet-based Safety Critical Embedded Systems. Radoslav Hristov Todorov s080990 DTU IMM MSc Thesis Analysis and Optimization of TTEthernet-based Safety Critical Embedded Systems Radoslav Hristov Todorov s080990 16-08-2010 Acknowledgements The work for this master thesis project continued

More information

Fault Tolerance. Distributed Software Systems. Definitions

Fault Tolerance. Distributed Software Systems. Definitions Fault Tolerance Distributed Software Systems Definitions Availability: probability the system operates correctly at any given moment Reliability: ability to run correctly for a long interval of time Safety:

More information

Basic vs. Reliable Multicast

Basic vs. Reliable Multicast Basic vs. Reliable Multicast Basic multicast does not consider process crashes. Reliable multicast does. So far, we considered the basic versions of ordered multicasts. What about the reliable versions?

More information

Failures, Elections, and Raft

Failures, Elections, and Raft Failures, Elections, and Raft CS 8 XI Copyright 06 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved. Distributed Banking SFO add interest based on current balance PVD deposit $000 CS 8 XI Copyright

More information

To do. Consensus and related problems. q Failure. q Raft

To do. Consensus and related problems. q Failure. q Raft Consensus and related problems To do q Failure q Consensus and related problems q Raft Consensus We have seen protocols tailored for individual types of consensus/agreements Which process can enter the

More information

Design and Realization of TTE Network based on EDA

Design and Realization of TTE Network based on EDA Journal of Web Systems and Applications (2017) Vol. 1, Numuber 1 Clausius Scientific Press, Canada Design and Realization of TTE Network based on EDA Peili Ding1,a, Gangfeng Yan2,b, Yinan Wang3,c, Zhixiang

More information

Distributed Deadlock

Distributed Deadlock Distributed Deadlock 9.55 DS Deadlock Topics Prevention Too expensive in time and network traffic in a distributed system Avoidance Determining safe and unsafe states would require a huge number of messages

More information

Replication in Distributed Systems

Replication in Distributed Systems Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over

More information

Dependability Entering Mainstream IT Networking Standards (IEEE 802.1)

Dependability Entering Mainstream IT Networking Standards (IEEE 802.1) Dependability Entering Mainstream IT Networking Standards (IEEE 802.1) 64th Meeting of the IFIP 10.4 Working Group on Dependable Computing and Fault Tolerance Visegrád, Hungary, June 27-30, 2013 Wilfried

More information

Evaluation of numerical bus systems used in rocket engine test facilities

Evaluation of numerical bus systems used in rocket engine test facilities www.dlr.de Chart 1 > Numerical bus systems > V. Schmidt 8971_151277.pptx > 13.06.2013 Evaluation of numerical bus systems used in rocket engine test facilities Volker Schmidt Pavel Georgiev Harald Horn

More information

Distributed Systems COMP 212. Lecture 19 Othon Michail

Distributed Systems COMP 212. Lecture 19 Othon Michail Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails

More information

System aspects of dependable computers

System aspects of dependable computers Part 6: System aspects of dependable computers Course: Dependable Computer Systems 2012, Stefan Poledna, All rights reserved part 6, page 1 System design considerations Fault-tolerance is not the only

More information

Consensus and related problems

Consensus and related problems Consensus and related problems Today l Consensus l Google s Chubby l Paxos for Chubby Consensus and failures How to make process agree on a value after one or more have proposed what the value should be?

More information

Consensus in Distributed Systems. Jeff Chase Duke University

Consensus in Distributed Systems. Jeff Chase Duke University Consensus in Distributed Systems Jeff Chase Duke University Consensus P 1 P 1 v 1 d 1 Unreliable multicast P 2 P 3 Consensus algorithm P 2 P 3 v 2 Step 1 Propose. v 3 d 2 Step 2 Decide. d 3 Generalizes

More information

Today: Fault Tolerance

Today: Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

CORBA in the Time-Triggered Architecture

CORBA in the Time-Triggered Architecture 1 CORBA in the Time-Triggered Architecture H. Kopetz TU Wien July 2003 Outline 2 Hard Real-Time Computing Event and State Messages The Time Triggered Architecture The Marriage of CORBA with the TTA Conclusion

More information

Scheduling Real-Time Communication in IEEE 802.1Qbv Time Sensitive Networks

Scheduling Real-Time Communication in IEEE 802.1Qbv Time Sensitive Networks Scheduling Real-Time Communication in IEEE 802.1Qbv Time Sensitive Networks Silviu S. Craciunas, Ramon Serna Oliver, Martin Chmelik, Wilfried Steiner TTTech Computertechnik AG RTNS 2016, Brest, France,

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

Today: Fault Tolerance. Fault Tolerance

Today: Fault Tolerance. Fault Tolerance Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing

More information

FlexRay International Workshop. Protocol Overview

FlexRay International Workshop. Protocol Overview FlexRay International Workshop 4 th March 2003 Detroit Protocol Overview Dr. Christopher Temple - Motorola FlexRay principles Provide a communication infrastructure for future generation highspeed control

More information

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski Distributed Systems 09. State Machine Replication & Virtual Synchrony Paul Krzyzanowski Rutgers University Fall 2016 1 State machine replication 2 State machine replication We want high scalability and

More information

Enhanced Ethernet Switching Technology. Time Applications. Rui Santos 17 / 04 / 2009

Enhanced Ethernet Switching Technology. Time Applications. Rui Santos 17 / 04 / 2009 Enhanced Ethernet Switching Technology for Adaptive Hard Real- Time Applications Rui Santos (rsantos@ua.pt) 17 / 04 / 2009 Problem 2 Switched Ethernet became common in real-time communications Some interesting

More information

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part I CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Overview Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed

More information

A Fault Management Protocol for TTP/C

A Fault Management Protocol for TTP/C A Fault Management Protocol for TTP/C Juan R. Pimentel Teodoro Sacristan Kettering University Dept. Ingenieria y Arquitecturas Telematicas 1700 W. Third Ave. Polytechnic University of Madrid Flint, Michigan

More information

Fault Tolerance. Distributed Systems IT332

Fault Tolerance. Distributed Systems IT332 Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to

More information

CSE 5306 Distributed Systems

CSE 5306 Distributed Systems CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves

More information

Distributed Systems 11. Consensus. Paul Krzyzanowski

Distributed Systems 11. Consensus. Paul Krzyzanowski Distributed Systems 11. Consensus Paul Krzyzanowski pxk@cs.rutgers.edu 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value must be one

More information

CSE 5306 Distributed Systems. Fault Tolerance

CSE 5306 Distributed Systems. Fault Tolerance CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure

More information

Consensus Problem. Pradipta De

Consensus Problem. Pradipta De Consensus Problem Slides are based on the book chapter from Distributed Computing: Principles, Paradigms and Algorithms (Chapter 14) by Kshemkalyani and Singhal Pradipta De pradipta.de@sunykorea.ac.kr

More information

Dep. Systems Requirements

Dep. Systems Requirements Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small

More information

Providing Real-Time and Fault Tolerance for CORBA Applications

Providing Real-Time and Fault Tolerance for CORBA Applications Providing Real-Time and Tolerance for CORBA Applications Priya Narasimhan Assistant Professor of ECE and CS University Pittsburgh, PA 15213-3890 Sponsored in part by the CMU-NASA High Dependability Computing

More information

Dependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8

Dependability and real-time. TDDD07 Real-time Systems. Where to start? Two lectures. June 16, Lecture 8 TDDD7 Real-time Systems Lecture 7 Dependability & Fault tolerance Simin Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Dependability and real-time If a system

More information

Real-Time (Paradigms) (47)

Real-Time (Paradigms) (47) Real-Time (Paradigms) (47) Memory: Memory Access Protocols Tasks competing for exclusive memory access (critical sections, semaphores) become interdependent, a common phenomenon especially in distributed

More information

Atacama: An Open Experimental Platform for Mixed-Criticality Networking on Top of Ethernet

Atacama: An Open Experimental Platform for Mixed-Criticality Networking on Top of Ethernet Atacama: An Open Experimental Platform for Mixed-Criticality Networking on Top of Ethernet Gonzalo Carvajal 1,2 and Sebastian Fischmeister 1 1 University of Waterloo, ON, Canada 2 Universidad de Concepcion,

More information

A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems

A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems Work in progress A Byzantine Fault-Tolerant Key-Value Store for Safety-Critical Distributed Real-Time Systems December 5, 2017 CERTS 2017 Malte Appel, Arpan Gujarati and Björn B. Brandenburg Distributed

More information

Byzantine Failures. Nikola Knezevic. knl

Byzantine Failures. Nikola Knezevic. knl Byzantine Failures Nikola Knezevic knl Different Types of Failures Crash / Fail-stop Send Omissions Receive Omissions General Omission Arbitrary failures, authenticated messages Arbitrary failures Arbitrary

More information

Specifying and Proving Broadcast Properties with TLA

Specifying and Proving Broadcast Properties with TLA Specifying and Proving Broadcast Properties with TLA William Hipschman Department of Computer Science The University of North Carolina at Chapel Hill Abstract Although group communication is vitally important

More information

SPIDER: A Fault-Tolerant Bus Architecture

SPIDER: A Fault-Tolerant Bus Architecture Formal Methods Group NASA Langley Research Center lee.s.pike@nasa.gov May 11, 2005 Motivation Safety-critical distributed x-by-wire applications are being deployed in inhospitable environments. Failure

More information

Commercial Real-time Operating Systems An Introduction. Swaminathan Sivasubramanian Dependable Computing & Networking Laboratory

Commercial Real-time Operating Systems An Introduction. Swaminathan Sivasubramanian Dependable Computing & Networking Laboratory Commercial Real-time Operating Systems An Introduction Swaminathan Sivasubramanian Dependable Computing & Networking Laboratory swamis@iastate.edu Outline Introduction RTOS Issues and functionalities LynxOS

More information

Priya Narasimhan. Assistant Professor of ECE and CS Carnegie Mellon University Pittsburgh, PA

Priya Narasimhan. Assistant Professor of ECE and CS Carnegie Mellon University Pittsburgh, PA OMG Real-Time and Distributed Object Computing Workshop, July 2002, Arlington, VA Providing Real-Time and Fault Tolerance for CORBA Applications Priya Narasimhan Assistant Professor of ECE and CS Carnegie

More information

Fault Tolerance. Distributed Systems. September 2002

Fault Tolerance. Distributed Systems. September 2002 Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend

More information

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems

Failure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements

More information

Diagnosis in the Time-Triggered Architecture

Diagnosis in the Time-Triggered Architecture TU Wien 1 Diagnosis in the Time-Triggered Architecture H. Kopetz June 2010 Embedded Systems 2 An Embedded System is a Cyber-Physical System (CPS) that consists of two subsystems: A physical subsystem the

More information

Distributed Systems Fault Tolerance

Distributed Systems Fault Tolerance Distributed Systems Fault Tolerance [] Fault Tolerance. Basic concepts - terminology. Process resilience groups and failure masking 3. Reliable communication reliable client-server communication reliable

More information

Distributed Systems 24. Fault Tolerance

Distributed Systems 24. Fault Tolerance Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network

More information

Failure Tolerance. Distributed Systems Santa Clara University

Failure Tolerance. Distributed Systems Santa Clara University Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot

More information

ARTIST-Relevant Research from Linköping

ARTIST-Relevant Research from Linköping ARTIST-Relevant Research from Linköping Department of Computer and Information Science (IDA) Linköping University http://www.ida.liu.se/~eslab/ 1 Outline Communication-Intensive Real-Time Systems Timing

More information

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University

CprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Fault Tolerance Dr. Yong Guan Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Outline for Today s Talk Basic Concepts Process Resilience Reliable

More information

Process groups and message ordering

Process groups and message ordering Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create ( name ), kill ( name ) join ( name, process ), leave

More information

Replicated State Machine in Wide-area Networks

Replicated State Machine in Wide-area Networks Replicated State Machine in Wide-area Networks Yanhua Mao CSE223A WI09 1 Building replicated state machine with consensus General approach to replicate stateful deterministic services Provide strong consistency

More information

Avnu Alliance Introduction

Avnu Alliance Introduction Avnu Alliance Introduction Announcing a Liaison between Edge Computing Consortium and Avnu Alliance + What is Avnu Alliance? Creating a certified ecosystem to bring precise timing, reliability and compatibility

More information

SMT-Based Formal Verification of a TTEthernet Synchronization Function

SMT-Based Formal Verification of a TTEthernet Synchronization Function SMT-Based Formal Verification of a TTEthernet Synchronization Function Wilfried Steiner 1 and Bruno Dutertre 2 1 TTTech Computertechnik AG, Chip IP Design A-1040 Vienna, Austria wilfried.steiner@tttech.com

More information

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus

Concepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus CIS 505: Software Systems Lecture Note on Consensus Insup Lee Department of Computer and Information Science University of Pennsylvania CIS 505, Spring 2007 Concepts Dependability o Availability ready

More information

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov Outline 1. Introduction to Byzantine Fault Tolerance Problem 2. PBFT Algorithm a. Models and overview b. Three-phase protocol c. View-change

More information

Drive-by-Data & Integrated Modular Platform

Drive-by-Data & Integrated Modular Platform Drive-by-Data & Integrated Modular Platform Gernot Hans, Bombardier Transportation Mirko Jakovljevic, TTTech Computertechnik AG CONNECTA has received funding from the European Union s Horizon 2020 research

More information

Dependability and real-time. TDDD07 Real-time Systems Lecture 7: Dependability & Fault tolerance. Early computer systems. Dependable systems

Dependability and real-time. TDDD07 Real-time Systems Lecture 7: Dependability & Fault tolerance. Early computer systems. Dependable systems TDDD7 Real-time Systems Lecture 7: Dependability & Fault tolerance Simin i Nadjm-Tehrani Real-time Systems Laboratory Department of Computer and Information Science Linköping university Dependability and

More information

2017 Paul Krzyzanowski 1

2017 Paul Krzyzanowski 1 Question 1 What problem can arise with a system that exhibits fail-restart behavior? Distributed Systems 06. Exam 1 Review Stale state: the system has an outdated view of the world when it starts up. Not:

More information

Self-Stabilizing Byzantine Digital Clock Synchronization

Self-Stabilizing Byzantine Digital Clock Synchronization Self-Stabilizing Byzantine Digital Clock Synchronization Ezra N. Hoch, Danny Dolev, Ariel Daliot School of Engineering and Computer Science, The Hebrew University of Jerusalem, Israel Problem Statement

More information

Consensus. Chapter Two Friends. 2.3 Impossibility of Consensus. 2.2 Consensus 16 CHAPTER 2. CONSENSUS

Consensus. Chapter Two Friends. 2.3 Impossibility of Consensus. 2.2 Consensus 16 CHAPTER 2. CONSENSUS 16 CHAPTER 2. CONSENSUS Agreement All correct nodes decide for the same value. Termination All correct nodes terminate in finite time. Validity The decision value must be the input value of a node. Chapter

More information

Various Emerging Time- Triggered Protocols for Driveby-Wire

Various Emerging Time- Triggered Protocols for Driveby-Wire Various Emerging Time- Triggered for Driveby-Wire Applications Syed Masud Mahmud, Ph.D. Electrical and Computer Engg. Dept. Wayne State University Detroit MI 48202 smahmud@eng.wayne.edu January 11, 2007

More information

3. Quality of Service

3. Quality of Service 3. Quality of Service Usage Applications Learning & Teaching Design User Interfaces Services Content Process ing Security... Documents Synchronization Group Communi cations Systems Databases Programming

More information

Fault Tolerance. Basic Concepts

Fault Tolerance. Basic Concepts COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time

More information

DISTRIBUTED REAL-TIME SYSTEMS

DISTRIBUTED REAL-TIME SYSTEMS Distributed Systems Fö 11/12-1 Distributed Systems Fö 11/12-2 DISTRIBUTED REAL-TIME SYSTEMS What is a Real-Time System? 1. What is a Real-Time System? 2. Distributed Real Time Systems 3. Predictability

More information

TU Wien. Fault Isolation and Error Containment in the TT-SoC. H. Kopetz. TU Wien. July 2007

TU Wien. Fault Isolation and Error Containment in the TT-SoC. H. Kopetz. TU Wien. July 2007 TU Wien 1 Fault Isolation and Error Containment in the TT-SoC H. Kopetz TU Wien July 2007 This is joint work with C. El.Salloum, B.Huber and R.Obermaisser Outline 2 Introduction The Concept of a Distributed

More information

Capacity of Byzantine Agreement: Complete Characterization of Four-Node Networks

Capacity of Byzantine Agreement: Complete Characterization of Four-Node Networks Capacity of Byzantine Agreement: Complete Characterization of Four-Node Networks Guanfeng Liang and Nitin Vaidya Department of Electrical and Computer Engineering, and Coordinated Science Laboratory University

More information

Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers

Clock Synchronization. Synchronization. Clock Synchronization Algorithms. Physical Clock Synchronization. Tanenbaum Chapter 6 plus additional papers Clock Synchronization Synchronization Tanenbaum Chapter 6 plus additional papers Fig 6-1. In a distributed system, each machine has its own clock. When this is the case, an event that occurred after another

More information

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University

Fault Tolerance Part II. CS403/534 Distributed Systems Erkay Savas Sabanci University Fault Tolerance Part II CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Reliable Group Communication Reliable multicasting: A message that is sent to a process group should be delivered

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/58 Definition Distributed Systems Distributed System is

More information

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos

Recall our 2PC commit problem. Recall our 2PC commit problem. Doing failover correctly isn t easy. Consensus I. FLP Impossibility, Paxos Consensus I Recall our 2PC commit problem FLP Impossibility, Paxos Client C 1 C à TC: go! COS 418: Distributed Systems Lecture 7 Michael Freedman Bank A B 2 TC à A, B: prepare! 3 A, B à P: yes or no 4

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/60 Definition Distributed Systems Distributed System is

More information

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201

Distributed Systems. coordination Johan Montelius ID2201. Distributed Systems ID2201 Distributed Systems ID2201 coordination Johan Montelius 1 Coordination Coordinating several threads in one node is a problem, coordination in a network is of course worse: failure of nodes and networks

More information

Initial Assumptions. Modern Distributed Computing. Network Topology. Initial Input

Initial Assumptions. Modern Distributed Computing. Network Topology. Initial Input Initial Assumptions Modern Distributed Computing Theory and Applications Ioannis Chatzigiannakis Sapienza University of Rome Lecture 4 Tuesday, March 6, 03 Exercises correspond to problems studied during

More information

A CAN-Based Architecture for Highly Reliable Communication Systems

A CAN-Based Architecture for Highly Reliable Communication Systems A CAN-Based Architecture for Highly Reliable Communication Systems H. Hilmer Prof. Dr.-Ing. H.-D. Kochs Gerhard-Mercator-Universität Duisburg, Germany E. Dittmar ABB Network Control and Protection, Ladenburg,

More information

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance

Distributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Fault Tolerance Version: December 2, 2010 2 / 65 Contents Chapter

More information

MENCIUS: BUILDING EFFICIENT

MENCIUS: BUILDING EFFICIENT MENCIUS: BUILDING EFFICIENT STATE MACHINE FOR WANS By: Yanhua Mao Flavio P. Junqueira Keith Marzullo Fabian Fuxa, Chun-Yu Hsiung November 14, 2018 AGENDA 1. Motivation 2. Breakthrough 3. Rules of Mencius

More information