Discussion of Failure Mode Assumptions for IEEE 802.1Qbt
|
|
- Dennis Ward
- 6 years ago
- Views:
Transcription
1 Discussion of Failure Mode Assumptions for IEEE 802.1Qbt Wilfried Steiner, Corporate Scientist Page 1
2 Clock Synchronization is a core building block of many RT Systems Eth Grand Master Eth 1588 The local clocks in a distributed system can accurately be synchronized to each other. Eth Page 2
3 Basic Questions in Fault-Tolerant Clock Synchronization Eth Grand Master Eth Loss of Grand Master clock requires a changeover - How long does the changeover take? - Is the changeover fault-tolerant? - Is a malicious failure behavior of the Grand Master clock tolerated? Eth Page 3
4 Fault-Tolerance through Redundancy Situation: What is the color of the house? No Failure Green Don t Know Green Fail-Silence Failure Fail-Consistent Failure Red Green Green Page 4
5 Failure Mode: Fail-Silence When the current grandmaster clock fails then gptp ensures that another clock becomes the new grandmaster if there exists such a clock in the system, which we will assume in the following This means that there is some fail-over time after which the system is running stable again synchronized and syntonized to the new grandmaster clock. The fail-silence failure mode is tolerated when the original grand master clock fails permanently. Page 5
6 Failure Mode: Fail-Silence What happens when the original grandmaster clock fails transiently or intermittent? e.g., the original grandmaster clock periodically reboots Will the network oscillate between the original and a secondary grandmaster clock? Page 6
7 Model-Based Development i Development of fault-tolerant clock synchronization algorithms is non-trivial: synchronization proof is hard for certain failure modes completeness has to be proven as well i.e., we need to prove that we have covered all possible failure scenarios Therefore, formal methods are used in the development and in the verification of such algorithms. Theorem Proving is the process of developing a deductive proof, typically interactive with a proof assistant. Model Checking is an automatized approach. Page 7
8 ACTIVE (4) Model-Based Development ii e.g., IEEE 802.1ASbt INIT 1.1 LISTEN 2.1 COLD 3.1 START (1) (2) (3) INIT (1) SILENCE (4) ok LISTEN 2.1 STARTUP Protected STARTUP (2) (3) 6.1 (6) Tentative ROUND (5) ACTIVE (7) e.g., fail-silence Model Checker no, because e.g., system will sync Page 8
9 Example: SAE AS6802 First Byzantine fault-tolerant clock synchronization algorithm verified by model-checking only. Basic algorithm addresses only synchronization of the clocks. Extension for syntonization (we call it clock-rate correction) has been modeled and studied as well. Page 9
10 Fault-Tolerant Clock Synchronization Grand Master Eth Grand Master Grand Master Grand Master 1588 Fault-tolerant synchronization services are needed for establishing a safe and highly available synchronized time. Eth Page 10
11 SAE AS6802 Clock Synchronization Algorithm (case of five SM is updated in the standard) Algorithm Specification Page 11
12 Byzantine Failure Tolerance Occurrence of a Byzantine failure is a combination of a fail-arbitrary synchronization master (end station) and an inconsistent-omission faulty compression master (bridge). Page 12
13 Rate-Correction with Stable Clock Drifts Calculate and apply rate-correction term Store 1 st statecorrection term Store 2 nd statecorrection term Page 13
14 Rate-Correction with Unstable Clock Drifts Coincidently also the speed of the oscillator changes Store 1 st statecorrection term Store 2 nd statecorrection term Calculate and apply rate-correction term Page 14
15 What are the failure modes of IEEE 802.1ASbt Permanent fail-silence? Transient/Intermittent fail-silence? Fail-consistent faulty? e.g., a grandmaster providing faulty time Inconsistent faulty bridges? e.g., a bridge forwarding time information only on some ports Byzantine faulty grandmaster clocks? Page 15
16 Wilfried Steiner, Corporate Scientist Page 16
17 Backup Page 17
18 Static vs. Dynamic Systems Situation: What is the color of the house? Static Situation one Truth Situation: What is the color of the ball? Dynamic Situation >one Truth Page 18
19 Origins: Byzantine Failures A distributed system that measures the temperature of a vessel shall raise an alarm when the temperature exceeds a certain threshold. The system shall tolerate the arbitrary failure of one node. How many nodes are required? How many messages are required? Temperature HOT N2 HOT N1 Faulty HOT COLD COLD N3 COLD N1: COLD N2: HOT N3: COLD ========== COLD In general, three nodes are insufficient to tolerate the arbitrary failure of a single node. The two correct nodes are not always able to agree on a value. A decent body of scientific literature exists that address this problem of dependable systems, in particular dependable communication. Page 19
20 Byzantine Clocks A distributed system in which all nodes are equipped with local clocks, all clocks shall become and remain synchronized. The system shall tolerate the arbitrary failure of one node. How many nodes are required? How many messages are required? Fast Clock Perfect Clock Slow Clock R.int R.int Real Time N1: 00:01 N2: 00:01 N3: 00:04 ========== 00:01 00:01 N2 00:01 N1 Time Faulty 00:01 00:04 00:04 N3 00:04 N1: 00:04 N2: 00:01 N3: 00:04 ========== 00:04 In general, three nodes are insufficient to tolerate the arbitrary failure of a single node. The two correct nodes are not always able to bring their clocks into close agreement. A decent body of scientific literature exists that address this problem of fault-tolerant clock synchronization. Page 20
Dependable Computer Systems
Dependable Computer Systems Part 6b: System Aspects Contents Synchronous vs. Asynchronous Systems Consensus Fault-tolerance by self-stabilization Examples Time-Triggered Ethernet (FT Clock Synchronization)
More informationDependability Entering Mainstream IT Networking Standards (IEEE 802.1)
Dependability Entering Mainstream IT Networking Standards (IEEE 802.1) 64th Meeting of the IFIP 10.4 Working Group on Dependable Computing and Fault Tolerance Visegrád, Hungary, June 27-30, 2013 Wilfried
More informationFault Tolerance. Distributed Systems. September 2002
Fault Tolerance Distributed Systems September 2002 Basics A component provides services to clients. To provide services, the component may require the services from other components a component may depend
More informationDep. Systems Requirements
Dependable Systems Dep. Systems Requirements Availability the system is ready to be used immediately. A(t) = probability system is available for use at time t MTTF/(MTTF+MTTR) If MTTR can be kept small
More informationDeterministic Ethernet as Reliable Communication Infrastructure for Distributed Dependable Systems
Deterministic Ethernet as Reliable Communication Infrastructure for Distributed Dependable Systems DREAM Seminar UC Berkeley, January 21 st, 2014 Wilfried Steiner, Corporate Scientist wilfried.steiner@tttech.com
More informationHigh-Availability/Redundancy
IEEE 802.1ASbt for Industrial Networks High-Availability/Redundancy IEEE 802.1 Plenary Session -, Beijing Feng Chen, Siemens AG Franz-Josef Goetz, Siemens AG siemens.com/answers Recap: Industrial Requirements
More informationTime Sync Redundant Grandmaster Clock Support.
ime Sync edundant master Support. IEEE 802.1 May 2013 Interim Presented by Eric Spada & Yong Kim 1 ationale and Methods ationale Seamless transition (frequency and phase) from primary (pgm) to backup (bgm)
More informationSimulation-Based Fault Injection as a Verification Oracle for the Engineering of Time-Triggered Ethernet networks
Simulation-Based Fault Injection as a Verification Oracle for the Engineering of Time-Triggered Ethernet networks Loïc FEJOZ, RealTime-at-Work (RTaW) Bruno REGNIER, CNES Philippe, MIRAMONT, CNES Nicolas
More informationToday: Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationToday: Fault Tolerance. Fault Tolerance
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Paxos Failure recovery Checkpointing
More informationToday: Fault Tolerance. Replica Management
Today: Fault Tolerance Failure models Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery
More informationFault Tolerance Part I. CS403/534 Distributed Systems Erkay Savas Sabanci University
Fault Tolerance Part I CS403/534 Distributed Systems Erkay Savas Sabanci University 1 Overview Basic concepts Process resilience Reliable client-server communication Reliable group communication Distributed
More informationDistributed Systems (ICE 601) Fault Tolerance
Distributed Systems (ICE 601) Fault Tolerance Dongman Lee ICU Introduction Failure Model Fault Tolerance Models state machine primary-backup Class Overview Introduction Dependability availability reliability
More informationChapter 8 Fault Tolerance
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 8 Fault Tolerance 1 Fault Tolerance Basic Concepts Being fault tolerant is strongly related to
More informationBasic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication. Distributed commit.
Basic concepts in fault tolerance Masking failure by redundancy Process resilience Reliable communication One-one communication One-many communication Distributed commit Two phase commit Failure recovery
More informationDistributed Systems COMP 212. Revision 2 Othon Michail
Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise
More informationFAULT TOLERANT SYSTEMS
FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 5 Processor-Level Techniques & Byzantine Failures Chapter 2 Hardware Fault Tolerance Part.5.1 Processor-Level Techniques
More informationDistributed Systems. 05. Clock Synchronization. Paul Krzyzanowski. Rutgers University. Fall 2017
Distributed Systems 05. Clock Synchronization Paul Krzyzanowski Rutgers University Fall 2017 2014-2017 Paul Krzyzanowski 1 Synchronization Synchronization covers interactions among distributed processes
More informationConsensus Problem. Pradipta De
Consensus Problem Slides are based on the book chapter from Distributed Computing: Principles, Paradigms and Algorithms (Chapter 14) by Kshemkalyani and Singhal Pradipta De pradipta.de@sunykorea.ac.kr
More informationPractical Byzantine Fault Tolerance. Castro and Liskov SOSP 99
Practical Byzantine Fault Tolerance Castro and Liskov SOSP 99 Why this paper? Kind of incredible that it s even possible Let alone a practical NFS implementation with it So far we ve only considered fail-stop
More informationChapter 5: Distributed Systems: Fault Tolerance. Fall 2013 Jussi Kangasharju
Chapter 5: Distributed Systems: Fault Tolerance Fall 2013 Jussi Kangasharju Chapter Outline n Fault tolerance n Process resilience n Reliable group communication n Distributed commit n Recovery 2 Basic
More informationProposal for Reservation of Time Sync Resources in the TSN UNI
0 0 0 March, Proposal for Reservation of Time Sync Resources in the TSN UNI Editor s Foreword The user/network interface (UNI) of 0.Qcc is specified in clause of the latest draft for task group ballot,
More informationBYZANTINE GENERALS BYZANTINE GENERALS (1) A fable: Michał Szychowiak, 2002 Dependability of Distributed Systems (Byzantine agreement)
BYZANTINE GENERALS (1) BYZANTINE GENERALS A fable: BYZANTINE GENERALS (2) Byzantine Generals Problem: Condition 1: All loyal generals decide upon the same plan of action. Condition 2: A small number of
More informationFault Tolerance. Basic Concepts
COP 6611 Advanced Operating System Fault Tolerance Chi Zhang czhang@cs.fiu.edu Dependability Includes Availability Run time / total time Basic Concepts Reliability The length of uninterrupted run time
More informationFailure Tolerance. Distributed Systems Santa Clara University
Failure Tolerance Distributed Systems Santa Clara University Distributed Checkpointing Distributed Checkpointing Capture the global state of a distributed system Chandy and Lamport: Distributed snapshot
More informationC 1. Recap. CSE 486/586 Distributed Systems Failure Detectors. Today s Question. Two Different System Models. Why, What, and How.
Recap Best Practices Distributed Systems Failure Detectors Steve Ko Computer Sciences and Engineering University at Buffalo 2 Today s Question Two Different System Models How do we handle failures? Cannot
More informationHigh Available Synchronization with IEEE 802.1AS bt
High Available Synchronization with IEEE 802.1AS bt 2013-03-19 IEEE 802 Meeting - TSN-TG Orlando / USA Franz-Josef Goetz, Siemens AG Structure of this Presentation 1. Methods in IEEE 1588 v2 and IEEE 802.1AS
More informationLast Class:Consistency Semantics. Today: More on Consistency
Last Class:Consistency Semantics Consistency models Data-centric consistency models Client-centric consistency models Eventual Consistency and epidemic protocols Lecture 16, page 1 Today: More on Consistency
More informationDistributed Systems 11. Consensus. Paul Krzyzanowski
Distributed Systems 11. Consensus Paul Krzyzanowski pxk@cs.rutgers.edu 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value must be one
More informationFault Tolerance via the State Machine Replication Approach. Favian Contreras
Fault Tolerance via the State Machine Replication Approach Favian Contreras Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Written by Fred Schneider Why a Tutorial? The
More informationFault Tolerance. Distributed Software Systems. Definitions
Fault Tolerance Distributed Software Systems Definitions Availability: probability the system operates correctly at any given moment Reliability: ability to run correctly for a long interval of time Safety:
More informationAn Introduction to TTEthernet
An Introduction to thernet TU Vienna, Apr/26, 2013 Guest Lecture in Deterministic Networking (DetNet) Wilfried Steiner, Corporate Scientist wilfried.steiner@tttech.com Copyright TTTech Computertechnik
More informationCSE 486/586 Distributed Systems
CSE 486/586 Distributed Systems Failure Detectors Slides by: Steve Ko Computer Sciences and Engineering University at Buffalo Administrivia Programming Assignment 2 is out Please continue to monitor Piazza
More informationPractical Byzantine Fault
Practical Byzantine Fault Tolerance Practical Byzantine Fault Tolerance Castro and Liskov, OSDI 1999 Nathan Baker, presenting on 23 September 2005 What is a Byzantine fault? Rationale for Byzantine Fault
More informationBasic vs. Reliable Multicast
Basic vs. Reliable Multicast Basic multicast does not consider process crashes. Reliable multicast does. So far, we considered the basic versions of ordered multicasts. What about the reliable versions?
More information16 Time Triggered Protocol
16 Time Triggered Protocol [TTtech04] (TTP) 18-549 Distributed Embedded Systems Philip Koopman October 25, 2004 Significant material drawn from: Prof. H. Kopetz [Kopetz] TTP Specification v 1.1 [TTTech]
More informationDRAFT. Dual Time Scale in Factory & Energy Automation. White Paper about Industrial Time Synchronization. (IEEE 802.
SIEMENS AG 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 DRAFT Dual Time Scale in Factory & Energy Automation White Paper about Industrial
More informationSelf-Stabilizing Byzantine Digital Clock Synchronization
Self-Stabilizing Byzantine Digital Clock Synchronization Ezra N. Hoch, Danny Dolev, Ariel Daliot School of Engineering and Computer Science, The Hebrew University of Jerusalem, Israel Problem Statement
More informationTutorial on Time-Synchronization for AAA2C based on IEEE Std 802.1AS -2011
Tutorial on Time-Synchronization for AAA2C based on IEEE Std 802.1AS -2011, Ph.D. November 7, 2012 Intel Corporation kevin.b.stanton@intel.com Abstract This presentation provides an overview of time synchronization
More informationConcepts. Techniques for masking faults. Failure Masking by Redundancy. CIS 505: Software Systems Lecture Note on Consensus
CIS 505: Software Systems Lecture Note on Consensus Insup Lee Department of Computer and Information Science University of Pennsylvania CIS 505, Spring 2007 Concepts Dependability o Availability ready
More informationCSE 5306 Distributed Systems
CSE 5306 Distributed Systems Fault Tolerance Jia Rao http://ranger.uta.edu/~jrao/ 1 Failure in Distributed Systems Partial failure Happens when one component of a distributed system fails Often leaves
More informationMODEL-BASED ANALYSIS OF TIMED-TRIGGERED ETHERNET
MODEL-BASED ANALYSIS OF TIMED-TRIGGERED ETHERNET Bruno Dutertre, SRI International, Menlo Park, CA Arvind Easwaran, Brendan Hall, Honeywell International, Minneapolis, MN Wilfried Steiner, TTTech Computertechnik
More informationCSE 5306 Distributed Systems. Fault Tolerance
CSE 5306 Distributed Systems Fault Tolerance 1 Failure in Distributed Systems Partial failure happens when one component of a distributed system fails often leaves other components unaffected A failure
More informationMODEL-BASED ANALYSIS OF TIMED-TRIGGERED ETHERNET
MODEL-BASED ANALYSIS OF TIMED-TRIGGERED ETHERNET Bruno Dutertre, SRI International, Menlo Park, CA Arvind Easwaran, Brendan Hall, Honeywell International, Minneapolis, MN Wilfried Steiner, TTTech Computertechnik
More informationSelf-stabilizing Byzantine Digital Clock Synchronization
Self-stabilizing Byzantine Digital Clock Synchronization Ezra N. Hoch, Danny Dolev and Ariel Daliot The Hebrew University of Jerusalem We present a scheme that achieves self-stabilizing Byzantine digital
More informationIntroduction to Distributed Systems Seif Haridi
Introduction to Distributed Systems Seif Haridi haridi@kth.se What is a distributed system? A set of nodes, connected by a network, which appear to its users as a single coherent system p1 p2. pn send
More informationLecture 10: Clocks and Time
06-06798 Distributed Systems Lecture 10: Clocks and Time Distributed Systems 1 Time service Overview requirements and problems sources of time Clock synchronisation algorithms clock skew & drift Cristian
More informationDistributed Systems COMP 212. Lecture 19 Othon Michail
Distributed Systems COMP 212 Lecture 19 Othon Michail Fault Tolerance 2/31 What is a Distributed System? 3/31 Distributed vs Single-machine Systems A key difference: partial failures One component fails
More informationReal-Time Component Software. slide credits: H. Kopetz, P. Puschner
Real-Time Component Software slide credits: H. Kopetz, P. Puschner Overview OS services Task Structure Task Interaction Input/Output Error Detection 2 Operating System and Middleware Application Software
More informationFlexRay International Workshop. FAN analysis
FlexRay International Workshop 16 th and 17 th April, 2002 Munich FAN analysis Dipl. Inf. Jens Lisner - University of Essen Project FAN - Goals Verify the design of FlexRay in particular: countermeasures
More informationPractical Byzantine Fault Tolerance (The Byzantine Generals Problem)
Practical Byzantine Fault Tolerance (The Byzantine Generals Problem) Introduction Malicious attacks and software errors that can cause arbitrary behaviors of faulty nodes are increasingly common Previous
More informationDistributed Systems Principles and Paradigms. Chapter 08: Fault Tolerance
Distributed Systems Principles and Paradigms Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.20, steen@cs.vu.nl Chapter 08: Fault Tolerance Version: December 2, 2010 2 / 65 Contents Chapter
More informationCS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:
CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online
More informationComplexity-Reducing Design Patterns for Cyber-Physical Systems. DARPA META Project. AADL Standards Meeting January 2011 Steven P.
Complexity-Reducing Design Patterns for Cyber-Physical Systems DARPA META Project AADL Standards Meeting 24-27 January 2011 Steven P. Miller Delivered to the Government in Accordance with Contract FA8650-10-C-7081
More informationByzantine Consensus. Definition
Byzantine Consensus Definition Agreement: No two correct processes decide on different values Validity: (a) Weak Unanimity: if all processes start from the same value v and all processes are correct, then
More informationTime-Triggered Ethernet
Time-Triggered Ethernet Chapters 42 in the Textbook Professor: HONGWEI ZHANG CSC8260 Winter 2016 Presented By: Priyank Baxi (fr0630) fr0630@wayne.edu Outline History Overview TTEthernet Traffic Classes
More informationC 1. Today s Question. CSE 486/586 Distributed Systems Failure Detectors. Two Different System Models. Failure Model. Why, What, and How
CSE 486/586 Distributed Systems Failure Detectors Today s Question I have a feeling that something went wrong Steve Ko Computer Sciences and Engineering University at Buffalo zzz You ll learn new terminologies,
More informationByzantine Failures. Nikola Knezevic. knl
Byzantine Failures Nikola Knezevic knl Different Types of Failures Crash / Fail-stop Send Omissions Receive Omissions General Omission Arbitrary failures, authenticated messages Arbitrary failures Arbitrary
More informationByzantine Fault Tolerance and Consensus. Adi Seredinschi Distributed Programming Laboratory
Byzantine Fault Tolerance and Consensus Adi Seredinschi Distributed Programming Laboratory 1 (Original) Problem Correct process General goal: Run a distributed algorithm 2 (Original) Problem Correct process
More informationHow to Synchronize a Pausible Clock to a Reference. Robert Najvirt, Andreas Steininger
How to Synchronize a Pausible Clock to a Reference Robert Najvirt, Andreas Steininger GALS Communication The communication between two (locally) synchronous modules has inevitable potential for metastable
More informationReplication in Distributed Systems
Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over
More informationA Framework for the Formal Verification of Time-Triggered Systems
A Framework for the Formal Verification of Time-Triggered Systems Lee Pike leepike@galois.com Indiana University, Bloomington Department of Computer Science Advisor: Prof. Steven D. Johnson December 12,
More informationDistributed Systems 24. Fault Tolerance
Distributed Systems 24. Fault Tolerance Paul Krzyzanowski pxk@cs.rutgers.edu 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware failure Software bugs Operator errors Network
More informationThe operator has activated this LED to identify this chassis. This chassis is not being identified. Fabric modules are all operational.
s Chassis s, page 1 System Controller s, page 2 Supervisor Module s, page 3 Fan Tray s, page 3 Fabric Module s, page 4 Line Card s, page 4 Power Supply s, page 5 Chassis s The chassis s are located at
More informationCprE Fault Tolerance. Dr. Yong Guan. Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University
Fault Tolerance Dr. Yong Guan Department of Electrical and Computer Engineering & Information Assurance Center Iowa State University Outline for Today s Talk Basic Concepts Process Resilience Reliable
More informationDistributed Systems COMP 212. Lecture 17 Othon Michail
Distributed Systems COMP 212 Lecture 17 Othon Michail Synchronisation 2/29 What Can Go Wrong Updating a replicated database: Customer (update 1) adds 100 to an account, bank employee (update 2) adds 1%
More informationDetectable Byzantine Agreement Secure Against Faulty Majorities
Detectable Byzantine Agreement Secure Against Faulty Majorities Matthias Fitzi, ETH Zürich Daniel Gottesman, UC Berkeley Martin Hirt, ETH Zürich Thomas Holenstein, ETH Zürich Adam Smith, MIT (currently
More informationModule 8 - Fault Tolerance
Module 8 - Fault Tolerance Dependability Reliability A measure of success with which a system conforms to some authoritative specification of its behavior. Probability that the system has not experienced
More informationDistributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski
Distributed Systems 09. State Machine Replication & Virtual Synchrony Paul Krzyzanowski Rutgers University Fall 2016 1 State machine replication 2 State machine replication We want high scalability and
More informationTCG 01-G & TCG 02-G Firmware Release Notes
TCG 01-G & TCG 02-G Firmware Release Notes VERSION F2.28r6 (May 2018) Bug: The sync relay was closing when the clock went into the state; Tuning. The operation has been changed and the relay will close
More informationDistributed Systems Fault Tolerance
Distributed Systems Fault Tolerance [] Fault Tolerance. Basic concepts - terminology. Process resilience groups and failure masking 3. Reliable communication reliable client-server communication reliable
More informationTSW Reliability and Fault Tolerance
TSW Reliability and Fault Tolerance Alexandre David 1.2.05 Credits: some slides by Alan Burns & Andy Wellings. Aims Understand the factors which affect the reliability of a system. Introduce how software
More informationDistributed Systems. Fault Tolerance. Paul Krzyzanowski
Distributed Systems Fault Tolerance Paul Krzyzanowski Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Faults Deviation from expected
More informationCS State Machine Replication
CS 5450 State Machine Replication Key Ideas To tolerate faults replicate functionality! Can represent deterministic distributed system as replicated state machine (SMR) Each replica reaches the same conclusion
More informationTo do. Consensus and related problems. q Failure. q Raft
Consensus and related problems To do q Failure q Consensus and related problems q Raft Consensus We have seen protocols tailored for individual types of consensus/agreements Which process can enter the
More informationLecture 12: Time Distributed Systems
Lecture 12: Time Distributed Systems Behzad Bordbar School of Computer Science, University of Birmingham, UK Lecture 12 1 Overview Time service requirements and problems sources of time Clock synchronisation
More informationFault Tolerance. Distributed Systems IT332
Fault Tolerance Distributed Systems IT332 2 Outline Introduction to fault tolerance Reliable Client Server Communication Distributed commit Failure recovery 3 Failures, Due to What? A system is said to
More informationToday: Fault Tolerance. Failure Masking by Redundancy
Today: Fault Tolerance Agreement in presence of faults Two army problem Byzantine generals problem Reliable communication Distributed commit Two phase commit Three phase commit Failure recovery Checkpointing
More informationImplementation Issues. Remote-Write Protocols
Implementation Issues Two techniques to implement consistency models Primary-based protocols Assume a primary replica for each data item Primary responsible for coordinating all writes Replicated write
More informationLecture XII: Replication
Lecture XII: Replication CMPT 401 Summer 2007 Dr. Alexandra Fedorova Replication 2 Why Replicate? (I) Fault-tolerance / High availability As long as one replica is up, the service is available Assume each
More informationTime Synchronization in a Campus Network
Time Synchronization in a Campus Network Antti Pietiläinen 1 ITSF 2015, Edinburgh, Antti Pietiläinen 4.11.2015 Time Synchronization in a Campus Network Measurement scheme Network Measurements Conclusions
More informationIssues in Programming Language Design for Embedded RT Systems
CSE 237B Fall 2009 Issues in Programming Language Design for Embedded RT Systems Reliability and Fault Tolerance Exceptions and Exception Handling Rajesh Gupta University of California, San Diego ES Characteristics
More informationAvoiding Utilization Inefficiency in.1qbv
Avoiding Utilization Inefficiency in.1qbv IEEE 802 Interim Meeting, Norfolk, VA, May/2014 (preliminary version) Wilfried Steiner, Corporate Scientist wilfried.steiner@tttech.com Page 1 From 802.1Qbv-D1.2
More informationOverview ECE 753: FAULT-TOLERANT COMPUTING 1/21/2014. Recap. Fault Modeling. Fault Modeling (contd.) Fault Modeling (contd.)
ECE 753: FAULT-TOLERANT COMPUTING Kewal K.Saluja Department of Electrical and Computer Engineering Fault Modeling Lectures Set 2 Overview Fault Modeling References Fault models at different levels (HW)
More informationThe Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer
The Timed Asynchronous Distributed System Model By Flaviu Cristian and Christof Fetzer - proposes a formal definition for the timed asynchronous distributed system model - presents measurements of process
More informationFault Tolerance. The Three universe model
Fault Tolerance High performance systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful
More informationDistributed Systems. Clock Synchronization: Physical Clocks. Paul Krzyzanowski
Distributed Systems Clock Synchronization: Physical Clocks Paul Krzyzanowski pxk@cs.rutgers.edu Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution
More informationDistributed Storage Systems: Data Replication using Quorums
Distributed Storage Systems: Data Replication using Quorums Background Software replication focuses on dependability of computations What if we are primarily concerned with integrity and availability (and
More informationDistributed Deadlock
Distributed Deadlock 9.55 DS Deadlock Topics Prevention Too expensive in time and network traffic in a distributed system Avoidance Determining safe and unsafe states would require a huge number of messages
More informationDistributed Algorithms Reliable Broadcast
Distributed Algorithms Reliable Broadcast Alberto Montresor University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents
More informationSystem Models for Distributed Systems
System Models for Distributed Systems INF5040/9040 Autumn 2015 Lecturer: Amir Taherkordi (ifi/uio) August 31, 2015 Outline 1. Introduction 2. Physical Models 4. Fundamental Models 2 INF5040 1 System Models
More informationSystem models for distributed systems
System models for distributed systems INF5040/9040 autumn 2010 lecturer: Frank Eliassen INF5040 H2010, Frank Eliassen 1 System models Purpose illustrate/describe common properties and design choices for
More informationFault Tolerance. Fall 2008 Jussi Kangasharju
Fault Tolerance Fall 2008 Jussi Kangasharju Chapter Outline Fault tolerance Process resilience Reliable group communication Distributed commit Recovery 2 Basic Concepts Dependability includes Availability
More informationModule 8 Fault Tolerance CS655! 8-1!
Module 8 Fault Tolerance CS655! 8-1! Module 8 - Fault Tolerance CS655! 8-2! Dependability Reliability! A measure of success with which a system conforms to some authoritative specification of its behavior.!
More informationFailure Models. Fault Tolerance. Failure Masking by Redundancy. Agreement in Faulty Systems
Fault Tolerance Fault cause of an error that might lead to failure; could be transient, intermittent, or permanent Fault tolerance a system can provide its services even in the presence of faults Requirements
More informationDistributed Systems. 19. Fault Tolerance Paul Krzyzanowski. Rutgers University. Fall 2013
Distributed Systems 19. Fault Tolerance Paul Krzyzanowski Rutgers University Fall 2013 November 27, 2013 2013 Paul Krzyzanowski 1 Faults Deviation from expected behavior Due to a variety of factors: Hardware
More informationSMT-Based Formal Verification of a TTEthernet Synchronization Function
SMT-Based Formal Verification of a TTEthernet Synchronization Function Wilfried Steiner 1 and Bruno Dutertre 2 1 TTTech Computertechnik AG, Chip IP Design A-1040 Vienna, Austria wilfried.steiner@tttech.com
More informationFailure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18
Failure models Byzantine Fault Tolerance Fail-stop: nodes either execute the protocol correctly or just stop Byzantine failures: nodes can behave in any arbitrary way Send illegal messages, try to trick
More informationDiscussion of Proposals for Redundancy in 802.1ASbt
IEEE 802.1ASbt Timing and Synchronization Discussion of Proposals for Redundancy in 802.1ASbt IEEE 802.1 Interim Meeting - Sept. 2014, Ottawa, Canada Feng Chen, Franz-Josef Goetz - Siemens AG Geoffrey
More information