Discussion of Failure Mode Assumptions for IEEE 802.1Qbt

Size: px

Start display at page:

Download "Discussion of Failure Mode Assumptions for IEEE 802.1Qbt"

Dennis Ward
6 years ago
Views:

1 Discussion of Failure Mode Assumptions for IEEE 802.1Qbt Wilfried Steiner, Corporate Scientist Page 1

clocks in a distributed system can accurately be

2 Clock Synchronization is a core building block of many RT Systems Eth Grand Master Eth 1588 The local clocks in a distributed system can accurately be synchronized to each other. Eth Page 2

3 Basic Questions in Fault-Tolerant Clock Synchronization Eth Grand Master Eth Loss of Grand Master clock requires a changeover - How long does the changeover take? - Is the changeover fault-tolerant? - Is a malicious failure behavior of the Grand Master clock tolerated? Eth Page 3

4 Fault-Tolerance through Redundancy Situation: What is the color of the house? No Failure Green Don t Know Green Fail-Silence Failure Fail-Consistent Failure Red Green Green Page 4

5 Failure Mode: Fail-Silence When the current grandmaster clock fails then gptp ensures that another clock becomes the new grandmaster if there exists such a clock in the system, which we will assume in the following This means that there is some fail-over time after which the system is running stable again synchronized and syntonized to the new grandmaster clock. The fail-silence failure mode is tolerated when the original grand master clock fails permanently. Page 5

6 Failure Mode: Fail-Silence What happens when the original grandmaster clock fails transiently or intermittent? e.g., the original grandmaster clock periodically reboots Will the network oscillate between the original and a secondary grandmaster clock? Page 6

7 Model-Based Development i Development of fault-tolerant clock synchronization algorithms is non-trivial: synchronization proof is hard for certain failure modes completeness has to be proven as well i.e., we need to prove that we have covered all possible failure scenarios Therefore, formal methods are used in the development and in the verification of such algorithms. Theorem Proving is the process of developing a deductive proof, typically interactive with a proof assistant. Model Checking is an automatized approach. Page 7

8 ACTIVE (4) Model-Based Development ii e.g., IEEE 802.1ASbt INIT 1.1 LISTEN 2.1 COLD 3.1 START (1) (2) (3) INIT (1) SILENCE (4) ok LISTEN 2.1 STARTUP Protected STARTUP (2) (3) 6.1 (6) Tentative ROUND (5) ACTIVE (7) e.g., fail-silence Model Checker no, because e.g., system will sync Page 8

9 Example: SAE AS6802 First Byzantine fault-tolerant clock synchronization algorithm verified by model-checking only. Basic algorithm addresses only synchronization of the clocks. Extension for syntonization (we call it clock-rate correction) has been modeled and studied as well. Page 9

10 Fault-Tolerant Clock Synchronization Grand Master Eth Grand Master Grand Master Grand Master 1588 Fault-tolerant synchronization services are needed for establishing a safe and highly available synchronized time. Eth Page 10

11 SAE AS6802 Clock Synchronization Algorithm (case of five SM is updated in the standard) Algorithm Specification Page 11

12 Byzantine Failure Tolerance Occurrence of a Byzantine failure is a combination of a fail-arbitrary synchronization master (end station) and an inconsistent-omission faulty compression master (bridge). Page 12

13 Rate-Correction with Stable Clock Drifts Calculate and apply rate-correction term Store 1 st statecorrection term Store 2 nd statecorrection term Page 13

14 Rate-Correction with Unstable Clock Drifts Coincidently also the speed of the oscillator changes Store 1 st statecorrection term Store 2 nd statecorrection term Calculate and apply rate-correction term Page 14

15 What are the failure modes of IEEE 802.1ASbt Permanent fail-silence? Transient/Intermittent fail-silence? Fail-consistent faulty? e.g., a grandmaster providing faulty time Inconsistent faulty bridges? e.g., a bridge forwarding time information only on some ports Byzantine faulty grandmaster clocks? Page 15

16 Wilfried Steiner, Corporate Scientist Page 16

17 Backup Page 17

18 Static vs. Dynamic Systems Situation: What is the color of the house? Static Situation one Truth Situation: What is the color of the ball? Dynamic Situation >one Truth Page 18

19 Origins: Byzantine Failures A distributed system that measures the temperature of a vessel shall raise an alarm when the temperature exceeds a certain threshold. The system shall tolerate the arbitrary failure of one node. How many nodes are required? How many messages are required? Temperature HOT N2 HOT N1 Faulty HOT COLD COLD N3 COLD N1: COLD N2: HOT N3: COLD ========== COLD In general, three nodes are insufficient to tolerate the arbitrary failure of a single node. The two correct nodes are not always able to agree on a value. A decent body of scientific literature exists that address this problem of dependable systems, in particular dependable communication. Page 19

20 Byzantine Clocks A distributed system in which all nodes are equipped with local clocks, all clocks shall become and remain synchronized. The system shall tolerate the arbitrary failure of one node. How many nodes are required? How many messages are required? Fast Clock Perfect Clock Slow Clock R.int R.int Real Time N1: 00:01 N2: 00:01 N3: 00:04 ========== 00:01 00:01 N2 00:01 N1 Time Faulty 00:01 00:04 00:04 N3 00:04 N1: 00:04 N2: 00:01 N3: 00:04 ========== 00:04 In general, three nodes are insufficient to tolerate the arbitrary failure of a single node. The two correct nodes are not always able to bring their clocks into close agreement. A decent body of scientific literature exists that address this problem of fault-tolerant clock synchronization. Page 20

Dependable Computer Systems

Dependable Computer Systems Part 6b: System Aspects Contents Synchronous vs. Asynchronous Systems Consensus Fault-tolerance by self-stabilization Examples Time-Triggered Ethernet (FT Clock Synchronization)