Verteilte Systeme (Distributed Systems) Karl M. Göschka Karl.Goeschka@tuwien.ac.at http://www.infosys.tuwien.ac.at/teaching/courses/ VerteilteSysteme/
Lecture 6: Clocks and Agreement Synchronization of physical clocks Logical clocks and ordering Distributed mutual exclusion Election Global state
Clock Synchronization When each machine has its own clock, an event that occurred after another event may nevertheless be assigned an earlier time. Time is so basic to the way people think! 3
Physical Clocks (1) Computation of the mean solar day. 4
Time and Clocks Historically, time has been measured astronomically: Solar day (transit of the sun) and solar second as 1/86400 of a solar day Earth s rotation is not constant (core turbulence) and slowing down (tidal friction, atmospheric drag) mean solar second (GMT) 9.192.631.770 transitions of Cesium 133 International Atomic Time (TAI) at the BIH Coordinated Universal Time (UTC): UTC second = TAI second, but leap seconds keep UTC in phase with solar time 5
Physical Clocks (2) TAI 0 1 2 3 4 5 6 7 solar second 0 1 2 3 4 5 6 leap second UTC 0 1 2 3 3 4 5 6 TAI seconds are of constant length, unlike solar seconds. Leap seconds are introduced when necessary to keep UTC in phase with the sun. 6
Timer A timer is a counter that counts clock ticks Crystal oszillator Battery backed CMOS RAM (initial setting) Clock offset, skew, drift (different definitions in literature!) UTC is provided e.g. by National Institute of Standard Time (NIST): WWV, GEOS, GPS,... Real-time systems need actual clock time synchronize with real-world time (external) synchronize with each other (internal) 7
Clock Synchronization Algorithms The relation between clock time and UTC ticking at different rates. Maximum drift rate determines required re-synchronization interval. 8
Network Time Protocol (1) T 2 =T 2 -θ T 3 =T 3 -θ Getting the current time from a time server. 9
Network Time Protocol (2) Time must never run backward All nodes adjust (advance/slow down) their clocks locally Estimate/measure propagation delay Estimate offset and compute accuracy Take best (minimum delay) of eight measures Use multiple sources to improve accuracy Hierarchical precision (strata) ~ms (WAN), ~µs (LAN), ~ns (with hardware support, e.g., IEEE 1588) Security? 10
Network Time Protocol (3) Stratum 0 Stratum 1 Stratum 2 Stratum 3 NTP precision levels 11
Attacking time synchronization 12
The Berkeley Algorithm a) The time daemon asks all the other machines for their clock values b) The machines answer c) The time daemon tells everyone how to adjust their clock 14
Clock Synchronization in Wireless (1) e.g., sensor networks nodes are resource constrained multihop routing is expensive optimize algorithms for energy consumption RBS Reference Broadcast Synchronization internal sync (no absolute clock) only receivers synchronize (based on receipt of reference message) signal propagation time ~ constant (without multihop routing) 15
Clock Synchronization in Wireless (2) The usual critical path in determining network delays. The critical path in the case of RBS. 16
Lecture 6: Clocks and Agreement Synchronization of physical clocks Logical clocks and ordering Distributed mutual exclusion Election Global state
Time vs. Order (logical time) Synchronous system: Algorithms are easier to model, but clock synchronization needed Asynchronous system: Today s reality, but many design problems can not be solved with deterministic algorithms However, often no global clock and no clock synchronization are needed: It is sufficient to agree on the order of events (logical clocks) time is relative, anyway Then, some events are ordered, some are concurrent (partial order) 19
Making clocks move forward This situation must be prevented Fixed! In many cases, wall clock time does not matter. All we care about is relative time. (L. Lamport) (This is not true in some real-time systems.) 20
Happened-before (1) Definition of logical clocks based on the happened-before relation to order events sequentially in a distributed system: Events in one process are ordered (local clock) Message send happens before message receive happened-before is transitive Events that are not ordered are concurrent (partial ordering) Similar to physical causality, therefore also called potential causal ordering 21
Happened-before (2) p 1 a b m 1 p 2 c d m 2 Physical time p 3 e Feynman (space-time) diagrams document causality Relationship is transitive: a happened-before f Imposes a partial order (not total): a b c d f e (a,b,c,d), but e f f 22
Logical clock implementation Captures happened-before ordering numerically Lamport timestamps Each node keeps a counter (LC): 1. Increment LC before each event (computation, send, receive) 2. On message send, piggyback LC 3. On message receive set local LC to max(local LC, Received LC) (time can only move forward) and then apply rule 1 for receipt (+1). Total order by adding process ID a b L(a) < L(b), but the converse is not true! 23
Lamport clocks in middleware The positioning of Lamport s logical clocks in distributed systems. 24
Example: Inconsistent replication Problem due to message delays and lack of global time If (non-commutative) updates arrive in different orders at the two sites, the databases will become inconsistent. We could require all messages to arrive at all nodes in the same order (Which may be too strong also see causal). 25
Synchronizing multicast messages Assume data is replicated on several servers Updates to data are performed by clients Update request is multicast to all servers Multicast messages arrive in different orders at different servers How to ensure consistency of data at all servers? Order message deliveries at servers Differentiate between receipt and delivery 26
Totally-Ordered Multicast clients multicast their updates with (Lamport) timestamp (FIFO, reliable) upon receipt, the message is put into local queue ordered by timestamp server acknowledges receipt of requests by multicast (for total ordering) eventually all processes will have the same copy of the local queue a message that is at the head of the queue and has been acknowledged by all processes is delivered to server process (respective ACKs are deleted) updates may not be done in correct (?) order but they are done in the same order at all nodes 31
Vector Clocks - Principle Logical clocks order related events; nothing can be said about unrelated events Problem with Lamport timestamps: L(a)<L(b) > a b Rather: L(a)<L(b) (a b) or (a b) too restrictive? Concurrent message transmission using logical clocks. 32
Vector Clocks - Example (1,0,0) (2,0,0) p 1 a b m 1 p 2 (2,1,0) (2,2,0) c d m 2 Physical time p 3 (0,0,1) e f (2,2,2) 33
Vector Clocks - Algorithm 1. Initially, V i [j]=0 2. Before P i timestamps an event, V i [i]:=v i [i]+1 3. P i includes V i in every message it sends 4. When P i receives a timestamp t in a message, it sets V i [j]:=max(v i [j],t[j]) (merge operation), and then applies rule 2 for receipt. 34
Vector Clocks Usage V i [i] is the number of events P i has timestamped V i [j] (j i) is the number of events occurred at P j on which P i may causally depend Comparison of vector clocks: V=V iff V[j]=V [j] j V V iff V[j] V [j] j V<V iff V V and V V Now, V(a)<V(b) a b (and vice-versa) Disadvantage: more storage and message payload optimizations exist 35
Causal ordering using vector timestamps P1 (0,0,0) (0,1,0) (0,1,1) (0,2,1) P2 (0,0,0) (0,1,0) (0,1,1) (0,2,1) P3 (0,0,0) (0,1,0) (0,1,1) 39
Lecture 6: Clocks and Agreement Synchronization of physical clocks Logical clocks and ordering Distributed mutual exclusion Election Global state
Mutual Exclusion Coordinate activities, share resources critical section (monitor, semaphor) locally assisted by OS in turn assisted by HW in order to guarantee atomic operations Distributed mutex: based solely on message passing: Safety: At most one process may execute in the critical section at a time Liveness: Requests to enter and exit the critical section eventually succeed (no deadlock, no starvation) Ordering: Happened-before (fairness) 42
A Centralized Algorithm a) Process 1 asks the coordinator for permission to enter a critical region. Permission is granted b) Process 2 then asks permission to enter the same critical region. The coordinator does not reply. c) When process 1 exits the critical region, it tells the coordinator, which then replies to 2 43
A Distributed Algorithm a) Two processes want to enter the same critical region at the same moment. They each multicast their intention along with timestamp. b) Process 0 has the lowest timestamp, so it wins. c) When process 0 is done, it sends an OK also, so 2 can now enter the critical region. 45
A Token Ring Algorithm a) An unordered group of processes on a network. b) A logical ring constructed in software: Token goes around ring; a process that owns the token can enter the critical region. 46
A dezentralized probabilistic Algorithm Each resource replicated n times Access requires majority m>n/2 Application in DHTs Coordinator may reset (forget) what if k=2m-n coordinators fail (i.e., (n-m)+k=m)? Voting correctness violated, but probability extremely low (e.g., 10-40 ) Scales well (non-deterministic) BUT: Bad utilization if many competing nodes (starvation) 48
A Comparison of the Four Algorithms A comparison of four mutual exclusion algorithms. 49
Lecture 6: Clocks and Agreement Synchronization of physical clocks Logical clocks and ordering Distributed mutual exclusion Election Global state
Election algorithms in distributed systems Distributed agreement algorithms attempt to establish agreement among a set of processes about the value of a piece of information (e.g. what time is it?) Election algorithms are one group of agreement algorithms The problem is for a set of processes (participants) to elect a leader (e.g. who will be our coordinator?) Useful for many algorithms that require a (temporarily central) co-ordinator 51
The Bully Algorithm (1) a) Process 4 holds an election b) Process 5 and 6 respond, telling 4 to stop c) Now 5 and 6 each hold an election 52
The Bully Algorithm (2) d) Process 6 tells 5 to stop e) Process 6 wins and tells everyone by sending a COORDINATOR message 53
A Ring Algorithm Election algorithm using a ring. Organize the processes logically along a ring. 56
Another Ring Algorithm 3 17 4 24 9 1 15 28 24 Note: The election was started by process 17. The highest process identifier encountered so far is 24. Participant processes are shown darkened 58
Elections in Ad hoc networks best leader is elected overlay is constructed (hierarchy) resource capacities are taken into account (e.g., battery lifetime) see example 60
Elections in Wireless Environments (1) Election algorithm in a wireless network, with node a as the source. (a) Initial network. (b) (e) The build-tree phase 61
Elections in Wireless Environments (2) Election algorithm in a wireless network, with node a as the source. (a) Initial network. (b) (e) The build-tree phase 62
Elections in Wireless Environments (3) (e) The build-tree phase. (f) Reporting of best node to source. 63
Elections in large-scale P2P Systems Requirements for superpeer selection: 1. Normal nodes should have low-latency access to superpeers. 2. Superpeers should be evenly distributed across the overlay network. 3. There should be a predefined portion of superpeers relative to the total number of nodes in the overlay network. 4. Each superpeer should not need to serve more than a fixed number of normal nodes. 64
Lecture 6: Clocks and Agreement Synchronization of physical clocks Logical clocks and ordering Distributed mutual exclusion Election Global state
Global state predicates (1) p 1 p 2 a. Garbage collection object reference message garbage object state of communication channel! p1 wait-for p 2 b. Deadlock wait-for state of communication channel! p1 p 2 c. Termination passive activate passive 66
Global state: Consistent cut a) A consistent cut b) An inconsistent cut (effect without cause) 67
Global state predicates (2) Stability once the system enters a state S 0 in which the predicate is true, it remains true in all future states reachable from S 0 Safety α is an undesirable predicate of the system s global state (e.g. being deadlocked) Safety(α) at S 0 : α=false for all states reachable from S 0 (i.e. bad α will never happen) Liveness β is a desirable property (e.g. reach termination) Lifeness(β) at S 0 : For any linearization starting from S 0 β=true for some state S L reachable from S 0 (i.e. good β will eventually happen) 68
Snapshot algorithm (1) Chandy and Lamport (1985): Record a set of process and channel states such that the recorded global state is consistent, even though the combination of recorded states may never have actually occurred at the same time. a) Organization of a process and channels for a distributed snapshot 69
Snapshot algorithm (3) b) Process Q receives a marker for the first time (from other channel) and records its local state c) Q records all incoming message d) Q receives a marker for its incoming channel and finishes recording the state of the incoming channel 71
Example (1) p c 1 2 p 2 c 1 Process P 2 has already received an order for five widgets before S 0 $1000 (none) account widgets $50 2000 account widgets 1. Global state S 0 <$1000, 0> p 1 c 2 (empty) p 2 <$50, 2000> c 1 (empty) 2. Global state S 1 <$900, 0> p 1 c 2 (Order 10, $100), M p 2 <$50, 2000> c 1 (empty) 3. Global state S 2 <$900, 0> p 1 c 2 (Order 10, $100), M p 2 <$50, 1995> c 1 (five widgets) M 4. Global state S 3 <$900, 5> p 1 c 2 (Order 10, $100) p 2 <$50, 1995> Final state: P1 <$1000,0>; P2<$50,1995>; c1<five widgets>; c2<> c 1 (empty) (M = marker message) 72
Reachability between states actual execution e 0,e 1,... S init recording begins recording ends S final S snap pre-snap: e' 0,e' 1,...e' R-1 post-snap: e' R,e' R+1,... if a stable predicate is true in the snapshot, then it is also true in (any) final state. 75
Summary Distributed processes need to synchronize their actions to ensure cooperation or fair competition Lack of a global clock makes synchronization difficult Often, ordering is enough: Logical clocks and vector stamps reduce the cost of synchronization Distributed agreement algorithms are required when processes need to coordinate their actions. Mutex, Election, Global state,... 77