Slides for Chapter 12: Coordination and Agreement

Slides for hater : oordination and greement rom oulouris, ollimore and Kindberg istributed Systems: oncets and esign dition, ailure ssumtions and ailure etectors reliable communication channels rocess failures: crashes failure detector: object/code in a rocess that detects failures of other rocesses unreliable failure detector unsusected or susected (evidence of ossible failures) each rocess sends ``alive'' message to everyone else not receiving ``alive'' message after timeout most ractical systems reliable failure detector unsusected or failure synchronous system few ractical systems istributed Mutual xclusion rovide critical region in a distributed environment message assing for examle, locking files, lockd daemon in UNX (NS is stateless, no file-locking at the NS level) lgorithms for mutual exclusion N rocesses rocesses don't fail message delivery is reliable critical region: enter(), resourceccesses(), exit() Proerties: [M] safety: only one rocess at a time [M] liveness: eventually enter or exit [M3] haened-before ordering: ordering of enter() is the same as ordering lgorithms for mutual exclusion Performance evaluation: overhead and bandwidth consumtion: # of messages sent client delay incurred by a rocess at entry and exit throughut measured by synchronization delay: delay between one's exit and next's entry central server algorithm server kees track of a ---ermission to enter critical region a rocess requests the server for the the server grants the if it has the a rocess can enter if it gets the, otherwise waits when done, a rocess sends release and exits

Server managing a mutual exclusion for a set of rocesses central server algorithm Queue of requests. Request Server 3. rant. Release 3 Proerties: safety, why? liveness, why? ordering not guaranteed, why? [V in rocesses vs. server] Performance: enter overhead: two messages (request and grant) enter delay: time between request and grant exit overhead: one message (release) exit delay: none synchronization delay: between release and grant centralized server is the bottle neck ring-based algorithm ring of rocesses transferring a mutual exclusion logical ring, could be unrelated to the hysical configuration i sends messages to (i+) mod N when a rocess holds a, it can enter, otherwise waits when a rocess releases a (exit), it sends to its neighbor n 3 Token ring-based algorithm roerties: safety, why? liveness, why? ordering not guaranteed, why? Performance: bandwidth consumtion: kees circulating enter overhead: 0 to N messages enter delay: delay for 0 to N messages exit overhead: one message exit delay: none synchronization delay: delay for to N messages n algorithm using multicast and logical clocks multicast a request message for the enter only if all the other rocesses rely totally-ordered timestams: <T, i > each rocess kees a state: RLS, L, WNT if all have state = RLS, all rely, a rocess can hold the and enter if a rocess has state = L, doesn't rely until it exits if more than one rocess has state = WNT, rocess with the lowest timestam will get all N- relies first.

Ricart and grawala s algorithm Multicast synchronization On initialization state := RLS; To enter the section state := WNT; Multicast request to all rocesses; request rocessing deferred here T := request s timestam; Wait until (number of relies received = (N )); state := L; Rely 3 On receit of a request <T i, i > at j (i j) if (state = L or (state = WNT and (T, j ) < (T i, i ))) then queue request from i without relying; else rely immediately to i ; end if To exit the critical section state := RLS; rely to any queued requests; Rely 3 3 Rely 3 n algorithm using multicast and logical locks n algorithm using multicast and logical locks Proerties safety, why? liveness, why? ordering, why? erformance: bandwidth consumtion: no kees circulating entry overhead: (N-), why? [with multicast suort: + (N -) = N] entry delay: delay between request and getting all relies exit overhead: 0 to N- messages exit delay: none synchronization delay: delay for message (one last rely from the revious holder) Maekawa s Voting lgorithm Main dea We actually don t need all N relies onsider rocesses,, X needs relies from and X needs relies from and X f X can only rely to (vote) *one rocess at a time* and cannot have a rely from X at the same time Mutex between and --hinges on X Processes in overlaing grous members in the overla control mutex Mutual exclusion for all rocesses grou for every rocess air of grous for and overlas => mutex(, ) very ossible air of grous overlas => mutex(all ossible airs of rocesses) => mutex(all rocesses) 3

rous and members Parameters and grou membershi grou for each rocess N rocesses N grous ach grou has K members numbers of rocesses to request for ermission ach rocess is in M grous (M > ) llows overlaing => mutex f the grous are disjoint, M =, no overlaing, no mutex number of rocesses to grant ermission N = number of rocesses K = voting grou size M = number of voting grous each rocess is in Otimal (smallest K, why?) K = M ~= sqrt(n) Non-trivial to construct the grous roximation K = * sqrt(n) - Put rocess id s in a sqrt(n) by sqrt(n) table Union the rows and columns where i is N grous M = * sqrt(n) - rou membershi rou for : {,,,, } rou for : {,,,, } rou for : {,,,, } very air of grous overla N grous ach grou has K = * sqrt(n) members M = * sqrt(n) [# of grous each rocess are in] rou membershi (alternative?) rous for,, : {,,,, } rous for,, : {,,,, } rous for,, : {,,,, } very air of grous overla N grous [Sqrt(N) unique grous] ach grou has K = * sqrt(n) members M = 3 [sqrt(n)] for,,, but M = 6 [*sqrt(n)] for the rest Undesirable, why? Sketch of the Voting lgorithm igure.6 Maekawa s voting algorithm grou of rocesses V i for each rocess i ach air of grous overla X in grous: [,X] and [,X] => the grous for any air of rocesses overla To enter the critical region, i Sends RQUST s to all rocesses in V i Waits for RPLY s (VOT s) from all rocesses in V i To exit the critical region, i Sends RLS s to all rocesses in V i Three tyes of messages, not two as in the multicast alg. On initialization state := RLS; voted := LS; or i to enter the critical section state := WNT; Multicast request to all rocesses in V i ; Wait until (number of relies received = K); state := L; On receit of a request from i at j if (state = L or voted = TRU) then queue request from i without relying; else send rely to i ; voted := TRU; end if or i to exit the critical section state := RLS; Multicast release to all rocesses in V i ; On receit of a release from i at j if (queue of requests is non-emty) then remove head of queue from k, say; send rely to k ; voted := TRU; else voted := LS; end if

eadlock? Processes:,, rou :, rou :, rou :, eadlock has s rely, waiting for s rely has s rely, waiting for s rely has s rely, waiting for s rely Timestam the requests in ordering holding according to the timestam Proerties Safety: No rocess can rely/vote more than once at any time. Liveness: timestam ( ordering) ordering: above Performance ntry overhead [assuming k = sqrt(n)] Sqrt(N) requests + Sqrt(N) relies * Sqrt(N) < (N - ) [N > ] xit overhead Sqrt(N) releases lections choosing a unique rocess for a articular role for examle, server in dist. mutex each rocess can call only one multile concurrent s can be called by different rocesses articiant: engages in an rocess with the largest id wins each rocess i has variable elected i =? (don't know) initially lections Proerties: [] elected i of a ``articiant'' rocess must be P max (elected rocess---largest id) or? [] liveness: all rocesses articiate and eventually set elected i!=? (or crash) Performance: overhead (bandwidth consumtion): # of messages turnaround time: # of messages to comlete an ring-based algorithm logical ring, could be unrelated to the hysical configuration i sends messages to (i+) mod N no failures elect the coordinator with the largest id initially, every rocess is a non-articiant any rocess can call an : marks itself as articiant laces its id in an message sends the message to its neighbor 5

Ring-based algorithm ring-based in rogress receiving an message: if id > myid, forward the msg, mark articiant if id < myid non-articiant: relace id with myid: forward the msg, mark articiant articiant: sto forwarding (why? Later, multile s) if id = myid, coordinator found, mark non-articiant, elected i := id, send elected message with myid receiving an elected message: id!= myid, mark non-articiant, elected i := id forward the msg if id = myid, sto forwarding 9 5 3 Note: The was started by rocess 7. The highest rocess identifier encountered so far is. Particiant rocesses are shown darkened 7 8 Ring-based algorithm Ring-based algorithm Proerties: safety: only the rocess with the largest id can send an elected message liveness: every rocess in the ring eventually articiates in the ; extra s are stoed Performance: one, best case, when? N messages N elected messages turnaround: N messages one, worst case, when? N - messages N elected messages turnaround: 3N - messages can't tolerate failures, not very ractical rocesses can crash and can be detected by other rocesses timeout T = T transmitting + T rocessing each rocess knows all the other rocesses and can communicate with them Messages:, answer, coordinator start an detects the coordinator has failed sends an message to all rocesses with higher id's and waits for answers (excet the failed coordinator/rocess) if no answers in time T, it is the coordinator sends coordinator message (with its id) to all rocesses with lower id's else waits for a coordinator message starts an if timeout 6

The of coordinator, after the failure of and then 3 Stage Stage Stage 3 ventually... Stage answer 3 timeout answer coordinator answer 3 3 3 receiving an message sends an answer message back starts an if it hasn't started one send messages to all higher-id rocesses (including the failed coordinator the coordinator might be u by now) receiving a coordinator message set elected i to the new coordinator to be a coordinator, it has to start an when a crashed rocess is relaced the new rocess starts an and can relace the current coordinator (hence ``bully'') roerties: safety: a lower-id rocess always yields to a higher-id rocess owever, during an, if a failed rocess is relaced the low-id rocesses might have two different coordinators: the newly elected coordinator and the new rocess, why? failure detection might be unreliable liveness: all rocesses articiate and know the coordinator at the end Performance best case: when? overhead: N- coordinator messages turnaround delay: no /answer messages worst case: when? overhead: + +...+ (N-) + (N-)= (N-)(N-)/ + (N-) messages, +...+ (N-) answer messages, N- coordinator messages, total: (N-)(N-) + (N-) = (N+)(N-) = O(N ) turnaround delay: delay of and answer messages 7