Midterm Examination ECE 419S 2015: Distributed Systems Date: March 13th, 2015, 6-8 p.m.

Size: px

Start display at page:

Download "Midterm Examination ECE 419S 2015: Distributed Systems Date: March 13th, 2015, 6-8 p.m."

Lorraine Montgomery
6 years ago
Views:

1 Midterm Examination ECE 419S 2015: Distributed Systems Date: March 13th, 2015, 6-8 p.m. Instructor: Cristiana Amza Department of Electrical and Computer Engineering University of Toronto Problem number Maximum Score Your Score bonus bonus total bonus points This exam is closed textbook and closed lecture notes. You have two hours to complete the exam. Use of computing and/or communicating devices is NOT permitted. You do not need to obtain more than 100 points for this exam. 100 points will give you the full midterm exam credit. However, additional points are provided, which may help if you run out of time. Moreover, for problems with a higher degree of difficulty some bonus points are provided as guidance for you on how to budget your time. Write your name and student number in the space below. Do the same on the top of each sheet of this exam book. Your Student Number Your First Name Your Last Name 1

2 Problem 1. Basic Distributed System Concepts, Architectures and Algorithms (12 points) (a) (5 points) Name three differences between a multiprocessor and a distributed system (DS) that cause problems for the DS, and at least two concrete problems that these differences create for distributed system algorithms and their implementation. a.1 Three differences: Lack of a unique physical clock Network latency Lack of a physically shared state in DS a.2 Two problems for DS algorithms that the (above) differences cause: These make agreement impossible, algorithms for fault tolerance, synchronization and data consistency difficult. Hard to distinguish between process/network failure versus slow processor or slow network. b) (3 points) Lamport describes an algorithm to logically order events in a distributed system. In this algorithm, events a and b, in processes i and j, have logical time stamps C i (a) and C j (b). If C i (a) < C j (b), we do not know if a b or if a and b are un-ordered. (Note that is the right-arrow Lamport happened-before operator.) How would you extend the logical time information sent with each message to include enough information that would enable you to decide for sure which of the two cases described above applies (ordered by happened-before or unordered)? The correct answer is a description of the vector timestamp algorithm, as shown in p. 447, CDK and in the lecture notes. c) (4 points) Mention 5 uses of Replication by briefly naming the scenarios it is used in for each. Scaling through load balancing, low latency local replicas for use in the Wide Area, mobility, data availability, and fault tolerance. Grading Scheme: 5/4 if the student gave 5 distinct uses, 4/4 for 4 distinct uses. 2

3 Problem 2. Physical Clocks (14 Points) In the lectures, we sketched the implementation of at-most-once message delivery semantics (e.g., in RPC) using physically synchronized clocks. The goal of the algorithm is that at-most-once semantics should always be guaranteed even if some new messages may be incorrectly rejected as duplicates. The algorithm is partially described below. Every client RPC request message carries a client or connection identifier and a physical clock timestamp. For each client connection, the server records in a table the most recent timestamp it has seen. The client always timestamps an RPC message retransmission with the same timestamp as the original message. If an incoming message for a connection is lower or equal than the timestamp stored for that connection, then the server rejects the message as a duplicate. To protect against crashes, when the above table would be lost, periodically the server writes its current time to disk. Let p be the period between successive disk writes. When the server crashes and then reboots, it reloads the latest stored time value from disk,tlatest. The idea for guaranteeing at most once semantics is to reject all messages that might have been accepted before the crash (and to accept only new messages that could not have been accepted before the server crash). Some new messages may be incorrectly rejected, but at-most-once semantics should always be guaranteed. (a) (5 points) Modify the receiver algorithm/implementation in order to minimize the probability that valid messages are rejected due to potential message reordering in the network in the case above (a message may seem old and get discarded just because a later message from the same client arrived first at the server). You can assume that there are no node failures for this part without losing points. The receiver maintains a sliding window of 50 ms for each sender. It buffers all messages with a timestamp up to 50 ms less than the message with the largest timestamp received from the same sender. It then rejects duplicates by precise match with the message timestamps stored within this window and it rejects older messages that fall outside of the window. When the largest timestamp is updated, the receiver moves the window according to the new timestamp, and it delivers the older messages it still stores for that sender to the application. (b) (5 points) Assume that we are the sole engineers and in charge of all implementation aspects of the endto-end at-most-once delivery algorithm from scratch with no use of libraries or other packages besides TCP/IP (reliable, in-order network, so the previous problem in part a) cannot occur), at all levels of our DS. So, we assume reliable network communication, but, other than that, we can never blame it/anything on anybody else. The algorithm as stated makes it sound like, with the right choice of parameter p, most messages will be delivered to the application only once (or at least we are led to believe that we can control how many messages do not get delivered by a good choice of p). Also assuming that, when nodes fail they come back up almost immediately, describe other characteristics where the fraction of messages that we cannot deliver may end up to be high, in practice, regardless of any parameter settings and our best efforts. You should provide messaging patterns, and bottlenecks or limitations of end points and the network, which, in the context of incompletely specified assumptions or specifications in our algorithm as stated, lead to a large fraction of messages getting dropped. 3

4 I expected that you would find a stress test that brings this system to its knees, or makes it impractical - the more specific to this algorithm and the less contrived the conditions the better. The at-most-once-message-delivery paper from MIT was published in Since then the Internet scale has increased significantly, and the number of clients that any kind of server on the Internet is expected to service has increased from hundreds of thousands to millions or more. The only idea worth anything in this algorithm is for the case of server crashes - minimizing the amount of information we keep, by writing to disk only the server timestamp. So, only for covering the case of crashes is the algorithm worth anything, and, if we show that the algorithm is impractical in this case, we don t need to search further. Hence, the vulnerabilities of this algorithm are related to crashes, the way it rejects messages upon crashes, and its only building block used - synchronized physical clocks. Specifically, if the server crashes, upon recovery, it will need to clock sync with ALL its clients as soon as possible; otherwise the algorithm is not applicable. This may take a while if there are millions of clients which we cannot assume organize themselves into an NTP hierarchy just to accommodate the server. So, the worst pattern is periodic server crashes with the need to resync the server s clock with millions of client clocks with different drifts, and the worst client pattern related to this is 1. bursts of messages while the server has not resync-ed, which will need to be buffered by the server, possibly exceeding its capacity and 2. periodic bursts of messages within p, before the server gets to write its timestamp to disk, messages which will later have to be rejected, when the server recovers, because they are within the uncertainty window, but which will be retransmitted by the client(s), thus increasing the server s processing load during the resync period after the crash. (c) (4 points) If the bound on the clock skew bound is large enough, briefly describe a scenario where a message can be accepted twice violating the at-most-once delivery guarantee (in error). The scenario may happen under the following assumptions: i) the clock skew, epsilon, between the server and client clocks is comparable to p (or to make it very easy to understand greater than p) and ii) the server accepts client messages which have client timestamps greater than its own clock at the time of the receipt of the message. In this case, say epsilon = p + 10 ms. At the time T at the server (read from the server s clock), and before the server crash, a message comes from the client with client timestamp T + epsilon. The message timestamp is greater than the previous client timestamp stored in the table for this client, so the server accepts it. Let s say that the server writes its time (T) on disk right then (tlastest = T) and afterwards the server crashes and comes back up immediately at time T + 1 ms. Subsequently, it will also accept the duplicate (retransmission) of the previous client message with timestamp T + epsilon, because the rule is that it will accept all messages after tlatest + p (and we assumed that tlatest = T and epsilon > p). 4

5 Problem 3. Logical Clocks, Causal and Total Order Multicast (39 points + 4 bonus points) In a bulletin board application, each post is multicasted to all members of a chat room. We would like to avoid anomalies, such as the one in the figure below, where a chat room participant observes a reply to any previous post which the reply logically depends on before the original post. Assume that we don t know the type of application messages (post or reply) in the multicast messaging layer that does the ordering. Furthermore, for the purposes of defining causality in terms of happens-before on our bulletin board application, we define that post 1 happens-before post 2 iff post 1 is delivered to the display on the node issuing post 2 before post 2 is sent by mcast. The relation is transitive i.e., a post can be based on seeing the full history of all the related messages, ordered by happens-before posted by other nodes, not just the immediately preceding post seen. For this whole problem, we will assume that there are no failures of nodes, and that network channels are reliable and FIFO, and that network parallelism exists, but maybe not full network parallelism. a) (12 points)for the types of logical timestamp used below, please provide a brief general description of how the overall solution for avoiding the temporal anomaly works for N nodes (not only 3 nodes or the special case shown). For all parts below, please make sure to think about and to describe how a participating node decides when to deliver/display a post to the application running on that node in each algorithm and also specify the total number of messages needed in the algorithm on behalf of each BB message post. You need to account for all messages in the system as a whole, from the initial multicast of a post until delivery for display of that post on all nodes, in Big Oh notation, as a function of a generic N (not 3 as in the Figure). Assume that a multicast generates N separate messages. a.1) (4 points) We use the same rules of computing Lamport clocks as in the lectures and then use these standard Lamport clocks to timestamp each message mcasted - what is the rule for delivering a message to display on each node? How does it work (briefly) and what is the total number of messages system-wide on behalf of each post? 5

6 If using Lamport clocks to timestamp each message mcasted, keep all messages received in a queue on each node, mcast an ACK for each message received, and deliver when the message is at head of queue and all N 2 ACK s for it have been received. a.2) (4 points) We use VTS-1 Vector timestamps with VTS counting/incrementing the local processor s position at both send message events, and receive message events, for any message. We do not increment on display events. We defer updating non-local positions in the local VTS to the time of processing post display events (with the same rules for the update of these other positions in the VTS as in the lecture notes and textbook). What is the rule for delivering a message to display on each node? How does it work (briefly) and what is the total number of messages system-wide on behalf of each post? If using standard Vector clocks to timestamp each message mcasted, keep all messages received in a queue on each node, mcast an ACK for each message received, and deliver when the message is at head of queue and all N 2 ACK s for it have been received. a.3) (4 points) If using VTS-2 Vector timestamps with VTS counting/incrementing of the local processor s position only on send events - when do we deliver a message to display on each node? How does it work (briefly) and what is the total number of messages system-wide on behalf of each post? When there is no gap between thev j [j] of this message and thev i [j] of this node, andv j [k] V i [k] for (k j), the node can deliver the message to the application/display i.e., for the human to see. It means all previous events of j have been delivered. Total messages on behalf of each post: N (no ACK s necessary). b) (5 points) For the purposes of this part, assume a correctly working solution for avoiding the anomaly based on your favourite one out of the three timestamp choices above (pick one of Lamport, VTS-1 or VTS-2). For the purposes of this part b, and for part c below, we assume that the Figure is correct in terms of the timing of all message sends and receives (this remains as shown on the Figure), but that the Figure does not specify when messages are displayed on each participant s screen (for any post and for any participant). Also assume that we are down in the machine, i.e., that we can t understand and/or that we completely ignore message content (that is for humans, not for us). Alternatively, if we still want to maintain human status, assume we can replace the content of any message to be posted on this BB by anyone in this chat room with our content of interest today, such as, The TA s are now on strike, or I hope the TA strike will be over soon. With the above assumptions, let s reformulate the problem to answer the following question: With the timestamping solution of your choice (pick one and fill it in below), and the assumptions above, what is the 6

7 earliest possible time when message m of Student 2 could have been delivered on the Prof s screen? This earliest time needs to be given as a logical clock on the local (Prof s) node. Please state the sequence of operations relevant to the local clock on the Prof s node up until the time of display, including the local logical clock just before the display of m on the Prof s node, and the local clock just after. Be as specific and precise as possible. Easiest to compute for VTS-2: The time just before and after display for message m on the Professor s node is [1,1,0]/[1,1,1]. Grading Scheme: For the schemes that required ACK s missing to count the ACK s in the clock was not penalized as long as the student explicitly specified that it should be included in the counts. c) (6 points) Given the reformulation of the problem statement in part b) above, and assuming that the ordering solutions above are working correctly, describe the order among the three post events shown in the Figure (Sm - the send of m by the Prof, Sm - the send of m by Student 1, and Sm - the send of m by Student 2). Which of these events are causally ordered by our definition of happens before, and which of them are concurrent with each other? All answers are required for each of the three possible solutions. This may be the hardest question on the exam. If we consider the definition of happens-before according to Lamport, and that Sm represents the send of the multicast of message m as a single event, then Sm > Sm > Sm no matter what the timestamp ordering scheme we use. This is because, for the classic definition of happens-before given by Lamport, the exchange of a message is sufficient to determine causality. It is plain from the figure that one message is sent by P1 and received by P2, ordering Sm > Sm. Similarly another message is sent by P2 and received by P3, ordering Sm > Sm. However, it is also plain to see that, if we consider the implementation of TO-Mcast based on Lamport clocks, also given by Lamport, the receipt of message m at P3, for example, does not imply that m is displayed at P3 at the time of receipt. Therefore, if we modify the happens before relationship to include message display, instead of mere message receipt, then, it is clear that the Send of m and the Send of m are concurrent events. Based on this revised definition of happens-before, we have. c1.) Lamport clocks (2 points). Sm, Sm and Sm are concurrent events. c2.) VTS-1 (2 points). With the solution given in a2), Sm, Sm and Sm are concurrent events. c3.) VTS-2 (2 points). With the solution given in a3), and the new definition of happens-before, Sm > Sm, but, Sm and Sm 7

8 are concurrent events and Sm and Sm are concurrent events. d.) (6 points). Totally ordered BB with Lamport clocks improved with RTT additional assumption. Assume we want to implement a totally ordered BB using Lamport clocks. In totally ordered multicast, all messages need to be ordered in the same order on all nodes. If we add the assumption that the network roundtrip latency is bounded by an RTT which we know up-front, design a solution for the totally ordered BB based on Lamport clocks which reduces the total number of messages and the total bandwidth consumption as much as possible. Specify for which kind of message patterns in the BB application and for which network characteristics the intended optimizations will matter the most? Assume that RTT is the worst case round trip delay for any two end-points in the network, if I assign a time-out of 2*RTT to the message at the head of the queue, and I wait for 2*RTT before delivering it, then I must have gotten all messages that were sent to me before that same message got to them, hence I am not missing anything, and the message at the head of my queue must have been received everywhere. So, I do not need any ACKs. At any node, I form the queue sorted by Lamport clocks, and then deliver the head of the queue if I received nothing with lower Lamport clock, from anyone, for 2*RTT duration, instead of waiting for ACKs. There is no need for ACKs, If we have low bandwidth, and a contended chat room, saving lots of ACKs pays off. e) (4 points) Mention the worst kind of disadvantage that your specialized RTT-based solution for the totally ordered BB in part d) can have compared to your best performing causally ordered BB solution so far. Describe the specific message patterns of the BB application and relevant network characteristics in terms of latency, bandwith, parallelism, etc, under which the disadvantage is expected to be the worst possible. Be as explicit and complete as possible in your answer. We will introduce a lot of unnecessary latency if high latency variability with some messages getting long delays, hence if the RTT representing the worst case is much larger than the average latency. The scheme is also at a disadvantage under any conditions that favor delivery of many ACKs in parallel, fast, such as high network parallelism, and high network bandwidth. 8

9 f.) (6 points + 4 bonus points) VTS-1 improvement without any additional assumption. Design an improved version of the VTS-1 solution from part (a2) which significantly reduces the total number of messages that need to be sent for a reliable in-order network, without using any additional assumptions about bounded network delay or RTT. No change to the VTS-1 s is allowed, so you will still increment the local position in the timestamp on both sends and receives. Your solution needs to remain decentralized. There are no other restrictions. A reduction of the total number of messages by (at least) a factor of 2 is acceptable for full points. A maximum of 4 bonus points is allocated for solutions that achieve big Oh reductions in the number of messages. Answers here can vary. Acceptable reductions are: 1. batching i.e., waiting to send ACKs until we can multicast two or more ACKs in the same multicast message, resulting in the reduction of overall number of messages and 2. using a different topology for connecting the nodes in order to send the multicasts e.g., token ring with piggy-backing and forwarding of multicast messages, or tree. Both of these are focused on how to group existing messages into fewer message batches. The best answer is a justification of the fact that, for this particular problem, with the assumptions we make here, there is actually no need for ACKs even for VTS-1. But, we need to change the way we process messages for delivery, and we need to understand why, in this case, even with gaps in the VTS, we are fine. 9

10 Problem 4. Replication, Performance, Fault Tolerance and Availability (8 points + 2 bonus points) Describe a solution to avoid the bulletin board anomaly from the previous question that provides the best of both worlds : the solution needs to combine performance in terms of competitive latency and scaling with the best algorithms designed for the BB so far, on one hand, and robustness on the other hand. Robustness is defined by high availability of data and service and fault tolerance to single node failures. The network is still assumed to be reliable and in order, but without an RTT bound. Your solution needs to be (fully) decentralized. Your own adaptation of Quorum Consensus (QC) to this problem is recommended with earning 2 bonus points for the correct and appropriate use of QC. However, you can use an algorithm of your own choice instead of QC if you can meet the above criteria. Hint: Quorum consensus has become widely used because it is good at maintaining small amounts of state in a replicated, consistent and fault tolerant manner. Think about maintaining small amounts of replicated state with QC. Then think about how to incorporate the replicated state in your BB solution, in order to maintain the replicated BB in a consistent, fault tolerant manner as well. Argue that, under reasonably common BB application patterns and network characteristics, with a nonnegligible probability of single node failures, your solution provides the best of both worlds. As a stresstest for your solution, please explain what happens if a post was communicated to some of the nodes but not to others e.g., due to a node failure in the middle of a multicast. We use Quorum Consensus in order to implement a bulletin board with totally ordered posting of all messages on all nodes and also with fault tolerance in the case of participant crashes. For this, the general idea is that the whole BB is an object we maintain using QC, which has a certain version number based on how many messages have been included (not necessarily displayed) in the BB. Whenever a node wants to post a new message, it multicasts its request and forms a Read Quorum. From the Read quorum, it extracts the most up to date (highest) version number that the BB has at that moment. The node then writes a new version of the BB with incremented version number and its new message included and waits for acknowledgments from a Write Quorum. The performance penalty is incurred on read operations. A read operation on the BB e.g., in order to figure out what messages we can send to display now needs to involve a Read Quorum. If the version number returned by the Read Quorum is the same as the one the local node has, then the read operation is complete. Otherwise, the local node selects the highest version number and the associated BB state, and performs a write (involving waiting responses from a Write Quorum) to all other replicas. The last step above is necessary if we want to be serious about fault tolerance of the BB messages. It is possible that a node fails in the middle of a write - and, in this case, a subsequent read by any node effectively finishes what that node started. Grading scheme: Lenient, based on the student s understanding of what Quorum Consensus is, or, in general, based on the student s understanding of how to extend any previous scheme with Fault Tolerance features. 10

11 Problem 5. Mutual Exclusion and Lamport clocks (26 points) For all parts of this question, assume reliable delivery, no failures of nodes, or links, message latency is limitted, but we may not know the bound unless otherwise specified, and a node only needs the resource for a limited amount of time; the critical section is of limited size and the number of lock reacquires in a loop is limitted on each node. Part A. Mutual Exclusion Algorithms with Elements of Centralization (14 points). Assume that, in a mutual exclusion algorithm of our own design, a centralized node, called manager (one manager per lock), maintains the location of the current lock owner. Also assume that the lock does not change its owner, and remains in the same location if it is released by its owner but not requested by another node. Also assume that each lock is initially placed at the manager s location (but not held). The algorithm is as follows: For the lock acquire: i) a node which wants the lock sends an acquire message to the manager. ii) the manager returns the location of the lock, i.e., its current owner id. iii) the requesting node contacts the owner directly. iv) the owner creates a local queue of process id s with acquire requests it has received. The queue is maintained in FIFO order. Upon releasing a lock, the lock owner i) reads the process id at the head of the queue who is next in line to become owner, ii) sends that process id of the new owner to the manager, iii) dequeues the head of the queue, iv) passes the lock and the remaining queue to the process which is next in line to become owner. a) (4 points) Is this algorithm correct? Argue that the algorithm is correct or give a counter-example. The algorithm suffers from the following race condition: A node gets the current owner from the manager, but, its request to the current owner arrives after the lock and the queue has already been passed to a new owner. The algorithm does not specify what to do in this case, so, the assumption is that the late requesters will hang waiting the lock forever. Note that the old owner cannot simply forward such delayed requests, because the lock may have changed owners yet again afterwards. b) (4 points) Now assume that we change the algorithm in part a) to add tagging each acquire message with the Lamport clock at the time of the send on the sending node (as usual when using Lamport clock timestamps). If the owner maintains its queue in order of Lamport clocks, instead of FIFO, is the algorithm correct? Argue that the algorithm is correct or give a counter-example. The algorithm suffers from the same race condition; ordering the queue differently did not resolve this. c) (6 points) Describe an improvement to either one of the previous two algorithms (withot Lamport clocks as in part a, or with Lamport clocks as in part b) if we add the assumption of a known RTT (round trip) bound in the system. You are not allowed to maintain the state maintained by the manager - it will still maintain the current lock owner. The improvement can be either in terms of correctness or performance of 11

12 the original algorithm; however, if an algorithm is incorrect, it does not matter what its performance is for the purposes of grading. Assume that, the current lock ownerowaits for 2*RTT just before passing the queue and the lock. Anyone who requested the lock owner identity from the manager and had obtained O must have sent their request, and the request must have been received by O also by the end of this wait. Some requests will accumulate on the future lock owner before it receives the lock and the old queue, and the algorithm needs to be modified to allow this. The algorithm also needs to be modified to merge the old queue from the old owner with the new queue from the new owner upon passing the queue and the lock (the new queue contains the latest requests). Part B. Decentralized Mutual Exclusion Algorithms based on Token Ring (12 points). a) (4 points) Explain why the mutual exclusion algorithm based on the Token Ring node topology is not fair (by giving a fairness counter-example). For example, a node I who needs/asks for the lock is located right behind he current owner on the ring. I must wait for the token to be passed all around the ring before it has a chance to get it. Other nodes may get the token while it is on its way even if many messages were exchanged around the ring since the time that I placed its request. b) (8 points) Describe your own variation on the standard implementation of a mutual exclusion based on the Token Ring algorithm that allows for fairness in the algorithm. Your solution should still use the Token Ring to transmit messages between nodes (message can be sent only around the Ring) and you need to try to minimize the total bandwidth consumption of your algorithm. You can use one or more of the standard assumptions for bounded and/or reliable network, clocks, etc, if it helps, without losing any points. One possibility: Assuming sync-ed physical clocks. Each node registers the physical timestamp of its lock request locally. The token circulates as usual. When receiving the token, a node which needs the lock adds its local request timestamp to the token, and passes on the token. A node will enter a critical section when 1. it receives the token, 2. it sees its own request carried on the token, and 3. its request is the lowest timestamp request from those carried on the token. At release, the node will delete its own request from the token and pass it on as before. 12

13 Problem 6. Global State. (8 points) Given the following code segments, how many and which results are not possible under sequential consistency (SC)? List all results of the three prints e.g., print x = 0, print y = 0, print z = 0, etc, that are not possible under SC. Assume that all variables, i.e., A, x, y and z are initialized to 0 before this code is reached. Show your thinking. 1. P1 A = 1 x = A y = A z = y P2 print y print x print z All combinations print x = 0/1, print y = 0/1 are valid under SC except for print y = 1 and print x = 0, and print z = 0/1. The key to thinking about this is that print z can happen either before z=y, or after this statement. 2. P1 P2 P3 A = 1 x = A y = x print y z = y print x print z All combinations print x = 0/1, print y = 0/1 are valid under SC except for print y = 1 and print x = 0, and print z = 0/1. 13

14 14

15 15

16 16

PROCESS SYNCHRONIZATION

DISTRIBUTED COMPUTER SYSTEMS PROCESS SYNCHRONIZATION Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Process Synchronization Mutual Exclusion Algorithms Permission Based Centralized