Linearizability CMPT 401 Thursday, March 31, 2005 The execution of a replicated service (potentially with multiple requests interleaved over multiple servers) is said to be linearizable if: The interleaved sequence of operations has the same results as if it was run sequentially on a single object. The order of operations in the interleaving is consistent with the real times at which the operations occurred in the actual execution. Sequential Consistency Passive Replication The execution of a replicated service (potentially with multiple requests interleaved over multiple servers) is said to be sequentially consistent if: The interleaved sequence of operations has the same results as if it was run sequentially on a single object. The order of operations in the interleaving is consistent with the program order in which each individual client requested them. One primary physical object that handles all client requests One or more backup physical objects that stay in synch with the primary When the primary fails, one backup is promoted to be the new primary Ideally, this provides fault-tollerance
Is Passive Replication Linearizable? Since one server is handling all the requests, they are handled as if they were processed at one correct object. (first requirement) Since the server handles requests in the order it receives them, the ordering of operations is real time. (second requirement) What if reads can be directed to the backups? What about during failures? Active Replication Client sends request to front end and blocks waiting for a response. Front end uses reliable, totally ordered multicast to send request to all replica servers. Servers execute the request (identically since requests arrive in the same order at all servers). Responses are returned to the front end. The front end determines the single response and returns it to the client. Again, the goal here is to make a fault-tolerant system Problems with Active Replication in Reality Fisher et al. showed that we cannot build a system that can be guaranteed to reach consensus in an asynchronous system with crash failures. We have shown that if we have a totally ordered, reliable multicast system, we can solve consensus. Thus, we cannot build such a system in an asynchronous system with crash failures. Active replication relies on this kind of multicast system to provide the guarantees in the algorithm. Making Active Replication Workable We have previously seen that we can build failure detectors that allow a consensus system with very low probability of failure. Similarly, we have mentioned randomized algorithms for consensus that have very low probability of failure. Using either of these approaches, we can construct an active replication scheme with very low probability of failure.
The gossip System System designed to automatically move data to the edges and increase availability Divide requests into two types: queries are read-only requests with no writing updates are write-only requests with no reading The system is designed to meet two constraints: Consistent Service - Each client sends requests to any replica it chooses. Clients are allowed to send some requests to one replica and some to another. Regardless, they always produce responses consistent with the updates the client has seen so far Relaxed Replica Consistency - All replicas eventually receive all updates and apply them in an order sufficient to meet the consistency needs of the application. gossip Query Requests Client sends request to front end which in turn selects a nearby, available replica. Front end sends the request along with a vector timestamp (one entry for each replica) that indicates the most recent updates the client has seen. If replica has a greater or equal local timestamp, it responds with its local data and its local vector timestamp. Otherwise, it must request and wait for more updates. Front end merges its local timestamp with received replica timestamp gossip Update Requests Client sends request to front end which in turn selects a nearby, available replica. In cases where fault-tolerance is important, a front end may send the request to N/2 + 1 replicas to ensure resilience in the case of crashes. Front end sends the request along with: a vector timestamp (one entry for each replica) that indicates the most recent updates the client has seen a unique ID, to ensure that updates are not applied more than once The application of the update depends on whether it is handled in causal, forced, or immediate mode gossip Causal Updates Replica increments its entry in its own vector timestamp Replica immediately responds to client with its new vector timestamp (the update timestamp). Client merges timestamp with its own. The replica waits until its timestamp is greater or equal to the client s timestamp The replica finally updates the value and merges its current timestamp with the update timestamp
gossip Forced Updates These updates must be totally as well as causally ordered At any given time, one replica is known to all others as the primary replica. The order in which updates reach the primary replica is appended to them as a sequence number. Before a replica will apply a forced update, it must both have both a! timestamp and a forced sequence number which is less by only one gossip Immediate Updates Immediate updates go through the primary replica as well. Immediate updates are flagged with information on which causal and forced updates have come before it. Other replicas must apply this update exactly after the forced and causal updates determined by the primary. Gossip Messages Updates are shared between replicas using gossip messages. A gossip message consists of: the sender s update log the sender s timestamp The receiver of a message must: merge in any updates that it has not seen before discard any pending updates that have arrived in the log merge the received vector timestamp with its own The Bayou System Preventing conflicts is too restrictive in a system with disconnects and partitions Instead, when replicas share updates with each other, they can try to resolve any conflicts that occur Using domain-specific rules, the resolution of conflicts is called operational transformation Each replica has a list of committed list of updates and a tentative list of updates. Order of operations and thus final decision to commit is imposed by using a primary replica
The Coda System A descendent of AFS with the goal of allowing high availability, despite disconnects and partitions The replicas are called the Volume Storage Group (VSG). At any given time, a client can access some subset of these replicas called the Available Volume Storage Group (AVSG). Connected execution proceeds as per AFS, with updates being communicated by clients to the AVSG. Disconnected Coda When the AVSG is empty, the client is said to be disconnected. In this situation, the client still has access to any files that were cached locally before the disconnect. When reconnection occurs, all the updates are sent back to the AVSG and any conflicts are manually resolved by the user. Coda Replication Each file as a Coda Version Vector (CVV). This is a vector timestamp with one entry for each replica in the VSG. Each element of the CVV represents the number of updates received at a given replica for this file. Replicas can compare CVVs and, if v1!"v2 or v1 #"v2, then the more recent version of the file can be transmitted to update the old version. Normally, this is a two-step process: individual servers agree to the update and acknowle to the client the client computes the new CVV for the file and notifies all the servers who updated If one of those conditions does not hold, then the file is considered to be in conflict. User intervention is required to merge the files. Communication with Replicas In AFS, we know which server is going to give us a callback message on changes In Coda, clients select one member of the AVSG when opening a file. This one replica is responsible for providing a callback When a file is updated, the update is sent to the whole AVSG, so those replicas can provide callbacks to their clients Once every few minutes, the client must probe the VSG for each cached file to check what replicas are in the AVSG. Replicas respond with a vector timestamp representing the state of the replica (roughly). If a volume is found to be inconsistent between members of the AVSG, the client drops all its callback promises and requests new version of its files.