TAPIR By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton
Outline Problem Space Inconsistent Replication TAPIR Evaluation Conclusion
Problem Develop an app to send pictures of chocolate labs How do we save the pictures??
Problem Distributed storage system Want strong consistency use replication protocols like Paxos which incur a high performance cost Want efficient protocols can only guarantee weak consistency
Problem Guarantees Fault-Tolerance Scalability Linearizability Distributed Transaction Protocol Replication Protocol
Problem Existing architectures
Problem We are enforcing serial ordering in two places: Between replicas Between partitions
Problem Guarantees Fault-Tolerance Scalability Linearizability Distributed Transaction Protocol Replication Protocol
Inconsistent Replication Just make the replication layer inconsistent! Operations can execute in any order Still provides fault tolerance No costly consistency protocol
Inconsistent Replication Guarantees Fault Tolerance At any time, every operation in the operation set is in the record of at least one replica in any quorum of f+1 replicas Visibility For any two operations in the operation set, at least one is visible to the other Consensus every operation has agreement from at least a majority of the replicas
Inconsistent Replication Application Replica Replica Conflict Detection Application Replica
Inconsistent Replication Inconsistent Execution Client sends Propose(op, id) to all replicas Replicas mark [id,op] as Tentative in their record. Replies to client Reply(id) Once client receives f+1 replies for an id, sends Finalize(id) to all replicas Replicas transition from Tentative -> Finalized for that op when they receive from client and respond with ExecInconsistent() to Application layer 1 Round Trip Application sends InvokeInconsistent() Replication layer responds with ExecInconsistent()
Inconsistent Replication
Inconsistent Replication Consensus Execution Client sends Propose(op, id) to all replicas Replicas mark [id,op, result] as Tentative in their record. Replies to client Reply(id, result) Fast Path (Fast Quorum) If client receives 3/2f+1 matching results, return result to application layer and send Finalize to all replicas 1 Round Trip Application sends InvokeConsensus() Replication layer responds with ExecConsensus()
Inconsistent Replication Consensus Execution Slow Path (Didn t reach 3/2f+1 fast quorum of matching results) Client must wait for f+1 responses. Sends Finalize(id, result). Result is computed from the decide() function When replica receives Finalize, records op as finalized (updates the record if the result recorded was different) and sends Confirm(id) to client. Once client receives f+1 Confirm messages, returns result to application 2 Round Trip Application sends InvokeConsensus(), Replication layer responds with Decide(), Application sends decision, Replication sends ExecConsensus()
Inconsistent Replication
Inconsistent Replication Synchronization -IR uses View Changes. -But wait, that implies a leader -Leaders exist solely during view change. Only job is to ensure that at least f+1 replicas are up to date
Inconsistent Replication Synchronization -When triggered, leader collects f+1 replica s logs. - Merges all Finalized records into a master record - If record is Tentative, must have Transaction layer Decide() what to do - From Transaction Layer s response, Master Record R is created. All clients update their records to the master
Inconsistent Replication Good Bad - 1 round-trip path, 2 roundtrip worst case - No cross replication communication needed - Replica s don t appear as a single machine (need occasional sync) - Requires a well-designed transaction layer on top
TAPIR Designed specifically to interface with IR Uses 2PC across the partitions of replicas This is the Transaction Application Layer. Users interact with this, not IR. Stands for Transaction Application Protocol for Inconsistent Replication
TAPIR
TAPIR Optimistic Concurrency Control (OCC) IR guarantees us visibility In any pair of consensus operations, at least one will be visible to the other. Thus, we can t do conflict checks that require the entire history because each IR replica may have an incomplete history Yet, in OCC we are only performing pairwise conflict checks. If a conflict exists, at least one replica will see the conflicting transaction
TAPIR Optimistic Concurrency Control (OCC) Application Replica Replica Application Replica
TAPIR Read() and Write() are collected for the transaction. We build a read and write set. This phase ends when the user enters a Commit() or Abort() Prepare() is called and we perform a consensus operation at the IR level. We pass in the read and write sets. This is the only consensus operation that TAPIR uses. Commit and Abort are inconsistent operations and read and write are not replicated.
TAPIR After Prepare() is sent to the replicas, the consensus protocol is followed at the IR layer If all partitions reply with Prepare-OK, then TAPIR passes a Commit() to all replicas If any replica responds with Abort(), then TAPIR passes Abort() to all replicas.
TAPIR Decide() function Must be implemented by application side Again, called when there is a conflict detected between results at the IR layer Simple solution. If a majority (f+1) replicas, decide Prepare-OK. Due to our IR guarantees, no conflicting transaction could get a majority of the replicas to return Prepare-OK
TAPIR Linearizable?? To commit two transactions through TAPIR we must execute two Prepare() messages -> consensus operations in IR We are guaranteed through visibility in IR and OCC that one of the Prepare() operations would abort. (Will not manage to obtain f+1 replicas who respond with Prepare-OK due to conflict)
TAPIR Fault Tolerance? Yes. This is a guarantee from IR. If TAPIR receives f+1 Prepare-OK messages from IR, then an inconsistent Commit operation is issued Replicas eventually commit the transaction to their records. If a replica does not, it will eventually catch up on synchronization when it copies the master record
Evaluation Implemented TAPIR as a Key-Value storage system Compared against OCC-Store 2-Phase Commit as the transaction layer running on Multi-Paxos Compared against Lock-Store Google s Spanner storage system with a few tweaks. Runs Multi-Paxos in the replication layer
Evaluation Comparison with Strong Consistency systems
Evaluation Wide Area Latency
Evaluation Comparison with Weak Consistency systems
Conclusion Existing systems waste work by enforcing linearizability in the replication layer TAPIR leverages Inconsistent Replication to provide linearizable transactions Improves latency and throughput on commit No leader bottleneck Round-trip time can be halved in common case
Questions?