TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Similar documents
Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

Replication in Distributed Systems

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

There Is More Consensus in Egalitarian Parliaments

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

Applications of Paxos Algorithm

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Low-Latency Multi-Datacenter Databases using Replicated Commit

Enhancing Throughput of

EECS 498 Introduction to Distributed Systems

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Consistency in Distributed Systems

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Distributed Systems 11. Consensus. Paul Krzyzanowski

CSE 444: Database Internals. Lecture 25 Replication

Paxos and Replication. Dan Ports, CSEP 552

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

CS October 2017

EECS 498 Introduction to Distributed Systems

Replicated State Machine in Wide-area Networks

BIG DATA AND CONSISTENCY. Amy Babay

CS 425 / ECE 428 Distributed Systems Fall 2017

ZHT: Const Eventual Consistency Support For ZHT. Group Member: Shukun Xie Ran Xin

SpecPaxos. James Connolly && Harrison Davis

Replications and Consensus

Intuitive distributed algorithms. with F#

Module 7 - Replication

Concurrency Control II and Distributed Transactions

Replication. Feb 10, 2016 CPSC 416

EECS 482 Introduction to Operating Systems

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Introduction to Distributed Systems Seif Haridi

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Dynamo: Key-Value Cloud Storage

Consistency examples. COS 418: Distributed Systems Precept 5. Themis Melissaris

Integrity in Distributed Databases

Strong Consistency and Agreement

! Replication comes with consistency cost: ! Reasons for replication: Better performance and availability. ! Replication transforms client-server

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Recovering from a Crash. Three-Phase Commit

Paxos Made Moderately Complex Made Moderately Simple

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Strong Consistency & CAP Theorem

Byzantine fault tolerance. Jinyang Li With PBFT slides from Liskov

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

CPS 512 midterm exam #1, 10/7/2016

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Consensus and related problems

DISTRIBUTED COMPUTER SYSTEMS

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

EECS 498 Introduction to Distributed Systems

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Distributed Systems 8L for Part IB

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Leader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

Modeling, Analyzing, and Extending Megastore using Real-Time Maude

Distributed Systems COMP 212. Revision 2 Othon Michail

Recall use of logical clocks

Exam 2 Review. Fall 2011

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Replication and Consistency. Fall 2010 Jussi Kangasharju

Linearizability CMPT 401. Sequential Consistency. Passive Replication

Data-Intensive Distributed Computing

Distributed Consensus: Making Impossible Possible

Lecture 6 Consistency and Replication

Robust BFT Protocols

Introduction to riak_ensemble. Joseph Blomstedt Basho Technologies

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Using MVCC for Clustered Databases

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin

Replication and Consistency

Introduction to MySQL InnoDB Cluster

Transcription:

TAPIR By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Outline Problem Space Inconsistent Replication TAPIR Evaluation Conclusion

Problem Develop an app to send pictures of chocolate labs How do we save the pictures??

Problem Distributed storage system Want strong consistency use replication protocols like Paxos which incur a high performance cost Want efficient protocols can only guarantee weak consistency

Problem Guarantees Fault-Tolerance Scalability Linearizability Distributed Transaction Protocol Replication Protocol

Problem Existing architectures

Problem We are enforcing serial ordering in two places: Between replicas Between partitions

Problem Guarantees Fault-Tolerance Scalability Linearizability Distributed Transaction Protocol Replication Protocol

Inconsistent Replication Just make the replication layer inconsistent! Operations can execute in any order Still provides fault tolerance No costly consistency protocol

Inconsistent Replication Guarantees Fault Tolerance At any time, every operation in the operation set is in the record of at least one replica in any quorum of f+1 replicas Visibility For any two operations in the operation set, at least one is visible to the other Consensus every operation has agreement from at least a majority of the replicas

Inconsistent Replication Application Replica Replica Conflict Detection Application Replica

Inconsistent Replication Inconsistent Execution Client sends Propose(op, id) to all replicas Replicas mark [id,op] as Tentative in their record. Replies to client Reply(id) Once client receives f+1 replies for an id, sends Finalize(id) to all replicas Replicas transition from Tentative -> Finalized for that op when they receive from client and respond with ExecInconsistent() to Application layer 1 Round Trip Application sends InvokeInconsistent() Replication layer responds with ExecInconsistent()

Inconsistent Replication

Inconsistent Replication Consensus Execution Client sends Propose(op, id) to all replicas Replicas mark [id,op, result] as Tentative in their record. Replies to client Reply(id, result) Fast Path (Fast Quorum) If client receives 3/2f+1 matching results, return result to application layer and send Finalize to all replicas 1 Round Trip Application sends InvokeConsensus() Replication layer responds with ExecConsensus()

Inconsistent Replication Consensus Execution Slow Path (Didn t reach 3/2f+1 fast quorum of matching results) Client must wait for f+1 responses. Sends Finalize(id, result). Result is computed from the decide() function When replica receives Finalize, records op as finalized (updates the record if the result recorded was different) and sends Confirm(id) to client. Once client receives f+1 Confirm messages, returns result to application 2 Round Trip Application sends InvokeConsensus(), Replication layer responds with Decide(), Application sends decision, Replication sends ExecConsensus()

Inconsistent Replication

Inconsistent Replication Synchronization -IR uses View Changes. -But wait, that implies a leader -Leaders exist solely during view change. Only job is to ensure that at least f+1 replicas are up to date

Inconsistent Replication Synchronization -When triggered, leader collects f+1 replica s logs. - Merges all Finalized records into a master record - If record is Tentative, must have Transaction layer Decide() what to do - From Transaction Layer s response, Master Record R is created. All clients update their records to the master

Inconsistent Replication Good Bad - 1 round-trip path, 2 roundtrip worst case - No cross replication communication needed - Replica s don t appear as a single machine (need occasional sync) - Requires a well-designed transaction layer on top

TAPIR Designed specifically to interface with IR Uses 2PC across the partitions of replicas This is the Transaction Application Layer. Users interact with this, not IR. Stands for Transaction Application Protocol for Inconsistent Replication

TAPIR

TAPIR Optimistic Concurrency Control (OCC) IR guarantees us visibility In any pair of consensus operations, at least one will be visible to the other. Thus, we can t do conflict checks that require the entire history because each IR replica may have an incomplete history Yet, in OCC we are only performing pairwise conflict checks. If a conflict exists, at least one replica will see the conflicting transaction

TAPIR Optimistic Concurrency Control (OCC) Application Replica Replica Application Replica

TAPIR Read() and Write() are collected for the transaction. We build a read and write set. This phase ends when the user enters a Commit() or Abort() Prepare() is called and we perform a consensus operation at the IR level. We pass in the read and write sets. This is the only consensus operation that TAPIR uses. Commit and Abort are inconsistent operations and read and write are not replicated.

TAPIR After Prepare() is sent to the replicas, the consensus protocol is followed at the IR layer If all partitions reply with Prepare-OK, then TAPIR passes a Commit() to all replicas If any replica responds with Abort(), then TAPIR passes Abort() to all replicas.

TAPIR Decide() function Must be implemented by application side Again, called when there is a conflict detected between results at the IR layer Simple solution. If a majority (f+1) replicas, decide Prepare-OK. Due to our IR guarantees, no conflicting transaction could get a majority of the replicas to return Prepare-OK

TAPIR Linearizable?? To commit two transactions through TAPIR we must execute two Prepare() messages -> consensus operations in IR We are guaranteed through visibility in IR and OCC that one of the Prepare() operations would abort. (Will not manage to obtain f+1 replicas who respond with Prepare-OK due to conflict)

TAPIR Fault Tolerance? Yes. This is a guarantee from IR. If TAPIR receives f+1 Prepare-OK messages from IR, then an inconsistent Commit operation is issued Replicas eventually commit the transaction to their records. If a replica does not, it will eventually catch up on synchronization when it copies the master record

Evaluation Implemented TAPIR as a Key-Value storage system Compared against OCC-Store 2-Phase Commit as the transaction layer running on Multi-Paxos Compared against Lock-Store Google s Spanner storage system with a few tweaks. Runs Multi-Paxos in the replication layer

Evaluation Comparison with Strong Consistency systems

Evaluation Wide Area Latency

Evaluation Comparison with Weak Consistency systems

Conclusion Existing systems waste work by enforcing linearizability in the replication layer TAPIR leverages Inconsistent Replication to provide linearizable transactions Improves latency and throughput on commit No leader bottleneck Round-trip time can be halved in common case

Questions?