Building Consistent Transactions with Inconsistent Replication

Similar documents
Building Consistent Transactions with Inconsistent Replication

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Building Consistent Transactions with Inconsistent Replication

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

There Is More Consensus in Egalitarian Parliaments

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Concurrency Control II and Distributed Transactions

Replication in Distributed Systems

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

Enhancing Throughput of

CPS 512 midterm exam #1, 10/7/2016

Distributed Transactions

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Multi-version concurrency control

Leader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols

Low-Latency Multi-Datacenter Databases using Replicated Commit

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

Erasure Coding in Object Stores: Challenges and Opportunities

DISTRIBUTED COMPUTER SYSTEMS

EECS 498 Introduction to Distributed Systems

Distributed PostgreSQL with YugaByte DB

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Multi-version concurrency control

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

High Performance Transactions in Deuteronomy

Applications of Paxos Algorithm

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

CS 655 Advanced Topics in Distributed Systems

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Replications and Consensus

Exploiting Commutativity For Practical Fast Replication

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Consistency in Distributed Systems

FIT: A Distributed Database Performance Tradeoff. Faleiro and Abadi CS590-BDS Thamir Qadah

Versioning, Consistency, and Agreement

SCALABLE CONSISTENCY AND TRANSACTION MODELS

CS October 2017

High-Performance ACID via Modular Concurrency Control

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Data-Intensive Distributed Computing

Why distributed databases suck, and what to do about it. Do you want a database that goes down or one that serves wrong data?"

Surviving congestion in geo-distributed storage systems

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Strong Consistency & CAP Theorem

DrRobert N. M. Watson

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Fundamentals Large-Scale Distributed System Design. (a.k.a. Distributed Systems 1)

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

NetChain: Scale-Free Sub-RTT Coordination

Data Storage Revolution

10. Replication. Motivation

SpecPaxos. James Connolly && Harrison Davis

Recovering from a Crash. Three-Phase Commit

Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Module 7 - Replication

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Quiz I Solutions MASSACHUSETTS INSTITUTE OF TECHNOLOGY Spring Department of Electrical Engineering and Computer Science

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Key-value store with eventual consistency without trusting individual nodes

Eventual Consistency 1

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

EECS 498 Introduction to Distributed Systems

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Integrity in Distributed Databases

Transaction Management using Causal Snapshot Isolation in Partially Replicated Databases

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Automated Data Partitioning for Highly Scalable and Strongly Consistent Transactions

Extreme Computing. NoSQL.

Google Spanner - A Globally Distributed,

Distributed Data Management Replication

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

An Evaluation of Distributed Concurrency Control. Harding, Aken, Pavlo and Stonebraker Presented by: Thamir Qadah For CS590-BDS

Replication and Consistency. Fall 2010 Jussi Kangasharju

Implementing caches. Example. Client. N. America. Client System + Caches. Asia. Client. Africa. Client. Client. Client. Client. Client.

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

TARDiS: A branch and merge approach to weak consistency

Intuitive distributed algorithms. with F#

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Transcription:

DB Reading Group Fall 2015 slides by Dana Van Aken Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports (University of Washington)

Motivation App programmers prefer distributed transactional storage with strong consistency ease of use, strong guarantees Tradeoffs fault tolerance: strongly consistent replication protocols are expensive (e.g. Paxos) Megastore, Spanner weakly consistent protocols are less costly but provide fewer (if any) guarantees (e.g. eventual consistency) Dynamo, Cassandra

Common architecture for distributed txn l systems Distributed Transaction Protocol: atomic commitment protocol (2PC) + CC mechanism e.g. 2PC + (2PL OCC MVCC) Replication Protocol: e.g. Paxos, Viewstamped Replication

Spanner-like system writes buffered at client until commit read ops must go to shard leaders to ensure order across replicas (gets value & timestamp of any data read) Commit takes at least 2 round trips

Observation Existing distributed transaction storage systems that integrate both protocols waste work and performance due to this redundancy Is it possible to remove this redundancy and still provide read-write transactions with the same guarantees as Spanner? YES. linearizable transaction ordering globally consistent reads across database at a timestamp How? Replication with no consistency

Key Contributions Define IR (inconsistent replication) new replication protocol fault tolerance without consistency Design TAPIR (Transactional Application Protocol for IR) new distributed transaction protocol linearizable transaction ordering using IR (Spanner) Build/evaluate TAPIR-KV high-performance transactional storage (TAPIR + IR)

Inconsistent Replication Fault tolerance without consistency ordered op log replaced by an unordered op set Used with a higher-level protocol: application protocol to decide/recover the outcome of conflicting operations Can invoke ops in 2 modes: inconsistent and consensus Both: execute in any order Consensus only: returns a single consensus result Guarantees: fault tolerance: successful ops & consensus results are persistent visibility: for each pair of operations, at least one is visible to the other

IR Application Protocol Interface

IR Protocol: Operation Processing IR can complete inconsistent operations with a single round-trip to f+1 replicas and no coordination across replicas consensus operations fast path: if [3/2 f]+1 replicas return matching results common case, single round-trip slow path: if otherwise two round-trips to at least f+1 replicas

IR Protocol: Replica Recovery & Synchronization uses single protocol for recovering failed replicas & synchronizing replicas View change Protocol is identical to Viewstamp Replication (Oki, Liskov) except that the leader must merge records from the latest view leader relies on application protocol to determine consensus results result of merge is the master record, used to synchronize other replicas

TAPIR Transactional Application Protocol for IR Efficiently leverages IR s weak guarantees to provide highperformance linearizable transactions (Spanner) Clients: front-end app servers (possibly at same datacenter) Applications interact with TAPIR (not IR) once an app calls commit, it cannot abort this allows TAPIR to use clients as 2PC coordinators Replicas keep a log of committed/aborted txns in timestamp order Replicas also maintain a versioned data store

TAPIR: Transaction Processing Uses OCC concentrates all ordering decisions into a single set of validation checks only requires one consensus operation ( prepare ) decide function: commit if a majority of replicas replied prepare-ok

Spanner-like system vs TAPIR

Experimental Setup built TAPIR-KV (transactional key-value store) Google Compute Engine (GCE), 3 geographical regions US, Europe, Asia VMs placed in different availability zones server specs: virt. single core 2.6 GHz Intel Xeon, 8 GB RAM, 1 Gb NIC comparison systems OCC-STORE (standard OCC + 2PC), LOCK-STORE (Spanner) workloads Retwis, YCSB+T

Results: RTT & clock synchronization RTTs: US-Europe: 110 ms US-Asia: 165 ms Europe-Asia: 260 ms low clock skew (0.1-3.4 ms), BUT has a long tail worst case ~27 ms unlike Spanner, TAPIR performance depends on actual clock skew, not a worst-case bound

Avg. Rewtis transactional latency vs. throughput Rewtis single data center US region only 10 shards 3 replicas/shard 10M keys zipf coef: 0.75

Avg. wide-area latency for Rewtis transactions 1 replica per shard in each geographical region leader in US (if any) client in US, Asia, or Europe

Abort rates at varying Zipf coefficients single region replicas in 3 availability zones constant load of 5000 txns/s

Comparison with weakly consistent storage systems YCSB+T single shard 3 replicas 1M keys MongoDB & Redis: master-slave set to use synch. replication Cassandra: set replication level to 2

Conclusion possible build distributed transactions with better performance and strong consistency semantics on top of a replication protocol with no consistency relative to conventional transactional storage systems lowers commit latency by 50% increases throughput by 3x performance is competitive with weakly-consistent systems while offering much stronger guarantees

The end!

Techniques to improve performance optimize for read-only transactions Megastore, Spanner use more restrictive transaction models VoltDB provide weaker consistency guarantees Dynamo, MongoDB

Observation Existing distributed transaction storage systems that integrate both protocols waste work and performance because both enforce strong consistency

IR Protocol Unique operation ID: IR client ID + op counter Replica maintain unordered records of executed ops and consensus results