Building Consistent Transactions with Inconsistent Replication

Similar documents
Building Consistent Transactions with Inconsistent Replication

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

Building Consistent Transactions with Inconsistent Replication

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

10. Replication. Motivation

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

SCALABLE CONSISTENCY AND TRANSACTION MODELS

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Integrity in Distributed Databases

Consistency in Distributed Systems

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Enhancing Throughput of

CS Amazon Dynamo

There Is More Consensus in Egalitarian Parliaments

Applications of Paxos Algorithm

DrRobert N. M. Watson

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Technical Report TR Design and Evaluation of a Transaction Model with Multiple Consistency Levels for Replicated Data

Migrating Oracle Databases To Cassandra

Low-Latency Multi-Datacenter Databases using Replicated Commit

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Distributed Systems COMP 212. Revision 2 Othon Michail

CrossStitch: An Efficient Transaction Processing Framework for Geo-Distributed Systems

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

CS 655 Advanced Topics in Distributed Systems

Important Lessons. Today's Lecture. Two Views of Distributed Systems

ZHT: Const Eventual Consistency Support For ZHT. Group Member: Shukun Xie Ran Xin

Presented By: Devarsh Patel

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Apache Cassandra - A Decentralized Structured Storage System

Consolidating Concurrency Control and Consensus for Commits under Conflicts

FIT: A Distributed Database Performance Tradeoff. Faleiro and Abadi CS590-BDS Thamir Qadah

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

CS October 2017

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Database Architectures

Transaction Management using Causal Snapshot Isolation in Partially Replicated Databases

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology

Scalable Causal Consistency for Wide-Area Storage with COPS

Lecture 6 Consistency and Replication

Large-Scale Data Stores and Probabilistic Protocols

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Concurrency Control II and Distributed Transactions

Spanner: Google s Globally- Distributed Database

Modern Database Concepts

Replication in Distributed Systems

Computing Parable. The Archery Teacher. Courtesy: S. Keshav, U. Waterloo. Computer Science. Lecture 16, page 1

On the Study of Different Approaches to Database Replication in the Cloud

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

11. Replication. Motivation

Distributed Data Management Replication

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

Consolidating Concurrency Control and Consensus for Commits under Conflicts

EECS 498 Introduction to Distributed Systems

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Consistency: Relaxed. SWE 622, Spring 2017 Distributed Software Engineering

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Systems. Lec 12: Consistency Models Sequential, Causal, and Eventual Consistency. Slide acks: Jinyang Li

Eventual Consistency 1

MY WEAK CONSISTENCY IS STRONG WHEN BAD THINGS DO NOT COME IN THREES ZECHAO SHANG JEFFREY XU YU

Security (and finale) Dan Ports, CSEP 552

No compromises: distributed transactions with consistency, availability, and performance

Low Overhead Concurrency Control for Partitioned Main Memory Databases. Evan P. C. Jones Daniel J. Abadi Samuel Madden"

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

Distributed PostgreSQL with YugaByte DB

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Distributed Transactions

Database Architectures

The Google File System

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Outline. Spanner Mo/va/on. Tom Anderson

Databases. Laboratorio de sistemas distribuidos. Universidad Politécnica de Madrid (UPM)

Memory-Based Cloud Architectures

Mutual consistency, what for? Replication. Data replication, consistency models & protocols. Difficulties. Execution Model.

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Intuitive distributed algorithms. with F#

Spanner: Google s Globally-Distributed Database. Wilson Hsieh representing a host of authors OSDI 2012

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Defining properties of transactions

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

Google Spanner - A Globally Distributed,

Transcription:

Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports University of Washington

Distributed storage systems provide crucial guarantees for datacenter applications: Durability Scalability Fault-tolerance

Distributed Storage Architecture Storage Partition ADistributed Storage Storage Partition System B Storage Partition C

Distributed Storage Architecture Clients Storage Partition ADistributed Storage Storage Partition System B Storage Partition C

Distributed Storage Architecture Clients Storage Partition A Storage Partition B Storage Partition C

Distributed Storage Architecture Clients Scalability Storage Partition A Storage Partition B Storage Partition C

Distributed Storage Architecture Clients Scalability Storage Partition A Storage Partition B Storage Partition C

Distributed Storage Architecture Clients Scalability Fault-tolerance Fault-tolerance Fault-tolerance Storage Partition A Storage Partition B Storage Partition C

Consistency guarantees are important in a distributed system. Guides programmer reasoning about: application state (i.e., what is a valid state, what invariants can I assume) concurrency (i.e., what happens when two writes happen at the same time) failures (i.e., what happens when the system fails in the middle of an operation)

Some systems have weaker consistency guarantees. Eventual consistency - eventual ordering of operations and applications resolve conflicts No atomicity or concurrency control - applications use versioning and explicit locking Examples: Dynamo, Cassandra, Voldemort

Some systems have strong consistency guarantees. ACID distributed transactions - help applications manage concurrency Strong consistency/linearizable isolation - strict serial ordering of transactions Examples: Spanner, MegaStore

Distributed transactions are expensive in a replicated system. Distributed transactions with strong consistency require replication with strong consistency. Replication with strong consistency imposes a high overhead.

Distributed transactions are expensive in a replicated system. Distributed transactions with strong consistency require replication with strong consistency. Replication with strong consistency imposes a high overhead. Lots of cross-replica coordination = higher latency + lower throughput

Programmers face a choice. Strong consistency guarantees are easier to use but have limited performance. Weak consistency guarantees are harder to use but have better performance.

Programmers face a choice. Strong consistency guarantees are easier to use but have limited performance. Weak consistency guarantees are harder to use but have better performance.

Programmers face a choice. Strong consistency guarantees are easier to use but have limited performance. Weak consistency guarantees are harder to use but have better performance.

Programmers face a choice. Strong consistency guarantees are easier to use but have limited performance. Weak consistency guarantees are harder to use but have better performance.

Our Goal Make transactional storage cheaper to use while maintaining strong guarantees.

Our Goal Improve latency and throughput for r/w transactions Make transactional storage cheaper to use while maintaining strong guarantees.

Our Goal Improve latency and throughput for r/w transactions Make transactional storage cheaper to use while maintaining strong guarantees. Strong Consistency General Transaction Model

Our Approach Provide distributed transactions with strong consistency using a replication protocol with no consistency.

Our Approach Provide distributed transactions with strong consistency using a replication protocol with no consistency.

Our Approach Provide distributed transactions with strong consistency using a replication protocol with no consistency.

Rest of this talk 1. The cost of strong consistency 2. TAPIR - the Transactional Application Protocol for Inconsistent Replication 3. Evaluation 4. Summary

Why is consistency so expensive? Storage Partition Transactional A Storage Partition Storage System B Storage Partition C

Why is consistency so expensive? Storage Partition Transactional A Storage Partition Storage System B Storage Partition C

Why is consistency so expensive? Storage Partition Transactional A Storage Partition Storage System B Storage Partition C

Why is consistency so expensive? Storage Partition Transactional A Storage Partition Storage System B Storage Partition C

Why is consistency so expensive? Storage Partition A Storage Partition B Storage Partition C

Why is consistency so expensive? Storage Partition A Storage Partition B Storage Partition C

Why is consistency so expensive? Cross-partition coordination (Two-Phase Commit) Storage Partition A Storage Partition B Storage Partition C

Why is consistency so expensive? Cross-partition coordination (Two-Phase Commit) Storage Partition A Storage Partition B Storage Partition C

Why is consistency so expensive? Cross-partition coordination (Two-Phase Commit) Cross-replica coordination (Paxos) Cross-replica coordination (Paxos) Cross-replica coordination (Paxos) Storage Partition A Storage Partition B Storage Partition C

Why is consistency so expensive? Wasted work! Cross-partition coordination (Two-Phase Commit) Cross-replica coordination (Paxos) Cross-replica coordination (Paxos) Cross-replica coordination (Paxos) Storage Partition A Storage Partition B Storage Partition C

Why is consistency so expensive? Existing transactional storage systems use a transaction protocol and a replication protocol that both enforce strong consistency.

Rest of this talk 1. The cost of strong consistency 2. TAPIR - the Transactional Application Protocol for Inconsistent Replication 3. Evaluation 4. Summary

TAPIR Transactional Application Protocol for Inconsistent Replication

TAPIR The first transaction protocol to provide distributed transactions with strong consistency using a replication protocol with no consistency.

Inconsistent Replication A new replication protocol that: Provides fault-tolerance without consistency Supports unordered record, instead of ordered log Requires no cross-replica coordination Does not rely on synchronous disk writes

TAPIR Storage Partition A Storage Partition B Storage Partition C

TAPIR Storage Partition A Storage Partition B Storage Partition C

TAPIR Cross-transaction coordination Storage Partition A Storage Partition B Storage Partition C

TAPIR Single round-trip! Cross-transaction coordination Storage Partition A Storage Partition B Storage Partition C

TAPIR Single round-trip! No leader! Cross-transaction coordination Storage Partition A Storage Partition B Storage Partition C

TAPIR Storage Partition A Storage Partition B Storage Partition C

TAPIR Storage Partition A Storage Partition B Storage Partition C

TAPIR Reordered transactions? Storage Partition A Storage Partition B Storage Partition C

TAPIR Reordered transactions? Missing transactions? Storage Partition A Storage Partition B Storage Partition C

Handling Inconsistency TAPIR uses several techniques to cope with inconsistency across replicas: Loosely synchronized clocks for transaction ordering. Optimistic concurrency control to detect conflicts with a partial history. Multi-versioned storage for applying updates out-of-order.

TAPIR Technique: Transaction ordering with loosely synchronized clocks Clients pick transaction timestamp using local clock. Replicas validate transaction at timestamp, regardless of when they receive the transaction. Clock synchronization for performance, not correctness. Storage Partition B

TAPIR Technique: Transaction ordering with loosely synchronized clocks Clients pick transaction timestamp using local clock. Replicas validate transaction at timestamp, regardless of when they receive the transaction. Clock synchronization for performance, not correctness. Storage Partition B

TAPIR Technique: Transaction ordering with loosely synchronized clocks Clients pick transaction timestamp using local clock. 148 142 Replicas validate transaction at timestamp, regardless of when they receive the transaction. Clock synchronization for performance, not correctness. Storage Partition B

TAPIR Technique: Transaction ordering with loosely synchronized clocks Clients pick transaction timestamp using local clock. Replicas validate transaction at timestamp, regardless of when they receive the transaction. 148 142 Clock synchronization for performance, not correctness. 148 142 142 148 Storage Partition B

TAPIR Technique: Transaction ordering with loosely synchronized clocks Clients pick transaction timestamp using local clock. Replicas validate transaction at timestamp, regardless of when they receive the transaction. 148 142 Clock synchronization for performance, not correctness. 148 142 142 148 Storage Partition B

Handling Inconsistency TAPIR uses several techniques to cope with inconsistency across replicas: Loosely synchronized clocks for optimistic transaction ordering at clients. Optimistic concurrency control to detect conflicts with a partial history. Multi-versioned storage for applying updates out-of-order.

TAPIR Technique: Conflict detection with optimistic concurrency control OCC checks just one transaction at a time, so a full transaction history is not necessary. Every transaction committed at a majority. Quorum intersection ensures every transaction is checked. 142 148 148 142 Storage Partition C

TAPIR Technique: Conflict detection with optimistic concurrency control OCC checks just one transaction at a time, so a full transaction history is not necessary. Every transaction committed at a majority. Quorum intersection ensures every transaction is checked. 142 148 148 142 Storage Partition C

TAPIR Technique: Conflict detection with optimistic concurrency control OCC checks just one transaction at a time, so a full transaction history is not necessary. Every transaction committed at a majority. Quorum intersection ensures every transaction is checked. 142 148 148 142 Storage Partition C

Handling Inconsistency TAPIR uses several techniques to cope with inconsistency across replicas: Loosely synchronized clocks for optimistic transaction ordering at clients. Optimistic concurrency control to detect conflicts with a partial history. Multi-versioned storage for applying updates out-of-order.

TAPIR Technique: Out-of-order updates with multi-versioned storage Backing store versioned using transaction timestamp. Replicas periodically synchronize to find missed transactions. Backing store converges to same state, regardless of when the updates are applied. 142 148 148 142 Storage Partition C

TAPIR Technique: Out-of-order updates with multi-versioned storage Backing store versioned using transaction timestamp. Replicas periodically synchronize to find missed transactions. Backing store converges to same state, regardless of when the updates are applied. 142 148 142 148 148 142 Storage Partition C

Rest of this talk 1. The cost of strong consistency 2. TAPIR - the Transactional Application Protocol for Inconsistent Replication 3. Evaluation 4. Summary

Experimental Questions Does TAPIR improve latency? In a single cluster? Across datacenters? Does TAPIR improve throughput? For low contention workloads? For high contention workload?

Deployment Cluster Servers connected via 12 switch fat-tree topology Average clock skew: ~6us Average RTT: ~150us Wide-area Google Compute Engine VMs in Asia, Europe and US Average clock skew: ~2ms Average RTT: (Eu-A)~260 (Eu-US)~110 (US-As)~166

Workload Microbenchmark Single key read-modify-write transaction 1 shard, 3 replicas Uniform access distribution over 1 million keys Retwis benchmark Read-write transactions based on Retwis 5 shards, 3 replicas Zipf distribution (co-efficient=0.6) over 1 million keys

Systems TAPIR: Transactional storage with strong consistency with inconsistent replication TXN: Transactional storage with strong consistency SPAN: Spanner read-write protocol QW: Non-transactional storage with weak consistency with write everywhere, read anywhere policy

System Comparison Transaction Replication Concurrency Protocol Protocol Control TAPIR 2PC Inconsistent Replication OCC TXN 2PC Paxos OCC SPAN 2PC Paxos Strict 2-Phase Locking QW None Write everywhere, Read anywhere None

Cluster Microbenchmark Latency Transaction Latency (microseconds) 600.0 450.0 300.0 150.0 0.0 TXN SPAN QW TAPIR

Wide-area Deployment 166ms US 100ms Europe Asia 260ms

Wide-area Deployment 166ms US 100ms LEADER Europe Asia 260ms

Wide-area Deployment 166ms US 100ms LEADER Europe Asia 260ms

Wide-area Deployment 166ms US 100ms LEADER Europe Asia 260ms

Wide-area Deployment 166ms US 100ms LEADER Europe Asia 260ms

Wide-area Retwis Latency

Wide-area Retwis Latency Transaction Latency (milliseconds) 800 600 400 200 US EUROPE ASIA 0 TXN SPAN QW TAPIR

Wide-area Retwis Latency Transaction Latency (milliseconds) 800 600 400 200 US EUROPE ASIA 0 TXN SPAN QW TAPIR

Wide-area Retwis Latency Transaction Latency (milliseconds) 800 600 400 200 US EUROPE ASIA 0 TXN SPAN QW TAPIR

Wide-area Retwis Latency Transaction Latency (milliseconds) 800 600 400 200 US EUROPE ASIA 0 TXN SPAN QW TAPIR

Wide-area Retwis Latency Transaction Latency (milliseconds) 800 600 400 200 US EUROPE ASIA 0 TXN SPAN QW TAPIR

Microbenchmark Throughput

Microbenchmark Throughput Transaction Throughput (/second) 18000 13500 9000 4500 TXN SPAN TAPIR QW 0 0 10 20 30 40 Number of Clients

Microbenchmark Throughput Transaction Throughput (/second) 18000 13500 9000 4500 TXN SPAN TAPIR QW 0 0 10 20 30 40 Number of Clients

Microbenchmark Throughput Transaction Throughput (/second) 18000 13500 9000 4500 TXN SPAN TAPIR QW 0 0 10 20 30 40 Number of Clients

Microbenchmark Throughput Transaction Throughput (/second) 18000 13500 9000 4500 TXN SPAN TAPIR QW 0 0 10 20 30 40 Number of Clients

Microbenchmark Throughput Transaction Throughput (/second) 18000 13500 9000 4500 TXN SPAN TAPIR QW 2x 0 0 10 20 30 40 Number of Clients

Microbenchmark Throughput Transaction Throughput (/second) 18000 13500 9000 4500 TXN SPAN TAPIR QW 2x 0 0 10 20 30 40 Number of Clients

Microbenchmark Throughput Transaction Throughput (/second) 18000 13500 9000 4500 TXN SPAN TAPIR QW 2x 2x 0 0 10 20 30 40 Number of Clients

Retwis Throughput Transaction Throughput (/second) 12000 9000 6000 3000 TXN SPAN TAPIR QW 0 0 20 40 60 80 Number of Clients

Summary TAPIRs are surprisingly fast. Replication does not have to be consistent for transactions to be. Transactions do not have to be expensive.