Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Similar documents
Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

Exploiting Commutativity For Practical Fast Replication

Making RAMCloud Writes Even Faster

Building Consistent Transactions with Inconsistent Replication

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Building Consistent Transactions with Inconsistent Replication

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Byzantine fault tolerance. Jinyang Li With PBFT slides from Liskov

At-most-once Algorithm for Linearizable RPC in Distributed Systems

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

SpecPaxos. James Connolly && Harrison Davis

Rocksteady: Fast Migration for Low-Latency In-memory Storage. Chinmay Kulkarni, Aniraj Kesavan, Tian Zhang, Robert Ricci, Ryan Stutsman

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

There Is More Consensus in Egalitarian Parliaments

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Lightning Talks. Platform Lab Students Stanford University

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

The Google File System

Reducing the Costs of Large-Scale BFT Replication

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Replication in Distributed Systems

NetChain: Scale-Free Sub-RTT Coordination

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

CPS 512 midterm exam #1, 10/7/2016

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Building Consistent Transactions with Inconsistent Replication

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

The Google File System

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

Replicated State Machine in Wide-area Networks

A definition. Byzantine Generals Problem. Synchronous, Byzantine world

Rocksteady: Fast Migration for Low-latency In-memory Storage

Consolidating Concurrency Control and Consensus for Commits under Conflicts

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

NPTEL Course Jan K. Gopinath Indian Institute of Science

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Low-Latency Datacenters. John Ousterhout

Basic vs. Reliable Multicast

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Erasure Coding in Object Stores: Challenges and Opportunities

Paxos and Replication. Dan Ports, CSEP 552

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Changing Requirements for Distributed File Systems in Cloud Storage

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed Systems COMP 212. Revision 2 Othon Michail

Intuitive distributed algorithms. with F#

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Implementing Linearizability at Large Scale and Low Latency

Consolidating Concurrency Control and Consensus for Commits under Conflicts

HP: Hybrid Paxos for WANs

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Gnothi: Separating Data and Metadata for Efficient and Available Storage Replication

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Prophecy: Using History for High Throughput Fault Tolerance

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

REPLICATED STATE MACHINES

Mencius: Another Paxos Variant???

GFS: The Google File System

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store

Homework 2 COP The total number of paths required to reach the global state is 20 edges.

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Memory-Based Cloud Architectures

MODELS OF DISTRIBUTED SYSTEMS

Lecture XIII: Replication-II

Paxos provides a highly available, redundant log of events

High Performance Transactions in Deuteronomy

Replication. Feb 10, 2016 CPSC 416

A Reliable Broadcast System

Mix n Match Async and Group Replication for Advanced Replication Setups. Pedro Gomes Software Engineer

CA485 Ray Walshe Google File System

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Lecture 18: Reliable Storage

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Strong Consistency & CAP Theorem

The Google File System

PARALLEL CONSENSUS PROTOCOL

The Google File System

Security (and finale) Dan Ports, CSEP 552

Distributed Systems Exam 1 Review Paul Krzyzanowski. Rutgers University. Fall 2016

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Transcription:

Exploiting Commutativity For Practical Fast Replication Seo Jin Park and John Ousterhout

Overview Problem: consistent replication adds latency and throughput overheads Why? Replication happens after ordering Key idea: exploit commutativity to enable fast replication before ordering CURP (Consistent Unordered Replication Protocol) s replicate in 1 round-trip time (RTT) if operations are commutative Simple augmentation on existing primary-backup systems Results RAMCloud s performance improvements Latency: 14 µs 7.1 µs (no replication: 6.1 µs) Throughput: 184 kops/sec 728 kops/s (~4x) Redis cache is now fault-tolerant with small cost (12% latency, 18% throughput ) Slide 2

Consistent Replication Doubles Latencies client 1 write x = 1 2 ok X: 1 Server Unreplicated Systems: 1 RTT for operation client 1 write x = 1 2 X: 1 4 ok X: 1 3 ok X: X: X: 11 Primary Backup Replicated Systems: 2 RTTs for operations Slide 3

Strawman 1 RTT Replication State: x = 2 x 1 x 2 x 1 x 2 Primary State: x = 1 x 2 x 1 Backups Strong consistency is broken! Slide 4

What Makes Consistent Replication Expensive? Consistent replication protocols must solve two problems: Consistent Ordering: all replicas should appear to execute operations in the same order Durability: once completed, an operation must survive crashes. Previous protocols combined the two requirements x 1 x 3 x 1 x 2 x 2 x 3 x 3 x 1 x 2 Primary x 3 x 1 x 2 x 3 x 1 x 2 Time to complete an operation 1 RTT for serialization 1 RTT for replication Backups Slide 5

Exploiting Commutativity to Defer Ordering For performance: cannot do totally ordered replication in 2 RTTs Replicate just for durability & exploit commutativity to defer ordering Safe to reorder if operations are commutative (e.g. updates on different keys) Consistent Unordered Replication Protocol (CURP): When concurrent operations commute, replicate without ordering When not, fall back to slow totally-ordered replication Slide 6

Overview of CURP Primary returns execution results immediately (before syncing to backups) s directly replicate to ensure durability Time to complete an operation y 5 z 7 1 RTT x 1 x 2 y 5 z 7 Primary z 7 y 5 Witnesses garbage collection async Witness x 1 Backups No ordering info x 2 Temporary until primary replicates to backups Witness data are replayed during recovery Slide 7

Normal Operation s send an RPC request to primary and witnesses in parallel If all witnesses accepted (saved) request, client can complete operation safely without sync to backups. execute z 7 result:ok record z 7 accepted x 1 x 2 y 5 z 7 Primary z 7 async x 1 Backups x 2 y 5 Witnesses Slide 8

Normal Operation (continued.) If any witness rejected (not saved) request, client must wait for sync to backups. Operation completes in 2 RTTs mostly (worst case 3 RTTs) 1 execute z 2 2 sync z 2 synced x 1 x 2 y 5 z 7 Primary x 1 Backups x 2 y 5 z 7 1 record z 2 rejected z 7 y 5 Witnesses When a primary receives a sync request, usually syncing to backups is already completed or at least initiated Slide 9

Crash Recovery First load from a backup and then replay requests in a witness Example: State: x = 2, y = 5, z = 7 async x 1 x 2 y 5 z 7 x 1 x 2 y 5 z 7 Primary Backups 1 async z 7 y 5 Witnesses 2 retry State: x = 2, y = 5, z = 7 x 1 x 2 y 5 z 7 New Primary Slide 10

3 Potential Problems for Strong Consistency 1. Replay from witness may be out of order Witnesses only keep commutative requests 2. Primaries may reveal not-yet-durable data to other clients Primaries detect & block reads of unsynced data 3. Requests replayed from witness may have been already recovered from backup Detect and avoid duplicate execution using RIFL Slide 11

P1. Replay From Witness May Be Out Of Order Witness has no way to know operation order determined by primary Witness detects non-commutative operations and rejects them Then, client needs to issue explicit sync request to primary Okay to replay in any order y 5 accepted x 3 y 5 Witness x 4 rejected x 3 Witness Slide 12

P2. Primaries May Reveal Not-yet-durable Data Primary doesn t know if an operation is made durable in witnesses Subsequent operations (e.g. reads) may externalize the new data Recorded in witnesses or not? A B x 3 read x x = 3 State: x = 3 x 1 x 2 x 3 Primary async x 1 Backups x 2 Must wait for sync to backups if a client request depends on any unsynced updates Slide 13

P3. Replay Can Cause Duplicate Executions A client request may exist both in backups and witnesses. Replaying operations recovered by backups endangers consistency State: x: 3, y: 7 async x 1 x 2 x 3 z 7 x 1 x 2 x 3 Primary z 7 x 2 Witnesses 2 retry Backups 1 State: x: 2, y: 7 x 1 x 2 x 3 x 2 z 7 New primary Detect & ignore duplicate requests, e.g. RIFL [SOSP 15] Slide 14

Performance Evaluation of CURP Ø Implemented in Redis and RAMCloud Performance Fast KV cache CURP Consistently replicated & As fast as no replication CURP RAMCloud Replicated in-memory KV store Consistency Slide 15

RAMCloud s Latency after CURP Writes are issued sequentially by a client to a master Configuration 40 B key, 100 B value Keys are randomly (uniform dist.) selected from 2 M unique keys Xeon 4 cores (8 T) @ 3 GHz Mellanox Connect-X 2 InfiniBand (24 Gbps) Kernel-bypassing transport Fraction of Writes 1 0.5 0.1 0.01 Median 7.1 μs vs. 14 µs Original (3 B) CURP (3 B, 3 W) CURP (1 B, 1 W) Unreplicated 0.001 0 10 20 30 40 50 60 70 Latency (µs) Slide 16

RAMCloud s Throughput after CURP Thanks to CURP, can batch replication requests without impacting latency: improves throughput Each client issues writes (40B key, 100B value) sequentially Write Throughput (k write per second) 900 800 700 600 500 400 300 200 100 0 Unreplicated CURP (1 B, 1 W) CURP (3 B, 3 W) Original (3 B) 0 5 10 15 20 25 30 ~6% per replica 4x Count (number of clients) Slide 17

See Paper for Design Garbage collection Reconfiguration handling (data migration, backup crash, witness crash) Read operation How to extend CURP to quorum-based consensus protocols Performance Measurement with skewed workloads (many non-commutative ops) Resource consumption by witness servers CURP s impact on Redis performance Slide 18

Related work Rely on commutativity for fast replication Generalized Paxos (2005): 1.5 RTT EPaxos (SOSP 13) : 2 RTTs in LAN, expensive read Rely on the network s in-order deliveries of broadcasts Special networking hardware: NOPaxos (OSDI 16), Speculative Paxos (NSDI 15) Presume & rollback: PLATO (SRDS 06), Optimistic Active Replication (ICDCS 01) Combine with transaction layer for rollback: TAPIR (SOSP 15), Janus (OSDI 16) CURP is Faster than other protocols using commutativity Doesn t require rollback or special networking hardware Easy to integrate with existing primary-backup systems Slide 19

Conclusion Total order is not necessary for consistent replication CURP clients replicate without ordering in parallel with sending requests to execution servers è 1 RTT Exploit commutativity for consistency Improves both latency (2 RTTs -> 1 RTT) and throughput (4x) RAMCloud s latency: 14 µs 7.1 µs (no replication: 6.1 µs) Throughput: 184 kops/sec 728 kops/s (~4x) Slide 20