Enhancing Throughput of

Similar documents
Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

Replication in Distributed Systems

Fine-grained Transaction Scheduling in Replicated Databases via Symbolic Execution

Evaluating BFT Protocols for Spire

There Is More Consensus in Egalitarian Parliaments

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

All about Eve: Execute-Verify Replication for Multi-Core Servers

Ovid A Software-Defined Distributed Systems Framework. Deniz Altinbuken, Robbert van Renesse Cornell University

Automated Data Partitioning for Highly Scalable and Strongly Consistent Transactions

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

Replicated State Machine in Wide-area Networks

Low Overhead Concurrency Control for Partitioned Main Memory Databases

Low-Latency Multi-Datacenter Databases using Replicated Commit

Tolerating Latency in Replicated State Machines through Client Speculation

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

HP: Hybrid Paxos for WANs

PARALLEL CONSENSUS PROTOCOL

Practical Byzantine Fault Tolerance

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Reducing the Costs of Large-Scale BFT Replication

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Byzantine Fault Tolerance

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

On Fault-tolerant and High Performance Replicated Transactional Systems

SAREK: Optimistic Parallel Ordering in Byzantine Fault Tolerance

Giza: Erasure Coding Objects across Global Data Centers

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

Byzantine fault tolerance. Jinyang Li With PBFT slides from Liskov

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Byzantine Fault Tolerance and Consensus. Adi Seredinschi Distributed Programming Laboratory

STATE Machine Replication (SMR) is a well-known approach

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015

PRIMARY-BACKUP REPLICATION

Canopus: A Scalable and Massively Parallel Consensus Protocol

Replication and Consistency. Fall 2010 Jussi Kangasharju

Theoretical Computer Science

Primary-Backup Replication

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Applications of Paxos Algorithm

Scalable State-Machine Replication

6.824 Final Project. May 11, 2014

Assignment 5. Georgia Koloniari

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Module 7 - Replication

Byzantine Fault Tolerance

ZHT: Const Eventual Consistency Support For ZHT. Group Member: Shukun Xie Ran Xin

Distributed Consensus: Making Impossible Possible

Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin

(Lightweight) Recoverable Virtual Memory. Robert Grimm New York University

HiperTM: High Performance, Fault-Tolerant Transactional Memory

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Introduction to Distributed Systems Seif Haridi

Strong Consistency & CAP Theorem

MENCIUS: BUILDING EFFICIENT

Data Storage Revolution

Distributed Consensus: Making Impossible Possible

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Strong Consistency at Scale

Distributed Commit in Asynchronous Systems

Low Overhead Concurrency Control for Partitioned Main Memory Databases. Evan P. C. Jones Daniel J. Abadi Samuel Madden"

Distributed Systems. Day 9: Replication [Part 1]

Fast Atomic Multicast

SCHISM: A WORKLOAD-DRIVEN APPROACH TO DATABASE REPLICATION AND PARTITIONING

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Basic vs. Reliable Multicast

The Stream Processor as a Database. Ufuk

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Distributed Systems. Aleardo Manacero Jr.

Building Consistent Transactions with Inconsistent Replication

CPS 512 midterm exam #1, 10/7/2016

April 21, 2017 Revision GridDB Reliability and Robustness

10. Replication. Motivation

Recovering from a Crash. Three-Phase Commit

Recall use of logical clocks

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

CS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Consolidating Concurrency Control and Consensus for Commits under Conflicts

CrossStitch: An Efficient Transaction Processing Framework for Geo-Distributed Systems

Large-scale cluster management at Google with Borg

Leader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols

Huge market -- essentially all high performance databases work this way

Distributed Systems (5DV147)

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Paxos and Replication. Dan Ports, CSEP 552

Transcription:

Enhancing Throughput of NCA 2017 Zhongmiao Li, Peter Van Roy and Paolo Romano

Enhancing Throughput of Partially Replicated State Machines via NCA 2017 Zhongmiao Li, Peter Van Roy and Paolo Romano

Enhancing Throughput of Partially Replicated State Machines via Multi-Partition Operation Scheduling NCA 2017 Zhongmiao Li, Peter Van Roy and Paolo Romano

Background Online services strive to have 7*24 availability. Replication is crucial to ensure availability. State-machine replication (SMR) is a key technique to implement fault-tolerant services.

Background Online services strive to have 7*24 availability. Replication is crucial to ensure availability. State-machine replication (SMR) is a key technique to implement fault-tolerant services.

Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas A, B, C

Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas A, B, C A, B, C A, B, C

Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas OP1 OP2 OP3 A, B, C A, B, C A, B, C

Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas Consensus A, B, C A, B, C A, B, C

Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas Consensus OP1 OP2 OP3 A, B, C A, B, C A, B, C

Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas Consensus A, B, C A, B, C A, B, C

Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas A, B, C OP2 OP1 OP3 Consensus OP2 OP1 OP3 A, B, C A, B, C OP2 OP1 OP3

Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas Consensus A, B, C A, B, C A, B, C

Background Partially-replicated state machines(i) The classical SMR does not scale Replicas store full state & execute all update ops => throughput limited by single replica s capacity & speed! Recent work propose to partially-replicate state machines to enhance scalability High performance state- machine replication, DSN 11 Calvin: fast distributed transactions for partitioned database systems, SIGMOD 12 Scalable state-machine replication, DSN 14

Background Partially-replicated state machines(ii) A BC A B C A B C

Background Partially-replicated state machines(ii) Replication group A BC A B C A B C

Background Partially-replicated state machines(ii) Replication group A A A B B B C C C

Background Partially-replicated state machines(ii) Replication group A A A A Replication group B B B B Replication group C C C C

Background Partially-replicated state machines(ii) Replication group A A A A Replication group B B B B Replication group C C C C Each replica splits their state to multiple partitions Ops involving single partition (SPOs) only executed by that partition Ops involving multiple partitions (MPOs) coordinated and then executed by involved partitions

Background Partially-replicated state machines(ii) Replication group A A A A Replication group B B B B Replication group C C C C Each replica splits their state to multiple partitions Ops involving single partition (SPOs) only executed by that partition Ops involving multiple partitions (MPOs) coordinated and then executed by involved partitions But.. can we scale linearly by adding more partitions?

Problems Coordinating MPOs (i) Partitions have to agree on the order of MPOs.

Problems Coordinating MPOs (i) Partitions have to agree on the order of MPOs. OP1: A=10 B=10 A B OP2: A=5 B=5

Problems Coordinating MPOs (i) Partitions have to agree on the order of MPOs. OP1: A=10 A B OP2: A=5 B=5 OP1: B=10

Problems Coordinating MPOs (i) Partitions have to agree on the order of MPOs. OP1: A=10 OP2: B=5 OP2: A=5 A B OP1: B=10

Problems Coordinating MPOs (i) Partitions have to agree on the order of MPOs. OP1: A=10 OP2: B=5 OP2: A=5 A B OP1: B=10 A=5 B=10

Problems Coordinating MPOs (i) Partitions have to agree on the order of MPOs. OP1: A=10 OP2: B=5 OP2: A=5 A B OP1: B=10 A=5 B=10 Coordinating MPOs is slow Replication + multiple inter-group communication In existing systems, the coordination of MPOs lies on the critical path of execution! Partitions sit idle while coordinating MPOs=> throughput reduced

Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 A B C

Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP1 A B C

Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP1 A B C

Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP2 A OP1 B C

Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP2 A OP1 B C

Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP2 A OP1 B C

Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP2 A OP1 B C

Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP1 OP2 A OP1 OP2 B C

Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP1 OP2 A OP1 OP2 Ordering lies on the critical path of execution Non-scalable B OP3: C=100 C

Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP1 OP2 A OP1 OP2 Ordering lies on the critical path of execution Non-scalable B OP3: C=100 C Scalable SMR leverages atomic multicast to order ops More scalable than Calvin, but ordering still lies on the critical path of execution Additional messages exchanged between partitions to ensure linearizability* *Omitted due to time constraints; refer to paper if interested

Solution Genepi Remove the coordination of MPOs from the critical path of operation execution by: Schedule MPOs to future round => overlap the ordering of MPOs & processing of ordered ops Genepi: Efficient execution protocol ensuring linearizability* Scraper: an ordering building block for Genepi

Solution Genepi Remove the coordination of MPOs from the critical path of operation execution by: Schedule MPOs to future round => overlap the ordering of MPOs & processing of ordered ops Genepi: Efficient execution protocol ensuring linearizability* Scraper: an ordering building block for Genepi *Omitted due to time constraints; refer to paper if interested

Solution Genepi Remove the coordination of MPOs from the critical path of operation execution by: Schedule MPOs to future round => overlap the ordering of MPOs & processing of ordered ops Genepi: Efficient execution protocol ensuring linearizability* Scraper: an ordering building block for Genepi Scalable consensus for partial replication *Omitted due to time constraints; refer to paper if interested

Solution Scraper abstraction Formal specifications can be found in the paper S-Propose(SPOs, Rs, MPOs, Rm) Propose accumulated ops for each round Rs current round & Rm a future round => only low bound on final round S-Decide(OPs, R) Triggered when the operations for R has been decided R can only be decided if 1, 2,, R-1 have all been decided

Solution Genepi Execution Scraper Partition A Partition B

Solution Genepi Execution Round 1 Scraper Partition A Partition B

Solution Genepi Execution Round 1 Scraper Propose(SPO1, 1, MPO1, 2) Propose(SPO2, 1, MPO2, 2) Partition A Partition B

Solution Genepi Execution Round 1 Scraper Partition A Partition B

Solution Genepi Execution Round 1 Decide(SPO1,1) Scraper Decide(SPO2,1) Partition A Partition B

Solution Genepi Execution Round 1 Scraper Partition A Partition B

Solution Genepi Execution Round 2 MPO1: 2 Scraper Partition A Partition B

Solution Genepi Execution Round 2 MPO1: 2 Scraper Propose(SPO3, 2, MPO3, 3) Propose(SPO4, 2, MPO4, 3) Partition A Partition B

Solution Genepi Execution Round 2 MPO1: 2 Scraper Partition A Partition B

Solution Genepi Execution Round 2 Decide([SPO3, MPO1],2) Scraper Decide([SPO4, MPO1],2) Partition A Partition B

Solution Genepi Execution Round 2 Scraper Partition A Partition B

Solution Genepi Execution Scraper Partition A Partition B

Solution Genepi Execution Round 3 MPO2: 3 MPO3: 3 MPO4: 3 Scraper Partition A Partition B

Solution Genepi Execution Round 3 MPO2: 3 MPO3: 3 MPO4: 3 Scraper Propose( ) Propose( ) Partition A Partition B

Solution Genepi Execution Round 3 MPO2: 3 MPO3: 3 MPO4: 3 Scraper Partition A Partition B

Solution Genepi Execution Round 3 Scraper Decide([.., MPO2, MPO3, MPO4],3) Decide([.., MPO2, MPO3, MPO4],3) Partition A Partition B

Solution Genepi Execution Round 3 Scraper Partition A Partition B

Solution Scraper design (i) Avoiding synchronizing all partitions for scalability

Solution Scraper design (i) Avoiding synchronizing all partitions for scalability

Solution Scraper design (i) Avoiding synchronizing all partitions for scalability Partitions unilaterally advance rounds

Solution Scraper design (i) Avoiding synchronizing all partitions for scalability Partitions unilaterally advance rounds

Solution Scraper design (i) Avoiding synchronizing all partitions for scalability Partitions unilaterally advance rounds How to ensure they agree on rounds of ops?

Solution Scraper design (i) Avoiding synchronizing all partitions for scalability Partitions unilaterally advance rounds How to ensure they agree on rounds of ops? Key idea: a two-phase-commit-like protocol for partitions to agree on the round of an operation

Solution Scraper design (ii) R: 10 R: 13 Partition A Partition B

Solution Scraper design (ii) 1. Coordinator sends request with min_round 1 OP1: round 12 R: 10 R: 13 Partition A Partition B

Solution Scraper design (ii) 1. Coordinator sends request with min_round 2. Partitions propose max(min_round, decided round+1) 1 OP1: round 12 2 round 14 R: 10 R: 13 Partition A R14 Partition B OP1

Solution Scraper design (ii) 1. Coordinator sends request with min_round 2. Partitions propose max(min_round, decided round+1) 3. Coordinator decides max(received rounds) 1 OP1: round 12 2 round 14 R: 10 R: 13 Partition A 3 round 14 R14 Partition B OP1

Solution Scraper design (ii) 1. Coordinator sends request with min_round 2. Partitions propose max(min_round, decided round+1) 3. Coordinator decides max(received rounds) 4. Partitions finalize proposal 1 OP1: round 12 2 round 14 R: 10 R: 13 Partition A 3 round 14 R14 Partition B

Solution Other aspects in the paper Replication to ensure fault-tolerance Lightweight mechanism to ensure linearizability Delay replying to clients Choosing round numbers for MPOS Big enough to allow ordering MPOs Not too large to avoid unnecessary latency overhead

Implementation: Evaluation Experimental setup Calvin, S-SMR and Genepi all implemented based on Calvin s codebase (in C++) Deployment: Deployed in Grid 5000. Used up to 40 nodes in the same region; RTT is around 0.4ms Replication cost emulated by injecting 3ms delay 5ms round duration for batching MPOs scheduled two rounds later (2*5ms)

Evaluation Micro benchmark Each op reads & updates 10 keys Increase number of nodes & percentage of MPOs

Evaluation Micro benchmark Each op reads & updates 10 keys Increase number of nodes & percentage of MPOs

Evaluation Micro benchmark Each op reads & updates 10 keys Increase number of nodes & percentage of MPOs : Genepi scales better than Calvin: 83% higher throughput with 40 nodes & 1% MPOs : Latency of MPOs is 7~14 ms higher than SPOs

Evaluation TPC-C About 10% distributed transactions Includes heavy-weight and/or read-only txns At 40 nodes, Genepi has 45% throughput gain

Evaluation TPC-C About 10% distributed transactions Includes heavy-weight and/or read-only txns At 40 nodes, Genepi has 45% throughput gain

Summary Genepi s idea of postponing the execution of MPOs allow remove MPO coordination from the critical path of operation execution Questions?

Evaluation Micro benchmark 10 nodes, Varying the % of MPOs and partitions accessed by MPOs Genepi is only worse for workloads with lots of MPOs that access lots of MPOs!