Enhancing Throughput of
|
|
- Dale Marsh
- 6 years ago
- Views:
Transcription
1 Enhancing Throughput of NCA 2017 Zhongmiao Li, Peter Van Roy and Paolo Romano
2 Enhancing Throughput of Partially Replicated State Machines via NCA 2017 Zhongmiao Li, Peter Van Roy and Paolo Romano
3 Enhancing Throughput of Partially Replicated State Machines via Multi-Partition Operation Scheduling NCA 2017 Zhongmiao Li, Peter Van Roy and Paolo Romano
4 Background Online services strive to have 7*24 availability. Replication is crucial to ensure availability. State-machine replication (SMR) is a key technique to implement fault-tolerant services.
5 Background Online services strive to have 7*24 availability. Replication is crucial to ensure availability. State-machine replication (SMR) is a key technique to implement fault-tolerant services.
6 Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas A, B, C
7 Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas A, B, C A, B, C A, B, C
8 Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas OP1 OP2 OP3 A, B, C A, B, C A, B, C
9 Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas Consensus A, B, C A, B, C A, B, C
10 Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas Consensus OP1 OP2 OP3 A, B, C A, B, C A, B, C
11 Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas Consensus A, B, C A, B, C A, B, C
12 Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas A, B, C OP2 OP1 OP3 Consensus OP2 OP1 OP3 A, B, C A, B, C OP2 OP1 OP3
13 Background State-machine replication Applications are abstracted as deterministic state machines All replicas store application state Replicas agree on operation order (e.g. using Paxos), then execute Deterministic operation => equivalent final state of replicas Consensus A, B, C A, B, C A, B, C
14 Background Partially-replicated state machines(i) The classical SMR does not scale Replicas store full state & execute all update ops => throughput limited by single replica s capacity & speed! Recent work propose to partially-replicate state machines to enhance scalability High performance state- machine replication, DSN 11 Calvin: fast distributed transactions for partitioned database systems, SIGMOD 12 Scalable state-machine replication, DSN 14
15 Background Partially-replicated state machines(ii) A BC A B C A B C
16 Background Partially-replicated state machines(ii) Replication group A BC A B C A B C
17 Background Partially-replicated state machines(ii) Replication group A A A B B B C C C
18 Background Partially-replicated state machines(ii) Replication group A A A A Replication group B B B B Replication group C C C C
19 Background Partially-replicated state machines(ii) Replication group A A A A Replication group B B B B Replication group C C C C Each replica splits their state to multiple partitions Ops involving single partition (SPOs) only executed by that partition Ops involving multiple partitions (MPOs) coordinated and then executed by involved partitions
20 Background Partially-replicated state machines(ii) Replication group A A A A Replication group B B B B Replication group C C C C Each replica splits their state to multiple partitions Ops involving single partition (SPOs) only executed by that partition Ops involving multiple partitions (MPOs) coordinated and then executed by involved partitions But.. can we scale linearly by adding more partitions?
21 Problems Coordinating MPOs (i) Partitions have to agree on the order of MPOs.
22 Problems Coordinating MPOs (i) Partitions have to agree on the order of MPOs. OP1: A=10 B=10 A B OP2: A=5 B=5
23 Problems Coordinating MPOs (i) Partitions have to agree on the order of MPOs. OP1: A=10 A B OP2: A=5 B=5 OP1: B=10
24 Problems Coordinating MPOs (i) Partitions have to agree on the order of MPOs. OP1: A=10 OP2: B=5 OP2: A=5 A B OP1: B=10
25 Problems Coordinating MPOs (i) Partitions have to agree on the order of MPOs. OP1: A=10 OP2: B=5 OP2: A=5 A B OP1: B=10 A=5 B=10
26 Problems Coordinating MPOs (i) Partitions have to agree on the order of MPOs. OP1: A=10 OP2: B=5 OP2: A=5 A B OP1: B=10 A=5 B=10 Coordinating MPOs is slow Replication + multiple inter-group communication In existing systems, the coordination of MPOs lies on the critical path of execution! Partitions sit idle while coordinating MPOs=> throughput reduced
27 Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 A B C
28 Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP1 A B C
29 Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP1 A B C
30 Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP2 A OP1 B C
31 Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP2 A OP1 B C
32 Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP2 A OP1 B C
33 Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP2 A OP1 B C
34 Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP3: C=100 OP1 OP2 A OP1 OP2 B C
35 Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP1 OP2 A OP1 OP2 Ordering lies on the critical path of execution Non-scalable B OP3: C=100 C
36 Problems Coordinating MPOs (ii) Calvin requires all-to-all synchronization to order ops Progresses in round, in each round: OP1: A=10, B=10 OP2: A=5, B=5 OP1 OP2 A OP1 OP2 Ordering lies on the critical path of execution Non-scalable B OP3: C=100 C Scalable SMR leverages atomic multicast to order ops More scalable than Calvin, but ordering still lies on the critical path of execution Additional messages exchanged between partitions to ensure linearizability* *Omitted due to time constraints; refer to paper if interested
37 Solution Genepi Remove the coordination of MPOs from the critical path of operation execution by: Schedule MPOs to future round => overlap the ordering of MPOs & processing of ordered ops Genepi: Efficient execution protocol ensuring linearizability* Scraper: an ordering building block for Genepi
38 Solution Genepi Remove the coordination of MPOs from the critical path of operation execution by: Schedule MPOs to future round => overlap the ordering of MPOs & processing of ordered ops Genepi: Efficient execution protocol ensuring linearizability* Scraper: an ordering building block for Genepi *Omitted due to time constraints; refer to paper if interested
39 Solution Genepi Remove the coordination of MPOs from the critical path of operation execution by: Schedule MPOs to future round => overlap the ordering of MPOs & processing of ordered ops Genepi: Efficient execution protocol ensuring linearizability* Scraper: an ordering building block for Genepi Scalable consensus for partial replication *Omitted due to time constraints; refer to paper if interested
40 Solution Scraper abstraction Formal specifications can be found in the paper S-Propose(SPOs, Rs, MPOs, Rm) Propose accumulated ops for each round Rs current round & Rm a future round => only low bound on final round S-Decide(OPs, R) Triggered when the operations for R has been decided R can only be decided if 1, 2,, R-1 have all been decided
41 Solution Genepi Execution Scraper Partition A Partition B
42 Solution Genepi Execution Round 1 Scraper Partition A Partition B
43 Solution Genepi Execution Round 1 Scraper Propose(SPO1, 1, MPO1, 2) Propose(SPO2, 1, MPO2, 2) Partition A Partition B
44 Solution Genepi Execution Round 1 Scraper Partition A Partition B
45 Solution Genepi Execution Round 1 Decide(SPO1,1) Scraper Decide(SPO2,1) Partition A Partition B
46 Solution Genepi Execution Round 1 Scraper Partition A Partition B
47 Solution Genepi Execution Round 2 MPO1: 2 Scraper Partition A Partition B
48 Solution Genepi Execution Round 2 MPO1: 2 Scraper Propose(SPO3, 2, MPO3, 3) Propose(SPO4, 2, MPO4, 3) Partition A Partition B
49 Solution Genepi Execution Round 2 MPO1: 2 Scraper Partition A Partition B
50 Solution Genepi Execution Round 2 Decide([SPO3, MPO1],2) Scraper Decide([SPO4, MPO1],2) Partition A Partition B
51 Solution Genepi Execution Round 2 Scraper Partition A Partition B
52 Solution Genepi Execution Scraper Partition A Partition B
53 Solution Genepi Execution Round 3 MPO2: 3 MPO3: 3 MPO4: 3 Scraper Partition A Partition B
54 Solution Genepi Execution Round 3 MPO2: 3 MPO3: 3 MPO4: 3 Scraper Propose( ) Propose( ) Partition A Partition B
55 Solution Genepi Execution Round 3 MPO2: 3 MPO3: 3 MPO4: 3 Scraper Partition A Partition B
56 Solution Genepi Execution Round 3 Scraper Decide([.., MPO2, MPO3, MPO4],3) Decide([.., MPO2, MPO3, MPO4],3) Partition A Partition B
57 Solution Genepi Execution Round 3 Scraper Partition A Partition B
58 Solution Scraper design (i) Avoiding synchronizing all partitions for scalability
59 Solution Scraper design (i) Avoiding synchronizing all partitions for scalability
60 Solution Scraper design (i) Avoiding synchronizing all partitions for scalability Partitions unilaterally advance rounds
61 Solution Scraper design (i) Avoiding synchronizing all partitions for scalability Partitions unilaterally advance rounds
62 Solution Scraper design (i) Avoiding synchronizing all partitions for scalability Partitions unilaterally advance rounds How to ensure they agree on rounds of ops?
63 Solution Scraper design (i) Avoiding synchronizing all partitions for scalability Partitions unilaterally advance rounds How to ensure they agree on rounds of ops? Key idea: a two-phase-commit-like protocol for partitions to agree on the round of an operation
64 Solution Scraper design (ii) R: 10 R: 13 Partition A Partition B
65 Solution Scraper design (ii) 1. Coordinator sends request with min_round 1 OP1: round 12 R: 10 R: 13 Partition A Partition B
66 Solution Scraper design (ii) 1. Coordinator sends request with min_round 2. Partitions propose max(min_round, decided round+1) 1 OP1: round 12 2 round 14 R: 10 R: 13 Partition A R14 Partition B OP1
67 Solution Scraper design (ii) 1. Coordinator sends request with min_round 2. Partitions propose max(min_round, decided round+1) 3. Coordinator decides max(received rounds) 1 OP1: round 12 2 round 14 R: 10 R: 13 Partition A 3 round 14 R14 Partition B OP1
68 Solution Scraper design (ii) 1. Coordinator sends request with min_round 2. Partitions propose max(min_round, decided round+1) 3. Coordinator decides max(received rounds) 4. Partitions finalize proposal 1 OP1: round 12 2 round 14 R: 10 R: 13 Partition A 3 round 14 R14 Partition B
69 Solution Other aspects in the paper Replication to ensure fault-tolerance Lightweight mechanism to ensure linearizability Delay replying to clients Choosing round numbers for MPOS Big enough to allow ordering MPOs Not too large to avoid unnecessary latency overhead
70 Implementation: Evaluation Experimental setup Calvin, S-SMR and Genepi all implemented based on Calvin s codebase (in C++) Deployment: Deployed in Grid Used up to 40 nodes in the same region; RTT is around 0.4ms Replication cost emulated by injecting 3ms delay 5ms round duration for batching MPOs scheduled two rounds later (2*5ms)
71 Evaluation Micro benchmark Each op reads & updates 10 keys Increase number of nodes & percentage of MPOs
72 Evaluation Micro benchmark Each op reads & updates 10 keys Increase number of nodes & percentage of MPOs
73 Evaluation Micro benchmark Each op reads & updates 10 keys Increase number of nodes & percentage of MPOs : Genepi scales better than Calvin: 83% higher throughput with 40 nodes & 1% MPOs : Latency of MPOs is 7~14 ms higher than SPOs
74 Evaluation TPC-C About 10% distributed transactions Includes heavy-weight and/or read-only txns At 40 nodes, Genepi has 45% throughput gain
75 Evaluation TPC-C About 10% distributed transactions Includes heavy-weight and/or read-only txns At 40 nodes, Genepi has 45% throughput gain
76 Summary Genepi s idea of postponing the execution of MPOs allow remove MPO coordination from the critical path of operation execution Questions?
77 Evaluation Micro benchmark 10 nodes, Varying the % of MPOs and partitions accessed by MPOs Genepi is only worse for workloads with lots of MPOs that access lots of MPOs!
Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li
Janus Consolidating Concurrency Control and Consensus for Commits under Conflicts Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li New York University, University of Southern California State of the Art
More informationBuilding Consistent Transactions with Inconsistent Replication
Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports University of Washington Distributed storage systems
More informationBuilding Consistent Transactions with Inconsistent Replication
DB Reading Group Fall 2015 slides by Dana Van Aken Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports
More informationTAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton
TAPIR By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton Outline Problem Space Inconsistent Replication TAPIR Evaluation Conclusion Problem
More informationPaxos Replicated State Machines as the Basis of a High- Performance Data Store
Paxos Replicated State Machines as the Basis of a High- Performance Data Store William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li March 30, 2011 Q: How to build a
More informationSDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines
SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines Hanyu Zhao *, Quanlu Zhang, Zhi Yang *, Ming Wu, Yafei Dai * * Peking University Microsoft Research Replication for Fault Tolerance
More informationReplication in Distributed Systems
Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over
More informationFine-grained Transaction Scheduling in Replicated Databases via Symbolic Execution
Fine-grained Transaction Scheduling in Replicated Databases via Symbolic Execution Raminhas pedro.raminhas@tecnico.ulisboa.pt Stage: 2 nd Year PhD Student Research Area: Dependable and fault-tolerant systems
More informationEvaluating BFT Protocols for Spire
Evaluating BFT Protocols for Spire Henry Schuh & Sam Beckley 600.667 Advanced Distributed Systems & Networks SCADA & Spire Overview High-Performance, Scalable Spire Trusted Platform Module Known Network
More informationThere Is More Consensus in Egalitarian Parliaments
There Is More Consensus in Egalitarian Parliaments Iulian Moraru, David Andersen, Michael Kaminsky Carnegie Mellon University Intel Labs Fault tolerance Redundancy State Machine Replication 3 State Machine
More informationDesigning Distributed Systems using Approximate Synchrony in Data Center Networks
Designing Distributed Systems using Approximate Synchrony in Data Center Networks Dan R. K. Ports Jialin Li Naveen Kr. Sharma Vincent Liu Arvind Krishnamurthy University of Washington CSE Today s most
More informationMDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete
MDCC MULTI DATA CENTER CONSISTENCY Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete gpang@cs.berkeley.edu amplab MOTIVATION 2 3 June 2, 200: Rackspace power outage of approximately 0
More informationAll about Eve: Execute-Verify Replication for Multi-Core Servers
All about Eve: Execute-Verify Replication for Multi-Core Servers Manos Kapritsos, Yang Wang, Vivien Quema, Allen Clement, Lorenzo Alvisi, Mike Dahlin Dependability Multi-core Databases Key-value stores
More informationOvid A Software-Defined Distributed Systems Framework. Deniz Altinbuken, Robbert van Renesse Cornell University
Ovid A Software-Defined Distributed Systems Framework Deniz Altinbuken, Robbert van Renesse Cornell University Ovid Build distributed systems that are easy to evolve easy to reason about easy to compose
More informationAutomated Data Partitioning for Highly Scalable and Strongly Consistent Transactions
Automated Data Partitioning for Highly Scalable and Strongly Consistent Transactions Alexandru Turcu, Roberto Palmieri, Binoy Ravindran Virginia Tech SYSTOR 2014 Desirable properties in distribute transactional
More informationMINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES
MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES Divy Agrawal Department of Computer Science University of California at Santa Barbara Joint work with: Amr El Abbadi, Hatem Mahmoud, Faisal
More informationData Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)
Data Consistency and Blockchain Bei Chun Zhou (BlockChainZ) beichunz@cn.ibm.com 1 Data Consistency Point-in-time consistency Transaction consistency Application consistency 2 Strong Consistency ACID Atomicity.
More informationPractical Byzantine Fault Tolerance (The Byzantine Generals Problem)
Practical Byzantine Fault Tolerance (The Byzantine Generals Problem) Introduction Malicious attacks and software errors that can cause arbitrary behaviors of faulty nodes are increasingly common Previous
More informationReplicated State Machine in Wide-area Networks
Replicated State Machine in Wide-area Networks Yanhua Mao CSE223A WI09 1 Building replicated state machine with consensus General approach to replicate stateful deterministic services Provide strong consistency
More informationLow Overhead Concurrency Control for Partitioned Main Memory Databases
Low Overhead Concurrency Control for Partitioned Main Memory Databases Evan Jones, Daniel Abadi, Samuel Madden, June 2010, SIGMOD CS 848 May, 2016 Michael Abebe Background Motivations Database partitioning
More informationLow-Latency Multi-Datacenter Databases using Replicated Commit
Low-Latency Multi-Datacenter Databases using Replicated Commit Hatem Mahmoud, Faisal Nawab, Alexander Pucher, Divyakant Agrawal, Amr El Abbadi UCSB Presented by Ashutosh Dhekne Main Contributions Reduce
More informationTolerating Latency in Replicated State Machines through Client Speculation
Tolerating Latency in Replicated State Machines through Client Speculation April 22, 2009 1, James Cowling 2, Edmund B. Nightingale 3, Peter M. Chen 1, Jason Flinn 1, Barbara Liskov 2 University of Michigan
More informationZyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin
Zyzzyva Speculative Byzantine Fault Tolerance Ramakrishna Kotla L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin The Goal Transform high-performance service into high-performance
More informationPractical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov
Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov Outline 1. Introduction to Byzantine Fault Tolerance Problem 2. PBFT Algorithm a. Models and overview b. Three-phase protocol c. View-change
More informationEris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control
Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control [Extended Version] Jialin Li Ellis Michael Dan R. K. Ports University of Washington {lijl, emichael, drkp}@cs.washington.edu
More informationHP: Hybrid Paxos for WANs
HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de TU Darmstadt, Germany Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded
More informationPARALLEL CONSENSUS PROTOCOL
CANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL Bernard Wong CoNEXT 2017 Joint work with Sajjad Rizvi and Srinivasan Keshav CONSENSUS PROBLEM Agreement between a set of nodes in the presence
More informationPractical Byzantine Fault Tolerance
Practical Byzantine Fault Tolerance Robert Grimm New York University (Partially based on notes by Eric Brewer and David Mazières) The Three Questions What is the problem? What is new or different? What
More informationCMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS
Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22
More informationReducing the Costs of Large-Scale BFT Replication
Reducing the Costs of Large-Scale BFT Replication Marco Serafini & Neeraj Suri TU Darmstadt, Germany Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de
More informationPaxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha
Paxos Made Live An Engineering Perspective Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone Presented By: Dipendra Kumar Jha Consensus Algorithms Consensus: process of agreeing on one result
More informationEris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control
Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control Jialin Li University of Washington lijl@cs.washington.edu ABSTRACT Distributed storage systems aim to provide strong
More informationDistributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi
1 Lecture Notes 1 Basic Concepts Anand Tripathi CSci 8980 Operating Systems Anand Tripathi CSci 8980 1 Distributed Systems A set of computers (hosts or nodes) connected through a communication network.
More informationDistributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs
1 Anand Tripathi CSci 8980 Operating Systems Lecture Notes 1 Basic Concepts Distributed Systems A set of computers (hosts or nodes) connected through a communication network. Nodes may have different speeds
More informationJust Say NO to Paxos Overhead: Replacing Consensus with Network Ordering
Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, Dan R. K. Ports Server failures are the common case in data centers
More informationByzantine Fault Tolerance
Byzantine Fault Tolerance CS6450: Distributed Systems Lecture 10 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University.
More informationHT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers
1 HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers Vinit Kumar 1 and Ajay Agarwal 2 1 Associate Professor with the Krishna Engineering College, Ghaziabad, India.
More informationOn Fault-tolerant and High Performance Replicated Transactional Systems
On Fault-tolerant and High Performance Replicated Transactional Systems Sachin Hirve Preliminary Examination Proposal submitted to the Faculty of the Virginia Polytechnic Institute and State University
More informationSAREK: Optimistic Parallel Ordering in Byzantine Fault Tolerance
SAREK: Optimistic Parallel Ordering in Byzantine Fault Tolerance Bijun Li TU Braunschweig bli@ibr.cs.tu-bs.de Wenbo Xu TU Braunschweig wxu@ibr.cs.tu-bs.de Muhammad Zeeshan Abid KTH Stockholm mzabid@kth.se
More informationGiza: Erasure Coding Objects across Global Data Centers
Giza: Erasure Coding Objects across Global Data Centers Yu Lin Chen*, Shuai Mu, Jinyang Li, Cheng Huang *, Jin li *, Aaron Ogus *, and Douglas Phillips* New York University, *Microsoft Corporation USENIX
More informationDistributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201
Distributed Systems ID2201 replication Johan Montelius 1 The problem The problem we have: servers might be unavailable The solution: keep duplicates at different servers 2 Building a fault-tolerant service
More informationCS6450: Distributed Systems Lecture 15. Ryan Stutsman
Strong Consistency CS6450: Distributed Systems Lecture 15 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed
More informationByzantine fault tolerance. Jinyang Li With PBFT slides from Liskov
Byzantine fault tolerance Jinyang Li With PBFT slides from Liskov What we ve learnt so far: tolerate fail-stop failures Traditional RSM tolerates benign failures Node crashes Network partitions A RSM w/
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions Transactions Main issues: Concurrency control Recovery from failures 2 Distributed Transactions
More informationByzantine Fault Tolerance and Consensus. Adi Seredinschi Distributed Programming Laboratory
Byzantine Fault Tolerance and Consensus Adi Seredinschi Distributed Programming Laboratory 1 (Original) Problem Correct process General goal: Run a distributed algorithm 2 (Original) Problem Correct process
More informationSTATE Machine Replication (SMR) is a well-known approach
TO APPEAR IN IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (MARCH 17) 1 Elastic State Machine Replication Andre Nogueira, Antonio Casimiro, Alysson Bessani State machine replication (SMR) is a
More informationDistributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015
Distributed Coordination with ZooKeeper - Theory and Practice Simon Tao EMC Labs of China {simon.tao@emc.com} Oct. 24th, 2015 Agenda 1. ZooKeeper Overview 2. Coordination in Spring XD 3. ZooKeeper Under
More informationPRIMARY-BACKUP REPLICATION
PRIMARY-BACKUP REPLICATION Primary Backup George Porter Nov 14, 2018 ATTRIBUTION These slides are released under an Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) Creative Commons
More informationCanopus: A Scalable and Massively Parallel Consensus Protocol
Canopus: A Scalable and Massively Parallel Consensus Protocol (Extended report) Sajjad Rizvi, Bernard Wong, Srinivasan Keshav Cheriton School of Computer Science University of Waterloo, 200 University
More informationReplication and Consistency. Fall 2010 Jussi Kangasharju
Replication and Consistency Fall 2010 Jussi Kangasharju Chapter Outline Replication Consistency models Distribution protocols Consistency protocols 2 Data Replication user B user C user A object object
More informationTheoretical Computer Science
Theoretical Computer Science 496 (2013) 170 183 Contents lists available at SciVerse ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Optimizing Paxos with batching
More informationPrimary-Backup Replication
Primary-Backup Replication CS 240: Computing Systems and Concurrency Lecture 7 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Simplified Fault Tolerance
More informationExploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout
Exploiting Commutativity For Practical Fast Replication Seo Jin Park and John Ousterhout Overview Problem: replication adds latency and throughput overheads CURP: Consistent Unordered Replication Protocol
More informationApplications of Paxos Algorithm
Applications of Paxos Algorithm Gurkan Solmaz COP 6938 - Cloud Computing - Fall 2012 Department of Electrical Engineering and Computer Science University of Central Florida - Orlando, FL Oct 15, 2012 1
More informationScalable State-Machine Replication
Scalable State-Machine Replication Carlos Eduardo Bezerra, Fernando Pedone, Robbert van Renesse University of Lugano, Switzerland Cornell University, USA Universidade Federal do Rio Grande do Sul, Brazil
More information6.824 Final Project. May 11, 2014
6.824 Final Project Colleen Josephson cjoseph@mit.edu Joseph DelPreto delpreto@mit.edu Pranjal Vachaspati pranjal@mit.edu Steven Valdez dvorak42@mit.edu May 11, 2014 1 Introduction The presented project
More informationAssignment 5. Georgia Koloniari
Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last
More information10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein
10. Replication CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety Copyright 2012 Philip A. Bernstein 1 Outline 1. Introduction 2. Primary-Copy Replication 3. Multi-Master Replication 4.
More informationFailure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18
Failure models Byzantine Fault Tolerance Fail-stop: nodes either execute the protocol correctly or just stop Byzantine failures: nodes can behave in any arbitrary way Send illegal messages, try to trick
More informationModule 7 - Replication
Module 7 - Replication Replication Why replicate? Reliability Avoid single points of failure Performance Scalability in numbers and geographic area Why not replicate? Replication transparency Consistency
More informationByzantine Fault Tolerance
Byzantine Fault Tolerance CS 240: Computing Systems and Concurrency Lecture 11 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. So far: Fail-stop failures
More informationZHT: Const Eventual Consistency Support For ZHT. Group Member: Shukun Xie Ran Xin
ZHT: Const Eventual Consistency Support For ZHT Group Member: Shukun Xie Ran Xin Outline Problem Description Project Overview Solution Maintains Replica List for Each Server Operation without Primary Server
More informationDistributed Consensus: Making Impossible Possible
Distributed Consensus: Making Impossible Possible QCon London Tuesday 29/3/2016 Heidi Howard PhD Student @ University of Cambridge heidi.howard@cl.cam.ac.uk @heidiann360 What is Consensus? The process
More informationParallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitionin
Parallel Data Types of Parallelism Replication (Multiple copies of the same data) Better throughput for read-only computations Data safety Partitioning (Different data at different sites More space Better
More information(Lightweight) Recoverable Virtual Memory. Robert Grimm New York University
(Lightweight) Recoverable Virtual Memory Robert Grimm New York University The Three Questions What is the problem? What is new or different? What are the contributions and limitations? The Goal Simplify
More informationHiperTM: High Performance, Fault-Tolerant Transactional Memory
HiperTM: High Performance, Fault-Tolerant Transactional Memory Sachin Hirve, Roberto Palmieri 1, Binoy Ravindran Virginia Tech, Blacksburg VA 2461, USA Abstract We present HiperTM, a high performance active
More informationMegastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database. Presented by Kewei Li The Problem db nosql complex legacy tuning expensive
More informationIntroduction to Distributed Systems Seif Haridi
Introduction to Distributed Systems Seif Haridi haridi@kth.se What is a distributed system? A set of nodes, connected by a network, which appear to its users as a single coherent system p1 p2. pn send
More informationStrong Consistency & CAP Theorem
Strong Consistency & CAP Theorem CS 240: Computing Systems and Concurrency Lecture 15 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Consistency models
More informationMENCIUS: BUILDING EFFICIENT
MENCIUS: BUILDING EFFICIENT STATE MACHINE FOR WANS By: Yanhua Mao Flavio P. Junqueira Keith Marzullo Fabian Fuxa, Chun-Yu Hsiung November 14, 2018 AGENDA 1. Motivation 2. Breakthrough 3. Rules of Mencius
More informationData Storage Revolution
Data Storage Revolution Relational Databases Object Storage (put/get) Dynamo PNUTS CouchDB MemcacheDB Cassandra Speed Scalability Availability Throughput No Complexity Eventual Consistency Write Request
More informationDistributed Consensus: Making Impossible Possible
Distributed Consensus: Making Impossible Possible Heidi Howard PhD Student @ University of Cambridge heidi.howard@cl.cam.ac.uk @heidiann360 hh360.user.srcf.net Sometimes inconsistency is not an option
More informationATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases
ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases We talked about transactions and how to implement them in a single-node database. We ll now start looking into how to
More informationStrong Consistency at Scale
Strong Consistency at Scale Carlos Eduardo Bezerra University of Lugano (USI) Switzerland Le Long Hoang University of Lugano (USI) Switzerland Fernando Pedone University of Lugano (USI) Switzerland Abstract
More informationDistributed Commit in Asynchronous Systems
Distributed Commit in Asynchronous Systems Minsoo Ryu Department of Computer Science and Engineering 2 Distributed Commit Problem - Either everybody commits a transaction, or nobody - This means consensus!
More informationLow Overhead Concurrency Control for Partitioned Main Memory Databases. Evan P. C. Jones Daniel J. Abadi Samuel Madden"
Low Overhead Concurrency Control for Partitioned Main Memory Databases Evan P. C. Jones Daniel J. Abadi Samuel Madden" Banks" Payment Processing" Airline Reservations" E-Commerce" Web 2.0" Problem:" Millions
More informationDistributed Systems. Day 9: Replication [Part 1]
Distributed Systems Day 9: Replication [Part 1] Hash table k 0 v 0 k 1 v 1 k 2 v 2 k 3 v 3 ll Facebook Data Does your client know about all of F s servers? Security issues? Performance issues? How do clients
More informationFast Atomic Multicast
Università della Svizzera italiana USI Technical Report Series in Informatics Fast Atomic Multicast Paulo R. Coelho 1, Nicolas Schiper 2, Fernando Pedone 1 1 Faculty of Informatics, Università della Svizzera
More informationSCHISM: A WORKLOAD-DRIVEN APPROACH TO DATABASE REPLICATION AND PARTITIONING
SCHISM: A WORKLOAD-DRIVEN APPROACH TO DATABASE REPLICATION AND PARTITIONING ZEYNEP KORKMAZ CS742 - PARALLEL AND DISTRIBUTED DATABASE SYSTEMS UNIVERSITY OF WATERLOO OUTLINE. Background 2. What is Schism?
More informationExploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout
Exploiting Commutativity For Practical Fast Replication Seo Jin Park and John Ousterhout Overview Problem: consistent replication adds latency and throughput overheads Why? Replication happens after ordering
More informationBasic vs. Reliable Multicast
Basic vs. Reliable Multicast Basic multicast does not consider process crashes. Reliable multicast does. So far, we considered the basic versions of ordered multicasts. What about the reliable versions?
More informationThe Stream Processor as a Database. Ufuk
The Stream Processor as a Database Ufuk Celebi @iamuce Realtime Counts and Aggregates The (Classic) Use Case 2 (Real-)Time Series Statistics Stream of Events Real-time Statistics 3 The Architecture collect
More informationCS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.
Question 1 What makes a message unstable? How does an unstable message become stable? Distributed Systems 2016 Exam 2 Review Paul Krzyzanowski Rutgers University Fall 2016 In virtual sychrony, a message
More informationDistributed Systems. Aleardo Manacero Jr.
Distributed Systems Aleardo Manacero Jr. Replication - part 1 Introduction Using multiple servers to attend client requests allow for a better performance in the system Unfortunately, as shown in the study
More informationBuilding Consistent Transactions with Inconsistent Replication
Building Consistent Transactions with Inconsistent Replication Irene Zhang Naveen Kr. Sharma Adriana Szekeres Arvind Krishnamurthy Dan R. K. Ports University of Washington {iyzhang, naveenks, aaasz, arvind,
More informationCPS 512 midterm exam #1, 10/7/2016
CPS 512 midterm exam #1, 10/7/2016 Your name please: NetID: Answer all questions. Please attempt to confine your answers to the boxes provided. If you don t know the answer to a question, then just say
More informationApril 21, 2017 Revision GridDB Reliability and Robustness
April 21, 2017 Revision 1.0.6 GridDB Reliability and Robustness Table of Contents Executive Summary... 2 Introduction... 2 Reliability Features... 2 Hybrid Cluster Management Architecture... 3 Partition
More information10. Replication. Motivation
10. Replication Page 1 10. Replication Motivation Reliable and high-performance computation on a single instance of a data object is prone to failure. Replicate data to overcome single points of failure
More informationRecovering from a Crash. Three-Phase Commit
Recovering from a Crash If INIT : abort locally and inform coordinator If Ready, contact another process Q and examine Q s state Lecture 18, page 23 Three-Phase Commit Two phase commit: problem if coordinator
More informationRecall use of logical clocks
Causal Consistency Consistency models Linearizability Causal Eventual COS 418: Distributed Systems Lecture 16 Sequential Michael Freedman 2 Recall use of logical clocks Lamport clocks: C(a) < C(z) Conclusion:
More informationRecall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers
Replicated s, RAFT COS 8: Distributed Systems Lecture 8 Recall: Primary-Backup Mechanism: Replicate and separate servers Goal #: Provide a highly reliable service Goal #: Servers should behave just like
More informationCS 138: Practical Byzantine Consensus. CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.
CS 138: Practical Byzantine Consensus CS 138 XX 1 Copyright 2017 Thomas W. Doeppner. All rights reserved. Scenario Asynchronous system Signed messages s are state machines It has to be practical CS 138
More informationConsolidating Concurrency Control and Consensus for Commits under Conflicts
Consolidating Concurrency Control and Consensus for Commits under Conflicts Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li New York University, University of Southern California Abstract Conventional
More informationCrossStitch: An Efficient Transaction Processing Framework for Geo-Distributed Systems
CrossStitch: An Efficient Transaction Processing Framework for Geo-Distributed Systems Sharon Choy, Bernard Wong, Xu Cui, Xiaoyi Liu Cheriton School of Computer Science, University of Waterloo s2choy,
More informationLarge-scale cluster management at Google with Borg
Large-scale cluster management at Google with Borg Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes Google Inc. Slides heavily derived from John Wilkes s presentation
More informationLeader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols
Leader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols Vaibhav Arora, Tanuj Mittal, Divyakant Agrawal, Amr El Abbadi * and Xun Xue, Zhiyanan,
More informationHuge market -- essentially all high performance databases work this way
11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch
More informationDistributed Systems (5DV147)
Distributed Systems (5DV147) Replication and consistency Fall 2013 1 Replication 2 What is replication? Introduction Make different copies of data ensuring that all copies are identical Immutable data
More informationConsolidating Concurrency Control and Consensus for Commits under Conflicts
Consolidating Concurrency Control and Consensus for Commits under Conflicts Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li New York University, University of Southern California Abstract Conventional
More informationPaxos and Replication. Dan Ports, CSEP 552
Paxos and Replication Dan Ports, CSEP 552 Today: achieving consensus with Paxos and how to use this to build a replicated system Last week Scaling a web service using front-end caching but what about the
More information