TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Size: px
Start display at page:

Download "TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton"

Transcription

1 TAPIR By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

2 Outline Problem Space Inconsistent Replication TAPIR Evaluation Conclusion

3 Problem Develop an app to send pictures of chocolate labs How do we save the pictures??

4 Problem Distributed storage system Want strong consistency use replication protocols like Paxos which incur a high performance cost Want efficient protocols can only guarantee weak consistency

5 Problem Guarantees Fault-Tolerance Scalability Linearizability Distributed Transaction Protocol Replication Protocol

6 Problem Existing architectures

7 Problem We are enforcing serial ordering in two places: Between replicas Between partitions

8 Problem Guarantees Fault-Tolerance Scalability Linearizability Distributed Transaction Protocol Replication Protocol

9 Inconsistent Replication Just make the replication layer inconsistent! Operations can execute in any order Still provides fault tolerance No costly consistency protocol

10 Inconsistent Replication Guarantees Fault Tolerance At any time, every operation in the operation set is in the record of at least one replica in any quorum of f+1 replicas Visibility For any two operations in the operation set, at least one is visible to the other Consensus every operation has agreement from at least a majority of the replicas

11 Inconsistent Replication Application Replica Replica Conflict Detection Application Replica

12 Inconsistent Replication Inconsistent Execution Client sends Propose(op, id) to all replicas Replicas mark [id,op] as Tentative in their record. Replies to client Reply(id) Once client receives f+1 replies for an id, sends Finalize(id) to all replicas Replicas transition from Tentative -> Finalized for that op when they receive from client and respond with ExecInconsistent() to Application layer 1 Round Trip Application sends InvokeInconsistent() Replication layer responds with ExecInconsistent()

13 Inconsistent Replication

14 Inconsistent Replication Consensus Execution Client sends Propose(op, id) to all replicas Replicas mark [id,op, result] as Tentative in their record. Replies to client Reply(id, result) Fast Path (Fast Quorum) If client receives 3/2f+1 matching results, return result to application layer and send Finalize to all replicas 1 Round Trip Application sends InvokeConsensus() Replication layer responds with ExecConsensus()

15 Inconsistent Replication Consensus Execution Slow Path (Didn t reach 3/2f+1 fast quorum of matching results) Client must wait for f+1 responses. Sends Finalize(id, result). Result is computed from the decide() function When replica receives Finalize, records op as finalized (updates the record if the result recorded was different) and sends Confirm(id) to client. Once client receives f+1 Confirm messages, returns result to application 2 Round Trip Application sends InvokeConsensus(), Replication layer responds with Decide(), Application sends decision, Replication sends ExecConsensus()

16 Inconsistent Replication

17 Inconsistent Replication Synchronization -IR uses View Changes. -But wait, that implies a leader -Leaders exist solely during view change. Only job is to ensure that at least f+1 replicas are up to date

18 Inconsistent Replication Synchronization -When triggered, leader collects f+1 replica s logs. - Merges all Finalized records into a master record - If record is Tentative, must have Transaction layer Decide() what to do - From Transaction Layer s response, Master Record R is created. All clients update their records to the master

19 Inconsistent Replication Good Bad - 1 round-trip path, 2 roundtrip worst case - No cross replication communication needed - Replica s don t appear as a single machine (need occasional sync) - Requires a well-designed transaction layer on top

20 TAPIR Designed specifically to interface with IR Uses 2PC across the partitions of replicas This is the Transaction Application Layer. Users interact with this, not IR. Stands for Transaction Application Protocol for Inconsistent Replication

21 TAPIR

22 TAPIR Optimistic Concurrency Control (OCC) IR guarantees us visibility In any pair of consensus operations, at least one will be visible to the other. Thus, we can t do conflict checks that require the entire history because each IR replica may have an incomplete history Yet, in OCC we are only performing pairwise conflict checks. If a conflict exists, at least one replica will see the conflicting transaction

23 TAPIR Optimistic Concurrency Control (OCC) Application Replica Replica Application Replica

24 TAPIR Read() and Write() are collected for the transaction. We build a read and write set. This phase ends when the user enters a Commit() or Abort() Prepare() is called and we perform a consensus operation at the IR level. We pass in the read and write sets. This is the only consensus operation that TAPIR uses. Commit and Abort are inconsistent operations and read and write are not replicated.

25 TAPIR After Prepare() is sent to the replicas, the consensus protocol is followed at the IR layer If all partitions reply with Prepare-OK, then TAPIR passes a Commit() to all replicas If any replica responds with Abort(), then TAPIR passes Abort() to all replicas.

26 TAPIR Decide() function Must be implemented by application side Again, called when there is a conflict detected between results at the IR layer Simple solution. If a majority (f+1) replicas, decide Prepare-OK. Due to our IR guarantees, no conflicting transaction could get a majority of the replicas to return Prepare-OK

27 TAPIR Linearizable?? To commit two transactions through TAPIR we must execute two Prepare() messages -> consensus operations in IR We are guaranteed through visibility in IR and OCC that one of the Prepare() operations would abort. (Will not manage to obtain f+1 replicas who respond with Prepare-OK due to conflict)

28 TAPIR Fault Tolerance? Yes. This is a guarantee from IR. If TAPIR receives f+1 Prepare-OK messages from IR, then an inconsistent Commit operation is issued Replicas eventually commit the transaction to their records. If a replica does not, it will eventually catch up on synchronization when it copies the master record

29 Evaluation Implemented TAPIR as a Key-Value storage system Compared against OCC-Store 2-Phase Commit as the transaction layer running on Multi-Paxos Compared against Lock-Store Google s Spanner storage system with a few tweaks. Runs Multi-Paxos in the replication layer

30 Evaluation Comparison with Strong Consistency systems

31 Evaluation Wide Area Latency

32 Evaluation Comparison with Weak Consistency systems

33 Conclusion Existing systems waste work by enforcing linearizability in the replication layer TAPIR leverages Inconsistent Replication to provide linearizable transactions Improves latency and throughput on commit No leader bottleneck Round-trip time can be halved in common case

34 Questions?

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication DB Reading Group Fall 2015 slides by Dana Van Aken Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports

More information

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication Building Consistent Transactions with Inconsistent Replication Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, Dan R. K. Ports University of Washington Distributed storage systems

More information

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication Building Consistent Transactions with Inconsistent Replication Irene Zhang Naveen Kr. Sharma Adriana Szekeres Arvind Krishnamurthy Dan R. K. Ports University of Washington {iyzhang, naveenks, aaasz, arvind,

More information

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, Dan R. K. Ports Server failures are the common case in data centers

More information

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

Designing Distributed Systems using Approximate Synchrony in Data Center Networks Designing Distributed Systems using Approximate Synchrony in Data Center Networks Dan R. K. Ports Jialin Li Naveen Kr. Sharma Vincent Liu Arvind Krishnamurthy University of Washington CSE Today s most

More information

Replication in Distributed Systems

Replication in Distributed Systems Replication in Distributed Systems Replication Basics Multiple copies of data kept in different nodes A set of replicas holding copies of a data Nodes can be physically very close or distributed all over

More information

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li Janus Consolidating Concurrency Control and Consensus for Commits under Conflicts Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li New York University, University of Southern California State of the Art

More information

There Is More Consensus in Egalitarian Parliaments

There Is More Consensus in Egalitarian Parliaments There Is More Consensus in Egalitarian Parliaments Iulian Moraru, David Andersen, Michael Kaminsky Carnegie Mellon University Intel Labs Fault tolerance Redundancy State Machine Replication 3 State Machine

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions Transactions Main issues: Concurrency control Recovery from failures 2 Distributed Transactions

More information

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines Hanyu Zhao *, Quanlu Zhang, Zhi Yang *, Ming Wu, Yafei Dai * * Peking University Microsoft Research Replication for Fault Tolerance

More information

Applications of Paxos Algorithm

Applications of Paxos Algorithm Applications of Paxos Algorithm Gurkan Solmaz COP 6938 - Cloud Computing - Fall 2012 Department of Electrical Engineering and Computer Science University of Central Florida - Orlando, FL Oct 15, 2012 1

More information

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete MDCC MULTI DATA CENTER CONSISTENCY Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete gpang@cs.berkeley.edu amplab MOTIVATION 2 3 June 2, 200: Rackspace power outage of approximately 0

More information

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions Distributed Systems Day 13: Distributed Transaction To Be or Not to Be Distributed.. Transactions Summary Background on Transactions ACID Semantics Distribute Transactions Terminology: Transaction manager,,

More information

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5. Question 1 What makes a message unstable? How does an unstable message become stable? Distributed Systems 2016 Exam 2 Review Paul Krzyzanowski Rutgers University Fall 2016 In virtual sychrony, a message

More information

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf Distributed systems Lecture 6: distributed transactions, elections, consensus and replication Malte Schwarzkopf Last time Saw how we can build ordered multicast Messages between processes in a group Need

More information

Low-Latency Multi-Datacenter Databases using Replicated Commit

Low-Latency Multi-Datacenter Databases using Replicated Commit Low-Latency Multi-Datacenter Databases using Replicated Commit Hatem Mahmoud, Faisal Nawab, Alexander Pucher, Divyakant Agrawal, Amr El Abbadi UCSB Presented by Ashutosh Dhekne Main Contributions Reduce

More information

Enhancing Throughput of

Enhancing Throughput of Enhancing Throughput of NCA 2017 Zhongmiao Li, Peter Van Roy and Paolo Romano Enhancing Throughput of Partially Replicated State Machines via NCA 2017 Zhongmiao Li, Peter Van Roy and Paolo Romano Enhancing

More information

EECS 498 Introduction to Distributed Systems

EECS 498 Introduction to Distributed Systems EECS 498 Introduction to Distributed Systems Fall 2017 Harsha V. Madhyastha Dynamo Recap Consistent hashing 1-hop DHT enabled by gossip Execution of reads and writes Coordinated by first available successor

More information

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database. Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database. Presented by Kewei Li The Problem db nosql complex legacy tuning expensive

More information

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson Distributed systems Lecture 6: Elections, distributed transactions, and replication DrRobert N. M. Watson 1 Last time Saw how we can build ordered multicast Messages between processes in a group Need to

More information

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout Exploiting Commutativity For Practical Fast Replication Seo Jin Park and John Ousterhout Overview Problem: consistent replication adds latency and throughput overheads Why? Replication happens after ordering

More information

Consistency in Distributed Systems

Consistency in Distributed Systems Consistency in Distributed Systems Recall the fundamental DS properties DS may be large in scale and widely distributed 1. concurrent execution of components 2. independent failure modes 3. transmission

More information

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout Exploiting Commutativity For Practical Fast Replication Seo Jin Park and John Ousterhout Overview Problem: replication adds latency and throughput overheads CURP: Consistent Unordered Replication Protocol

More information

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 14: Data Replication Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich Database Replication What is database replication The advantages of

More information

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini Large-Scale Key-Value Stores Eventual Consistency Marco Serafini COMPSCI 590S Lecture 13 Goals of Key-Value Stores Export simple API put(key, value) get(key) Simpler and faster than a DBMS Less complexity,

More information

Distributed Systems 11. Consensus. Paul Krzyzanowski

Distributed Systems 11. Consensus. Paul Krzyzanowski Distributed Systems 11. Consensus Paul Krzyzanowski pxk@cs.rutgers.edu 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value must be one

More information

CSE 444: Database Internals. Lecture 25 Replication

CSE 444: Database Internals. Lecture 25 Replication CSE 444: Database Internals Lecture 25 Replication CSE 444 - Winter 2018 1 Announcements Magda s office hour tomorrow: 1:30pm Lab 6: Milestone today and due next week HW6: Due on Friday Master s students:

More information

Paxos and Replication. Dan Ports, CSEP 552

Paxos and Replication. Dan Ports, CSEP 552 Paxos and Replication Dan Ports, CSEP 552 Today: achieving consensus with Paxos and how to use this to build a replicated system Last week Scaling a web service using front-end caching but what about the

More information

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 10. Consensus: Paxos Paul Krzyzanowski Rutgers University Fall 2017 1 Consensus Goal Allow a group of processes to agree on a result All processes must agree on the same value The value

More information

CS October 2017

CS October 2017 Atomic Transactions Transaction An operation composed of a number of discrete steps. Distributed Systems 11. Distributed Commit Protocols All the steps must be completed for the transaction to be committed.

More information

EECS 498 Introduction to Distributed Systems

EECS 498 Introduction to Distributed Systems EECS 498 Introduction to Distributed Systems Fall 2017 Harsha V. Madhyastha Replicated State Machines Logical clocks Primary/ Backup Paxos? 0 1 (N-1)/2 No. of tolerable failures October 11, 2017 EECS 498

More information

Replicated State Machine in Wide-area Networks

Replicated State Machine in Wide-area Networks Replicated State Machine in Wide-area Networks Yanhua Mao CSE223A WI09 1 Building replicated state machine with consensus General approach to replicate stateful deterministic services Provide strong consistency

More information

BIG DATA AND CONSISTENCY. Amy Babay

BIG DATA AND CONSISTENCY. Amy Babay BIG DATA AND CONSISTENCY Amy Babay Outline Big Data What is it? How is it used? What problems need to be solved? Replication What are the options? Can we use this to solve Big Data s problems? Putting

More information

CS 425 / ECE 428 Distributed Systems Fall 2017

CS 425 / ECE 428 Distributed Systems Fall 2017 CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy) Nov 7, 2017 Lecture 21: Replication Control All slides IG Server-side Focus Concurrency Control = how to coordinate multiple concurrent

More information

ZHT: Const Eventual Consistency Support For ZHT. Group Member: Shukun Xie Ran Xin

ZHT: Const Eventual Consistency Support For ZHT. Group Member: Shukun Xie Ran Xin ZHT: Const Eventual Consistency Support For ZHT Group Member: Shukun Xie Ran Xin Outline Problem Description Project Overview Solution Maintains Replica List for Each Server Operation without Primary Server

More information

SpecPaxos. James Connolly && Harrison Davis

SpecPaxos. James Connolly && Harrison Davis SpecPaxos James Connolly && Harrison Davis Overview Background Fast Paxos Traditional Paxos Implementations Data Centers Mostly-Ordered-Multicast Network layer Speculative Paxos Protocol Application layer

More information

Replications and Consensus

Replications and Consensus CPSC 426/526 Replications and Consensus Ennan Zhai Computer Science Department Yale University Recall: Lec-8 and 9 In the lec-8 and 9, we learned: - Cloud storage and data processing - File system: Google

More information

Intuitive distributed algorithms. with F#

Intuitive distributed algorithms. with F# Intuitive distributed algorithms with F# Natallia Dzenisenka Alena Hall @nata_dzen @lenadroid A tour of a variety of intuitivedistributed algorithms used in practical distributed systems. and how to prototype

More information

Module 7 - Replication

Module 7 - Replication Module 7 - Replication Replication Why replicate? Reliability Avoid single points of failure Performance Scalability in numbers and geographic area Why not replicate? Replication transparency Consistency

More information

Concurrency Control II and Distributed Transactions

Concurrency Control II and Distributed Transactions Concurrency Control II and Distributed Transactions CS 240: Computing Systems and Concurrency Lecture 18 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material.

More information

Replication. Feb 10, 2016 CPSC 416

Replication. Feb 10, 2016 CPSC 416 Replication Feb 10, 2016 CPSC 416 How d we get here? Failures & single systems; fault tolerance techniques added redundancy (ECC memory, RAID, etc.) Conceptually, ECC & RAID both put a master in front

More information

EECS 482 Introduction to Operating Systems

EECS 482 Introduction to Operating Systems EECS 482 Introduction to Operating Systems Winter 2018 Baris Kasikci (Thanks, Harsha Madhyastha and Jason Flinn for the slides!) Distributed file systems Remote storage of data that appears local Examples:

More information

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Consolidating Concurrency Control and Consensus for Commits under Conflicts Consolidating Concurrency Control and Consensus for Commits under Conflicts Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li New York University, University of Southern California Abstract Conventional

More information

Introduction to Distributed Systems Seif Haridi

Introduction to Distributed Systems Seif Haridi Introduction to Distributed Systems Seif Haridi haridi@kth.se What is a distributed system? A set of nodes, connected by a network, which appear to its users as a single coherent system p1 p2. pn send

More information

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus AGREEMENT PROTOCOLS Paxos -a family of protocols for solving consensus OUTLINE History of the Paxos algorithm Paxos Algorithm Family Implementation in existing systems References HISTORY OF THE PAXOS ALGORITHM

More information

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Consolidating Concurrency Control and Consensus for Commits under Conflicts Consolidating Concurrency Control and Consensus for Commits under Conflicts Shuai Mu, Lamont Nelson, Wyatt Lloyd, and Jinyang Li New York University, University of Southern California Abstract Conventional

More information

Dynamo: Key-Value Cloud Storage

Dynamo: Key-Value Cloud Storage Dynamo: Key-Value Cloud Storage Brad Karp UCL Computer Science CS M038 / GZ06 22 nd February 2016 Context: P2P vs. Data Center (key, value) Storage Chord and DHash intended for wide-area peer-to-peer systems

More information

Consistency examples. COS 418: Distributed Systems Precept 5. Themis Melissaris

Consistency examples. COS 418: Distributed Systems Precept 5. Themis Melissaris Consistency examples COS 418: Distributed Systems Precept 5 Themis Melissaris Plan Midterm poll Consistency examples 2 Fill out this poll: http://tinyurl.com/zdeq4lr 3 Linearizability 4 Once read returns

More information

Integrity in Distributed Databases

Integrity in Distributed Databases Integrity in Distributed Databases Andreas Farella Free University of Bozen-Bolzano Table of Contents 1 Introduction................................................... 3 2 Different aspects of integrity.....................................

More information

Strong Consistency and Agreement

Strong Consistency and Agreement 1 Lee Lorenz, Brent Sheppard Jenkins, if I want another yes man, I ll build one! Strong Consistency and Agreement COS 461: Computer Networks Spring 2011 Mike Freedman hap://www.cs.princeton.edu/courses/archive/spring11/cos461/

More information

! Replication comes with consistency cost: ! Reasons for replication: Better performance and availability. ! Replication transforms client-server

! Replication comes with consistency cost: ! Reasons for replication: Better performance and availability. ! Replication transforms client-server in Replication Continuous and Haifeng Yu CPS 212 Fall 2002! Replication comes with consistency cost:! Reasons for replication: Better performance and availability! Replication transforms -server communication

More information

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Paxos Replicated State Machines as the Basis of a High- Performance Data Store Paxos Replicated State Machines as the Basis of a High- Performance Data Store William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li March 30, 2011 Q: How to build a

More information

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Consolidating Concurrency Control and Consensus for Commits under Conflicts Consolidating Concurrency Control and Consensus for Commits under Conflicts Shuai Mu and Lamont Nelson, New York University; Wyatt Lloyd, University of Southern California; Jinyang Li, New York University

More information

Recovering from a Crash. Three-Phase Commit

Recovering from a Crash. Three-Phase Commit Recovering from a Crash If INIT : abort locally and inform coordinator If Ready, contact another process Q and examine Q s state Lecture 18, page 23 Three-Phase Commit Two phase commit: problem if coordinator

More information

Paxos Made Moderately Complex Made Moderately Simple

Paxos Made Moderately Complex Made Moderately Simple Paxos Made Moderately Complex Made Moderately Simple Tom Anderson and Doug Woos State machine replication Reminder: want to agree on order of ops Can think of operations as a log S1 S2 Lab 3 Paxos? S3

More information

SCALABLE CONSISTENCY AND TRANSACTION MODELS

SCALABLE CONSISTENCY AND TRANSACTION MODELS Data Management in the Cloud SCALABLE CONSISTENCY AND TRANSACTION MODELS 69 Brewer s Conjecture Three properties that are desirable and expected from realworld shared-data systems C: data consistency A:

More information

Strong Consistency & CAP Theorem

Strong Consistency & CAP Theorem Strong Consistency & CAP Theorem CS 240: Computing Systems and Concurrency Lecture 15 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Consistency models

More information

Byzantine fault tolerance. Jinyang Li With PBFT slides from Liskov

Byzantine fault tolerance. Jinyang Li With PBFT slides from Liskov Byzantine fault tolerance Jinyang Li With PBFT slides from Liskov What we ve learnt so far: tolerate fail-stop failures Traditional RSM tolerates benign failures Node crashes Network partitions A RSM w/

More information

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18 Failure models Byzantine Fault Tolerance Fail-stop: nodes either execute the protocol correctly or just stop Byzantine failures: nodes can behave in any arbitrary way Send illegal messages, try to trick

More information

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication CSE 444: Database Internals Section 9: 2-Phase Commit and Replication 1 Today 2-Phase Commit Replication 2 Two-Phase Commit Protocol (2PC) One coordinator and many subordinates Phase 1: Prepare Phase 2:

More information

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases We talked about transactions and how to implement them in a single-node database. We ll now start looking into how to

More information

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems ZooKeeper & Curator CS 475, Spring 2018 Concurrent & Distributed Systems Review: Agreement In distributed systems, we have multiple nodes that need to all agree that some object has some state Examples:

More information

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201 Distributed Systems ID2201 replication Johan Montelius 1 The problem The problem we have: servers might be unavailable The solution: keep duplicates at different servers 2 Building a fault-tolerant service

More information

CPS 512 midterm exam #1, 10/7/2016

CPS 512 midterm exam #1, 10/7/2016 CPS 512 midterm exam #1, 10/7/2016 Your name please: NetID: Answer all questions. Please attempt to confine your answers to the boxes provided. If you don t know the answer to a question, then just say

More information

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

CS6450: Distributed Systems Lecture 15. Ryan Stutsman Strong Consistency CS6450: Distributed Systems Lecture 15 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed

More information

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering Agreement and Consensus SWE 622, Spring 2017 Distributed Software Engineering Today General agreement problems Fault tolerance limitations of 2PC 3PC Paxos + ZooKeeper 2 Midterm Recap 200 GMU SWE 622 Midterm

More information

Consensus and related problems

Consensus and related problems Consensus and related problems Today l Consensus l Google s Chubby l Paxos for Chubby Consensus and failures How to make process agree on a value after one or more have proposed what the value should be?

More information

DISTRIBUTED COMPUTER SYSTEMS

DISTRIBUTED COMPUTER SYSTEMS DISTRIBUTED COMPUTER SYSTEMS CONSISTENCY AND REPLICATION CONSISTENCY MODELS Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Consistency Models Background Replication Motivation

More information

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control Jialin Li University of Washington lijl@cs.washington.edu ABSTRACT Distributed storage systems aim to provide strong

More information

EECS 498 Introduction to Distributed Systems

EECS 498 Introduction to Distributed Systems EECS 498 Introduction to Distributed Systems Fall 2017 Harsha V. Madhyastha Implementing RSMs Logical clock based ordering of requests Cannot serve requests if any one replica is down Primary-backup replication

More information

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

CS6450: Distributed Systems Lecture 11. Ryan Stutsman Strong Consistency CS6450: Distributed Systems Lecture 11 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed

More information

Distributed Systems 8L for Part IB

Distributed Systems 8L for Part IB Distributed Systems 8L for Part IB Handout 3 Dr. Steven Hand 1 Distributed Mutual Exclusion In first part of course, saw need to coordinate concurrent processes / threads In particular considered how to

More information

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita

Proseminar Distributed Systems Summer Semester Paxos algorithm. Stefan Resmerita Proseminar Distributed Systems Summer Semester 2016 Paxos algorithm stefan.resmerita@cs.uni-salzburg.at The Paxos algorithm Family of protocols for reaching consensus among distributed agents Agents may

More information

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech Spanner: Google's Globally-Distributed Database Presented by Maciej Swiech What is Spanner? "...Google's scalable, multi-version, globallydistributed, and synchronously replicated database." What is Spanner?

More information

Leader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols

Leader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols Leader or Majority: Why have one when you can have both? Improving Read Scalability in Raft-like consensus protocols Vaibhav Arora, Tanuj Mittal, Divyakant Agrawal, Amr El Abbadi * and Xun Xue, Zhiyanan,

More information

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems Transactions CS 475, Spring 2018 Concurrent & Distributed Systems Review: Transactions boolean transfermoney(person from, Person to, float amount){ if(from.balance >= amount) { from.balance = from.balance

More information

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre NoSQL systems: sharding, replication and consistency Riccardo Torlone Università Roma Tre Data distribution NoSQL systems: data distributed over large clusters Aggregate is a natural unit to use for data

More information

Modeling, Analyzing, and Extending Megastore using Real-Time Maude

Modeling, Analyzing, and Extending Megastore using Real-Time Maude Modeling, Analyzing, and Extending Megastore using Real-Time Maude Jon Grov 1 and Peter Ölveczky 1,2 1 University of Oslo 2 University of Illinois at Urbana-Champaign Thanks to Indranil Gupta (UIUC) and

More information

Distributed Systems COMP 212. Revision 2 Othon Michail

Distributed Systems COMP 212. Revision 2 Othon Michail Distributed Systems COMP 212 Revision 2 Othon Michail Synchronisation 2/55 How would Lamport s algorithm synchronise the clocks in the following scenario? 3/55 How would Lamport s algorithm synchronise

More information

Recall use of logical clocks

Recall use of logical clocks Causal Consistency Consistency models Linearizability Causal Eventual COS 418: Distributed Systems Lecture 16 Sequential Michael Freedman 2 Recall use of logical clocks Lamport clocks: C(a) < C(z) Conclusion:

More information

Exam 2 Review. Fall 2011

Exam 2 Review. Fall 2011 Exam 2 Review Fall 2011 Question 1 What is a drawback of the token ring election algorithm? Bad question! Token ring mutex vs. Ring election! Ring election: multiple concurrent elections message size grows

More information

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control [Extended Version] Jialin Li Ellis Michael Dan R. K. Ports University of Washington {lijl, emichael, drkp}@cs.washington.edu

More information

Replication and Consistency. Fall 2010 Jussi Kangasharju

Replication and Consistency. Fall 2010 Jussi Kangasharju Replication and Consistency Fall 2010 Jussi Kangasharju Chapter Outline Replication Consistency models Distribution protocols Consistency protocols 2 Data Replication user B user C user A object object

More information

Linearizability CMPT 401. Sequential Consistency. Passive Replication

Linearizability CMPT 401. Sequential Consistency. Passive Replication Linearizability CMPT 401 Thursday, March 31, 2005 The execution of a replicated service (potentially with multiple requests interleaved over multiple servers) is said to be linearizable if: The interleaved

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 (Fall 2018) Part 7: Mutable State (2/2) November 13, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are

More information

Distributed Consensus: Making Impossible Possible

Distributed Consensus: Making Impossible Possible Distributed Consensus: Making Impossible Possible Heidi Howard PhD Student @ University of Cambridge heidi.howard@cl.cam.ac.uk @heidiann360 hh360.user.srcf.net Sometimes inconsistency is not an option

More information

Lecture 6 Consistency and Replication

Lecture 6 Consistency and Replication Lecture 6 Consistency and Replication Prof. Wilson Rivera University of Puerto Rico at Mayaguez Electrical and Computer Engineering Department Outline Data-centric consistency Client-centric consistency

More information

Robust BFT Protocols

Robust BFT Protocols Robust BFT Protocols Sonia Ben Mokhtar, LIRIS, CNRS, Lyon Joint work with Pierre Louis Aublin, Grenoble university Vivien Quéma, Grenoble INP 18/10/2013 Who am I? CNRS reseacher, LIRIS lab, DRIM research

More information

Introduction to riak_ensemble. Joseph Blomstedt Basho Technologies

Introduction to riak_ensemble. Joseph Blomstedt Basho Technologies Introduction to riak_ensemble Joseph Blomstedt (@jtuple) Basho Technologies riak_ensemble Paxos framework for scalable consistent system 2 node node node node node node node node 3 What about state? 4

More information

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha Paxos Made Live An Engineering Perspective Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone Presented By: Dipendra Kumar Jha Consensus Algorithms Consensus: process of agreeing on one result

More information

Using MVCC for Clustered Databases

Using MVCC for Clustered Databases Using MVCC for Clustered Databases structure introduction, scope and terms life-cycle of a transaction in Postgres-R write scalability tests results and their analysis 2 focus: cluster high availability,

More information

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015 Distributed Coordination with ZooKeeper - Theory and Practice Simon Tao EMC Labs of China {simon.tao@emc.com} Oct. 24th, 2015 Agenda 1. ZooKeeper Overview 2. Coordination in Spring XD 3. ZooKeeper Under

More information

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage Horizontal or vertical scalability? Scaling Out Key-Value Storage COS 418: Distributed Systems Lecture 8 Kyle Jamieson Vertical Scaling Horizontal Scaling [Selected content adapted from M. Freedman, B.

More information

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 19. Spanner Paul Krzyzanowski Rutgers University Fall 2017 November 20, 2017 2014-2017 Paul Krzyzanowski 1 Spanner (Google s successor to Bigtable sort of) 2 Spanner Take Bigtable and

More information

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer Group Replication: A Journey to the Group Communication Core Alfranio Correia (alfranio.correia@oracle.com) Principal Software Engineer 4th of February Copyright 7, Oracle and/or its affiliates. All rights

More information

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski Distributed Systems 09. State Machine Replication & Virtual Synchrony Paul Krzyzanowski Rutgers University Fall 2016 1 State machine replication 2 State machine replication We want high scalability and

More information

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22

More information

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin Zyzzyva Speculative Byzantine Fault Tolerance Ramakrishna Kotla L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin The Goal Transform high-performance service into high-performance

More information

Replication and Consistency

Replication and Consistency Replication and Consistency Today l Replication l Consistency models l Consistency protocols The value of replication For reliability and availability Avoid problems with disconnection, data corruption,

More information

Introduction to MySQL InnoDB Cluster

Introduction to MySQL InnoDB Cluster 1 / 148 2 / 148 3 / 148 Introduction to MySQL InnoDB Cluster MySQL High Availability made easy Percona Live Europe - Dublin 2017 Frédéric Descamps - MySQL Community Manager - Oracle 4 / 148 Safe Harbor

More information