Designing Distributed Systems using Approximate Synchrony in Data Center Networks

Similar documents
Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Building Consistent Transactions with Inconsistent Replication

Building Consistent Transactions with Inconsistent Replication

SpecPaxos. James Connolly && Harrison Davis

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

There Is More Consensus in Egalitarian Parliaments

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

PARALLEL CONSENSUS PROTOCOL

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Applications of Paxos Algorithm

Building Consistent Transactions with Inconsistent Replication

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

NetChain: Scale-Free Sub-RTT Coordination

Paxos and Replication. Dan Ports, CSEP 552

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

Enhancing Throughput of

Intra-cluster Replication for Apache Kafka. Jun Rao

EECS 498 Introduction to Distributed Systems

Replicated State Machine in Wide-area Networks

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control

CPS 512 midterm exam #1, 10/7/2016

Zyzzyva. Speculative Byzantine Fault Tolerance. Ramakrishna Kotla. L. Alvisi, M. Dahlin, A. Clement, E. Wong University of Texas at Austin

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

MENCIUS: BUILDING EFFICIENT

Integrity in Distributed Databases

XCo: Explicit Coordination for Preventing Congestion in Data Center Ethernet

High Performance Packet Processing with FlexNIC

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

ZHT: Const Eventual Consistency Support For ZHT. Group Member: Shukun Xie Ran Xin

Data Storage Revolution

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

Exploiting Commutativity For Practical Fast Replication

Data Consistency and Blockchain. Bei Chun Zhou (BlockChainZ)

Intuitive distributed algorithms. with F#

Failure models. Byzantine Fault Tolerance. What can go wrong? Paxos is fail-stop tolerant. BFT model. BFT replication 5/25/18

Viewstamped Replication to Practical Byzantine Fault Tolerance. Pradipta De

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Paxos Made Live. An Engineering Perspective. Authors: Tushar Chandra, Robert Griesemer, Joshua Redstone. Presented By: Dipendra Kumar Jha

The Google File System

TROPIC: Transactional Resource Orchestration Platform In the Cloud

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Security (and finale) Dan Ports, CSEP 552

Arrakis: The Operating System is the Control Plane

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

Byzantine fault tolerance. Jinyang Li With PBFT slides from Liskov

AGREEMENT PROTOCOLS. Paxos -a family of protocols for solving consensus

AGORA: A Dependable High-Performance Coordination Service for Multi-Cores

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Distributed Consensus Protocols

PushyDB. Jeff Chan, Kenny Lam, Nils Molina, Oliver Song {jeffchan, kennylam, molina,

HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

Distributed Filesystem

The Google File System

Fault Tolerance for Highly Available Internet Services: Concept, Approaches, and Issues

Surviving congestion in geo-distributed storage systems

Spanner: Google s Globally- Distributed Database

Consistency in SDN. Aurojit Panda, Wenting Zheng, Xiaohe Hu, Arvind Krishnamurthy, Scott Shenker

VoltDB vs. Redis Benchmark

Concurrency Control II and Distributed Transactions

Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

HP: Hybrid Paxos for WANs

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

CSE 124: Networked Services Lecture-17

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Dynamic Reconfiguration of Primary/Backup Clusters

Extreme Computing. NoSQL.

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Networking Recap Storage Intro. CSE-291 (Cloud Computing), Fall 2016 Gregory Kesden

Practical Byzantine Fault Tolerance

Assignment 12: Commit Protocols and Replication Solution

or? Paxos: Fun Facts Quorum Quorum: Primary Copy vs. Majority Quorum: Primary Copy vs. Majority

Distributed Transactions

An Implementation of the Homa Transport Protocol in RAMCloud. Yilong Li, Behnam Montazeri, John Ousterhout

Practical Byzantine Fault Tolerance. Miguel Castro and Barbara Liskov

Low-Latency Multi-Datacenter Databases using Replicated Commit

Distributed Consensus: Making Impossible Possible

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

STORM AND LOW-LATENCY PROCESSING.

Harmony: Consistency at Scale

Consensus and related problems

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity

Canopus: A Scalable and Massively Parallel Consensus Protocol

Today s Papers. Google Chubby. Distributed Consensus. EECS 262a Advanced Topics in Computer Systems Lecture 24. Paxos/Megastore November 24 th, 2014

Towards a Software Defined Data Plane for Datacenters

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

Recall use of logical clocks

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Minuet Rethinking Concurrency Control in Storage Area Networks

Practical Byzantine Fault

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Transcription:

Designing Distributed Systems using Approximate Synchrony in Data Center Networks Dan R. K. Ports Jialin Li Naveen Kr. Sharma Vincent Liu Arvind Krishnamurthy University of Washington CSE

Today s most popular applications are distributed systems in the data center

Today s most popular applications are distributed systems in the data center Modern data center: ~50,000 commodity servers constant server failures

How do we program the data center? Use distributed algorithms to tolerate failures, inconsistencies Example: Paxos state machine replication

Distributed systems and networks are typically designed independently???

Distributed systems and networks are typically designed independently??? asynchronous network (Internet) packets may be arbitrarily dropped delayed reordered

Distributed systems and networks are typically designed independently

Distributed systems and networks are typically designed independently Data center networks are different!

Data Center Networks Are Different Data center networks are more predictable known topology, routes, predictable latencies Data center networks are more reliable Data center networks are extensible single administrative domain makes changes possible software-defined networking exposes sophisticated line-rate processing capability

Data Center Networks Are Different Data center networks are more predictable known topology, routes, predictable latencies We should co-design distributed systems Data center networks are more reliable and data center networks! Data center networks are extensible single administrative domain makes changes possible software-defined networking exposes sophisticated line-rate processing capability

Co-Designing Networks and Distributed Systems Design the data center network to support distributed applications Design distributed applications around the properties of the data center network

This Talk A concrete instantiation: improving replication performance using Speculative Paxos and Mostly-Ordered Multicast

This Talk A concrete instantiation: improving replication performance using Speculative Paxos and Mostly-Ordered Multicast new replication protocol new network primitive

This Talk A concrete instantiation: improving replication performance using Speculative Paxos and Mostly-Ordered Multicast new replication protocol new network primitive 3x throughput and 40% lower latency than conventional approach

Outline 1. Co-designing Distributed Systems and Data Center Networks 2. Background: State Machine tion & Paxos 3. Mostly-Ordered Multicast and Speculative Paxos 4. Evaluation

State Machine tion Used to tolerate failures in datacenter applications keep critical management services online (e.g., Google s Chubby, Zookeeper) persistent storage in distributed databases (e.g., Spanner, H-Store) Strongly consistent (linearizable) replication, i.e., all replicas execute same operations in same order even when up to half replicas fail even when messages are lost

Example: Paxos Client Leader

Example: Paxos Client request Leader

Example: Paxos Client Leader request prepare

Example: Paxos Client request prepare prepareok Leader

Example: Paxos Client Leader request prepare prepareok exec

Example: Paxos Client request prepare prepareok reply Leader exec commit

Example: Paxos Client request prepare prepareok reply Leader exec commit latency: 4 message delays

Example: Paxos throughput: bottleneck replica processes 2n msgs request prepare prepareok reply Client Leader exec commit latency: 4 message delays

Outline 1. Co-designing Distributed Systems and Data Center Networks 2. Background: State Machine tion & Paxos 3. Mostly-Ordered Multicast and Speculative Paxos 4. Evaluation

Improving Paxos Performance Paxos requires a leader replica to order requests Can we use the network instead?

Improving Paxos Performance Paxos requires a leader replica to order requests Can we use the network instead? Engineer the network to provide Mostly-Ordered Multicast (MOM) - best-effort ordering of multicasts New replication protocol: Speculative Paxos - commits most operations in a single round trip

Mostly-Ordered Multicast Concurrent messages are ordered: If any node receives message A then B, then all other receivers process them in the same order best effort not guaranteed Practical to implement can be violated in event of network failure but not satisfied by existing multicast protocols!

Mostly-Ordered Multicast

Mostly-Ordered Multicast Different path lengths, congestion cause reordering

Mostly-Ordered Multicast Different path lengths, congestion cause reordering

Mostly-Ordered Multicast Different path lengths, congestion cause reordering

Mostly-Ordered Multicast Different path lengths, congestion cause reordering

Mostly-Ordered Multicast Different path lengths, congestion cause reordering MOM approach: Route multicast messages to a root switch equidistant from receivers

Mostly-Ordered Multicast Different path lengths, congestion cause reordering MOM approach: Route multicast messages to a root switch equidistant from receivers

less network support MOM Design Options better ordering

less network support MOM Design Options 1. Topology-Aware Multicast route packets to a randomly-chosen root switch better ordering

less network support MOM Design Options 1. Topology-Aware Multicast route packets to a randomly-chosen root switch 2. High-Priority Multicast use higher QoS priority to avoid link congestion better ordering

less network support better ordering MOM Design Options 1. Topology-Aware Multicast route packets to a randomly-chosen root switch 2. High-Priority Multicast use higher QoS priority to avoid link congestion 3. Network Serialization route packets through a single root switch

Speculative Paxos New state machine replication protocol Relies on MOM to order requests in the normal case But not required: remains correct even with reorderings: safety + liveness under usual conditions

Speculative Paxos Client

Speculative Paxos request Client

Speculative Paxos replicas immediately speculatively execute request & reply! Client request spec-reply(result, hash) spec-exec spec-exec spec-exec

Speculative Paxos replicas immediately speculatively execute request & reply! client checks for matching responses from 3/4 superquorum Client request spec-reply(result, hash) match? spec-exec spec-exec spec-exec

Speculative Paxos replicas immediately speculatively execute request & reply! client checks for matching responses from 3/4 superquorum Client request spec-reply(result, hash) match? spec-exec spec-exec spec-exec latency: 2 message delays (vs 4)

Speculative Paxos replicas immediately speculatively execute request & reply! client checks for matching responses from 3/4 superquorum Client request no bottleneck replica each processes only 2 msgs spec-reply(result, hash) match? spec-exec spec-exec spec-exec latency: 2 message delays (vs 4)

Speculative Execution s execute requests speculatively might have to roll back operations Clients know their requests succeeded they check for matching hashes in replies means clients don t need to speculate Similar to Zyzzyva [SOSP 07]

Handling Ordering Violations What if replicas don t execute requests in the same order? s periodically run synchronization protocol If divergence detected: reconciliation replicas pause execution, select leader, send logs leader decides ordering for operations and notifies replicas replicas rollback and re-execute requests in proper order

Handling Ordering Violations What if replicas don t execute requests in the same order? s periodically run synchronization protocol If divergence detected: reconciliation replicas pause execution, select leader, send logs leader decides ordering for operations and notifies replicas replicas rollback and re-execute requests in proper order Note: 3/4 superquorum requirement ensures new leader can always be sure which requests succeeded even if 1/2 fail. [cf. Fast Paxos]

Outline 1. Co-designing Distributed Systems and Data Center Networks 2. Background: State Machine tion & Paxos 3. Mostly-Ordered Multicast and Speculative Paxos 4. Evaluation

Evaluation Setup 12-switch fat tree testbed 1 Gb / 10 Gb ethernet 3 replicas (2.27 GHz Xeon L5640) MOM scalability experiments: 2560-host simulated fat tree data center network background traffic from Microsoft data center measurements

SpecPaxos Improves Latency and Throughput (emulated datacenter network with MOMs) better latency (us) throughput (ops / second) better

SpecPaxos Improves Latency and Throughput (emulated datacenter network with MOMs) 0 SpecPaxos better -300 Paxos latency (us) -600-900 -1200 0 25,000 50,000 75,000 100,000 throughput (ops / second) better

SpecPaxos Improves Latency and Throughput (emulated datacenter network with MOMs) better latency (us) 0-300 -600-900 SpecPaxos Paxos 3x throughput and 40% lower latency than Paxos -1200 0 25,000 50,000 75,000 100,000 throughput (ops / second) better

SpecPaxos Improves Latency and Throughput (emulated datacenter network with MOMs) better 0-300 Fast Paxos Paxos SpecPaxos latency -600 (us) -900 Paxos + batching -1200 0 25,000 50,000 75,000 100,000 throughput (ops / second) better

SpecPaxos Improves Latency and Throughput (emulated datacenter network with MOMs) better 0-300 Fast Paxos Paxos SpecPaxos latency (us) -600-900 -1200 better latency than Fast Paxos and same throughput as batching! Paxos + batching 0 25,000 50,000 75,000 100,000 throughput (ops / second) better

MOMs Provide Necessary Support 120000 Speculative Paxos Paxos Throughput 90000 60000 30000 0 0.001% 0.01% 0.1% 1% Simulated packet reordering rate

MOM Ordering Effectiveness Ordering Violation Rates Testbed (12 switches) Simulation (119 switches, 2560 hosts) Regular Multicast 1-10% 1-2% Topology-Aware MOM Network Serialization 0.001%-0.05% 0.01%-0.1% ~0% ~0%

Application Performance Transactional key-value store (2PC + OCC) Synthetic workload based on Retwis Twitter clone < 250 LOC required to implement rollback Measured transactions/sec that meet 10 ms SLO Paxos Paxos+Batching Fast Paxos SpecPaxos 0 1500 3000 4500 6000 Max Throughput (Transactions/second)

Summary New approach to building distributed systems based on co-designing with the data center network Dramatic performance improvement for replication by combining MOM network primitive for best-effort ordering Speculative Paxos: efficient replication protocol This is only the first step for co-designing distributed systems and data center networks!