No compromises: distributed transac2ons with consistency, availability, and performance

Similar documents
No Compromises. Distributed Transactions with Consistency, Availability, Performance

No compromises: distributed transactions with consistency, availability, and performance

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

FaRM: Fast Remote Memory

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Outline. Spanner Mo/va/on. Tom Anderson

Inexpensive Coordination in Hardware

FaRM: Fast Remote Memory

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Distributed System. Gang Wu. Spring,2018

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

CPS 512 midterm exam #1, 10/7/2016

Applications of Paxos Algorithm

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Exam 2 Review. October 29, Paul Krzyzanowski 1

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Cluster Communica/on Latency:

Building Consistent Transactions with Inconsistent Replication

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

Fault Tolerance. Goals: transparent: mask (i.e., completely recover from) all failures, or predictable: exhibit a well defined failure behavior

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

Persistent Memory over Fabric (PMoF) Adding RDMA to Persistent Memory Pawel Szymanski Intel Corporation

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Distributed ACID Transac2ons in Apache Ignite

The Google File System

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Advanced Computer Networks. End Host Optimization

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Fault Tolerance Causes of failure: process failure machine failure network failure Goals: transparent: mask (i.e., completely recover from) all

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Unevenly Distributed. Adrian

The Google File System

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

arxiv: v2 [cs.db] 21 Nov 2016

There Is More Consensus in Egalitarian Parliaments

Reliable Distributed Messaging with HornetQ

The Google File System

SCALE AND SECURE MOBILE / IOT MQTT TRAFFIC

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

The Google File System (GFS)

MapReduce. U of Toronto, 2014

Accelerating Data Science. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland

A Design for Networked Flash

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

CA485 Ray Walshe Google File System

GFS: The Google File System

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

NPTEL Course Jan K. Gopinath Indian Institute of Science

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

CLOUD-SCALE FILE SYSTEMS

Reconfigurable hardware for big data. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland

Building Consistent Transactions with Inconsistent Replication

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

High Performance Transactions in Deuteronomy

BigData and Map Reduce VITMAC03

MENCIUS: BUILDING EFFICIENT

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015

Google File System. By Dinesh Amatya

) Intel)(TX)memory):) Transac'onal) Synchroniza'on) Extensions)(TSX))) Transac'ons)

Distributed Filesystem

Virtual Synchrony. Jared Cantwell

Integrity in Distributed Databases

ZHT: Const Eventual Consistency Support For ZHT. Group Member: Shukun Xie Ran Xin

Google File System. Arun Sundaram Operating Systems

CSE 124: Networked Services Lecture-16

Distributed System. Gang Wu. Spring,2018

The Google File System

Byzantine Fault Tolerant Raft

Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

The Google File System

How VoltDB does Transactions

CSE 124: Networked Services Fall 2009 Lecture-19

Big Data Challenges in a Product Group. Johannes Gehrke Microsoft

There is a tempta7on to say it is really used, it must be good

EMC RecoverPoint. EMC RecoverPoint Support

Group Replication: A Journey to the Group Communication Core. Alfranio Correia Principal Software Engineer

JANUARY 28, 2014, SAN JOSE, CA. Microsoft Lead Partner Architect OS Vendors: What NVM Means to Them

Introduction to MySQL Cluster: Architecture and Use

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

IsoStack Highly Efficient Network Processing on Dedicated Cores

Exploiting Commutativity For Practical Fast Replication

10 Things to Consider When Using Apache Ka7a: U"liza"on Points of Apache Ka4a Obtained From IoT Use Case

Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data

SCALARIS. Irina Calciu Alex Gillmor

Distributed Commit in Asynchronous Systems

Distributed Systems. Day 9: Replication [Part 1]

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

RDMA programming concepts

RDMA Requirements for High Availability in the NVM Programming Model

Topics in Reliable Distributed Systems

Transcription:

No compromises: distributed transac2ons with consistency, availability, and performance Aleksandar Dragojevic, Dushyanth Narayanan, Edmund B. Nigh2ngale, MaDhew Renzelmann, Alex Shamis, Anirudh Badam, Miguel Castro

FaRM A main memory distributed compu2ng plakorm that provides distributed ACID Serializability High availability High performance Two hardware trends to eliminate storage and network bodlenecks Fast commodity networks with RDMA Inexpensive approach to provide non- vola2le DRAM Primary- backup replica2on and unreplicated coordinators, reducing message counts compared with Paxos One- side RDMA, parallel recovery

Non- vola2le DRAM Distributed UPS makes DRAM durable Lithium- ion baderies Saves contents of memory to SSD using energy from baderies Cost Energy cost $0.55/GB Storage cost (reserving SSD) $0.9/GB ~15% of DRAM cost (NVDIMM costs 3-5x more)

Programming Model and Architecture Abstrac2on of a global address space that spans machines in a cluster FaRM API provides transparent access to local and remote objects within transac2ons

FaRM Architecture

Architecture Configura2on <i, S, F, CM> i: 64- bit unique configura2on iden2fier S: set of machines F: mapping to failure domains CM: configura2on manager Zookeeper ensures machines agree on the current configura2on and stores it (not for managing leases, detec2ng failures, etc.) Fault tolerance One primary and f replicas CM allocates new region (GB) in primary and replicas Commit alloca2on only all replicas succeed Ring- buffer based send receive pairs The sender appends records to the log using one- sided RDMA writes The receiver periodically polls the head of the log

Distributed Transac2ons and Replica2on Lock Validate Commit backups Commit Primaries Truncate

Correctness and Performance Correctness Locking ensures serializa2on of write and valida2on ensures serializa2on of read Serializablity across failures: wait for hardware acks from all backups before wri2ng COMMIT- PRIMARY The coordinator reserves log space at all par2cipants to avoid involving he backups CPUs

Correctness and Performance Performance Two- phase commit (Spanner) requires 2f+1 replicas to tolerate f failures Each state machine opera2on requires 2f+1 round trip messages (4P(2f+1) messages) FaRM Use primary backup replica2on instead of Paxos state machine replica2on f+1 copies Coordinator state is not replicated Commit phase uses Pw(f+3) one- side RDMA writes

Failure Recovery Durability and high availability by replica2on Machines can fail by crashing but can recover the data by using non- vola2le memory Durability for all commided transac2ons even the en2re cluster fails or loss power as data are persisted in non- vola2le DRAM Tolerant f non- vola2le DRAM failures

Failure Detec2on Each machine holds a lease at the CM and the CM holds a lease at every other machine Expira2on of any lease triggers failure recovery 5ms short lease to guarantee high availability Dedicated queue pairs for leases Lease manager uses Infiniband with connec2onless unreliable datagram transport Dedicated lease manager thread that runs at the highest user- space priority Preallocate memory for the lease manager

Reconfigura2on Suspect Probe Update configura2on Remap regions Send new configura2on Apply new configura2on Commit new configura2on

Transac2on State Recovery Block access to recovering regions Drain logs Find recovering reansac2ons Lock recovery Replicate log records Vote Decide

Evalua2on Setup 90 machines for FaRM cluster and 5 machines for replicated Zookeepers 256GB DRAM and two 8- core Intel E5 CPUs 56Gbps Infiniband NICs Benchmarks Telecommunica2on Applica2on Transac2on Processing (TATP) TCP- C a well- known database benchmark with complex transac2ons

Performance

Failure Recovery

CM Failure

Conclusion FaRM, a memory distributed compu2ng plakorm Distributed transac2ons and replica2on Strict serializability and high performance Primary- backup replica2on, not coordinator replica2on High throughput and low latency, fast recovery