No Compromises. Distributed Transactions with Consistency, Availability, Performance

Similar documents
No compromises: distributed transactions with consistency, availability, and performance

No compromises: distributed transac2ons with consistency, availability, and performance

FaRM: Fast Remote Memory

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

CPS 512 midterm exam #1, 10/7/2016

FaRM: Fast Remote Memory

10. Replication. Motivation

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Deconstructing RDMA-enabled Distributed Transaction Processing: Hybrid is Better!

arxiv: v2 [cs.db] 21 Nov 2016

Beyond FLP. Acknowledgement for presentation material. Chapter 8: Distributed Systems Principles and Paradigms: Tanenbaum and Van Steen

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed System. Gang Wu. Spring,2018

Consistency in Distributed Systems

Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Exam 2 Review. Fall 2011

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Module 8 Fault Tolerance CS655! 8-1!

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Google File System. Arun Sundaram Operating Systems

Building Consistent Transactions with Inconsistent Replication

XI. Transactions CS Computer App in Business: Databases. Lecture Topics

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

CS October 2017

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

11/7/2018. Event Ordering. Module 18: Distributed Coordination. Distributed Mutual Exclusion (DME) Implementation of. DME: Centralized Approach

Chapter 16: Distributed Synchronization

The Google File System (GFS)

Overview of Transaction Management

Topics in Reliable Distributed Systems

CA485 Ray Walshe Google File System

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Chapter 18: Distributed

First-In-First-Out (FIFO) Algorithm

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Remote Persistent Memory SNIA Nonvolatile Memory Programming TWG

Distributed File Systems II

Unevenly Distributed. Adrian

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Dynamo: Amazon s Highly Available Key-value Store

Module 8 - Fault Tolerance

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

The End of a Myth: Distributed Transactions Can Scale

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

At-most-once Algorithm for Linearizable RPC in Distributed Systems

High Performance Transactions in Deuteronomy

PNUTS and Weighted Voting. Vijay Chidambaram CS 380 D (Feb 8)

CS Amazon Dynamo

Readings. Storage Hierarchy III: I/O System. I/O (Disk) Performance. I/O Device Characteristics. often boring, but still quite important

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Transactions. Transaction. Execution of a user program in a DBMS.

Loose-Ordering Consistency for Persistent Memory

The Google File System

SpecPaxos. James Connolly && Harrison Davis

Goal A Distributed Transaction

Data Storage Revolution

Global atomicity. Such distributed atomicity is called global atomicity A protocol designed to enforce global atomicity is called commit protocol

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Synchronization (contd.)

Recovering from a Crash. Three-Phase Commit

Exam 2 Review. October 29, Paul Krzyzanowski 1

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

ZooKeeper Atomic Broadcast

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Low Overhead Concurrency Control for Partitioned Main Memory Databases

TROPIC: Transactional Resource Orchestration Platform In the Cloud

Distributed Database Management System UNIT-2. Concurrency Control. Transaction ACID rules. MCA 325, Distributed DBMS And Object Oriented Databases

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

Atomicity. Bailu Ding. Oct 18, Bailu Ding Atomicity Oct 18, / 38

BIG DATA AND CONSISTENCY. Amy Babay

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Intra-cluster Replication for Apache Kafka. Jun Rao

SQL Server 2014: In-Memory OLTP for Database Administrators

Persistent Memory over Fabric (PMoF) Adding RDMA to Persistent Memory Pawel Szymanski Intel Corporation

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

Agreement and Consensus. SWE 622, Spring 2017 Distributed Software Engineering

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Consensus and related problems

JANUARY 28, 2014, SAN JOSE, CA. Microsoft Lead Partner Architect OS Vendors: What NVM Means to Them

Recoverability. Kathleen Durant PhD CS3200

Silberschatz and Galvin Chapter 18

Accelerating Ceph with Flash and High Speed Networks

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

CSE 124: Networked Services Lecture-16

Distributed Systems

Performance and Optimization Issues in Multicore Computing

Transactions. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Consensus Protocols

Inexpensive Coordination in Hardware

T ransaction Management 4/23/2018 1

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

Transcription:

No Compromises Distributed Transactions with Consistency, Availability, Performance Aleksandar Dragojevic, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, Miguel Castro Microsoft Research

Why Transactions with high availability & strict serialization - Abstraction: endlessly-running single machine with one transaction at a time with consistent ordering - Issue: Poor Performance - No Transactions, Weak Consistency, Single Machine Transactions,

What Distributed ACID transactions with: - Strict Serializability - High Availability - High Throughput - Low Latency

What - Performance FaRM - Peak throughput: 140 million TATP transactions/sec - 90 machines - 4.9 TB database - 50< ms Recovery

How RMDA cheap non-volatile DRAM

How - Design 1) Reduce Message Counts 2) RDMA Reads/Writes 3) Parallelism Assumptions: Network is fast, we are cpu-bound

How - Design Assumptions: - Network is fast, we are cpu-bound

Hardware-DRAM Distributed UPS means durable DRAM -> SSD rarely > NVDIMMs - Energy-bound

Hardware-RDMA - No Remote CPU - > 2x RPC / > 4x RPC (after optimization) - Bottleneck: NIC Message Rate - After optimizations: CPU Bound

Software - Programming Model

Software - Programming Model - Global Address Space (2GB Regions, 1 primary, f backups) - CM Manages config with monotonically increasing counter & chooses replicas (magic) - 2PC to make new region - Local & Remote Objects in Transactions - Transactions (OCC) - App Thread can start - Said thread becomes coordinator - Thread can execute arbitrary logic - e.g. read/write/allocate/free objects - Calls FaRM to commit at end using 4 stage commit - Strict Serialiability

Software - Programming Model - Transactions (OCC) - Individual Object Reads atomic - only read committed data - successive reads return same data - reads of objects written in same transaction guarantee that value read as well - no atomicity for reads across objects, but guarantees will not commit on issue? (commit time consistency checks) - Locality Hints - Lock-free Reads

Software - Config - Zookeeper to agree & store current config - Custom RDMA implementation for: - Leases, Failure Detection, Coordinate Recovery - Per machine FIFO ring-buffer transaction log (RDMA) & message queue - Lazily updates receiver

Software - Transaction & Replication (Novel Part here) - Fewer Messages - RDMA - Primary Backup (vertical paxos) for data/transaction logs - unreplicated transaction coordinators

Software - Transaction & Replication 1) Lock 2) Validate 3) Commit Backup 4) Commit Primary 5) Truncate

Software - Transaction & Replication 1) Lock - Write LOCK record for written object primaries (versions & values) using CAS - Fails, Writes abort record to all primaries - Returns app error

Software - Transaction & Replication 1) Lock 2) Validate - Fail if any version # not expected by transaction - Abort - RDMA/RPC Reads if > 4 objects

Software - Transaction & Replication 1) Lock 2) Validate 3) Commit Backup - Write COMMIT-BACKUP record to backup logs (same as LOCK record contents) - Waits for NIC ack(s)

Software - Transaction & Replication 1) Lock 2) Validate 3) Commit Backup 4) Commit Primary - Coordinator writes COMMIT-PRIMARY record to logs at primaries - Success = >= 1 ACK/Local Write - In place objects update, version update, unlock object (exposing new vals)

Software - Transaction & Replication 1) Lock 2) Validate 3) Commit Backup 4) Commit Primary 5) Truncate - Backups/Primaries keep records till truncated - Coordinator truncates primaries/backups after all primaries send ACKs - Transactions to truncate piggy back in other log records, - Backups apply updates at truncation

Software - Transaction & Replication Correctness - Read-Write Transactions - Serializable at write locks - Read Transactions - Serializable at last read - Strict Serializability - serialization point: Start of execution - completion reported to application

Software - 4 PC differences - Coordinator reserves log space for PC records before starting in participants log s - f+1 replicas vs 2f+1 replicas - 4*Participants*(2f+1) messages > Participant writers*(f+3) RDMA writes / Participant readers*1 only

Software - Failure Recovery (novel) - Bounded Clock Drift / Eventually Bounded Message Delays - DRAM for transaction log durability - 5 phases

Software - Failure Recovery 1) Failure Detection 2) Reconfig 3) Transaction State Recovery 4) Bulk Data Recovery 5) Allocator State Recovery

Software - Failure Recovery 1) Failure Detection - Leases using 3 way handshake - magic: 5ms lease time - super short leases - Dedicated infiniband send/receive verbs - Dedicated Queues - Highest CPU scheduling priority & manager - Dedicated HW threads - Preallocated memory, paged & pinned

Software - Failure Recovery 1) Failure Detection 2) Reconfig - Leases (usual solution, but not here) - Precise Membership - After failure, agreement on membership before mutation - suspect, probe, update configs (after majority), remap regions, send new config, apply new config, commit new config

Software - Failure Recovery 1) Failure Detection 2) Reconfig 3) Transaction State Recovery - Block Region Access - Drain all message logs - Find recovering transactions - Build complete set of recovering transactions - Multithreaded recovery - Send recovered records to backups - Vote to determine whether to commit/abort transaction where coordinator through Consistent Hashing - Decide

Software - Failure Recovery 1) Failure Detection 2) Reconfig 3) Transaction State Recovery 4) Bulk Data Recovery - Delay recovery till all regions ative and then copy in background in parallel - 8 KB blocks - Recovery reads start within random interval of previous read, examine before copy

Software - Failure Recovery 1) Failure Detection 2) Reconfig 3) Transaction State Recovery 4) Bulk Data Recovery 5) Allocator State Recovery - Regions are 1 MB blocks - Block Headers (replicated) - Slab Free Lists (recovered using a scan) - Backgrounded

Performance Graphs (on back of paper)