Making RAMCloud Writes Even Faster

Similar documents
CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Exploiting Commutativity For Practical Fast Replication

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Two phase commit protocol. Two phase commit protocol. Recall: Linearizability (Strong Consistency) Consensus

Lightning Talks. Platform Lab Students Stanford University

Recall: Primary-Backup. State machine replication. Extend PB for high availability. Consensus 2. Mechanism: Replicate and separate servers

Failures, Elections, and Raft

Extend PB for high availability. PB high availability via 2PC. Recall: Primary-Backup. Putting it all together for SMR:

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

CPS 512 midterm exam #1, 10/7/2016

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

At-most-once Algorithm for Linearizable RPC in Distributed Systems

Scalable Low-Latency Indexes for a Key-Value Store Ankita Kejriwal

Implementing Linearizability at Large Scale and Low Latency

To do. Consensus and related problems. q Failure. q Raft

Indexing in RAMCloud. Ankita Kejriwal, Ashish Gupta, Arjun Gopalan, John Ousterhout. Stanford University

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

VoltDB vs. Redis Benchmark

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

Building Consistent Transactions with Inconsistent Replication

High-Level Data Models on RAMCloud

Trek: Testable Replicated Key-Value Store

P2 Recitation. Raft: A Consensus Algorithm for Replicated Logs

Choosing a MySQL HA Solution Today. Choosing the best solution among a myriad of options

Distributed Data Management Replication

Millisort: An Experiment in Granular Computing. Seo Jin Park with Yilong Li, Collin Lee and John Ousterhout

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

MySQL HA Solutions Selecting the best approach to protect access to your data

Replication in Distributed Systems

Chapter 11: File System Implementation. Objectives

Low-Latency Datacenters. John Ousterhout

Mix n Match Async and Group Replication for Advanced Replication Setups. Pedro Gomes Software Engineer

Rocksteady: Fast Migration for Low-Latency In-memory Storage. Chinmay Kulkarni, Aniraj Kesavan, Tian Zhang, Robert Ricci, Ryan Stutsman

Eventual Consistency Today: Limitations, Extensions and Beyond

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8

The CAP theorem. The bad, the good and the ugly. Michael Pfeiffer Advanced Networking Technologies FG Telematik/Rechnernetze TU Ilmenau

REPLICATED STATE MACHINES

Performance Monitoring AlwaysOn Availability Groups. Anthony E. Nocentino

Lecture 18: Reliable Storage

CSE 124: REPLICATED STATE MACHINES. George Porter November 8 and 10, 2017

Stateless Network Functions:

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

What s Wrong with the Operating System Interface? Collin Lee and John Ousterhout

Distributed Systems 8L for Part IB

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Designing for Understandability: the Raft Consensus Algorithm. Diego Ongaro John Ousterhout Stanford University

The Google File System

Chapter 17: Recovery System

Data Storage Revolution

Distributed Data Management Transactions

Transaction Management & Concurrency Control. CS 377: Database Systems

CORFU: A Shared Log Design for Flash Clusters

Distributed and Fault-Tolerant Execution Framework for Transaction Processing

CS 655 Advanced Topics in Distributed Systems

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

The Care and Feeding of a MySQL Database for Linux Adminstrators. Dave Stokes MySQL Community Manager

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

RECOVERY CHAPTER 21,23 (6/E) CHAPTER 17,19 (5/E)

Intra-cluster Replication for Apache Kafka. Jun Rao

From eventual to strong consistency. Primary-Backup Replication. Primary-Backup Replication. Replication State Machines via Primary-Backup

ni.com Decisions Behind the Design: LabVIEW for CompactRIO Sample Projects

Everything You Need to Know About MySQL Group Replication

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Choosing a MySQL HA Solution Today

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Administration Naive DBMS CMPT 454 Topics. John Edgar 2

Introduction to Database Services

<Insert Picture Here> QCon: London 2009 Data Grid Design Patterns

Percona XtraDB Cluster

EECS 482 Introduction to Operating Systems

Scaling Out Key-Value Storage

02 - Distributed Systems

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Dynamo: Key-Value Cloud Storage

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

02 - Distributed Systems

EMC RecoverPoint. EMC RecoverPoint Support

Efficient Geographic Replication & Disaster Recovery. Tom Pantelis Brian Freeman Colin Dixon

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

How Microsoft Built MySQL, PostgreSQL and MariaDB for the Cloud. Santa Clara, California April 23th 25th, 2018

Database Recovery. Dr. Bassam Hammo

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

CS122 Lecture 15 Winter Term,

Three Models. 1. Time Order 2. Distributed Algorithms 3. Nature of Distributed Systems1. DEPT. OF Comp Sc. and Engg., IIT Delhi

MySQL Group Replication. Bogdan Kecman MySQL Principal Technical Engineer

MSc, Computer & Systems TalTech. Writes on 2ndQuadrant blog From Turkey Lives in

Amazon ElastiCache 8/1/17. Why Amazon ElastiCache is important? Introduction:

Physical Storage Media

<Insert Picture Here> New MySQL Enterprise Backup 4.1: Better Very Large Database Backup & Recovery and More!

MySQL HA Solutions. Keeping it simple, kinda! By: Chris Schneider MySQL Architect Ning.com

The Google File System

FaRM: Fast Remote Memory

Low-Latency Datacenters. John Ousterhout Platform Lab Retreat May 29, 2015

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Transcription:

Making RAMCloud Writes Even Faster (Bring Asynchrony to Distributed Systems) Seo Jin Park John Ousterhout

Overview Goal: make writes asynchronous with consistency. Approach: rely on client Server returns before making writes durable If a server crashes, client retries previous writes Behavior is still consistent: linearizable if client is alive Anticipated benefits: Write latency: 15 µs à 6 µs (even with geo-replication) Lower tail latency Write Throughput: 2-3x higher Some applications don t need durability of last 10ms Slide 2

Bring Asynchrony to RAMCloud RAMCloud provides linearizability (current) Strongest form of consistency for concurrent systems Write is blocked while replication Write: 15 µs vs. Read: 5 µs Client Data Master(s) Backups Write success Durable Log Write Wastes cycles in server Client Data Master(s) Backups Make durability for write happen asynchronously Should we give up consistency? Write success Durable Log Write asynchrony = weak consistency?? Slide 3

Consistency in Performant Systems Eventual consistency is popular in distributed storage Writes are asynchronously durable for best performance Ex) Redis cluster, TAO, MySQL replication Problem: difficult to reason about the state of system Clients may read different values. Don t know when updates will be applied Cannot check update was durably queued Write may get applied long after New model: linearizable unless client crashes => Similar to (stronger) asynchronous file system Slide 4

API asyncwrite(tableid, key, value) value, version asynccondwrite(), asyncincrement() etc sync() NULL <waits all updates are durable> Possible APIs [Feedback requested: are they useful?] rpc.sync() NULL <waits 1 update is durable> sync(callbackfunc) NULL Example ramcloud.asyncwrite(1, Bob, 2 ); ramcloud.asyncwrite(1, Bill, 2 ); ramcloud.sync(); printf( Updated Bob and Bill ); Slide 5

New Consistency Model Durability for write happens asynchronously Behavior is still consistent 1. All reads are consistent Reads are blocked until data become durable 2. Writes are linearizable unless client crash When a server crashes, client retries previously returned writes. Write is lost only if both client and server crash Client may wait for durability before externalization Conditional write is still consistent and possible Slide 6

Maintaining Linearizability in Server Crash In server crash, client retries previously returned writes Goal: Restore the same state as before server crash Issue 1: Retry may re-execute the same write request If a server crash, a write may or may not be recovered. Client retries operations that are not yet known to be durable. The retried write may get re-executed, which overwrites and reverts subsequent updates by other clients Issue 2: Retries from different clients may be out of order End state of system will be different Previously succeeded conditional write may fail (client sees inconsistency) Slide 7

Issue 1: Retry may re-execute RIFL (Reusable Infrastructure for Linearizability) [SOSP15] will let server ignore already completed writes Client write(x, 20) success retry X: 20 Master?? X: 20 Backups write(x, 20) success? X: 20 Recovery Master Crash Recovery Slide 8

Issue 2: Out of Order Retries Retries from clients may arrive with different order from original execution => linearizability in danger! Option 1) Use object version to decide final winner Write: okay Conditional write: can be handled specially. Append? Not possible. Option 2) Allow only 1 not-replicated write: overwrites wait for durable Any deterministic operations are okay. Weakness: continuously overwritten object can be bad. Feedback requested: Is it common and real problem? Slide 9

Issue 2: Out of Order Retries Client if x.version == 2 then x.value = 20 success retry X.val: 20 1 X.ver: 32 Master X X.val: 1 X.ver: 2 Backups if x.version == 2 then x.value = 20 (Always) success X.val: 120 X.ver: 23 Recovery Master Crash Recovery Slide 10

Anticipated Benefits Reduces RAMCloud write latency Completely decouples write latency and replication latency Consistent geo-replication becomes practical Reduced tail latency: not affected by 3 backup servers More efficient threading model in servers No need to spin wait for replication Dedicated replication thread is possible Improves write throughput of RAMCloud 2-3x Slide 11

Possible Applications? 1. Don t care about durability Durability of last 10ms may not be important Ex) Real-time doc sharing: user cannot distinguish from typo 2. Split of update / validate clients End-user can check write was failed. If failed, retry. No surprise resurrection! Validation by read is possible. Ex) Purchase item, redirect to order confirmation page, which is rendered by different web server. Human notices and retries. 3. Many updates before externalization Simply sync() before externalizing the success of writes. Any experiences on this? Slide 12

Need help! Questions Applications? How does current web applications use no-sql DB? How useful is ordering guarantee? Is it important to have some ordering for durability? Callback based API? Is a single final response to request the only externalization? Challenges Client-side threading model for accurate timer Slide 13

Conclusion Rely on client retry if server crash à strong consistency with asynchronously durable writes Decoupling durability from critical path can improves performance (latency ê, throughput é) RIFL (Reusable Infrastructure for Linearizability) eases design and reasoning of consistency Slide 14

Q&A Slide 15