Dynamo: Key-Value Cloud Storage

Similar documents
Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Scaling Out Key-Value Storage

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

CS Amazon Dynamo

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Presented By: Devarsh Patel

EECS 498 Introduction to Distributed Systems

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Dynamo: Amazon s Highly Available Key-Value Store

CAP Theorem, BASE & DynamoDB

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

There is a tempta7on to say it is really used, it must be good

Dynamo Tom Anderson and Doug Woos

Dynamo: Amazon s Highly Available Key-value Store

Dynamo: Amazon s Highly Available Key-value Store

Eventual Consistency 1

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Distributed Hash Tables Chord and Dynamo

CS 655 Advanced Topics in Distributed Systems

FAQs Snapshots and locks Vector Clock

Page 1. Key Value Storage"

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

Background. Distributed Key/Value stores provide a simple put/get interface. Great properties: scalability, availability, reliability

Large-Scale Data Stores and Probabilistic Protocols

6.830 Lecture Spark 11/15/2017

Intuitive distributed algorithms. with F#

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Distributed Key-Value Stores UCSB CS170

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Dynamo: Amazon s Highly Available Key-value Store

References. NoSQL Motivation. Why NoSQL as the Solution? NoSQL Key Feature Decisions. CSE 444: Database Internals

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

Riak. Distributed, replicated, highly available

10. Replication. Motivation

Distributed Hash Tables

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Network File System (NFS)

Network File System (NFS)

INF-5360 Presentation

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

Replication in Distributed Systems

Availability versus consistency. Eventual Consistency: Bayou. Eventual consistency. Bayou: A Weakly Connected Replicated Storage System

11/12/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Peer-to-Peer Systems and Distributed Hash Tables

Distributed Hash Tables: Chord

EECS 498 Introduction to Distributed Systems

CSE-E5430 Scalable Cloud Computing Lecture 10

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Self-healing Data Step by Step

Introduction to Distributed Data Systems

Page 1. Key Value Storage" System Examples" Key Values: Examples "

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Indexing Large-Scale Data

Page 1. Key Value Storage" System Examples" Key Values: Examples "

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Haridimos Kondylakis Computer Science Department, University of Crete

DIVING IN: INSIDE THE DATA CENTER

DrRobert N. M. Watson

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

Distributed Data Management Replication

CS6450: Distributed Systems Lecture 13. Ryan Stutsman

Distributed Systems Final Exam

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

Network. CS Advanced Operating Systems Structures and Implementation Lecture 13. File Systems (Con t) RAID/Journaling/VFS.

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

Consistency in Distributed Systems

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Building Consistent Transactions with Inconsistent Replication

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

CPS 512 midterm exam #1, 10/7/2016

Example File Systems Using Replication CS 188 Distributed Systems February 10, 2015

Distributed Key Value Store Utilizing CRDT to Guarantee Eventual Consistency

Peer- to- Peer in the Datacenter: Amazon Dynamo

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein

CS5412: DIVING IN: INSIDE THE DATA CENTER

Report to Brewer s original presentation of his CAP Theorem at the Symposium on Principles of Distributed Computing (PODC) 2000

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

Chapter 11 - Data Replication Middleware

Cassandra Design Patterns

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Datacenter replication solution with quasardb

CS5412 CLOUD COMPUTING: PRELIM EXAM Open book, open notes. 90 minutes plus 45 minutes grace period, hence 2h 15m maximum working time.

How Eventual is Eventual Consistency?

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Distributed Hash Tables

Distributed Systems. 05. Clock Synchronization. Paul Krzyzanowski. Rutgers University. Fall 2017

Strong Eventual Consistency and CRDTs

Consistency and Replication 1/62

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

CS October 2017

Strong Eventual Consistency and CRDTs

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

Transcription:

Dynamo: Key-Value Cloud Storage Brad Karp UCL Computer Science CS M038 / GZ06 22 nd February 2016

Context: P2P vs. Data Center (key, value) Storage Chord and DHash intended for wide-area peer-to-peer systems Individual nodes at Internet s edge Central challenges: low-latency key lookup with small forwarding state per node Consistent hashing to map keys to nodes Replication at successors for availability under failure Are these techniques useful in a more traditional data-center scenario? 2

Amazon s Workload (in 2007) Peak load: tens of millions of customers Tens of thousands of servers in globally distributed data centers Dynamo: (key, value) storage back end for Amazon s web-based store put(), get(); values usually less than 1 MB Requirements: Low latency of requests: focus on 99.9% SLA Highly available despite failures (measured in success of users operations) Scalable as workload grows to more servers 3

Amazon s Workload (cont d) Shopping cart service must always be able to write to and read from its data store. despite disks failing, network routes flapping, data centers destroyed by tornadoes Services that use (key, value) stores: Best-seller lists Shopping carts Customer preferences Session management Sales rank Product catalog 4

Techniques (mostly not new) Place replicated data on nodes according to consistent hashing Maintain consistency of replicated data using version vectors ( vector clocks in paper) Eventual consistency for replicated data: prioritize success and low latency of writes and reads over consistency (unlike DBs) Efficiently sync replicas using Merkle trees 5

Techniques (mostly not new) Place replicated data on nodes according to consistent hashing Maintain consistency of replicated data Key using trade-offs: version vectors ( vector clocks in response paper) time vs. consistency vs. durability Eventual consistency for replicated data: prioritize success and low latency of writes and reads over consistency (unlike DBs) Efficiently sync replicas using Merkle trees 6

Partitions Force Choice Between Availability and Consistency Suppose 3 replicas are partitioned into 2 and 1 If one replica fixed as master, no client in other partition can write In Paxos-based primary-backup, no client in minority partition can write 7

Partitions Force Choice Between Availability and Consistency Suppose 3 replicas are partitioned into 2 and 1 If one replica fixed as master, no client in other partition can partitions write In Paxos-based primary-backup, no client in minority partition can write Traditional distributed databases emphasize consistency over availability when there are 8

Alternative: Eventual Consistency Tell client write complete when only some replicas have stored it Propagate to other replicas in background Allows writes in both partitions but risks: returning stale data write conflicts when partition heals 9

Alternative: Eventual Consistency Tell client write complete when only some replicas have stored it Propagate to other replicas in background Allows writes in both partitions but risks: returning stale data write conflicts when partition heals put(k, v 0 ) 10

Alternative: Eventual Consistency Tell client write complete when only some replicas have stored it Propagate to other replicas in background Allows writes in both partitions but risks: returning stale data write conflicts when partition heals put(k, v 0 ) put(k, v 1 ) 11

Alternative: Eventual Consistency Tell client write complete when only some replicas have stored it Propagate to other replicas in background Allows writes in both partitions but risks: returning stale data write conflicts when partition heals put(k, v 0 ) put(k, v 1 ) 12

Alternative: Eventual Consistency Tell client write complete when only some replicas have stored it Propagate to other replicas in background Allows writes in both partitions but risks: returning stale data write conflicts when partition heals put(k, v 0 )?@%$!! put(k, v 1 ) 13

Alternative: Eventual Consistency Tell client write complete when only some replicas have stored it Propagate to other Dynamo replicas in emphasizes backgroundavailability put(k, over consistency when there are partitions v 0 ) Allows writes in both partitions but risks: returning stale data write conflicts when partition heals?@%$!! put(k, v 1 ) 14

Dynamo s Interface Keys and values opaque to Dynamo Dynamo hashes keys into 128-bit identifiers with MD-5 get(key) à value, context returns one value or multiple conflicting values context describes version(s) of value(s) put(key, context, value) à OK context indicates which value versions this new value version derived from 15

Consistent Hashing in Dynamo Much like in Chord (k, v) pair stored at k s successors (named preference list in paper) Each physical node acts as multiple virtual nodes, each with own place on ring When physical node fails, many other physical nodes handle its load When physical node added, takes load from many other physical nodes Physical nodes of heterogeneous capacity can host different numbers of virtual nodes 16

Gossip and Lookup To add node, administrator explicitly informs existing node of new node s address Once per second, each node contacts random other node; they exchange their lists of known nodes (including virtual node IDs) Each node learns which other nodes handle all key ranges Result: all nodes can send directly to any key s successor 17

Data Replication Successor node of key k named coordinator Nodes send put(k, ) to k s coordinator k s coordinator replicates to preference list Goal: each (k, v) pair replicated at N nodes Preference list longer than N to allow for failed nodes 18

Consistency Model: Sloppy Quorums Goal: want it to be likely that get()s see most recent put()s Goal: don t want to block get()s or put()s from completing during partitions Dynamo tries to store all values put() under k on first N live nodes of coordinator s preference list Coordinator replies success for put() when only W replicas have completed write Coordinator replies success for get() when only R replicas have completed read R < N and W < N to make get(), put() fast 19

Sloppy Quorums and put()s Suppose coordinator doesn t receive W replies when replicating a put() Could return failure, but remember goal of high availability for writes Hinted handoff: coordinator tries next successors in preference list (beyond first N successors) if necessary Indicates to recipient correct replica node Recipient will periodically try to forward to correct replica node 20

Wide-Area Replication Last paragraph in Section 4.6 states that preference lists always contain nodes from more than one data center Consequence: data likely to survive failure of entire data center Synchronously waiting for writes to remote data center would incur unacceptably high latency Compromise: W < N, eventual consistency 21

Sloppy Quorums and get()s Suppose coordinator doesn t receive R replies when processing a get() p. 211: R is the minimum number of nodes that must participate in a successful read operation. (Sounds like these get()s fail.) Why not return whatever data was found, though? As we will see, consistency not guaranteed anyway 22

Sloppy Quorums and Freshness Common case given in paper: N = 3, R = 2, W = 2 With these values, do sloppy quorums guarantee a get() sees all prior put()s? If no failures, yes: Two writers saw each put() Two readers responded to each get() Write and read quorums must overlap! If failures, no: Two nodes in preference list go down; put() replicated outside preference list Two nodes in preference list come back up; get() occurs before they receive prior put() 23

Sloppy Quorums and Freshness Common case given in paper: N = 3, R = 2, W = 2 With these values, do sloppy quorums guarantee a get() sees all prior put()s? If no failures, yes: Two writers saw each put() Two readers responded to each get() Sloppy quorums increase probability get()s observe Write and recent read put()s, quorums but must no overlap! guarantee If failures, no: Two nodes in preference list go down; put() replicated outside preference list Two nodes in preference list come back up; get() occurs before they receive prior put() 24

Conflicts Suppose N = 3, W = 2, R = 2, and nodes are named A, B, C First put(k, ) completes on A, B Second put(k, ) completes on B, C Now get(k) arrives, completes first at A, C Conflicting results from A and C: each has seen different put(k, ) Dynamo returns both results What does client do now? 25

Conflicts vs. Applications Shopping cart: Could take union of two results What if second put() was result of user deleting item from cart stored in first put()? Result: resurrection of deleted item Can we do better? Can Dynamo resolve cases when multiple values are found? Sometimes. If it can t, application must do so. 26

Version Vectors ( Vector Clocks ) Version vector: list of (node, counter) pairs, e.g., [(A, 1), (B, 3), ] Dynamo stores version vector with each stored (k, v) pair Idea: track ancestor/descendant relationship between different versions of data stored under the same key k 27

Version Vectors (cont d) Rule: given two versions, each with a VV, if each counter of the first version is less than or equal to that of the second, then the first is an ancestor of the second and can be forgotten by Dynamo Each time a put() occurs, Dynamo increments the counter in the VV for the coordinator node Each time a get() occurs, Dynamo returns the VV for the value(s) returned (in the context ) When get()ting a value, modifying it, and put()ting it again under the same key, user must supply the context Dynamo provided in the result of the get() 28

Version Vectors (Auto-Resolving Example) put() handled by A is version v1, stamped with VV [(A,1)] put() updating same k handled by C is version v2, stamped with VV [(A,1),(C,1)] get(k) retrieved from A and C returns only v2 v1 [(A,1)] write handled by A write handled by C v2 [(A,1),(C,1)] 29

Version Vectors (Application-Resolving Example) write handled by A v1 [(A,1)] write handled by B v2 [(A,1),(B,1)] write handled by C v3 [(A,1),(C,1)] reconciled and written by A v4 [(A,2),(B,1)(C,1)] 30

Limitations of Application-Based Reconciliation Suppose two clients wish to increment the same counter concurrently (stored under the same key k) Each will independently read the same value Each will locally modify it Each will write back A subsequent reader will see two instances of the same value no sensible way to identify two increments occurred! 31

Trimming Version Vectors Many nodes may process a series of put()s to same key; VVs may get long Must they grow forever? Dynamo stores time of modification with each entry in VV When VV longer than 10 hosts long, VV drops the timestamp of host that least recently processed that key Avoids passing huge VVs around Conservative: might force client to do reconciliation, as loses state about a value s ancestry 32

Maintaining Replication: Node-to-Node Reconciliation A node holding data received via hinted handoff may crash before it can pass data to unavailable node in preference list Need another way to ensure each (k, v) pair replicated N times Nodes nearby on ring periodically compare the (k, v) pairs they hold, and copy any they are missing that are held by the other 33

Efficient Reconciliation: Merkle Trees Idea: hierarchically summarize the (k, v) pairs a node holds by ranges of keys Leaf node: hash of one (k, v) pair Internal node: hash of concatenation of children Compare roots; if match, done If don t match, compare children; recur 34

Merkle Tree Reconciliation: Example B is missing orange key; A is missing green one Compare nodes from root downwards, pruning when hashes match A s values: [0, 2 128 ) [0, 2 127 ) [2 127, 2 128 ) B s values: [0, 2 128 ) [0, 2 127 ) [2 127, 2 128 ) 35

Merkle Tree Reconciliation: Example B is missing orange key; A is missing green one Compare nodes from root downwards, pruning when hashes match A s values: [0, 2 128 ) [0, 2 127 ) [2 127, 2 128 ) B s values: [0, 2 128 ) [0, 2 127 ) [2 127, 2 128 ) 36

Merkle Tree Reconciliation: Example B is missing orange key; A is missing green one Compare nodes from root downwards, pruning when hashes match A s values: [0, 2 128 ) [0, 2 127 ) [2 127, 2 128 ) B s values: [0, 2 128 ) [0, 2 127 ) [2 127, 2 128 ) 37

Merkle Tree Reconciliation: Example B is missing orange key; A is missing green one Compare nodes from root downwards, pruning when hashes match A s values: [0, 2 128 ) [0, 2 127 ) [2 127, 2 128 ) B s values: [0, 2 128 ) [0, 2 127 ) [2 127, 2 128 ) 38

How General Is Dynamo s Consistency Model? Doesn t support, e.g., concurrent increments or many other operations Amazon s stated uses: Shopping cart (merge carts, resurrected deleted items) Session information for users Product list (but probably nearly read-only, so R = 1, W = N would work well) 39

How Useful Is It to Vary N, R, W? N R W Behavior 3 2 2 Parameters from paper: good durability, good R/W latency 3 3 1 Slow reads, weak durability, fast writes 3 1 3 Slow writes, strong durability, fast reads 3 3 3 More likely that reads see all prior writes? 3 1 1 Doesn t make sense. (Why not?) 40

Dynamo: Summary Consistent hashing broadly useful for replication not only in P2P systems Extreme emphasis on availability and low latency, unusually, at the cost of some inconsistency Eventual consistency lets writes and reads return quickly, even when partitions and failures Version vectors allow some conflicts to be resolved automatically; others left to application Seems to meet Amazon s specific applications consistency needs but definitely not right for all applications 41