Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

Similar documents
Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Scaling Out Key-Value Storage

Dynamo: Key-Value Cloud Storage

EECS 498 Introduction to Distributed Systems

Presented By: Devarsh Patel

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

CAP Theorem, BASE & DynamoDB

Dynamo: Amazon s Highly Available Key-Value Store

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS Amazon Dynamo

CS6450: Distributed Systems Lecture 13. Ryan Stutsman

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Distributed Hash Tables

There is a tempta7on to say it is really used, it must be good

Dynamo Tom Anderson and Doug Woos

Dynamo: Amazon s Highly Available Key-value Store

Dynamo: Amazon s Highly Available Key-value Store

CS6450: Distributed Systems Lecture 15. Ryan Stutsman

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

Background. Distributed Key/Value stores provide a simple put/get interface. Great properties: scalability, availability, reliability

Primary/Backup. CS6450: Distributed Systems Lecture 3/4. Ryan Stutsman

Byzantine Fault Tolerance

CS 655 Advanced Topics in Distributed Systems

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

11/12/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

FAQs Snapshots and locks Vector Clock

Distributed Hash Tables Chord and Dynamo

Peer-to-Peer Systems and Distributed Hash Tables

Dynamo: Amazon s Highly Available Key-value Store

Eventual Consistency 1

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

6.830 Lecture Spark 11/15/2017

GFS. CS6450: Distributed Systems Lecture 5. Ryan Stutsman

References. NoSQL Motivation. Why NoSQL as the Solution? NoSQL Key Feature Decisions. CSE 444: Database Internals

Large-Scale Data Stores and Probabilistic Protocols

Availability versus consistency. Eventual Consistency: Bayou. Eventual consistency. Bayou: A Weakly Connected Replicated Storage System

Distributed Transactions

Intuitive distributed algorithms. with F#

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

Page 1. Key Value Storage"

PEER-TO-PEER NETWORKS, DHTS, AND CHORD

Riak. Distributed, replicated, highly available

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

Distributed Hash Tables: Chord

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Distributed Hash Tables

Bitcoin. CS6450: Distributed Systems Lecture 20 Ryan Stutsman

CSE-E5430 Scalable Cloud Computing Lecture 10

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

10. Replication. Motivation

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Indexing Large-Scale Data

Introduction to Distributed Data Systems

C 1. Last Time. CSE 486/586 Distributed Systems Distributed Hash Tables. Today s Question. What We Want. What We Want. What We Don t Want

Last Time. CSE 486/586 Distributed Systems Distributed Hash Tables. What We Want. Today s Question. What We Want. What We Don t Want C 1

Peer- to- Peer in the Datacenter: Amazon Dynamo

EECS 498 Introduction to Distributed Systems

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

Large-scale Caching. CS6450: Distributed Systems Lecture 18. Ryan Stutsman

CPS 512 midterm exam #1, 10/7/2016

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

CSE 486/586 Distributed Systems

Chord: A Scalable Peer-to-peer Lookup Service For Internet Applications

Haridimos Kondylakis Computer Science Department, University of Crete

DrRobert N. M. Watson

Distributed Key-Value Stores UCSB CS170

Finding Data in the Cloud using Distributed Hash Tables (Chord) IBM Haifa Research Storage Systems

: Scalable Lookup

Scaling Out Systems. COS 518: Advanced Computer Systems Lecture 6. Kyle Jamieson

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Distributed Systems. 17. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2016

Last time. Distributed systems Lecture 6: Elections, distributed transactions, and replication. DrRobert N. M. Watson

CS 43: Computer Networks BitTorrent & Content Distribution. Kevin Webb Swarthmore College September 28, 2017

How to make sites responsive? 7/121

Replication in Distributed Systems

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

Byzantine Fault Tolerance

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Sharding & CDNs. CS 475, Spring 2018 Concurrent & Distributed Systems

Database Architectures

Building Consistent Transactions with Inconsistent Replication

Example File Systems Using Replication CS 188 Distributed Systems February 10, 2015

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

416 Distributed Systems. March 23, 2018 CDNs

Kademlia: A P2P Informa2on System Based on the XOR Metric

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

08 Distributed Hash Tables

Distributed Data Management Replication

Distributed Key Value Store Utilizing CRDT to Guarantee Eventual Consistency

Transcription:

Scaling KVS CS6450: Distributed Systems Lecture 14 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed for use under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Some material taken/derived from MIT 6.824 by Robert Morris, Franz Kaashoek, and Nickolai Zeldovich. 1

Horizontal or vertical scalability? Vertical Scaling Horizontal Scaling 2

Horizontal scaling is chaotic Probability of any failure in given period = 1 (1 p) n p = probability a machine fails in given period n = number of machines For 50K machines, each with 99.99966% available 16% of the time, data center experiences failures For 100K machines, failures 30% of the time! 3

Today 1. Techniques for partitioning data Metrics for success 2. Case study: Amazon Dynamo key-value store 4

Scaling out: Partition and place Partition management Including how to recover from node failure e.g., bringing another node into partition group Changes in system size, i.e. nodes joining/leaving Data placement On which node(s) to place a partition? Maintain mapping from data object to responsible node(s) Centralized: Cluster manager Decentralized: Deterministic hashing and algorithms 5

Modulo hashing Consider problem of data partition: Given object id X, choose one of k servers to use Suppose instead we use modulo hashing: Place X on server i = hash(x) mod k What happens if a server fails or joins (k ß k±1)? or different clients have different estimate of k? 6

Problem for modulo hashing: Changing number of servers h(x) = x + 1 (mod 4) Add one machine: h(x) = x + 1 (mod 5) Server 4 3 All entries get remapped to new nodes! 2 à Need to move objects over the network 1 0 5 7 10 11 27 29 36 38 40 Object id 7

Consistent hashing Assign n tokens to random points on mod 2 k circle; hash key size = k Hash object to random circle position Put object in closest clockwise bucket successor (key) à bucket 12 14 0 Token Bucket 4 Desired features Balance: No bucket has too many objects Smoothness: Addition/removal of token minimizes object movements for other buckets 8 8

Consistent hashing s load balancing problem Each node owns 1/n th of the ID space in expectation Says nothing of request load per bucket If a node fails, its successor takes over bucket Smoothness goal : Only localized shift, not O(n) But now successor owns two buckets: 2/n th of key space The failure has upset the load balance 9

Virtual nodes Idea: Each physical node now maintains v > 1 tokens Each token corresponds to a virtual node Each virtual node owns an expected 1/(vn) th of ID space Upon a physical node s failure, v successors take over, each now stores (v+1)/vn ths of ID space (compared to v/vn before) Result: Better load balance with larger v 10

Today 1. Techniques for partitioning data 2. Case study: the Amazon Dynamo keyvalue store 11

Dynamo: The P2P context Chord and DHash intended for wide-area P2P systems Individual nodes at Internet s edge, file sharing Central challenges: low-latency key lookup with small forwarding state per node Techniques: Consistent hashing to map keys to nodes Replication at successors for availability under failure 12

Amazon s workload (in 2007) Tens of thousands of servers in globallydistributed data centers Peak load: Tens of millions of customers Tiered service-oriented architecture Stateless web page rendering servers, atop Stateless aggregator servers, atop Stateful data stores (e.g. Dynamo) put( ), get( ): values usually less than 1 MB 13

How does Amazon use Dynamo? Shopping cart Session info Maybe recently visited products et c.? Product list Mostly read-only, replication for high read throughput 14

Dynamo requirements Highly available writes despite failures Despite disks failing, network routes flapping, data centers destroyed by tornadoes Non-requirement: Security, authentication, Always respond quickly, even during failures à replication authorization (used in a non-hostile environment) Low request-response latency: focus on 99.9% SLA Incrementally scalable as servers grow to workload Adding nodes should be seamless Comprehensible conflict resolution High availability in above sense implies conflicts 15

Design questions How is data placed and replicated? How are requests routed and handled in a replicated system? How to cope with temporary and permanent node failures? 16

Dynamo s system interface Basic interface is a key-value store get(k) and put(k, v) Keys and values opaque to Dynamo get(key) à value, context Returns one value or multiple conflicting values Context describes version(s) of value(s) put(key, context, value) à OK Context indicates which versions this version supersedes or merges 17

Dynamo s techniques Place replicated data on nodes with consistent hashing Maintain consistency of replicated data with vector clocks Eventual consistency for replicated data: prioritize success and low latency of writes over reads And availability over consistency (unlike DBs) Efficiently synchronize replicas using Merkle trees Key trade-offs: Response time vs. consistency vs. durability 18

Data placement Key K Key K put(k, ), get(k) requests go to me A F G B C Coordinator node Nodes B, C and D store keys in range (A,B) including K. E D Each data item is replicated at N virtual nodes (e.g., N = 3) 19

Data replication Much like in Chord: a key-value pair à key s N successors (preference list) Coordinator receives a put for some key Coordinator then replicates data onto nodes in the key s preference list Preference list size > N to account for node failures For robustness, the preference list skips tokens to ensure distinct physical nodes 20

Gossip and lookup Gossip: Once per second, each node contacts a randomly chosen other node They exchange their lists of known nodes (including virtual node IDs) Each node learns which others handle all key ranges Result: All nodes can send directly to any key s coordinator ( zero-hop DHT ) Reduces variability in response times 21

Partitions force a choice between availability and consistency Suppose three replicas are partitioned into two and one If one replica fixed as master, no client in other partition can write In Paxos-based primary-backup, no client in the partition of one can write Traditional distributed databases emphasize consistency over availability when there are partitions 22

Alternative: Eventual consistency Dynamo emphasizes availability over consistency when there are partitions Tell client write complete when only some replicas have stored it Propagate to other replicas in background Allows writes in both partitions but risks: Returning stale data Write conflicts when partition heals:?@%$!! put(k,v 0 ) put(k,v 1 ) 23

Mechanism: Sloppy quorums If no failure, reap consistency benefits of single master Else sacrifice consistency to allow progress Dynamo tries to store all values put() under a key on first N live nodes of coordinator s preference list BUT to speed up get() and put(): Coordinator returns success for put when W < N replicas have completed write Coordinator returns success for get when R < N replicas have completed read 24

Sloppy quorums: Hinted handoff Suppose coordinator doesn t receive W replies when replicating a put() Could return failure, but remember goal of high availability for writes Hinted handoff: Coordinator tries next successors in preference list (beyond first N) if necessary Indicates the intended replica node to recipient Recipient will periodically try to forward to the intended replica node 25

Hinted handoff: Example Suppose C fails Node E is in preference list Needs to receive replica of the data Hinted Handoff: replica at E points to node C F G E A D Key K B Key K C Coordinator Nodes B, C and D store keys in range (A,B) including K. When C comes back E forwards the replicated data back to C 26

Wide-area replication Last, 4.6: Preference lists always contain nodes from more than one data center Consequence: Data likely to survive failure of entire data center Blocking on writes to a remote data center would incur unacceptably high latency Compromise: W < N, eventual consistency 27

Sloppy quorums and get()s Suppose coordinator doesn t receive R replies when processing a get() Penultimate, 4.5: R is the min. number of nodes that must participate in a successful read operation. Sounds like these get()s fail Why not return whatever data was found, though? As we will see, consistency not guaranteed anyway 28

Sloppy quorums and freshness Common case given in paper: N = 3; R = W = 2 With these values, do sloppy quorums guarantee a get() sees all prior put()s? If no failures, yes: Two writers saw each put() Two readers responded to each get() Write and read quorums must overlap! 29

Sloppy quorums and freshness Common case given in paper: N = 3, R = W = 2 With these values, do sloppy quorums guarantee a get() sees all prior put()s? With node failures, no: Two nodes in preference list go down put() replicated outside preference list Two nodes in preference list come back up get() occurs before they receive prior put() 30

Conflicts Suppose N = 3, W = R = 2, nodes are named A, B, C 1 st put(k, ) completes on A and B 2 nd put(k, ) completes on B and C Now get(k) arrives, completes first at A and C Conflicting results from A and C Each has seen a different put(k, ) Dynamo returns both results; what does client do now? 31

Conflicts vs. applications Shopping cart: Could take union of two shopping carts What if second put() was result of user deleting item from cart stored in first put()? Result: resurrection of deleted item Can we do better? Can Dynamo resolve cases when multiple values are found? Sometimes. If it can t, application must do so. 32

Version vectors (vector clocks) Version vector: List of (coordinator node, counter) pairs e.g., [(A, 1), (B, 3), ] Dynamo stores a version vector with each stored key-value pair Idea: track ancestor-descendant relationship between different versions of data stored under the same key k 33

Version vectors: Dynamo s mechanism Rule: If vector clock comparison of v1 < v2, then the first is an ancestor of the second Dynamo can forget v1 Each time a put() occurs, Dynamo increments the counter in the V.V. for the coordinator node Each time a get() occurs, Dynamo returns the V.V. for the value(s) returned (in the context ) Then users must supply that context to put()s that modify the same key 34

Version vectors (auto-resolving case) v1 [(A,1)] put handled by node A put handled by node C v2 [(A,1), (C,1)] v2 > v1, so Dynamo nodes automatically drop v1, for v2 35

Version vectors (app-resolving case) put handled by node B v1 [(A,1)] put handled by node A put handled by node C v2 [(A,1), (B,1)] Client reads v2, v3; context: [(A,1), (B,1), (C,1)] v3 [(A,1), (C,1)] v2 v3, so a client must perform semantic reconciliation v4 [(A,2), (B,1), (C,1)] Client reconciles v2 and v3; node A handles the put 36

Trimming version vectors Many nodes may process a series of put()s to same key Version vectors may get long do they grow forever? No, there is a clock truncation scheme Dynamo stores time of modification with each V.V. entry When V.V. > 10 nodes long, V.V. drops the timestamp of the node that least recently processed that key 37

Impact of deleting a VV entry? v1 [(A,1)] put handled by node A put handled by node C v2 [(A,1), (C,1)] v2 v1, so looks like application resolution is required 38

Concurrent writes What if two clients concurrently write w/o failure? e.g. add different items to same cart at same time Each does get-modify-put They both see the same initial version And they both send put() to same coordinator Will coordinator create two versions with conflicting VVs? We want that outcome, otherwise one was thrown away Paper doesn't say, but coordinator could detect problem via put() context 39

Removing threats to durability Hinted handoff node crashes before it can replicate data to node in preference list Need another way to ensure that each key-value pair is replicated N times Mechanism: replica synchronization Nodes nearby on ring periodically gossip Compare the (k, v) pairs they hold Copy any missing keys the other has How to compare and copy replica state quickly and efficiently? 40

Efficient synchronization with Merkle trees Merkle trees hierarchically summarize the key-value pairs a node holds One Merkle tree for each virtual node key range Leaf node = hash of one key s value Internal node = hash of concatenation of children Compare roots; if match, values match If they don t match, compare children Iterate this process down the tree 41

Merkle tree reconciliation A, B differ in two keys Exchange and compare hash nodes from root downwards, pruning when hashes match A s values: [0, 2 128 ) [0, 2 127 ) [2 127, 2 128 ) B s values: [0, 2 128 ) [0, 2 127 ) [2 127, 2 128 ) Finds differing keys quickly and with minimum information exchange 42

43

How useful is it to vary N, R, W? N R W Behavior 3 2 2 Parameters from paper: Good durability, good R/W latency 3 3 1 Slow reads, weak durability, fast writes 3 1 3 Slow writes, strong durability, fast reads 3 3 3 More likely that reads see all prior writes? 3 1 1 Read quorum doesn t overlap write quorum 44

Evolution of partitioning and placement Strategy 1: Chord + virtual nodes partitioning and placement New nodes steal key ranges from other nodes Scan of data store from donor node took a day Burdensome recalculation of Merkle trees on join/leave 45

Evolution of partitioning and placement Q partitions: fixed and equally sized Strategy 2: Fixed-size partitions, random token placement Placement: T virtual nodes per physical node (random tokens) Place the partition on first N nodes after its end 46

Evolution of partitioning and placement Q partitions: fixed and equally sized Strategy 3: Fixed-size partitions, equal tokens per partition S total nodes in the system Placement: Q/S tokens per partition 47

Dynamo: Take-away ideas Consistent hashing broadly useful for replication not only in P2P systems Extreme emphasis on availability and low latency, unusually, at the cost of some inconsistency Eventual consistency lets writes and reads return quickly, even when partitions and failures Version vectors allow some conflicts to be resolved automatically; others left to application 48

Tuesday class meeting: Strong Consistency/Linearizability 49