Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

Similar documents
DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Dynamo: Amazon s Highly Available Key-value Store

CAP Theorem, BASE & DynamoDB

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Presented By: Devarsh Patel

Dynamo: Amazon s Highly Available Key-Value Store

CS Amazon Dynamo

Dynamo: Amazon s Highly Available Key-value Store

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Dynamo: Amazon s Highly Available Key-value Store

11/12/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

FAQs Snapshots and locks Vector Clock

Dynamo: Key-Value Cloud Storage

CS 655 Advanced Topics in Distributed Systems

Scaling Out Key-Value Storage

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

There is a tempta7on to say it is really used, it must be good

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Distributed Hash Tables Chord and Dynamo

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Background. Distributed Key/Value stores provide a simple put/get interface. Great properties: scalability, availability, reliability

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

10. Replication. Motivation

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

EECS 498 Introduction to Distributed Systems

References. NoSQL Motivation. Why NoSQL as the Solution? NoSQL Key Feature Decisions. CSE 444: Database Internals

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Haridimos Kondylakis Computer Science Department, University of Crete

6.830 Lecture Spark 11/15/2017

Eventual Consistency 1

Introduction to Distributed Data Systems

Replication in Distributed Systems

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Riak. Distributed, replicated, highly available

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Indexing Large-Scale Data

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Large-Scale Data Stores and Probabilistic Protocols

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

Cloud Computing. Lectures 11, 12 and 13 Cloud Storage

Page 1. Key Value Storage"

DIVING IN: INSIDE THE DATA CENTER

Integrity in Distributed Databases

Cloud Computing. Up until now

Chapter 24 NOSQL Databases and Big Data Storage Systems

CompSci 516 Database Systems

Distributed Key Value Store Utilizing CRDT to Guarantee Eventual Consistency

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

Distributed Data Management Replication

11. Replication. Motivation

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

Apache Cassandra - A Decentralized Structured Storage System

Today s topics. FAQs. Topics in BigTable. This material is built based on, CS435 Introduction to Big Data Spring 2017 Colorado State University

Dynamo Tom Anderson and Doug Woos

CONSISTENT FOCUS. Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability.

Architecture and Implementation of Database Systems (Summer 2018)

Building Consistent Transactions with Inconsistent Replication

Report to Brewer s original presentation of his CAP Theorem at the Symposium on Principles of Distributed Computing (PODC) 2000

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

08 Distributed Hash Tables

Extreme Computing. NoSQL.

CSE-E5430 Scalable Cloud Computing Lecture 10

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Hash Tables

Improving Logical Clocks in Riak with Dotted Version Vectors: A Case Study

Persistence and Node Failure Recovery in Strongly Consistent Key-Value Datastore

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Intuitive distributed algorithms. with F#

Computing Parable. The Archery Teacher. Courtesy: S. Keshav, U. Waterloo. Computer Science. Lecture 16, page 1

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

CISC 7610 Lecture 2b The beginnings of NoSQL

Distributed Systems (5DV147)

Scalable overlay Networks

Eventual Consistency Today: Limitations, Extensions and Beyond

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

How to make sites responsive? 7/121

Page 1. Key Value Storage" System Examples" Key Values: Examples "

DrRobert N. M. Watson

Dynamo: Amazon s Highly- Available Key- Value Store

Consistency in Distributed Systems

Distributed Storage Systems: Data Replication using Quorums

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

Distributed Key-Value Stores UCSB CS170

DHT Overview. P2P: Advanced Topics Filesystems over DHTs and P2P research. How to build applications over DHTS. What we would like to have..

Important Lessons. Today's Lecture. Two Views of Distributed Systems

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

Page 1. Key Value Storage" System Examples" Key Values: Examples "

10.0 Towards the Cloud

Strong Consistency & CAP Theorem

Transactions and ACID

Transcription:

Dynamo: Amazon s Highly Available Key-value Store ID2210-VT13 Slides by Tallat M. Shafaat

Dynamo An infrastructure to host services Reliability and fault-tolerance at massive scale Availability providing an always-on experience Cost-effectiveness Performance 2

Context Amazon s e-commerce platform Shopping cart served tens of millions requests, over 3 million checkouts in a single day Unavailability == $$$ No complex queries Managed system Add/remove nodes Security Byzantine nodes 3

CAP Theorem Only two possible at the same time Consistency Availability Partition-tolerance Dynamo, target applications: Availability and Parition-tolerance Eventual consistency 4

Clients view on Consistency Strong consistency. Single storage image. After the update completes, any subsequent access will return the updated value. Weak consistency. The system does not guarantee that subsequent accesses will return the updated value. Inconsistency window. Eventual consistency. Form of weak consistency If no new updates are made to the object, eventually all accesses will return the last updated value. 5

Reconciliation Eventual consistency Causal consistency Read-your-writes consistency... Inconsistent copy 1 Consistent data Inconsistent copy 2 Consistent data Inconsistent copy 3 6

Requirements Query model Simple read/write operations on small data items ACID properties Weaker consistency model No isolation, only single key updates Efficiency Tradeoff between performance, cost efficiency, availability and durability guarantees 7

Amazon Store Architecture 8

Design considerations Conflict resolution When Who Scalability Symmetry Decentralization Heterogeneity 9

The big picture Easy usage Load-balancing Replication Eventual consistency High availability Easy management Failuredetection Scalability 10

Easy usage: Interface get(key) return single object or list of objects with conflicting version and context put(key, context, object) store object and context under key Context encodes system meta-data, e.g. version number 11

Data partitioning Based on consistent hashing Hash key and put on responsible node 14 15 0 1 2 13 3 12 4 11 5 10 6 9 8 7 12

Load balancing Load Storage bits Popularity of the item Processing required to serve the item Consistent hashing may lead to imbalance 13

Load imbalance (1/5) Node identifiers may not be balanced 14

Load imbalance (2/5) Node identifiers may not be balanced 15

Load imbalance (3/5) Node identifiers may not be balanced Data identifiers may not be balanced - node - data 16

Load imbalance (4/5) Node identifiers may not be balanced Data identifiers may not be balanced Hot spots tallat-song4.mp3 tallat-song3.mp3 tallat-song1.mp3 tallat-song2.mp3 - node - data britney.mp3 17

Load imbalance (5/5) Node identifiers may not be balanced Data identifiers may not be balanced Hot spots Heterogeneous nodes - node - data 18

Load balancing via Virtual Servers Each physical node picks multiple random identifiers Each identifier represents a virtual server Each node runs multiple virtual servers Each node responsible for noncontiguous regions 19

Virtual Servers How many virtual servers? For homogeneous, all nodes run log N virtual servers For heterogeneous, nodes run clogn virtual servers, where c is small for weak nodes large for powerful nodes Move virtual servers from heavily loaded physical nodes to lightly loaded physical nodes 20

Replication Successor list replication Replicate the data of your N closest neighbors for a replication factor of N Node 11 Node 10 Node 5 Data: 12, 13, 14, 15, 0 15 0 1 14 2 13 3 Node 0 Node 11 Node 10 Data: 1, 2, 3 Node 10 Node 5 Node 3 12 4 Data: 11 Node 3 Node 0 Node 11 11 5 Data: 4, 5 Node 5 Node 3 Node 0 Data: 6, 7, 8, 9, 10 10 9 8 7 6 21

The big picture Easy usage Load-balancing Replication Eventual consistency High availability Easy management Failuredetection Scalability 22

Data versioning (1/3) Eventual consistency, updates propagated asynchronously Each modification is a new and immutable version of the data Multiple versions of an object New versions can subsume older versions Syntactic reconciliation Semantic reconciliation 23

Data versioning (2/3) Version branching due to failures, network partitions, etc. Target applications aware of multiple versions Use vector clocks for capturing causality If causal, older version can be forgotten If concurrent, conflict exists requiring reconciliation A put requires a context, i.e. which version to update 24

Data versioning (3/3) Client C1 writes new object say via Sx C1 updates the object say via Sx C1 updates the object say via Sy C2 reads D2 and updates the object Say via Sz Reconciliation write handled by Sy D3 ([Sx,2], [Sy,1]) D1 ([Sx,1]) D2 ([Sx,2]) write handled by Sx write handled by Sx write handled by Sz D4 ([Sx,2], [Sz,1]) reconsiled and written by Sx D5 ([Sx,3], [Sy,1], [Sz,1]) 25

Execution of operations put and get operations Client can send the request to the node responsible for the data Save on latency, code on client to a generic load balancer Extra hop 26

Quorum systems R / W : minimum number of nodes that must participate in a successul read / write R + W > N (overlap) R=3, W=3, N=5 R=4, W=2, N=5 27

put (key, value, context) Coordinator generates new vector clock and writes the new version locally Send to N nodes Wait for response from W-1 nodes Using W=1 High availability for writes Low durability 28

(value, context) get (key) Coordinator requests existing versions from N Wait for response from R nodes If multiple versions, return all versions that are causally unrelated Divergent versions are then reconciled Reconciled version written back Using R=1 High performance read engine 29

The big picture Easy usage Load-balancing Replication Eventual consistency High availability Easy management Failuredetection Scalability 30

Handling transient failures A managed system Which N nodes to update? Say A is unreachable put will use D Later, D detects A is alive send the replica to A remove the replica 1 1 A 1 1 B C D 1 1 1 Tolerate failure of a data center Each object replicated across multiple data centers 31

Handling permanent failures (1/2) Anti-entropy for replica synchronization Use Merkle trees for fast inconsistency detection and minimum transfer of data 1 5 Hash Data items: D2, D3, D4, D5 Hash Hash Hash Hash Hash Hash D2 D3 D4 D5 32

Handling permanent failures (2/2) Nodes maintain Merkle tree of each key range Exchange root of Merkle tree to check if the key ranges are up-to-date 1 1 A B 1 1 C 1 33

Quorums under failures systems Due to partitions, quorums might not exist Create transient replicas Reconcile after partition heals R=3, W=3, N=5 34

Membership A managed system Administrator explicitly adds and removes nodes Receiving node stores changes with time stamp Gossiping to propagate membership changes Eventually consistent view O(1) hop overlay log(n) hops, e.g. n=1024, 10 hops, 50ms/hop, 500ms 35

Failure detection Passive failure detection Use pings only for detection from failed to alive A detects B as failed if it doesnt respond to a message A periodically checks if B is alive again In the absense of client requests, A doesn t need to know if B is alive Permanent node additions and removals are explicit 36

Adding nodes A new node X added to system X is assigned key ranges w.r.t. its virtual servers For each key range, it transfers the data items Node A Node G A X B Data: (A, X] Node A Data: (A, B] Node G X=Data\(X,B) Data=Data\(A,X) Drop G G F C Node B Data: (B, C] Node A X=B\(X,B) B=B\(A,X) Drop A E D Node C Node B Data: (C, D] 37

Removing nodes Reallocation of keys is a reverse process of adding nodes 38

Implementation details Local persistence BDS, MySQL, etc. Request coordination Read operation Create context Syntactic reconciliation Read repair Write operation Read-your-writes 39

Evaluation 40

Evaluation 41

Partitioning and placement (1/2) Data ranges are not fixed More time spend to locate items More data storage needed for indexing Inefficient bootstrapping Difficult to archive the whole data H A B G C F E D 42

Partitioning and placement (2/2) Divide data space into equally sized ranges Assign ranges to nodes A H B G C F D E 43

Versions of an item Reason Node failures, data center failures, network partitions Large number of concurrent writes to an item Occurence 99.94 % one version 0.00057 % two versions 0.00047 % three versions 0.00009 % four versions Evaluation: versioning due to concurrent writes 44

Client vs Server coordination Read requests coordinated by any Dynamo node Write requests coordinated by a node replicating the data item Request coordination can be moved to client Use libraries Reduces latency by saving one hop Client library updates view of membership periodically 45

End notes Peer-to-peer techniques have been the key enablers for building Dynamo:... decentralized techniques can be combined to provide a single highly-available system. 46

References Dynamo: amazon's highly available key-value store, Giuseppe DeCandia et. al., SOSP 2007. Bigtable: A Distributed Storage System for Structured Data, Fay Chang et. al., OSDI 2006. Casandra - http://cassandra.apache.org/ Eventual consistency - http://www.allthingsdistri buted.com/ 2008/12/eventually_consistent.html Key values stores, No SQL 47