Presented By: Devarsh Patel

Similar documents
DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Dynamo: Amazon s Highly Available Key-value Store

CS Amazon Dynamo

CAP Theorem, BASE & DynamoDB

Dynamo: Amazon s Highly Available Key-Value Store

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Dynamo: Amazon s Highly Available Key-value Store

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

Dynamo: Key-Value Cloud Storage

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Background. Distributed Key/Value stores provide a simple put/get interface. Great properties: scalability, availability, reliability

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Scaling Out Key-Value Storage

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

There is a tempta7on to say it is really used, it must be good

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

FAQs Snapshots and locks Vector Clock

Dynamo: Amazon s Highly Available Key-value Store

CS 655 Advanced Topics in Distributed Systems

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

11/12/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Distributed Hash Tables Chord and Dynamo

EECS 498 Introduction to Distributed Systems

Dynamo Tom Anderson and Doug Woos

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

References. NoSQL Motivation. Why NoSQL as the Solution? NoSQL Key Feature Decisions. CSE 444: Database Internals

Riak. Distributed, replicated, highly available

Intuitive distributed algorithms. with F#

Consistency and Replication

6.830 Lecture Spark 11/15/2017

Haridimos Kondylakis Computer Science Department, University of Crete

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Introduction to Distributed Data Systems

Large-Scale Data Stores and Probabilistic Protocols

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Replication in Distributed Systems

Cloud Computing. Lectures 11, 12 and 13 Cloud Storage

Building Consistent Transactions with Inconsistent Replication

Dynamo: Amazon s Highly- Available Key- Value Store

Distributed Systems (5DV147)

Cloud Computing. Up until now

Distributed Key Value Store Utilizing CRDT to Guarantee Eventual Consistency

Improving Logical Clocks in Riak with Dotted Version Vectors: A Case Study

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Distributed Data Management Replication

Distributed KIDS Labs 1

Database Architectures

Important Lessons. Today's Lecture. Two Views of Distributed Systems

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

10. Replication. Motivation

Peer-to-peer Sender Authentication for . Vivek Pathak and Liviu Iftode Rutgers University

Database Architectures

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

Consistency and Replication 1/62

Chapter 6 Synchronization (2)

Peer- to- Peer in the Datacenter: Amazon Dynamo

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

10.0 Towards the Cloud

Distributed Hash Tables

What Came First? The Ordering of Events in

Axway API Management 7.5.x Cassandra Best practices. #axway

Scalable overlay Networks

Page 1. Key Value Storage"

Replication and Consistency. Fall 2010 Jussi Kangasharju

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Distributed Data Analytics Partitioning

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

Consistency and Replication (part b)

From Relational to Riak

Chapter 11 - Data Replication Middleware

Distributed Key-Value Stores UCSB CS170

02 - Distributed Systems

02 - Distributed Systems

Apache Cassandra - A Decentralized Structured Storage System

Clusters. Or: How to replace Big Iron with PCs. Robert Grimm New York University

Distributed Hash Tables: Chord

Chapter 7 Consistency And Replication

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Protecting Microsoft Exchange

Consistency in Distributed Systems

Consistency and Replication

Trade- Offs in Cloud Storage Architecture. Stefan Tai

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

Consistency and Replication 1/65

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

Example File Systems Using Replication CS 188 Distributed Systems February 10, 2015

Replication. Consistency models. Replica placement Distribution protocols

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Systems. 09. State Machine Replication & Virtual Synchrony. Paul Krzyzanowski. Rutgers University. Fall Paul Krzyzanowski

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

Today s topics. FAQs. Topics in BigTable. This material is built based on, CS435 Introduction to Big Data Spring 2017 Colorado State University

Transcription:

: Amazon s Highly Available Key-value Store Presented By: Devarsh Patel CS5204 Operating Systems 1

Introduction Amazon s e-commerce platform Requires performance, reliability and efficiency To support continuous growth, platform needs to be highly scalable A highly available and scalable distributed data store built for Amazon s platform is used to manage services that have very high reliability requirements and need tight control over the tradeoffs between availability, consistency, cost-effectiveness and performance. provides a simple primary-key only interface to meet requirements of applications like best seller lists, shopping carts, customer preferences, session management, etc. A completely decentralized system with minimal need for manual administration. CS5204 Operating Systems 2

System Assumptions and Requirements Simple key-value interface Highly available Efficient in resource usage Simple scale out scheme to address growth in data set size or request rates Each service that uses runs its own instances Used only by Amazon s internal services Non-hostile environment No security requirements like authentication and authorization Targets applications that operate with weaker consistency in favor of high availability Service level agreements (SLA) Measured at the 99.9 th percentile of the distribution Key factors: service latency at a given request rate Example: response time of 300ms for 99.9% of requests at peak client load of 500 requests per second State management is the main component of a service s SLAs CS5204 Operating Systems 3

Design Considerations Designed to be an eventually consistent data store Always writeable data store Consistency vs. availability To achieve a level of consistency, replication algorithms are forced to tradeoff the availability of the data under certain failure scenarios. To improve availability, uses weaker form of consistency (eventual consistency) Allows optimistic replication techniques Can lead to conflicting changes which must be detected and resolved Data store or application performs conflict resolution to the reads Other key principles Incremental scalability One storage node at a time Symmetry Every node has same set of responsibilities Decentralization Favor decentralized peer-to-peer techniques Heterogeneity Work distribution must be proportional CS5204 Operating Systems 4

System Architecture Core distributed system techniques used in : Partitioning, Replication, Versioning, Membership, Failure handling and Scaling CS5204 Operating Systems 5

System Interface Two operations: get() and put() get(key) Locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context put(key, context, object) - Determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk context encodes system metadata about the object MD5 hash on the key generates 128-bit identifier to identify storage nodes CS5204 Operating Systems 6

Consistent Hashing Partitioning Algorithm Output range is a fixed circular space or ring Advantage Departure or arrival of a node only affects immediate neighbors Issues Non-uniform data and load distribution uses a variant of consistent hashing by using concept of virtual nodes CS5204 Operating Systems 7

Replication Replicate data on multiple hosts Reason To achieve high availability and durability per-instance Preference list List of nodes responsible for storing particular key Figure 1: Partitioning and replication of keys in ring. CS5204 Operating Systems 8

Data Versioning treats the result of each modification as a new and immutable version of the data Allows for multiple versions of an object to be present in the system at the same time. Problem Version branching due to failures combined with concurrent updates, resulting in conflicting versions of object Updates in the presence of network partitions and node failures result in an object having distinct version sub-histories CS5204 Operating Systems 9

Data Versioning Uses vector clocks A list of (node, counter) pairs Determines two version of an object are on parallel branches or have causal ordering Conflict requires reconciliation Conflicting versions passed to application as output of get operation Application resolves conflicts and puts a new (consistent) version CS5204 Operating Systems 10

Data Versioning Figure: Version evolution of an object over time CS5204 Operating Systems 11

Execution of get/put operations Two strategies to select a node: Request through a load balancer Request directly to the coordinator nodes Coordinator Node handling read and write operation First among the top N nodes in the preference list Quorum system Two key configurable values: R and W R - minimum nodes participated in successful read operation W - minimum nodes participated in successful write operation Quorum like system requires, R+W > N (N, R, W) can be chosen to achieve desired tradeoff R and W are usually configured to be less than N, to provide better latency. Write is successful If W-1 nodes respond to put() request Read is successful If R noes respond to get() request CS5204 Operating Systems 12

Sloppy quorum Hinted Handoff All read and write operations are done on Top N healthy nodes in the preference list Coordinator is first in this group Replicas sent to node will have a hint in its metadata indicating the original node that should hold the replica Hinted replicas are stored by available node and sent forwarded when original node recovers. Ensures read and write operations are not failed due to node or network failures CS5204 Operating Systems 13

Replica synchronization Detect the inconsistencies between replicas faster and to minimize the amount of transferred data using Merkle tree. Separate tree maintained by each node for each key range Advantage: each branch of the tree can be checked independently without requiring nodes to download the entire tree or the entire data set Disadvantage: Adds overhead to maintain Merkle trees when a node joins or leaves the system CS5204 Operating Systems 14

Membership and Failure Detection Ring Membership Explicit mechanism to add or remove node from a ring Done by administrator using command line tool or browser Gossip-based protocol propagates membership, partitioning, and placement information via periodic exchanges Nodes eventually know key ranges of its peers and can forward requests to them External Discovery To prevent logical partitions, some nodes play role of seeds Seed nodes discovered via external mechanism are known to all nodes Failure Detection Nodes failures are detected by lack of responsiveness and recovery detected by periodic retry CS5204 Operating Systems 15

Experiences & Lessons Learned Main patterns in which is used: Business logic specific reconciliation Timestamp based reconciliation High performance read engine Client applications can tune values of N, R and W Common (N,R,W) configuration used by several instances of is (3,2,2) CS5204 Operating Systems 16

Experiences & Lessons Learned Balancing performance and Durability CS5204 Operating Systems 17

Experiences & Lessons Learned Ensuring Uniform Load Distribution CS5204 Operating Systems 18

Partitioning & Placement Strategies Partitioning and placement of keys in the three strategies. A, B, and C depict the three unique nodes that form the preference list for the key k1 on the consistent hashing ring (N=3). The shaded area indicates the key range for which nodes A, B, and C form the preference list. Dark arrows indicate the token locations for various nodes. CS5204 Operating Systems 19

Strategy 1 Partitioning & Placement Strategies T random tokens per node and partition by token value: It needs to steal its key ranges from other nodes Bootstrapping of new node is lengthy Other nodes process scanning/transmission of key ranges for new node as background activities Disadvantages: Numerous nodes have to adjust their Merkle trees when a new node joins or leaves system Archiving entire key space is highly inefficient CS5204 Operating Systems 20

Partitioning & Placement Strategies Strategy 2 T random tokens per node and equal sized partitions: Divided into Q equally sized partitions Q >> N and Q >> S*T, where S is no. of nodes in the system Advantages: Decoupling of partition and partition placement Allows changing of placement scheme at run-time Strategy 3 Q/S tokens per node, equal sized partitions: Decoupling of partition and placement Advantages: Faster bootstrapping/recovery Ease of archival CS5204 Operating Systems 21

Partitioning & Placement Strategies Strategies have different tuning parameters Fair way to compare strategies is to evaluate the skew in their load distributions for a fixed amount of space to maintain membership information Strategy 3 achieves best load balancing efficiency CS5204 Operating Systems 22

Client-driven or Server-driven Coordination Any node can coordinate read requests; write requests handled by coordinator State-machine for coordination can be in load balancing server or incorporated into client Client-driven coordination has lower latency because it avoids extra network hop (redirection) CS5204 Operating Systems 23

Thank You CS5204 Operating Systems 24