Dynamo: Amazon s Highly Available Key-value Store

Similar documents
Dynamo: Amazon s Highly Available Key-value Store

CAP Theorem, BASE & DynamoDB

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Presented By: Devarsh Patel

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

Dynamo: Amazon s Highly Available Key-Value Store

CS Amazon Dynamo

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Hash Tables Chord and Dynamo

Dynamo: Amazon s Highly Available Key-value Store

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Background. Distributed Key/Value stores provide a simple put/get interface. Great properties: scalability, availability, reliability

There is a tempta7on to say it is really used, it must be good

11/12/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Dynamo: Amazon s Highly- Available Key- Value Store

CSE-E5430 Scalable Cloud Computing Lecture 10

FAQs Snapshots and locks Vector Clock

Scaling Out Key-Value Storage

Dynamo: Key-Value Cloud Storage

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Improving Logical Clocks in Riak with Dotted Version Vectors: A Case Study

CS 655 Advanced Topics in Distributed Systems

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Dynamo Tom Anderson and Doug Woos

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Cassandra- A Distributed Database

CONSISTENT FOCUS. Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability.

Load Balancing Technology Based On Consistent Hashing For Database Cluster Systems

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

EECS 498 Introduction to Distributed Systems

A Cloud Storage Adaptable to Read-Intensive and Write-Intensive Workload

References. NoSQL Motivation. Why NoSQL as the Solution? NoSQL Key Feature Decisions. CSE 444: Database Internals

Two Levels Hashing Function in Consistent Hashing Algorithm

Cloud Computing. Lectures 11, 12 and 13 Cloud Storage

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Cloud Computing. Up until now

10.0 Towards the Cloud

A Transaction Model for Management for Replicated Data with Multiple Consistency Levels

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

10. Replication. Motivation

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

Riak. Distributed, replicated, highly available

Defining Weakly Consistent Byzantine Fault-Tolerant Services

Introduction to store data in Redis, a persistent and fast key-value database

NoSQL and Database as a Service

SCALABLE CONSISTENCY AND TRANSACTION MODELS

6.830 Lecture Spark 11/15/2017

Introduction to Distributed Data Systems

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Semantically Aware Contention Management for Distributed Applications

Evaluating Auto Scalable Application on Cloud

Eventual Consistency 1

Haridimos Kondylakis Computer Science Department, University of Crete

CS 223 Final Project CuckooRings: A Data Structure for Reducing Maximum Load in Consistent Hashing

Today s topics. FAQs. Topics in BigTable. This material is built based on, CS435 Introduction to Big Data Spring 2017 Colorado State University

Distributed Key Value Store Utilizing CRDT to Guarantee Eventual Consistency

Scalable overlay Networks

Weak Consistency as a Last Resort

Intuitive distributed algorithms. with F#

Large-Scale Data Stores and Probabilistic Protocols

Managing Object Versioning in Geo-Distributed Object Storage Systems

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Indexing Large-Scale Data

Peer- to- Peer in the Datacenter: Amazon Dynamo

An Efficient Distributed B-tree Index Method in Cloud Computing

WITH the rapidly increasing usage of internet based services,

SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE SYSTEM IN CLOUD

Strengthening Consistency in the Cassandra Distributed Key-Value Store

Apache Cassandra - A Decentralized Structured Storage System

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

From Paxos to CORFU: A Flash-Speed Shared Log

Chapter 6: Weak Isolation and Distribution

Building Consistent Transactions with Inconsistent Replication

Evaluation of the Huawei UDS cloud storage system for CERN specific data

NoSQL Databases. The reference Big Data stack

A Review to the Approach for Transformation of Data from MySQL to NoSQL

Lecture 18: Peer-to-Peer applications

GreenCHT: A Power-Proportional Replication Scheme for Consistent Hashing based Key Value Storage Systems

Report to Brewer s original presentation of his CAP Theorem at the Symposium on Principles of Distributed Computing (PODC) 2000

EECS 498 Introduction to Distributed Systems

Selective Hearing: An Approach to Distributed, Eventually Consistent Edge Computation

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Architecture and Implementation of Database Systems (Summer 2018)

NOSQL DATABASE PERFORMANCE BENCHMARKING - A CASE STUDY

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

NoSQL Data Stores. The reference Big Data stack

Page 1. Key Value Storage"

Cassandra Design Patterns

Transcription:

Dynamo: Amazon s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels October 2007 SOSP '07: Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles Publisher: ACM Apr. 1, 2009

OUTLINE INTRODUCTION BACKGROUND SYSTEM ARCHITECTURE IMPLEMENTATION EXPERIENCES CONCLUSIONS

INTRODUCTION Amazon uses a highly decentralized, loosely coupled, service oriented architecture consisting of hundreds of services There are always a small but significant number of server and network components that are failing at any time Dynamo is another highly available and scalable distributed data store like S3

BACKGROUND System Assumptions and Requirements: Query Model, ACID, Efficiency, Others Service Level Agreements (SLA) Design Considerations: Conflict resolution Incremental scalability Symmetry Decentralization Heterogeneity

System Interface Partitioning Algorithm Replication and Data Versioning Execution of get() and put() operations Handling Failures Membership and Failure Detection Adding/Removing Storage Nodes

System Interface Dynamo stores objects associated with a key through a simple interface: get(key) put(key, context, object) The context encodes system metadata about the object that is opaque to the caller and includes information

Partitioning Algorithm Dynamo s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts The output range of a hash function is treated as a fixed circular space or ring Each node in the system is assigned to a random value within this space which represents its position on the ring

Partitioning Algorithm Data Hash Function Virtual Nodes Coordinator Consistent Hashing in Dynamo

Replication Each data item is replicated at N hosts where N is a parameter configured perinstance Each key, k, is assigned to a coordinator node which is in charge of the replication of the data items The list of nodes that is responsible for storing a particular key called the preference list

Replication A Coordinator G B Data F N = 3 C Data E D Data Replication of keys in Dynamo

Data Versioning Dynamo provides eventual consistency, which allows for updates to be propagated to all replicas asynchronously Dynamo uses vector clocks in order to capture causality between different versions of the same object The client needs to perform the reconciliation if the system can t

Data Versioning A client write a new object The client updates the object Update the object by different servers and different clients The client performs the reconciliation write handled by Sy D1 ([Sx, 1]) D2 ([Sx, 2]) write handled by Sx write handled by Sx D3 ([Sx 2], [Sy, 1]) D4 ([Sx 2], [Sz, 1]) D5 ([Sx, 3], [Sy, 1], [Sz, 1]) write handled by Sz reconciled and written by Sx Version evolution of an object over time

Execution of get() and put() Any storage node in Dynamo is eligible to receive client get and put operations for any key There are two strategies that a client can use to select a node: route its request through a load balancer use a partition-aware client library

Execution of get() and put() To maintain consistency among replicas, Dynamo uses a consistency protocol which has two key configurable values: R: the minimum # of nodes that must participate in a successful read operation W: the minimum # of nodes that must participate in a successful write operation

Handling Failures: Hinted Handoff All read and write operations are performed on the first N healthy nodes from the preference list which may not always be the first N nodes encountered while walking the consistency hashing ring Ensures that the read and write operations are not failed due to temporary node or network failures

Handling Failures: Hinted Handoff A G B F Detect A periodically C E D Data with a hint in its metadata Hinted Handoff during a Write Operation

Handling Permanent Failures: Replica synchronization To detect the inconsistencies between replicas faster and to minimize the amount of the amount of transferred data Merkle trees: a hash tree where leaves are hashes of the values of the individual keys. Parent nodes higher in the tree are hashes of their respective children

Membership and Failure Detection Ring Membership Explicit mechanism to initiate the addition and removal of nodes from ring External Discovery To prevent logical partitions, some Dynamo nodes play the role seeds Failure Detection Avoid attempts to communicates with unreachable peers

Adding/Removing Storage Nodes A G X Transfer keys F B E C D A bootstrapping scenario when adding a node

IMPLEMENTATION Each storage node in Dynamo has three main software components: request coordination using Java NIO channels membership and failure detection local persistence engine allows for different storage engines

EXPERIENCES Three main patterns used in Dynamo: Business logic specific reconciliation Timestamp based reconciliation High performance read engine The common (N, R, W) configuration used by several instances of Dynamo is (3, 2, 2)

EXPERIENCES Balancing Performance and Durability Average and 99.9 percentiles of latencies for read and write requests during Amazon s peak request season of December 2006

EXPERIENCES Balancing Performance and Durability Comparison of performance of 99.9th percentile latencies for buffered vs. non-buffered writes over a period of 24 hours

EXPERIENCES Ensuring Uniform Load Distribution Fraction of nodes that are out-of-balance and their corresponding request load

EXPERIENCES Ensuring Uniform Load Distribution Strategy 1 T random tokens per node and partition by token value Strategy 2 T random tokens per node and equal sized partitions Strategy 3 Q/S tokens per node, equal-sized partitions Partitioning and placement of keys in the three strategies. A, B, and C depict the three unique nodes that form the preference list for the key k1 on the consistent hashing ring (N=3)

EXPERIENCES Ensuring Uniform Load Distribution Comparison of the load distribution efficiency of different strategies for system with 30 nodes and N=3 with equal amount of metadata maintained at each node

EXPERIENCES Divergent Versions: When and How Many? Occurs when the system is facing failure scenarios or handling a large number of concurrent writes to a single data In experiment, the number of versions returned to the shopping cart service: 99.94% of requests saw one version 0.00057% of requests saw 2 versions 0.00047% of requests saw 3 versions 0.00009% of requests saw 4 versions

EXPERIENCES Client-driven or Server-driven Coordination 99.9th percentile read latency (ms) 99.9th percentile write latency (ms) Average read latency (ms) Average write latency (ms) Serverdriven Clientdriven 68.9 68.5 3.9 4.02 30.4 30.4 1.55 1.9 Performance of client-driven and server-driven coordination approaches

CONCLUSIONS The production use of Dynamo for the past year demonstrates that decentralized techniques can be combined to provide a single highly-available system Its success in one of the most challenging application environments shows that an eventual consistent storage system can be a building block for highly-available applications