CS Amazon Dynamo

Similar documents
DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Dynamo: Amazon s Highly Available Key-value Store

CAP Theorem, BASE & DynamoDB

Presented By: Devarsh Patel

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

There is a tempta7on to say it is really used, it must be good

Dynamo: Amazon s Highly Available Key-value Store

Dynamo: Amazon s Highly Available Key-Value Store

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Dynamo: Key-Value Cloud Storage

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

Background. Distributed Key/Value stores provide a simple put/get interface. Great properties: scalability, availability, reliability

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

FAQs Snapshots and locks Vector Clock

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

Scaling Out Key-Value Storage

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Dynamo: Amazon s Highly Available Key-value Store

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Hash Tables Chord and Dynamo

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

6.830 Lecture Spark 11/15/2017

11/12/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

CS 655 Advanced Topics in Distributed Systems

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

EECS 498 Introduction to Distributed Systems

10. Replication. Motivation

Intuitive distributed algorithms. with F#

Dynamo Tom Anderson and Doug Woos

Haridimos Kondylakis Computer Science Department, University of Crete

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Report to Brewer s original presentation of his CAP Theorem at the Symposium on Principles of Distributed Computing (PODC) 2000

Eventual Consistency 1

Riak. Distributed, replicated, highly available

Cloud Computing. Lectures 11, 12 and 13 Cloud Storage

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

Network. CS Advanced Operating Systems Structures and Implementation Lecture 13. File Systems (Con t) RAID/Journaling/VFS.

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Replication in Distributed Systems

Distributed Data Management Replication

Cloud Computing. Up until now

Distributed Systems (5DV147)

Distributed Key Value Store Utilizing CRDT to Guarantee Eventual Consistency

Apache Cassandra - A Decentralized Structured Storage System

Building Consistent Transactions with Inconsistent Replication

10.0 Towards the Cloud

Introduction to Distributed Data Systems

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

Consistency and Replication (part b)

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

Transactions and ACID

Distributed Hash Tables

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

References. NoSQL Motivation. Why NoSQL as the Solution? NoSQL Key Feature Decisions. CSE 444: Database Internals

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

What Came First? The Ordering of Events in

Architecture and Implementation of Database Systems (Summer 2018)

Chapter 11 - Data Replication Middleware

DrRobert N. M. Watson

Large-Scale Data Stores and Probabilistic Protocols

Architecture and Implementation of Database Systems (Winter 2016/17)

Important Lessons. Today's Lecture. Two Views of Distributed Systems

CS October 2017

Computing Parable. The Archery Teacher. Courtesy: S. Keshav, U. Waterloo. Computer Science. Lecture 16, page 1

Eventual Consistency Today: Limitations, Extensions and Beyond

Self-healing Data Step by Step

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Peer- to- Peer in the Datacenter: Amazon Dynamo

Cassandra Design Patterns

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein

Why distributed databases suck, and what to do about it. Do you want a database that goes down or one that serves wrong data?"

CS6450: Distributed Systems Lecture 11. Ryan Stutsman

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Mobile and Heterogeneous databases Distributed Database System Transaction Management. A.R. Hurson Computer Science Missouri Science & Technology

Improving Logical Clocks in Riak with Dotted Version Vectors: A Case Study

Distributed Systems. Day 13: Distributed Transaction. To Be or Not to Be Distributed.. Transactions

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

Trade- Offs in Cloud Storage Architecture. Stefan Tai

BIG DATA AND CONSISTENCY. Amy Babay

Consistency and Replication. Why replicate?

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

INF-5360 Presentation

Scalable overlay Networks

Database Architectures

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Consistency & Replication

NoSQL Databases. The reference Big Data stack

Dynamo: Amazon s Highly- Available Key- Value Store

Today s topics. FAQs. Topics in BigTable. This material is built based on, CS435 Introduction to Big Data Spring 2017 Colorado State University

CSE 5306 Distributed Systems. Consistency and Replication

Transcription:

CS 5450 Amazon Dynamo

Amazon s Architecture

Dynamo The platform for Amazon's e-commerce services: shopping chart, best seller list, produce catalog, promotional items etc. A highly available, distributed key-value storage system: put() & get() interfaces Aims to provide "always on" guarantee at the cost of losing some consistency

Design Considerations Incremental scalability Minimal management overhead Symmetry For ease of maintenance, no master/slave nodes Decentralized control Centralized approach leads to outages Heterogeneity Exploit capabilities of different nodes

CAP Theorem Consistency single up-to-date copy Availability for updates Partition Tolerance Pick 2 out of 3

What Does CAP Really Mean? No system where P is possible can at all times guarantee both C and A if your network is highly reliable (and fast), so that P is extremely rare, you can aim for both C and A Google Spanner

Requirements Reliability: customers should be able to edit their shopping cart even when there are: disk failures network failures tornados! Efficiency: latency measurement is done at 99.9th percentile of the distribution

ACID Properties uatomicity uconsistency uisolation udurability

Requirements Query model: only simple read/write to small data (less than 1MB), uniquely identified by a key ACID properties: atomicity: important! consistency: weak is sufficient isolation: not at all durability: important!

Consistency BASE Basically Available, Soft State, Eventually Consistent Always writable Can always write to shopping cart Conflict resolution on reads Application-driven conflict resolution Merge conflicting shopping carts to never lose Add Cart push inconsistencies to Customer Service

Requirements Service Level Agreement (SLA): client and server agree on several characteristics related to the system e.g 300 ms response time for peak load of 500 requests/second putting together a single webpage can require responses from 150 services for typical request

Techniques Used by Dynamo Problem Technique Advantage Partitioning Consistent Hashing Incremental scalability High Availability for writes Handling temporary failures Recovering from permanent failures Membership and failure detection Vector clocks with reconciliation during reads Sloppy Quorum and hinted handoff Anti-entropy using Merkle trees Gossip-based membership protocol and failure detection Version size is decoupled from update rates Provides high availability and durability guarantee when some of the replicas are not available Synchronizes divergent replicas in the background Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

Consistent Hashing A Key k Next node clockwise is in charge G B Issues F Hash key and place on ring C Unbalanced load distribution E D Nodes are heterogenous, uneven resources per node

Virtual Nodes A Key k Next node clockwise is in charge G B Virtual nodes F Hash key and place on ring C Each node has multiple points in ring: load more likely to even out E D Different number of virtual nodes (depending on node s resources)

Node Joining or Leaving

What If

Replication A Key k G B Coordinator Replicates data in a list of N nodes F E D C N is per key Part of longer preference list Mindful of virtual nodes

Techniques Used by Dynamo Problem Technique Advantage Partitioning Consistent Hashing Incremental scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

Quorums put(): - coordinator writes new data locally - sends to next n-1 nodes - when w-1 respond, considered successful get(): - coordinator requests next n-1 nodes - when r-1 respond, considered successful r + w > n

Data Versioning A put() call may return successfully to caller before the update has been applied to all the replications If latest object version unavailable, apply put to earlier version, merge later A get() call may return many versions of the same object An object may have distinct version sub-histories that need to be reconciled

Delayed Reconciliation The goal: "add to cart" operation should never be forgotten or rejected When a customer wants to add an item to (or remove from) a shopping cart and the latest version is not available, the item is added to (or removed from) the older version The divergent versions are reconciled later

Vector Clocks in Dynamo write handled by B V1([A,1]) V2([A,2]) write handled by node A write handled by node A write handled by C List of (node, counter) pairs Counters are logical integer timestamps assigned by coordinator Each object version has one vector clock V3([A,2], [B,1]) V4:([A,2], [C,1]) Reconciled and handled by A Used to determine whether a version is subsumed V5([A,3], [B,1], [C,1]) Stored in context in get(), put()

Vector Clocks in Dynamo Size of vector clock can get arbitrarily long - bounded by N in practice - drop oldest one when beyond certain threshold Reconciliation - Easy if values are causally ordered - Otherwise, applications will handle

Reconciliation (Summary) An "add to cart" operation will never be lost, but removed items may reappear

Reconciliation in Practice In one 24 hour period, 99.94% of requests saw exactly one version Divergent version usually not caused by failure but concurrent writers...(rarely human beings, usually automated client programs)

Techniques Used by Dynamo Problem Technique Advantage Partitioning Consistent Hashing Incremental scalability High Availability for writes Handling temporary failures Recovering from permanent failures Membership and failure detection Vector clocks with reconciliation during reads Sloppy Quorum and hinted handoff Anti-entropy using Merkle trees Gossip-based membership protocol and failure detection Version size is decoupled from update rates Provides high availability and durability guarantee when some of the replicas are not available Synchronizes divergent replicas in the background Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

Quasi-Quorums get() and put() driven by two parameters R: minimum number of replicas to read from W: minimum number of replicas to write to if R+W > N, we have a quorum system! but latency dictated by slowest replica What if we want to execute a put() but the number of available replicas in the next node is less than w-1?

Sloppy Quorum and Hinted Handoff Use next N healthy nodes for read/write Data tagged with the node it should go to (notified later if temporarily unavailable) Transfer the data to the node when it becomes available Always writeable!

Techniques Used by Dynamo Problem Technique Advantage Partitioning Consistent Hashing Incremental scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Membership and failure detection Anti-entropy using Merkle trees Gossip-based membership protocol and failure detection Synchronizes divergent replicas in the background Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

Handling Permanent Failures Replica synchronization: - Node synchronizes with another node - A Merkle Tree is used to detect inconsistency and minimize the data that needs to be transferred - leaves are hash of objects - parents are hash of children

Membership and Failure Detection Gossip-based protocol propagates membership change and maintains an eventually consistent view Seeds are used to prevent parititioned ring Use timeout to for failure discovery