Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Similar documents
Dynamo: Amazon s Highly Available Key-Value Store

Presented By: Devarsh Patel

Dynamo: Amazon s Highly Available Key-value Store

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

CAP Theorem, BASE & DynamoDB

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

CS Amazon Dynamo

Dynamo: Amazon s Highly Available Key-value Store

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Dynamo: Key-Value Cloud Storage

Scaling Out Key-Value Storage

Background. Distributed Key/Value stores provide a simple put/get interface. Great properties: scalability, availability, reliability

Distributed Hash Tables Chord and Dynamo

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

There is a tempta7on to say it is really used, it must be good

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

CS 655 Advanced Topics in Distributed Systems

Dynamo: Amazon s Highly Available Key-value Store

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

11/12/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

FAQs Snapshots and locks Vector Clock

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

10. Replication. Motivation

EECS 498 Introduction to Distributed Systems

Dynamo Tom Anderson and Doug Woos

Intuitive distributed algorithms. with F#

6.830 Lecture Spark 11/15/2017

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

References. NoSQL Motivation. Why NoSQL as the Solution? NoSQL Key Feature Decisions. CSE 444: Database Internals

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Large-Scale Data Stores and Probabilistic Protocols

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

10.0 Towards the Cloud

Riak. Distributed, replicated, highly available

Cloud Computing. Lectures 11, 12 and 13 Cloud Storage

CSE-E5430 Scalable Cloud Computing Lecture 10

Page 1. Key Value Storage"

Distributed Hash Tables

Distributed Key Value Store Utilizing CRDT to Guarantee Eventual Consistency

Introduction to Distributed Data Systems

Applications of Paxos Algorithm

Cloud Computing. Up until now

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Distributed Data Management Replication

Eventual Consistency 1

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

Leader Election. Rings, Arbitrary Networks. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Hash Tables: Chord

Replication in Distributed Systems

Peer- to- Peer in the Datacenter: Amazon Dynamo

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

Peer-to-Peer Systems and Distributed Hash Tables

Distributed Hash Tables

CISC 7610 Lecture 2b The beginnings of NoSQL

SCALARIS. Irina Calciu Alex Gillmor

Indexing Large-Scale Data

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

Ceph: A Scalable, High-Performance Distributed File System

Chapter 11 - Data Replication Middleware

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Haridimos Kondylakis Computer Science Department, University of Crete

DrRobert N. M. Watson

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

Dynomite. Yet Another Distributed Key Value Store

SpecPaxos. James Connolly && Harrison Davis

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Distributed Systems. 17. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2016

Today s topics. FAQs. Topics in BigTable. This material is built based on, CS435 Introduction to Big Data Spring 2017 Colorado State University

Data Management in the Cloud. Tim Kraska

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

08 Distributed Hash Tables

Dynamo: Amazon s Highly- Available Key- Value Store

Basic vs. Reliable Multicast

Distributed Information Processing

Distributed Systems Final Exam

Cassandra Design Patterns

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

Distributed Hash Tables

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

ZooKeeper. Table of contents

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou

Extreme Computing. NoSQL.

Scalable overlay Networks

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Systems (5DV147)

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

ebay s Architectural Principles

Transcription:

Dynamo Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/20

Outline Motivation 1 Motivation 2 3 Smruti R. Sarangi Leader Election 2/20

Motivation Reliability is one of the biggest challenges for Amazon. Amazon aims at 99.999% reliability (5 9s) (Less than 5 mins per year) Lack of reliability can translate into significant financial losses The infrastructure consists of thousands of servers Servers and network components keep failing. Customers need an always-on experience. Smruti R. Sarangi Leader Election 3/20

Dynamo Motivation Highly available key-value store. Serves a diverse set of applications Services Best seller s list, shopping carts, customer preferences, sales rank, product catalog Served 3 million shopping checkouts in a single day Manages session state for thousands of concurrently active sessions Provides a simple key-value interface over the network Smruti R. Sarangi Leader Election 4/20

Assumptions and Requirements Query Model Simple key-value access ACID properties Provides only A, I, and D Latency requirements 99.9% of all accesses satisfy the SLA SLA Service Level Agreement Maximum Latency Maximum client request rate Smruti R. Sarangi Leader Election 5/20

System Diagram Client Requests Page Rendering Server Server Server Server Request Routing Aggregator Request Routing Aggregator Aggregation Storage Dynamo Dynamo Other Smruti R. Sarangi Leader Election 6/20

Key Principles of the Design Incremental Scalability : Should be able to scale one node at a time. Symmetry : Every node should have the same responsibility. Decentralization : Peer to peer system Heterogeneity : Needs to be able to exploit heterogeneous capabilities of servers. Eventual Consistency: 1 A get operation returns a set of versions (need not contain latest value). 2 A write ultimately succeeds. Smruti R. Sarangi Leader Election 7/20

Basic Operations get(key) Returns all the versions of an item (context). put(key,object, context) Add the object corresponding to the key in the database. Smruti R. Sarangi Leader Election 8/20

Partitioning Motivation Uses consistent hashing (similar to Chord) to distribute keys in a circular space. Each item is assigned to its successor. Uses the notion of virtual nodes for load balancing. A physical node is responsible for multiple virtual nodes. For fault tolerance, the key is assigned to N successors ( preference list ). One of these N successors, is the co-ordinator node. Smruti R. Sarangi Leader Election 9/20

Data Versioning Motivation A put call might return before the update has propagated to all replicas. If there is a failure then some replicas might get the update after a very long time. Some applications such as add to shopping cart (write), need to always complete. ( Prioritize Writes ) Each new version of data is treated as a new and immutable version of data (recall Percolator). If there are failures and concurrent updates, then version branching may occur. Reconciliation needs to be performed among multiple updates Can be done at the server side (generic logic) Can be done at the client side (semantic merging) Smruti R. Sarangi Leader Election 10/20

Vector Clocks for Versioning A vector clock, contains an entry for each server in the preference list. When a server updates an object, it increments its vector clock. If there are concurrent modifications, then a get operation returns all versions. The put operation indicates the version. The put is considered a merge operation. Example Smruti R. Sarangi Leader Election 11/20

Vector Clock Example write handled by B v3 [A,2, B,1] v1 [A,1] v2 [A,2] write handled by A write handled by A write handled by C v4 [A,2, C,1] writes reconciled by A v5 [A,3, B,1, C,1] Smruti R. Sarangi Leader Election 12/20

Execution of get() and put() Send the request to any node that will forward it to the coordinator (like Pastry). Or, directly find the successor. The nodes ideally access the preference list (or top N healthy nodes) There is a read quorum of R nodes, and write quorum of W nodes R + W > N For a put request, the co-ordinator merges the versions, and broadcasts it to the quorum For a get request, the co-ordinator sends all the concurrent versions to the client Smruti R. Sarangi Leader Election 13/20

Sloppy Quorum Motivation Uses the first N healthy nodes (typically the preference list) If a node cannot deliver an update to node A, then it will send it to node D with a hint Once A recovers, D will transfer the object For added reliability the quorum spans across data centers Smruti R. Sarangi Leader Election 14/20

Synchronization across Replicas Nodes maintain Merkle trees The parent is the hash of the children. A Merkel tree contains the set of keys mapped to each virtual node per physical node. Nodes regularly exchange Merkle trees, through an antientropy based algorithm. Trees need to be often recalculated. Smruti R. Sarangi Leader Election 15/20

Maintaining Membership Dynamo maintains membership information through explicit join and leave requests. Ring membership changes are infrequent. Additionally a gossip based protocols propagates ring membership information across randomly chosen nodes. For 1-Hop routing nodes maintain large routing tables. All routing, membership, and placement propagates through an anti-entropy based gossip protocols. To prevent logical partitions, some nodes act as seeds, and synchronize information across peers. Smruti R. Sarangi Leader Election 16/20

Load Balancing and Failure Detection Failure detection is also done with gossip style protocols. Allocation and De-allocation happens in the same manner as Chord. Smruti R. Sarangi Leader Election 17/20

Three different types of storage engines In memory buffer with persistent backing store. Berkely DB MySQL DB Request co-ordination Communication through Java NIO channels Smruti R. Sarangi Leader Election 18/20

Read-Write Response Time Smruti R. Sarangi Leader Election 19/20

BDB vs Buffered Writes Smruti R. Sarangi Leader Election 20/20

Discussion Motivation Reconciliation Methods Business Logic Based Reconciliation : Shopping cart Time stamp based Reconciliation (last write wins): Customer session management Smruti R. Sarangi Leader Election 21/20

Dynamo: Amazon s Highly Available Key-Value Store, Decandia et. al. Smruti R. Sarangi Leader Election 21/20