CAP Theorem, BASE & DynamoDB

Similar documents
Dynamo: Amazon s Highly Available Key-value Store

Dynamo: Amazon s Highly Available Key-value Store

CS Amazon Dynamo

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Presented By: Devarsh Patel

Dynamo: Amazon s Highly Available Key-Value Store

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

Dynamo: Amazon s Highly Available Key-value Store

There is a tempta7on to say it is really used, it must be good

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Scaling Out Key-Value Storage

Background. Distributed Key/Value stores provide a simple put/get interface. Great properties: scalability, availability, reliability

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Distributed Hash Tables Chord and Dynamo

FAQs Snapshots and locks Vector Clock

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Dynamo: Key-Value Cloud Storage

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

The material in this lecture is taken from Dynamo: Amazon s Highly Available Key-value Store, by G. DeCandia, D. Hastorun, M. Jampani, G.

Spark Programming with RDD

11/12/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

CS 655 Advanced Topics in Distributed Systems

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Dynamo Tom Anderson and Doug Woos

Tutorial: Apache Storm

L10: Dictionary & Skip List

Dynamo: Amazon s Highly- Available Key- Value Store

EECS 498 Introduction to Distributed Systems

Haridimos Kondylakis Computer Science Department, University of Crete

Eventual Consistency 1

Improving Logical Clocks in Riak with Dotted Version Vectors: A Case Study

Cassandra- A Distributed Database

6.830 Lecture Spark 11/15/2017

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

CSE-E5430 Scalable Cloud Computing Lecture 10

CS655: Advanced Topics in Distributed Systems [Fall 2013] Dept. Of Computer Science, Colorado State University

References. NoSQL Motivation. Why NoSQL as the Solution? NoSQL Key Feature Decisions. CSE 444: Database Internals

10.0 Towards the Cloud

SCALABLE CONSISTENCY AND TRANSACTION MODELS

10. Replication. Motivation

Distributed Key Value Store Utilizing CRDT to Guarantee Eventual Consistency

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Intuitive distributed algorithms. with F#

Cloud Computing. Lectures 11, 12 and 13 Cloud Storage

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

Cloud Computing. Up until now

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

L5-6:Runtime Platforms Hadoop and HDFS

CONSISTENT FOCUS. Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability.

L20-21: Graph Data Structure

Introduction to Distributed Data Systems

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Data Structures, Algorithms & Data Science Platforms

Riak. Distributed, replicated, highly available

Today s topics. FAQs. Topics in BigTable. This material is built based on, CS435 Introduction to Big Data Spring 2017 Colorado State University

DS ,21. L11-12: Hashmap

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

What Came First? The Ordering of Events in

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Consistency and Replication (part b)

Weak Consistency and Disconnected Operation in git. Raymond Cheng

DS L9: Queues

Peer- to- Peer in the Datacenter: Amazon Dynamo

Scalable overlay Networks

NoSQL Databases. The reference Big Data stack

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Network. CS Advanced Operating Systems Structures and Implementation Lecture 13. File Systems (Con t) RAID/Journaling/VFS.

L1:Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung ACM SOSP, 2003

NoSQL Data Stores. The reference Big Data stack

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

L22-23: Graph Algorithms

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

Report to Brewer s original presentation of his CAP Theorem at the Symposium on Principles of Distributed Computing (PODC) 2000

Distributed Hash Tables

L6,7: Time & Space Complexity

Apache Cassandra - A Decentralized Structured Storage System

Selective Hearing: An Approach to Distributed, Eventually Consistent Edge Computation

Data Management in the Cloud. Tim Kraska

Large-Scale Data Stores and Probabilistic Protocols

DISTRIBUTED FILE SYSTEMS CARSTEN WEINHOLD

Replication in Distributed Systems

Seminar on. Fault Tolerance in Cloud Computing

Cassandra Design Patterns

Introduction to store data in Redis, a persistent and fast key-value database

Distributed Systems (5DV147)

Advanced Databases ( CIS 6930) Fall Instructor: Dr. Markus Schneider. Group 17 Anirudh Sarma Bhaskara Sreeharsha Poluru Ameya Devbhankar

Architecture and Implementation of Database Systems (Summer 2018)

Consistency and Replication

Building Consistent Transactions with Inconsistent Replication

Distributed Data Management Replication

Defining Weakly Consistent Byzantine Fault-Tolerant Services

Transcription:

Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत DS256:Jan18 (3:1) Department of Computational and Data Sciences CAP Theorem, BASE & DynamoDB Yogesh Simmhan Yogesh Simmhan & Partha Talukdar, 2016 This work is licensed under a Creative Commons Attribution 4.0 International License Copyright for external content used with attribution is retained by their original authors

Dynamo: Amazon's highly available keyvalue store DeCandia G, Hastorun D, Jampani M, Kakulapati G, Lakshman A, Pilchin A, Sivasubramanian S, Vosshall P, Vogels W. ACM SIGOPS symposium on Operating systems principles (SOSP), 2007 43

Distributed Hashtable Primary-key only access for read and write Value is blob, <1MB Shopping cart, best sellers list, user preferences High availability when failures are a given disks, network, data center Available across data centers E.g. Shopping cart serves 10M requests, 3M checkouts per day (2007) RDBMS: Cost hardware, skilled DBA, consistency over availability due to replication limits, limited load balancing 44

Stateless workflows, w/ caching build a system where all customers have a good experience, rather than just the majority (or an average number) SLAs are expressed and measured at the 99.9th percentile of the distribution based on a cost-benefit analysis 45

Amazon s Dynamo DB Highly Available Even the slightest outage has significant financial consequences Service Level Agreements Guaranteeing response in 300ms for 99.9% of requests at a peak load of 500 req/sec vs. ACID Weak consistency, no Isolation (since only 1 key op at a time) Non-hostile environment, security not a concern Dynamo: Amazon s Highly Available Key-value Store, Giuseppe DeCandia, et al, SOSP, 2007

Design Principles Optimistic replication techniques Changes propagate to replicas in the background, Server and network failures, Concurrent, disconnected work is tolerated When to perform the process of resolving update conflicts Dynamo targets the design space of an always writeable data store Rejecting customer updates could result in a poor customer experience Push the complexity of conflict resolution to the reads Who performs the process of conflict resolution Done by the data store or the application Application is aware of the data schema, and can select best conflict resolution method 47

Design Principles Incremental scalability (weak scaling) Symmetry of nodes responsibilities Decentralization (P2P) Heterogeneity of node s resources 48

Design Choices Sacrifice strong consistency for availability always writeable. No updates are rejected. Conflict resolution is executed during read instead of write, i.e. always writeable. Incremental scalability & decentralization Symmetry of responsibility Heterogeneity in capacity All nodes are trusted

Techniques Problem Technique Advantage Partitioning Consistent Hashing Incremental Scalability Vector clocks with High Availability for Version size is decoupled reconciliation during writes from update rates. reads Handling temporary failures Recovering from permanent failures Membership and failure detection Sloppy Quorum and hinted handoff Anti-entropy using Merkle trees Gossip-based membership protocol and failure detection. Provides high availability and durability guarantee when some of the replicas are not available. Synchronizes divergent replicas in the background. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.

Partitioning Consistent hashing Output range of hash func. on key is a fixed ring Virtual node is responsible for a range of hash values (tokens) Hash value for the key maps to a virtual node Each physical node responsible for multiple virtual nodes Allows nodes to arrive and leave without having to change keys present in virtual nodes

Partitioning and placement of key Divide the hash space into Q equally sized partitions virtual node or token Each physical node assigned Q/S tokens where S is the number of nodes in the system. Can also assign variable tokens to physical node based on machine size Adapt to capacity of physical nodes Incrementally add/remove physical nodes When a node leaves the system, its tokens (virtual nodes) are randomly & uniformly distributed to the remaining nodes to load balance When a node joins the system it uniformaly "steals" tokens from nodes in the system to load balance

Replication Each data item is replicated at N hosts. preference list : The list of nodes responsible for storing a particular key. Coordinator node (from hashing) stores first copy Next copy stored in subsequent virtual nodes Skip virtual nodes present on same physical node Gossip protocol Propagates changes among nodes Each node contacts a peer at random every second and the two nodes reconcile their membership change histories Eventually consistent view of membership, mapping from tokens to nodes Seeds to make propagation rapid, avoid partitioning: all peers know of the seeds D stores (A, B], (B, C], (C, D] Permanent node adds and removes are done centrally and notified to peers If a peer cannot reach a another, it must be a transient error

Key Value Operations Add and update items both use put(key, value) operation get(key) returns the value Any node may receive the request Forwarded to the coordinator node for response put() and get() are sent to all N healthy replicas, but put() may return to its client before the update is applied at all replicas May leave replicas in inconsistent state get() may return many versions of same object 54

Sloppy Quorum Writes are successful if w replicas out of N can be updated (w<n) Coordinator forwards requests to all N replicas, and returns when w respond Vector clock generated at coordinator is forwarded Reads return all r replica values (r<n) Coordinator sends requests to all N replicas, and returns when r respond Coordinator returns causally unrelated copies Clients need to decide how to use these copies Reads & writes dictated by slowest replica Set r+w > N

Data Versioning & Consistency put() is treated as append of the updated value Immutable append to a particular version of the object Multiple versions can coexist but system will not internally resolve them Challenge Distinct version sub-histories need to be reconciled. Solution Uses vector clocks to capture causality between different versions of the same object.

Consistency with Vector Clocks Vector clock: (node, counter) pair Every version of every object is associated with one vector clock. If the counters on the first object s clock are <= all nodes in the second clock, then the first is an ancestor of the second and can be forgotten. i.e. first object happened before second object If get() has multiple replica versions, return causally unrelated versions i.e. remove partial ordered & only return causally unordered versions for reconciliation Client writes the reconciled version back e.g. Sx resolves D3 and D4 into D5

Hinted handoff Useful for transient failures Assume N = 3. When A is temporarily down or unreachable during a write, send replica to D. D is hinted thru metadata that the replica belongs to A (but A was down) D maintains write in a separate local DB. D will deliver writes to A when A is recovered. Again: always writeable

Replica synchronization Handles permanent failures Resolve replica inconsistency faster Reduce data transfer Merkle tree: a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in the tree are hashes of their respective children. Advantage: Each branch of the tree can be checked independently without requiring nodes to download the entire tree. Help in reducing the amount of data that needs to be transferred while checking for inconsistencies among replicas.

Replica synchronization Anti-entropy schemes when hinted hand-off does not work Replicas have different subsets of writes Build a Merkle tree for common ranges of keys among replicas N trees if there are N replicas Check root, then children if they don t match, etc. Minimizes data transfer