FAQs Snapshots and locks Vector Clock

Similar documents
DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

CS Amazon Dynamo

Today s topics. FAQs. Topics in BigTable. This material is built based on, CS435 Introduction to Big Data Spring 2017 Colorado State University

CAP Theorem, BASE & DynamoDB

11/12/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

Presented By: Devarsh Patel

Dynamo: Amazon s Highly Available Key-Value Store

Dynamo: Amazon s Highly Available Key-value Store

Dynamo: Amazon s Highly Available Key-value Store

EECS 498 Introduction to Distributed Systems

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Bigtable. Presenter: Yijun Hou, Yixiao Peng

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS 655 Advanced Topics in Distributed Systems

There is a tempta7on to say it is really used, it must be good

Dynamo: Key-Value Cloud Storage

Scaling Out Key-Value Storage

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Dynamo: Amazon s Highly Available Key-value Store

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

Background. Distributed Key/Value stores provide a simple put/get interface. Great properties: scalability, availability, reliability

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Intuitive distributed algorithms. with F#

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

Database Architectures

References. NoSQL Motivation. Why NoSQL as the Solution? NoSQL Key Feature Decisions. CSE 444: Database Internals

6.830 Lecture Spark 11/15/2017

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Cassandra Design Patterns

Distributed Hash Tables Chord and Dynamo

Riak. Distributed, replicated, highly available

Cassandra- A Distributed Database

Distributed Systems [Fall 2012]

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

Indexing Large-Scale Data

Distributed Data Management Replication

Dynamo Tom Anderson and Doug Woos

CS November 2017

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

10. Replication. Motivation

L22: NoSQL. CS3200 Database design (sp18 s2) 4/5/2018 Several slides courtesy of Benny Kimelfeld

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

CS November 2018

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

BIG DATA AND CONSISTENCY. Amy Babay

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Structured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

Spotify. Scaling storage to million of users world wide. Jimmy Mårdell October 14, 2014

Introduction to Distributed Data Systems

Why distributed databases suck, and what to do about it. Do you want a database that goes down or one that serves wrong data?"

Haridimos Kondylakis Computer Science Department, University of Crete

Relational databases

CompSci 516 Database Systems

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

A Cloud Storage Adaptable to Read-Intensive and Write-Intensive Workload

Axway API Management 7.5.x Cassandra Best practices. #axway

CSE-E5430 Scalable Cloud Computing Lecture 9

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

Big Data Development CASSANDRA NoSQL Training - Workshop. November 20 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

BigTable A System for Distributed Structured Storage

Data Management in the Cloud. Tim Kraska

Staggeringly Large File Systems. Presented by Haoyan Geng

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Cloud Computing. Lectures 11, 12 and 13 Cloud Storage

What Came First? The Ordering of Events in

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Cloud Computing. Up until now

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

/ Cloud Computing. Recitation 10 March 22nd, 2016

Availability versus consistency. Eventual Consistency: Bayou. Eventual consistency. Bayou: A Weakly Connected Replicated Storage System

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

/ Cloud Computing. Recitation 8 October 18, 2016

BigTable: A System for Distributed Structured Storage

Distributed File Systems II

2/27/2019 Week 6-B Sangmi Lee Pallickara

Applications of Paxos Algorithm

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Example File Systems Using Replication CS 188 Distributed Systems February 10, 2015

Outline. Introduction Background Use Cases Data Model & Query Language Architecture Conclusion

Eventual Consistency Today: Limitations, Extensions and Beyond

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 17

Integrity in Distributed Databases

The Google File System

Weak Consistency and Disconnected Operation in git. Raymond Cheng

Distributed systems. Lecture 6: distributed transactions, elections, consensus and replication. Malte Schwarzkopf

Exam 2 Review. October 29, Paul Krzyzanowski 1

The Google File System

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

CS6450: Distributed Systems Lecture 13. Ryan Stutsman

Transcription:

//08 CS5 Introduction to Big - FALL 08 W.B.0.0 CS5 Introduction to Big //08 CS5 Introduction to Big - FALL 08 W.B. FAQs Snapshots and locks Vector Clock PART. LARGE SCALE DATA STORAGE SYSTEMS NO SQL DATA STORAGE Computer Science, Colorado State University http://www.cs.colostate.edu/~cs5 //08 CS5 Introduction to Big - FALL 08 W.B. //08 CS5 Introduction to Big - FALL 08 W.B. Example : Designing locks for read Requirement: While I am reading /home/cs_bookclub/killing_a_mockingbird, others can read the same book. Read operation should obtain read locks on: /cs_bookclub Any number of read operations can access /home/cs_bookclub/killing_a_mockingbird concurrently If a user try to append a text to /home/cs_bookclub/killing_a_mockingbird Read locks on /home, and /home/cs_bookclub will be required à allowed Also write lock on /home/cs_bookclub/killing_a_mockingbird will be required à not allowed Therefore, the write/append operation should wait until all of the reading locks are released Example : Designing locks for write Requirement: While I am writing /home/ cs_bookclub/the_black_cat, others cannot write a file with the same name Write operation should obtain read locks on: /cs_bookclub And write lock on: /cs_bookclub/the_black_cat No read operation to /home/cs_bookclub/the_black_cat will be allowed No write operation to /home/cs_bookclub/the_black_cat will be allowed Other files under /home/cs_bookclub will be allowed to read //08 CS5 Introduction to Big - FALL 08 W.B. //08 CS5 Introduction to Big - FALL 08 W.B.5 Example : Designing locks for snapshot Example : Designing locks for snapshot Requirement: While I am shapshotting /home/cs_bookclub, there should be no change in this directory. Read locks on: /cs_bookclub. à This will not work. New file can be added under this directory. File can be modified under this directory as well. Write lock on: /cs_bookclub à This will prevent any modification under this directory Requirement: While I am shapshotting /home/cs_bookclub/robin_hood, there should be no change in this file. Read locks on? Write lock on?

//08 CS5 Introduction to Big - FALL 08 W.B.6 Example : Designing locks for snapshot. -- Answers //08 CS5 Introduction to Big - FALL 08 W.B.7 Today s topics No SQL storage: Dynamo Requirement: While I am shapshotting /home/mybooks/ Robin_Hood, there is no change in this file. Read locks on: (preventing deletion of these directory) /cs_bookclub Write lock on: à This will prevent any modification in this file à Read lock will also work for preventing possible modification //08 CS5 Introduction to Big - FALL 08 W.B.8 //08 CS5 Introduction to Big - FALL 08 W.B.9 Definition of the vector clocks NoSQL Storage:. -Value Stores Dynamo VC(x) denotes the vector clock of event x VC(x) z denotes the component of that clock for process z VC(x) < VC(y) z[vc(x) z VC(y) z ] z'[vc(x) z' < VC(y) z' ] x à y denotes that event x happens before event y If x à y, then VC(x) < VC(y) //08 CS5 Introduction to Big - FALL 08 W.B.0 //08 CS5 Introduction to Big - FALL 08 W.B. Object Dx Write handled by Sx Execution of get and put operations D([Sx,]) counter Users can send the operations to any storage node in Dynamo Node performed update Write handled by Sx And updated replicas D([Sx,]) Write handled by Sy Write handled by Sz Node Failure in Sy Before updating replicas Could not get the time D([Sx,],[Sy,]) D([Sx,],[Sz,]) Vector of ([Sx,],[Sy,]) Some of the replications Reconciled and written by Sx might still have (Sx,) only D5([Sx,],[Sy,],[Sz,]) Coordinator A node handling a read or write operation The top N nodes in the preference list Client can select a coordinator Route its request through a generic load balancer Use a partition-aware client library Directly access the coordinators

//08 CS5 Introduction to Big - FALL 08 W.B. Using quorum-like system //08 CS5 Introduction to Big - FALL 08 W.B. put request R Minimum number of nodes that must participate in a successful read operation W Minimum number of nodes that must participate in a successful write operation Setting R and W for the given replication factor of N R + W > N W > N/ The latency of a get (or put) operation is dictated by the slowest one of the R (or W) replicas R and W are configured to be less than N Coordinator node. Generates the vector clock --For the new version. Writes the new version locally. Sends the new version to the N highest-ranked reachable nodes --Along with the new vector clock //08 CS5 Introduction to Big - FALL 08 W.B. get request Write handled by Sy The coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes In the preference list D([Sx,],[Sy,]) Waits for R responses Here, R is the read quorum If multiple versions of the data are collected Returns all the versions it deems to be causally unrelated The reconciled version superseding the current versions is written back D5([Sx,],[Sy,],[Sz,]) Write handled by Sz D([Sx,],[Sz,]) //08 CS5 Introduction to Big - FALL 08 W.B.5 What if W is? Applications that need the highest level of availability can set W as Under Amazon s model A write is accepted as long as a single node in the system has durably written the key to its local store A write request is rejected, Only if all nodes in the system are unavailable //08 CS5 Introduction to Big - FALL 08 W.B.6 NoSQL Storage:. -Value Stores (Dynamo) () Partitioning () High Availability for writes () Handling temporary failures () Recovering from permanent failures (5) Membership and failure detection //08 CS5 Introduction to Big - FALL 08 W.B.7 Sloppy quorum All read and write operations are performed on the first N healthy nodes from the preference list May not always be the first N nodes on the hashing ring Hinted handoff If a node is temporarily unavailable, data is propagated to the next node in the ring Metadata contains information about the originally intended node Stored in a separate local database and scanned periodically Upon detecting that the original node is recovered, A data delivery attempt will be made Once the transfer succeeds, the data at the temporary node will be removed

//08 CS5 Introduction to Big - FALL 08 W.B.8 //08 CS5 Introduction to Big - FALL 08 W.B.9 Example 5 7 0 A B NoSQL Storage:. -Value Stores (Dynamo) () Partitioning () High Availability for writes () Handling temporary failures () Recovering from permanent failures (5) Membership and failure detection The data will be sent to the node D If C is temporarily down 5 D C This data contains a hint in its metadata After the recovery, D will send data to C -- node where it was supposed to be stored Then, it will remove the information. //08 CS5 Introduction to Big - FALL 08 W.B.0 Anti-entropy protocol Replica synchronization protocol Hinted replica can be lost before they can be returned to the original replica node Detect inconsistencies between the replicas faster Minimize the amount of transferred data Dynamo uses Merkle tree //08 CS5 Introduction to Big - FALL 08 W.B. Merkle tree Hash tree where leaves are hashes of the values of individual keys Parent nodes are hashes of their respective children Each branch of the tree can be checked independently Without requiring nodes to download the entire tree or dataset If the hash values of the root of two trees are equal The values of the leaf nodes in the tree are equal No synchronization needed //08 CS5 Introduction to Big - FALL 08 W.B. //08 CS5 Introduction to Big - FALL 08 W.B. Uses of Merkle tree Merkle trees can be used to verify any kind of data stored, handled and transferred. Used in a peer-to-peer network Trusted computing systems Sun s ZFS (Zeta File System) How Merkle tree works H+H means concatenating two values H5 (H+H) Top Hash (H5+H6) H6 (H+H) Google s Wave protocol Git Cassandra and Dynamo Bittorrent protocol Block chain H H H H

//08 CS5 Introduction to Big - FALL 08 W.B. How Dynamo uses Merkle tree Each node maintains a separate Merkle tree for each key range //08 CS5 Introduction to Big - FALL 08 W.B.5 How Merkle tree works for Dynamo Node A Node B. Compare Top hashes Top Hash A Top Hash B Two nodes exchange the root of the Merkle tree Corresponding to the key ranges that they host in common Node performs tree traversal and determines if there is any difference Perform the appropriate synchronization action Disadvantage When a new node joins or leaves Tree needs to be recalculated H H5 H6. If top hash did not match Compare H5 vs. H and H6 vs. H H H H. If H6 H and H did not match H Compare H vs. H and H vs. H H0 H H H //08 CS5 Introduction to Big - FALL 08 W.B.6 Example We are trying to build a Merkle Tree with 6 blocks Replication factor is, each storage node will maintain Merkle Trees for each of replication based on the data block s original node identifier Suppose that storage server nodes, ra, rb, and rc store replications of the same data blocks How many hash values does each Merkle tree contain? (including the root hash) What will be the maximum number of comparison(s) of hash values to detect the complete set of corrupted data blocks? (Including the comparison of root hashes) //08 CS5 Introduction to Big - FALL 08 W.B.7 Example- answers We are trying to build a Merkle Tree with 6 blocks Replication factor is, each storage node will maintain Merkle Trees for each of replication based on the data block s original node identifier Suppose that storage server nodes, ra, rb, and rc store replications of the same data blocks How many hash values does each Merkle tree contain? (including the root hash) hash values If there is a mismatch between root hashes of ra and rb, what will be the maximum number of comparison(s) of hash values to detect the complete set of corrupted data blocks? (Including the comparison of root hashes)- comparisons //08 CS5 Introduction to Big - FALL 08 W.B.8 NoSQL Storage:. -Value Stores (Dynamo) () Partitioning () High Availability for writes () Handling temporary failures () Recovering from permanent failures (5) Membership and failure detection //08 CS5 Introduction to Big - FALL 08 W.B.9 Identifier Ring Membership A node outage should not result in re-balancing of the partition assignment or repair of the unreachable replicas A node outage is mostly temporary Gossip-based protocol Propagates membership changes Maintains an eventually consistent view of membership Each node contacts a peer every second Random selection Two nodes reconcile their persisted membership change history 5

//08 CS5 Introduction to Big - FALL 08 W.B.0 Logical partitioning Almost concurrent addition of two new nodes Node A joins the ring Node B joins the ring A and B consider themselves members of the ring Yet neither would be immediately aware of each other A does not know the existence of B Logical partitioning //08 CS5 Introduction to Big - FALL 08 W.B. External Discovery Addresses the logical partitioning Seeds Discovered via an external mechanism Known to all nodes Statically configured (or from a configuration service) Seed nodes will eventually reconcile their membership with all of the nodes //08 CS5 Introduction to Big - FALL 08 W.B. //08 CS5 Introduction to Big - FALL 08 W.B. Failure Detection Attempts to Avoid communication with unreachable peers during a get or put operation Transfer partitions and hinted replicas Detecting communication failures When there is no response to an initiated communication NoSQL Storage:. Column Family Stores Google s Big Table Responding to communication failures Sender will try alternate nodes that map to failed node s partitions Periodically retry failed node for recovery //08 CS5 Introduction to Big - FALL 08 W.B. This material is built based on, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Debora A. Wallach, Mike Byrrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage System for Structured, OSDI 006 //08 CS5 Introduction to Big - FALL 08 W.B.5 Column-family storage Optimized for the data Sparse columns and no schema Aggregate-oriented storage Most data interaction is done with the same aggregate Aggregate A collection of data that we interact with as a unit Stores groups of columns (column families) together 6

//08 CS5 Introduction to Big - FALL 08 W.B.6 get(, Profile:name ) Column key Column value Column family Profile name martin billingaddr.. payment Orders Row key ODR00 ODR00.. ODR00 ODR00 7