Large-Scale Data Stores and Probabilistic Protocols

Similar documents
Peer to Peer Systems and Probabilistic Protocols

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

Page 1. Key Value Storage"

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Dynamo: Key-Value Cloud Storage

Scaling Out Key-Value Storage

Distributed Hash Tables Chord and Dynamo

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

Distributed Hash Tables

Large-Scale Key-Value Stores Eventual Consistency Marco Serafini

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

Peer-to-Peer Systems and Distributed Hash Tables

Recap. CSE 486/586 Distributed Systems Case Study: Amazon Dynamo. Amazon Dynamo. Amazon Dynamo. Necessary Pieces? Overview of Key Design Techniques

CS 138: Dynamo. CS 138 XXIV 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Ken Birman. Cornell University. CS5410 Fall 2008.

CS514: Intermediate Course in Computer Systems

Distributed Systems. 17. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Key-Value Stores UCSB CS170

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Presented By: Devarsh Patel

EECS 498 Introduction to Distributed Systems

Distributed Hash Tables: Chord

CS Amazon Dynamo

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

CS5412: TIER 2 OVERLAYS

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

HyperDex: Next Generation NoSQL

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Hash Tables

Our goal today. Using Gossip to Build Scalable Services. Gossip 101. Gossip scales very nicely. Gossip about membership. Gossip in distributed systems

Building Consistent Transactions with Inconsistent Replication

Distributed Systems. 10. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2017

Reminder: Distributed Hash Table (DHT) CS514: Intermediate Course in Operating Systems. Several flavors, each with variants.

Page 1. Key Value Storage" System Examples" Key Values: Examples "

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

Replication in Distributed Systems

CS5412: OTHER DATA CENTER SERVICES

Page 1. Key Value Storage" System Examples" Key Values: Examples "

Dynamo: Amazon s Highly Available Key-Value Store

Distributed Hash Tables

Intuitive distributed algorithms. with F#

Distributed Data Management Replication

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou

Performance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences

Scaling Out Systems. COS 518: Advanced Computer Systems Lecture 6. Kyle Jamieson

Distributed Systems Final Exam

Introduction to riak_ensemble. Joseph Blomstedt Basho Technologies

Apache Cassandra - A Decentralized Structured Storage System

EECS 498 Introduction to Distributed Systems

Content Overlays. Nick Feamster CS 7260 March 12, 2007

Building a transactional distributed

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

Searching for Shared Resources: DHT in General

LECT-05, S-1 FP2P, Javed I.

PEER-TO-PEER NETWORKS, DHTS, AND CHORD

Searching for Shared Resources: DHT in General

CPSC 426/526. P2P Lookup Service. Ennan Zhai. Computer Science Department Yale University

Tools for Social Networking Infrastructures

Lecture 6: Overlay Networks. CS 598: Advanced Internetworking Matthew Caesar February 15, 2011

DIVING IN: INSIDE THE DATA CENTER

There is a tempta7on to say it is really used, it must be good

Self Management for Large-Scale Distributed Systems

CS5412 CLOUD COMPUTING: PRELIM EXAM Open book, open notes. 90 minutes plus 45 minutes grace period, hence 2h 15m maximum working time.

CS5412: DIVING IN: INSIDE THE DATA CENTER

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Naming WHAT IS NAMING? Name: Entity: Slide 3. Slide 1. Address: Identifier:

The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla

Haridimos Kondylakis Computer Science Department, University of Crete

Distributed Systems (5DV147)

10. Replication. Motivation

CS5412: DIVING IN: INSIDE THE DATA CENTER

Dynamo: Amazon s Highly Available Key-value Store

08 Distributed Hash Tables

416 Distributed Systems. Mar 3, Peer-to-Peer Part 2

CAP Theorem, BASE & DynamoDB

Dynamo: Amazon s Highly Available Key-value Store

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Distributed System Engineering: Spring Exam II

CSE-E5430 Scalable Cloud Computing Lecture 10

Finding Data in the Cloud using Distributed Hash Tables (Chord) IBM Haifa Research Storage Systems

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

ZHT A Fast, Reliable and Scalable Zero- hop Distributed Hash Table

Distributed Systems. Day 9: Replication [Part 1]

DISTRIBUTED SYSTEMS CSCI 4963/ /4/2015

Kleinberg s Small-World Networks. Normalization constant have to be calculated:

Reprise: Stability under churn (Tapestry) A Simple lookup Test. Churn (Optional Bamboo paper last time)

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Flooded Queries (Gnutella) Centralized Lookup (Napster) Routed Queries (Freenet, Chord, etc.) Overview N 2 N 1 N 3 N 4 N 8 N 9 N N 7 N 6 N 9

Cassandra- A Distributed Database

Distributed Systems. replication Johan Montelius ID2201. Distributed Systems ID2201

Introduction to P2P Computing

Winter CS454/ Assignment 2 Instructor: Bernard Wong Due date: March 12 15, 2012 Group size: 2

Distributed Systems Homework 1 (6 problems)

References. NoSQL Motivation. Why NoSQL as the Solution? NoSQL Key Feature Decisions. CSE 444: Database Internals

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Persistence and Node Failure Recovery in Strongly Consistent Key-Value Datastore

CIS 700/005 Networking Meets Databases

Transcription:

Distributed Systems 600.437 Large-Scale Data Stores & Probabilistic Protocols Department of Computer Science The Johns Hopkins University 1 Large-Scale Data Stores and Probabilistic Protocols Lecture 11 Further reading: * Reliable Distributed Systems by Ken Birman - Chapter 25. * Cassandra - A Decentralized Structured Storage System Lakshman, Malik * HyperDex: A Distributed, Searchable Key-Value Store Escriva, Wong, Sirer 2

Distributed Indexing Distributed (and peer to peer) file systems have two parts: A lookup mechanism that tracks down the node holding the object. A superimposed file system application that actually retrieves and stores the files. Distributed indexing refers to the lookup part. The Internet DNS is the most successful distributed indexing mechanism to date, mapping machine names to IP addresses. Peer to peer indexing tries to generalize the concept to (key, value) pairs. Also called Distributed Hash Table (DHT). 3 Distributed Indexing (cont.) Let us say we want to store a very large number of objects and access them based on their key. How would you implement a (key, value) distributed data store that provides good performance for lookup and scales to hundreds or thousands of nodes? Now, think about what would you do to ensure robustness in the presence of participants coming and going. 4

Chord Developed at MIT (2001). Main idea: forming a massive virtual ring where every node is responsible for a portion of the periphery. Node IDs and data keys are hashed using the same function into a non-negative space. Each node is responsible for all the (key,value) pairs for which the hash result is less or equal to the node ID hash result, but greater then the next smaller hashed node ID. 5 Node ID with a hash result of 93 Chord Indexing Key 5 Key 95 Node 10 Node 93 Key 11 Key 88 Key 81 Node 78 Key 72 Circular ID Space Node 60, key 60 Key 50 Key 30 Node 32 Key 33 (Key, value) pair with a hash result of 50 6

Node that with an ID hash result of 93 Chord Indexing (cont.) Key 95 Key 5 Node 10 Node 93 Key 11 Key 88 Key 81 Node 78 Key 72 Circular ID Space Node 60, key 60 Key 50 Key 30 Node 32 Key 33 (Key, value) pair with a hash result of 50 7 Chord Indexing (cont.) Each node maintains a pointer to the node after it and another pointer to the node before it. A new node contacts an existing node (startup issue) and traverses the ring until it finds the node before and after it. A state transfer is performed from the next node on the ring in order to accommodate the newly joined node. Lookup can be performed by traversing the ring, going one node at a time. Can we do better? 8

Chord Lookup Each node maintains a finger table that serves as short-cuts to nodes at various distances within the hash key space. Question: How would you construct the finger table to allow logarithmic search steps? 9 Chord Lookup (cont.) Finger table of node 78 1/4 1/8 1/2 1/16 1/32 Node 78 10

Chord Lookup (cont.) Node 93 Node 87 Node 81 Node 78 Node 98 Node 5 Node 10 k19 Node 20 Node 25 Lookup (k19) Node 32 Node 65 Node 43 Node 60 Node 50 11 Chord Issues to Consider Overall, log(n) hops for lookup in the worst case! very good. What is a hop? Where are the nodes? Is log(n) really good? What about churn? Is it really log(n) worst case over time? How to maintain robustness? 12

Kelips Developed at Cornell (2003). Uses more storage (sqrt(n) instead of log(n) ) at each node. Replicating each item at sqrt(n) nodes. Aims to achieve O(1) for lookups. Copes with churn by imposing a constant communication overhead. Although data quality may lag if updates occur too rapidly. How would you do that? 13 Kelips Lookup N is approximate number for the number of nodes. Each node id is hashed into one of sqrt(n) affinity groups. Each key from (key,value) pair is hashed into one of the sqrt(n) groups. Approximately sqrt(n) replicas in each affinity group. Pointers are maintained to a small number of members of each affinity group. Lookup is O(1). Weak consistency between the replicas is maintained using a reliable multicast protocol based on gossip. 14

Kelips Lookup (cont.) 1 2 3 N 15 Probabilistic Broadcast Protocols A class game demonstrating the probabilistic broadcast (pbcast) protocol: At least n >= 20 logical participants. Each participant randomly picks 4 numbers 1-n, noting the order of their selection. Playing the game with the first number, then the first 2 numbers, then 3 and 4 numbers and looking at coverage for a message generated by one participant. 16

Cassandra: A Distributed Data Store cassandra.apache.org Key-value data store, supporting get(key) and put(key, value) Designed to scale to hundreds of servers Initially developed by Facebook, made open source in 2008 Heavily influenced by Amazon s Dynamo storage system (which was influenced by Chord) 17 Cassandra: Partitioning Keys are hashed into a bounded space, which forms a logical ring (like Chord) Servers are placed at different locations on this ring Every server maintains information about the position of every other server The nodes responsible for a key are found by searching clock-wise from the hashed key Data is replicated on the first N servers found 18

Partitioning, N=3 I 99 0 A B Hash(k) = 25 H C G D E F 19 Load can be rebalanced by moving nodes H Partitioning I 99 0 A B C G D E F 20

Cassandra: Partitioning (cont) Changes in node positions (from membership changes or rebalancing) are handled by one node designated as the leader The leader stores these updates in Zookeeper (separate system using Paxos) for fault tolerance In case of a leader failure, a new leader is elected using Zookeeper Changes are disseminated probabilistically Every second, each node exchanges information with two random nodes 21 Cassandra: Replication Read/Write Quorums If data is stored on N replicas, users can configure two values, R and W R is the minimum number of replicas that must participate for a read operation W is the minimum number of replicas that must participate for a read operation As long as R + W > N, every read quorum will contain a node with the latest write 22

Executing Put(key, value) Send put request to any node, which will act as the coordinator for that request Coordinator forwards the update to the relevant N replicas After W of those replicas have acknowledged the update, the coordinator can tell the client that the write was successful 23 Executing Get(key) Send get request to any node, which will act as the coordinator Coordinator requests the object from the relevant N nodes After R of those replicas respond, the coordinator returns the most recent version held by any of the replicas 24

Executing Get(key) How is the most recent version determined? Coordinators give each write update a timestamp, based on its local clock 25 Clock-based Timestamps Each put is given a timestamp by the coordinator responsible for that update If a replica receives an update with a lower timestamp than its current version, it ignores the update, but acknowledges that the write was successful If the clocks on different coordinators drift, this can cause unexpected behavior 26

Example: Losing an update Client Get(k) Put(k,1) Put(k,2) Let s say C 2 has a slower clock C 1 C 2 Put(k,1, T=20) Put(k, 2, T=19) R (k,1, 20) 27 Hyperdex: A Distributed Data Store hyperdex.org A large-scale sharded key-value data store (2012) Supports get, put, and atomic operations such as conditional puts and atomic addition All operations on a single key are linearizable Supports efficient search on secondary attributes 28

Search In addition to partitioning based on the key, each object is stored on additional servers based on its secondary attributes Combining the hashes of a set of secondary attributes forms a hyperspace which can be partitioned This enables efficient search by limiting the number of servers that need to be contacted This can be done for multiple sets of attributes 29 Hyperspace Hashing Attribute values are hashed independently Any hash function may be used Phone Number H( 607-555-1024 ) H( Neil ) First Name Last Name H( Armstrong ) http://hyperdex.org/slides/ 30 2013-06-28-cloudphysics.pdf

Hyperspace Hashing Objects reside at the coordinate specified by the hashes Phone Number Neil Armstrong H( 607-555-1024 ) H( Neil ) First Name H( Armstrong ) Last Name http://hyperdex.org/slides/ 31 2013-06-28-cloudphysics.pdf Hyperspace Hashing Different objects reside at different coordinates Phone Number Neil Armstrong Lance Armstrong Neil Diamond First Name Last Name http://hyperdex.org/slides/ 32 2013-06-28-cloudphysics.pdf

Hyperspace Hashing http://hyperdex.org/slides/ 33 2013-06-28-cloudphysics.pdf Hyperspace Hashing By specifying more of the secondary attributes, we can reduce the number of servers that need to be searched If all of the subspace attributes are specified, the search is equally efficient as searching by key What if the secondary attributes are updated? 34

Value-Dependent Chaining To perform an update, all of the servers involved are organized in a chain The server responsible for the primary key is at the head of the chain Any server holding the current version of the object is in the chain Any server that will hold the updated version of the object is also in the chain The update is ordered at the head and passed through the chain Once it reaches the end of the chain, the tail server can commit the update, and pass an acknowledgment back through the chain Updates are not committed until an acknowledgement is received from the next server in the chain 35 Value-Dependent Chaining http://hyperdex.org/slides/ 36 2013-06-28-cloudphysics.pdf

Value-Dependent Chaining Each put takes a forward pass down the chain and is committed during the backward pass http://hyperdex.org/slides/ 37 2013-06-28-cloudphysics.pdf Value-Dependent Chaining http://hyperdex.org/slides/ 38 2013-06-28-cloudphysics.pdf

Consistency Guarantees Any operation that was committed before a search will be reflected in its results In the presence of concurrent updates, either version may be returned, but at least one version of every object will be seen Because an update can be reflected in a search before it is committed, search results may be inconsistent with get calls 39 Fault Tolerance Each server in the chain can be replicated Hyperdex uses chain replication, but any consistent replication protocol could be used If every block of replicas remains available, the system remains available 40

Fault Tolerance http://hyperdex.org/slides/ 41 2013-06-28-cloudphysics.pdf