Scaling Out Systems. COS 518: Advanced Computer Systems Lecture 6. Kyle Jamieson

Similar documents
Distributed Hash Tables: Chord

Peer-to-Peer Systems and Distributed Hash Tables

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

Distributed Hash Tables

PEER-TO-PEER NETWORKS, DHTS, AND CHORD

CIS 700/005 Networking Meets Databases

08 Distributed Hash Tables

Peer-to-peer computing research a fad?

CompSci 356: Computer Network Architectures Lecture 21: Overlay Networks Chap 9.4. Xiaowei Yang

Horizontal or vertical scalability? Horizontal scaling is challenging. Today. Scaling Out Key-Value Storage

Large-Scale Data Stores and Probabilistic Protocols

Goals. EECS 122: Introduction to Computer Networks Overlay Networks and P2P Networks. Solution. Overlay Networks: Motivations.

Finding Data in the Cloud using Distributed Hash Tables (Chord) IBM Haifa Research Storage Systems

Peer-to-Peer Systems

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

Page 1. Key Value Storage"

Last Time. CSE 486/586 Distributed Systems Distributed Hash Tables. What We Want. Today s Question. What We Want. What We Don t Want C 1

C 1. Last Time. CSE 486/586 Distributed Systems Distributed Hash Tables. Today s Question. What We Want. What We Want. What We Don t Want

Content Overlays. Nick Feamster CS 7260 March 12, 2007

CPSC 426/526. P2P Lookup Service. Ennan Zhai. Computer Science Department Yale University

Scaling Out Key-Value Storage

L3S Research Center, University of Hannover

CSCI-1680 P2P Rodrigo Fonseca

Badri Nath Rutgers University

EECS 122: Introduction to Computer Networks Overlay Networks and P2P Networks. Overlay Networks: Motivations

CSE 486/586 Distributed Systems

EE 122: Peer-to-Peer Networks

Scaling KVS. CS6450: Distributed Systems Lecture 14. Ryan Stutsman

P2P Network Structured Networks: Distributed Hash Tables. Pedro García López Universitat Rovira I Virgili

CS505: Distributed Systems

Page 1. How Did it Start?" Model" Main Challenge" CS162 Operating Systems and Systems Programming Lecture 24. Peer-to-Peer Networks"

Searching for Shared Resources: DHT in General

416 Distributed Systems. Mar 3, Peer-to-Peer Part 2

Overlay networks. To do. Overlay networks. P2P evolution DHTs in general, Chord and Kademlia. Turtles all the way down. q q q

CS 640 Introduction to Computer Networks. Today s lecture. What is P2P? Lecture30. Peer to peer applications

Searching for Shared Resources: DHT in General

Flooded Queries (Gnutella) Centralized Lookup (Napster) Routed Queries (Freenet, Chord, etc.) Overview N 2 N 1 N 3 N 4 N 8 N 9 N N 7 N 6 N 9

Finding Data in the Cloud using Distributed Hash Tables (Chord) IBM Haifa Research Storage Systems

Overlay Networks: Motivations. EECS 122: Introduction to Computer Networks Overlay Networks and P2P Networks. Motivations (cont d) Goals.

DISTRIBUTED SYSTEMS CSCI 4963/ /4/2015

Distributed Systems. 17. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2016

LECT-05, S-1 FP2P, Javed I.

Peer to Peer Systems and Probabilistic Protocols

CS514: Intermediate Course in Computer Systems

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

Telematics Chapter 9: Peer-to-Peer Networks

Peer-to-Peer (P2P) Systems

Kademlia: A P2P Informa2on System Based on the XOR Metric

Architectures for Distributed Systems

Handling Churn in a DHT

Peer to Peer I II 1 CS 138. Copyright 2015 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

Distributed lookup services

Distributed Hash Tables Chord and Dynamo

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou

Reminder: Distributed Hash Table (DHT) CS514: Intermediate Course in Operating Systems. Several flavors, each with variants.

Overlay Networks. Behnam Momeni Computer Engineering Department Sharif University of Technology

Overlay networks. Today. l Overlays networks l P2P evolution l Pastry as a routing overlay example

Chord : A Scalable Peer-to-Peer Lookup Protocol for Internet Applications

Dynamo. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Motivation System Architecture Evaluation

15-744: Computer Networking P2P/DHT

Chord: A Scalable Peer-to-peer Lookup Service For Internet Applications

Lecture 6: Overlay Networks. CS 598: Advanced Internetworking Matthew Caesar February 15, 2011

Peer- Peer to -peer Systems

Motivation for peer-to-peer

CSE 124 Finding objects in distributed systems: Distributed hash tables and consistent hashing. March 8, 2016 Prof. George Porter

EECS 498 Introduction to Distributed Systems

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES

Peer-to-Peer Systems. Network Science: Introduction. P2P History: P2P History: 1999 today

CE Advanced Network Security Denial of Service Attacks II

Introduction to Peer-to-Peer Systems

Introduction to P2P Computing

Content Overlays (continued) Nick Feamster CS 7260 March 26, 2007

P2P: Distributed Hash Tables

CS 268: DHTs. Page 1. How Did it Start? Model. Main Challenge. Napster. Other Challenges

: Scalable Lookup

CS 347 Parallel and Distributed Data Processing

Chapter 10: Peer-to-Peer Systems

Kleinberg s Small-World Networks. Normalization constant have to be calculated:

P2P Network Structured Networks (IV) Distributed Hash Tables. Pedro García López Universitat Rovira I Virgili

Peer-to-Peer Systems. Chapter General Characteristics

CS 43: Computer Networks. 14: DHTs and CDNs October 3, 2018

Lecture 8: Application Layer P2P Applications and DHTs

An Adaptive Stabilization Framework for DHT

Consistency and Replication (part b)

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

CSCI-1680 Web Performance, Content Distribution P2P Rodrigo Fonseca

Main Challenge. Other Challenges. How Did it Start? Napster. Model. EE 122: Peer-to-Peer Networks. Find where a particular file is stored

CS 43: Computer Networks BitTorrent & Content Distribution. Kevin Webb Swarthmore College September 28, 2017

EE 122: Peer-to-Peer (P2P) Networks. Ion Stoica November 27, 2002

Ken Birman. Cornell University. CS5410 Fall 2008.

Structured Peer-to-Peer Networks

*Adapted from slides provided by Stefan Götz and Klaus Wehrle (University of Tübingen)

Today s Objec2ves. Kerberos. Kerberos Peer To Peer Overlay Networks Final Projects

CDNs and Peer-to-Peer

Slides for Chapter 10: Peer-to-Peer Systems

Modern Technology of Internet

Overlay networks. T o do. Overlay networks. P2P evolution DHTs in general, Chord and Kademlia. q q q. Turtles all the way down

Peer-to-peer systems and overlay networks

Transcription:

Scaling Out Systems COS 518: Advanced Computer Systems Lecture 6 Kyle Jamieson [Credits: Selected content adapted from M. Freedman, B. Karp, R. Morris]

Today 1. Scaling up: Reliable multicast 2. Scaling out: Partitioning, DHTs and Chord 2

Recall the tradeoffs CAP: Distributed systems can have 2 of the following 3: Consistency (Strong/Linearizability) Availability Partition Tolerance: Liveness despite arbitrary failures Bit of an oversimplification. Really: When you get P, do you choose A or C? Goal? ALPS (Coined by Lloyd and Freedman, 2011) Available Low-Latency Partition-Tolerant Scalable (to more than 1 CPU per site) 3

Reliable multicast motivation: Fast content purging in a CDN [Fastly CDN] CDN servers geographically distributed containing many replicas of data Want to give cusomters the ability to takedown content in a short period of time from across the CDN Your sales team advertises 150 ms takedown latency 4

Reliable multicast protocols Reliable Multicast Transport Protocol (RMTP) Scalable Reliable Multicast (SRM) Sequenced, lossless bulk delivery of data from one sender to a group of receivers TCP-like cumulative sequence numbers on data Sequence numbers and bitmaps in acknowledgement packets back to sender Window-based flow control Retransmissions, failure monitoring among receivers 5

Reliable multicast performance Average throughput on nonperturbed members 250 200 150 100 50 group size: 32 group size: 64 group size: 96 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Sleep time (fraction) of ONE group member

Bimodal multicast pbcast: Probabilistic broadcast Birman et al., ACM ToCS 17(2) 1999 Atomicity property is the bimodal delivery guarantee: High probability that each multicast will reach almost all processes Low probability that a multicast will reach just a very small set of processes Vanishingly small probability that it will reach some intermediate number of processes The traditional all or nothing guarantee thus becomes almost all or almost none. 7

Bimodal multicast (1/4) Initially use UDP/IP best effort multicast

Bimodal multicast (2/4) Periodically (e.g. every 100ms): Each node picks random group member, and send digest describing its state

Bimodal multicast (3/4) Recipient checks digest against own history For all missing message, solicits copy of msg

Bimodal multicast (4/4) Nodes respond to solicitations by retransmitting requested message(s)

Why bimodal? Two phases? Nope Description of dual modes of result Pbcast bimodal delivery distribution p{#processes=k} 1.E+00 1.E- 05 1.E- 10 1.E- 15 1.E- 20 1.E- 25 1.E- 30 Either sender fails or data gets thru w.h.p. 0 5 10 15 20 25 30 35 40 45 50 number of processes to deliver pbcast

Epidemic algorithms via gossiping Assume a fixed population of size n For simplicity, assume epidemic (delivery) spreads homogenously through popularly Simple randomized delivery: any one can deliver to any one with equal probability Assume that k members are already delivered Delivery occurs in rounds

Probability of delivery Probability P deliver (k,n) that a undelivered member is delivered to in a round, if k are already infected? P deliver (k,n) = 1 P (nobody delivers) = 1 (1 1/n) k E [# newly delivered] = (n k) P deliver (k,n) Basically it s a Binomial distribution # rounds to deliver to the entire population is O(log n)

Two prevailing styles Gossip pull ( anti-entropy ) A asks B for something it is trying to find Commonly used for management replicated data Resolve differences between DBs by comparing digests Gossip push ( rumor mongering ): A tells B something B doesn t know Gossip for multicasting Keep sending for bounded period of time: O (log n) Also used to compute aggregates Push-pull gossip Combines both mechanisms O(n log log n) msgs to spread rumor in O(log n) time

Wednesday reading: Bayou Problem: Collaborative applications E.g., Meeting room scheduling, bibliographic database Setting: Mobile computing environment Seek to support disconnected workgroups Rely only on weak/opportunistic connectivity (occasional, pair-wise communication) Key technical problem: How to converge to (eventually) consistent state? Use anti-entropy for pair-wise resolution Observation: Need application-specific conflict detection and resolution at granularity of individual updates

Today 1. Scaling up: Reliable multicast 2. Scaling out: Partitioning, DHTs and Chord 17

Scaling out by partitioning data Every data object belongs to data partition Each partition resides on one or more nodes Replication protocol between nodes hosting partition, e.g., could be strong or eventually consistent Every node hosts one or more partition 18

What is a DHT? Single-node hash table abstract: key = Hash(name) put(key, value) get(key) à value Service: O(1) storage How do I do this across millions of hosts on the Internet? Distributed Hash Table 19

What Is a DHT? (and why?) Distributed Hash Table: key = Hash(data) lookup(key) à IP address (Chord) send-rpc(ip address, PUT, key, value) send-rpc(ip address, GET, key) à value The first step towards truly large-scale distributed systems a tuple in a global database engine a data block in a global file system rare.mp3 in a P2P file-sharing system 20

DHT Factoring put(key, data) Distributed application get (key) data Distributed hash table lookup(key) node IP address Lookup service (DHash) (Chord) node node. node Application may be distributed over many nodes DHT distributes data storage over many nodes 21

Why the put()/get() DHT interface? API supports a wide range of applications DHT imposes no structure/meaning on keys Key/value pairs are persistent and global Can store keys in other DHT values And thus build complex data structures 22

Why might DHT design be hard? Decentralized: no central authority Scalable: low network traffic overhead Efficient: find items quickly (latency) Dynamic: nodes fail, new nodes join General-purpose: flexible naming 23

The Lookup problem Put (Key= title Value=file data ) Publisher N 1 N 2 N 3 Internet? Client Get(key= title ) N 4 N N 6 5 At the heart of all DHTs 24

Motivation: Centralized lookup (Napster) SetLoc( title, N4) Publisher@ N 4 Key= title Value=file data N 1 DB N 2 N 9 N N 7 6 N 3 Client Lookup( title ) N 8 Simple, but O(N) state and a single point of failure 25

Motivation: Flooded Queries (Gnutella) Publisher@ N 4 Key= title Value=file data N 1 N 2 N 6 N 9 N 7 N 3 N 8 Lookup( title ) Client Robust, but worst case O(N) messages per lookup 26

Motivation: FreeDB, Routed DHT Queries (Chord, &c.) Publisher Key=H(audio data) Value={artist, album title, track title} N 4 N 1 N 2 N 6 N 9 N 7 N 3 Client Lookup(H(audio data)) N 8 27

DHT Applications They re not just for stealing music anymore global file systems [OceanStore, CFS, PAST, Pastiche, UsenetDHT] naming services [Chord-DNS, Twine, SFR] DB query processing [PIER, Wisc] Internet-scale data structures [PHT, Cone, SkipGraphs] communication services [i3, MCAN, Bayeux] event notification [Scribe, Herald] File sharing [OverNet] 28

Basic Approaches Require two features: Partition management: On which node(s) to place a partition Including how to recover from a node failure, e.g., bringing another node into partition group Changes in system size, e.g., nodes joining and leaving Resolution: Maintain mapping from data name to responsible node(s) Centralized: Cluster manager Decentralized: Deterministic hashing and algorithms 29

The partitioning problem Consider problem of data partition: Given document X, choose one of k servers to use Suppose we use modulo hashing Number servers 1..k Place X on server i = (X mod k) Problem? Data may not be uniformly distributed Place X on server i = hash (X) mod k Problem? What happens if a server fails or joins (k à k±1)? Problem? What if different clients have different estimate of k? Answer: All entries get remapped to new nodes! 30

Placing objects on servers How to determine the server on which a certain object resides? Typical approach: Hash the object s identifier Hash function h maps object id x to a server id E.g., h(x) = [ax + b (mod p)], where p is a prime integer a, b are constant integers chosen uniformly at random from [0, p 1] x is an object s serial number

Difficulty: Changing number of servers h(x) = x + 1 (mod 4) Add one machine: h(x) = x + 1 (mod 5) Server 4 Adding 3a machine results in all objects assignments changing: need to move objects over the network. 2 1 0 5 7 10 11 27 29 36 38 40 Object serial number

Chord Lookup Algorithm Properties Interface: lookup(key) IP address Efficient: O(log N) messages per lookup N is the total number of servers Scalable: O(log N) state per node Robust: survives massive failures Simple to analyze 33

Chord IDs Key identifier = SHA-1(key) Node identifier = SHA-1(IP address) SHA-1 distributes both uniformly How does Chord partition data? i.e., map key IDs to node IDs 34

Consistent hashing [Karger 97] Node 105 N105 Key 5 K5 K20 Circular 7-bit ID space N32 N90 K80 A key is stored at its successor: node with next-higher ID 35

Basic Lookup N105 N120 N10 Where is key 80? N90 has K80 N32 K80 N90 N60 36

Simple lookup algorithm Lookup(my-id, key-id) n = my successor if my-id < n < key-id // next hop call Lookup(key-id) on node n else // done return n Correctness depends only on successors 37

Finger Table Allows log(n)-time Lookups ¼ ½ 1/8 1/16 1/32 1/64 1/128 N80 38

Finger i Points to Successor of n+2 i 1 112 ¼ N120 ½ 1/8 1/16 1/32 1/64 1/128 N80 39

Lookup with Fingers Lookup(my-id, key-id) look in local finger table for highest node n: my-id < n < key-id if n exists call Lookup(key-id) on node n // next hop else return my successor // done 40

Lookups Take O(log N) Hops N5 N110 N10 N20 K19 N99 N32 Lookup(K19) N80 N60 41

Join Operation n N50 joins the ring via N15 succ=4 pred=44 58 4 8 n n n N50: send join(50) to N15 N44: returns N58 N50 updates its successor to N58 succ=nil succ=58 pred=nil 50 58 succ=58 pred=35 44 join(50) 20 15 35 32 42

Periodic Stabilize n n N50: periodic stabilize n Sends stabilize message to N58 N50: send notify message to N58 n Update pred=44 stabilize(node=50) notify(node=50) succ=58 pred=nil 50 succ=58 pred=35 succ=4 pred=50 pred=44 44 58 succ.pred=44 4 8 20 15 35 32 43

Periodic Stabilize n n N44: periodic stabilize N44 Asks N58 for pred à N50 succ=4 pred=50 58 4 8 n N44 updates successor succ=58 to N50 pred=nil 50 stabilize(node=44) 15 succ=50 succ=58 pred=35 44 20 35 32 44

Periodic Stabilize succ=4 pred=50 n N44 has a new successor: N50 58 4 8 n N44 notifies N50 succ=58 pred=nil pred=44 50 notify(node=44) 15 succ=50 pred=35 44 20 35 32 45

Periodic Stabilize Converges! pred=50 n This completes the joining operation! 58 4 8 succ=58 pred=44 50 15 succ=50 44 20 35 32 46

Joining: Linked List Insert N36 N25 1. Lookup(36) N40 K30 K38 47

Join (2) N25 2. N36 sets its own successor pointer N40 K30 K38 N36 48

Join (3) N25 3. Copy keys 26..36 from N40 to N36 N40 K30 K38 N36 K30 49

Join (4) N25 4. Set N25 s successor pointer N40 K30 K38 N36 K30 Predecessor pointer allows link to new node Update finger pointers in the background Correct successors produce correct lookups 50

Failures Might Cause Incorrect Lookup N113 N120 N10 N102 N85 Lookup(90) N80 N80 doesn t know correct successor, so incorrect lookup 51

Solution: Successor Lists Each node knows r immediate successors After failure, will know first live successor Correct successors guarantee correct lookups Guarantee is with some probability 52

Choosing Successor List Length Assume ½ the nodes fail P(successor list all dead) = (½) r i.e., P(this node breaks the Chord ring) Depends on independent failure Successor list of size r = O(log N) makes this probability 1/N: low for large N 53

Lookup with Fault Tolerance Lookup(my-id, key-id) look in local finger table and successor-list for highest node n s.t. my-id < n < key-id if n exists call Lookup(key-id) on node n // next hop if call failed, remove n from finger table or successor-list return Lookup(my-id, key-id) else return my successor // done 54

Experimental Overview Quick lookup in large systems Low variation in lookup costs Robust despite massive failure Experiments confirm theoretical results 55

Chord Lookup Cost Is O(log N) Average Messages per Lookup Number of Nodes Constant is 1/2 56

Failure Experimental Setup Start 1,000 CFS/Chord servers Successor list has 20 entries Wait until they stabilize Insert 1,000 key/value pairs Five replicas of each Stop X% of the servers Immediately perform 1,000 lookups 57

DHash Replicates Blocks at r Successors N110 N5 N10 N99 N20 N40 Block 17 N80 N68 N60 N50 Replicas are easy to find if successor fails Hashed node IDs ensure independent failure 58

Massive Failures Have Little Impact Failed Lookups (Percent) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 (1/2) 6 is 1.6% 5 10 15 20 25 30 35 40 45 50 Failed Nodes (Percent) 59

DHash summary Builds key/value storage on Chord Replicates blocks for availability Stores k replicas at the k successor servers after the block s successor on the Chord ring Caches blocks for load balance Client sends copy of block to each of the servers it contacted along the lookup path Authenticates block contents 60

DHTs: A Retrospective Original DHTs (CAN, Chord, Kademlia, Pastry, Tapestry) proposed in 2001-02 Following 5-6 years saw proliferation of DHT-based applications: filesystems (e.g., CFS, Ivy, Pond, PAST) naming systems (e.g., SFR, Beehive) indirection/interposition systems (e.g., i3, DOA) content distribution systems (e.g., Coral) distributed databases (e.g., PIER) 61

What DHTs Got Right Consistent hashing Elegant way to divide a workload across machines Very useful in clusters: actively used today in Dynamo, FAWN-KV, ROAR, Replication for high availability, efficient recovery after node failure Incremental scalability: add nodes, capacity increases Self-management: minimal configuration Unique trait: no single server to shut down, control, monitor well suited to illegal applications, be they sharing music or resisting censorship 62

DHTs limitations High latency between peers Limited bandwidth between peers (as compared to within a cluster) Lack of trust in peers correct behavior securing DHT routing hard, unsolved in practice 63

Next time Wednesday 10/28 Paper Discussion: Weakening Consistency Bayou, Dynamo, Eiger 64