CS 268: Lecture 22 DHT Applications

CS 268: Lecture 22 DHT Applications Ion Stoica Computer Science Division Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720-1776 (Presentation based on slides from Robert Morris and Sean Rhea) 1 Outline Cooperative File System (CFS) Open DHT 2 Page 1

Target CFS Uses Serving data with inexpensive hosts: - open-source distributions - off-site backups - tech report archive - efficient sharing of music 3 How to mirror open-source distributions? Multiple independent distributions - Each has high peak load, low average Individual servers are wasteful Solution: aggregate - Option 1: single powerful server - Option 2: distributed service But how do you find the data? 4 Page 2

Design Challenges Avoid hot spots Spread storage burden evenly Tolerate unreliable participants Fetch speed comparable to whole-file TCP Avoid O(#participants) algorithms - Centralized mechanisms [Napster], broadcasts [Gnutella] CFS solves these challenges 5 CFS Architecture Each node is a client and a server Clients can support different interfaces - File system interface - Music key-word search 6 Page 3

Client-server interface Files have unique names Files are read-only (single writer, many readers) Publishers split files into blocks Clients check files for authenticity 7 Server Structure QHR1S T QR(S T T T U M U K! " #%$&('*)+-,.-/0 $+1'#2$3-,546078 $&('9070:8$&)-,;"-0=<?>($&A@*"B#2CED GFH I-J-JAKLLMN"O,;"-0 $E)-,;"-0P< 8 Page 4

Chord Hashes a Block ID to its Successor Q U Q N10 (MM K(MK L"!#(M L (ML L N100 S Q S N32 (MM L $%& 'L N80 N60 LK "-C-$+ /CO),6" 0=< & :$E# /C-", C 4;! # 4 )->H$+C >80 0 $+1(" #9/"-C $ 4 /$ 4 $+ 9 DHash/Chord Interface ( Q*),+ - Q. 0/ S &1 QHR1S T &2 S 43 T5+ Q 0/ S 1 T lookup() returns list with node IDs closer in ID space to block ID - Sorted, closest first 10 Page 5

M K DHash Uses Other Nodes to Locate Blocks N5 N99 N110 N80 N10 N20 N40 N50 N68 N60 ( Q "%) 11 Storing Blocks S T &2- S 2 Long-term blocks are stored for a fixed time - Publishers need to refresh periodically Cache uses LRU 12 Page 6

M K M K Replicate blocks at r successors N5 N110 N10 N99 N20 N40 M&' N80 N50 N68 N60 "-C-$ -#%$ B " CC #%$&( /8>-#%$+ 4 /8C-$ 38$ /C-$/= #2$3-,54607-4,5>-#2$ 13 Lookups find replicas N5 / F S + T & T N110 N99 N80 N68 N60 ( Q +M') N10 N20 N40 N50 M&' 14 Page 7

M K M K First Live Successor Manages Replicas N5 N110 N10 N99 M&' N20 N40 M&' N80 N50 N68 N60, 4 4 4, 4 "-C-$0 -/E,;"-07-, C-$: $ #& /$ + 4;+$ #%! +$ H>0 07$&(1"B# 15 DHash Copies to Caches Along Lookup Path N5 / T T + T S T N110 N99 N80 N68 ( N60 Q ) N10 N20 N40 N50 16 Page 8

Caching at Fingers Limits Load N32 /-,,;"# /"-C $& :$ 4 / -$#% 3" 45/=4 / " -4;, 4 4 $ H4 /,;$ ),6" 0=<,;"C " / 17 Virtual Nodes Allow Heterogeneity N10 N60 N101 N5 U U Hosts may differ in disk/net capacity Hosts may advertise multiple IDs - Chosen as SHA-1(IP Address, index) - Each ID represents a virtual node Host load proportional to # v.n. s Manually controlled 18 Page 9

Why Blocks Instead of Files? Cost: one lookup per block - Can tailor cost by choosing good block size Benefit: load balance is simple - For large files - Storage cost of large files is spread out - Popular files are served in parallel 19 Outline Cooperative File System (CFS) Open DHT 20 Page 10

Questions: How many DHTs will there be? Can all applications share one DHT? 21 Benefits of Sharing a DHT Amortizes costs across applications - Maintenance bandwidth, connection state, etc. Facilitates bootstrapping of new applications - Working infrastructure already in place Allows for statistical multiplexing of resources - Takes advantage of spare storage and bandwidth Facilitates upgrading existing applications - Share DHT between application versions 22 Page 11

The DHT as a Service 23 The DHT as a Service OpenDHT 24 Page 12

The DHT as a Service OpenDHT Clients 25 The DHT as a Service OpenDHT 26 Page 13

The DHT as a Service What is this interface? OpenDHT 27 It s not lookup() lookup(k) Challenges: 1. Distribution 2. Security What does this node do with it? k 28 Page 14

1. Storage - CFS, UsenetDHT, PKI, etc. How are DHTs Used? 2. Rendezvous - Simple: Chat, Instant Messenger - Load balanced: i3 - Multicast: RSS Aggregation, White Board - Anycast: Tapestry, Coral 29 What about put/get? Works easily for storage applications Easy to share - No upcalls, so no code distribution or security complications But does it work for rendezvous? - Chat? Sure: put(my-name, my-ip) - What about the others? 30 Page 15

Protecting Against Overuse Must protect system resources against overuse - Resources include network, CPU, and disk - Network and CPU straightforward - Disk harder: usage persists long after requests Hard to distinguish malice from eager usage - Don t want to hurt eager users if utilization low Number of active users changes over time - Quotas are inappropriate 31 Fair Storage Allocation Our solution: give each client a fair share - Will define fairness in a few slides Limits strength of malicious clients - Only as powerful as they are numerous Protect storage on each DHT node separately - Must protect each subrange of the key space - Rewards clients that balance their key choices 32 Page 16

The Problem of Starvation Fair shares change over time - Decrease as system load increases Starvation! Client 1 arrives fills 50% of disk Client 2 arrives fills 40% of disk Client 3 arrives max share = 10% time 33 Preventing Starvation Simple fix: add time-to-live (TTL) to puts - put (key, value) put (key, value, ttl) Prevents long-term starvation - Eventually all puts will expire 34 Page 17

Preventing Starvation Simple fix: add time-to-live (TTL) to puts - put (key, value) put (key, value, ttl) Prevents long-term starvation - Eventually all puts will expire Can still get short term starvation Client A arrives fills entire of disk Client B arrives asks for space Client A s values start expiring B Starves time 35 Preventing Starvation Stronger condition: Be able to accept r min bytes/sec new data at all times This is non-trivial to arrange! max space TTL Sum must be < max capacity Reserved for future puts. Slope = r min size Candidate put 0 now time max 36 Page 18

Preventing Starvation Stronger condition: Be able to accept r min bytes/sec new data at all times This is non-trivial to arrange! Violation! max max space TTL size space TTL size 0 now time max 0 now time max 37 Preventing Starvation Formalize graphical intuition: f(τ) = B(t now ) - D(t now, t now + τ) + r min τ D(t now, t now + τ): aggregate size of puts expiring in the interval (t now, t now + τ) To accept put of size x and TTL l: f(τ) + x < C for all 0 τ < l Can track the value of f efficiently with a tree - Leaves represent inflection points of f - Add put, shift time are O(log n), n = # of puts 38 Page 19

Fair Storage Allocation Queue full: reject put Not full: enqueue put Per-client put queues Select most underrepresented Wait until can accept without violating r min Store and send accept message to client The Big Decision: Definition of most under-represented 39 Defining Most Under-Represented Not just sharing disk, but disk over time - 1 byte put for 100s same as 100 byte put for 1s - So units are bytes seconds, call them commitments Equalize total commitments granted? - No: leads to starvation - A fills disk, B starts putting, A starves up to max TTL Client A arrives fills entire of disk Client B arrives asks for space B catches up with A Now A Starves! time 40 Page 20

Defining Most Under-Represented Instead, equalize rate of commitments granted - Service granted to one client depends only on others putting at same time Client A arrives fills entire of disk Client B arrives asks for space B catches up with A time A & B share available rate 41 Defining Most Under-Represented Instead, equalize rate of commitments granted - Service granted to one client depends only on others putting at same time Mechanism inspired by Start-time Fair Queuing - Have virtual time, v(t) - Each put gets a start time S(p ci ) and finish time F(p ci ) F(p ci ) = S(p ci ) + size(p ci ) ttl(p ci ) S(p ci ) = max(v(a(p ci )) - ε, F(p c i-1 )) v(t) = maximum start time of all accepted puts 42 Page 21

FST Performance 43 Page 22