Peer-to-peer computing research a fad? Frans Kaashoek kaashoek@lcs.mit.edu NSF Project IRIS http://www.project-iris.net Berkeley, ICSI, MIT, NYU, Rice
What is a P2P system? Node Node Node Internet Node Node A distributed system architecture: No centralized control Nodes are symmetric in function Large number of unreliable nodes Enabled by technology improvements
P2P: an exciting social development Internet users cooperating to share, for example, music files Napster, Gnutella, Morpheus, KaZaA, etc. Lots of attention from the popular press The ultimate form of democracy on the Internet The ultimate threat to copy-right protection on the Internet
How to build robust services? Many critical services use Internet Hospitals, government agencies, etc. These services need to be robust Node and communication failures Load fluctuations (e.g., flash crowds) Attacks (including DDoS)
Example: peer-to-peer data archiver Back up hard disk to other users machines Why? Backup is usually difficult Many machines have lots of spare disk space Requirements for cooperative archiver: Divide data evenly among many computers Find data Don t lose any data High performance: backups are big More challenging than sharing music!
The promise of P2P computing Reliability: no central point of failure Many replicas Geographic distribution High capacity through parallelism: Many disks Many network connections Many CPUs Automatic configuration Useful in public and proprietary settings
Traditional distributed computing: client/server Server Client Client Internet Client Client Successful architecture, and will continue to be so Tremendous engineering necessary to make server farms scalable and robust
Application-level overlays Site 2 Site 3 N N ISP1 ISP2 N Site 1 N ISP3 N One per application Nodes are decentralized NOC is centralized N Site 4 P2P systems are overlay networks without central control
Distributed hash table (DHT) put(key, data) Distributed application get (key) data Distributed hash table lookup(key) node IP address Lookup service (Backup) (DHash) (Chord) node node. node DHT distributes data storage over perhaps millions of nodes Many applications can use the same DHT infrastructure
A DHT has a good interface Put(key, value) and get(key) Æ value Simple interface! API supports a wide range of applications DHT imposes no structure/meaning on keys Key/value pairs are persistent and global Can store keys in other DHT values And thus build complex data structures
A DHT makes a good shared infrastructure Many applications can share one DHT service Much as applications share the Internet Eases deployment of new applications Pools resources from many participants Efficient due to statistical multiplexing Fault-tolerant due to geographic distribution
Many applications for DHTs Streaming [SplitStream, ] Content distribution networks [Coral, Squirrel,..] File systems [CFS, OceanStore, PAST, Ivy, ] Archival/Backup store [HiveNet,Mojo,Pastiche] Censor-resistant stores [Eternity, FreeNet,..] DB query and indexing [PIER, ] Event notification [Scribe] Naming systems [ChordDNS, Twine,..] Communication primitives [I3, ] Common thread: data is location-independent
DHT implementation challenges 1. Scalable lookup 2. Handling failures 3. Network-awareness for performance 4. Data integrity 5. Coping with systems in flux 6. Balance load (flash crowds) 7. Robustness with untrusted participants 8. Heterogeneity 9. Anonymity 10. Indexing Goal: simple, provably-good algorithms this talk
1. The lookup problem How do you find the node responsible for a key? N 1 N 2 N 3 Publisher put(key, data) Internet? Client get(key) N 4 N5 N 6
Centralized lookup (Napster) Central server knows where all keys are key/value N 4 key is at N4 N 1 DB N 2 N 3 Client Lookup(key) N 9 N N 7 6 N 8 Simple, but O(N) state for server Server can be attacked (lawsuit killed Napster)
Flooding queries (Gnutella) Lookup by asking every node key/value N 4 N 1 N 2 N 3 Lookup(key) Client N 6 N 9 N 7 N 8 Robust but expensive: O(N) messages per lookup
Routed lookups Each node has a numeric identifier (ID) Route lookup(key) in ID space N 100 put(key=50, data= ) Client1 N 30 N 10 N 15 N 5 get(key=50) Client2 N 12 N 50 N 25 N 60 N 45
Routing algorithm goals Fair (balanced) key range assignments Easy to maintain routing table Dead nodes lead to time outs Small number of hops to route message Low stretch Stay robust despite rapid change Solutions: Small table and Log(N) hops: CAN, Chord, Kademlia, Pastry, Tapestry, Koorde, etc. Big table and One/two hops: Kellips, EpiChord, etc.
Chord key assignments Each node has 160-bit ID ID space is circular Data keys are also IDs A key is stored on the next higher node Good load balance Easy to find keys slowly N105 N90 Circular ID space K80 K5 K20 N60 (N90 is responsible for keys K61 through K90) N32
Chord s routing table Routing table lists nodes: _ way around circle _ way around circle 1/8 way around circle next around circle The table is small: log N entries 1/8 _ 1/16 1/32 1/64 1/128 N80 _
Chord lookups take O(log N) hops Each step goes at least halfway to the destination Lookups are fast: log N steps N110 N99 _ N5 _ N10 K19 N20 N32 N80 N60 Node N32 looks up key K19
2. Handling failures: redundancy Each node knows about next r nodes on circle Each key is stored by the r nodes after it on the circle To save space, each node stores only a piece of the block Collecting half the pieces is enough to reconstruct the block N110 N99 N80 N5 N60 N10 K19 N20 K19 N32 N40 K19
Redundancy handles failures Failed Lookups (Fraction) 1000 DHT nodes Average of 5 runs 6 replicas for each key Kill fraction of nodes Then measure how many lookups fail All replicas must be killed for lookup to fail Failed Nodes (Fraction)
3. Reducing Chord lookup delay N20 N80 N40 N41 N20 s routing table may point to distant nodes Any node with a closer ID could be used in route Knowing about proximity could help performance Key challenge: avoid O(N 2 ) pings
Estimate latency using synthetic coordinate system Each node estimates its position Position = (x,y): synthetic coordinates x and y units are time (milliseconds) Distance between two nodes coordinates predicts network latency Challenges: triangle equality, etc.
Vivaldi synthetic coordinates Each node starts with a random incorrect position 2,3 1,2 0,1 3,0
Vivaldi synthetic coordinates Each node starts with a random incorrect position Each node pings a few other nodes to measure network latency (distance) 1 ms 2 ms A 1 ms 2 ms B
Vivaldi synthetic coordinates Each node starts with a random incorrect position Each node pings a few other nodes to measure network latency (distance) Each nodes moves to cause measured distances to match coordinates 1 2 2 1
Vivaldi synthetic coordinates Each node starts with a random incorrect position Each node pings a few other nodes to measure network latency (distance) Each nodes moves to cause measured distance to match coordinates Minimize force in spring network 1 2 2 1
Execution on 86 PlanetLab Internet hosts Each host only pings a few other random hosts Most hosts find useful coordinates after a few dozen pings Use 3D coordinates Vivaldi in action
Vivaldi vs. network coordinates Vivaldi s coordinates match geography well over-sea distances shrink (faster than over-land) orientation of Australia and Europe is wrong Simulations confirm results for larger networks
Vivaldi predicts latency well 600 Actual latency (ms) 400 200 NYU AUS y = x 0 0 200 400 600 Predicted latency (ms)
DHT fetches with Vivaldi (1) Choose the lowest-latency copy of data Fetch time (ms) 500 400 300 200 100 Lookup Download 0 Without Vivaldi With Replica Selection
DHT fetches with Vivaldi (2) Choose the lowest-latency copy of data Route Chord lookup through nearby nodes Fetch time (ms) 500 400 300 200 100 0 Without Vivaldi With Replica Selection Lookup Download With Proximity Routing
DHT implementation summary Chord for looking up keys Replication at successors for fault tolerance Fragmentation and erasure coding to reduce storage space Vivaldi network coordinate system for Server selection Proximity routing
Backup system on DHT Store file system image snapshots as hash trees Can access daily images directly Yet images share storage for common blocks Only incremental storage cost Encrypt data User-level NFS server parses file system images to present dump hierarchy Application is ignorant of DHT challenges DHT is just a reliable block store
Future work DHTs Improve performance Handle untrusted nodes Vivaldi Does it scale to larger and more diverse networks? Apps Need lots of interesting applications
Philosophical questions How decentralized should systems be? Gnutella versus content distribution network Have a bit of both? (e.g., CDNs) Why does the distributed systems community have more problems with decentralized systems than the networking community? A distributed system is a system in which a computer you don t know about renders your own computer unusable Internet (BGP, NetNews)
Related Work Lookup algs CAN, Kademlia, Koorde, Pastry, Tapestry, Viceroy, DHTs DOLR, Past, OpenHash Network coordinates and springs GNP, Hoppe s mesh relaxation Applications Ivy, OceanStore, Pastiche, Twine,
Conclusions Peer-to-peer promises some great properties DHTs are a good way to build peer-to-peer applications: Easy to program Single, shared infrastructure for many applications Robust in the face of failures Scalable to large number of servers Self configuring Can provide high performance http://www.project-iris.net