Searching for Shared Resources: DHT in General

Similar documents
Searching for Shared Resources: DHT in General

L3S Research Center, University of Hannover

L3S Research Center, University of Hannover

Advanced Distributed Systems. Peer to peer systems. Reference. Reference. What is P2P? Unstructured P2P Systems Structured P2P Systems

Part 1: Introducing DHTs

*Adapted from slides provided by Stefan Götz and Klaus Wehrle (University of Tübingen)

*Adapted from slides provided by Stefan Götz and Klaus Wehrle (University of Tübingen)

Distributed Hash Tables (DHT)

Telematics Chapter 9: Peer-to-Peer Networks

Content Overlays. Nick Feamster CS 7260 March 12, 2007

Structured Peer-to-Peer Networks

Architectures for Distributed Systems

Distributed Knowledge Organization and Peer-to-Peer Networks

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Introduction to P2P Computing

Distributed Hash Tables: Chord

CS514: Intermediate Course in Computer Systems

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

Structured P2P. Complexity O(log N)

LECT-05, S-1 FP2P, Javed I.

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou

Chord : A Scalable Peer-to-Peer Lookup Protocol for Internet Applications

Badri Nath Rutgers University

Distributed Hash Table

CS 640 Introduction to Computer Networks. Today s lecture. What is P2P? Lecture30. Peer to peer applications

Peer-to-Peer Systems. Chapter General Characteristics

CPSC 426/526. P2P Lookup Service. Ennan Zhai. Computer Science Department Yale University

Distributed Data Management. Profr. Dr. Wolf-Tilo Balke Institut für Informationssysteme Technische Universität Braunschweig

Peer-to-Peer Systems and Distributed Hash Tables

6 Structured P2P Networks

Lecture 6: Overlay Networks. CS 598: Advanced Internetworking Matthew Caesar February 15, 2011

Modern Technology of Internet

Introduction to Peer-to-Peer Networks

CSE 486/586 Distributed Systems

08 Distributed Hash Tables

CIS 700/005 Networking Meets Databases

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES

Finding Data in the Cloud using Distributed Hash Tables (Chord) IBM Haifa Research Storage Systems

Scalable overlay Networks

P2P Network Structured Networks: Distributed Hash Tables. Pedro García López Universitat Rovira I Virgili

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

Goals. EECS 122: Introduction to Computer Networks Overlay Networks and P2P Networks. Solution. Overlay Networks: Motivations.

Peer-To-Peer Techniques

Flooded Queries (Gnutella) Centralized Lookup (Napster) Routed Queries (Freenet, Chord, etc.) Overview N 2 N 1 N 3 N 4 N 8 N 9 N N 7 N 6 N 9

Distributed Systems. 17. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2016

Security for Structured Peer-to-peer Overlay Networks. Acknowledgement. Outline. By Miguel Castro et al. OSDI 02 Presented by Shiping Chen in IT818

C 1. Last Time. CSE 486/586 Distributed Systems Distributed Hash Tables. Today s Question. What We Want. What We Want. What We Don t Want

Peer-to-Peer Systems. Network Science: Introduction. P2P History: P2P History: 1999 today

CSE 5306 Distributed Systems

Towards Benchmarking of P2P Technologies from a SCADA Systems Protection Perspective

Introduction to Peer-to-Peer Systems

Overlay networks. To do. Overlay networks. P2P evolution DHTs in general, Chord and Kademlia. Turtles all the way down. q q q

Advanced Computer Networks

A Framework for Peer-To-Peer Lookup Services based on k-ary search

Distributed lookup services

Peer-to-Peer (P2P) Systems

Last Time. CSE 486/586 Distributed Systems Distributed Hash Tables. What We Want. Today s Question. What We Want. What We Don t Want C 1

A Survey of Peer-to-Peer Content Distribution Technologies

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma

Lecture 8: Application Layer P2P Applications and DHTs

P2P Network Structured Networks: Distributed Hash Tables. Pedro García López Universitat Rovira I Virgili

CSCI-1680 P2P Rodrigo Fonseca

Making Gnutella-like P2P Systems Scalable

Dynamic Load Sharing in Peer-to-Peer Systems: When some Peers are more Equal than Others

Overlay and P2P Networks. Unstructured networks. PhD. Samu Varjonen

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

CompSci 356: Computer Network Architectures Lecture 21: Overlay Networks Chap 9.4. Xiaowei Yang

Peer-to-peer computing research a fad?

Page 1. How Did it Start?" Model" Main Challenge" CS162 Operating Systems and Systems Programming Lecture 24. Peer-to-Peer Networks"

416 Distributed Systems. Mar 3, Peer-to-Peer Part 2

EECS 122: Introduction to Computer Networks Overlay Networks and P2P Networks. Overlay Networks: Motivations

Distributed Hash Tables

PEER-TO-PEER NETWORKS, DHTS, AND CHORD

CSE 5306 Distributed Systems. Naming

Symphony. Symphony. Acknowledgement. DHTs: : The Big Picture. Spectrum of DHT Protocols. Distributed Hashing in a Small World

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma

Finding Data in the Cloud using Distributed Hash Tables (Chord) IBM Haifa Research Storage Systems

Peer-to-Peer Internet Applications: A Review

Distributed Information Processing

P2P: Distributed Hash Tables

: Scalable Lookup

Telecommunication Services Engineering Lab. Roch H. Glitho

Peer-to-peer systems and overlay networks

Peer- Peer to -peer Systems

An Adaptive Stabilization Framework for DHT

Overlay Networks: Motivations. EECS 122: Introduction to Computer Networks Overlay Networks and P2P Networks. Motivations (cont d) Goals.

Early Measurements of a Cluster-based Architecture for P2P Systems

Distributed Meta-data Servers: Architecture and Design. Sarah Sharafkandi David H.C. Du DISC

Page 1. Key Value Storage"

EE 122: Peer-to-Peer Networks

Department of Computer Science Institute for System Architecture, Chair for Computer Networks. File Sharing

Content Overlays (continued) Nick Feamster CS 7260 March 26, 2007

EE 122: Peer-to-Peer (P2P) Networks. Ion Stoica November 27, 2002

12/5/16. Peer to Peer Systems. Peer-to-peer - definitions. Client-Server vs. Peer-to-peer. P2P use case file sharing. Topics

Chapter 6 PEER-TO-PEER COMPUTING

INF5070 media storage and distribution systems. to-peer Systems 10/

Providing File Services using a Distributed Hash Table

Scaling Out Key-Value Storage

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Naming WHAT IS NAMING? Name: Entity: Slide 3. Slide 1. Address: Identifier:

Distriubted Hash Tables and Scalable Content Adressable Network (CAN)

Transcription:

1 ELT-53206 Peer-to-Peer Networks Searching for Shared Resources: DHT in General Mathieu Devos Tampere University of Technology Department of Electronics and Communications Engineering Based on the original slides provided by A. Surak (TUT), K. Wehrle, S. Götz, S. Rieche (University of Tübingen), Jani Peltotalo (TTY), Olivier Lamotte (TTY) and the OSDI team 2006 ELT-53206 09.09.2015

2 Distributed Management and Retrieval of Data Challenges in P2P systems : Location of data among the distributed system Where to store the data, and how to request and recover it? Scalability of the topology Keep a low complexity (Big O) and scalable storage capabilities Fault tolerance and RESILIENCE Frequent changes, heterogeneous network I have D, Where to store D? D? Data item D Overlay layer Where can I find D? 12.5.7.31 peer-to-peer.info berkeley.edu planet-lab.org 89.11.20.15 95.7.6.10 86.8.10.18 7.31.10.25

3 Big O Notation Big O notation is widely used by computer scientists to concisely describe the behavior of algorithms Specifically describes the worst-case scenario, and can be used to describe the execution time required or the space used by an algorithm Common types of orders O(1) constant O(log n) logarithmic O(n) linear O(n 2 ) quadratic

4 Comparison of strategies for data retrieval Strategies for storage and retrieval of data items scattered over distributed systems Central server (Napster) Flooding search (GNUTELLA, unstructured overlays) Distributed indexing (CHORD, PASTRY)

5 Central Server Simple strategy : the server stores information about locations Node A (provider) tells server that it stores item D Node B (requester) asks server S for the location of D Server S tells B that node A stores item D Node B requests item D from node A Server S Node B Node A A stores D

6 Central Server Simple strategy : the server stores information about locations Node A (provider) tells server that it stores item D Node B (requester) asks server S for the location of D Server S tells B that node A stores item D Node B requests item D from node A Server S Node B Node A

7 Central Server Simple strategy : the server stores information about locations Node A (provider) tells server that it stores item D Node B (requester) asks server S for the location of D Server S tells B that node A stores item D Node B requests item D from node A Server S A stores D Node B A stores D Node A

8 Central Server Simple strategy : the server stores information about locations Node A (provider) tells server that it stores item D Node B (requester) asks server S for the location of D Server S tells B that node A stores item D Node B requests item D from node A Server S Node B Node A

9 Central Server Simple strategy : the server stores information about locations Node A (provider) tells server that it stores item D Node B (requester) asks server S for the location of D Server S tells B that node A stores item D Node B requests item D from node A Server S Node B Node A

10 Central Server Pros and Cons Search complexity O(1) : just ask the server Complex and fuzzy queries possible Easy to implement

11 Central Server Pros and Cons Search complexity O(1) : just ask the server Complex and fuzzy queries possible Easy to implement No scalability since stateful O(n) node state in server O(n) network and server load Easy target (single point of failure) also for law suites (Napster, TPB) Costs of maintenance, availability, scalability Not suitable for systems with massive numbers of users No self sustainability, needs moderation

12 Central Server Pros and Cons Search complexity O(1) : just ask the server Complex and fuzzy queries possible Easy to implement Fault tolerance and RESILIENCE? Scalability of the topology? No scalability since stateful O(n) node state in server O(n) network and server load Easy target (single point of failure) also for law suites (TPB) Costs of maintenance, availability, scalability Not suitable for systems with massive numbers of users No self sustainability, needs moderation Location of data among the distributed system?

13 Central Server Pros and Cons Search complexity O(1) : just ask the server Complex and fuzzy queries possible Easy to implement No scalability since stateful O(n) node state in server O(n) network and server load Easy target (single point of failure) also for law suites (TPB) Costs of maintenance, availability, scalability Not suitable for systems with massive numbers of users YET : Best approach for small and simple applications!

Where is D? 14 Flooding Search Applied in UNSTRUCTURED P2P systems No information about the location of requested data in overlay Content is only stored in the node providing it Fully distributed approach Retrieval of data No routing information for content Necessity to ask as much systems as possible/necessary Approaches Flooding: high traffic load on network, does not scale Highest degree search: (here degree = number of connections) quick search through large areas large number of messages needed for unique identification using highly connected nodes (nodeid)

15 Flooding Search No information about location of data in the intermediate systems Necessity for broad search (Graph theory) Node B (requester) asks neighboring nodes for item D Ping Node B Pong Query Hit Connect 3 Messages Node A

16 Flooding Search No information about location of data in the intermediate systems Necessity for broad search (Graph theory) Node B (requester) asks neighboring nodes for item D Nodes forward request to further nodes (breadth-first search/ flooding) Ping Pong Query Hit Connect Node B Topology loop detection 8 Messages Node A

17 Flooding Search No information about location of data in the intermediate systems Necessity for broad search (Graph theory) Node B (requester) asks neighboring nodes for item D Nodes forward request to further nodes (breadth-first search/ flooding) Node A (provider of item D) sends D to requesting node B Ping Node B Pong Query Hit Connect 12 Messages Node A

18 Flooding Search No information about location of data in the intermediate systems Necessity for broad search (Graph theory) Node B (requester) asks neighboring nodes for item D Nodes forward request to further nodes (breadth-first search/ flooding) Node A (provider of item D) sends D to requesting node B Ping Node B Pong Query Hit Connect 16 Messages Node A

19 Flooding Search No information about location of data in the intermediate systems Necessity for broad search (Graph theory) Node B (requester) asks neighboring nodes for item D Nodes forward request to further nodes (breadth-first search/ flooding) Node A (provider of item D) sends D to requesting node B Ping Node B Pong Query Hit Connect 16 Messages + Too many pongs! Node A

Communication Overhead 20 Motivations for a better solution Communication overhead vs. node state Flooding Bottlenecks: Communication overhead False negatives = poor reliability O(n) O(log n) O(1)? Scalable solution Bottlenecks: Memory, CPU, Network Availability Central Server O(1) O(log n) Node State O(n)

Communication Overhead 21 Motivations for a better solution Communication overhead vs. node state Flooding DHT Scalability in O(log n) O(n) No false negatives Resistant against changes Failures, attacks Short time users (freeriders) O(log n) O(1) Distributed Hash Table Central Server O(1) O(log n) Node State O(n)

22 Distributed Indexing Approach of distributed indexing schemes Issues Data and nodes are mapped into same address space Intermediate nodes maintain routing information to target nodes Efficient forwarding to destination Definitive statement of existence of content How to join the topology? Maintenance of routing information required Network events : join, leave, failures, attacks! Fuzzy queries (e.g., wildcard searches) not primarily supported

23 Distributed Indexing Goal is scalable complexity for Communication effort: O(log n) hops Node state: O(log n) routing entries n Nodes Routing in O(log n) steps H( my data ) = 3107 709 1008 1622 2011 2207? 611 3485 2906 12.5.7.31 berkeley.edu planet-lab.org peer-to-peer.info 89.11.20.15 95.7.6.10 86.8.10.18 7.31.10.25

24 Fundamentals of Distributed Hash Tables Design challenges : Desired characteristics Flexibility Reliability Scalability Hashtables use a hash function to map identifying values, known as keys (e.g., a person's name), to their associated values (number) Equal distribution of content among nodes Crucial for efficient lookup of content Consistent hashing function (advantages and disadvantages) Permanent adaptation to faults, joins and departures of nodes Assignment of responsibilities to new nodes Re-assignment and re-distribution of responsibilities in case of node failure or departure

25 Distributed Management of Data Nodes and Keys of the hashtable share the same address space Using IDs Nodes are responsible for data in certain parts of the address space The data a node is in charge of may change since there are changes Looking for data means : Finding the responsible node via intermediate nodes The query is routed to the node responsible, which returns the key/value pair Mathematical closeness of comparing hashes Node in China might be next to Node in USA (hashing!) Target node is not necessarily known in advance is it even available? This is a very deterministic statement about availability of data

26 Addressing in Distributed Hash Tables Step 1: Mapping of content/nodes into linear space : consistent hashing m bit identifier space for both keys and nodes The hash function result has to fit in the space : mod 2 m, here m=6 [ 0,, 2 m -1 ] number of objects to be stored, 64 IDs E.g., Hash( ELT-53206-lec4.pdf ) mod 2 m : 54 E.g., Hash( 129.100.16.93 ) mod 2 m : 21 The address space is commonly viewed as a circle

27 Addressing in Distributed Hash Tables Step 2: Association of address ranges to the nodes Often with small redundancy (overlapping of parts) Continuous adaptation due to changes Real (underlay) and logical (overlay) topology are unlikely correlated (consistent hashing) Keys are assigned to their successor node in the identifier circle the node with next higher ID (remember the circular namespace) N1?

28 Association of Address Space with Nodes 709 1008 1622 2011 2207 Logical view of the Distributed Hash Table (Overlay layer) 611 3485 2906 Mapping on the real topology (Physical layer)

29 Addressing in Distributed Hash Tables Step 3: Locating the data (content-based routing) Minimum overhead with distributed hash tables O(1) with centralized hash table (aka expensive server) O(n) DHT hops without finger table (left) O(log n): DHT hops to locate object, and n keys and routing information to store at a node (right)

30 Routing to a Data Item Routing to a Key/Value pair Start lookup at arbitrary node of DHT, unless Bootstrap nodes Routing to requested data item (key) Key/Value pair is delivered to requester. H( TLT2626-lec4.pdf ) In our case, the value is a pointer to the location of file, we are indexing resources for P2P exchanges Key = H( TLT2626-lec4.pdf ) (54, (ip, port)) Initial node (arbitrary or bootstrap) Value = pointer to location of file (indirect storage)

31 How is content stored Direct / Indirect Direct storage Content is stored in responsible node for H( my data ) Inflexible for large content o.k., if small amount data (<1KB) Example: DNS queries Indirect storage Value is often real storage address of content: (IP, Port) = (134.2.11.140, 4711) More flexible, but one step more to reach content 709 1008 1622 2011 2207 D 611 D D H SHA-1 ( D )=3107 3485 2906 134.2.11.68

32 Node Arrival Joining of a new node Calculation of node ID New node contacts DHT via arbitrary bootstrap node Assignment of a particular hash range Binding into routing environment Copying of Key/Value pairs of hash range 709 1008 1622 2011 2207 611 3485 2906 ID: 3485 134.2.11.68 TLT-2626, Lecture 4 03.10.2012

Node Arrival Chord Example 33

Node Arrival Chord Example 34

Node Arrival Chord Example 35

Node Failure and Departure Failure of a node Use of redundant/replicated data if node fails Use of redundant/alternative routing paths if routing environment fails Departure of a node Partitioning of hash range to neighbor nodes Copying of Key/Value pairs to corresponding nodes Unbinding from routing environment

37 Reliability in Distributed Hash Tables Erasure codes and redundancy Erasure code transform a message of k symbols into a longer message (code word) with n symbols such that the original message can be recovered from a subset of the n symbols Every time a node crashes, a piece of the data is destroyed, and after some time, the data may no longer be computable Therefore, the idea of redundancy also needs replication of the data Form of Forward Error Correction (future of storage?) Replication Several nodes should manage the same range of keys Introduces new possibilities for underlay aware routing

Replication Example: Multiple Nodes in One Interval Each interval of the DHT may be maintained by several nodes at the same time Fixed positive number K indicates how many nodes have to at least act within one interval Each data item is therefore replicated at least K times 1 2 4 5 6 7 9 10 3 8 Node

Load Balancing in Distributed Hash Tables Initial assumption: uniform key distribution Hash function Every node with equal load Load balancing is not needed Optimal distribution of documents across nodes Equal distribution Nodes across address space Data across nodes Is this assumption justifiable? Example: Analysis of distribution 4096 Chord nodes 500000 documents Optimum ~122 documents per node Distribution could benefit from load balancing Frequency distribution of DHT nodes storing a certain number of documents

Load Balancing Algorithms Several techniques have been proposed to ensure an equal data distribution Possible subject for Paper assignment 1. Power of Two Choices John Byers, Jeffrey Considine, and Michael Mitzenmacher 2. Virtual Servers Simple Load Balancing for Distributed Hash Tables" Ananth Rao, Karthik Lakshminarayanan, Sonesh Surana, Richard Karp, and Ion Stoica "Load Balancing in Structured P2P Systems" 3. Thermal-Dissipation-based Approach Simon Rieche, Leo Petrak, and Klaus Wehrle "A Thermal-Dissipation-based Approach for Balancing Data Load in Distributed Hash Tables" 4. A Simple Address-Space and Item Balancing David Karger, and Matthias Ruhl "Simple, Efficient Load Balancing Algorithms for Peer-to-Peer Systems"

41 DHT Interfaces Generic interface of Distributed Hash Tables Provisioning of information Put(key, value) Requesting of information (search for content) Get(key) Reply Value DHT approaches are interchangeable (with respect to interface) Distributed Application Put(Key,Value) Get(Key) Distributed Hash Table (CAN, Chord, Pastry, Tapestry, ) Value Node 1 Node 2 Node 3.... Node N

Comparison: DHT vs. DNS Traditional name services follow fixed mapping DNS maps a logical node name to an IP address DHTs offer flat/generic mapping of addresses Not bound to particular applications or services value in (key, value) may be an address a document or other data

Comparison: DHT vs. DNS Domain Name System Mapping: Symbolic name IP address Is built on a hierarchical structure with root servers Names refer to administrative domains Specialized to search for computer names and services Distributed Hash Table Mapping: key value can easily realize DNS Does not need a special server Does not require special name space Can find data that are independently located of computers

44 Comparison of Lookup Concepts System Per Node State Communication Overhead Fuzzy Queries No False Negatives Robustness Central Server O(n) O(1) Flooding Search O(1) O(n²) Distributed Hash Tables O(log n) O(log n)

45 Properties of DHTs Use of routing information for efficient search for content Keys are evenly (or not?) distributed across nodes of DHT No bottlenecks A continuous increase in number of stored keys is admissible Failure of nodes can be tolerated Survival of attacks possible Self-organizing system Simple and efficient realization Supporting a wide spectrum of applications Flat (hash) key without semantic meaning Value depends on application

46 Learning Outcomes Things to know Differences between lookup concepts Fundamentals of DHTs How DHT works Be ready for specific examples of DHT algorithms Try some yourself! Simulation using Python here : http://bit.ly/szazxt

Any questions? 47 mathieu.devos@tut.fi