Challenges in the Wide-area. Tapestry: Decentralized Routing and Location. Global Computation Model. Cluster-based Applications

Similar documents
Challenges in the Wide-area. Tapestry: Decentralized Routing and Location. Key: Location and Routing. Driving Applications

Bayeux: An Architecture for Scalable and Fault Tolerant Wide area Data Dissemination

! Naive n-way unicast does not scale. ! IP multicast to the rescue. ! Extends IP architecture for efficient multi-point delivery. !

Early Measurements of a Cluster-based Architecture for P2P Systems

Simultaneous Insertions in Tapestry

Decentralized Object Location In Dynamic Peer-to-Peer Distributed Systems

Distributed Hash Table

Data Replication under Latency Constraints Siu Kee Kate Ho

A Common API for Structured Peer-to- Peer Overlays. Frank Dabek, Ben Y. Zhao, Peter Druschel, Ion Stoica

A Survey of Peer-to-Peer Content Distribution Technologies

Peer-to-Peer Networks Pastry & Tapestry 4th Week

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou

P2P: Distributed Hash Tables

Brocade: Landmark Routing on Overlay Networks

Overlay and P2P Networks. Structured Networks and DHTs. Prof. Sasu Tarkoma

Brocade: Landmark Routing on Peer to Peer Networks. Ling Huang, Ben Y. Zhao, Yitao Duan, Anthony Joseph, John Kubiatowicz

Architectures for Distributed Systems

Arvind Krishnamurthy Fall 2003

Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing

Distributed Hash Tables: Chord

P2P Network Structured Networks (IV) Distributed Hash Tables. Pedro García López Universitat Rovira I Virgili

Exploiting Route Redundancy via Structured Peer to Peer Overlays

INF5071 Performance in distributed systems: Distribution Part III

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Peer-to-Peer Systems and Distributed Hash Tables

Lecture 6: Overlay Networks. CS 598: Advanced Internetworking Matthew Caesar February 15, 2011

L3S Research Center, University of Hannover

P2P Network Structured Networks: Distributed Hash Tables. Pedro García López Universitat Rovira I Virgili

A Decentralized Location and Routing Infrastructure for Fault-tolerant Wide-area Network Applications

Overlay Networks for Multimedia Contents Distribution

Goals. EECS 122: Introduction to Computer Networks Overlay Networks and P2P Networks. Solution. Overlay Networks: Motivations.

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

Distributed Hash Tables

Reminder: Distributed Hash Table (DHT) CS514: Intermediate Course in Operating Systems. Several flavors, each with variants.

Subway : Peer-To-Peer Clustering of Clients for Web Proxy

EECS 122: Introduction to Computer Networks Overlay Networks and P2P Networks. Overlay Networks: Motivations

Handling Churn in a DHT

A Scalable Content- Addressable Network

The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla

Bayeux: An Architecture for Scalable and Fault-tolerant Wide-area Data Dissemination

Overlay Networks: Motivations. EECS 122: Introduction to Computer Networks Overlay Networks and P2P Networks. Motivations (cont d) Goals.

INF5070 media storage and distribution systems. to-peer Systems 10/

Peer- Peer to -peer Systems

: Scalable Lookup

Content Overlays. Nick Feamster CS 7260 March 12, 2007

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

A Framework for Peer-To-Peer Lookup Services based on k-ary search

Page 1. How Did it Start?" Model" Main Challenge" CS162 Operating Systems and Systems Programming Lecture 24. Peer-to-Peer Networks"

Searching for Shared Resources: DHT in General

A Super-Peer Based Lookup in Structured Peer-to-Peer Systems

Peer to Peer I II 1 CS 138. Copyright 2015 Thomas W. Doeppner, Rodrigo Fonseca. All rights reserved.

CS514: Intermediate Course in Computer Systems

Slides for Chapter 10: Peer-to-Peer Systems

Overlay and P2P Networks. Introduction and unstructured networks. Prof. Sasu Tarkoma

Staggeringly Large File Systems. Presented by Haoyan Geng

Hybrid Overlay Structure Based on Random Walks

Searching for Shared Resources: DHT in General

PChord: Improvement on Chord to Achieve Better Routing Efficiency by Exploiting Proximity

Introduction to Peer-to-Peer Systems

Overlay Networks. Behnam Momeni Computer Engineering Department Sharif University of Technology

Athens University of Economics and Business. Dept. of Informatics

Distriubted Hash Tables and Scalable Content Adressable Network (CAN)

Making Gnutella-like P2P Systems Scalable

LessLog: A Logless File Replication Algorithm for Peer-to-Peer Distributed Systems

ReCord: A Distributed Hash Table with Recursive Structure

Towards a Common API for Structured Peer-to-Peer Overlays

Chapter 6 PEER-TO-PEER COMPUTING

CS 640 Introduction to Computer Networks. Today s lecture. What is P2P? Lecture30. Peer to peer applications

A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol

P2P Network Structured Networks: Distributed Hash Tables. Pedro García López Universitat Rovira I Virgili

L3S Research Center, University of Hannover

Chapter 10: Peer-to-Peer Systems

Should we build Gnutella on a structured overlay? We believe

Flooded Queries (Gnutella) Centralized Lookup (Napster) Routed Queries (Freenet, Chord, etc.) Overview N 2 N 1 N 3 N 4 N 8 N 9 N N 7 N 6 N 9

Evaluation and Comparison of Mvring and Tree Based Application Layer Multicast on Structured Peer-To-Peer Overlays

Distributed Knowledge Organization and Peer-to-Peer Networks

Peer-to-peer computing research a fad?

Tapestry: A Resilient Global-scale Overlay for Service Deployment

DATA. The main challenge in P2P computing is to design and implement LOOKING UP. in P2P Systems

Peer-to-Peer Systems. Network Science: Introduction. P2P History: P2P History: 1999 today

Distributed Hash Tables

Distributed Hash Tables

Towards a Common API for Structured Peer-to-Peer Overlays Λ

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Path Optimization in Stream-Based Overlay Networks

Page 1. Key Value Storage"

Peer-to-Peer: Part II

Overlay networks. Today. l Overlays networks l P2P evolution l Pastry as a routing overlay example

Implications of Neighbor Selection on DHT Overlays

Slides for Chapter 10: Peer-to-Peer Systems. From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design

Chord : A Scalable Peer-to-Peer Lookup Protocol for Internet Applications

CIS 700/005 Networking Meets Databases

Building a low-latency, proximity-aware DHT-based P2P network

EE 122: Peer-to-Peer Networks

Distributed Meta-data Servers: Architecture and Design. Sarah Sharafkandi David H.C. Du DISC

Advanced Computer Networks

Distributed Systems. 17. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2016

Towards a Scalable Distributed Information Management System

DYNAMIC TREE-LIKE STRUCTURES IN P2P-NETWORKS

Distributed Information Processing

Transcription:

Challenges in the Wide-area Tapestry: Decentralized Routing and Location System Seminar S 0 Ben Y. Zhao CS Division, U. C. Berkeley Trends: Exponential growth in CPU, b/w, storage Network expanding in reach and b/w Can applications leverage new resources? Scalability: increasing users, requests, traffic Resilience: more components inversely low MTBF Management: intermittent resource availability complex management schemes Proposal: an infrastructure that solves these issues and passes benefits onto applications Ben Zhao - Tapestry @ U. W. 590 S'0 Cluster-based Applications Global Computation Model Advantages Ease of fault-monitoring Communication on LANs Low latency High bandwidth Abstract away comm. Shared state Simple load balancing Limitations Centralization as liability Centralized network link Centralized power source Geographic locality Scalability limitations Outgoing bandwidth Power consumption Physical resources (space, cooling) Non-trivial deployment A wish list for global scale application services Global self-adaptive system Utilize all available resources Decentralize all functionality no bottlenecks, no single points of vulnerability Exploit locality whenever possible localize impact of failures Peer-based monitoring of failures and resources Ben Zhao - Tapestry @ U. W. 590 S'0 Ben Zhao - Tapestry @ U. W. 590 S'0 Driving Applications Leverage proliferation of cheap & plentiful resources: CPU s, storage, network bandwidth Global applications share distributed resources Shared computation: SETI, Entropia Shared storage OceanStore, Napster, Scale-8 Shared bandwidth Application-level multicast, content distribution Key: Location and Routing Hard problem: Locating and messaging to resources and data Approach: wide-area overlay infrastructure: Easier to deploy than lower-level solutions Scalable: million nodes, billion objects Available: detect and survive routine faults Dynamic: self -configuring, adaptive to network Exploits locality: localize effects of operations/failures Load balancing Ben Zhao - Tapestry @ U. W. 590 S'0 5 Ben Zhao - Tapestry @ U. W. 590 S'0 6

Talk Outline Problems facing wide-area applications Previous work: Location services & PRR97 Tapestry: mechanisms and protocols Preliminary Evaluation Sample application: Bayeux Related and future work Previous Work: Location Goals: Given ID or description, locate nearest object Location services (scalability via hierarchy) DNS Globe Berkeley SDS Issues Consistency for dynamic data Scalability at root Centralized approach: bottleneck and vulnerability Ben Zhao - Tapestry @ U. W. 590 S'0 7 Ben Zhao - Tapestry @ U. W. 590 S'0 8 Decentralizing Hierarchies Centralized hierarchies Each higher level node responsible for locating objects in a greater domain Decentralize: Create a tree for object O (really!) Object O has its own root and subtree Server on each level keeps pointer to nearest object in domain Queries search up in hierarchy Root ID = O Directory servers tracking replicas Ben Zhao - Tapestry @ U. W. 590 S'0 9 What is Tapestry? A prototype of a decentralized, scalable, fault-tolerant, adaptive location and routing infrastructure Network layer of OceanStore (Zhao, Kubiatowicz, Joseph et al. U.C. Berkeley) Suffix-based hypercube routing Core system inspired by Plaxton, Rajamaran, Richa (SPAA97) Core API: publishobject(objectid, [serverid]) sendmsgtoobject(objectid) sendmsgtonode() Ben Zhao - Tapestry @ U. W. 590 S'0 0 PRR (SPAA 97) Namespace (nodes and objects) large enough to avoid collisions (~ 60?) (size N in Log (N) bits) Insert Object: Hash Object into namespace to get ObjectID For (i=0, i<log (N), i+j) { //Define hierarchy j is base of digit size used, (j = hex digits) Insert entry into nearest node that matches on last i bits When no matches found, then pick node matching (i n) bits with highest ID value, terminate PRR97 Object Lookup Lookup object Traverse same relative nodes as insert, except searching for entry at each node For (i=0, i<log (N), i+n) { Search for entry in nearest node matching on last i bits Each object maps to hierarchy defined by single root f (ObjectID) = RootID Publish / search both route incrementally to root Root node = f (O), is responsible for knowing object s location Ben Zhao - Tapestry @ U. W. 590 S'0 Ben Zhao - Tapestry @ U. W. 590 S'0

0x05E 0x79FE 0xFE 0x555E Basic PRR Mesh Incremental suffix-based routing 0xFE 0xABFE 0x7FF 0x9E 0xE 0x90 Ben Zhao - Tapestry @ U. W. 590 S'0 0x7FE 0xFE 0xFE 0x0FE 0x99E 0x9990 0xF990 PRR97 Routing to Nodes Example: Octal digits, 8 namespace, 0057 6750 0057 0 5 6 7 Neighbor 0057 Map 0880 0 5 6 7 For 57 (Octal) 0880 90 07 x0 xx0 xxx0 90 0 5 6 7 7 x 57 xxx0 850 7 x xx 57 850 0 5 6 7 7 x xx xxx 8750 7 x xx xxx 8750 0 5 6 7 57 x5 xx5 xxx5 67 x6 xx6 xxx6 7750 0 5 6 7 77 57 7750 xx7 xxx7 6750 0 5 6 7 Routing Levels 6750 Ben Zhao - Tapestry @ U. W. 590 S'0 Use of Plaxton Mesh Randomization and Locality PRR97 Limitations Setting up the routing tables Uses global knowledge Supports only static networks Finding way up to root Sparse networks: find node with highest ID value What happens as network changes Need deterministic way to find the same node over time Result: good analytical properties, but fragile in practice, and limited to small, static networks Ben Zhao - Tapestry @ U. W. 590 S'0 5 Ben Zhao - Tapestry @ U. W. 590 S'0 6 Talk Outline Tapestry Contributions Problems facing wide-area applications Previous work: Location services & PRR97 Tapestry: mechanisms and protocols Preliminary Evaluation Sample application: Bayeux Related and future work Ben Zhao - Tapestry @ U. W. 590 S'0 7 PRR97 Benefits inherited by Tapestry: Scalable: state: blog b (N), hops: Log b (N) b=digit base, N= namespace Exploits locality Proportional route distance Limitations Global knowledge algorithms Root node vulnerability Lack of adaptability Tapestry A real System! Distributed algorithms Dynamic root mapping Dynamic node insertion Redundancy in location and routing Fault-tolerance protocols Self-configuring / adaptive Support for mobile objects Application Infrastructure Ben Zhao - Tapestry @ U. W. 590 S'0 8

Fault-tolerant Location Minimized soft-state vs. explicit fault-recovery Multiple roots Objects hashed w/ small salts multiple names/roots Queries and publishing utilize all roots in parallel P(finding Reference w/ partition) = (/) n where n = # of roots Soft-state periodic republish 50 million files/node, daily republish, b = 6, N = 60, 0B/msg, worst case update traffic: 56 kb/s, expected traffic w/ 0 real nodes: 9 kb/s Fault-tolerant Routing Detection: Periodic probe packets between neighbors Selective NACKs Handling: Each entry in routing map has alternate nodes Second chance algorithm for intermittent failures Long term failures alternates found via routing tables Protocols: Reactive Adaptive Routing Proactive Duplicate Packet Routing Ben Zhao - Tapestry @ U. W. 590 S'0 9 Ben Zhao - Tapestry @ U. W. 590 S'0 0 Dynamic Insertion Operations necessary for N to become fully integrated: Step : Build up N s routing maps Send messages to each hop along path from gateway to current node N that best approximates N The i th hop along the path sends its i th level route table to N N optimizes those tables where necessary Step : Move appropriate data from N to N Step : Use back pointers from N to find nodes which have null entries for N s ID, tell them to add new entry to N Step : Notify local neighbors to modify paths to route through N where appropriate Ben Zhao - Tapestry @ U. W. 590 S'0 0xC05E Dynamic Insertion Example 0x779FE 0xFE 0xB555E 0xAFE 0x0ABFE 0x97FE 0x9FE 0xFE 0x70FE 0x699E 0x09990 Gateway NEW 0x59E 0xD7FF 0x790 0xFE Ben Zhao - Tapestry @ U. W. 590 S'0 0xF990 Summary Talk Outline Decentralized location and routing infrastructure Core design from PRR97 Distributed algorithms for object-root mapping, node insertion Fault-handling with redundancy, soft-state beacons, self-repair Analytical properties Per node routing table size: blog b (N) N = size of namespace, n = # of physical nodes Find object in Log b (n) overlay hops Key system properties Decentralized and scalable via random naming, yet has locality Adaptive approach to failures and environmental changes Problems facing wide-area applications Previous work: Location services & PRR97 Tapestry: mechanisms and protocols Preliminary Evaluation Sample application: Bayeux Related and future work Ben Zhao - Tapestry @ U. W. 590 S'0 Ben Zhao - Tapestry @ U. W. 590 S'0

Evaluation Issues Results: Location Locality RDP vs Object Distance (TI5000) Routing distance overhead (RDP) Routing redundancy fault-tolerance Availability of objects and references Message delivery under link/router failures Overhead of fault-handling Optimality of dynamic insertion Locality vs. storage overhead Performance stability via redundancy RDP Locality Pointers No Pointers 0 8 6 0 8 6 0 5 6 7 8 9 0 5 6 7 8 9 0 5 Object Distance Measuring effectiveness of locality pointers (TIERS 5000) Ben Zhao - Tapestry @ U. W. 590 S'0 5 Ben Zhao - Tapestry @ U. W. 590 S'0 6 Results: Stability via Redundancy Latency (Hop Units) 90 80 70 60 50 0 0 0 0 0 Retrieving Objects with Multiple Roots Average Latency Aggregate Bandwidth per Object 5 # of Roots Utilized Parallel queries on multiple roots. Aggregate bandwidth measuresb/w used for soft-state republish /day and b/w used by requests at rate of /s. Ben Zhao - Tapestry @ U. W. 590 S'0 7 70 60 50 0 0 0 0 0 Aggregate Bandwidth (kb/s) Talk Outline Problems facing wide-area applications Previous work: Location services & PRR97 Tapestry: mechanisms and protocols Preliminary Evaluation Sample application: Bayeux Related and future work Ben Zhao - Tapestry @ U. W. 590 S'0 8 Example Application: Bayeux Related Work Application-level multicast Leverages Tapestry Scalability Fault tolerant data delivery Novel optimizations Self-forming member group partitions Group ID clustering for better b/w utilization *000 000 0 *00 *00 *0 **0 **00 ***0 Root Content Addressable Networks Ratnasamy et al., (ACIRI / UCB) Chord Stoica, Morris, Karger, Kaashoek, Balakrishnan (MIT / UCB) Pastry Druschel and Rowstron (Rice / Microsoft Research) Ben Zhao - Tapestry @ U. W. 590 S'0 9 Ben Zhao - Tapestry @ U. W. 590 S'0 0 5

Future Work For More Information Explore effects of parameters on system performance via simulations Explore stability via statistics Show effectiveness of application infrastructure Build novel applications, scale existing apps to wide-area Silverback / OceanStore: global archival systems Fault-tolerant Adaptive Routing Network Embedded Directory Services Deployment Large scale time-delayed event-driven simulation Real wide-area network of universities / research centers Tapestry: http://www.cs.berkeley.edu/~ravenben/tapestry OceanStore: http://oceanstore.cs.berkeley.edu Related papers: http://oceanstore.cs.berkeley.edu/publications http://www.cs.berkeley.edu/~ravenben/publications ravenben@cs.berkeley.edu Ben Zhao - Tapestry @ U. W. 590 S'0 Ben Zhao - Tapestry @ U. W. 590 S'0 Backup Nodes Follow Dynamic Root Mapping Problem: choosing a root node for every object Deterministic over network changes Globally consistent Assumptions All nodes with same matching suffix contains same null/non-null pattern in next level of routing map Requires: consistent knowledge of nodes across network Ben Zhao - Tapestry @ U. W. 590 S'0 Ben Zhao - Tapestry @ U. W. 590 S'0 PRR Solution Given desired ID N, Find set S of nodes in existing network nodes n matching most # of suffix digits with N Choose S i = node in S with highest valued ID Issues: Mapping must be generated statically using global knowledge Must be kept as hard state in order to operate in changing environment Mapping is not well distributed, many nodes in n get no mappings Tapestry Solution Globally consistent distributed algorithm: Attempt to route to desired ID N i Whenever null entry encountered, choose next higher non-null pointer entry If current node S is only non-null pointer in rest of route map, terminate route, f (N) = S Assumes: Routing maps across network are up to date Null/non-null properties identical at all nodes sharing same suffix Ben Zhao - Tapestry @ U. W. 590 S'0 5 Ben Zhao - Tapestry @ U. W. 590 S'0 6 6

Analysis Globally consistent deterministic mapping Null entry no node in network with suffix consistent map identical null entries across same route maps of nodes w/ same suffix Additional hops compared to PRR solution: Reduce to coupon collector problem Assuming random distribution With n ln(n) + cn entries, P(all coupons) = -e -c For n=b, c=b-ln(b), P(b nodes left) = -b/e b =.8 0-6 # of additional hops Log b (b ) = Distributed algorithm with minimal additional hops Dynamic Mapping Border Cases Two cases A. If a node disappeared, and some node did not detect it. Routing proceeds on invalid link, fails No backup router, so proceed to surrogate routing B. If a node entered, has not been detected, then go to surrogate node instead of existing node New node checks with surrogate after all such nodes have been notified Route info at surrogate is moved to new node Ben Zhao - Tapestry @ U. W. 590 S'0 7 Ben Zhao - Tapestry @ U. W. 590 S'0 8 Content-Addressable Networks Chord Distributed hashtable addressed in d dimension coordinate space Routing table size: O(d) Hops: expected O(dN /d ) N = size of namespace in d dimensions Efficiency via redundancy Multiple dimensions Multiple realities Reverse push of breadcrumb caches Assume immutable objects Associate each node and object a unique ID in unidimensional space Object O stored by node with highest ID < O Finger table Pointer for next node i away in namespace Table size: Log (n) n = total # of nodes Find object: Log (n) hops Optimization via heuristics 6 Node 0 0 7 5 Ben Zhao - Tapestry @ U. W. 590 S'0 9 Ben Zhao - Tapestry @ U. W. 590 S'0 0 Pastry Incremental routing like Plaxton / Tapestry Object replicated at x nodes closest to object s ID Routing table size: b(log b N)+O(b) Find objects in O(Log b N) hops Issues: Does not exploit locality Infrastructure controls replication and placement Consistency / security Key Properties Logical hops through overlay per route Routing state per overlay node Overlay routing distance vs. underlying network Relative Delay Penalty (RDP) Messages for insertion Load balancing Ben Zhao - Tapestry @ U. W. 590 S'0 Ben Zhao - Tapestry @ U. W. 590 S'0 7

Comparing Key Metrics Properties Parameter Tapestry Chord CAN Pastry Base b None Dimen d Base b Logical Path Length Log b N Log N O(d*N /d ) Log b N Neighbor-state blog b N Log N O(d) blog b N+O(b) Routing Overhead (RDP) O() O() O()? O()? Messages to insert Mutability Load-balancing O(Log O(Log N) O(d*N /d b N) ) O(Log b N) App-dep. App-dep Immut.??? Good Good Good Good Designed as PP Indices Ben Zhao - Tapestry @ U. W. 590 S'0 8