Design and Implementation of a Distributed Object Storage System on Peer Nodes. Roger Kilchenmann. Diplomarbeit Von. aus Zürich

Size: px

Start display at page:

Download "Design and Implementation of a Distributed Object Storage System on Peer Nodes. Roger Kilchenmann. Diplomarbeit Von. aus Zürich"

Scarlett Nicholson
6 years ago
Views:

1 Design and Implementation of a Distributed Object Storage System on Peer Nodes Diplomarbeit Von Roger Kilchenmann aus Zürich vorgelegt am Lehrstuhl für Praktische Informatik IV Prof. Dr. W. Effelsberg Fakultät für Mathematik und Informatik Universität Mannheim Mai 2002 Betreuer: Prof. Dr. E. Biersack, Institute Eurecom, Sophia Antipolis

2 II

3 Contents Abstract List of Figures VII IX 1 Introduction Evolution of Internet Applications A Reliable Storage Network Outline Related Work File Sharing Applications Napster Gnutella FastTrack Swarmcast Distributed Storage Applications Freenet PAST Oceanstore/Silverback Cooperative File System (CFS) Problem Analysis Overlay Routing Networks Chord Increasing Reliability with Redundancy IDA Approach Replication Approach Replica Placement Caching and Load Balancing Separation of Data and Metadata Metadata Access Data Access III

4 IV CONTENTS 4 Framework Design Overview Object Oriented Programming Classes and Objects Inheritance Polymorphism Java Framework Layers Message Layer Lookup and Routing Layer Block Storage Layer Application Layer Message Layer Implementation Basic Node Thread Re-use Node-to-Node Communication Recursive Message Communication Lookup and Routing Layer Implementation Identifiers Successors and Predecessors Fingers Lookup Iterative Method Recursive Method Virtual Node Tunnelling Join and Leave Stabilization Process Notification Node Join Block Storage Layer Implementation Basic Elements Metadata Hash Paraloader Block Location Cache Block Replica Cache Storing a Block Fetching a Block Reorganization on Overlay Network Changes Metadata Reorganization Data Reorganization

5 CONTENTS V 8 Application Layer Implementation User Interface File Storage and Retrieval Block Event Methods Performance Iterative vs. Recursive Lookup Virtual Node Tunnelling Overall File Performance Conclusion and Future Work 65 Bibliography XI A Package p2p.layer.message XV B Package p2p.layer.lookup XXIII C Package p2p.layer.storage XLV D Package p2p.layer.application LIX E Package cache LXV Ehrenwörtliche Erklärung LXIX

6 VI CONTENTS

7 Abstract This work describes the design and implementation of a distributed system that uses empty disk space on Internet hosts for reliable storage of data objects. To increase the reliability of the system, the objects are replicated and distributed to peer-nodes of an overlay network that is spanned over the participating hosts. The Chord overlay network provides a robust and well scaling binding of objects to nodes, which is used to organize the object replicas in an environment of unreliable hosts that may join or leave the system frequently. It is robust against host failures and the binding is resolved by an efficient lookup operation that operates in time logarithmic to the number of hosts. Congestion of nodes due to non-uniform object access patterns is avoided by caching and parallel data access, which both distribute the load on many nodes and overcome some disadvantages of Chord s deterministic overlay network topology. The major part of the prototype implemented in Java consists of a versatile object oriented framework architecture. A hierarchy of framework layers provides generalized solutions to problems in peer-to-peer networking. The file storage application itself is a thin layer on top of this framework. VII

8 VIII CONTENTS

9 List of Figures 3.1 Chord Key Distribution with Virtual Nodes Chord Finger Example Chord Lookup Example Chord Lookup Hot-Spot I-Hop and P-List Caching Layer and Class Hierarchy Thread Pool Message Processing Iterative Lookup Pseudo-Code and Message Traffic Recursive Lookup Pseudo-Code and Message Traffic Node Join Pseudo-Code and Message Traffic Thread Interaction in Parallel Download Data Block Storage Data Block Retrieval Reorganization Caused by a Leaving Node Reorganization Caused by a Joining Node Splitting a File into Blocks Comparison of the Iterative and Recursive Lookup Latencies The Effect of Virtual Node Tunnelling on the Lookup Path Length Overall File Performance IX

10 X LIST OF FIGURES

11 Chapter 1 Introduction In the last years peer-to-peer applications became a lot of public attention. On the one hand it was the legal issue about content sharing and on the other hand the peer-to-peer paradigm seamed to be a new idea to many people. But the early internet was already designed like that and the Usenet, which appeared in 1979 and is still very popular, can be seen as one of the first P2P applications because there is no hierarchy or central control and the Network News Transport Protocol (NNTP) uses peer-to-peer communication between newsservers [16]. This early application already shares most of the properties that characterize P2P applications[27]: They take advantage of distributed, shared resources such as storage, CPU cycles, and content on peer-nodes Peer-nodes have identical capabilities and responsibilities Symmetrical communication between peer-nodes Significant autonomy from central servers for fault tolerance Operate in dynamic environment where frequent join and leave is the norm The reason why most people consider peer-to-peer applications as something radical new is that in the last decade the paradigm of Internet applications changed from decentralized applications like the Usenet to the server centric World Wide Web. 1.1 Evolution of Internet Applications Between 1995 and 1999 the Internet became a mass medium driven by the "killer" applications World Wide Web(WWW). This changed the application paradigm and had an influence on the further development of the Internet architecture. The World Wide Web is a typical client/server application. A Web client, now called browser, connects to a well known server, which returns the page according to the request 1

12 2 CHAPTER 1. INTRODUCTION of the client and closes the connection. Because the client initiates the communication, only the web server needs a permanent, well known Internet Protocol (IP) address. This behavior allowed the internet service providers (ISP) to satisfy the fast growing number of new internet users by assigning temporary IP addresses to dial up connections because the limited IP address space, 2 32 addresses of 4 byte length, was too small to assign a permanent IP address to every user. Temporary dial-up connections together with unpredictable IP addresses demand new concepts to organize and maintain a distributed network. Another property of the client/server based WWW application had impact on the technical development. Because of the asymmetry in the WWW service, page requests are much smaller than the page reply, the dial-up technologies were developed considering that asymmetry. ADSL and V.90 modems have three to eight time higher downstream bandwidth than upstream bandwidth. By removing the distinction between clients and servers, the P2P applications have a symmetrical bandwidth characteristics. The upstream path of asymmetric connections will limit the total throughput between peer-nodes. Therefore, new mechanisms need to be introduced for using the available bandwidth resources more efficiently. To summarize, the originally symmetric and deterministic internet architecture became asymmetric and dynamic due to changes in the application preferences. The new generation of P2P applications have to deal with a dynamic and asymmetric environment, which contradicts its inherent symmetry and imposes problems on reliability and efficiency. With technological progress and decreasing prices in the computer hardware, new applications for personal computers became possible. Due to increasing hard disk capacities and faster processors, playing and storing audio and video content became very popular. But exchanging multimedia content over the Internet was difficult and expensive for an unexperienced Internet user. Setting up a WWW or FTP server needs a decent amount of knowledge and a permanent Internet connection is hardly affordable for a private person. The new generation of P2P applications from the late 90 s offer a comfortable and easy way for everybody to publish and share content. This made P2P systems, such as Napster, very popular and resulted in about 38 million registered Napster users by Oct 2000[20].

13 1.2. A RELIABLE STORAGE NETWORK 3 So far three different types of P2P applications have developed: P2P File Sharing - Content driven applications sharing bandwidth and storage resources to provide efficient content distribution and storage. Some related applications are presented in the following Chapter "Related Work". P2P Messaging - Human presence is shared across a distributed and decentralized system like the Groove collaboration network[17]. P2P Computing - The distributed system shares CPU cycles for solving computing problems. The sum of idle CPU cycles on many workstation can replace very expensive supercomputers. A well known example is the SETI@home project[26], which uses idle CPUs on internet hosts to analyze radio signals from outer space in order to find signals from intelligent origin. 1.2 A Reliable Storage Network The Java based prototype presented in this work uses empty disk space available on Internet hosts to build a reliable storage system. By April 2002, a typical workstation PC is shipped with a hard disk of about 60 GB storage capacity. After the operating system and some other applications are installed, most of the capacity on the hard disk is still unused. For example, the software takes 10 GB and the remaining free disk space is 50GB. For an organization with 100 such workstations, the total amount of unused disk space is 5 TB. Nowadays, most workstations in an organization are connected together with a local area network (LAN) that uses the Internet protocol (IP). This work describes the design of such a system and explains the implementation of a simple file storage application, which uses an underlying framework architecture developed as a mayor part of this work. The goal is to achieve a maximum of reliability and fault tolerance for the storage service built out of Internet hosts, called nodes in this context. The storage network must be reliable while the nodes themselves are not. Unlike dedicated file servers, the nodes are workstations generally not shipped with redundant power supplies or RAID (Redundant Array of Inexpensive Disks) systems. Since the workstations are under control of their users, their system availability is not predictable. Users may shut down their workstations or a network link may temporarily fail.

14 4 CHAPTER 1. INTRODUCTION Assuming a heterogenous and dynamic environment of hosts connected together by a high bandwidth and low latency IP network this work has focuses on: reliability of data storage scalability in terms of the number of hosts and content requests efficient usage of the available resources Further, it is assumed that there are no restrictions concerning firewalls and Network Address Translation (NAT) issues. All hosts are willing to cooperate by relaying messages and they store data if their storage quotas have not been exceeded. 1.3 Outline This work is organized as follows: In the Chapter "Related Work" existing applications with their solutions to P2P specific problems are analyzed. The next three chapters reflect the general object oriented approach of software development: problem analysis, design and implementation. First the problem fields need to be identified and a general solution has to be developed. This is done in the Chapter "Problem Analysis". In the second step the design breaks the general solution into software layers with well defined functionality and interfaces described in the Chapter "Framework Design Overview". For each layer the algorithms and some important implementation details are presented in its own chapter. In the Section "Performance" the effect of implementation alternatives and optimizations on performance are examined and the overall file storage performance is evaluated. This work closes with the Chapter "Conclusion and Future Work", where the results of this work are summarized and an outlook for future improvements is given. Important terms are emphasized with bold font when they are introduced. Class and method names are always emphasized with italic.

15 Chapter 2 Related Work The field of P2P applications targeted to share storage resource can be divided into two groups. The first group offers file sharing and content distribution capabilities. Because content on hosts is shared, the main task for this group of application is content location and its distribution. The different methods of how to find content items and how the content items are distributed are examined. The second group offers a distributed file system service. The ability of reliable and persistent storage distinguishes it from the first group. Apart from the content location and distribution, the mechanisms to achieve reliability in a dynamic and unreliable environment are examined for this group of applications. 2.1 File Sharing Applications Napster Although Napster[10] is often referred as the first P2P application, its does not follow a true P2P concept. Napster can be characterized as a client/server system for centralized content metadata lookup combined with direct client-to-client connections for content delivery. At startup the Napster client software connects to a central Napster server, authenticates itself with login and password and registers its shared content s metadata to a central index database. A content query is sent to the central index server, which processes the query by a index database lookup, and returns to the client a list of matching content metadata records containing the network location of the client sharing the content item, its exact filename and some bandwidth and latency information. From this list the user has to choose a client from whom to download the content file. The download reliability is low because only a single unreliable source is used and a broken download is not automatically continued from a dif- 5

16 6 CHAPTER 2. RELATED WORK ferent source. Another conceptual problem is using a central server for content location, which is neither a reliable nor a scalable solution. Napster needs to operate several of those central servers to achieve fault tolerance and load balancing because a single server can only handle a limited number of users simultaneously. Above this threshold, the server will reject connection requests and the client has to try another server. A several servers are necessary to serve peak load, but at other times they will be idle, which results in bad resource allocation. After connecting to one of the central servers, the client stays connected to its server for the whole session. Since each server maintains its own index database, a user will only see a restricted view of the total content available. The handicap of Napster is the centralized index, which simplifies the system but results in a single point of failure and a performance bottleneck Gnutella To avoid the disadvantages of Napster, the Gnutella network is decentralized. The only central component is the host cache service, which is used by the servants, a Gnutella specific term of combined client and server, to find a bootstrap node. The Gnutella protocol uses a time-to-live (TTL) scoped flooding for servant and content discovery. A servant is permanently connected to a small number of neighbors. When a servant receives a request from one of his neighbors, it decreases the TTL counter of the request and forwards it to all its neighbors if the TTL is greater than zero. The reply is routed back along the reverse path. There are two important request/reply pairs. A Ping request for discovering new servants is replied with a Pong response message containing the IP address, TCP port and some status information. The other pair is the Query request, which contains query keywords and is answered with a QueryHit if the Query matches some files shared by the servant. The QueryHit is routed back along the reverse path to the servant that initiated the Query and contains the necessary information to start a direct download of the file, which is done similar to the HTTP get command. The main disadvantage of Gnutella is the distributed search based on scoped flooding, which does not scale in terms of the number of servants[21] because the number of messages grows exponentially and uses much of the servant s bandwidth. To reduce the number of servants the next generation of the Gnutella protocol will introduce supernodes, which will act as a message routing proxies for clients with limited bandwidth. These clients, called shielded nodes, have only a single connection to one supernode, which shields them from routing Gnutella messages. The supernode concept is a result of the nodes heterogeneity observed in the real world. Not all nodes are really equal concerning their resources and by far not all user want to share them.

17 2.1. FILE SHARING APPLICATIONS FastTrack The FastTrack protocol, used in the KaZaa and Morpheus application [8], is a hybrid and two layered architecture of peers connect to supernodes, which themselves are connected together. A supernode acts like a local search hub that maintains the index of the media files being shared by each peer connected to it and proxies search requests on behalf of its local peers. FastTrack elects a peers with sufficient bandwidth and processing power to become a supernode if its user has allowed it in the configuration. A search results in FastTrack contains a list of files that match the search criteria. FastTrack uses parallel download and client side caching for file transfers. A file is logically split into segments and these segments are downloaded from other peers that share the same file or, in the case of client side caching, do download this file and share the segments downloaded so far until the download is completed. This can increase the download speed significantly, especially for asymmetric dial-up connections because the limited upstream bandwidths add up together. As FastTrack is a proprietary protocol, it is so far difficult to evaluate what scaling properties the supernode network has Swarmcast Swarmcast [15] is a content distribution network. The content provider has to host content on his own server and Swarmcast s job is to boost the download and to ease the burden on the contents providers server. This is done by parallel downloading and locality based client side caching. For each file a temporary mesh of client nodes downloading this file is maintained in order to find other close nodes to exchange file parts. The provider s file is broken into parts, which are then encoded into packets with a forward error correction code (FEC) [22]. A (n, k) forward error correction encodes k source packets into n > k encoded packets. The encoding is such that any subset of k encoded packets suffices to reconstruct the source data. Swarmcast reduces complexity and communication overhead by randomly sending the packets to other nodes in the mesh and Swarmcast use FEC encoding to avoid the potential overlap in duplicate packets, which would otherwise drastically decrease the utility of each packet. The encoded packets are then spread randomly among the nodes of the mesh, which will exchange them until the nodes have enough packets to reconstruct the file. After downloading, the nodes should keep the packets in a their cache to support the other nodes in the mesh. The system scales nicely because the more requests there are for a file, the more nodes join the mesh and the more packets are cached and exchanged.

18 8 CHAPTER 2. RELATED WORK 2.2 Distributed Storage Applications Distributed storage applications must have an active replication strategy to increase reliability as compared to file sharing and content distribution application, which more rely the fact that with a large number of users sharing content, the probability of content being available can be quite high, however without determination. An important difference to file sharing applications is that distributed storage applications in general have a publish process, which adds content items to the system. The location of the content items is not predefined like it is for file sharing applications Freenet Freenet[3] is a distributed publishing system, which provides anonymity to publishers and consumers of content. An adaptive network is used to locate content by forwarding requests to nodes that are closer to the key that identifies a content item. On each hop information whether the item was found on this path or not travels in backward direction and is temporarily stored on the nodes. The next request for the same key takes advantage of this information and gets routed directly to the content source. When the query reaches the content source, the content is propagated along the query s reverse path and cached in the intermediate nodes. Freenet uses an intelligent flooding search, where the routing information and cached copies are stored along the path. The more requests for a content item, the more cached copies and routing information are available. If there has been no request in a period of time for a content item, the nodes discard the content items because all routing information about this item on the other nodes has already timed out and the item is not referenced anymore. As a consequence, published content is only stored persistent as long as there is enough demand to keep routing information alive. The content objects are floating around in the network and there is only temporal and local knowledge about where the content is actually located. To provide anonymity to the publishers and consumers, there is no direct peer-to-peer data transfer. Instead, the content data is routed through the network. Nodes with low bandwidth may become a bottleneck to the system and the flooding based content lookup causes scales badly PAST PAST [6] is a persistent peer-to-peer storage utility, which replicates complete files on multiple nodes. Pastry [24] is used for message routing and content location. PAST stores a content item on the node whose node identifier nodeid is closest to the file identifier fileid. Routing a message to the closest node is done by choosing the next hop node whose nodeid shares with the fileid a prefix that is at least one

19 2.2. DISTRIBUTED STORAGE APPLICATIONS 9 digit longer than the prefix that the fileid shares with the present node s nodeid. The fileid is generated by hashing the filename and the nodeid is assigned randomly when a node joins the network. The routing path length scales logarithmic in terms of the overall number of nodes in the network. For each file an individual replication factor k can be chosen and replicas are stored on the k nodes that are closest to the fileid. Maintaining the k replicas in the case of a node failure is detected by the Pastry background process of exchanging heartbeat messages with neighbors. When a node detects a neighbor node s failure, the replica is automatically replaced on another neighbor. Free storage space is used to cache files along the routing path, while approaching the closest node during the publish or retrieval process. This can only be done if the file data is routed along the reverse query path. Thus there is no direct peer-to-peer file transfer. Similar to the Freenet system, nodes with low bandwidth may become a bottleneck Oceanstore/Silverback Silverback[30] is the archival layer of the Oceanstore system. For routing and content location Tapestry[32] is used, which is a distributed version of Plaxton s hashed-suffix routing and is quite similar to Pastry. Therefore, the number of hops and messages is logarithmic to the total number of nodes in the network. A file is split into blocks, which are then encoded with a forward error correction (FEC) code into n fragments. The block s binary data is hashed into a blockid, which is used to route the n block fragments to the n closest nodes in terms of the most common suffix of the nodeids and the blockid. The fragments are periodically republished by a file s Responsible Party to increase reliability and they are cached along the path to reduce the access latency and to balance the load on several nodes. The system features a file version management, which use tombstones to reduce storage resources by storing only the difference of a file block compared to the latest tombstone version of a block Cooperative File System (CFS) CFS[4] is a read only file system built on top of Chord[28], which is used for content location. Chord belongs to the same family of second generation peer-to-peer resource location services like Pastry and Tapestry, which use routing for content location. On each hop the closest routing alternative is chosen to approach the closest node defined by a metric of the identifiers generated by a hash function. The basic idea is that nodes closer to the routing target have a more detailed view over the target s neighborhood and this knowledge is exploited to approach the target. Since Chord was chosen as the lookup and routing service for this work, it is described in detail in Section Right now, it s enough to know that the hashed identifiers are interpreted as n-bit numbers, which are arranged in a circle by the natural integer order. A file is split into blocks identified by blockids. By definition, the r block

20 10 CHAPTER 2. RELATED WORK replicas are stored at the successor node of the blockid and its r 1 immediate successor nodes. The successor node is the closest node to an item identified by a blockid or a nodeid and per definition it is the node that immediately follows the item s ID in the circle. When a node joins the circle, a block s successors node can change and the network has to move some blocks to the new node to maintain the property of storing the block s replicas on the closest r nodes to the blockid. The blockid s successor node is responsible for maintaining the r 1 block replicas on its r 1 successor nodes by periodically verifying their availability and replacing replicas in case of a node failure. Similar to PAST and Silverback, cache replicas are stored along the reverse lookup path when the requested block data is returned to reduce latency and to balance the load.

21 Chapter 3 Problem Analysis Peer-to-Peer applications aim to take advantage of shared resources in a dynamic environment where fluctuating participants are the norm. Hence there is a need for a resource centric address scheme working under dynamic conditions[27]. In this work the resource of interest is free disk space available on Internet hosts. Each host is identified by a unique Internet Protocols (IP) host address used for packet routing to this host. To establish a communication to a host, an additional port number is necessary to identify the software that handles the communication process on that host. The IP address together with the port number is the network location, which is necessary to communicate with the software on a host managing its storage resources. Moreover, a content item needs a content name that identifies it among all the other content items. A resource centric addressing scheme for storage related applications provides a binding from content names to network locations, which are resolved by a lookup operation[25]. The Domain Name System (DNS)[7] is an excellent example of such an addressing scheme. It is a host centric addressing scheme because it was introduced to map human readable host names to host IP addresses. A host name consists of domain names separated by dots, which are interpreted as a hierarchy where the last domain name is the top level domain like com, net and org. Basically, there are two possibilities how the binding is stored. In a single flat "hosts.txt" file or distributed over a hierarchical topology of DNS servers, which store the the necessary information to resolve a DNS lookup by traversing the hierarchy. Because the number of lookup steps is limited by the number of hierarchy levels and caching is used in all levels, the DNS scaled to times its original size[16]. But the binding information in this system is manually maintained and changes need hours, if not days, to penetrate thought the system. Therefore, it is not well suited for P2P systems with participants that in average stay in a P2P System for less than an hour. 11

22 12 CHAPTER 3. PROBLEM ANALYSIS The way how Napster resolves the host addresses compares to the single flat file in the DNS. The central real time index on a Napster server stores all binding information to map content name fragments (keywords) to IP addresses, which can be looked up by the clients. The disadvantages of such a solution was already discussed. The distributed lookup by flooding performed by Gnutella provides minimal lookup latency, but trades it against bandwidth and scalability because the number of messages and the bandwidth grows exponentially with the number of nodes. For a file system application the situation is different from that of file sharing application. In file sharing applications the content is already stored on hosts and they have to discover the content stored on the hosts like Napster and Gnutella do. For an application like the one proposed in this work, the system itself decides where the content items are stored during the publish process and therefore an addressing scheme based on an overlay network can be used, which resolves bindings by routing. The addressing scheme maps content items to nodes and this mapping is used to store and retrieve the content items. 3.1 Overlay Routing Networks An overlay routing network is built of nodes connected together by a network of distinct topology. This network is a logical overlay network because the nodes communicate over an underlying communication network. But the logical network topology has influence on the routing algorithm. To resolve a binding for a content name, a message is routed to a node that is "closest" to the content name according to a metric of the node identifiers and the content names. The communication network address of this "closest" node is the result of the lookup operation. The routing algorithm on each node exploits local routing knowledge to route the message to the "closest" local routing alternative until there is no "closer" routing alternative. To define the closeness, there has to be a metric that applies to both, the node identifier and content name space. This can be achieved by using a hash function that deterministically maps node identifiers and content names into a flat (and uniformly populated) hash space and a metric is chosen to define the closeness. Pastry, Tapestry and Chord are based on this idea and therefore share the same average lookup path length of log(#nodes) hops, but they use different metrics and overlay network topologies.

23 3.1. OVERLAY ROUTING NETWORKS Chord Chord is a distributed lookup and routing overlay network using consistent hashing[9], originally introduced for distributed caching. Nodes are organized in a circular topology by using m-bit node IDs interpreted as nonnegative integer numbers wrapped around at zero. The total ordering of integer numbers assigns each node a predecessor and successor node. A node ID is generated by applying a hash function to the node s host IP address. Therefore, the overlay network becomes a deterministic function of the host address. In other words, the host IP address determines the position in the circle. This makes the overlay network topology completely unaware of the underlying network layer topology, which has some positive and negative effects. Routing a message to the successor - neighbor in the overlay network - could result in routing to the other side of the world in the underlying IP network, causing high latency. On the other hand, IP network failures in a region do not map to a region of the logical overlay network region, often used for placing redundant replicas, like in CFS or PAST. A content item is also identified by an m-bit content key, and the binding from keys (hashed content name) to node IDs (hashed host address) is defined by the successor function. A key k is located at the node n with the same or the next higher ID than the key k, written as n = successor(k). The content item associated with the key k is not stored in Chord itself. Chord just assigns a Responsible Node whose network location is used to access the content item. If a host operates more than one node, they are called Virtual Nodes and their node ID is calculated by hashing the host IP address together with a small Virtual Node ID, which is only unique on that host. The reason to use Virtual Nodes is that for a small number of nodes in the circle, the distance in the identifier space between nodes is not likely to be equally distributed as desired, which results in a unequal distribution of keys per real nodes.

24 14 CHAPTER 3. PROBLEM ANALYSIS Figure 3.1: Distribution of keys per real node depending on Virtual Nodes per real node for a simulated network with 10 4 real nodes and 10 6 keys. Figure 3.1 taken from [28] shows that increasing the total number of nodes by introducing multiple Virtual Nodes per real node, balances the number of keys per real nodes. In CFS the number of Virtual Nodes on a host is also used to adjust to the available storage capacity because a node is not allowed to reject a storage request. The separation of content data and content meta data, discussed later in this chapter, can eliminate this problem by introducing one more degree of freedom. But an equal distribution of keys per real node is still a desirable property and Virtual Nodes give significant optimization potential for the lookup as described in the implementation Section Using the same identifiers for nodes and keys leads to a combined lookup and routing. A lookup is resolved by routing a message to the node that is the successor of the key. Every node knows at least two other nodes, its successor and its predecessor. The simple lookup algorithm routes a messages around the circle by following the successor pointers until a node with the same ID or the next higher ID than the key is found. The metric used in Chord is the numerical difference of key and node ID. Routing is done by choosing the local routing alternative that minimizes this difference. As one node is only aware of its successor and its predecessor as available routing alternatives, a lookup message is always traversing the circle in direction of the successor pointers because only this reduces the distance. In the worst case a message has to complete a full circle turn, before the node that is successor to the key is found. Resolving a lookup with the simple algorithm takes O(#nodes) hops. As long as every node has a working pointer to the immediate successor in the cir-

25 3.1. OVERLAY ROUTING NETWORKS 15 cle, a successful successor lookup is guaranteed. In a real application with frequent join and leave, a single successor pointer is not sufficient enough to guarantee a successful lookup. A single node failure would break the circle and result in lookup failures. Therefore, redundant successor pointers are used. As long as one working successor pointer is found, the lookup routing can proceed and a successful lookup is guaranteed. To reduce the average lookup path length to a practical number, a finger table with additional routing information is introduced. Fingers are like shortcuts, used instead of going around the circle from node to node following the successor pointers. Every node divides the circle into m finger intervals with exponentially growing size in power of 2. Finger Table n = 80; k = 1.. m 80 [start end) length node k n + 2 k-1 mod 2 m n + 2 k mod 2 m 2 k-1 successor( start) N110 N N80 N120 finger[6] N m = 7 bit Figure 3.2: An example of a finger interval with the finger pointer A finger points to the successor of the interval start, which could result in finger pointers being outside their corresponding finger interval. The finger nodes are resolved by the Chord lookup function, which returns the successor node of the interval start ID. Using the finger table, adds O(m) additional routing alternatives and the one is chosen that leads closest to the successor of the key. The higher the finger index, the farer away the finger points. Therefore, the finger table is searched in reverse order, starting at the finger[m]. If a finger i points to a node preceding the key, this hop reduces the distance to the key by 2 i 1. With a few hops, the distance to the key is quickly reduced, which results in an average lookup path length of O(log(#nodes)). This bound was proven theoretically and verified by controlled experiments in a the Chord paper[28].

26 16 CHAPTER 3. PROBLEM ANALYSIS Figure 3.3 shows a detailed Chord routing example using finger tables in a m = 7 bit circular hash space. Starting at node N32, which wants to resolve the successor of the key K19, N32 looks in its finger routing table for the node that closest precedes K19. The finger table is searched in reverse order, starting at the finger with index 7. This finger matches the criteria and therefore the lookup continues at N99. On N99 the finger table is searched again. The 7 th finger N60 does not precede K19 and therefore the 6 th finger is tested. This one, pointing to N5, precedes K19, hence the lookup continues on N5. N5 finds N10 as its closest preceding finger. N10 now terminates the lookup because it can make out that its successor N20 is the successor node of K19. N5 Finger Table 5 [start end) node Finger Table 99 [start end) node N110 N99 N80 N10 N20 N32 K19 lookup(k19) Finger Table 32 [start end) node Figure 3.3: An example of a lookup using the finger table. The two important properties of Chord are inherited from using ranged hash functions as proposed in consistent hashing[28]: 1. balance property : the number of keys per real nodes is K with high probability (if each real node has O(logN) Virtual Nodes), where K is the number N of keys and N is the number of nodes. The responsibility for keys is equally distributed among all nodes. 2. monotony property : when the (N + 1) th node joins, the binding for O( K N ) keys changes from an existing nodes to the new node. In other words, the responsibility for keys changes only from existing nodes to new nodes, never from existing nodes to existing nodes. There is only a local reorganization on a node join.

27 3.2. INCREASING RELIABILITY WITH REDUNDANCY 17 Chord offers a scalable, robust, and balanced mapping of hashed content names to host network locations (IP address and a Virtual Node ID), which allows to communicate with these Virtual Node over the underlying network layer. The Chord overlay network delegates responsibility for content items to nodes. Chord does not store the data itself. To be consistent with the terminology introduced by Chord, a content item is identified by its key k, which of course is a Chord m-bit identifier. The node that is responsible for a content item k is called the primary Responsible Node RN1 k of key k and is defined as RN1 k = successor(k) by the lookup function. In a perfect world with no host failures, the straightforward solution would be storing content items on their primary Responsible Nodes. 3.2 Increasing Reliability with Redundancy The reliability for data storage on unreliable nodes is increased by adding redundant information and dispersing this information to several nodes. The reliability expressed as the probability of a successful data access is determined by: The amount of redundant information added The number of nodes and their independent failure probabilities How redundant information is distributed over multiple nodes The problem of storing data blocks on unreliable nodes is closely related to storing data blocks on a hard disk array like it is done for RAID storage solutions [2]. A RAID system is a redundant array of inexpensive disks. This technology was developed to organize small hard disks into arrays to replace much more expensive high capacity disks and to reduce the risk of data loss due to hard disk failures. There are several approaches how to distribute data over disks, or like in this case storage nodes. Two of them are now discussed IDA Approach IDA stands for Information Dispersal Algorithm, proposed by M. Rabin[19]. The basic idea is to disperse the content of a data block into n fragments. The original data can be reconstructed out of any subset of k fragments, where k <= n. One major aspect of this algorithm is that redundancy is added uniformly; there is no distinction between data and parity. This property allows to control the amount of redundant data in fine granularity. To tolerate up to r simultaneous node failures, the data block has to be encoded into n = k + r fragments. If all nodes have the independent failure probability p,

28 18 CHAPTER 3. PROBLEM ANALYSIS this gives a access reliability: p(access) = 1 n i=r+1 ( n i ) p i (1 p) n i The redundancy necessary to achieve this reliability is n k. It is obvious that the k IDA approach needs less amount of storage resource compared to the straight forward replication approach. For the same reliability the replication approach needs r times redundancy. The currently available Forward Error Correcting (FEC) codes, such as the Read- Solomon code [22], have encoding times quadratic to the number of the encoded blocks n. Tornado codes [12] achieve a linear encoding time to n, but so far there is no free implementation available. A performance comparison of the different codes can be found here[13] Replication Approach The other redundancy scheme, block replication or also called mirroring, was already mentioned and compared to the IDA approach. An analysis of the past development in hardware shows that the hard disk storage space doubled every 18 month, which is often referred as Moore s Law, and it is expected to hold for the next decade[31]. While the capacity per disk is growing, the price per storage unit is falling. By April 2002, the average hard disk that is shipped with a workstation can store between 40 to 60 GB. Therefore, disk storage capacity is not considered as a limited resource. Since the prototype will be implemented in Java, one should take into account that Java code that executes numeric calculation is likely to be 10 to 30 times slower than native machine code generated out of C code. Using FEC codes, based on polynomial arithmetic, will always produce significantly higher CPU load compared to replication. The storage node software is expected to run on workstations with priority to the user processes, not dedicated single purpose servers. Therefore, it should run as a background process with low priority and consume as less cpu cycles as possible. For these two reasons, the replication scheme will be used for to increase the reliability of the prototype Replica Placement The locations where replicas are stored depend on the overlay network topology that is used. In general, replicas are stored in the logical neighborhood of the primary Responsible Node. When the primary Responsible Node fails, the routing and lookup mechanism of the overlay network will assign the responsibility to another

29 3.3. CACHING AND LOAD BALANCING 19 node logically close to the failed primary Responsible Node. In Chord, a node that failed gets replaced by its immediate successor. Either the new node already has a replica or the new could can ask its neighbors for a replica. The set of nodes storing replicas of a content item identified by its key k are called Responsible Nodes and defined as: {RNi k i = 1... r}. The name expresses that they are altogether responsible to increase the reliability in terms of the access probability determined by the replication factor r. When this idea is applied to the circular overlay network topology used by Chord, the Responsible Nodes have to be either the primary Responsible Node s r 1 successors or its r 1 predecessors. This decision should take into account how the content items are located by the overlay network and if there are implications for caching and load balancing schemes that could be used to improve the access performance. 3.3 Caching and Load Balancing Chord s balance property results in a uniform distribution of responsibility for content items among all nodes. However, non-uniform information access due to popular content will create hot-spots in the overlay network and congestion in the underling network if not avoided by caching and load balancing mechanisms. The design of these mechanism is closely related to the overlay network s topology and its routing algorithms because both have influence on the routing path through the overlay network and therefore on the locations of hot-spots. Most of the activity in this distributed peer-to-peer system will be caused by locating and accessing content for which caching is used to increase the performance. Chord itself has been developed in the field of distributed cache design based on consistent hashing. Hence, peer-to-peer design should consider some general design principles for distributed caching[29]: 1. Maintain a hierarchy of metadata that tracks where copies of data are stored. 2. Separate data paths from metadata paths 3. Use direct cache-to-cache data transfers to avoid store-and-forward delays Separation of Data and Metadata Caching is used to improve download performance by placing or locating cache replicas closer to the user than the content itself, assuming that closer in terms of network proximity will result in higher throughput. In Chord s case, where network proximity is not reflected by the overlay network, it is difficult to find a close replica.

30 20 CHAPTER 3. PROBLEM ANALYSIS Therefore, a parallel access scheme, which accesses several replicas in parallel, will be used to increase download performance. Farther details about parallel access can be found in the Section According to the first design principle, a content item s metadata structure contains pointers to nodes where replicas to increase the reliability or to distribute the load are located. This metadata structure is then used for parallel access. Accessing a content item is a two step process: 1. Accessing metadata information by primary Responsible Node lookup 2. Using the metadata pointers for parallel access Following the second design principle, data and metadata access paths are separated and for each an individual caching and load balancing scheme is designed that exploits some of its access characteristic. In the Section the metadata access caching scheme and in Section the data access caching scheme is explained in detail. Two ways of separating data and metadata are possible, real and logical separation. Real separation is when a content item s metadata and the replica data are located on different nodes. Logical separation means, that data and metadata are on the same node, but are distinguished in the sense of their different roles in the two step access pattern. First a node is accessed to return metadata, then it is accessed again in the parallel access process. The idea for real separation was originally developed to overcome a negative effect of Chord s balance and monotony properties. When a node joins the Chord ring, some of the keys the new node becomes responsible for shift to the new node. For an average of K keys per node, K replicas have to be transferred to the new N N node. In real life, the circle will be sparsely populated with N nodes and a much higher number K of keys, which makes K 1. Depending on the number of keys, N this can cause high load on the underlying network link between the new node that joins and the existing successor node if the keys are directly transferred from the old node to the new node. From one point of view, this data transfer is not necessary because the replicas on the old node have not vanished and therefore there is no need for shifting data due to a change of responsibility. In order to drastically reduce the data transferred, instead of moving the real data, much smaller metadata is shifted from the old node to the new node to reflect the change of responsibility. An additional degree of freedom is introduced, which allows to choose a node where a replica is stored. This has the advantage that nodes with low storage resource usage can be preferred and an explicit balancing of storage resources can be achieved. It is not necessary anymore

Hierarchical peer-to-peer look-up service. Prototype implementation

Hierarchical peer-to-peer look-up service Prototype implementation (Master Thesis) Francisco Javier Garcia Romero Tutor in Institut Eurecom: Prof. Dr. Ernst Biersack March 28, 2003 Acknowledges I first