Secure and Robust Overlay Content Distribution

Size: px

Start display at page:

Download "Secure and Robust Overlay Content Distribution"

Matthew Neal
5 years ago
Views:

1 Secure and Robust Overlay Content Distribution A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY Hun Jeong Kang IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor Of Philosophy October, 2010

3 Secure and Robust Overlay Content Distribution by Hun Jeong Kang ABSTRACT With the success of applications spurring the tremendous increase in the volume of data transfer, efficient and reliable content distribution has become a key issue. Peerto-peer (P2P) technology has gained popularity as a promising approach to large-scale content distribution due to its benefits including self-organizing, load-balancing, and fault-tolerance. Despite these strengths, P2P systems also present several challenges such as performance guarantees, reliability, efficiency, and security. In P2P systems deployed on a large scale, these challenges are more difficult to deal with because of the large number of participants, unreliable user behaviors, and unexpected situations. This thesis explores solutions to improve the efficiency, robustness, and security of large-scale P2P content distribution systems, focusing on three particular issues: lookup, practical network coding, and secure network coding. A distributed hash table (DHT) is a structured overlay network service that provides a decentralized lookup for mapping objects to locations. This thesis focuses on improving the lookup performance of Kademlia DHT protocol. Although many studies have proposed DHTs to provide a means of organizing and locating peers to many distributed systems, to the best of my knowledge, Kademlia is a unique DHT deployed on an Internet-scale in the real world. This study evaluates the lookup performance of Kad (a variation of Kademlia) deployed in one of the largest P2P file-sharing networks. The measurement study shows that lookup results are not consistent; only 18% of nodes located by storing and searching lookups are the same. This lookup inconsistency problem leads to poor performance and the inefficient use of resources during lookups. This study identifies the underlying reasons for this inconsistency problem and the poor performance of lookups, and proposes solutions to guarantee reliable lookup results while providing the efficient use of resources. This thesis studies the practicality of network coding to facilitate cooperative content distribution. Network coding is a new data transmission technique which allows any i

4 nodes in a network to encode and distribute data. It is a good solution offering reliability and efficiency in distributing content, but the usefulness of network coding is still in dispute because of its dubious performance gains and coding overhead in practice. With the implementation of network coding in a real-world application, this thesis measures the performance and overhead of network coding for content distribution in practice. This study also provides a lightweight yet efficient encoding scheme which allows network coding to provide improved performance and robustness with negligible overhead. Network coding is a promising data transmission technique. However, the use of network coding also poses security vulnerabilities by allowing untrusted nodes to produce new encoded data. Network coding is seriously vulnerable to pollution attacks where malicious nodes inject false corrupted data into a network. Because of the nature of the network coding, even a single unfiltered false data block may propagate widely in the network and disrupt correct decoding on many nodes, by being mixed with other correct blocks. Since blocks are re-coded in transit, traditional hash or signature schemes do not work with network coding. Thus, this thesis introduces a new homomorphic signature scheme which efficiently verifies encoded data on-the-fly and comes with desirable features appropriate for P2P content distribution. This scheme can protect network coding from pollution attacks without delaying downloading processes. ii

5 Contents Abstract List of Tables List of Figures i iv v 1 Introduction Reliable and Efficient Lookup Practical Network Coding Secure Network Coding Reliable and Efficient Lookup Background Kad Kad Lookup Evaluation of Kad Lookup Performance Experimental Setup Performance Results Analysis of Poor Lookup Performance Characterizing Routing Table Entries Analysis of Lookup Inconsistency Improvements Solutions Performance Comparisons Object Popularity and Load Balancing i

6 2.6 Related Work Summary Practical Network Coding Preliminaries Cooperative Content Distribution Random Linear Network Coding Performance and Overhead in Network Coding Practical Network Coding System System Architecture i-code: Lightweight and Efficient Coding Evaluation Comparisons of encoding schemes Practicality Check Related work Summary Secure Network Coding Preliminaries Threat Model Related work Requirements for content distribution Secure Network Coding Signature Scheme Comparisons Practical Consideration Parameter Setup Performance Boost Evaluation Summary Conclusion 84 ii

7 Appendix A. Security Analysis 86 A.1 KYCK Signature A.2 Batch Verification iii

8 List of Tables 3.1 Parameters for generations and blocks Comparison of secure network coding schemes iv

9 List of Figures 1.1 Block (block) scheduling problem Illustration of a GET lookup Performance of lookup: (a) search yield (immediately after PUT) (b)search yield (over 24 hour window) (c) search success ratio (over time) (d) search access ratio (by distance) Statistics on routing tables Illustration of how a lookup can be inconsistent Number of nodes at each distance from a target Number of replica roots at each distance from a target Lookup algorithm for Fix Lookup Improvement (Search Yield) Lookup Overhead Lookup performance over time (a) Lookup with real popular objects (b) Original Kad lookup for our objects (c) New Kad lookup for our objects (d) Load for each prefix bit for real popular objects and our objects Comparisons of downloads between BitTorrent and network coding Comparisons with the optimal downloading time Tradeoff between CPU overhead and block dependency System architecture i-code design. Our encoding scheme requires only one block to be read from disk and one linear combination, greatly reducing encoding overhead Comparison of overhead Comparison of downloading time v

10 3.8 Detailed comparison of time overhead Comparison of block dependency Detailed comparison of time overhead Comparisons of downloading times with 256KB block size Comparisons of downloading in different environments time with 256KB block size Impact of a block size on downloading times Downloading times with flash-crowd Downloading times according to peer arrivals and departures Completeness according to the time of source departure times Downloading times of fast nodes in heterogeneous environments Impact of number of neighbors Signature verification time Signature verification time Signature verification time Signature verification time vi

11 Chapter 1 Introduction With the growth of computer networks, the recent popularity of applications in entertainment, business and sciences has spurred a tremendous increase in the volume of data transfer. In these applications, efficient and reliable content distribution has become a key issue. Traditionally, content distribution systems have been based on the clientserver model; clients download the entire content from dedicated servers. However, this client-server model comes at a high cost for running the servers: they are costly to maintain, bandwidth is expensive, and steps must be taken to prevent a server or data center from becoming a single point of failure. This failure of a central server leads to a halt in providing whole services even if all the clients are alive. To avoid some of these problems, the advent of peer-to-peer (P2P) technologies has provided a new paradigm for content distribution. All content receivers (or nodes) also become content providers cooperatively participating in the content distribution. Since their roles are equal, in contrast to the client-server relationship, they are called peers. They autonomously form an overlay network and contribute their bandwidth and computing resources for content distribution 1. The available resources of P2P systems grow as the number of peers in the network increases. Although individual peers have lower uptimes than dedicated servers and contribute less bandwidth, their numbers, theoretically, more than make up for their individual resource constraints. Furthermore, P2P technology helps content distribution scale better. The more popular the content is, the more peers participate in its distribution and thus the more peers 1 P2P, cooperative, and overlay content distributions are used interchangeably in this work. 1

12 2 contribute resources. In short, P2P architecture is a promising candidate to make content distribution more scalable, more fault-tolerable, and faster. Despite these strengths, P2P content distribution systems present several challenges such as performance guarantees, reliability, efficiency, and security. For P2P systems that are deployed on a large scale, these challenges are more difficult to deal with because of the large number of participants, unreliable user behaviors, and unexpected situations. In this thesis, we 2 explore solutions to improve the reliability and efficiency of large-scale P2P content distribution systems, focusing on three particular issues: lookup, practical network coding, and secure network coding. The rest of this chapter presents an overview of our research followed by the details in the next chapters. Chapter 2 diagnoses reliability and efficiency of lookups, Chapter 3 explores the practicality of network coding, and Chapter 4 studies the security of network coding in practice. 1.1 Reliable and Efficient Lookup In P2P content distribution systems, content is distributed to nodes in a network instead of a central server. Therefore, a peer should be able to locate where desired content exits and this lookup is an essential issue in a P2P system. In early P2P networks such as Napster, a central server stores the location information of all content in a distribution network. With this approach, peers simply query the server and can easily search desired content. However, this approach is not scalable and suffers a single point of failure. To solve these problems, decentralized and unstructured lookup is used in Gnutella [1] and Kazaa [2]. With this distributed approach, there is no centralized server and each node keeps the location information of all content stored locally. When a node attempts to locate content, it floods requests to the network. Therefore, this unstructured method for searching content is expensive. For efficient and scalable lookup, the research community proposed distributed hash tables (DHTs), also called structured overlays [3, 4, 5, 6, 7]. In DHTs, each peer has an overlay address and a routing table. When a peer performs a query for an identifier, the 2 This thesis has benefited from all collaboration with my colleagues. The use of we throughout this thesis is meant to acknowledge to their contribution.

13 3 query is routed to the peer with the closest overlay address. The DHT enforces rules on the way that peers select neighbors to guarantee performance bounds on the number of hops needed to perform a query (typically O(log N) where N is the number of peers in the network). A DHT provides a simple put/get interface, similar to traditional hash tables. One can insert a key-value pair (k, v) and retrieve the value v with key k. Because a DHT provides a decentralized lookup service mapping objects to peers, it can also provide means of organizing and locating peers for use in higher-level applications in a large peer-to-peer (P2P) network. This potential to be used as a fundamental building block for large-scale distributed systems has led to an enormous body of work on designing highly scalable DHTs. Despite this, only a handful of DHTs have been deployed on the Internet-scale: Kad, Azureus [8], and Mainline [9], all of which are based on the Kademlia protocol [7]. This thesis evaluates and improves Kademlia lookup performance regarding reliability and efficiency in the use of resources. The study focuses on Kad because it is widely deployed with more than 1.5 million simultaneous users [10]. Furthermore DHT lookup has more significant portion in Kad than Azureus and Mainline that use DHT lookup for only bootstrapping. Like other DHTs, Kad uses a data replication scheme; object information is stored at multiple nodes (called replica roots). Therefore, a peer can retrieve the information once it finds at least one replica root. However, we observe that 8% of lookup operations for searching cannot find any replica roots immediately after publishing, which means they are unable to retrieve the information. Even worse, 25% of searchers fail to locate the information 10 hours after storing the information. This poor performance is due to inconsistency between storing and searching lookup processes; Kad lookup for the same objects map to an inconsistent set of nodes. From our measurement, only 18% of replica roots located by storing and searching lookup services are the same on average. Moreover, this lookup inconsistency causes an inefficient use of resources. We also find that 45% of replica roots are never located thus used by any searching peers for rare objects. Furthermore, when many peers search for popular information stored by many peers, 85% of replica roots are never used and only a small number of the roots suffer the burden of most requests. Therefore, we can see that Kad lookup is not reliable and waste resources such as bandwidth and storage for unused replica roots.

14 4 Why are the nodes located by publishing and searching lookup inconsistent? Past studies [11, 12] on Kademlia-based networks have claimed that lookup results are different because routing tables are inconsistent due to dynamic node participation (churn) and slow routing table convergence. We question this claim and examine entries in routing tables of nodes around a certain key space. Surprisingly, the routing table entries are much more similar among the nodes than expected. Therefore, these nodes return a similar list of their neighbors to be contacted when they receive requests for the key. However, the Kad lookup algorithm does not consider this high level of similarity in routing table entries. As a result, this duplicate contact list limits the unique number of located replica roots around the key. The consistent lookup enables reliable information search although some copies of the information are not available due to node churn or failure. Then they can also provide the same level of reliability with the smaller number of required replica roots compared to inconsistent lookup, which means lookup efficiently uses resources such as bandwidth and storage. Furthermore, consistent lookup locating multiple replica roots provides a way to load-balancing. Therefore, we propose algorithms considering the routing table similarity in Kad and show how improved lookup consistency affects the performance. These solutions can improve lookup consistency up to 90% and eventually lead to guaranteeing reliable lookup results while providing efficient resource use and load-balancing. Our solutions are completely compatible with existing Kad clients, and thus incrementally deployable. 1.2 Practical Network Coding Despite their advantages and popularity, existing cooperative content distribution systems suffer decreased performance through poor design, overly high peer turnover, and unforeseen emergent properties of large peer groups. In a typical P2P content distribution system like BitTorrent [13], content is divided into pieces (or blocks) 3. Peers exchange missing blocks with each other until they collect all the pieces of the content and reconstruct the original content. As soon as a node acquires at least one block, 3 By following the specification of BitTorrent, a piece refers to a part of content and a block describes the data that is exchanged between peers in this thesis.

15 5 the node can offer the received block to others. Due to the direction of content flow, we refer to the uploader and downloader as the upstream and downstream nodes, respectively. This parallelizes downloads, such that peers can simultaneously download different blocks from different nodes, achieving higher thought [13, 14]. However, this approach also poses significant challenges in the form of scheduling and availability problems. Peers must make decisions about how to upload and download content, which is called block scheduling. These decisions include which blocks they retrieve, from which peers they download blocks, and to which peers they provide blocks. Finding the optimal scheduling that minimizes downloading time is difficult especially when peers make local decisions without relying on central coordination. This is referred to as the block scheduling problem. To illustrate this problem, we take an example modified from [15]. In Figure 1.1(a), peer B is about to complete downloading block X from peer A, and peer C needs to decide which blocks to download from A and B. If C decides to download block X from A, then both B and C will have the same block X. This leads the problem that the link between B and C cannot be used and the downloading process of C will be delayed. This problem will be more difficult in larger scale P2P systems. To address the block scheduling problem, P2P systems use scheduling schemes such as random and local-rarest-first policies, but their block scheduling is still often referred to as inefficient [13, 16]. (a) Non-coding (b) Network Coding Figure 1.1: Block (block) scheduling problem The availability of data blocks may also affect the performance of content distribution systems. P2P networks are dynamic in nature because peers may arrive, depart,

16 6 or fail frequently, which is referred to as peer dynamics or churn. When some peers are not available, certain blocks may become rare. Peers missing the blocks should wait for their turns in a long line to receive rare blocks. This availability problem makes the efficient scheduling harder. Even worse, some blocks may be unavailable as they are held by peers who happen to be offline. Recall that content cannot be reconstructed when even one block is missing. Therefore, peers fail to download the whole content due to the small portion of missing data. Applying network coding to P2P systems has been considered a solution to these problems [17, 18, 19, 20, 21, 15, 22, 23, 24]. We will refer to network coding-enabled P2P systems for content distribution as ncp2p. In contrast to ncp2p, P2P systems which do not use network coding will be referred to as non-coding systems or simply P2P-only. The key idea is to allow peers to encode their blocks when sending outgoing data to downstream nodes. In this thesis, we focus on linear network coding [25] which is commonly used in many studies. When a peer uploads a block to another node, it sends a linear combination of some or all of the blocks it has. This way peers no longer have to fetch a copy of each specific block; rather, a peer simply asks another node to send a coded block, without specifying a block index. In Figure 1.1(b), peer C will download a linear combination of block X and Y from A without worrying about which blocks B will have. Then B and C can exchange blocks with each other, which efficiently uses the link between B and C and minimize the downloading time. After a downloader receives enough linearly independent blocks, it can reconstruct the original content, eliminating the requirement that each block should be downloaded individually. Even when some peers having specific blocks leave the network, other nodes will not have difficulty in downloading coded blocks from remaining other peers and recovering the original content. Therefore, network coding can potentially provide better robustness and reliability for content distribution. Despite the benefits of network coding, it has not been widely used in real-world P2P systems for content distribution. There has been some doubt about the performance gains from network coding in practice. In addition, network coding has been blamed for its computational complexity and excessive resource use. In this thesis, we explore challenges and problems we face when network coding is applied to real-world P2P applications for content distribution. We eventually answer the following questions:

17 How much can real-world applications benefit from network coding? 7 How much overhead does network coding introduce in real environments? How can we improve network coding in order to make it more practical? To answer these questions, we chose BitTorrent [13], one of the most popular P2P protocols, as a concrete application. We modified a BitTorent client [26] to use network coding and measured the performance and overhead of the system. With the network coding, the first issue we faced was to decide how to encode data. In linear network coding, a node must decide how many blocks are combined to generate an outgoing block, since encoding time is not trivial. We measured an encoding time of 2 milliseconds to combine two 256KB blocks when using commodity hardware. When blocks are stored on a disk, encoding time may increase significantly due to disk access delay, which varies depending on numerous factors such as disk speed, disk cache size, available system memory, and the number of page faults. We observed disk access times varying from 30 microseconds to 0.2 seconds to load 256KB of data. Network coding was originally formulated such that all blocks that were available to a peer were combined to produce an encoded block [25]. In the current content distribution, files are often quite large and consist of hundreds or thousands of blocks (or smaller numbers of blocks but with larger block sizes). If we assume a block size of 256KB for the large files, it would take several seconds to encode a single outgoing block. Therefore, it is almost impossible to use this full encoding in practice. To reduce the encoding overhead, peers can use fewer input blocks to generate a coded block in a series of schemes [27, 18, 22, 19]. However, this approach generates more dependent blocks, especially when too few input blocks are combined. In linear network coding, the usefulness of data is determined by linear dependency. These dependent blocks do not contribute useful data to other nodes, since they carry duplicate information from other blocks, thus wasting bandwidth and time. The high level of block dependency delays content propagation, since peers have difficulty in locating independent blocks. There is a direct tradeoff between encoding overhead and block dependency. Currently, no encoding scheme achieves both low encoding overhead and low block dependency. The primary contribution of our work is the design of i-code, an encoding scheme satisfying both requirements of low encoding overhead and low levels of block dependency.

18 8 i-code combines only two blocks for every encoding operation, dramatically reducing the encoding overhead. However, it does not have the dependent block penalty faced by encoding schemes which combine few input blocks. The key idea is to emulate an encoding scheme which combines many input blocks. To that end, each peer using i-code maintains a well-mixed block which we call the accumulation block (a). Whenever a peer receives an independent block (w), it updates its accumulation block with the new data (a αa + βw, for randomly chosen coefficients α and β). When the peer encodes a new block, it selects a block from its local store and linearly combines it with the accumulation block. Therefore, all blocks the peer has are accumulated into a and mixing any block with the accumulation block has a similar effect of combining many blocks. We compare the performance and overhead of each network coding scheme and show that i-code exhibits a low level of block dependency comparable to the full coding with significantly less overhead and fewer dependent blocks than sparse coding schemes. To address the practicality issue of network coding, we also provide performance and overhead in real environments. Prior studies on the benefits of network coding [15, 28] have been based on simulations or theoretical analyses, and may not reflect real network conditions. Although Gkantsidis et. al. provide an implementation in [22], there is no real-world performance comparison between their network coding-enabled implementation and a non-coding system. Other recent work shows potential practical benefits of ncp2p over P2P alone [23, 19, 18], but the experiments were performed with a small number of nodes or small file sizes, in network settings that were overly favorable to network coding, thus limiting the generalizability of the findings. With our BitTorrent client using i-code, we provide thorough empirical comparison between a P2P-only, and a ncp2p system, using many nodes communicating over a local area or wide-area network 4. Experimental results show that content distributing time of the ncp2p system decreases by 5 21% compared to the P2P-only (BitTorrent alone) and provides much better reliability and robustness. 4 We use PlanetLab [29] for wide-area network testing.

19 1.3 Secure Network Coding 9 P2P content distribution systems run in inherently untrustworthy environments. Some nodes may be malicious and attempt to disrupt the distribution of content. They may launch a number of attacks aiming at disrupting P2P architectures by providing false information, dropping messages, or steering peers to malicious nodes. In this thesis we do not consider such types of attacks. Instead, we focus on those attacks that arise from the use of network coding for content distribution. Network coding-enabled P2P systems can provide improved performance and robustness for cooperative content distribution. However, the use of network coding also poses security vulnerabilities by allowing any nodes to produce new encoded data. Some attackers intentionally send other nodes linear dependent blocks which do not contribute useful data to the receivers for the purpose of decreasing the diversity of data in the distribution network [30]. However, this type of attacks can be easily detected and prevented. On the other hand, network coding is seriously vulnerable to pollution attacks and we will focus on protecting network coding from this type of attacks. In the pollution attacks, malicious nodes inject corrupted data into a distribution network by sending other nodes blocks which are not linear combinations of original content. These corrupted blocks disrupt correct decoding; the original content cannot be reconstructed from the corrupted data. Before decoding, they cannot be filtered out by traditional hashes and digital signatures commonly used in P2P applications to verify the integrity of content pieces. In Bittorrent, for example, a content source creates a hash for each piece using a hash function and records the hash value into a metadata file which will be distributed to a downloader. When the downloader receives a particular block, the hash of the block is compared to the recorded hash to test whether the block has been modified or not. With network coding content blocks are encoded in transit by being mixed with other blocks. Because the source cannot provide hashes or signatures of encoded blocks in advance, the traditional methods of checking data integrity do not work with network coding. In addition, the nature of network coding aggravates the problem. If a false block is not filtered, it is combined with other correct blocks for encoding and thus corrupts those outgoing blocks which will in turn corrupt other blocks in another peers. Even a

20 10 single unfiltered false block may propagate in the network while exponentially increasing the number of corrupted blocks. Although the final decoded file can be identified as corrupted with a traditional hash, the amount of bandwidth, storage, and computation time wasted on the invalid file cannot be recovered. Since it is not obvious which block was corrupted, downloaders must re-try downloading the entire file, potentially encountering the same pollution. Therefore, P2P content distribution systems must implement a protocol to verify coded data before passing it on to other nodes. Although several schemes have been proposed for securing network coding against pollution attacks, they require high computational overhead or are not appropriate for being applied to P2P systems. Schemes in [31, 32] have relatively higher computational overhead than other schemes because of pairing operations and the cost of signature generation and aggregation. Schemes in [33, 30, 34] do not allow hashes, checksum, or public key information to be distributed before content is prepared. Thus, they are not appropriate for P2P streaming, one type of content distribution. Schemes in [30, 35] require secure channels between peers or use symmetric keys, which cannot be easily realized in P2P environments. The scheme in [36] makes the size of each block variable and becomes inefficient when blocks traverse many nodes. This is not appropriate in P2P systems where we do not know how many nodes blocks traverse or how severe data expansion will eventually become. Our goal is to answer the question: can secure network coding still provide improved performance with affordable overhead compared to a system which does not use network coding? We first propose an efficient homomorphic signature scheme which verifies encoded data on-the-fly. It also comes with desirable features appropriate for P2P content distribution which are mentioned when we list other schemes. We implement our scheme in a real-world BitTorrent client and measure its performance and overhead in real content distribution. We finally conclude that our secure scheme can protect network coding from attacks with affordable overhead and no delay to downloading processes.

21 Chapter 2 Reliable and Efficient Lookup The lookup to find where desired content exists is an essential issue in a P2P system. Distributed hash tables (DHTs) have been proposed as a solution for a distributed, efficient, and scalable overlay lookup service in a large-scale P2P system. DHTs can also provide a means of organizing and locating peers for use in higher-level applications in a large peer-to-peer (P2P) network. This potential to be used as a fundamental building block for large-scale distributed systems has led to an enormous body of work on designing highly scalable DHTs. Nevertheless, only a handful of Kademlia-based DHTs have been deployed on the Internet-scale. This chapter provides a study of the lookup performance of locating nodes responsible for replicated information focusing on Kad, a popular Kademlia-based DHT. Section 2.1 gives a brief background about Kad and its lookup. Section 2.2 presents the measurement results of Kad lookup performance. Section 2.3 identifies the factors underlying these observations and Section 2.4 presents solutions to improve the lookup performance. Section 2.5 provides more insights about the popularity of objects and load balancing in the Kad network. Section 2.6 surveys related work and Section 2.7 summarizes the chapter. 11

22 2.1 Background Kad Kad is a Kademlia-based DHT for P2P file sharing. It is widely deployed with more than 1.5 million simultaneous users [10] and is connected to the popular edonkey file sharing network. The amule and emule clients are the two most popular clients used to connect to the Kad network. We examine the performance of Kad using amule (at the time of writing, we used amule version 2.1.3), a popular cross-platform open-source project. The other client, emule, also has a similar design and implementation. Kad organizes participating peers into an overlay network and forms a key space of 128-bit quantifiers among peers. (We interchangeably use a peer and a node in this work.) It virtually places a peer onto a position in the key space by assigning a node identifier (Kad ID) to the peer. The distance between two positions in the key space is defined as the value of a bitwise XOR on their corresponding keys. In this sense, the more prefix bits are matched between two keys, the smaller the distance is. Based on this definition, we say that a node is close (or near ) to another node or a key if the corresponding XOR distance is small in the key space. Each node takes responsibility for objects whose keys are near its Kad ID. As a building block for the file sharing, Kad provides two fundamental operations: PUT to store the binding in the form of (key, value) and GET to retrieve value with key. These operations can be used for storing and retrieving objects for file information. For simplicity, we only consider keyword objects in this work because almost the same operations are performed in the same way for other objects such as file objects. Consider a file to be shared, its keyword, and keyword objects (or bindings) where key is the hash of the keyword and value is the metadata for the file at a node responsible for the key. Peers who own the file publish the object so that any user can search the file with the keyword and retrieve the metadata. From the information in the metadata, users interested in the file can download it. Because a peer responsible for the object might not be available, Kad uses the data replication approach; the binding is stored at r nodes (referred to as replica roots and r is 10 in amule). To prevent binding information from being stored at arbitrary locations, Kad has a search tolerance that limits the set of potential replica roots for a target.

23 2.1.2 Kad Lookup 13 In both PUT and GET operations, a Kad lookup for a target key (T ) performs the process of locating nodes which are responsible for T (nodes near T ). In each lookup step, a query is sent to peers closer to target T. Because a lookup in Kad is based on prefix matching, a querying node selects the nodes (contacts) which have the longest matched prefix bit length to T. The number of steps in a Kad lookup is bounded to O(log(N)) and a lookup is iteratively performed: each peer on the way to key T returns the next contacts to the querying node. The querying node can approach to the node closest to T by repeating lookup steps until it cannot find any nodes closer to T than those it has already learned in Phase1. In Phase2, the querying node attempts to discover nodes in the surrounding key space to support data replication (Phase1 and Phase2 are named for convenience). Kad takes an approach to send publish (PUBLISH REQ) and search (SEARCH REQ) requests in Phase2. This approach is an efficient strategy because the replica roots exist near the target and search nodes can locate the replica roots with high probability. (This will be explored in Section in detail). This process repeats until termination conditions are reached a specific amount of binding information is obtained or a time-out occurs. Figure 2.1 illustrates an example of a simplified GET lookup when node Q searches for key T. Phase1 works as follows: 1. Learning from a routing table: Q picks ( learns ) α contacts (nodes) closest to target T from all the nodes in its routing table (although α = 3 in Kad, Figure 2.1 shows the lookup process where α = 1 for this simple illustration). Node X is chosen. 2. Querying the learned nodes: Q queries these chosen nodes (i.e., node X) in parallel by sending KADEMLIA REQ messages for T. 3. Locating queried nodes: each of queried nodes selects β contacts closest to the target from its routing table, and returns those contacts in a KADEMLIA RES message (β is 2 in GET and 4 in PUT). In this example, node X returns Y and Y other (not shown in the figure). Once a node sends a KADEMLIA RES responding to a KADEMLIA REQ, the node is referred to as a located node. 4. Learning from the queried nodes: Q learns the returned contacts (Y and Y other ) from queried nodes (X) and picks the α closest contacts (Y ) from its learned nodes. 5. Querying next contacts: Q queries the selected nodes (Y ).

24 14 Figure 2.1: Illustration of a GET lookup Q repeats these iterations (learning, querying, and locating) until it receives KADEM- LIA RES from A, which is closest to T (it cannot find any other nodes closer to T than A). In Phase2, Q sends SEARCH REQ to nodes which are close to the key, while trying to locate more nodes near T by querying already learned nodes. Q sends SEARCH REQ to A and KADEMLIA REQ to B. After learning C, Q then sends SEARCH REQ to B and KADEMLIA REQ to C. If nodes have bindings whose key is matched with target T, they return the bindings. Notice that many bindings can be returned, especially for popular keywords. This process is repeated until 300 unique bindings are retrieved or 25 seconds have elapsed since the start of the search in Kad.

25 2.2 Evaluation of Kad Lookup Performance 15 Due to diverse peer behaviors and dynamic network environments, Kad adopts the data replication approach which stores binding information at multiple replica roots for reliability, load-balancing in storing and retrieving the information. However, without the help of an efficient lookup, this approach could just waste the bandwidth and storage of the nodes involved with the replication. In this section, we evaluate the performance of Kad focusing on the consistency between lookups through a measurement study. We first describe the experimental setup of our measurements. We then measure the lookup ability to locate replica roots and see how this ability affects the Kad lookup performance Experimental Setup We ran a Kad node using an amule client on machines having static IP addresses without a firewall or a NAT. Kad IDs of the peers were randomly selected so that the IDs were uniformly distributed over the Kad key space. A publishing peer shared a file in the following format keywordu.extension (e.g., as3d1f0goa.zx2cv7bn ), where keywordu is a 10-byte randomly-generated keyword, and extension is a fixed string among all our file names, used for identifying our published files. This allows us to publish and search keyword objects of the files not duplicated with existing ones. For each experiment, one node published a file and 32 nodes searched for that file by using keywordu. We ran nodes which had different Kad IDs and were bootstrapped from different nodes in the Kad network to avoid measuring the performance in a particular key space. We repeated the experiments with more than 30,000 file names. In order to empirically evaluate the lookup performance, we define the following metrics. Search yield measures the fraction of replica roots found by a GET lookup process following a PUT operation, implying how reliably a node can search a desired file, and is calculated as the number of the replica roots located by a GET lookup the number of published replica roots. Search success ratio is the fraction of GET operations which retrieve a value for a key

26 from any replica roots located by a search lookup (referred to as successful searches), implying whether a node can find a desired object or not, and is calculated as 16 the number of successful searches the number of total searches. Search access ratio measures the fraction of GET lookups which find a particular replica root, implying how likely the replica root is to be accessible (found) through lookups with a corresponding key, and being calculated as (for each replica root) the number of searches which locate a replica root the number of total searches for the corresponding key. For load balancing, the distribution of search access ratios among replica roots should not be skewed Performance Results We evaluate the lookup ability to locate replica roots by measuring the search yield. Then, we show how search yield affects the Kad lookup performance by examining the search success ratio and search access ratio. Figure 2.2(a) shows the distribution of the search yield immediately after PUT operations ( found by each line). The average search yield is about 18%, meaning that only one or two replica roots are found by a GET lookup (because the replication factor is 10 in amule). In addition, about 80% of the total lookups locate fewer than 3 replica roots (25% search yield). This result is quite disappointing, since this means that one cannot find a published file 80% of the time when these three nodes leave the network, even though 7 more replica roots exist. Figure 2.2(b), the search yield continuously decreases over time during a day from 18% to 9% which means nodes are less likely to find a desired file as time goes by. This low search yield directly implies poor Kad lookup performance. A search is successful unless the search lookup is not able to find any replica roots (i.e., unless the search yield is 0). This is because binding information can be retrieved from any located replica root. Figure 2.2(c) shows the search success ratio over time. Immediately after publishing a file, the search success ratio is 92% implying that 8% of the experiments we cannot find a published file. This result matches the statistics in Figure 2.2(a) that 8% of searches have a 0 search yield. This result is somewhat surprising since we expected

27 17 the fraction of searches (CDF) found by each found by all search yield search yield time (hours) (a) (b) search success ratio the fraction of lookups time (hours) x-th closest replica root (c) (d) Figure 2.2: Performance of lookup: (a) search yield (immediately after PUT) (b)search yield (over 24 hour window) (c) search success ratio (over time) (d) search access ratio (by distance)

28 18 that i) there exists at least 10 replica roots near the target, and ii) DHT routing should guarantee to find a published file. Even worse, the search success ratio continuously decreases over time during a day from 92% to 67% before re-publishing occurs. This degradation of the search success ratio over time is caused by churn in the network. In Kad, no other peers take over the file binding information stored in a node when the node leaves the Kad network. The mechanism to mitigate this problem caused by churning is that the publishing peer performs PUT every 24 hours for keyword objects. Because GET lookups are able to find a small fraction of replica roots, there must be unused replica roots as shown in Figure 2.2(a). In found by all line, 55% of replica roots are found by all lookups on average, so 45% of replica roots are never located. From this fact, we can conjecture that the replica roots found by each GET lookup are not disjointed. This inference can be checked in Figure 2.2(d) showing the search access ratio of each replica root. In this figure, nodes in the X-axis are sorted by distance to a target and we can easily see that most of lookups locate the two closest replica roots, but the other replica roots are not contacted by lookups. This distribution of the search access ratios indicates that the load of replica roots is highly unbalanced. Overall, the current Kad lookup process cannot efficiently locate more than two replica roots. Thus, resources such as storage and network bandwidth are uselessly wasted for storing and retrieving replicated binding information. 2.3 Analysis of Poor Lookup Performance In the previous section, we showed that the poor performance of Kad lookups (18% search yield) is due to the inconsistent lookup results. In this section, we analyze the root causes of these lookup inconsistencies. Previous studies [11, 12] of Kademliabased networks have blamed membership churn, an inherent part of every file-sharing application, as the main contributing factor to these performance issues. These studies claim that network churn leads to routing table inconsistencies as well as slow routing table convergence. These factors then lead to non-uniform lookup results [11, 12]. We question this claim and identify the underlying reasons for the lookup inconsistency in Kad. First, we analyze the entries within routing tables, specifically focusing on consistency and responsiveness. Next, we dissect the poor performance of Kad lookups

29 based upon characteristics of routing table entries Characterizing Routing Table Entries In this subsection, we empirically characterize routing table entries in Kad. We first explain the distribution of nodes in the key space, and then examine consistency and responsiveness. By consistency we mean how similar the routing tables of nodes around a target ID are, and by responsiveness we mean how well entries in the routing tables respond when searching nodes query them. Node Distribution. Kad is known to have 1.5 million concurrent nodes with IDs uniformly distributed [11]. Because we know the key space is uniformly populated and the general size of the network, we can derive n L, the expected number of nodes that exactly match L prefix bits with the target key. Let N be the number of nodes in the network and n L be the expected number of nodes which match at least L prefix bits with the target key. Then, the expected match between any target and the closest node to that target is 2 log 2 N bits. n L increases exponentially as L decreases (nodes are further from the target). Thus, n L and n L can be computed as follows: n L = 2 log 2 N L n L = n L n L+1 = 2 log 2 N L 1 When N is 1.5 million, the expected number of nodes for each matched prefix length is as follows: L n L n L Routing Table Collection. To further study Kad, we collected routing table entries of peers located around given targets. We built a crawler that, given a target T, will crawl the Kad network looking for all the nodes close to T. If a node matches at least 16 bits with T, its routing table is polled. The number 16 is chosen empirically since there should be about 23 nodes at more than or equal to 16 bit matched prefix length in Kad (more than twice the number of replica roots). Those nodes are the ones close to T.

30 Polling routing tables can be performed by sending the same node multiple KADEM- LIA REQ messages for different target IDs. Each node will then return the routing table entries that are closest to these target IDs. A node s whole routing table can thus be obtained by sending many KADEMLIA REQ. For every node found or polled, a HELLO REQ is sent to determine whether that node is alive. For this study, we select more than 600 random target IDs and retrieve the routing tables of approximately 10,000 distinct Kad peers. responsiveness. We then examine the two properties mentioned above: consistency and View Similarity. We measure the similarity of routing tables. Let P be the set of peers close to the target ID T. A node Z is added to P if the matched prefix length of Z with T is at least 16. We define a peer s view v to T as the set of k closest entries in the peer s routing table. This is because when queried, peers select the k closest entries from their routing tables and return them. We selected 2, 4, and 10 as k because 2 is the number of contacts returned in SEARCH REQ, 4 for PUBLISH REQ and 10 for FIND NODE. We measure the distance d (or the difference) between views (v x, v y ) of two peers x and y in P as d(v x, v y ) = v x v y + v y v x v x + v y where v x is the number of entries in v x. d(v x, v y ) is 1 when all entries are different and 0 when they are the same. 20 The similarity of views to the target is defined as 1 dissimilarity where dissimilarity is the average distance among the views of peers in P. Then, the level of this similarity indicates how similar close-to-t entries in the routing tables of nodes around the target T are. For simplicity, we call this the similarity of routing table entries. Figure 2.3(a) shows that the average similarity of routing table entries is 70% based on comparisons of all nodes in P. This means that among any two routing tables of nodes in P, close to T, 70% of entries are identical. Therefore, peers return similar and duplicate entries when a searching node queries them for T. The high similarity values indicate that the closest node has a similar view to a target with the other close nodes in P. Responsiveness. In Figure 2.3(c), we examine the number of responsive (live) contacts normalized by the total number of contacts close to a given target key. The result

31 21 fraction (CDF) Closest 2 Closest 4 Closest the fraction of fresh entries consistency matched prefix length (a) Similarity among all nodes (b) Response ratio of nodes Figure 2.3: Statistics on routing tables shows that around 80% of the entries in the routing tables respond to our requests, up to a matched prefix length of 15. The fraction of responsive contacts decreases as the matched prefix length increases because in the current amule/emule implementations, peers do not check the liveness of other peers close to its Kad ID as often as nodes further away [11] Analysis of Lookup Inconsistency In the previous subsection, we observed that the routing table entries of nodes are similar and only half of the nodes near a specific ID are alive. From this observation, we investigate why Kad lookups are inconsistent and then present analytical results. We explain why Kad lookups are inconsistent using an example, shown in Figure 2.4. A number (say k) in a circle means that the node is the k th closest node to the target key T in the network. Only nodes located by the querying nodes are shown. We first see how the high level of the routing table similarity affects the ability of locating nodes close to T. Peers close to T have similar close-to-t contacts in their routing tables. Thus, the same contacts are returned multiple times in KADEMLIA RES messages and the number of learned nodes is small. In Figure 2.4(a), node Q learns only the two closest nodes because all queried nodes return node 1 and node 2. The failure to locate nodes close to a target causes inconsistency between lookups for PUT and GET. A publishing node only finds a small fraction of the nodes close to the

22 Figure 2.4: Illustration of how a lookup can be inconsistent target. In Figure 2.4(b), node P locates three closest nodes (nodes 1, 2, and 3) as well as less useful nodes farther from the target T.

32 22 Figure 2.4: Illustration of how a lookup can be inconsistent target. In Figure 2.4(b), node P locates three closest nodes (nodes 1, 2, and 3) as well as less useful nodes farther from the target T. Node P then publishes to the r closest nodes among these located nodes, assuming that those nodes are the very closest to the target (r = 10 but only 6 nodes are shown in the figure). Note that some replica roots (e.g. node 37) are actually far from T and many closer nodes exist. Similarly, searching nodes (Q1 and Q2) find only a subset of the actual closest nodes. These querying nodes then send SEARCH REQ to the located nodes (referred to as search-tried ). However, only a small fraction of the search-tried nodes are replica roots (referred to as searchfound ). From this example, we can clearly see that the querying nodes will obtain binding information only from the two closest nodes (node 1 and node 2) out of 10 replica roots. We next present analytical results supporting our reasoning for inconsistent Kad lookups. Figures 2.5 shows the average number of different types of nodes at each matched prefix length for PUT and GET. The existing line shows the number of nodes found by our crawler at each prefix length and matches with the expected numbers provided in the previous subsection. The duplicately-learned line shows the total

33 23 the number of nodes existing duplicately-learned uniquely-learned located the number of nodes existing duplicately-learned uniquely-learned located matched prefix length matched prefix length (a) PUT (b) GET Figure 2.5: Number of nodes at each distance from a target number of nodes learned by a searching node including duplicates and the uniquelylearned line represents the distinct number of nodes found without duplicates. When a node is included in 3 KADEMLIA RES messages, it is counted as 3 in the duplicatelylearned line and 1 in the uniquely-learned line. We can see that some nodes very close to T are duplicately returned when a querying node sends KADEMLIA REQ messages. In other words, the number of uniquely-learned nodes is much smaller than the number of duplicately-learned nodes when they are very close to T. For instance, there is one existing node at 20 matched prefix length (in uniquely-learned line), and it is returned to a querying node 5 times in PUT and 3.8 times in GET ( duplicately-learned lines). To further compound the issue, the number of located nodes is half that of uniquelylearned nodes because, on average, 50% of the entries in the routing tables are stale. In other words, half of the learned contacts no longer exist in the network. As a result, a PUT lookup locates only 8.3 nodes and a GET lookup finds only 4.5 nodes out of the 23 live nodes which have more than 16 matched prefix length with the target. Thus, we can see that the duplicate contact lists and stale (dead) routing table entries cause a Kad lookup to locate only a small number of the existing nodes close to the target. Since the closest nodes are not located, PUT and GET operations are inadvertently performed far from the target. Figure 2.6 shows the average number of published (denoted as p L ), search-tried (denoted as s L ), and search-found (denoted as f L )

34 24 the number of nodes published search-tried search-found matched prefix length Figure 2.6: Number of replica roots at each distance from a target nodes for each matched prefix length L. We clearly see that more than half of the nodes which are published and search-tried match less than 17 bits with the target key. We can formulate the expected number of replica roots E[f L ] located by a GET lookup for each L. Let N be the number of nodes in the network and n L be the expected number of nodes which match L prefix bits with the target key. Then f L is computed as follows: E[f L ] = s L p L p L = s L n L 2 log 2 N L 1 The computed values of E[f L ] match with f L from the experiments shown in Figure 2.5. From the formula, E[f L ] is inversely proportional to L because n L increases exponentially. Thus, although a GET lookup is able to find some of the closest nodes to a target, not all of these nodes are replica roots because a PUT operation publishes binding information to some nodes really far from the target as well as nodes close to the target. For a GET lookup to find all the replica roots, that is, all the nodes located by PUT, the GET operation has to contact exactly the same nodes this is highly unlikely. This is the reason for the lookup inconsistency between PUT and GET operations.

35 2.4 Improvements 25 We already saw how the lookup inconsistency problem affects the lookup performance in Section 2.2. This problem limits the lookup reliability and wastes resources. In this section, we describe several possible solutions to increase lookup consistency. Then, we see how well the proposed solutions improve Kad lookup performance. Moreover, we evaluate the overhead of the new improvements Solutions Tuning Kad parameters. Tuning parameters on Kad lookups can be a trivial attempt to improve Kad lookup performance. The number of replica roots (r = 10) can be increased. Although this change could slightly improve performance, it will still be ineffective because close nodes are not located and the replica roots that are far from the target will still exist. The timeout value (t = 3 seconds) for each request can also be decreased. We do not believe this will be useful either since this change results in more queries being sent and more duplicates being received. The number of returned contacts in each KADEMLIA RES can also be increased (β = 2 for GET and β = 4 for PUT). Suppose that 20 contacts are returned in each KADEMLIA RES. Then, 20 nodes close to a target can be located (if all contacts are alive) even though returned contacts are duplicated. However, this increases the size of messages by an order of 10 for GET (5 for PUT). Finally, the number of contacts queried at each iteration (α = 3) can be increased. This would increase the number of contacts queried at each iteration step, which thus increases the ability to find more replica roots. However, this approach will result in more messages sent and even more duplicate contacts received. Querying only the closest node (Fix1.) A solution of querying only the closest node exploits the high similarity in routing table entries. After finding the closest node to a particular target, a peer asks for its 20 contacts closest to the target. From our experimental results, a lookup finds the closest node with 90% probability, and always locates one of the nodes which matches at least 16 prefix bits with the target. Therefore, the expected search yield is = 0.97 (90% chance of finding the closest node from Figure 2.2(d), 10% chance of not finding the closest node, and 70% similarity among routing table entries from Section 2.3). We note that this simple solution comes

36 26 as a direct result of our measurements and analysis. Avoiding duplicates by changing target IDs (Fix2.) Because of the routing table similarity, duplicate contacts are returned from queried nodes and this eventually limits the number of located nodes close to a target. To address this problem, we propose Fix2 that can locate enough nodes closest to a target. Figure 2.7: Lookup algorithm for Fix2 Our new lookup algorithm is illustrated in Figure 2.7 in which peer Q attempts to locate nodes surrounding target T. Assume that nodes (A, B,..., F ) close to target T have the same entries around T in their routing tables and all entries exist in the network. We define KADEMLIA REQ by adding a target notation; KADEMLIA REQ (T ) is a request to ask a queried node to select β contacts closest to target T, and return them in KADEMLIA RES. In the original Kad, Q receives duplicate contacts when it sends KADEMLIA REQ (T ) to multiple nodes. In a current Kad GET lookup (β = 2), the only three contacts (A, B, and C) would be returned. However, Fix2 can learn more contacts by manipulating target identifiers in KADEMLIA REQ. Once the closest node A is located (i.e., Phase2 is initiated see Section 2.1), Q sends KADEMLIA REQ by replacing the target ID with other learned node IDs ({B, C,..., F }). In other words, Q sends KADEMLIA REQ (T ) instead of KADEMLIA REQ (T ) where T {B, C,..., F }. Then, the queried nodes return contacts (neighbors) closest to themselves. In this way, Q can locate most of the nodes close to the real target T. In order to effectively exploit Fix2, we separate the lookup procedures for PUT and GET. These operations have different requirements according to their individual purposes; while GET requires a low delay in order to satisfy users, PUT requires publishing the file information where other peers can easily find it (it does not require a low delay).

37 27 However, Kad has identical lookup algorithms for both PUT and GET, where a publishing peer starts PUT as soon as Phase2 is initiated even when most of the close nodes are not located. This causes the copies of bindings to be stored far from the target. Therefore, we modify only a PUT lookup to delay sending PUBLISH REQ until enough nodes close to the target are located while GET is performed without delay. In our implementation, we wait one minute (the average time to send the last PUBLISH REQ is 50 seconds in our experiments) before performing a PUT operation expecting that most of the close nodes are located during that time Performance Comparisons We next compare the performance improvement of the proposed algorithms. With the results obtained from the same experiments explained in Section 2.2, we show that our solutions significantly improve lookup performance CDF search yield Original Fix1 Fix2 r=20 t=1 α=6 β=20 Figure 2.8: Lookup Improvement (Search Yield) Search yield can be used to clearly explain the lookup consistency problem. Figure 2.8(a) shows the search yield for each solution. Simply tuning parameters (number of replica roots, timeout value, α, β) exhibit search yields of 35% 42%. Fix1 has an improvement of 90%, on average, which is slightly less than expected because some replica roots leave the network or do not respond to the GET requests. Fix2 improves

38 28 the search yield to 80%, on average, but provides more reliable and consistent results. For a search yield of 0.4, 99% of Fix2 lookups have higher search yields compared to 95% of Fix1 lookups. Since Fix1 relies only on the closest node, the lookup results may be different when the closest node is different (due to churn). This can be observed when a new node closer to the target churns in because it could have different routing table entries from the other nodes close to it. CDF Original Fix1 Fix2 r=20 α= Number of Messages CDF Original Fix1 Fix2 r=20 α= Number of Messages (a) PUT (b) GET Figure 2.9: Lookup Overhead We next look at the overhead in the number of messages sent for both PUT and GET operations. The number of messages sent by each algorithm for PUT is shown in Figure 2.8(a). Fix1 and Fix2 use 72% and 85% fewer messages respectively because the current Kad lookup contacts more nodes than the proposed algorithms. After reaching the node closest to a target, the current Kad lookup locates only a small fraction of close nodes in Phase2 (the number of nodes found within the search tolerance is fewer than 10). Thus, the querying node repeats Phase1 again and contacts nodes further from the target until it can find more than 10 nodes within the search tolerance. The overhead for parameter tunings is higher than the original Kad implementation, as expected. Increasing the number of replica roots implies that 20 replica roots need to be found. Since it is already difficult (having to restart Phase1 ) to find 10 replica roots, it is even more difficult to find 20 replica roots thus, the number of messages sent in PUT is much higher than for Original. Contacting more nodes at each iteration

39 29 (increasing α from 3 to 6) increases the number of messages sent, and shortening the timeout (from 3 to 1) incurs a similar overhead. However, we observe that the overhead is not as high as increasing the number of replica roots because when r is increased, Phase1 is restarted a couple of times the Kad lookup process has difficulties locating 10 replica roots, thus trying to locate 20 replica roots means that Phase1 has to take place more times. The message overhead for GET operations is shown in Figure 2.8(b). Fix1 and Fix2 sent times more messages than the current Kad lookup. In the current Kad implementation, only a few contacts out of the learned nodes are queried during Phase2 thus, few KADEMLIA REQ and SEARCH REQ are sent. Even if the original Kad lookup implementation is altered to send more requests, this would not increase the search yield due to the number of messages wasted in contacting far away nodes from the target because of duplicate answers. Increasing the number of replica roots to 20 uses roughly the same number of messages as Original for GET because increasing the number of replica roots does not affect the search lookup process. Increasing the number of returned contacts (α), however, does increase the number of messages sent in GET because 6 nodes are queried instead of 3 nodes (a shorter timeout has a similar overhead). The overhead due to this tweaking is even higher than Fix1 or Fix2 because our algorithms increase α only after finding the closest node. search yield Original Fix1 Fix search sucess ratio Original 0.7 Fix1 Fix time (hour) time (hour) (a) Search Yield (b) Search Success Ratio Figure 2.10: Lookup performance over time

40 30 Fix1 and Fix2 produce much higher performance than solutions changing parameters. Moreover, the overhead of these two solutions are lower than Original for PUT and slightly higher for GET. The overhead for the other solutions is much higher. We next compare only these two algorithms, Fix1 and Fix2, as they are the most promising ones. Figure 2.10(a) shows that the search yield of the algorithms decreases as time goes on because of churns. However, it is still higher than the original Kad lookup. Although they show a very similar performance level, the variation of performance in Fix1 is slightly higher than Fix2 due to the possibility of a closer node churning in. Due to the high search yield, both Fix1 and Fix2 enable a peer to successfully find a desired object at any time with a higher probability than the original Kad lookup. In Figure 2.10(b), the search success ratios for our proposed algorithms are almost 1 after publishing while the ratio for the original Kad is Even after 20 hours, the ratios for the solutions are 0.96 while the ratio for the original Kad is Overall, Fix1 and Fix2 significantly improve the performance of the Kad lookup process with little overhead in terms of extra messages sent compared to the other possible algorithms and the original one. Fix1 is simple and can be used in an environment with a high routing table consistency. The downside of Fix1 is that it is not as reliable as Fix2 in some cases. Suppose that a new node joins and becomes the closest node, but its routing table entries close to the target are not replica roots which were routing table entries of the old closest node. Then, a GET operation might not be able to find these replica roots. However, a querying client can locate most of the closest nodes around a target in Fix2 even though the old closest node leaves the network or a joining node becomes the closest node. Therefore, Fix2 can be used for applications which require strong reliability and robustness. 2.5 Object Popularity and Load Balancing Many peers publish or search popular objects (or keywords such as love ) and some nodes responsible for the objects receive a large number of requests. To examine severity of this load balancing issue, we perform experiments on the lookup for popular objects in Kad network. The experiments are composed of two steps: i) finding the most of replica roots of popular objects in the network using our crawler and ii) examining the

41 31 number of the replica roots located by Kad lookups. We select objects whose name match with keywords extracted from the 100 most popular items in Pirate Bay [37] on April 5, We modify our crawler used for collecting routing table entries so that it could send SEARCH REQ. We consider a node to be a replica root if it returns binding information matching with a particular target keyword. Then, we run 420 clients to search bindings for the objects using those keywords. number of replica roots existing distinctly-found duplicately-found number of replica roots existing distinctly-found duplicately-found matched prefix length (a) matched prefix length (b) number of replica roots existing distinctly-found duplicately-found number of replica roots real orginal new matched prefix length (c) matched prefix length (d) Figure 2.11: (a) Lookup with real popular objects (b) Original Kad lookup for our objects (c) New Kad lookup for our objects (d) Load for each prefix bit for real popular objects and our objects We evaluate Kad lookup performance by investigating the number of replica roots located by Kad searches. First, we examine if a client was able to retrieve bindings. In

42 the experiments, each client could find at least one replica root and retrieve binding information the search success ratio was 1. Next, we discuss if Kad lookups use resources efficiently. Figure 2.11(a) shows the average number of the replica roots located by all clients at each prefix matched length. The existing line represents the actual replica roots observed by our crawler. The distinctly-found line indicates the number of the unique replica roots, but the duplicately-found line includes duplicates. For example, when one replica root is located by 10 clients, it is counted as 1 in the distinctlyfound line but as 10 in the duplicately-found line. Overall, our results indicate that 85% of all replica roots were not located during search lookups, and, therefore, never provide the bindings to the clients. Our crawler found a total of 598 replica roots for each keyword on average. However, our clients located only 93 replica roots during the searches, which was only 15% of the total replica roots. Furthermore, we could observe a load-balancing problem in Kad lookups. Most of the unlocated replica roots are far from the target (low matched prefix length). At 11 matched prefix length, only 10 out of 121 replica roots were located. On the other hand, nodes close to the target were always located but received requests from many clients. At more than or equal to 20 matched prefix length ( 20+ in the figure), there were only 1.4 unique replica roots (in the both existing and distinctly-found lines) implying that all those replica roots were located by clients. However, there were 201 duplicate-found roots, which means that one replica root received search requests from 141 clients, on average. To better illustrate the load-balancing problem, we define the average lookup overhead of replica roots at L prefix matched length as: Load L = number of duplicately-found replica roots number of existing replica roots A high Load L value means that there are numerous nodes at matched prefix length L which received search requests. The real line in Figure 2.11(d) shows the load for the above experiments. The load was high for the high matched prefix length (replica roots close to the target) while the load was close to 0 for nodes far from the target (low matched prefix length). This result indicates that i) Kad is not using replica roots efficiently, and ii) the nodes closest to the target suffer the burden for most of the search requests. This problem can be explained by two factors in Kad. First, a querying node sends 32

43 SEARCH REQ starting from the closest node to nodes far from the target, thus, the closest node would receive most of the requests. Secondly, due to the termination condition in Kad, the search stops if 300 results (objects) are received (recall that a replica root can return more than one result). Although there are more replica roots storing the binding information for a certain object, the search process stops without contacting these replica roots because 300 objects have been returned by the few replica roots contacted. To address this load-balancing problem, we propose a new solution which satisfies the following requirements: i) balance the load for search lookups, and ii) produce a high search yield for both rare and popular objects. A description of the solution is as follows. A querying node attempts to retrieve the binding information starting far from the target ID. Suppose that querying node Q sends a KADEMLIA REQ to node A, which is within the search-tolerance for target T. In addition to returning a list of peers (containing nodes closest to T that A knows about), A sends a piggybacked bit informing Q whether it has binding information for T, that is, whether A is a replica root for T. If A sends such a bit, Q then sends a SEARCH REQ with a list of keywords to A and the latter returns any binding of objects matching all the keywords. When many replica roots publish popular objects, Q has a chance to retrieve enough bindings from replica roots that are not close to T. Thus, Q does not have to contact replica roots close to the target. This lookup can reduce the load on the closest nodes to a target with only a 1-bit communication overhead. To exploit the new lookup solution, it is important to decide where to publish objects, that is, which nodes will be replica roots. Some nodes very close to a target ID should clearly be replica roots. This guarantees a high search yield even if only a small number of nodes publish the same objects ( rare objects) because the closest nodes are almost always found as we have previously shown. Moreover, it is desirable that nodes far from the target be replica roots so that they can provide binding information earlier in the lookup process. This lessens the burden on the load for the closest replica roots and provides a shorter GET delay to querying nodes. In the new PUT operation, a publishing peer locates most of the closest nodes using Fix2 and obtains a node index by sorting these nodes based on their distance to a target ID. The publishing node then sends the i-th closest node a PUBLISH REQ with probability p = 1 i 4.This heuristic guarantees 33

44 34 that objects are published to the five closest nodes and to nodes further from the target. We implemented our proposed solution and ran experiments to determine if it met our requirements for both PUT and GET. The same experiments from Section 2.2 were performed with the new solution. We repeated the experiments changing the number of files to be published, but we only present experiment results similar to those of the real network when the original Kad lookups were used. We observed a search success ratio of 62% for rare objects and almost 100% for popular objects. We next looked at whether our algorithm mitigated the load-balancing problem or not. In the experiment, 500 nodes published about 2150 different files with the same keyword, and another 500 nodes searched those files with that keyword. The experiments were repeated with 50 different keywords. To show that our experiments emulated real popular objects in Kad, we tested both the original Kad lookup algorithm and our solution for comparison. In Figure 2.11(d), the original line shows the results obtained from using the original Kad algorithm. As expected, these results were similar to what we obtained from the real network. The number of replica roots located by our proposed Kad lookup solution is shown in Figure 2.11(c). For comparison, we present the number of replica roots located by Kad lookup without modification in Figure 2.11(b). More replica roots were found (both duplicately-found and distinctly-found lines) farther from a target than for the original Kad lookup. At 11 matched prefix length, 48 out of 101 replica roots were located using our solution while only 10 out of 91 replica roots were located using the original algorithm. The new line in Figure 2.11(d) shows that the load was shared more evenly across all the replica roots for our solution. At more than or equal to 20 matched prefix bit, the load decreased by 22%. In summary, our experimental results show that the proposed solution guarantees a high search yield for both rare and popular objects, and can further mitigate the load balancing problem in lookups for popular objects. 2.6 Related Work Kad is a DHT based on the Kademlia protocol [7] that uses a different lookup strategy than other DHTs such as Chord [4] and Pastry [5]. The main difference between Chord

45 35 and Kademlia is that Chord has a root for every key (node ID). When a querying node finds that root, it can locate most of the replica roots. Every node keeps track of its next closest node (successor). In Pastry [5], each node has an ID and the node with the ID numerically closest to the key is in charge. Since each node also keeps track of its neighbors, once the closest node is found, the other replica roots can also be found. Thus, Chord and Pastry do not suffer from the same problems as Kad. We note that just replacing the Kad algorithm with Chord or Pastry is not a suitable solution as Kad contains some intrinsic properties, inherited from Kademlia, that neither Chord nor Pastry possesses for example, Kad IDs are symmetric whereas Chord IDs are not. The Pastry algorithm can return nodes far from the target due to the switch in distance metrics. Moreover, Kad is widely used by over 1.5 million concurrent users whereas it was never shown that Chord or Pastry can work on large-scale networks. Since Kad is one of the largest deployed P2P networks, several studies have measured various properties and features of the Kad network. Steiner et al [10, 38] crawled the whole Kad network, estimated the network size, and showed the distribution of node IDs over the Kad key space. More recently in [39], the authors analyzed the Kad lookup latency and proposed changing the configuration parameters (timeout, α, β) to improve the latency. Our work differs in that we measured the lookup performance in terms of reliability and load-balancing, and identified some fundamental causes of the poor performance. Stutzbach et al. [11] and Falkner et al. [12] studied networks based on the Kademlia DHT algorithm by using emule and Azureus clients, respectively. They argued that the lookup inconsistency problem is caused by churn and slow routing table convergence. However, our detailed analysis on lookups clearly shows that the lookup inconsistency problem is caused by the lookup algorithm which cannot consider duplicate returns from nodes with consistent views in the routing tables. Furthermore, the authors proposed changing the number of replica roots as a solution. Our experiments indicate that just increasing the replication factor is not an efficient solution. We propose two incrementally-deployable algorithms which significantly improve the lookup performance, and a solution to mitigate the load-balancing problem. Thus, prior work on the lookup inconsistency is incomplete and limited.

46 36 Freedman et al. [40] considered the problems in DHTs (Kad included) due to nontransitivity in the Internet. However, non-transitivity will only impact the lookup performance in a small way since, in essence, it can be considered a form of churn in the network. We already accounted for churn in our analysis and showed that churn is only a minor factor in the poor Kad lookup performance. 2.7 Summary Distributed hash tables (DHTs) have been proposed as a solution for a distributed and scalable lookup service to allow users to find content they are searching for in P2P networks or large-scale distributed systems. We have measured the performance of the Kademlia DHT lookup in Kad that is deployed in one of the largest P2P file-sharing networks. In this chapter we solved the following problems. Search failure: Our measurement study shows 8 30 % of Kad lookups fail to find the desired content. This poor performance is due to the inconsistency between storing and searching lookups; only 18% of replica roots are located by searching lookups on average. We found that the Kad lookup algorithm does not work well under the feature in which routing tables are much more converged than expected for a given target. By considering the feature, we proposed two solutions: one to simply obtain lists of contacts from the node closest to the key, and the other to avoid the duplicate returns by asking peers about nodes closest to themselves instead of the target key. The new lookups can find the desired content with a probability of more than 95%. Inefficient resource use and poor load-balancing: DHTs usually store object information on multiple nodes called replica roots. When many peers search for popular information stored by many peers, 85% of replica roots are never used and only a small number of the roots suffer the burden of most requests. We found out that a lookup process fails to locate most of nodes close to a target except a few duplicate contact nodes and stores information on places where other peers are not likely to find it. In our solution, a publisher locates most of the closest nodes around a target and arranges replica roots in a way so other peers can find the replica roots before they get to hot spots.

47 Chapter 3 Practical Network Coding P2P applications for cooperative content distribution have recently gained popularity. Despite their success, they still suffer from inefficiency and reliability problems. To improve the distribution speed and resolve the data availability problem, many studies have considered applying network coding for content distribution. Network coding is a new type of data transmission technique which allows any nodes to encode data. In content distribution, peers no longer have to fetch a copy of each specific block. They simply ask another node to send a coded block, without specifying a block index. This technique may alleviate the block scheduling problem in a large scale P2P system. Furthermore, peers do not suffer from the data availably problem because they can reconstruct the original content after receiving enough blocks without relaying on the existence of specific content blocks. In spite of its benefits, network coding has not been widely used in real-world P2P systems. The usefulness of network coding is still disputed because of its questionable performance gains and coding overhead in practice. In this chapter, we study the practicality of network coding by measuring the performance and overhead of network coding in a real-world application. We also provide a new encoding scheme which can make network coding more practical. Section 3.1 introduces the basic concepts and real-world performance issues in cooperative content distribution systems and linear network coding. Section 3.2 describes our practical network coding system with our new novel encoding scheme and Section 3.3 provides the results of real-world tests with our implementation. Section 3.4 discusses related work and Section 4.5 summarizes the chapter. 37

48 3.1 Preliminaries Cooperative Content Distribution As a concrete real-world application for cooperative content distribution, we consider BitTorrent [13], the most popular P2P file-sharing protocol. It has been reported that BitTorrent traffic amounts to 30 80% of P2P traffic and 20 55% of all the Internet traffic as of 2008 and 2009 [41]. We describe a process of file distribution in BitTorrent which facilitates fast downloading. When a user wants to distribute a file, the user s client divides the file into smaller pieces. It then creates metadata (called a torrent file) which includes information such as the name of the content being shared, its total size, the hashes of the pieces, and the address of a tracker. The tracker is a central node which keeps a list of peers participating in distribution of the file (called a swarm). A user who wants to download the content first fetches the torrent file and contacts the tracker. During this bootstrap process, the tracker responds with a list containing a subset of peers in the swarm. Peers exchange BitField and HAVE messages to indicate which pieces the peers have. A BitField message includes a series of bits mapped to pieces each peer has, and peers send HAVE messages whenever they complete receiving a new piece. Based on these messages, a peer knows which nodes have its missing blocks and asks them to send the blocks 1. For fairness, BitTorrent systems enforce a tit-for-tat exchange of content between peers. A peer typically chooses a recipient of content based on the upload rate of other nodes. This policy encourages uploading and discourages freeriding which downloads content without uploading data to other nodes. According to the downloading status, BitTorrent defines two types of peers: seeders and leechers. Seeders own a complete copy of the content and share it, while leechers have either no pieces or an incomplete set, and are unable to reconstruct the content without additional pieces. 1 In this thesis, a piece refers to a part of content and a block describes the data that is exchanged between peers

49 3.1.2 Random Linear Network Coding 39 In this paper, we consider the popular linear network coding design [25], which is simple to implement and has been proven to achieve maximum throughput. In this scheme, a file is divided into m pieces, each represented as n elements in a finite field F p of size p, where p is prime. Then, i-th piece can be considered as a vector ũ i = (u i,1,..., u i,n ) F n p and the file becomes a sequence of vectors (ũ 1,..., ũ m ). When the original content source performs encoding, with respect to a vector of coefficients (α 1,..., α m ) (referred to as the encoding vector), it computes an information vector, or a linear combination of ũ 1,..., ũ m, n w = α i ũ i = (w 1,..., w n ). (3.1) i=1 (The choice of encoding vectors depends on the type of coding, and can be a global parameter, but in random linear network coding each node independently chooses encoding vectors randomly.) The source then sends the encoding vectors and information vectors together in the augmented block (or simply block) with the augmented form of w = (α 1,..., α m, w 1,..., w n ). Any node, when requesting an augmented block from a peer, sends a linear combination j β jw j of its received blocks w 1,..., w l. A receiver can decode the original file after receiving at least m linearly independent blocks. Let W be the information vectors of received blocks and A be the matrix whose rows are the encoding vectors of received blocks. The receiver can recover all original pieces of file U by solving the linear equation W = AU Performance and Overhead in Network Coding Benefits of Network Coding Despite their advantages and popularity, cooperative content distribution systems including BitTorrent pose significant challenges. A key challenge is block scheduling, which affects the distributing speed. To illustrate this problem, consider the following experiment, in which we used BitTorrent clients (CTorrent [26]) to distribute a 32MB file with 128 pieces to 95 nodes in PlanetLab [29]. All nodes joined the distribution session at the same time and limited their upload bandwidth to 100KB/s. They left the session immediately after they finished their download. In Figure 3.1(a), a point

50 40 at (x,y) means that node y downloaded a block at the x second mark after the start of the experiment. We observe many gaps in the figure, which means there were many small time periods when nodes did not download anything. This eventually delayed their downloading completion time. There were peers waiting for their turn to download missing blocks from others. Some peers endured especially long wait times toward the end of their download, when they had obtained most, but not all, of the blocks. An average download completed in 634 seconds. Nodes Downloading time (a) Downloads in BitTorrent Nodes Downloading time (b) Downloads in network coding Figure 3.1: Comparisons of downloads between BitTorrent and network coding Figure 3.1(b) shows that network coding makes efficient content propagation easier. In this experiment, we used the same parameters except that BitTorrent clients used network coding 2. The inherent benefit of network coding is that it does not use predetermined block indexes. Since coded blocks have no identity, peers no longer have to fetch a copy of each specific block. Therefore, they simply request its upstream peer to send a coded block that is combinations of blocks it already has. Adding network coding dramatically reduces periods of idle time, reducing average download time to 407 seconds. Figure 3.2 compares downloading times of BitTorrent and network coding with an ideal downloading time in the experiments. With the fixed upload rate, the ideal downloading time will be depend on arrivals and departures of peers. However, in the given experiments, we can assume the ideal downloading is the case where a single client stably 2 More specifically, we used the optimal encoding which will be explained in the next subsection

51 41 Ratio of download time BT NC 0 32MB 64MB 128MB 256MB File size Figure 3.2: Comparisons with the optimal downloading time downloads content from a server at a download rate of 100KB/s without competition. We simply compute this time by dividing the file size with the upload rate. The Y-axis in the figure represents the ratio of downloading times to the ideal downloading time. We observe that the downloading time of the network coding-enabled system is much closer to the ideal downloading time than the non-coding BitTorrent system. This means that network coding can provide near-optimal scheduling that reduces downloading times of peers. Tradeoff in Network Coding With the use of network coding, we face an important issue: how to encode data at each node in a network. There could be many ways for nodes to select and combine received blocks w 1,..., w l to produce an outgoing block j β jw j. To deal with this issue, we mainly consider two criteria: encoding overhead and block dependency. One obvious method would be to use all blocks a node has and combine them using coefficients β j chosen uniform randomly from the field F p [25]. This is the usual random linear network coding, but in this thesis we call this the full coding in order to distinguish it from other schemes. It is shown [42] that the full random linear network coding used for P2P systems achieves maximum possible throughput. However, it is impractical for use at line speed, where a node would produce an encoded block in real time, when

52 requested by another node, due to the CPU overhead or disk access. To reduce the encoding overhead, peers may generate a coded block with fewer input blocks. The encoding overhead can be ameliorated by splitting the file into a small number of generations. Full coding then only needs to be performed on each generation, reducing the number of blocks that must be encoded for every request. This encoding scheme is referred to as gen coding in this thesis. Another alternative to the full random coding is the sparse random linear network coding, where a node randomly selects up to k blocks it has received, w i1,..., w ik, forms a random linear combination k j=1 β jw ij and sends it to an other node. The encoding scheme which combines k blocks is referred to as k-coding in this thesis. While this scheme reduces the encoding overhead, a small k generates unnecessary dependent blocks. Suppose a node has already received m independent blocks, and it has just received a new independent block w. Then this node has new information about the file (namely w), but an outgoing block from the node would contain w in its linear combination with probability k/(m +1). This means that, initially when m was small, the new information w can be propagated through outgoing blocks with high probability, but as m grows, especially when m m, such probability is only k/m. There is a clear trade-off between the value of k (and thus CPU overhead) and the bandwidth utilization (goodput) in the network. To show the usefulness of network coding in the previous subsection, we introduced so-called optimal coding. This coding fully combines all coefficient vectors of blocks a peer has. Because this coding scheme does not encode information vectors in blocks, its encoding overhead is negligible. However, it has the same level of low block dependency to the full coding. We consider the coding has an optimal performance we can obtain in practice and we will use this coding to compare the performance of other schemes. Is there an encoding scheme to satisfy both requirements of low encoding overhead and low level of block dependency? Figure 3.3 demonstrates the trade-offs between these requirements by comparing sparse coding schemes with various values of k, full coding, and gen coding. White bars represent CPU usage and crosshatched bars represent the relative amount of dependent blocks produced compared to optimal coding. Note that, k-coding with a small k generate many dependent blocks, and k-coding with a large k and the full coding demand more CPU cycles The graph also includes a novel 3 In Section 3.3, we give more detailed explanation about this graph.

53 43 CPU usage (%) k-1 k-2 k-4 k-8 Overhead Dependency k-16 k-32 k-64 Encoding k-128 k-256 full gen icode Ratio of Dependent blocks Figure 3.3: Tradeoff between CPU overhead and block dependency network coding scheme design, i-code. This coding combines the benefits of sparse coding causing small overhead and full coding which generates few dependent blocks. We will describe this lightweight and efficient encoding scheme in the next section. 3.2 Practical Network Coding System Network coding has been proposed to improve the performance of content distribution systems. However, almost no real-world system use network coding for large scale content distribution over peer-to-peer networks. One of the obstacles to the use of network coding in practice is encoding overhead because nodes in networks need to combine several blocks for sending encoded data. This issue on encoding overhead becomes more serious when we consider resource constraints placed on real-world clients. Real-world clients have limited memory, processing capacity, and slow disks. These factors will affect the overall performance of network coding operations. By considering these constraints, we propose a practical network coding system with a lightweight and efficient coding scheme, which achieves both low encoding overhead and a low level of

54 block dependency System Architecture Here we provides the overview of our network coding-enabled system for content distribution. Figure 3.4 shows modules in each peer participating in a distribution session. With the help of the neighbor manager module, peers maintain an overlay network for content distribution and obtain information on data others have. Using the scheduler module, a peer decides which nodes it uploads to or which nodes it downloads from. Peers exchange encoded data in the form of augmented vectors and test the usefulness of received blocks by using the dependency checker. The encoding module generates outgoing blocks and the decoding module reconstructs the original content. In our system, the majority of blocks a peer has reside on the disk of the peer and they must be read into memory as needed by considering resource constraints. This is because peers may not have enough memory to cache an entire blocks in RAM, especially in modern P2P networks where multi-gigabyte files are common. a compact memory footprint is a common requirement in all systems. Below we describe functions of each module. Figure 3.4: System architecture

55 45 The neighbor manager module provides lists of peers participating in a distribution session. These lists can be obtained from a tracker or other peers. A peer establishes links to some peers (called neighbors) and maintains connections by periodically exchanging messages. Therefore, this module can detect arrivals and departures of the neighbors. The neighbor manager also provides information about what each peer has. In BitTorrent-like systems, peers exchange bitfields, series of bits are mapped to pieces each peer has. However, network coding does not use predetermined block indexes, and peers thus exchange ranks of their coefficient matrices composed of encoding vectors in blocks of peers. In P2P systems not using network coding, clients should make decisions on block scheduling. However, network coding does not have to decide specific blocks to download, which is an inherent benefit of network coding. Instead, the scheduler module only decides neighbors with whom a peer exchanges data. To select those neighbors, the scheduler infers whether a neighbor has useful blocks or not. We use a simple heuristic assuming a neighbor has useful blocks unless it sends multiple useless blocks. For such a neighbor, the peer asks not to send blocks until the neighbor receives new independent blocks from other nodes. The scheduler module also determines to whom a peer uploads data. We simply follow the tit-for-tat policy in BitTorrent-like systems. In linear network coding, the usefulness of data is determined by linear dependency. The dependency checker module maintains a coefficient matrix which consists of encoding vectors of blocks in a peer s local store. When peer A is about to receive a block from peer B, it first performs a linearity check using the encoding vector and the coefficient matrix. The module uses Gaussian elimination to verify whether the block is linearly independent on the blocks A already has. If it is dependent, the received block will be dropped. To avoid resource waste due to transmission of linear dependent blocks, we use the following dependency check strategy. Once peer B decides to upload a block to A, peer B first encodes a new block and sends A only the encoding vector. Upon receiving the encoding vector, peer A checks if the vector is linearly independent on the blocks it has. If it is, A requests B to send the full block and B transmits the remaining part (i.e., information vector) of the block. If not, A informs B that a message informing the coded block is dependent. This strategy greatly reduces wasted bandwidth since the size of encoding the vector is much smaller than an entire coded block.

56 46 The decoding module enables a peer to reconstruct the original content upon receiving m independent blocks where m is the number of pieces in the content. This module solves a linear equation for decoding by using Gauss-Jordan elimination. Because decoding is time-consuming work, we may consider progressive decoding: a peer decodes blocks as it receives them without waiting for receiving all m independent blocks. With this strategy, the decoding time overlaps with time to downloading content, reducing the total time to obtain the original content. However, this strategy is not feasible in practice due to the previously mentioned resource constraints: the majority of blocks a peer has are not allowed to reside on the memory but rather they are loaded into the memory from the disk, as needed. This requires a large number of disk operations. Instead we consider another approach with multiple generations. Let m be the number of blocks and n be the dimension of blocks. Because the decoding process requires O(m n ) multiplication operations, the decoding time can be reduced to 1/x compared to 1 generation by using x generations. Furthermore, once a node downloads one generation, it can start decoding the generation while receiving blocks in other generations. Therefore, decoding time overlaps with downloading time like progressive decoding. Finally, the encoding module generates outgoing blocks using random linear network coding described in Section 3.1. A peer chooses a subset of blocks and linearly combines them with randomly selected coefficients in Galois Field GF (2 8 ). This module may have several submodules to use different types of encoding schemes such as full coding, sparse coding, and gen coding as well as our novel encoding scheme for which details will be described in the next subsection i-code: Lightweight and Efficient Coding We propose a novel encoding scheme which we call i-code, the contraction of incremental encoding. This scheme is lightweight and efficient. It loads only one block to be read from the disk and mixes only two blocks for every encoding operation. Although this scheme combines a small number of blocks, it does not impose the dependent block penalty faced by the sparse encoding. The key idea behind i-code is to approximate full encoding by maintaining a block of well-mixed encoded data called an accumulation block which contains a mix of blocks in the peer s local store (see Figure 3.5). When peer A is about to receive a block, it first performs a linearity check, using the

57 47 Accumulation block Block Dependency check New block... local disk Figure 3.5: i-code design. Our encoding scheme requires only one block to be read from disk and one linear combination, greatly reducing encoding overhead. encoding vector, to verify that the incoming block is linearly independent from blocks that A already has. If it is not, then the block will be dropped. Otherwise, the newly received linearly independent block w is stored to the disk, and the accumulation block a is updated by a αa + βw, for randomly chosen coefficients α, β F p. When sending a block, A reads a single block w i from the disk and computes αa + βw i which is a random linear combination of the accumulation block and w i. It also updates a similarly as a random linear combination with w i. In i-code, blocks the peer has are accumulated to this representative block. Mathematically, the accumulation block is updated to be a generic random block of the subspace spanned by blocks which a peer has. When outputting a coded block, the peer selects one block and combines it with the accumulation block. Because the many blocks are already combined with the accumulation block, mixing these two blocks has similar effects of the full coding which combines all blocks a peer has. As an example of the effects, a newly received block can be immediately combined to any outgoing blocks because it is already linearly combined to the accumulation block. This reduces the probability of encoding dependent blocks. In sparse k-coding, on the other hand, a newly received block may not be used for encoding a new outgoing blocks, which may encode dependent blocks with a higher probability.

Peer-to-Peer Systems. Chapter General Characteristics

Peer-to-Peer Systems. Chapter General Characteristics Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include