Towards large scale peer-to-peer web search

Size: px
Start display at page:

Download "Towards large scale peer-to-peer web search"

Transcription

1 Towards large scale peer-to-peer web search Gerwin van Doorn Human Media Interaction, University of Twente, the Netherlands ABSTRACT Web search engines, such as Google and Yahoo, are based on the centralized database model. Search engines using the centralized database model suffer from a several drawbacks, such as: they have a single point of failure, a limited representation of the web, their index is not up-to-date, and scalability. Currently a lot of research is being done on using peer-to-peer (P2P) technology for the use of full-text search in order to overcome the issues centralized search engines suffer from. Although P2P systems have proven to be highly scalable in file sharing applications, this is not so obvious for large scale full-text search. In this paper I will discuss and compare some of the most important P2P architectures based on their literature. Based on this, I will set out the direction future P2P research should be heading to, in order to make large scale P2P web search possible. Keywords Distributed Hash Table, Peer to peer, Overlay network, Hybrid, Semantic 1. INTRODUCTION The last few years the World Wide Web has gone through some remarkable changes. Where previously only companies and universities could add content to the web, nowadays users contribute a great deal to the content of the web. One of the best known concepts where professional and non-professional users contribute to the content is Wikipedia ("Wikipedia home page,"). Weblogs are another fast growing concept where people can add content to the web. One of the most important and most widely used applications of the Web is full text keyword search. Without it, it would be impossible to navigate the internet efficiently. Nevertheless, the current size of the Web makes this a difficult problem. Current search engines for the Web are usually based on a single database model. With the single database model documents from the web are copied from their location to a central database where they are stored and indexed. This model of information retrieval has been very successful because initially the web was not very big, the web was not a commercial medium and HTML pages were not very complex (Callan, 2000). In 2000 Google indexed approximately 1 billion pages. The index size increased by 1 billion each year till 2005, when it doubled from approximately 4 billion to 8 billion ("The Internet Archive," 2006). Google has since then stopped listing their index size, but claim, after Yahoo announced to have an index size of 20 billion (Mayer, 2005), that their index size is 3 times bigger than any other search engine (Google, 2005). There is also a lot of information that cannot be indexed because either, there are no links to these pages, or they are dynamically generated from a database. These pages are referred to as the deep web or the invisible web. This means that there is even more information on the web, than one can currently search, for with a general search-engine that needs to crawl web-servers. In 2000 the deep web was estimated to contain over 550 billion individual documents (Bergman, 2000). Current search engines working with a central database model need to crawl the web periodically to keep their information up to date. A complete re-crawl would take a long time and is very inefficient. Web-crawlers are unable to know if a document was actually updated until the document is downloaded, forcing web-crawlers to download more documents than is necessary. Search engines based on the single database model also lack robustness. When the database goes offline, searching is impossible. In order to prevent such a dramatic scenario, data needs to be replicated and thus more hardware resources are needed. Currently, major companies like Google, Yahoo! and Microsoft control every aspect of internet search, intranet search and even desktop search. Users depend on these companies to find the information they are looking for. In order to compete with current leading web search companies like Yahoo!, Google and Microsoft, enormous investments have to be made to create the necessary infrastructure. This makes it almost impossible to compete and create new and better information retrieval services for a big audience. Peer-to-peer (P2P) networking is one of the most rapidly developing areas of modern computing (Klampanos & Jose, 2004). There is an increasing interest to utilize the P2P approach for web search in order to overcome the aforementioned problems. Although the use of P2P networking has proven to be very effective for file sharing, its use for full-text web search has yet to be proven. One of the most important issues to overcome in order to make P2P web search possible is scalability. 2. PROBLEM STATEMENT The task of web search using a P2P network comes down to selecting the best documents in relation to a query, while these documents are scattered across a physical network. Although many different P2P systems have been proposed for keyword search, it is not clear which kind of P2P system would have the greatest potential for large scale web search, and which characteristics such a system should have. This paper looks at several important P2P architectures and compares them based on their literature in order to set out future directions of This paper is structured as follows: Section 3 will discuss the first generation of P2P networks that were mainly used for file-sharing applications. Section 4 discusses Dynamic Hash Tables and P2P systems that make use of them. Section 5 describes how indexing schemes can be combined to form hybrid P2P systems, avoiding the limitations of previous indexing schemes. Section 6 will describe P2P systems that make use of Information Retrieval methods to locate data.

2 Section 7 will summarize the discussed systems and compare them based on the information given their literature. Finally, conclusions will be drawn in section FIRST P2P NETWORKS 3.1 Centralized and unstructured One of the first systems that made P2P sharing popular was Napster (Chawathe, Ratnasamy, Breslau, Lanham, & Shenker, 2003; Lu & Callan, 2003). Napster takes advantage of the already available bandwidth and storage space of the connected peers. X a, b Y b, c Z a, b, c Figure 2 Local indexing As we can see in Figure 2, each peer indexes its own documents (X, Y and Z), each document containing one or more terms (a, b and c). Figure 1 Centralized and unstructured P2P architecture A peer is connected to a central server cluster. Each active peer sends a list of its shared filenames to the cluster. When a node is looking for a file, a query is send to this cluster. The search engine will return a list of other peers that have the desired content and the peer searching for this content can connect directly to the peers that contain the content. This way bandwidth of the server is small but the download bandwidth is big because it is shared among peers. Napster is not a pure peer to peer system as some tasks make use of a client server system (e.g., searching). The major drawbacks of these centralized systems are that they scale poorly and have a single point of failure (Lv, Cao, Cohen, Li, & Shenker, 2002). 3.2 Decentralized and unstructured In decentralized unstructured P2P systems, like Gnutella (Clip, 2000), there is no centralized directory service like with Napster. The network is formed by nodes joining the network using a few loose rules. In order to find a file, a node queries its neighboring peers. Flooding (Clip, 2000) or random walks (Lv et al., 2002) are used on this graph to propagate queries. Each peer indexes its own content. This is often referred to as document based partitioning of local indexing. Figure 3 Decentralized and unstructured P2P architecture Flooding is a very inefficient process that generates a large amount of network traffic as a query usually gets send to many or all nodes in the network. The search mechanisms in these networks are not very scalable due to the communication costs and large load on the network participants. For instance, according to Tang et al. a 128K-node Gnutella-like network would transmit 192MB data when transmitting one query with the use of flooding (Tang, Xu, & Dwarkadas, 2003). Content allocation and network structure are uncorrelated in these networks as nodes often connect to the network using physical measures like join order, connection speed, etc. A TTL (Time To Live) is usually used to limit the number of hops between nodes. This reduces network traffic but does not guarantee a query hit. 4. DHT BASED P2P ARCHITECTURES 4.1 Distributed Hash Tables DHT stands for Distributed Hash Table. A DHT is a hash table which is distributed among cooperating computers, which is referred to as nodes or peers. DHTs are also referred to as structured overlay networks as its nodes are connected to each other over an existing network, like the internet. Like a hash table it contains key/value pairs, also referred to as items. The primary service provided by a DHT is the lookup operation; given a key, find the corresponding value. Both the key and value representation can be arbitrary, such as a string or an object. In the case of a keyword based search engine, an item usually consists of a keyword that is mapped to the peer

3 responsible for the list of documents containing the keyword. This list is also referred to as posting-list. ordered in a circular 2 m identifier space and can be represented as in Figure 4. Table 1 DHT mapping keywords to peers Key Value Node 0 0 information peer 1 retrieval peer 1 7 Key 0 Key 1 Node 1 1 car peer 2 tomato peer 3 A DHT can be constructed from many nodes, making it possible to efficiently store large amounts of data as each node is responsible for a part of the items. Each node has a routing table that stores the node s neighbors. When a lookup function is performed on one of the nodes, and this node does not contain the item itself, the node will route the query to its neighbors until it reaches the node that is responsible for the provided key. Another important feature of DHTs is that they are, to a certain degree, fault-tolerant. DHTs should be able to perform lookups even when some nodes fail. This is usually done by replicating items (key/value pairs) of the failed node by neighboring nodes Routing hops Hops are the number of re-routes that are needed to deliver a query to the responsible node, and it is important for the efficiency of the DHT to keep this number as low as possible. Each extra hop means extra latency for sending an extra message. More hops also mean more chance that a one or more nodes fail during a lookup. By having a large routing table at each node, fewer hops are needed. The bigger each node s routing table, the fewer hops are needed. However, the bigger the routing table the more messages need to be broadcasted in order to keep routing information up-to-date Routing time Reducing the number of hops does not necessarily mean the time it takes to reach the destination is reduced. A route that needs five hops and uses local nodes can still take less time than a route that needs two hops but goes through a node on the other side of the world. 4.2 The first Distributed Hash Tables Different DHTs exist, differing in how they are structured and how they handle replication, node joins, node failures, and node arrivals. The first DHTs appeared around 2001 and were CAN (Ratnasamy, Francis, Handley, Karp, & Schenker, 2001), Chord (Stoica, Morris, Karger, Kaashoek, & Balakrishnan, 2001), Pastry (Rowstron & Druschel, 2001) and Tapestry (Zhao, Kubiatowicz, & Joseph, 2001). In order to demonstrate how these systems operate, the CAN and Chord systems are briefly described in the following sections. A more extensive overview and comparison of P2P overlay networks is given by (Lua, Crowcroft, Pias, Sharma, & Lim, 2005) Chord The Chord protocol (Stoica et al., 2001) makes use of consistent hashing to assign keys to nodes. This tends to balance the load as each node receives the same number of keys. A node s identifier is chosen by hashing the node s IP address. The key identifier is chosen by hashing the key. The identifiers are Node 4 Key 3 Figure 4 Chord identifier circle, m = 3 A key k is assigned to the first node whose identifier is equal to or follows the key k in the identifier-space. This node is called the successor node. This is demonstrated in Figure 4, where the key with location 3 is assigned to the node located at 4, as this is the successor node of identifier 3. If node 4 were to leave the network, all of its assigned keys would be reassigned to its successor, node 0. If a node would join at identifier 3, key 3 would be reassigned to node 3. Each node stores only a small amount of routing information. The nodes know their successor but also maintain additional routing information to make routing more efficient. This additional information is stored in a routing table of maximum size m. The i th entry of the table of node n contains the identity 1 of the node that succeeds n by 2 i. When node n does not know the successor of a key k, it looks in its routing table for a node whose identity is closer to k than its own identifier and precedes k. This node will know more about k than node n. This process is repeated until the successor node of key k is found. Suppose node 4 from Figure 4 wants to know the successor node of identifier 1. Node 4 will look at position 2 of its routing table and find node 0 to be closest. Node 0 will ask node 1 to find the successor of identifier 1. Node 1 will return its own identity to node 4 as node 1 is the successor node for identity 1. Each node only needs to maintain O(log N ) other nodes in order to resolve all lookups via O(log N ) messages, N being the total number of nodes in the network CAN CAN stands for Content Addressable Network and uses a d- dimensional Cartesian coordinate space on a d-torus to implement a distributed hash table that maps keys onto values (Ratnasamy et al., 2001). The coordinate space is partitioned among all the nodes in the system, each node owning its individual distinct zone. In the virtual coordinate space, each 3 2

4 key/value pair is mapped onto a point in the coordinate space using a uniform hash function. The key/value pair is stored on the node that is responsible for the zone to which the key/value pair is mapped. Each CAN node holds a routing table containing the IP address and virtual coordinate zone of each neighbor. A CAN-message is routed towards its destination by greedy forwarding to the neighbor closest to the destination coordinate. needs to be send over the network in order to make an intersection with another posting list. Suppose we search for information retrieval. When retrieval has a smaller posting list than information, this posting list can be send to the node responsible for the term information so this node can intersect both posting lists to create a list of documents containing both terms. Figure 5 Example of a node G joining a 2d CAN New nodes randomly choose a point P in the CAN coordinate space and requests to join using an existing CAN-node. This request is then routed to the node responsible for the zone point in which P lies. The owner of the zone splits the zone in half and assigns one half to the new node. This new node take over some of the neighbors of the previous owner and neighbors of both nodes are informed of the reallocation. Figure 5 shows the joining of node G in the CAN. The number of neighbors a node maintains is independent of the total number of nodes in the network and only depends on the network dimensions. The 1 d average length of routing paths in CAN is, where d O( dn ) is the number of dimensions and n is the total number of nodes in the system. 4.3 DHTs and search Naïve usage When dividing responsibility for the words from the document corpus among peers, DHTs can be used to map a word to the peer that is responsible for this word. This is also referred to as partitioning by keyword or global indexing. a X, Z b X, Y, Z c Y, Z Figure 6 Global indexing Each peer stores the posting list of each word it is responsible for. A posting list is a list of documents (represented in Figure 6 by X, Y and Z) containing this word. When searching for documents containing a word, the DHT resolves the peer that is responsible for this word. The peer then returns the posting list containing the documents the word occurs in. When a query contains multiple terms, the posting list of one or more terms Figure 7 Naive implementation of a DHT for search This can be an enormous amount of information that needs to be sent over the network Optimizations Several optimizations were suggest by (J. Li et al., 2003) and (Reynolds & Vahdat, 2003), such as: Cashing and pre-computing posting-list intersections Bloom filter compression Incremental results The problem with these optimizations is that either huge posting-lists still need to be transferred over the network, or the search quality is degrading. When dealing with enormous amounts of data, as is the case with the internet, these optimizations still do not suffice. A bloom filter is a hash based data structure that summarizes membership in a set. Sending a Bloom filter based on a set A, A being a posting-list, instead of the posting-list A itself, reduces the amount of communication required to determine the intersection A B. Take for example the, according to Google Zeitgeist 2006, popular query Paris Hilton. A search for Paris on Google returns results; Hilton returns results. We will call the result set from the query Paris set A and the result set from Hilton set B. According to (Reynolds & Vahdat, 2003), the ideal bloom filter size would be calculated as follows: A log A B j In the ideal case the smallest set is send first, in this case this is set A with results. According to this formula the size of the Bloom filter to be sent will be:

5 ( ) log = j bits or 134MB This is smaller than sending the complete list of 128 bit document identifiers, which would consume 1.4GB, but still is an enormous amount of data for just one query. A side effect of Bloom filters is that they allow false positives. The number of false positives grows exponentially as the size of the Bloom filter increases. A large amount of extra storage space and memory is needed when caching pre-computed posting-list intersections. The biggest issue with keyword based indexing is the fact that it is not possible to rank results with respect to the query before a posting list is transferred. Search engine users usually only look at the first n results. According to a search engine user study by iprospect, 88% of search engine users change their search terms and/or search engine when they can not find what they seek in the first 3 result pages, 41% of these users change their search query after the first page. Most users (82%) do stick with the same search engine when they re-launch their search and add more search terms to their query (iprospect Search Engine User Behavior Study, 2006). Reynolds et al. suggests using Fagin s algorithm to incrementally retrieve a ranked list of documents. Fagin s algorithm, originally proposed in (Fagin, 1999), and some variants on this algorithm, where tested in ODISSEA, but no testing was performed with large data sets (Jo-Wen, Zhang, Long, & Shanmugasundaram, 2003). The returned top k results are chosen based on the ranking of the documents of the individual query terms and not on the complete query. The problem with this Fagin s algorithm is that when terms in a query are uncorrelated, the algorithm might take as long as a non-optimized algorithm, resulting in slow performance. In order to make P2P web search scalable, queries should be ranked by the nodes that are responsible for the partial or complete query. Only the top k ranked documents should be returned, and if necessary combined with the top results returned by other nodes Highly Discriminative Keys To avoid transferring large posting lists across the network, Podnar et al. propose the use of Highly Discriminative Keys (Aberer, Klemm, Luu, Podnar, & Rajman, ; Podnar, Luu, Rajman, Klemm, & Aberer, 2006). They propose a novel indexing strategy that is key-based instead of single term based. This means that posting lists are mapped to keys, like information retrieval, instead of using the separate terms information and retrieval. The indexing procedure is costly but as their focus is on DL (Digital Libraries) this does not have to be a problem as they are rather static. The key idea of their indexing strategy is to limit the posting list size of the global P2P index to a constant predefined value, and extend the index vocabulary to improve retrieval effectiveness. In a single-term index, posting lists can be extremely large. Joining these posting lists at query time consumes a large amount of network bandwidth. By using many HDKs (Highly Discriminative Keys or rare-keys), more specific but smaller posting lists are created that are joined at indexing time. This is in a way similar to generating queries and linking them to the n best matching documents. The quality of a key k for a given document d with respect to the indexing adequacy is determined by its discriminative power. The key must be as discriminative as possible with respect to the document and the document collection. When a key is frequent, and thus has low discriminative power, it means that its document frequency DF(k) is bigger than a fixed document frequency DFmax. When DF(f) < DFmax, the key is considered rare. Because many term combinations can form potential rare keys, proximity and redundancy filtering is used to keep the rare key vocabulary to a manageable size. The proximity filter retains keys built of words appearing close in documents as they are likely to appear together in a query. The redundancy filter removes supersets of rare keys. In order to retrieve documents, queries are mapped to keys. It is possible that a user provides a query that already is a HDK. It is of course possible that users query frequent keys or terms not covered by the HDKs. They propose some options to overcome this problem. A valid option might be to notify the user that his/her query is not discriminative enough with respect to the document collection, and provide support to the user in refining the query. Two other options they propose are to find semantically similar terms to the query terms or index k-best documents for frequent keys (which only make up 1% of the vocabulary). This system showed acceptable retrieval performance compared to a single-term TF-IDF baseline system without additional techniques for dealing with queries containing frequent keys. The retrieved posting lists were still 92% smaller than the baseline when the overlap with the baseline system on the 20 top results was 94.30%. 5. HYBRID P2P ARCHITECTURES As demonstrated by(j. Li et al., 2003), using either keyword based partitioning or document based partitioning, including their optimizations, does not make large scale keyword search feasible. In order to overcome this, several people have proposed using a combination of local and global indexing, called hybrid indexing. 5.1 esearch Tang et al. propose a P2P keyword search system called esearch based on a hybrid indexing structure (Tang & Dwarkadas, 2004). The key idea of their method is selective metadata distribution. Like global indexing, the metadata is distributed based on terms, each node is responsible for one or more terms. Each node also stores the complete term list of each document it is responsible for. The optimization of esearch is that it does not distribute the document term lists to all the nodes that but only to the nodes responsible for the top n terms. The n top terms are identified using Okapi term weighting (Robertson, Walker, Hancock-Beaulieu, Gull, & Lau, 1992).

6 a X b X, Y c Z X a, b X a, b Y b, c Z a, b, c Figure 8 Hybrid indexing The top n terms are determined using the term weighting scheme Okapi (Robertson et al., 1992). This optimization can degrade the quality of the search results as a query on a term that is not among the top terms of a document cannot find this document. Their argument is that when none of the query terms are among the top terms of a document, IR algorithms are unlikely to rank this document among the best matching documents for this query anyway. To reduce the chance of missing relevant documents they adopt automatic query expansion. With the use of automatic query expansion esearch obtains search quality as good as the centralized baselines by publishing a document under its top 20 terms. When using automatic query expansion, the hybrid indexing structure is first used to retrieve a smaller number of so called feedback documents. For each term in these retrieved feedback documents the average weight among the feedback documents is calculated. The system then chooses k terms that have the biggest average weight and adds these to the query as they are assumed to be relevant to the query. This new query is then used to retrieve the final set of documents. esearch only transmits 3.3KB data on average during a retrieval operation. This amount is independent on the corpus size and scales logarithmically with the number of nodes. In addition to all this, esearch also distributes the load for each term. The DHT based Chord protocol esearch uses, is modified to prevent nodes responsible for popular terms to store more data than others. esearch also caches popular queries for a certain amount of time at the nodes processing the query and the paths along which it is forwarded to prevent hot spots. When a query arrives at a node where the result to this same query is cached, the result is immediately returned. The storage space consumed by esearch is 6.8 times that of global indexing systems but the benefit is its low search costs. Their argument is that disk capacity has increased 160 times faster than network bandwidth and thus trading in modest disk space for communication and precision is a proper design choice for P2P systems. Further storage reduction like pruning and index compression can be added. 5.2 Multi level partitioning Another hybrid index partitioning scheme, named Multi Level Partitioning (MLP), is proposed by (Shi et al., 2004). MLP is adaptive and can balance between keyword based and document based partitioning. Figure 9 Multi Level Partitioning MLP relies on a node group hierarchy. In this hierarchy, nodes are logically divided into groups, each group being divided into sub groups until level n (see Figure 9). At level n, keyword based partitioning is applied within the group. This means that multiple nodes from different groups can be responsible for the same keyword. When a query is processed, it first sends down the hierarchy to level 1, and then to all groups of the next level until it reaches level n. The nodes responsible for the keywords in the query in each group perform an intersection of the inverted list. The intersections are done in parallel. Because keyword based partitioning is used within the group, the inverted lists that need to be transmitted to do the intersection are much smaller. The results are combined level by level and sent back to peer that started the search. MLP achieves an enormous cut-back on bisection bandwidth ( times, with respect to keyword based partitioning), more than was suggested in (J. Li et al., 2003). Shi et al. do not describe how efficient ranking of the results can be done using MLP. Although MLP cuts back on the bisection bandwidth, the aggregated bandwidth is still huge. 6. SEMANTIC P2P ARCHITECTURES Many P2P search systems use simple keyword matching to find documents matching a query. These systems distribute documents randomly, without any notion of their content. By making the distance between documents proportional to their similarity in semantics, the search space and communication cost can be reduced. This is what semantic P2P systems do psearch psearch uses a classic IR algorithm, Latent Semantic Indexing (LSI), to map documents and queries as points to a semantic space in a, thereby creating a semantic overlay network (Tang et al., 2003).

7 Figure 10 psearch semantic space in a 2d CAN Documents that are semantically alike are co-located in the semantic space, making it unnecessary to search nodes containing irrelevant documents. When a query is submitted, the query is routed to the nodes that are semantically closest to the query within a radius r. These nodes will do a local search using plsi and return the best matching documents to the user. In psearch the dimensions of the CAN are set to the dimensions of the semantic space. This dimensionality can be as high as 300. When the number of dimensions is smaller than log 2( n ), n being the number of nodes, more zones will be created than there are nodes available. To support semantic vectors with larger dimensions than the CAN, psearch makes use of rotated semantic vectors which are stored at different locations in the CAN. These vectors are sub-vectors of the original semantic vector but have a lower dimensionality. For p sub-vectors p searches are started when a query is submitted. The results are of these p searches are combined. The technique does not ensure their full vectors are also similar but the low-dimensional elements of a semantic vector are the most significant. When the low-dimensional elements of a document and query are similar, the document can be considered relevant. Tang et al. also observed that the 300-dimensional semantic space was not sufficient enough for the TREC 7&8 corpus and queries. Increasing the dimensions of the semantic space is not an option as psearch can only use the lower dimensions. However, it still needs a finer structure in order to rank documents properly. psearch achieves an accuracy of 90% in a 32k- node system and 86% accuracy in a 128k-node system. These accuracies are with respect to a centralized LSI implementation. A query and each index are 1.5KB in size when semantic vectors have a dimensionality of 300. Using the most pessimistic testing results, psearch s communication costs for processing a query in a 128k-node system is 632KB. The data transmitted when publishing one index is 49.5KB and storage cost is 6KB. When nodes replicate data from their direct neighbors in the CAN, the communication cost for publishing one index is 89.5KB and 159KB for storage. A big advantage of psearch is that its bandwidth consumption for processing queries is independent of corpus size, query length, and document length. The number of routing hops and the number of visited nodes are the parameters that influence psearch s communication cost and are constant, or increase slowly as the system scales. More recent research by Tang et al. research discusses the limitations of psearch due to its reliance on LSI (Tang, Dwarkadas, & Xu, 2004). They conclude that when the corpus is heterogeneous and large, the retrieval quality of LSI is inferior to methods such as Okapi (Robertson et al., 1992). Another limitation of LSI is that memory consumption and computational load are high, making it unscalable. To improve the efficiency of LSI they propose a method elsi. Documents are clustered and only the centroids of these clusters are used as representative documents. The dimensionality of the centroid vectors is reduced by removing low-weight terms, resulting in a much smaller input matrix for the Singular Value Decompositions (SVD) that will result in the base of the semantic space. They combine their LSI approach with Okapi (LSI+Okapi) to boost the precision of psearch. LSI is still used to guide the query to the search region, but inside the search region Okapi to select documents and guide the exploration of the search region. Their optimizations make psearch s high-end-precision, the precision when only retrieving the top n documents, approach the state-of-the-art centralized IR systems. However, its lowend-precision, the precision when retrieving many documents, is still inferior. This does not have to be a problem as most users only look at the first few search results Semantic Small World Li et al. propose a semantic overlay network based on small world networks named Semantic Small World (M. Li, Lee, & Sivasubramaniam, 2004; M. Li, Lee, Sivasubramaniam, & Lee, 2004). The P2P system is built from the ground up and does not make use of existing DHTs like CAN or Chord. SSW dynamically clusters peers based on the semantics of their content. Semantics are represented by k-dimensional Semantic Vectors (SV) that is not limited to a specific data format. Unlike psearch, SSW supports all dimensions of the semantic space in search. A simple but effective dimension reduction strategy, called adaptive space linearization (ASL), is used to create a one-dimensional SSW that operates in a high-dimensional semantic space. Peers cluster their local data objects and pick the centroid of the largest cluster as their position in the semantic space. A network is considered to be a small world network when it has a small average path length and a large cluster coefficient. Research has shown that searches can efficiently be conducted 2 in O(log N) when each node in the network knows its neighbors and each node knows a few randomly chosen long distance nodes. A node in SSW also indexes content from other peers. Which content to index at which node is determined by the mapping from the semantic subspace to peers. If some content of a peer A falls into the subspace of another peer B, peer B will index such content from peer A. These indexes are called foreign indexes.

8 15(1111) P=4 P=1 P=3 Peer 1 Peer 3 12(1100) P=2 14(1110) Peer 2 4(0100) P=2 (1, 1) When the number of long range contacts is large enough, SSW achieves much smaller path lengths than psearch (about 50%). The path length also increases slower. SSW s precision is more than 10% higher than that of psearch. Their precision is measured with respect to a centralized LSI system. 7. DISCUSSION It is difficult to compare the P2P systems based on their literature alone as each system was compared with respect to a different baseline. What we can infer from the discussed systems is that a scalable P2P system that should operate at the size of the web should have the following features: Documents need a compact representation 0(0000) 8(1000) (0, 0) High-end precision should be good, low-end precision not necessarily C0 C15 SV=[0.9, 0.6] PCN4 = 1000 = 8 C4 C4 ClusterRange = [0-0.25,0.35-1] Par.His: 1: 1, : Incremental results is a need, with the best matching documents ranked highest Ranking of results should preferably be done at the indexing peer(s) C14 PCN12 = 1110 = 14 C12 C8 C12 ClusterRange = [ , 0.5-1] Par.His: 1: 1, : : 1, 0.7 Figure 11 SSW example C4 has long range contact with C12 In Figure 11 an example is shown on how SSW handles dimension reduction and handles searches. It shows how a SSW of any dimension can be mapped to a 1-dimensional space. The top figure shows how a semantic space is partitioned when new nodes are added to the network. Unlike psearch, SSW does not necessarily split clusters in half. The location of the split depends on the distributions of the number of nodes and data. Clusters are only split when the number of nodes in a cluster exceeds a predefined maximum. The nodes of a clusters keep track of the partitioning history. This history is used to create cluster names and estimate queries PCN (Pseudo Cluster Name). This PCN is used to route the query to the correct cluster. In the example that is illustrated in Figure 11 we can see how the SV of a query is routed to its destination cluster. In cluster 4 the PCN of the 2-dimensional SV=[0.9, 0.6] is estimated. The estimated PCN is cluster 8. Because cluster 8 is not among the long range contact of cluster 4, the SV is send to the nearest cluster, in this case cluster 12 (normally C8 would be among C4 s short range contacts but this is omitted in this example). In C12 the PCN is estimated again, this time resulting in 14. As the SV lies in the cluster range of C14 the SV is flooded to all nodes in the cluster. A SSW that used SVs of k dimensions, containing N nodes, with cluster size M and number of long range contacts l will on average perform a search across clusters in 2 log (2 N / M ) messages. As we can see, the number O l of dimensions has no influence on the routing performance. The first generation P2P systems and system using DHTs naively are not suitable for large scale web search as they either need to flood queries or intersect huge posting lists. They either scale linearly with size of the network or size of the data, respectively. From the hybrid architectures, esearch looks most promising. Queries are resolved logarithmically and precision is as good as the baseline system (Okapi) at a modest storage cost of 6.8 times that of systems based on keyword partitioning. Semantic P2P systems can also resolve searches efficiently. Semantic systems also allow different media to be searched as it does not matter what kind of data is being used, making it possible to use the same architecture for e.g. both keyword and image search. As a side-effect it is less effective for keyword searches unless the semantic representation is accompanied by additional textual information. Semantic P2P systems look promising but have several limitations as well: A transformation matrix needs to be created that is a good representation of the document collection. This needs the full document collection or either a good representation of the document collection Important documents can be missed during search when the documents contain multiple topics that are not near each other in the semantic space (Dumais, 1994) An important issue to overcome, in order to make it attractive for companies to operate as a P2P node, is the storage of keywords and documents identifiers they do not want to be related with. People might not feel comfortable with the idea that their server is being used to answer queries related to pornography or worse. A P2P network might have to limit distribution of the index for certain content to nodes that host similar content. Semantic Small Word can be a solution to this problem as servers will be responsible for data that is semantically similar to the data they originally host.

9 8. CONCLUSION In this paper several leading P2P architectures were discussed. The key to a scalable P2P keyword search engines seems to be a compact representation of the original documents. Both esearch and semantic approaches make use of a compact representation of the data. Extra storage space in these systems is needed but as disk space is cheaper than bandwidth, this looks like a reasonable tradeoff. Future research should compare these systems to a state-of-the-art centralized search system in order to compare their precision and scalability with respect to each other. Besides scalability there will be many other issues to overcome before large scale P2P web search becomes feasible, such as: Security: fraudulent peers Ethics: storage and linkage to unwanted parties (e.g. pornography) Load balancing, hot spots, popular queries Before these issues can be resolved, a scalable P2P solution, that is as good as a centralized system, should be proposed first. The discussed systems were tested using small document collections, in order to test if they are really scalable, testing needs to be done with a data set that is a good representation of the internet. REFERENCES Aberer, K., Klemm, F., Luu, T., Podnar, I., & Rajman, M. Building a peer-to-peer full-text Web search engine with highly discriminative keys: Technical report, EPFL (LSIR), Bergman, M. K. (2000). The Deep Web: Surfacing Hidden Value. from Callan, J. P. (2000). Searching for Needles in a World of Haystacks. IEEE Data Engineering Bulletin, 23(3), Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N., & Shenker, S. (2003). Making gnutella-like P2P systems scalable. Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, Clip, D. S. S. (2000). Gnutella Protocol Specification v _0.4.pdf. Dumais, S. T. (1994). Latent semantic indexing (LSI): TREC-3 report. Overview of the Third Text REtrieval Conference, 219, 230. Fagin, R. (1999). Combining fuzzy information from multiple systems. Journal of Computer and System Sciences, 58(1), Google. (2005). Sizing Up Search Engines. Retrieved 2007, from The Internet Archive (Publication (2006). Retrieved 2006: iprospect Search Engine User Behavior Study. (2006).): iprospect.com. Jo-Wen, T., Zhang, W. J., Long, A., & Shanmugasundaram, K. (2003). ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval. International Workshop on the Web and Databases. Klampanos, I. A., & Jose, J. M. (2004). An architecture for information retrieval over semi-collaborating Peer-to- Peer networks. Proceedings of the 2004 ACM symposium on Applied computing, Li, J., Loo, B. T., Hellerstein, J., Kaashoek, F., Karger, D. R., & Morris, R. (2003). On the Feasibility of Peer-to-Peer Web Indexing and Search. IPTPS 03, Li, M., Lee, W. C., & Sivasubramaniam, A. (2004). Semantic Small World: An Overlay Network for Peer-to-Peer Search. Proceedings of the Network Protocols, 12th IEEE International Conference on (ICNP'04)-Volume 00, Li, M., Lee, W. C., Sivasubramaniam, A., & Lee, D. L. (2004). A small world overlay network for semantic based search in P2P systems. Proceedings of Workshop on Semantics in Peerto-Peer and Grid Computing (SemPGrid), in conjunction with the World Wide Web Conference (WWW). Lu, J., & Callan, J. (2003). Content-based retrieval in hybrid peer-to-peer networks. Proceedings of the twelfth international conference on Information and knowledge management, Lua, K., Crowcroft, J., Pias, M., Sharma, R., & Lim, S. (2005). A survey and comparison of peer-to-peer overlay network schemes. Communications Surveys & Tutorials, IEEE, Lv, Q., Cao, P., Cohen, E., Li, K., & Shenker, S. (2002). Search and replication in unstructured peer-to-peer networks. Proceedings of the 16th international conference on Supercomputing, Mayer, T. (2005). Our Blog is Growing Up And So Has Our Index. 2007, from Podnar, I., Luu, T., Rajman, M., Klemm, F., & Aberer, K. (2006). A Peer-to-Peer Architecture for Information Retrieval Across Digital Library Collections. ECDL 06, Alicante, Spain (to appear). Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Schenker, S. (2001). A scalable content-addressable network: ACM Press New York, NY, USA. Reynolds, P., & Vahdat, A. (2003). Efficient peer-to-peer keyword searching. Proceedings of International Middleware Conference, Robertson, S. E., Walker, S., Hancock-Beaulieu, M., Gull, A., & Lau, M. (1992). Okapi at TREC. Text REtrieval Conference, Rowstron, A., & Druschel, P. (2001). Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), 11, Shi, S., Yang, G., Wang, D., Yu, J., Qu, S., & Chen, M. (2004). Making peer-to-peer keyword searching feasible using multi-level partitioning. Proceedings of the 3rd IPTPS. Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A scalable peer-to-peer lookup service for internet applications. Proceedings of the 2001 SIGCOMM conference, 31(4),

10 Tang, C., & Dwarkadas, S. (2004). Hybrid Global-Local Indexing for Efficient Peer-to-Peer Information Retrieval. Proceedings of the 1st NSDI. Tang, C., Dwarkadas, S., & Xu, Z. (2004). On scaling latent semantic indexing for large peer-to-peer systems. Proceedings of the 27th annual international conference on Research and developement in information retrieval, Tang, C., Xu, Z., & Dwarkadas, S. (2003). Peer-to-peer information retrieval using self-organizing semantic overlay networks. Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, Wikipedia home page. from Zhao, B. Y., Kubiatowicz, J., & Joseph, A. D. (2001). Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing. Computer, 74.

Distributed Hash Table

Distributed Hash Table Distributed Hash Table P2P Routing and Searching Algorithms Ruixuan Li College of Computer Science, HUST rxli@public.wh.hb.cn http://idc.hust.edu.cn/~rxli/ In Courtesy of Xiaodong Zhang, Ohio State Univ

More information

Aggregation of a Term Vocabulary for P2P-IR: a DHT Stress Test

Aggregation of a Term Vocabulary for P2P-IR: a DHT Stress Test Aggregation of a Term Vocabulary for P2P-IR: a DHT Stress Test Fabius Klemm and Karl Aberer School of Computer and Communication Sciences Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland

More information

Early Measurements of a Cluster-based Architecture for P2P Systems

Early Measurements of a Cluster-based Architecture for P2P Systems Early Measurements of a Cluster-based Architecture for P2P Systems Balachander Krishnamurthy, Jia Wang, Yinglian Xie I. INTRODUCTION Peer-to-peer applications such as Napster [4], Freenet [1], and Gnutella

More information

Distriubted Hash Tables and Scalable Content Adressable Network (CAN)

Distriubted Hash Tables and Scalable Content Adressable Network (CAN) Distriubted Hash Tables and Scalable Content Adressable Network (CAN) Ines Abdelghani 22.09.2008 Contents 1 Introduction 2 2 Distributed Hash Tables: DHT 2 2.1 Generalities about DHTs............................

More information

A Scalable Content- Addressable Network

A Scalable Content- Addressable Network A Scalable Content- Addressable Network In Proceedings of ACM SIGCOMM 2001 S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker Presented by L.G. Alex Sung 9th March 2005 for CS856 1 Outline CAN basics

More information

Aggregation of a Term Vocabulary for Peer-to-Peer Information Retrieval: a DHT Stress Test

Aggregation of a Term Vocabulary for Peer-to-Peer Information Retrieval: a DHT Stress Test Aggregation of a Term Vocabulary for Peer-to-Peer Information Retrieval: a DHT Stress Test Fabius Klemm and Karl Aberer School of Computer and Communication Sciences Ecole Polytechnique Fédérale de Lausanne

More information

Peer-to-Peer Systems. Chapter General Characteristics

Peer-to-Peer Systems. Chapter General Characteristics Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include

More information

A Framework for Peer-To-Peer Lookup Services based on k-ary search

A Framework for Peer-To-Peer Lookup Services based on k-ary search A Framework for Peer-To-Peer Lookup Services based on k-ary search Sameh El-Ansary Swedish Institute of Computer Science Kista, Sweden Luc Onana Alima Department of Microelectronics and Information Technology

More information

Building a low-latency, proximity-aware DHT-based P2P network

Building a low-latency, proximity-aware DHT-based P2P network Building a low-latency, proximity-aware DHT-based P2P network Ngoc Ben DANG, Son Tung VU, Hoai Son NGUYEN Department of Computer network College of Technology, Vietnam National University, Hanoi 144 Xuan

More information

An Architecture for Peer-to-Peer Information Retrieval

An Architecture for Peer-to-Peer Information Retrieval An Architecture for Peer-to-Peer Information Retrieval Karl Aberer, Fabius Klemm, Martin Rajman, Jie Wu School of Computer and Communication Sciences EPFL, Lausanne, Switzerland July 2, 2004 Abstract Peer-to-Peer

More information

A Directed-multicast Routing Approach with Path Replication in Content Addressable Network

A Directed-multicast Routing Approach with Path Replication in Content Addressable Network 2010 Second International Conference on Communication Software and Networks A Directed-multicast Routing Approach with Path Replication in Content Addressable Network Wenbo Shen, Weizhe Zhang, Hongli Zhang,

More information

On the Feasibility of Peer-to-Peer Web Indexing and Search

On the Feasibility of Peer-to-Peer Web Indexing and Search On the Feasibility of Peer-to-Peer Web Indexing and Search Jinyang Li Boon Thau Loo Joseph M. Hellerstein M. Frans Kaashoek David Karger Robert Morris MIT Lab for Computer Science UC Berkeley jinyang@lcs.mit.edu,

More information

A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol

A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol Min Li 1, Enhong Chen 1, and Phillip C-y Sheu 2 1 Department of Computer Science and Technology, University of Science and Technology of China,

More information

A Super-Peer Based Lookup in Structured Peer-to-Peer Systems

A Super-Peer Based Lookup in Structured Peer-to-Peer Systems A Super-Peer Based Lookup in Structured Peer-to-Peer Systems Yingwu Zhu Honghao Wang Yiming Hu ECECS Department ECECS Department ECECS Department University of Cincinnati University of Cincinnati University

More information

Dynamic Load Sharing in Peer-to-Peer Systems: When some Peers are more Equal than Others

Dynamic Load Sharing in Peer-to-Peer Systems: When some Peers are more Equal than Others Dynamic Load Sharing in Peer-to-Peer Systems: When some Peers are more Equal than Others Sabina Serbu, Silvia Bianchi, Peter Kropf and Pascal Felber Computer Science Department, University of Neuchâtel

More information

A Hybrid Peer-to-Peer Architecture for Global Geospatial Web Service Discovery

A Hybrid Peer-to-Peer Architecture for Global Geospatial Web Service Discovery A Hybrid Peer-to-Peer Architecture for Global Geospatial Web Service Discovery Shawn Chen 1, Steve Liang 2 1 Geomatics, University of Calgary, hschen@ucalgary.ca 2 Geomatics, University of Calgary, steve.liang@ucalgary.ca

More information

Architectures for Distributed Systems

Architectures for Distributed Systems Distributed Systems and Middleware 2013 2: Architectures Architectures for Distributed Systems Components A distributed system consists of components Each component has well-defined interface, can be replaced

More information

DYNAMIC TREE-LIKE STRUCTURES IN P2P-NETWORKS

DYNAMIC TREE-LIKE STRUCTURES IN P2P-NETWORKS DYNAMIC TREE-LIKE STRUCTURES IN P2P-NETWORKS Herwig Unger Markus Wulff Department of Computer Science University of Rostock D-1851 Rostock, Germany {hunger,mwulff}@informatik.uni-rostock.de KEYWORDS P2P,

More information

A Small World Overlay Network for Semantic Based Search in P2P Systems

A Small World Overlay Network for Semantic Based Search in P2P Systems A Small World Overlay Network for Semantic Based Search in P2P Systems Mei Li 1, Wang-Chien Lee 1, Anand Sivasubramaniam 1, and Dik Lun Lee 2 1 Pennsylvania State University, University Park, USA {meli,

More information

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE

March 10, Distributed Hash-based Lookup. for Peer-to-Peer Systems. Sandeep Shelke Shrirang Shirodkar MTech I CSE for for March 10, 2006 Agenda for Peer-to-Peer Sytems Initial approaches to Their Limitations CAN - Applications of CAN Design Details Benefits for Distributed and a decentralized architecture No centralized

More information

Effect of Links on DHT Routing Algorithms 1

Effect of Links on DHT Routing Algorithms 1 Effect of Links on DHT Routing Algorithms 1 Futai Zou, Liang Zhang, Yin Li, Fanyuan Ma Department of Computer Science and Engineering Shanghai Jiao Tong University, 200030 Shanghai, China zoufutai@cs.sjtu.edu.cn

More information

Using Highly Discriminative Keys for Indexing in a Peer-to-Peer Full-Text Retrieval System

Using Highly Discriminative Keys for Indexing in a Peer-to-Peer Full-Text Retrieval System Using Highly Discriminative Keys for Indexing in a Peer-to-Peer Full-Text Retrieval System Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer Ecole Polytechnique Fdrale de Lausanne (EPFL) School of Computer

More information

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou Scalability In Peer-to-Peer Systems Presented by Stavros Nikolaou Background on Peer-to-Peer Systems Definition: Distributed systems/applications featuring: No centralized control, no hierarchical organization

More information

Should we build Gnutella on a structured overlay? We believe

Should we build Gnutella on a structured overlay? We believe Should we build on a structured overlay? Miguel Castro, Manuel Costa and Antony Rowstron Microsoft Research, Cambridge, CB3 FB, UK Abstract There has been much interest in both unstructured and structured

More information

Query Processing Over Peer-To-Peer Data Sharing Systems

Query Processing Over Peer-To-Peer Data Sharing Systems Query Processing Over Peer-To-Peer Data Sharing Systems O. D. Şahin A. Gupta D. Agrawal A. El Abbadi Department of Computer Science University of California at Santa Barbara odsahin, abhishek, agrawal,

More information

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang Dept. of Computer Science University of Rochester Rochester, NY 14627-226 sarrmor@cs.rochester.edu Sandhya Dwarkadas Dept.

More information

Shaking Service Requests in Peer-to-Peer Video Systems

Shaking Service Requests in Peer-to-Peer Video Systems Service in Peer-to-Peer Video Systems Ying Cai Ashwin Natarajan Johnny Wong Department of Computer Science Iowa State University Ames, IA 500, U. S. A. E-mail: {yingcai, ashwin, wong@cs.iastate.edu Abstract

More information

Supporting Multiple-Keyword Search in A Hybrid Structured Peer-to-Peer Network

Supporting Multiple-Keyword Search in A Hybrid Structured Peer-to-Peer Network Supporting Multiple-Keyword Search in A Hybrid Structured Peer-to-Peer Network Xing Jin W.-P. Ken Yiu S.-H. Gary Chan Department of Computer Science The Hong Kong University of Science and Technology Clear

More information

Load Sharing in Peer-to-Peer Networks using Dynamic Replication

Load Sharing in Peer-to-Peer Networks using Dynamic Replication Load Sharing in Peer-to-Peer Networks using Dynamic Replication S Rajasekhar, B Rong, K Y Lai, I Khalil and Z Tari School of Computer Science and Information Technology RMIT University, Melbourne 3, Australia

More information

Making Gnutella-like P2P Systems Scalable

Making Gnutella-like P2P Systems Scalable Making Gnutella-like P2P Systems Scalable Y. Chawathe, S. Ratnasamy, L. Breslau, N. Lanham, S. Shenker Presented by: Herman Li Mar 2, 2005 Outline What are peer-to-peer (P2P) systems? Early P2P systems

More information

Flexible Information Discovery in Decentralized Distributed Systems

Flexible Information Discovery in Decentralized Distributed Systems Flexible Information Discovery in Decentralized Distributed Systems Cristina Schmidt and Manish Parashar The Applied Software Systems Laboratory Department of Electrical and Computer Engineering, Rutgers

More information

A Survey of Peer-to-Peer Content Distribution Technologies

A Survey of Peer-to-Peer Content Distribution Technologies A Survey of Peer-to-Peer Content Distribution Technologies Stephanos Androutsellis-Theotokis and Diomidis Spinellis ACM Computing Surveys, December 2004 Presenter: Seung-hwan Baek Ja-eun Choi Outline Overview

More information

A LOAD BALANCING ALGORITHM BASED ON MOVEMENT OF NODE DATA FOR DYNAMIC STRUCTURED P2P SYSTEMS

A LOAD BALANCING ALGORITHM BASED ON MOVEMENT OF NODE DATA FOR DYNAMIC STRUCTURED P2P SYSTEMS A LOAD BALANCING ALGORITHM BASED ON MOVEMENT OF NODE DATA FOR DYNAMIC STRUCTURED P2P SYSTEMS 1 Prof. Prerna Kulkarni, 2 Amey Tawade, 3 Vinit Rane, 4 Ashish Kumar Singh 1 Asst. Professor, 2,3,4 BE Student,

More information

A Top Catching Scheme Consistency Controlling in Hybrid P2P Network

A Top Catching Scheme Consistency Controlling in Hybrid P2P Network A Top Catching Scheme Consistency Controlling in Hybrid P2P Network V. Asha*1, P Ramesh Babu*2 M.Tech (CSE) Student Department of CSE, Priyadarshini Institute of Technology & Science, Chintalapudi, Guntur(Dist),

More information

Subway : Peer-To-Peer Clustering of Clients for Web Proxy

Subway : Peer-To-Peer Clustering of Clients for Web Proxy Subway : Peer-To-Peer Clustering of Clients for Web Proxy Kyungbaek Kim and Daeyeon Park Department of Electrical Engineering & Computer Science, Division of Electrical Engineering, Korea Advanced Institute

More information

A Structured Overlay for Non-uniform Node Identifier Distribution Based on Flexible Routing Tables

A Structured Overlay for Non-uniform Node Identifier Distribution Based on Flexible Routing Tables A Structured Overlay for Non-uniform Node Identifier Distribution Based on Flexible Routing Tables Takehiro Miyao, Hiroya Nagao, Kazuyuki Shudo Tokyo Institute of Technology 2-12-1 Ookayama, Meguro-ku,

More information

A Survey of Peer-to-Peer Systems

A Survey of Peer-to-Peer Systems A Survey of Peer-to-Peer Systems Kostas Stefanidis Department of Computer Science, University of Ioannina, Greece kstef@cs.uoi.gr Abstract Peer-to-Peer systems have become, in a short period of time, one

More information

A Square Root Topologys to Find Unstructured Peer-To-Peer Networks

A Square Root Topologys to Find Unstructured Peer-To-Peer Networks Global Journal of Computer Science and Technology Network, Web & Security Volume 13 Issue 2 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

Athens University of Economics and Business. Dept. of Informatics

Athens University of Economics and Business. Dept. of Informatics Athens University of Economics and Business Athens University of Economics and Business Dept. of Informatics B.Sc. Thesis Project report: Implementation of the PASTRY Distributed Hash Table lookup service

More information

: Scalable Lookup

: Scalable Lookup 6.824 2006: Scalable Lookup Prior focus has been on traditional distributed systems e.g. NFS, DSM/Hypervisor, Harp Machine room: well maintained, centrally located. Relatively stable population: can be

More information

A Peer-to-Peer Architecture to Enable Versatile Lookup System Design

A Peer-to-Peer Architecture to Enable Versatile Lookup System Design A Peer-to-Peer Architecture to Enable Versatile Lookup System Design Vivek Sawant Jasleen Kaur University of North Carolina at Chapel Hill, Chapel Hill, NC, USA vivek, jasleen @cs.unc.edu Abstract The

More information

Huffman-DHT: Index Structure Refinement Scheme for P2P Information Retrieval

Huffman-DHT: Index Structure Refinement Scheme for P2P Information Retrieval International Symposium on Applications and the Internet Huffman-DHT: Index Structure Refinement Scheme for P2P Information Retrieval Hisashi Kurasawa The University of Tokyo 2-1-2 Hitotsubashi, Chiyoda-ku,

More information

Exploiting Semantic Clustering in the edonkey P2P Network

Exploiting Semantic Clustering in the edonkey P2P Network Exploiting Semantic Clustering in the edonkey P2P Network S. Handurukande, A.-M. Kermarrec, F. Le Fessant & L. Massoulié Distributed Programming Laboratory, EPFL, Switzerland INRIA, Rennes, France INRIA-Futurs

More information

Evaluating Unstructured Peer-to-Peer Lookup Overlays

Evaluating Unstructured Peer-to-Peer Lookup Overlays Evaluating Unstructured Peer-to-Peer Lookup Overlays Idit Keidar EE Department, Technion Roie Melamed CS Department, Technion ABSTRACT Unstructured peer-to-peer lookup systems incur small constant overhead

More information

Evolution of Peer-to-peer algorithms: Past, present and future.

Evolution of Peer-to-peer algorithms: Past, present and future. Evolution of Peer-to-peer algorithms: Past, present and future. Alexei Semenov Helsinki University of Technology alexei.semenov@hut.fi Abstract Today peer-to-peer applications are widely used for different

More information

CS514: Intermediate Course in Computer Systems

CS514: Intermediate Course in Computer Systems Distributed Hash Tables (DHT) Overview and Issues Paul Francis CS514: Intermediate Course in Computer Systems Lecture 26: Nov 19, 2003 Distributed Hash Tables (DHT): Overview and Issues What is a Distributed

More information

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P? Peer-to-Peer Data Management - Part 1- Alex Coman acoman@cs.ualberta.ca Addressed Issue [1] Placement and retrieval of data [2] Server architectures for hybrid P2P [3] Improve search in pure P2P systems

More information

Scalable overlay Networks

Scalable overlay Networks overlay Networks Dr. Samu Varjonen 1 Lectures MO 15.01. C122 Introduction. Exercises. Motivation. TH 18.01. DK117 Unstructured networks I MO 22.01. C122 Unstructured networks II TH 25.01. DK117 Bittorrent

More information

Peer Clustering and Firework Query Model

Peer Clustering and Firework Query Model Peer Clustering and Firework Query Model Cheuk Hang Ng, Ka Cheung Sia Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong SAR {chng,kcsia}@cse.cuhk.edu.hk

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [P2P SYSTEMS] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Byzantine failures vs malicious nodes

More information

A New Adaptive, Semantically Clustered Peer-to-Peer Network Architecture

A New Adaptive, Semantically Clustered Peer-to-Peer Network Architecture A New Adaptive, Semantically Clustered Peer-to-Peer Network Architecture 1 S. Das 2 A. Thakur 3 T. Bose and 4 N.Chaki 1 Department of Computer Sc. & Engg, University of Calcutta, India, soumava@acm.org

More information

Mill: Scalable Area Management for P2P Network based on Geographical Location

Mill: Scalable Area Management for P2P Network based on Geographical Location Mill: Scalable Area Management for PP Network based on Geographical Location MATSUURA Satoshi sato-mat@is.naist.jp FUJIKAWA Kazutoshi fujikawa@itc.naist.jp SUNAHARA Hideki suna@wide.ad.jp Graduate School

More information

Improving Hybrid Keyword-Based Search

Improving Hybrid Keyword-Based Search Improving Hybrid Keyword-Based Search Matei A. Zaharia and Srinivasan Keshav Abstract: We present a hybrid peer-to-peer system architecture for keyword-based free-text search in environments with heterogeneous

More information

PeerSearch: Efficient Information Retrieval in Peer-to-Peer Networks

PeerSearch: Efficient Information Retrieval in Peer-to-Peer Networks PeerSearch: Efficient Information Retrieval in Peer-to-Peer Networks Chunqiang Tang, Zhichen Xu and Mallik Mahalingam Abstract In this paper, we propose an efficient peer-to-peer information retrieval

More information

A Hybrid Structured-Unstructured P2P Search Infrastructure

A Hybrid Structured-Unstructured P2P Search Infrastructure A Hybrid Structured-Unstructured P2P Search Infrastructure Abstract Popular P2P file-sharing systems like Gnutella and Kazaa use unstructured network designs. These networks typically adopt flooding-based

More information

Survey of DHT Evaluation Methods

Survey of DHT Evaluation Methods Survey of DHT Evaluation Methods Markus Meriläinen Helsinki University of Technology Markus.Merilainen@tkk.fi Abstract In this paper, we present an overview of factors affecting the performance of the

More information

Data-Centric Query in Sensor Networks

Data-Centric Query in Sensor Networks Data-Centric Query in Sensor Networks Jie Gao Computer Science Department Stony Brook University 10/27/05 Jie Gao, CSE590-fall05 1 Papers Chalermek Intanagonwiwat, Ramesh Govindan and Deborah Estrin, Directed

More information

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma Overlay and P2P Networks Unstructured networks Prof. Sasu Tarkoma 20.1.2014 Contents P2P index revisited Unstructured networks Gnutella Bloom filters BitTorrent Freenet Summary of unstructured networks

More information

Effective File Replication and Consistency Maintenance Mechanism in P2P Systems

Effective File Replication and Consistency Maintenance Mechanism in P2P Systems Global Journal of Computer Science and Technology Volume 11 Issue 16 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA) Online ISSN: 0975-4172

More information

Design and Implementation of a Semantic Peer-to-Peer Network

Design and Implementation of a Semantic Peer-to-Peer Network Design and Implementation of a Semantic Peer-to-Peer Network Kiyohide Nakauchi 1, Hiroyuki Morikawa 2, and Tomonori Aoyama 3 1 National Institute of Information and Communications Technology, 4 2 1, Nukui-kitamachi,

More information

08 Distributed Hash Tables

08 Distributed Hash Tables 08 Distributed Hash Tables 2/59 Chord Lookup Algorithm Properties Interface: lookup(key) IP address Efficient: O(log N) messages per lookup N is the total number of servers Scalable: O(log N) state per

More information

Distributed Systems. 17. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 17. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 17. Distributed Lookup Paul Krzyzanowski Rutgers University Fall 2016 1 Distributed Lookup Look up (key, value) Cooperating set of nodes Ideally: No central coordinator Some nodes can

More information

Application Layer Multicast For Efficient Peer-to-Peer Applications

Application Layer Multicast For Efficient Peer-to-Peer Applications Application Layer Multicast For Efficient Peer-to-Peer Applications Adam Wierzbicki 1 e-mail: adamw@icm.edu.pl Robert Szczepaniak 1 Marcin Buszka 1 1 Polish-Japanese Institute of Information Technology

More information

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma Overlay and P2P Networks Unstructured networks Prof. Sasu Tarkoma 19.1.2015 Contents Unstructured networks Last week Napster Skype This week: Gnutella BitTorrent P2P Index It is crucial to be able to find

More information

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Outline System Architectural Design Issues Centralized Architectures Application

More information

An Algorithm to Reduce the Communication Traffic for Multi-Word Searches in a Distributed Hash Table

An Algorithm to Reduce the Communication Traffic for Multi-Word Searches in a Distributed Hash Table An Algorithm to Reduce the Communication Traffic for Multi-Word Searches in a Distributed Hash Table Yuichi Sei 1, Kazutaka Matsuzaki 2, and Shinichi Honiden 3 1 The University of Tokyo Information Science

More information

Decentralized Object Location In Dynamic Peer-to-Peer Distributed Systems

Decentralized Object Location In Dynamic Peer-to-Peer Distributed Systems Decentralized Object Location In Dynamic Peer-to-Peer Distributed Systems George Fletcher Project 3, B649, Dr. Plale July 16, 2003 1 Introduction One of the key requirements for global level scalability

More information

IN recent years, the amount of traffic has rapidly increased

IN recent years, the amount of traffic has rapidly increased , March 15-17, 2017, Hong Kong Content Download Method with Distributed Cache Management Masamitsu Iio, Kouji Hirata, and Miki Yamamoto Abstract This paper proposes a content download method with distributed

More information

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 16. Distributed Lookup. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 16. Distributed Lookup Paul Krzyzanowski Rutgers University Fall 2017 1 Distributed Lookup Look up (key, value) Cooperating set of nodes Ideally: No central coordinator Some nodes can

More information

Comparing Chord, CAN, and Pastry Overlay Networks for Resistance to DoS Attacks

Comparing Chord, CAN, and Pastry Overlay Networks for Resistance to DoS Attacks Comparing Chord, CAN, and Pastry Overlay Networks for Resistance to DoS Attacks Hakem Beitollahi Hakem.Beitollahi@esat.kuleuven.be Geert Deconinck Geert.Deconinck@esat.kuleuven.be Katholieke Universiteit

More information

Squid: Enabling search in DHT-based systems

Squid: Enabling search in DHT-based systems J. Parallel Distrib. Comput. 68 (2008) 962 975 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Squid: Enabling search in DHT-based

More information

Update Propagation Through Replica Chain in Decentralized and Unstructured P2P Systems

Update Propagation Through Replica Chain in Decentralized and Unstructured P2P Systems Update Propagation Through Replica Chain in Decentralized and Unstructured PP Systems Zhijun Wang, Sajal K. Das, Mohan Kumar and Huaping Shen Center for Research in Wireless Mobility and Networking (CReWMaN)

More information

Searching for Shared Resources: DHT in General

Searching for Shared Resources: DHT in General 1 ELT-53206 Peer-to-Peer Networks Searching for Shared Resources: DHT in General Mathieu Devos Tampere University of Technology Department of Electronics and Communications Engineering Based on the original

More information

Design of a New Hierarchical Structured Peer-to-Peer Network Based On Chinese Remainder Theorem

Design of a New Hierarchical Structured Peer-to-Peer Network Based On Chinese Remainder Theorem Design of a New Hierarchical Structured Peer-to-Peer Network Based On Chinese Remainder Theorem Bidyut Gupta, Nick Rahimi, Henry Hexmoor, and Koushik Maddali Department of Computer Science Southern Illinois

More information

Searching for Shared Resources: DHT in General

Searching for Shared Resources: DHT in General 1 ELT-53207 P2P & IoT Systems Searching for Shared Resources: DHT in General Mathieu Devos Tampere University of Technology Department of Electronics and Communications Engineering Based on the original

More information

EARM: An Efficient and Adaptive File Replication with Consistency Maintenance in P2P Systems.

EARM: An Efficient and Adaptive File Replication with Consistency Maintenance in P2P Systems. : An Efficient and Adaptive File Replication with Consistency Maintenance in P2P Systems. 1 K.V.K.Chaitanya, 2 Smt. S.Vasundra, M,Tech., (Ph.D), 1 M.Tech (Computer Science), 2 Associate Professor, Department

More information

Peer-to-Peer Systems and Distributed Hash Tables

Peer-to-Peer Systems and Distributed Hash Tables Peer-to-Peer Systems and Distributed Hash Tables CS 240: Computing Systems and Concurrency Lecture 8 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Selected

More information

Overlay and P2P Networks. Unstructured networks. PhD. Samu Varjonen

Overlay and P2P Networks. Unstructured networks. PhD. Samu Varjonen Overlay and P2P Networks Unstructured networks PhD. Samu Varjonen 25.1.2016 Contents Unstructured networks Last week Napster Skype This week: Gnutella BitTorrent P2P Index It is crucial to be able to find

More information

Content Overlays. Nick Feamster CS 7260 March 12, 2007

Content Overlays. Nick Feamster CS 7260 March 12, 2007 Content Overlays Nick Feamster CS 7260 March 12, 2007 Content Overlays Distributed content storage and retrieval Two primary approaches: Structured overlay Unstructured overlay Today s paper: Chord Not

More information

Telematics Chapter 9: Peer-to-Peer Networks

Telematics Chapter 9: Peer-to-Peer Networks Telematics Chapter 9: Peer-to-Peer Networks Beispielbild User watching video clip Server with video clips Application Layer Presentation Layer Application Layer Presentation Layer Session Layer Session

More information

FPN: A Distributed Hash Table for Commercial Applications

FPN: A Distributed Hash Table for Commercial Applications FPN: A Distributed Hash Table for Commercial Applications Cezary Dubnicki, Cristian Ungureanu, Wojciech Kilian NEC Laboratories Princeton, NJ, USA {dubnicki, cristian, wkilian}@nec-labs.com Abstract Distributed

More information

Diminished Chord: A Protocol for Heterogeneous Subgroup Formation in Peer-to-Peer Networks

Diminished Chord: A Protocol for Heterogeneous Subgroup Formation in Peer-to-Peer Networks Diminished Chord: A Protocol for Heterogeneous Subgroup Formation in Peer-to-Peer Networks David R. Karger 1 and Matthias Ruhl 2 1 MIT Computer Science and Artificial Intelligence Laboratory Cambridge,

More information

SplitQuest: Controlled and Exhaustive Search in Peer-to-Peer Networks

SplitQuest: Controlled and Exhaustive Search in Peer-to-Peer Networks SplitQuest: Controlled and Exhaustive Search in Peer-to-Peer Networks Pericles Lopes Ronaldo A. Ferreira pericles@facom.ufms.br raf@facom.ufms.br College of Computing, Federal University of Mato Grosso

More information

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste Overview 5-44 5-44 Computer Networking 5-64 Lecture 6: Delivering Content: Peer to Peer and CDNs Peter Steenkiste Web Consistent hashing Peer-to-peer Motivation Architectures Discussion CDN Video Fall

More information

Understanding Chord Performance

Understanding Chord Performance CS68 Course Project Understanding Chord Performance and Topology-aware Overlay Construction for Chord Li Zhuang(zl@cs), Feng Zhou(zf@cs) Abstract We studied performance of the Chord scalable lookup system

More information

Resilient GIA. Keywords-component; GIA; peer to peer; Resilient; Unstructured; Voting Algorithm

Resilient GIA. Keywords-component; GIA; peer to peer; Resilient; Unstructured; Voting Algorithm Rusheel Jain 1 Computer Science & Information Systems Department BITS Pilani, Hyderabad Campus Hyderabad, A.P. (INDIA) F2008901@bits-hyderabad.ac.in Chittaranjan Hota 2 Computer Science & Information Systems

More information

A Peer-to-peer Framework for Caching Range Queries

A Peer-to-peer Framework for Caching Range Queries A Peer-to-peer Framework for Caching Range Queries O. D. Şahin A. Gupta D. Agrawal A. El Abbadi Department of Computer Science University of California Santa Barbara, CA 9316, USA {odsahin, abhishek, agrawal,

More information

Load Balancing in Structured P2P Systems

Load Balancing in Structured P2P Systems 1 Load Balancing in Structured P2P Systems Ananth Rao Karthik Lakshminarayanan Sonesh Surana Richard Karp Ion Stoica fananthar, karthik, sonesh, karp, istoicag@cs.berkeley.edu Abstract Most P2P systems

More information

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables

Today. Why might P2P be a win? What is a Peer-to-Peer (P2P) system? Peer-to-Peer Systems and Distributed Hash Tables Peer-to-Peer Systems and Distributed Hash Tables COS 418: Distributed Systems Lecture 7 Today 1. Peer-to-Peer Systems Napster, Gnutella, BitTorrent, challenges 2. Distributed Hash Tables 3. The Chord Lookup

More information

Location Efficient Proximity and Interest Clustered P2p File Sharing System

Location Efficient Proximity and Interest Clustered P2p File Sharing System Location Efficient Proximity and Interest Clustered P2p File Sharing System B.Ajay Kumar M.Tech, Dept of Computer Science & Engineering, Usharama College of Engineering & Technology, A.P, India. Abstract:

More information

12/5/16. Peer to Peer Systems. Peer-to-peer - definitions. Client-Server vs. Peer-to-peer. P2P use case file sharing. Topics

12/5/16. Peer to Peer Systems. Peer-to-peer - definitions. Client-Server vs. Peer-to-peer. P2P use case file sharing. Topics // Topics Peer to Peer Systems Introduction Client-server vs peer to peer Peer-to-peer networks Routing Overlays Structured vs unstructured Example PP Systems Skype login server Peer-to-peer - definitions

More information

PeerSearch: Efficient Information Retrieval in Peer-to-Peer Networks

PeerSearch: Efficient Information Retrieval in Peer-to-Peer Networks PeerSearch: Efficient Information Retrieval in Peer-to-Peer Networks Chunqiang Tang, Zhichen Xu, Mallik Mahalingam Internet Systems and Storage Laboratory HP Laboratories Palo Alto HPL-00-98 July th, 00*

More information

LessLog: A Logless File Replication Algorithm for Peer-to-Peer Distributed Systems

LessLog: A Logless File Replication Algorithm for Peer-to-Peer Distributed Systems LessLog: A Logless File Replication Algorithm for Peer-to-Peer Distributed Systems Kuang-Li Huang, Tai-Yi Huang and Jerry C. Y. Chou Department of Computer Science National Tsing Hua University Hsinchu,

More information

Making Search Efficient on Gnutella-like P2P Systems

Making Search Efficient on Gnutella-like P2P Systems Making Search Efficient on Gnutella-like P2P Systems Yingwu Zhu Department of ECECS University of Cincinnati zhuy@ececs.uc.edu Xiaoyu Yang Department of ECECS University of Cincinnati yangxu@ececs.uc.edu

More information

Evaluation Study of a Distributed Caching Based on Query Similarity in a P2P Network

Evaluation Study of a Distributed Caching Based on Query Similarity in a P2P Network Evaluation Study of a Distributed Caching Based on Query Similarity in a P2P Network Mouna Kacimi Max-Planck Institut fur Informatik 66123 Saarbrucken, Germany mkacimi@mpi-inf.mpg.de ABSTRACT Several caching

More information

Overlay Networks for Multimedia Contents Distribution

Overlay Networks for Multimedia Contents Distribution Overlay Networks for Multimedia Contents Distribution Vittorio Palmisano vpalmisano@gmail.com 26 gennaio 2007 Outline 1 Mesh-based Multicast Networks 2 Tree-based Multicast Networks Overcast (Cisco, 2000)

More information

Neighborhood Signatures for Searching P2P Networks

Neighborhood Signatures for Searching P2P Networks Neighborhood Signatures for Searching P2P Networks Mei Li Wang-Chien Lee Anand Sivasubramaniam Department of Computer Science and Engineering Pennsylvania State University University Park, PA 16802 E-Mail:

More information

Peer-to-Peer (P2P) Systems

Peer-to-Peer (P2P) Systems Peer-to-Peer (P2P) Systems What Does Peer-to-Peer Mean? A generic name for systems in which peers communicate directly and not through a server Characteristics: decentralized self-organizing distributed

More information

DESIGN OF DISTRIBUTED, SCALABLE, TOLERANCE, SEMANTIC OVERLAY CREATION USING KNOWLEDGE BASED CLUSTERING

DESIGN OF DISTRIBUTED, SCALABLE, TOLERANCE, SEMANTIC OVERLAY CREATION USING KNOWLEDGE BASED CLUSTERING DESIGN OF DISTRIBUTED, SCALABLE, TOLERANCE, SEMANTIC OVERLAY CREATION USING KNOWLEDGE BASED CLUSTERING Ms. V.Sharmila Associate Professor, Department of Computer Science and Engineering, KSR College of

More information

Chord : A Scalable Peer-to-Peer Lookup Protocol for Internet Applications

Chord : A Scalable Peer-to-Peer Lookup Protocol for Internet Applications : A Scalable Peer-to-Peer Lookup Protocol for Internet Applications Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M. Frans Kaashock, Frank Dabek, Hari Balakrishnan March 4, 2013 One slide

More information