Towards large scale peer-to-peer web search

Size: px

Start display at page:

Download "Towards large scale peer-to-peer web search"

Teresa Grant
6 years ago
Views:

1 Towards large scale peer-to-peer web search Gerwin van Doorn Human Media Interaction, University of Twente, the Netherlands ABSTRACT Web search engines, such as Google and Yahoo, are based on the centralized database model. Search engines using the centralized database model suffer from a several drawbacks, such as: they have a single point of failure, a limited representation of the web, their index is not up-to-date, and scalability. Currently a lot of research is being done on using peer-to-peer (P2P) technology for the use of full-text search in order to overcome the issues centralized search engines suffer from. Although P2P systems have proven to be highly scalable in file sharing applications, this is not so obvious for large scale full-text search. In this paper I will discuss and compare some of the most important P2P architectures based on their literature. Based on this, I will set out the direction future P2P research should be heading to, in order to make large scale P2P web search possible. Keywords Distributed Hash Table, Peer to peer, Overlay network, Hybrid, Semantic 1. INTRODUCTION The last few years the World Wide Web has gone through some remarkable changes. Where previously only companies and universities could add content to the web, nowadays users contribute a great deal to the content of the web. One of the best known concepts where professional and non-professional users contribute to the content is Wikipedia ("Wikipedia home page,"). Weblogs are another fast growing concept where people can add content to the web. One of the most important and most widely used applications of the Web is full text keyword search. Without it, it would be impossible to navigate the internet efficiently. Nevertheless, the current size of the Web makes this a difficult problem. Current search engines for the Web are usually based on a single database model. With the single database model documents from the web are copied from their location to a central database where they are stored and indexed. This model of information retrieval has been very successful because initially the web was not very big, the web was not a commercial medium and HTML pages were not very complex (Callan, 2000). In 2000 Google indexed approximately 1 billion pages. The index size increased by 1 billion each year till 2005, when it doubled from approximately 4 billion to 8 billion ("The Internet Archive," 2006). Google has since then stopped listing their index size, but claim, after Yahoo announced to have an index size of 20 billion (Mayer, 2005), that their index size is 3 times bigger than any other search engine (Google, 2005). There is also a lot of information that cannot be indexed because either, there are no links to these pages, or they are dynamically generated from a database. These pages are referred to as the deep web or the invisible web. This means that there is even more information on the web, than one can currently search, for with a general search-engine that needs to crawl web-servers. In 2000 the deep web was estimated to contain over 550 billion individual documents (Bergman, 2000). Current search engines working with a central database model need to crawl the web periodically to keep their information up to date. A complete re-crawl would take a long time and is very inefficient. Web-crawlers are unable to know if a document was actually updated until the document is downloaded, forcing web-crawlers to download more documents than is necessary. Search engines based on the single database model also lack robustness. When the database goes offline, searching is impossible. In order to prevent such a dramatic scenario, data needs to be replicated and thus more hardware resources are needed. Currently, major companies like Google, Yahoo! and Microsoft control every aspect of internet search, intranet search and even desktop search. Users depend on these companies to find the information they are looking for. In order to compete with current leading web search companies like Yahoo!, Google and Microsoft, enormous investments have to be made to create the necessary infrastructure. This makes it almost impossible to compete and create new and better information retrieval services for a big audience. Peer-to-peer (P2P) networking is one of the most rapidly developing areas of modern computing (Klampanos & Jose, 2004). There is an increasing interest to utilize the P2P approach for web search in order to overcome the aforementioned problems. Although the use of P2P networking has proven to be very effective for file sharing, its use for full-text web search has yet to be proven. One of the most important issues to overcome in order to make P2P web search possible is scalability. 2. PROBLEM STATEMENT The task of web search using a P2P network comes down to selecting the best documents in relation to a query, while these documents are scattered across a physical network. Although many different P2P systems have been proposed for keyword search, it is not clear which kind of P2P system would have the greatest potential for large scale web search, and which characteristics such a system should have. This paper looks at several important P2P architectures and compares them based on their literature in order to set out future directions of This paper is structured as follows: Section 3 will discuss the first generation of P2P networks that were mainly used for file-sharing applications. Section 4 discusses Dynamic Hash Tables and P2P systems that make use of them. Section 5 describes how indexing schemes can be combined to form hybrid P2P systems, avoiding the limitations of previous indexing schemes. Section 6 will describe P2P systems that make use of Information Retrieval methods to locate data.

Section 7 will summarize the discussed systems and compare them based on the information given their literature. Finally, conclusions will be drawn in section 8. 3. FIRST P2P NETWORKS 3.

Napster takes advantage of the already available bandwidth and storage space of the connected peers.

2 Section 7 will summarize the discussed systems and compare them based on the information given their literature. Finally, conclusions will be drawn in section FIRST P2P NETWORKS 3.1 Centralized and unstructured One of the first systems that made P2P sharing popular was Napster (Chawathe, Ratnasamy, Breslau, Lanham, & Shenker, 2003; Lu & Callan, 2003). Napster takes advantage of the already available bandwidth and storage space of the connected peers. X a, b Y b, c Z a, b, c Figure 2 Local indexing As we can see in Figure 2, each peer indexes its own documents (X, Y and Z), each document containing one or more terms (a, b and c). Figure 1 Centralized and unstructured P2P architecture A peer is connected to a central server cluster. Each active peer sends a list of its shared filenames to the cluster. When a node is looking for a file, a query is send to this cluster. The search engine will return a list of other peers that have the desired content and the peer searching for this content can connect directly to the peers that contain the content. This way bandwidth of the server is small but the download bandwidth is big because it is shared among peers. Napster is not a pure peer to peer system as some tasks make use of a client server system (e.g., searching). The major drawbacks of these centralized systems are that they scale poorly and have a single point of failure (Lv, Cao, Cohen, Li, & Shenker, 2002). 3.2 Decentralized and unstructured In decentralized unstructured P2P systems, like Gnutella (Clip, 2000), there is no centralized directory service like with Napster. The network is formed by nodes joining the network using a few loose rules. In order to find a file, a node queries its neighboring peers. Flooding (Clip, 2000) or random walks (Lv et al., 2002) are used on this graph to propagate queries. Each peer indexes its own content. This is often referred to as document based partitioning of local indexing. Figure 3 Decentralized and unstructured P2P architecture Flooding is a very inefficient process that generates a large amount of network traffic as a query usually gets send to many or all nodes in the network. The search mechanisms in these networks are not very scalable due to the communication costs and large load on the network participants. For instance, according to Tang et al. a 128K-node Gnutella-like network would transmit 192MB data when transmitting one query with the use of flooding (Tang, Xu, & Dwarkadas, 2003). Content allocation and network structure are uncorrelated in these networks as nodes often connect to the network using physical measures like join order, connection speed, etc. A TTL (Time To Live) is usually used to limit the number of hops between nodes. This reduces network traffic but does not guarantee a query hit. 4. DHT BASED P2P ARCHITECTURES 4.1 Distributed Hash Tables DHT stands for Distributed Hash Table. A DHT is a hash table which is distributed among cooperating computers, which is referred to as nodes or peers. DHTs are also referred to as structured overlay networks as its nodes are connected to each other over an existing network, like the internet. Like a hash table it contains key/value pairs, also referred to as items. The primary service provided by a DHT is the lookup operation; given a key, find the corresponding value. Both the key and value representation can be arbitrary, such as a string or an object. In the case of a keyword based search engine, an item usually consists of a keyword that is mapped to the peer

3 responsible for the list of documents containing the keyword. This list is also referred to as posting-list. ordered in a circular 2 m identifier space and can be represented as in Figure 4. Table 1 DHT mapping keywords to peers Key Value Node 0 0 information peer 1 retrieval peer 1 7 Key 0 Key 1 Node 1 1 car peer 2 tomato peer 3 A DHT can be constructed from many nodes, making it possible to efficiently store large amounts of data as each node is responsible for a part of the items. Each node has a routing table that stores the node s neighbors. When a lookup function is performed on one of the nodes, and this node does not contain the item itself, the node will route the query to its neighbors until it reaches the node that is responsible for the provided key. Another important feature of DHTs is that they are, to a certain degree, fault-tolerant. DHTs should be able to perform lookups even when some nodes fail. This is usually done by replicating items (key/value pairs) of the failed node by neighboring nodes Routing hops Hops are the number of re-routes that are needed to deliver a query to the responsible node, and it is important for the efficiency of the DHT to keep this number as low as possible. Each extra hop means extra latency for sending an extra message. More hops also mean more chance that a one or more nodes fail during a lookup. By having a large routing table at each node, fewer hops are needed. The bigger each node s routing table, the fewer hops are needed. However, the bigger the routing table the more messages need to be broadcasted in order to keep routing information up-to-date Routing time Reducing the number of hops does not necessarily mean the time it takes to reach the destination is reduced. A route that needs five hops and uses local nodes can still take less time than a route that needs two hops but goes through a node on the other side of the world. 4.2 The first Distributed Hash Tables Different DHTs exist, differing in how they are structured and how they handle replication, node joins, node failures, and node arrivals. The first DHTs appeared around 2001 and were CAN (Ratnasamy, Francis, Handley, Karp, & Schenker, 2001), Chord (Stoica, Morris, Karger, Kaashoek, & Balakrishnan, 2001), Pastry (Rowstron & Druschel, 2001) and Tapestry (Zhao, Kubiatowicz, & Joseph, 2001). In order to demonstrate how these systems operate, the CAN and Chord systems are briefly described in the following sections. A more extensive overview and comparison of P2P overlay networks is given by (Lua, Crowcroft, Pias, Sharma, & Lim, 2005) Chord The Chord protocol (Stoica et al., 2001) makes use of consistent hashing to assign keys to nodes. This tends to balance the load as each node receives the same number of keys. A node s identifier is chosen by hashing the node s IP address. The key identifier is chosen by hashing the key. The identifiers are Node 4 Key 3 Figure 4 Chord identifier circle, m = 3 A key k is assigned to the first node whose identifier is equal to or follows the key k in the identifier-space. This node is called the successor node. This is demonstrated in Figure 4, where the key with location 3 is assigned to the node located at 4, as this is the successor node of identifier 3. If node 4 were to leave the network, all of its assigned keys would be reassigned to its successor, node 0. If a node would join at identifier 3, key 3 would be reassigned to node 3. Each node stores only a small amount of routing information. The nodes know their successor but also maintain additional routing information to make routing more efficient. This additional information is stored in a routing table of maximum size m. The i th entry of the table of node n contains the identity 1 of the node that succeeds n by 2 i. When node n does not know the successor of a key k, it looks in its routing table for a node whose identity is closer to k than its own identifier and precedes k. This node will know more about k than node n. This process is repeated until the successor node of key k is found. Suppose node 4 from Figure 4 wants to know the successor node of identifier 1. Node 4 will look at position 2 of its routing table and find node 0 to be closest. Node 0 will ask node 1 to find the successor of identifier 1. Node 1 will return its own identity to node 4 as node 1 is the successor node for identity 1. Each node only needs to maintain O(log N ) other nodes in order to resolve all lookups via O(log N ) messages, N being the total number of nodes in the network CAN CAN stands for Content Addressable Network and uses a d- dimensional Cartesian coordinate space on a d-torus to implement a distributed hash table that maps keys onto values (Ratnasamy et al., 2001). The coordinate space is partitioned among all the nodes in the system, each node owning its individual distinct zone. In the virtual coordinate space, each 3 2

4 key/value pair is mapped onto a point in the coordinate space using a uniform hash function. The key/value pair is stored on the node that is responsible for the zone to which the key/value pair is mapped. Each CAN node holds a routing table containing the IP address and virtual coordinate zone of each neighbor. A CAN-message is routed towards its destination by greedy forwarding to the neighbor closest to the destination coordinate. needs to be send over the network in order to make an intersection with another posting list. Suppose we search for information retrieval. When retrieval has a smaller posting list than information, this posting list can be send to the node responsible for the term information so this node can intersect both posting lists to create a list of documents containing both terms. Figure 5 Example of a node G joining a 2d CAN New nodes randomly choose a point P in the CAN coordinate space and requests to join using an existing CAN-node. This request is then routed to the node responsible for the zone point in which P lies. The owner of the zone splits the zone in half and assigns one half to the new node. This new node take over some of the neighbors of the previous owner and neighbors of both nodes are informed of the reallocation. Figure 5 shows the joining of node G in the CAN. The number of neighbors a node maintains is independent of the total number of nodes in the network and only depends on the network dimensions. The 1 d average length of routing paths in CAN is, where d O( dn ) is the number of dimensions and n is the total number of nodes in the system. 4.3 DHTs and search Naïve usage When dividing responsibility for the words from the document corpus among peers, DHTs can be used to map a word to the peer that is responsible for this word. This is also referred to as partitioning by keyword or global indexing. a X, Z b X, Y, Z c Y, Z Figure 6 Global indexing Each peer stores the posting list of each word it is responsible for. A posting list is a list of documents (represented in Figure 6 by X, Y and Z) containing this word. When searching for documents containing a word, the DHT resolves the peer that is responsible for this word. The peer then returns the posting list containing the documents the word occurs in. When a query contains multiple terms, the posting list of one or more terms Figure 7 Naive implementation of a DHT for search This can be an enormous amount of information that needs to be sent over the network Optimizations Several optimizations were suggest by (J. Li et al., 2003) and (Reynolds & Vahdat, 2003), such as: Cashing and pre-computing posting-list intersections Bloom filter compression Incremental results The problem with these optimizations is that either huge posting-lists still need to be transferred over the network, or the search quality is degrading. When dealing with enormous amounts of data, as is the case with the internet, these optimizations still do not suffice. A bloom filter is a hash based data structure that summarizes membership in a set. Sending a Bloom filter based on a set A, A being a posting-list, instead of the posting-list A itself, reduces the amount of communication required to determine the intersection A B. Take for example the, according to Google Zeitgeist 2006, popular query Paris Hilton. A search for Paris on Google returns results; Hilton returns results. We will call the result set from the query Paris set A and the result set from Hilton set B. According to (Reynolds & Vahdat, 2003), the ideal bloom filter size would be calculated as follows: A log A B j In the ideal case the smallest set is send first, in this case this is set A with results. According to this formula the size of the Bloom filter to be sent will be:

5 ( ) log = j bits or 134MB This is smaller than sending the complete list of 128 bit document identifiers, which would consume 1.4GB, but still is an enormous amount of data for just one query. A side effect of Bloom filters is that they allow false positives. The number of false positives grows exponentially as the size of the Bloom filter increases. A large amount of extra storage space and memory is needed when caching pre-computed posting-list intersections. The biggest issue with keyword based indexing is the fact that it is not possible to rank results with respect to the query before a posting list is transferred. Search engine users usually only look at the first n results. According to a search engine user study by iprospect, 88% of search engine users change their search terms and/or search engine when they can not find what they seek in the first 3 result pages, 41% of these users change their search query after the first page. Most users (82%) do stick with the same search engine when they re-launch their search and add more search terms to their query (iprospect Search Engine User Behavior Study, 2006). Reynolds et al. suggests using Fagin s algorithm to incrementally retrieve a ranked list of documents. Fagin s algorithm, originally proposed in (Fagin, 1999), and some variants on this algorithm, where tested in ODISSEA, but no testing was performed with large data sets (Jo-Wen, Zhang, Long, & Shanmugasundaram, 2003). The returned top k results are chosen based on the ranking of the documents of the individual query terms and not on the complete query. The problem with this Fagin s algorithm is that when terms in a query are uncorrelated, the algorithm might take as long as a non-optimized algorithm, resulting in slow performance. In order to make P2P web search scalable, queries should be ranked by the nodes that are responsible for the partial or complete query. Only the top k ranked documents should be returned, and if necessary combined with the top results returned by other nodes Highly Discriminative Keys To avoid transferring large posting lists across the network, Podnar et al. propose the use of Highly Discriminative Keys (Aberer, Klemm, Luu, Podnar, & Rajman, ; Podnar, Luu, Rajman, Klemm, & Aberer, 2006). They propose a novel indexing strategy that is key-based instead of single term based. This means that posting lists are mapped to keys, like information retrieval, instead of using the separate terms information and retrieval. The indexing procedure is costly but as their focus is on DL (Digital Libraries) this does not have to be a problem as they are rather static. The key idea of their indexing strategy is to limit the posting list size of the global P2P index to a constant predefined value, and extend the index vocabulary to improve retrieval effectiveness. In a single-term index, posting lists can be extremely large. Joining these posting lists at query time consumes a large amount of network bandwidth. By using many HDKs (Highly Discriminative Keys or rare-keys), more specific but smaller posting lists are created that are joined at indexing time. This is in a way similar to generating queries and linking them to the n best matching documents. The quality of a key k for a given document d with respect to the indexing adequacy is determined by its discriminative power. The key must be as discriminative as possible with respect to the document and the document collection. When a key is frequent, and thus has low discriminative power, it means that its document frequency DF(k) is bigger than a fixed document frequency DFmax. When DF(f) < DFmax, the key is considered rare. Because many term combinations can form potential rare keys, proximity and redundancy filtering is used to keep the rare key vocabulary to a manageable size. The proximity filter retains keys built of words appearing close in documents as they are likely to appear together in a query. The redundancy filter removes supersets of rare keys. In order to retrieve documents, queries are mapped to keys. It is possible that a user provides a query that already is a HDK. It is of course possible that users query frequent keys or terms not covered by the HDKs. They propose some options to overcome this problem. A valid option might be to notify the user that his/her query is not discriminative enough with respect to the document collection, and provide support to the user in refining the query. Two other options they propose are to find semantically similar terms to the query terms or index k-best documents for frequent keys (which only make up 1% of the vocabulary). This system showed acceptable retrieval performance compared to a single-term TF-IDF baseline system without additional techniques for dealing with queries containing frequent keys. The retrieved posting lists were still 92% smaller than the baseline when the overlap with the baseline system on the 20 top results was 94.30%. 5. HYBRID P2P ARCHITECTURES As demonstrated by(j. Li et al., 2003), using either keyword based partitioning or document based partitioning, including their optimizations, does not make large scale keyword search feasible. In order to overcome this, several people have proposed using a combination of local and global indexing, called hybrid indexing. 5.1 esearch Tang et al. propose a P2P keyword search system called esearch based on a hybrid indexing structure (Tang & Dwarkadas, 2004). The key idea of their method is selective metadata distribution. Like global indexing, the metadata is distributed based on terms, each node is responsible for one or more terms. Each node also stores the complete term list of each document it is responsible for. The optimization of esearch is that it does not distribute the document term lists to all the nodes that but only to the nodes responsible for the top n terms. The n top terms are identified using Okapi term weighting (Robertson, Walker, Hancock-Beaulieu, Gull, & Lau, 1992).

6 a X b X, Y c Z X a, b X a, b Y b, c Z a, b, c Figure 8 Hybrid indexing The top n terms are determined using the term weighting scheme Okapi (Robertson et al., 1992). This optimization can degrade the quality of the search results as a query on a term that is not among the top terms of a document cannot find this document. Their argument is that when none of the query terms are among the top terms of a document, IR algorithms are unlikely to rank this document among the best matching documents for this query anyway. To reduce the chance of missing relevant documents they adopt automatic query expansion. With the use of automatic query expansion esearch obtains search quality as good as the centralized baselines by publishing a document under its top 20 terms. When using automatic query expansion, the hybrid indexing structure is first used to retrieve a smaller number of so called feedback documents. For each term in these retrieved feedback documents the average weight among the feedback documents is calculated. The system then chooses k terms that have the biggest average weight and adds these to the query as they are assumed to be relevant to the query. This new query is then used to retrieve the final set of documents. esearch only transmits 3.3KB data on average during a retrieval operation. This amount is independent on the corpus size and scales logarithmically with the number of nodes. In addition to all this, esearch also distributes the load for each term. The DHT based Chord protocol esearch uses, is modified to prevent nodes responsible for popular terms to store more data than others. esearch also caches popular queries for a certain amount of time at the nodes processing the query and the paths along which it is forwarded to prevent hot spots. When a query arrives at a node where the result to this same query is cached, the result is immediately returned. The storage space consumed by esearch is 6.8 times that of global indexing systems but the benefit is its low search costs. Their argument is that disk capacity has increased 160 times faster than network bandwidth and thus trading in modest disk space for communication and precision is a proper design choice for P2P systems. Further storage reduction like pruning and index compression can be added. 5.2 Multi level partitioning Another hybrid index partitioning scheme, named Multi Level Partitioning (MLP), is proposed by (Shi et al., 2004). MLP is adaptive and can balance between keyword based and document based partitioning. Figure 9 Multi Level Partitioning MLP relies on a node group hierarchy. In this hierarchy, nodes are logically divided into groups, each group being divided into sub groups until level n (see Figure 9). At level n, keyword based partitioning is applied within the group. This means that multiple nodes from different groups can be responsible for the same keyword. When a query is processed, it first sends down the hierarchy to level 1, and then to all groups of the next level until it reaches level n. The nodes responsible for the keywords in the query in each group perform an intersection of the inverted list. The intersections are done in parallel. Because keyword based partitioning is used within the group, the inverted lists that need to be transmitted to do the intersection are much smaller. The results are combined level by level and sent back to peer that started the search. MLP achieves an enormous cut-back on bisection bandwidth ( times, with respect to keyword based partitioning), more than was suggested in (J. Li et al., 2003). Shi et al. do not describe how efficient ranking of the results can be done using MLP. Although MLP cuts back on the bisection bandwidth, the aggregated bandwidth is still huge. 6. SEMANTIC P2P ARCHITECTURES Many P2P search systems use simple keyword matching to find documents matching a query. These systems distribute documents randomly, without any notion of their content. By making the distance between documents proportional to their similarity in semantics, the search space and communication cost can be reduced. This is what semantic P2P systems do psearch psearch uses a classic IR algorithm, Latent Semantic Indexing (LSI), to map documents and queries as points to a semantic space in a, thereby creating a semantic overlay network (Tang et al., 2003).

7 Figure 10 psearch semantic space in a 2d CAN Documents that are semantically alike are co-located in the semantic space, making it unnecessary to search nodes containing irrelevant documents. When a query is submitted, the query is routed to the nodes that are semantically closest to the query within a radius r. These nodes will do a local search using plsi and return the best matching documents to the user. In psearch the dimensions of the CAN are set to the dimensions of the semantic space. This dimensionality can be as high as 300. When the number of dimensions is smaller than log 2( n ), n being the number of nodes, more zones will be created than there are nodes available. To support semantic vectors with larger dimensions than the CAN, psearch makes use of rotated semantic vectors which are stored at different locations in the CAN. These vectors are sub-vectors of the original semantic vector but have a lower dimensionality. For p sub-vectors p searches are started when a query is submitted. The results are of these p searches are combined. The technique does not ensure their full vectors are also similar but the low-dimensional elements of a semantic vector are the most significant. When the low-dimensional elements of a document and query are similar, the document can be considered relevant. Tang et al. also observed that the 300-dimensional semantic space was not sufficient enough for the TREC 7&8 corpus and queries. Increasing the dimensions of the semantic space is not an option as psearch can only use the lower dimensions. However, it still needs a finer structure in order to rank documents properly. psearch achieves an accuracy of 90% in a 32k- node system and 86% accuracy in a 128k-node system. These accuracies are with respect to a centralized LSI implementation. A query and each index are 1.5KB in size when semantic vectors have a dimensionality of 300. Using the most pessimistic testing results, psearch s communication costs for processing a query in a 128k-node system is 632KB. The data transmitted when publishing one index is 49.5KB and storage cost is 6KB. When nodes replicate data from their direct neighbors in the CAN, the communication cost for publishing one index is 89.5KB and 159KB for storage. A big advantage of psearch is that its bandwidth consumption for processing queries is independent of corpus size, query length, and document length. The number of routing hops and the number of visited nodes are the parameters that influence psearch s communication cost and are constant, or increase slowly as the system scales. More recent research by Tang et al. research discusses the limitations of psearch due to its reliance on LSI (Tang, Dwarkadas, & Xu, 2004). They conclude that when the corpus is heterogeneous and large, the retrieval quality of LSI is inferior to methods such as Okapi (Robertson et al., 1992). Another limitation of LSI is that memory consumption and computational load are high, making it unscalable. To improve the efficiency of LSI they propose a method elsi. Documents are clustered and only the centroids of these clusters are used as representative documents. The dimensionality of the centroid vectors is reduced by removing low-weight terms, resulting in a much smaller input matrix for the Singular Value Decompositions (SVD) that will result in the base of the semantic space. They combine their LSI approach with Okapi (LSI+Okapi) to boost the precision of psearch. LSI is still used to guide the query to the search region, but inside the search region Okapi to select documents and guide the exploration of the search region. Their optimizations make psearch s high-end-precision, the precision when only retrieving the top n documents, approach the state-of-the-art centralized IR systems. However, its lowend-precision, the precision when retrieving many documents, is still inferior. This does not have to be a problem as most users only look at the first few search results Semantic Small World Li et al. propose a semantic overlay network based on small world networks named Semantic Small World (M. Li, Lee, & Sivasubramaniam, 2004; M. Li, Lee, Sivasubramaniam, & Lee, 2004). The P2P system is built from the ground up and does not make use of existing DHTs like CAN or Chord. SSW dynamically clusters peers based on the semantics of their content. Semantics are represented by k-dimensional Semantic Vectors (SV) that is not limited to a specific data format. Unlike psearch, SSW supports all dimensions of the semantic space in search. A simple but effective dimension reduction strategy, called adaptive space linearization (ASL), is used to create a one-dimensional SSW that operates in a high-dimensional semantic space. Peers cluster their local data objects and pick the centroid of the largest cluster as their position in the semantic space. A network is considered to be a small world network when it has a small average path length and a large cluster coefficient. Research has shown that searches can efficiently be conducted 2 in O(log N) when each node in the network knows its neighbors and each node knows a few randomly chosen long distance nodes. A node in SSW also indexes content from other peers. Which content to index at which node is determined by the mapping from the semantic subspace to peers. If some content of a peer A falls into the subspace of another peer B, peer B will index such content from peer A. These indexes are called foreign indexes.

8 15(1111) P=4 P=1 P=3 Peer 1 Peer 3 12(1100) P=2 14(1110) Peer 2 4(0100) P=2 (1, 1) When the number of long range contacts is large enough, SSW achieves much smaller path lengths than psearch (about 50%). The path length also increases slower. SSW s precision is more than 10% higher than that of psearch. Their precision is measured with respect to a centralized LSI system. 7. DISCUSSION It is difficult to compare the P2P systems based on their literature alone as each system was compared with respect to a different baseline. What we can infer from the discussed systems is that a scalable P2P system that should operate at the size of the web should have the following features: Documents need a compact representation 0(0000) 8(1000) (0, 0) High-end precision should be good, low-end precision not necessarily C0 C15 SV=[0.9, 0.6] PCN4 = 1000 = 8 C4 C4 ClusterRange = [0-0.25,0.35-1] Par.His: 1: 1, : Incremental results is a need, with the best matching documents ranked highest Ranking of results should preferably be done at the indexing peer(s) C14 PCN12 = 1110 = 14 C12 C8 C12 ClusterRange = [ , 0.5-1] Par.His: 1: 1, : : 1, 0.7 Figure 11 SSW example C4 has long range contact with C12 In Figure 11 an example is shown on how SSW handles dimension reduction and handles searches. It shows how a SSW of any dimension can be mapped to a 1-dimensional space. The top figure shows how a semantic space is partitioned when new nodes are added to the network. Unlike psearch, SSW does not necessarily split clusters in half. The location of the split depends on the distributions of the number of nodes and data. Clusters are only split when the number of nodes in a cluster exceeds a predefined maximum. The nodes of a clusters keep track of the partitioning history. This history is used to create cluster names and estimate queries PCN (Pseudo Cluster Name). This PCN is used to route the query to the correct cluster. In the example that is illustrated in Figure 11 we can see how the SV of a query is routed to its destination cluster. In cluster 4 the PCN of the 2-dimensional SV=[0.9, 0.6] is estimated. The estimated PCN is cluster 8. Because cluster 8 is not among the long range contact of cluster 4, the SV is send to the nearest cluster, in this case cluster 12 (normally C8 would be among C4 s short range contacts but this is omitted in this example). In C12 the PCN is estimated again, this time resulting in 14. As the SV lies in the cluster range of C14 the SV is flooded to all nodes in the cluster. A SSW that used SVs of k dimensions, containing N nodes, with cluster size M and number of long range contacts l will on average perform a search across clusters in 2 log (2 N / M ) messages. As we can see, the number O l of dimensions has no influence on the routing performance. The first generation P2P systems and system using DHTs naively are not suitable for large scale web search as they either need to flood queries or intersect huge posting lists. They either scale linearly with size of the network or size of the data, respectively. From the hybrid architectures, esearch looks most promising. Queries are resolved logarithmically and precision is as good as the baseline system (Okapi) at a modest storage cost of 6.8 times that of systems based on keyword partitioning. Semantic P2P systems can also resolve searches efficiently. Semantic systems also allow different media to be searched as it does not matter what kind of data is being used, making it possible to use the same architecture for e.g. both keyword and image search. As a side-effect it is less effective for keyword searches unless the semantic representation is accompanied by additional textual information. Semantic P2P systems look promising but have several limitations as well: A transformation matrix needs to be created that is a good representation of the document collection. This needs the full document collection or either a good representation of the document collection Important documents can be missed during search when the documents contain multiple topics that are not near each other in the semantic space (Dumais, 1994) An important issue to overcome, in order to make it attractive for companies to operate as a P2P node, is the storage of keywords and documents identifiers they do not want to be related with. People might not feel comfortable with the idea that their server is being used to answer queries related to pornography or worse. A P2P network might have to limit distribution of the index for certain content to nodes that host similar content. Semantic Small Word can be a solution to this problem as servers will be responsible for data that is semantically similar to the data they originally host.

9 8. CONCLUSION In this paper several leading P2P architectures were discussed. The key to a scalable P2P keyword search engines seems to be a compact representation of the original documents. Both esearch and semantic approaches make use of a compact representation of the data. Extra storage space in these systems is needed but as disk space is cheaper than bandwidth, this looks like a reasonable tradeoff. Future research should compare these systems to a state-of-the-art centralized search system in order to compare their precision and scalability with respect to each other. Besides scalability there will be many other issues to overcome before large scale P2P web search becomes feasible, such as: Security: fraudulent peers Ethics: storage and linkage to unwanted parties (e.g. pornography) Load balancing, hot spots, popular queries Before these issues can be resolved, a scalable P2P solution, that is as good as a centralized system, should be proposed first. The discussed systems were tested using small document collections, in order to test if they are really scalable, testing needs to be done with a data set that is a good representation of the internet. REFERENCES Aberer, K., Klemm, F., Luu, T., Podnar, I., & Rajman, M. Building a peer-to-peer full-text Web search engine with highly discriminative keys: Technical report, EPFL (LSIR), Bergman, M. K. (2000). The Deep Web: Surfacing Hidden Value. from Callan, J. P. (2000). Searching for Needles in a World of Haystacks. IEEE Data Engineering Bulletin, 23(3), Chawathe, Y., Ratnasamy, S., Breslau, L., Lanham, N., & Shenker, S. (2003). Making gnutella-like P2P systems scalable. Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, Clip, D. S. S. (2000). Gnutella Protocol Specification v _0.4.pdf. Dumais, S. T. (1994). Latent semantic indexing (LSI): TREC-3 report. Overview of the Third Text REtrieval Conference, 219, 230. Fagin, R. (1999). Combining fuzzy information from multiple systems. Journal of Computer and System Sciences, 58(1), Google. (2005). Sizing Up Search Engines. Retrieved 2007, from The Internet Archive (Publication (2006). Retrieved 2006: iprospect Search Engine User Behavior Study. (2006).): iprospect.com. Jo-Wen, T., Zhang, W. J., Long, A., & Shanmugasundaram, K. (2003). ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval. International Workshop on the Web and Databases. Klampanos, I. A., & Jose, J. M. (2004). An architecture for information retrieval over semi-collaborating Peer-to- Peer networks. Proceedings of the 2004 ACM symposium on Applied computing, Li, J., Loo, B. T., Hellerstein, J., Kaashoek, F., Karger, D. R., & Morris, R. (2003). On the Feasibility of Peer-to-Peer Web Indexing and Search. IPTPS 03, Li, M., Lee, W. C., & Sivasubramaniam, A. (2004). Semantic Small World: An Overlay Network for Peer-to-Peer Search. Proceedings of the Network Protocols, 12th IEEE International Conference on (ICNP'04)-Volume 00, Li, M., Lee, W. C., Sivasubramaniam, A., & Lee, D. L. (2004). A small world overlay network for semantic based search in P2P systems. Proceedings of Workshop on Semantics in Peerto-Peer and Grid Computing (SemPGrid), in conjunction with the World Wide Web Conference (WWW). Lu, J., & Callan, J. (2003). Content-based retrieval in hybrid peer-to-peer networks. Proceedings of the twelfth international conference on Information and knowledge management, Lua, K., Crowcroft, J., Pias, M., Sharma, R., & Lim, S. (2005). A survey and comparison of peer-to-peer overlay network schemes. Communications Surveys & Tutorials, IEEE, Lv, Q., Cao, P., Cohen, E., Li, K., & Shenker, S. (2002). Search and replication in unstructured peer-to-peer networks. Proceedings of the 16th international conference on Supercomputing, Mayer, T. (2005). Our Blog is Growing Up And So Has Our Index. 2007, from Podnar, I., Luu, T., Rajman, M., Klemm, F., & Aberer, K. (2006). A Peer-to-Peer Architecture for Information Retrieval Across Digital Library Collections. ECDL 06, Alicante, Spain (to appear). Ratnasamy, S., Francis, P., Handley, M., Karp, R., & Schenker, S. (2001). A scalable content-addressable network: ACM Press New York, NY, USA. Reynolds, P., & Vahdat, A. (2003). Efficient peer-to-peer keyword searching. Proceedings of International Middleware Conference, Robertson, S. E., Walker, S., Hancock-Beaulieu, M., Gull, A., & Lau, M. (1992). Okapi at TREC. Text REtrieval Conference, Rowstron, A., & Druschel, P. (2001). Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), 11, Shi, S., Yang, G., Wang, D., Yu, J., Qu, S., & Chen, M. (2004). Making peer-to-peer keyword searching feasible using multi-level partitioning. Proceedings of the 3rd IPTPS. Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord: A scalable peer-to-peer lookup service for internet applications. Proceedings of the 2001 SIGCOMM conference, 31(4),

10 Tang, C., & Dwarkadas, S. (2004). Hybrid Global-Local Indexing for Efficient Peer-to-Peer Information Retrieval. Proceedings of the 1st NSDI. Tang, C., Dwarkadas, S., & Xu, Z. (2004). On scaling latent semantic indexing for large peer-to-peer systems. Proceedings of the 27th annual international conference on Research and developement in information retrieval, Tang, C., Xu, Z., & Dwarkadas, S. (2003). Peer-to-peer information retrieval using self-organizing semantic overlay networks. Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, Wikipedia home page. from Zhao, B. Y., Kubiatowicz, J., & Joseph, A. D. (2001). Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing. Computer, 74.

Distributed Hash Table

Distributed Hash Table P2P Routing and Searching Algorithms Ruixuan Li College of Computer Science, HUST rxli@public.wh.hb.cn http://idc.hust.edu.cn/~rxli/ In Courtesy of Xiaodong Zhang, Ohio State Univ