Using Highly Discriminative Keys for Indexing in a Peer-to-Peer Full-Text Retrieval System

Size: px
Start display at page:

Download "Using Highly Discriminative Keys for Indexing in a Peer-to-Peer Full-Text Retrieval System"

Transcription

1 Using Highly Discriminative Keys for Indexing in a Peer-to-Peer Full-Text Retrieval System Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer Ecole Polytechnique Fdrale de Lausanne (EPFL) School of Computer and Communication Sciences CH-115 Lausanne, Switzerland July 6, 25 Abstract Excessive network bandwidth consumption, caused by the transmission of long posting lists, was identified as one of the major bottlenecks for implementing distributed full-text retrieval in a Peer-to- Peer (P2P) architecture To address this problem we introduce a novel approach to indexing using highly discriminative terms and term sets, which leads to short posting lists and therefore reduces the network traffic by almost one order of magnitude In addition, we show that retrieval based on discriminative term sets provides a retrieval quality comparable to standard full-text retrieval using TF-IDF ranking Our indexing scheme is an important improvement towards realistic P2P retrieval systems that opens the opportunity to virtually unlimited scalability well beyond the capacity of today s best centralized Web search engines Keywords:distributed information retrieval, peer-to-peer (P2P), highly discriminative keys, distributional semantics 1 Introduction and motivation More and more information is available in the World Wide Web To deal with such huge amounts of data, we need scalable full-text retrieval mechanisms Traditional centralized solutions are not suitable to handle such amounts of data Even with huge investments in the necessary infrastructure (Google, MSN Search) [Sceats, 23] such systems can index only a small part of the documents available in the WWW Peer-to-Peer (P2P) systems seem to be very interesting candidates for performing large-scale distributed full-text retrieval For the Internet for example, such systems can be implemented using web servers: instead of only delivering documents, the available web servers can also be used to provide a distributed The work presented in this paper is carried out in the framework of the EPFL Center for Global Computing and was supported by the Swiss National Funding Agency OFES as part of the European project ALVIS - Superpeer Semantic Search Engine No 268

2 2 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer search facility for the documents This example is a specific instance of a general principle in P2P systems that relies on resource sharing to attain scalability to a global size, each participant in the network acting as a client and server at the same time Furthermore, P2P systems also provide sufficient fault tolerance by replicating resources in the network Finally, P2P architectures allow a high degree of parallel processing and thus scale well in terms of computational complexity Notice however, that P2P systems might encounter problems with heavy sequential processing due to the relatively low communication speed compared to densely clustered systems In this work we propose a new indexing strategy, which is specifically designed to perform efficient full-text retrieval in a structured P2P system To achieve this goal, we use an inverted index distributed in the P2P system For clarity, let us first recall some standard information retrieval notions: In an inverted index, terms are associated with posting lists A posting list consists of references to the documents that contain the term The terms used in the inverted index form the vocabulary of the document collection Within this framework, one approach [Reynolds and Vahdat, 23,Suel et al, 23] for P2P text retrieval is to distribute the inverted index among the peers in the network such that each peer is responsible for storing a certain amount of terms and their associated posting lists (see figure 1) Given a term t, such a P2P system must guarantee that at least one of the peers responsible for t is available at any time to retrieve the associated posting list 1 Furthermore, as such a peer can be found very efficiently, ie typically in O(Log(N)) routing hops (where N is the number of peers), a simple mechanism to process a query is to retrieve and intersect the posting lists associated with the query terms (see figure 1) 2 at peer x: q1 = {term 1, term 5} retrieve PL 1 retrieve PL 5 responsible peers peer 1 peer 2 peer 3 inverted index term 1 posting list 1 term 2 posting list 2 term 3 posting list 3 term 4 posting list 4 term 5 posting list 5 peer Np term Nt posting list Nt Fig 1 Distributing an inverted index: Peer x retrieves the posting lists for term 1 and term 5 from the responsible peers in the network One major problem with this approach is that some terms have very long posting lists As answering queries requires posting lists to be shipped over the 1 To achieve fault tolerance, terms and posting lists are replicated at several peers 2 To optimize the query process, the intersection of posting lists can be done at the peer that holds the longest posting list of the query terms

3 Using HDK for Indexing in a P2P Full-Text Retrieval System 3 network, this approach becomes very inefficient (if not impossible) for long posting lists This problem with long posting lists has been recently identified as a major obstacle for implementing text retrieval in P2P [Li et al, 23] In this paper we present a solution that aims at reducing the size of the postings to limit the network load when processing queries To obtain short posting lists we propose to index not only single terms but also term sets, ie sets of terms that occur simultaneously in documents The interesting term sets are the highly discriminative ones, ie the ones that are associated with short posting lists For example, instead of using two long posting lists term 1 {doc 1, doc 2, doc 4, doc 6, doc 8, doc 9 } and term 2 {doc 2, doc 3, doc 6, doc 7, doc 1 }, we might only index the term set {term 1, term 2 }, which is associated with a far shorter posting list: {term 1, term 2 } {doc 2, doc 6 } In other words, we propose to identify highly discriminative term combinations at indexing time instead of performing long posting list intersections at query processing time (when they are quite expensive in terms of network load) With our indexing scheme we can achieve higher parallelization by using a retrieval method that allows to increase the number of keys used for indexing, while simultaneously decreasing the length of posting lists With such an approach, it becomes easier to distribute the indexing structure at a finer granularity in P2P networks, which in turn reduces the communication costs are reduced One possible concern is that using term combinations might lead to a vocabulary of unmanageable size To limit this effect we introduce a novel concept of i-rare keys (Section 32), a scheme for careful key selection We show that after removing terms (or term sets) that are not suitable for indexing the vocabulary grows linearly with the document collection size and is therefore still manageable in a P2P environment (Section 41) Another potential problem of our approach is to maintain high retrieval quality Certain terms and term combinations that are not discriminative, ie that have long posting lists, are no longer indexed but might still be useful for retrieval (eg because they appear in user queries) To solve this problem we use distributional semantics to support concept-based retrieval covering queries that are not based on the indexed keys (Section 34) As a result we can show that using highly discriminative terms and term sets has no negative effects on the retrieval quality We compare our scalable indexing strategy to a central full-text retrieval engine and obtain similar precision and recall values for top-k results when using TF-IDF ranking (Section 4) The rest of our paper is structured as follows: Section 2 provides related work in P2P full-text retrieval We describe the details of our indexing strategy in Section 3 Section 4 presents our experimental results Section 5 finishes with discussion and conclusions 2 Related work in P2P full-text retrieval Why P2P? We believe that our indexing strategy is particularly suitable for P2P P2P systems provide better scalability but also require considerably less

4 4 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer network bandwidth availability than densely clustered distributed IR systems Sending long posting lists over the network for query answering will not work in P2P However, our scheme is also applicable in distributed IR, especially as the length of the posting lists is a tunable parameter In this section we provide background and related work on P2P systems and how they can be used to perform distributed full-text retrieval P2P systems can be categorized into two main classes: unstructured and structured systems A popular representative of unstructured P2P is Gnutella, which uses flooding to answer queries Advanced approaches try to restrict the amount of messages generated, such as [Lv et al, 21] using random walks [Lu and Callan, 23] studies the use of content-based resource selection and document retrieval algorithms in hierarchical unstructured P2P networks The idea is to cluster leaf-nodes around directory nodes On receiving a query, a directory node selects the most appropriate leaf and neighboring directory nodes for routing The second class is structured P2P, also called structured overlay networks or distributed hash tables (DHT) In such systems, the peers organize to jointly build a distributed index, which allows to perform efficient searches for specific identifiers (hashes), usually by contacting only O(Log(N)) peers in a network of size N Examples of DHTs are CAN, CHORD, P-Grid, and Pastry, which were presented in [Ratnasamy et al, 21,Stoica et al, 21,Aberer, 21,Rowstron and Druschel, 21] In [Reynolds and Vahdat, 23], the authors used a DHT to map keywords to peers responsible for indexing They are trying to resolve the problem of long posting lists by using Bloom filters, caches, and incremental results (top-k) However, Bloom filters and caches still leave the query processing time roughly proportional to the collection size The incremental results method requires user query feedback and a good ranking function Another approach is presented in [Tang et al, 23] and is based on CAN [Ratnasamy et al, 21] Documents and queries are represented as latent semantic indexing vectors in a Cartesian space This space is mapped into a structured P2P network keeping semantically related indexes co-located Documents and queries can be routed to the responsible peer in the Cartesian space Difficulties occur when high-dimensional vectors have to be mapped into a lowdimensional overlay network So far it remains unclear, which of the two classes of P2P systems (structured and unstructured) is better suited for implementing highly distributed full-text retrieval Our approach is based on structured overlay networks We will therefore explain in more detail the services offered by these systems: The basic functions a structured P2P layer offers are: join(), leave(), and route(id, message) A peer uses join() to connect to the P2P system, where, among other things, its routing tables are initialized, and leave() to gracefully disconnect The main functionality is route(id, message): each peer in the network is responsible for a defined subset of identifiers from an agreedupon identifier space With route(id, message) any peer in the network can

5 Using HDK for Indexing in a P2P Full-Text Retrieval System 5 send a message to the peer responsible for any given point in the id space, by having to contact only O(Log(N)) other peers Our P2P text-retrieval application uses route() to send (key, posting list)-pairs to the responsible peers, as well as to retrieve the posting lists associated to keys for query answering The id is calculated from the key using a hash function, which is also provided by the P2P layer The P2P layer independently takes care of constructing and maintaining the routing tables in the presence of peers going on- and offline Bamboo [Bamboo, 24] is an example for a working implementation of a DHT that can reliably deliver messages in systems with moderate churn rates As the description of the various existing P2P systems is not the focus of this paper, we refer the interested reader to the relevant literature [Ratnasamy et al, 21,Stoica et al, 21,Rowstron and Druschel, 21] [Aberer, 21,Lv et al, 21] 3 Indexing strategy In this section we explain in detail our indexing strategy The fundamental idea of our approach is that we index documents using only highly discriminative terms or term sets to avoid long posting lists, which are extremely inefficient in a distributed setting 31 Definitions To make the presentation easier we introduce the following definitions: Definition 1 T is the set of all terms that appear in a document collection Definition 2 K is the set of keys A key k is a set of (one or more) terms t T The vocabulary of the document collection is a subset of K Notice that T is also a subset of K containing only keys that are made up of a single term The quality of a key k for a given document d is determined by its discriminative power: Discriminative power of a key To be discriminative, a key k must be as specific as possible with respect to the document d it is associated with We categorize a key on the basis of its Document Frequency (DF ), ie the number of documents it appears in We define a threshold DF low 3 to split the set of keys K into two classes: K non rare = {k K DF (k) > DF low }: The set of non rare keys, which appear in many documents, ie have bad discriminative power [Salton and Yang, 1973] 3 We will see in section 4 how DF low can be set

6 small voc large vocabulary 6 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer K rare = {k K DF (k) DF low }: The set of rare keys, which appear in very few documents and are therefore highly discriminative As far as the discriminative power is concerned, the interesting keys are in K rare We therefore use only those keys to index documents Standard Information Retrieval techniques use single terms to index documents, however, in our approach we use the discriminative sets of terms (see figure 2) long posting lists term 1 posting list term 2 posting list term Nt posting list Nt single term indexing very inefficient for distributed indexing key 1 posting list 1 key 2 posting list 2 key Nk posting list Nk short posting lists same retrieval quality highly discriminative key indexing efficient for distributed indexing Fig 2 Overview of indexing using highly discriminative keys 32 Key filtering Creating rare keys using all possible term combinations leads to an explosion of the vocabulary as it grows O( D c ), where D is the size of the document collection and c > 1 depending on the length of the key We therefore introduce filters to select good keys and to keep the vocabulary at a manageable size 4 We will describe two filter processes in the following subsections: i-rare filter An important quality of a key is its semantic adequacy A key k is semantically adequate for a document d if there is a high probability that a user produces k when searching for d 4 Our experiments in section 4 show that using filters the vocabulary grows linearly with the size of the document collection without deteriorating the retrieval quality

7 Using HDK for Indexing in a P2P Full-Text Retrieval System 7 Hypothesis 1 As a user is more likely to generate generic terms, a semantically adequate key contains a very specific combination of quite generic terms In other words, a semantically adequate key is by hypothesis a key, for which all its subsets are not discriminative (ie not rare) We call such keys intrinsically rare: Definition 3 K i rare K rare : The set of intrinsically rare keys Each key k K i rare becomes non-rare by removing any of its terms Our hypothesis is that all keys in K i rare K rare are highly discriminative and semantically adequate and therefore should be used for indexing the document collection Notice that supersets of i-rare keys are redundant, as they are already covered in the index by the i-rare keys They would only increase the vocabulary without improving retrieval performance Window filter We use a window filter as a second mechanism to reduce the size of the vocabulary: We introduce the restriction that keys can be constructed only by choosing terms that appear in the same context A text context can be a phrase, a paragraph, or a window of a certain size in the document [Salton et al, 1993] discovered that comparing a query to a text context leads to better results than by comparing it to the complete document Such keys have a higher probability of being used in queries as they have better semantic meanings than keys that are constructed by randomly combining terms Therefore, only keys that are constructed of terms that appear within a window of w consecutive terms remain in the vocabulary K irw K i rare K rare Figure 3 summarizes the filtering process all possible rare keys K rare i-rare filter K i-rare window filter K irw Fig 3 Overview of the key filtering process Definition 4 T irw is the set of all terms occurring in elements of K irw

8 8 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer 33 Key production algorithm Under our assumption that the keys in K irw are suitable for indexing in a P2P environment, we will now explain an algorithm to generate such keys Our key production algorithm works as follows (a more formal description is given in algorithm 1): The algorithm starts with the generation of Krare, 1 the vocabulary of rare single-term keys It is created by removing all keys from the single-term vocabulary with a document frequency greater than DF 5 low Then, to create keys of size s the documents are processed one by one For each document, we create keys by concatenating all possible term combinations of size s within a window of size w These keys are stored in T emp The second step is to remove those keys that are not i-rare: For each key we have to check whether all its subsets are non-rare, which is done using K s 1 irw If the key is i-rare it is added to the vocabulary Kirw s This process is repeated until s reaches a maximum key size The final vocabulary K irw is the union of all s-term vocabularies Kirw s Complexity analysis This subsection provides a complexity analysis of the key production algorithm: With a collection of size x terms, we get: The complexity to construct all keys of size s within a window of w terms is x Cw 1 s 1 O(x ws 1 ) The function K s 1 irw isirare(key) in algorithm 1 is used to check whether a key of s terms is i rare The function will check all (s-1)-term keys that are subsets of key The complexity of this function is s C, where C is the constant cost to access the vocabulary storage structure 6 The complexity is therefore maxkeysize s=1 (s w s 1 x C) We assume that w is constant and set maxkeysize to three 7 Therefore, the computation time grows linearly with the document collection However, the constant factor is high To deal with this problem, we propose to parallelize the computation by distributing the construction of the vocabulary: Each peer in the network computes the vocabulary for a small part of the document collection, which is locally stored To decide whether a key is rare, it is necessary to aggregate its global document frequency We believe that such an aggregation is possible with aggregation mechanisms proposed in recent P2P papers [Albrecht et al, 24,El-Ansary et al, 23,Yalagandula and Dahlin, 24] A detailed analysis including updates of document frequencies is part of future work 5 As Krare 1 contains only single-term keys, the i-rare and window filters do not apply 6 Remember, a key is i-rare if removing any term makes the key non-i-rare There are s possible terms that can be removed 7 This max key length seems to be suitable as longer keys are less likely to appear in user queries [Silverstein et al, 1999] Furthermore, in the document collections we used for our experiments there are only few i-rare keys with length > 3, however, the additional computation effort is high

9 Using HDK for Indexing in a P2P Full-Text Retrieval System 9 Algorithm 1 Key Production Algorithm 1: /* the voc of rare single-term keys */ 2: generate Kirw 1 = Krare 1 3: 4: for s = 2 to maxkeysize do 5: Kirw s = φ /* vocabulary of s-term keys */ 6: 7: /* process document by document to create a set of s-term keys */ 8: for all document ɛ collection do 9: T emp = φ /* set of keys for one document */ 1: 11: /* generate keys by concatenating terms that appear in a window of size w */ 12: for all tuple of s terms t i1,, t is in a window of size w do 13: key = concat(t i1,, t is ) 14: T emp key 15: end for 16: 17: /* insert key set for one document into Kirw s */ 18: for all key ɛ T emp do 19: 2: /* check whether key is i-rare */ 21: if K s 1 irw isirare(key) then 22: 23: /* update document frequency in Kirw s */ KirwinsertKey(key,1) s 24: end if 25: end for 26: end for 27: end for 28: K irw = Kirw s

10 1 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer 34 Distributional Semantics We will now discuss the problem of finding, given a query q = {t 1, t 2,, t m }, t i T, the corresponding most relevant discriminative keys k K irw that was used for indexing We first search for all possible query term combinations of size 3 If some of these combinations are in the index, the corresponding documents will be returned However, it can happen that none of the term combinations are in the index In this case, the query does not contain any irw keys In real life Web search, users often enter only two or three terms for a search query [Silverstein et al, 1999], which consequently rarely contains irw keys Therefore, to improve the recall performance of our approach, we project queries into the irw key space To perform such a projection, we use Distributional Semantics [Rajman and Bonnet, 1992] It uses a co-occurrence matrix to make a probabilistic connection between the full vocabulary T (all terms of the document collection) and T irw (the terms occurring in K irw ) With such a cooccurrence matrix, query terms can be approximated with terms in T irw, which increases the chance of finding a key in K irw Co-occurrence matrix To find irw keys we use probabilistic associations between T and T irw Two terms are semantically similar if their textual contexts are similar The projection of a query q from T to T irw corresponds to the transformation from an N T to an N Tirw -dimensional space, where N T is the size of T and N Tirw the size of T irw We represent the set of terms by an N T N Tirw matrix of co-occurrences CO Each line of this matrix represents the co-occurrences of a term in T : CO = co 1 co 2 co 11 co 12 co 1NTirw = co 21 co 22 co 2NTirw co NT co NT 1 co NT 2 co NT N Tirw The co-occurrence frequencies between two terms are computed on a reference corpus that is representative of the domain for which the semantic model is defined The profile of co-occurrences of a term t i T is interpreted as an estimate of the probability distribution that measures the association between t i and the terms t T irw More precisely, co ij is an estimate for p(t j T irw t i T ), the probability that the meaning represented by the term t j is also triggered by the term t i : p(t j t i ) co ij = f(t j, t i) k f(t k, t i), where f(t j, t i) is the co-occurrence frequency between the two terms t j and t i in the observed contexts In our case, we consider contexts of the form of a window of size w co terms in a document

11 Using HDK for Indexing in a P2P Full-Text Retrieval System 11 Query expansion Using the co-occurrence matrix, the retrieval process proceeds as follow: if a query q = {t 1, t 2,, t m }, t i T does not contain any irw keys, we consider the co-occurrence profiles of t 1, t 2, t m From these profiles, we derive the n best co-occurrent terms t T irw to construct an approximated query q DS To select the n best co-occurrent terms t for a query q = {t 1, t 2,, t m }, we use the following score: score(t j, q) = p(t j q) = p(q t j) p(t j ) p(q) under the hypothesis that t 1, t 2, t m are independent when they are triggered by t, value p(q) is the same for all the expansion terms t, so to compare the score(t j, q) between t j, we can use this score: score (t j, q) = p(t j) m p(t j t i) i=1 p(t j ) For the final query, we also take the original query q into account: q = q DS q 4 Experimental Results We implemented our algorithms in Java To evaluate the retrieval quality of our approach we used the Reuters 8 corpus and some corpora from SMART 9 to evaluate the indexing process The Reuters corpus consists of over 8 news articles, including all English Reuters news edited between 2/8/1996 and 19/8/1997 All articles are stored in an XML file containing rich information about the content Annotations include topic, country, authors, date, etc We extracted only the content and title of the news to create the document collection The documents in our test collection contain between 5 and 3 words The average number of terms in a document is 17 and the average number of unique terms is 12 To simulate the evolution of the P2P system (the peers join the network one by one, each with their own document collection), we start with a collection size of 1 documents and then, in each step, add 1 documents The maximum collection size is 2, documents For the retrieval performance evaluations, we need some collections with query sets and relevant judgments So we used 2 collections taken from SMART: CRAN with 1398 abstracts and 225 queries in the domain of aerodynamics, and MED with 133 abstracts and 3 queries in the domain of medicine Results are ranked using traditional TF-IDF ranking [Frakes and Baeza-Yates, 1992] To investigate the quality of the retrieval results, we use precision and recall at top k retrieved documents R@k) For example, when we get the top-2 ranked documents for a query, the user evaluates that 8 documents are relevant ftp://ftpcscornelledu/pub/smart/

12 12 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer to the query Assuming there are 25 relevant documents in the whole collection, = 4 and R@2 = 32 We are interested in the high-end precision and recall (top-5 to top-2) as typical Web Track of TREC 1 measures have shown that users are often interested only in the highest ranked search results 41 Size of K irw The first experiment investigates the number of generated keys using our proposed irw key generation algorithm For pre-processing we removed 25 common English stop words and applied the Porter Stemmer With a collection of D documents, DF low is set to α D 1 β The parameters α and β are used to tune the posting list size to guarantee good retrieval quality and to maintain the key computation cost manageable For example, for α = β = 3, DF low = 81 for collection of 2, and 3 for a collection of one billion documents In the following experiments, the window sizes are set to 1 and 15 We refer to them as (w=1) and (w=15) Figure 4 shows that the number of irw keys grows linearly with respect to the size of the collection and therefore can be stored in a P2P system The indexing size (in number os postings) and processing time for key production also grows linearly with respect to the size of the collection (w=1) and (w=15) (Figure 5) w = 1 w = 15 Vocabulary growth Keys in million Collection size in million of terms (1 to 2, docs) Fig 4 Number of keys with respect to the collection size 42 Association matrix To apply distributional semantics we have to create a co-occurrence matrix As this matrix is usually very sparse, we are only interested in its non-zero elements The values of this matrix are the co-occurrence frequencies between two terms 1

13 Using HDK for Indexing in a P2P Full-Text Retrieval System 13 Number of postings in million w = 1 w = 15 Inverted index size Processing time in minute w = 1 w = 15 Processing time Collection size in million of terms (1 to 2, docs) (a) Collection size in million of terms (1 to 2, docs) (b) Fig 5 Index size and processing time with respect to the collection size in a window of size w co In our experiments, we set w co to 1 and 15 to estimate the size of matrix (we think that the w co should be set to the same value w as used in key production algorithm) To estimate the growth of the association matrix, we start with a corpus of 1 documents In each step we add 1 documents to the corpus and count the non-zero elements of the matrix The results in Figure 6 show that the number of non-zero elements grows slowly with respect to the size of the corpus It is therefore possible to store the association matrix in a P2P system Non-Zero element in matrix in million w_co = 1 w_co = 15 Association matrix size Collection size in million of terms (1 to 2, docs) Fig 6 Growth of association matrix 43 Comparing indexing using highly discriminative keys with single-term indexing In this section we compare indexing using highly discriminative keys (HDK indexing), with traditional single-term (ST ) indexing

14 14 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer We used Terrier 11, a Java software for the development of search engines, to implement standard TF-IDF ranking The comparison between HDK and ST indexing was made with the CRAN and MED collections We set DF low to 3, 2, 1, and the window size w to 15 The co-occurrence frequency is also computed with window size 15 The results are shown in Figure 7 The first column (in black) shows the average precision score of all queries by applying ST indexing The three next columns (HDK3, HDK2, HDK1) show the results for our approach for DF low = 3, 2, and 1 and without using distributional semantics The last column (HDK1-DS) is the result using distributional semantics to expand the query by 3 terms These results show that our approach is comparable to traditional ST indexing when using TF-IDF ranking HDK and ST have the same retrieval quality for larger values for DF low In the case the size of the posting list is small (DF low = 1), using distributional semantics slightly improves the retrieval quality This improvement seems marginal, however, in our experiments the queries were quite long (about ten terms on average) and therefore likely to contain irw keys We expect the effect of distributional semantics to be higher with shorter queries Such experiments are part of future work 8 MED Collection, Precision at top ranked docs 6 MED Collection, Recall at top ranked docs P@5 P@1 P@15 P@2 ST HDK3 HDK2 HDK1 HDK1-DS R@5 R@1 R@15 R@2 ST HDK3 HDK2 HDK1 HDK1-DS (a) Precision-MED (b) Recall-MED CRAN Collection, Precision at top ranked docs 6 5 CRAN Collection, Recall at top ranked docs P@5 P@1 P@15 P@2 R@5 R@1 R@15 R@2 ST HDK3 HDK2 HDK1 HDK1-DS ST HDK3 HDK2 HDK1 HDK1-DS (c) Precision-CRAN (d) Recall-CRAN Fig 7 Precision and recall evaluation for MED and CRAN collection Bandwidth consumption We now compare the bandwidth consumption in a distributed environment for retrieving the posting lists for highly discriminative keys (HDK ) and single-term (ST ) indexing 11

15 Using HDK for Indexing in a P2P Full-Text Retrieval System 15 During the retrieval process, HDK indexing consumes considerably less bandwidth than ST indexing, as it only uses short posting lists To measure the savings, we propose the following simple scenario: assume a query q contains a list of terms (t 1, t 2, t n ) The amount of postings transmitted using ST indexing is approximately s(t i ), where s(t i ) is the size of the posting list of term t i In HDK indexing, the amount of postings will be s(k j ), where k j are irw keys generated from the query q and s(k j ) is size of the posting list associated with the key k j, where s(k j ) DF low Figure 8 shows (in a and c) the avg number of transmitted postings, as well as (in b and d) the avg size of the longest posting list transmitted per query HDK indexing uses a substantially smaller amount of network bandwidth than ST indexing (between 7% and 9% less) For the CRAN collection, for example, we can see that ST indexing transmits about 17 postings, whereas HDK3 transmits only about 3 postings Furthermore, the avg size of the longest posting list transmitted per query is drastically decreased with our approach (between 85% and 95%) As the transmission of the posting lists for a query can be parallelized, the longest posting list will determine the overall query response time The performance using the MED collection is slightly worse than with CRAN, which might be attributed to a smaller query set and a smaller number of documents #Postings Average of transmitted postings per query ST HDK3 HDK2 HDK1 HDK1- DS (a) MED #Postings Average of longest posting list size ST HDK3 HDK2 HDK1 HDK1- DS (b) MED #Postings Average of transmitted postings per query ST HDK3 HDK2 HDK1 HDK1- DS (c) CRAN #Postings Average of longest posting list size ST HDK3 HDK2 HDK1 HDK1- DS (d) CRAN Fig 8 Number of postings transmitted for query answering for ST indexing compared to HDK indexing with different posting list lengths

16 16 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer Figure 9 compares the vocabulary and index size for HDK and ST for 5 documents When generating the index, the size of the irw vocabulary (K irw ) is substantially larger than the ST vocabulary On the other hand, in HDK indexing, the posting lists are short We studied the behavior of the HDK vocabulary (K irw ) and inverted index with varying DF low (see Figure 9) The longer we allow the posting lists to be, the more similar HDK and ST indexing become The ratio HDK /ST vocabulary falls very quickly, from 5 times ST -vocabulary for DF low = 3 to near equal size for DF low = 5 The ratio of the inverted indexes of HDK and ST also decreases for large values of DF low 5 45 Ratio of vocabulary and index size between HDK and ST index size HDK / index ST voc size HDK / voc ST ratio Length of the posting list (DF_low) Fig 9 Comparison of the sizes of the vocabularies and the indexes for single term (ST ) indexing and highly discriminative keys (HDK ) indexing for varying values of DF low (posting list length) for collection of 5 documents 5 Discussion and Conclusions In this paper, we proposed a strategy using highly discriminative keys for indexing Our solution overcomes an important scalability problem of standard IR techniques for distributed search engines (bandwidth requirement growing with the average posting list size) By applying the distributional semantic technique, the approach achieves results comparable to standard centralized information retrieval using TF-IDF The number of indexed keys and of the co-occurrence matrix remain manageable (ie is grows linearly with the size of the document collection) In our approach, the HDK index is bigger than the ST index However, storage space is largely available in P2P systems as opposed to network bandwidth Therefore, the reductions in bandwidth and response time for query answering weigh more than the increase of storage space Moreover, there exists other mechanisms to further reduce the size of the vocabulary (and therefore of the index), for example, by taking user queries into account: only keys that appear in

17 Using HDK for Indexing in a P2P Full-Text Retrieval System 17 user queries with a certain frequency are kept in the index [Klemm et al, 24] Another mechanism to limit the size of posting lists is to rank and keep only top-k postings Moreover, further techniques of IR can be applied, such as compression of posting lists and index using bit vectors Such optimizations will be considered in future work One challenge this approach must face is the duplication of documents in the collection Documents that appear more often than DF low in the collection will not contain any rare keys and are therefore not indexed To overcome this problem, we will cluster these documents and then extract the lexical profile of each cluster and index it Another possibility would be to use signature files to check the similar content between documents Details of this solution are part of future work References [Aberer, 21] Karl Aberer P-Grid: A self-organizing access structure for P2P information systems Sixth International Conference on Cooperative Information Systems, 21 [Albrecht et al, 24] Keno Albrecht, Ruedi Arnold, Michael Gähwiler, and Roger Wattenhofer Aggregating information in peer-to-peer systems for improved join and leave In Peer-to-Peer Computing, pages , 24 [Bamboo, 24] Bamboo The Bamboo Distributed Hash Table, 24 [El-Ansary et al, 23] Sameh El-Ansary, Luc Onana Alima, Per Brand, and Seif Haridi Efficient broadcast in structured p2p networks In IPTPS, pages , 23 [Frakes and Baeza-Yates, 1992] W Frakes and R Baeza-Yates Information Retrieval: Data Structures and Algorithms Prentice Hall PTR, 1992 [Klemm et al, 24] F Klemm, A Datta, and K Aberer A query-adaptive partial distributed hash table for peer-to-peer systems In International Workshop on Peerto-Peer Computing & DataBases, 24 [Li et al, 23] J Li, B Loo, J Hellerstein, F Kaashoek, D Karger, and R Morris The Feasibility of Peer-to-Peer Web Indexing and Search, 23 [Lu and Callan, 23] Jie Lu and Jamie Callan Content-based retrieval in hybrid peer-to-peer networks In Proceedings of the twelfth international conference on Information and knowledge management, 23 [Lv et al, 21] Q Lv, P Cao, E Cohen, K Li, and S Shenker Search and replication in unstructured peer-to-peer networks In 16th International Conference on Supercomputing, 21 [Rajman and Bonnet, 1992] Martin Rajman and Alain Bonnet Corpora-Base Linguistics: New Tools for Natural Language Processing 1st Annual Conference of Association for Global Strategic Information, 1992 [Ratnasamy et al, 21] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Shenker A scalable content-addressable network In SIGCOMM, 21 [Reynolds and Vahdat, 23] P Reynolds and A Vahdat Efficient Peer-to-Peer Keyword Searching Middleware3, 23 [Rowstron and Druschel, 21] A Rowstron and P Druschel Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems In

18 18 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), 21 [Salton and Yang, 1973] G Salton and C Yang On the specification of term values in automatic indexing Journal of Documentation, (29): , 1973 [Salton et al, 1993] G Salton, J Allan, and C Buckley Approaches to Passage Retrieval in Full Text Information Systems In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 49 58, 1993 [Sceats, 23] Mark Sceats How Big is the Web, 23 [Silverstein et al, 1999] Craig Silverstein, Monika Rauch Henzinger, Hannes Marais, and Michael Moricz Analysis of a very large web search engine query log SIGIR Forum, 33(1):6 12, 1999 [Stoica et al, 21] I Stoica, R Morris, D Karger, M F Kaashoek, and H Balakrishnan Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications In Proceedings of ACM SIGCOMM, 21 [Suel et al, 23] Torsten Suel, Chandan Mathur, Jo-Wen Wu, Jiangong Zhang, Alex Delis, Mehdi Kharrazi, Xiaohui Long, and Kulesh Shanmugasundaram ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval WebDB 3, 23 [Tang et al, 23] C Tang, C Xu, and S Dwarkadas Peer-to-peer information retrieval using self-organizing semantic overlay networks In SIGCOMM, 23 [Yalagandula and Dahlin, 24] Praveen Yalagandula and Mike Dahlin A scalable distributed information management system In SIGCOMM 4: Proceedings of the 24 conference on Applications, technologies, architectures, and protocols for computer communications, pages , New York, NY, USA, 24 ACM Press

Aggregation of a Term Vocabulary for P2P-IR: a DHT Stress Test

Aggregation of a Term Vocabulary for P2P-IR: a DHT Stress Test Aggregation of a Term Vocabulary for P2P-IR: a DHT Stress Test Fabius Klemm and Karl Aberer School of Computer and Communication Sciences Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland

More information

Aggregation of a Term Vocabulary for Peer-to-Peer Information Retrieval: a DHT Stress Test

Aggregation of a Term Vocabulary for Peer-to-Peer Information Retrieval: a DHT Stress Test Aggregation of a Term Vocabulary for Peer-to-Peer Information Retrieval: a DHT Stress Test Fabius Klemm and Karl Aberer School of Computer and Communication Sciences Ecole Polytechnique Fédérale de Lausanne

More information

An Architecture for Peer-to-Peer Information Retrieval

An Architecture for Peer-to-Peer Information Retrieval An Architecture for Peer-to-Peer Information Retrieval Karl Aberer, Fabius Klemm, Martin Rajman, Jie Wu School of Computer and Communication Sciences EPFL, Lausanne, Switzerland July 2, 2004 Abstract Peer-to-Peer

More information

Towards large scale peer-to-peer web search

Towards large scale peer-to-peer web search Towards large scale peer-to-peer web search Gerwin van Doorn Human Media Interaction, University of Twente, the Netherlands g.h.vandoorn@cs.utwente.nl ABSTRACT Web search engines, such as Google and Yahoo,

More information

Distributed Hash Table

Distributed Hash Table Distributed Hash Table P2P Routing and Searching Algorithms Ruixuan Li College of Computer Science, HUST rxli@public.wh.hb.cn http://idc.hust.edu.cn/~rxli/ In Courtesy of Xiaodong Zhang, Ohio State Univ

More information

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Technical report LSIR-REPORT-2006-009 Ivana Podnar, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer [ivana.podnar, martin.rajman,

More information

A Framework for Peer-To-Peer Lookup Services based on k-ary search

A Framework for Peer-To-Peer Lookup Services based on k-ary search A Framework for Peer-To-Peer Lookup Services based on k-ary search Sameh El-Ansary Swedish Institute of Computer Science Kista, Sweden Luc Onana Alima Department of Microelectronics and Information Technology

More information

A Super-Peer Based Lookup in Structured Peer-to-Peer Systems

A Super-Peer Based Lookup in Structured Peer-to-Peer Systems A Super-Peer Based Lookup in Structured Peer-to-Peer Systems Yingwu Zhu Honghao Wang Yiming Hu ECECS Department ECECS Department ECECS Department University of Cincinnati University of Cincinnati University

More information

Early Measurements of a Cluster-based Architecture for P2P Systems

Early Measurements of a Cluster-based Architecture for P2P Systems Early Measurements of a Cluster-based Architecture for P2P Systems Balachander Krishnamurthy, Jia Wang, Yinglian Xie I. INTRODUCTION Peer-to-peer applications such as Napster [4], Freenet [1], and Gnutella

More information

Huffman-DHT: Index Structure Refinement Scheme for P2P Information Retrieval

Huffman-DHT: Index Structure Refinement Scheme for P2P Information Retrieval International Symposium on Applications and the Internet Huffman-DHT: Index Structure Refinement Scheme for P2P Information Retrieval Hisashi Kurasawa The University of Tokyo 2-1-2 Hitotsubashi, Chiyoda-ku,

More information

Update Propagation Through Replica Chain in Decentralized and Unstructured P2P Systems

Update Propagation Through Replica Chain in Decentralized and Unstructured P2P Systems Update Propagation Through Replica Chain in Decentralized and Unstructured PP Systems Zhijun Wang, Sajal K. Das, Mohan Kumar and Huaping Shen Center for Research in Wireless Mobility and Networking (CReWMaN)

More information

Building a low-latency, proximity-aware DHT-based P2P network

Building a low-latency, proximity-aware DHT-based P2P network Building a low-latency, proximity-aware DHT-based P2P network Ngoc Ben DANG, Son Tung VU, Hoai Son NGUYEN Department of Computer network College of Technology, Vietnam National University, Hanoi 144 Xuan

More information

A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol

A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol A Chord-Based Novel Mobile Peer-to-Peer File Sharing Protocol Min Li 1, Enhong Chen 1, and Phillip C-y Sheu 2 1 Department of Computer Science and Technology, University of Science and Technology of China,

More information

Web Text Retrieval with a P2P Query-Driven Index

Web Text Retrieval with a P2P Query-Driven Index Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn, Toan Luu, Ivana Podnar Žarko, Martin Rajman, Karl Aberer Ecole Polytechnique Fédérale de Lausanne (EPFL) University of Zagreb School of

More information

Query Processing Over Peer-To-Peer Data Sharing Systems

Query Processing Over Peer-To-Peer Data Sharing Systems Query Processing Over Peer-To-Peer Data Sharing Systems O. D. Şahin A. Gupta D. Agrawal A. El Abbadi Department of Computer Science University of California at Santa Barbara odsahin, abhishek, agrawal,

More information

Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval

Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval Gleb Skobeltsyn, Toan Luu, Ivana Podnar Žarko, Martin Rajman, Karl Aberer Ecole Polytechnique Fédérale de Lausanne (EPFL) Lausanne, Switzerland

More information

An Algorithm to Reduce the Communication Traffic for Multi-Word Searches in a Distributed Hash Table

An Algorithm to Reduce the Communication Traffic for Multi-Word Searches in a Distributed Hash Table An Algorithm to Reduce the Communication Traffic for Multi-Word Searches in a Distributed Hash Table Yuichi Sei 1, Kazutaka Matsuzaki 2, and Shinichi Honiden 3 1 The University of Tokyo Information Science

More information

Evaluating Unstructured Peer-to-Peer Lookup Overlays

Evaluating Unstructured Peer-to-Peer Lookup Overlays Evaluating Unstructured Peer-to-Peer Lookup Overlays Idit Keidar EE Department, Technion Roie Melamed CS Department, Technion ABSTRACT Unstructured peer-to-peer lookup systems incur small constant overhead

More information

Distriubted Hash Tables and Scalable Content Adressable Network (CAN)

Distriubted Hash Tables and Scalable Content Adressable Network (CAN) Distriubted Hash Tables and Scalable Content Adressable Network (CAN) Ines Abdelghani 22.09.2008 Contents 1 Introduction 2 2 Distributed Hash Tables: DHT 2 2.1 Generalities about DHTs............................

More information

A Hybrid Peer-to-Peer Architecture for Global Geospatial Web Service Discovery

A Hybrid Peer-to-Peer Architecture for Global Geospatial Web Service Discovery A Hybrid Peer-to-Peer Architecture for Global Geospatial Web Service Discovery Shawn Chen 1, Steve Liang 2 1 Geomatics, University of Calgary, hschen@ucalgary.ca 2 Geomatics, University of Calgary, steve.liang@ucalgary.ca

More information

Dynamic Load Sharing in Peer-to-Peer Systems: When some Peers are more Equal than Others

Dynamic Load Sharing in Peer-to-Peer Systems: When some Peers are more Equal than Others Dynamic Load Sharing in Peer-to-Peer Systems: When some Peers are more Equal than Others Sabina Serbu, Silvia Bianchi, Peter Kropf and Pascal Felber Computer Science Department, University of Neuchâtel

More information

A Directed-multicast Routing Approach with Path Replication in Content Addressable Network

A Directed-multicast Routing Approach with Path Replication in Content Addressable Network 2010 Second International Conference on Communication Software and Networks A Directed-multicast Routing Approach with Path Replication in Content Addressable Network Wenbo Shen, Weizhe Zhang, Hongli Zhang,

More information

Peer-to-Peer Systems. Chapter General Characteristics

Peer-to-Peer Systems. Chapter General Characteristics Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include

More information

Should we build Gnutella on a structured overlay? We believe

Should we build Gnutella on a structured overlay? We believe Should we build on a structured overlay? Miguel Castro, Manuel Costa and Antony Rowstron Microsoft Research, Cambridge, CB3 FB, UK Abstract There has been much interest in both unstructured and structured

More information

Congestion Control for Distributed Hash Tables

Congestion Control for Distributed Hash Tables Congestion Control for Distributed Hash Tables Fabius Klemm, Jean-Yves Le Boudec, and Karl Aberer School of Computer and Communication Sciences Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne,

More information

On the Feasibility of Peer-to-Peer Web Indexing and Search

On the Feasibility of Peer-to-Peer Web Indexing and Search On the Feasibility of Peer-to-Peer Web Indexing and Search Jinyang Li Boon Thau Loo Joseph M. Hellerstein M. Frans Kaashoek David Karger Robert Morris MIT Lab for Computer Science UC Berkeley jinyang@lcs.mit.edu,

More information

Data Allocation Scheme Based on Term Weight for P2P Information Retrieval

Data Allocation Scheme Based on Term Weight for P2P Information Retrieval Data Allocation Scheme Based on Term Weight for P2P Information Retrieval ABSTRACT Hisashi Kurasawa The University of Tokyo 2-1-2 Hitotsubashi Chiyoda-ku, Tokyo, JAPAN kurasawa@nii.ac.jp Atsuhiro Takasu

More information

DYNAMIC TREE-LIKE STRUCTURES IN P2P-NETWORKS

DYNAMIC TREE-LIKE STRUCTURES IN P2P-NETWORKS DYNAMIC TREE-LIKE STRUCTURES IN P2P-NETWORKS Herwig Unger Markus Wulff Department of Computer Science University of Rostock D-1851 Rostock, Germany {hunger,mwulff}@informatik.uni-rostock.de KEYWORDS P2P,

More information

Exploiting Semantic Clustering in the edonkey P2P Network

Exploiting Semantic Clustering in the edonkey P2P Network Exploiting Semantic Clustering in the edonkey P2P Network S. Handurukande, A.-M. Kermarrec, F. Le Fessant & L. Massoulié Distributed Programming Laboratory, EPFL, Switzerland INRIA, Rennes, France INRIA-Futurs

More information

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou

Scalability In Peer-to-Peer Systems. Presented by Stavros Nikolaou Scalability In Peer-to-Peer Systems Presented by Stavros Nikolaou Background on Peer-to-Peer Systems Definition: Distributed systems/applications featuring: No centralized control, no hierarchical organization

More information

Improving Hybrid Keyword-Based Search

Improving Hybrid Keyword-Based Search Improving Hybrid Keyword-Based Search Matei A. Zaharia and Srinivasan Keshav Abstract: We present a hybrid peer-to-peer system architecture for keyword-based free-text search in environments with heterogeneous

More information

A Peer-to-Peer Architecture to Enable Versatile Lookup System Design

A Peer-to-Peer Architecture to Enable Versatile Lookup System Design A Peer-to-Peer Architecture to Enable Versatile Lookup System Design Vivek Sawant Jasleen Kaur University of North Carolina at Chapel Hill, Chapel Hill, NC, USA vivek, jasleen @cs.unc.edu Abstract The

More information

Adaptive Load Balancing for DHT Lookups

Adaptive Load Balancing for DHT Lookups Adaptive Load Balancing for DHT Lookups Silvia Bianchi, Sabina Serbu, Pascal Felber and Peter Kropf University of Neuchâtel, CH-, Neuchâtel, Switzerland {silvia.bianchi, sabina.serbu, pascal.felber, peter.kropf}@unine.ch

More information

Architectures for Distributed Systems

Architectures for Distributed Systems Distributed Systems and Middleware 2013 2: Architectures Architectures for Distributed Systems Components A distributed system consists of components Each component has well-defined interface, can be replaced

More information

Load Balancing in Structured P2P Systems

Load Balancing in Structured P2P Systems 1 Load Balancing in Structured P2P Systems Ananth Rao Karthik Lakshminarayanan Sonesh Surana Richard Karp Ion Stoica fananthar, karthik, sonesh, karp, istoicag@cs.berkeley.edu Abstract Most P2P systems

More information

SIL: Modeling and Measuring Scalable Peer-to-Peer Search Networks

SIL: Modeling and Measuring Scalable Peer-to-Peer Search Networks SIL: Modeling and Measuring Scalable Peer-to-Peer Search Networks Brian F. Cooper and Hector Garcia-Molina Department of Computer Science Stanford University Stanford, CA 94305 USA {cooperb,hector}@db.stanford.edu

More information

Self-Correcting Broadcast in Distributed Hash Tables

Self-Correcting Broadcast in Distributed Hash Tables Self-Correcting Broadcast in Distributed Hash Tables Ali Ghsi 1,Luc Onana Alima 1, Sameh El-Ansary 2, Per Brand 2 and Seif Haridi 1 1 IMIT-Royal Institute of Technology, Kista, Sweden 2 Swedish Institute

More information

Comparing Chord, CAN, and Pastry Overlay Networks for Resistance to DoS Attacks

Comparing Chord, CAN, and Pastry Overlay Networks for Resistance to DoS Attacks Comparing Chord, CAN, and Pastry Overlay Networks for Resistance to DoS Attacks Hakem Beitollahi Hakem.Beitollahi@esat.kuleuven.be Geert Deconinck Geert.Deconinck@esat.kuleuven.be Katholieke Universiteit

More information

Peer Clustering and Firework Query Model

Peer Clustering and Firework Query Model Peer Clustering and Firework Query Model Cheuk Hang Ng, Ka Cheung Sia Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, N.T., Hong Kong SAR {chng,kcsia}@cse.cuhk.edu.hk

More information

Making Search Efficient on Gnutella-like P2P Systems

Making Search Efficient on Gnutella-like P2P Systems Making Search Efficient on Gnutella-like P2P Systems Yingwu Zhu Department of ECECS University of Cincinnati zhuy@ececs.uc.edu Xiaoyu Yang Department of ECECS University of Cincinnati yangxu@ececs.uc.edu

More information

Time-related replication for p2p storage system

Time-related replication for p2p storage system Seventh International Conference on Networking Time-related replication for p2p storage system Kyungbaek Kim E-mail: University of California, Irvine Computer Science-Systems 3204 Donald Bren Hall, Irvine,

More information

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence

More information

Evaluation Study of a Distributed Caching Based on Query Similarity in a P2P Network

Evaluation Study of a Distributed Caching Based on Query Similarity in a P2P Network Evaluation Study of a Distributed Caching Based on Query Similarity in a P2P Network Mouna Kacimi Max-Planck Institut fur Informatik 66123 Saarbrucken, Germany mkacimi@mpi-inf.mpg.de ABSTRACT Several caching

More information

A Top Catching Scheme Consistency Controlling in Hybrid P2P Network

A Top Catching Scheme Consistency Controlling in Hybrid P2P Network A Top Catching Scheme Consistency Controlling in Hybrid P2P Network V. Asha*1, P Ramesh Babu*2 M.Tech (CSE) Student Department of CSE, Priyadarshini Institute of Technology & Science, Chintalapudi, Guntur(Dist),

More information

A Hybrid Structured-Unstructured P2P Search Infrastructure

A Hybrid Structured-Unstructured P2P Search Infrastructure A Hybrid Structured-Unstructured P2P Search Infrastructure Abstract Popular P2P file-sharing systems like Gnutella and Kazaa use unstructured network designs. These networks typically adopt flooding-based

More information

Multiterm Keyword Searching For Key Value Based NoSQL System

Multiterm Keyword Searching For Key Value Based NoSQL System Multiterm Keyword Searching For Key Value Based NoSQL System Pallavi Mahajan 1, Arati Deshpande 2 Department of Computer Engineering, PICT, Pune, Maharashtra, India. Pallavinarkhede88@gmail.com 1, ardeshpande@pict.edu

More information

A Square Root Topologys to Find Unstructured Peer-To-Peer Networks

A Square Root Topologys to Find Unstructured Peer-To-Peer Networks Global Journal of Computer Science and Technology Network, Web & Security Volume 13 Issue 2 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

Supporting Multiple-Keyword Search in A Hybrid Structured Peer-to-Peer Network

Supporting Multiple-Keyword Search in A Hybrid Structured Peer-to-Peer Network Supporting Multiple-Keyword Search in A Hybrid Structured Peer-to-Peer Network Xing Jin W.-P. Ken Yiu S.-H. Gary Chan Department of Computer Science The Hong Kong University of Science and Technology Clear

More information

Load Sharing in Peer-to-Peer Networks using Dynamic Replication

Load Sharing in Peer-to-Peer Networks using Dynamic Replication Load Sharing in Peer-to-Peer Networks using Dynamic Replication S Rajasekhar, B Rong, K Y Lai, I Khalil and Z Tari School of Computer Science and Information Technology RMIT University, Melbourne 3, Australia

More information

Subway : Peer-To-Peer Clustering of Clients for Web Proxy

Subway : Peer-To-Peer Clustering of Clients for Web Proxy Subway : Peer-To-Peer Clustering of Clients for Web Proxy Kyungbaek Kim and Daeyeon Park Department of Electrical Engineering & Computer Science, Division of Electrical Engineering, Korea Advanced Institute

More information

A Structured Overlay for Non-uniform Node Identifier Distribution Based on Flexible Routing Tables

A Structured Overlay for Non-uniform Node Identifier Distribution Based on Flexible Routing Tables A Structured Overlay for Non-uniform Node Identifier Distribution Based on Flexible Routing Tables Takehiro Miyao, Hiroya Nagao, Kazuyuki Shudo Tokyo Institute of Technology 2-12-1 Ookayama, Meguro-ku,

More information

Weighted-HR: An Improved Hierarchical Grid Resource Discovery

Weighted-HR: An Improved Hierarchical Grid Resource Discovery Journal of Computer & Robotics 11 (2), 2018 21-30 21 Computer & Robotics Weighted-HR: An Improved Hierarchical Grid Resource Discovery Mahdi Mollamotalebi a,*, Mohammad Mehdi Gilanian Sadeghi b a Department

More information

L3S Research Center, University of Hannover

L3S Research Center, University of Hannover , University of Hannover Dynamics of Wolf-Tilo Balke and Wolf Siberski 21.11.2007 *Original slides provided by S. Rieche, H. Niedermayer, S. Götz, K. Wehrle (University of Tübingen) and A. Datta, K. Aberer

More information

A Peer-to-peer Framework for Caching Range Queries

A Peer-to-peer Framework for Caching Range Queries A Peer-to-peer Framework for Caching Range Queries O. D. Şahin A. Gupta D. Agrawal A. El Abbadi Department of Computer Science University of California Santa Barbara, CA 9316, USA {odsahin, abhishek, agrawal,

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

A P2P Approach for Membership Management and Resource Discovery in Grids1

A P2P Approach for Membership Management and Resource Discovery in Grids1 A P2P Approach for Membership Management and Resource Discovery in Grids1 Carlo Mastroianni 1, Domenico Talia 2 and Oreste Verta 2 1 ICAR-CNR, Via P. Bucci 41 c, 87036 Rende, Italy mastroianni@icar.cnr.it

More information

A Scalable Content- Addressable Network

A Scalable Content- Addressable Network A Scalable Content- Addressable Network In Proceedings of ACM SIGCOMM 2001 S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker Presented by L.G. Alex Sung 9th March 2005 for CS856 1 Outline CAN basics

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P? Peer-to-Peer Data Management - Part 1- Alex Coman acoman@cs.ualberta.ca Addressed Issue [1] Placement and retrieval of data [2] Server architectures for hybrid P2P [3] Improve search in pure P2P systems

More information

Shaking Service Requests in Peer-to-Peer Video Systems

Shaking Service Requests in Peer-to-Peer Video Systems Service in Peer-to-Peer Video Systems Ying Cai Ashwin Natarajan Johnny Wong Department of Computer Science Iowa State University Ames, IA 500, U. S. A. E-mail: {yingcai, ashwin, wong@cs.iastate.edu Abstract

More information

Multi-level Hashing for Peer-to-Peer System in Wireless Ad Hoc Environment

Multi-level Hashing for Peer-to-Peer System in Wireless Ad Hoc Environment Multi-level Hashing for Peer-to-Peer System in Wireless Ad Hoc Environment Dewan Tanvir Ahmed, Shervin Shirmohammadi Distributed & Collaborative Virtual Environments Research Laboratory School of Information

More information

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Outline System Architectural Design Issues Centralized Architectures Application

More information

A Survey of Peer-to-Peer Systems

A Survey of Peer-to-Peer Systems A Survey of Peer-to-Peer Systems Kostas Stefanidis Department of Computer Science, University of Ioannina, Greece kstef@cs.uoi.gr Abstract Peer-to-Peer systems have become, in a short period of time, one

More information

Parallel Routing Method in Churn Tolerated Resource Discovery

Parallel Routing Method in Churn Tolerated Resource Discovery in Churn Tolerated Resource Discovery E-mail: emiao_beyond@163.com Xiancai Zhang E-mail: zhangxiancai12@sina.com Peiyi Yu E-mail: ypy02784@163.com Jiabao Wang E-mail: jiabao_1108@163.com Qianqian Zhang

More information

Making Gnutella-like P2P Systems Scalable

Making Gnutella-like P2P Systems Scalable Making Gnutella-like P2P Systems Scalable Y. Chawathe, S. Ratnasamy, L. Breslau, N. Lanham, S. Shenker Presented by: Herman Li Mar 2, 2005 Outline What are peer-to-peer (P2P) systems? Early P2P systems

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang Dept. of Computer Science University of Rochester Rochester, NY 14627-226 sarrmor@cs.rochester.edu Sandhya Dwarkadas Dept.

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

PERFORMANCE ANALYSIS OF R/KADEMLIA, PASTRY AND BAMBOO USING RECURSIVE ROUTING IN MOBILE NETWORKS

PERFORMANCE ANALYSIS OF R/KADEMLIA, PASTRY AND BAMBOO USING RECURSIVE ROUTING IN MOBILE NETWORKS International Journal of Computer Networks & Communications (IJCNC) Vol.9, No.5, September 27 PERFORMANCE ANALYSIS OF R/KADEMLIA, PASTRY AND BAMBOO USING RECURSIVE ROUTING IN MOBILE NETWORKS Farida Chowdhury

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

Improving Information Retrieval Effectiveness in Peer-to-Peer Networks through Query Piggybacking

Improving Information Retrieval Effectiveness in Peer-to-Peer Networks through Query Piggybacking Improving Information Retrieval Effectiveness in Peer-to-Peer Networks through Query Piggybacking Emanuele Di Buccio, Ivano Masiero, and Massimo Melucci Department of Information Engineering, University

More information

Feature selection. LING 572 Fei Xia

Feature selection. LING 572 Fei Xia Feature selection LING 572 Fei Xia 1 Creating attribute-value table x 1 x 2 f 1 f 2 f K y Choose features: Define feature templates Instantiate the feature templates Dimensionality reduction: feature selection

More information

Distributed Pagerank for P2P Systems

Distributed Pagerank for P2P Systems Appears in the International Symposium on High Performance Distributed Computing Distributed Pagerank for P2P Systems Karthikeyan Sankaralingam Simha Sethumadhavan James C. Browne Department of Computer

More information

An Active Query Routing Methodology for P2P Search Networks

An Active Query Routing Methodology for P2P Search Networks An Active Query Routing Methodology for P2P Search Networks Srikanth Kallurkar and R. Scott Cost University of Maryland Baltimore County, USA skallu1@umbc.edu Abstract. Peer-to-Peer computing paradigm

More information

Adaptively Routing P2P Queries Using Association Analysis

Adaptively Routing P2P Queries Using Association Analysis Adaptively Routing P2P Queries Using Association Analysis Brian D. Connelly, Christopher W. Bowron, Li Xiao, Pang-Ning Tan, and Chen Wang Department of Computer Science and Engineering Michigan State University

More information

Performance Analysis of Restricted Path Flooding Scheme in Distributed P2P Overlay Networks

Performance Analysis of Restricted Path Flooding Scheme in Distributed P2P Overlay Networks 216 IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.12, December 2007 Performance Analysis of Restricted Path Flooding Scheme in Distributed P2P Overlay Networks Hyuncheol

More information

Design and Implementation of a Semantic Peer-to-Peer Network

Design and Implementation of a Semantic Peer-to-Peer Network Design and Implementation of a Semantic Peer-to-Peer Network Kiyohide Nakauchi 1, Hiroyuki Morikawa 2, and Tomonori Aoyama 3 1 National Institute of Information and Communications Technology, 4 2 1, Nukui-kitamachi,

More information

SplitQuest: Controlled and Exhaustive Search in Peer-to-Peer Networks

SplitQuest: Controlled and Exhaustive Search in Peer-to-Peer Networks SplitQuest: Controlled and Exhaustive Search in Peer-to-Peer Networks Pericles Lopes Ronaldo A. Ferreira pericles@facom.ufms.br raf@facom.ufms.br College of Computing, Federal University of Mato Grosso

More information

Flexible Information Discovery in Decentralized Distributed Systems

Flexible Information Discovery in Decentralized Distributed Systems Flexible Information Discovery in Decentralized Distributed Systems Cristina Schmidt and Manish Parashar The Applied Software Systems Laboratory Department of Electrical and Computer Engineering, Rutgers

More information

Design of a New Hierarchical Structured Peer-to-Peer Network Based On Chinese Remainder Theorem

Design of a New Hierarchical Structured Peer-to-Peer Network Based On Chinese Remainder Theorem Design of a New Hierarchical Structured Peer-to-Peer Network Based On Chinese Remainder Theorem Bidyut Gupta, Nick Rahimi, Henry Hexmoor, and Koushik Maddali Department of Computer Science Southern Illinois

More information

Data-Centric Query in Sensor Networks

Data-Centric Query in Sensor Networks Data-Centric Query in Sensor Networks Jie Gao Computer Science Department Stony Brook University 10/27/05 Jie Gao, CSE590-fall05 1 Papers Chalermek Intanagonwiwat, Ramesh Govindan and Deborah Estrin, Directed

More information

Excogitating File Replication and Consistency maintenance strategies intended for Providing High Performance at low Cost in Peer-to-Peer Networks

Excogitating File Replication and Consistency maintenance strategies intended for Providing High Performance at low Cost in Peer-to-Peer Networks Excogitating File Replication and Consistency maintenance strategies intended for Providing High Performance at low Cost in Peer-to-Peer Networks Bollimuntha Kishore Babu #1, Divya Vadlamudi #2, Movva

More information

Scalable and Self-configurable Eduroam by using Distributed Hash Table

Scalable and Self-configurable Eduroam by using Distributed Hash Table Scalable and Self-configurable Eduroam by using Distributed Hash Table Hiep T. Nguyen Tri, Rajashree S. Sokasane, Kyungbaek Kim Dept. Electronics and Computer Engineering Chonnam National University Gwangju,

More information

Inverted List Caching for Topical Index Shards

Inverted List Caching for Topical Index Shards Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed

More information

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav School of Computer Science, University of Waterloo, Waterloo, ON, Canada Abstract: We present GAB, a search algorithm

More information

IN recent years, the amount of traffic has rapidly increased

IN recent years, the amount of traffic has rapidly increased , March 15-17, 2017, Hong Kong Content Download Method with Distributed Cache Management Masamitsu Iio, Kouji Hirata, and Miki Yamamoto Abstract This paper proposes a content download method with distributed

More information

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma Overlay and P2P Networks Unstructured networks Prof. Sasu Tarkoma 20.1.2014 Contents P2P index revisited Unstructured networks Gnutella Bloom filters BitTorrent Freenet Summary of unstructured networks

More information

Performance Modelling of Peer-to-Peer Routing

Performance Modelling of Peer-to-Peer Routing Performance Modelling of Peer-to-Peer Routing Idris A. Rai, Andrew Brampton, Andrew MacQuire and Laurent Mathy Computing Department, Lancaster University {rai,brampton,macquire,laurent}@comp.lancs.ac.uk

More information

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

An Investigation of Basic Retrieval Models for the Dynamic Domain Task An Investigation of Basic Retrieval Models for the Dynamic Domain Task Razieh Rahimi and Grace Hui Yang Department of Computer Science, Georgetown University rr1042@georgetown.edu, huiyang@cs.georgetown.edu

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Application Layer Multicast For Efficient Peer-to-Peer Applications

Application Layer Multicast For Efficient Peer-to-Peer Applications Application Layer Multicast For Efficient Peer-to-Peer Applications Adam Wierzbicki 1 e-mail: adamw@icm.edu.pl Robert Szczepaniak 1 Marcin Buszka 1 1 Polish-Japanese Institute of Information Technology

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Diminished Chord: A Protocol for Heterogeneous Subgroup Formation in Peer-to-Peer Networks

Diminished Chord: A Protocol for Heterogeneous Subgroup Formation in Peer-to-Peer Networks Diminished Chord: A Protocol for Heterogeneous Subgroup Formation in Peer-to-Peer Networks David R. Karger 1 and Matthias Ruhl 2 1 MIT Computer Science and Artificial Intelligence Laboratory Cambridge,

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Characterizing Gnutella Network Properties for Peer-to-Peer Network Simulation

Characterizing Gnutella Network Properties for Peer-to-Peer Network Simulation Characterizing Gnutella Network Properties for Peer-to-Peer Network Simulation Selim Ciraci, Ibrahim Korpeoglu, and Özgür Ulusoy Department of Computer Engineering, Bilkent University, TR-06800 Ankara,

More information

Introduction to Peer-to-Peer Systems

Introduction to Peer-to-Peer Systems Introduction Introduction to Peer-to-Peer Systems Peer-to-peer (PP) systems have become extremely popular and contribute to vast amounts of Internet traffic PP basic definition: A PP system is a distributed

More information

Scalable overlay Networks

Scalable overlay Networks overlay Networks Dr. Samu Varjonen 1 Lectures MO 15.01. C122 Introduction. Exercises. Motivation. TH 18.01. DK117 Unstructured networks I MO 22.01. C122 Unstructured networks II TH 25.01. DK117 Bittorrent

More information