Using Highly Discriminative Keys for Indexing in a Peer-to-Peer Full-Text Retrieval System

Size: px

Start display at page:

Download "Using Highly Discriminative Keys for Indexing in a Peer-to-Peer Full-Text Retrieval System"

Reynold Bryan
5 years ago
Views:

1 Using Highly Discriminative Keys for Indexing in a Peer-to-Peer Full-Text Retrieval System Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer Ecole Polytechnique Fdrale de Lausanne (EPFL) School of Computer and Communication Sciences CH-115 Lausanne, Switzerland July 6, 25 Abstract Excessive network bandwidth consumption, caused by the transmission of long posting lists, was identified as one of the major bottlenecks for implementing distributed full-text retrieval in a Peer-to- Peer (P2P) architecture To address this problem we introduce a novel approach to indexing using highly discriminative terms and term sets, which leads to short posting lists and therefore reduces the network traffic by almost one order of magnitude In addition, we show that retrieval based on discriminative term sets provides a retrieval quality comparable to standard full-text retrieval using TF-IDF ranking Our indexing scheme is an important improvement towards realistic P2P retrieval systems that opens the opportunity to virtually unlimited scalability well beyond the capacity of today s best centralized Web search engines Keywords:distributed information retrieval, peer-to-peer (P2P), highly discriminative keys, distributional semantics 1 Introduction and motivation More and more information is available in the World Wide Web To deal with such huge amounts of data, we need scalable full-text retrieval mechanisms Traditional centralized solutions are not suitable to handle such amounts of data Even with huge investments in the necessary infrastructure (Google, MSN Search) [Sceats, 23] such systems can index only a small part of the documents available in the WWW Peer-to-Peer (P2P) systems seem to be very interesting candidates for performing large-scale distributed full-text retrieval For the Internet for example, such systems can be implemented using web servers: instead of only delivering documents, the available web servers can also be used to provide a distributed The work presented in this paper is carried out in the framework of the EPFL Center for Global Computing and was supported by the Swiss National Funding Agency OFES as part of the European project ALVIS - Superpeer Semantic Search Engine No 268

2 2 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer search facility for the documents This example is a specific instance of a general principle in P2P systems that relies on resource sharing to attain scalability to a global size, each participant in the network acting as a client and server at the same time Furthermore, P2P systems also provide sufficient fault tolerance by replicating resources in the network Finally, P2P architectures allow a high degree of parallel processing and thus scale well in terms of computational complexity Notice however, that P2P systems might encounter problems with heavy sequential processing due to the relatively low communication speed compared to densely clustered systems In this work we propose a new indexing strategy, which is specifically designed to perform efficient full-text retrieval in a structured P2P system To achieve this goal, we use an inverted index distributed in the P2P system For clarity, let us first recall some standard information retrieval notions: In an inverted index, terms are associated with posting lists A posting list consists of references to the documents that contain the term The terms used in the inverted index form the vocabulary of the document collection Within this framework, one approach [Reynolds and Vahdat, 23,Suel et al, 23] for P2P text retrieval is to distribute the inverted index among the peers in the network such that each peer is responsible for storing a certain amount of terms and their associated posting lists (see figure 1) Given a term t, such a P2P system must guarantee that at least one of the peers responsible for t is available at any time to retrieve the associated posting list 1 Furthermore, as such a peer can be found very efficiently, ie typically in O(Log(N)) routing hops (where N is the number of peers), a simple mechanism to process a query is to retrieve and intersect the posting lists associated with the query terms (see figure 1) 2 at peer x: q1 = {term 1, term 5} retrieve PL 1 retrieve PL 5 responsible peers peer 1 peer 2 peer 3 inverted index term 1 posting list 1 term 2 posting list 2 term 3 posting list 3 term 4 posting list 4 term 5 posting list 5 peer Np term Nt posting list Nt Fig 1 Distributing an inverted index: Peer x retrieves the posting lists for term 1 and term 5 from the responsible peers in the network One major problem with this approach is that some terms have very long posting lists As answering queries requires posting lists to be shipped over the 1 To achieve fault tolerance, terms and posting lists are replicated at several peers 2 To optimize the query process, the intersection of posting lists can be done at the peer that holds the longest posting list of the query terms

3 Using HDK for Indexing in a P2P Full-Text Retrieval System 3 network, this approach becomes very inefficient (if not impossible) for long posting lists This problem with long posting lists has been recently identified as a major obstacle for implementing text retrieval in P2P [Li et al, 23] In this paper we present a solution that aims at reducing the size of the postings to limit the network load when processing queries To obtain short posting lists we propose to index not only single terms but also term sets, ie sets of terms that occur simultaneously in documents The interesting term sets are the highly discriminative ones, ie the ones that are associated with short posting lists For example, instead of using two long posting lists term 1 {doc 1, doc 2, doc 4, doc 6, doc 8, doc 9 } and term 2 {doc 2, doc 3, doc 6, doc 7, doc 1 }, we might only index the term set {term 1, term 2 }, which is associated with a far shorter posting list: {term 1, term 2 } {doc 2, doc 6 } In other words, we propose to identify highly discriminative term combinations at indexing time instead of performing long posting list intersections at query processing time (when they are quite expensive in terms of network load) With our indexing scheme we can achieve higher parallelization by using a retrieval method that allows to increase the number of keys used for indexing, while simultaneously decreasing the length of posting lists With such an approach, it becomes easier to distribute the indexing structure at a finer granularity in P2P networks, which in turn reduces the communication costs are reduced One possible concern is that using term combinations might lead to a vocabulary of unmanageable size To limit this effect we introduce a novel concept of i-rare keys (Section 32), a scheme for careful key selection We show that after removing terms (or term sets) that are not suitable for indexing the vocabulary grows linearly with the document collection size and is therefore still manageable in a P2P environment (Section 41) Another potential problem of our approach is to maintain high retrieval quality Certain terms and term combinations that are not discriminative, ie that have long posting lists, are no longer indexed but might still be useful for retrieval (eg because they appear in user queries) To solve this problem we use distributional semantics to support concept-based retrieval covering queries that are not based on the indexed keys (Section 34) As a result we can show that using highly discriminative terms and term sets has no negative effects on the retrieval quality We compare our scalable indexing strategy to a central full-text retrieval engine and obtain similar precision and recall values for top-k results when using TF-IDF ranking (Section 4) The rest of our paper is structured as follows: Section 2 provides related work in P2P full-text retrieval We describe the details of our indexing strategy in Section 3 Section 4 presents our experimental results Section 5 finishes with discussion and conclusions 2 Related work in P2P full-text retrieval Why P2P? We believe that our indexing strategy is particularly suitable for P2P P2P systems provide better scalability but also require considerably less

4 4 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer network bandwidth availability than densely clustered distributed IR systems Sending long posting lists over the network for query answering will not work in P2P However, our scheme is also applicable in distributed IR, especially as the length of the posting lists is a tunable parameter In this section we provide background and related work on P2P systems and how they can be used to perform distributed full-text retrieval P2P systems can be categorized into two main classes: unstructured and structured systems A popular representative of unstructured P2P is Gnutella, which uses flooding to answer queries Advanced approaches try to restrict the amount of messages generated, such as [Lv et al, 21] using random walks [Lu and Callan, 23] studies the use of content-based resource selection and document retrieval algorithms in hierarchical unstructured P2P networks The idea is to cluster leaf-nodes around directory nodes On receiving a query, a directory node selects the most appropriate leaf and neighboring directory nodes for routing The second class is structured P2P, also called structured overlay networks or distributed hash tables (DHT) In such systems, the peers organize to jointly build a distributed index, which allows to perform efficient searches for specific identifiers (hashes), usually by contacting only O(Log(N)) peers in a network of size N Examples of DHTs are CAN, CHORD, P-Grid, and Pastry, which were presented in [Ratnasamy et al, 21,Stoica et al, 21,Aberer, 21,Rowstron and Druschel, 21] In [Reynolds and Vahdat, 23], the authors used a DHT to map keywords to peers responsible for indexing They are trying to resolve the problem of long posting lists by using Bloom filters, caches, and incremental results (top-k) However, Bloom filters and caches still leave the query processing time roughly proportional to the collection size The incremental results method requires user query feedback and a good ranking function Another approach is presented in [Tang et al, 23] and is based on CAN [Ratnasamy et al, 21] Documents and queries are represented as latent semantic indexing vectors in a Cartesian space This space is mapped into a structured P2P network keeping semantically related indexes co-located Documents and queries can be routed to the responsible peer in the Cartesian space Difficulties occur when high-dimensional vectors have to be mapped into a lowdimensional overlay network So far it remains unclear, which of the two classes of P2P systems (structured and unstructured) is better suited for implementing highly distributed full-text retrieval Our approach is based on structured overlay networks We will therefore explain in more detail the services offered by these systems: The basic functions a structured P2P layer offers are: join(), leave(), and route(id, message) A peer uses join() to connect to the P2P system, where, among other things, its routing tables are initialized, and leave() to gracefully disconnect The main functionality is route(id, message): each peer in the network is responsible for a defined subset of identifiers from an agreedupon identifier space With route(id, message) any peer in the network can

5 Using HDK for Indexing in a P2P Full-Text Retrieval System 5 send a message to the peer responsible for any given point in the id space, by having to contact only O(Log(N)) other peers Our P2P text-retrieval application uses route() to send (key, posting list)-pairs to the responsible peers, as well as to retrieve the posting lists associated to keys for query answering The id is calculated from the key using a hash function, which is also provided by the P2P layer The P2P layer independently takes care of constructing and maintaining the routing tables in the presence of peers going on- and offline Bamboo [Bamboo, 24] is an example for a working implementation of a DHT that can reliably deliver messages in systems with moderate churn rates As the description of the various existing P2P systems is not the focus of this paper, we refer the interested reader to the relevant literature [Ratnasamy et al, 21,Stoica et al, 21,Rowstron and Druschel, 21] [Aberer, 21,Lv et al, 21] 3 Indexing strategy In this section we explain in detail our indexing strategy The fundamental idea of our approach is that we index documents using only highly discriminative terms or term sets to avoid long posting lists, which are extremely inefficient in a distributed setting 31 Definitions To make the presentation easier we introduce the following definitions: Definition 1 T is the set of all terms that appear in a document collection Definition 2 K is the set of keys A key k is a set of (one or more) terms t T The vocabulary of the document collection is a subset of K Notice that T is also a subset of K containing only keys that are made up of a single term The quality of a key k for a given document d is determined by its discriminative power: Discriminative power of a key To be discriminative, a key k must be as specific as possible with respect to the document d it is associated with We categorize a key on the basis of its Document Frequency (DF ), ie the number of documents it appears in We define a threshold DF low 3 to split the set of keys K into two classes: K non rare = {k K DF (k) > DF low }: The set of non rare keys, which appear in many documents, ie have bad discriminative power [Salton and Yang, 1973] 3 We will see in section 4 how DF low can be set

6 small voc large vocabulary 6 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer K rare = {k K DF (k) DF low }: The set of rare keys, which appear in very few documents and are therefore highly discriminative As far as the discriminative power is concerned, the interesting keys are in K rare We therefore use only those keys to index documents Standard Information Retrieval techniques use single terms to index documents, however, in our approach we use the discriminative sets of terms (see figure 2) long posting lists term 1 posting list term 2 posting list term Nt posting list Nt single term indexing very inefficient for distributed indexing key 1 posting list 1 key 2 posting list 2 key Nk posting list Nk short posting lists same retrieval quality highly discriminative key indexing efficient for distributed indexing Fig 2 Overview of indexing using highly discriminative keys 32 Key filtering Creating rare keys using all possible term combinations leads to an explosion of the vocabulary as it grows O( D c ), where D is the size of the document collection and c > 1 depending on the length of the key We therefore introduce filters to select good keys and to keep the vocabulary at a manageable size 4 We will describe two filter processes in the following subsections: i-rare filter An important quality of a key is its semantic adequacy A key k is semantically adequate for a document d if there is a high probability that a user produces k when searching for d 4 Our experiments in section 4 show that using filters the vocabulary grows linearly with the size of the document collection without deteriorating the retrieval quality

7 Using HDK for Indexing in a P2P Full-Text Retrieval System 7 Hypothesis 1 As a user is more likely to generate generic terms, a semantically adequate key contains a very specific combination of quite generic terms In other words, a semantically adequate key is by hypothesis a key, for which all its subsets are not discriminative (ie not rare) We call such keys intrinsically rare: Definition 3 K i rare K rare : The set of intrinsically rare keys Each key k K i rare becomes non-rare by removing any of its terms Our hypothesis is that all keys in K i rare K rare are highly discriminative and semantically adequate and therefore should be used for indexing the document collection Notice that supersets of i-rare keys are redundant, as they are already covered in the index by the i-rare keys They would only increase the vocabulary without improving retrieval performance Window filter We use a window filter as a second mechanism to reduce the size of the vocabulary: We introduce the restriction that keys can be constructed only by choosing terms that appear in the same context A text context can be a phrase, a paragraph, or a window of a certain size in the document [Salton et al, 1993] discovered that comparing a query to a text context leads to better results than by comparing it to the complete document Such keys have a higher probability of being used in queries as they have better semantic meanings than keys that are constructed by randomly combining terms Therefore, only keys that are constructed of terms that appear within a window of w consecutive terms remain in the vocabulary K irw K i rare K rare Figure 3 summarizes the filtering process all possible rare keys K rare i-rare filter K i-rare window filter K irw Fig 3 Overview of the key filtering process Definition 4 T irw is the set of all terms occurring in elements of K irw

8 8 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer 33 Key production algorithm Under our assumption that the keys in K irw are suitable for indexing in a P2P environment, we will now explain an algorithm to generate such keys Our key production algorithm works as follows (a more formal description is given in algorithm 1): The algorithm starts with the generation of Krare, 1 the vocabulary of rare single-term keys It is created by removing all keys from the single-term vocabulary with a document frequency greater than DF 5 low Then, to create keys of size s the documents are processed one by one For each document, we create keys by concatenating all possible term combinations of size s within a window of size w These keys are stored in T emp The second step is to remove those keys that are not i-rare: For each key we have to check whether all its subsets are non-rare, which is done using K s 1 irw If the key is i-rare it is added to the vocabulary Kirw s This process is repeated until s reaches a maximum key size The final vocabulary K irw is the union of all s-term vocabularies Kirw s Complexity analysis This subsection provides a complexity analysis of the key production algorithm: With a collection of size x terms, we get: The complexity to construct all keys of size s within a window of w terms is x Cw 1 s 1 O(x ws 1 ) The function K s 1 irw isirare(key) in algorithm 1 is used to check whether a key of s terms is i rare The function will check all (s-1)-term keys that are subsets of key The complexity of this function is s C, where C is the constant cost to access the vocabulary storage structure 6 The complexity is therefore maxkeysize s=1 (s w s 1 x C) We assume that w is constant and set maxkeysize to three 7 Therefore, the computation time grows linearly with the document collection However, the constant factor is high To deal with this problem, we propose to parallelize the computation by distributing the construction of the vocabulary: Each peer in the network computes the vocabulary for a small part of the document collection, which is locally stored To decide whether a key is rare, it is necessary to aggregate its global document frequency We believe that such an aggregation is possible with aggregation mechanisms proposed in recent P2P papers [Albrecht et al, 24,El-Ansary et al, 23,Yalagandula and Dahlin, 24] A detailed analysis including updates of document frequencies is part of future work 5 As Krare 1 contains only single-term keys, the i-rare and window filters do not apply 6 Remember, a key is i-rare if removing any term makes the key non-i-rare There are s possible terms that can be removed 7 This max key length seems to be suitable as longer keys are less likely to appear in user queries [Silverstein et al, 1999] Furthermore, in the document collections we used for our experiments there are only few i-rare keys with length > 3, however, the additional computation effort is high

9 Using HDK for Indexing in a P2P Full-Text Retrieval System 9 Algorithm 1 Key Production Algorithm 1: /* the voc of rare single-term keys */ 2: generate Kirw 1 = Krare 1 3: 4: for s = 2 to maxkeysize do 5: Kirw s = φ /* vocabulary of s-term keys */ 6: 7: /* process document by document to create a set of s-term keys */ 8: for all document ɛ collection do 9: T emp = φ /* set of keys for one document */ 1: 11: /* generate keys by concatenating terms that appear in a window of size w */ 12: for all tuple of s terms t i1,, t is in a window of size w do 13: key = concat(t i1,, t is ) 14: T emp key 15: end for 16: 17: /* insert key set for one document into Kirw s */ 18: for all key ɛ T emp do 19: 2: /* check whether key is i-rare */ 21: if K s 1 irw isirare(key) then 22: 23: /* update document frequency in Kirw s */ KirwinsertKey(key,1) s 24: end if 25: end for 26: end for 27: end for 28: K irw = Kirw s

10 1 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer 34 Distributional Semantics We will now discuss the problem of finding, given a query q = {t 1, t 2,, t m }, t i T, the corresponding most relevant discriminative keys k K irw that was used for indexing We first search for all possible query term combinations of size 3 If some of these combinations are in the index, the corresponding documents will be returned However, it can happen that none of the term combinations are in the index In this case, the query does not contain any irw keys In real life Web search, users often enter only two or three terms for a search query [Silverstein et al, 1999], which consequently rarely contains irw keys Therefore, to improve the recall performance of our approach, we project queries into the irw key space To perform such a projection, we use Distributional Semantics [Rajman and Bonnet, 1992] It uses a co-occurrence matrix to make a probabilistic connection between the full vocabulary T (all terms of the document collection) and T irw (the terms occurring in K irw ) With such a cooccurrence matrix, query terms can be approximated with terms in T irw, which increases the chance of finding a key in K irw Co-occurrence matrix To find irw keys we use probabilistic associations between T and T irw Two terms are semantically similar if their textual contexts are similar The projection of a query q from T to T irw corresponds to the transformation from an N T to an N Tirw -dimensional space, where N T is the size of T and N Tirw the size of T irw We represent the set of terms by an N T N Tirw matrix of co-occurrences CO Each line of this matrix represents the co-occurrences of a term in T : CO = co 1 co 2 co 11 co 12 co 1NTirw = co 21 co 22 co 2NTirw co NT co NT 1 co NT 2 co NT N Tirw The co-occurrence frequencies between two terms are computed on a reference corpus that is representative of the domain for which the semantic model is defined The profile of co-occurrences of a term t i T is interpreted as an estimate of the probability distribution that measures the association between t i and the terms t T irw More precisely, co ij is an estimate for p(t j T irw t i T ), the probability that the meaning represented by the term t j is also triggered by the term t i : p(t j t i ) co ij = f(t j, t i) k f(t k, t i), where f(t j, t i) is the co-occurrence frequency between the two terms t j and t i in the observed contexts In our case, we consider contexts of the form of a window of size w co terms in a document

11 Using HDK for Indexing in a P2P Full-Text Retrieval System 11 Query expansion Using the co-occurrence matrix, the retrieval process proceeds as follow: if a query q = {t 1, t 2,, t m }, t i T does not contain any irw keys, we consider the co-occurrence profiles of t 1, t 2, t m From these profiles, we derive the n best co-occurrent terms t T irw to construct an approximated query q DS To select the n best co-occurrent terms t for a query q = {t 1, t 2,, t m }, we use the following score: score(t j, q) = p(t j q) = p(q t j) p(t j ) p(q) under the hypothesis that t 1, t 2, t m are independent when they are triggered by t, value p(q) is the same for all the expansion terms t, so to compare the score(t j, q) between t j, we can use this score: score (t j, q) = p(t j) m p(t j t i) i=1 p(t j ) For the final query, we also take the original query q into account: q = q DS q 4 Experimental Results We implemented our algorithms in Java To evaluate the retrieval quality of our approach we used the Reuters 8 corpus and some corpora from SMART 9 to evaluate the indexing process The Reuters corpus consists of over 8 news articles, including all English Reuters news edited between 2/8/1996 and 19/8/1997 All articles are stored in an XML file containing rich information about the content Annotations include topic, country, authors, date, etc We extracted only the content and title of the news to create the document collection The documents in our test collection contain between 5 and 3 words The average number of terms in a document is 17 and the average number of unique terms is 12 To simulate the evolution of the P2P system (the peers join the network one by one, each with their own document collection), we start with a collection size of 1 documents and then, in each step, add 1 documents The maximum collection size is 2, documents For the retrieval performance evaluations, we need some collections with query sets and relevant judgments So we used 2 collections taken from SMART: CRAN with 1398 abstracts and 225 queries in the domain of aerodynamics, and MED with 133 abstracts and 3 queries in the domain of medicine Results are ranked using traditional TF-IDF ranking [Frakes and Baeza-Yates, 1992] To investigate the quality of the retrieval results, we use precision and recall at top k retrieved documents R@k) For example, when we get the top-2 ranked documents for a query, the user evaluates that 8 documents are relevant ftp://ftpcscornelledu/pub/smart/

12 12 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer to the query Assuming there are 25 relevant documents in the whole collection, = 4 and R@2 = 32 We are interested in the high-end precision and recall (top-5 to top-2) as typical Web Track of TREC 1 measures have shown that users are often interested only in the highest ranked search results 41 Size of K irw The first experiment investigates the number of generated keys using our proposed irw key generation algorithm For pre-processing we removed 25 common English stop words and applied the Porter Stemmer With a collection of D documents, DF low is set to α D 1 β The parameters α and β are used to tune the posting list size to guarantee good retrieval quality and to maintain the key computation cost manageable For example, for α = β = 3, DF low = 81 for collection of 2, and 3 for a collection of one billion documents In the following experiments, the window sizes are set to 1 and 15 We refer to them as (w=1) and (w=15) Figure 4 shows that the number of irw keys grows linearly with respect to the size of the collection and therefore can be stored in a P2P system The indexing size (in number os postings) and processing time for key production also grows linearly with respect to the size of the collection (w=1) and (w=15) (Figure 5) w = 1 w = 15 Vocabulary growth Keys in million Collection size in million of terms (1 to 2, docs) Fig 4 Number of keys with respect to the collection size 42 Association matrix To apply distributional semantics we have to create a co-occurrence matrix As this matrix is usually very sparse, we are only interested in its non-zero elements The values of this matrix are the co-occurrence frequencies between two terms 1

13 Using HDK for Indexing in a P2P Full-Text Retrieval System 13 Number of postings in million w = 1 w = 15 Inverted index size Processing time in minute w = 1 w = 15 Processing time Collection size in million of terms (1 to 2, docs) (a) Collection size in million of terms (1 to 2, docs) (b) Fig 5 Index size and processing time with respect to the collection size in a window of size w co In our experiments, we set w co to 1 and 15 to estimate the size of matrix (we think that the w co should be set to the same value w as used in key production algorithm) To estimate the growth of the association matrix, we start with a corpus of 1 documents In each step we add 1 documents to the corpus and count the non-zero elements of the matrix The results in Figure 6 show that the number of non-zero elements grows slowly with respect to the size of the corpus It is therefore possible to store the association matrix in a P2P system Non-Zero element in matrix in million w_co = 1 w_co = 15 Association matrix size Collection size in million of terms (1 to 2, docs) Fig 6 Growth of association matrix 43 Comparing indexing using highly discriminative keys with single-term indexing In this section we compare indexing using highly discriminative keys (HDK indexing), with traditional single-term (ST ) indexing

14 14 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer We used Terrier 11, a Java software for the development of search engines, to implement standard TF-IDF ranking The comparison between HDK and ST indexing was made with the CRAN and MED collections We set DF low to 3, 2, 1, and the window size w to 15 The co-occurrence frequency is also computed with window size 15 The results are shown in Figure 7 The first column (in black) shows the average precision score of all queries by applying ST indexing The three next columns (HDK3, HDK2, HDK1) show the results for our approach for DF low = 3, 2, and 1 and without using distributional semantics The last column (HDK1-DS) is the result using distributional semantics to expand the query by 3 terms These results show that our approach is comparable to traditional ST indexing when using TF-IDF ranking HDK and ST have the same retrieval quality for larger values for DF low In the case the size of the posting list is small (DF low = 1), using distributional semantics slightly improves the retrieval quality This improvement seems marginal, however, in our experiments the queries were quite long (about ten terms on average) and therefore likely to contain irw keys We expect the effect of distributional semantics to be higher with shorter queries Such experiments are part of future work 8 MED Collection, Precision at top ranked docs 6 MED Collection, Recall at top ranked docs P@5 P@1 P@15 P@2 ST HDK3 HDK2 HDK1 HDK1-DS R@5 R@1 R@15 R@2 ST HDK3 HDK2 HDK1 HDK1-DS (a) Precision-MED (b) Recall-MED CRAN Collection, Precision at top ranked docs 6 5 CRAN Collection, Recall at top ranked docs P@5 P@1 P@15 P@2 R@5 R@1 R@15 R@2 ST HDK3 HDK2 HDK1 HDK1-DS ST HDK3 HDK2 HDK1 HDK1-DS (c) Precision-CRAN (d) Recall-CRAN Fig 7 Precision and recall evaluation for MED and CRAN collection Bandwidth consumption We now compare the bandwidth consumption in a distributed environment for retrieving the posting lists for highly discriminative keys (HDK ) and single-term (ST ) indexing 11

15 Using HDK for Indexing in a P2P Full-Text Retrieval System 15 During the retrieval process, HDK indexing consumes considerably less bandwidth than ST indexing, as it only uses short posting lists To measure the savings, we propose the following simple scenario: assume a query q contains a list of terms (t 1, t 2, t n ) The amount of postings transmitted using ST indexing is approximately s(t i ), where s(t i ) is the size of the posting list of term t i In HDK indexing, the amount of postings will be s(k j ), where k j are irw keys generated from the query q and s(k j ) is size of the posting list associated with the key k j, where s(k j ) DF low Figure 8 shows (in a and c) the avg number of transmitted postings, as well as (in b and d) the avg size of the longest posting list transmitted per query HDK indexing uses a substantially smaller amount of network bandwidth than ST indexing (between 7% and 9% less) For the CRAN collection, for example, we can see that ST indexing transmits about 17 postings, whereas HDK3 transmits only about 3 postings Furthermore, the avg size of the longest posting list transmitted per query is drastically decreased with our approach (between 85% and 95%) As the transmission of the posting lists for a query can be parallelized, the longest posting list will determine the overall query response time The performance using the MED collection is slightly worse than with CRAN, which might be attributed to a smaller query set and a smaller number of documents #Postings Average of transmitted postings per query ST HDK3 HDK2 HDK1 HDK1- DS (a) MED #Postings Average of longest posting list size ST HDK3 HDK2 HDK1 HDK1- DS (b) MED #Postings Average of transmitted postings per query ST HDK3 HDK2 HDK1 HDK1- DS (c) CRAN #Postings Average of longest posting list size ST HDK3 HDK2 HDK1 HDK1- DS (d) CRAN Fig 8 Number of postings transmitted for query answering for ST indexing compared to HDK indexing with different posting list lengths

16 16 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer Figure 9 compares the vocabulary and index size for HDK and ST for 5 documents When generating the index, the size of the irw vocabulary (K irw ) is substantially larger than the ST vocabulary On the other hand, in HDK indexing, the posting lists are short We studied the behavior of the HDK vocabulary (K irw ) and inverted index with varying DF low (see Figure 9) The longer we allow the posting lists to be, the more similar HDK and ST indexing become The ratio HDK /ST vocabulary falls very quickly, from 5 times ST -vocabulary for DF low = 3 to near equal size for DF low = 5 The ratio of the inverted indexes of HDK and ST also decreases for large values of DF low 5 45 Ratio of vocabulary and index size between HDK and ST index size HDK / index ST voc size HDK / voc ST ratio Length of the posting list (DF_low) Fig 9 Comparison of the sizes of the vocabularies and the indexes for single term (ST ) indexing and highly discriminative keys (HDK ) indexing for varying values of DF low (posting list length) for collection of 5 documents 5 Discussion and Conclusions In this paper, we proposed a strategy using highly discriminative keys for indexing Our solution overcomes an important scalability problem of standard IR techniques for distributed search engines (bandwidth requirement growing with the average posting list size) By applying the distributional semantic technique, the approach achieves results comparable to standard centralized information retrieval using TF-IDF The number of indexed keys and of the co-occurrence matrix remain manageable (ie is grows linearly with the size of the document collection) In our approach, the HDK index is bigger than the ST index However, storage space is largely available in P2P systems as opposed to network bandwidth Therefore, the reductions in bandwidth and response time for query answering weigh more than the increase of storage space Moreover, there exists other mechanisms to further reduce the size of the vocabulary (and therefore of the index), for example, by taking user queries into account: only keys that appear in

17 Using HDK for Indexing in a P2P Full-Text Retrieval System 17 user queries with a certain frequency are kept in the index [Klemm et al, 24] Another mechanism to limit the size of posting lists is to rank and keep only top-k postings Moreover, further techniques of IR can be applied, such as compression of posting lists and index using bit vectors Such optimizations will be considered in future work One challenge this approach must face is the duplication of documents in the collection Documents that appear more often than DF low in the collection will not contain any rare keys and are therefore not indexed To overcome this problem, we will cluster these documents and then extract the lexical profile of each cluster and index it Another possibility would be to use signature files to check the similar content between documents Details of this solution are part of future work References [Aberer, 21] Karl Aberer P-Grid: A self-organizing access structure for P2P information systems Sixth International Conference on Cooperative Information Systems, 21 [Albrecht et al, 24] Keno Albrecht, Ruedi Arnold, Michael Gähwiler, and Roger Wattenhofer Aggregating information in peer-to-peer systems for improved join and leave In Peer-to-Peer Computing, pages , 24 [Bamboo, 24] Bamboo The Bamboo Distributed Hash Table, 24 [El-Ansary et al, 23] Sameh El-Ansary, Luc Onana Alima, Per Brand, and Seif Haridi Efficient broadcast in structured p2p networks In IPTPS, pages , 23 [Frakes and Baeza-Yates, 1992] W Frakes and R Baeza-Yates Information Retrieval: Data Structures and Algorithms Prentice Hall PTR, 1992 [Klemm et al, 24] F Klemm, A Datta, and K Aberer A query-adaptive partial distributed hash table for peer-to-peer systems In International Workshop on Peerto-Peer Computing & DataBases, 24 [Li et al, 23] J Li, B Loo, J Hellerstein, F Kaashoek, D Karger, and R Morris The Feasibility of Peer-to-Peer Web Indexing and Search, 23 [Lu and Callan, 23] Jie Lu and Jamie Callan Content-based retrieval in hybrid peer-to-peer networks In Proceedings of the twelfth international conference on Information and knowledge management, 23 [Lv et al, 21] Q Lv, P Cao, E Cohen, K Li, and S Shenker Search and replication in unstructured peer-to-peer networks In 16th International Conference on Supercomputing, 21 [Rajman and Bonnet, 1992] Martin Rajman and Alain Bonnet Corpora-Base Linguistics: New Tools for Natural Language Processing 1st Annual Conference of Association for Global Strategic Information, 1992 [Ratnasamy et al, 21] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Shenker A scalable content-addressable network In SIGCOMM, 21 [Reynolds and Vahdat, 23] P Reynolds and A Vahdat Efficient Peer-to-Peer Keyword Searching Middleware3, 23 [Rowstron and Druschel, 21] A Rowstron and P Druschel Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems In

18 18 Toan Luu, Fabius Klemm, Martin Rajman, Karl Aberer IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), 21 [Salton and Yang, 1973] G Salton and C Yang On the specification of term values in automatic indexing Journal of Documentation, (29): , 1973 [Salton et al, 1993] G Salton, J Allan, and C Buckley Approaches to Passage Retrieval in Full Text Information Systems In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 49 58, 1993 [Sceats, 23] Mark Sceats How Big is the Web, 23 [Silverstein et al, 1999] Craig Silverstein, Monika Rauch Henzinger, Hannes Marais, and Michael Moricz Analysis of a very large web search engine query log SIGIR Forum, 33(1):6 12, 1999 [Stoica et al, 21] I Stoica, R Morris, D Karger, M F Kaashoek, and H Balakrishnan Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications In Proceedings of ACM SIGCOMM, 21 [Suel et al, 23] Torsten Suel, Chandan Mathur, Jo-Wen Wu, Jiangong Zhang, Alex Delis, Mehdi Kharrazi, Xiaohui Long, and Kulesh Shanmugasundaram ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval WebDB 3, 23 [Tang et al, 23] C Tang, C Xu, and S Dwarkadas Peer-to-peer information retrieval using self-organizing semantic overlay networks In SIGCOMM, 23 [Yalagandula and Dahlin, 24] Praveen Yalagandula and Mike Dahlin A scalable distributed information management system In SIGCOMM 4: Proceedings of the 24 conference on Applications, technologies, architectures, and protocols for computer communications, pages , New York, NY, USA, 24 ACM Press

Aggregation of a Term Vocabulary for P2P-IR: a DHT Stress Test

Aggregation of a Term Vocabulary for P2P-IR: a DHT Stress Test Fabius Klemm and Karl Aberer School of Computer and Communication Sciences Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland