Result Merging in a Peer-to-Peer Web Search Engine. Sergey Chernov

Size: px

Start display at page:

Download "Result Merging in a Peer-to-Peer Web Search Engine. Sergey Chernov"

Shanon Doyle
6 years ago
Views:

1 Result Merging in a Peer-to-Peer Web Search Engine Sergey Chernov UNIVERSITÄT DES SAARLANDES January, 2005

2 A Result Merging in a Peer-to-Peer Web Search Engine A thesis submitted in partial fulfillment of the requirement for the degree of Master of Science in Computer Science Submitted by Sergey Chernov under the guidance of Prof. Dr-Ing. Gerhard Weikum Christian Zimmer UNIVERSITÄT DES SAARLANDES January, 2005

3 Abstract A tremendous amount of information in the Internet requires powerful search engines. Currently, only the commercial centralized search engines like Google can process terabytes of Web documents. Such approaches fail in indexing the Hidden Web located in the intranets and local databases, and with an exponential growing of information volume the situation becomes even worse. Peer-to-Peer (P2P) systems can be pursued for extending the current search capabilities. The Minerva project is a Web search engine based on a P2P architecture. In this thesis, we investigate the effectiveness of the different result merging methods for the Minerva system. Each peer provides an efficient search engine for its own focused Web crawl. Each peer can pose a query against a number of selected peers; the selection is based on a database ranking algorithm. The best top-k results from several highly ranked peers are collected by the query initiator and merged into a single list. We address problem of the result merging. We select several merging methods, which are feasible for use in a heterogeneous, dynamic, distributed environment. The experimental framework for these methods was implemented and the effectiveness of the merging techniques was studied with the TREC Web data. The language modeling based ranking method produced the most robust and accurate results under the different conditions. We also proposed a new merging method, which incorporates the preference-based language model. The novelty of the method is that the preference-based language model is obtained from the pseudo-relevance feedback on the best peer in the database ranking. In every tested setup, the new method was at least as effective as the baseline or slightly better. i

4 I hereby declare that this thesis is entirely my own work and that I have not used any other media than the ones mentioned in the thesis. Saarbrücken, the 12 nd January, 2005 Sergey Chernov ii

5 Acknowledgements I would like to thank my academic advisor Professor Gerhard Weikum for his guidance and encouragement through the duration of my master thesis project. I wish to express my sincere gratitude to my supervisors Christian Zimmer, Sebastian Michel, and Matthias Bender for their invaluable assistance and feedback. I would like to thank Kerstin Meyer-Ross for her continuous support in everything. I am very grateful to the members of the Databases and Information Systems group AG5, fellow students from the IMPRS program and all my friends from the Max-Planck Institute who provided me with friendly and stimulating environment. I would like to extend special thanks to Pavel Serdyukov and Natalie Kozlova for the numerous discussions and helpful ideas. It is difficult to explain how grateful I am to my mother, Galina Nikolaevna, and my father, Alevtin Petrovich, their wisdom and care made it possible for me to study. Finally, I want to thank the one person who was most supportive and patient during this process, my dear wife Olga. I would never accomplish this work without her love. iii

6 Contents 1 Introduction Motivation Our contribution Description of the remaining chapters Web search and Peer-to-Peer Systems Information retrieval basics Web search engines Peer-to-Peer architecture P2P Web search engines Minerva project Summary Result merging in distributed information retrieval Distributed information retrieval in general Result merging problem Prior work on collection fusion Collection fusion properties Cooperative environment Uncooperative environment Learning methods Probabilistic methods Prior work on the data fusion Data fusion properties Basic methods Mixture methods iv

7 3.4.4 Metasearch approved methods Summary Selected result merging strategies Target properties for result merging methods Score normalization with global IDF Score normalization with ICF Score normalization with CORI Score normalization with language modeling Score normalization with raw T F scores Summary Our approach Result merging with the preference-based language model Discussion Summary Implementation Global statistics classes Testing components Experiments Experimental setup Collections and queries Database selection algorithm Evaluation metrics Experiments with selected result merging methods Result merging methods Merging results Effect of limited statistics on the result merging Experiments with our approach Optimal size of the top-n Optimal smoothing parameter β Summary v

8 8 Conclusions and future work Conclusions Future work Bibliography 71 A Test queries 77 vi

9 List of Figures 2.1 The Minerva system architecture Simple metasearch architecture A query processing scheme in the distributed search system Collection fusion vs. data fusion An overlapping in the collection fusion problem Statistics propagation for the collection fusion Data fusion on a single search engine Main classes involved in merging A general view on the experiments implementation The macro-average precision with the database ranking RANDOM The macro-average recall with the database ranking RANDOM The macro-average precision with the database ranking CORI The macro-average recall with the database ranking CORI The macro-average precision with the database ranking IDEAL The macro-average recall with the database ranking IDEAL The macro-average precision of the LM 04 result merging method with the different database rankings The macro-average precision with the database ranking CORI with the global statistics collected over the 10 selected databases The macro-average precision with the database ranking IDEAL with the global statistics collected over the 10 selected databases The macro-average precision with the database ranking CORI with the different size of top-n for the preference-based model estimation vii

10 7.11 The macro-average precision with the database ranking IDEAL with the different size of top-n for the preference-based model estimation The macro-average precision with the database ranking CORI with the top-10 documents for the preference-based model estimation with β = 0.6 and LM04 result merging method The macro-average precision with the database ranking IDEAL with the top-10 documents for the preference-based model estimation with β = 0.6 and LM04 result merging method A.1 Relevant documents distribution for the T F method with the IDEAL ranking A.2 Relevant documents distribution for the LM 04P B06 method with the IDEAL ranking A.3 Residual between the number of relevant documents of the CE06LM04 and T F methods with the IDEAL database ranking 81 viii

11 List of Tables 4.1 The target properties of the result merging methods Topic-oriented experimental collections The difference in percents of the average precision between the result merging strategies and corresponding baselines with the RANDOM ranking. The LM 04 technique is compared with the SingleLM method; all others are compared with the SingleT F IDF approach The difference in percents of the average precision between the result merging strategies and corresponding baselines with the CORI ranking. The LM04 technique is compared with the SingleLM method; all others are compared with the SingleT F IDF approach The difference in percents of the average precision between the result merging strategies and corresponding baselines with the IDEAL ranking. The LM 04 technique is compared with the SingleLM method; all others are compared with the SingleT F IDF approach A.1 The topic-oriented set of the 25 experimental queries (topics are coded as HM for the Health and Medicine, and NE for the Nature and Ecology) A.2 The number of relevant documents for the T F and LM04P B06 methods with the IDEAL database ranking (LM 04P B06 name is shortened to LMP B for convinience) ix

12 Chapter 1 Introduction 1.1 Motivation Millions of new documents are created every day across the Internet and even more are changed. The huge amount of information increases exponentially and a search engine is the only hope to find the documents which are relevant to a particular user s need. Routine access to the information is now based on full-text information retrieval, instead of controlled vocabulary indices. Currently, a huge number of people use the Web for text and image search on a regular basis. In this thesis, we consider the problems of search on text data. The need for effective search tools becomes more important every day, but currently only a few centralized search engines like Google ( can cope with this task and they are only partially effective. The so-called Hidden Web consists of all intranets and local databases behind portal pages. According to estimation from [SP01], it is 2 to 50 times larger then the Visible Web, which can be crawled by the search robots. Taking into account that even the largest Google crawl of more than 8 billion pages encompasses only a part of the Visible Web, we can imagine how many probably relevant pages a centralized search engine does not consider during a search. This problem comes from the technical limitations of a single search engine. The desire to overcome the limitations of a single search engine established a new scientific direction distributed information retrieval or metasearch, we will use both terms as synonyms. The main technique that was devel- 1

13 oped in this field is an intermediate broker called a metasearch engine. It has access to the query interfaces of the individual search engines and text databases. Briefly, when a metasearch engine receives a user s query, it passes the query to a set of appropriate individual search engines and databases. Then it collects the partial results and combines them to improve the overall result. Numerous examples of metasearch engines are available on the Web ( The metasearch approach contains several significant sub-problems, which arise in the query execution process. The database selection problem arises when a query is routed from a metasearch engine to the individual search engines. A naive routing approach is to propagate a query to all available engines. The scalability of such a strategy is unsatisfactory since it is inefficient to ask more than several dozens of servers. The database selection process helps to discover a small number of the most useful databases for a current query and to ask only this limited subset without significant loss in recall. Many of the database selection methods were developed to tackle this issue [Voo95, CLC95, YL97, GGMT99, Cra01, SJCO02]. The result merging problem is another important sub-problem of the metasearch technique. In information retrieval, the output result is a ranked list of documents, which are sorted by their similarity score values. In distributed information retrieval, aforementioned list is obtained from several result lists, which are merged into one. The result merging problem is not trivial, numerous merging techniques have been studied in the literature [CLC95, TVGJL95, Bau99, Cra01, SJCO02]. The issue of an automatic database discovery was not fully addressed; so adding new data sources to the metasearch engine mainly remains a manual task. The major drawback of metasearch is that large search engines are not interested in cooperation. A search result is a commercial product which they want to sell by their own. For example, the STARTS proposal [GCGMP97] is a quite effective communication protocol designed especially for metasearch, but it is not widely used because of the reason above. The new Peer-to-Peer (P2P) technologies can help us to remove the limitations caused by an uncooperativeness of search engines vendors. The computation power of processors increases every year, and so does the network bandwidth. 2

14 Millions of personal computers have enough storage and computational resources to index their own documents and perform small crawls of the interesting fragments of the Web. They can provide a search on their local index, but do not have to uncover the data itself unless they want to. This is the way to incorporate the Hidden Web pages into a global search mechanism. Collaborative crawling can span a larger portion of the Web, since every peer can contribute its own focused crawl into the system. This method is cheap and provides us with topic-oriented search opportunities; we can also use intellectual inputs from other users to improve our own search. Such considerations launched the Minerva project, a new P2P Web search engine. The metasearch field has many common properties with search in a P2P system, but some important distinctions should be taken into account. A P2P environment is much more dynamic than traditional metasearch: Queries are processed on millions of small indices instead of dozens of large indices; Global query execution might require resource sharing and collaboration among different peers and cannot be fully performed on one peer; Limited statistics is a necessary requirement for a scalable P2P system, while in the distributed informational retrieval rich statistics can be provided by a centralized source; Cooperativeness of peer in a P2P system, in contrast to a metasearch setting, helps to reduce heterogeneity in such parameters as representative statistics or index update time. Distributed information retrieval accommodates features from two research areas: information retrieval and distributed systems. The goal of effectiveness is inherent for the former, it aims at high relevance of the returned documents, and the collaboration of users in a P2P setting gives us additional opportunities to refine the search results. The main goals of the Minerva project include the traditional metasearch goals and new issues: 1. Increased search coverage of the Web; 3

15 2. Retrieval effectiveness comparable with centralized search engine; 3. Scalable architecture to combine millions of small search engines. For this purpose, we want to exploit existing solutions from distributed information retrieval and adapt them to our new setup, with the aforementioned distinctive properties in mind. We also want to find novel methods, which are suitable for P2P architecture and can improve our system. The practical goal is to create a prototype of a highly scalable, effective, and efficient distributed P2P Web search engine. 1.2 Our contribution The main purpose of this thesis is to develop an effective result merging method for the Minerva system. We analyze major sub-problems of the result merging, and review several existing techniques. The selected methods have been implemented and evaluated in the Minerva prototype. In addition, a new preference-based language model method for result merging is proposed. Our approach combines the preference-based and the result merging rankings. The novelty of the method is that the preference-based language model is obtained from the pseudo-relevance feedback on the best peer in the database ranking. We address the issue of effectiveness. It is determined by the underlying result merging scheme. As in distributed information retrieval in a P2P system, the similarity scores for each document are computed on the base of the local database statistics. It makes the scores incomparable due to the differences in statistics on the different databases. A score computation based on the global statistics is the most accurate solution in our case. For the cooperative data sources, as we have in the Minerva system, we can collect the local database-dependent statistics and replace it by the globally estimated one, which is fair for all databases. We elaborated on this issue by testing several global score computation techniques and discovering the most effective scoring function. We also exploited additional information about the user s preferences in order to improve the quality of the final ranking. Our method combines two 4

16 rankings. The first ranking is the language modeling result merging scheme. The second one is based on the language model from pseudo-relevance feedback. The user preferences are inferred using the pseudo-relevance feedback, the top-k results from the best ranked database are assumed relevant. The novelty of our method is that pseudo-relevant feedback is obtained on the top-ranked peer before the global query execution. 1.3 Description of the remaining chapters Background information about information retrieval and P2P systems is presented in Chapter 2. An overview of distributed information retrieval and recent work on result merging is introduced in Chapter 3. Chapter 4 presents details of the merging techniques that we select for our experimental studies. Chapter 5 contains the new approach that is using the preference-based language model. In Chapter 6, we present implementation details. The experimental setup, evaluation methodology, and our results are presented in Chapter 7. Chapter 8 finalizes the thesis with the conclusions and suggestions for future work. 5

17 Chapter 2 Web search and Peer-to-Peer Systems In this chapter, we give a short description of Web search, P2P systems, and their potential as a platform for distributed information retrieval. Section 2.1 contains introductory information about information retrieval. In Section 2.2, we review the Web search engines. In Section 2.3, some general properties of P2P systems are discussed. Section 2.4 presents recent approaches for combining search mechanisms with P2P architecture. Section 2.5 describes our approach, the Minerva project. 2.1 Information retrieval basics Information retrieval deals with search engine architectures, algorithms, and methods that are concerned on the information search in the Internet, digital libraries, and text databases. The main goal is to find the relevant documents for a query from a collection of documents. The documents are preprocessed and placed into an index, which provides the base for retrieval. A typical search engine is based on the single database model of the text retrieval. In the model the documents from the Web and local databases are collected into a centralized repository and indexed. The whole model is effective if the index is large enough, to satisfy most of the user s information needs and a search engine uses an appropriate retrieval system. A retrieval system is the set of retrieval algorithms for the different purposes: ranking, 6

18 stemming, index processing, relevance feedback and so on. The widely used bag-of-words model assumes that every document may be represented by the words, which are contained in it. The most frequent words like the, and or is do not have rich semantics. They are called the stopwords and we remove them from the document representation. The full set of stopwords is stored in a stopwords list. The words variations with the same stem like run, runner and running are mapped into the one term, corresponding to a particular stem, a stemming algorithm performs this process. In current example the term is run. An important characteristic of a retrieval system is its underlying model of retrieval process. This model specifies the procedure of the probability estimation that a document will be judged relevant. The final document ranking is based on this estimation. The ranking is presented to the user after a query execution. The simple retrieval process models include a probabilistic model and a vector space model; the latter is the most widely used in the search engines. In the vector space model the document D is represented by the vector d = (w 1, w 2,..., w m ) where w i is the weight indicating the importance of term t i in representing the semantics of the document and m is the number of distinct terms. For all terms that do not occur in the document, corresponding entries will be equal to zero and the full document vector is very sparse. When the term occurs in the document, two factors are of importance in a weight assignment. The first factor is a term frequency T F it is a number of term s occurrences in the document. The weight of the term in the document s vector is proportional to T F. The more often term occurs, the more it is important in representing a document s semantics. The second factor, which affects the weight, is a document frequency DF it is the number of documents with particular term. The term weight is multiplied by the inverse document frequency IDF. The more frequently term appears in the documents, the less its importance in a discriminating between the documents having the term from the documents not having it. The worldwide standard for a term weighting is T F IDF product and its modifications. A simple query Q is a set of keywords. It is also transformed into an m-dimensional vector q = (w 1, w 2,..., w m ) using all preprocessing steps like 7

19 stopwords elimination, stemming and term weighting. After a creation of q and d, a similarity between the document s vector and the query s vector is estimated. This estimation is based on a similarity function, it can be a distance or angle measure. The most popular similarity function is the cosine measure, which is computed as a scalar product between q and d. Another popular approach that tries to overcome a heuristic nature of term weight estimation comes from the probabilistic model. The Language modeling approach [PC98] to information retrieval attempts to predict a probability of a query generation given a document. Although details may be different, the main idea is following: every document is viewed as a sample generated from a special language. A language model for each document can be estimated during indexing. The relevance of a document for a particular query is formulated as how likely the query was generated from the language model for that document. The likelihood for the query Q to be generated from the language model of the document D is computed as follows [SJCO02]: Q P (Q D) = λ P (t i D) + (1 λ) P (t i G) (2.1) i=1 Where: t i is the query item in the query Q; P (t i D) is the probability for t i to appear in the document D; P (t i G) is the probability for the term t i to be used in the common language model, e.g. in English; λ is the smoothing parameter between zero and one. The role of P (t i C) is to smooth the probability of the document D to generate the query term t i, particularly when P (t i D) is equal to zero. The usual measures for a retrieval effectiveness evaluation are the recall and precision, they are defined as follows [MYL02]: recall = NumberOfRetrievedRelevantDocuments N umberof RelevantDocuments (2.2) precision = NumberOfRetrievedRelevantDocuments N umberof RetrievedDocuments (2.3) The effectiveness of a text retrieval system is evaluated using a set of test queries. The relevant document set is identified beforehand. For every test 8

20 query a precision value is computed on the different levels of recall, these values are averaged over the whole query set and an average recall-precision curve is produced. In ideal case, when a system retrieves only the full set of relevant results every time the recall and precision values should be equal to 1. In practice, we cannot achieve such effectiveness due to the query ambiguity, specific user s understanding of a relevance notion and other factors. Incorporating the explicit user s feedback and implicitly inferred user s preferences from the previous search sessions can improve the retrieval quality. 2.2 Web search engines Information retrieval system for Web pages is called a Web search engine. The capabilities of these systems are very broad; the modern techniques allow queries on text, image, and sound files. In our work, we consider the problem of text data retrieval. Web search engines are also differentiated by their application area. The general-purpose search engines can search across the whole Web, while the special-purpose engines are concerned on the specific information sources or specific subjects. We are interested in the general-purpose Web search engines. The Web search engines inherited many properties from traditional information retrieval. Every Web search engine has a text database or, equally, a document collection that consists of all documents searchable by this engine. An index for these documents is created before query time, every term in it represents the single keyword or phrase. For each term one inverted index list is constructed, this list contains document identifiers for every document with current term, along with the corresponding similarity values. During query execution, a search engine takes a union over the inverted index lists corresponding to the query terms. Then search engine sorts all found documents in a descending order of their similarity score and presents the resulting ranking to the user. There are also distinctive features of the Web retrieval that were not used in traditional information retrieval. The most prominent examples are additional hyperlink relationships between the documents and an intensive document tagging. These differences can serve as the sources of additional 9

21 information for a search refinement and they are exploited in the different retrieval algorithms. The Web developers created a significant portion of the hyperlinks on the Web manually and this is the implicit intellectual input. The linkage structure can be used as expert evidence that two pages connected by a hyperlink are also semantically related. It can also be an indication that the Web designer, who placed the hyperlinks on some pages, assesses their content as valuable. Several algorithms are based on these considerations. The PageRank algorithm computes the global importance of the Web page in a large Web graph, which is inferred from the set of crawled pages [BP98]. The advantage of this algorithm is that a global importance of the page can be precomputed before query execution time. A HITS algorithm [Kle99] uses only a small subset of the Web graph, this subset is constructed during query time. Such on-line computation is inconvenient for the search engines with a high query workload, but it allows a topic-oriented page authority computation. The HTML tagging of documents can also be of use in the Web search engines. Rich information about an importance of terms is inferred from their position in a document. The terms in the title section are more important than in the body of the document. Emphasizing with a font size and style also indicates an additional importance of the term. The sophisticated term weightings schemes, which are based on these observations, improve the retrieval quality. There are several important limitations of the existing Web search engines. The first restriction is imposed on a size of searchable index. According to the Google statistics ( this search engine has the largest crawled index on the Web, its current size is about 8 billion pages. At the same time the Hidden Web or Deep Web, which embraces the pages that were excluded from the crawling for commercial or technical reasons, has a size about 2-50 times larger then the Visible Web [SP01]. Even now it is unrealistic for a single search engine to maintain an index of this size and the information volume increases even faster than a computation power of the centralized Web search engines. The second problem is the outdating of the crawled information. The news pages are changed daily and it is im- 10

22 possible to update the whole index at this rate. Some updating strategies help track changes on the most popular sites in the Internet, but many index entries are completely outdated. The novel opportunities provided by the peer-to-peer systems help to solve these problems. 2.3 Peer-to-Peer architecture A distributed system is a collection of autonomous computers that cooperate in order to achieve a common goal [Cra01]. In ideal case a user of such system does not explicitly notice other computers, their location, storage replication, load balancing, reliability or functionality. P2P system is an instance of the distributed system; it is decentralized, self-organized, highly dynamic loose coupling of many autonomous computers. P2P systems have become famous several years ago with the Napster ( and Gnutella ( filesharing systems. In the file-sharing P2P communities, every computer can join as a peer using the client program. Other peers can access all resources shared by the peers in this environment. The main feature of such systems is that the peer who is looking for a file can directly contact the peer that is sharing this file. The only information that has to be propagated is the peer s address and a short description of the shared data. The first systems like Napster used a centralized server with all peers addresses and names of the shared files. Other approaches avoided a single point of failure and used the Gnutella-style flooding protocol. It consequently broadcasts a request for a particular file through a small number of closest neighbors until the message will expire. The modern P2P applications like e-donkey ( are extremely popular now; they have numerous improvements over the predecessors. Therefore, we harness a power of the thousands autonomous personal computers all over the world to create a temporal community for a collaborative work. A P2P technology is trying to make systems: scalable, selforganized, fault-tolerant, publicly available, load-balanced. This list of desirable P2P properties is not exhaustive and there are also issues like anonymity, security, etc., but the selected properties are fundamental for our task. For 11

23 example, modern P2P systems are often based on a mixture topology when some super-peers establish the different levels of hierarchy, but we are interested in a pure P2P flat structure. It gives equal rights to all peers and makes a system more scalable. The limitation of search capabilities is a considerable drawback of the most P2P systems. Sometimes you have to know an exact filename of a data of interest or you will miss the relevant results. The combination of the search engine mechanisms for an effective retrieval with a powerful paradigm of a P2P community is a promising research direction. 2.4 P2P Web search engines The idea of a Peer-to-Peer Web search engine is extensively investigated nowadays. The interesting combinations of the search services with the P2P platforms are described in several following approaches. ODISSEA [SMW + 03] is different from many other P2P search approaches. It assumes two-layered search engine architecture and a global index structure distributed over the nodes of the system. Under a global index organization, in contrast to a local one, a single node holds the entire inverted index for a particular term. The distributed version of Fagin s threshold algorithm is used for result aggregation over the inverted lists. It is efficient only over very short queries about 2-3 words. For a distributed hash table (DHT) implementation, this system incorporated the Pastry protocol. PlanetP [CAPMN03] is another content search infrastructure. Each node maintains an index of its content and summarizes the set of terms in its index using a Bloom filter. The global index is the set of all summaries. Summaries are propagated and kept synchronized using a gossiping algorithm. This approach is effective for several thousands peers, but it is not scalable. Its retrieval quality is rather low for the top-k queries with a small k. GALANX [WGDW03] is the P2P system, which is implemented on the top of BerkeleyDB. Similar to the Minerva system, it maintains a local peer index on every node and distributes information about term presence on a peer with a DHT. The different query routing strategies are evaluated during the simulation. Most of them are based on the Chord protocol and proposed 12

24 strategies improve the basic effectiveness by the enlarging of the index size. The presented query routing approaches are not highly scalable since the index volume continuously increases with the number of peers in the system. 2.5 Minerva project The project Minerva [BMWZ04] is another Web search engine that is based on P2P architecture. See Figure 2.1. In this system, every peer P i provides an efficient search engine for its own focused Web crawl C i. The documents D ij are indexed locally and the result is posted into a global directory as a set of index statistics S i. A posting process and all other communications between the peers are based on the Chord protocol [SMK + 01]. Every peer contains a set of peerlists L i for a disjoint subset of terms T i, where P i=1 T i = T. The peerlist l is a mapping t P, where t is a particular term and P is a subset of peers which contain at least one document with this term. The terms are hashed and their corresponding peerlists are distributed fairly across the peers by the Chord protocol. During query execution all necessary peerlists, one for each query keyword, are obtained, and merged into one. Figure 2.1: The Minerva system architecture Every peer can pose a query against a number of selected peers that are most probable to contain the relevant documents. The selection is based 13

25 on a query routing strategy and this issue is known in a literature as the database selection problem. A search engine on every selected peer processes its inverted index until it obtains the top-k highly ranked documents for a current query. Then the best top-k results from these peers are collected by the query initiator and merged into one top-k list, this task is known as the result merging problem. Quality of the final top-k list depends heavily on a term weighting scheme on peers and merging algorithm, whereas speed depends mostly on a local index processing scheme. 2.6 Summary In this chapter, we introduce several basic concepts from information retrieval and Web search. We describe some key ideas of P2P systems and review several combinations of Web search engines with a P2P platform. Also small description of the new P2P Web search engine Minerva was provided. The scalability issue is recognized as an extremely important one. P2P architecture seems valuable in terms of the effective and efficient retrieval. 14

26 Chapter 3 Result merging in distributed information retrieval In this chapter, we review recent work on distributed information retrieval. In Section 3.1, we give a short overview of the general metasearch issues. Section 3.2 contains a comprehensive description of the result merging task. In Section 3.3, we elaborate on the collection fusion task. In Section 3.4, we address the problems of the data fusion task. 3.1 Distributed information retrieval in general During the past ten years, emerged a new research direction distributed information retrieval or metasearch. Metasearch is the task of collecting and combining the search results from a set of different sources. A typical scenario includes several search engines that execute a query and one metasearch engine that merges the results and creates a single ranked document list. Several interesting surveys of distributed information retrieval problems and solutions are presented in [Cal00, Cra01, Cro00, MYL02]. Distributed information retrieval task appears when the documents of interest are spread across many sources. In such situation, it might be possible to collect all documents to one server or establish multiple search engines, one for each collection of documents. The search process is performed across the 15

27 network with communications between many servers. This is a distinctive feature of distributed information retrieval. The search in the distributed environment has several attractive properties, which make it preferable to the single engine search. Several of these important features are listed in [MYL02]: The increased coverage of the Web, the indices from many sources are used in one search; The solution for the problem of the search scalability, a combination is cheaper than a centralized solution; The automation of the result preprocessing and combining, a user does not have to compare and combine the results from different sources manually; The improved retrieval effectiveness, the combination of different search engines can produce a better ranking than any single ranking algorithm. The metasearch is based on the multi-database model, where several text databases are modeled explicitly. The multi-database model for information retrieval has many tasks in common with the single-database model but also has some additional problems [Cro00]: The resource description task; The database selection task; The result merging task. These issues are essentially the core of distributed information retrieval research, we briefly describe them below. The main unit in the metasearch is an intermediate broker that is called a metasearch engine. It obtains and stores a limited summary about every database participating in a search process and decides which databases are most appropriate for a query. A metasearch engine also propagates a query to the selected single search engines, collects and reorganizes results. Simple metasearch architecture is presented on Figure 3.1. A user poses a query Q against a metasearch engine, which in turn propagates it to several search 16

Figure 3.1: Simple metasearch architecture engines. Then the result rankings R i are retrieved to the broker, merged, and presented to the user as a single document ranking R m.

28 Figure 3.1: Simple metasearch architecture engines. Then the result rankings R i are retrieved to the broker, merged, and presented to the user as a single document ranking R m. A summary statistics from a search engine is called resource description or database representative. A full-text database provides information about its contents in a set of statistics. It may include a data about the number of the specific term occurrences in the particular documents or in a whole collection, the number of indexed documents etc. Information for building resource description is obtained during the index creation step. The richness of the database representatives depends on the level of cooperation in the system. For example, the STARTS standard [GCGMP97] is the good choice for a cooperative environment, where all search engines present their results in the unified informative format. On the other hand, when they are unwilling to cooperate we can infer their statistics from query-based sampling [SC03]. The collected resource descriptions are used for the database selection or query routing task. In practice, we are not interested in the databases, which are unlikely to contain relevant documents. Therefore, we can select from all data sources only those, which are probably relevant to our query according to their resource descriptions. For each database, we calculate the usefulness measure that is usually based on the vector space model. Creating the effective and robust usefulness measure for the database ranking is the most prominent task of database selection. Several attempts to address this 17

29 problem are described in [Voo95, CLC95, YL97, GGMT99, Cra01, SJCO02]. The result merging problem arises when a query is executed on several selected databases and we want to create one single ranking out of these results. This problem is not trivial since the computation of similarity score between documents and query uses local collection statistics. Therefore, the scores are not directly comparable. The most accurate solution could be obtained by a global score normalization and requires a cooperation from sources. We are especially interested in this latter problem. The carefully designed result merging algorithm can provide us with the high quality results and give us an opportunity to speed-up a local index processing. More information about the result merging methods can be found in [CLC95, TVGJL95, Bau99, Cra01, SJCO02, SC03]. Figure 3.2: A query processing scheme in the distributed search system More precisely the query processing scheme is presented on Figure 3.2 [Cra01]. A query Q is posed on the set of search engines that are represented by their resource descriptions S i. A metasearch engine selects a subset of servers S, which are most probable to contain the relevant documents. The size of this subset usually does not exceed 10 databases. The broker routes Q to these selected search engines S i and obtains a set of document rankings R from the selected servers. In a real world, a user is interested only in the top-k best results where k can vary from 5 to 30. All rankings R i are merged 18

30 into one rank R m and the top-k results from it are presented to the user. Text retrieval aims at the high relevance of the results at the minimum response time. These two components are translated into the general issues of effectiveness or quality and efficiency or speed of the query processing. This thesis concerned on the effectiveness of the result merging problem. 3.2 Result merging problem A common issue in the metasearch is how to combine several ranked lists of the relevant documents from the different search engines into one ranked list. It is the so-called result merging problem. The following section reviews some modern merging methods. Result merging is divided into two main sub-problems. The first one is collection fusion, where the results are merged from the disjoint or nearly disjoint document sets. The second sub-problem is data fusion, which arises when we merge the different rankings obtained on the identical document sets. The main difference between the collection fusion and data fusion is that in the first case we want to approximate the result of a single search system on which the document set consists of all document s sub-sets involved in the merging. Therefore, the optimal solution is to obtain the very same retrieval effectiveness as the search engine with the united database has. However, in the data fusion problem the task is to merge the different rankings in such way that the final ranking is better than every participating ranking. The maximum quality of the result here is undefined but it should be no less than the quality of the best single ranking. Simple intuition for these two problems is presented on Figure 3.3. A comprehensive description of the differences between collection fusion and data fusion can be found in [VC99b, Mon02]. In metasearch, we often do not know beforehand what kind of a merging problem we have because it depends on the level of overlap between the documents of combined databases. If the overlap is very high the situation is closer to the data fusion, otherwise it is the collection fusion task. The metasearch on the Web was addressed mainly as the collection fusion problem. In fact, the overlap of search results on the different search engines is 19

Figure 3.3: Collection fusion vs. data fusion surprisingly low. However, some approaches also take into account the data fusion methods, sometimes even both types are evaluated in the mixture setups.

31 Figure 3.3: Collection fusion vs. data fusion surprisingly low. However, some approaches also take into account the data fusion methods, sometimes even both types are evaluated in the mixture setups. Another important property is a level of search engine cooperation. We divide all merging methods by the environment type into two categories: Cooperative (integrated) environment; Uncooperative (isolated) environment. The uncooperative or isolated merging methods have no other access to the individual databases than a ranked list of documents in the response to a query [Voo95]. The cooperative or integrated merging techniques assume an access to the database statistics values like T F, DF etc. In general, both types of merging methods can produce more effective result than the single collection with the full set of documents, if the data fusion strategy is used [TVGJL95]. In practice, the merged results produced by the uncooperative strategies have been less effective than the single collection run. Our primary goal is to find a subset of the effective merging methods, which we can apply and evaluate in the P2P Web search engine Minerva. We assume here that all peers in the Minerva system are cooperative and provide all necessary statistics. 20

32 3.3 Prior work on collection fusion A formal definition of the collection fusion problem was stated in [TVGJL95]. It is mixed with the data fusion definition, therefore, we modified it. Assume a set of document collections C associated with the search engines. With respect to the query Q, each collection C i contains a number of relevant documents. After the query Q is posed against the collection C i, the search engine returns the ranked list R i of documents D ij in a decreasing order of their similarity S ij to the query. The top-k results is the merged ranked list of length k containing the documents D ij with the highest similarity values S ij in a decreasing order. Consider a document collection C g = C i and the top-k results R g, which contains the documents D gj with similarity values S gj. The collection fusion task is given Q, C, and k find from R j the top-k results R c of the documents D cj such that S cj = S gj Collection fusion properties An ideal collection fusion method combines the documents from local search results into one ranked list in a descending order of their global similarity scores. The global similarity scores are produced by the single global search system over the united database containing all local documents. In the cooperative environment, where all search engines provide necessary statistics, we can achieve the consistent merging as produced by a non-distributed system, it is also known as the perfect merging and merging with normalized scores [Cra01]. In practice, no efficient collection fusion technique can guarantee exactly the same ranking as on the centralized database with all documents from all databases involved. Three main factors affect the collection fusion: 1. Only the documents returned by the selected servers can participate in a merging. Some relevant documents will be missed after the database selection step. 2. Different statistics and retrieval algorithms caused their own separate problem of incomparable scores. A missing of the documents might be the case when the top-k results are merged and necessary document is 21

locally ranked (k+1)th or greater. This problem can be solved by the global statistics normalization methods in the cooperative environment. 3. Overlapping between the databases. See Figure 3.4.

33 locally ranked (k+1)th or greater. This problem can be solved by the global statistics normalization methods in the cooperative environment. 3. Overlapping between the databases. See Figure 3.4. The pure collection fusion approaches [VF95, Kir97, CLC95] do not consider overlapping. It is quite difficult to accurately estimate the actual level of the document s overlap between datasets. Our assumption is that the degradation of the result quality due to overlapping is small when the efforts for statistics correction are significant. Figure 3.4: An overlapping in the collection fusion problem Cooperative environment In [SP99, SR00] was claimed that the simple raw score merging could show a good retrieval performance. It seems that the raw-score approach might be a valid first attempt for the merging of result lists, which are provided by the same retrieval model. In [CLC95] was suggested that the collection fusion based on the raw T F values seems as a valuable approach when the involved databases are more or less homogeneous and the retrieval quality degrades only by 10%. However, we assumed topically organized collections and they have highly skewed statistics. The most effective collection fusion methods are the score normalization techniques, which are based on consistent global collection statistics. All search engines must produce the document s relevance score using the same retrieval algorithms, including document ranking algorithm, stemming method, stopwords list. A metasearch engine collects all required local statistics from the selected databases before or during query time. Notice, that 22

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost