Result Merging in a Peer-to-Peer Web Search Engine. Sergey Chernov

Size: px
Start display at page:

Download "Result Merging in a Peer-to-Peer Web Search Engine. Sergey Chernov"

Transcription

1 Result Merging in a Peer-to-Peer Web Search Engine Sergey Chernov UNIVERSITÄT DES SAARLANDES January, 2005

2 A Result Merging in a Peer-to-Peer Web Search Engine A thesis submitted in partial fulfillment of the requirement for the degree of Master of Science in Computer Science Submitted by Sergey Chernov under the guidance of Prof. Dr-Ing. Gerhard Weikum Christian Zimmer UNIVERSITÄT DES SAARLANDES January, 2005

3 Abstract A tremendous amount of information in the Internet requires powerful search engines. Currently, only the commercial centralized search engines like Google can process terabytes of Web documents. Such approaches fail in indexing the Hidden Web located in the intranets and local databases, and with an exponential growing of information volume the situation becomes even worse. Peer-to-Peer (P2P) systems can be pursued for extending the current search capabilities. The Minerva project is a Web search engine based on a P2P architecture. In this thesis, we investigate the effectiveness of the different result merging methods for the Minerva system. Each peer provides an efficient search engine for its own focused Web crawl. Each peer can pose a query against a number of selected peers; the selection is based on a database ranking algorithm. The best top-k results from several highly ranked peers are collected by the query initiator and merged into a single list. We address problem of the result merging. We select several merging methods, which are feasible for use in a heterogeneous, dynamic, distributed environment. The experimental framework for these methods was implemented and the effectiveness of the merging techniques was studied with the TREC Web data. The language modeling based ranking method produced the most robust and accurate results under the different conditions. We also proposed a new merging method, which incorporates the preference-based language model. The novelty of the method is that the preference-based language model is obtained from the pseudo-relevance feedback on the best peer in the database ranking. In every tested setup, the new method was at least as effective as the baseline or slightly better. i

4 I hereby declare that this thesis is entirely my own work and that I have not used any other media than the ones mentioned in the thesis. Saarbrücken, the 12 nd January, 2005 Sergey Chernov ii

5 Acknowledgements I would like to thank my academic advisor Professor Gerhard Weikum for his guidance and encouragement through the duration of my master thesis project. I wish to express my sincere gratitude to my supervisors Christian Zimmer, Sebastian Michel, and Matthias Bender for their invaluable assistance and feedback. I would like to thank Kerstin Meyer-Ross for her continuous support in everything. I am very grateful to the members of the Databases and Information Systems group AG5, fellow students from the IMPRS program and all my friends from the Max-Planck Institute who provided me with friendly and stimulating environment. I would like to extend special thanks to Pavel Serdyukov and Natalie Kozlova for the numerous discussions and helpful ideas. It is difficult to explain how grateful I am to my mother, Galina Nikolaevna, and my father, Alevtin Petrovich, their wisdom and care made it possible for me to study. Finally, I want to thank the one person who was most supportive and patient during this process, my dear wife Olga. I would never accomplish this work without her love. iii

6 Contents 1 Introduction Motivation Our contribution Description of the remaining chapters Web search and Peer-to-Peer Systems Information retrieval basics Web search engines Peer-to-Peer architecture P2P Web search engines Minerva project Summary Result merging in distributed information retrieval Distributed information retrieval in general Result merging problem Prior work on collection fusion Collection fusion properties Cooperative environment Uncooperative environment Learning methods Probabilistic methods Prior work on the data fusion Data fusion properties Basic methods Mixture methods iv

7 3.4.4 Metasearch approved methods Summary Selected result merging strategies Target properties for result merging methods Score normalization with global IDF Score normalization with ICF Score normalization with CORI Score normalization with language modeling Score normalization with raw T F scores Summary Our approach Result merging with the preference-based language model Discussion Summary Implementation Global statistics classes Testing components Experiments Experimental setup Collections and queries Database selection algorithm Evaluation metrics Experiments with selected result merging methods Result merging methods Merging results Effect of limited statistics on the result merging Experiments with our approach Optimal size of the top-n Optimal smoothing parameter β Summary v

8 8 Conclusions and future work Conclusions Future work Bibliography 71 A Test queries 77 vi

9 List of Figures 2.1 The Minerva system architecture Simple metasearch architecture A query processing scheme in the distributed search system Collection fusion vs. data fusion An overlapping in the collection fusion problem Statistics propagation for the collection fusion Data fusion on a single search engine Main classes involved in merging A general view on the experiments implementation The macro-average precision with the database ranking RANDOM The macro-average recall with the database ranking RANDOM The macro-average precision with the database ranking CORI The macro-average recall with the database ranking CORI The macro-average precision with the database ranking IDEAL The macro-average recall with the database ranking IDEAL The macro-average precision of the LM 04 result merging method with the different database rankings The macro-average precision with the database ranking CORI with the global statistics collected over the 10 selected databases The macro-average precision with the database ranking IDEAL with the global statistics collected over the 10 selected databases The macro-average precision with the database ranking CORI with the different size of top-n for the preference-based model estimation vii

10 7.11 The macro-average precision with the database ranking IDEAL with the different size of top-n for the preference-based model estimation The macro-average precision with the database ranking CORI with the top-10 documents for the preference-based model estimation with β = 0.6 and LM04 result merging method The macro-average precision with the database ranking IDEAL with the top-10 documents for the preference-based model estimation with β = 0.6 and LM04 result merging method A.1 Relevant documents distribution for the T F method with the IDEAL ranking A.2 Relevant documents distribution for the LM 04P B06 method with the IDEAL ranking A.3 Residual between the number of relevant documents of the CE06LM04 and T F methods with the IDEAL database ranking 81 viii

11 List of Tables 4.1 The target properties of the result merging methods Topic-oriented experimental collections The difference in percents of the average precision between the result merging strategies and corresponding baselines with the RANDOM ranking. The LM 04 technique is compared with the SingleLM method; all others are compared with the SingleT F IDF approach The difference in percents of the average precision between the result merging strategies and corresponding baselines with the CORI ranking. The LM04 technique is compared with the SingleLM method; all others are compared with the SingleT F IDF approach The difference in percents of the average precision between the result merging strategies and corresponding baselines with the IDEAL ranking. The LM 04 technique is compared with the SingleLM method; all others are compared with the SingleT F IDF approach A.1 The topic-oriented set of the 25 experimental queries (topics are coded as HM for the Health and Medicine, and NE for the Nature and Ecology) A.2 The number of relevant documents for the T F and LM04P B06 methods with the IDEAL database ranking (LM 04P B06 name is shortened to LMP B for convinience) ix

12 Chapter 1 Introduction 1.1 Motivation Millions of new documents are created every day across the Internet and even more are changed. The huge amount of information increases exponentially and a search engine is the only hope to find the documents which are relevant to a particular user s need. Routine access to the information is now based on full-text information retrieval, instead of controlled vocabulary indices. Currently, a huge number of people use the Web for text and image search on a regular basis. In this thesis, we consider the problems of search on text data. The need for effective search tools becomes more important every day, but currently only a few centralized search engines like Google ( can cope with this task and they are only partially effective. The so-called Hidden Web consists of all intranets and local databases behind portal pages. According to estimation from [SP01], it is 2 to 50 times larger then the Visible Web, which can be crawled by the search robots. Taking into account that even the largest Google crawl of more than 8 billion pages encompasses only a part of the Visible Web, we can imagine how many probably relevant pages a centralized search engine does not consider during a search. This problem comes from the technical limitations of a single search engine. The desire to overcome the limitations of a single search engine established a new scientific direction distributed information retrieval or metasearch, we will use both terms as synonyms. The main technique that was devel- 1

13 oped in this field is an intermediate broker called a metasearch engine. It has access to the query interfaces of the individual search engines and text databases. Briefly, when a metasearch engine receives a user s query, it passes the query to a set of appropriate individual search engines and databases. Then it collects the partial results and combines them to improve the overall result. Numerous examples of metasearch engines are available on the Web ( The metasearch approach contains several significant sub-problems, which arise in the query execution process. The database selection problem arises when a query is routed from a metasearch engine to the individual search engines. A naive routing approach is to propagate a query to all available engines. The scalability of such a strategy is unsatisfactory since it is inefficient to ask more than several dozens of servers. The database selection process helps to discover a small number of the most useful databases for a current query and to ask only this limited subset without significant loss in recall. Many of the database selection methods were developed to tackle this issue [Voo95, CLC95, YL97, GGMT99, Cra01, SJCO02]. The result merging problem is another important sub-problem of the metasearch technique. In information retrieval, the output result is a ranked list of documents, which are sorted by their similarity score values. In distributed information retrieval, aforementioned list is obtained from several result lists, which are merged into one. The result merging problem is not trivial, numerous merging techniques have been studied in the literature [CLC95, TVGJL95, Bau99, Cra01, SJCO02]. The issue of an automatic database discovery was not fully addressed; so adding new data sources to the metasearch engine mainly remains a manual task. The major drawback of metasearch is that large search engines are not interested in cooperation. A search result is a commercial product which they want to sell by their own. For example, the STARTS proposal [GCGMP97] is a quite effective communication protocol designed especially for metasearch, but it is not widely used because of the reason above. The new Peer-to-Peer (P2P) technologies can help us to remove the limitations caused by an uncooperativeness of search engines vendors. The computation power of processors increases every year, and so does the network bandwidth. 2

14 Millions of personal computers have enough storage and computational resources to index their own documents and perform small crawls of the interesting fragments of the Web. They can provide a search on their local index, but do not have to uncover the data itself unless they want to. This is the way to incorporate the Hidden Web pages into a global search mechanism. Collaborative crawling can span a larger portion of the Web, since every peer can contribute its own focused crawl into the system. This method is cheap and provides us with topic-oriented search opportunities; we can also use intellectual inputs from other users to improve our own search. Such considerations launched the Minerva project, a new P2P Web search engine. The metasearch field has many common properties with search in a P2P system, but some important distinctions should be taken into account. A P2P environment is much more dynamic than traditional metasearch: Queries are processed on millions of small indices instead of dozens of large indices; Global query execution might require resource sharing and collaboration among different peers and cannot be fully performed on one peer; Limited statistics is a necessary requirement for a scalable P2P system, while in the distributed informational retrieval rich statistics can be provided by a centralized source; Cooperativeness of peer in a P2P system, in contrast to a metasearch setting, helps to reduce heterogeneity in such parameters as representative statistics or index update time. Distributed information retrieval accommodates features from two research areas: information retrieval and distributed systems. The goal of effectiveness is inherent for the former, it aims at high relevance of the returned documents, and the collaboration of users in a P2P setting gives us additional opportunities to refine the search results. The main goals of the Minerva project include the traditional metasearch goals and new issues: 1. Increased search coverage of the Web; 3

15 2. Retrieval effectiveness comparable with centralized search engine; 3. Scalable architecture to combine millions of small search engines. For this purpose, we want to exploit existing solutions from distributed information retrieval and adapt them to our new setup, with the aforementioned distinctive properties in mind. We also want to find novel methods, which are suitable for P2P architecture and can improve our system. The practical goal is to create a prototype of a highly scalable, effective, and efficient distributed P2P Web search engine. 1.2 Our contribution The main purpose of this thesis is to develop an effective result merging method for the Minerva system. We analyze major sub-problems of the result merging, and review several existing techniques. The selected methods have been implemented and evaluated in the Minerva prototype. In addition, a new preference-based language model method for result merging is proposed. Our approach combines the preference-based and the result merging rankings. The novelty of the method is that the preference-based language model is obtained from the pseudo-relevance feedback on the best peer in the database ranking. We address the issue of effectiveness. It is determined by the underlying result merging scheme. As in distributed information retrieval in a P2P system, the similarity scores for each document are computed on the base of the local database statistics. It makes the scores incomparable due to the differences in statistics on the different databases. A score computation based on the global statistics is the most accurate solution in our case. For the cooperative data sources, as we have in the Minerva system, we can collect the local database-dependent statistics and replace it by the globally estimated one, which is fair for all databases. We elaborated on this issue by testing several global score computation techniques and discovering the most effective scoring function. We also exploited additional information about the user s preferences in order to improve the quality of the final ranking. Our method combines two 4

16 rankings. The first ranking is the language modeling result merging scheme. The second one is based on the language model from pseudo-relevance feedback. The user preferences are inferred using the pseudo-relevance feedback, the top-k results from the best ranked database are assumed relevant. The novelty of our method is that pseudo-relevant feedback is obtained on the top-ranked peer before the global query execution. 1.3 Description of the remaining chapters Background information about information retrieval and P2P systems is presented in Chapter 2. An overview of distributed information retrieval and recent work on result merging is introduced in Chapter 3. Chapter 4 presents details of the merging techniques that we select for our experimental studies. Chapter 5 contains the new approach that is using the preference-based language model. In Chapter 6, we present implementation details. The experimental setup, evaluation methodology, and our results are presented in Chapter 7. Chapter 8 finalizes the thesis with the conclusions and suggestions for future work. 5

17 Chapter 2 Web search and Peer-to-Peer Systems In this chapter, we give a short description of Web search, P2P systems, and their potential as a platform for distributed information retrieval. Section 2.1 contains introductory information about information retrieval. In Section 2.2, we review the Web search engines. In Section 2.3, some general properties of P2P systems are discussed. Section 2.4 presents recent approaches for combining search mechanisms with P2P architecture. Section 2.5 describes our approach, the Minerva project. 2.1 Information retrieval basics Information retrieval deals with search engine architectures, algorithms, and methods that are concerned on the information search in the Internet, digital libraries, and text databases. The main goal is to find the relevant documents for a query from a collection of documents. The documents are preprocessed and placed into an index, which provides the base for retrieval. A typical search engine is based on the single database model of the text retrieval. In the model the documents from the Web and local databases are collected into a centralized repository and indexed. The whole model is effective if the index is large enough, to satisfy most of the user s information needs and a search engine uses an appropriate retrieval system. A retrieval system is the set of retrieval algorithms for the different purposes: ranking, 6

18 stemming, index processing, relevance feedback and so on. The widely used bag-of-words model assumes that every document may be represented by the words, which are contained in it. The most frequent words like the, and or is do not have rich semantics. They are called the stopwords and we remove them from the document representation. The full set of stopwords is stored in a stopwords list. The words variations with the same stem like run, runner and running are mapped into the one term, corresponding to a particular stem, a stemming algorithm performs this process. In current example the term is run. An important characteristic of a retrieval system is its underlying model of retrieval process. This model specifies the procedure of the probability estimation that a document will be judged relevant. The final document ranking is based on this estimation. The ranking is presented to the user after a query execution. The simple retrieval process models include a probabilistic model and a vector space model; the latter is the most widely used in the search engines. In the vector space model the document D is represented by the vector d = (w 1, w 2,..., w m ) where w i is the weight indicating the importance of term t i in representing the semantics of the document and m is the number of distinct terms. For all terms that do not occur in the document, corresponding entries will be equal to zero and the full document vector is very sparse. When the term occurs in the document, two factors are of importance in a weight assignment. The first factor is a term frequency T F it is a number of term s occurrences in the document. The weight of the term in the document s vector is proportional to T F. The more often term occurs, the more it is important in representing a document s semantics. The second factor, which affects the weight, is a document frequency DF it is the number of documents with particular term. The term weight is multiplied by the inverse document frequency IDF. The more frequently term appears in the documents, the less its importance in a discriminating between the documents having the term from the documents not having it. The worldwide standard for a term weighting is T F IDF product and its modifications. A simple query Q is a set of keywords. It is also transformed into an m-dimensional vector q = (w 1, w 2,..., w m ) using all preprocessing steps like 7

19 stopwords elimination, stemming and term weighting. After a creation of q and d, a similarity between the document s vector and the query s vector is estimated. This estimation is based on a similarity function, it can be a distance or angle measure. The most popular similarity function is the cosine measure, which is computed as a scalar product between q and d. Another popular approach that tries to overcome a heuristic nature of term weight estimation comes from the probabilistic model. The Language modeling approach [PC98] to information retrieval attempts to predict a probability of a query generation given a document. Although details may be different, the main idea is following: every document is viewed as a sample generated from a special language. A language model for each document can be estimated during indexing. The relevance of a document for a particular query is formulated as how likely the query was generated from the language model for that document. The likelihood for the query Q to be generated from the language model of the document D is computed as follows [SJCO02]: Q P (Q D) = λ P (t i D) + (1 λ) P (t i G) (2.1) i=1 Where: t i is the query item in the query Q; P (t i D) is the probability for t i to appear in the document D; P (t i G) is the probability for the term t i to be used in the common language model, e.g. in English; λ is the smoothing parameter between zero and one. The role of P (t i C) is to smooth the probability of the document D to generate the query term t i, particularly when P (t i D) is equal to zero. The usual measures for a retrieval effectiveness evaluation are the recall and precision, they are defined as follows [MYL02]: recall = NumberOfRetrievedRelevantDocuments N umberof RelevantDocuments (2.2) precision = NumberOfRetrievedRelevantDocuments N umberof RetrievedDocuments (2.3) The effectiveness of a text retrieval system is evaluated using a set of test queries. The relevant document set is identified beforehand. For every test 8

20 query a precision value is computed on the different levels of recall, these values are averaged over the whole query set and an average recall-precision curve is produced. In ideal case, when a system retrieves only the full set of relevant results every time the recall and precision values should be equal to 1. In practice, we cannot achieve such effectiveness due to the query ambiguity, specific user s understanding of a relevance notion and other factors. Incorporating the explicit user s feedback and implicitly inferred user s preferences from the previous search sessions can improve the retrieval quality. 2.2 Web search engines Information retrieval system for Web pages is called a Web search engine. The capabilities of these systems are very broad; the modern techniques allow queries on text, image, and sound files. In our work, we consider the problem of text data retrieval. Web search engines are also differentiated by their application area. The general-purpose search engines can search across the whole Web, while the special-purpose engines are concerned on the specific information sources or specific subjects. We are interested in the general-purpose Web search engines. The Web search engines inherited many properties from traditional information retrieval. Every Web search engine has a text database or, equally, a document collection that consists of all documents searchable by this engine. An index for these documents is created before query time, every term in it represents the single keyword or phrase. For each term one inverted index list is constructed, this list contains document identifiers for every document with current term, along with the corresponding similarity values. During query execution, a search engine takes a union over the inverted index lists corresponding to the query terms. Then search engine sorts all found documents in a descending order of their similarity score and presents the resulting ranking to the user. There are also distinctive features of the Web retrieval that were not used in traditional information retrieval. The most prominent examples are additional hyperlink relationships between the documents and an intensive document tagging. These differences can serve as the sources of additional 9

21 information for a search refinement and they are exploited in the different retrieval algorithms. The Web developers created a significant portion of the hyperlinks on the Web manually and this is the implicit intellectual input. The linkage structure can be used as expert evidence that two pages connected by a hyperlink are also semantically related. It can also be an indication that the Web designer, who placed the hyperlinks on some pages, assesses their content as valuable. Several algorithms are based on these considerations. The PageRank algorithm computes the global importance of the Web page in a large Web graph, which is inferred from the set of crawled pages [BP98]. The advantage of this algorithm is that a global importance of the page can be precomputed before query execution time. A HITS algorithm [Kle99] uses only a small subset of the Web graph, this subset is constructed during query time. Such on-line computation is inconvenient for the search engines with a high query workload, but it allows a topic-oriented page authority computation. The HTML tagging of documents can also be of use in the Web search engines. Rich information about an importance of terms is inferred from their position in a document. The terms in the title section are more important than in the body of the document. Emphasizing with a font size and style also indicates an additional importance of the term. The sophisticated term weightings schemes, which are based on these observations, improve the retrieval quality. There are several important limitations of the existing Web search engines. The first restriction is imposed on a size of searchable index. According to the Google statistics ( this search engine has the largest crawled index on the Web, its current size is about 8 billion pages. At the same time the Hidden Web or Deep Web, which embraces the pages that were excluded from the crawling for commercial or technical reasons, has a size about 2-50 times larger then the Visible Web [SP01]. Even now it is unrealistic for a single search engine to maintain an index of this size and the information volume increases even faster than a computation power of the centralized Web search engines. The second problem is the outdating of the crawled information. The news pages are changed daily and it is im- 10

22 possible to update the whole index at this rate. Some updating strategies help track changes on the most popular sites in the Internet, but many index entries are completely outdated. The novel opportunities provided by the peer-to-peer systems help to solve these problems. 2.3 Peer-to-Peer architecture A distributed system is a collection of autonomous computers that cooperate in order to achieve a common goal [Cra01]. In ideal case a user of such system does not explicitly notice other computers, their location, storage replication, load balancing, reliability or functionality. P2P system is an instance of the distributed system; it is decentralized, self-organized, highly dynamic loose coupling of many autonomous computers. P2P systems have become famous several years ago with the Napster ( and Gnutella ( filesharing systems. In the file-sharing P2P communities, every computer can join as a peer using the client program. Other peers can access all resources shared by the peers in this environment. The main feature of such systems is that the peer who is looking for a file can directly contact the peer that is sharing this file. The only information that has to be propagated is the peer s address and a short description of the shared data. The first systems like Napster used a centralized server with all peers addresses and names of the shared files. Other approaches avoided a single point of failure and used the Gnutella-style flooding protocol. It consequently broadcasts a request for a particular file through a small number of closest neighbors until the message will expire. The modern P2P applications like e-donkey ( are extremely popular now; they have numerous improvements over the predecessors. Therefore, we harness a power of the thousands autonomous personal computers all over the world to create a temporal community for a collaborative work. A P2P technology is trying to make systems: scalable, selforganized, fault-tolerant, publicly available, load-balanced. This list of desirable P2P properties is not exhaustive and there are also issues like anonymity, security, etc., but the selected properties are fundamental for our task. For 11

23 example, modern P2P systems are often based on a mixture topology when some super-peers establish the different levels of hierarchy, but we are interested in a pure P2P flat structure. It gives equal rights to all peers and makes a system more scalable. The limitation of search capabilities is a considerable drawback of the most P2P systems. Sometimes you have to know an exact filename of a data of interest or you will miss the relevant results. The combination of the search engine mechanisms for an effective retrieval with a powerful paradigm of a P2P community is a promising research direction. 2.4 P2P Web search engines The idea of a Peer-to-Peer Web search engine is extensively investigated nowadays. The interesting combinations of the search services with the P2P platforms are described in several following approaches. ODISSEA [SMW + 03] is different from many other P2P search approaches. It assumes two-layered search engine architecture and a global index structure distributed over the nodes of the system. Under a global index organization, in contrast to a local one, a single node holds the entire inverted index for a particular term. The distributed version of Fagin s threshold algorithm is used for result aggregation over the inverted lists. It is efficient only over very short queries about 2-3 words. For a distributed hash table (DHT) implementation, this system incorporated the Pastry protocol. PlanetP [CAPMN03] is another content search infrastructure. Each node maintains an index of its content and summarizes the set of terms in its index using a Bloom filter. The global index is the set of all summaries. Summaries are propagated and kept synchronized using a gossiping algorithm. This approach is effective for several thousands peers, but it is not scalable. Its retrieval quality is rather low for the top-k queries with a small k. GALANX [WGDW03] is the P2P system, which is implemented on the top of BerkeleyDB. Similar to the Minerva system, it maintains a local peer index on every node and distributes information about term presence on a peer with a DHT. The different query routing strategies are evaluated during the simulation. Most of them are based on the Chord protocol and proposed 12

24 strategies improve the basic effectiveness by the enlarging of the index size. The presented query routing approaches are not highly scalable since the index volume continuously increases with the number of peers in the system. 2.5 Minerva project The project Minerva [BMWZ04] is another Web search engine that is based on P2P architecture. See Figure 2.1. In this system, every peer P i provides an efficient search engine for its own focused Web crawl C i. The documents D ij are indexed locally and the result is posted into a global directory as a set of index statistics S i. A posting process and all other communications between the peers are based on the Chord protocol [SMK + 01]. Every peer contains a set of peerlists L i for a disjoint subset of terms T i, where P i=1 T i = T. The peerlist l is a mapping t P, where t is a particular term and P is a subset of peers which contain at least one document with this term. The terms are hashed and their corresponding peerlists are distributed fairly across the peers by the Chord protocol. During query execution all necessary peerlists, one for each query keyword, are obtained, and merged into one. Figure 2.1: The Minerva system architecture Every peer can pose a query against a number of selected peers that are most probable to contain the relevant documents. The selection is based 13

25 on a query routing strategy and this issue is known in a literature as the database selection problem. A search engine on every selected peer processes its inverted index until it obtains the top-k highly ranked documents for a current query. Then the best top-k results from these peers are collected by the query initiator and merged into one top-k list, this task is known as the result merging problem. Quality of the final top-k list depends heavily on a term weighting scheme on peers and merging algorithm, whereas speed depends mostly on a local index processing scheme. 2.6 Summary In this chapter, we introduce several basic concepts from information retrieval and Web search. We describe some key ideas of P2P systems and review several combinations of Web search engines with a P2P platform. Also small description of the new P2P Web search engine Minerva was provided. The scalability issue is recognized as an extremely important one. P2P architecture seems valuable in terms of the effective and efficient retrieval. 14

26 Chapter 3 Result merging in distributed information retrieval In this chapter, we review recent work on distributed information retrieval. In Section 3.1, we give a short overview of the general metasearch issues. Section 3.2 contains a comprehensive description of the result merging task. In Section 3.3, we elaborate on the collection fusion task. In Section 3.4, we address the problems of the data fusion task. 3.1 Distributed information retrieval in general During the past ten years, emerged a new research direction distributed information retrieval or metasearch. Metasearch is the task of collecting and combining the search results from a set of different sources. A typical scenario includes several search engines that execute a query and one metasearch engine that merges the results and creates a single ranked document list. Several interesting surveys of distributed information retrieval problems and solutions are presented in [Cal00, Cra01, Cro00, MYL02]. Distributed information retrieval task appears when the documents of interest are spread across many sources. In such situation, it might be possible to collect all documents to one server or establish multiple search engines, one for each collection of documents. The search process is performed across the 15

27 network with communications between many servers. This is a distinctive feature of distributed information retrieval. The search in the distributed environment has several attractive properties, which make it preferable to the single engine search. Several of these important features are listed in [MYL02]: The increased coverage of the Web, the indices from many sources are used in one search; The solution for the problem of the search scalability, a combination is cheaper than a centralized solution; The automation of the result preprocessing and combining, a user does not have to compare and combine the results from different sources manually; The improved retrieval effectiveness, the combination of different search engines can produce a better ranking than any single ranking algorithm. The metasearch is based on the multi-database model, where several text databases are modeled explicitly. The multi-database model for information retrieval has many tasks in common with the single-database model but also has some additional problems [Cro00]: The resource description task; The database selection task; The result merging task. These issues are essentially the core of distributed information retrieval research, we briefly describe them below. The main unit in the metasearch is an intermediate broker that is called a metasearch engine. It obtains and stores a limited summary about every database participating in a search process and decides which databases are most appropriate for a query. A metasearch engine also propagates a query to the selected single search engines, collects and reorganizes results. Simple metasearch architecture is presented on Figure 3.1. A user poses a query Q against a metasearch engine, which in turn propagates it to several search 16

28 Figure 3.1: Simple metasearch architecture engines. Then the result rankings R i are retrieved to the broker, merged, and presented to the user as a single document ranking R m. A summary statistics from a search engine is called resource description or database representative. A full-text database provides information about its contents in a set of statistics. It may include a data about the number of the specific term occurrences in the particular documents or in a whole collection, the number of indexed documents etc. Information for building resource description is obtained during the index creation step. The richness of the database representatives depends on the level of cooperation in the system. For example, the STARTS standard [GCGMP97] is the good choice for a cooperative environment, where all search engines present their results in the unified informative format. On the other hand, when they are unwilling to cooperate we can infer their statistics from query-based sampling [SC03]. The collected resource descriptions are used for the database selection or query routing task. In practice, we are not interested in the databases, which are unlikely to contain relevant documents. Therefore, we can select from all data sources only those, which are probably relevant to our query according to their resource descriptions. For each database, we calculate the usefulness measure that is usually based on the vector space model. Creating the effective and robust usefulness measure for the database ranking is the most prominent task of database selection. Several attempts to address this 17

29 problem are described in [Voo95, CLC95, YL97, GGMT99, Cra01, SJCO02]. The result merging problem arises when a query is executed on several selected databases and we want to create one single ranking out of these results. This problem is not trivial since the computation of similarity score between documents and query uses local collection statistics. Therefore, the scores are not directly comparable. The most accurate solution could be obtained by a global score normalization and requires a cooperation from sources. We are especially interested in this latter problem. The carefully designed result merging algorithm can provide us with the high quality results and give us an opportunity to speed-up a local index processing. More information about the result merging methods can be found in [CLC95, TVGJL95, Bau99, Cra01, SJCO02, SC03]. Figure 3.2: A query processing scheme in the distributed search system More precisely the query processing scheme is presented on Figure 3.2 [Cra01]. A query Q is posed on the set of search engines that are represented by their resource descriptions S i. A metasearch engine selects a subset of servers S, which are most probable to contain the relevant documents. The size of this subset usually does not exceed 10 databases. The broker routes Q to these selected search engines S i and obtains a set of document rankings R from the selected servers. In a real world, a user is interested only in the top-k best results where k can vary from 5 to 30. All rankings R i are merged 18

30 into one rank R m and the top-k results from it are presented to the user. Text retrieval aims at the high relevance of the results at the minimum response time. These two components are translated into the general issues of effectiveness or quality and efficiency or speed of the query processing. This thesis concerned on the effectiveness of the result merging problem. 3.2 Result merging problem A common issue in the metasearch is how to combine several ranked lists of the relevant documents from the different search engines into one ranked list. It is the so-called result merging problem. The following section reviews some modern merging methods. Result merging is divided into two main sub-problems. The first one is collection fusion, where the results are merged from the disjoint or nearly disjoint document sets. The second sub-problem is data fusion, which arises when we merge the different rankings obtained on the identical document sets. The main difference between the collection fusion and data fusion is that in the first case we want to approximate the result of a single search system on which the document set consists of all document s sub-sets involved in the merging. Therefore, the optimal solution is to obtain the very same retrieval effectiveness as the search engine with the united database has. However, in the data fusion problem the task is to merge the different rankings in such way that the final ranking is better than every participating ranking. The maximum quality of the result here is undefined but it should be no less than the quality of the best single ranking. Simple intuition for these two problems is presented on Figure 3.3. A comprehensive description of the differences between collection fusion and data fusion can be found in [VC99b, Mon02]. In metasearch, we often do not know beforehand what kind of a merging problem we have because it depends on the level of overlap between the documents of combined databases. If the overlap is very high the situation is closer to the data fusion, otherwise it is the collection fusion task. The metasearch on the Web was addressed mainly as the collection fusion problem. In fact, the overlap of search results on the different search engines is 19

31 Figure 3.3: Collection fusion vs. data fusion surprisingly low. However, some approaches also take into account the data fusion methods, sometimes even both types are evaluated in the mixture setups. Another important property is a level of search engine cooperation. We divide all merging methods by the environment type into two categories: Cooperative (integrated) environment; Uncooperative (isolated) environment. The uncooperative or isolated merging methods have no other access to the individual databases than a ranked list of documents in the response to a query [Voo95]. The cooperative or integrated merging techniques assume an access to the database statistics values like T F, DF etc. In general, both types of merging methods can produce more effective result than the single collection with the full set of documents, if the data fusion strategy is used [TVGJL95]. In practice, the merged results produced by the uncooperative strategies have been less effective than the single collection run. Our primary goal is to find a subset of the effective merging methods, which we can apply and evaluate in the P2P Web search engine Minerva. We assume here that all peers in the Minerva system are cooperative and provide all necessary statistics. 20

32 3.3 Prior work on collection fusion A formal definition of the collection fusion problem was stated in [TVGJL95]. It is mixed with the data fusion definition, therefore, we modified it. Assume a set of document collections C associated with the search engines. With respect to the query Q, each collection C i contains a number of relevant documents. After the query Q is posed against the collection C i, the search engine returns the ranked list R i of documents D ij in a decreasing order of their similarity S ij to the query. The top-k results is the merged ranked list of length k containing the documents D ij with the highest similarity values S ij in a decreasing order. Consider a document collection C g = C i and the top-k results R g, which contains the documents D gj with similarity values S gj. The collection fusion task is given Q, C, and k find from R j the top-k results R c of the documents D cj such that S cj = S gj Collection fusion properties An ideal collection fusion method combines the documents from local search results into one ranked list in a descending order of their global similarity scores. The global similarity scores are produced by the single global search system over the united database containing all local documents. In the cooperative environment, where all search engines provide necessary statistics, we can achieve the consistent merging as produced by a non-distributed system, it is also known as the perfect merging and merging with normalized scores [Cra01]. In practice, no efficient collection fusion technique can guarantee exactly the same ranking as on the centralized database with all documents from all databases involved. Three main factors affect the collection fusion: 1. Only the documents returned by the selected servers can participate in a merging. Some relevant documents will be missed after the database selection step. 2. Different statistics and retrieval algorithms caused their own separate problem of incomparable scores. A missing of the documents might be the case when the top-k results are merged and necessary document is 21

33 locally ranked (k+1)th or greater. This problem can be solved by the global statistics normalization methods in the cooperative environment. 3. Overlapping between the databases. See Figure 3.4. The pure collection fusion approaches [VF95, Kir97, CLC95] do not consider overlapping. It is quite difficult to accurately estimate the actual level of the document s overlap between datasets. Our assumption is that the degradation of the result quality due to overlapping is small when the efforts for statistics correction are significant. Figure 3.4: An overlapping in the collection fusion problem Cooperative environment In [SP99, SR00] was claimed that the simple raw score merging could show a good retrieval performance. It seems that the raw-score approach might be a valid first attempt for the merging of result lists, which are provided by the same retrieval model. In [CLC95] was suggested that the collection fusion based on the raw T F values seems as a valuable approach when the involved databases are more or less homogeneous and the retrieval quality degrades only by 10%. However, we assumed topically organized collections and they have highly skewed statistics. The most effective collection fusion methods are the score normalization techniques, which are based on consistent global collection statistics. All search engines must produce the document s relevance score using the same retrieval algorithms, including document ranking algorithm, stemming method, stopwords list. A metasearch engine collects all required local statistics from the selected databases before or during query time. Notice, that 22

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Peer-to-Peer Systems. Chapter General Characteristics

Peer-to-Peer Systems. Chapter General Characteristics Chapter 2 Peer-to-Peer Systems Abstract In this chapter, a basic overview is given of P2P systems, architectures, and search strategies in P2P systems. More specific concepts that are outlined include

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Structured Peer-to-Peer Search to build a Bibliographic Paper Recommendation System

Structured Peer-to-Peer Search to build a Bibliographic Paper Recommendation System Structured Peer-to-Peer Search to build a Bibliographic Paper Recommendation System By Pleng Chirawatkul Supervised by Prof. Dr.-Ing. Gerhard Weikum Sebastian Michel Matthias Bender A thesis submitted

More information

Improving Collection Selection with Overlap Awareness in P2P Search Engines

Improving Collection Selection with Overlap Awareness in P2P Search Engines Improving Collection Selection with Overlap Awareness in P2P Search Engines Matthias Bender Peter Triantafillou Gerhard Weikum Christian Zimmer and Improving Collection Selection with Overlap Awareness

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval

Information Retrieval Introduction Information Retrieval Information retrieval is a field concerned with the structure, analysis, organization, storage, searching and retrieval of information Gerard Salton, 1968 J. Pei: Information

More information

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana School of Information Technology A Frequent Max Substring Technique for Thai Text Indexing Todsanai Chumwatana This thesis is presented for the Degree of Doctor of Philosophy of Murdoch University May

More information

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University CS 555: DISTRIBUTED SYSTEMS [P2P SYSTEMS] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Byzantine failures vs malicious nodes

More information

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding

More information

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P? Peer-to-Peer Data Management - Part 1- Alex Coman acoman@cs.ualberta.ca Addressed Issue [1] Placement and retrieval of data [2] Server architectures for hybrid P2P [3] Improve search in pure P2P systems

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Semantic Overlay Networks

Semantic Overlay Networks Semantic Overlay Networks Arturo Crespo and Hector Garcia-Molina Write-up by Pavel Serdyukov Saarland University, Department of Computer Science Saarbrücken, December 2003 Content 1 Motivation... 3 2 Introduction

More information

THE WEB SEARCH ENGINE

THE WEB SEARCH ENGINE International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) Vol.1, Issue 2 Dec 2011 54-60 TJPRC Pvt. Ltd., THE WEB SEARCH ENGINE Mr.G. HANUMANTHA RAO hanu.abc@gmail.com

More information

A Framework for Securing Databases from Intrusion Threats

A Framework for Securing Databases from Intrusion Threats A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:

More information

SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE

SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE SEMANTIC WEB POWERED PORTAL INFRASTRUCTURE YING DING 1 Digital Enterprise Research Institute Leopold-Franzens Universität Innsbruck Austria DIETER FENSEL Digital Enterprise Research Institute National

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

Scalable overlay Networks

Scalable overlay Networks overlay Networks Dr. Samu Varjonen 1 Lectures MO 15.01. C122 Introduction. Exercises. Motivation. TH 18.01. DK117 Unstructured networks I MO 22.01. C122 Unstructured networks II TH 25.01. DK117 Bittorrent

More information

Enhanced Web Log Based Recommendation by Personalized Retrieval

Enhanced Web Log Based Recommendation by Personalized Retrieval Enhanced Web Log Based Recommendation by Personalized Retrieval Xueping Peng FACULTY OF ENGINEERING AND INFORMATION TECHNOLOGY UNIVERSITY OF TECHNOLOGY, SYDNEY A thesis submitted for the degree of Doctor

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS A Project Report Presented to The faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements

More information

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES Dr. Jack Lange Computer Science Department University of Pittsburgh Fall 2015 Outline System Architectural Design Issues Centralized Architectures Application

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

Distributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4

Distributed Web Crawling over DHTs. Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4 Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4 Search Today Search Index Crawl What s Wrong? Users have a limited search interface Today s web is dynamic and

More information

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most common word appears 10 times then A = rn*n/n = 1*10/100

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Alejandro Bellogín 1,2, Thaer Samar 1, Arjen P. de Vries 1, and Alan Said 1 1 Centrum Wiskunde

More information

Characterizing Gnutella Network Properties for Peer-to-Peer Network Simulation

Characterizing Gnutella Network Properties for Peer-to-Peer Network Simulation Characterizing Gnutella Network Properties for Peer-to-Peer Network Simulation Selim Ciraci, Ibrahim Korpeoglu, and Özgür Ulusoy Department of Computer Engineering, Bilkent University, TR-06800 Ankara,

More information

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. Jesse Anderton College of Computer and Information Science Northeastern University CS6200 Information Retrieval Jesse Anderton College of Computer and Information Science Northeastern University Major Contributors Gerard Salton! Vector Space Model Indexing Relevance Feedback SMART Karen

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Overlay and P2P Networks. Unstructured networks. PhD. Samu Varjonen

Overlay and P2P Networks. Unstructured networks. PhD. Samu Varjonen Overlay and P2P Networks Unstructured networks PhD. Samu Varjonen 25.1.2016 Contents Unstructured networks Last week Napster Skype This week: Gnutella BitTorrent P2P Index It is crucial to be able to find

More information

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16 Federated Search Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu November 21, 2016 Up to this point... Classic information retrieval search from a single centralized index all ueries

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,

More information

A Document-centered Approach to a Natural Language Music Search Engine

A Document-centered Approach to a Natural Language Music Search Engine A Document-centered Approach to a Natural Language Music Search Engine Peter Knees, Tim Pohle, Markus Schedl, Dominik Schnitzer, and Klaus Seyerlehner Dept. of Computational Perception, Johannes Kepler

More information

Unit 8 Peer-to-Peer Networking

Unit 8 Peer-to-Peer Networking Unit 8 Peer-to-Peer Networking P2P Systems Use the vast resources of machines at the edge of the Internet to build a network that allows resource sharing without any central authority. Client/Server System

More information

Hyperlink-Extended Pseudo Relevance Feedback for Improved. Microblog Retrieval

Hyperlink-Extended Pseudo Relevance Feedback for Improved. Microblog Retrieval THE AMERICAN UNIVERSITY IN CAIRO SCHOOL OF SCIENCES AND ENGINEERING Hyperlink-Extended Pseudo Relevance Feedback for Improved Microblog Retrieval A thesis submitted to Department of Computer Science and

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering Gurpreet Kaur M-Tech Student, Department of Computer Engineering, Yadawindra College of Engineering, Talwandi Sabo,

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

Early Measurements of a Cluster-based Architecture for P2P Systems

Early Measurements of a Cluster-based Architecture for P2P Systems Early Measurements of a Cluster-based Architecture for P2P Systems Balachander Krishnamurthy, Jia Wang, Yinglian Xie I. INTRODUCTION Peer-to-peer applications such as Napster [4], Freenet [1], and Gnutella

More information

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system. Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

Using Text Learning to help Web browsing

Using Text Learning to help Web browsing Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

AN EFFICIENT PROCESSING OF WEBPAGE METADATA AND DOCUMENTS USING ANNOTATION Sabna N.S 1, Jayaleshmi S 2

AN EFFICIENT PROCESSING OF WEBPAGE METADATA AND DOCUMENTS USING ANNOTATION Sabna N.S 1, Jayaleshmi S 2 AN EFFICIENT PROCESSING OF WEBPAGE METADATA AND DOCUMENTS USING ANNOTATION Sabna N.S 1, Jayaleshmi S 2 1 M.Tech Scholar, Dept of CSE, LBSITW, Poojappura, Thiruvananthapuram sabnans1988@gmail.com 2 Associate

More information

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud Data

A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud Data An Efficient Privacy-Preserving Ranked Keyword Search Method Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving. Therefore it is essential to develop

More information

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma Overlay and P2P Networks Unstructured networks Prof. Sasu Tarkoma 20.1.2014 Contents P2P index revisited Unstructured networks Gnutella Bloom filters BitTorrent Freenet Summary of unstructured networks

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML CS276B Text Retrieval and Mining Winter 2005 Plan for today Vector space approaches to XML retrieval Evaluating text-centric retrieval Lecture 15 Text-centric XML retrieval Documents marked up as XML E.g.,

More information

Information Retrieval in Peer to Peer Systems. Sharif University of Technology. Fall Dr Hassan Abolhassani. Author: Seyyed Mohsen Jamali

Information Retrieval in Peer to Peer Systems. Sharif University of Technology. Fall Dr Hassan Abolhassani. Author: Seyyed Mohsen Jamali Information Retrieval in Peer to Peer Systems Sharif University of Technology Fall 2005 Dr Hassan Abolhassani Author: Seyyed Mohsen Jamali [Slide 2] Introduction Peer-to-Peer systems are application layer

More information

An agent-based peer-to-peer grid computing architecture

An agent-based peer-to-peer grid computing architecture University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 2005 An agent-based peer-to-peer grid computing architecture Jia

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

A Developer s Guide to the Semantic Web

A Developer s Guide to the Semantic Web A Developer s Guide to the Semantic Web von Liyang Yu 1. Auflage Springer 2011 Verlag C.H. Beck im Internet: www.beck.de ISBN 978 3 642 15969 5 schnell und portofrei erhältlich bei beck-shop.de DIE FACHBUCHHANDLUNG

More information

International Journal of Scientific & Engineering Research Volume 8, Issue 5, May ISSN

International Journal of Scientific & Engineering Research Volume 8, Issue 5, May ISSN International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 106 Self-organizing behavior of Wireless Ad Hoc Networks T. Raghu Trivedi, S. Giri Nath Abstract Self-organization

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Observe novel applicability of DL techniques in Big Data Analytics. Applications of DL techniques for common Big Data Analytics problems. Semantic indexing

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

CSE 5306 Distributed Systems. Course Introduction

CSE 5306 Distributed Systems. Course Introduction CSE 5306 Distributed Systems Course Introduction 1 Instructor and TA Dr. Donggang Liu @ CSE Web: http://ranger.uta.edu/~dliu Email: dliu@uta.edu Phone: 817-2720741 Office: ERB 555 Office hours: Tus/Ths

More information

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma

Overlay and P2P Networks. Unstructured networks. Prof. Sasu Tarkoma Overlay and P2P Networks Unstructured networks Prof. Sasu Tarkoma 19.1.2015 Contents Unstructured networks Last week Napster Skype This week: Gnutella BitTorrent P2P Index It is crucial to be able to find

More information

IEEE 2013 JAVA PROJECTS Contact No: KNOWLEDGE AND DATA ENGINEERING

IEEE 2013 JAVA PROJECTS  Contact No: KNOWLEDGE AND DATA ENGINEERING IEEE 2013 JAVA PROJECTS www.chennaisunday.com Contact No: 9566137117 KNOWLEDGE AND DATA ENGINEERING (DATA MINING) 1. A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data

More information

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach Category Theory in Ontology Research: Concrete Gain from an Abstract Approach Markus Krötzsch Pascal Hitzler Marc Ehrig York Sure Institute AIFB, University of Karlsruhe, Germany; {mak,hitzler,ehrig,sure}@aifb.uni-karlsruhe.de

More information

Developing InfoSleuth Agents Using Rosette: An Actor Based Language

Developing InfoSleuth Agents Using Rosette: An Actor Based Language Developing InfoSleuth Agents Using Rosette: An Actor Based Language Darrell Woelk Microeclectronics and Computer Technology Corporation (MCC) 3500 Balcones Center Dr. Austin, Texas 78759 InfoSleuth Architecture

More information

Massive Data Analysis

Massive Data Analysis Professor, Department of Electrical and Computer Engineering Tennessee Technological University February 25, 2015 Big Data This talk is based on the report [1]. The growth of big data is changing that

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Mohsen Kamyar چهارمین کارگاه ساالنه آزمایشگاه فناوری و وب بهمن ماه 1391 Outline Outline in classic categorization Information vs. Data Retrieval IR Models Evaluation

More information

Peer-to-Peer Signalling. Agenda

Peer-to-Peer Signalling. Agenda Peer-to-Peer Signalling Marcin Matuszewski marcin@netlab.hut.fi S-38.115 Signalling Protocols Introduction P2P architectures Skype Mobile P2P Summary Agenda 1 Introduction Peer-to-Peer (P2P) is a communications

More information

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM Ms.Susan Geethu.D.K 1, Ms. R.Subha 2, Dr.S.Palaniswami 3 1, 2 Assistant Professor 1,2 Department of Computer Science and Engineering, Sri Krishna

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

Introduction to Grid Computing

Introduction to Grid Computing Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able

More information

Introduction to Mobile Ad hoc Networks (MANETs)

Introduction to Mobile Ad hoc Networks (MANETs) Introduction to Mobile Ad hoc Networks (MANETs) 1 Overview of Ad hoc Network Communication between various devices makes it possible to provide unique and innovative services. Although this inter-device

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016

Databases and Information Retrieval Integration TIETS42. Kostas Stefanidis Autumn 2016 + Databases and Information Retrieval Integration TIETS42 Autumn 2016 Kostas Stefanidis kostas.stefanidis@uta.fi http://www.uta.fi/sis/tie/dbir/index.html http://people.uta.fi/~kostas.stefanidis/dbir16/dbir16-main.html

More information

VISUAL RERANKING USING MULTIPLE SEARCH ENGINES

VISUAL RERANKING USING MULTIPLE SEARCH ENGINES VISUAL RERANKING USING MULTIPLE SEARCH ENGINES By Dennis Lim Thye Loon A REPORT SUBMITTED TO Universiti Tunku Abdul Rahman in partial fulfillment of the requirements for the degree of Faculty of Information

More information

A reputation system for BitTorrent peer-to-peer filesharing

A reputation system for BitTorrent peer-to-peer filesharing University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 2006 A reputation system for BitTorrent peer-to-peer filesharing

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction Inverted index Processing Boolean queries Course overview Introduction to Information Retrieval http://informationretrieval.org IIR 1: Boolean Retrieval Hinrich Schütze Institute for Natural

More information

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80]. Focus: Accustom To Crawl Web-Based Forums M.Nikhil 1, Mrs. A.Phani Sheetal 2 1 Student, Department of Computer Science, GITAM University, Hyderabad. 2 Assistant Professor, Department of Computer Science,

More information