Distributed Information Retrieval

Size: px
Start display at page:

Download "Distributed Information Retrieval"

Transcription

1 Distributed Information Retrieval Fabio Crestani and Ilya Markov University of Lugano, Switzerland Fabio Crestani and Ilya Markov Distributed Information Retrieval 1

2 Outline Motivations Deep Web Federated Search Metasearch Aggregated Search 1 Motivations Deep Web Federated Search Metasearch Aggregated Search Fabio Crestani and Ilya Markov Distributed Information Retrieval 2

3 Motivations Deep Web Federated Search Metasearch Aggregated Search Why do we need DIR? There are limits to what a search engines can find on the web Not everything that is on the web is or can be harvested The one size fits all approach of web search engine has many limitations Often there is more than one type of answer to the same query Thus: Deep Web, Federated Search, MetaSearch, Aggregated Search Fabio Crestani and Ilya Markov Distributed Information Retrieval 3

4 Deep Web Motivations Deep Web Federated Search Metasearch Aggregated Search There is a lot of information on the web that cannot be accessed by search engines (deep or hidden web). There are many different reasons why this information is not accessible to crawlers. This is often very valuable information! Web search engines can only be used to identify the resource (if possible), then the user has to deal directly with it. Even if this information could be crawled there are good reasons not too... Fabio Crestani and Ilya Markov Distributed Information Retrieval 4

5 Federated Search Motivations Deep Web Federated Search Metasearch Aggregated Search Federated Search is another name for DIR. Federated search systems do not crawl a resource, but pass a user query to the search facilities of the resource itself. Why would this be better? Preserves the property rights of the resource owner. Search facilities are optimised to the specific resource. Index is always up-to-date. The resource is curated and of high quality. Examples of federate search systems: PubMed, FedStats, WestLaw, and Cheshire. Fabio Crestani and Ilya Markov Distributed Information Retrieval 5

6 Metasearch Motivations Deep Web Federated Search Metasearch Aggregated Search Even the largest search engine cannot crawl effectively the entire web. Different search engines crawl different disjoint portions of the web. Different search engines use different ranking functions. Metasearch engines do not crawl the web, but pass a user query to a number of search engines and then present the fused results set. Examples of federate search systems: Dogpile, MataCrawler, AllInOneNews, and SavvySearch Fabio Crestani and Ilya Markov Distributed Information Retrieval 6

7 Aggregated Search Motivations Deep Web Federated Search Metasearch Aggregated Search Often there is more that one type of information relevant to a query (e.g. web page, images, map, reviews, etc). These type of information are indexed and ranked by separate sub-systems. Presenting this information in an aggregated way is more useful to the user. Fabio Crestani and Ilya Markov Distributed Information Retrieval 7

8 Outline Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid 1 Motivations 2 Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Fabio Crestani and Ilya Markov Distributed Information Retrieval 8

9 A Taxonomy of DIR Systems Peer-to-Peer Network Crawling Metadata Harvesting Hybrid A taxonomy of DIR architectures can be build considering where the indexes are kept. This suggest 4 different types of architectures: broker-based, peer-to-peer, crawling, and meta-data harvesting. Global collection Distributed indexes Centralised index P2P Broker Crawling Meta-data harvesting Hybrid Fabio Crestani and Ilya Markov Distributed Information Retrieval 9

10 Peer-to-Peer Networks Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Indexes are located with the resources. Some part of the indexes are distributed to other resources. Queries are distributed across the resources and results are merged by the peer that originated the query. Fabio Crestani and Ilya Markov Distributed Information Retrieval 10

11 Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Indexes are located with the resources. Queries are forwarded to resources and results are merged by the broker. Fabio Crestani and Ilya Markov Distributed Information Retrieval 11

12 Crawling Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Resources are crawled and documents are harvested Indexes are centralised. Queries are are carried out in a centralised way and documents are fetched from resources or from a storage. Fabio Crestani and Ilya Markov Distributed Information Retrieval 12

13 Metadata Harvesting Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Indexes are located with the resources, but metadata are harvested according to some protocol (off-line phase), like for example the OAI-PMH. Queries are carried out at the broker level (on-line phase) to identify relevant documents by the metadata, that are then requested from the resources. Fabio Crestani and Ilya Markov Distributed Information Retrieval 13

14 Indexing Harvesting Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid It is possible to crawl the indexes, instead of the metadata according to some protocol (off-line phase), like for example the OAI-PMH. Queries are carried out at the broker level (on-line phase) to identify relevant documents by the documents full content, that are then requested from the resources. Fabio Crestani and Ilya Markov Distributed Information Retrieval 14

15 Outline Motivations 1 Motivations Fabio Crestani and Ilya Markov Distributed Information Retrieval 15

16 Cooperative Uncooperative Fabio Crestani and Ilya Markov Distributed Information Retrieval 16

17 Stanford Protocol Proposal for Internet and Retrieval Search (STARTS) Query Language Filter expressions Ranking expressions Retrieved Documents Unnormalized score Source Statistics (term frequency, term weight, document frequency, document size, document count) Source metadata and content summary Vocabulary Statistics (term frequency, document frequency, number of documents) Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. STARTS: Stanford proposal for internet meta-searching. SIGMOD Rec., 26(2): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 17

18 Query-Based Sampling Query Resource Documents (2-10) Questions How to select queries? What are the stopping criteria? Jamie Callan and Margaret Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19(2):97 130, Fabio Crestani and Ilya Markov Distributed Information Retrieval 18

19 Sampling Queries Motivations Other (ord) Learned (lrd) Random Document Frequency (df ) Collection Frequency (ctf ) Average Term Frequency (ctf /df ) Query-Logs Jamie Callan and Margaret Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19(2):97 130, Milad Shokouhi, Justin Zobel, Seyed M. M. Tahaghoghi, and Falk Scholer. Using query logs to establish vocabularies in distributed information retrieval. Inf. Process. Manage., 43(1): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 19

20 Stopping Criteria Motivations documents n documents out of N resources Collection size Vocabulary size Newly downloaded terms are not likely to appear in future queries Q - set of training queries θ k - language model of a sample at k-th iteration p(q θ k ) = N q q i=1 j=1 p(t = q ij θ k ), l(θ k, Q) = log(p(q θ k )) φ k = l(θ k, Q) l(θ k 1, Q) = log( p(q θ k) p(q θ k 1 ) ) Fabio Crestani and Ilya Markov Distributed Information Retrieval 20

21 Stopping Criteria Bibliography Leif Azzopardi, Mark Baillie, and Fabio Crestani. Adaptive query-based sampling for distributed ir. In Proceedings of the ACM SIGIR, pages ACM, Mark Baillie, Leif Azzopardi, and Fabio Crestani. An adaptive stopping criteria for query-based sampling of distributed collections. In String Processing and Information Retrieval (SPIRE), James Caverlee, Ling Liu, and Joonsoo Bae. Distributed query sampling: a quality-conscious approach. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 21

22 Summary Cooperative (STARTS) Uncooperative (QBS) Sampling Queries Stopping Criteria Fabio Crestani and Ilya Markov Distributed Information Retrieval 22

23 References Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. Starts: Stanford proposal for internet meta-searching. SIGMOD Rec., 26(2): , Jamie Callan and Margaret Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19(2):97 130, Leif Azzopardi, Mark Baillie, and Fabio Crestani. Adaptive query-based sampling for distributed ir. In Proceedings of the ACM SIGIR, pages ACM, Mark Baillie, Leif Azzopardi, and Fabio Crestani. An adaptive stopping criteria for query-based sampling of distributed collections. In String Processing and Information Retrieval (SPIRE), James Caverlee, Ling Liu, and Joonsoo Bae. Distributed query sampling: a quality-conscious approach. In Proceedings of the ACM SIGIR, pages ACM, Milad Shokouhi, Justin Zobel, Seyed M. M. Tahaghoghi, and Falk Scholer. Using query logs to establish vocabularies in distributed information retrieval. Inf. Process. Manage., 43(1): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 23

24 Lexicon-Based Document-Surrogate Fabio Crestani and Ilya Markov Distributed Information Retrieval 24

25 Lexicon-Based Approaches Fabio Crestani and Ilya Markov Distributed Information Retrieval 25

26 Collection Retrieval Inference Network (CORI) Collection = Super-Document Bayesian inference network on super-documents Adapted Okapi T = df t,i df t,i cw i /avg cw I = C+0.5 log( cf t ) log(c + 1.0) p(t C i ) = b + (1 b) T I Collections are ranked according to p(q C i ) James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 26

27 Glossary-of-Servers Server (GlOSS) Goodness(q, l, C) = sim(q, d) d Rank(q,l,C) Rank(q, l, C) = {d C sim(q, d)>l} Cooperative Document and term statistics Luis Gravano, Héctor García-Molina, and Anthony Tomasic. Gloss: text-source discovery over the internet. ACM Trans. Database Syst., 24(2): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 27

28 Document-Surrogate Approaches Fabio Crestani and Ilya Markov Distributed Information Retrieval 28

29 Relevant Document Distribution Estimation (ReDDE) Idea 1 sampled document is relevant to a query C S C similar documents in a collection c are relevant to a query. R(C, Q) p(r d) C S C d S C Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 29

30 Relevant Document Distribution Estimation (ReDDE) Ranked sampled documents = Ranked documents in a centralized retrieval system Idea A document d j appears before a document d i in a sample C j S Cj documents appear before d i in a centralized retrieval system. Rank centralized (d i ) = d j :Rank sample (d j )<Rank sample (d i ) C j S Cj Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 30

31 Relevant Document Distribution Estimation (ReDDE) Rank centralized (d i ) = R(C, Q) p(r d) C S C d S C d j :Rank sample (d j )<Rank sample (d i ) C j S Cj p(r d) = {α if Rank centralized (d) < β i C i 0 otherwise. Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 31

32 Centralized-Rank Collection Selection (CRCS) R(C, Q) p(r d) C S C d S C Linear R(d) = { γ Rank sample (d) if Rank sample (d) < γ 0 otherwise. Exponential R(d) = α exp( β Rank sample (d)) p(r d) = R(d) C max Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. In ECIR, pages , Fabio Crestani and Ilya Markov Distributed Information Retrieval 32

33 R Motivations Advanced 15 DIR 20 Cutoff (nonrelevant) R Comparison Resource 0.2 Description Results 0.0 Merging Cutoff (GOV2) Fig. 3. R values for the cori, redde and crcs algorithms on the trec123-fr-doe- 81col (left) and 100-col-gov2 (right) testbeds. Table 2. Performance of different methods for the Trec4 (trec4-kmeans) testbed. Trec topics (long) were used as queries Cutoff=1 Cutoff=5 P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20 CORI ReDDE CRCS(l) CRCS(e) Table 3. Performance of collection selection methods for the uniform (trec colbysource) testbed. Trec topics (short) were used as queries evance judgments. We only report the results for cutoff=1 and cutoff=5. The former shows the system Cutoff=1 outputs when only the best collection Cutoff=5is selected while for the latter, P@5 the bestp@10 five collections P@15 P@20 are chosen P@5to P@10 get searched P@15for P@20 the query. We docori not report the results for larger cutoff values because cutoff= has shown to beredde a reasonable threshold for dir experiments on the real web collections [Avrahami CRCS(l) et al., ] The values show the calculated precision on the top xcrcs(e) results We select redde as the baseline as it does not require training queries and Table 4. Performance of collection selection methods for the representative (trec123- Milad Shokouhi. its effectiveness is found to be higher than cori and older alternatives [Si and 2ldb-60col) Central-rank-based testbed. Treccollection topics selection (short) in uncooperative were used as distributed queries information retrieval. In ECIR, pages Callan, , 2003a] The following tables compare the performance of discussed methods on different testbeds. Cutoff=1 We used the t-test to calculate Cutoff=5 the statistical signifi- Fabio Crestani and Ilya Markov Distributed Information Retrieval 33

34 On the relevant testbed Motivations (Table 5), all precision values for cori are significantly inferior to that of redde for both Resource cutoff values. Description Redde in general produces higher precision Broker-Based values Architecture than crcs methods. ResourceHowever, Selection none of the gaps are detected statistically significant Advancedby DIR the t-test at the 99% confidence interval. The results for crcs(l), redde and cori are comparable on the non-relevant testbed (Table 6). Crcs(e) significantly outperforms the other methods in most cases. On the gov2 testbed (Table 7), cori produces the best results when Comparison Table 6. Performance of collection selection methods for the non-relevant (trec123- FR-DOE-81col) Table 5. Performance testbed. of Trec collection topics selection(short) methods were forused theas relevant queries(trec123-ap- WSJ-60col) testbed. Trec topics (short) were used as queries Cutoff=1 Cutoff=5 Cutoff=1 Cutoff=5 CORI ReDDE CORI CRCS(l) ReDDE CRCS(e) CRCS(l) CRCS(e) Table 7. Performance of collection selection methods for the gov2 (100-col-gov2) testbed. Trec topics (short) were used as queries Cutoff=1 Cutoff=5 CORI ReDDE CRCS(l) CRCS(e) Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. In cutoff=1 while in the other scenarios there is no significant difference between ECIR, pages , the methods. Overall, Fabio we can Crestani conclude and Ilya that Markov crcs(e) Distributed selects better Information collections Retrieval and its high 34

35 Summary Lexicon-Based CORI GlOSS Document-Surrogate ReDDE CRCS Others Fabio Crestani and Ilya Markov Distributed Information Retrieval 35

36 References James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR, pages ACM, Luis Gravano, Héctor García-Molina, and Anthony Tomasic. Gloss: text-source discovery over the internet. ACM Trans. Database Syst., 24(2): , Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. In ECIR, pages , Fabio Crestani and Ilya Markov Distributed Information Retrieval 36

37 Estimating Collection Size Fabio Crestani and Ilya Markov Distributed Information Retrieval 37

38 Capture-Recapture Motivations X - event that a randomly sampled document is already in a sample X n - the same for n randomly sampled documents Two samples S 1 and S 2 E[X ] = S C, E[X n] = n E[X ] = n S C S 1 S 2 S 1 S 2 C = ˆ C = S 1 S 2 S 1 S 2 Take two samples Count the number of common documents King-Lup Liu, Clement Yu, and Weiyi Meng. Discovering the representative of a search engine. In Proceedings of the ACM CIKM, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 38

39 Sample-Resample Motivations Randomly pick a term t from a sample A - event that some sampled document contains t B - event that some document from the resource contains t P(A) = df t,s S, P(B) = df t,c C P(A) P(B) = ˆ C = df t,c S df t,s Send the query t to the resource to estimate df t,c Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 39

40 Fabio Crestani and Ilya Markov Distributed Information Retrieval 40

41 Motivations The results merging process: 1 Selected resources return their top-ranked documents to the broker. 2 The broker merges the documents and returns a fused list to the user. Not to be confused with data fusion, where the results come from a single resource and are then ranked by multiple retrieval models. Fabio Crestani and Ilya Markov Distributed Information Retrieval 41

42 Issues The results merging process involves a number of issues: 1 Duplicate detection and removal. 2 Normalising and merging relevance scores. Different solutions have been proposed for these issue, depending in the DIR environment. Fabio Crestani and Ilya Markov Distributed Information Retrieval 42

43 in Cooperative Environments The results merging in cooperative environments is much simpler and has different solutions: 1 Fetch documents from each resource, reindex and rank according to the broker IR model. 2 Get information about the way the document score is calculated and normalise score. At the highest level of collaboration it is possible to ask the resources to adopt the same retrieval model! Fabio Crestani and Ilya Markov Distributed Information Retrieval 43

44 Collection Retrieval Inference Network (CORI) The idea Linear combination of the score of the database and the score of the document. Normalised scores Normalised collection section score: C i = (R i R max ) (R max R min ) Normalised document score: D j = (D j D max ) (D max D min ) Heuristic linear combination: D j = (D j +0.4 D j C i 1.4 J.P. Callan, Z. Lu, and W.B. Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 44

45 in Uncooperative Environments In uncooperative environments resources might provide scores: But the broker does not have any information on how these score are computed. Score normalisation requires some way of comparing scores. Alternatively the resources might provide only rank positions: But the broker does not have any information on the relevance of each document in the rank lists. Merging the ranks requires some way of comparing rank positions. Fabio Crestani and Ilya Markov Distributed Information Retrieval 45

46 Semi-Supervised Learning (SSL) The idea Train a regression model for each collection that maps resource document scores to normalised scores. Requires that some returned document are found in the Collection Selection Index (CSI). Two cases: 1 Resources use identical retrieval models 2 Resources use different retrieval models L. Si and J. Callan. Using sampled data and regression to merge search engine results. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 46

47 SSL with Identical Retrieval Models The idea SSL uses documents found in CSI to train a single regression model to estimate the normalised score (D i,j ) from resource document scores (D i,j) and the score of the same document computed from the CSI (E i,j ). Normalised scores Having: Train: D 1,1 C 1 D 1,1 D 1,2 C 1 D 1, D n,m C n D n,m [a b] = D i,j = a E i,j + b E i,j C i E 1,1 E 1,2... E n,m Fabio Crestani and Ilya Markov Distributed Information Retrieval 47

48 SSL with Different Retrieval Models The idea SSL uses documents found in CSI to train a different regression models for each resource. Normalised scores Having: Train: D 1,1 1 D 1, D n,m 1 [a i b i ] = D i,j = a i E i,j + b i E 1,1 E 1,2... E n,m Fabio Crestani and Ilya Markov Distributed Information Retrieval 48

49 Sample-Agglomerate Fitting Estimate (SAFE) The idea For a given query the results from the CSI is a subranking of the original collection, so curve fitting to the subranking can be used to estimate the original scores. It does not require the presence of overlap documents in CSI. M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates. In ACM Transactions on Information Systems, 27(3): 129, Fabio Crestani and Ilya Markov Distributed Information Retrieval 49

50 Sample-Agglomerate Fitting Estimate (SAFE) Normalised scores 1 The broker ranks the documents available in the CSI for the query. 2 For each resource the sample documents (with non zero score) are used to estimate the merging score, where each sample document is assumed to be representative of a fraction Sc / c of the resource. 3 Use regression to fit a curve on the adjusted scores to predict the score of the document returned by the resource. Fabio Crestani and Ilya Markov Distributed Information Retrieval 50

51 More in Uncooperative Environments There are amy other approaches to results merging: STARTS uses the returned term frequency, document frequency, and document weight information to calculate the merging score based on similarities between documents. CVV calculates the merging score according to the collection score and the position of a document in the returned collection rank list. Another approach download small parts of the top returned documents and used a reference index of term statistics for reranking and merging the downloaded documents. L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford proposal for internet meta-searching. In Proceedings of the ACM SIGMOD, pages , B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. In Proceedings of the Conference on Database Systems for Advanced, pages 41-50, N. Craswell, D. Hawking, and P. Thistlewaite. Merging results from isolated search engines. In Proceedings of the Australasian Database Conference, pages , Fabio Crestani and Ilya Markov Distributed Information Retrieval 51

52 Data Fusion in Metasearch In data fusion methods documents in a single collection are ranked with different search engines The goal is to generate a single accurate ranking list from the ranking lists of different retrieval models. There are no collection samples and no CSI. The idea Use the voting principle: a document returned by many search systems should be ranked higher than the other documents. If available, also take the rank of documents into account. Fabio Crestani and Ilya Markov Distributed Information Retrieval 52

53 Metasearch Data Fusion Methods Many methods have been proposed: Data Fusion Round Robin. CombMNZ, CombSum, CombMax, CombMin. Logistic regression (covert rank to estimated probabilities of relevance). A comparison between score-based and rank-based methods suggests that rank-based methods are generally less effective. E. Fox and J. Shaw. Combination of multiple searches. In Proceedings of TREC, pages ,1994. Fabio Crestani and Ilya Markov Distributed Information Retrieval 53

54 in Metasearch We cannot use data fusion methods when collections are overlapping, but are not the same. We cannot use data fusion methods when the retrieval model are different. Web metasearch most typical example. The idea Normalise the document scores returned by multiple search engines using a regression function that compares the scores of overlapped documents between the returned ranked lists. In the absence of overlap between the results, most metasearch merging techniques become ineffective. S. Wu and F. Crestani. Shadow document methods of results merging. In Proceedings of the ACM SAC, pages , 2004 Fabio Crestani and Ilya Markov Distributed Information Retrieval 54

55 References (1) J.P. Callan, Z. Lu, and W.B. Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR, pages ACM, L. Si and J. Callan. Using sampled data and regression to merge search engine results. In Proceedings of the ACM SIGIR, pages ACM, M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates. In ACM Transactions on Information Systems, 27(3): 129, L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford proposal for internet meta-searching. In Proceedings of the ACM SIGMOD, pages , Fabio Crestani and Ilya Markov Distributed Information Retrieval 55

56 References (2) B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. In Proceedings of the Conference on Database Systems for Advanced, pages 41-50, N. Craswell, D. Hawking, and P. Thistlewaite. Merging results from isolated search engines. In Proceedings of the Australasian Database Conference, pages , E. Fox and J. Shaw. Combination of multiple searches. In Proceedings of TREC, pages ,1994. S. Wu and F. Crestani. Shadow document methods of results merging. In Proceedings of the ACM SAC, pages , 2004 Fabio Crestani and Ilya Markov Distributed Information Retrieval 56

57 Outline Motivations 1 Motivations Fabio Crestani and Ilya Markov Distributed Information Retrieval 57

58 Vertical Fabio Crestani and Ilya Markov Distributed Information Retrieval 58

59 Vertical Specialized subcollection focused on a specific domain (e.g., news, travel, and local search) or a specific media type (e.g., images and video). Vertical Selection The task of selecting the relevant verticals, if any, in response to a user s query. DIR Solution 0-1 verticals Fabio Crestani and Ilya Markov Distributed Information Retrieval 59

60 Vertical Selection References Fernando Diaz. Integration of news content into web results. In Proceedings of the ACM WSDM, pages ACM, Jaime Arguello, Fernando Diaz, Jamie Callan, and Jean-Francois Crespo. Sources of evidence for vertical selection. In Proceedings of the ACM SIGIR, pages ACM, Fernando Diaz and Jaime Arguello. Adaptation of offline vertical selection predictions in the presence of user feedback. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 60

61 Blog Distillation The task of identifying blogs with a recurring central interest. Blog Feed Posts Federated Collection Documents Jonathan L. Elsas, Jaime Arguello, Jamie Callan, and Jaime G. Carbonell. Retrieval and feedback models for blog feed search. In Proceedings of the ACM SIGIR, pages ACM, Jangwon Seo and W. Bruce Croft. Blog site search using resource selection. In Proceeding of the ACM CIKM, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 61

62 Personalized Metasearch Fabio Crestani and Ilya Markov Distributed Information Retrieval 62

63 Personalized Metasearch Broker provides a single search interface over all of user s online resources. Different collections individual folders, addressbooks, calendars the Web Paul Thomas and David Hawking. Experiences evaluating personal metasearch. In IIiX, pages , Paul Thomas and David Hawking. Server selection methods in personal metasearch: a comparative empirical study. Inf. Retr., 12(5): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 63

64 Others Motivations Expert Search The task of identifying experts with a given expertise Experts Related documents Desktop Search Different file and document types Results Fusion Fabio Crestani and Ilya Markov Distributed Information Retrieval 64

65 Summary Vertical Selection Blog Distillation Personalized Metasearch Expert Search Desktop Search Your application Fabio Crestani and Ilya Markov Distributed Information Retrieval 65

66 Outline Motivations 1 Motivations Fabio Crestani and Ilya Markov Distributed Information Retrieval 66

67 Evolving Collections Fabio Crestani and Ilya Markov Distributed Information Retrieval 67

68 Updating Query-Based Sampling Given N collections n documents can be sampled at each time step Distribution methods Uniform Popularity-based Size-based Fabio Crestani and Ilya Markov Distributed Information Retrieval 68

69 Updating Methods Comparison SIGIR 2007 Proceedings Session 21: Collection Representation in Distribut (CO=3) QL CU SS 100 doc (CO=5) QL CU SS 100 (CO=3) QL CU SS 100 doc (CO=5) Crawl Fabio Crestani and Ilya Markov Distributed0.5 Information Retrieval 69 QL

70 Evolving Collections References Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, and Luis Gravano. Modeling and managing content changes in text databases. In ICDE, pages , Milad Shokouhi, Mark Baillie, and Leif Azzopardi. Updating collection representations for federated search. In Proceedings of the ACM SIGIR, pages , ACM, Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, and Luis Gravano. Modeling and managing changes in text databases. ACM Trans. Database Syst., 32(3):14, Fabio Crestani and Ilya Markov Distributed Information Retrieval 70

71 Query Fabio Crestani and Ilya Markov Distributed Information Retrieval 71

72 Query Expansion Global Expansion based on all retrieved documents The same expanded query to each resource Local Expansion for each resource based only on its documents Local but general Get expansion terms from each resource and select the best terms The same expanded query to each resource Cluster Cluster resource independently of a query Use Global or Local but general approach for each cluster Fabio Crestani and Ilya Markov Distributed Information Retrieval 72

73 experiments, we use the original tion and result merging. Thereeffectiveness is solely due to the candidates. It can be seen from he sample size does not always esent statistically significant difts with small and large samples. ], we find this is particularly the which performs poorly on two e data to estimate query expanrformance of the local and clusoved with larger samples. This amples provide richer sources of edback and are more likely to be oldings. For the local methods use this is the only information Query Expansion roker has a choice of algorithms r been using CRCS for selection one of the best-performing alt of experiments considered the, another popular algorithm for fferent servers will be selected, see whether this will improve or n methods. for each selection algorithm. g is that using the CORI selecsignificant loss in performance: ost any combination of paramion methods. Motivations Table 4: The average performance of query expansion methods across different testbeds for TREC topics and TREC topics TREC Topics Method P@5 P@10 MRR BSNE Local Fuse Cluster Global TREC Topics BSNE Local Fuse Cluster Global Paul Ogilvie and Jamie Callan. The effectiveness of query expansion for distributed information retrieval. In Proceedings of the ACM CIKM, pages ACM, Milad Shokouhi, Leif Azzopardi, and Paul Thomas. Effective query expansion for federated search. In Proceedings 432of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 73

74 Overlap Results Fusion Remove duplicates from the result list Give higher score to a document appeared more than once Fabio Crestani and Ilya Markov Distributed Information Retrieval 74

75 (Overlap Estimate) Given Collections C 1 and C 2 K overlap documents between them Samples S 1 and S 2 D duplicate documents within them Estimated number of overlap documents ˆK = C 1 C 2 D S 1 S 2 Fabio Crestani and Ilya Markov Distributed Information Retrieval 75

76 (Relax) SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR C1 R= 16 2 C2 R= 15 C1 2 C2 R= 16 R= 15 C1 2 C2 R= 11 R= R= 8 C3 4 (A) C1 R= 9 R= 20 C4 R= 8 C3 C2 R= 12 4 (B) R= 20 C4 C1 R= 8 R= 4 C3 (C) C2 R= 12 R= 20 C4 1 R= 3 R= 20 C3 C4 (D) R= 2 C3 (E) R= 20 C4 Figure 1: The Relax selection on a sample graph. Each vertex (Cn) in this graph represents a federated collection. (A) The graph initialization where R represents the estimated number of relevant documents in each collection. (B) The graph after initialization where C4 is selected as the most relevant collection according to its R value. The weight wfabio e(u, v) Crestani of edge and Ilya between Markovu and Distributed v computed Information according Retrieval to the estimated number76 of

77 Results Fusion Remove duplicates from the result list Give higher score to a document appeared more than once Fusion Methods Document d appears in m collections with scores {s i } Shadow Document: assumes that d also appears in n m collections with a score m i=1 s i m score(d) = m i=1 s i + k(n m) m i=1 Multi-Evidence: score(d) = f (m) s i m nondecreasing function m i=1 s i m, where f (x) is a Fabio Crestani and Ilya Markov Distributed Information Retrieval 77

78 Results Fusion Inf Retrieval (2007) 10: Average precision at 8 document levels (5,10,15,20,25,30,50,100) Range of overlap rate (1:0-0.2; 2: ; 3: ; 4: ; 5: ) MEM SDM Round-robin Bayesian Borda CombMNZ Fig. 1 Performances of six methods with different overlap rates ([0,1] normalization) Fabio Crestani and Ilya Markov Distributed Information Retrieval 78

79 Overlapping Collections References Milad Shokouhi and Justin Zobel. Federated text retrieval from uncooperative overlapped collections. In Proceedings of the ACM SIGIR, pages ACM, Yaniv Bernstein, Milad Shokouhi, and Justin Zobel. Compact features for detection of near-duplicates in distributed retrieval. In SPIRE, pages , Shengli Wu and Sally McClean. Result merging methods in distributed information retrieval with overlapping databases. Inf. Retr., 10(3): , Shengli Wu and Fabio Crestani. Shadow document methods of results merging. In Proceedings of the ACM symposium on Applied computing, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 79

80 More DIR Summary Evolving Collections Query Expansion Overlapping Collections Multilingual Search Distributed Multimedia Information Retrieval Luo Si, Jamie Callan, Suleyman Cetintas, and Hao Yuan. An effective and efficient results merging strategy for multilingual information retrieval in federated search environments. Inf. Retr., 11(1):1 24, Jamie Callan, Fabio Crestani, and Mark Sanderson. Distributed Multimedia Information Retrieval: Sigir 2003 Workshop on Distributed Information Retrieval, Toronto, Canada, August 2003: Revised, Selected, and Invited Papers (Lecture Notes in Computer Science, 2924). SpringerVerlag, Fabio Crestani and Ilya Markov Distributed Information Retrieval 80

81 Q & A Fabio Crestani and Ilya Markov Distributed Information Retrieval 81

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16 Federated Search Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu November 21, 2016 Up to this point... Classic information retrieval search from a single centralized index all ueries

More information

Federated Text Retrieval From Uncooperative Overlapped Collections

Federated Text Retrieval From Uncooperative Overlapped Collections Session 2: Collection Representation in Distributed IR Federated Text Retrieval From Uncooperative Overlapped Collections ABSTRACT Milad Shokouhi School of Computer Science and Information Technology,

More information

Federated Text Search

Federated Text Search CS54701 Federated Text Search Luo Si Department of Computer Science Purdue University Abstract Outline Introduction to federated search Main research problems Resource Representation Resource Selection

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Federated Search 10 March 2016 Prof. Chris Clifton Outline Federated Search Introduction to federated search Main research problems Resource Representation Resource Selection

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Federated Search Prof. Chris Clifton 13 November 2017 Federated Search Outline Introduction to federated search Main research problems Resource Representation

More information

From federated to aggregated search

From federated to aggregated search From federated to aggregated search Fernando Diaz, Mounia Lalmas and Milad Shokouhi diazf@yahoo-inc.com mounia@acm.org milads@microsoft.com Outline Introduction and Terminology Architecture Resource Representation

More information

Federated Text Retrieval from Independent Collections

Federated Text Retrieval from Independent Collections Federated Text Retrieval from Independent Collections A thesis submitted for the degree of Doctor of Philosophy Milad Shokouhi B.E. (Hons.), School of Computer Science and Information Technology, Science,

More information

Federated Search. Contents

Federated Search. Contents Foundations and Trends R in Information Retrieval Vol. 5, No. 1 (2011) 1 102 c 2011 M. Shokouhi and L. Si DOI: 10.1561/1500000010 Federated Search By Milad Shokouhi and Luo Si Contents 1 Introduction 3

More information

A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval

A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval Mark Baillie 1, Mark J. Carman 2, and Fabio Crestani 2 1 CIS Dept., University of Strathclyde, Glasgow, UK mb@cis.strath.ac.uk

More information

ABSTRACT. Categories & Subject Descriptors: H.3.3 [Information Search and Retrieval]: General Terms: Algorithms Keywords: Resource Selection

ABSTRACT. Categories & Subject Descriptors: H.3.3 [Information Search and Retrieval]: General Terms: Algorithms Keywords: Resource Selection Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lsi@cs.cmu.edu, callan@cs.cmu.edu

More information

Capturing Collection Size for Distributed Non-Cooperative Retrieval

Capturing Collection Size for Distributed Non-Cooperative Retrieval Capturing Collection Size for Distributed Non-Cooperative Retrieval Milad Shokouhi Justin Zobel Falk Scholer S.M.M. Tahaghoghi School of Computer Science and Information Technology, RMIT University, Melbourne,

More information

A Methodology for Collection Selection in Heterogeneous Contexts

A Methodology for Collection Selection in Heterogeneous Contexts A Methodology for Collection Selection in Heterogeneous Contexts Faïza Abbaci Ecole des Mines de Saint-Etienne 158 Cours Fauriel, 42023 Saint-Etienne, France abbaci@emse.fr Jacques Savoy Université de

More information

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Information Processing and Management 43 (2007) 1044 1058 www.elsevier.com/locate/infoproman Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Anselm Spoerri

More information

Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques

Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques -7695-1435-9/2 $17. (c) 22 IEEE 1 Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques Gary A. Monroe James C. French Allison L. Powell Department of Computer Science University

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

Balancing Precision and Recall with Selective Search

Balancing Precision and Recall with Selective Search Balancing Precision and Recall with Selective Search Mon Shih Chuang Department of Computer Science San Francisco State University 1600 Holloway Ave, San Francisco CA, USA, 94132 mchuang@mail.sfsu.edu

More information

Federated Search in the Wild

Federated Search in the Wild Federated Search in the Wild The Combined Power of over a Hundred Search Engines Dong Nguyen 1, Thomas Demeester 2, Dolf Trieschnigg 1, Djoerd Hiemstra 1 1 University of Twente, The Netherlands 2 Ghent

More information

Faculty of Science and Technology MASTER S THESIS

Faculty of Science and Technology MASTER S THESIS Faculty of Science and Technology MASTER S THESIS Study program/ Specialization: Master of Science in Computer Science Spring semester, 2016 Open Writer: Shuo Zhang Faculty supervisor: (Writer s signature)

More information

UMass at TREC 2017 Common Core Track

UMass at TREC 2017 Common Core Track UMass at TREC 2017 Common Core Track Qingyao Ai, Hamed Zamani, Stephen Harding, Shahrzad Naseri, James Allan and W. Bruce Croft Center for Intelligent Information Retrieval College of Information and Computer

More information

An Overview of Aggregating Vertical Results into Web Search Results

An Overview of Aggregating Vertical Results into Web Search Results An Overview of Aggregating Vertical Results into Web Search Results Suhel Mustajab Department of Computer Science, A.M.U., Aligarh, U.P., India. Mohd. Kashif Adhami Department of Computer Science, A.M.U.,

More information

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna

More information

Retrieval and Feedback Models for Blog Distillation

Retrieval and Feedback Models for Blog Distillation Retrieval and Feedback Models for Blog Distillation Jonathan Elsas, Jaime Arguello, Jamie Callan, Jaime Carbonell Language Technologies Institute, School of Computer Science, Carnegie Mellon University

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

Combining CORI and the decision-theoretic approach for advanced resource selection

Combining CORI and the decision-theoretic approach for advanced resource selection Combining CORI and the decision-theoretic approach for advanced resource selection Henrik Nottelmann and Norbert Fuhr Institute of Informatics and Interactive Systems, University of Duisburg-Essen, 47048

More information

A Meta-search Method with Clustering and Term Correlation

A Meta-search Method with Clustering and Term Correlation A Meta-search Method with Clustering and Term Correlation Dyce Jing Zhao, Dik Lun Lee, and Qiong Luo Department of Computer Science Hong Kong University of Science & Technology {zhaojing,dlee,luo}@cs.ust.hk

More information

Opinions in Federated Search: University of Lugano at TREC 2014 Federated Web Search Track

Opinions in Federated Search: University of Lugano at TREC 2014 Federated Web Search Track Opinions in Federated Search: University of Lugano at TREC 2014 Federated Web Search Track Anastasia Giachanou 1,IlyaMarkov 2 and Fabio Crestani 1 1 Faculty of Informatics, University of Lugano, Switzerland

More information

Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection

Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection P.G. Ipeirotis & L. Gravano Computer Science Department, Columbia University Amr El-Helw CS856 University of Waterloo

More information

Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data

Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data Leah S. Larkey, Margaret E. Connell Department of Computer Science University of Massachusetts Amherst, MA 13

More information

Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents

Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents Michail Salampasis Vienna University of Technology Institute of Software Technology and Interactive Systems Vienna, Austria

More information

QoS Based Ranking for Composite Web Services

QoS Based Ranking for Composite Web Services QoS Based Ranking for Composite Web Services F.Ezhil Mary Arasi 1, Aditya Anand 2, Subodh Kumar 3 1 SRM University, De[partment of Computer Applications, Kattankulathur, Chennai, India 2 SRM University,

More information

A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University

A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University A Machine Learning Approach for Information Retrieval Applications Luo Si Department of Computer Science Purdue University Why Information Retrieval: Information Overload: Since the introduction of digital

More information

Collection Selection with Highly Discriminative Keys

Collection Selection with Highly Discriminative Keys Collection Selection with Highly Discriminative Keys Sander Bockting Avanade Netherlands B.V. Versterkerstraat 6 1322 AP, Almere, Netherlands sander.bockting@avanade.com Djoerd Hiemstra University of Twente

More information

Automatic Classification of Text Databases through Query Probing

Automatic Classification of Text Databases through Query Probing Automatic Classification of Text Databases through Query Probing Panagiotis G. Ipeirotis Computer Science Dept. Columbia University pirot@cs.columbia.edu Luis Gravano Computer Science Dept. Columbia University

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Query- vs. Crawling-based Classification of Searchable Web Databases

Query- vs. Crawling-based Classification of Searchable Web Databases Query- vs. Crawling-based Classification of Searchable Web Databases Luis Gravano Panagiotis G. Ipeirotis Mehran Sahami gravano@cs.columbia.edu pirot@cs.columbia.edu sahami@epiphany.com Columbia University

More information

Cost-Effective Combination of Multiple Rankers: Learning When Not To Query

Cost-Effective Combination of Multiple Rankers: Learning When Not To Query Cost-Effective Combination of Multiple Rankers: Learning When Not To Query ABSTRACT Combining multiple rankers has potential for improving the performance over using any of the single rankers. However,

More information

Evaluating Sampling Methods for Uncooperative Collections

Evaluating Sampling Methods for Uncooperative Collections Evaluating Sampling Methods for Uncooperative Collections Paul Thomas Department of Computer Science Australian National University Canberra, Australia paul.thomas@anu.edu.au David Hawking CSIRO ICT Centre

More information

Content-based search in peer-to-peer networks

Content-based search in peer-to-peer networks Content-based search in peer-to-peer networks Yun Zhou W. Bruce Croft Brian Neil Levine yzhou@cs.umass.edu croft@cs.umass.edu brian@cs.umass.edu Dept. of Computer Science, University of Massachusetts,

More information

Full text available at: Federated Search

Full text available at:  Federated Search Federated Search Federated Search Milad Shokouhi Microsoft Research Cambridge, CB30FB UK milads@microsoft.com Luo Si Purdue University West Lafayette, IN 47907-2066 USA lsi@cs.purdue.edu Boston Delft Foundations

More information

[23] T. W. Yan, and H. Garcia-Molina. SIFT { A Tool for Wide-Area Information Dissemination. USENIX

[23] T. W. Yan, and H. Garcia-Molina. SIFT { A Tool for Wide-Area Information Dissemination. USENIX [23] T. W. Yan, and H. Garcia-Molina. SIFT { A Tool for Wide-Area Information Dissemination. USENIX 1995 Technical Conference, 1995. [24] C. Yu, K. Liu, W. Wu, W. Meng, and N. Rishe. Finding the Most Similar

More information

Classification-Aware Hidden-Web Text Database Selection

Classification-Aware Hidden-Web Text Database Selection 6 Classification-Aware Hidden-Web Text Database Selection PANAGIOTIS G. IPEIROTIS New York University and LUIS GRAVANO Columbia University Many valuable text databases on the web have noncrawlable contents

More information

Merging algorithms for enterprise search

Merging algorithms for enterprise search Merging algorithms for enterprise search PengFei (Vincent) Li Australian National University u4959060@anu.edu.au Paul Thomas CSIRO and Australian National University paul.thomas@csiro.au David Hawking

More information

Document Allocation Policies for Selective Searching of Distributed Indexes

Document Allocation Policies for Selective Searching of Distributed Indexes Document Allocation Policies for Selective Searching of Distributed Indexes Anagha Kulkarni and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University 5 Forbes

More information

University of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte tracks with Terrier

University of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte tracks with Terrier University of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte tracks with Terrier Vassilis Plachouras, Ben He, and Iadh Ounis University of Glasgow, G12 8QQ Glasgow, UK Abstract With our participation

More information

Efficient distributed selective search

Efficient distributed selective search DOI 10.1007/s10791-016-9290-6 INFORMATION RETRIEVAL EFFICIENCY Efficient distributed selective search Yubin Kim 1 Jamie Callan 1 J. Shane Culpepper 2 Alistair Moffat 3 Received: 27 May 2016 / Accepted:

More information

Search Engines. Provide a ranked list of documents. May provide relevance scores. May have performance information.

Search Engines. Provide a ranked list of documents. May provide relevance scores. May have performance information. Search Engines Provide a ranked list of documents. May provide relevance scores. May have performance information. 3 External Metasearch Metasearch Engine Search Engine A Search Engine B Search Engine

More information

Retrieval and Feedback Models for Blog Distillation

Retrieval and Feedback Models for Blog Distillation Retrieval and Feedback Models for Blog Distillation CMU at the TREC 2007 Blog Track Jonathan Elsas, Jaime Arguello, Jamie Callan, Jaime Carbonell CMU s Blog Distillation Focus Two Research Questions: What

More information

Frontiers in Web Data Management

Frontiers in Web Data Management Frontiers in Web Data Management Junghoo John Cho UCLA Computer Science Department Los Angeles, CA 90095 cho@cs.ucla.edu Abstract In the last decade, the Web has become a primary source of information

More information

Automatic Structured Query Transformation Over Distributed Digital Libraries

Automatic Structured Query Transformation Over Distributed Digital Libraries Automatic Structured Query Transformation Over Distributed Digital Libraries M. Elena Renda I.I.T. C.N.R. and Scuola Superiore Sant Anna I-56100 Pisa, Italy elena.renda@iit.cnr.it Umberto Straccia I.S.T.I.

More information

Navigating the User Query Space

Navigating the User Query Space Navigating the User Query Space Ronan Cummins 1, Mounia Lalmas 2, Colm O Riordan 3 and Joemon M. Jose 1 1 School of Computing Science, University of Glasgow, UK 2 Yahoo! Research, Barcelona, Spain 3 Dept.

More information

External Query Reformulation for Text-based Image Retrieval

External Query Reformulation for Text-based Image Retrieval External Query Reformulation for Text-based Image Retrieval Jinming Min and Gareth J. F. Jones Centre for Next Generation Localisation School of Computing, Dublin City University Dublin 9, Ireland {jmin,gjones}@computing.dcu.ie

More information

Query-Based Sampling using Only Snippets

Query-Based Sampling using Only Snippets Query-Based Sampling using Only Snippets Almer S. Tigelaar and Djoerd Hiemstra {tigelaaras, hiemstra}@cs.utwente.nl Abstract Query-based sampling is a popular approach to model the content of an uncooperative

More information

Distributed similarity search algorithm in distributed heterogeneous multimedia databases

Distributed similarity search algorithm in distributed heterogeneous multimedia databases Information Processing Letters 75 (2000) 35 42 Distributed similarity search algorithm in distributed heterogeneous multimedia databases Ju-Hong Lee a,1, Deok-Hwan Kim a,2, Seok-Lyong Lee a,3, Chin-Wan

More information

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

An Investigation of Basic Retrieval Models for the Dynamic Domain Task An Investigation of Basic Retrieval Models for the Dynamic Domain Task Razieh Rahimi and Grace Hui Yang Department of Computer Science, Georgetown University rr1042@georgetown.edu, huiyang@cs.georgetown.edu

More information

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection

More information

Modeling and Managing Changes in Text Databases

Modeling and Managing Changes in Text Databases Modeling and Managing Changes in Text Databases PANAGIOTIS G. IPEIROTIS New York University and ALEXANDROS NTOULAS Microsoft Search Labs and JUNGHOO CHO University of California, Los Angeles and LUIS GRAVANO

More information

for Searching Social Media Posts

for Searching Social Media Posts Mining the Temporal Statistics of Query Terms for Searching Social Media Posts ICTIR 17 Amsterdam Oct. 1 st 2017 Jinfeng Rao Ferhan Ture Xing Niu Jimmy Lin Task: Ad-hoc Search on Social Media domain Stream

More information

number of documents in global result list

number of documents in global result list Comparison of different Collection Fusion Models in Distributed Information Retrieval Alexander Steidinger Department of Computer Science Free University of Berlin Abstract Distributed information retrieval

More information

Modeling and Managing Changes in Text Databases

Modeling and Managing Changes in Text Databases 14 Modeling and Managing Changes in Text Databases PANAGIOTIS G. IPEIROTIS New York University ALEXANDROS NTOULAS Microsoft Search Labs JUNGHOO CHO University of California, Los Angeles and LUIS GRAVANO

More information

Relevance Score Normalization for Metasearch

Relevance Score Normalization for Metasearch Relevance Score Normalization for Metasearch Mark Montague Department of Computer Science Dartmouth College 6211 Sudikoff Laboratory Hanover, NH 03755 montague@cs.dartmouth.edu Javed A. Aslam Department

More information

Identifying Redundant Search Engines in a Very Large Scale Metasearch Engine Context

Identifying Redundant Search Engines in a Very Large Scale Metasearch Engine Context Identifying Redundant Search Engines in a Very Large Scale Metasearch Engine Context Ronak Desai 1, Qi Yang 2, Zonghuan Wu 3, Weiyi Meng 1, Clement Yu 4 1 Dept. of CS, SUNY Binghamton, Binghamton, NY 13902,

More information

A Constrained Spreading Activation Approach to Collaborative Filtering

A Constrained Spreading Activation Approach to Collaborative Filtering A Constrained Spreading Activation Approach to Collaborative Filtering Josephine Griffith 1, Colm O Riordan 1, and Humphrey Sorensen 2 1 Dept. of Information Technology, National University of Ireland,

More information

Document and Query Expansion Models for Blog Distillation

Document and Query Expansion Models for Blog Distillation Document and Query Expansion Models for Blog Distillation Jaime Arguello, Jonathan L. Elsas, Changkuk Yoo, Jamie Callan, Jaime G. Carbonell Language Technologies Institute, School of Computer Science,

More information

Modeling and Managing Content Changes in Text Databases

Modeling and Managing Content Changes in Text Databases Modeling and Managing Content Changes in Text Databases Panagiotis G. Ipeirotis Alexandros Ntoulas Junghoo Cho Luis Gravano Abstract Large amounts of (often valuable) information are stored in web-accessible

More information

A BELIEF NETWORK MODEL FOR EXPERT SEARCH

A BELIEF NETWORK MODEL FOR EXPERT SEARCH A BELIEF NETWORK MODEL FOR EXPERT SEARCH Craig Macdonald, Iadh Ounis Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK craigm@dcs.gla.ac.uk, ounis@dcs.gla.ac.uk Keywords: Expert

More information

Shard Ranking and Cutoff Estimation for Topically Partitioned Collections

Shard Ranking and Cutoff Estimation for Topically Partitioned Collections Shard Ranking and Cutoff Estimation for Topically Partitioned Collections Anagha Kulkarni Almer S. Tigelaar Djoerd Hiemstra Jamie Callan Language Technologies Institute, School of Computer Science, Carnegie

More information

Full text available at: Aggregated Search

Full text available at:   Aggregated Search Aggregated Search Jaime Arguello School of Information and Library Science University of North Carolina at Chapel Hill, United States jarguello@unc.edu Boston Delft Foundations and Trends R in Information

More information

Efficient Execution of Dependency Models

Efficient Execution of Dependency Models Efficient Execution of Dependency Models Samuel Huston Center for Intelligent Information Retrieval University of Massachusetts Amherst Amherst, MA, 01002, USA sjh@cs.umass.edu W. Bruce Croft Center for

More information

Recommendation System for Location-based Social Network CS224W Project Report

Recommendation System for Location-based Social Network CS224W Project Report Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

Does Selective Search Benefit from WAND Optimization?

Does Selective Search Benefit from WAND Optimization? Does Selective Search Benefit from WAND Optimization? Yubin Kim 1(B), Jamie Callan 1, J. Shane Culpepper 2, and Alistair Moffat 3 1 Carnegie Mellon University, Pittsburgh, USA yubink@cmu.edu 2 RMIT University,

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Ranking Users for Intelligent Message Addressing

Ranking Users for Intelligent Message Addressing Ranking Users for Intelligent Message Addressing Vitor R. Carvalho 1 and William W. Cohen 1,2 Language Technologies Institute 1 and Machine Learning Department 2 Carnegie Mellon University, Pittsburgh,

More information

Using Coherence-based Measures to Predict Query Difficulty

Using Coherence-based Measures to Predict Query Difficulty Using Coherence-based Measures to Predict Query Difficulty Jiyin He, Martha Larson, and Maarten de Rijke ISLA, University of Amsterdam {jiyinhe,larson,mdr}@science.uva.nl Abstract. We investigate the potential

More information

A Constrained Spreading Activation Approach to Collaborative Filtering

A Constrained Spreading Activation Approach to Collaborative Filtering A Constrained Spreading Activation Approach to Collaborative Filtering Josephine Griffith 1, Colm O Riordan 1, and Humphrey Sorensen 2 1 Dept. of Information Technology, National University of Ireland,

More information

Exploiting Global Impact Ordering for Higher Throughput in Selective Search

Exploiting Global Impact Ordering for Higher Throughput in Selective Search Exploiting Global Impact Ordering for Higher Throughput in Selective Search Michał Siedlaczek [0000-0002-9168-0851], Juan Rodriguez [0000-0001-6483-6956], and Torsten Suel [0000-0002-8324-980X] Computer

More information

Query Expansion with the Minimum User Feedback by Transductive Learning

Query Expansion with the Minimum User Feedback by Transductive Learning Query Expansion with the Minimum User Feedback by Transductive Learning Masayuki OKABE Information and Media Center Toyohashi University of Technology Aichi, 441-8580, Japan okabe@imc.tut.ac.jp Kyoji UMEMURA

More information

Real-time Query Expansion in Relevance Models

Real-time Query Expansion in Relevance Models Real-time Query Expansion in Relevance Models Victor Lavrenko and James Allan Center for Intellignemt Information Retrieval Department of Computer Science 140 Governor s Drive University of Massachusetts

More information

Open Research Online The Open University s repository of research publications and other research outputs

Open Research Online The Open University s repository of research publications and other research outputs Open Research Online The Open University s repository of research publications and other research outputs A Study of Document Weight Smoothness in Pseudo Relevance Feedback Conference or Workshop Item

More information

GlOSS: Text-Source Discovery over the Internet

GlOSS: Text-Source Discovery over the Internet GlOSS: Text-Source Discovery over the Internet LUIS GRAVANO Columbia University HÉCTOR GARCÍA-MOLINA Stanford University and ANTHONY TOMASIC INRIA Rocquencourt The dramatic growth of the Internet has created

More information

A Task-Based Evaluation of an Aggregated Search Interface

A Task-Based Evaluation of an Aggregated Search Interface A Task-Based Evaluation of an Aggregated Search Interface No Author Given No Institute Given Abstract. This paper presents a user study that evaluated the effectiveness of an aggregated search interface

More information

Author: Yunqing Xia, Zhongda Xie, Qiuge Zhang, Huiyuan Zhao, Huan Zhao Presenter: Zhongda Xie

Author: Yunqing Xia, Zhongda Xie, Qiuge Zhang, Huiyuan Zhao, Huan Zhao Presenter: Zhongda Xie Author: Yunqing Xia, Zhongda Xie, Qiuge Zhang, Huiyuan Zhao, Huan Zhao Presenter: Zhongda Xie Outline 1.Introduction 2.Motivation 3.Methodology 4.Experiments 5.Conclusion 6.Future Work 2 1.Introduction(1/3)

More information

Automatic Query Type Identification Based on Click Through Information

Automatic Query Type Identification Based on Click Through Information Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China

More information

Result merging strategies for a current news metasearcher

Result merging strategies for a current news metasearcher Information Processing and Management 39 (2003) 581 609 www.elsevier.com/locate/infoproman Result merging strategies for a current news metasearcher Yves Rasolofo a, *, David Hawking b, Jacques Savoy a

More information

Database Selection Techniques for Routing Bibliographic Queries

Database Selection Techniques for Routing Bibliographic Queries Database Selection Techniques for Routing Bibliographic Queries Jian Xu Yinyan Cao Ee-Peng Lim Wee-Keong Ng Centre for Advanced Information Systems (CAIS) School of Applied Science, Nanyang Technological

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood TREC-3 Ad Hoc Retrieval and Routing Experiments using the WIN System Paul Thompson Howard Turtle Bokyung Yang James Flood West Publishing Company Eagan, MN 55123 1 Introduction The WIN retrieval engine

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

Automatically Building Research Reading Lists

Automatically Building Research Reading Lists Automatically Building Research Reading Lists Michael D. Ekstrand 1 Praveen Kanaan 1 James A. Stemper 2 John T. Butler 2 Joseph A. Konstan 1 John T. Riedl 1 ekstrand@cs.umn.edu 1 GroupLens Research Department

More information

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions

More information

A Cluster-Based Resampling Method for Pseudo- Relevance Feedback

A Cluster-Based Resampling Method for Pseudo- Relevance Feedback A Cluster-Based Resampling Method for Pseudo- Relevance Feedback Kyung Soon Lee W. Bruce Croft James Allan Department of Computer Engineering Chonbuk National University Republic of Korea Center for Intelligent

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

On Duplicate Results in a Search Session

On Duplicate Results in a Search Session On Duplicate Results in a Search Session Jiepu Jiang Daqing He Shuguang Han School of Information Sciences University of Pittsburgh jiepu.jiang@gmail.com dah44@pitt.edu shh69@pitt.edu ABSTRACT In this

More information

Approaches to Collection Selection and Results Merging for Distributed Information Retrieval

Approaches to Collection Selection and Results Merging for Distributed Information Retrieval ACM, 21. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Conference on Information

More information

ECNU at 2017 ehealth Task 2: Technologically Assisted Reviews in Empirical Medicine

ECNU at 2017 ehealth Task 2: Technologically Assisted Reviews in Empirical Medicine ECNU at 2017 ehealth Task 2: Technologically Assisted Reviews in Empirical Medicine Jiayi Chen 1, Su Chen 1, Yang Song 1, Hongyu Liu 1, Yueyao Wang 1, Qinmin Hu 1, Liang He 1, and Yan Yang 1,2 Department

More information

Interoperability for Digital Libraries

Interoperability for Digital Libraries DRTC Workshop on Semantic Web 8 th 10 th December, 2003 DRTC, Bangalore Paper: C Interoperability for Digital Libraries Michael Shepherd Faculty of Computer Science Dalhousie University Halifax, NS, Canada

More information

Time-aware Approaches to Information Retrieval

Time-aware Approaches to Information Retrieval Time-aware Approaches to Information Retrieval Nattiya Kanhabua Department of Computer and Information Science Norwegian University of Science and Technology 24 February 2012 Motivation Searching documents

More information