Distributed Information Retrieval

Size: px

Start display at page:

Download "Distributed Information Retrieval"

Matilda Riley
6 years ago
Views:

1 Distributed Information Retrieval Fabio Crestani and Ilya Markov University of Lugano, Switzerland Fabio Crestani and Ilya Markov Distributed Information Retrieval 1

2 Outline Motivations Deep Web Federated Search Metasearch Aggregated Search 1 Motivations Deep Web Federated Search Metasearch Aggregated Search Fabio Crestani and Ilya Markov Distributed Information Retrieval 2

3 Motivations Deep Web Federated Search Metasearch Aggregated Search Why do we need DIR? There are limits to what a search engines can find on the web Not everything that is on the web is or can be harvested The one size fits all approach of web search engine has many limitations Often there is more than one type of answer to the same query Thus: Deep Web, Federated Search, MetaSearch, Aggregated Search Fabio Crestani and Ilya Markov Distributed Information Retrieval 3

4 Deep Web Motivations Deep Web Federated Search Metasearch Aggregated Search There is a lot of information on the web that cannot be accessed by search engines (deep or hidden web). There are many different reasons why this information is not accessible to crawlers. This is often very valuable information! Web search engines can only be used to identify the resource (if possible), then the user has to deal directly with it. Even if this information could be crawled there are good reasons not too... Fabio Crestani and Ilya Markov Distributed Information Retrieval 4

5 Federated Search Motivations Deep Web Federated Search Metasearch Aggregated Search Federated Search is another name for DIR. Federated search systems do not crawl a resource, but pass a user query to the search facilities of the resource itself. Why would this be better? Preserves the property rights of the resource owner. Search facilities are optimised to the specific resource. Index is always up-to-date. The resource is curated and of high quality. Examples of federate search systems: PubMed, FedStats, WestLaw, and Cheshire. Fabio Crestani and Ilya Markov Distributed Information Retrieval 5

6 Metasearch Motivations Deep Web Federated Search Metasearch Aggregated Search Even the largest search engine cannot crawl effectively the entire web. Different search engines crawl different disjoint portions of the web. Different search engines use different ranking functions. Metasearch engines do not crawl the web, but pass a user query to a number of search engines and then present the fused results set. Examples of federate search systems: Dogpile, MataCrawler, AllInOneNews, and SavvySearch Fabio Crestani and Ilya Markov Distributed Information Retrieval 6

Aggregated Search Motivations Deep Web Federated Search Metasearch Aggregated Search Often there is more that one type of information relevant to a query (e.g. web page, images, map, reviews, etc).

7 Aggregated Search Motivations Deep Web Federated Search Metasearch Aggregated Search Often there is more that one type of information relevant to a query (e.g. web page, images, map, reviews, etc). These type of information are indexed and ranked by separate sub-systems. Presenting this information in an aggregated way is more useful to the user. Fabio Crestani and Ilya Markov Distributed Information Retrieval 7

8 Outline Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid 1 Motivations 2 Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Fabio Crestani and Ilya Markov Distributed Information Retrieval 8

A Taxonomy of DIR Systems Peer-to-Peer Network Crawling Metadata Harvesting

peer-to-peer, crawling, and meta-data harvesting.

9 A Taxonomy of DIR Systems Peer-to-Peer Network Crawling Metadata Harvesting Hybrid A taxonomy of DIR architectures can be build considering where the indexes are kept. This suggest 4 different types of architectures: broker-based, peer-to-peer, crawling, and meta-data harvesting. Global collection Distributed indexes Centralised index P2P Broker Crawling Meta-data harvesting Hybrid Fabio Crestani and Ilya Markov Distributed Information Retrieval 9

10 Peer-to-Peer Networks Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Indexes are located with the resources. Some part of the indexes are distributed to other resources. Queries are distributed across the resources and results are merged by the peer that originated the query. Fabio Crestani and Ilya Markov Distributed Information Retrieval 10

11 Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Indexes are located with the resources. Queries are forwarded to resources and results are merged by the broker. Fabio Crestani and Ilya Markov Distributed Information Retrieval 11

Crawling Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Resources are crawled and documents are harvested Indexes are centralised.

12 Crawling Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Resources are crawled and documents are harvested Indexes are centralised. Queries are are carried out in a centralised way and documents are fetched from resources or from a storage. Fabio Crestani and Ilya Markov Distributed Information Retrieval 12

Metadata Harvesting Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Indexes are located with the resources, but metadata are harvested according to some protocol (off-line

13 Metadata Harvesting Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Indexes are located with the resources, but metadata are harvested according to some protocol (off-line phase), like for example the OAI-PMH. Queries are carried out at the broker level (on-line phase) to identify relevant documents by the metadata, that are then requested from the resources. Fabio Crestani and Ilya Markov Distributed Information Retrieval 13

Indexing Harvesting Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid It is possible to crawl the indexes, instead of the metadata according to some protocol (off-line phase), like

14 Indexing Harvesting Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid It is possible to crawl the indexes, instead of the metadata according to some protocol (off-line phase), like for example the OAI-PMH. Queries are carried out at the broker level (on-line phase) to identify relevant documents by the documents full content, that are then requested from the resources. Fabio Crestani and Ilya Markov Distributed Information Retrieval 14

15 Outline Motivations 1 Motivations Fabio Crestani and Ilya Markov Distributed Information Retrieval 15

16 Cooperative Uncooperative Fabio Crestani and Ilya Markov Distributed Information Retrieval 16

17 Stanford Protocol Proposal for Internet and Retrieval Search (STARTS) Query Language Filter expressions Ranking expressions Retrieved Documents Unnormalized score Source Statistics (term frequency, term weight, document frequency, document size, document count) Source metadata and content summary Vocabulary Statistics (term frequency, document frequency, number of documents) Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. STARTS: Stanford proposal for internet meta-searching. SIGMOD Rec., 26(2): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 17

18 Query-Based Sampling Query Resource Documents (2-10) Questions How to select queries? What are the stopping criteria? Jamie Callan and Margaret Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19(2):97 130, Fabio Crestani and Ilya Markov Distributed Information Retrieval 18

19 Sampling Queries Motivations Other (ord) Learned (lrd) Random Document Frequency (df ) Collection Frequency (ctf ) Average Term Frequency (ctf /df ) Query-Logs Jamie Callan and Margaret Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19(2):97 130, Milad Shokouhi, Justin Zobel, Seyed M. M. Tahaghoghi, and Falk Scholer. Using query logs to establish vocabularies in distributed information retrieval. Inf. Process. Manage., 43(1): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 19

20 Stopping Criteria Motivations documents n documents out of N resources Collection size Vocabulary size Newly downloaded terms are not likely to appear in future queries Q - set of training queries θ k - language model of a sample at k-th iteration p(q θ k ) = N q q i=1 j=1 p(t = q ij θ k ), l(θ k, Q) = log(p(q θ k )) φ k = l(θ k, Q) l(θ k 1, Q) = log( p(q θ k) p(q θ k 1 ) ) Fabio Crestani and Ilya Markov Distributed Information Retrieval 20

21 Stopping Criteria Bibliography Leif Azzopardi, Mark Baillie, and Fabio Crestani. Adaptive query-based sampling for distributed ir. In Proceedings of the ACM SIGIR, pages ACM, Mark Baillie, Leif Azzopardi, and Fabio Crestani. An adaptive stopping criteria for query-based sampling of distributed collections. In String Processing and Information Retrieval (SPIRE), James Caverlee, Ling Liu, and Joonsoo Bae. Distributed query sampling: a quality-conscious approach. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 21

22 Summary Cooperative (STARTS) Uncooperative (QBS) Sampling Queries Stopping Criteria Fabio Crestani and Ilya Markov Distributed Information Retrieval 22

23 References Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. Starts: Stanford proposal for internet meta-searching. SIGMOD Rec., 26(2): , Jamie Callan and Margaret Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19(2):97 130, Leif Azzopardi, Mark Baillie, and Fabio Crestani. Adaptive query-based sampling for distributed ir. In Proceedings of the ACM SIGIR, pages ACM, Mark Baillie, Leif Azzopardi, and Fabio Crestani. An adaptive stopping criteria for query-based sampling of distributed collections. In String Processing and Information Retrieval (SPIRE), James Caverlee, Ling Liu, and Joonsoo Bae. Distributed query sampling: a quality-conscious approach. In Proceedings of the ACM SIGIR, pages ACM, Milad Shokouhi, Justin Zobel, Seyed M. M. Tahaghoghi, and Falk Scholer. Using query logs to establish vocabularies in distributed information retrieval. Inf. Process. Manage., 43(1): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 23

24 Lexicon-Based Document-Surrogate Fabio Crestani and Ilya Markov Distributed Information Retrieval 24

25 Lexicon-Based Approaches Fabio Crestani and Ilya Markov Distributed Information Retrieval 25

26 Collection Retrieval Inference Network (CORI) Collection = Super-Document Bayesian inference network on super-documents Adapted Okapi T = df t,i df t,i cw i /avg cw I = C+0.5 log( cf t ) log(c + 1.0) p(t C i ) = b + (1 b) T I Collections are ranked according to p(q C i ) James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 26

27 Glossary-of-Servers Server (GlOSS) Goodness(q, l, C) = sim(q, d) d Rank(q,l,C) Rank(q, l, C) = {d C sim(q, d)>l} Cooperative Document and term statistics Luis Gravano, Héctor García-Molina, and Anthony Tomasic. Gloss: text-source discovery over the internet. ACM Trans. Database Syst., 24(2): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 27

28 Document-Surrogate Approaches Fabio Crestani and Ilya Markov Distributed Information Retrieval 28

29 Relevant Document Distribution Estimation (ReDDE) Idea 1 sampled document is relevant to a query C S C similar documents in a collection c are relevant to a query. R(C, Q) p(r d) C S C d S C Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 29

30 Relevant Document Distribution Estimation (ReDDE) Ranked sampled documents = Ranked documents in a centralized retrieval system Idea A document d j appears before a document d i in a sample C j S Cj documents appear before d i in a centralized retrieval system. Rank centralized (d i ) = d j :Rank sample (d j )<Rank sample (d i ) C j S Cj Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 30

31 Relevant Document Distribution Estimation (ReDDE) Rank centralized (d i ) = R(C, Q) p(r d) C S C d S C d j :Rank sample (d j )<Rank sample (d i ) C j S Cj p(r d) = {α if Rank centralized (d) < β i C i 0 otherwise. Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 31

32 Centralized-Rank Collection Selection (CRCS) R(C, Q) p(r d) C S C d S C Linear R(d) = { γ Rank sample (d) if Rank sample (d) < γ 0 otherwise. Exponential R(d) = α exp( β Rank sample (d)) p(r d) = R(d) C max Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. In ECIR, pages , Fabio Crestani and Ilya Markov Distributed Information Retrieval 32

33 R Motivations Advanced 15 DIR 20 Cutoff (nonrelevant) R Comparison Resource 0.2 Description Results 0.0 Merging Cutoff (GOV2) Fig. 3. R values for the cori, redde and crcs algorithms on the trec123-fr-doe- 81col (left) and 100-col-gov2 (right) testbeds. Table 2. Performance of different methods for the Trec4 (trec4-kmeans) testbed. Trec topics (long) were used as queries Cutoff=1 Cutoff=5 P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20 CORI ReDDE CRCS(l) CRCS(e) Table 3. Performance of collection selection methods for the uniform (trec colbysource) testbed. Trec topics (short) were used as queries evance judgments. We only report the results for cutoff=1 and cutoff=5. The former shows the system Cutoff=1 outputs when only the best collection Cutoff=5is selected while for the latter, P@5 the bestp@10 five collections P@15 P@20 are chosen P@5to P@10 get searched P@15for P@20 the query. We docori not report the results for larger cutoff values because cutoff= has shown to beredde a reasonable threshold for dir experiments on the real web collections [Avrahami CRCS(l) et al., ] The values show the calculated precision on the top xcrcs(e) results We select redde as the baseline as it does not require training queries and Table 4. Performance of collection selection methods for the representative (trec123- Milad Shokouhi. its effectiveness is found to be higher than cori and older alternatives [Si and 2ldb-60col) Central-rank-based testbed. Treccollection topics selection (short) in uncooperative were used as distributed queries information retrieval. In ECIR, pages Callan, , 2003a] The following tables compare the performance of discussed methods on different testbeds. Cutoff=1 We used the t-test to calculate Cutoff=5 the statistical signifi- Fabio Crestani and Ilya Markov Distributed Information Retrieval 33

34 On the relevant testbed Motivations (Table 5), all precision values for cori are significantly inferior to that of redde for both Resource cutoff values. Description Redde in general produces higher precision Broker-Based values Architecture than crcs methods. ResourceHowever, Selection none of the gaps are detected statistically significant Advancedby DIR the t-test at the 99% confidence interval. The results for crcs(l), redde and cori are comparable on the non-relevant testbed (Table 6). Crcs(e) significantly outperforms the other methods in most cases. On the gov2 testbed (Table 7), cori produces the best results when Comparison Table 6. Performance of collection selection methods for the non-relevant (trec123- FR-DOE-81col) Table 5. Performance testbed. of Trec collection topics selection(short) methods were forused theas relevant queries(trec123-ap- WSJ-60col) testbed. Trec topics (short) were used as queries Cutoff=1 Cutoff=5 Cutoff=1 Cutoff=5 CORI ReDDE CORI CRCS(l) ReDDE CRCS(e) CRCS(l) CRCS(e) Table 7. Performance of collection selection methods for the gov2 (100-col-gov2) testbed. Trec topics (short) were used as queries Cutoff=1 Cutoff=5 CORI ReDDE CRCS(l) CRCS(e) Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. In cutoff=1 while in the other scenarios there is no significant difference between ECIR, pages , the methods. Overall, Fabio we can Crestani conclude and Ilya that Markov crcs(e) Distributed selects better Information collections Retrieval and its high 34

35 Summary Lexicon-Based CORI GlOSS Document-Surrogate ReDDE CRCS Others Fabio Crestani and Ilya Markov Distributed Information Retrieval 35

36 References James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR, pages ACM, Luis Gravano, Héctor García-Molina, and Anthony Tomasic. Gloss: text-source discovery over the internet. ACM Trans. Database Syst., 24(2): , Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. In ECIR, pages , Fabio Crestani and Ilya Markov Distributed Information Retrieval 36

37 Estimating Collection Size Fabio Crestani and Ilya Markov Distributed Information Retrieval 37

38 Capture-Recapture Motivations X - event that a randomly sampled document is already in a sample X n - the same for n randomly sampled documents Two samples S 1 and S 2 E[X ] = S C, E[X n] = n E[X ] = n S C S 1 S 2 S 1 S 2 C = ˆ C = S 1 S 2 S 1 S 2 Take two samples Count the number of common documents King-Lup Liu, Clement Yu, and Weiyi Meng. Discovering the representative of a search engine. In Proceedings of the ACM CIKM, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 38

39 Sample-Resample Motivations Randomly pick a term t from a sample A - event that some sampled document contains t B - event that some document from the resource contains t P(A) = df t,s S, P(B) = df t,c C P(A) P(B) = ˆ C = df t,c S df t,s Send the query t to the resource to estimate df t,c Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 39

40 Fabio Crestani and Ilya Markov Distributed Information Retrieval 40

41 Motivations The results merging process: 1 Selected resources return their top-ranked documents to the broker. 2 The broker merges the documents and returns a fused list to the user. Not to be confused with data fusion, where the results come from a single resource and are then ranked by multiple retrieval models. Fabio Crestani and Ilya Markov Distributed Information Retrieval 41

42 Issues The results merging process involves a number of issues: 1 Duplicate detection and removal. 2 Normalising and merging relevance scores. Different solutions have been proposed for these issue, depending in the DIR environment. Fabio Crestani and Ilya Markov Distributed Information Retrieval 42

43 in Cooperative Environments The results merging in cooperative environments is much simpler and has different solutions: 1 Fetch documents from each resource, reindex and rank according to the broker IR model. 2 Get information about the way the document score is calculated and normalise score. At the highest level of collaboration it is possible to ask the resources to adopt the same retrieval model! Fabio Crestani and Ilya Markov Distributed Information Retrieval 43

44 Collection Retrieval Inference Network (CORI) The idea Linear combination of the score of the database and the score of the document. Normalised scores Normalised collection section score: C i = (R i R max ) (R max R min ) Normalised document score: D j = (D j D max ) (D max D min ) Heuristic linear combination: D j = (D j +0.4 D j C i 1.4 J.P. Callan, Z. Lu, and W.B. Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 44

45 in Uncooperative Environments In uncooperative environments resources might provide scores: But the broker does not have any information on how these score are computed. Score normalisation requires some way of comparing scores. Alternatively the resources might provide only rank positions: But the broker does not have any information on the relevance of each document in the rank lists. Merging the ranks requires some way of comparing rank positions. Fabio Crestani and Ilya Markov Distributed Information Retrieval 45

46 Semi-Supervised Learning (SSL) The idea Train a regression model for each collection that maps resource document scores to normalised scores. Requires that some returned document are found in the Collection Selection Index (CSI). Two cases: 1 Resources use identical retrieval models 2 Resources use different retrieval models L. Si and J. Callan. Using sampled data and regression to merge search engine results. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 46

47 SSL with Identical Retrieval Models The idea SSL uses documents found in CSI to train a single regression model to estimate the normalised score (D i,j ) from resource document scores (D i,j) and the score of the same document computed from the CSI (E i,j ). Normalised scores Having: Train: D 1,1 C 1 D 1,1 D 1,2 C 1 D 1, D n,m C n D n,m [a b] = D i,j = a E i,j + b E i,j C i E 1,1 E 1,2... E n,m Fabio Crestani and Ilya Markov Distributed Information Retrieval 47

48 SSL with Different Retrieval Models The idea SSL uses documents found in CSI to train a different regression models for each resource. Normalised scores Having: Train: D 1,1 1 D 1, D n,m 1 [a i b i ] = D i,j = a i E i,j + b i E 1,1 E 1,2... E n,m Fabio Crestani and Ilya Markov Distributed Information Retrieval 48

49 Sample-Agglomerate Fitting Estimate (SAFE) The idea For a given query the results from the CSI is a subranking of the original collection, so curve fitting to the subranking can be used to estimate the original scores. It does not require the presence of overlap documents in CSI. M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates. In ACM Transactions on Information Systems, 27(3): 129, Fabio Crestani and Ilya Markov Distributed Information Retrieval 49

50 Sample-Agglomerate Fitting Estimate (SAFE) Normalised scores 1 The broker ranks the documents available in the CSI for the query. 2 For each resource the sample documents (with non zero score) are used to estimate the merging score, where each sample document is assumed to be representative of a fraction Sc / c of the resource. 3 Use regression to fit a curve on the adjusted scores to predict the score of the document returned by the resource. Fabio Crestani and Ilya Markov Distributed Information Retrieval 50

51 More in Uncooperative Environments There are amy other approaches to results merging: STARTS uses the returned term frequency, document frequency, and document weight information to calculate the merging score based on similarities between documents. CVV calculates the merging score according to the collection score and the position of a document in the returned collection rank list. Another approach download small parts of the top returned documents and used a reference index of term statistics for reranking and merging the downloaded documents. L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford proposal for internet meta-searching. In Proceedings of the ACM SIGMOD, pages , B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. In Proceedings of the Conference on Database Systems for Advanced, pages 41-50, N. Craswell, D. Hawking, and P. Thistlewaite. Merging results from isolated search engines. In Proceedings of the Australasian Database Conference, pages , Fabio Crestani and Ilya Markov Distributed Information Retrieval 51

52 Data Fusion in Metasearch In data fusion methods documents in a single collection are ranked with different search engines The goal is to generate a single accurate ranking list from the ranking lists of different retrieval models. There are no collection samples and no CSI. The idea Use the voting principle: a document returned by many search systems should be ranked higher than the other documents. If available, also take the rank of documents into account. Fabio Crestani and Ilya Markov Distributed Information Retrieval 52

53 Metasearch Data Fusion Methods Many methods have been proposed: Data Fusion Round Robin. CombMNZ, CombSum, CombMax, CombMin. Logistic regression (covert rank to estimated probabilities of relevance). A comparison between score-based and rank-based methods suggests that rank-based methods are generally less effective. E. Fox and J. Shaw. Combination of multiple searches. In Proceedings of TREC, pages ,1994. Fabio Crestani and Ilya Markov Distributed Information Retrieval 53

54 in Metasearch We cannot use data fusion methods when collections are overlapping, but are not the same. We cannot use data fusion methods when the retrieval model are different. Web metasearch most typical example. The idea Normalise the document scores returned by multiple search engines using a regression function that compares the scores of overlapped documents between the returned ranked lists. In the absence of overlap between the results, most metasearch merging techniques become ineffective. S. Wu and F. Crestani. Shadow document methods of results merging. In Proceedings of the ACM SAC, pages , 2004 Fabio Crestani and Ilya Markov Distributed Information Retrieval 54

55 References (1) J.P. Callan, Z. Lu, and W.B. Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR, pages ACM, L. Si and J. Callan. Using sampled data and regression to merge search engine results. In Proceedings of the ACM SIGIR, pages ACM, M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates. In ACM Transactions on Information Systems, 27(3): 129, L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford proposal for internet meta-searching. In Proceedings of the ACM SIGMOD, pages , Fabio Crestani and Ilya Markov Distributed Information Retrieval 55

56 References (2) B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. In Proceedings of the Conference on Database Systems for Advanced, pages 41-50, N. Craswell, D. Hawking, and P. Thistlewaite. Merging results from isolated search engines. In Proceedings of the Australasian Database Conference, pages , E. Fox and J. Shaw. Combination of multiple searches. In Proceedings of TREC, pages ,1994. S. Wu and F. Crestani. Shadow document methods of results merging. In Proceedings of the ACM SAC, pages , 2004 Fabio Crestani and Ilya Markov Distributed Information Retrieval 56

57 Outline Motivations 1 Motivations Fabio Crestani and Ilya Markov Distributed Information Retrieval 57

58 Vertical Fabio Crestani and Ilya Markov Distributed Information Retrieval 58

59 Vertical Specialized subcollection focused on a specific domain (e.g., news, travel, and local search) or a specific media type (e.g., images and video). Vertical Selection The task of selecting the relevant verticals, if any, in response to a user s query. DIR Solution 0-1 verticals Fabio Crestani and Ilya Markov Distributed Information Retrieval 59

60 Vertical Selection References Fernando Diaz. Integration of news content into web results. In Proceedings of the ACM WSDM, pages ACM, Jaime Arguello, Fernando Diaz, Jamie Callan, and Jean-Francois Crespo. Sources of evidence for vertical selection. In Proceedings of the ACM SIGIR, pages ACM, Fernando Diaz and Jaime Arguello. Adaptation of offline vertical selection predictions in the presence of user feedback. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 60

61 Blog Distillation The task of identifying blogs with a recurring central interest. Blog Feed Posts Federated Collection Documents Jonathan L. Elsas, Jaime Arguello, Jamie Callan, and Jaime G. Carbonell. Retrieval and feedback models for blog feed search. In Proceedings of the ACM SIGIR, pages ACM, Jangwon Seo and W. Bruce Croft. Blog site search using resource selection. In Proceeding of the ACM CIKM, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 61

62 Personalized Metasearch Fabio Crestani and Ilya Markov Distributed Information Retrieval 62

63 Personalized Metasearch Broker provides a single search interface over all of user s online resources. Different collections individual folders, addressbooks, calendars the Web Paul Thomas and David Hawking. Experiences evaluating personal metasearch. In IIiX, pages , Paul Thomas and David Hawking. Server selection methods in personal metasearch: a comparative empirical study. Inf. Retr., 12(5): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 63

64 Others Motivations Expert Search The task of identifying experts with a given expertise Experts Related documents Desktop Search Different file and document types Results Fusion Fabio Crestani and Ilya Markov Distributed Information Retrieval 64

65 Summary Vertical Selection Blog Distillation Personalized Metasearch Expert Search Desktop Search Your application Fabio Crestani and Ilya Markov Distributed Information Retrieval 65

66 Outline Motivations 1 Motivations Fabio Crestani and Ilya Markov Distributed Information Retrieval 66

67 Evolving Collections Fabio Crestani and Ilya Markov Distributed Information Retrieval 67

68 Updating Query-Based Sampling Given N collections n documents can be sampled at each time step Distribution methods Uniform Popularity-based Size-based Fabio Crestani and Ilya Markov Distributed Information Retrieval 68

69 Updating Methods Comparison SIGIR 2007 Proceedings Session 21: Collection Representation in Distribut (CO=3) QL CU SS 100 doc (CO=5) QL CU SS 100 (CO=3) QL CU SS 100 doc (CO=5) Crawl Fabio Crestani and Ilya Markov Distributed0.5 Information Retrieval 69 QL

70 Evolving Collections References Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, and Luis Gravano. Modeling and managing content changes in text databases. In ICDE, pages , Milad Shokouhi, Mark Baillie, and Leif Azzopardi. Updating collection representations for federated search. In Proceedings of the ACM SIGIR, pages , ACM, Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, and Luis Gravano. Modeling and managing changes in text databases. ACM Trans. Database Syst., 32(3):14, Fabio Crestani and Ilya Markov Distributed Information Retrieval 70

71 Query Fabio Crestani and Ilya Markov Distributed Information Retrieval 71

72 Query Expansion Global Expansion based on all retrieved documents The same expanded query to each resource Local Expansion for each resource based only on its documents Local but general Get expansion terms from each resource and select the best terms The same expanded query to each resource Cluster Cluster resource independently of a query Use Global or Local but general approach for each cluster Fabio Crestani and Ilya Markov Distributed Information Retrieval 72

73 experiments, we use the original tion and result merging. Thereeffectiveness is solely due to the candidates. It can be seen from he sample size does not always esent statistically significant difts with small and large samples. ], we find this is particularly the which performs poorly on two e data to estimate query expanrformance of the local and clusoved with larger samples. This amples provide richer sources of edback and are more likely to be oldings. For the local methods use this is the only information Query Expansion roker has a choice of algorithms r been using CRCS for selection one of the best-performing alt of experiments considered the, another popular algorithm for fferent servers will be selected, see whether this will improve or n methods. for each selection algorithm. g is that using the CORI selecsignificant loss in performance: ost any combination of paramion methods. Motivations Table 4: The average performance of query expansion methods across different testbeds for TREC topics and TREC topics TREC Topics Method P@5 P@10 MRR BSNE Local Fuse Cluster Global TREC Topics BSNE Local Fuse Cluster Global Paul Ogilvie and Jamie Callan. The effectiveness of query expansion for distributed information retrieval. In Proceedings of the ACM CIKM, pages ACM, Milad Shokouhi, Leif Azzopardi, and Paul Thomas. Effective query expansion for federated search. In Proceedings 432of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 73

74 Overlap Results Fusion Remove duplicates from the result list Give higher score to a document appeared more than once Fabio Crestani and Ilya Markov Distributed Information Retrieval 74

75 (Overlap Estimate) Given Collections C 1 and C 2 K overlap documents between them Samples S 1 and S 2 D duplicate documents within them Estimated number of overlap documents ˆK = C 1 C 2 D S 1 S 2 Fabio Crestani and Ilya Markov Distributed Information Retrieval 75

76 (Relax) SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR C1 R= 16 2 C2 R= 15 C1 2 C2 R= 16 R= 15 C1 2 C2 R= 11 R= R= 8 C3 4 (A) C1 R= 9 R= 20 C4 R= 8 C3 C2 R= 12 4 (B) R= 20 C4 C1 R= 8 R= 4 C3 (C) C2 R= 12 R= 20 C4 1 R= 3 R= 20 C3 C4 (D) R= 2 C3 (E) R= 20 C4 Figure 1: The Relax selection on a sample graph. Each vertex (Cn) in this graph represents a federated collection. (A) The graph initialization where R represents the estimated number of relevant documents in each collection. (B) The graph after initialization where C4 is selected as the most relevant collection according to its R value. The weight wfabio e(u, v) Crestani of edge and Ilya between Markovu and Distributed v computed Information according Retrieval to the estimated number76 of

77 Results Fusion Remove duplicates from the result list Give higher score to a document appeared more than once Fusion Methods Document d appears in m collections with scores {s i } Shadow Document: assumes that d also appears in n m collections with a score m i=1 s i m score(d) = m i=1 s i + k(n m) m i=1 Multi-Evidence: score(d) = f (m) s i m nondecreasing function m i=1 s i m, where f (x) is a Fabio Crestani and Ilya Markov Distributed Information Retrieval 77

78 Results Fusion Inf Retrieval (2007) 10: Average precision at 8 document levels (5,10,15,20,25,30,50,100) Range of overlap rate (1:0-0.2; 2: ; 3: ; 4: ; 5: ) MEM SDM Round-robin Bayesian Borda CombMNZ Fig. 1 Performances of six methods with different overlap rates ([0,1] normalization) Fabio Crestani and Ilya Markov Distributed Information Retrieval 78

79 Overlapping Collections References Milad Shokouhi and Justin Zobel. Federated text retrieval from uncooperative overlapped collections. In Proceedings of the ACM SIGIR, pages ACM, Yaniv Bernstein, Milad Shokouhi, and Justin Zobel. Compact features for detection of near-duplicates in distributed retrieval. In SPIRE, pages , Shengli Wu and Sally McClean. Result merging methods in distributed information retrieval with overlapping databases. Inf. Retr., 10(3): , Shengli Wu and Fabio Crestani. Shadow document methods of results merging. In Proceedings of the ACM symposium on Applied computing, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 79

80 More DIR Summary Evolving Collections Query Expansion Overlapping Collections Multilingual Search Distributed Multimedia Information Retrieval Luo Si, Jamie Callan, Suleyman Cetintas, and Hao Yuan. An effective and efficient results merging strategy for multilingual information retrieval in federated search environments. Inf. Retr., 11(1):1 24, Jamie Callan, Fabio Crestani, and Mark Sanderson. Distributed Multimedia Information Retrieval: Sigir 2003 Workshop on Distributed Information Retrieval, Toronto, Canada, August 2003: Revised, Selected, and Invited Papers (Lecture Notes in Computer Science, 2924). SpringerVerlag, Fabio Crestani and Ilya Markov Distributed Information Retrieval 80

81 Q & A Fabio Crestani and Ilya Markov Distributed Information Retrieval 81

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16 Federated Search Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu November 21, 2016 Up to this point... Classic information retrieval search from a single centralized index all ueries