Distributed Information Retrieval
|
|
- Matilda Riley
- 6 years ago
- Views:
Transcription
1 Distributed Information Retrieval Fabio Crestani and Ilya Markov University of Lugano, Switzerland Fabio Crestani and Ilya Markov Distributed Information Retrieval 1
2 Outline Motivations Deep Web Federated Search Metasearch Aggregated Search 1 Motivations Deep Web Federated Search Metasearch Aggregated Search Fabio Crestani and Ilya Markov Distributed Information Retrieval 2
3 Motivations Deep Web Federated Search Metasearch Aggregated Search Why do we need DIR? There are limits to what a search engines can find on the web Not everything that is on the web is or can be harvested The one size fits all approach of web search engine has many limitations Often there is more than one type of answer to the same query Thus: Deep Web, Federated Search, MetaSearch, Aggregated Search Fabio Crestani and Ilya Markov Distributed Information Retrieval 3
4 Deep Web Motivations Deep Web Federated Search Metasearch Aggregated Search There is a lot of information on the web that cannot be accessed by search engines (deep or hidden web). There are many different reasons why this information is not accessible to crawlers. This is often very valuable information! Web search engines can only be used to identify the resource (if possible), then the user has to deal directly with it. Even if this information could be crawled there are good reasons not too... Fabio Crestani and Ilya Markov Distributed Information Retrieval 4
5 Federated Search Motivations Deep Web Federated Search Metasearch Aggregated Search Federated Search is another name for DIR. Federated search systems do not crawl a resource, but pass a user query to the search facilities of the resource itself. Why would this be better? Preserves the property rights of the resource owner. Search facilities are optimised to the specific resource. Index is always up-to-date. The resource is curated and of high quality. Examples of federate search systems: PubMed, FedStats, WestLaw, and Cheshire. Fabio Crestani and Ilya Markov Distributed Information Retrieval 5
6 Metasearch Motivations Deep Web Federated Search Metasearch Aggregated Search Even the largest search engine cannot crawl effectively the entire web. Different search engines crawl different disjoint portions of the web. Different search engines use different ranking functions. Metasearch engines do not crawl the web, but pass a user query to a number of search engines and then present the fused results set. Examples of federate search systems: Dogpile, MataCrawler, AllInOneNews, and SavvySearch Fabio Crestani and Ilya Markov Distributed Information Retrieval 6
7 Aggregated Search Motivations Deep Web Federated Search Metasearch Aggregated Search Often there is more that one type of information relevant to a query (e.g. web page, images, map, reviews, etc). These type of information are indexed and ranked by separate sub-systems. Presenting this information in an aggregated way is more useful to the user. Fabio Crestani and Ilya Markov Distributed Information Retrieval 7
8 Outline Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid 1 Motivations 2 Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Fabio Crestani and Ilya Markov Distributed Information Retrieval 8
9 A Taxonomy of DIR Systems Peer-to-Peer Network Crawling Metadata Harvesting Hybrid A taxonomy of DIR architectures can be build considering where the indexes are kept. This suggest 4 different types of architectures: broker-based, peer-to-peer, crawling, and meta-data harvesting. Global collection Distributed indexes Centralised index P2P Broker Crawling Meta-data harvesting Hybrid Fabio Crestani and Ilya Markov Distributed Information Retrieval 9
10 Peer-to-Peer Networks Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Indexes are located with the resources. Some part of the indexes are distributed to other resources. Queries are distributed across the resources and results are merged by the peer that originated the query. Fabio Crestani and Ilya Markov Distributed Information Retrieval 10
11 Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Indexes are located with the resources. Queries are forwarded to resources and results are merged by the broker. Fabio Crestani and Ilya Markov Distributed Information Retrieval 11
12 Crawling Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Resources are crawled and documents are harvested Indexes are centralised. Queries are are carried out in a centralised way and documents are fetched from resources or from a storage. Fabio Crestani and Ilya Markov Distributed Information Retrieval 12
13 Metadata Harvesting Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid Indexes are located with the resources, but metadata are harvested according to some protocol (off-line phase), like for example the OAI-PMH. Queries are carried out at the broker level (on-line phase) to identify relevant documents by the metadata, that are then requested from the resources. Fabio Crestani and Ilya Markov Distributed Information Retrieval 13
14 Indexing Harvesting Motivations Peer-to-Peer Network Crawling Metadata Harvesting Hybrid It is possible to crawl the indexes, instead of the metadata according to some protocol (off-line phase), like for example the OAI-PMH. Queries are carried out at the broker level (on-line phase) to identify relevant documents by the documents full content, that are then requested from the resources. Fabio Crestani and Ilya Markov Distributed Information Retrieval 14
15 Outline Motivations 1 Motivations Fabio Crestani and Ilya Markov Distributed Information Retrieval 15
16 Cooperative Uncooperative Fabio Crestani and Ilya Markov Distributed Information Retrieval 16
17 Stanford Protocol Proposal for Internet and Retrieval Search (STARTS) Query Language Filter expressions Ranking expressions Retrieved Documents Unnormalized score Source Statistics (term frequency, term weight, document frequency, document size, document count) Source metadata and content summary Vocabulary Statistics (term frequency, document frequency, number of documents) Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. STARTS: Stanford proposal for internet meta-searching. SIGMOD Rec., 26(2): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 17
18 Query-Based Sampling Query Resource Documents (2-10) Questions How to select queries? What are the stopping criteria? Jamie Callan and Margaret Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19(2):97 130, Fabio Crestani and Ilya Markov Distributed Information Retrieval 18
19 Sampling Queries Motivations Other (ord) Learned (lrd) Random Document Frequency (df ) Collection Frequency (ctf ) Average Term Frequency (ctf /df ) Query-Logs Jamie Callan and Margaret Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19(2):97 130, Milad Shokouhi, Justin Zobel, Seyed M. M. Tahaghoghi, and Falk Scholer. Using query logs to establish vocabularies in distributed information retrieval. Inf. Process. Manage., 43(1): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 19
20 Stopping Criteria Motivations documents n documents out of N resources Collection size Vocabulary size Newly downloaded terms are not likely to appear in future queries Q - set of training queries θ k - language model of a sample at k-th iteration p(q θ k ) = N q q i=1 j=1 p(t = q ij θ k ), l(θ k, Q) = log(p(q θ k )) φ k = l(θ k, Q) l(θ k 1, Q) = log( p(q θ k) p(q θ k 1 ) ) Fabio Crestani and Ilya Markov Distributed Information Retrieval 20
21 Stopping Criteria Bibliography Leif Azzopardi, Mark Baillie, and Fabio Crestani. Adaptive query-based sampling for distributed ir. In Proceedings of the ACM SIGIR, pages ACM, Mark Baillie, Leif Azzopardi, and Fabio Crestani. An adaptive stopping criteria for query-based sampling of distributed collections. In String Processing and Information Retrieval (SPIRE), James Caverlee, Ling Liu, and Joonsoo Bae. Distributed query sampling: a quality-conscious approach. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 21
22 Summary Cooperative (STARTS) Uncooperative (QBS) Sampling Queries Stopping Criteria Fabio Crestani and Ilya Markov Distributed Information Retrieval 22
23 References Luis Gravano, Chen-Chuan K. Chang, Héctor García-Molina, and Andreas Paepcke. Starts: Stanford proposal for internet meta-searching. SIGMOD Rec., 26(2): , Jamie Callan and Margaret Connell. Query-based sampling of text databases. ACM Trans. Inf. Syst., 19(2):97 130, Leif Azzopardi, Mark Baillie, and Fabio Crestani. Adaptive query-based sampling for distributed ir. In Proceedings of the ACM SIGIR, pages ACM, Mark Baillie, Leif Azzopardi, and Fabio Crestani. An adaptive stopping criteria for query-based sampling of distributed collections. In String Processing and Information Retrieval (SPIRE), James Caverlee, Ling Liu, and Joonsoo Bae. Distributed query sampling: a quality-conscious approach. In Proceedings of the ACM SIGIR, pages ACM, Milad Shokouhi, Justin Zobel, Seyed M. M. Tahaghoghi, and Falk Scholer. Using query logs to establish vocabularies in distributed information retrieval. Inf. Process. Manage., 43(1): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 23
24 Lexicon-Based Document-Surrogate Fabio Crestani and Ilya Markov Distributed Information Retrieval 24
25 Lexicon-Based Approaches Fabio Crestani and Ilya Markov Distributed Information Retrieval 25
26 Collection Retrieval Inference Network (CORI) Collection = Super-Document Bayesian inference network on super-documents Adapted Okapi T = df t,i df t,i cw i /avg cw I = C+0.5 log( cf t ) log(c + 1.0) p(t C i ) = b + (1 b) T I Collections are ranked according to p(q C i ) James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 26
27 Glossary-of-Servers Server (GlOSS) Goodness(q, l, C) = sim(q, d) d Rank(q,l,C) Rank(q, l, C) = {d C sim(q, d)>l} Cooperative Document and term statistics Luis Gravano, Héctor García-Molina, and Anthony Tomasic. Gloss: text-source discovery over the internet. ACM Trans. Database Syst., 24(2): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 27
28 Document-Surrogate Approaches Fabio Crestani and Ilya Markov Distributed Information Retrieval 28
29 Relevant Document Distribution Estimation (ReDDE) Idea 1 sampled document is relevant to a query C S C similar documents in a collection c are relevant to a query. R(C, Q) p(r d) C S C d S C Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 29
30 Relevant Document Distribution Estimation (ReDDE) Ranked sampled documents = Ranked documents in a centralized retrieval system Idea A document d j appears before a document d i in a sample C j S Cj documents appear before d i in a centralized retrieval system. Rank centralized (d i ) = d j :Rank sample (d j )<Rank sample (d i ) C j S Cj Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 30
31 Relevant Document Distribution Estimation (ReDDE) Rank centralized (d i ) = R(C, Q) p(r d) C S C d S C d j :Rank sample (d j )<Rank sample (d i ) C j S Cj p(r d) = {α if Rank centralized (d) < β i C i 0 otherwise. Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 31
32 Centralized-Rank Collection Selection (CRCS) R(C, Q) p(r d) C S C d S C Linear R(d) = { γ Rank sample (d) if Rank sample (d) < γ 0 otherwise. Exponential R(d) = α exp( β Rank sample (d)) p(r d) = R(d) C max Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. In ECIR, pages , Fabio Crestani and Ilya Markov Distributed Information Retrieval 32
33 R Motivations Advanced 15 DIR 20 Cutoff (nonrelevant) R Comparison Resource 0.2 Description Results 0.0 Merging Cutoff (GOV2) Fig. 3. R values for the cori, redde and crcs algorithms on the trec123-fr-doe- 81col (left) and 100-col-gov2 (right) testbeds. Table 2. Performance of different methods for the Trec4 (trec4-kmeans) testbed. Trec topics (long) were used as queries Cutoff=1 Cutoff=5 P@5 P@10 P@15 P@20 P@5 P@10 P@15 P@20 CORI ReDDE CRCS(l) CRCS(e) Table 3. Performance of collection selection methods for the uniform (trec colbysource) testbed. Trec topics (short) were used as queries evance judgments. We only report the results for cutoff=1 and cutoff=5. The former shows the system Cutoff=1 outputs when only the best collection Cutoff=5is selected while for the latter, P@5 the bestp@10 five collections P@15 P@20 are chosen P@5to P@10 get searched P@15for P@20 the query. We docori not report the results for larger cutoff values because cutoff= has shown to beredde a reasonable threshold for dir experiments on the real web collections [Avrahami CRCS(l) et al., ] The values show the calculated precision on the top xcrcs(e) results We select redde as the baseline as it does not require training queries and Table 4. Performance of collection selection methods for the representative (trec123- Milad Shokouhi. its effectiveness is found to be higher than cori and older alternatives [Si and 2ldb-60col) Central-rank-based testbed. Treccollection topics selection (short) in uncooperative were used as distributed queries information retrieval. In ECIR, pages Callan, , 2003a] The following tables compare the performance of discussed methods on different testbeds. Cutoff=1 We used the t-test to calculate Cutoff=5 the statistical signifi- Fabio Crestani and Ilya Markov Distributed Information Retrieval 33
34 On the relevant testbed Motivations (Table 5), all precision values for cori are significantly inferior to that of redde for both Resource cutoff values. Description Redde in general produces higher precision Broker-Based values Architecture than crcs methods. ResourceHowever, Selection none of the gaps are detected statistically significant Advancedby DIR the t-test at the 99% confidence interval. The results for crcs(l), redde and cori are comparable on the non-relevant testbed (Table 6). Crcs(e) significantly outperforms the other methods in most cases. On the gov2 testbed (Table 7), cori produces the best results when Comparison Table 6. Performance of collection selection methods for the non-relevant (trec123- FR-DOE-81col) Table 5. Performance testbed. of Trec collection topics selection(short) methods were forused theas relevant queries(trec123-ap- WSJ-60col) testbed. Trec topics (short) were used as queries Cutoff=1 Cutoff=5 Cutoff=1 Cutoff=5 CORI ReDDE CORI CRCS(l) ReDDE CRCS(e) CRCS(l) CRCS(e) Table 7. Performance of collection selection methods for the gov2 (100-col-gov2) testbed. Trec topics (short) were used as queries Cutoff=1 Cutoff=5 CORI ReDDE CRCS(l) CRCS(e) Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. In cutoff=1 while in the other scenarios there is no significant difference between ECIR, pages , the methods. Overall, Fabio we can Crestani conclude and Ilya that Markov crcs(e) Distributed selects better Information collections Retrieval and its high 34
35 Summary Lexicon-Based CORI GlOSS Document-Surrogate ReDDE CRCS Others Fabio Crestani and Ilya Markov Distributed Information Retrieval 35
36 References James P. Callan, Zhihong Lu, and W. Bruce Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR, pages ACM, Luis Gravano, Héctor García-Molina, and Anthony Tomasic. Gloss: text-source discovery over the internet. ACM Trans. Database Syst., 24(2): , Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Milad Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval. In ECIR, pages , Fabio Crestani and Ilya Markov Distributed Information Retrieval 36
37 Estimating Collection Size Fabio Crestani and Ilya Markov Distributed Information Retrieval 37
38 Capture-Recapture Motivations X - event that a randomly sampled document is already in a sample X n - the same for n randomly sampled documents Two samples S 1 and S 2 E[X ] = S C, E[X n] = n E[X ] = n S C S 1 S 2 S 1 S 2 C = ˆ C = S 1 S 2 S 1 S 2 Take two samples Count the number of common documents King-Lup Liu, Clement Yu, and Weiyi Meng. Discovering the representative of a search engine. In Proceedings of the ACM CIKM, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 38
39 Sample-Resample Motivations Randomly pick a term t from a sample A - event that some sampled document contains t B - event that some document from the resource contains t P(A) = df t,s S, P(B) = df t,c C P(A) P(B) = ˆ C = df t,c S df t,s Send the query t to the resource to estimate df t,c Luo Si and Jamie Callan. Relevant document distribution estimation method for resource selection. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 39
40 Fabio Crestani and Ilya Markov Distributed Information Retrieval 40
41 Motivations The results merging process: 1 Selected resources return their top-ranked documents to the broker. 2 The broker merges the documents and returns a fused list to the user. Not to be confused with data fusion, where the results come from a single resource and are then ranked by multiple retrieval models. Fabio Crestani and Ilya Markov Distributed Information Retrieval 41
42 Issues The results merging process involves a number of issues: 1 Duplicate detection and removal. 2 Normalising and merging relevance scores. Different solutions have been proposed for these issue, depending in the DIR environment. Fabio Crestani and Ilya Markov Distributed Information Retrieval 42
43 in Cooperative Environments The results merging in cooperative environments is much simpler and has different solutions: 1 Fetch documents from each resource, reindex and rank according to the broker IR model. 2 Get information about the way the document score is calculated and normalise score. At the highest level of collaboration it is possible to ask the resources to adopt the same retrieval model! Fabio Crestani and Ilya Markov Distributed Information Retrieval 43
44 Collection Retrieval Inference Network (CORI) The idea Linear combination of the score of the database and the score of the document. Normalised scores Normalised collection section score: C i = (R i R max ) (R max R min ) Normalised document score: D j = (D j D max ) (D max D min ) Heuristic linear combination: D j = (D j +0.4 D j C i 1.4 J.P. Callan, Z. Lu, and W.B. Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 44
45 in Uncooperative Environments In uncooperative environments resources might provide scores: But the broker does not have any information on how these score are computed. Score normalisation requires some way of comparing scores. Alternatively the resources might provide only rank positions: But the broker does not have any information on the relevance of each document in the rank lists. Merging the ranks requires some way of comparing rank positions. Fabio Crestani and Ilya Markov Distributed Information Retrieval 45
46 Semi-Supervised Learning (SSL) The idea Train a regression model for each collection that maps resource document scores to normalised scores. Requires that some returned document are found in the Collection Selection Index (CSI). Two cases: 1 Resources use identical retrieval models 2 Resources use different retrieval models L. Si and J. Callan. Using sampled data and regression to merge search engine results. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 46
47 SSL with Identical Retrieval Models The idea SSL uses documents found in CSI to train a single regression model to estimate the normalised score (D i,j ) from resource document scores (D i,j) and the score of the same document computed from the CSI (E i,j ). Normalised scores Having: Train: D 1,1 C 1 D 1,1 D 1,2 C 1 D 1, D n,m C n D n,m [a b] = D i,j = a E i,j + b E i,j C i E 1,1 E 1,2... E n,m Fabio Crestani and Ilya Markov Distributed Information Retrieval 47
48 SSL with Different Retrieval Models The idea SSL uses documents found in CSI to train a different regression models for each resource. Normalised scores Having: Train: D 1,1 1 D 1, D n,m 1 [a i b i ] = D i,j = a i E i,j + b i E 1,1 E 1,2... E n,m Fabio Crestani and Ilya Markov Distributed Information Retrieval 48
49 Sample-Agglomerate Fitting Estimate (SAFE) The idea For a given query the results from the CSI is a subranking of the original collection, so curve fitting to the subranking can be used to estimate the original scores. It does not require the presence of overlap documents in CSI. M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates. In ACM Transactions on Information Systems, 27(3): 129, Fabio Crestani and Ilya Markov Distributed Information Retrieval 49
50 Sample-Agglomerate Fitting Estimate (SAFE) Normalised scores 1 The broker ranks the documents available in the CSI for the query. 2 For each resource the sample documents (with non zero score) are used to estimate the merging score, where each sample document is assumed to be representative of a fraction Sc / c of the resource. 3 Use regression to fit a curve on the adjusted scores to predict the score of the document returned by the resource. Fabio Crestani and Ilya Markov Distributed Information Retrieval 50
51 More in Uncooperative Environments There are amy other approaches to results merging: STARTS uses the returned term frequency, document frequency, and document weight information to calculate the merging score based on similarities between documents. CVV calculates the merging score according to the collection score and the position of a document in the returned collection rank list. Another approach download small parts of the top returned documents and used a reference index of term statistics for reranking and merging the downloaded documents. L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford proposal for internet meta-searching. In Proceedings of the ACM SIGMOD, pages , B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. In Proceedings of the Conference on Database Systems for Advanced, pages 41-50, N. Craswell, D. Hawking, and P. Thistlewaite. Merging results from isolated search engines. In Proceedings of the Australasian Database Conference, pages , Fabio Crestani and Ilya Markov Distributed Information Retrieval 51
52 Data Fusion in Metasearch In data fusion methods documents in a single collection are ranked with different search engines The goal is to generate a single accurate ranking list from the ranking lists of different retrieval models. There are no collection samples and no CSI. The idea Use the voting principle: a document returned by many search systems should be ranked higher than the other documents. If available, also take the rank of documents into account. Fabio Crestani and Ilya Markov Distributed Information Retrieval 52
53 Metasearch Data Fusion Methods Many methods have been proposed: Data Fusion Round Robin. CombMNZ, CombSum, CombMax, CombMin. Logistic regression (covert rank to estimated probabilities of relevance). A comparison between score-based and rank-based methods suggests that rank-based methods are generally less effective. E. Fox and J. Shaw. Combination of multiple searches. In Proceedings of TREC, pages ,1994. Fabio Crestani and Ilya Markov Distributed Information Retrieval 53
54 in Metasearch We cannot use data fusion methods when collections are overlapping, but are not the same. We cannot use data fusion methods when the retrieval model are different. Web metasearch most typical example. The idea Normalise the document scores returned by multiple search engines using a regression function that compares the scores of overlapped documents between the returned ranked lists. In the absence of overlap between the results, most metasearch merging techniques become ineffective. S. Wu and F. Crestani. Shadow document methods of results merging. In Proceedings of the ACM SAC, pages , 2004 Fabio Crestani and Ilya Markov Distributed Information Retrieval 54
55 References (1) J.P. Callan, Z. Lu, and W.B. Croft. Searching distributed collections with inference networks. In Proceedings of the ACM SIGIR, pages ACM, L. Si and J. Callan. Using sampled data and regression to merge search engine results. In Proceedings of the ACM SIGIR, pages ACM, M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates. In ACM Transactions on Information Systems, 27(3): 129, L. Gravano, C. Chang, H. Garcia-Molina, and A. Paepcke. STARTS: Stanford proposal for internet meta-searching. In Proceedings of the ACM SIGMOD, pages , Fabio Crestani and Ilya Markov Distributed Information Retrieval 55
56 References (2) B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. In Proceedings of the Conference on Database Systems for Advanced, pages 41-50, N. Craswell, D. Hawking, and P. Thistlewaite. Merging results from isolated search engines. In Proceedings of the Australasian Database Conference, pages , E. Fox and J. Shaw. Combination of multiple searches. In Proceedings of TREC, pages ,1994. S. Wu and F. Crestani. Shadow document methods of results merging. In Proceedings of the ACM SAC, pages , 2004 Fabio Crestani and Ilya Markov Distributed Information Retrieval 56
57 Outline Motivations 1 Motivations Fabio Crestani and Ilya Markov Distributed Information Retrieval 57
58 Vertical Fabio Crestani and Ilya Markov Distributed Information Retrieval 58
59 Vertical Specialized subcollection focused on a specific domain (e.g., news, travel, and local search) or a specific media type (e.g., images and video). Vertical Selection The task of selecting the relevant verticals, if any, in response to a user s query. DIR Solution 0-1 verticals Fabio Crestani and Ilya Markov Distributed Information Retrieval 59
60 Vertical Selection References Fernando Diaz. Integration of news content into web results. In Proceedings of the ACM WSDM, pages ACM, Jaime Arguello, Fernando Diaz, Jamie Callan, and Jean-Francois Crespo. Sources of evidence for vertical selection. In Proceedings of the ACM SIGIR, pages ACM, Fernando Diaz and Jaime Arguello. Adaptation of offline vertical selection predictions in the presence of user feedback. In Proceedings of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 60
61 Blog Distillation The task of identifying blogs with a recurring central interest. Blog Feed Posts Federated Collection Documents Jonathan L. Elsas, Jaime Arguello, Jamie Callan, and Jaime G. Carbonell. Retrieval and feedback models for blog feed search. In Proceedings of the ACM SIGIR, pages ACM, Jangwon Seo and W. Bruce Croft. Blog site search using resource selection. In Proceeding of the ACM CIKM, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 61
62 Personalized Metasearch Fabio Crestani and Ilya Markov Distributed Information Retrieval 62
63 Personalized Metasearch Broker provides a single search interface over all of user s online resources. Different collections individual folders, addressbooks, calendars the Web Paul Thomas and David Hawking. Experiences evaluating personal metasearch. In IIiX, pages , Paul Thomas and David Hawking. Server selection methods in personal metasearch: a comparative empirical study. Inf. Retr., 12(5): , Fabio Crestani and Ilya Markov Distributed Information Retrieval 63
64 Others Motivations Expert Search The task of identifying experts with a given expertise Experts Related documents Desktop Search Different file and document types Results Fusion Fabio Crestani and Ilya Markov Distributed Information Retrieval 64
65 Summary Vertical Selection Blog Distillation Personalized Metasearch Expert Search Desktop Search Your application Fabio Crestani and Ilya Markov Distributed Information Retrieval 65
66 Outline Motivations 1 Motivations Fabio Crestani and Ilya Markov Distributed Information Retrieval 66
67 Evolving Collections Fabio Crestani and Ilya Markov Distributed Information Retrieval 67
68 Updating Query-Based Sampling Given N collections n documents can be sampled at each time step Distribution methods Uniform Popularity-based Size-based Fabio Crestani and Ilya Markov Distributed Information Retrieval 68
69 Updating Methods Comparison SIGIR 2007 Proceedings Session 21: Collection Representation in Distribut (CO=3) QL CU SS 100 doc (CO=5) QL CU SS 100 (CO=3) QL CU SS 100 doc (CO=5) Crawl Fabio Crestani and Ilya Markov Distributed0.5 Information Retrieval 69 QL
70 Evolving Collections References Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, and Luis Gravano. Modeling and managing content changes in text databases. In ICDE, pages , Milad Shokouhi, Mark Baillie, and Leif Azzopardi. Updating collection representations for federated search. In Proceedings of the ACM SIGIR, pages , ACM, Panagiotis G. Ipeirotis, Alexandros Ntoulas, Junghoo Cho, and Luis Gravano. Modeling and managing changes in text databases. ACM Trans. Database Syst., 32(3):14, Fabio Crestani and Ilya Markov Distributed Information Retrieval 70
71 Query Fabio Crestani and Ilya Markov Distributed Information Retrieval 71
72 Query Expansion Global Expansion based on all retrieved documents The same expanded query to each resource Local Expansion for each resource based only on its documents Local but general Get expansion terms from each resource and select the best terms The same expanded query to each resource Cluster Cluster resource independently of a query Use Global or Local but general approach for each cluster Fabio Crestani and Ilya Markov Distributed Information Retrieval 72
73 experiments, we use the original tion and result merging. Thereeffectiveness is solely due to the candidates. It can be seen from he sample size does not always esent statistically significant difts with small and large samples. ], we find this is particularly the which performs poorly on two e data to estimate query expanrformance of the local and clusoved with larger samples. This amples provide richer sources of edback and are more likely to be oldings. For the local methods use this is the only information Query Expansion roker has a choice of algorithms r been using CRCS for selection one of the best-performing alt of experiments considered the, another popular algorithm for fferent servers will be selected, see whether this will improve or n methods. for each selection algorithm. g is that using the CORI selecsignificant loss in performance: ost any combination of paramion methods. Motivations Table 4: The average performance of query expansion methods across different testbeds for TREC topics and TREC topics TREC Topics Method P@5 P@10 MRR BSNE Local Fuse Cluster Global TREC Topics BSNE Local Fuse Cluster Global Paul Ogilvie and Jamie Callan. The effectiveness of query expansion for distributed information retrieval. In Proceedings of the ACM CIKM, pages ACM, Milad Shokouhi, Leif Azzopardi, and Paul Thomas. Effective query expansion for federated search. In Proceedings 432of the ACM SIGIR, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 73
74 Overlap Results Fusion Remove duplicates from the result list Give higher score to a document appeared more than once Fabio Crestani and Ilya Markov Distributed Information Retrieval 74
75 (Overlap Estimate) Given Collections C 1 and C 2 K overlap documents between them Samples S 1 and S 2 D duplicate documents within them Estimated number of overlap documents ˆK = C 1 C 2 D S 1 S 2 Fabio Crestani and Ilya Markov Distributed Information Retrieval 75
76 (Relax) SIGIR 2007 Proceedings Session 21: Collection Representation in Distributed IR C1 R= 16 2 C2 R= 15 C1 2 C2 R= 16 R= 15 C1 2 C2 R= 11 R= R= 8 C3 4 (A) C1 R= 9 R= 20 C4 R= 8 C3 C2 R= 12 4 (B) R= 20 C4 C1 R= 8 R= 4 C3 (C) C2 R= 12 R= 20 C4 1 R= 3 R= 20 C3 C4 (D) R= 2 C3 (E) R= 20 C4 Figure 1: The Relax selection on a sample graph. Each vertex (Cn) in this graph represents a federated collection. (A) The graph initialization where R represents the estimated number of relevant documents in each collection. (B) The graph after initialization where C4 is selected as the most relevant collection according to its R value. The weight wfabio e(u, v) Crestani of edge and Ilya between Markovu and Distributed v computed Information according Retrieval to the estimated number76 of
77 Results Fusion Remove duplicates from the result list Give higher score to a document appeared more than once Fusion Methods Document d appears in m collections with scores {s i } Shadow Document: assumes that d also appears in n m collections with a score m i=1 s i m score(d) = m i=1 s i + k(n m) m i=1 Multi-Evidence: score(d) = f (m) s i m nondecreasing function m i=1 s i m, where f (x) is a Fabio Crestani and Ilya Markov Distributed Information Retrieval 77
78 Results Fusion Inf Retrieval (2007) 10: Average precision at 8 document levels (5,10,15,20,25,30,50,100) Range of overlap rate (1:0-0.2; 2: ; 3: ; 4: ; 5: ) MEM SDM Round-robin Bayesian Borda CombMNZ Fig. 1 Performances of six methods with different overlap rates ([0,1] normalization) Fabio Crestani and Ilya Markov Distributed Information Retrieval 78
79 Overlapping Collections References Milad Shokouhi and Justin Zobel. Federated text retrieval from uncooperative overlapped collections. In Proceedings of the ACM SIGIR, pages ACM, Yaniv Bernstein, Milad Shokouhi, and Justin Zobel. Compact features for detection of near-duplicates in distributed retrieval. In SPIRE, pages , Shengli Wu and Sally McClean. Result merging methods in distributed information retrieval with overlapping databases. Inf. Retr., 10(3): , Shengli Wu and Fabio Crestani. Shadow document methods of results merging. In Proceedings of the ACM symposium on Applied computing, pages ACM, Fabio Crestani and Ilya Markov Distributed Information Retrieval 79
80 More DIR Summary Evolving Collections Query Expansion Overlapping Collections Multilingual Search Distributed Multimedia Information Retrieval Luo Si, Jamie Callan, Suleyman Cetintas, and Hao Yuan. An effective and efficient results merging strategy for multilingual information retrieval in federated search environments. Inf. Retr., 11(1):1 24, Jamie Callan, Fabio Crestani, and Mark Sanderson. Distributed Multimedia Information Retrieval: Sigir 2003 Workshop on Distributed Information Retrieval, Toronto, Canada, August 2003: Revised, Selected, and Invited Papers (Lecture Notes in Computer Science, 2924). SpringerVerlag, Fabio Crestani and Ilya Markov Distributed Information Retrieval 80
81 Q & A Fabio Crestani and Ilya Markov Distributed Information Retrieval 81
Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16
Federated Search Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu November 21, 2016 Up to this point... Classic information retrieval search from a single centralized index all ueries
More informationFederated Text Retrieval From Uncooperative Overlapped Collections
Session 2: Collection Representation in Distributed IR Federated Text Retrieval From Uncooperative Overlapped Collections ABSTRACT Milad Shokouhi School of Computer Science and Information Technology,
More informationFederated Text Search
CS54701 Federated Text Search Luo Si Department of Computer Science Purdue University Abstract Outline Introduction to federated search Main research problems Resource Representation Resource Selection
More informationCS54701: Information Retrieval
CS54701: Information Retrieval Federated Search 10 March 2016 Prof. Chris Clifton Outline Federated Search Introduction to federated search Main research problems Resource Representation Resource Selection
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Federated Search Prof. Chris Clifton 13 November 2017 Federated Search Outline Introduction to federated search Main research problems Resource Representation
More informationFrom federated to aggregated search
From federated to aggregated search Fernando Diaz, Mounia Lalmas and Milad Shokouhi diazf@yahoo-inc.com mounia@acm.org milads@microsoft.com Outline Introduction and Terminology Architecture Resource Representation
More informationFederated Text Retrieval from Independent Collections
Federated Text Retrieval from Independent Collections A thesis submitted for the degree of Doctor of Philosophy Milad Shokouhi B.E. (Hons.), School of Computer Science and Information Technology, Science,
More informationFederated Search. Contents
Foundations and Trends R in Information Retrieval Vol. 5, No. 1 (2011) 1 102 c 2011 M. Shokouhi and L. Si DOI: 10.1561/1500000010 Federated Search By Milad Shokouhi and Luo Si Contents 1 Introduction 3
More informationA Topic-based Measure of Resource Description Quality for Distributed Information Retrieval
A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval Mark Baillie 1, Mark J. Carman 2, and Fabio Crestani 2 1 CIS Dept., University of Strathclyde, Glasgow, UK mb@cis.strath.ac.uk
More informationABSTRACT. Categories & Subject Descriptors: H.3.3 [Information Search and Retrieval]: General Terms: Algorithms Keywords: Resource Selection
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lsi@cs.cmu.edu, callan@cs.cmu.edu
More informationCapturing Collection Size for Distributed Non-Cooperative Retrieval
Capturing Collection Size for Distributed Non-Cooperative Retrieval Milad Shokouhi Justin Zobel Falk Scholer S.M.M. Tahaghoghi School of Computer Science and Information Technology, RMIT University, Melbourne,
More informationA Methodology for Collection Selection in Heterogeneous Contexts
A Methodology for Collection Selection in Heterogeneous Contexts Faïza Abbaci Ecole des Mines de Saint-Etienne 158 Cours Fauriel, 42023 Saint-Etienne, France abbaci@emse.fr Jacques Savoy Université de
More informationExamining the Authority and Ranking Effects as the result list depth used in data fusion is varied
Information Processing and Management 43 (2007) 1044 1058 www.elsevier.com/locate/infoproman Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Anselm Spoerri
More informationObtaining Language Models of Web Collections Using Query-Based Sampling Techniques
-7695-1435-9/2 $17. (c) 22 IEEE 1 Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques Gary A. Monroe James C. French Allison L. Powell Department of Computer Science University
More informationRMIT University at TREC 2006: Terabyte Track
RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction
More informationBalancing Precision and Recall with Selective Search
Balancing Precision and Recall with Selective Search Mon Shih Chuang Department of Computer Science San Francisco State University 1600 Holloway Ave, San Francisco CA, USA, 94132 mchuang@mail.sfsu.edu
More informationFederated Search in the Wild
Federated Search in the Wild The Combined Power of over a Hundred Search Engines Dong Nguyen 1, Thomas Demeester 2, Dolf Trieschnigg 1, Djoerd Hiemstra 1 1 University of Twente, The Netherlands 2 Ghent
More informationFaculty of Science and Technology MASTER S THESIS
Faculty of Science and Technology MASTER S THESIS Study program/ Specialization: Master of Science in Computer Science Spring semester, 2016 Open Writer: Shuo Zhang Faculty supervisor: (Writer s signature)
More informationUMass at TREC 2017 Common Core Track
UMass at TREC 2017 Common Core Track Qingyao Ai, Hamed Zamani, Stephen Harding, Shahrzad Naseri, James Allan and W. Bruce Croft Center for Intelligent Information Retrieval College of Information and Computer
More informationAn Overview of Aggregating Vertical Results into Web Search Results
An Overview of Aggregating Vertical Results into Web Search Results Suhel Mustajab Department of Computer Science, A.M.U., Aligarh, U.P., India. Mohd. Kashif Adhami Department of Computer Science, A.M.U.,
More informationEffect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching
Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna
More informationRetrieval and Feedback Models for Blog Distillation
Retrieval and Feedback Models for Blog Distillation Jonathan Elsas, Jaime Arguello, Jamie Callan, Jaime Carbonell Language Technologies Institute, School of Computer Science, Carnegie Mellon University
More informationRobust Relevance-Based Language Models
Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new
More informationA Formal Approach to Score Normalization for Meta-search
A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003
More informationCombining CORI and the decision-theoretic approach for advanced resource selection
Combining CORI and the decision-theoretic approach for advanced resource selection Henrik Nottelmann and Norbert Fuhr Institute of Informatics and Interactive Systems, University of Duisburg-Essen, 47048
More informationA Meta-search Method with Clustering and Term Correlation
A Meta-search Method with Clustering and Term Correlation Dyce Jing Zhao, Dik Lun Lee, and Qiong Luo Department of Computer Science Hong Kong University of Science & Technology {zhaojing,dlee,luo}@cs.ust.hk
More informationOpinions in Federated Search: University of Lugano at TREC 2014 Federated Web Search Track
Opinions in Federated Search: University of Lugano at TREC 2014 Federated Web Search Track Anastasia Giachanou 1,IlyaMarkov 2 and Fabio Crestani 1 1 Faculty of Informatics, University of Lugano, Switzerland
More informationDistributed Search over the Hidden Web: Hierarchical Database Sampling and Selection
Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection P.G. Ipeirotis & L. Gravano Computer Science Department, Columbia University Amr El-Helw CS856 University of Waterloo
More informationCollection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data
Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data Leah S. Larkey, Margaret E. Connell Department of Computer Science University of Massachusetts Amherst, MA 13
More informationReport on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents
Report on the CLEF-IP 2012 Experiments: Search of Topically Organized Patents Michail Salampasis Vienna University of Technology Institute of Software Technology and Interactive Systems Vienna, Austria
More informationQoS Based Ranking for Composite Web Services
QoS Based Ranking for Composite Web Services F.Ezhil Mary Arasi 1, Aditya Anand 2, Subodh Kumar 3 1 SRM University, De[partment of Computer Applications, Kattankulathur, Chennai, India 2 SRM University,
More informationA Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University
A Machine Learning Approach for Information Retrieval Applications Luo Si Department of Computer Science Purdue University Why Information Retrieval: Information Overload: Since the introduction of digital
More informationCollection Selection with Highly Discriminative Keys
Collection Selection with Highly Discriminative Keys Sander Bockting Avanade Netherlands B.V. Versterkerstraat 6 1322 AP, Almere, Netherlands sander.bockting@avanade.com Djoerd Hiemstra University of Twente
More informationAutomatic Classification of Text Databases through Query Probing
Automatic Classification of Text Databases through Query Probing Panagiotis G. Ipeirotis Computer Science Dept. Columbia University pirot@cs.columbia.edu Luis Gravano Computer Science Dept. Columbia University
More informationQuery Likelihood with Negative Query Generation
Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer
More informationCS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University
CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and
More informationQuery- vs. Crawling-based Classification of Searchable Web Databases
Query- vs. Crawling-based Classification of Searchable Web Databases Luis Gravano Panagiotis G. Ipeirotis Mehran Sahami gravano@cs.columbia.edu pirot@cs.columbia.edu sahami@epiphany.com Columbia University
More informationCost-Effective Combination of Multiple Rankers: Learning When Not To Query
Cost-Effective Combination of Multiple Rankers: Learning When Not To Query ABSTRACT Combining multiple rankers has potential for improving the performance over using any of the single rankers. However,
More informationEvaluating Sampling Methods for Uncooperative Collections
Evaluating Sampling Methods for Uncooperative Collections Paul Thomas Department of Computer Science Australian National University Canberra, Australia paul.thomas@anu.edu.au David Hawking CSIRO ICT Centre
More informationContent-based search in peer-to-peer networks
Content-based search in peer-to-peer networks Yun Zhou W. Bruce Croft Brian Neil Levine yzhou@cs.umass.edu croft@cs.umass.edu brian@cs.umass.edu Dept. of Computer Science, University of Massachusetts,
More informationFull text available at: Federated Search
Federated Search Federated Search Milad Shokouhi Microsoft Research Cambridge, CB30FB UK milads@microsoft.com Luo Si Purdue University West Lafayette, IN 47907-2066 USA lsi@cs.purdue.edu Boston Delft Foundations
More information[23] T. W. Yan, and H. Garcia-Molina. SIFT { A Tool for Wide-Area Information Dissemination. USENIX
[23] T. W. Yan, and H. Garcia-Molina. SIFT { A Tool for Wide-Area Information Dissemination. USENIX 1995 Technical Conference, 1995. [24] C. Yu, K. Liu, W. Wu, W. Meng, and N. Rishe. Finding the Most Similar
More informationClassification-Aware Hidden-Web Text Database Selection
6 Classification-Aware Hidden-Web Text Database Selection PANAGIOTIS G. IPEIROTIS New York University and LUIS GRAVANO Columbia University Many valuable text databases on the web have noncrawlable contents
More informationMerging algorithms for enterprise search
Merging algorithms for enterprise search PengFei (Vincent) Li Australian National University u4959060@anu.edu.au Paul Thomas CSIRO and Australian National University paul.thomas@csiro.au David Hawking
More informationDocument Allocation Policies for Selective Searching of Distributed Indexes
Document Allocation Policies for Selective Searching of Distributed Indexes Anagha Kulkarni and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University 5 Forbes
More informationUniversity of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte tracks with Terrier
University of Glasgow at TREC2004: Experiments in Web, Robust and Terabyte tracks with Terrier Vassilis Plachouras, Ben He, and Iadh Ounis University of Glasgow, G12 8QQ Glasgow, UK Abstract With our participation
More informationEfficient distributed selective search
DOI 10.1007/s10791-016-9290-6 INFORMATION RETRIEVAL EFFICIENCY Efficient distributed selective search Yubin Kim 1 Jamie Callan 1 J. Shane Culpepper 2 Alistair Moffat 3 Received: 27 May 2016 / Accepted:
More informationSearch Engines. Provide a ranked list of documents. May provide relevance scores. May have performance information.
Search Engines Provide a ranked list of documents. May provide relevance scores. May have performance information. 3 External Metasearch Metasearch Engine Search Engine A Search Engine B Search Engine
More informationRetrieval and Feedback Models for Blog Distillation
Retrieval and Feedback Models for Blog Distillation CMU at the TREC 2007 Blog Track Jonathan Elsas, Jaime Arguello, Jamie Callan, Jaime Carbonell CMU s Blog Distillation Focus Two Research Questions: What
More informationFrontiers in Web Data Management
Frontiers in Web Data Management Junghoo John Cho UCLA Computer Science Department Los Angeles, CA 90095 cho@cs.ucla.edu Abstract In the last decade, the Web has become a primary source of information
More informationAutomatic Structured Query Transformation Over Distributed Digital Libraries
Automatic Structured Query Transformation Over Distributed Digital Libraries M. Elena Renda I.I.T. C.N.R. and Scuola Superiore Sant Anna I-56100 Pisa, Italy elena.renda@iit.cnr.it Umberto Straccia I.S.T.I.
More informationNavigating the User Query Space
Navigating the User Query Space Ronan Cummins 1, Mounia Lalmas 2, Colm O Riordan 3 and Joemon M. Jose 1 1 School of Computing Science, University of Glasgow, UK 2 Yahoo! Research, Barcelona, Spain 3 Dept.
More informationExternal Query Reformulation for Text-based Image Retrieval
External Query Reformulation for Text-based Image Retrieval Jinming Min and Gareth J. F. Jones Centre for Next Generation Localisation School of Computing, Dublin City University Dublin 9, Ireland {jmin,gjones}@computing.dcu.ie
More informationQuery-Based Sampling using Only Snippets
Query-Based Sampling using Only Snippets Almer S. Tigelaar and Djoerd Hiemstra {tigelaaras, hiemstra}@cs.utwente.nl Abstract Query-based sampling is a popular approach to model the content of an uncooperative
More informationDistributed similarity search algorithm in distributed heterogeneous multimedia databases
Information Processing Letters 75 (2000) 35 42 Distributed similarity search algorithm in distributed heterogeneous multimedia databases Ju-Hong Lee a,1, Deok-Hwan Kim a,2, Seok-Lyong Lee a,3, Chin-Wan
More informationAn Investigation of Basic Retrieval Models for the Dynamic Domain Task
An Investigation of Basic Retrieval Models for the Dynamic Domain Task Razieh Rahimi and Grace Hui Yang Department of Computer Science, Georgetown University rr1042@georgetown.edu, huiyang@cs.georgetown.edu
More informationAnatomy of a search engine. Design criteria of a search engine Architecture Data structures
Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection
More informationModeling and Managing Changes in Text Databases
Modeling and Managing Changes in Text Databases PANAGIOTIS G. IPEIROTIS New York University and ALEXANDROS NTOULAS Microsoft Search Labs and JUNGHOO CHO University of California, Los Angeles and LUIS GRAVANO
More informationfor Searching Social Media Posts
Mining the Temporal Statistics of Query Terms for Searching Social Media Posts ICTIR 17 Amsterdam Oct. 1 st 2017 Jinfeng Rao Ferhan Ture Xing Niu Jimmy Lin Task: Ad-hoc Search on Social Media domain Stream
More informationnumber of documents in global result list
Comparison of different Collection Fusion Models in Distributed Information Retrieval Alexander Steidinger Department of Computer Science Free University of Berlin Abstract Distributed information retrieval
More informationModeling and Managing Changes in Text Databases
14 Modeling and Managing Changes in Text Databases PANAGIOTIS G. IPEIROTIS New York University ALEXANDROS NTOULAS Microsoft Search Labs JUNGHOO CHO University of California, Los Angeles and LUIS GRAVANO
More informationRelevance Score Normalization for Metasearch
Relevance Score Normalization for Metasearch Mark Montague Department of Computer Science Dartmouth College 6211 Sudikoff Laboratory Hanover, NH 03755 montague@cs.dartmouth.edu Javed A. Aslam Department
More informationIdentifying Redundant Search Engines in a Very Large Scale Metasearch Engine Context
Identifying Redundant Search Engines in a Very Large Scale Metasearch Engine Context Ronak Desai 1, Qi Yang 2, Zonghuan Wu 3, Weiyi Meng 1, Clement Yu 4 1 Dept. of CS, SUNY Binghamton, Binghamton, NY 13902,
More informationA Constrained Spreading Activation Approach to Collaborative Filtering
A Constrained Spreading Activation Approach to Collaborative Filtering Josephine Griffith 1, Colm O Riordan 1, and Humphrey Sorensen 2 1 Dept. of Information Technology, National University of Ireland,
More informationDocument and Query Expansion Models for Blog Distillation
Document and Query Expansion Models for Blog Distillation Jaime Arguello, Jonathan L. Elsas, Changkuk Yoo, Jamie Callan, Jaime G. Carbonell Language Technologies Institute, School of Computer Science,
More informationModeling and Managing Content Changes in Text Databases
Modeling and Managing Content Changes in Text Databases Panagiotis G. Ipeirotis Alexandros Ntoulas Junghoo Cho Luis Gravano Abstract Large amounts of (often valuable) information are stored in web-accessible
More informationA BELIEF NETWORK MODEL FOR EXPERT SEARCH
A BELIEF NETWORK MODEL FOR EXPERT SEARCH Craig Macdonald, Iadh Ounis Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK craigm@dcs.gla.ac.uk, ounis@dcs.gla.ac.uk Keywords: Expert
More informationShard Ranking and Cutoff Estimation for Topically Partitioned Collections
Shard Ranking and Cutoff Estimation for Topically Partitioned Collections Anagha Kulkarni Almer S. Tigelaar Djoerd Hiemstra Jamie Callan Language Technologies Institute, School of Computer Science, Carnegie
More informationFull text available at: Aggregated Search
Aggregated Search Jaime Arguello School of Information and Library Science University of North Carolina at Chapel Hill, United States jarguello@unc.edu Boston Delft Foundations and Trends R in Information
More informationEfficient Execution of Dependency Models
Efficient Execution of Dependency Models Samuel Huston Center for Intelligent Information Retrieval University of Massachusetts Amherst Amherst, MA, 01002, USA sjh@cs.umass.edu W. Bruce Croft Center for
More informationRecommendation System for Location-based Social Network CS224W Project Report
Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless
More informationTREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback
RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano
More informationDoes Selective Search Benefit from WAND Optimization?
Does Selective Search Benefit from WAND Optimization? Yubin Kim 1(B), Jamie Callan 1, J. Shane Culpepper 2, and Alistair Moffat 3 1 Carnegie Mellon University, Pittsburgh, USA yubink@cmu.edu 2 RMIT University,
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationRanking Users for Intelligent Message Addressing
Ranking Users for Intelligent Message Addressing Vitor R. Carvalho 1 and William W. Cohen 1,2 Language Technologies Institute 1 and Machine Learning Department 2 Carnegie Mellon University, Pittsburgh,
More informationUsing Coherence-based Measures to Predict Query Difficulty
Using Coherence-based Measures to Predict Query Difficulty Jiyin He, Martha Larson, and Maarten de Rijke ISLA, University of Amsterdam {jiyinhe,larson,mdr}@science.uva.nl Abstract. We investigate the potential
More informationA Constrained Spreading Activation Approach to Collaborative Filtering
A Constrained Spreading Activation Approach to Collaborative Filtering Josephine Griffith 1, Colm O Riordan 1, and Humphrey Sorensen 2 1 Dept. of Information Technology, National University of Ireland,
More informationExploiting Global Impact Ordering for Higher Throughput in Selective Search
Exploiting Global Impact Ordering for Higher Throughput in Selective Search Michał Siedlaczek [0000-0002-9168-0851], Juan Rodriguez [0000-0001-6483-6956], and Torsten Suel [0000-0002-8324-980X] Computer
More informationQuery Expansion with the Minimum User Feedback by Transductive Learning
Query Expansion with the Minimum User Feedback by Transductive Learning Masayuki OKABE Information and Media Center Toyohashi University of Technology Aichi, 441-8580, Japan okabe@imc.tut.ac.jp Kyoji UMEMURA
More informationReal-time Query Expansion in Relevance Models
Real-time Query Expansion in Relevance Models Victor Lavrenko and James Allan Center for Intellignemt Information Retrieval Department of Computer Science 140 Governor s Drive University of Massachusetts
More informationOpen Research Online The Open University s repository of research publications and other research outputs
Open Research Online The Open University s repository of research publications and other research outputs A Study of Document Weight Smoothness in Pseudo Relevance Feedback Conference or Workshop Item
More informationGlOSS: Text-Source Discovery over the Internet
GlOSS: Text-Source Discovery over the Internet LUIS GRAVANO Columbia University HÉCTOR GARCÍA-MOLINA Stanford University and ANTHONY TOMASIC INRIA Rocquencourt The dramatic growth of the Internet has created
More informationA Task-Based Evaluation of an Aggregated Search Interface
A Task-Based Evaluation of an Aggregated Search Interface No Author Given No Institute Given Abstract. This paper presents a user study that evaluated the effectiveness of an aggregated search interface
More informationAuthor: Yunqing Xia, Zhongda Xie, Qiuge Zhang, Huiyuan Zhao, Huan Zhao Presenter: Zhongda Xie
Author: Yunqing Xia, Zhongda Xie, Qiuge Zhang, Huiyuan Zhao, Huan Zhao Presenter: Zhongda Xie Outline 1.Introduction 2.Motivation 3.Methodology 4.Experiments 5.Conclusion 6.Future Work 2 1.Introduction(1/3)
More informationAutomatic Query Type Identification Based on Click Through Information
Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China
More informationResult merging strategies for a current news metasearcher
Information Processing and Management 39 (2003) 581 609 www.elsevier.com/locate/infoproman Result merging strategies for a current news metasearcher Yves Rasolofo a, *, David Hawking b, Jacques Savoy a
More informationDatabase Selection Techniques for Routing Bibliographic Queries
Database Selection Techniques for Routing Bibliographic Queries Jian Xu Yinyan Cao Ee-Peng Lim Wee-Keong Ng Centre for Advanced Information Systems (CAIS) School of Applied Science, Nanyang Technological
More informationTerm Frequency Normalisation Tuning for BM25 and DFR Models
Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter
More informationMaking Retrieval Faster Through Document Clustering
R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e
More informationTREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood
TREC-3 Ad Hoc Retrieval and Routing Experiments using the WIN System Paul Thompson Howard Turtle Bokyung Yang James Flood West Publishing Company Eagan, MN 55123 1 Introduction The WIN retrieval engine
More informationUsing PageRank in Feature Selection
Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important
More informationAutomatically Building Research Reading Lists
Automatically Building Research Reading Lists Michael D. Ekstrand 1 Praveen Kanaan 1 James A. Stemper 2 John T. Butler 2 Joseph A. Konstan 1 John T. Riedl 1 ekstrand@cs.umn.edu 1 GroupLens Research Department
More informationReddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011
Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions
More informationA Cluster-Based Resampling Method for Pseudo- Relevance Feedback
A Cluster-Based Resampling Method for Pseudo- Relevance Feedback Kyung Soon Lee W. Bruce Croft James Allan Department of Computer Engineering Chonbuk National University Republic of Korea Center for Intelligent
More informationLearning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li
Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,
More informationOn Duplicate Results in a Search Session
On Duplicate Results in a Search Session Jiepu Jiang Daqing He Shuguang Han School of Information Sciences University of Pittsburgh jiepu.jiang@gmail.com dah44@pitt.edu shh69@pitt.edu ABSTRACT In this
More informationApproaches to Collection Selection and Results Merging for Distributed Information Retrieval
ACM, 21. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Conference on Information
More informationECNU at 2017 ehealth Task 2: Technologically Assisted Reviews in Empirical Medicine
ECNU at 2017 ehealth Task 2: Technologically Assisted Reviews in Empirical Medicine Jiayi Chen 1, Su Chen 1, Yang Song 1, Hongyu Liu 1, Yueyao Wang 1, Qinmin Hu 1, Liang He 1, and Yan Yang 1,2 Department
More informationInteroperability for Digital Libraries
DRTC Workshop on Semantic Web 8 th 10 th December, 2003 DRTC, Bangalore Paper: C Interoperability for Digital Libraries Michael Shepherd Faculty of Computer Science Dalhousie University Halifax, NS, Canada
More informationTime-aware Approaches to Information Retrieval
Time-aware Approaches to Information Retrieval Nattiya Kanhabua Department of Computer and Information Science Norwegian University of Science and Technology 24 February 2012 Motivation Searching documents
More information