From federated to aggregated search

Size: px
Start display at page:

Download "From federated to aggregated search"

Transcription

1 From federated to aggregated search Fernando Diaz, Mounia Lalmas and Milad Shokouhi Outline Introduction and Terminology Architecture Resource Representation Resource Selection Result Presentation Evaluation Open Problems Bibliography 1

2 Outline Introduction and Terminology Architecture Resource Representation Resource Selection Result Presentation Evaluation Open Problems Bibliography Introduction What is federated search? What is aggregated search? Motivations Challenges Relationships 2

3 A classical example of federated search One query Collections to be searched A classical example of federated search Merged list of results 3

4 Motivation for federated search Search a number of independent collections, with a focus on hidden web collections Collections not easily crawlable (and often should not) Access to up-to-date information and data Parallel search over several collections Effective tool for enterprise and digital library environments Challenges for federated search How to represent collections, so that to know what documents each contain? How to select the collection(s) to be searched for relevant documents? How to merge results retrieved from several collections, to return one list of results to the users? Cooperative environment Uncooperative environment 4

5 From federated search to aggregated search Federated search on the web Peer-to-peer network connects distributed peers (usually for file sharing), where each peer can be both server and client Metasearch engine combines the results of different search engines into a single result list Vertical search also known as aggregated search add the top-ranked results from relevant verticals (e.g. images, videos, maps) to typical web search results A classical example of aggregated search Structured Data News Homepage Wikipedia Real-time results Video Twitter 5

6 Motivation for aggregated search Increasingly different types of information being available, sough and relevant e.g. news, image, wiki, video, audio, blog, map, tweet Search engine allows accessing these through so-called verticals Two ways to search Users can directly search the verticals Or rely on so called aggregated search Google universal search 2007: [ ] search across all its content sources, compare and rank all the information in real time, and deliver a single, integrated set of search results [ ] will incorporate information from a variety of previously separate sources including videos, images, news, maps, books, and websites into a single set of results. Motivation for aggregated search 25K editorially classified queries (Arguello et al, 09) 6

7 Motivation for aggregated search Motivation for aggregated search 7

8 Challenges in aggregated search Extremely heterogeneous collections What is/are the vertical intent(s)? And Handling ambiguous (query vertical) intent Handling non-stationary intent (e.g. news, local) How many results from each to return and where to position them in the result page? Slotting results Users looking at 1 st result page Page optimization and its evaluation Ambiguous non-stationary intent Query - Travel - Molusk - Paul Vertical - Wikipedia - News - Image 8

9 Recap Introduction federated search aggregated search heterogeneity low high scale (documents, users) small large user feedback little a lot Terminology 1. federated search, distributed information retrieval, data fusion, aggregated search, universal search, peer-to-peer network 2. resource, vertical, database, collection, source, server, domain, genre 3. merging, blending, fusion, aggregation, slotted, tiled 9

10 Problem definition Present the querier with a summary of search results from one or more resources. General architecture Raw Query User Search Interface/ Portal/ Broker Query Query Query Query Query Source/ Server/ Vertical Source/ Server/ Vertical Source/ Server/ Vertical Source/ Server/ Vertical Source/ Server/ Vertical 10

11 Peer-to-peer network Peer Directory Server Peer to Peer (P2P) networks Broker-based Single centralized broker with documents lists shared from peer (e.g. Napster, original version) Decentralized Each peer acts as both client and server (e.g. Gnutella v0.4) Structure-based Use distributed hash tables (DHT) (e.g. Chord (Stocia et al, 03) ) Hierarchical Use local directory services for routing and merging (e.g. Swapper.NET) 11

12 Federated search Query Merged results Broker Sum A Sum B Sum C Sum D Sum E Query Query Query Query Query Collection A Collection B Collection C Collection D Collection E Federated search Also known as distributed information retrieval (DIR) system Provides one portal for searching information from multiple sources corporate intranets, fee-based databases, library catalogues, internet resources, userspecific digital storage Funnelback, Westlaw, FedStats, Cheshire, etc (see also 12

13 Metasearch Raw Query User Metasearch engine Query Query Query Query WWW 13

14 Metasearch Search engine querying several different search engines and combines results from them (blended), or displays results separately (non-blended) Does not crawl the web but rely on data gathered by other search engines Dogpile,Metacrawler, Search.com, etc (see Aggregated search User Angelina Jolie Results Query Query Query Query WWW Index (text) 14

15 Aggregated search Specific to a web search engine Increasingly more than one type of information relevant to an information need mostly web page + image, map, blog, etc These types of information are indexed and ranked using dedicated approaches (verticals) Presenting the results from verticals in an aggregated way believed to be more useful All major search engines are doing some levels of aggregated search Data fusion Query One ranked list of result (merged) Different document representations Merging Different retrieval models BM25 KL Inquery Anchor only Title only GOV2 One document collection (e.g. Voorhees etal, 95) 15

16 Data fusion Search one collection Document can be indexed in different ways Title index, abstract index, etc (poly-representation) Weighting scheme Different retrieval models Rankings generated by different retrieval models (or different document representations) merged to produce the final rank Has often been shown to improve retrieval performance (TREC) Terminology - Resource Source Server Database Collection (federated search) Server Vertical (aggregated search) Domain Genre 16

17 Terminology - Aggregation Merging Blending Fusion Slotted Tiled Aggregated search (tiled) 17

18 Aggregated search (tiled) Naver.com Aggregated search (slotted) 18

19 Others Clustering Faceted search Multi-document summarization Document generation Entity search (see special issue in press on Current research in focused retrieval and result aggregation, Journal of Information Retrieval (Trotman etal, 10)) Yippy Clustering search engine from Vivisimo clusty.com 19

20 Faceted search Multi-document summarization 20

21 Fictitious document generation (Paris et al, 10) Entity search 21

22 Recap Shown the relations between federated, aggregated search, and others Exposed the various terminologies used In the rest of the tutorial, we concentrate on federated search and aggregated search Focus is on effective search Outline Introduction and Terminology Architecture Resource Representation Resource Selection Result Presentation Evaluation Open Problems Bibliography 22

23 Architecture: what are the general components of federated and aggregated search systems. Federated search architecture 23

24 Aggregated search architecture Pre-retrieval aggregation: decide verticals before seeing results Post-retrieval aggregation: decide verticals after seeing results Pre-web aggregation: decide verticals before seeing web results Post-web aggregation: decide verticals after seeing web results Post-retrieval, pre-web 24

25 Pre and post-retrieval, pre-web Outline Introduction and Terminology Architecture Resource Representation Resource Selection Result Presentation Evaluation Open Problems Bibliography 25

26 Resource representation: how to represent resources, so that we know what documents each contain. Resource representation in federated search (Also known as resource summary/description) 26

27 Resource representation Cooperative environments Comprehensive term statistics Collection size information Uncooperative environments Query-based sampling Collection size estimation Resource representation (cooperative environments) STARTS Protocol (Gravano et al, 97) Source metadata Rich query language 27

28 Resource representation (cooperative environments) Different types of term statistics (Callan et al, 95; Gravano et al, 94a,b,99; Meng et al, 01; Yuwono and Lee, 97; Xu and Callan, 98; Zobel, 97) Anchor-text HARP (Hawking and Thomas, 05) Resource representation (uncooperative environments) Query-based sampling (Callan and Connell, 01) Select a query, probe collection Download the top n documents Select the next query, repeat Query selector Query Sampled documents 28

29 Resource representation (uncooperative environments) Query selector (Callan and Connell, 01) Other resource description (ord) Learned resource description (lrd) Average tf, random, df, ctf Query logs (Craswell, 00; Shokouhi et al, 07d) Focused probing (Ipeirotis and Gravano, 02) Resource representation (uncooperative environments) Adaptive sampling (Shokouhi et al, 06a) Rate of visiting new vocabulary (Baillie et al, 06a) Rate of sample quality improvement (reference query log) (Caverlee et al, 06) Proportional document ratio (PD) Proportional vocabulary ratio (PV) Vocabulary growth (VG) 29

30 Resource representation (uncooperative environments) Improving incomplete samples Shrinkage (Ipeirotis, 04; Ipeirotis and Gravano, 04): topically related collections should share similar terms Q-pilot (Sugiura and Etzioni, 00): sampled documents + backlinks + front page Resource representation (Collection size estimation) Capture-recapture (Liu et al, 01) Sample A (Capture) Sample B (recapture) 30

31 Resource representation (Collection size estimation) Resource representation (Collection size estimation) Multiple queries sampler (Thomas and Hawking, 07) Random-walk sampler, and pool-based sampler (Bar-Yossef and Gurevich, 06) Collection overlap estimation (Shokouhi and Zobel, 07) 31

32 Resource representation (Updating summaries) (Ipeirotis et al, 05) (Shokouhi et al, 07a) Resource representation in aggregated search Vertical content samples or access to vertical API represents content supply Vertical query logs samples or access to historic vertical searches represents content demand 32

33 Vertical content includes text NEWS Vertical content includes structure SPORTS 33

34 Vertical content includes images IMAGES Issues with vertical content Dynamics some vertical becomes stale fast Heterogeneous content heterogeneous ranking algorithms Non-free text APIs affects query-based sampling 34

35 Addressing content dynamics sample most recently indexed documents (Diaz 09) assumes users more likely to be interested in recent content (Konig et al, 09) in practice, only need a fraction of the corpus to perform well Addressing heterogeneous content performance of two different methods of dealing with heterogeneous content 1. use text available with documents (e.g. captions) 2. manually map to surrogates (e.g. wikipedia pages) (Arguello et al, 09) 35

36 Vertical query logs Queries issued directly to a vertical represent explicit vertical intent Is similar to having a large body of labeled queries Issues with vertical query logs Dynamics some verticals require temporally-sensitive sampling for example, we do not want to sample news query logs for a whole year Non-free text APIs affects query modeling 36

37 Hybrid approaches Should only sample documents likely to be useful for vertical selection/merging e.g. a document which is never requested is not useful for representing a vertical Suggests log-biased sampling (Shokouhi et al, 06; Arguello et al, 09) Recap Resource representation Representation completeness federated search low aggregated search low-high Representation generation sampling/shared dictionaries sampling, API Freshness important critical 37

38 Outline Introduction and Terminology Architecture Resource Representation Resource Selection Result Presentation Evaluation Open Problems Bibliography Resource selection: how to select the resource(s) to be searched for relevant documents. 38

39 Resource selection for federated search Query Broker Sum A Sum B Sum C Sum D Sum E Query Query Query Collection A Collection B Collection C Collection D Collection E Resource selection (Lexicon-based methods) Big-document bag of word summaries CORI (Callan et al, 95) GlOSS (Gravano et al, 94b) CVV (Yuwono and Lee, 97) Collection C Sampling Collection A Collection B Sampling Sampling Broker 39

40 Resource selection (Lexicon-based methods) CORI GlOSS Resource selection (Document-surrogate methods) Sample documents with retained boundaries ReDDE (Si and Callan, 03a) CRCS (Shokouhi, 07a) SUSHI (Thomas and Shokouhi, 09) Collection C Sampling Collection A Collection B Sampling Sampling Broker 40

41 Resource selection (Document-surrogate methods) ReDDE ReDDE assumes that the topranked sampled documents are relevant. Broker Ranking ReDDE estimates the size of collections by sample-resample Assuming that all collections have the same size we have: yellow > blue > red Query CRCS is inspired by ReDDE but assigns different probability of relevance based on document position: red > yellow, blue Resource selection (Document-surrogate methods) SUSHI 41

42 Resource selection (Document-surrogate methods) SUSHI Resource selection (Document-surrogate methods) SUSHI Different regression functions for each collection and query Scores are comparable (estimated over the same index) 42

43 Resource selection (Supervised methods) Utility maximization techniques Model the search effectiveness DTF (Nottelmann and Fuhr, 03), UUM (Si and Callan, 04a), RUM (Si and Callan, 05b) Classification-based methods Classify collections/queries for better selection Classification-aware server selection (Ipeirotis and Gravano, 08), classification-based resource selection (Arguello et al, 09a), learning from past queries (Cetintas et al, 09) Resource selection in aggregated Search Content-based predictors derived from (sampled) vertical content Query string-based predictors derived from query text, independent of any resource associated with a vertical Query log-based predictors derived from previous requests issued by users to the vertical portal 43

44 Content-based predictors Distributed information retrieval (DIR) predictors Simple result set predictors numresults, score distributions, etc (Diaz 09; Konig etal, 09) Complex result set predictors Clarity (Cronen-Townsend et al, 02) Autocorrelation (Diaz, 07) Many, many more (Hauff, 10) Issues with content-based predictors DIR (usually) assumes homogeneous content types performance predictors (usually) assume text corpora assumes ranking function consistency between verticals between vertical selector machine and vertical ranker machine verticals have different dynamics (e.g. news vs. image) 44

45 String-based predictors Dictionary lookups terms correlated with a vertical (e.g., movie titles) Regular expressions patterns correlated with explicit vertical requests (e.g., obama news) Named entities automatically-detected entity types (e.g., geographic entities) String-based predictors Issues curating lists and expressions (manual or automatic) terms included in dictionary manually vetted for relevance high precision/low recall 45

46 Log-based predictors Classification approaches (Beitzel etal 07; Li etal, 08) Language model approaches (Arguello etal, 09) Issues verticals with structured queries (e.g. local) query logs with dynamics (e.g. news) (Diaz, 09) Comparing predictor performance (Arguello et al, 09) 46

47 Predictor cost Pre-retrieval predictors computed without sending the query to the vertical no network cost Post-retrieval predictors computed on the results from the vertical requires vertical support of web scale query traffic incurs network latency can be mitigated with vertical content caches Combining predictors Use predictors as features for a machinelearned model Training data 1. editorial data 2. behavioral data (e.g. clicks) 3. other vertical data (Diaz, 09; Arguello etal, 09; Konig etal, 09) 47

48 Editorial data Data: <query,vertical,{+,-}> Features: predictors based on f(query,vertical) Models: log-linear (Arguello etal, 09) boosted decision trees (Arguello etal, 10) Combining predictors (Arguello etal, 09) 48

49 Click data Data: <query,vertical,{click,skip}>, <query,vertical,click through rate> Features: predictors based on f(query,vertical) Models: log-linear (Diaz, 09) boosted decision trees (Konig etal, 09) Gathering click data Exploration bucket: show suboptimal presentations in order to gather positive (and negative) click/skip data Cold start problem: without a basic model, the best exploration is random Random exploration results in poor user experience 49

50 Gathering click data Solutions reduce impact to small fraction of traffic/users train a basic high-precision non-click model (perhaps with editorial data) Other issues Presentation bias: different verticals have different click-through rates a priori Position bias: different presentation positions have different click-through rates a priori Click precision and recall ability to predict queries using thresholded click-through-rate to infer relevance (Konig etal, 09) 50

51 Non-target data have training data no data Non-target data Data: <query,source vertical,{+,-}> Features: predictors based on f (query,target vertical) Models: generic model+adaptation (Arguello etal, 10) 51

52 Non-target data (Arguello etal, 10) Generic model Objective train a single model that performs well for all source verticals Assumption if it performs well across all source verticals, it will perform well on the target vertical (Arguello etal, 10) 52

53 Non-target data adapted model (Arguello etal, 10) Adapted model Objective learn non-generic relationship between features and the target vertical Assumption can bootstrap from labels generated by the generic model (Arguello etal, 10) 53

54 Non-target query classification average precision on target query classification; red (blue) indicates statistically significant improvements (degradations) compared to the single predictor (Arguello etal, 10) Training set characteristics What is the cost of generating training data how much money? how much time? how many negative impressions as a result of exploration? Are targets normalized? can we compare classifier output? 54

55 Training set cost summary Online adaptation Production vertical selection systems receive a variety of feedback signals clicks, skips reformulations A machine-learned system can adjust predictions based on real time user feedback very important for dynamic verticals (Diaz, 09; Diaz and Arguello, 09) 55

56 Online adaptation Passive feedback: adjust prediction/ parameters in response to feedback allows recovery from false positives difficult to recover from false negatives Active feedback/explore-exploit: opportunistically present suboptimal verticals for feedback allows recovery from both errors incurs exploration cost (Diaz, 09; Diaz and Arguello, 09) Online adaptation Issues setting learning rate for dynamic intent verticals normalizing feedback signal across verticals resolving feedback and training signal (click relevance) (Diaz, 09; Diaz and Arguello, 09) 56

57 Recap Resource selection Features and content type Collection size federated search often textual unavailable (uncooperative) aggregated search diverse Training data none some-much Outline Introduction and Terminology Architecture Resource Representation Resource Selection Result Presentation Evaluation Open Problems Bibliography 57

58 Resource presentation: how to return results retrieved from several resources to users. Result merging (Metasearch engines) Same source (web) different overlapped indexes Document scores may not be available Title, snippet, position and timestamps D-WISE (Yuwono and Lee, 96) Inquirus (Glover et al., 99) SavvySearch (Dreilinger and Howe, 1997) 58

59 Result merging (Data fusion) Same corpus Different retrieval models Document scores/positions available Unsupervised techniques CombSUM, CombMNZ (Fox and Shaw, 93, 94) Borda fuse (Aslam and Montague, 01) Supervised techniques Bayes-fuse, weighted Borda fuse (Aslam and Montague, 01) Segment-based fusion (Lillis et al 06, 08; Shokouhi 07b) Result merging in federated search User Merged results Broker Sum A Sum B Sum C Sum D Sum E Query Query Query Collection A Collection B Collection C Collection D Collection E 59

60 Result merging CORI (Callan et al, 95) Normalized collection score + Normalized document score. Result merging SSL (Si and Callan, 2003b) A B L R Selected resources C D D F Broker E Q Ranking F G H Query 60

61 Result merging Broker score Source-specific score Result merging - Miscellaneous scenarios Multi-lingual result merging SSL with logistic regression (Si and Callan, 05a; Si et al, 08) Personalized metasearch (Thomas, 08) Merging overlapped collections COSCO (Hernandez and Kambhampati 05): exact duplicates GHV (Bernstein et al, 06; Shokouhi et al, 07b): exact/near duplicates 61

62 Slotted vs tiled result presentation Images on top Images in the middle Images at the bottom Images at top-right Images at the bottom-right Images on the left 3 verticals 3 positions 3 degree of vertical intents (Sushmita et al, 10) Slotted vs tiled Designers of aggregated search interfaces should account for the aggregation styles for both, vertical intent key for deciding on position and type of vertical results slotted accurate estimation of the best position of vertical result tiled accurate selection of the type of vertical result 62

63 Recap Result presentation Content type Document scores federated search homogenous (text documents) depends on environment aggregated search heterogeneous heterogeneous Oracle centralized index none Outline Introduction and Terminology Architecture Resource Representation Resource Selection Result Presentation Evaluation Open Problems Bibliography 63

64 Evaluation Evaluation: how to measure the effectiveness of federated and aggregated search systems. Resource representation (summaries) evaluation Federated search CTF ratio (Callan and Connell, 01) Spearman rank correlation coefficient (SRCC), (Callan and Connell, 01) Kullback-Leibler divergence (KL) (Baillie et al,06b; Ipeirotis et al, 2005), topical KL (Baillie et al, 09) Predictive likelihood (Baillie et al, 06a) 64

65 Resource selection evaluation Federated search Result merging evaluation Federated search Oracle Correct merging (centralized index ranking) (Hawking and Thistlewaite, 99) Perfect merging (ordered by relevance labels) (Hawking and Thistlewaite, 99) Metrics Precision Correct matches (Chakravarthy and Haase, 95) 65

66 Vertical Selection Evaluation Aggregated search Majority of publications focus on single vertical selection vertical accuracy, precision, recall Evaluation data editorial data behavioral data single vertical selection Editorial data Guidelines judge relevance based on vertical results (implicit judging of retrieval/content quality) judge relevance based on vertical description (assumes idealized retrieval/content quality) Evaluation metric derived from binary or graded relevance judgments (Arguello etal, 09; Arguello et al, 10) 66

67 Behavioral data Inference relevance from behavioral data (e.g. click data) Evaluation metric regression error on predicted CTR infer binary or graded relevance (Diaz, 09; Konig etal, 09) Test collections (a la TREC) quantity/media text image video total size (G) number of documents 86,186, ,439 1,253* 86,858,007 Statistics on Topics number of topics 150 average rel docs per topic average rel verticals per topic 1.75 ratio of General Web topics 29.3% ratio of topics with two vertical intents ratio of topics with more than two vertical intents 66.7% 4.0% * There are on an average more than 100 events/shots contained in each video clip (document) (Zhou & Lalmas, 10) 67

68 Test collections (a la TREC) existing test collections ImageCLEF photo retrieval track TREC web track INEX ad-hoc track TREC blog track topic t 1 doc d 1 d 2 d 3 d n judgment R N R R Image Vertical Blog Vertical Reference (Encyclopedia) Vertical (simulated) verticals Shopping Vertical General Web Vertical topic t 1 t 1 vertical V 1 doc d 1 d 2 d V1 V 2 d 1 d 2 d V2 judgment R N R N N R V k d 1 d 2 d Vk N N N Recap Evaluation Editorial data federated search document relevance judgments aggregated search query labels Behavioral data none critical 68

69 Outline Introduction and Terminology Architecture Resource Representation Resource Selection Result Presentation Evaluation Open Problems Bibliography Open problems in federated search Beyond big document Classification-based server selection (Arguello et al, 09a) Topic modeling Query expansion Previous techniques had little success (Ogilvie and Callan, 01; Shokouhi et al, 09) Evaluating federated search Confounding factors Federated search in other context Blog Search (Elsas et al, 08; Seo and Croft, 08) Effective merging Supervised techniques 69

70 Open problems in aggregated search Evaluation metrics slotted presentation tiled presentation metrics based on behavioral signals Models for multiple verticals Minimizing the cost for new verticals, markets Outline Introduction and Terminology Architecture Resource Representation Resource Selection Result Presentation Evaluation Open Problems Bibliography 70

71 Bibliography J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo, Sources of evidence for vertical selection. In SIGIR 2009 (2009). J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In Proceedings of the ACM CIKM, Pages , Hong Kong, China, 2009a. J. Arguello, F. Diaz, J.-F. Paiement, Vertical Selection in the Presence of Unlabeled Verticals. In SIGIR 2010 (2010). J. Aslam and Mark Montague. Models for metasearch, In Proceedings of ACM SIGIR, Pages, , New Orleans, LA, M. Baillie, L. Azzopardi, and F. Crestani. Adaptive query-based sampling of distributed collections, In Proceedings of SPIRE, Pages , Glasgow, UK, 2006a. M. Baillie, L. Azzopardi, and F. Crestani. Towards better measures: evaluation of estimated resource description quality for distributed IR. In X. Jia, editor, Proceedings of the First International Conference on Scalable Information systems, page 41, Hong Kong, 2006b. M. Baillie, M. Carman, and F. Crestani. A topic-based measure of resource description quality for distributed information retrieval. In Proceedings of ECIR, pages , Toulouse, France, Bibliography Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. Proceedings of WWW, pages , Edinburgh, UK, S. M. Beitzel, E. C. Jensen, D. D. Lewis, A. Chowdhury, O. and Frieder, Automatic classification of web queries using very large unlabeled query logs. ACM Trans. Inf. Syst. 25, 2 (2007), 9. Y. Bernstein, M. Shokouhi, and J. Zobel. Compact features for detection of nearduplicates in distributed retrieval. Proceedings of SPIRE, Pages , Glasgow, UK, J. Callan and M. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2): , J. Callan, Z. Lu, and B. Croft. Searching distributed collections with inference networks. In Proceedings of ACM SIGIR, pages Seattle, WA, 1995 J. Caverlee, L. Liu, and J. Bae. Distributed query sampling: a quality-conscious approach. In Proceedings of ACM SIGIR, pages Seattle, WA, S. Cetintas, L. Si, and H. Yuan, Learning from past queries for resource selection, In Proceedings of ACM CIKM, Pages , Hong Kong, China. 71

72 Bibliography B.T. Bartell, G.W. Cottrell, and R.K. Belew. Automatic Combination of Multiple Ranked Retrieval Systems, ACM SIGIR, pp , C. Baumgarten. A Probabilitstic Solution to the Selection and Fusion Problem in Distributed Information Retrieval, ACM SIGIR, pp , N. Craswell. Methods for Distributed Information Retrieval. PhD thesis, Australian National University, S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. ACM SIGIR, pp , A. Chakravarthy and K. Haase. NetSerf: using semantic knowledge to find internet information archives, ACM SIGIR, pp 4-11, Seattle, WA, F. Diaz. Performance prediction using spatial autocorrelation. ACM SIGIR, pp , F. Diaz. Integration of news content into web results. ACM International Conference on Web Search and Data Mining, F. Diaz, J. and Arguello. Adaptation of offline vertical selection predictions in the presence of user feedback, ACM SIGIR, D. Dreilinger and A. Howe. Experiences with selecting search engines using metasearch. ACM Transaction on Information Systems, 15(3): , J. Elsas, J. Arguello, J. Callan, and J. Carbonell. Retrieval and feedback models for blog feed search, ACM SIGIR, pp , Singapore, Bibliography E. Glover, S. Lawrence, W. Birmingham, and C. Giles. Architecture of a metasearch engine that supports user information needs, ACM CIKM, pp ,1999. L. Gravano, H. García-Molina, and A. Tomasic. Precision and recall of GlOSS estimators for database discovery. Third International conference on Parallel and Distributed Information Systems, pp , Austin, TX, 1994a. L. Gravano, H. García-Molina, and A. Tomasic. The effectiveness of GlOSS for the text database discovery problem. ACM SIGMOD, pp , Minneapolis, MN, 1994b. L. Gravano, C. Chang, H. García-Molina, and A. Paepcke. STARTS:Stanford proposal for internet metasearching, ACM SIGMOD, pp , Tucson, AZ, L. Gravano, H. García-Molina, and A. Tomasic. GlOSS: text-source discovery over the internet, ACM Transactions on Database Systems, 24(2): , E. Fox and J. Shaw. Combination of multiple searches. Second Text REtrieval Conference, pp , Gaithersburg, MD, E. Fox and J. Shaw. Combination of multiple searches, Third Text REtrieval Conference, pp , Gaithersburg, MD, J. French, and A. Powell. Metrics for evaluating database selection techniques, World Wide Web, 3(3): , C. Hauff. Predicting the Effectiveness of Queries and Retrieval Systems, PhD thesis, University of Twente,

73 Bibliography D. Hawking and P. Thomas. Server selection methods in hybrid portal search, ACM SIGIR, pp 75-82, Salvador, Brazil, D. Hawking and P. Thistlewaite. Methods for information server selection, ACM Transactions on Information Systems, 17(1):40-76, T. Hernandez and S. Kambhampati. Improving text collection selection with coverage and overlap statistics. WWW, pp , Chiba, Japan, P. Ipeirotis and L. Gravano. When one sample is not enough: improving text database selection using shrinkage. ACM SIGMOD, pp , Paris, France, P. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. VLDB, pages , Hong Kong, China, P. Ipeirotis and L. Gravano. Classification-aware hidden-web text database selection. ACM Transactions on Information Systems, 26(2):1-66, P. Ipeirotis, A. Ntoulas, J. Cho, and L. Gravano. Modeling and managing content changes in text databases, 21st International Conference on Data Engineering, pp , Tokyo, Japan, A. C. König, M. Gamon, and Q. Wu. Click-through prediction for news queries, ACM SIGIR, Bibliography X. Li, Y.-Y. Wang, and A. Acero, Learning query intent from regularized click graphs, ACM SIGIR, pp D. Lillis, F. Toolan, R. Collier, and J. Dunnion. ProbFuse: a probabilistic approach to data fusion, ACM SIGIR, pp , Seattle, WA, K. Liu, C. Yu, and W. Meng. Discovering the representative of a search engine. ACM CIKM, pp , McLean, VA, N. Liu, J. Yan, W. Fan, Q. Yang, and Z. Chen. Identifying Vertical Search Intention of Query through Social Tagging Propagation, WWW, Madrid, W. Meng, Z. Wu, C. Yu, and Z. Li. A highly scalable and effective method for metasearch, ACM Transactions on Information Systems, 19(3): , W. Meng, C. Yu, and K. Liu. Building efficient and effective metasearch engines. ACM Computing Surveys, 34(1):48-89, V. Murdock, and M. Lalmas. Workshop on aggregated search, SIGIR Forum 42(2): 80-83, H. Nottelmann and N. Fuhr. Combining CORI and the decision-theoretic approach for advanced resource selection, ECIR, pp , Sunderland, UK, P. Ogilvie and J. Callan. The effectiveness of query expansion for distributed information retrieval, ACM CIKM, pp , Atlanta, GA, C. Paris, S. Wan and P. Thomas. Focused and aggregated search: a perspective from natural language generation, Journal of Information Retrieval, Special Issue,

74 Bibliography S. Park. Analysis of characteristics and trends of Web queries submitted to NAVER, a major Korean search engine, Library & Information Science Research 31(2): , F. Schumacher and R. Eschmeyer. The estimation of fish populations in lakes and ponds, Journal of the Tennessee Academy of Science, 18: , M. Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval, ECIR, pp , Rome, Italy, 2007a. J. Seo and B. Croft. Blog site search using resource selection, ACM CIKM, pp , Napa Valley, CA, M. Shokouhi. Segmentation of search engine results for effective data-fusion, ECIR, pp , Rome, Italy, 2007b. M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates, ACM Transactions on Information Systems, 27(3):1-29, M. Shokouhi and J. Zobel. Federated text retrieval from uncooperative overlapped collections, ACM SIGIR, pp Amsterdam, Netherlands, M. Shokouhi, F. Scholer, and J. Zobel. Sample sizes for query probing in uncooperative distributed information retrieval, Eighth Asia Pacific Web Conference, pp , Harbin, China, 2006a. Bibliography M. Shokouhi, J. Zobel, F. Scholer, and S. Tahaghoghi. Capturing collection size for distributed non-cooperative retrieval, ACM SIGIR, pp , Seattle, WA, 2006b. M. Shokouhi, J. Zobel, S. Tahaghoghi, and F. Scholer. Using query logs to establish vocabularies in distributed information retrieval, Information Processing and Management, 43(1): , 2007d. M. Shokouhi, P. Thomas, and L. Azzopardi. Effective query expansion for federated search, ACM SIGIR, pp , Singapore, L. Si and J. Callan. Unified utility maximization framework for resource selection, ACM CIKM, pages 32-41, Washington, DC, 2004a. L. Si and J. Callan. CLEF2005: multilingual retrieval by combining multiple multilingual ranked lists. Sixth Workshop of the Cross-Language Evaluation Forum, Vienna, Austria, 2005a. L. Si, J. Callan, S. Cetintas, and H. Yuan. An effective and efficient results merging strategy for multilingual information retrieval in federated search environments, Information Retrieval, 11(1):1--24, L. Si and J. Callan. Relevant document distribution estimation method for resource selection, ACM SIGIR, pp , Toronto, Canada, 2003a. L. Si and J. Callan. Modeling search engine effectiveness for federated search, ACM SIGIR, pp 83-90, Salvador, Brazil, 2005b. L. Si and J. Callan. A semisupervised learning method to merge search engine results, ACM Transactions on Information Systems, 21(4): , 2003b. 74

75 Bibliography A. Sugiura and O. Etzioni. Query routing for web search engines: architectures and experiments, WWW, Pages , Amsterdam, Netherlands, S. Sushmita, H. Joho and M. Lalmas. A Task-Based Evaluation of an Aggregated Search Interface, SPIRE, Saariselkä, Finland, S. Sushmita, H. Joho, M. Lalmas, and R. Villa. Factors Affecting Click-Through Behavior in Aggregated Search Interfaces, ACM CIKM, Toronto, Canada, S. Sushmita, B. Piwowarski, and M. Lalmas. Dynamics of Genre and Domain Intents, Technical Report, University of Glasgow S. Sushmita, H. Joho, M. Lalmas and J.M. Jose. Understanding domain "relevance" in web search, WWW 2009 Workshop on Web Search Result Summarization and Presentation, Madrid, Spain, P. Thomas and D. Hawking. Evaluating sampling methods for uncooperative collections, ACM SIGIR, pp , Amsterdam, Netherlands, P. Thomas. Server characterisation and selection for personal metasearch, PhD thesis, Australian National University, P. Thomas and M. Shokouhi. SUSHI: scoring scaled samples for server selection, ACM SIGIR, pp , Singapore, Singapore, A. Trotman, S. Geva, J. Kamps, M. Lalmas and V. Murdock (eds). Current research in focused retrieval and result aggregation, Special Issue in the Journal of Information Retrieval, Springer, Bibliography T. Tsikrika and M. Lalmas. Merging Techniques for Performing Data Fusion on the Web, ACM CIKM, pp , Atlanta, Georgia, Ellen M. Voorhees, Narendra Kumar Gupta, Ben Johnson-Laird. Learning Collection Fusion Strategies, ACM SIGIR, pp , B. Yuwono and D. Lee. WISE: A world wide web resource database system. IEEE Transactions on Knowledge and Data Engineering, 8(4): , B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. Fifth International Conference on Database Systems for Advanced Applications, 6, pp 41-50, Melbourne, Australia, J. Xu and J. Callan. Effective retrieval with distributed collections, ACM SIGIR, pp , Melbourne, Australia, A. Zhou and M. Lalmas. Building a Test Collection for Aggregated Search, Technical Report, University of Glasgow J. Zobel. Collection selection via lexicon inspection, Australian Document Computing Symposium, pp , Melbourne, Australia,

Federated Text Retrieval From Uncooperative Overlapped Collections

Federated Text Retrieval From Uncooperative Overlapped Collections Session 2: Collection Representation in Distributed IR Federated Text Retrieval From Uncooperative Overlapped Collections ABSTRACT Milad Shokouhi School of Computer Science and Information Technology,

More information

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16 Federated Search Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu November 21, 2016 Up to this point... Classic information retrieval search from a single centralized index all ueries

More information

Distributed Information Retrieval

Distributed Information Retrieval Distributed Information Retrieval Fabio Crestani and Ilya Markov University of Lugano, Switzerland Fabio Crestani and Ilya Markov Distributed Information Retrieval 1 Outline Motivations Deep Web Federated

More information

Federated Text Search

Federated Text Search CS54701 Federated Text Search Luo Si Department of Computer Science Purdue University Abstract Outline Introduction to federated search Main research problems Resource Representation Resource Selection

More information

Federated Search. Contents

Federated Search. Contents Foundations and Trends R in Information Retrieval Vol. 5, No. 1 (2011) 1 102 c 2011 M. Shokouhi and L. Si DOI: 10.1561/1500000010 Federated Search By Milad Shokouhi and Luo Si Contents 1 Introduction 3

More information

Aggregation for searching complex information spaces. Mounia Lalmas

Aggregation for searching complex information spaces. Mounia Lalmas Aggregation for searching complex information spaces Mounia Lalmas mounia@acm.org Outline Document Retrieval Focused Retrieval Aggregated Retrieval Complexity of the information space (s) INEX - INitiative

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Federated Search 10 March 2016 Prof. Chris Clifton Outline Federated Search Introduction to federated search Main research problems Resource Representation Resource Selection

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Federated Search Prof. Chris Clifton 13 November 2017 Federated Search Outline Introduction to federated search Main research problems Resource Representation

More information

A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval

A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval A Topic-based Measure of Resource Description Quality for Distributed Information Retrieval Mark Baillie 1, Mark J. Carman 2, and Fabio Crestani 2 1 CIS Dept., University of Strathclyde, Glasgow, UK mb@cis.strath.ac.uk

More information

An Overview of Aggregating Vertical Results into Web Search Results

An Overview of Aggregating Vertical Results into Web Search Results An Overview of Aggregating Vertical Results into Web Search Results Suhel Mustajab Department of Computer Science, A.M.U., Aligarh, U.P., India. Mohd. Kashif Adhami Department of Computer Science, A.M.U.,

More information

A Methodology for Collection Selection in Heterogeneous Contexts

A Methodology for Collection Selection in Heterogeneous Contexts A Methodology for Collection Selection in Heterogeneous Contexts Faïza Abbaci Ecole des Mines de Saint-Etienne 158 Cours Fauriel, 42023 Saint-Etienne, France abbaci@emse.fr Jacques Savoy Université de

More information

Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques

Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques -7695-1435-9/2 $17. (c) 22 IEEE 1 Obtaining Language Models of Web Collections Using Query-Based Sampling Techniques Gary A. Monroe James C. French Allison L. Powell Department of Computer Science University

More information

Capturing Collection Size for Distributed Non-Cooperative Retrieval

Capturing Collection Size for Distributed Non-Cooperative Retrieval Capturing Collection Size for Distributed Non-Cooperative Retrieval Milad Shokouhi Justin Zobel Falk Scholer S.M.M. Tahaghoghi School of Computer Science and Information Technology, RMIT University, Melbourne,

More information

Federated Search in the Wild

Federated Search in the Wild Federated Search in the Wild The Combined Power of over a Hundred Search Engines Dong Nguyen 1, Thomas Demeester 2, Dolf Trieschnigg 1, Djoerd Hiemstra 1 1 University of Twente, The Netherlands 2 Ghent

More information

Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection

Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection P.G. Ipeirotis & L. Gravano Computer Science Department, Columbia University Amr El-Helw CS856 University of Waterloo

More information

Cost-Effective Combination of Multiple Rankers: Learning When Not To Query

Cost-Effective Combination of Multiple Rankers: Learning When Not To Query Cost-Effective Combination of Multiple Rankers: Learning When Not To Query ABSTRACT Combining multiple rankers has potential for improving the performance over using any of the single rankers. However,

More information

Federated Text Retrieval from Independent Collections

Federated Text Retrieval from Independent Collections Federated Text Retrieval from Independent Collections A thesis submitted for the degree of Doctor of Philosophy Milad Shokouhi B.E. (Hons.), School of Computer Science and Information Technology, Science,

More information

Time-aware Approaches to Information Retrieval

Time-aware Approaches to Information Retrieval Time-aware Approaches to Information Retrieval Nattiya Kanhabua Department of Computer and Information Science Norwegian University of Science and Technology 24 February 2012 Motivation Searching documents

More information

Focused Retrieval Using Topical Language and Structure

Focused Retrieval Using Topical Language and Structure Focused Retrieval Using Topical Language and Structure A.M. Kaptein Archives and Information Studies, University of Amsterdam Turfdraagsterpad 9, 1012 XT Amsterdam, The Netherlands a.m.kaptein@uva.nl Abstract

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

Faculty of Science and Technology MASTER S THESIS

Faculty of Science and Technology MASTER S THESIS Faculty of Science and Technology MASTER S THESIS Study program/ Specialization: Master of Science in Computer Science Spring semester, 2016 Open Writer: Shuo Zhang Faculty supervisor: (Writer s signature)

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

Full text available at: Federated Search

Full text available at:  Federated Search Federated Search Federated Search Milad Shokouhi Microsoft Research Cambridge, CB30FB UK milads@microsoft.com Luo Si Purdue University West Lafayette, IN 47907-2066 USA lsi@cs.purdue.edu Boston Delft Foundations

More information

ABSTRACT. Categories & Subject Descriptors: H.3.3 [Information Search and Retrieval]: General Terms: Algorithms Keywords: Resource Selection

ABSTRACT. Categories & Subject Descriptors: H.3.3 [Information Search and Retrieval]: General Terms: Algorithms Keywords: Resource Selection Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lsi@cs.cmu.edu, callan@cs.cmu.edu

More information

Search Engines Information Retrieval in Practice

Search Engines Information Retrieval in Practice Search Engines Information Retrieval in Practice W. BRUCE CROFT University of Massachusetts, Amherst DONALD METZLER Yahoo! Research TREVOR STROHMAN Google Inc. ----- PEARSON Boston Columbus Indianapolis

More information

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied

Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Information Processing and Management 43 (2007) 1044 1058 www.elsevier.com/locate/infoproman Examining the Authority and Ranking Effects as the result list depth used in data fusion is varied Anselm Spoerri

More information

Implementing a customised meta-search interface for user query personalisation

Implementing a customised meta-search interface for user query personalisation Implementing a customised meta-search interface for user query personalisation I. Anagnostopoulos, I. Psoroulas, V. Loumos and E. Kayafas Electrical and Computer Engineering Department, National Technical

More information

External Query Reformulation for Text-based Image Retrieval

External Query Reformulation for Text-based Image Retrieval External Query Reformulation for Text-based Image Retrieval Jinming Min and Gareth J. F. Jones Centre for Next Generation Localisation School of Computing, Dublin City University Dublin 9, Ireland {jmin,gjones}@computing.dcu.ie

More information

Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data

Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data Collection Selection and Results Merging with Topically Organized U.S. Patents and TREC Data Leah S. Larkey, Margaret E. Connell Department of Computer Science University of Massachusetts Amherst, MA 13

More information

Improving Text Collection Selection with Coverage and Overlap Statistics

Improving Text Collection Selection with Coverage and Overlap Statistics Improving Text Collection Selection with Coverage and Overlap Statistics Thomas Hernandez Arizona State University Dept. of Computer Science and Engineering Tempe, AZ 85287 th@asu.edu Subbarao Kambhampati

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,

More information

Inferring User Search for Feedback Sessions

Inferring User Search for Feedback Sessions Inferring User Search for Feedback Sessions Sharayu Kakade 1, Prof. Ranjana Barde 2 PG Student, Department of Computer Science, MIT Academy of Engineering, Pune, MH, India 1 Assistant Professor, Department

More information

Content-based search in peer-to-peer networks

Content-based search in peer-to-peer networks Content-based search in peer-to-peer networks Yun Zhou W. Bruce Croft Brian Neil Levine yzhou@cs.umass.edu croft@cs.umass.edu brian@cs.umass.edu Dept. of Computer Science, University of Massachusetts,

More information

Predicting Query Performance on the Web

Predicting Query Performance on the Web Predicting Query Performance on the Web No Author Given Abstract. Predicting performance of queries has many useful applications like automatic query reformulation and automatic spell correction. However,

More information

Evaluating Sampling Methods for Uncooperative Collections

Evaluating Sampling Methods for Uncooperative Collections Evaluating Sampling Methods for Uncooperative Collections Paul Thomas Department of Computer Science Australian National University Canberra, Australia paul.thomas@anu.edu.au David Hawking CSIRO ICT Centre

More information

Ontology-Based Web Query Classification for Research Paper Searching

Ontology-Based Web Query Classification for Research Paper Searching Ontology-Based Web Query Classification for Research Paper Searching MyoMyo ThanNaing University of Technology(Yatanarpon Cyber City) Mandalay,Myanmar Abstract- In web search engines, the retrieval of

More information

Northeastern University in TREC 2009 Million Query Track

Northeastern University in TREC 2009 Million Query Track Northeastern University in TREC 2009 Million Query Track Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, Stefan Savev, Javed Aslam Information Studies Department, University of Sheffield, Sheffield, UK College

More information

Frontiers in Web Data Management

Frontiers in Web Data Management Frontiers in Web Data Management Junghoo John Cho UCLA Computer Science Department Los Angeles, CA 90095 cho@cs.ucla.edu Abstract In the last decade, the Web has become a primary source of information

More information

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching Wolfgang Tannebaum, Parvaz Madabi and Andreas Rauber Institute of Software Technology and Interactive Systems, Vienna

More information

Improving Collection Selection with Overlap Awareness in P2P Search Engines

Improving Collection Selection with Overlap Awareness in P2P Search Engines Improving Collection Selection with Overlap Awareness in P2P Search Engines Matthias Bender Peter Triantafillou Gerhard Weikum Christian Zimmer and Improving Collection Selection with Overlap Awareness

More information

Overview of the INEX 2009 Link the Wiki Track

Overview of the INEX 2009 Link the Wiki Track Overview of the INEX 2009 Link the Wiki Track Wei Che (Darren) Huang 1, Shlomo Geva 2 and Andrew Trotman 3 Faculty of Science and Technology, Queensland University of Technology, Brisbane, Australia 1,

More information

Opinions in Federated Search: University of Lugano at TREC 2014 Federated Web Search Track

Opinions in Federated Search: University of Lugano at TREC 2014 Federated Web Search Track Opinions in Federated Search: University of Lugano at TREC 2014 Federated Web Search Track Anastasia Giachanou 1,IlyaMarkov 2 and Fabio Crestani 1 1 Faculty of Informatics, University of Lugano, Switzerland

More information

IMPROVING TEXT COLLECTION SELECTION WITH COVERAGE AND OVERLAP STATISTICS. Thomas L. Hernandez

IMPROVING TEXT COLLECTION SELECTION WITH COVERAGE AND OVERLAP STATISTICS. Thomas L. Hernandez IMPROVING TEXT COLLECTION SELECTION WITH COVERAGE AND OVERLAP STATISTICS by Thomas L. Hernandez A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science ARIZONA STATE

More information

Relevance in XML Retrieval: The User Perspective

Relevance in XML Retrieval: The User Perspective Relevance in XML Retrieval: The User Perspective Jovan Pehcevski School of CS & IT RMIT University Melbourne, Australia jovanp@cs.rmit.edu.au ABSTRACT A realistic measure of relevance is necessary for

More information

Mounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom,

Mounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom, XML Retrieval Mounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom, mounia@acm.org Andrew Trotman, Department of Computer Science, University of Otago, New Zealand,

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Using Coherence-based Measures to Predict Query Difficulty

Using Coherence-based Measures to Predict Query Difficulty Using Coherence-based Measures to Predict Query Difficulty Jiyin He, Martha Larson, and Maarten de Rijke ISLA, University of Amsterdam {jiyinhe,larson,mdr}@science.uva.nl Abstract. We investigate the potential

More information

Informativeness for Adhoc IR Evaluation:

Informativeness for Adhoc IR Evaluation: Informativeness for Adhoc IR Evaluation: A measure that prevents assessing individual documents Romain Deveaud 1, Véronique Moriceau 2, Josiane Mothe 3, and Eric SanJuan 1 1 LIA, Univ. Avignon, France,

More information

KDD 10 Tutorial: Recommender Problems for Web Applications. Deepak Agarwal and Bee-Chung Chen Yahoo! Research

KDD 10 Tutorial: Recommender Problems for Web Applications. Deepak Agarwal and Bee-Chung Chen Yahoo! Research KDD 10 Tutorial: Recommender Problems for Web Applications Deepak Agarwal and Bee-Chung Chen Yahoo! Research Agenda Focus: Recommender problems for dynamic, time-sensitive applications Content Optimization

More information

Reducing Redundancy with Anchor Text and Spam Priors

Reducing Redundancy with Anchor Text and Spam Priors Reducing Redundancy with Anchor Text and Spam Priors Marijn Koolen 1 Jaap Kamps 1,2 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Informatics Institute, University

More information

Evaluation of Meta-Search Engine Merge Algorithms

Evaluation of Meta-Search Engine Merge Algorithms 2008 International Conference on Internet Computing in Science and Engineering Evaluation of Meta-Search Engine Merge Algorithms Chunshuang Liu, Zhiqiang Zhang,2, Xiaoqin Xie 2, TingTing Liang School of

More information

Passage Retrieval and other XML-Retrieval Tasks. Andrew Trotman (Otago) Shlomo Geva (QUT)

Passage Retrieval and other XML-Retrieval Tasks. Andrew Trotman (Otago) Shlomo Geva (QUT) Passage Retrieval and other XML-Retrieval Tasks Andrew Trotman (Otago) Shlomo Geva (QUT) Passage Retrieval Information Retrieval Information retrieval (IR) is the science of searching for information in

More information

Automatic Structured Query Transformation Over Distributed Digital Libraries

Automatic Structured Query Transformation Over Distributed Digital Libraries Automatic Structured Query Transformation Over Distributed Digital Libraries M. Elena Renda I.I.T. C.N.R. and Scuola Superiore Sant Anna I-56100 Pisa, Italy elena.renda@iit.cnr.it Umberto Straccia I.S.T.I.

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

2009 M. Elena Renda. A Personalized Information Search Assistant

2009 M. Elena Renda. A Personalized Information Search Assistant A Personalized Information Search Assistant 1 Outline Introduction Search Scenario Personalization Our Approach P I S A System Functionality Architecture Prototype & Demo Conclusions and Future Work 2

More information

An Improvement of Search Results Access by Designing a Search Engine Result Page with a Clustering Technique

An Improvement of Search Results Access by Designing a Search Engine Result Page with a Clustering Technique An Improvement of Search Results Access by Designing a Search Engine Result Page with a Clustering Technique 60 2 Within-Subjects Design Counter Balancing Learning Effect 1 [1 [2www.worldwidewebsize.com

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document

More information

Ranking Web Pages by Associating Keywords with Locations

Ranking Web Pages by Associating Keywords with Locations Ranking Web Pages by Associating Keywords with Locations Peiquan Jin, Xiaoxiang Zhang, Qingqing Zhang, Sheng Lin, and Lihua Yue University of Science and Technology of China, 230027, Hefei, China jpq@ustc.edu.cn

More information

NYU CSCI-GA Fall 2016

NYU CSCI-GA Fall 2016 1 / 45 Information Retrieval: Personalization Fernando Diaz Microsoft Research NYC November 7, 2016 2 / 45 Outline Introduction to Personalization Topic-Specific PageRank News Personalization Deciding

More information

Entity and Knowledge Base-oriented Information Retrieval

Entity and Knowledge Base-oriented Information Retrieval Entity and Knowledge Base-oriented Information Retrieval Presenter: Liuqing Li liuqing@vt.edu Digital Library Research Laboratory Virginia Polytechnic Institute and State University Blacksburg, VA 24061

More information

Does Selective Search Benefit from WAND Optimization?

Does Selective Search Benefit from WAND Optimization? Does Selective Search Benefit from WAND Optimization? Yubin Kim 1(B), Jamie Callan 1, J. Shane Culpepper 2, and Alistair Moffat 3 1 Carnegie Mellon University, Pittsburgh, USA yubink@cmu.edu 2 RMIT University,

More information

Aggregated Search. Jaime Arguello School of Information and Library Science University of North Carolina at Chapel Hill

Aggregated Search. Jaime Arguello School of Information and Library Science University of North Carolina at Chapel Hill Foundations and Trends R in Information Retrieval Vol. XX, No. XX (2016) 1 139 c 2016 J. Arguello DOI: 10.1561/XXXXXXXXXX Aggregated Search Jaime Arguello School of Information and Library Science University

More information

Full text available at: Aggregated Search

Full text available at:   Aggregated Search Aggregated Search Jaime Arguello School of Information and Library Science University of North Carolina at Chapel Hill, United States jarguello@unc.edu Boston Delft Foundations and Trends R in Information

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Combining CORI and the decision-theoretic approach for advanced resource selection

Combining CORI and the decision-theoretic approach for advanced resource selection Combining CORI and the decision-theoretic approach for advanced resource selection Henrik Nottelmann and Norbert Fuhr Institute of Informatics and Interactive Systems, University of Duisburg-Essen, 47048

More information

SNUMedinfo at TREC CDS track 2014: Medical case-based retrieval task

SNUMedinfo at TREC CDS track 2014: Medical case-based retrieval task SNUMedinfo at TREC CDS track 2014: Medical case-based retrieval task Sungbin Choi, Jinwook Choi Medical Informatics Laboratory, Seoul National University, Seoul, Republic of Korea wakeup06@empas.com, jinchoi@snu.ac.kr

More information

Overview of the TREC 2013 Crowdsourcing Track

Overview of the TREC 2013 Crowdsourcing Track Overview of the TREC 2013 Crowdsourcing Track Mark D. Smucker 1, Gabriella Kazai 2, and Matthew Lease 3 1 Department of Management Sciences, University of Waterloo 2 Microsoft Research, Cambridge, UK 3

More information

Semantic Website Clustering

Semantic Website Clustering Semantic Website Clustering I-Hsuan Yang, Yu-tsun Huang, Yen-Ling Huang 1. Abstract We propose a new approach to cluster the web pages. Utilizing an iterative reinforced algorithm, the model extracts semantic

More information

Using a Medical Thesaurus to Predict Query Difficulty

Using a Medical Thesaurus to Predict Query Difficulty Using a Medical Thesaurus to Predict Query Difficulty Florian Boudin, Jian-Yun Nie, Martin Dawes To cite this version: Florian Boudin, Jian-Yun Nie, Martin Dawes. Using a Medical Thesaurus to Predict Query

More information

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Romain Deveaud 1 and Florian Boudin 2 1 LIA - University of Avignon romain.deveaud@univ-avignon.fr

More information

Social Search Networks of People and Search Engines. CS6200 Information Retrieval

Social Search Networks of People and Search Engines. CS6200 Information Retrieval Social Search Networks of People and Search Engines CS6200 Information Retrieval Social Search Social search Communities of users actively participating in the search process Goes beyond classical search

More information

Using Temporal Profiles of Queries for Precision Prediction

Using Temporal Profiles of Queries for Precision Prediction Using Temporal Profiles of Queries for Precision Prediction Fernando Diaz Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst, MA 01003 fdiaz@cs.umass.edu

More information

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND 41 CHAPTER 5 TEXT W. Bruce Croft BACKGROUND Much of the information in digital library or digital information organization applications is in the form of text. Even when the application focuses on multimedia

More information

Survey on Community Question Answering Systems

Survey on Community Question Answering Systems World Journal of Technology, Engineering and Research, Volume 3, Issue 1 (2018) 114-119 Contents available at WJTER World Journal of Technology, Engineering and Research Journal Homepage: www.wjter.com

More information

Retrieval and Feedback Models for Blog Distillation

Retrieval and Feedback Models for Blog Distillation Retrieval and Feedback Models for Blog Distillation Jonathan Elsas, Jaime Arguello, Jamie Callan, Jaime Carbonell Language Technologies Institute, School of Computer Science, Carnegie Mellon University

More information

number of documents in global result list

number of documents in global result list Comparison of different Collection Fusion Models in Distributed Information Retrieval Alexander Steidinger Department of Computer Science Free University of Berlin Abstract Distributed information retrieval

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Towards a Distributed Web Search Engine

Towards a Distributed Web Search Engine Towards a Distributed Web Search Engine Ricardo Baeza-Yates Yahoo! Labs Barcelona, Spain Joint work with many people, most at Yahoo! Labs, In particular Berkant Barla Cambazoglu A Research Story Architecture

More information

Graph Classification in Heterogeneous

Graph Classification in Heterogeneous Title: Graph Classification in Heterogeneous Networks Name: Xiangnan Kong 1, Philip S. Yu 1 Affil./Addr.: Department of Computer Science University of Illinois at Chicago Chicago, IL, USA E-mail: {xkong4,

More information

A Meta-search Method with Clustering and Term Correlation

A Meta-search Method with Clustering and Term Correlation A Meta-search Method with Clustering and Term Correlation Dyce Jing Zhao, Dik Lun Lee, and Qiong Luo Department of Computer Science Hong Kong University of Science & Technology {zhaojing,dlee,luo}@cs.ust.hk

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

A Task-Based Evaluation of an Aggregated Search Interface

A Task-Based Evaluation of an Aggregated Search Interface A Task-Based Evaluation of an Aggregated Search Interface No Author Given No Institute Given Abstract. This paper presents a user study that evaluated the effectiveness of an aggregated search interface

More information

Document Allocation Policies for Selective Searching of Distributed Indexes

Document Allocation Policies for Selective Searching of Distributed Indexes Document Allocation Policies for Selective Searching of Distributed Indexes Anagha Kulkarni and Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University 5 Forbes

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Automatic Classification of Text Databases through Query Probing

Automatic Classification of Text Databases through Query Probing Automatic Classification of Text Databases through Query Probing Panagiotis G. Ipeirotis Computer Science Dept. Columbia University pirot@cs.columbia.edu Luis Gravano Computer Science Dept. Columbia University

More information

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval

Tilburg University. Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Tilburg University Authoritative re-ranking of search results Bogers, A.M.; van den Bosch, A. Published in: Advances in Information Retrieval Publication date: 2006 Link to publication Citation for published

More information

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM Myomyo Thannaing 1, Ayenandar Hlaing 2 1,2 University of Technology (Yadanarpon Cyber City), near Pyin Oo Lwin, Myanmar ABSTRACT

More information

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia

Empowering People with Knowledge the Next Frontier for Web Search. Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Empowering People with Knowledge the Next Frontier for Web Search Wei-Ying Ma Assistant Managing Director Microsoft Research Asia Important Trends for Web Search Organizing all information Addressing user

More information

Collection Selection with Highly Discriminative Keys

Collection Selection with Highly Discriminative Keys Collection Selection with Highly Discriminative Keys Sander Bockting Avanade Netherlands B.V. Versterkerstraat 6 1322 AP, Almere, Netherlands sander.bockting@avanade.com Djoerd Hiemstra University of Twente

More information

A Metric for Inferring User Search Goals in Search Engines

A Metric for Inferring User Search Goals in Search Engines International Journal of Engineering and Technical Research (IJETR) A Metric for Inferring User Search Goals in Search Engines M. Monika, N. Rajesh, K.Rameshbabu Abstract For a broad topic, different users

More information

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion Sara Lana-Serrano 1,3, Julio Villena-Román 2,3, José C. González-Cristóbal 1,3 1 Universidad Politécnica de Madrid 2 Universidad

More information

Classification-Aware Hidden-Web Text Database Selection

Classification-Aware Hidden-Web Text Database Selection 6 Classification-Aware Hidden-Web Text Database Selection PANAGIOTIS G. IPEIROTIS New York University and LUIS GRAVANO Columbia University Many valuable text databases on the web have noncrawlable contents

More information

Relevance Score Normalization for Metasearch

Relevance Score Normalization for Metasearch Relevance Score Normalization for Metasearch Mark Montague Department of Computer Science Dartmouth College 6211 Sudikoff Laboratory Hanover, NH 03755 montague@cs.dartmouth.edu Javed A. Aslam Department

More information

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Sungbin Choi, Jinwook Choi Medical Informatics Laboratory, Seoul National University, Seoul, Republic of

More information

Jan Pedersen 22 July 2010

Jan Pedersen 22 July 2010 Jan Pedersen 22 July 2010 Outline Problem Statement Best effort retrieval vs automated reformulation Query Evaluation Architecture Query Understanding Models Data Sources Standard IR Assumptions Queries

More information

Ranking models in Information Retrieval: A Survey

Ranking models in Information Retrieval: A Survey Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor

More information

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk

More information