Statistical Language Models for Information Retrieval

Size: px
Start display at page:

Download "Statistical Language Models for Information Retrieval"

Transcription

1 Statistical Language Models for Information Retrieval Tutorial at ACM SIGIR 2006 Aug. 6, 2006 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

2 Goal of the Tutorial Introduce the emerging area of applying statistical language models (SLMs) to information retrieval (IR). Targeted audience: IR practitioners who are interested in acquiring advanced modeling techniques IR researchers who are looking for new research problems in IR models Accessible to anyone with basic knowledge of probability and statistics ChengXiang Zhai,

3 Scope of the Tutorial What will be covered Brief background on IR and SLMs Review of recent applications of unigram SLMs in IR Details of some specific methods that are either empirically effective or theoretically important A framework for systematically exploring SLMs in IR Outstanding research issues in applying SLMs to IR What will not be covered Traditional IR methods Implementation of IR systems Discussion of high-order or other complex SLMs Application of SLMs in supervised learning See any IR textbook e.g., [Baeza-Yates & Ribeiro-Neto 99, Grossman & Frieder 04] See [Witten et al. 99] See [Manning & Schutze 99] E.g., TDT, Text Categorization. and [Jelinek 98] See publications in Machine Learning, Speech Recognition, and Natural Language Processing ChengXiang Zhai,

4 Tutorial Outline 1. Introduction 2. The Basic Language Modeling Approach 3. More Advanced Language Models 4. Language Models for Special Retrieval Tasks 5. A General Framework for Applying SLMs to IR 6. Summary ChengXiang Zhai,

5 1. Introduction Part 1: Introduction - Information Retrieval (IR) - Statistical Language Models (SLMs) - Applications of SLMs to IR We are here 2. The Basic Language Modeling Approach 3. More Advanced Language Models 4. Language Models for Special Retrieval Tasks 5. A General Framework for Applying SLMs to IR 6. Summary ChengXiang Zhai,

6 What is Information Retrieval (IR)? Narrow sense (= ad hoc text retrieval) Given a collection of text documents (information items) Given a text query from a user (information need) Retrieve relevant documents from the collection A broader sense of IR may include Retrieving non-textual information (e.g., images) Other tasks (e.g., filtering, categorization or summarization) In this tutorial, IR ad hoc text retrieval Ad hoc text retrieval is fundamental to IR and has many applications (e.g., search engines, digital libraries, ) ChengXiang Zhai,

7 Formalization of IR Tasks Vocabulary V={w 1, w 2,, w N } of language Query q = q 1,,q m, where q i V Document d i = d i1,,d im i, where d ij V Collection C= {d 1,, d k } Set of relevant documents R(q) C Generally unknown and user-dependent Query is a hint on which doc is in R(q) Task = compute R (q), an approximate R(q) ChengXiang Zhai,

8 Computing R (q): Doc Selection vs. Ranking True R(q) R(q)={d C f(d,q)=1}, where f(d,q) {0,1} is an indicator function (classifier) - Doc Selection f(d,q)=? Doc Ranking f(d,q)=? R(q) = {d C f(d,q)>q}, where f(d,q) is a ranking function; q is a cutoff implicitly set by the user ChengXiang Zhai, d d d d d d d d d 9 - R (q) R (q) q=0.77

9 Problems with Doc Selection The classifier is unlikely accurate Over-constrained query (terms are too specific): no relevant documents found Under-constrained query (terms are too general): over delivery It is extremely hard to find the right position between these two extremes Even if it is accurate, all relevant documents are not equally relevant Relevance is a matter of degree! ChengXiang Zhai,

10 Ranking is often preferred A user can stop browsing anywhere, so the boundary/cutoff is controlled by the user High recall users would view more items High precision users would view only a few Theoretical justification: Probability Ranking Principle [Robertson 77], Risk Minimization [Zhai 02, Zhai & Lafferty 06] The retrieval problem is now reduced to defining a ranking function f, such that, for all q, d 1, d 2, f(q,d 1 ) > f(q,d 2 ) iff p(relevant q,d 1 ) >p(relevant q,d 2 ) Function f is an operational definition of relevance Most IR research is centered on finding a good f ChengXiang Zhai,

11 Two Well-Known Traditional Retrieval Formulas [Singhal 01] Key retrieval heuristics: [ ] TF (Term Frequency) IDF (Inverse Doc Freq.) + Length normalization [Sparck Jones 72, Salton & Buckley 88, Singhal et al. 96, Robertson & Walker 94, Fang et al. 04] Other heuristics: ( k1 + 1) tf dl k1(1 b + b ) + tf avdl Typo Stemming Stop word removal Phrases Similar quantities will occur in the LMs ChengXiang Zhai,

12 Feedback in IR Query Updated query Learn from Examples Retrieval Engine Document collection Feedback Results: d d d k Assume top 10 docs are relevant Pseudo feedback Relevance feedback Judgments: d 1 + d 2 + d 3 + d k -... top 10 User Judgments: d 1 + d 2 - d 3 + d k -... User judges documents ChengXiang Zhai,

13 Feedback in IR (cont.) An essential component in any IR method Relevance feedback is always desirable, but a user may not be willing to provide explicit judgments Pseudo/automatic feedback is always possible, and often improves performance on average through Exploiting word co-occurrences Enriching a query with additional related words Indirectly addressing issues such as ambiguous words and synonyms Implicit feedback is a good compromise ChengXiang Zhai,

14 Evaluation of Retrieval Performance Total # relevant docs = 8 1. d 1 2. d 2 3. d 3 4. d 4 5. d 5 6. d 6 7. d 7 8. d 8 9. d d 10 As a SET of results As a ranked list How do we compare different rankings? # relret 4 precision = = = 0.4 # retrieved 10 # relret 4 recall = = = 0.5 # relevant 8 precision 1.0 x x x x PR-curve recall 0.0 Which is the best? A C B A>C B>C But is A>B? Summarize a ranking with a single number 1 k pi k i = 1 AvgPrec= k is the total # of rel docs p i = prec at the rank where the i-th rel doc is retrieved p i =0 if the i-th rel doc is not retrieved AvgPrec = (1/1+2/2+3/4+4/ )/8=0.394 Avg. Prec. is sensitive to the position of each rel doc! ChengXiang Zhai,

15 1. Introduction Part 1: Introduction (cont.) - Information Retrieval (IR) - Statistical Language Models (SLMs) - Application of SLMs to IR We are here 2. The Basic Language Modeling Approach 3. More Advanced Language Models 4. Language Models for Special Retrieval Tasks 5. A General Framework for Applying SLMs to IR 6. Summary ChengXiang Zhai,

16 What is a Statistical LM? A probability distribution over word sequences p( Today is Wednesday )» p( Today Wednesday is )» p( The eigenvalue is positive )» Context/topic dependent! Can also be regarded as a probabilistic mechanism for generating text, thus also called a generative model ChengXiang Zhai,

17 Why is a LM Useful? Provides a principled way to quantify the uncertainties associated with natural language Allows us to answer questions like: Given that we see John and feels, how likely will we see happy as opposed to habit as the next word? (speech recognition) Given that we observe baseball three times and game once in a news article, how likely is it about sports? (text categorization, information retrieval) Given that a user is interested in sports news, how likely would the user use baseball in a query? (information retrieval) ChengXiang Zhai,

18 Source-Channel Framework (Model of Communication System [Shannon 48] ) Source P(X) Transmitter (encoder) Noisy Channel Receiver (decoder) X Y X P(Y X) P(X Y)=? Destination Xˆ = argmax X p( X Y ) = argmax X p( Y X ) p( X ) When X is text, p(x) is a language model (Bayes Rule) Many Examples: Speech recognition: X=Word sequence Y=Speech signal Machine translation: X=English sentence Y=Chinese sentence OCR Error Correction: X=Correct word Y= Erroneous word Information Retrieval: X=Document Y=Query Summarization: X=Summary Y=Document ChengXiang Zhai,

19 The Simplest Language Model (Unigram Model) Generate a piece of text by generating each word independently Thus, p(w 1 w 2... w n )=p(w 1 )p(w 2 ) p(w n ) Parameters: {p(w i )} p(w 1 )+ +p(w N )=1 (N is voc. size) Essentially a multinomial distribution over words A piece of text can be regarded as a sample drawn according to this word distribution ChengXiang Zhai,

20 Text Generation with Unigram LM (Unigram) Language Model q p(w q) Topic 1: Text mining text 0.2 mining 0.1 assocation 0.01 clustering 0.02 food Sampling Document d Text mining paper Given q, p(d q) varies according to d Topic 2: Health food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 Food nutrition paper ChengXiang Zhai,

21 Estimation of Unigram LM (Unigram) Language Model q p(w q)=? Estimation Document 10/100 5/100 3/100 3/100 1/100 text? mining? assocation? database? query? text 10 mining 5 association 3 database 3 algorithm 2 query 1 efficient 1 Total #words =100 How good is the estimated model? It gives our document sample the highest prob, but it doesn t generalize well More about this later ChengXiang Zhai,

22 More Sophisticated LMs N-gram language models In general, p(w 1 w 2... w n )=p(w 1 )p(w 2 w 1 ) p(w n w 1 w n-1 ) n-gram: conditioned only on the past n-1 words E.g., bigram: p(w 1... w n )=p(w 1 )p(w 2 w 1 ) p(w 3 w 2 ) p(w n w n-1 ) Remote-dependence language models (e.g., Maximum Entropy model) Structured language models (e.g., probabilistic context-free grammar) Will barely be covered in this tutorial. If interested, read [Jelinek 98, Manning & Schutze 99, Rosenfeld 00] ChengXiang Zhai,

23 Why Just Unigram Models? Difficulty in moving toward more complex models They involve more parameters, so need more data to estimate (A doc is an extremely small sample) They increase the computational complexity significantly, both in time and space Capturing word order or structure may not add so much value for topical inference But, using more sophisticated models can still be expected to improve performance... ChengXiang Zhai,

24 Evaluation of SLMs Direct evaluation criterion: How well does the model fit the data to be modeled? Example measures: Data likelihood, perplexity, cross entropy, Kullback-Leibler divergence (mostly equivalent) Indirect evaluation criterion: Does the model help improve the performance of the task? Specific measure is task dependent For retrieval, we look at whether a model helps improve retrieval accuracy We hope more reasonable LMs would achieve better retrieval performance ChengXiang Zhai,

25 1. Introduction Part 1: Introduction (cont.) - Information Retrieval (IR) - Statistical Language Models (SLMs) - Application of SLMs to IR We are here 2. The Basic Language Modeling Approach 3. More Advanced Language Models 4. Language Models for Special Retrieval Tasks 5. A General Framework for Applying SLMs to IR 6. Summary ChengXiang Zhai,

26 Representative LMs for IR Query likelihood scoring Ponte & Croft 98 Hiemstra & Kraaij 99; Miller et al. 99 Basic LM (Query Likelihood) Improved Basic LM Query/Rel Model & Feedback Special IR tasks Beyond unigram Song & Croft 99 Translation model Berger & Lafferty 99 Xu & Croft 99 Parameter sensitivity Ng 00 Smoothing examined Bayesian Query likelihood Zhai & Lafferty 01a Zaragoza et al. 03. Theoretical justification URL prior Time prior Lafferty & Zhai 01a,01b Kraaij et al. 02 Li & Croft 03 Two-stage LMs Zhai & Lafferty 02 Term-specific smoothing Hiemstra 02 Title LM Jin et al. 02 Relevance LM Lavrenko & Croft 01 Model-based FB Zhai & Lafferty 01b Markov-chain query model Lafferty & Zhai 01b Concept Likelihood Srikanth & Srihari 03 Rel. Query FB Nallanati et al 03 Cluster LM Kurland & Lee Parsimonious LM Hiemstra et al. 04 Lavrenko et al. 02 Xu et al. 01 Zhang et al. 02 Ogilvie & Callan 03 Cronen-Townsend et al. 02 Zhai et al. 03 Si et al Cluster smoothing Liu & Croft 04; Tao et al. 06 Dependency LM Thesauri Gao et al. 04 Cao et al. 05 Pesudo Query Kurland et al. 05 Query expansion Bai et al. 05 Rebust Est. Tao & Zhai 06 Shen et al. 05 Tan et al. 06 Kurland & Lee 05 Ponte 98 Hiemstra 01 Dissertations Berger 01 Zhai 02 Lavrenko 04 Kraaij 04 Srikanth 04 ChengXiang Zhai,

27 Ponte & Croft s Pioneering Work [Ponte & Croft 98] Contribution 1: A new query likelihood scoring method: p(q D) [Maron and Kuhns 60] had the idea of query likelihood, but didn t work out how to estimate p(q D) Contribution 2: Connecting LMs with text representation and weighting in IR [Wong & Yao 89] had the idea of representing text with a multinomial distribution (relative frequency), but didn t study the estimation problem Good performance is reported using the simple query likelihood method ChengXiang Zhai,

28 Early Work ( ) At about the same time as SIGIR 98, in TREC 7, two groups explored similar ideas independently: BBN [Miller et al., 99] & Univ. of Twente [Hiemstra & Kraaij 99] In TREC-8, Ng from MIT motivated the same query likelihood method in a different way [Ng 99] All following the simple query likelihood method; methods differ in the way the model is estimated and the event model for the query All show promising empirical results Main problems: Feedback is explored heuristically Lack of understanding why the method works. ChengXiang Zhai,

29 Later Work (1999-) Attempt to understand why LMs work [Zhai & Lafferty 01a, Lafferty & Zhai 01a, Ponte 01, Greiff & Morgan 03, Sparck Jones et al. 03, Lavrenko 04] Further extend/improve the basic LMs [Song & Croft 99, Berger & Lafferty 99, Jin et al. 02, Nallapati & Allan 02, Hiemstra 02, Zaragoza et al. 03, Srikanth & Srihari 03, Nallapati et al 03, Li &Croft 03, Gao et al. 04, Liu & Croft 04, Kurland & Lee 04,Hiemstra et al. 04,Cao et al. 05, Tao et al. 06] Explore alternative ways of using LMs for retrieval (mostly query/relevance model estimation) [Xu & Croft 99, Lavrenko & Croft 01, Lafferty & Zhai 01a, Zhai & Lafferty 01b, Lavrenko 04, Kurland et al. 05, Bai et al. 05,Tao & Zhai 06] Explore the use of SLMs for special retrieval tasks [Xu & Croft 99, Xu et al. 01, Lavrenko et al. 02, Cronen-Townsend et al. 02, Zhang et al. 02, Ogilvie & Callan 03, Zhai et al. 03, Kurland & Lee 05, Shen et al. 05] ChengXiang Zhai,

30 Part 2: The Basic LM Approach 1. Introduction 2. The Basic Language Modeling Approach - Query Likelihood Document Ranking - Smoothing of Language Models - Why does it work? - Variants of the basic LM 3. More Advanced Language Models 4. Language Models for Special Retrieval Tasks 5. A General Framework for Applying SLMs to IR 6. Summary We are here ChengXiang Zhai,

31 The Basic LM Approach [Ponte & Croft 98] Document Text mining paper Food nutrition paper Language Model text? mining? assocation? clustering? food? food? nutrition? healthy? diet? Query = data mining algorithms Which model would most likely have generated this query? ChengXiang Zhai,

32 Ranking Docs by Query Likelihood Doc LM Query likelihood d 1 q d 1 p(q q d 1 ) q d 2 q d 2 p(q q d 2 ) p(q q d N ) d N q d N ChengXiang Zhai,

33 Modeling Queries: Different Assumptions Multi-Bernoulli: Modeling word presence/absence q= (x 1,, x V ), x i =1 for presence of word w i ; x i =0 for absence V V V pq ( = ( x,..., x ) d) = pw ( = x d) = pw ( = 1 d) pw ( = 0 d) 1 V i i i i i= 1 i= 1, x = 1 i= 1, x = 0 Parameters: {p(w i =1 d), p(w i =0 d)} p(w i =1 d)+ p(w i =0 d)=1 Multinomial (Unigram LM): Modeling word frequency q=q 1, q m, where q j is a query word i pq ( = q... q d) = pq ( d) = pw ( d) 1 c(w i,q) is the count of word w i in query q m Parameters: {p(w i d)} p(w 1 d)+ p(w v d) = 1 i V m j i j= 1 i= 1 c( w, q) [Ponte & Croft 98] uses Multi-Bernoulli; most other work uses multinomial Multinomial seems to work better [Song & Croft 99, McCallum & Nigam 98,Lavrenko 04] ChengXiang Zhai, i

34 Retrieval as LM Estimation Document ranking based on query likelihood m log pq ( d) = log pq ( d) = cw (, q)log pw ( d) where, q= qq... q V i i i i= 1 i= m Document language model Retrieval problem Estimation of p(w i d) Smoothing is an important issue, and distinguishes different approaches ChengXiang Zhai,

35 How to Estimate p(w d)? Simplest solution: Maximum Likelihood Estimator P(w d) = relative frequency of word w in d What if a word doesn t appear in the text? P(w d)=0 In general, what probability should we give a word that has not been observed? If we want to assign non-zero probabilities to such words, we ll have to discount the probabilities of observed words This is what smoothing is about ChengXiang Zhai,

36 Part 2: The Basic LM Approach (cont.) 1. Introduction 2. The Basic Language Modeling Approach - Query Likelihood Document Ranking - Smoothing of Language Models - Why does it work? - Variants of the basic LM We are here 3. More Advanced Language Models 4. Language Models for Special Retrieval Tasks 5. A General Framework for Applying SLMs to IR 6. Summary ChengXiang Zhai,

37 Language Model Smoothing (Illustration) P(w) Max. Likelihood Estimate p ( w) = ML count of w count of all words Smoothed LM Word w ChengXiang Zhai,

38 How to Smooth? All smoothing methods try to discount the probability of words seen in a document re-allocate the extra counts so that unseen words will have a non-zero count Method 1 Additive smoothing [Chen & Goodman 98]: Add a constant d to the counts of each word, e.g., add 1 Counts of w in d pw ( d) = cwd (, ) + 1 d + V Length of d (total counts) Add one, Laplace Vocabulary size ChengXiang Zhai,

39 Improve Additive Smoothing Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate pdml ( w d) if w is seen in d pwd ( ) = αd pw ( REF) otherwise α d = 1 p ( w d) wis seen wis unseen DML pw ( REF) Reference language model Normalizer Prob. Mass for unseen words ChengXiang Zhai,

40 Other Smoothing Methods Method 2 Absolute discounting [Ney et al. 94]: Subtract a constant d from the counts of each word max( c( wd, ) δ,0) + δ d p( wref ) d p( w d) = u # unique words Method 3 Linear interpolation [Jelinek-Mercer 80]: Shrink uniformly toward p(w REF) cwd (, ) pwd ( ) = (1 λ) + λ pw ( REF) d ML estimate parameter ChengXiang Zhai,

41 Other Smoothing Methods (cont.) Method 4 Dirichlet Prior/Bayesian [MacKay & Peto 95, Zhai & Lafferty 01a, Zhai & Lafferty 02]: Assume pseudo counts mp(w REF) parameter cwd (, ) + µ pw ( REF) d cwd (, ) µ p( w d) = = + pw ( REF) d + µ d + µ d d + µ Method 5 Good Turing [Good 53]: Assume total # unseen events to be n 1 (# of singletons), and adjust the seen events in the same way cwd (, ) + 1 n 2* n p( w d) = ; c*( wd, ) = n ;0* =,1* =,... n r c*( wd, ) 1 2 d c( wd, ) + 1 nc( wd, ) n0 n1 = the number of words with count r What if n c( wd, ) ( ) = 0? What about p w REF? Heuristics needed ChengXiang Zhai,

42 So, which method is the best? It depends on the data and the task! Cross validation is generally used to choose the best method and/or set the smoothing parameters For retrieval, Dirichlet prior performs well Backoff smoothing [Katz 87] doesn t work well due to a lack of 2 nd -stage smoothing Note that many other smoothing methods exist See [Chen & Goodman 98] and other publications in speech recognition

43 Comparison of Three Methods [Zhai & Lafferty 01a] Query Type Jelinek-Mercer Dirichlet Abs. Discounting Title Long Relative performance of JM, Dir. and AD precision TitleQuery LongQuery 0 JM DIR AD Method Comparison is performed on a variety of test collections ChengXiang Zhai,

44 Part 2: The Basic LM Approach (cont.) 1. Introduction 2. The Basic Language Modeling Approach - Query Likelihood Document Ranking - Smoothing of Language Models - Why does it work? - Variants of the basic LM We are here 3. More Advanced Language Models 4. Language Models for Different Retrieval Tasks 5. A General Framework for Applying SLMs to IR 6. Summary ChengXiang Zhai,

45 Understanding Smoothing Retrieval formula using the general smoothing scheme The general smoothing scheme Discounted ML estimate pdml ( w d) if w is seen in d pwd ( ) = αd pw ( REF) otherwise log pq ( d) = cwq (, )log pw ( d) w V = cwq (, )log p ( w d) + cwq (, )log α pw ( REF) DML wvcwd, (, ) > 0 wvcwd, (, ) = 0 Reference language model = cwq (, )log p ( w d) + cwq (, )log α pw ( REF) cwq (, )log α pw ( REF) = DML d d wvcwd, (, ) > 0 w V wvcwd, (, ) > 0 pdml ( w d) cwq (, )log + q log αd + cwqpw (, ) ( REF) α pw ( REF) wvcwd, (, ) > 0 d wv c( wq, ) > 0 The key rewriting step Similar rewritings are very common when using LMs for IR d ChengXiang Zhai,

46 Smoothing & TF-IDF Weighting [Zhai & Lafferty 01a] Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain p ( w d) log pq ( d) = cwq (, )log + q log α + cwqpw (, ) ( REF) TF weighting DML d wvcwd, (, ) > 0 αd pw ( REF) w V c( wq, ) > 0 Words in both query and doc Doc length normalization (long doc is expected to have a smaller a d ) IDF-like weighting ChengXiang Zhai, Ignore for ranking Smoothing with p(w C) TF-IDF + length norm. Smoothing implements traditional retrieval heuristics LMs with simple smoothing can be computed as efficiently as traditional retrieval models

47 The Dual-Role of Smoothing [Zhai & Lafferty 02] Verbose queries long long Keyword queries short short Why does query type affect smoothing sensitivity? ChengXiang Zhai,

48 Another Reason for Smoothing Content words Query = the algorithms for data mining p DML (w d1): p DML (w d2): p( algorithms d1) = p( algorithm d2) p( data d1) < p( data d2) p( mining d1) < p( mining d2) Intuitively, d2 should have a higher score, but p(q d1)>p(q d2) So we should make p( the ) and p( for ) less different for all docs, and smoothing helps achieve this goal After smoothing with p( w d) = 0.1p ( w d) p( w REF), p( q d1) < p( q d 2)! DML Query = the algorithms for data mining P(w REF) Smoothed p(w d1): Smoothed p(w d2): ChengXiang Zhai,

49 Two-stage Smoothing [Zhai & Lafferty 02] Stage-1 -Explain unseen words -Dirichlet prior(bayesian) µ Stage-2 -Explain noise in query -2-component mixture λ P(w d) = (1-λ) c(w,d) +µp(w C) d +µ Collection LM + λp(w U) User background model Can be approximated by p(w C) ChengXiang Zhai,

50 Estimating µ using leave-one-out [Zhai & Lafferty 02] w 1 P(w 1 d - w1 ) log-likelihood Leave-one-out w 2 P(w 2 d - w2 ) l c( w, d ) 1+ µ p( w C) N i ( µ C) = c( w, di )log( 1 i= 1 w V di 1+ µ ) w n... P(w n d - wn ) Maximum Likelihood Estimator µ -1 µ ˆ = argmax l (µ C) Newton s Method ChengXiang Zhai,

51 Why would leave-one-out work? 20 word by author1 abc abc ab c d d abc cd d d abd ab ab ab ab cd d e cd e 20 word by author2 abc abc ab c d d abe cb e f acf fb ef aff abef cdc db ge f s Suppose we keep sampling and get 10 more words. Which author is likely to write more new words? Now, suppose we leave e out m doesn t have to be big µ pml(" e" author1) = psmooth (" e" author1) = + p(" e" REF) µ µ µ pml(" e" author2) = psmooth(" e" author2) = + p(" e" REF) µ µ m must be big! more smoothing The amount of smoothing is closely related to the underlying vocabulary size ChengXiang Zhai,

52 Estimating λ using Mixture Model [Zhai & Lafferty 02] Stage-1 d 1 µ P(w d 1 )... λ Stage-2 (1-λ)p(w d 1 )+ λp(w U)... π 1 Query d N µ P(w d N ) λ (1-λ)p(w d N )+ λp(w U) π N Q=q 1 q m N m p(q λ,u) = π ((1 λ)p(q d) + λp(q U)) i i= 1 j= 1 λˆ = argmax p(q λ,u) λ j i j Maximum Likelihood Estimator Expectation-Maximization (EM) algorithm pq ( d ) = cq (, d ) + µ ˆ pq ( C) j i j d + µ ˆ ChengXiang Zhai, j i Estimated in stage-1 i

53 Automatic 2-stage results Optimal 1-stage results [Zhai & Lafferty 02] Average precision (3 DB s + 4 query types, 150 topics) * Indicates significant difference Collection query Optimal-JM Optimal-Dir Auto-2stage SK 20.3% 23.0% 22.2%* LK 36.8% 37.6% 37.4% SV 18.8% 20.9% 20.4% AP88-89 LV 28.8% 29.8% 29.2% SK 19.4% 22.3% 21.8%* LK 34.8% 35.3% 35.8% SV 17.2% 19.6% 19.9% WSJ87-92 LV 27.7% 28.2% 28.8%* SK 17.9% 21.5% 20.0% LK 32.6% 32.6% 32.2% SV 15.6% 18.5% 18.1% ZIFF1-2 LV 26.7% 27.9% 27.9%* Completely automatic tuning of parameters IS POSSIBLE! ChengXiang Zhai,

54 The Notion of Relevance Relevance D(Rep(q), Rep(d)) Similarity P(r=1 q,d) r {0,1} Probability of Relevance P(d fiq) or P(q fid) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Regression Model (Fox 83) Prob. distr. model (Wong & Yao, 89) Doc generation Generative Model Classical prob. Model (Robertson & Sparck Jones, 76) Later, LMs are used along these lines too Query generation Basic LM approach (Ponte & Croft, 98) Different inference system Prob. concept space model (Wong & Yao, 95) Inference network model (Turtle & Croft, 91) Initially, LMs are applied to IR in this way ChengXiang Zhai,

55 Justification of Query Likelihood [Lafferty & Zhai 01a] The General Probabilistic Retrieval Model Define P(Q,D R) Compute P(R Q,D) using Bayes rule Rank documents by O(R Q,D) O( R = 1 Q, D) = P( Q, D R P( Q, D R = = 1) 0) P( R P( R = = 1) 0) Ignored for ranking D Special cases Document generation : P(Q,D R)=P(D Q,R)P(Q R) Query generation : P(Q,D R)=P(Q D,R)P(D R) Doc generation leads to the classic Robertson-Sparck Jones model Query generation leads to the query likelihood LM approach ChengXiang Zhai,

56 Query Generation [Lafferty & Zhai 01a] O( R = 1 Q, D) P( Q, D R P( Q, D R = = 1) 0) = P( Q D, R = 1) P( D R = 1) P( Q D, R = 0) P( D R = 0) P( D R P( Q D, R = 1) P( D R = = 1) 0) ( Assume P( Q D, R = 0)» P( Q R = 0)) Query likelihood p(q q d ) Document prior Assuming uniform prior, we have O( R = 1 Q, D) P( Q D, R = 1) Computing P(Q D, R=1) generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model P(Q D)=P(Q D, R=1)! Prob. that a user who likes D would pose query Q Relevance-based interpretation of the so-called document language model ChengXiang Zhai,

57 Part 2: The Basic LM Approach (cont.) 1. Introduction 2. The Basic Language Modeling Approach - Query Likelihood Document Ranking - Smoothing of Language Models - Why does it work? - Variants of the basic LM We are here 3. More Advanced Language Models 4. Language Models for Special Retrieval Tasks 5. A General Framework for Applying SLMs to IR 6. Summary ChengXiang Zhai,

58 Variants of the Basic LM Approach Different smoothing strategies Hidden Markov Models (essentially linear interpolation) [Miller et al. 99] Smoothing with an IDF-like reference model [Hiemstra & Kraaij 99] Performance tends to be similar to the basic LM approach Many other possibilities for smoothing [Chen & Goodman 98] Different priors Link information as prior leads to significant improvement of Web entry page retrieval performance [Kraaij et al. 02] Time as prior [Li & Croft 03] PageRank as prior [Kurland & Lee 05] Passage retrieval [Liu & Croft 02] ChengXiang Zhai,

59 Part 3: More Advanced LMs 1. Introduction 2. The Basic Language Modeling Approach 3. More Advanced Language Models - Improving the basic LM approach We are here - Feedback and alternative ways of using LMs 4. Language Models for Special Retrieval Tasks 5. A General Framework for Applying SLMs to IR 6. Summary ChengXiang Zhai,

60 Improving the Basic LM Approach Capturing limited dependencies Bigrams/Trigrams [Song & Croft 99]; Grammatical dependency [Nallapati & Allan 02, Srikanth & Srihari 03, Gao et al. 04] Generally insignificant improvement as compared with other extensions such as feedback Full Bayesian query likelihood [Zaragoza et al. 03] Performance similar to the basic LM approach Translation model for p(q D,R) [Berger & Lafferty 99, Jin et al. 02,Cao et al. 05] Address polesemy and synonyms; improves over the basic LM methods, but computationally expensive Cluster-based smoothing/scoring [Liu & Croft 04, Kurland & Lee 04,Tao et al. 06] Improves over the basic LM, but computationally expensive Parsimonious LMs [Hiemstra et al. 04]: Using a mixture model to factor out non-discriminative words ChengXiang Zhai,

61 Translation Models Directly modeling the translation relationship between words in the query and words in a doc When relevance judgments are available, (q,d) serves as data to train the translation model Without relevance judgments, we can use synthetic data [Berger & Lafferty 99], <title, body>[jin et al. 02], or thesauri [Cao et al. 05] Basic translation model m pq ( DR, ) = pt( qi wj) pw ( j D) i= 1 w j V Translation model Regular doc LM ChengXiang Zhai,

62 Cluster-based Smoothing/Scoring Cluster-based smoothing: Smooth a document LM with a cluster of similar documents [Liu & Croft 04]: improves over the basic LM, but insignificantly Document expansion smoothing: Smooth a document LM with the neighboring documents (essentially one cluster per document) [Tao et al. 06] : improves over the basic LM more significantly Cluster-based query likelihood: Similar to the translation model, but translate the whole document to the query through a set of clusters [Kurland & Lee 04] pq ( DR, ) = pq ( C) pc ( D) C Clusters How likely doc D belongs to cluster C Likelihood of Q given C Only effective when interpolated with the basic LM scores ChengXiang Zhai,

63 Part 3: More Advanced LMs (cont.) 1. Introduction 2. The Basic Language Modeling Approach 3. More Advanced Language Models We are - Improving the basic LM approach here - Feedback and Alternative ways of using LMs 4. Language Models for Special Retrieval Tasks 5. A General Framework for Applying SLMs to IR 6. Summary ChengXiang Zhai,

64 Feedback and Doc/Query Generation Classic Prob. Model Query likelihood ( Language Model ) OR ( = 1 QD, ) PDQR (, = 1) PDQR (, = 0) OR ( = 1 QD, ) PQ ( DR, = 1) Rel. doc model NonRel. doc model Rel. query model Parameter Estimation (q 1,d 1,1) (q 1,d 2,1) (q 1,d 3,1) (q 1,d 4,0) (q 1,d 5,0) (q 3,d 1,1) (q 4,d 1,1) (q 5,d 1,1) (q 6,d 2,1) (q 6,d 3,0) P(D Q,R=1) P(D Q,R=0) P(Q D,R=1) Initial retrieval: - query as rel doc vs. doc as rel query - P(Q D,R=1) is more accurate Feedback: - P(D Q,R=1) can be improved for the current query and future doc - P(Q D,R=1) can also be improved, but for current doc and future query Query-based feedback Doc-based feedback ChengXiang Zhai,

65 Difficulty in Feedback with Query Likelihood Traditional query expansion [Ponte 98, Miller et al. 99, Ng 99] Improvement is reported, but there is a conceptual inconsistency What s an expanded query, a piece of text or a set of terms? Avoid expansion Query term reweighting [Hiemstra 01, Hiemstra 02] Translation models [Berger & Lafferty 99, Jin et al. 02] Only achieving limited feedback Doing relevant query expansion instead [Nallapati et al 03] The difficulty is due to the lack of a query/relevance model The difficulty can be overcome with alternative ways of using LMs for retrieval (e.g., relevance model [Lavrenko & Croft 01], Query model estimation [Lafferty & Zhai 01b; Zhai & Lafferty 01b]) ChengXiang Zhai,

66 Two Alternative Ways of Using LMs Classic Probabilistic Model :Doc-Generation as opposed to Query-generation PDQR (, = 1) PDQR (, = 1) OR ( = 1 QD, ) PDQR (, = 0) PD ( ) Natural for relevance feedback Challenge: Estimate p(d Q,R=1) without relevance feedback; relevance model [Lavrenko & Croft 01] provides a good solution Probabilistic Distance Model :Similar to the vector-space model, but with LMs as opposed to TF-IDF weight vectors A popular distance function: Kullback-Leibler (KL) divergence, covering query likelihood as a special case score( Q, D ) = D ( θ Q θ D ), essentially p( w θ Q )log p( w θ D ) Retrieval is now to estimate query & doc models and feedback is treated as query LM updating [Lafferty & Zhai 01b; Zhai & Lafferty 01b] w V Both methods outperform the basic LM significantly ChengXiang Zhai,

67 Relevance Model Estimation [Lavrenko & Croft 01] Question: How to estimate P(D Q,R) (or p(w Q,R)) without relevant documents? Key idea: Treat query as observations about p(w Q,R) Approximate the model space with document models Two methods for decomposing p(w,q) Independent sampling (Bayesian model averaging) pwqr (, ) = pw ( θd) p( θd QRd, ) θd pw ( θd) p( θd R) pq ( θd) dθd Θ pw ( θ ) p( θ R) pq ( θ ) pw ( θ ) pq ( θ ) D D D D j D D C D C j= 1 Conditional sampling: p(w,q)=p(w)p(q w) Θ m pwqr (, = 1) pwpq ( ) ( w) = pw ( ) pq ( DpD ) ( w) pw ( ) = pw ( D) pd ( ) pd ( w) = D C m i= 1 D C pw ( D) pd ( ) pw ( ) i Original formula in [Lavranko &Croft 01] pw ( D) pw ( ) pd ( w) = pd ( ) ChengXiang Zhai,

68 Kernel-based Allocation [Lavrenko 04] A general generative model for text n pw ( 1... wn) = pw ( i θ) pd ( θ) Θ i= 1 1 p( dθ) = Kw( dθ) N v v T= training data w T Choices of the kernel function Delta kernel: An infinite mixture model Kernel-based density function v Kernel function kv ( θ) similarity( w, θ) 1 Dirichlet kernel: allow a training point to spread its influence w n 1 v pw (... w ) pw ( w) n = N v w T i = 1 Average probability of w 1 w n over all training points ChengXiang Zhai, i

69 Query Model Estimation [Lafferty & Zhai 01b, Zhai & Lafferty 01b] Question: How to estimate a better query model than the ML estimate based on the original query? Massive feedback : Improve a query model through cooccurrence pattern learned from A document-term Markov chain that outputs the query [Lafferty & Zhai 01b] Thesauri, corpus [Bai et al. 05,Collins-Thompson & Callan 05] Model-based feedback: Improve the estimate of query model by exploiting pseudo-relevance feedback Update the query model by interpolating the original query model with a learned feedback model [ Zhai & Lafferty 01b] Estimate a more integrated mixture model using pseudofeedback documents [ Tao & Zhai 06] ChengXiang Zhai,

70 Feedback as Model Interpolation [Zhai & Lafferty 01b] Document D Query Q θ D θ Q D( θq θ D ) Results a=0 θ Q ' = θ Q θ ' = (1 α) θ + αθ Q No feedback a=1 Q θ Q ' = θ F F Full feedback θ F Generative model Divergence minimization Feedback Docs F={d 1, d 2,, d n } ChengXiang Zhai,

71 θ F Estimation Method I: Generative Mixture Model P(source) λ 1-λ Background words P(w C) Topic words P(w θ ) w w F={D 1,, D n } log pf ( θ) = cwd ( ; )log((1 λ) pw ( ) θ + λpwc ( )) D F w D Maximum Likelihood q F = argmax q log p( F The learned topic model is called a parsimonious language model in [Hiemstra et al. 04] q ) ChengXiang Zhai,

72 θ F Estimation Method II: Empirical Divergence Minimization Background model C θ C far (l) close θ θ d1 D 1 F={D 1,, D n } θ d n D n Empirical divergence D ( θ λ, FC, ) = D( θ θ Dj) λ D( θ θ C)) 1 F n i= 1 Divergence minimization q F = argmin q D l ( q, F, C) ChengXiang Zhai,

73 Example of Feedback Query Model λ=0.9 W p(w ) security airport beverage alcohol bomb terrorist author license bond counter-terror terror newsnet attack operation headline Trec topic 412: airport security q F Mixture model approach Web database Top 10 docs λ=0.7 q F W p(w ) the security airport beverage alcohol to of and author bomb terrorist in license state by ChengXiang Zhai,

74 Model-based feedback Improves over Simple LM [Zhai & Lafferty 01b] collection Simple LM Mixture Improv. Div.Min. Improv. AvgPr pos +41% pos +40% InitPr pos -4% pos +0% AP88-89 Recall 3067/ /4805 pos +27% 3665/4805 pos +19% AvgPr pos +10% pos +5% InitPr pos -3% pos -3% TREC8 Recall 2853/ /4728 pos +11% 3129/4728 pos +10% AvgPr pos +9% pos +11% InitPr pos -1% pos -2% WEB Recall 1755/ /2279 pos +0% 1798/2279 pos +2% Translation models, Relevance models, and Feedback-based query models have all been shown to improve performance significantly over the simple LMs (Parameter tuning is necessary in many cases, but see [Tao & Zhai 06] for parameter-free pseudo feedback) ChengXiang Zhai,

75 Part 4: LMs for Special Retrieval Tasks 1. Introduction 2. The Basic Language Modeling Approach 3. More Advanced Language Models 4. Language Models for Special Retrieval Tasks - Cross-lingual IR - Distributed IR - Structured document retrieval - Personalized/context-sensitive search - Modeling redundancy - Predicting query difficulty - Subtopic retrieval 5. A General Framework for Applying SLMs to IR 6. Summary We are here ChengXiang Zhai,

76 Cross-Lingual IR Use query in language A (e.g., English) to retrieve documents in language B (e.g., Chinese) Cross-lingual p(q D,R) [Xu et al 01] m pq ( DR, ) = [ αpq ( i REF) + (1 α) pc ( D) ptrans( qi c)] English i= 1 Chinese Cross-lingual p(d Q,R) [Lavrenko et al 02] Method 1: Method 2: pcqr (, ) 1 1 pcq (, 1... qm) pq (... q ) 1 c V Chinese pcq (,... q ) = pm (, M ) pc ( M ) pq ( M ) m E C c i E ( M, M ) M i= 1 E C English Chinese word pcq (,... q ) = pm ( ) pc ( M ) pq ( M ) pq ( M ) = p ( q c) pc ( M ) m C c i C i C trans i C M M i= 1 c V C m Estimate with parallel corpora m m Chinese Translation model Estimate with a bilingual lexicon Or Parallel corpora ChengXiang Zhai,

77 Distributed IR Retrieve documents from multiple collections The task is generally decomposed into two subtasks: Collection selection and result fusion Using LMs for collection selection [Xu & Croft 99, Si et al. 02] Treat collection selection as retrieving collections as opposed to documents Estimate each collection model by maximum likelihood estimate [Si et al. 02] or clustering [Xu & Croft 99] Using LMs for result fusion [ Si et al. 02] Assume query likelihood scoring for all collections, but on each collection, a distinct reference LM is used for smoothing Adjust the bias score p(q D,Collection) to recover the fair score p(q D) ChengXiang Zhai,

78 Structured Document Retrieval [Ogilvie & Callan 03] D Title Abstract Body-Part1 Body-Part2 D 1 D 2 D 3 D k -Want to combine different parts of a document with appropriate weights -Anchor text can be treated as a part of a document - Applicable to XML retrieval Select D j and generate a query word using D j Q= qq... q 1 2 m pq ( DR, = 1) = pq ( DR, = 1) m i= 1 m i= 1 k = sd ( DR, = 1) pq ( D, R = 1) j= 1 i j i j part selection prob. Serves as weight for D j Can be trained using EM ChengXiang Zhai,

79 Personalized/Context-Sensitive Search [Shen et al. 05, Tan et al. 06] User information and search context can be used to estimate a better query model Context-independent Query LM: θˆ Q = argmax θ p( θ Query, Collection) Context-sensitive Query LM: θˆ Q = argmax θ p( θ Query, User, SearchContext, Collection) Refinement of this model leads to specific retrieval formulas Simple models often end up interpolating many unigram language models based on different sources of evidence, e.g., short-term search history [Shen et al. 05] or long-term search history [Tan et al. 06] ChengXiang Zhai,

80 Modeling Redundancy Given two documents D 1 and D 2, decide how redundant D 1 (or D 2 ) is w.r.t. D 2 (or D 1 ) Redundancy of D 1 to what extent can D 1 be explained by a model estimated based on D 2 Use a unigram mixture model [Zhai 02] log pd ( λθ, ) = cwd (, )log[ λpw ( θ ) + (1 λ) pw ( REF)] λ * 1 D 1 wv 2 2 = argmax log pd ( λθ, ) λ Maximum Likelihood estimator EM algorithm 1 D 2 See [Zhang et al. 02] for a 3-component redundancy model Along a similar line, we could measure document similarity in an asymmetric way [Kurland & Lee 05] D Measure of redundancy LM for D 2 Reference LM ChengXiang Zhai,

81 Predicting Query Difficulty [Cronen-Townsend et al. 02] Observations: Discriminative queries tend to be easier Comparison of the query model and the collection model can indicate how discriminative a query is Method: Define query clarity as the KL-divergence between an estimated query model or relevance model and the collection LM clarity( Q) = p( w θ Q )log w p( w θ ) p( w Collection) An enriched query LM can be estimated by exploiting pseudo feedback (e.g., relevance model) Correlation between the clarity scores and retrieval performance is found Q ChengXiang Zhai,

82 Subtopic Retrieval [Zhai 02, Zhai et al 03] Subtopic retrieval: Aim at retrieving as many distinct subtopics of the query topic as possible E.g., retrieve different applications of robotics Need to go beyond independent relevance Two methods explored in [Zhai 02] Maximal Marginal Relevance: Maximizing subtopic coverage indirectly through redundancy elimination LMs can be used to model redundancy Maximal Diverse Relevance: Maximizing subtopic coverage directly through subtopic modeling Define a retrieval function based on subtopic representation of query and documents Mixture LMs can be used to model subtopics (essentially clustering) ChengXiang Zhai,

83 Unigram Mixture Models Each subtopic is modeled with one unigram LM A document is treated as observations from a mixture model involving many subtopic LMs Two different sampling strategies to generate a document Strategy 1 (Document Clustering): Choose a subtopic model and generate all the words in a document using the same model Strategy 2 (Aspect Models [Hofmann 99; Blei et al 02])Use a (potentially) different subtopic model when generating each word in a document, so two words in a document may be generated using different LMs For subtopic retrieval, we assume a document may have multiple subtopics, so strategy 2 is more appropriate Many other applications ChengXiang Zhai,

84 Aspect Models Subtopic 1 P(w t 1 ) pw ( τ,..., τ, λ,..., λ ) = λ pw ( τ ) 1 k 1 k i i i= 1 k Subtopic 2 P(w t 2 ) w Document D=d 1 d n Subtopic k P(w t k ) Prob. LSI [Hofmann 99]: Different D has a different set of λ s n A D D D 1 k 1 k = a pdi a i= 1 a= 1 pd ( τ,..., τ, λ,..., λ ) λ ( τ ) Flexible aspect distr. Need regularization Latent Dirichlet Allocation [Blei et al 02, Lafferty & Minka 03] λ s are drawn from a common Dirichlet distribution n A pd ( τ1,..., τk, α1,..., α ) = k [ pd ( i τa) pa ( λ)] Dir( λ α1,..., αk) dλ Λ i= 1 a= 1 l is now regularized ChengXiang Zhai,

85 Part 5: A General Framework for Applying SLMs to IR 1. Introduction 2. The Basic Language Modeling Approach 3. More Advanced Language Models 4. Language Models for Special Retrieval Tasks 5. A General Framework for Applying SLMs to IR - Risk minimization framework - Special cases We are here 6. Summary ChengXiang Zhai,

86 Risk Minimization: Motivation Long-standing IR Challenges Improve IR theory Develop theoretically sound and empirically effective models Go beyond the limited traditional notion of relevance (independent, topical relevance) Improve IR practice Optimize retrieval parameters automatically SLMs are very promising tools How can we systematically exploit SLMs in IR? Can SLMs offer anything hard/impossible to achieve in traditional IR? ChengXiang Zhai,

87 Idea 1: Retrieval as Decision-Making (A more general notion of relevance) Given a query, - Which documents should be selected? (D) - How should these docs be presented to the user? (π) Choose: (D,π)? Unordered subset Query? Ranked list? Clustering ChengXiang Zhai,

88 Idea 2: Systematic Language Modeling QUERY MODELING Query Query Language Model USER MODELING Retrieval Decision:? Loss Function User Documents Document Language Models DOC MODELING ChengXiang Zhai,

89 Generative Model of Document & Query [Lafferty & Zhai 01b] User U p( q U) Q q Q p( q q, U ) Q q Query Partially observed prθ ( Q, θd) R observed Source S p( q S) D q D p( d q, S) D d Document inferred ChengXiang Zhai,

90 Applying Bayesian Decision Theory [Lafferty & Zhai 01b, Zhai 02, Zhai & Lafferty 06] Choice: (D 1,p 1 ) Loss L θ q query q user U Choice: (D 2,p 2 ) L θ 1 Choice: (D n,p n ) L θ N... doc set C source S ( D*, π*) RISK MINIMIZATION = arg min L( D, π, θ ) p( θ q, U, C, S) dθ D, π Θ loss hidden observed Bayes risk for choice (D, p) ChengXiang Zhai,

91 Special Cases Set-based models (choose D) Ranking models (choose π) Boolean model Independent loss Relevance-based loss Distance-based loss Dependent loss MMR loss MDR loss Probabilistic relevance model Generative Relevance Theory Vector-space Model Two-stage LM KL-divergence model Subtopic retrieval model ChengXiang Zhai,

92 Optimal Ranking for Independent Loss v π* = argmin L( πθ, ) p( θ qucsd,,, ) θ π N i L( πθ, ) = s l( θ θ... θ ) i i= 1 j= 1 N = s l( θπ ) j i i i= 1 j= 1 N N j+ 1 j= 1 i= 1 N j+ 1 j= 1 i= 1 1 π 1 ( s )( l θ ) N N j+ 1 v π* = argmin ( s )( l θ ) p( θ qucsd,,, ) θ π N = argmin ( s ) l( θ ) p( θ π = Θ i π Θ j= 1 i= 1 j i i π j j π π j j π v qucsd,,, ) v v rd ( qucs,,, ) = l( θ ) p( θ qucsd,,, ) θ k k k k Θ v π* = Ranking based on rd ( qucs,,, ) Θ k j θ π Decision space = {rankings} j Sequential browsing Independent loss Independent risk = independent scoring Risk ranking principle [Zhai 02] ChengXiang Zhai,

93 Automatic Parameter Tuning Retrieval parameters are needed to model different user preferences customize a retrieval model to specific queries and documents Retrieval parameters in traditional models EXTERNAL to the model, hard to interpret Parameters are introduced heuristically to implement intuition No principles to quantify them, must set empirically through many experiments Still no guarantee for new queries/documents Language models make it possible to estimate parameters ChengXiang Zhai,

94 Parameter Setting in Risk Minimization Estimate Query model parameters Query Query Language Model User model parameters Set Estimate Doc model parameters Loss Function User Documents Document Language Models ChengXiang Zhai,

95 Generative Relevance Hypothesis [Lavrenko 04] Generative Relevance Hypothesis: For a given information need, queries expressing that need and documents relevant to that need can be viewed as independent random samples from the same underlying generative model A special case of risk minimization when document models and query models are in the same space Implications for retrieval models: the same underlying generative model makes it possible to Match queries and documents even if they are in different languages or media Estimate/improve a relevant document model based on example queries or vice versa ChengXiang Zhai,

96 Risk Minimization: Summary Risk minimization is a general probabilistic retrieval framework Retrieval as a decision problem (=risk min.) Separate/flexible language models for queries and docs Advantages A unified framework for existing models Automatic parameter tuning due to LMs Allows for modeling complex retrieval tasks Lots of potential for exploring LMs For more information, see [Zhai 02] ChengXiang Zhai,

97 Part 6: Summary 1. Introduction 2. The Basic Language Modeling Approach 3. More Advanced Language Models 4. Language Models for Special Retrieval Tasks 5. A General Framework for Applying SLMs to IR 6. Summary We are here SLMs vs. traditional methods: Pros & Cons What we have achieved so far Challenges and future directions ChengXiang Zhai,

98 Pros: SLMs vs. Traditional IR Statistical foundations (better parameter setting) More principled way of handling term weighting More powerful for modeling subtopics, passages,.. Leverage LMs developed in related areas Empirically as effective as well-tuned traditional models with potential for automatic parameter tuning Cons: Lack of discrimination (a common problem with generative models) Less robust in some cases (e.g., when queries are semi-structured) Computationally complex Empirically, performance appears to be inferior to well-tuned fullfledged traditional methods (at least, no evidence for beating them) ChengXiang Zhai,

99 What We Have Achieved So Far Framework and justification for using LMs for IR Several effective models are developed Basic LM with Dirichlet prior smoothing is a reasonable baseline Basic LM with informative priors often improves performance Translation model handles polysemy & synonyms Relevance model incorporates LMs into the classic probabilistic IR model KL-divergence model ties feedback with query model estimation Mixture models can model redundancy and subtopics Completely automatic tuning of parameters is possible LMs can be applied to virtually any retrieval task with great potential for modeling complex IR problems ChengXiang Zhai,

100 Challenges and Future Directions Challenge 1: Establish a robust and effective LM that Optimizes retrieval parameters automatically Performs as well as or better than well-tuned traditional retrieval methods with pseudo feedback Is as efficient as traditional retrieval methods Can LMs consistently (convincingly) outperform traditional methods without sacrificing efficiency? Challenge 2: Demonstrate consistent and substantial improvement by going beyond unigram LMs Model limited dependency between terms Derive more principled weighting methods for phrases Can we do much better by going beyond unigram LMs? ChengXiang Zhai,

101 Challenges and Future Directions (cont.) Challenge 3: Develop LMs that can support life-time learning Develop LMs that can improve accuracy for a current query through learning from past relevance judgments Support collaborative information retrieval How can we learn effectively from past relevance judgments? Challenge 4: Develop LMs that can model document structures and subtopics Recognize query-specific boundaries of relevant passages Passage-based/subtopic-based feedback Combine different structural components of a document How can we break the document unit in a principled way? ChengXiang Zhai,

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary Risk Minimization and Language Modeling in Text Retrieval Thesis Summary ChengXiang Zhai Language Technologies Institute School of Computer Science Carnegie Mellon University July 21, 2002 Abstract This

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Improving Difficult Queries by Leveraging Clusters in Term Graph

Improving Difficult Queries by Leveraging Clusters in Term Graph Improving Difficult Queries by Leveraging Clusters in Term Graph Rajul Anand and Alexander Kotov Department of Computer Science, Wayne State University, Detroit MI 48226, USA {rajulanand,kotov}@wayne.edu

More information

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

An Investigation of Basic Retrieval Models for the Dynamic Domain Task An Investigation of Basic Retrieval Models for the Dynamic Domain Task Razieh Rahimi and Grace Hui Yang Department of Computer Science, Georgetown University rr1042@georgetown.edu, huiyang@cs.georgetown.edu

More information

Two-Stage Language Models for Information Retrieval

Two-Stage Language Models for Information Retrieval Two-Stage Language Models for Information Retrieval ChengXiang hai School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 John Lafferty School of Computer Science Carnegie Mellon University

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3) CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)" All slides Addison Wesley, Donald Metzler, and Anton Leuski, 2008, 2012! Language Model" Unigram language

More information

Focused Retrieval Using Topical Language and Structure

Focused Retrieval Using Topical Language and Structure Focused Retrieval Using Topical Language and Structure A.M. Kaptein Archives and Information Studies, University of Amsterdam Turfdraagsterpad 9, 1012 XT Amsterdam, The Netherlands a.m.kaptein@uva.nl Abstract

More information

The University of Amsterdam at the CLEF 2008 Domain Specific Track

The University of Amsterdam at the CLEF 2008 Domain Specific Track The University of Amsterdam at the CLEF 2008 Domain Specific Track Parsimonious Relevance and Concept Models Edgar Meij emeij@science.uva.nl ISLA, University of Amsterdam Maarten de Rijke mdr@science.uva.nl

More information

Ranking models in Information Retrieval: A Survey

Ranking models in Information Retrieval: A Survey Ranking models in Information Retrieval: A Survey R.Suganya Devi Research Scholar Department of Computer Science and Engineering College of Engineering, Guindy, Chennai, Tamilnadu, India Dr D Manjula Professor

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

On Duplicate Results in a Search Session

On Duplicate Results in a Search Session On Duplicate Results in a Search Session Jiepu Jiang Daqing He Shuguang Han School of Information Sciences University of Pittsburgh jiepu.jiang@gmail.com dah44@pitt.edu shh69@pitt.edu ABSTRACT In this

More information

Regularized Estimation of Mixture Models for Robust Pseudo-Relevance Feedback

Regularized Estimation of Mixture Models for Robust Pseudo-Relevance Feedback Regularized Estimation of Mixture Models for Robust Pseudo-Relevance Feedback Tao Tao Department of Computer Science University of Illinois at Urbana-Champaign ChengXiang Zhai Department of Computer Science

More information

A Study of Methods for Negative Relevance Feedback

A Study of Methods for Negative Relevance Feedback A Study of Methods for Negative Relevance Feedback Xuanhui Wang University of Illinois at Urbana-Champaign Urbana, IL 61801 xwang20@cs.uiuc.edu Hui Fang The Ohio State University Columbus, OH 43210 hfang@cse.ohiostate.edu

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University

A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University A Machine Learning Approach for Information Retrieval Applications Luo Si Department of Computer Science Purdue University Why Information Retrieval: Information Overload: Since the introduction of digital

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments Hui Fang ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign Abstract In this paper, we report

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document

More information

A Study of Collection-based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback

A Study of Collection-based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback A Study of Collection-based Features for Adapting the Balance Parameter in Pseudo Relevance Feedback Ye Meng 1, Peng Zhang 1, Dawei Song 1,2, and Yuexian Hou 1 1 Tianjin Key Laboratory of Cognitive Computing

More information

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16 Federated Search Jaime Arguello INLS 509: Information Retrieval jarguell@email.unc.edu November 21, 2016 Up to this point... Classic information retrieval search from a single centralized index all ueries

More information

Statistical Machine Translation Models for Personalized Search

Statistical Machine Translation Models for Personalized Search Statistical Machine Translation Models for Personalized Search Rohini U AOL India R& D Bangalore, India Rohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information

Modeling Query Term Dependencies in Information Retrieval with Markov Random Fields

Modeling Query Term Dependencies in Information Retrieval with Markov Random Fields Modeling Query Term Dependencies in Information Retrieval with Markov Random Fields Donald Metzler metzler@cs.umass.edu W. Bruce Croft croft@cs.umass.edu Department of Computer Science, University of Massachusetts,

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Part I: Classic Models Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Chap 03: Modeling,

More information

Making Retrieval Faster Through Document Clustering

Making Retrieval Faster Through Document Clustering R E S E A R C H R E P O R T I D I A P Making Retrieval Faster Through Document Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-02 January 23, 2004 D a l l e M o l l e I n s t i t u t e

More information

Estimating Embedding Vectors for Queries

Estimating Embedding Vectors for Queries Estimating Embedding Vectors for Queries Hamed Zamani Center for Intelligent Information Retrieval College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01003 zamani@cs.umass.edu

More information

Open Research Online The Open University s repository of research publications and other research outputs

Open Research Online The Open University s repository of research publications and other research outputs Open Research Online The Open University s repository of research publications and other research outputs A Study of Document Weight Smoothness in Pseudo Relevance Feedback Conference or Workshop Item

More information

A Cluster-Based Resampling Method for Pseudo- Relevance Feedback

A Cluster-Based Resampling Method for Pseudo- Relevance Feedback A Cluster-Based Resampling Method for Pseudo- Relevance Feedback Kyung Soon Lee W. Bruce Croft James Allan Department of Computer Engineering Chonbuk National University Republic of Korea Center for Intelligent

More information

Noisy Text Clustering

Noisy Text Clustering R E S E A R C H R E P O R T Noisy Text Clustering David Grangier 1 Alessandro Vinciarelli 2 IDIAP RR 04-31 I D I A P December 2004 1 IDIAP, CP 592, 1920 Martigny, Switzerland, grangier@idiap.ch 2 IDIAP,

More information

A Formal Approach to Score Normalization for Meta-search

A Formal Approach to Score Normalization for Meta-search A Formal Approach to Score Normalization for Meta-search R. Manmatha and H. Sever Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst, MA 01003

More information

WebSci and Learning to Rank for IR

WebSci and Learning to Rank for IR WebSci and Learning to Rank for IR Ernesto Diaz-Aviles L3S Research Center. Hannover, Germany diaz@l3s.de Ernesto Diaz-Aviles www.l3s.de 1/16 Motivation: Information Explosion Ernesto Diaz-Aviles

More information

The Opposite of Smoothing: A Language Model Approach to Ranking Query-Specific Document Clusters

The Opposite of Smoothing: A Language Model Approach to Ranking Query-Specific Document Clusters Journal of Artificial Intelligence Research 41 (2011) 367 395 Submitted 03/2011; published 07/2011 The Opposite of Smoothing: A Language Model Approach to Ranking Query-Specific Document Clusters Oren

More information

Document indexing, similarities and retrieval in large scale text collections

Document indexing, similarities and retrieval in large scale text collections Document indexing, similarities and retrieval in large scale text collections Eric Gaussier Univ. Grenoble Alpes - LIG Eric.Gaussier@imag.fr Eric Gaussier Document indexing, similarities & retrieval 1

More information

TREC-10 Web Track Experiments at MSRA

TREC-10 Web Track Experiments at MSRA TREC-10 Web Track Experiments at MSRA Jianfeng Gao*, Guihong Cao #, Hongzhao He #, Min Zhang ##, Jian-Yun Nie**, Stephen Walker*, Stephen Robertson* * Microsoft Research, {jfgao,sw,ser}@microsoft.com **

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Federated Search Prof. Chris Clifton 13 November 2017 Federated Search Outline Introduction to federated search Main research problems Resource Representation

More information

Using Maximum Entropy for Automatic Image Annotation

Using Maximum Entropy for Automatic Image Annotation Using Maximum Entropy for Automatic Image Annotation Jiwoon Jeon and R. Manmatha Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst Amherst, MA-01003.

More information

A Markov Random Field Model for Term Dependencies

A Markov Random Field Model for Term Dependencies A Markov Random Field Model for Term Dependencies Donald Metzler metzler@cs.umass.edu Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts Amherst, MA

More information

Indri at TREC 2005: Terabyte Track (Notebook Version)

Indri at TREC 2005: Terabyte Track (Notebook Version) Indri at TREC 2005: Terabyte Track (Notebook Version) Donald Metzler, Trevor Strohman, Yun Zhou, W. B. Croft Center for Intelligent Information Retrieval University of Massachusetts, Amherst Abstract This

More information

IN4325 Query refinement. Claudia Hauff (WIS, TU Delft)

IN4325 Query refinement. Claudia Hauff (WIS, TU Delft) IN4325 Query refinement Claudia Hauff (WIS, TU Delft) The big picture Information need Topic the user wants to know more about The essence of IR Query Translation of need into an input for the search engine

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Relating the new language models of information retrieval to the traditional retrieval models

Relating the new language models of information retrieval to the traditional retrieval models Relating the new language models of information retrieval to the traditional retrieval models Djoerd Hiemstra and Arjen P. de Vries University of Twente, CTIT P.O. Box 217, 7500 AE Enschede The Netherlands

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

Reducing Redundancy with Anchor Text and Spam Priors

Reducing Redundancy with Anchor Text and Spam Priors Reducing Redundancy with Anchor Text and Spam Priors Marijn Koolen 1 Jaap Kamps 1,2 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Informatics Institute, University

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence! (301) 219-4649 james.mayfield@jhuapl.edu What is Information Retrieval? Evaluation

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Federated Search 10 March 2016 Prof. Chris Clifton Outline Federated Search Introduction to federated search Main research problems Resource Representation Resource Selection

More information

Information Retrieval: Retrieval Models

Information Retrieval: Retrieval Models CS473: Web Information Retrieval & Management CS-473 Web Information Retrieval & Management Information Retrieval: Retrieval Models Luo Si Department of Computer Science Purdue University Retrieval Models

More information

Lecture 7: Relevance Feedback and Query Expansion

Lecture 7: Relevance Feedback and Query Expansion Lecture 7: Relevance Feedback and Query Expansion Information Retrieval Computer Science Tripos Part II Ronan Cummins Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk

More information

Context-Sensitive Information Retrieval Using Implicit Feedback

Context-Sensitive Information Retrieval Using Implicit Feedback Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen Department of Computer Science University of Illinois at Urbana-Champaign Bin Tan Department of Computer Science University of

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

A Taxonomy of Semi-Supervised Learning Algorithms

A Taxonomy of Semi-Supervised Learning Algorithms A Taxonomy of Semi-Supervised Learning Algorithms Olivier Chapelle Max Planck Institute for Biological Cybernetics December 2005 Outline 1 Introduction 2 Generative models 3 Low density separation 4 Graph

More information

Extracting Relevant Snippets for Web Navigation

Extracting Relevant Snippets for Web Navigation Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Extracting Relevant Snippets for Web Navigation Qing Li Southwestern University of Finance and Economics, China liq t@swufe.edu.cn

More information

Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval

Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval Cheng Zhai Computer Science Department University of Illinois at Urbana-Champaign Urbana, IL 61801 czhai@cs.uiuc.edu

More information

Probabilistic Information Retrieval Part I: Survey. Alexander Dekhtyar department of Computer Science University of Maryland

Probabilistic Information Retrieval Part I: Survey. Alexander Dekhtyar department of Computer Science University of Maryland Probabilistic Information Retrieval Part I: Survey Alexander Dekhtyar department of Computer Science University of Maryland Outline Part I: Survey: Why use probabilities? Where to use probabilities? How

More information

Mining long-lasting exploratory user interests from search history

Mining long-lasting exploratory user interests from search history Mining long-lasting exploratory user interests from search history Bin Tan, Yuanhua Lv and ChengXiang Zhai Department of Computer Science, University of Illinois at Urbana-Champaign bintanuiuc@gmail.com,

More information

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data Misha Bilenko and Ryen White presented by Matt Richardson Microsoft Research Search = Modeling User Behavior

More information

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

Fall Lecture 16: Learning-to-rank

Fall Lecture 16: Learning-to-rank Fall 2016 CS646: Information Retrieval Lecture 16: Learning-to-rank Jiepu Jiang University of Massachusetts Amherst 2016/11/2 Credit: some materials are from Christopher D. Manning, James Allan, and Honglin

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

Federated Text Search

Federated Text Search CS54701 Federated Text Search Luo Si Department of Computer Science Purdue University Abstract Outline Introduction to federated search Main research problems Resource Representation Resource Selection

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Context-Based Topic Models for Query Modification

Context-Based Topic Models for Query Modification Context-Based Topic Models for Query Modification W. Bruce Croft and Xing Wei Center for Intelligent Information Retrieval University of Massachusetts Amherst 140 Governors rive Amherst, MA 01002 {croft,xwei}@cs.umass.edu

More information

Social Search Networks of People and Search Engines. CS6200 Information Retrieval

Social Search Networks of People and Search Engines. CS6200 Information Retrieval Social Search Networks of People and Search Engines CS6200 Information Retrieval Social Search Social search Communities of users actively participating in the search process Goes beyond classical search

More information

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization Romain Deveaud 1 and Florian Boudin 2 1 LIA - University of Avignon romain.deveaud@univ-avignon.fr

More information

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval

Custom IDF weights for boosting the relevancy of retrieved documents in textual retrieval Annals of the University of Craiova, Mathematics and Computer Science Series Volume 44(2), 2017, Pages 238 248 ISSN: 1223-6934 Custom IDF weights for boosting the relevancy of retrieved documents in textual

More information

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4, April 2009 349 WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY Mohammed M. Sakre Mohammed M. Kouta Ali M. N. Allam Al Shorouk

More information

Re-ranking Documents Based on Query-Independent Document Specificity

Re-ranking Documents Based on Query-Independent Document Specificity Re-ranking Documents Based on Query-Independent Document Specificity Lei Zheng and Ingemar J. Cox Department of Computer Science University College London London, WC1E 6BT, United Kingdom lei.zheng@ucl.ac.uk,

More information

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12 Fall 2016 CS646: Information Retrieval Lecture 2 - Introduction to Search Result Ranking Jiepu Jiang University of Massachusetts Amherst 2016/09/12 More course information Programming Prerequisites Proficiency

More information

Language Modeling Based Local Set Re-ranking using Manual Relevance Feedback

Language Modeling Based Local Set Re-ranking using Manual Relevance Feedback Language Modeling Based Local Set Re-ranking using Manual Relevance Feedback Manoj Kumar Chinnakotla and ushpak Bhattacharyya Department of Computer Science and Engineering, Indian Institute of Technology,

More information

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese

More information

COMP6237 Data Mining Searching and Ranking

COMP6237 Data Mining Searching and Ranking COMP6237 Data Mining Searching and Ranking Jonathon Hare jsh2@ecs.soton.ac.uk Note: portions of these slides are from those by ChengXiang Cheng Zhai at UIUC https://class.coursera.org/textretrieval-001

More information

Chapter 3: Supervised Learning

Chapter 3: Supervised Learning Chapter 3: Supervised Learning Road Map Basic concepts Evaluation of classifiers Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Summary 2 An example

More information

Semi-Parametric and Non-parametric Term Weighting for Information Retrieval

Semi-Parametric and Non-parametric Term Weighting for Information Retrieval Semi-Parametric and Non-parametric Term Weighting for Information Retrieval Donald Metzler 1 and Hugo Zaragoza 1 Yahoo! Research {metzler,hugoz}@yahoo-inc.com Abstract. Most of the previous research on

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Machine Learning. Supervised Learning. Manfred Huber

Machine Learning. Supervised Learning. Manfred Huber Machine Learning Supervised Learning Manfred Huber 2015 1 Supervised Learning Supervised learning is learning where the training data contains the target output of the learning system. Training data D

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Graph Data & Introduction to Information Retrieval Huan Sun, CSE@The Ohio State University 11/21/2017 Slides adapted from Prof. Srinivasan Parthasarathy @OSU 2 Chapter 4

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

A Survey of Query Expansion until June 2012

A Survey of Query Expansion until June 2012 A Survey of Query Expansion until June 2012 Yogesh Kakde Indian Institute of Technology, Bombay 25th June 2012 Abstract Here we present a survey of important work done on Query Expansion (QE) between the

More information

Fast exact maximum likelihood estimation for mixture of language model

Fast exact maximum likelihood estimation for mixture of language model Available online at www.sciencedirect.com Information rocessing and Management 44 (2008) 1076 1085 www.elsevier.com/locate/infoproman Fast exact maximum likelihood estimation for mixture of language model

More information

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky

Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky Implementation of a High-Performance Distributed Web Crawler and Big Data Applications with Husky The Chinese University of Hong Kong Abstract Husky is a distributed computing system, achieving outstanding

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea Intelligent Information Retrieval 1. Relevance feedback - Direct feedback - Pseudo feedback 2. Query expansion

More information

On Duplicate Results in a Search Session

On Duplicate Results in a Search Session On Duplicate Results in a Search Session Jiepu Jiang Daqing He Shuguang Han School of Information Sciences University of Pittsburgh jiepu.jiang@gmail.com dah44@pitt.edu shh69@pitt.edu ABSTRACT In this

More information

Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval

Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval ChengXiang Zhai Computer Science Department University of Illinois at Urbana-Champaign William W. Cohen Center for Automated

More information

DD2475 Information Retrieval Lecture 7: Probabilistic Information Retrieval, Language Models

DD2475 Information Retrieval Lecture 7: Probabilistic Information Retrieval, Language Models Only Simplified Model of Reality DD2475 Information Retrieval Lecture 7: Probabilistic Information Retrieval, Language Models Hedvig Kjellström hedvig@kth.se www.csc.kth.se/dd2475 User Information Need

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

A User Profiles Acquiring Approach Using Pseudo-Relevance Feedback

A User Profiles Acquiring Approach Using Pseudo-Relevance Feedback A User Profiles Acquiring Approach Using Pseudo-Relevance Feedback Xiaohui Tao and Yuefeng Li Faculty of Science & Technology, Queensland University of Technology, Australia {x.tao, y2.li}@qut.edu.au Abstract.

More information

Chapter 9. Classification and Clustering

Chapter 9. Classification and Clustering Chapter 9 Classification and Clustering Classification and Clustering Classification and clustering are classical pattern recognition and machine learning problems Classification, also referred to as categorization

More information

Term-Specific Smoothing for the Language Modeling Approach to Information Retrieval: The Importance of a Query Term

Term-Specific Smoothing for the Language Modeling Approach to Information Retrieval: The Importance of a Query Term Term-Specific Smoothing for the Language Modeling Approach to Information Retrieval: The Importance of a Query Term Djoerd Hiemstra University of Twente, Centre for Telematics and Information Technology

More information

Classification and Clustering

Classification and Clustering Chapter 9 Classification and Clustering Classification and Clustering Classification/clustering are classical pattern recognition/ machine learning problems Classification, also referred to as categorization

More information

EXPERIMENTS ON RETRIEVAL OF OPTIMAL CLUSTERS

EXPERIMENTS ON RETRIEVAL OF OPTIMAL CLUSTERS EXPERIMENTS ON RETRIEVAL OF OPTIMAL CLUSTERS Xiaoyong Liu Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts, Amherst, MA 01003 xliu@cs.umass.edu W.

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 5 Relevance Feedback and Query Expansion Introduction A Framework for Feedback Methods Explicit Relevance Feedback Explicit Feedback Through Clicks Implicit Feedback

More information

Relevance Models for Topic Detection and Tracking

Relevance Models for Topic Detection and Tracking Relevance Models for Topic Detection and Tracking Victor Lavrenko, James Allan, Edward DeGuzman, Daniel LaFlamme, Veera Pollard, and Steven Thomas Center for Intelligent Information Retrieval Department

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Data Matrices and Vector Space Model Denis Helic KTI, TU Graz Nov 6, 2014 Denis Helic (KTI, TU Graz) KDDM1 Nov 6, 2014 1 / 55 Big picture: KDDM Probability

More information

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India

More information