DD2475 Information Retrieval Lecture 7: Probabilistic Information Retrieval, Language Models

Size: px

Start display at page:

Download "DD2475 Information Retrieval Lecture 7: Probabilistic Information Retrieval, Language Models"

Maximilian Smith
5 years ago
Views:

Only Simplified Model of Reality DD2475 Information Retrieval Lecture 7: Probabilistic Information Retrieval, Language Models Hedvig

Documents Document Uncertain guess of document relevance! Probabilistic Methods Today!

probabilistic information retrieval (Manning Chapter 11) -!Probabilistic methods the right way to do it -!Mathematically sound -!

1 Only Simplified Model of Reality DD2475 Information Retrieval Lecture 7: Probabilistic Information Retrieval, Language Models Hedvig Kjellström User Information Need Query Uncertain understanding! of user need! Documents Document Uncertain guess of document relevance! Probabilistic Methods Today!Traditionally in Artificial Intelligence, Information Retrieval, Machine Learning, Computer Vision, etc:!probabilistic information retrieval (Manning Chapter 11) -!Probabilistic methods the right way to do it -!Mathematically sound -!Computationally expensive not practical!it may be different now! -!Probability ranking principle -!Binary independence model!language models (Manning Chapter 12) -!Query likelihood model -!Language model vs binary independence model

The Document Ranking Problem Query Probabilistic Information Retrieval (Manning Chapter 11)!Have collection of documents!user issues query!return list of documents!ranking method core of IR system: -!

2 The Document Ranking Problem Query Probabilistic Information Retrieval (Manning Chapter 11)!Have collection of documents!user issues query!return list of documents!ranking method core of IR system: -!Want best document first, second best second, etc. Document Lecture 4!Idea 1: Rank according to query-document similarity -!CosineSim(tf-idf(document),tf-idf(query))!Idea 2: Rank according to probability of document being relevant w.r.t. information need -!p(relevant document,query) Random Variables Probabilities!A Boolean-valued random variable denotes an event/ hypothesis some degree of uncertainty as to whether A occurs -!relevant (the document is relevant to the query)!a random variable can also be discrete, continuous, vector-valued, etc -!document (a certain document) -!query (a certain query)!p(a) - fraction of worlds in which A = true = Green square / Red square! 0.1 A = false! A = true!

3 Conditional Probabilities Conditional Probabilities!p(B A) - fraction of worlds where B = true in which A = true as well = Brown square / Green square A = false, B = false!!p(a) = Green square / Red square! 0.1!p(A! B) = Brown square / Red square! 0.03! p(b A) = p(a! B) / p(b)! 0.3 A = false, B = false! B = true! A = true! B = true! A = true! Bayes Rule Probability Ranking Principle (PRP) p(a D) = -!p(a) = prior probability of hypothesis A PRIOR -!p(d) = prior probability of data D EVIDENCE -!p(d A) = probability of D given A - LIKELIHOOD -!P(A D) = probability of A given D POSTERIOR!Boolean hypotheses: Odds O(A D) = p(d A)p(A) p(d) p(a D) p( A D) = p(a D) 1" p(a D)!Long version, van Rijsbergen (1979: ): -! If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.!short version: -!Have representative training document collection and training queries, learn p(relevant document,query) -!Ranking of new documents w.r.t. new queries (similar to, or in, the training collection) with this function is bound to better than anything else

4 Probability Ranking Principle (PRP) Probability Ranking Principle (PRP)!Let d be a document in the collection -!R represents relevance of doc w.r.t. given (fixed) query -!NR represents non-relevance of doc w.r.t. given (fixed) query R={0,1} (NR/R)!!Need to find p(r d,q) - probability that a document d is relevant given query q!how can this be done? -!Reformulate using Bayes rule to distributions easier to learn p(r d,q) = p(nr d,q) = p(d R,q)p(R q) p(d q) p(d NR,q)p(NR q) p(d q) p(r d,q) + p(nr d,q) =1 O(R d,q) = p(r d,q) p(nr d,q)!p(r q), p(nr q) - prior probability of retrieving a (non-)relevant document given query q!prp in action: Rank all documents by p(r d,q) or O(R d,q) Probability Ranking Principle (PRP)!How compute p(d R,q), p(d NR,q), p(r q), p(nr q)? -!Do not know exact probabilities, have to use estimates -!I.e., use simplified model of reality -! simplest model!questionable assumptions -!More later Documents Query Document Lecture 1 More in Lecture 8!Traditionally used with PRP! Binary = Boolean: Documents d are boolean vectors where present in document x iff term i is! Independence : terms occur in documents independently -!Different documents can be modeled as same vector!bernoulli Naive Bayes model Documents Query Document

5 ! q = (q 1,,q n )!Documents d: binary vectors!(queries q: also binary vectors ) O(R q,! x ) =!! p(r q, x ) p(nr q, x! p(r q) = ) p(nr q) " p( x R,q) p( x! NR,q)!Given query q, -!for each document d, compute p(r q,d) -!Equivalent to computing p(r q,x) where x represents d -!Interested only in ranking Constant for a given query Using independence assumption: Needs estimation!will use odds and Bayes Rule: So: O(R q,d) = O(R q) " n # i=1 p(x i R,q) p(x i NR,q) O(R q,d) = O(R q) " Since x i is either 0 or 1: n # i=1 p(x i R,q) p(x i NR,q) All matching terms Non-matching query terms Let Assume, for all terms not occurring in the query (q i =0): All matching terms All query terms Then... This can be! changed (e.g., in! relevance feedback)!

6 Binary Independence Model All boils down to computing RSV Constant for each query Retrieval Status Value: Only quantity to be estimated for rankings So, how do we compute c i s from our data? Estimating RSV coefficients from document training set For each term i look at this table of document counts:!s << (N S) -!r i! n/n -!log (1 r i )/r i! log (N n)/n Lecture 4!n << (N n) -!log (N n)/n! log N/n = idf!estimates: For now, assume no zero terms. See book page 208.!Theoretical justification for the idf score! -!Bayes-optimal

7 Sec ! Exercise 5 Minutes Exercise 5 Minutes!Possible to get reasonable approximations of probabilities!possible to get reasonable approximations of probabilities BUT it requires restrictive assumptions.!discuss in pairs/groups: -!What are the assumptions made by PRP and BIM? BUT it requires restrictive assumptions: -!term independence (term order, term co-occurrence) -!terms not in query don t affect the outcome -!boolean representation of documents/queries/relevance -!document relevance values are independent! Really, it s bad to keep on returning duplicates!some of these assumptions can be removed -!Trade-offs Summary Comparison!Standard Vector Space Model +!Empirical for the most part; success measured by results!!few properties provable!probabilistic Model +!Based on a firm theoretical foundation +!Theoretically justified optimal ranking scheme!!binary word-in-doc weights (not using term frequencies)!!independence of terms (can be alleviated)!!amount of computation!!has never worked convincingly better in practice Language Models (Manning Chapter 12)

Standard Probabilistic IR Dealing with the Query Part User Information Need Query

but how does user generate query based on information need?

(LM) Information need query! P(q M d ) generation!

documents!language Model approach models this process!! d1 d2 dn!! document collection!

Assumption: User choosing queries in methodical way but with some noise, based on what

8 Standard Probabilistic IR Dealing with the Query Part User Information Need Query Information need P(R q,d) d1 Documents Document query matching d2!but how does user generate query based on information need?!! dn document collection IR Based on Language Model (LM) IR Based on Language Model (LM) Information need query! P(q M d ) generation!common search heuristic: form query from words you expect to find in matching documents!language Model approach models this process!! d1 d2 dn!! document collection!model user s generation of queries as a stochastic process -! Assumption: User choosing queries in methodical way but with some noise, based on what documents they are looking for!approach: -!Infer a LM M d for each document d -!Estimate p(q M d ) for each document d -!Rank the documents according to p(q M d )! High p(q M d ) probable that M d generated q

9 Stochastic Language Models Stochastic Language Models!Model probability of generating string s in the language M:!Compare languages/models Model M Model M1 Model M2 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes the man likes the woman * * * *! p(s M) = ! the! 0.01! class! ! sayst! ! pleaseth! ! yon! ! maiden! 0.01! woman! 0.2! the! ! class! 0.03! sayst! 0.02! pleaseth! 0.1! yon! 0.01! maiden! ! woman! the class pleaseth yon maiden * * * *! * * * *! p(s M2) > p(s M1) Sec ! Exercise 2 Minutes Stochastic Language Models!Stochastic language model: -!Probability of each word in language -!Multiply probabilities of query terms to get probability of query!discuss in pairs/groups: -!What are the limitations of this model? -!How can the model be improved?!a statistical model taking order into account M p( M) = p( M) p( M, ) p( M, ) p( M, )

10 Model Approximations Fundamental problem of LMs!Unigram Language Models -!What we had in first example p( M) = p( M) p( M) p( M) p( M)!Bigram (generally, n-gram) Language Models p( M) = p( M) p( M, ) p( M, ) p( M, )!Other Language Models -!Grammar-based models (PCFGs), etc.! Probably not the first thing to try in IR!Usually we don t know the model M -!But have a sample of text (document d) representative of that model p( M( ))!Estimate language model M from sample d!then compute the observation probability of q M d q! LMs vs. BIMs for IR LMs vs. BIMs for IR!LMs computationally tractable, intuitively appealing -!Conceptually simple and explanatory -!Formal mathematical model -!Natural use of collection statistics!lms provide effective retrieval if -!Language models are accurate representations of the data -!Users have some sense of term distribution!lm has no explicit Relevance -!Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance -!Current extensions focus on putting relevance back into the model!lm assumes that documents and expressions of information user need are of the same type -!Unrealistic -!Very simple models of language -!Current extensions add more model layers

11 Next!Computer hall session (December 9, ) -!Grå (Osquars Backe 2, level 5) -!Examination of Assignment 2 -!Sign up in Doodle (see course homepage after 10.30)!Written exam (December 13, ) -!E52 -!Covers Manning Chapters 1-9 (NOTE THE CHANGE)

Probabilistic Information Retrieval Part I: Survey. Alexander Dekhtyar department of Computer Science University of Maryland

Probabilistic Information Retrieval Part I: Survey Alexander Dekhtyar department of Computer Science University of Maryland Outline Part I: Survey: Why use probabilities? Where to use probabilities? How