DD2475 Information Retrieval Lecture 5: Evaluation, Relevance Feedback, Query Expansion. Evaluation

Size: px

Start display at page:

Download "DD2475 Information Retrieval Lecture 5: Evaluation, Relevance Feedback, Query Expansion. Evaluation"

Garry Allen
6 years ago
Views:

1 Sec ! DD2475 Information Retrieval Lecture 5: Evaluation, Relevance Feedback, Query Epansion Lectures 2-4: All Tools for a Complete Information Retrieval System Lecture 2 Lecture 3 Lectures 2,4 Hedvig Kjellström hedvig@kth.se Lecture 4 Lecture 3 Lecture 4 Today!Evaluation (Manning Chapter 8) -!Benchmarks -!Precision and recall -!Making results usable to a user!relevance feedback and query epansion (Manning Chapter 9) -!Improving results -!Globally: query epansion/reformulation -!Locally: relevance feedback Evaluation (Manning Chapter 8)

Sec. 8.1! Sec. 8.3! Quality Measure - Relevance Unranked Retrieval Evaluation Metric: Precision and Recall!Most common proy for IR system: relevance of search results!

2 Sec. 8.1! Sec. 8.3! Quality Measure - Relevance Unranked Retrieval Evaluation Metric: Precision and Recall!Most common proy for IR system: relevance of search results!relevance measurement requires: -!Relevance metric -!Benchmark document collection -!Benchmark suite of queries -! Binary label Relevant/Nonrelevant for each querydocument combination!precision: fraction of retrieved documents that also are relevant = P(relevant retrieved)!recall: fraction of relevant documents that also are retrieved = P(retrieved relevant) -!Precision -!Recall Relevant Retrieved tp fp Not Retrieved fn tn P = tp/(tp + fp) R = tp/(tp + fn) Nonrelevant!What is the relative size of the sets tp, tn, fp, fn? Sec. 8.3! Sec. 8.3! Another Metric: Accuracy Why Not Just Use Accuracy?!View engine as a classifier!accuracy: the fraction of classifications that are correct -!(tp + tn) / ( tp + fp + fn + tn)!how to build a % accurate search engine on a low budget. More in Lecture 9!Accuracy commonly used evaluation measure in machine learning classification!why is this not a very useful measure in IR? Search for:! 0 matching results found.!people doing information retrieval want to find something and have a certain tolerance for junk.

3 Sec. 8.3! Sec. 8.3! Precision/Recall Difficulties with Precision/Recall Evaluation!Ideal: -!No false positives, precision = 1 -!No false negatives, recall = 1!Reality: -!Get one or the other by tweaking parameters!recall non-decreasing function of #documents retrieved (1 when all documents retrieved)!precision decreases as recall increases -!Not a theorem, but result with strong empirical confirmation!should average over large document collection/query ensembles!need human relevance assessments -!People aren t reliable assessors!assessments have to be binary -!Nuanced assessments?!heavily skewed by collection/authorship -!Results may not translate from one domain to another Sec. 8.3! Sec. 8.3! Combined Measure: F F 1 (Purple) and Other Averages!Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean):!people usually use balanced F 1 measure -! i.e., with! = 1 or " =! F 1 = 2PR P + R

4 Sec. 8.4! Sec ! Ranked Retrieval Evaluation Metric: Precision-Recall Curve Eercise 5 Minutes # !Varying cut-off point (top K results)!sequence that generated this curve: Relevant tp (cumulative) fp (cumulative) fn (cumulative) Precision Recall -!1000 data points in total -!Precision P = tp/(tp + fp) -!Recall R = tp/(tp + fn)! of course, averaging over bunch of queries!!work out in pairs: -!Fill in the empty boes above DD2475 Lecture 4, November 16, 2010 Sec. 8.4! Sec. 8.4! Evaluation Measures Evaluation Measures!Average performance over many queries!summary measure: 11-point interpolated average precision -!Early standard measure -!Evaluates performance at all recall levels!summmary measure: Precision-at-K -!Precision of top k results -!Natural for web search -!Averages badly, arbitrary parameter K

5 Sec. 8.1! Sec. 8.5! What is Assessed? From Document Collections to Test Collections!User has information need!information need translated into query (by user)!relevance assessed relative to information need not query!eample: -! Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. -!Query: WINE RED WHITE HEART ATTACK EFFECTIVE!Test queries -!Must be relevant to documents available -!Best designed by domain eperts -!Random query terms generally not a good idea!relevance assessments -!Human judges, time-consuming -!Human judges, error? Kappa Measure for Inter-Judge (Dis)Agreement Sec. 8.5! Kappa Measure: Eample #documents Judge 1 Judge Relevant Relevant 70 Nonrelevant Nonrelevant 20 Relevant Nonrelevant 10 Nonrelevant Relevant!Kappa measure -!Agreement measure among judges -!Designed for categorical judgments -!Corrects for chance agreement!kappa = [ P(A) P(E) ] / [ 1 P(E) ]!P(A) proportion of time judges agree!p(e) probability that agreement is by chance!kappa = 0 for chance agreement, 1 for total agreement!p(a) = 370/400 = 0.925!P(nonrelevant) = ( )/800 = !P(relevant) = ( )/800 = !P(E) = ^ ^2 = 0.665!Kappa = ( )/( ) = 0.776!Kappa > 0.8 = good agreement!0.67 < Kappa < 0.8 = reasonable!depends on purpose of study!for >2 judges: average pairwise kappas

6 Sec. 8.2! Sec ! Standard Relevance Benchmarks Can we avoid human judgment?!tet REtrieval Conference (TREC) -!IR test bed by National Institute of Standards and Technology (NIST) -!Reuters and other benchmark document collections used -! Retrieval tasks specified, sometimes as queries -! Each query-document combination marked as Relevant or Nonrelevant by human eperts!cross Language Evaluation Forum (CLEF) -!Concentrated on European languages and cross-language information retrieval!no!human judgement makes eperimental work hard -!Especially on a large scale!in some very specific settings, can use proies -!E.g. use eact cosine score to compare approimative vector space retrieval to eact (baseline)!can reuse (human labeled) test collections -!Watch for overtraining, overfitting!many others Sec ! Sec. 8.6! Evaluation of Large Search Engines Measures for a Search Engine!Search engines have test collections of queries and hand-ranked results!recall is difficult to measure on the web!often precision at top K, e.g., K = 10!Non-relevance-based measures -!Clickthrough on first result -!Studies of user behavior in the lab!how fast does it inde -!Number of documents/hour!how fast does it search -!Latency as a function of inde size!epressiveness of query language -!Ability to epress comple information needs -!Speed on comple queries!uncluttered UI!Is it free?

Sec. 8.6! Sec. 8.6.2! Measures for a Search Engine Measuring User Happiness!All of the preceding criteria are measurable.!the key measure: user happiness -!

7 Sec. 8.6! Sec ! Measures for a Search Engine Measuring User Happiness!All of the preceding criteria are measurable.!the key measure: user happiness -!What is this? -!Speed of response/size of inde are factors -!But relevance is also important!need a way of quantifying user happiness!issue: who is the user we are trying to make happy? -!Depends on the setting!web engine: -!User finds what they want and return to the engine!ecommerce site: -!User finds what they want and buy!enterprise (company/govt/academic): -!Care about user productivity Sec. 8.7! Sec. 8.7! Result Summaries Result Summaries!Having ranked the documents matching a query, we wish to present a results list!most commonly, a list of the document titles plus a short summary, aka 10 blue links!title from document metadata. What about summaries? -!This description is crucial -!User can identify good/relevant hits based on description!two basic kinds: -!Static: Always the same, regardless of the query -!Dynamic: Query-dependent attempt to eplain why the document was retrieved for the query at hand

Sec. 8.7! Sec. 8.7! Static Summaries Dynamic Summaries!Subset of document!simplest heuristic: the first 50 words of document -!Summary cached at indeing time!

8 Sec. 8.7! Sec. 8.7! Static Summaries Dynamic Summaries!Subset of document!simplest heuristic: the first 50 words of document -!Summary cached at indeing time!more sophisticated: etract from each document a set of key sentences -!Simple NLP heuristics to score each sentence -!Summary is made up of top-scoring sentences Recall Lecture 4!Find small windows in document that contain query terms -!Requires fast window lookup in a document cache!score each window wrt query -!Use various features such as window width, position in document, etc.!most sophisticated: NLP used to synthesize a summary -!Seldom used in IR; cf. tet summarization work Alternative Results Presentations!An active area of HCI research!recent versions of Mac OS X: Cover Flow Relevance Feedback and Query Epansion (Manning Chapter 9)

Sec. 9.1! Improving Results Relevance Feedback!Local methods (documents relating to query) -!Relevance feedback!global methods (all documents in collection) -!Query epansion!

9 Sec. 9.1! Improving Results Relevance Feedback!Local methods (documents relating to query) -!Relevance feedback!global methods (all documents in collection) -!Query epansion!relevance feedback: user feedback on relevance of documents in initial set of results -!User issues a (short, simple) query -!User marks some results as relevant or non-relevant -!System computes a better representation of the information need based on feedback -!Possibly iterate!key idea: may be difficult to formulate good query when you don t know the collection well, so iterate Sec. 9.1! Relevance Feedback Eample 1: Similar Pages!Will use ad hoc retrieval to refer to regular retrieval without relevance feedback!three eamples of relevance feedback that highlight different aspects

10 Eample 2: Relevance Feedback in Image Search Results for Initial Query!Eperimental image search engine (Newsam et al., ICIP 2001) Not active anymore Query-document similarity according to different measures Relevance Feedback Results after Relevance Feedback

11 Eample 3: Relevance Feedback in Tet Search Query-document similarity Results for Initial Query!Tet search eample!initial query: NEW SPACE SATELLITE APPLICATIONS , 08/13/91, NASA Hasn t Scrapped Imaging Spectrometer , 07/09/91, NASA Scratches Environment Gear From Satellite Plan , 04/04/90, Science Panel Backs NASA Satellite Plan, But Urges Launches of Smaller Probes , 09/09/91, A NASA Satellite Project Accomplishes Incredible Feat: Staying Within Budget , 07/24/90, Scientist Who Eposed Global Warming Proposes Satellites for Climate Research , 08/22/90, Report Provides Support for the Critics Of Using Big Satellites to Study Climate , 04/13/87, Arianespace Receives Satellite Launch Pact From Telesat Canada , 12/02/87, Telecommunications Tale of Two Companies Relevance Feedback Epanded Query after Relevance Feedback +! +! +! , 08/13/91, NASA Hasn t Scrapped Imaging Spectrometer , 07/09/91, NASA Scratches Environment Gear From Satellite Plan , 04/04/90, Science Panel Backs NASA Satellite Plan, But Urges Launches of Smaller Probes , 09/09/91, A NASA Satellite Project Accomplishes Incredible Feat: Staying Within Budget , 07/24/90, Scientist Who Eposed Global Warming Proposes Satellites for Climate Research , 08/22/90, Report Provides Support for the Critics Of Using Big Satellites to Study Climate , 04/13/87, Arianespace Receives Satellite Launch Pact From Telesat Canada , 12/02/87, Telecommunications Tale of Two Companies new space satellite application nasa eos launch aster instrument arianespace bundespost ss rocket scientist broadcast earth oil measure

12 Updated query-document similarity Results for Epanded Query Key Concept: Centroid Old 2! Old 1! Old 8! , 07/09/91, NASA Scratches Environment Gear From Satellite Plan , 08/13/91, NASA Hasn t Scrapped Imaging Spectrometer , 08/07/89, When the Pentagon Launches a Secret Satellite, Space Sleuths Do Some Spy Work of Their Own , 07/31/89, NASA Uses Warm Superconductors For Fast Circuit , 12/02/87, Telecommunications Tale of Two Companies , 07/09/91, Soviets May Adapt Parts of SS-20 Missile For Commercial Use , 07/12/88, Gaping Gap: Pentagon Lags in Race To Match the Soviets In Rocket Launchers , 06/14/90, Rescue of Satellite By Space Agency To Cost $90 Million!Centroid = center of mass of set of points!recall: Represent documents as points in highdimensional space!definition: Centroid where C is a set of documents -!Vector mean Rocchio Algorithm The Theoretically Best Query!Uses the vector space model to pick query with optimal terms and weights!seeks the query q opt that maimizes!tries to separate documents marked relevant and non-relevant #! o! o! o! "(o)! o! o! o! "()!!Problem: we don t know the truly relevant documents Optimal query non-relevant documents o relevant documents

13 Rocchio 1971 Algorithm (SMART) Relevance Feedback on Initial Query!Used in practice:!d r = set of known relevant document vectors!d nr = set of known irrelevant document vectors -!Different from C r and C nr!!!q m = modified query vector; q 0 = original query vector;!,",#: weights (hand-chosen or set empirically) Initial query o! #! o! o! #! "(o)! "()! o! o! o!!new query moves toward relevant documents and away from irrelevant documents Revised query known non-relevant documents o known relevant documents Sec ! Sec ! Eercise 5 Minutes Relevance Feedback: Assumptions!A1: User has sufficient knowledge for initial query -!Violations: Misspellings, cross-language retrieval, vocabulary mismatch!positive feedback is more valuable than negative feedback (so, set $ <!; e.g. $ = 0.25,! = 0.75)!Many systems (both eamples earlier) allow only positive relevance feedback ($ = 0)!A2: Relevance prototypes are well-behaved -!Term distribution in relevant documents similar -!Term distribution in non-relevant documents different from relevant documents -!Violations: Synonyms, two subjects in one query!discuss in pairs/groups: -!Why? DD2475 Lecture 4, November 16, 2010

Sec. 9.2.2! Relevance Feedback: Problems Query Epansion!Long queries inefficient for typical IR engine -!Long response times for user -!High cost for retrieval system!

in relevance feedback, users give additional input (relevant/non-relevant) on documents -!Local, dynamic, analysis of documents relating to query!

14 Sec ! Relevance Feedback: Problems Query Epansion!Long queries inefficient for typical IR engine -!Long response times for user -!High cost for retrieval system!why are long queries ineffiecient?!users are often reluctant to provide eplicit feedback!often harder to understand why a particular document was retrieved after applying relevance feedback!in relevance feedback, users give additional input (relevant/non-relevant) on documents -!Local, dynamic, analysis of documents relating to query!in query epansion, users give additional input (refine search term) on words or phrases -!Global, static, analysis of all documents in collection Sec ! Query Assist How is Query Augmented?!Manual thesaurus -!Maintained by human specialists -!Used e.g. in Medicine!Automatic thesaurus -!More to come!refinements based on query log mining -!Common on the web -!System suggests queries used before by others -!Compare to Collaborative Filtering:

Sec. 9.2.2! Sec. 9.2.3! Thesaurus-Based Query Epansion Automatic Thesaurus Generation!For each query term, epand query from thesaurus -!feline! feline cat!

15 Sec ! Sec ! Thesaurus-Based Query Epansion Automatic Thesaurus Generation!For each query term, epand query from thesaurus -!feline! feline cat!attempt to generate thesaurus automatically by analyzing collection of documents!may weight added terms less than original terms!generally increases recall!may significantly decrease precision, particularly with ambiguous terms -! interest rate % interest rate fascinate evaluate!widely used in many science/engineering fields!there is a high cost of manually producing and maintaning a thesaurus!definition 1: Two words similar if they co-occur with similar words!definition 2: Two words are similar if they occur in grammatical relation with same words -! You can harvest, peel, eat, prepare, etc. apples and pears! apples and pears must be similar!1 more robust, 2 more accurate!why is that? Sec ! Automatic Thesaurus Generation: Eample Automatic Thesaurus Generation: Discussion!Quality of associations usually a problem!term ambiguity may introduce irrelevant statistically correlated terms. -! Apple computer % Apple red fruit computer!problems: -!False positives: Words deemed similar that are not -!False negatives: Words deemed dissimilar that are similar!since terms are highly correlated anyway, epansion may not retrieve many additional documents.

16 Net!Computer hall session (November 25, ) -!Grå (Osquars Backe 2, level 5) -!Eamination of computer assignment 1 -!Book a time in the Doodle (see course homepage)!lecture 6 (November 30, ) -!D32 -!Readings: Manning Chapter 10

Sec. 8.7 RESULTS PRESENTATION

Sec. 8.7 RESULTS PRESENTATION 1 Sec. 8.7 Result Summaries Having ranked the documents matching a query, we wish to present a results list Most commonly, a list of the document titles plus a short summary,