Performance Evaluation of Information Retrieval Systems

Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence and Tech, Hong Kong) There are many retreval models/ algorthms/ systems, whch one s the best? What s the best component for: Rankng functon (dot-product, cosne, ) Term selecton (stopword removal, stemmng ) Term weghtng (TF, TF-IDF, ) How far down the ranked lst wll a user need to look to fnd some/all relevant documents? Dffcultes n Evaluatng IR Systems Effectveness s related to the relevancy of retreved tems. Relevancy s not typcally bnary but contnuous. Even f relevancy s bnary, t can be a dffcult judgment to make. Relevancy, from a human standpont, s: Subjectve: Depends upon a specfc user s judgment. Stuatonal: Relates to user s current needs. Cogntve: Depends on human percepton and behavor. Dynamc: Changes over tme. Human Labeled Corpora (Gold Standard) Start wth a corpus of documents. Collect a set of queres for ths corpus. Have one or more human experts exhaustvely label the relevant documents for each query. Typcally assumes bnary relevance judgments. Requres consderable human effort for large document/query corpora. 3 4 Entre document collecton Relevant documents Precson and Recall Retreved documents umber of relevant documents retreved recall = Total number of relevant documents relevant rrelevant retreved & rrelevant retreved & relevant retreved ot retreved & rrelevant not retreved but relevant not retreved Precson and Recall Precson The ablty to retreve top-ranked documents that are mostly relevant. Recall The ablty of the search to fnd all of the relevant tems n the corpus. precson = umber of relevant documents Total number of retreved documents retreved 5 6

Determnng Recall s Dffcult Total number of relevant tems s sometmes not avalable: Sample across the database and perform relevance judgment on these tems. Apply dfferent retreval algorthms to the same database for the same query. The aggregate of relevant tems s taken as the total relevant set. Trade-off between Recall and Precson Returns relevant documents but msses many useful ones too Precson 0 Recall The deal Returns most relevant documents but ncludes lots of junk 7 8 Computng Recall/Precson Ponts For a gven query, produce the ranked lst of retrevals. Adjustng a threshold on ths ranked lst produces dfferent sets of retreved documents, and therefore dfferent recall/precson measures. Mark each document n the ranked lst that s relevant accordng to the gold standard. Compute a recall/precson par for each poston n the ranked lst that contans a relevant document. 9 Computng Recall/Precson Ponts: An Example n doc # relevant 588 x 589 x 3 576 4 590 x 5 986 6 59 x 7 984 8 988 9 578 0 985 03 59 3 77 x 4 990 Let total # of relevant docs = 6 Check each new recall pont: R=/6=0.67; P=/= R=/6=0.333; P=/= R=3/6=0.5; P=3/4=0.75 R=4/6=0.667; P=4/6=0.667 R=5/6=0.833; p=5/3=0.38 Mssng one relevant document. ever reach 00% recall 0 Interpolatng a Recall/Precson Curve Interpolate a precson value for each standard recall level: r j {0.0, 0., 0., 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,.0} r 0 = 0.0, r = 0.,, r 0 =.0 The nterpolated precson at the j-th standard recall level s the maxmum known precson at any recall level between the j-th and (j + )-th level: P( r ) = max P( r) j rj r r j + Precson Interpolatng a Recall/Precson Curve: An Example.0 0.8 0.6 0.4 0. 0. 0.4 0.6 0.8.0 Recall

3 Average Recall/Precson Curve Typcally average performance over a large set of queres. Compute average precson at each standard recall level across all queres. Plot average precson/recall curves to evaluate overall system performance on a document/query corpus. Compare Two or More Systems The curve closest to the upper rght-hand corner of the graph ndcates the best performance Precson 0.8 0.6 0.4 0. 0 ostem Stem 0. 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Recall 3 4 Sample RP Curve for CF Corpus Problems wth Recall/Precson Recall/Precson and ts related measures need a par of numbers, not very ntutve Sngle-value measures R-precson F-measure E-measure Fallout rate ESL ASL 5 6 R- Precson F-Measure Precson at the R-th poston n the rankng of results for a query that has R relevant documents. n doc # relevant 588 x 589 x 3 576 4 590 x 5 986 6 59 x 7 984 8 988 9 578 0 985 03 59 3 77 x 4 990 R = # of relevant docs = 6 R-Precson = 4/6 = 0.67 7 One measure of performance that takes nto account both recall and precson. Harmonc mean of recall and precson: PR = = F P + R + R P Compared to arthmetc mean, both need to be hgh for harmonc mean to be hgh. 8

4 E Measure (parameterzed F Measure) A varant of F measure that allows weghtng emphass on precson over recall: ( + β ) PR ( + β ) E = = β P + R β + R P Value of β controls trade-off: β = : Equally weght precson and recall (E=F). β > : Weght precson more. β < : Weght recall more. Fallout Rate Problems wth both precson and recall: umber of rrelevant documents n the collecton s not taken nto account. Recall s undefned when there s no relevant document n the collecton. Precson s undefned when no document s retreved. no.of nonrelevant tems retreved Fallout = total no.of nonrelevant tems n the collecton 9 0 Other Measures Fve Types of ESL Expected Search Length: [Cooper 968] average number of documents that must be examned to retreve a gven number of relevant documents : maxmum number of relevant documents e : expected search length for ESL = = * e = Type : A user may just want the answer to a very specfc factual queston or a sngle statstcs. Only one relevant document s needed to satsfy the search request. Type : A user may actually want only a fxed number, for example, sx of relevant documents to a query. Type 3: A user may wsh to see all documents relevant to the topc. Type 4: A user may want to sample a subject area as n, but wsh to specfy the deal sze for the sample as some proporton, say one-tenth, of the relevant documents. Type 5: A user may wsh to read all relevant documents n case there should be less than fve, and exactly fve n case there exst more than fve. Other Measures (cont.) Average Search Length: [Losee 998] expected poston of a relevant document n the ordered lst of all documents : total number of documents Q: probablty that the rankng s optmal (perfect) A: expected proporton of all documents examned n order to reach the average poston of a relevant document n an optmal rankng ASL= [ QA+ ( Q)( A)] Problems Whle they are sngle value measurements (F-measure, E-measure, ESL, ASL) They are not easy to measure (compute) Or they are not ntutve Or the data requred for the measure are typcally not avalable (e.g. ASL) They don t work well n web search envronment 3 4

5 RankPower We propose a sngle, effectve measure for nteractve nformaton search systems such as the web. Take nto consderaton both the placement of the relevant documents and the number of relevant documents n a set of retreved documents for a gven query. Some defntons For a gven query, documents are returned Among the returned documents, R are relevant documents, R = C < Each of the relevant document n R s placed at L Average rank of returned relevant documents R avg () R avg ( ) C = = C L 5 6 Some propertes A functon of two varables, ndvdual ranks of relevant documents, and the number of relevant documents For a fxed C, the more documents lsted earler, the more favorte the value s (smaller values are favored). If the sze of returned documents ncreases and the number of relevant documents n also ncreases, the average rank ncreases (unbounded). In the dea case where every sngle returned document s relevant, the average rank s smply (+)/ RankPower defnton C L Ravg( ) = RankPower ( ) = = C C 7 8 RankPower propertes It s a decreasng functon of snce the rate of ncrease of the denomnator (C ) s faster than the numerator It s bounded below by ½ so the measure can be used as a benchmark to compare dfferent systems It weghs the placement very heavly (see an example for explanaton later), the ones placed earler n the lst are much favored. If two sets of returned documents have the same average rank, the one wth more document s favored. Examples Compare two systems each of whch returns a lst of 0 documents. System A has two relevant documents lsted as st and nd, wth a RankPower of 0.75. Let s examne some scenaro n whch system B can match or surpass system A. If system B returns 3 relevant documents, unless two of the three are lsted st and nd, t s less favored than A snce the two best cases (+3+4)/3 =0.89 and (+3+4)/3 = whch are greater than that of A (0.75). System B needs to have 6 relevant documents n ts top- 0 lst to beat A f t doesn t capture st and nd places. 9 30

6 Examples (cont.) The measure (RankPower) s tested n a real web search envronment. We compare the results of sendng 7 queres to AltaVsta and MARS (one of our ntellgent web search projects), lmtng to the frst 0 returned results. R avg C RankPower MARS 6.59 9.33 0.7 AltaVsta 0.4 8.50. RankPower A Varaton R avg () : average rank of relevant docs among retreved docs C : count of relevant docs among retreved docs S : poston of the th relevant document S Ravg( ) RankPower ( ) = = C C RP( ) 0. 5 = C ( C + ) RankPowerAlt( ) = C C ( C + ) = C S S = = RPAlt( ) 0 3 3 Subjectve Relevance Measure ovelty Rato: The proporton of tems retreved and judged relevant by the user and of whch they were prevously unaware. Ablty to fnd new nformaton on a topc. Coverage Rato: The proporton of relevant tems retreved out of the total relevant documents known to a user pror to the search. Relevant when the user wants to locate documents whch they have seen before (e.g., the budget report for Year 000). 33