Web Information Retrieval Exercises Evaluation in information retrieval
Evaluating an IR system Note: information need is translated into a query Relevance is assessed relative to the information need not the query E.g., Information need: I'm looking for information on whether drinking red wine is more effective at reducing your risk of heart attacks than white wine. Query: wine red white heart attack effective Evaluate whether the doc addresses the information need, not whether it has these words
Standard relevance benchmarks TREC - National Institute of Standards and Testing (NIST) has run large IR test bed for many years Reuters and other benchmark doc collections used Retrieval tasks specified sometimes as queries Human experts mark, for each query and for each doc, Relevant or Irrelevant or at least for subset of docs that some system returned for that query
Unranked results We next assume that the search engine returns a set of documents as potentially relevant Does not perform any ranking We want to assess the quality of these results
Precision and Recall Precision: fraction of retrieved docs that are relevant = P(relevant retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved relevant) Relevant Irrelevant Retrieved tp fp Not Retrieved fn tn Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)
A combined measure: F Combined measure that assesses this tradeoff is F measure (weighted harmonic mean): F= 1 α 1 P +(1 α ) 1 R = ( β2 +1 )PR β 2 P+R People usually use balanced F 1 measure i.e., with = 1 or = ½
Accuracy Given a query an engine classifies each doc as Relevant or Irrelevant. Accuracy of an engine: the fraction of these classifications that is correct. The accuracy of an engine: the fraction of these classifications that are correct (tp + tn) / ( tp + fp + fn + tn) Accuracy is a commonly used evaluation measure in machine learning classification work Q.: Why is this not a very useful evaluation measure in IR?
Why not just use accuracy? How to build a 99.9999% accurate search engine on a low budget. Search for: 0 matching results found. People doing information retrieval want to find something and have a certain tolerance for junk.
Precision/Recall Q.: An IR system returns 8 relevant documents, and 10 nonrelevant documents. There are a total of 20 relevant documents in the collection. What is the precision of the system on this search, and what is its recall?
Precision/Recall Q.: An IR system returns 8 relevant documents, and 10 nonrelevant documents. There are a total of 20 relevant documents in the collection. What is the precision of the system on this search, and what is its recall? A.: By applying definitions: Precision = 8/18 = 0.44; Recall = 8/20 =0.4.
Precision/Recall Q.: The balanced F measure (a.k.a. F1 ) is defined as the harmonic mean of precision and recall. What is the advantage of using the harmonic mean rather than averaging (using the arithmetic mean)?
Precision/Recall Q.: The balanced F measure (a.k.a. F1 ) is defined as the harmonic mean of precision and recall. What is the advantage of using the harmonic mean rather than averaging (using the arithmetic mean)? A.: Since arithmetic mean is closer to the highest of the two values of precision and recall, it is not a good representative of both values whereas Harmonic mean, on the other hand is closer to min(precision, Recall). In case when all documents are returned, Recall is 1 and Arithmetic mean is more than 0.5 though the precision is very low and the purpose of the search engine is not served effectively.
Precision/Recall Q.: Derive the equivalence between the two formulas for F measure, given that α = 1/( β 2 + 1).
Ranked retrieval performance
Interpolated precision If you can increase precision by increasing recall, then you should get to count that So you take the max of precisions to right of value
11-point interpolated average precision 11-point interpolated average precision The standard measure in the TREC competitions: you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the documents, using interpolation (the value for 0 is always interpolated!), and average them
Ranked retrieval results Q.: What are the possible values for interpolated precision at a recall level of 0?
Ranked retrieval results Q.: What are the possible values for interpolated precision at a recall level of 0? A.: any value between 0 and 1
Ranked retrieval results Q.: Must there always be a break-even point between precision and recall? Either show there must be or give a counter-example.
Ranked retrieval results Q.: Consider a ranked list of results. Must there always be a break-even point between precision and recall? Either show there must be or give a counter-example. A.: precision = recall iff fp = fn or tp=0. If the highest ranked element is not relevant, then tp=0 and that is a trivial break-even point. If the highest ranked element is relevant: The number of false positives increases as you go down the list and the number of false negatives decreases. As you go down the list, if the item is R, then fn decreases and fp does not change. If the item is N, then fp increases and fn does not change. At the beginning of the list fp<fn. At the end of the list fp>fn. Thus, there has to be a break-even point.
Ranked retrieval results Q.: What is the relationship between the value of F 1 and the break-even point?
Ranked retrieval results Q.: What is the relationship between the value of F 1 and the break-even point? A.: At the break-even point: F 1 = P = R.
Ranked retrieval results Q.: Consider an information need for which there are 4 relevant documents in the collection. Contrast two systems run on this collection. Their top 10 results are judged for relevance as follows (the leftmost item is the top ranked search result): System 1 R N R N N N N N R R System 2 N R N N R R R N N N a. What is the MAP of each system? Which has a higher MAP? b. Does this result intuitively make sense? What does it say about what is important in getting a good MAP score? c. What is the R-precision of each system? (Does it rank the systems the same as MAP?)
Ranked retrieval results A.: a. MAP(System 1) = (1/4)*(1+(2/3)+(3/9)+(4/10)) = 0.6. MAP(System 2) = (1/4)*(1/2 + 2/5 + 3/6 + 4/7) = 0.493 System 1 has a higher average precision b. For a good MAP score, it is essential to have more relevant documents in the first few (3-5) retrieved ones c. R-Precision(system 1) = 1/2 R-Precision(system 2) =1/4. This ranks the systems the same as MAP.
Ranked retrieval results/1 Q.: The following list of R s and N s represents relevant (R) and nonrelevant (N) returned documents in a ranked list of 20 documents retrieved in response to a query from a collection of 10,000 documents. The top of the ranked list (the document the system thinks is most likely to be relevant) is on the left of the list. This list shows 6 relevant documents. Assume that there are 8 relevant documents in total in the collection. R R N N N N N N R N R N N N R N N N N R a. What is the precision of the system on the top 20? b. What is the F 1 on the top 20? c. What is the interpolated precision of the system at 20% recall? d. What is the interpolated precision of the system at 25% recall?
Ranked retrieval results/2 A.: The answers to a and b are straightforward. The answer to c is also simple: it suffices to observe that the first two documents are relevent, so they correspond to 25% recall, since we have a total number of 8 relevant documents. As a consequence, interpolated precision at 20% is 1. As for question d, you need to consider the values of precision for all values of recall exceeding 25% and pick the max R R N N N N N N R N R N N N R N N N N R a. What is the precision of the system on the top 20? b. What is the F 1 on the top 20? c. What is the interpolated precision of the system at 20% recall? d. What is the interpolated precision of the system at 25% recall?