A Simple and Efficient Sampling Method for Es7ma7ng AP and ndcg

Size: px

Start display at page:

Download "A Simple and Efficient Sampling Method for Es7ma7ng AP and ndcg"

Aldous Dalton
5 years ago
Views:

1 A Simple and Efficient Sampling Method for Es7ma7ng AP and ndcg Emine Yilmaz Microso' Research, Cambridge, UK Evangelos Kanoulas Javed Aslam Northeastern University, Boston, USA

2 Introduc7on Obtaining relevance judgments Relevance judgments are expensive TREC: Depth k pooling Document collec7ons can be very large Depth pooling is s7ll expensive (85600 judgments for TREC8) 3 min/doc, 40 hrs/wk, 50 wks/year ==> 2.14 man years! Evalua7on with incomplete judgments Bpref (Buckley and Voorhees, SIGIR 06) Evalua7on using condensed lists (Sakai SIGIR 07) Methods for ranking systems with less judgments (CartereYe et al. SIGIR 06, Moffat et al. SIGIR 07) Methods directly eslmalng measures with less judgments (Aslam et al. SIGIR 06, Yilmaz and Aslam CIKM 06)

3 Mo7va7on Inferred AP (Yilmaz and Aslam CIKM 06) No confidence intervals associated with the es7mates Incomplete relevance judgments random subset of complete judgments Importance Sampling (Aslam et al. SIGIR 06) Difficult to compute confidence intervals Overly complicated Combine the advantages of the two approaches Confidence intervals for inferred AP Extend inferred AP to incorporate nonrandom judgments

4 Inferred AP [Yilmaz and Aslam CIKM06] Average precision as a random experiment 1. Select a relevant document at random Rank of the document : k 2. Select a rank at random from the set {1,.,k} 3. Output the binary relevance of document at this rank. Average (step 1) of precisions at relevant documents (steps 2 and 3).

5 Inferred AP [Yilmaz and Aslam CIKM06] 1. R 2. N 3. R 4. R 5. N 6. R 7. N 8. N 9. R 10. N

6 Inferred AP [Yilmaz and Aslam CIKM06] 1. R 2. N 3. R 4. R 5. N 6. R 7. N 8. N 9. R 10. N

7 Inferred AP [Yilmaz and Aslam CIKM07] 1. R 2. N 3. R 4. R 5. N 6. R 7. N 8. N 9. R 10. N PC 1 = = 1 PC 3 = =0.625 PC 9 = = infap = =0.7268

8 Variance in Inferred AP 1. R 2. N 3. R 4. R 5. N 6. R 7. N 8. N 9. R 10. N Inferred AP is unbiased in expecta7on Varies in prac7ce Variance and Confidence Intervals Random Experiment can be realized as two stage sampling

9 Variance in Inferred AP 1. R 2. N 3. R 4. R 5. N 6. R 7. N 8. N 9. R 10. N Two stages sampling Stage 1 : sample of cut off levels (relevant documents) and average precisions 1 st variance component

10 Variance in Inferred AP 1. R 2. N 3. R 4. R 5. N 6. R 7. N 8. N 9. R 10. N Two stages sampling Stage 2 : sample of documents above each selected cut of level to compute precisions 2 nd variance component

11 Variance in Inferred AP 1. R 2. N 3. R 4. R 5. N 6. R 7. N 8. N 9. R 10. N Law of Total Variance Total Variance in inferred AP = stage 1 variance + stage 2 variance Variance of Mean InfAP = Total Variance in InfAP / (# of Queries) 2 Assign confidence intervals to Mean InfAP according to Central Limit Theorem

12 Conﬁdence Intervals for Mean InfAP

13 Conﬁdence Intervals for Mean InfAP

14 Confidence Intervals for Mean InfAP Percentage of mean infap values deviating from actual MAP values TREC 8! Cumulative Function Distribution of infap values CDF for infap values CDF for Normal Distribution Number of std. dev. mean infap values deviate from the actual MAP values K S test : for 90% of systems the hypothesis cannot be rejected (α = 0.05)

15 Conﬁdence Intervals for Mean InfAP

16 Stra7fied Random Sampling Goal : Unbiased es7mator of AP Decrease variance in the es7mator Evalua7on measures give more weight to documents towards the top of the list Top heavy sampling strategy can reduce variance in Mean InfAP

17 Stra7fied Random Sampling 1. R 2. N 3. R 4. R 5. N 6. R 7. N 8. N 9. R 10. N 1 st Stratum, p = 60% 2 nd Stratum, p = 40% Divide complete pool of judgments into strata (disjoint con7guous subsets) Randomly sample some documents from each stratum to be judged Sampling percentage within each stratum can be different Evaluate search engines with sampled documents

18 Extended infap (xinfap) Select a relevant document at random (1 st step) Selected relevant document can fall in any of the strata By the defini7on of condi7onal expecta7on

19 Extended infap (xinfap) Select a relevant document at random (1 st step) Probability of picking relevant document from stratum s

20 Extended infap (xinfap) Select a relevant document at random (1 st step) Probability of picking relevant document from stratum s

21 Extended infap (xinfap) 1 st Stratum, p = 60% 2 nd Stratum, p = 40% 1. R 2. N 3. R 4. R 5. N 6. R 7. N 8. N 9. R 10. N

22 Extended infap (xinfap) Select a relevant document at random (1 st step) Within each stratum: Judged documents uniform random subset of all documents Uniform distribu7on over the relevant documents computed as average of precisions at judged relevant documents

23 Extended infap (xinfap) Precision at a relevant document at rank k (2 nd and 3 rd step) Select a rank at random from the set {1,.,k} Output the binary relevance of document at this rank. Probability 1/k pick the current document

24 Extended infap (xinfap) Precision at a relevant document at rank k (2 nd and 3 rd step) Select a rank at random from the set {1,.,k} Output the binary relevance of document at this rank. Probability 1/k pick the current document Probability (k 1)/k pick a document above

25 Extended infap (xinfap) Precision at a relevant document at rank k (2 nd and 3 rd step) Select a rank at random from the set {1,.,k} Output the binary relevance of document at this rank. Probability 1/k pick the current document Probability (k 1)/k pick a document above Probability of picking a document (above k) from stratum s

26 Extended infap (xinfap) Precision at a relevant document at rank k (2 nd and 3 rd step) Select a rank at random from the set {1,.,k} Output the binary relevance of document at this rank. Probability 1/k pick the current document Probability (k 1)/k pick a document above

27 Extended infap (xinfap) Precision at a relevant document at rank k (2 nd and 3 rd step) Select a rank at random from the set {1,.,k} Output the binary relevance of document at this rank. Probability 1/k pick the current document Probability (k 1)/k pick a document above

28 Extended infap (xinfap) 1 st Stratum, p = 60% 2 nd Stratum, p = 40% 1. R 2. N 3. R 4. R 5. N 6. R 7. N 8. N 9. R 10. N

29 Extended infap (xinfap) 1 st Stratum, p = 60% 2 nd Stratum, p =40% 1. R 2. N 3. R 4. R 5. N 6. R 7. N 8. N 9. R 10. N

30 TREC Terabyte 06 Depth 50 pool Available judgments Remainder P% judged

31 TREC Terabyte 06 Depth 50 pool Available judgments Depth 50 pool Standard Measures Remainder P% judged Remainder P% judged

32 TREC Terabyte 06 Depth 50 pool Available judgments Depth 50 pool Standard Measures Depth 50 pool Inferred AP Remainder P% judged Remainder P% judged Remainder P% judged

33 Simulate Terabyte Setup on TREC 8 data Assume complete judgments: depth 100 pool Form different depth k pools k є {1,2,3,4,5,10,20,30,40,50} For each k compute the total number of documents in depth k pool Randomly sample equal number of documents from the complete judgment set (excluding depth k pool) Assume the remaining documents are unjudged Evaluate search engines with sampled documents

34 Comparison of the measures : RMS error

35 Comparison of the measures: Kendall s Tau

36 Inferred ndcg (infndcg) Apply the same methodology to ndcg Es7mate DCG and DCG I separately E[DCG I ]can be computed using the es7mated number of relevant documents (for each relevance grade)

37 DCG as a Random Experiment For each rank i, associate a variable DCG as a random experiment 1. Select a document at random Rank of the document: i 2. Output the value of

38 Es7ma7ng DCG with Incomplete Judgments DCG as a random experiment 1. Select a document at random Rank of the document: i 2. Output the value of Due to proper7es of condi7onal expecta7on,

39 Overall Results: TREC8

40 Overall Results: TREC8

41 Overall Results: TREC10

42 Overall Results: TREC10

43 Conclusions

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson

Search Engines. Informa1on Retrieval in Prac1ce. Annota1ons by Michael L. Nelson Search Engines Informa1on Retrieval in Prac1ce Annota1ons by Michael L. Nelson All slides Addison Wesley, 2008 Evalua1on Evalua1on is key to building effec$ve and efficient search engines measurement usually