Advances on the Development of Evaluation Measures. Ben Carterette Evangelos Kanoulas Emine Yilmaz

Size: px

Start display at page:

Download "Advances on the Development of Evaluation Measures. Ben Carterette Evangelos Kanoulas Emine Yilmaz"

Bathsheba Cannon
5 years ago
Views:

1 Advances on the Development of Evaluation Measures Ben Carterette Evangelos Kanoulas Emine Yilmaz

2 Information Retrieval Systems Match information seekers with the information they seek

3 Why is Evaluation so Important? What you can t measure you can t improve Lord Kelvin Most retrieval systems are tuned to optimize for an objective evaluation metric 3

4 Outline Intro to evaluation Different approaches to evaluation Traditional evaluation measures User model based evaluation measures Session Evaluation Novelty and Diversity 4

5 Online Evaluation Click/Noclick Evaluate Design interactive experiments Use users actions to evaluate the quality 5

6 Online Evaluation Standard click metrics Clickthrough rate Queries per user Probability user skips over results they have considered (pskip) Result interleaving

7 What is result interleaving? A way to compare rankers online Given the two rankings produced by two methods Present a combination of the rankings to users Result interleaving Credit assignment based on clicks

8 Team Draft Interleaving (Radlinski et al., 2008) Interleaving two rankings Input: Two rankings Repeat: Toss a coin to see which team picks next Winner picks their best remaining player Loser picks their best remaining player Output: One ranking Credit assignment Ranking providing more of the clicked results wins

9 Team Draft Interleaving Ranking A 1. Napa Valley The authority for lodging Napa Valley Wineries - Plan your wine Napa Valley College 4. Been There Tips Napa Valley 5. Napa Valley Wineries and Wine Napa Country, California Wikipedia Ranking B 1. Napa Country, California Wikipedia en.wikipedia.org/wiki/napa_valley 2. Napa Valley The authority for lodging Napa: The Story of an American Eden... books.google.co.uk/books?isbn= Napa Valley Hotels Bed and Breakfast... Presented Ranking 1. Napa Valley The authority 5. for NapaValley.org lodging Napa Country, California 6. Wikipedia The Napa Valley Marathon en.wikipedia.org/wiki/napa_valley en.wikipedia.org/wiki/napa_valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn= Napa Valley Wineries Plan your wine Napa Valley Hotels Bed and Breakfast... A B 6. Napa Valley College 7 NapaValley.org

10 Team Draft Interleaving Ranking A 1. Napa Valley The authority for lodging Napa Valley Wineries - Plan your wine Napa Valley College 4. Been There Tips Napa Valley 5. Napa Valley Wineries and Wine Napa Country, California Wikipedia Ranking B 1. Napa Country, California Wikipedia en.wikipedia.org/wiki/napa_valley 2. Napa Valley The authority for lodging Napa: The Story of an American Eden... books.google.co.uk/books?isbn= Napa Valley Hotels Bed and Breakfast... Presented Ranking 1. Napa Valley The authority 5. for NapaValley.org lodging Napa Country, California 6. Wikipedia The Napa Valley Marathon en.wikipedia.org/wiki/napa_valley en.wikipedia.org/wiki/napa_valley 3. Napa: The Story of an American Eden... books.google.co.uk/books?isbn= Napa Valley Wineries Plan your wine Napa Valley Hotels Bed and Breakfast Napa Valley College 7 NapaValley.org B wins!

11 Offline Evaluation Controlled laboratory experiments The user s interaction with the engine is only simulated Ask experts to judge each query result Predict how users behave when they search Aggregate judgments to evaluate 11

12 Offline Evaluation Documents Judge User model Evaluate Ask experts to judge each query result Predict how users behave when they search Aggregate judgments to evaluate 12

13 Online vs. Offline Evaluation Online Offline Pros Cheap Measure actual user reactions Fast to evaluate Easy to try new ideas Portable Cons Need to go live Noisy Slow Not duplicable Needs ground truth Slow to obtain judgments Expensive Inconsistent Difficult to model how users behave 13

14 Outline Intro to evaluation Different approaches to evaluation Traditional evaluation measures User model based evaluation measures Session Evaluation Novelty and Diversity 14

15 Traditional Experiment Results Search Engines Judges How many good docs have I missed/found? 15

16 Depth-k Pooling sys 1 sys 2 sys 3 sys M 1 2 A B C D A E B A Documents Judge 3 k C. X. M. A. D. F..... S. Z. z Y T B L 16

17 Depth-k Pooling z k sys 2... C D M A T sys 3... A E D F B sys M.. B A S Z L sys 1.. A B C X Y Judge 17

18 Depth-k Pooling z k sys 2.. C D M A T sys 3.. A E D F B sys M.. B A S Z L sys 1.. A B C X Y z k sys 2.. N N R R? sys 3.. R N N R R sys M... R R N N? sys 1... R R N N? Judge 18

. B A S Z L sys 1.. A B C X Y.... 1 2 3 z k sys 2.

19 Depth-k Pooling z k sys 2.. C D M A T sys 3.. A E D F B sys M.. B A S Z L sys 1.. A B C X Y z k sys 2.. N N R R N sys 3.. R N N R R sys M... R R N N N sys 1... R R N N N Judge 19

20 Reusable Test Collections Document Corpus Topics Topic 1 Topic 2 Topic N Relevance Judgments 20

21 Evaluation Metrics: Precision vs Recall Retrieved list R N R N N R N N N R.

22 Visualizing Retrieval Performance: Precision-Recall Curves List: R N R N N R N N N R

23 Evaluation Metrics: Average Precision List: R N R N N R N N N R

24 Outline Intro to evaluation Different approaches to evaluation Traditional evaluation measures User model based evaluation measures Session Evaluation Novelty and Diversity 24

25 User models Behind Traditional Metrics Users always look at top k documents What fraction of the top k documents are relevant? Recall Users would like to find all the relevant documents. What fraction of these documents have been retrieved by the search engine?

26 User Model of Average Precision (Robertson 08) 1. User steps down a ranked list one-by-one 2. Stops browsing documents due to satisfaction stops with a certain probability after observing a relevant document 3. Gains utility from each relevant document

27 User Model of Average Precision (Robertson 08) The probability that the user stops browsing is uniform over all the relevant documents The utility a user gains when he stops browsing at a relevant document at rank n (precision at rank n) AP can be written as: 1 P( n) if doc is relevant, 0 o.w. R U(n) 1 n n k 1 rel(k) AP P( n) U( n) n 1

28 User Model Based Evaluation Measures Directly aim at evaluating user satisfaction An effectiveness measure should be correlated to the user s experience Thus interest in effectiveness measures based on explicit models of user interaction Devise a user model correlated with user behavior Infer an evaluation metric from the user model

29 Basic User Model Simple model of user interaction: 1. User steps down ranked results one-by-one 2. Stops at a document at rank k with some probability P(k) 3. Gains some utility U(k) from relevant documents M k 1 U(k)P(k)

30 Basic User Model 1. Discount: What is the chance a user will visit a document? Model of the browsing behavior 2. Utility: What does the user gain by visiting a document?

31 Model Browsing Behavior black powder ammunition Position-based models The chance of observing a document depends on the position it is presented in the ranked list.

32 Rank Biased Precision black powder ammunition Query View Next Item Stop

33 Rank Biased Precision black powder ammunition RBP = i=1 rel i i 1 (1- )

Discounted Cumulative Gain 1 2 3 4 5 6 7 8 9 10 black powder ammunition Relevance HR R N N HR R N R N N Relevance Score 2 1 0 0

34 Discounted Cumulative Gain black powder ammunition Relevance HR R N N HR R N R N N Relevance Score rel 2 r 1 Gain Discount by rank 1/log 2 (r+1) Discounted Gain

35 Discounted Cumulative Gain DCG can be written as: N P( user visits doc r) Utility r 1 Discount function models the probability that the user visits (clicks on) the document at rank r Currently, P(user clicks on doc r) = 1/log 2 (r+1) r

36 Discounted Cumulative Gain Instead of stopping probability, think about viewing probability This fits in discounted gain model framework:

1 2 3 4 5 6 7 8 9 10 Normalised Discounted Cumulative black powder ammunition Relevance HR R N N HR R N R N N Relevance Score 2 1 0 0 2 1 0

37 Normalised Discounted Cumulative black powder ammunition Relevance HR R N N HR R N R N N Relevance Score rel 2 r 1 Gain Gain Discount by rank 1/log 2 (r+1) Discounted Gain NDCG DCG optdcg

Model Browsing Behavior black powder ammunition 1 2 3 4 5 6 7 8 9 10 Cascade-based models The user views search results from top to bottom At each rank i, the user has a certain

38 Model Browsing Behavior black powder ammunition Cascade-based models The user views search results from top to bottom At each rank i, the user has a certain probability of being satisfied. Probability of satisfaction proportional to the relevance grade of the document at rank i. Once the user is satisfied with a document, he terminates the search.

39 Rank Biased Precision black powder ammunition Query View Next Item Stop

40 Expected Reciprocal Rank [Chapelle et al CIKM09] black powder ammunition Query View Next Item Relevant? highly somewhat no 10 Stop

41 black powder ammunition Expected Reciprocal Rank [Chapelle et al CIKM09] g r (r) : Utility of finding " the perfect document" at rank r (r) 1/r ERR ERR n r 1 n r 1 1 P(user stopsat position r) r : relevance grade of Prob. of relevance of doc r R 1 1 r r i 1 r (1 R ) R i the r gr 2 2 r g th document 1 P(user stops at position r) max

42 Metrics derived from Query Logs Use the query logs to understand how users behave Learn the parameters of the user model from the query logs Utility, discount, etc.

43 Metrics derived from Query Logs Users tend to stop search if they are satisfied or frustrated Relevance P(observe a doc at rank r) highly affected by snippet quality Relevance P(Stop R) Bad 0.49 Fair 0.41 Good 0.37 Excellent 0.53 Perfect 0.76 P(C R) Bad 0.50 Fair 0.49 Good 0.45 Excellent 0.59 Perfect 0.79

Metrics derived from Query Logs Users behave differently for different queries Informational queries Navigational queries Navigational Informational P(C R) P(Stop R)

44 Metrics derived from Query Logs Users behave differently for different queries Informational queries Navigational queries Navigational Informational P(C R) P(Stop R) P(C R) P(Stop R) Bad Fair Good Excellent Perfect

45 Expected Browsing Utility (Yilmaz et al. CIKM 10) D EBU (r) P(E r ) P(C R r ) EBU n r 1 D EBU (r) R r

46 Basic User Model 1. Discount: What is the chance a user will visit a document? Model of the browsing behavior 2. Utility: What does the user gain by visiting a document? Mostly ad-hoc, no clear user model

47 Graded Average Precision (Robertson et al. SIGIR 10) One document is more useful than another One possible meaning: one document is useful to more users than another Hence the following: assume grades of relevance but that user has a threshold relevance grade which defines a binary view different users have different thresholds described by a probability distribution over users

48 Graded Average Precision [Robertson et al. SIGIR10] User has binary view of relevance by thresholding the relevance scale Relevance Scale Highly Relevant Relevant Considered relevant with probability g 1 Irrelevant

49 Graded Average Precision [Robertson et al. SIGIR10] User has binary view of relevance by thresholding the relevance scale Relevance Scale Highly Relevant Relevant Considered relevant with probability g 2 Irrelevant

50 Graded Average Precision Assume relevance grades {0...c} 0 for non-relevant, + c positive grades g i = P(user threshold is at i) for i {1...c} i.e. user regards grades {i...c} as relevant, grades {0...(i-1)} as not relevant g i s sum to one Step down the ranked list, stopping at documents that may be relevant then calculate expected precision at each of these (expected over the population of users)

51 Graded Average Precision (GAP) Relevance 1 HR 2 R 3 N 4 N 5 R 6 HR 7 R

52 Graded Average Precision (GAP) Relevance 1 HR 1 Rel 2 R 2 Rel 3 N 4 N 5 R 6 HR 7 R with prob. g 1 3 N 4 N 5 Rel 6 Rel 7 Rel prec 6 4 6

53 Graded Average Precision (GAP) Relevance 1 HR 1 Rel 2 R 2 N 3 N 4 N 5 R 6 HR 7 R with prob. g 2 3 N 4 N 5 N 6 Rel 7 N prec 6 2 6

54 Graded Average Precision (GAP) Relevance 1 HR 2 R 3 NR 4 NR 5 R 6 HR 7 R wprec g g 2

55 Probability Models Almost all the measures we ve discussed are based on probabilistic models of users Most have one or more parameters representing something about user behavior Is there a way to incorporate variability in the user population? How do we estimate parameter values? Is a single point estimate good enough?

56 Choosing Parameter Values Parameter θ models a user Higher θ more patience, more results viewed Lower θ less patience, fewer results viewed Different approaches: Minimize variance in evaluation (Kanoulas & Aslam, CIKM 09) Use click log; fit a model to gaps between clicks (Zhang et al., IRJ, 2010) All try to infer a single value for the parameters

57 Distribution of Patience for RBP Form a distribution P(θ) Sampling from P(θ) is like sampling a user defined by their patience How can we form a proper distribution of θ? Idea: mine logged search engine user data Look at ranks users are clicking Estimate patience based on absence or presence of clicks

58 Modeling Patience from Log Data We will assume we have a flat prior θ that we want to update using log data L Decompose L into individual search sessions For each session q, count: c q, the total number of clicks r q, the total number of no-clicks Model c q with a negative binomial distribution conditional on r q and θ:

59 Modeling Patience from Log Data Marginalize P(θ L) over r: Apply Bayes rule to P(θ r, L): P(L θ, r) is the likelihood of the observed clicks

60 Complete Model Expression Model components result in three equations to estimate P(θ L)

61 Empirical Patience Profiles: Navigational Queries

62 Empirical Patience Profiles: Informational Queries

63 Extend to ERR Parameters

64 Evaluation Using Parameter Distributions Monte Carlo procedure: Sample a parameter value from P(θ L) Or a vector of values for ERR Compute the measure with the sampled value Iterate to form distribution P(RBP) or P(ERR)

65 Marginal Distribution Analysis S 1 =[R N N N N N N N N N] S 2 =[N R R R R R R R R R]

66 Distribution of RBP

67 Distribution of ERR

68 Marginal Distribution Analysis Given two systems, over all choices of θ What is P(M 1 > M 2 )? What is P((M 1 - M 2 )>t)?

69 Marginal Distribution Analysis

70 Outline Intro to evaluation Different approaches to evaluation Traditional evaluation measures User model based evaluation measures Session Evaluation Novelty and Diversity 71

71 Why sessions? Current evaluation framework Assesses the effectiveness of systems over oneshot queries Users reformulate their initial query Still fine if optimizing system for one-shot queries led to optimal performance over an entire session

72 Why sessions? When was the DuPont Science Essay Contest created? Initial Query : DuPont Science Essay Contest Reformulation : When was the DSEC created? e.g. retrieval systems should accumulate information along a session

73 Paris Paris Luxurious J Hilton Lo Hotels Paris

74 Extend the evaluation framework From one query evaluation To multi-query sessions evaluation

75 Construct appropriate test collections Rethink of evaluation measures

76 Basic test collection A set of information needs A friend from Kenya is visiting you and you'd like to surprise him with by cooking a traditional swahili dish. You would like to search online to decide which dish you will cook at home. A static sequence of m queries Initial Query : 1 st Reformulation : 2 nd Reformulation : (m-1) th Reformulation : kenya cooking traditional kenya cooking traditional swahili kenya swahili traditional food recipes

77 Basic Test Collection Factual/Amorphous, Known-item search Intellectual/Amorphous, Explanatory search Factual/Amorphous, Known-item search

78 Experiment kenya cooking traditional kenya cooking traditional swahili kenya swahili traditional food recipes

79 Experiment kenya cooking traditional kenya cooking traditional swahili kenya swahili traditional food recipes

80 Construct appropriate test collections Rethink of evaluation measures

81 What is a good system?

82 How can we measure goodness?

83 Measuring goodness The user steps down a ranked list of documents and observes each one of them until a decision point and either a) abandons the search, or b) reformulates While stepping down or sideways, the user accumulates utility

84 What are the challenges?

85 Evaluation over a single ranked list kenya cooking traditional kenya cooking traditional swahili kenya swahili traditional food recipes

87 Session DCG [Järvelin et al ECIR 2008] kenya cooking traditional kenya cooking traditional swahili k r 1 2 rel(r) 1 log b (r b 1) 1 log c (1 c 1) DCG(RL1) 1 k r 1 2 rel(r) 1 log b (r b 1) log c (2 c 1) DCG(RL2)

88 Session Metrics Session DCG [Järvelin et al ECIR 2008] The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment]

89 Model-based measures Probabilistic space of users following different paths Ω is the space of all paths P(ω) is the prob of a user following a path ω in Ω U(ω) is the utility of path ω in Ω P( )U( ) [Yang and Lad ICTIR 2009]

90 Expected Global Utility [Yang and Lad ICTIR 2009] 1. User steps down ranked results one-by-one 2. Stops browsing documents based on a stochastic process that defines a stopping probability distribution over ranks and reformulates 3. Gains something from relevant documents, accumulating utility

91 Expected Global Utility [Yang and Lad ICTIR 2009] The probability of a user following a path ω: P(ω) = P(r 1, r 2,..., r K ) r i is the stopping and reformulation point in list i Assumption: stopping positions in each list are independent P(r 1, r 2,..., r K ) = P(r 1 )P(r 2 )...P(r K ) Use geometric distribution (RBP) to model the stopping and reformulation behaviour P(r i = r) = (1- ) k 1

92 Geometric w/ parameter θ Expected Global Utility Q1 Q2 Q3 N R R N R R N R R N R R N R R N N R N N R N N R N N R N N R

93 Session Metrics Session DCG [Järvelin et al ECIR 2008] The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment] Expected global utility [Yang and Lad ICTIR 2009] The user steps down a ranked list of documents until a decision point and reformulates [Stochastic; no early abandonment]

94 Model-based measures Probabilistic space of users following different paths Ω is the space of all paths P(ω) is the prob of a user following a path ω in Ω M ω is a measure over a path ω esm P( )M [Kanoulas et al. SIGIR

95 Probability of a path Q1 Q2 Q3 N R R N R R N R R N R R N R R N N R N N R N N R N N R N N R (1) (2) Probability of abandoning at reform 2 X Probability of reformulating at rank 3

96 Geometric w/ parameter p reform Q1 Q2 Q3 N R R N R R N R R N R R N R R N N R N N R N N R N N R N N R (1) Probability of abandoning the session at reformulation i

97 Truncated Geometric w/ parameter p reform Q1 Q2 Q3 N R R N R R N R R N R R N R R N N R N N R N N R N N R N N R (1) Probability of abandoning the session at reformulation i

98 Truncated Geometric w/ parameter p reform Geometric w/ parameter p down Q1 Q2 Q3 N R R N R R N R R N R R N R R N N R N N R N N R N N R N N R (2) Probability of reformulating at rank j

99 Session Metrics Session DCG [Järvelin et al ECIR 2008] The user steps down the ranked list until rank k and reformulates [Deterministic; no early abandonment] Expected global utility [Yang and Lad ICTIR 2009] The user steps down a ranked list of documents until a decision point and reformulates [Stochastic; no early abandonment] Expected session measures [Kanoulas et al. SIGIR 2011] The user steps down a ranked list of documents until a decision point and either abandons the query or reformulates [Stochastic; allows early abandonment]

100 Outline Intro to evaluation Different approaches to evaluation Traditional evaluation measures User model based evaluation measures Session Evaluation Novelty and Diversity 101

101 Novelty The redundancy problem: the first relevant document contains some useful information every document with the same information after that is worth less to the user but worth the same to traditional evaluation measures Novelty retrieval attempts to ensure that ranked results do not have much redundancy

102 Example query: oil-producing nations members of OPEC North Atlantic nations South American nations 10 relevant articles about OPEC probably not as useful as one relevant article about each group And one relevant article about all oil-producing nations might be even better

103 How to Evaluate? One approach: List subtopics, aspects, or facets of the topic Judge each document relevant or not to each possible subtopic For oil-producing nations, subtopics could be names of nations Saudi Arabia, Russia, Canada,

104 Subtopic Relevance Example

105 Evaluation Measures Subtopic recall and precision (Zhai et al., 2003) Subtopic recall at rank k: Count unique subtopics in top k documents Divide by total number of known unique subtopics Subtopic precision at recall r: Find least k at which subtopic recall r is achieved Find least k at which subtopic recall r could possibly be achieved (by a perfect system) Divide latter by former Models a user that wants all subtopics and doesn t care about redundancy as long as they are seeing new information

106 Subtopic Relevance Evaluation Copyright Ben Carterette

107 Diversity Short keyword queries are inherently ambiguous An automatic system can never know the user s intent Diversification attempts to retrieve results that may be relevant to a space of possible intents

108 Evaluation Measures Subtopic recall and precision This time with judgments to intents rather than subtopics Measures that know about intents: Intent-aware family of measures (Agrawal et al.) D, D measures (Sakai et al.) α-ndcg (Clarke et al.) ERR-IA (Chapelle et al.)

109 Intent-Aware Measures Assume there is a probability distribution P(i Q) over intents for a query Q Probability that a randomly-sampled user means intent i when submitting query Q The intent-aware version of a measure is its weighted average over this distribution

110 = 0.35* * * * *0.1 = 0.23

111 D-measure Take the idea of intent-awareness and apply it to computing document gain The gain for a document is the (weighted) average of its gains for subtopics it is relevant to D-nDCG is ndcg computed using intent-aware gains

112 D-DCG = 0.35/log /log 3 +

113 α-ndcg α-ndcg is a generalization of ndcg that accounts for both novelty and diversity α is a geometric penalization for redundancy Redefine the gain of a document: +1 for each subtopic it is relevant to (1-α) for each document higher in the ranking that subtopic already appeared in Discount is the same as usual

114 (1-α) +1 (1-α) (1-α) (1-α) 2 (1-α) 2

115 ERR-IA Intent-aware version of ERR But it has appealing properties other IA measures do not have: ranges between 0 and 1 submodularity: diminishing returns for relevance to a given subtopic -> built-in redundancy penalization Also has appealing properties over α-ndcg: Easily handles graded subtopic judgments Easily handles intent distributions

116 Granularity of Judging What exactly is a subtopic? Perhaps any piece of information a user may be interested in finding? At what granularity should subtopics be defined? For example: cardinals has many possible meanings cardinals baseball team is still very broad cardinals baseball team schedule covers 6 months cardinals baseball team schedule august covers ~25 games cardinals baseball team schedule august 12 th

117 Preference Judgments for Novelty What about evaluating novelty with no subtopic judgments? Preference judgments: Is document A more relevant than document B? Conditional preference judgments: Is document A better than document B given that I ve just seen document C? Assumption: preference is based on novelty over C Is it true? Come to our presentation on Wednesday

118

119 Conclusions Strong interest in using evaluation measures to model user behavior and satisfaction Driven by availability of user logs, increased computational power, good abstract models DCG, RBP, ERR, EBU, session measures, diversity measures all model users in different ways Cranfield-style evaluation is still important! But there is still much to understand about users and how they derive satisfaction

120 Conclusions Ongoing and future work: Models with more degrees of freedom Direct simulation of users from start of session to finish Application to other domains Thank you! Slides will be available online

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User