Web Search and Text Mining Lecture 8
Outline Explore user clickthrough data for learning ranking function Extracting implicit relevance judgments Learning ranking function based on preference data
Implicit and Explicit Relevance Judgements Explicit Relevance Judgements PEGFB Human editors, annotators, experts, etc. Implicit Relevance Judgements clickthrough data
Absolute relevance: a particular document is/is not relevant to a query (also degree of relevance) Relative relevance: a particular document is/is not more relevant to a query than some other documents The above is defined at query level (w.r.t. a query). We can also define absolute relevance at a collection level, because for some ambiguous queries seemingly relevant documents are not as relevant.
Summary of Literature Kemp & Ramamohanarao. (PKDD 2003), document transformation by appending queries to clicked documents, the assumption is based on interpreting clickthrough as absolute relevance. Cohen, Freund, 1998/1999, log data relative relevance feedback. (different class of functions: combinations of experts) Beeferman & Berger, use query log to annotate documents, indexing terms Joachims and co-workers, 2002, learning ranking functions using clickthroughs (more on this later) Agichtein and co-workers, 2006, modeling user behaviors for learning ranking functions
How Clickthrough Data Are Collected? Links on result page point to a proxy server which records user clicks. The proxy then uses http Location command to forward the user to the target page The process is transparent to users
User Behavior Study Eye tracking study on Query formulation Assessment of the result pages Selection of links to click
so by the ing. This the order. We prorelevance tely agree lly. IEVAL cally tune ieval and ume that ]). While, they dedgments. feedback. minimize Joachims e trained ]. A simiohanarao Table 1: Questions used in the study. Navigational Find the homepage of Michael Jordan, the statistician. Find the page displaying the route map for Greyhound buses. Find the homepage of the 1000 Acres Dude Ranch. Find the homepage for graduate housing at Carnegie Mellon University. Find the homepage of Emeril - the chef who has a television cooking program. Informational Where is the tallest mountain in New York located? With the heavy coverage of the democratic presidential primaries, you are excited to cast your vote for a candidate. When are democratic presidential primaries in New York? Which actor starred as the main character in the original Time Machine movie? A friend told you that Mr. Cornell used to live close to campus - near University and Steward Ave. Does anybody live in his house now? If so, who? What is the name of the researcher who discovered the first modern antibiotic?
What are being looked at and what are being clicked? Phase I. 1. 34 undergraduate students from Cornell of all majors 2. Data for 29 subjects recorded Phase II. 1. Normal (6) 2. Swapped 1st and 2nd (5) 3. Reversed top 10 results (5)
Eye Fixations: stable gaze lasting 200-300 milliseconds. Correspondence between fixations and position of abstracts. Number of lines for abstracts on Google range from two lines to five lines.
Relevance Influence User Click Behavior 1. In reversed condition, significantly more abstracts scanned 2. Averaged rank of clicked documents: normal 2.66, reversed 4.03 3. Averaged number of clicks: normal 0.80, reversed 0.64
cific query ocuments. l functions ral class of oach on a ity library 1 mot search earch comsmosis, as, almost as Percentage 100% 80% 60% 40% 20% 0% 1 2 3 4 5 6 Rank of Abstract 7 % looked at % clicked on Figure 1: Percentage of time an abstract was viewed/clicked on depending on the rank of the result. 8 9 10 8 7 h training Percentage over search sessions. First two results looked at. nguish bensists of a 4 6 5 ent, or set 3 judgments 2 tracts Viewed ove / Below
Evidence Supporting Relative Relevance Each click represents an endorsement of the document: absolute relevance assessment. Two Issues 1. Trust Bias: documents ranked higher tend to be clicked more even though they are not as relevant (user influenced by order) 2. Quality Bias: overall quality of all the abstracts in the result sets influence click decision (Averaged rank: normal 2.67, reversed 3.27) (rank determined by judges.)
osis, as lmost as Figure 1: Percentage of time an abstract was viewed/clicked on depending on the rank of the result. training uish beists of a t, or set dgments mplicitly s. Morebsolute ment is, involves ant to a med abmber of to learn Rajaram icit feedto spend ng. This Abstracts Viewed Above / Below 8 7 6 5 4 3 2 1 0-1 -2 1 2 3 4 5 6 7 8 9 10 Rank of Clicked Link Figure 2: Mean number of abstracts viewed above and below a clicked link depending on its rank. Google returns 10 results in first page. 1) One abstract below the clicked An important innovation in this paper is that we learn a more general ranking function than previous work by associating query words with specific documents. This approach has been used previously to learn to generate abstracts [23], and in document transformation [19], but not to learn rank- one is looked at, usually the one just clicked. 2) Scan top to bottom.
Both Clicked and Non-clicked Documents Example. l 1, l 2, l 3, l 4, l 5, l 6, l 7 1. Click > Skip Above rel(l 3 ) > rel(l 2 ), rel(l 5 ) > rel(l 2 ), rel(l 5 ) > rel(l 4 ).
Earlier clicks less informative than later clicks. 2. Last Click > Skip Above rel(l 5 ) > rel(l 2 ), rel(l 5 ) > rel(l 4 ).
Later clicks more informed than earlier ones. 3. Click > Earlier Clicks Assume order of clicks is 3, 1, and 5: rel(l 1 ) > rel(l 3 ), rel(l 5 ) > rel(l 1 ), rel(l 5 ) > rel(l 3 ). Not supported by data.
Some abstracts are not looked at at all, but abstracts immediately before the clicked link are most likely being viewed. 4. Last Click > Skip Previous 5. Click > No-click Next (less valuable, aligned with current ranking)
Table 4: Accuracy of several strategies for generating pairwise preferences from clicks. The base of comparison are either the explicit judgments of the abstracts, or the explicit judgments of the page itself. Error bars are the larger of the two sides of the 95% binomial confidence interval around the mean. Explicit Feedback Abstracts Pages Data Phase I Phase II Phase II Strategy normal normal swapped reversed all all Inter-Judge Agreement 89.5 N/A N/A N/A 82.5 86.4 Click > Skip Above 80.8 ± 3.6 88.0 ± 9.5 79.6 ± 8.9 83.0 ± 6.7 83.1 ± 4.4 78.2 ± 5.6 Last Click > Skip Above 83.1 ± 3.8 89.7 ± 9.8 77.9 ± 9.9 84.6 ± 6.9 83.8 ± 4.6 80.9 ± 5.1 Click > Earlier Click 67.2 ± 12.3 75.0 ± 25.8 36.8 ± 22.9 28.6 ± 27.5 46.9 ±13.9 64.3 ±15.4 Click > Skip Previous 82.3 ± 7.3 88.9 ± 24.1 80.0 ± 18.0 79.5 ± 15.4 81.6 ± 9.5 80.7 ± 9.6 Click > No Click Next 84.1 ± 4.9 75.6 ± 14.5 66.7 ± 13.1 70.0 ± 15.7 70.4 ± 8.0 67.4 ± 8.2 user s trust into the quality of the search engine, as well as the quality of the retrieval function itself. Unfortunately, trust and retrieval quality are two quantities that are difficult to measure explicitly. We will now explore implicit feedback measures that respect these dependencies by interpreting clicks not as absolute relevance feedback, but as pairwise preference statements. Such an interpretation is supported by research in user evaluated, all feedback is relative to the quality of the retrieved set. How accurate is this implicit feedback compared to the explicit feedback? To address this question, we compare the pairwise preferences generated from the clicks to the explicit relevance judgments. Table 4 shows the percentage of times the preferences generated from clicks agree with the direction of a strict preference of a relevance judge. On the data
Query Chains Query 1: NDLF 1. http://.../staffweb/smg/smg970319.html 2. http://.../staffweb/smg/smg970226.html 3. http://.../staffweb/smg/smg960417.html 4. http://.../staffweb/smg/smg960403.html 5. http://.../staffweb/smg/smg960828.html Query 2: Ezra Cornell residence 1. Dear Uncle Ezra Questions for Tuesday, May... 2. Dear Uncle Ezra Questions for Thursday,... 3. Ezra Cornell had close Albion ties 4. October 1904 Albion 100 Years Age 5. Cornell competes with Off-Housing market. Click > q Click > q Figure 3: Two example queries and result sets. Click > q No relevant results in top 10. But users continue to use refined queries to search. Web, but were not told of the specific interest in their behavior on the results page of Google. All clicks, the results returned by Google, and the pages connected to the results were recorded by an HTTP proxy. Movement of the eyes was recorded using an ASL 504 commercial eye tracker (Applied
Generate Implicit Relevance Judgments Many cases, results do not contain relevant documents or those are ranked too low and the users do not see them. However, in those cases, users reformulate the queries which are more successful. NDLF National digital library foundation.
Query Sessions Basic idea: making use of the sequence of queries and clicks in a search session Example: multiple appearance of special collections followed by rare books imply query similarity. Query chain: a sequence of reformulated queries. Using query chains, many more document can be considered w.r.t. relevance judgments.
Issues 1. Automatic detection of query chains in query logs 2. Infer relevance judgments for both individual query results as well as those across queries in the same chain 3. The relevance judgments are used to train a ranking SVM
l l l l l Click > q Skip Above Click First > q No-Click Second May...,... Click > q Skip Above Click First > q No-Click Second t sets. Click > q Skip Earlier Query Click > q Top Two Earlier Query their behe results he results e eyes was (Applied ils on the looked at, r a query. two result two docere much Figure 4: Feedback strategies. We either consider a single query, q, or a query q that has been preceded by a query q. Given a query, a dot represents a result document and an x indicates the result was clicked on. We generate a constraint for each arrow shown, with respect to the query marked.
q1 d1 d2 x d3 q2 d4 x d5 d6 d 2 > q1 d 1 d 4 > q2 d 5 d 4 > q1 d 5 d 4 > q1 d 1 d 4 > q1 d 3 Figure 5: Sample query chain and the feedback that would be generated using all six feedback strategies. Two queries were run, and each returned three documents. One document in each query was clicked
Accuracy of Feedback judgments # of judges 16. Strategy Accuracy Click > q Skip Above 78.2 ± 5.6 Click First > q No-Click Second 63.4 ± 16.5 Click > q Skip Earlier Query 68.0 ± 8.4 Click > q Top Two Earlier Query 84.5 ± 6.1 Inter-Judge Agreement 86.4 ck that tegies. ee docclicked with j Table 1: Accuracy of the strategies for generating pairwise preferences from clicks. The base of comparison are the explicit page judgments. Note that the first two cases cover two preferences strategies each. gives a sample query chain and the feedback that would be generated in this case.
Detecting Query Chains 1285 queries grouped into query chains manually used as training data for query chain detection. For each pair of queries from the same IP address within half an hour, a feature vector is extracted with 16 features. SVM trained with average accuracy 94.3% and precision 96.5% vs. 91.6% without using the features. Important features: CosineDistance(q1, q2) and CosineDistance(doc ids of r1, doc ids of r2).
of the two k > q Skip Query are ticular, the s very close e that this nces, since ay not occuracy bey apply to Skip Earreceived a Query is gate the eft of Click ates preferuery, but click (but is strategy ck followed g evidence er query. rated from CosineDistance(q1, q2) CosineDistance(doc ids of r1, doc ids of r2 ) CosineDistance(abstracts of r1, abstracts of r2 ) TrigramMatch(q1, q2) ShareOneWord(q1, q2) ShareTwoWords(q1, q2) SharePhraseOfTwoWords(q1, q2) NumberOfDifferentWords(q1, q2) t2 t1 {5, 10, 30, 100} seconds t2 t1 > 100 seconds NormalizedNumberOfClicks(r1) NormalizedMin( r1, r2 ) NormalizedMax( r1, r2 ) Table 2: Features used to learn to classify query chains. q1 and q2 are two queries at times t1 and t2, with t1 < t2. r1 and r2 are the respective result sets, with r1 and r2 being the top 10 results. belonging to a query chain. This resulted in 1285 queries. Two judges (the authors of this paper) then individually grouped the queries into query chains manually, using search
Learning Ranking Functions Training data of the form d i > q d j meaning document d i preferred over document d j given query q. A retrieval function rel(d i, q) = w t Φ(d i, q) where Φ(d i, q) feature vector for query-document pair (d i, q). d i > q d j w t Φ(d i, q) > w t Φ(d j, q)
allow some of the preference constraints to be violated, as is done with classification SVMs. This yields a preference constraint over w. w Φ(d i, q) w Φ(d j, q) + 1 ξ ij Although we cannot efficiently find a w that minimizes the number of violated constraints, we can minimize an upper bound on thew number t Φ(d i, q) > ofwviolated t Φ(d j, q) + constraints, P 1 ξ ij, ξ ij. Simultaneously maximizing the margin leads to the following convex quadratic optimization problem: Allowing violations of the constraints by adding slack variables ξ ij 0, and minimize an upper bound of the number of violations, ξ ij, 1 min w w + C P w,ξij 2 ij ξ ij subject to (q, i, j) : w Φ(d i, q) w Φ(d j, q) + 1 ξ ij i, j : ξ ij 0 We will later add more constraints to the optimization problem taking advantage of prior knowledge in the learning to rank setting. Equivalent to SVM on Φ(d i, q) Φ(d j, q). (3) 20 m in ra d W or sc th h d ra φ ra th ra ar
Construct the Mapping Φ(d, q) Φ(d, q) will consists of 1) rank features (28 in total for ranks 1,2,...,10,15,20,...,100). The corresponding feature is et to 1 if the document is at or above the specific rank; 2) term/document features. Assume the search engine has F original ranking functions, rel f 0.
. nts. (1) red ear (2) aps can of is res ere be ing ion ening for query q. In the experiments in this paper, F consists of a single ranking function as provided by Nutch for the sake of simplicity. Now, 2 φ f 3 1 rank (d, q) 6. 7 φ f rank (d, q) = Φ(d, q) = 2 6 4 φ terms (d, q) = 6 4. φ f F rank (d, q) φ terms (d, q) 7 5 1(Rank(d in r f 0 (d, q)) 1). 1(Rank(d in r f 0 (q)) 10) 1(Rank(d in r f 0 (q)) 15). 1(Rank(d in r f 0 (q)) 100) 2 6 4 1(d = d 1 t 1 q). 1(d = d M t N q) where 1 is the indicator function. Before looking at the term features φ terms (d, q), let s ex- 3 7 5 3 7 5
Prior Constraints Without constraints, trivial solutions tend to reverse the original search engine order. One set of constraints on the weights, w i > w min limiting how quickly the original ranking is changed by training data. Example. Consider a result of 100 documents, d i ranked i.
Figure 6: Two example rankings with four results each, and the combined outputs we would generate by starting with the top ranked document from ranking r. document ranked at position i in r f 0 (q). In this case, φ f rank (d 100, q) = [0,..., 0, 0, 1] T φ f rank (d 95, q) = [0,..., 0, 1, 1] T φ f rank (d 1, q) = [1,..., 1, 1, 1] T Calling the part of w that corresponds to rank features w rank, from Equation 4 we then get w rank φ f rank (d 100, q) w min w rank φ f rank (d 95, q) 2w min f
φ f rank (d 1, q) = [1,..., 1, 1, 1] T alling the part of w that corresponds to rank feature ank, from Equation 4 we then get w rank φ f rank (d 100, q) w min w rank φ f rank (d 95, q) 2w min w rank φ f rank (d 1, q) 28w min Now say we have a document d that is preferred over d To have a unseen document ranked higher than d 1, the term t is not feature inneeds the original to be large. results. d would be ranked highe rel(d, q) > rel(d 1, q). We know from Section 7.2 tha ly φ t,d term(d, q) is non-zero in φ terms (d, q). Expanding an plifying, this would imply:
y). ok vet j air ing rds roof ata eccuthe ms ro. ranking r d 1 d 2 d 3 d 4 ranking r d 2 d 5 d 1 d 6 combined(r, r ) d 1 d 2 d 5 d 3 d 4 f 6 Figure 6: Two example rankings with four results each, and the combined outputs we would generate by starting with the top ranked document from
from rankings r and p n results from the en(n, r ) are defined we have to combine ults of the combined anking such that for en(n, r) 1. In our three results in the cause seen(3, r) = 2 t the top five results, compensate for a bias ometimes one bigger and r half the time. (n, r) = seen(n, r ). 6]. a combined ranking, rankings is preferred. er looked at by taking Evaluation User Prefers Mode Chains Other Indifferent rel QC vs. rel 0 392 (32%) 239 (20%) 579 (47%) rel QC vs. rel NC 211 (17%) 160 (13%) 855 (70%) Table 3: Results on Cornell Library search engine. rel 0 is the original retrieval function, rel QC is that trained using query chains, and rel NC is that trained without using query chains. 7.6 Results and Discussion We evaluated the ranking functions on the CUL search from 10 December 2004 through 18 February 2005 using the evaluation method described in Section 7.4. When a user connected to the search engine, we randomly selected an evaluation mode for that user. The user either saw a