Reducing Click and Skip Errors in Search Result Ranking

Size: px

Start display at page:

Download "Reducing Click and Skip Errors in Search Result Ranking"

Brittany Johns
5 years ago
Views:

1 Reducing Click and Skip Errors in Search Result Ranking Jiepu Jiang Center for Intelligent Information Retrieval College of Information and Computer Sciences University of Massachusetts Amherst James Allan Center for Intelligent Information Retrieval College of Information and Computer Sciences University of Massachusetts Amherst ABSTRACT Most current search systems show descriptive summaries of results to searchers such that they can decide whether or not to click and read the results in detail. However, searchers may incorrectly click on a non-relevant result and/or skip relevant ones without clicking for many reasons (e.g., misleading summaries, redundant content). A public dataset shows that 41% of clicked results are non-relevant and searchers skip 45% of relevant results without clicking after reading their summaries. Reducing such click and skip errors are crucial for improving the actual performance of search systems. This paper explores a new paradigm of search result ranking considering both result relevance and the chances of click and skip errors. Ideally, we should place relevant results with greater click probabilities at higher ranks than those with smaller click probabilities, such that searchers are less likely to skip relevant results. Similarly, we demote ranks for non-relevant results with higher click probabilities, such that searchers are less likely to click on non-relevant results. The proposed approach takes advantage of various features that are predictive of click and skip behavior among relevant and non-relevant results, including characteristics of result webpages and summaries, and their relation to search query and past searcher interaction within the same search session. We train ranking models using these features to optimize an ndcg style metric, which measures the quality of a result list by both relevance and click probabilities. Experimental results on an open dataset show our approach reduces 15%-20% of the click and skip errors in search result ranking with a slight decline of result relevance (ndcg@10 decreased by only 3.9%). Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval retrieval models. H.1.2 [Models and Principles]: User/Machine Systems human factors. Keywords Click; interactive search; search result ranking; IR evaluation; summarization. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference 10, Month 1 2, 2010, City, State, Country. Copyright 2010 ACM /00/0010 $ INTRODUCTION The 10-blue links paradigm has been ruling search engines for decades. It is an interaction mode where search systems deliver results to searchers through a search result page (SERP), filled with a ranked list of result summaries. These summaries are previews of result documents, usually customized to the queries, such that searchers can spend a small effort reading the summaries and assess whether it is worthwhile or not to click on the results and read in detail. This mode is adopted in almost all current search systems. The accuracy of this interaction mode greatly affects the actual performance of search systems. Although sometimes searchers can learn relevant information sorely from reading result summaries, in most cases they need to read result documents to acquire relevant information. In such case, it helps little to place a relevant result at the top if most searchers would not click on the result after viewing its summary. Similarly, clicking a non-relevant result costs searchers time and effort and may cause frustration, which has a greater adverse effect than skipping a non-relevant result. However, few discuss these factors in search result ranking and evaluation results are ranked and evaluated by how relevant the documents are, ignoring whether searchers would click or skip the summaries. One solution to the problem is to generate better search result summaries, such that searchers can make better decisions on clicking and skipping. Much work has been done in this direction [23, 24], yet reported click accuracy for search engines and state-of-theart retrieval systems are not satisfying. For example, Yilmaz et al. [26] reported in a commercial search engine query log that the probability of clicking results judged as Perfect, Excellent, Good, Fair, and Bad are 0.94, 0.71, 0.55, 0.45, and 0.49, respectively. Similarly, in the TREC 2014 session track dataset [5] query logs (collected using an Indri-based retrieval system), only 41% of the clicked results are relevant and searchers skipped 45% of the relevant results after reading their summaries. We approach the problem from a different angle by optimizing search result ranking for both relevance and searchers click and skip behavior. Existing approaches do not consider click and skip errors into ranking they purely rank relevant results on top of nonrelevant ones. Assuming we know whether searchers are going to click on or skip a result after reading its summary, we can perform a more sophisticated ranking of search results. For example, comparing two equally relevant results, we should rank the one searchers would click (or are more likely to click) at a higher rank. Similarly, we can demote non-relevant results with high click probabilities to avoid click errors. Yet currently it is unclear how to perform such rankings of search results. In this paper, we present the first study tackling this challenge. We consider both relevance and click and skip errors when ranking search results. Our goal is to reduce the chances of click and skip

2 errors while maintaining the relevance of results. The proposed approach consists of two parts: First, we design features that are predictive of searcher click and skip behavior on relevant and non-relevant results. Section 4 designs and analyzes various features considering: query-independent characteristics of summaries and results, similarity of search query and results, past searcher interaction in the same session, and situational factors. Second, we propose an ndcg style metric measuring quality of search results considering both result relevance and the chances of click and skip errors. This metric can be used as an objective function to train ranking models to solve the problem. Section 5 introduces this metric. We train LambdaMART rankers to optimize the metric and compare with rankers optimized for regular ndcg. Experiment results show the trained rankers can reduce the chances of click and skip errors by 15% to 20%, with only a slight decline of result relevance (ndcg@10 decreased by 3.9%). The proposed approach also outperforms many state-of-the-art retrieval models. The rest of the article introduces our approach and experiments. 2. RELATED WORK This section reviews related work on three topics: searcher click and skip behavior, web search studies on searchers implicit feedback, and search models using session information. Searcher click and skip behavior on a SERP is greatly affected by the summary of results. Tombros and Sanderson [23] first applied query-biased summaries to IR systems, and found increased accuracy and speed of searchers in result clicking comparing to query-independent summaries. White et al. [24] compared querybiased summaries with those using only document titles and abstracts, and came to similar conclusions. This suggests that searcher click and skip errors may relate to the relation between summaries and search queries. Cutrell and Guan [9] experimented using different lengths of summaries in both navigational and informational search tasks. They observed that searchers achieved best clicking accuracy on short summaries in navigational search tasks, but on long summaries in informational search tasks. This suggests that searcher click and skip information may relate to textual characteristics of summaries as well as search tasks. Clarke et al. [7] predicted searcher clicks using result summary features. In comparison, we rank results by both searcher clicks and relevance of search results at the same time. Shokouhi et al. [22] studied how repeated results in different queries were clicked by searchers. Similar information is included in our approach as features. Note that we are interested in searchers click and skip behavior only after viewing result summaries. This is essentially a different task comparing to many studies for predicting clicks based on the position of results on the SERP [6, 12]. These studies mainly aim at accurately explaining position bias of clicks in search logs, such that we can more accurately use click-through data as implicit feedback. Our study is also different from many studies [2, 21] that use clicks as implicit feedback for improving relevance-based ranking of search results. In contrast, we use clicks as ground truth on how would searcher behave (click or skip) after reading results summaries, such that we can reduce click and skip errors in search result ranking. The approach adopted in this paper shares similarity with studies of predicting and optimizing click-through rates (CTRs) of ads [20]. These studies predict position-independent CTRs for ads, such that they can rank ads with CTRs at the top and optimize ads clicks. In comparison, our task is more complex. We optimize click and skip errors for results, which depends on both click probability and result relevance. Our approach for reducing searcher click and skip errors incorporates lots of searcher interaction information, which shares similarity with many web search studies on searcher implicit feedbacks. Joachims et al. [15, 16] explored the validity of using click-through data to generate searcher preference of results. Agichtein [3] incorporated browsing features and query-text features to improve clickthrough features in similar tasks. Although our dataset has relevance judgments, Joachims s studies [15, 16] shed light on our approach of extracting labels for searcher click and skip behavior from search logs. Agichtein et al. [2] also demonstrated the effectiveness of using implicit searcher feedback in improving search result ranking. Hassan et al. [14] and Ageev et al. [1] predicted search success using searcher interaction in a session. Guo et al. [13] predicted search engine switch. These studies show that past searcher interaction data are useful for predicting future searcher interactions, which sheds lights on our approaches. Our study also relates to previous work using local search context information within a search session to improve search performance. For example: Shen et al. [21] proposed a language model based relevance feedback approach to incorporate previous search queries and clicks within a search session; Guan et al. [11] utilized query change information by separately considering changes of terms in query reformulation, as well as whether these terms appear in past results as relevance feedback information; Luo et al. [19] further modeled query change information as actions and optimized session search problem using a POMDP framework. Their work share similarity with our study in that similar datasets were employed (search logs with local session contexts). But these studies still aim at relevance-based ranking and evaluate using relevance of results. Our study mainly differs from their work in that we consider both relevance of results and chances of click and skip errors in search result ranking and evaluation. 3. CLICK AND SKIP ERRORS In this section, we define click and skip errors and discuss the approach of extracting labels for these errors from query logs. 3.1 Definition and Goal We discuss searcher click and skip errors in the de facto standard search result page (SERP) setting. The search engine shows searchers a ranked list of result summaries linking to the result webpages. Searchers read the summaries and make (maybe imperfect) decisions on whether to click on the results and read the full webpages. A summary usually consists of the title and URL of the webpage, and a query-biased snippet [23, 24], with query terms highlighted. This mode is adopted by most current search engines. We consider two properties of a search result: Relevance of the document. Similar to most current studies, we consider multi-level relevance. For simplicity we also use binary relevant (R) and non-relevant (NR) in discussion. Whether searchers will click on the result and read details of the webpage after viewing its summary. We use click and skip for the cases that searchers click and do not click on the result after viewing its summary. The two properties divide results into four categories: R+click, R+skip, NR+click, and NR+skip. We refer to NR+click as click errors, i.e., searchers incorrectly click on a non-relevant result. We call R+skip skip errors, i.e., searchers read result summaries but do not click. Note that R+skip are not necessarily searchers errors, but may instead be correct decisions (e.g., skip a previously clicked relevant result or one with duplicate content). Here we simply call them skip errors to simplify discussion.

3 We assume R+click are preferred over R+skip, and NR+skip are preferred over NR+click in search result ranking. In this paper, we do not consider the cases that searchers learn relevant information directly from reading result summaries (in such cases, R+skip are more preferable because they require fewer effort). This mostly happens when searchers are looking for simple and factual information (e.g., release date Windows 10, UPS phone number ). For most other information needs, searchers still need to access result documents in order to acquire relevant information, especially when the information needs are complex, exploratory, or transactional (e.g., U.S. history, Mac vs. Surface, book a hotel ). The goal of conventional search systems is to rank R over NR. Our goal here is to further rank R+click over R+skip, and NR+skip over NR+click. This ranking reduces the chances of click and skip errors, because it places results with high risks of click and skip errors (e.g., R+skip and NR+click) to lower ranks. 3.2 Dataset Our study relies on search logs to extract ground truth for click and skip labels. We use the TREC 2014 session track dataset [5], because it provides the largest publicly accessible non-anonymous query logs with relevance judgments at the time of our study. The TREC 2014 session track dataset [5] provides information for 1,257 search sessions and 4,666 queries on 60 different search topics. A search session includes a series of queries for the search topic, the results and summaries shown on the SERPs, and searcher interaction information (e.g., clicks). Searchers of these sessions are paid workers on Amazon Mechanical Turk. They were shown the topic descriptions and were requested to work on search topics using an experimental search system for at least 3 minutes. Their search interaction was recorded. The session track 1 removes SERP and searcher interaction information for the last query of each session, because their goal is to evaluate techniques of retrieving results for the last query using past searcher interaction. The experimental search system was built on Indri 2 and worked on a full index of the ClueWeb12 dataset 3. The system ranks results by matching query terms and phrases to webpages contents, titles, URLs, and anchor texts linking to the webpages. Due to the large size of the ClueWeb12 dataset (about 733 million webpages), a result caching strategy was adopted to ensure that the system could respond requests in at most 6 seconds [5]. Result summaries are generated using Indri s built-in functions. According to its source code 4, Indri constructs summaries by selecting document fragments with the highest coverage of query terms, with a bias to the beginning of documents. This is similar to many other approaches [23]. Result relevance was judged by assessors from NIST. The pool of judgment involved all results displayed to the searchers, as well as the top results from participants runs. Due to its large volume, only parts of the sessions were included into the judgment pool. More details of the dataset are discussed in Carterette et al. [5]. 3.3 Extracting click and skip Labels Relevance of results and searcher clicks are readily available in the dataset. We can easily get labels for R and NR and multi-level relevance of results. But we need to further know whether searchers viewed a summary to extract click and skip labels (because both are defined in the condition that searchers viewed the summaries). The observed clicks on a SERP are largely affected by ranks of result summaries. For example, searchers may not click on a result summary simply because it is ranked at the bottom on the SERP (they did not have the chance to view it), which cannot be counted as skip. That type of information can only be collected using biometric sensors (e.g., eye-tracking device), which is costly. Alternatively, we automatically extract click and skip labels by assuming that: if a searcher clicks a result, the searcher has read its summary and all summaries at a higher position on the SERP. Using this assumption, we automatically extract click and skip labels from SERPs with at least one observed click. We simply assign click labels to all observed clicks. We find the lowest ranked clicked results on a SERP and assign skip labels to unclicked results ranked at higher positions on the SERP. We do not consider SERPs without any clicks, and we exclude results with no relevance judgments. We also exclude the first query of each session, because our approach makes use of past searcher interaction information. Using this approach, we obtain labels for 1,085 webpages. The average rank of clicked and skipped summaries are 3.40 and On average the lowest rank of clicked results on SERPs with at least one click is The correctness of the assumption is well supported by previous eye-tracking studies. Joachims et al. [15] showed searchers viewed at least 90% of clicked summaries at rank 1 to 5 and 81.8% at rank 6. According to these statistics, we believe this assumption has about 90% accuracy on assigning click labels. Granka et al. [10] reported that if searchers clicked results at rank 2, 3, and 4, respectively 79%, 80%, and 77% of the results at higher positions were also viewed by searchers. Similar results were reported in Joachims et al. s study [15]. Cutrell and Guan [9] reported an even high chance of viewing summaries if searchers clicked another summary at a lower rank. According to these results, we expect about 80% accuracy on assigning skip labels to summaries. Although this data labeling approach is not perfect, we believe it is the best we can do in an automatic way. More importantly, this makes our approach generalizable, because it does not require more effort to prepare such a dataset than preparing conventional search datasets (the majority of the effort are relevance judgments). 4. FEATURES This section introduces features for ranking search results. We also analyze values of these features among four categories of results, as a way of examining whether these features are predictive of click and skip behavior among relevant and non-relevant results. Considering existing IR methods work well on relevance-based ranking, we focus on features that may help identify click and skip. We show comparisons of mean feature values in the four classes of results. We test significance using Welch s t-test (a variant of t-test that does not require data normality assumption). * and ** mean p<0.05 and p<0.01. We only mark significant differences for click vs. skip, R+click vs. R+skip, and NR+click vs. NR+skip. 4.1 Query-independent Features Cutrell and Guan [9] observed different accuracy of clicking relevant results when searchers were shown result snippets of different lengths. This suggests click and skip behavior can be affected by query-independent characteristics of summaries and webpages. Table 1 lists query-independent features. Figure 1 shows examples of features with significant differences on click and skip. Figure 1(a), 1(b), and 1(c) show clicked and skipped summaries differ in textual characteristics. Clicked summaries have shorter titles comparing to the skipped ones. Clicked and skipped summaries See src/snippetbuilder.cpp (line ) in Indri 5.8.

4 (a) Title Length (w/o stop words) click < skip **, NR+click < NR+skip ** (c) min IDF of words in title click > skip *, NR+click > NR+skip * % 80% 70% 60% (b) max TF of words in title click < skip *, NR+click < NR+skip * (d) Page Rank (percentile) click < skip * Figure 1. Query-independent features with significant differences between clicked and skipped webpage snippets. Table 1. A list of query-independent features. Length of summary title/snippet, with or without Length stop words. Length of the URL (by characters/by levels of directories). Length_URL #unique words Number of unique words in title/snippet. %stopwords Percent of stop words in title/snippet. Percent of words in <a>, <p>, <li>, <td>, heading %wordstag tags (<h1> to <h6>), and emphasis tags (<b>, <i>, <em>, <strong>). Average/maximum/minimum values of TF/IDF TF/IDF/TF IDF for words in result summary title/snippet. #fragment Number of document fragments in the snippet. pagerank Pagerank of the webpage (in percentile). spamrank [8] Waterloo spam rank of the webpage (in percentile). also differ significantly in terms of their titles maximum word frequency and minimum word IDF. Figure 1(d) shows clicked webpages have relatively lower pagerank percentile scores than the skipped ones (but the mean scores for clicked webpages are still as high as over 70). We suspect this is related to the nature of search tasks in the dataset (complex factual and informational search tasks). In these tasks, searchers are less likely interested in webpages with very high pagerank scores (which are usually homepages). Different trends may be observed in other tasks (e.g., navigational search). Despite task dependency, results suggest click and skip behavior relate to webpage centrality. Most other features in Table 1 show significant differences between relevant and non-relevant results, but no clear differences between clicked and skipped ones. 4.2 Features Using Current Search Query Table 2 lists features using the current query and Figure 2 shows examples with significant differences between click and skip. We adopt features measuring the similarity between query and result summary or webpage, including coverage of query terms in summary title, snippet, and URL, and query likelihood (QL) score of webpage [27]. When matching query terms in URLs, we count term occurrence by whether the term is a subsequence of the URL. The original QL scores are not comparable between different queries. Thus we normalize them by query length, i.e., each query term is assigned a weight 1/ Q, where Q is the number of query terms in the query. Figure 2(a) shows that clicked summaries have higher coverage of query terms in their titles comparing to skipped ones. We found similar patterns for coverage of query terms in snippets 90% 80% 70% 60% 50% 40% 85% 75% 65% 55% Figure 2. Features using current query with significant differences between clicked and skipped webpage snippets. %qterm #qtermtag %qtermtag ratio %qtermtag QLscore (a) % current query terms in title All click/skip differences are significant (c) ratio of query term coverage in <a> and in the whole webpage #window(k) minwindow click > skip *, NR+click > NR+skip * % (b) # windows (size=5) in webpage covering all query terms All click/skip differences are significant Table 2. Features using current search query. Percent of query terms in title/snippet/url. Frequency of query terms in different HTML tags. Percent of query terms in different HTML tags. The ratio of %qtermtag in different HTML tags comparing to that of the whole webpage. Normalized query likelihood score of the summary/webpage. Number of windows of size k in webpage covering all terms, any bigrams, or any trigrams in queries. Minimum window size in webpage covering all terms, any bigrams, or any trigrams in queries. and URLs as well. QL scores show significant differences between R+click and R+skip, but not for NR+click and NR+skip. We also use features matching query term phrases in webpages, including the number of windows of size k covering all query terms, and the minimum window size covering all query terms. We set the minimum window size to the document length when not all query terms occur in the webpage. Figure 2(b) shows that it is more likely to find small windows (5 words) covering all query terms in clicked result webpages than in the skipped ones. The distributions of query terms in different HTML tags are also useful features for distinguishing click and skip documents. While analyzing searcher click errors in the dataset, we found many click errors happened when result summaries look relevant (e.g., have high coverage of query terms), but the actual webpages do not provide any helpful information. One common characteristic of these results is that query terms mainly occur in navigational contents, such as hyperlinks and directories. Thus, we compute the coverage of query terms in each specific HTML tag (e.g., <a>, <p>) and in the whole webpage. Then we use their ratio as a feature. As Figure 2(c) and 2(d) show, results with click errors (NR+click) have higher ratio in navigational content (e.g., in <a> tag) but low ratio in main content of the webpage (e.g., in <p> tag). This suggests webpage structure plays an important factor in search result summarization, as well as click and skip behavior. 4.3 Features Using Session Information This section analyzes features using past searcher interaction in the same session, e.g., past queries, SERPs, and clicks. Table 3 lists these features and Figure 3 shows the values of some features. Past search queries in the same session were used in many studies for improving search performance [11, 19, 21]. In our study, we also examine whether past queries can help identify click and skip. We use coverage of past query terms in summary title, snippet, and 80% 60% 40% 20% (d) ratio of query term coverage in <p> and in the whole webpage All click/skip differences are significant

5 80% 70% 60% 50% 40% 30% 20% 100% 90% 80% 70% 60% 50% 40% Figure 3. Features using session information with significant differences between clicked and skipped webpage snippets. %past qterm %past qterm by query type %past SERP result term QLscore past clicked docs Repeated Result (a) % past query terms in title click > skip **, NR+click > NR+skip ** (c) % KEEP terms in title click > skip **, NR+click > NR+skip ** Table 3. Features using session information. Percent of past query terms in title/snippet/url. We consider all past query terms and ADD/KEEP/RMV terms separately. %past qterm considering only 1) past queries w/ clicks, or 2) past query reformulations whose 2nd query has clicks. Coverage of words from past clicked/skipped SERP titles in the summary s title/snippet. Normalized query likelihood score of past queries, past queries with clicks, past clicked/all SERP titles/snippets. Whether a summary of the same result (by URL) was clicked or skipped by the searcher on previous SERPs. URL as features. We compute the coverage of terms for each past query and use the mean value as features. Following Guan et al. s work [11], we also separately consider three types of terms in past query reformulations. For a query q1 and its reformulation q2, ADD terms are those in q2 but not in q1, KEEP terms are those in both queries, and RMV terms are those in q1 but not in q2. We compute the coverage of ADD, KEEP, and RMV terms for each past query reformulation and use their mean value as features. Figure 3(b) shows that coverage of ADD terms can help identify click and skip among both relevant and non-relevant results. In contrast, coverage of KEEP terms only have significant differences between NR+click and NR+skip. Coverage of past query terms show patterns similar to KEEP terms (Figure 3(a)). Whether searchers previously visited exactly the same result is also an important reason for skip. Figure 3(d) shows that 20.3% of R+skip and 10.4% of NR+skip were previously visited by searchers within the same session, comparing to 3.4% and 3.5% for R+click and NR+click. We also adopt variants of past query term coverage features by only considering past queries with clicks, and past query reformulations whose second query has clicks. In addition, we measure the similarity of result summaries and documents with past SERP titles and summaries as features. We separately consider past clicked and skipped SERPs. Due to the limited space, we do not put analysis on these features in this paper. 4.4 Situational Features We also suspect searchers click and skip decisions relate to situational factors which are independent of any specific results, i.e., in a certain period of a session, searchers are more likely to click or skip results. Situational features include: the number of previous 50% 40% 30% 20% 10% 0% 30% 20% 10% 0% (b) % ADD terms in title All click/skip differences are significant (d) Is Previously Visited URL All click/skip differences are significant clicks, SAT clicks, and DSAT clicks; the number of previous queries, and those with clicks. We found that on average searchers have 0.51 SAT clicks when they skip a relevant summary, comparing to 0.39 when they click a relevant one (p<0.05). This suggests searchers are probably more likely to skip relevant results if they have already visited some relevant information in previous searches. 4.5 Summary As a summary, this section introduced and analyzed features using different types of result information (e.g., title, snippet, URL, and webpage) and diverse searcher interaction evidences (e.g., current search query, past search queries, and click and skip behavior on previous SERPs). Analysis shows significantly different patterns between clicked and skipped results on many features. This suggests our features are potentially predictive for searcher click and skip behaviors. In addition, this suggests it is possible to identify click and skip behavior on relevant and non-relevant results automatically. But we are conservative regarding the generalizability of any specific feature patterns. Readers need to be aware that these searcher interaction data were collected using one specific search system and result summarization method. 5. RANKING The ultimate goal of our study is to rank search results considering both relevance of results and potential searcher click and skip errors after viewing result summaries. The task would be simple if we can make perfect predictions on relevance and click/skip behavior. In such case, we can apply a rule and rank results first by relevance and then by predicted click or skip. Unfortunately, making such perfect predictions seems beyond the limit of current technology. We tackle this challenge by training LambdaMART [4] rankers to optimize an ndcg style metric. We name the metric click sensitive ndcg (cs-ndcg). Similar to ndcg, cs-ndcg prefers ranking more relevant results over less relevant ones, and prefers relevant results at the top. In addition to these properties, cs-ndcg penalizes relevant results with low click probabilities and non-relevant ones with high click probabilities. This section introduces this metric and our ranking approach. 5.1 Problem Definition We consider both results that can be labeled for click and skip and that cannot. This is because the criterion in Section 3.3 can only automatically label less than half of the top 10 results (on average the lowest rank of clicked results is 3.13), which greatly limits the number of ranking candidates. We assign unknown labels to the results that cannot be automatically labeled for click and skip. In total we have six categories of results: R+click, R+skip, R+unknown, NR+click, NR+skip, and NR+unknown. Our goal is to rank results by the following order (at this point we assume binary relevance to simplify discussion): R+click > R+unknown > R+skip = NR+skip > NR+unknown > NR+click This is because we do not know how searchers would behave on results without click and skip labels (unknown). In such case, it is reasonable to assume searchers are less likely to click on these results comparing to results with observed clicks (click), but more likely to click on them comparing to those with observed skip behavior (skip). Therefore, we rank R+unknown between R+click and R+skip because searchers are more likely to click and benefit from R+unknown comparing to R+skip, and they are less likely to do so comparing to R+click. For similar reasons, we should rank NR+un-

6 known between NR+skip and NR+click. We do not prefer any ranking between R+skip and NR+skip because in case searchers skip a result, relevance of results means little to the searcher. The rest of this section introduces cs-ndcg, which evaluates the quality of a rank list based on the ranking rule presented above. It also takes into consideration multiple relevance levels and discount influence of an individual result by its rank in the list. 5.2 Click Sensitive ndcg (cs-ndcg) Let L={D1, D2,, Dn} be a ranked list of n search results and their summaries. We measure Di s contribution to the quality of the rank list (g(di)) as Equation (1), where: ri is the relevance degree of Di judged by the assessors (ri = 0 means Di is not relevant); pi is the probability of clicking Di after viewing its summary; gp is the penalty of clicking a non-relevant result. r ( 2 i 1 ) pi ri > 0 g( D ) i = gppi ri = 0 In essence, Equation (1) weights gain of a relevant result by its clicking probability, where the same gain function of ndcg is adopted. In addition, Equation (1) sets an adverse effect for a clicked non-relevant results, which can set off the positive contribution (gain) of relevant results. The scale of the adverse effect is controlled by a parameter gp, and we weight the adverse effect of a result by its clicking probability. We assign pi to 1 if Di is labeled click, and 0 if Di has label skip. For R+unknown and NR+unknown, we make our best guess assign pi to the probability of clicking a relevant and a non-relevant result after viewing its summary, respectively. The two probabilities can be estimated from the dataset using existing click and skip labels, i.e., PP(cccccccccc RR) = RR+cccccccccc and RR PP(cccccccccc NNNN) = NNNN+cccccccccc, where X is the number of instances has NNNN label X. In our dataset, P(click R) = 0.55 and P(click NR) = Similar to ndcg, we sum up the contribution of results at each rank, with a position-based discount function. We refer to the sum as cs-dcg for the rank list, as in Equation (2). The position-based discount function is the same as the one in ndcg, which ensures results on the top of the rank list have a greater impact on the whole rank list. n g( Di ) cs-dcg ( L) = (2) log i + 1 i= 1 2 ( ) Click sensitive ndcg (cs-ndcg) for a rank list L is calculated by normalizing its cs-dcg to the range [0, 1]. Different from DCG, cs-dcg can be negative values due to the penalty of clicking nonrelevant results. Therefore, we normalize cs-dcg by both its lower bound and upper bound, as Equation (3). The ideal rank list (Lbest) is defined as a rank list where all results have the highest relevance labels (4 in our dataset) and perfect click accuracy (always click). The worst rank list (Lworst) include all observed NR+click, followed by non-relevant results without observations on click and skip. cs-dcg ( L) cs-dcg ( Lworst ) cs-ndcg ( L) = (3) cs-dcg L cs-dcg L ( ) ( ) Similar to the case of ndcg [25], when training LambdaMART models using cs-ndcg, we update using the following λ-gradients, where Δcs-nDCG is the cs-ndcg value gained by swapping a pair of results Di and Dj in the rank list. best C λ cs-ndcg ij ij = Sij ο ij worst (1) (4) Table 4. Contribution of results with different relevance. Relevance Label for D g(d) (g p=1) click unknown skip Properties of cs-ndcg One may notice that cs-ndcg is essentially equivalent to regular ndcg if we set pi=1 for all relevant results and pi=0 for all nonrelevant results. In such case, g(di) equates the gain function in regular ndcg. Therefore, regular ndcg can be considered as a special case of cs-ndcg by making the following assumptions: searchers would always click relevant results and skip non-relevant results. Apparently, these are not warranted assumptions. cs-ndcg prefers ranking by the criterion in Section 5.1. Table 4 shows the actual contribution of results, i.e., g(d), with click, skip, and unknown labels in our dataset. It ensures that among relevant results with equal relevance labels, we prefer R+click over R+unknown. Due to the exponential form gain function and the estimated click probability for relevant results (0.55) in our dataset, R+click and R+unknown with higher relevance levels are always preferred over those with lower relevance levels, e.g., R+unknow (r=4) has a contribution 8.27 and R+click (r=3) has a contribution 7. This ensures that in most cases, we still prefer relevant results with higher relevance labels, but R+click are preferred over R+unknown. The contribution function also ensures that NR+skip is preferred over NR+unknown and then over NR+click. The penalty parameter gp controls the scale of the adverse effect of clicking a non-relevant results on searcher experience. The greater the value of gp, the more the quality of a rank list is affected by NR+click (click errors). 5.4 Incorporating Partly Judged Queries In essence, our approach is to train ranking models using both relevance judgments and observed searcher click and skip behavior in query logs. Ideally, we hope to have relevance and click/skip labels for many document in many queries. However, practically, we usually have relevance judgments on many documents (depending on the size of the pool for relevance judgment), but only for a small set of queries. In contrast, we can only have a few click and skip observations per query (depending on the lowest click rank), but we can easily get labels in more queries due to the scalability of query logs. The cs-ndcg metric considers the problem of limited click/skip observation per query we set click probabilities based on result relevance when there is no observed click/skip behavior. This section discusses another situation. Once we have relevance judgments on a set of topics, we can have more and more click/skip observations on queries of similar topics (or exactly the same queries). We may reuse old relevance judgment on new queries, but these queries may have unjudged results. It is risky to simply assume all unjudged results as non-relevant and put them into training. Instead, we simply set Δcs-nDCG to 0 if either or both results of a swapping pair has not been judged we skip updating model in case of unjudged results. Similarly, when training using regular ndcg as the target metric, we can also incorporate partly judged queries by setting ΔnDCG and λ-gradients to 0 if either or both results of the swapping pair has not been judged. The TREC session track dataset offers opportunities for studying the helpfulness of using partly judged queries for training. Among the 1,257 sessions in the dataset, only results for queries in the first

7 120 sessions (and top results of participants submissions on these sessions) were included into relevance judgment pool. The rest of the sessions have overlap with the 120 sessions in topics. We reuse the judged topic s relevance judgment to other sessions with the same topic. Some of the sessions queries have unjudged results in the top 10 results. In our experiments, we also evaluate how helpful it is to incorporate these partly judged queries into training. 6. EXPERIMENTS We apply our approaches to re-rank top 10 results of queries in the TREC 2014 session track dataset. We only re-rank top 10 results of queries, because the dataset only provides the top 10 results and the fact that we only observed 3.13 click/skip labels on average for a query. We evaluate results by both relevance of a rank list and the chances of committing click and skip errors in the rank list. We use the TREC 2014 session track dataset for evaluation. We select queries from the dataset with full relevance judgment and at least one click on its top 10 results. In order to apply features using past search interaction information, we do not select the first query of each session. This selects 145 queries. These queries are used for evaluation. To make use of partly judged queries, we further select other queries (excluding the first query of each session) with at least 2 results being judged and at least one click in top 10 results. This selects 222 partly judged queries. We use 10-fold cross validation in all experiments: each fold uses 70% of the 145 fully judged queries for training, 20% for validation, and 10% for testing. We also incorporate the 222 partly judged queries into training using the approach discussed in Section 5.4. We do not evaluate results using cs-ndcg since this metric has not yet been validated as an evaluation metric. Instead, we evaluate results using regular ndcg@10 and the following metrics measuring click and skip errors: Chances of click & skip errors at rank k (E@k). We assume searchers would commit a click or skip error on a result in the reranked results if we observed such an error on the same result for this query in the log. E@k equates the total number of click and skip errors on results at rank 1 to k devided by k. This metric is similar to P@k but it counts click & skip errors. Discounted cumulated errors at rank k (DCE@k). Similar to discounted cumulated gain (DCG), we can measure the cumulated number of click and skip errors at rank k with a position-based discount. We use the same discount function in DCG. Number of pairwise ranking errors. We count the following pairwise ranking errors in the re-ranked results: R+skip>R+click, NR+click>NR+skip, NR+click>R+click, NR+skip>R+click, and NR+click>R+skip. A>B means A is ranked at a higher rank than B. R+skip>R+click is measured only among pairs of results with the same level of relevance. In addition, we use R&NR for any ranking errors between a relevant and a non-relevant results, i.e., the sum of NR+click>R+click, NR+skip>R+click, and NR+click>R+skip. Similarly, we use click&skip for the sum of R+skip>R+click and NR+click>NR+skip, which are click and skip ranking errors. We also count the sum of click&skip and R&NR. We compare our approach (Lambda ranking models trained for cs-ndcg) with a few baselines. Few previous studies target this problem. Therefore, we compare our approach with LambdaMART ranking models trained for regular ndcg. We also compare with state-of-the-art ad hoc search models and session search approaches with strong reported performance on the TREC session track datasets, including: Query likelihood model with Dirichlet smoothing (QL) [27]. Context sensitive relevance feedback (CSRF) [21]. CSRF is a relevance feedback approach considering past search queries and contents of clicked results within the same session. Variants of CSRF were ranked at the top in the TREC 2011 and 2012 session tracks [17, 18]. Query change model (QCM) [11]. QCM considers changes of terms in past query reformulation (i.e., ADD, KEEP, RMV terms in Section 4.3) and whether these terms occur in past search results as relevance feedback. Variants of QCM were ranked at the top in the TREC 2014 session track [5]. We also provide evaluation for the original ranking of results in the query log, and perfect and worst ranking of results by ndcg and by cs-ndcg. We report mean values of the evaluation metrics on the 145 fully judged queries in following sections. We test significant difference of results using a paired t-test. 7. EVALUATION In this section, we first compare our approach with baselines in Section 7.1. Section 7.2 conducts feature ablation analysis. In Section 7.3, we compare results using only fully judged queries for training with those incorporate partly judged queries into training. 7.1 Click-Sensitive Ranking Models Table 5 shows re-ranking effectiveness of our approaches and the baseline approaches using both fully judged and partly judged queries for training in each fold. Reported results are mean values of metrics on the 145 fully judged test queries. Results show our approach can significantly reduce the chances of click and skip errors with a tradeoff of only a slight decline in the relevance of rank list. As Table 5 shows, when setting gp to 3, our ranking model trained for cs-ndcg@10 ( All Features (csndcg, gp=3) ) significantly reduced the discounted cumulated click & skip errors (DCE@5) by 14.7% (0.526 vs , smaller values are better) comparing to the model using same features but trained for regular ndcg@10 ( All Features (ndcg) ). In addition, this model also reduced the chances of click and skip errors (E@k) from 21.4% to 17.9% at rank 1 (p<0.05), from 21.4% to 17.6% at rank 2 (p<0.05), and from 21.6% to 19.3% at rank 3. Similarly, the total number of pairwise ranking errors (only counting results with observed click/skip labels) have also been reduced from 1.35 to 1.09 (by 19.8%, p<0.01). While reducing click & skip errors by about 15% to 20%, the relevance of the rank list only slightly decreased ndcg@10 declined by 3.9% (0.274 vs , p<0.01). The number of pairwise ranking errors between relevant and non-relevant results (R & NR) slightly increased from to (p 0.05), but those related to click and skip (click & skip) reduced by 24.4% (0.897 vs , p<0.01). This indicates our approach is effective for reducing click and skip errors, with only a small cost of reducing result relevance. We can come to the same conclusion when setting gp to 1 in csndcg@10 to train ranking models, but the effectiveness of the model changed slightly comparing to using gp = 3. Figure 4 further explores the effectiveness of LambdaMART ranking models using different values of gp for training. Figure 4(a) shows that the value of gp does not much affect relevance of the rank list (ndcg@10). With gp increases from 0.25 to 3, click and skip errors slightly decreased: DCE@5 declined from to 0.526, E@1 from to 0.179, and E@2 from to There is a slight increase in the chances of error at rank 3 (E@3), we suspect this is because we only re-rank the top 10 results. In such case, the total number of click and skip errors in the top 10 results is a constant. Therefore when chances of error at rank 1 and 2 declines, the chances of error at lower ranks will naturally increase.

8 Table 5. Comparison of ranking models (using all features) optimized for ndcg and cs-ndcg, and other baseline approaches. Runs All R & NR click & skip Number of Pairwise Ranking Errors R+skip > R+click NR+click > NR+skip NR+click > R+click NR+skip > R+click Original Perfect (Relevance) Worst (Relevance) Perfect (cs-ndcg) Worst (cs-ndcg) QL (ndcg) QL (cs-ndcg, gp=1) CSRF (ndcg) CSRF (cs-ndcg, gp=1) QCM (ndcg) QCM (cs-ndcg, gp=1) All Features (ndcg) (0.285) (0.172) (0.055) All Features (cs-ndcg, gp=1) (0.177) (1.076) (0.876) (0.359) (0.083) All Features (cs-ndcg, gp=3) (0.526) (0.179) (0.176) (0.503) (0.007) () indicates the best results in its column (excluding oracle runs). Darker and lighter shading for QL, CSRF, QCM indicate significant difference versus All Features (cs-ndcg, gp=1) at 0.01 and 0.05 levels. Darker and lighter shading for All Features (cs-ndcg, gp=1) and All Features (cs-ndcg, gp=3) indicate significant difference versus All Features (ndcg) at 0.01 and 0.05 levels. NR+click > R+skip (a) Effects of g p on ndcg and chances of click & skip errors ndcg@10 DCE@5 E@1 E@2 E@3 DCE@5 (train w/ ndcg) Figure 4. Effects of gp on the effeteness of ranking models (b) Effects of g p on pairwise ranking errors DCE@5 (train w/ ndcg) All Pairwise Errors R & NR click & skip R+skip>R+click All Pairwise Errors (train w/ ndcg) Figure 4(b) further shows the effects of gp on the number of pairwise ranking errors. Similarly, we observed slight declines in total number of ranking errors (All), R & NR errors, total number of click & skip errors, and click & skip errors among non-relevant results (NR+click>NR+skip). But click & skip errors among relevant result increased from to This confirmed our expectation on cs-ndcg the parameter gp controls the scale of the adverse effect for clicking non-relevant results. Greater values in gp prefer ranking models with fewer click errors (NR+click), but there is a cost of increasing click and skip errors among the relevant results. Despite the variability of models effectiveness on different values of gp, models trained using cs-ndcg consistently reduced the chances of click and skip errors comparing to those trained using the regular ndcg. As Table 5 shows, both our approaches and the baselines committed significantly more click & skip errors than R & NR ones. The best approach can reduce the number of R & NR errors to as few as (only 11.6% of total possible R & NR errors in the dataset) and achieve ndcg@10 very close to the best value we can reach by re-ranking the top 10 results (0.285 vs ). But there is still a large room for improvements on click & skip errors even our best approach committed on average observed click & skip ranking errors in the top 10 results, which is 40.5% of the possible click & skip errors we can make in this dataset (2.159). This indicates the limited understanding of current study, including our own Table 6. Ranking models using different features. Runs ndcg@10 DCE@5 E@1 E@2 E@3 R&NR click&skip All Features (0.179) (0.193) (0.897) No Session No Summary Webpage (0.193) Title (0.502) (0.169) (0.168) Snippet URL Current Query Past Queries Q Independent (0.277) Darker and lighter shading indicate significant difference vs. All Features at 0.01 and 0.05 levels. study, on click & skip errors. In contrast, existing methods already achieved satisfactory performance on ranking relevant and non-relevant results, as indicated by the low R & NR errors in Table 5. Comparing to other ad hoc search and session search baselines (e.g., QL, CSRF, and QCM), our approach consistently outperform all of them in both relevance of results and the chances of click and skip errors. This is not surprising considering much more information is involved in our features comparing to the baselines. This also indicates our feature set is very effective for solving relevancebased ranking of results and click and skip error reduction. To summarize, results in this section demonstrate our approach can effectively rank search results considering both result relevance and the chances of click and skip errors. Comparing to existing methods, our approach can significantly reduce the chances of searcher click and non-click errors by about 15% to 20%, with only slight decrease in ndcg@ Effectiveness of Features We further analyze the effectiveness of different feature sets in this section. Table 6 shows results for models trained for cs-ndcg (gp = 3) using different sets of features. We first compare features using webpage full content with those using different elements of result summaries (title, snippet, URL). As Table 6 shows, models using webpage content can achieve better relevance of results, but significantly more click and skip errors. Actually, features using only result summary title achieved the best DCG@5 even less errors than using the combination of all other feature, although it has limited relevance of results. This indicates

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User