Improving Web Search Ranking by Incorporating User Behavior Information

Size: px

Start display at page:

Download "Improving Web Search Ranking by Incorporating User Behavior Information"

Frank Myron Spencer
5 years ago
Views:

1 Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Microsoft Research Eric Brill Microsoft Research Susan Dumais Microsoft Research ABSTRACT We show that incorporating user behavior ata can significantly improve orering of top results in real web search setting. We examine alternatives for incorporating feeback into the ranking process an explore the contributions of user feeback compare to other common web search features. We report results of a large scale evaluation over 3,000 queries an 12 million user interactions with a popular web search engine. We show that incorporating implicit feeback can augment other features, improving the accuracy of a competitive web search ranking algorithms by as much as 31% relative to the original performance. Categories an Subject Descriptors H.3.3 Information Search an Retrieval Relevance feeback, search process; H.3.5 Online Information Services Web-base services. General Terms Algorithms, Measurement, Experimentation eywors Web search, implicit relevance feeback, web search ranking. 1. INTRODUCTION Millions of users interact with search engines aily. They issue queries, follow some of the links in the results, click on as, spen time on pages, reformulate their queries, an perform other actions. These interactions can serve as a valuable source of information for tuning an improving web search result ranking an can compliment more costly explicit jugments. Implicit relevance feeback for ranking an personalization has become an active area of research. Recent work by Joachims an others exploring implicit feeback in controlle environments have shown the value of incorporating implicit feeback into the ranking process. Our motivation for this work is to unerstan how implicit feeback can be use in a large-scale operational environment to Permission to make igital or har copies of all or part of this work for personal or classroom use is grante without fee provie that copies are not mae or istribute for profit or commercial avantage an that copies bear this notice an the full citation on the first page. To copy otherwise, or republish, to post on servers or to reistribute to lists, requires prior specific permission an/or a fee. SIGIR 06, August 6 11, 2006, Seattle, Washington, USA. Copyright 2006 ACM /06/ $5.00. improve retrieval. How oes it compare to an compliment evience from page content, anchor text, or link-base features such as inlinks or PageRank? While it is intuitive that user interactions with the web search engine shoul reveal at least some information that coul be use for ranking, estimating user preferences in real web search settings is a challenging problem, since real user interactions ten to be more noisy than commonly assume in the controlle settings of previous stuies. Our paper explores whether implicit feeback can be helpful in realistic environments, where user feeback can be noisy (or aversarial) an a web search engine alreay uses hunres of features an is heavily tune. To this en, we explore ifferent approaches for ranking web search results using real user behavior obtaine as part of normal interactions with the web search engine. The specific contributions of this paper inclue: Analysis of alternatives for incorporating user behavior into web search ranking (Section 3). An application of a robust implicit feeback moel erive from mining millions of user interactions with a major web search engine (Section 4). A large scale evaluation over real user queries an search results, showing significant improvements erive from incorporating user feeback (Section 6). We summarize our finings an iscuss extensions to the current work in Section 7, which conclues the paper. 2. BACGROUND AND RELATED WOR Ranking search results is a funamental problem in information retrieval. Most common approaches primarily focus on similarity of query an a page, as well as the overall page quality [3,4,24]. However, with increasing popularity of search engines, implicit feeback (i.e., the actions users take when interacting with the search engine) can be use to improve the rankings. Implicit relevance measures have been stuie by several research groups. An overview of implicit measures is compile in elly an Teevan [14]. This research, while eveloping valuable insights into implicit relevance measures, was not applie to improve the ranking of web search results in realistic settings. Closely relate to our work, Joachims [11] collecte implicit measures in place of explicit measures, introucing a technique base entirely on clickthrough ata to learn ranking functions. Fox et al. [8] explore the relationship between implicit an explicit measures in Web search, an evelope Bayesian moels to 19 ACM SIGIR Forum 11 Vol. 52 No. 2 December 2018

2 correlate implicit measures an explicit relevance jugments for both iniviual queries an search sessions. This work consiere a wie range of user behaviors (e.g., well time, scroll time, reformulation patterns) in aition to the popular clickthrough behavior. However, the moeling effort was aime at preicting explicit relevance jugments from implicit user actions an not specifically at learning ranking functions. Other stuies of user behavior in web search inclue Pharo an Järvelin [19], but were not irectly applie to improve ranking. More recently, Joachims et al. [12] presente an empirical evaluation of interpreting clickthrough evience. By performing eye tracking stuies an correlating preictions of their strategies with explicit ratings, the authors showe that it is possible to accurately interpret clickthroughs in a controlle, laboratory setting. Unfortunately, the extent to which previous research applies to real-worl web search is unclear. At the same time, while recent work (e.g., [26]) on using clickthrough information for improving web search ranking is promising, it captures only one aspect of the user interactions with web search engines. We buil on existing research to evelop robust user behavior interpretation techniques for the real web search setting. Instea of treating each user as a reliable expert, we aggregate information from multiple, unreliable, user search session traces, as we escribe in the next two sections. 3. INCORPORATING IMPLICIT FEEDBAC We consier two complementary approaches to ranking with implicit feeback: (1) treating implicit feeback as inepenent evience for ranking results, an (2) integrating implicit feeback features irectly into the ranking algorithm. We escribe the two general ranking approaches next. The specific implicit feeback features are escribe in Section 4, an the algorithms for interpreting an incorporating implicit feeback are escribe in Section Implicit Feeback as Inepenent Evience The general approach is to re-rank the results obtaine by a web search engine accoring to observe clickthrough an other user interactions for the query in previous search sessions. Each result is assigne a score accoring to expecte relevance/user satisfaction base on previous interactions, resulting in some preference orering base on user interactions alone. While there has been significant work on merging multiple rankings, we aapt a simple an robust approach of ignoring the original rankers scores, an instea simply merge the rank orers. The main reason for ignoring the original scores is that since the feature spaces an learning algorithms are ifferent, the scores are not irectly comparable, an re-normalization tens to remove the benefit of incorporating classifier scores. We experimente with a variety of merging functions on the evelopment set of queries (an using a set of interactions from a ifferent time perio from final evaluation sets). We foun that a simple rank merging heuristic combination works well, an is robust to variations in score values from original rankers. For a given query q, the implicit score IS is compute for each result from available user interaction features, resulting in the implicit rank I for each result. We compute a merge score S M () for by combining the ranks obtaine from implicit feeback, I with the original rank of, O : S M w I (, I, O, w I ) = I + 1 O O + 1 if implicit feeback exists for otherwise where the weight w I is a heuristically tune scaling factor representing the relative importance of the implicit feeback. The query results are orere in by ecreasing values of S M to prouce the final ranking. One special case of this moel arises when setting w I to a very large value, effectively forcing clicke results to be ranke higher than un-clicke results an intuitive an effective heuristic that we will use as a baseline. Applying more sophisticate classifier an ranker combination algorithms may result in aitional improvements, an is a promising irection for future work. The approach above assumes that there are no interactions between the unerlying features proucing the original web search ranking an the implicit feeback features. We now relax this assumption by integrating implicit feeback features irectly into the ranking process. 3.2 Ranking with Implicit Feeback Features Moern web search engines rank results base on a large number of features, incluing content-base features (i.e., how closely a query matches the text or title or anchor text of the ocument), an queryinepenent page quality features (e.g., PageRank of the ocument or the omain). In most cases, automatic (or semi-automatic) methos are evelope for tuning the specific ranking function that combines these feature values. Hence, a natural approach is to incorporate implicit feeback features irectly as features for the ranking algorithm. During training or tuning, the ranker can be tune as before but with aitional features. At runtime, the search engine woul fetch the implicit feeback features associate with each query-result URL pair. This moel requires a ranking algorithm to be robust to missing values: more than 50% of queries to web search engines are unique, with no previous implicit feeback available. We now escribe such a ranker that we use to learn over the combine feature sets incluing implicit feeback. 3.3 Learning to Rank Web Search Results A key aspect of our approach is exploiting recent avances in machine learning, namely trainable ranking algorithms for web search an information retrieval (e.g., [5, 11] an classical results reviewe in [3]). In our setting, explicit human relevance jugments (labels) are available for a set of web search queries an results. Hence, an attractive choice to use is a supervise machine learning technique to learn a ranking function that best preicts relevance jugments. RankNet is one such algorithm. It is a neural net tuning algorithm that optimizes feature weights to best match explicitly provie pairwise user preferences. While the specific training algorithms use by RankNet are beyon the scope of this paper, it is escribe in etail in [5] an inclues extensive evaluation an comparison with other ranking methos. An attractive feature of RankNet is both train- an run-time efficiency runtime ranking can be 20 ACM SIGIR Forum 12 Vol. 52 No. 2 December 2018

3 quickly compute an can scale to the web, an training can be one over thousans of queries an associate juge results. We use a 2-layer implementation of RankNet in orer to moel non-linear relationships between features. Furthermore, RankNet can learn with many (ifferentiable) cost functions, an hence can automatically learn a ranking function from human-provie labels, an attractive alternative to heuristic feature combination techniques. Hence, we will also use RankNet as a generic ranker to explore the contribution of implicit feeback for ifferent ranking alternatives. 4. IMPLICIT USER FEEDBAC MODEL Our goal is to accurately interpret noisy user feeback obtaine as by tracing user interactions with the search engine. Interpreting implicit feeback in real web search setting is not an easy task. We characterize this problem in etail in [1], where we motivate an evaluate a wie variety of moels of implicit user activities. The general approach is to represent user actions for each search result as a vector of features, an then train a ranker on these features to iscover feature values inicative of relevant (an non-relevant) search results. We first briefly summarize our features an moel, an the learning approach (Section 4.2) in orer to provie sufficient information to replicate our ranking methos an the subsequent experiments. 4.1 Representing User Actions as Features We moel observe web search behaviors as a combination of a ``backgroun component (i.e., query- an relevance-inepenent noise in user behavior, incluing positional biases with result interactions), an a ``relevance component (i.e., query-specific behavior inicative of relevance of a result to a query). We esign our features to take avantage of aggregate user behavior. The feature set is comprise of irectly observe features (compute irectly from observations for each query), as well as query-specific erive features, compute as the eviation from the overall queryinepenent istribution of values for the corresponing irectly observe feature values. The features use to represent user interactions with web search results are summarize in Table 4.1. This information was obtaine via opt-in client-sie instrumentation from users of a major web search engine. We inclue the traitional implicit feeback features such as clickthrough counts for the results, as well as our novel erive features such as the eviation of the observe clickthrough number for a given query-url pair from the expecte number of clicks on a result in the given position. We also moel the browsing behavior after a result was clicke e.g., the average page well time for a given query-url pair, as well as its eviation from the expecte (average) well time. Furthermore, the feature set was esigne to provie essential information about the user experience to make feeback interpretation robust. For example, web search users can often etermine whether a result is relevant by looking at the result title, URL, an summary in many cases, looking at the original ocument is not necessary. To moel this aspect of user experience we inclue features such as overlap in wors in title an wors in query (TitleOverlap) an the fraction of wors share by the query an the result summary. Clickthrough features Position ClickFrequency ClickProbability ClickDeviation IsNextClicke IsPreviousClicke IsClickAbove IsClickBelow Browsing features TimeOnPage CumulativeTimeOnPage TimeOnDomain TimeOnShortUrl IsFolloweLink IsExactUrlMatch IsReirecte IsPathFromSearch ClicksFromSearch AverageDwellTime DwellTimeDeviation CumulativeDeviation DomainDeviation Query-text features Position of the URL in Current ranking Number of clicks for this query, URL pair Probability of a click for this query an URL Deviation from expecte click probability 1 if clicke on next position, 0 otherwise 1 if clicke on previous position, 0 otherwise 1 if there is a click above, 0 otherwise 1 if there is click below, 0 otherwise Page well time Cumulative time for all subsequent pages after search Cumulative well time for this omain Cumulative time on URL prefix, no parameters 1 if followe link to result, 0 otherwise 0 if aggressive normalization use, 1 otherwise 1 if initial URL same as final URL, 0 otherwise 1 if only followe links after query, 0 otherwise Number of hops to reach page from query Average time on page for this query Deviation from average well time on page Deviation from average cumulative well time Deviation from average well time on omain TitleOverlap Wors share between query an title SummaryOverlap Wors share between query an snippet QueryURLOverlap Wors share between query an URL QueryDomainOverlap Wors share between query an URL omain QueryLength Number of tokens in query QueryNextOverlap Fraction of wors share with next query Table 4.1: Some features use to represent post-search navigation history for a given query an search result URL. Having escribe our feature set, we briefly review our general metho for eriving a user behavior moel. 4.2 Deriving a User Feeback Moel To learn to interpret the observe user behavior, we correlate user actions (i.e., the features in Table 4.1 representing the actions) with the explicit user jugments for a set of training queries. We fin all the instances in our session logs where these queries were submitte to the search engine, an aggregate the user behavior features for all search sessions involving these queries. Each observe query-url pair is represente by the features in Table 4.1, with values average over all search sessions, an assigne one of six possible relevance labels, ranging from Perfect to Ba, as assigne by explicit relevance jugments. These labele feature vectors are use as input to the RankNet training algorithm (Section 3.3) which prouces a traine user behavior moel. This approach is particularly attractive as it oes not require heuristics beyon feature engineering. The resulting user behavior moel is use to help rank web search results either irectly or in combination with other features, as escribe below. 5. EXPERIMENTAL SETUP The ultimate goal of incorporating implicit feeback into ranking is to improve the relevance of the returne web search results. Hence, we compare the ranking methos over a large set of juge queries 21 ACM SIGIR Forum 13 Vol. 52 No. 2 December 2018

4 with explicit relevance labels provie by human juges. In orer for the evaluation to be realistic we obtaine a ranom sample of queries from web search logs of a major search engine, with associate results an traces for user actions. We escribe this ataset in etail next. Our metrics are escribe in Section 5.2 that we use to evaluate the ranking alternatives, liste in Section 5.3 in the experiments of Section Datasets We compare our ranking methos over a ranom sample of 3,000 queries from the search engine query logs. The queries were rawn from the logs uniformly at ranom by token without replacement, resulting in a query sample representative of the overall query istribution. On average, 30 results were explicitly labele by human juges using a six point scale ranging from Perfect own to Ba. Overall, there were over 83,000 results with explicit relevance jugments. In orer to compute various statistics, ocuments with label Goo or better will be consiere relevant, an with lower labels to be non-relevant. Note that the experiments were performe over the results alreay highly ranke by a web search engine, which correspons to a typical user experience which is limite to the small number of the highly ranke results for a typical web search query. The user interactions were collecte over a perio of 8 weeks using voluntary opt-in information. In total, over 1.2 million unique queries were instrumente, resulting in over 12 million iniviual interactions with the search engine. The ata consiste of user interactions with the web search engine (e.g., clicking on a result link, going back to search results, etc.) performe after a query was submitte. These actions were aggregate across users an search sessions an converte to features in Table 4.1. To create the training, valiation, an test query sets, we create three ifferent ranom splits of 1,500 training, 500 valiation, an 1000 test queries. The splits were one ranomly by query, so that there was no overlap in training, valiation, an test queries. 5.2 Evaluation Metrics We evaluate the ranking algorithms over a range of accepte information retrieval metrics, namely Precision at (P()), Normalize Discounte Cumulative Gain (NDCG), an Mean Average Precision (MAP). Each metric focuses on a eferent aspect of system performance, as we escribe below. Precision at : As the most intuitive metric, P() reports the fraction of ocuments ranke in the top results that are labele as relevant. In our setting, we require a relevant ocument to be labele Goo or higher. The position of relevant ocuments within the top is irrelevant, an hence this metric measure overall user satisfaction with the top results. NDCG at : NDCG is a retrieval measure evise specifically for web search evaluation [10]. For a given query q, the ranke results are examine from the top ranke own, an the NDCG compute as: N q = M q j= 1 ( 2 r( j) 1) / log(1 + j) Where M q is a normalization constant calculate so that a perfect orering woul obtain NDCG of 1; an each r(j) is an integer relevance label (0= Ba an 5= Perfect ) of result returne at position j. Note that unlabele an Ba ocuments o not contribute to the sum, but will reuce NDCG for the query pushing own the relevant labele ocuments, reucing their contributions. NDCG is well suite to web search evaluation, as it rewars relevant ocuments in the top ranke results more heavily than those ranke lower. MAP: Average precision for each query is efine as the mean of the precision at values compute after each relevant ocument was retrieve. The final MAP value is efine as the mean of average precisions of all queries in the test set. This metric is the most commonly use single-value summary of a run over a set of queries. 5.3 Ranking Methos Compare Recall that our goal is to quantify the effectiveness of implicit behavior for real web search. One imension is to compare the utility of implicit feeback with other information available to a web search engine. Specifically, we compare effectiveness of implicit user behaviors with content-base matching, static page quality features, an combinations of all features. F: As a strong web search baseline we use the F scoring, which was use in one of the best performing systems in the TREC 2004 Web track [23,27]. F an its variants have been extensively escribe an evaluate in IR literature, an hence serve as a strong, reproucible baseline. The F variant we use for our experiments computes separate match scores for each fiel for a result ocument (e.g., boy text, title, an anchor text), an incorporates query-inepenent linkbase information (e.g., PageRank, ClickDistance, an URL epth). The scoring function an fiel-specific tuning is escribe in etail in [23]. Note that F oes not irectly consier explicit or implicit feeback for tuning. RN: The ranking prouce by a neural net ranker (RankNet, escribe in Section 3.3) that learns to rank web search results by incorporating F an a large number of aitional static an ynamic features escribing each search result. This system automatically learns weights for all features (incluing the F score for a ocument) base on explicit human labels for a large set of queries. A system incorporating an implementation of RankNet is currently in use by a major search engine an can be consiere representative of the state of the art in web search. F-RerankCT: The ranking prouce by incorporating clickthrough statistics to reorer web search results ranke by F above. Clickthrough is a particularly important special case of implicit feeback, an has been shown to correlate with result relevance. This is a special case of the ranking metho in Section 3.1, with the weight w I set to 1000 an the ranking I is simply the number of clicks on the result corresponing to. In effect, this ranking brings to the top all returne web search results with at least one click (an orers them in ecreasing orer by number of clicks). The relative ranking of the remainer of results is unchange an they are inserte below all clicke results. This metho serves as our baseline implicit feeback reranking metho. F-RerankAll The ranking prouce by reorering the F results using all user behavior features (Section 4). This metho learns a moel of user preferences by correlating feature values with explicit relevance labels using the RankNet neural net algorithm (Section 4.2). At runtime, for a given query the 22 ACM SIGIR Forum 14 Vol. 52 No. 2 December 2018

5 implicit score I r is compute for each result r with available user interaction features, an the implicit ranking is prouce. The merge ranking is compute as escribe in Section 3.1. Base on the experiments over the evelopment set we fix the value of w I to 3 (the effect of the w I parameter for this ranker turne out to be negligible). F+All: Ranking erive by training the RankNet (Section 3.3) learner over the features set of the F score as well as all implicit feeback features (Section 3.2). We use the 2-layer implementation of RankNet [5] traine on the queries an labels in the training an valiation sets. RN+All: Ranking erive by training the 2-layer RankNet ranking algorithm (Section 3.3) over the union of all content, ynamic, an implicit feeback features (i.e., all of the features escribe above as well as all of the new implicit feeback features we introuce). The ranking methos above span the range of the information use for ranking, from not using the implicit or explicit feeback at all (i.e., F) to a moern web search engine using hunres of features an tune on explicit jugments (RN). As we will show next, incorporating user behavior into these ranking systems ramatically improves the relevance of the returne ocuments. 6. EXPERIMENTAL RESULTS Implicit feeback for web search ranking can be exploite in a number of ways. We compare alternative methos of exploiting implicit feeback, both by re-ranking the top results (i.e., the F-RerankCT an F-RerankAll methos that reorer F results), as well as by integrating the implicit features irectly into the ranking process (i.e., the RN+ALL an F+All methos which learn to rank results over the implicit feeback an other features). We compare our methos over strong baselines (F an RN) over the NDCG, Precision at, an MAP measures efine in Section 5.2. The results were average over three ranom splits of the overall ataset. Each split containe 1500 training, 500 valiation, an 1000 test queries, all query sets isjoint. We first present the results over all 1000 test queries (i.e., incluing queries for which there are no implicit measures so we use the original web rankings). We then rill own to examine the effects on reranking for the attempte queries in more etail, analyzing where implicit feeback prove most beneficial. NDCG Rerank-CT -Rerank-All +All Figure 6.1: NDCG at for F, F-RerankCT, F-Rerank-All, an F+All for varying Precision Rerank-CT -Rerank-All +All Figure 6.2: Precision at for F, F-RerankCT, F-Rerank-All, an F+All for varying Interestingly, using clickthrough alone, while giving significant benefit over the original F ranking, is not as effective as consiering the full set of features in Table 4.1. While we analyze user behavior (an most effective component features) in a separate paper [1], it is worthwhile to give a concrete example of the kin of noise inherent in real user feeback in web search setting. We first experimente with ifferent methos of re-ranking the output of the F search results. Figures 6.1 an 6.2 report NDCG an Precision for F, as well as for the strategies reranking results with user feeback (Section 3.1). Incorporating all user feeback (either in reranking framework or as features to the learner irectly) results in significant improvements (using twotaile t-test with p=0.01) over both the original F ranking as well as over reranking with clickthrough alone[rev1]. The improvement is consistent across the top 10 results an largest for the top result: NDCG at 1 for F+All is 22 compare to 18 of the original results, an precision at 1 similarly increases from to 3. Base on these results we will use the irect feature combination (i.e., F+All) ranker for subsequent comparisons involving implicit feeback. Relative click frequency Result position PTR=2 PTR=3 PTR=5 Figure 6.3: Relative clickthrough frequency for queries with varying Position of Top Relevant result (PTR). 23 ACM SIGIR Forum 15 Vol. 52 No. 2 December 2018

6 If users consiere only the relevance of a result to their query, they woul click on the topmost relevant results. Unfortunately, as Joachims an others have shown, presentation also influences which results users click on quite ramatically. Users often click on results above the relevant one presumably because the short summaries o not provie enough information to make accurate relevance assessments an they have learne that on average topranke items are relevant. Figure 6.3 shows relative clickthrough frequencies for queries with known relevant items at positions other than the first position; the position of the top relevant result (PTR) ranges from 2-10 in the figure. For example, for queries with first relevant result at position 5 (PTR=5), there are more clicks on the non-relevant results in higher ranke positions than on the first relevant result at position 5. As we will see, learning over a richer behavior feature set, results in substantial accuracy improvement over clickthrough alone[rev2]. We now consier incorporating user behavior into a much richer feature set, RN (Section 5.3) use by a major web search engine. RN incorporates F, link-base features, an hunres of other features. Figure 6.4 reports NDCG at an Figure 6.5 reports Precision at. Interestingly, while the original RN rankings are significantly more accurate than F alone, incorporating implicit feeback features (F+All) results in ranking that significantly outperforms the original RN rankings. In other wors, implicit feeback incorporates sufficient information to replace the hunres of other features available to the RankNet learner traine on the RN feature set. NDCG RN RN+All +All Figure 6.4: NDCG at for F, F+All, RN, an RN+All for varying Furthermore, enriching the RN features with implicit feeback set exhibits significant gain on all measures, allowing RN+All to outperform all other methos. This emonstrates the complementary nature of implicit feeback with other features available to a state of the art web search engine. Precision RN RN+All +All Figure 6.5: Precision at for F, F+All, RN, an RN+All for varying We summarize the performance of the ifferent ranking methos in Table 6.1. We report the Mean Average Precision (MAP) score for each system. While not intuitive to interpret, MAP allows quantitative comparison on a single metric. The gains marke with * are significant at p=0.01 level using two taile t-test. MAP Gain P(1) Gain F F-Rerank-CT * * F-RerankImplicit * F+Implicit * RN RN+All * * Table 6.1: Mean Average Precision (MAP) for all strategies. So far we reporte results average across all queries in the test set. Unfortunately, less than half ha sufficient interactions to attempt reranking. Out of the 1000 queries in test, between 46% an 49%, epening on the train-test split, ha sufficient interaction information to make preictions (i.e., there was at least 1 search session in which at least 1 result URL was clicke on by the user). This is not surprising: web search is heavy-taile, an there are many unique queries. We now consier the performance on the queries for which user interactions were available. Figure 6.6 reports NDCG for the subset of the test queries with the implicit feeback features. The gains at top 1 are ramatic. The NDCG at 1 of F+All increases from to 0.75 (a 31% relative gain), achieving performance comparable to RN+All operating over a much richer feature set. 24 ACM SIGIR Forum 16 Vol. 52 No. 2 December 2018

7 NDCG RN RN+All +All that most of the improvement is for poorly performing queries (i.e., MAP < 0.1). Interestingly, incorporating user behavior information egraes accuracy for queries with high original MAP score. One possible explanation is that these easy queries ten to be navigational (i.e., having a single, highly-ranke most appropriate answer), an user interactions with lower-ranke results may inicate ivergent information nees that are better serve by the less popular results (with corresponingly poor overall relevance ratings) Figure 6.6: NDCG at for F, F+All, RN, an RN+All on test queries with user interactions Similarly, gains on precision at top 1 are substantial (Figure 6.7), an are likely to be apparent to web search users. When implicit feeback is available, the F+All system returns relevant ocument at top 1 almost 70% of the time, compare 53% of the time when implicit feeback is not consiere by the original F system Frequency Average Gain Precision RN RN+All +All Figure 6.7: Precision at NDCG at for F, F+All, RN, an RN+All on test queries with user interactions We summarize the results on the MAP measure for attempte queries in Table 6.2. MAP improvements are both substantial an significant, with improvements over the F ranker most pronounce. Metho MAP Gain P(1) Gain RN RN+All (19%) (10%) F F+All (24%) (31%) Table 6.2: Mean Average Precision (MAP) on attempte queries for best performing methos We now analyze the cases where implicit feeback was shown most helpful. Figure 6.8 reports the MAP improvements over the baseline F run for each query with MAP uner. Note Figure 6.8: Gain of F+All over original F ranking To summarize our experimental results, incorporating implicit feeback in real web search setting resulte in significant improvements over the original rankings, using both F an RN baselines. Our rich set of implicit features, such as time on page an eviations from the average behavior, provies avantages over using clickthrough alone as an inicator of interest. Furthermore, incorporating implicit feeback features irectly into the learne ranking function is more effective than using implicit feeback for reranking. The improvements observe over large test sets of queries (1,000 total, between 466 an 495 with implicit feeback available) are both substantial an statistically significant. 7. CONCLUSIONS AND FUTURE WOR In this paper we explore the utility of incorporating noisy implicit feeback obtaine in a real web search setting to improve web search ranking. We performe a large-scale evaluation over 3,000 queries an more than 12 million user interactions with a major search engine, establishing the utility of incorporating noisy implicit feeback to improve web search relevance. We compare two alternatives of incorporating implicit feeback into the search process, namely reranking with implicit feeback an incorporating implicit feeback features irectly into the traine ranking function. Our experiments showe significant improvement over methos that o not consier implicit feeback. The gains are particularly ramatic for the top =1 result in the final ranking, with precision improvements as high as 31%, an the gains are substantial for all values of. Our experiments showe that implicit user feeback can further improve web search performance, when incorporate irectly with popular content- an link-base features. 25 ACM SIGIR Forum 17 Vol. 52 No. 2 December 2018

8 Interestingly, implicit feeback is particularly valuable for queries with poor original ranking of results (e.g., MAP lower than 0.1). One promising irection for future work is to apply recent research on automatically preicting query ifficulty, an only attempt to incorporate implicit feeback for the ifficult queries. As another research irection we are exploring methos for extening our preictions to the previously unseen queries (e.g., query clustering), which shoul further improve the web search experience of users. ACNOWLEDGMENTS We thank Chris Burges an Matt Richarson for an implementation of RankNet for our experiments. We also thank Robert Ragno for his valuable suggestions an many iscussions. 8. REFERENCES [1] E. Agichtein, E. Brill, S. Dumais, an R.Ragno, Learning User Interaction Moels for Preicting Web Search Result Preferences. In Proceeings of the ACM Conference on Research an Development on Information Retrieval (SIGIR), 2006 [2] J. Allan, HARD Track Overview in TREC 2003, High Accuracy Retrieval from Documents, 2003 [3] R. Baeza-Yates an B. Ribeiro-Neto, Moern Information Retrieval, Aison-Wesley, [4] S. Brin an L. Page, The Anatomy of a Large-scale Hypertextual Web Search Engine, in Proceeings of WWW, 1997 [5] C.J.C. Burges, T. Shake, E. Renshaw, A. Lazier, M. Dees, N. Hamilton, G. Hullener, Learning to Rank using Graient Descent, in Proceeings of the International Conference on Machine Learning, 2005 [6] D.M. Chickering, The WinMine Toolkit, Microsoft Technical Report MSR-TR , 2002 [7] M. Claypool, D. Brown, P. Lee an M. Wasea. Inferring user interest. IEEE Internet Computing [8] S. Fox,. arnawat, M. Mylan, S. T. Dumais an T. White. Evaluating implicit measures to improve the search experience. In ACM Transactions on Information Systems, 2005 [9] J. Goecks an J. Shavlick. Learning users interests by unobtrusively observing their normal behavior. In Proceeings of the IJCAI Workshop on Machine Learning for Information Filtering [10] Jarvelin an J. ekalainen. IR evaluation methos for retrieving highly relevant ocuments. In Proceeings of the ACM Conference on Research an Development on Information Retrieval (SIGIR), 2000 [11] T. Joachims, Optimizing Search Engines Using Clickthrough Data. In Proceeings of the ACM Conference on nowlege Discovery an Datamining (SIGDD), 2002 [12] T. Joachims, L. Granka, B. Pang, H. Hembrooke, an G. Gay, Accurately Interpreting Clickthrough Data as Implicit Feeback, Proceeings of the ACM Conference on Research an Development on Information Retrieval (SIGIR), 2005 [13] T. Joachims, Making Large-Scale SVM Learning Practical. Avances in ernel Methos, in Support Vector Learning, MIT Press, 1999 [14] D. elly an J. Teevan, Implicit feeback for inferring user preference: A bibliography. In SIGIR Forum, 2003 [15] J. onstan, B. Miller, D. Maltz, J. Herlocker, L. Goron, an J. Riel. GroupLens: Applying collaborative filtering to usenet news. In Communications of ACM, [16] M. Morita, an Y. Shinoa, Information filtering base on user behavior analysis an best match text retrieval. Proceeings of the ACM Conference on Research an Development on Information Retrieval (SIGIR), 1994 [17] D. Oar an J. im. Implicit feeback for recommener systems. In Proceeings of the AAAI Workshop on Recommener Systems [18] D. Oar an J. im. Moeling information content using observable behavior. In Proceeings of the 64th Annual Meeting of the American Society for Information Science an Technology [19] N. Pharo, N. an. Järvelin. The SST metho: a tool for analyzing web information search processes. In Information Processing & Management, 2004 [20] P. Pirolli, The Use of Proximal Information Scent to Forage for Distal Content on the Worl Wie Web. In Working with Technology in Min: Brunswikian. Resources for Cognitive Science an Engineering, Oxfor University Press, 2004 [21] F. Ralinski an T. Joachims, Query Chains: Learning to Rank from Implicit Feeback. In Proceeings of the ACM Conference on nowlege Discovery an Data Mining (SIGDD), [22] F. Ralinski an T. Joachims, Evaluating the Robustness of Learning from Implicit Feeback, in Proceeings of the ICML Workshop on Learning in Web Search, 2005 [23] S. E. Robertson, H. Zaragoza, an M. Taylor, Simple extension to multiple weighte fiels, in Proceeings of the Conference on Information an nowlege Management (CIM), 2004 [24] G. Salton & M. McGill. Introuction to moern information retrieval. McGraw-Hill, 1983 [25] E.M. Voorhees, D. Harman, Overview of TREC, 2001 [26] G.R. Xue, H.J. Zeng, Z. Chen, Y. Yu, W.Y. Ma, W.S. Xi, an W.G. Fan, Optimizing web search using web click-through ata, in Proceeings of the Conference on Information an nowlege Management (CIM), 2004 [27] H. Zaragoza, N. Craswell, M. Taylor, S. Saria, an S. Robertson. Microsoft Cambrige at TREC 13: Web an Har Tracks. In Proceeings of TREC ACM SIGIR Forum 18 Vol. 52 No. 2 December 2018

APPLYING GENETIC ALGORITHM IN QUERY IMPROVEMENT PROBLEM. Abdelmgeid A. Aly

APPLYING GENETIC ALGORITHM IN QUERY IMPROVEMENT PROBLEM. Abdelmgeid A. Aly International Journal "Information Technologies an Knowlege" Vol. / 2007 309 [Project MINERVAEUROPE] Project MINERVAEUROPE: Ministerial Network for Valorising Activities in igitalisation -