An Improved Usage-Based Ranking

Chen Ding 1, Chi-Hung Chi 1,2, and Tiejian Luo 2 1 School of Computing, National University of Singapore Lower Kent Ridge Road, Singapore 119260 chich@comp.nus.edu.sg 2 The Graduate School of Chinese Academy of Sciences Beijing, P.R. China 100039 Abstract. A good ranking is critical to gain a positive searching experience. With usage data collected from past searching activities, it could be improved from current approaches which are largely based on text or link information. In this paper, we proposed a usage-based ranking algorithm. Basically, it calculates the rank score on time duration considering the propagated effect, which is an improvement on the simple selection frequency determined method. Besides, it also has some heuristics to further improve the accuracy of top-positioned results. 1 Introduction Ranking module is a key component in web information retrieval system because it can alleviate the cognitive overload for users to identify most relevant documents from the huge result list by improving the quality and accuracy of top-ranked results. The key to achieve a good ranking for web searching is to make full use of the available sources on the web instead of being confined to the pure text information. One example is the link connectivity of the web graph, which has been largely investigated in many researches ([2], [4], [7]). In traditional IR systems, in order to find the specific information, users often spend some time to provide feedback to refine the search, and such kind of feedback could improve the final ranking. While on the web, the dominant searches are informal ones. Without the clear and urgent information requirements in mind, with the ease of accessing information on the same topic from different web resources, it is not likely for users to spend much time on one searching. Thus, explicit feedback is quite rare on the web. However, the powerful capability of web tracking techniques makes it easier to capture the user behaviors when they browse the web content. From the information such as which links users click, how long users spend on a page, the users satisfaction degree for the relevance of the web page could be estimated. It is actually a kind of implicit feedback from users. We believe that such kind of usage data could be a good source for relevance judging and a quality ranking. X. Meng, J. Su, and Y. Wang (Eds.): WAIM 2002, LNCS 2419, pp. 346 353, 2002. Springer-Verlag Berlin Heidelberg 2002

347 The usage data has been investigated a lot in many researches. But normally it is collected from a web site and utilized to better present the site or help user navigation ([5], [8], [13]). There is limited work to utilize the usage data in the web information retrieval systems, especially in the ranking algorithm. For some systems [10] that do use the usage data in ranking, they determine the relevance of a web page by its selection frequency. This measurement is not that accurate to indicate the real relevance. The time spent on reading the page, the operation of saving, printing the page or adding the page to the bookmark, and the action of following the links in the page, are all good indicators, perhaps better than the simple selection frequency. So it is worth for further exploration on how to apply this kind of actual user behavior to the ranking mechanism. It is the purpose of this study to develop a more accurate ranking algorithm to utilize the usage data. In this paper, we developed a usage-based ranking algorithm. The time duration on reading and navigating the web page constitutes the basic rank score, and several heuristics are summarized to further increase the precision for the top-positioned results. We believe that such kind of ranking could supplement the current algorithm (e.g. text-based, link-based) and provide a high accuracy and quality. 2 Related Work The traditional relevance feedback mechanism ([11]) is used to benefit one search session by refining the initial query towards the relevant documents. Other users submitting the same query cannot benefit from it. Thus, the performance improvement by relevance feedback is on a per user basis. While on the web, the implicit feedback information is collected from various users. It is not aimed to benefit one user s retrieval experience. Its underlying rationale is more similar to the collaborative filtering. The relevance or quality of a web page could be determined only by the large number of collaborative judgment data, since users submit the same query usually share some opinions on the result relevance. Thus, collaborative filtering is a closely related area to our work. Collaborative filtering is a way to use others knowledge as recommendations that serve as inputs to a system that aggregates and directs results to the appropriate recipients ([9]). It relies on the fact that people s tastes are not randomly distributed, and there are general trends and patterns within the taste of a person and as well as between groups of people. Most of early collaborative filtering systems ([1]) use the explicit judgment information provided by users to recommend documents. The major disadvantage is that they require the user s participation, which is normally not desired for web searches. To address this problem, several systems have been developed to try to extract user judgments by observing user activities and then conjecturing their evaluation on the documents. We review two systems that aim to improve the web searching.

348 C. Ding, C.-H. Chi, and T. Luo KSS ([10]) has a proxy to record users access patterns when they browse the web. The number of times a link is accessed is stored in the proxy and annotated besides the link to indicate its value. It also has a meta search engine that attempts to find the best search engines for a given query based on the selection frequency for results. This number of previous accesses could also be used to rank result lists merged from multiple search engines. In DirectHit search engine, each result URL in the result list is made to point back to the search engine server first, and then re-direct to the target. In this way, what users actually follow upon a result list could be recorded in server log. It gradually develops the data to identify which result pages are popular and which are not. Then in later searches for the same query, the returned pages could be ranked by their popularity (i.e., the access count). The exact ranking mechanism is unknown to the public. 3 User-Based Ranking The general idea for the usage-based ranking is to monitor which pages users choose from the result list and what actions users take upon these pages; from this kind of user behavior information, user s judgment on the relevance of these pages could be induced; usage-based scores are then assigned to them the next time when the same query is submitted by other users. Intuitively, if a page is more frequently selected, its chance to be judged as relevant is higher; if a page is less frequently selected, its chance to be relevant is lower. Thus, it seems to be natural to assign a score proportional to its selection frequency. DirectHit and KSS use this kind of selection frequency determined method to judge the relevance degree of the web page. However, it is not that accurate. For instance, if a user clicks to browse a web page and returns to the result list immediately since it is not relevant, and if this kind of patterns is observed from many different clicks, then it is not correct to judge the relevance of this page based on its selection frequency. The reason might be the inadvertent human mistake, the misleading titles of web pages, or returned summaries not representing the real content. Therefore, the selection frequency is not a good measurement, and the time to spend on a page may be better. If users spend some time on reading through a page, it is more likely for the page to be relevant than the case in which users just click the page. The usage-based score for the page relevance could be better measured with the time spent on it. [3] and [6] confirm this observation. Definitely, the longer the page, the longer users could spend on it. Sometimes, users spend less time on a page just because it is quite short although its content might be very relevant. In order to redeem this effect, the time duration should be normalized to the length of the web page. Most of web pages contain hyperlinks to other pages. When two pages are connected due to their content, the relevance of one page on a query could induce the relevance of the other page. Hence, a higher percentage of links accessed from a page

349 could be a strong indication of the relevance of the page. This is particularly important for index pages, which contains a lot of related links, and on which users spend less time than those content-related. In this way, in addition to text information, the hyper information could also contribute to the relevance of a page. Likewise, the total time spent on linked pages is a more appropriate measurement than the access percentage. Thus, when users follow hyperlinks in a page, the time spent on these linked pages could be propagated to this page to increase its relevance degree. From above analyses, the time duration and the access via hyperlinks are two major factors to measure the relevance. They are used to calculate a basic usagebased rank score. The hyperlink effect could be propagated recursively, with a fading factor, along the link hierarchy in which the first level nodes are search results and higher level nodes are expanded from the first level by hyperlinks. Apart from the duration, the usage-based rank is also related to the page latest access time from the search results. For two pages with the same duration value, the one with the latest access time should have a higher rank score because it is more likely to reflect the current user interest on that query. The ranking formula is as follows, 0 URank = n Q, D ( lt 1 lt i= 1 Q i Dur( i)) Dur i) = dur( i) + F dur( L i) u LD linked pagesfromd Where lt Q is the latest access time of query Q and lt i is the latest access time of document D in the ith access for Q; n D is the number of accesses for D from Q; dur( i) is the time spent on D in the ith access, which is normalized on the length of D; F u is the fading factor for durations propagated from linked pages. After the basic score for the web page has been calculated, there should be an adjustment value on it if certain conditions hold. The main purpose for the score adjustment is to decrease the score for the high-positioned pages that are not that relevant judged by previous users and increase the score for the low-positioned pages that are quite relevant known from previous judgments. We concluded four heuristics. Heuristic1 If a web page has a high rank, and its selection frequency is less than the average selection frequency for the query, then it should have a negative adjustment value computed as follows, clickrate( URank = ( 1) ( HR _ THRES rank ) avg( clickrate( ) clickrate = freq( freq( Q)

350 C. Ding, C.-H. Chi, and T. Luo Where freq( is the selection frequency of D for which is the same value as n D ; freq(q) is the frequency of Q; rank'( is the average rank position of D in previous searches for Q; average value is averaged on all result documents for Q. When the rank of a document is less than HR_THRES, it is considered to have a high rank. Heuristic2 If a web page has a high rank, and its average duration is less than the lower bound for duration value LB_DUR, then it has a negative adjustment value. (1/ n URank = ( D n Q, D ) Dur( i) i= 1 1) ( HR _ THRES rank ) LB _ DUR Heuristic3 If a web page has a high rank, but it has never been accessed, then it has a negative adjustment value. hrfreq( URank = ( HRFREQ _ THRES) ( HR _ THRES rank ) freq( Q) Where hrfreq( is to measure how many times a document D occurs in the high position of the ranked list for Q and is accessed; HRFREQ_THRES is a threshold value for hrfreq(. Heuristic4 If a document has a low rank, and its selection frequency is larger than a threshold value LB_CLICKRATE, it has a positive adjustment value. LB _ CLICKRATE URank = (1 ) ( rank LR _ THRES) clickrate( When the rank of a document is larger than LR_THRES, it is considered to have a low rank. After the basic score and the adjustment value of the web page are computed, the reliability of the combined value should be measured based on some statistical data, and the final score should be further adjusted on this reliability factor. The reliability of the rank score could be determined by the query frequency, the page selection frequency for a given query and others. Therefore, the final usage-based rank score is the basic rank score adjusted with a value (either negative or positive) and then multiplied by a reliability factor. It is as follows, 0 URank = rf ( Q) ( URank + URank ) rf ( Q) = ltq freq( Q) ( ltq ftq )

351 Where rf(q) is the reliability factor for the rank value; ft Q is the first time of query submission for Q. The reliability factor is determined by the usage data collected for the query. If the latest submission time for the query is more current, the usage-based rank for this query is more reliable. If the query is more frequently submitted, the rank is more reliable. If the query exists for longer time in query database, the rank is more reliable. In the above calculation, all the thresholds are selected from the iterative tests on real log data. All the duration and rank position values are normalized before the calculation. 4 Experiment Since our algorithm is targeted on general queries, we chose 21 queries on general topics for the experiment, including intellectual property law Singapore, mobile agent, Ian Thorpe, information retrieval, travel Europe, World Wide Web conference, classical guitar, machine learning researcher, free web server software, amusement park Japan, MP3 Discman, client server computing, concord aircraft, Internet research Singapore, computer science department National University Singapore, information agent, ATM technology, movie awards, quest correct information web hyper search engine, Scatter Gather Douglass Cutting, and WAP specification. After the query set was specified, each query was submitted to an existing search engine (Google) to collect the top 200 results as the test database. This number was considered to be large enough because usually user only review the top 20 to 50 result documents. The usage-based ranking alone may not work sometimes when no usage data is available for some queries. So, the usage-based ranking should be as the complement to some existing algorithms. We chose a ranking algorithm based on both text and link information as the basis. In order to obtain the rank scores, we downloaded the full documents, and performed the whole indexing and ranking procedure on them. The final rank was a linear combination of the basic rank score and the usage-based rank score. We defined two sessions in the experiment. In session 1, evaluators should judge the relevance of results returned from the basic ranking algorithm. The whole evaluation procedure was logged in the proxy server. Based on the evaluation results, usage-based ranking could be calculated. Then in session 2, the new rankings (the combination of basic ranking and usage-based ranking) could be presented to different evaluators to see whether the improvement was made. To evaluate the performance, we used the top-n precision value to measure the relevance. The precision value for the top n results is defined as the ratio of the number of relevant results within top n results to the value of n.

352 C. Ding, C.-H. Chi, and T. Luo Figure 1 shows the comparisons of top 30 precision values for Google and our ranking algorithm (basic plus usage-based). The results from session 1 were different with those from session 2 since the relevance was judged by different groups of persons. The figure shows the comparisons for all the 21 queries and also comparisons for average query, average general query and average specific query. Precision Comparisons 90% 80% Top 30 Precision 70% 60% 50% 40% 30% 20% 10% 0% ipls mba it ir te csnus wwwc cg mlr fwss apj mp3d csc ca quest irs Queries waps ia atmt mva scatter avg avg-g avg-s Google session1 session2 Fig. 1. Comparison Graph of Top 30 Precision Values for 21 Queries From these figures, we could see that for most of queries, precision values derived from session 2 were better than those from session 1, and both of them were better than those from Google. As long as the precision values judged by new users were comparable to the precision values judged by old users, it indicated that the combination of usage data collected from previous searches could benefit the later searches conducted by different users. So the experiment results could verify the effectiveness of our proposed ranking algorithm. The improvement over the Google results implied that the usage-based ranking could further enhance the text-and-linkbased Google ranking algorithm and produce a better ranking list. The overall conclusion from these observations was that the usage-based ranking could improve the retrieval effectiveness, when it is combined with other ranking algorithms. From this conclusion, we could know that our proposed ranking algorithm has achieved the expected performance.

353 4 Conclusion From the study, we could know that the usage data on past searching experiences could be used to benefit the later searching if it could be utilized in the ranking module. In our proposed usage-based ranking algorithm, the basic rank score is calculated on the time users spend on reading the page and browsing the connected pages, the high-ranked pages may have a negative adjustment value if their positions could not match their actual usage, and the low-ranked pages may have a positive adjustment value if users tend to dig them out from low positions. References 1. Balabanovic, Y. Shoham, "Fab: Content-based Collaborative Recommendation," Communications of the ACM, 40(3), pp. 66-72, 1997. 2. J. M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment," IBM Research Report RJ 10076, 1997. 3. J. Konstan, B. Miller, D. Maltz, J. Herlocker, L.Gordon, J. Riedl, "GroupLens: Applying Collaborative Filtering to Usenet News," Communications of ACM, 40(3), pp. 77-87, 1997. 4. M. Marchiori, "The Quest for Correct Information on the Web: Hyper Search Engines," Proceedings of the 6th World Wide Web Conference (WWW6), 1997. 5. B. Mobasher, R. Cooley, J. Srivastava, "Automatic Personalization Based on Web Usage Mining," Technical Report TR99-010, Department of Computer Science, Depaul University, 1999. 6. M. Morita, Y. Shinoda, "Information Filtering Based on User Behavior Analysis and Best Match Text Retrieval,"Proceedings of 17th ACM SIGIR, 1994. 7. L. Page, S. Brin, R. Motwani, T. Winograd, "The PageRank Citation Ranking: Bringing Order to the Web," Stanford University working paper SIDL-WP-1997-0072, 1997. 8. M. Perkowitz, O. Etzioni, "Towards Adaptive Web Sites: Conceptual Framework and Case Study," Proceedings of the 8th World Wide Web Conference (WWW8), 1999 9. P. Resnik, H. Varian, "Recommender Systems," Communications of the ACM, 40(3), 1997. 10. G. Rodriguez-Mula, H. Garcia-Molina, A. Paepcke, "Collaborative Value Filtering on the Web," Proceedings of the 7th World Wide Web Conference (WWW7), 1998. 11. G. Salton, M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, Inc., 1983. 12. T. W. Yan, M. Jacobsen, H. Garcia-Monila, U. Dayal, "From User Access Patterns to Dynamic Hypertext Linking," Proceedings of the 5th World Wide Web Conference (WWW5), 1996.