ihits: Extending HITS for Personal Interests Profiling

ihits: Extending HITS for Personal Interests Profiling Ziming Zhuang School of Information Sciences and Technology The Pennsylvania State University zzhuang@ist.psu.edu Abstract Ever since the boom of World Wide Web, profiling online users' interests has become an important task for content providers. The traditional approach involves manual entry of users' data, which requires intensive labor and time. Recent approaches utilize machine learning and clustering techniques to build the profiles, by analyzing the content of the Web pages visited by the users. Because such solutions rely heavily on the textual information, although they are capable of differentiating different topics of interests, it remains a difficult task to determine the users' different levels of interests in a given topic as well as gauge the shift of interests over time. In this paper, we propose ihits, which is an extension to the HITS (Hypertext-Induced Topic Search) algorithm. The algorithm automatically determines a ranked list of user s interests through link analysis on Web pages that the user visited. The visit pattern is obtained from the browsing history. We evaluate our approach by comparing automatically-generated interests profiles of the users with users manual entry to examine its accuracy and effectiveness. Our evaluation shows that the approach is promising and achieves satisfactory results. Our study introduces a novel approach to build a user-interests profiling systems with the capability to automatically capture and rank users browsing interests preference. 1. Introduction The term Web usage mining stands for the automatic discovery of the useful information from the secondary data derived from users interactions with the Web. Spiliopoulou described the ideal direction of Web usage mining, that was to analyze the actual usage data in order to predict users future behavior based upon his profile of interests, and finally to adapt the Web for the greatest benefits to the users [1]. The research and applications of Web usage mining can be classified into two major categories: personalized or impersonalized. Personalized mining, which aims to learn the interests of a specific user and later uses this captured knowledge to better serve his information needs, is the theme of this paper. Done either in an explicit or implicit manner, the processes of capturing users interests and building user profiles remain two major issues to be solved. While there have been different approaches to deal with user-interests profiling, they can be roughly classified into two main streams: manual profiling, which requires time and efforts from the users to explicitly express their personal interests, and automatic profiling, where the system learns the users interests through the history of users interactions without any explicit input from them. We propose ihits, an automatic approach to build user profiles with ranked interests. Our initial evaluation shows that the approach is promising and satisfactory results. This paper is organized as follows: Section 2 gives an overview of the previous studies that have been done on different approaches to profile users interests. In Section 3, we present the ihits approach by discussing the algorithm and the detailed procedure. In Section 4, we describe the experiment designed to evaluate the performance and discuss the results. Section 5 offers some insights into the limitations of the study and our plan of future work. In Section 6, we conclude this paper. 2. Related Studies User profiling has been studied extensively in the area of recommendation systems and information filtering systems. Particularly, personal interests profiling is the process of gathering information about the users interests. And this information can be utilized to build the users profiles to make further personalization possible. Currently there are two major approaches in this area. 2.1 Explicit (Knowledge-based) Profiling 1

Explicit profiling requires the direct involvement of the users. It is defined as a knowledge-based approach that engineers static models of users and dynamically matches users to the closest model in [2]. The knowledge of users interests is captured by explicit input from the users, through online/offline questionnaires, interviews, profile subscription, etc. One example is the SIFT developed at Stanford University [3], with which a user may subscribe an existing profile of interested topics via email and optionally update some parameters to further tailor the profile in order to fit his personal preferences. An alternative is to ask users to rank or grade the Web pages they have visited based upon their perception of the pages relevance to their own interests. One example is the NewsWeeder system [4], which uses the per-news-article rating input from the users as the training data for a machine learning algorithm to compose the user-interests profile. As we have already mentioned, one of the major disadvantages of the explicit profiling approach is that it requires direct input of time and efforts from the users. Because this approach uses pre-defined knowledge, it limits the capability to capture newly discovered interests or deal with the shift of users interests. 2.2 Implicit (Behavior-based) Profiling Implicit profiling is also known as indirect profiling, which is based on observing and analyzing the navigation patterns of users, as well as the content and link structure of Web pages. It is described as an approach that uses the user s behavior as a model, commonly using machinelearning techniques to discover useful patterns in the behavior in [2]. Usually, implicit profiling involves machine learning and clustering techniques. Web pages are clustered based on content- or link-based information to discover and group similar pages into distinct topics. Hyoung and Philip employ a divisive hierarchical clustering algorithm on keywords in the Web pages visited by a user to generate a hierarchy of users interests [5]. TF- IDF term weights, nearest neighbors, naïve Bayes are often used in keyword selection, the results of which represents topics of users interests. Machine learning techniques are used to exploit the users browsing history (server logs and/or client logs) in order to find potential interests. In such cases, cues such as the time spent on browsing the Web pages [6] and visit frequency are used. The News Dude system [7] uses a combined strategy in which longterm and short-term interests are modeled in different ways. Sakagami and Kamba utilize the record of users scroll and mouse operations to determine to a certain degree which part of the page the users are most interested in [8]. Yoshinori takes a similar approach and extract topics by sentences or lines instead of by pages, in order to achieve a higher precision while detecting users interests [9]. The combination of explicit and implicit profiling proves promising. The NewT system [10] incorporates users relevance feedback to discover interests of users, and finally refines the news filters. The implicit learning process effectively reduces the users burden in offering input and improves the system performance. More recently, ontology-based user profiling has been demonstrated as a novel approach. Two experimental systems, Quickstep and Foxtrot, are examples to build user profiles with semantic-rich approaches [2]. Because of its semantic richness, such approaches can achieve higher profiling accuracy by ontological inference and external reference. 3. Research Approach In this study, we propose ihits, a new approach to implicitly gather information about users interests through (1) link analysis on the pool of Web pages that has been visited by them, and (2) the users visit frequency and durations from browsing history. Our approach is rooted in the HITS (Hypertext Induced Topic Search) algorithm that first appeared in [11]. Wang et al. [12] used a similar approach to build an expert finding system, but the goal of the expert finding system in their approach was to find the top N experts (users ranked with the top N highest expertise weights) for a given topic. In our study, we build ihits towards a different goal that focuses on implicitly profiling users interests. To the best of our knowledge, there is no prior study which resembles our proposed approach in personal interests profiling. 3.1 Two Assumptions of Positive Two-Way Feedback ihits is based on two empirical assumptions of positive two-way feedbacks. These two-way feedbacks can be represented as follows: The first assumption is similar to the one that originated from the HITS algorithm [11], which is: a high quality authority page comes from the incoming links from a high quality hub page; a high quality hub page comes from the outgoing links to a high quality authority page. The second assumption is through which we incorporate the variable that represents the users level of interests: the more interests the user has towards a given topic, the more often and the longer he is likely to visit the higher quality pages in the topic domain; the higher quality the pages are for the given domain, the more often and the longer they 2

are likely to be visited by the users who are more interested in the topic. Thus the level of interests of a user and the quality of the Web pages he visits will reinforce each other through an iterative way. Based upon such two-way feedbacks, the ihits algorithm can implicitly capture the users level of interests. 3.2 The ihits Algorithm Let S be a set of Web pages that are in a given topic domain T. Let AT(p) and HT(p) be the authority and hub value of a Web page p that belongs to S. Let IT(u) be the interests level of user u towards the topic domain T. We represent the two assumptions of two-way feedbacks as below: A T (p) = γ * m S,m p H T (m) + (1-γ) * u p I T (u) (I) H T (p) = γ * n S,p n A T (n) + (1-γ) * u p I T (u) (II) I T (u) = (1-γ) * [ p S, u p A T (p) + p S, u p H T (p)] (III) In the above equations, an arrow denotes: a hyper-link from the left operator to the right operator if both the operators denote Web pages; or a visit by user u to page p. We use the variable γ (0 γ 1) to adjust the influence of the two assumptions of feedbacks stated above. A large γ makes the first assumption more significant, whereas a small γ makes the second assumption more significant. When γ becomes 1, the above equations will degrade to the original HITS computation. Based on the link structure of the Web pages in S, we can construct the adjacency matrix Adj that represents the linkage information between every two pages in S. Here we define Adj as: Adj = [a pq ], where a pq = 1 if there exists a hyper-link from page p to page q; otherwise a pq = 0 (IV) We represent a user s interests for topic T by the frequency and durations he visits the pages in S. Here we use visit matrix V, where V = [v up ] to denote the above two variables of user u s visit to page p. Let F(u, p) be the frequency of user u s visit to page p. Let Di(u, p) be the duration of user u s i th visit to page p, which is measured in seconds. Then, we compute V = [v up ] as below: v up = lg [β * F(u, p) + (1-β) * MAX i=1 F(u, p) Di(u, p)] (V) In equation (V), for any given user, if F(u, p) > 10, let F(u, p) = 10; if MAX i=1 F(u, p) Di(u, p) > 10, let MAX i=1 F(u, p) Di(u, p) = 10. Here we make an assumption that 10 seconds is the maximum length of time needed for a user to judge whether the current page is worth further reading or not. This assumption can be easily adjusted according to individual reader s reading habits. The parameter β (0 β 1) is used to adjust the significance of visit frequency and durations, where a large β will increase the influence of visit frequency and a small β will increase the influence of visit duration. Equation (V) incorporates both the effect of visit frequency and the duration of user u s visit to page p, and each element in the visit matrix V falls into [0, 1]. Representing the authority, hub and user-interests value by three variables A, H and I, now we can use equations (IV) and (V) to rewrite the original equations (I) ~ (III) as below: A = γ * Adj T * H + (1-γ) * V T * I... (E1) H = γ * Adj * A + (1-γ) * V T * I... (E2) I = (1-γ) * V * (A + H)... (E3) Finally, equations E1~3 are what we use in the computation process described in the next section. 3.3 Procedure to Generate Interests Profile Based upon the algorithm described above, we are then able to build the ranked user-interests profile through the following steps: Procedure ihitscompute() Input: - Set S r,, which denoted the pages that belong to topic T and have been previously visited by user u. - User u s visit pattern (logs of his visit frequency and durations). Output: User u s top N most interested topics. S1: Expand S r by adding in pages that either point to pages in S r or are pointed to by S r to generate page set S, and construct the adjacency matrix Adj of S; S2: Retrieve logs of users visit frequency and durations, and construct the visit matrix V; S3: Apply the ihits algorithm discussed in the previous section iteratively until the computation converges; S4: Assign I as user u s topic interests level towards topic T; S5: If T already exists in user u s profile, update T s interests level to be I (simply overwrite the previous value; or, if the effect of time is taken into account, we may need to incorporate the previous value appropriately); otherwise add topic T into u s profile as a new topic, together with its corresponding interests level I; S6: Sort the topics T i in user u's profile by the corresponding topic interests level I i in a descending order, return the top N topics as user u s most interested topics. 3

4. Evaluation and Results In this section, we describe how we test the approach and present here the experiment design and results for further discussion. Future plans of evaluation in the near future are discussed in the next section of this paper. 4.1 Experiment Design We first randomly choose seven different topics, and select one representative Web page for each of them (see Table 1 for details). A subject is employed to first rank the seven topics with a scale of 1 to 7, 1 for the most interested topic and 7 for the least. After that we generate a random sequence of numbers 1 to 7, which represents the browsing order of the seven topics, and in such an order we ask the same subject to freely browse the corresponding Web page. We record the subject s visit frequency and durations with GoldenEye (http://www.monitoring-spy-software.com/), a background monitoring software. After all of the seven topics are covered, we start to construct the adjacency matrix and the visit matrix. First, we manually compile the out-going links on the Web pages, and we find the incoming links to the Web pages by using Google search engine s special query parameter link: url, which returns a list of Web pages that point to url. After retrieving such linkage information, we then use equation (IV) to construct the adjacency matrix. For the visit matrix, we extract user s visit frequency and duration from the log files exported from the monitoring software, and then use equation (V) to build the matrix. Then we apply the 7-step procedure discussed in the previous section to get the results for our initial evaluation, which is summarized in the next session. Table 1. Selected topics and Web pages Topic No. Topic Term Web page 1 JAVA java.sun.com 2 Movie www.imdb.com 3 Travel www.letsgo.com 4 Photography www.photo.net 5 News www.cnn.com 6 Tax taxes.yahoo.com 7 Music www.mp3.com 4.2 Evaluation and Results Evaluation is based on the measure of recall, which is defined as the percentage of overlap between the test subject s ranking and the ihits ranking of the levels of his interests in the seven topics. Results are shown below in Table 2. Table 2. Evaluation results Topic No. User s Ranking ihits Ranking 1 6 6 2 4 3 3 2 2 4 1 1 5 3 4 6 5 5 7 7 7 Recall: 71.43% 5. Discussion and Future Work 5.1 The Convergence Problem So far we haven t rigorously proved the convergence of the ihits algorithm mathematically. However, we find that in [12] a similar system produces a very strong tendency to converge. As the elements in our definition of visit matrix fall into [0..1] so we believe this tendency still exists in our approach. Although in our experiments the computation did converge in a short time, we still need to obtain mathematical proof for the convergence in our future work. 5.2 Different Weights for Novel and Expert Users We believe that the weight γ can be adjusted in a way to better fit the level of expertise of different users. A small γ is appropriate for novel users since they are less aware of the Web pages quality so that we may rely more on their visit patterns, whereas a large γ is suitable for expert users since they are more aware of the pages quality, so that it s reasonable to give more credits on the quality factor. In order to train the system to choose an appropriate weight, we can use a machine learning approach with a small training set that is composed of manual entries. 5.3 Limitations of the Initial Evaluation We have to point out that there re four major limitations of the initial evaluation. First, the subject is arranged to 4

rank his interests in the seven topics before he actually does the browsing, which may potentially affect him in his own browsing behavior and hurt the validity of the data collected. Second, we offer only one Web page instead of a multiple-page set for each of the seven given topics, which makes the algorithm more dependent on the user s visit patterns and less on the Web pages quality. Third, the small volume of data collected here cannot guarantee with confidence that the approach is also effective on large datasets. Forth, we evaluate the performance only by measuring the percentage of overlaps; in the future we should also take into account the distance between the same topics in two ranking lists. 5.4 Plans of Future Evaluation In order to further examine the ihits approach, we are currently planning an evaluation which will involve much less bias. We first randomly choose five topics. For each of the five topics, we obtain the first 20 URLs retrieved by a popular search engine (e.g. Google). We then manually shuffle these 100 (5*20) URLs and compile a random list of all of them. By doing this, we try to minimize the bias generated by the ranking algorithm of the search engine and the user s browsing sequence. During the experiment, we employ a number of test subjects to browse these 100 URLs freely based upon their own interests, and suggest they can visit pages that they are more interested in earlier and then the less interested. We record their visit pattern (URL, frequency, duration) with the background monitoring software. Post-experiment questionnaires are given for the subjects to fill out, and on the questionnaires they are asked give a one to five ranking for the five topics they received. In such a way we wish to overcome the four limitations in our initial evaluation. In the meanwhile, we are also developing a search interface for the CiteSeer Digital Library, in which we incorporate the ihits approach. Usage data will be collected for evaluation purpose. 6. Conclusions An effective user-interests profiling system usually requires multiple approaches, no matter whether it is done explicitly and implicitly. In this paper we propose ihits, a novel approach to automatically generate a ranked list of users interests, with an extended HITS algorithm analyzing the linkage information of the Web pages and the users browsing patterns. The initial evaluation shows the approach has the potential to reach a satisfactory result, and is worthwhile for further exploiting. We discuss our plans for the future work. We believe that our study is promising and it may eventually deliver a novel tool for user-interests profiling using link analysis and Web usage logs. 7. References [1] M. Spiliopoulou. (1999). Data mining for the Web. In Proc. of Principles of Data Mining and Knowledge Discovery, PKDD 1999. [2] S. Middleton, N. Shadbolt, D. Roure. (2004). Ontological User Profiling in Recommender Systems. ACM Transactions on Information Systems, Vol. 22, No.1, January 2004, Pages 54-88. [3] T. Yan, H. Garcia-Molina. (1995). SIFT A Tool for Wide- Area Information Dissemination. In Proc. of 1995 USENIX Technical Conference, 1995. [4] K. Lang. (1994). NewsWeeder: Learning to Filter NetNews. In Proc. of Intl. Conference of Machine Learning, 1995, Pages 331-339. [5] K. Hyoung, C. Philip. (2003). Learning Implicit User Interests Hierarchy for Context in Personalization. IUI 2003. [6] M. Morita, Y. Shinoda. (1994). Information Filtering Based on User Behavior Analysis and Best Match Text Retrieval. Proc. of the 17th SIGIR Conference, 1994. [7] D. Billsus, M. Pazzani. (1999). A Personal News Agent that Talks, Learns and Explains. Autonomous Agents 1999 Seattle WA, USA. [8] H Sakagami, T. Kamba. (1997). Learning Personal Preferences on Online Newspaper Articles from user Behaviors. Proc. of the 6th WWW Conference, 1997. [9] H. Yoshinori. (2004). Implicit User Profiling for On Demand Relevance Feedback. IUI 2004. [10] B. Sheth. (1994). Newt: A Learning Approach to Personalized Information Filtering. Master's thesis. Department of Electric Engineering and Computer Science, MIT, 1994. [11] J. Kleinberg. (1998). Authoritative Sources in a Hyperlinked Environment. In Proceedings of the 9th ACM SIAM Symposium on Discrete Algorithms. [12] J. Wang, Z. Chen, L. Tao, W. Ma, W. Liu. (2002). Ranking User s Relevance to a Topic through Link Analysis on Web Logs. WIDM 02, November 8, 2002, Virginia, USA 5