Mining Web Logs for Personalized Site Maps

Mining Web Logs for Personalized Site Maps Fergus Toolan Nicholas Kushmerick Smart Media Institute, Computer Science Department, University College Dublin {fergus.toolan, nick}@ucd.ie Abstract. Navigating through a large Web site can be a frustrating exercise. Many sites employ Site Maps to help visitors understand the overall structure of the site. However, by their very nature, unpersonalized Site Maps show most visitors large amounts of irrelevant content. We propose techniques based on Web usage mining to deliver Personalized Site Maps that are specialized to the interests of each individual visitor. The key challenge is to resolve the tension between simplicity (showing just relevant content), and comprehensibility (showing sufficient context so that the visitors can understand how the content is related to the overall structure of the site). We develop two baseline algorithms (one that relies on shortest paths, and one that mines the server log for popular paths), and compare them to a novel approach that mines the server log for popular path fragments that can be dynamically assembled to reconstruct popular paths. Our experiments with two large Web sites confirm that the mined path fragments provide much better coverage of visitors sessions that the baseline approach of mining entire paths. 1. Introduction Finding relevant information in a large Web site can be tedious and frustrating. Site Maps are commonly used by Web developers to help visitors understand and navigate complex sites. For example, Figure 1(a) shows a portion of Apple.com s Site Map. By their very nature, Site Maps present nearly all of a Web site s content. Of course, most visitors are interested in just a small subset of this content [3]. Figure 1(b) illustrates how this Site Map could be personalized for some particular visitor who is interested in just a few aspects of Apple.com. Our research goal is to develop techniques to enable Web sites to automatically deliver Personalized Site Maps. Achieving this goal involves solving two sub-problems. The first challenge is to determine what content items (i.e., Web pages) each visitor is actually interested in. The second challenge is to display these relevant pages in a way that helps visitors understand how the relevant pages are related. Web designers invest substantial effort in crafting Site Maps in order to help visitors understand the overall structure of the Web site, and Personalized Site Maps must not throw the baby out with the bathwater by ignoring this structure. For example, a visitor interested in 17 inch Studio Displays should not simply be pointed to the most relevant page, but she should be shown how the page is related to the site s Products section. We adopt a simple solution to the first problem we assume that the visitor expresses her interests with an explicit query such as inexpensive studio displays that is processed with standard information retrieval techniques and defer to future work more sophisticated approaches based on collaborative filtering and other forms of user models. In this paper, we focus on the second sub-problem, how to organize a set of relevant Web pages to reflect the site s structure. We note that there is a trade-off between two competing considerations. On the one hand, Personalized Site Maps should be as simple as possible. This suggests a trivial approach in which a Personalized Site Map displays just the shortest paths from each of the relevant pages to the site s home page. However, short paths are not necessarily intuitively meaningful to visitors [4]. For example, Web sites often contain numerous navigational cross-links, so the shortest path between two pages may well involve completely unrelated parts of the site. We adopt the assumption that the most comprehensible path between two pages will be the one that has been most popular with previous site visitors Submitted to First International Workshop on Mining for Enhanced Web Search; draft of 01/08/02. -1-

[7]. We therefore add to our Personalized Site Maps paths that have been frequently traversed by past site visitors, which are often not the shortest. The technical challenge of our work concerns how to compute the most popular path between a given pair of pages. The naïve approach would be to extract the N most popular paths from the server s access log. However, given the inherent diversity of visitors interests, N must be extremely large in order to obtain sufficient coverage over actual site visitors. To address this sparseness, we propose a novel algorithm for mining fragments of paths, rather than entire paths, from the server logs, and then assembling the fragments. For example, suppose that A>B>C>D>E>F>G and A>B>C>D>H>I>J are two paths that occur frequently in past visitors session, where the notation where x>y indicates a traversal by a particular visitor from page x to page y. Using the naïve approach we need to store the two paths in their entirety. However we could store only A>B>C>D, D>E>F>G and D>H>I>J and then recreate the full paths. This path fragment method allows us to compress the previous sessions much more than storing entire paths. We make the following contributions. First, we formalize the problem of constructing Personalized Site Maps (Section 2). Second, we describe our algorithm for solving this problem that mines popular path fragments from server logs (Section 3). Finally, using data from two Web sites, we empirically demonstrate that shortest paths are often quite unpopular (thus providing evidence that shortest paths are not intuitively meaningful), and that our mined path fragments provide better coverage than simply storing entire paths (Section 4). Apple.com Site Map Apple.com Personalized Site Map (a) Figure 1: (a) The Apple.com Site Map, and (b) a fictitious personalized version that displays only the few pages that are relevant to a particular visitor. (b) 2. Problem formalization We formalize the problem of constructing a Personalized Site Map as follows. We take as input a Web site s graph G=(V,E) and its distinguished home page root r V. Each node in V corresponds to a Web page, and directed edge (u,v) E represents a hyperlink between the corresponding documents. We also assume that a set of relevant nodes R = {r 1, r 2,, r n } has been identified during the initial relevance-assessment step. A Personalized Site Map is a subset G =(V,E ) of the original graph, such that G contains the root and the relevant nodes (i.e., V {r, r 1, r 2,, r n } V), as well as sufficient additional nodes and edges from G so that G contains a path from the root r to each relevant node r i. So far, this personalization task is highly under-constrained, as there may be many such subgraphs G =(V,E ). To decide between alternative subgraphs, we exploit the actual visitor usage data from the site s server log. The intuition is that we want to select the alternative G whose edges E are the most popular among previous site visitors. Our task thus reduces to the following: Given a Web site server log, we want to mine sufficient data from the log in order to be able to reliably reconstruct the most popular path from any node u V to any other node v V. Naturally, without access to the entire log some data -2-

will necessarily be lost and this reconstruction process cannot be perfect. We will therefore be interested in empirically comparing the coverage of alternative algorithms. 3. Algorithms We begin with a brief discussion of server log pre-processing. We then describe three alternative algorithms for constructing Personalized Site Maps. The first baseline algorithm, SP, ignores the server log and simply assumes that shorter paths are more popular than longer paths. The second algorithm, PP, extracts the N most popular paths from the server log, and tries to reconstruct the most popular path between two pages using these N paths. As mentioned above, PP is ineffective because path traversal logs for large graphs are necessarily very sparse and thus N must be very large to ensure adequate coverage. Our third algorithm, MP, mines path fragments from server logs, and then dynamically assembles them into a path between from a given node u to another node v. Since MP discards strictly more information than PP, it can make mistakes, but our experiments in Section 4 demonstrate its effectiveness in practice. 3.1. Server log pre-processing. Web server logs contain a large amount of noise which must be discarded, and also often do not contain data that must be inferred [1]. Noise corresponds to requests for images, applets, etc, which are logged, yet are irrelevant for our purposes as these are embedded in page views. Data may be missing due to caching by, for example, the browser or Internet service provider. This arises most commonly when the visitor uses the browser s back button. For example, if a user traverses the path u>v, then hits the back button, and then traverses u>w, this will appear in the log as u>v>w. We use a simple path completion algorithm to automatically insert entries that must be missing due to the known structure of the site graph. The final problem with server logs is that requests are stored in the order that the server receives them. Specifically, if multiple people are browsing the site concurrently, their requests are intermingled in the log file. We use simple session extraction heuristics to segment the entire log into a sequence of sessions. First, we partition requests by IP address. Second, we use an inter-access delay threshold D to split a sequence of accesses from a given IP into one or more sessions. From our preliminary experiments described in Section 4.2 we set D = 15 minutes. 3.2. SP ( shortest path ) algorithm. The simplest technique in any route planning system is the shortest path between two points [4]. Essentially, in order to estimate the most popular path from node u to node v, the SP algorithm ignores the past visitors entirely and simply assumes that short paths are more popular than long paths. To evaluate the SP algorithm, we measure its coverage. The coverage of the SP algorithm is the fraction of extracted sessions in which users went from page u to page v via the shortest path. In Section 4 we empirically demonstrate that the coverage of SP is in fact quite low. 3.3. PP ( popular paths ) algorithm. The PP algorithm simply records the N most frequent sessions extracted from the server log during the pre-processing step. It is an example of sequential pattern discovery from web logs as seen in [5] and [2]. The coverage of PP is the fraction of extracted sessions in which the user navigated from page u to page v via the most popular path from u to v. In Section 4 we demonstrate that N must be quite large in order to obtain sufficient coverage over the entire Web graph. 3.4. MP ( mined paths ) algorithm. The MP algorithm expands each server log session into a set of all subpaths of length between K min and K max. The N most popular such fragments are then used to reconstruct a path from a page u to a page v. To do so, MP considers all possible ways to assemble the mined fragments, subject to the constraint that adjacent fragments must overlap on A pages. For instance, if A=2 then the two fragments u>v>w and -3-

v>w>x can be assembled to create a path u>v>w>x. This overlap constraint corresponds to an assumption that Web navigation can be modelled as a Markov process of order A [9]. In our experiments we use A=2, K min =4, and K max =15. We leave to future work a systematic exploration of optimal values for these parameters. The coverage of MP is the fraction of extracted sessions that can be recovered from the mined fragments. In Section 4 we demonstrate that, for a given value of N, MP has better coverage than PP. 4. Experiments We now describe an experimental evaluation of the techniques we discussed in the previous section. We begin with a discussion of the two datasets we used for our experiments. We then describe the results of experiments designed to answer the following questions: 1. How sensitive are our results to the inter-access delay threshold D used to segment the raw server log into sessions? (Section 4.2) 2. How frequently is the shortest path between two pages the most popular path? (Section 4.3) 3. How does the coverage of PP compare to that of MP, as a function of the amount N of mined data? (Section 4.4) 4.1. Datasets. We evaluated our techniques on two Web sites, the server for the Computer Science Department of University College Dublin (www.cs.ucd.ie), and Music Machines (machines.hyperreal.org) 1. Figure 2 summarises these datasets. UCD CS Music Machines Time Period Apr 2000 Dec 2001 Feb 1997 Apr 1999 Total Requests 4,327,397 14,722,468 After Pre-processing 1,258,643 2,996,322 Number of Distinct IPs 55,429 270,092 Number of Sessions 236,675 554,801 Mean Session Length 5.32 5.40 Figure 2: Summary of the experimental data. The total number of requests includes images, applets, etc. The number after pre-processing is the number of requests for actual page views. The number of distinct IP s is the number of IP addresses from which the server received requests in the time period. Note that the number of IP address is not equal to the number of actual visitors due to noise introduced by proxy servers, and it is not equal to the number of sessions because each visitor may initiate several sessions in the log file time period. 4.2. Threshold Experiments. The first experiments relate to the inter-access delay threshold D used to segment the raw server log into sessions. Specifically, we want to ensure that the results from our subsequent experiments are not overly sensitive to the setting of this free parameter. 1 The Music Machines server logs were archived by Mike Perkowitz and are available at www.cs.washington.edu/ai/adaptive-data. -4-

300000 Number of Sessions 250000 200000 150000 5 10 15 20 25 30 35 40 45 Inter-access delay threshold D (minutes) Figure 3: Number of sessions extracted from the UCD log, as a function of the inter-access delay threshold D. Figure 3 shows the number of distinct sessions extracted from the log files of the UCD web site as the session threshold increases from five to 45 minutes. While the number of sessions grows rapidly as D decreases, the variation is much smaller at the intuitively reasonable larger thresholds. Our second experiment compared the overlap between the popular paths mined by the PP algorithm, using a threshold of 10, 15 and 20 minutes. As shown in Figure 4, there is a substantial overlap between the various sets of mined paths. 100 80 Overlap (%) 60 40 20 0 10-15 10-20 15-20 Compared inter-access delay thresholds (minutes) Figure 4: Overlap between paths mined from the UCD log by the PP algorithm, for three pairs of inter-access delay thresholds. Based on this data, we conclude that our technique is relatively stable across values of the inter-access delay threshold D. We set the threshold D=15 minutes for the remainder of the experiments. 4.3. Comparison of shortest and popular paths. The next experiment seeks to confirm that shortest paths are not necessarily the most popular. Figure 5 shows the fraction of popular paths that are in fact the shortest path, as a function of the number N of paths mined by the PP algorithm, for both web sites. For example, of the N=40 most-popular paths mined by PP, 60% these paths are in fact the shortest path. We can see that as the paths become less popular (i.e., for small values of N), shortness is indeed a good proxy for popularity. However, as -5-

N increases the overlap between PP and SP decreases substantially. We conclude that, as predicted, popular paths are frequently sub-optimal. Overlap 100 80 60 40 UCD MM 20 0 10 30 50 70 90 150 250 350 450 Number of mined paths N Figure 5: Overlap between popular and shortest paths. 4.4. Coverage of mined and popular paths. So far, our experiments have been concerned with demonstrating that SP and MP do indeed generate different paths. In this section we investigate the benefits of using mined paths as opposed to just using the popular paths. For various values of N, we measure the fraction of the extracted sessions that can be reconstructed in their entirety using MP. With N=5000 we can reconstruct 27% of sessions from the UCD log file in their entirety and can manage to recreate 14% of the Music Machines Sessions. We can generalize this experiment by measuring the fraction of individual sessions that each algorithm can reconstruct. That is, we know that 27% of the UCD sessions can be fully (100%) reconstructed, by presumably many others sessions can be, say, 75% reconstructed. We therefore measured the average fraction of a given session that can be reconstructed for each of the two algorithms. For both Web sites, our experimental results in Figures 6 and 7 demonstrate that the mined path fragments can be used to reconstruct a greater proportion of individual traversals than relying solely on popular paths. Specifically, we measure the coverage of MP and PP for various values of N up to 1000, and find that for each value of N, MP has higher coverage than PP, and the coverage gap grows rapidly as N increases. 30% 25% Coverage 20% 15% 10% Popular Mined Difference 5% 0% 0 250 500 750 1000 Numbr of mined paths or fragments N Figure 6: Coverage of the MP and PP algorithms for the UCD site. -6-

3% 3% Coverage 2% 2% 1% Popular Mined Difference 1% 0% 0 250 500 750 1000 Number of mined paths or fragments N Figure 7: Coverage of the MP and PP algorithms for the Music Machines site. 5. Related Work Previous research in Web usage and server log mining addresses two major issues: the preprocessing of the raw data, and the discovery of patterns or rules in the data. Our work relates to both of these areas. Pre-processing is discussed in detail in [1]. The aim of pre-processing of web server log files is to obtain a set of sessions (visits) recorded in the log files. It can be divided into three distinct phases: data cleansing, user/session identification and path completion. Our system implements all of these components of Web Usage Mining. The second phase of Web usage mining is that of pattern discovery [1,2,5,6]. Pattern discovery involves the extraction of some meaningful information, such as association rules, classification rules, or sequential patterns. The PP and MP algorithms can be seen as the pattern discovery phase for the Personalised Site Map task. The construction of improved site maps is discussed in [3]. Li et al discuss the need for topic-focused site maps that home in on the users interests and try to display that section of the map. They also discuss the granularity of the site map, which is the level of detail the map should show. They use the method of extracting logical domains from the web site where each logical domain is associated with a certain topic. Unlike our system they use semantic knowledge from the pages contents. 6. Future Work Our Personalized Site Map algorithms have been fully implemented. Our current focus involves measuring the effectiveness of our approach. Our experiments have demonstrated that our technique works well, in the sense that we are able to build site maps containing popular (as opposed to merely short) paths. We believe that users will find popular paths more intuitive compared to paths that are merely short, but we have not yet established this empirically. We intend to conduct user trials of the system to get users judgements of the quality of the Personalized Site Maps. For example, one important topic is the generality/specificity of the pages on the paths. Do the pages earlier in the path contain more general information than later pages? We are also exploring other applications for our path-mining algorithm. At its core, we have developed an approach to predicting which pages are likely to be viewed next, given a prefix of a visitor s trajectory. Therefore, a second potential application concerns using this predictive ability for pre-fetching and caching [8]. -7-

Our preliminary investigation of pre-fetching shows promising results. For the UCD site, we allow the PP algorithm to recommend a likely next page after each session prefix. The entire dataset contains over 236,000 sessions, leading to 832,307 recommendations from PP. Of these recommendations, 43,349 (5.2%) were correct (ie, that the user did indeed visit the recommended page next). We intend to extend this experiment to caching of multiple pages, and comparing our approach to existing page-prediction algorithms. Another possible direction would be to introduce a collaborative element to the system. We could rate each popular path for a user based on whether it appears in his sessions or not. Standard collaborative filtering techniques can then be used to recommend a particular popular path to recommend with greater confidence than our current PP algorithm. 7. Conclusions We have introduced the problem of automatically constructing Personalized Site Maps. The key challenge is to display to the visitor a subgraph that both contains relevant content items, and also organizes them in a coherent and meaningful manner. Our approach is based on the assumption that the best way to indicate the relationship between a given pair of pages is to show the path between them that has been most popular with past visitors. Based on this observation, we propose a naïve algorithm (PP) for mining popular paths from raw server logs, and a more sophisticated algorithm (MP) for mining path fragments. The key idea of MP is to mine a collection of path fragments that can be dynamically assembled in order to reconstruct many popular paths. Our experiments with two large Web sites confirm that MP can reconstruct a larger fraction of visitors sessions that PP. Acknowledgments. We thank Barry Smyth for helpful discussions. This research was funded by grant N-00014-00-1-0021 from the US Office of Naval Research, and grant 01/F.1/C015 from Science Foundation Ireland. References [1] Rob Cooley Web Usage Mining PhD Thesis, Department of Computer Science, University of Minnesota, 2001 [2] W. Gaul and L. Schmidt-Thieme. Mining web navigation path fragments. In Proceedings of the Workshop on Web Mining for E-Commerce -- Challenges and Opportunities, Boston, MA, August 2000 [3] Li, W-S., Ayan, N. F., Kolak, O., Vu, Q. "Constructing Multi-Granular and Topic Focused Web Sites in Proceedings of WWW10-2000. [4] McGinty, L., Smyth, B., Case Based Route Planning. In Proceedings of the 11 th Conference on Artificial Intelligence and Cognitive Science, Galway, Ireland, 2000. [5] Srikant, R., Agrawal, R. Mining Sequential Patterns: Generalisations and Performance improvements. In Proceedings of the 5 th International conference Extending Database Systems. 1996 [6] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data SIGKDD Explorations, Vol. 1, Issue 2, 2000. [7] Wexelblat, A., Maes, P. Footprints: History-Rich Tools for Information Foraging In Proceeedings of CHI 99 Conference on Human Factors in Computing, 1999-8-

[8] Yang, Q., Zhang, H. H., Li, T. "Mining Web logs for Prediction in WWW Caching and Pre-fetching." In Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD'01, San Francisco. 2001 [9] Ypma, A., Heskes, T. Categorization of Web Pages and User Clustering with mixtures of Hidden Markov Models. In Proceedings of the International Workshop on Web Knowledge Discovery and Data Mining, WEBKDD 02, July 23 2002, Edmonton, Canada. -9-