Mining Web Logs for Personalized Site Maps

Similar documents
Survey Paper on Web Usage Mining for Web Personalization

A Survey on Web Personalization of Web Usage Mining

THE STUDY OF WEB MINING - A SURVEY

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

Behaviour Recovery and Complicated Pattern Definition in Web Usage Mining

Web page recommendation using a stochastic process model

Semantic Clickstream Mining

Pattern Classification based on Web Usage Mining using Neural Network Technique

INTRODUCTION. Chapter GENERAL

The influence of caching on web usage mining

FM-WAP Mining: In Search of Frequent Mutating Web Access Patterns from Historical Web Usage Data

Data Mining of Web Access Logs Using Classification Techniques

CLASSIFICATION OF WEB LOG DATA TO IDENTIFY INTERESTED USERS USING DECISION TREES

Create a Profile for User Using Web Usage Mining

Using Petri Nets to Enhance Web Usage Mining 1

Semantic Website Clustering

Web Data mining-a Research area in Web usage mining

Log Information Mining Using Association Rules Technique: A Case Study Of Utusan Education Portal

A Constrained Spreading Activation Approach to Collaborative Filtering

Web Usage Mining: A Research Area in Web Mining

Association-Rules-Based Recommender System for Personalization in Adaptive Web-Based Applications

Web Usage Mining. Overview Session 1. This material is inspired from the WWW 16 tutorial entitled Analyzing Sequential User Behavior on the Web

Overview of Web Mining Techniques and its Application towards Web

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Fault Identification from Web Log Files by Pattern Discovery

Mining for User Navigation Patterns Based on Page Contents

Farthest First Clustering in Links Reorganization

Sathyamangalam, 2 ( PG Scholar,Department of Computer Science and Engineering,Bannari Amman Institute of Technology, Sathyamangalam,

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

MetaData for Database Mining

Web Usage Mining: A Research Area in Web Mining

Fuzzy Cognitive Maps application for Webmining

DATA MINING II - 1DL460. Spring 2014"

An Effective method for Web Log Preprocessing and Page Access Frequency using Web Usage Mining

CHAPTER - 3 PREPROCESSING OF WEB USAGE DATA FOR LOG ANALYSIS

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

Inferring User Search for Feedback Sessions

A New Web Usage Mining Approach for Website Recommendations Using Concept Hierarchy and Website Graph

Review Paper Approach to Recover CSGM Method with Higher Accuracy and Less Memory Consumption using Web Log Mining

Collaborative Filtering using a Spreading Activation Approach

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

A Constrained Spreading Activation Approach to Collaborative Filtering

Characterizing Home Pages 1

Improving the Performance of a Proxy Server using Web log mining

CLASSIFICATION FOR SCALING METHODS IN DATA MINING

International Journal of Software and Web Sciences (IJSWS)

More Efficient Classification of Web Content Using Graph Sampling

K-Nearest-Neighbours with a Novel Similarity Measure for Intrusion Detection

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

A Review Paper on Web Usage Mining and Pattern Discovery

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

Web Mining. Data Mining and Text Mining (UIC Politecnico di Milano) Daniele Loiacono

A Hybrid Web Personalization Model Based on Site Connectivity

Context-based Navigational Support in Hypermedia

Ontology Generation from Session Data for Web Personalization

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Web Mining Using Cloud Computing Technology

Keywords Data alignment, Data annotation, Web database, Search Result Record

Chapter 2. Related Work

Chapter 10. Conclusion Discussion

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Heading-Based Sectional Hierarchy Identification for HTML Documents

IJITKMSpecial Issue (ICFTEM-2014) May 2014 pp (ISSN )

DATA MINING - 1DL105, 1DL111

Semi-Supervised Clustering with Partial Background Information

On Multiple Query Optimization in Data Mining

Leveraging Set Relations in Exact Set Similarity Join

Capturing Window Attributes for Extending Web Browsing History Records

Data warehousing and Phases used in Internet Mining Jitender Ahlawat 1, Joni Birla 2, Mohit Yadav 3

EFFECTIVELY USER PATTERN DISCOVER AND CLASSIFICATION FROM WEB LOG DATABASE

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming

Theme Identification in RDF Graphs

Probabilistic Abstraction Lattices: A Computationally Efficient Model for Conditional Probability Estimation

Explore Co-clustering on Job Applications. Qingyun Wan SUNet ID:qywan

COMPREHENSIVE FRAMEWORK FOR PATTERN ANALYSIS THROUGH WEB LOGS USING WEB MINING: A REVIEW

International Journal of Advanced Research in Computer Science and Software Engineering

Probability Measure of Navigation pattern predition using Poisson Distribution Analysis

Pattern Mining in Frequent Dynamic Subgraphs

Implementation Techniques

Research/Review Paper: Web Personalization Using Usage Based Clustering Author: Madhavi M.Mali,Sonal S.Jogdand, Deepali P. Shinde Paper ID: V1-I3-002

Graph based Approach for Mining Frequent Sequential Access Patterns of Web pages

Research Article Apriori Association Rule Algorithms using VMware Environment

Thwarting Traceback Attack on Freenet

Image Similarity Measurements Using Hmok- Simrank

AN EFFECTIVE SEARCH ON WEB LOG FROM MOST POPULAR DOWNLOADED CONTENT

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Mining Temporally Evolving Graphs

An Average Linear Time Algorithm for Web. Usage Mining

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Hierarchical Document Clustering

ARS: Web Page Recommendation System for Anonymous Users Based On Web Usage Mining

An Algorithm for user Identification for Web Usage Mining

Recommendation Models for User Accesses to Web Pages (Invited Paper)

Web Service Usage Mining: Mining For Executable Sequences

WEB-LOG CLEANING FOR CONSTRUCTING SEQUENTIAL CLASSIFIERS

Pre-processing of Web Logs for Mining World Wide Web Browsing Patterns

An Approach To Web Content Mining

Transcription:

Mining Web Logs for Personalized Site Maps Fergus Toolan Nicholas Kushmerick Smart Media Institute, Computer Science Department, University College Dublin {fergus.toolan, nick}@ucd.ie Abstract. Navigating through a large Web site can be a frustrating exercise. Many sites employ Site Maps to help visitors understand the overall structure of the site. However, by their very nature, unpersonalized Site Maps show most visitors large amounts of irrelevant content. We propose techniques based on Web usage mining to deliver Personalized Site Maps that are specialized to the interests of each individual visitor. The key challenge is to resolve the tension between simplicity (showing just relevant content), and comprehensibility (showing sufficient context so that the visitors can understand how the content is related to the overall structure of the site). We develop two baseline algorithms (one that relies on shortest paths, and one that mines the server log for popular paths), and compare them to a novel approach that mines the server log for popular path fragments that can be dynamically assembled to reconstruct popular paths. Our experiments with two large Web sites confirm that the mined path fragments provide much better coverage of visitors sessions that the baseline approach of mining entire paths. 1. Introduction Finding relevant information in a large Web site can be tedious and frustrating. Site Maps are commonly used by Web developers to help visitors understand and navigate complex sites. For example, Figure 1(a) shows a portion of Apple.com s Site Map. By their very nature, Site Maps present nearly all of a Web site s content. Of course, most visitors are interested in just a small subset of this content [3]. Figure 1(b) illustrates how this Site Map could be personalized for some particular visitor who is interested in just a few aspects of Apple.com. Our research goal is to develop techniques to enable Web sites to automatically deliver Personalized Site Maps. Achieving this goal involves solving two sub-problems. The first challenge is to determine what content items (i.e., Web pages) each visitor is actually interested in. The second challenge is to display these relevant pages in a way that helps visitors understand how the relevant pages are related. Web designers invest substantial effort in crafting Site Maps in order to help visitors understand the overall structure of the Web site, and Personalized Site Maps must not throw the baby out with the bathwater by ignoring this structure. For example, a visitor interested in 17 inch Studio Displays should not simply be pointed to the most relevant page, but she should be shown how the page is related to the site s Products section. We adopt a simple solution to the first problem we assume that the visitor expresses her interests with an explicit query such as inexpensive studio displays that is processed with standard information retrieval techniques and defer to future work more sophisticated approaches based on collaborative filtering and other forms of user models. In this paper, we focus on the second sub-problem, how to organize a set of relevant Web pages to reflect the site s structure. We note that there is a trade-off between two competing considerations. On the one hand, Personalized Site Maps should be as simple as possible. This suggests a trivial approach in which a Personalized Site Map displays just the shortest paths from each of the relevant pages to the site s home page. However, short paths are not necessarily intuitively meaningful to visitors [4]. For example, Web sites often contain numerous navigational cross-links, so the shortest path between two pages may well involve completely unrelated parts of the site. We adopt the assumption that the most comprehensible path between two pages will be the one that has been most popular with previous site visitors Submitted to First International Workshop on Mining for Enhanced Web Search; draft of 01/08/02. -1-

[7]. We therefore add to our Personalized Site Maps paths that have been frequently traversed by past site visitors, which are often not the shortest. The technical challenge of our work concerns how to compute the most popular path between a given pair of pages. The naïve approach would be to extract the N most popular paths from the server s access log. However, given the inherent diversity of visitors interests, N must be extremely large in order to obtain sufficient coverage over actual site visitors. To address this sparseness, we propose a novel algorithm for mining fragments of paths, rather than entire paths, from the server logs, and then assembling the fragments. For example, suppose that A>B>C>D>E>F>G and A>B>C>D>H>I>J are two paths that occur frequently in past visitors session, where the notation where x>y indicates a traversal by a particular visitor from page x to page y. Using the naïve approach we need to store the two paths in their entirety. However we could store only A>B>C>D, D>E>F>G and D>H>I>J and then recreate the full paths. This path fragment method allows us to compress the previous sessions much more than storing entire paths. We make the following contributions. First, we formalize the problem of constructing Personalized Site Maps (Section 2). Second, we describe our algorithm for solving this problem that mines popular path fragments from server logs (Section 3). Finally, using data from two Web sites, we empirically demonstrate that shortest paths are often quite unpopular (thus providing evidence that shortest paths are not intuitively meaningful), and that our mined path fragments provide better coverage than simply storing entire paths (Section 4). Apple.com Site Map Apple.com Personalized Site Map (a) Figure 1: (a) The Apple.com Site Map, and (b) a fictitious personalized version that displays only the few pages that are relevant to a particular visitor. (b) 2. Problem formalization We formalize the problem of constructing a Personalized Site Map as follows. We take as input a Web site s graph G=(V,E) and its distinguished home page root r V. Each node in V corresponds to a Web page, and directed edge (u,v) E represents a hyperlink between the corresponding documents. We also assume that a set of relevant nodes R = {r 1, r 2,, r n } has been identified during the initial relevance-assessment step. A Personalized Site Map is a subset G =(V,E ) of the original graph, such that G contains the root and the relevant nodes (i.e., V {r, r 1, r 2,, r n } V), as well as sufficient additional nodes and edges from G so that G contains a path from the root r to each relevant node r i. So far, this personalization task is highly under-constrained, as there may be many such subgraphs G =(V,E ). To decide between alternative subgraphs, we exploit the actual visitor usage data from the site s server log. The intuition is that we want to select the alternative G whose edges E are the most popular among previous site visitors. Our task thus reduces to the following: Given a Web site server log, we want to mine sufficient data from the log in order to be able to reliably reconstruct the most popular path from any node u V to any other node v V. Naturally, without access to the entire log some data -2-

will necessarily be lost and this reconstruction process cannot be perfect. We will therefore be interested in empirically comparing the coverage of alternative algorithms. 3. Algorithms We begin with a brief discussion of server log pre-processing. We then describe three alternative algorithms for constructing Personalized Site Maps. The first baseline algorithm, SP, ignores the server log and simply assumes that shorter paths are more popular than longer paths. The second algorithm, PP, extracts the N most popular paths from the server log, and tries to reconstruct the most popular path between two pages using these N paths. As mentioned above, PP is ineffective because path traversal logs for large graphs are necessarily very sparse and thus N must be very large to ensure adequate coverage. Our third algorithm, MP, mines path fragments from server logs, and then dynamically assembles them into a path between from a given node u to another node v. Since MP discards strictly more information than PP, it can make mistakes, but our experiments in Section 4 demonstrate its effectiveness in practice. 3.1. Server log pre-processing. Web server logs contain a large amount of noise which must be discarded, and also often do not contain data that must be inferred [1]. Noise corresponds to requests for images, applets, etc, which are logged, yet are irrelevant for our purposes as these are embedded in page views. Data may be missing due to caching by, for example, the browser or Internet service provider. This arises most commonly when the visitor uses the browser s back button. For example, if a user traverses the path u>v, then hits the back button, and then traverses u>w, this will appear in the log as u>v>w. We use a simple path completion algorithm to automatically insert entries that must be missing due to the known structure of the site graph. The final problem with server logs is that requests are stored in the order that the server receives them. Specifically, if multiple people are browsing the site concurrently, their requests are intermingled in the log file. We use simple session extraction heuristics to segment the entire log into a sequence of sessions. First, we partition requests by IP address. Second, we use an inter-access delay threshold D to split a sequence of accesses from a given IP into one or more sessions. From our preliminary experiments described in Section 4.2 we set D = 15 minutes. 3.2. SP ( shortest path ) algorithm. The simplest technique in any route planning system is the shortest path between two points [4]. Essentially, in order to estimate the most popular path from node u to node v, the SP algorithm ignores the past visitors entirely and simply assumes that short paths are more popular than long paths. To evaluate the SP algorithm, we measure its coverage. The coverage of the SP algorithm is the fraction of extracted sessions in which users went from page u to page v via the shortest path. In Section 4 we empirically demonstrate that the coverage of SP is in fact quite low. 3.3. PP ( popular paths ) algorithm. The PP algorithm simply records the N most frequent sessions extracted from the server log during the pre-processing step. It is an example of sequential pattern discovery from web logs as seen in [5] and [2]. The coverage of PP is the fraction of extracted sessions in which the user navigated from page u to page v via the most popular path from u to v. In Section 4 we demonstrate that N must be quite large in order to obtain sufficient coverage over the entire Web graph. 3.4. MP ( mined paths ) algorithm. The MP algorithm expands each server log session into a set of all subpaths of length between K min and K max. The N most popular such fragments are then used to reconstruct a path from a page u to a page v. To do so, MP considers all possible ways to assemble the mined fragments, subject to the constraint that adjacent fragments must overlap on A pages. For instance, if A=2 then the two fragments u>v>w and -3-

v>w>x can be assembled to create a path u>v>w>x. This overlap constraint corresponds to an assumption that Web navigation can be modelled as a Markov process of order A [9]. In our experiments we use A=2, K min =4, and K max =15. We leave to future work a systematic exploration of optimal values for these parameters. The coverage of MP is the fraction of extracted sessions that can be recovered from the mined fragments. In Section 4 we demonstrate that, for a given value of N, MP has better coverage than PP. 4. Experiments We now describe an experimental evaluation of the techniques we discussed in the previous section. We begin with a discussion of the two datasets we used for our experiments. We then describe the results of experiments designed to answer the following questions: 1. How sensitive are our results to the inter-access delay threshold D used to segment the raw server log into sessions? (Section 4.2) 2. How frequently is the shortest path between two pages the most popular path? (Section 4.3) 3. How does the coverage of PP compare to that of MP, as a function of the amount N of mined data? (Section 4.4) 4.1. Datasets. We evaluated our techniques on two Web sites, the server for the Computer Science Department of University College Dublin (www.cs.ucd.ie), and Music Machines (machines.hyperreal.org) 1. Figure 2 summarises these datasets. UCD CS Music Machines Time Period Apr 2000 Dec 2001 Feb 1997 Apr 1999 Total Requests 4,327,397 14,722,468 After Pre-processing 1,258,643 2,996,322 Number of Distinct IPs 55,429 270,092 Number of Sessions 236,675 554,801 Mean Session Length 5.32 5.40 Figure 2: Summary of the experimental data. The total number of requests includes images, applets, etc. The number after pre-processing is the number of requests for actual page views. The number of distinct IP s is the number of IP addresses from which the server received requests in the time period. Note that the number of IP address is not equal to the number of actual visitors due to noise introduced by proxy servers, and it is not equal to the number of sessions because each visitor may initiate several sessions in the log file time period. 4.2. Threshold Experiments. The first experiments relate to the inter-access delay threshold D used to segment the raw server log into sessions. Specifically, we want to ensure that the results from our subsequent experiments are not overly sensitive to the setting of this free parameter. 1 The Music Machines server logs were archived by Mike Perkowitz and are available at www.cs.washington.edu/ai/adaptive-data. -4-

300000 Number of Sessions 250000 200000 150000 5 10 15 20 25 30 35 40 45 Inter-access delay threshold D (minutes) Figure 3: Number of sessions extracted from the UCD log, as a function of the inter-access delay threshold D. Figure 3 shows the number of distinct sessions extracted from the log files of the UCD web site as the session threshold increases from five to 45 minutes. While the number of sessions grows rapidly as D decreases, the variation is much smaller at the intuitively reasonable larger thresholds. Our second experiment compared the overlap between the popular paths mined by the PP algorithm, using a threshold of 10, 15 and 20 minutes. As shown in Figure 4, there is a substantial overlap between the various sets of mined paths. 100 80 Overlap (%) 60 40 20 0 10-15 10-20 15-20 Compared inter-access delay thresholds (minutes) Figure 4: Overlap between paths mined from the UCD log by the PP algorithm, for three pairs of inter-access delay thresholds. Based on this data, we conclude that our technique is relatively stable across values of the inter-access delay threshold D. We set the threshold D=15 minutes for the remainder of the experiments. 4.3. Comparison of shortest and popular paths. The next experiment seeks to confirm that shortest paths are not necessarily the most popular. Figure 5 shows the fraction of popular paths that are in fact the shortest path, as a function of the number N of paths mined by the PP algorithm, for both web sites. For example, of the N=40 most-popular paths mined by PP, 60% these paths are in fact the shortest path. We can see that as the paths become less popular (i.e., for small values of N), shortness is indeed a good proxy for popularity. However, as -5-

N increases the overlap between PP and SP decreases substantially. We conclude that, as predicted, popular paths are frequently sub-optimal. Overlap 100 80 60 40 UCD MM 20 0 10 30 50 70 90 150 250 350 450 Number of mined paths N Figure 5: Overlap between popular and shortest paths. 4.4. Coverage of mined and popular paths. So far, our experiments have been concerned with demonstrating that SP and MP do indeed generate different paths. In this section we investigate the benefits of using mined paths as opposed to just using the popular paths. For various values of N, we measure the fraction of the extracted sessions that can be reconstructed in their entirety using MP. With N=5000 we can reconstruct 27% of sessions from the UCD log file in their entirety and can manage to recreate 14% of the Music Machines Sessions. We can generalize this experiment by measuring the fraction of individual sessions that each algorithm can reconstruct. That is, we know that 27% of the UCD sessions can be fully (100%) reconstructed, by presumably many others sessions can be, say, 75% reconstructed. We therefore measured the average fraction of a given session that can be reconstructed for each of the two algorithms. For both Web sites, our experimental results in Figures 6 and 7 demonstrate that the mined path fragments can be used to reconstruct a greater proportion of individual traversals than relying solely on popular paths. Specifically, we measure the coverage of MP and PP for various values of N up to 1000, and find that for each value of N, MP has higher coverage than PP, and the coverage gap grows rapidly as N increases. 30% 25% Coverage 20% 15% 10% Popular Mined Difference 5% 0% 0 250 500 750 1000 Numbr of mined paths or fragments N Figure 6: Coverage of the MP and PP algorithms for the UCD site. -6-

3% 3% Coverage 2% 2% 1% Popular Mined Difference 1% 0% 0 250 500 750 1000 Number of mined paths or fragments N Figure 7: Coverage of the MP and PP algorithms for the Music Machines site. 5. Related Work Previous research in Web usage and server log mining addresses two major issues: the preprocessing of the raw data, and the discovery of patterns or rules in the data. Our work relates to both of these areas. Pre-processing is discussed in detail in [1]. The aim of pre-processing of web server log files is to obtain a set of sessions (visits) recorded in the log files. It can be divided into three distinct phases: data cleansing, user/session identification and path completion. Our system implements all of these components of Web Usage Mining. The second phase of Web usage mining is that of pattern discovery [1,2,5,6]. Pattern discovery involves the extraction of some meaningful information, such as association rules, classification rules, or sequential patterns. The PP and MP algorithms can be seen as the pattern discovery phase for the Personalised Site Map task. The construction of improved site maps is discussed in [3]. Li et al discuss the need for topic-focused site maps that home in on the users interests and try to display that section of the map. They also discuss the granularity of the site map, which is the level of detail the map should show. They use the method of extracting logical domains from the web site where each logical domain is associated with a certain topic. Unlike our system they use semantic knowledge from the pages contents. 6. Future Work Our Personalized Site Map algorithms have been fully implemented. Our current focus involves measuring the effectiveness of our approach. Our experiments have demonstrated that our technique works well, in the sense that we are able to build site maps containing popular (as opposed to merely short) paths. We believe that users will find popular paths more intuitive compared to paths that are merely short, but we have not yet established this empirically. We intend to conduct user trials of the system to get users judgements of the quality of the Personalized Site Maps. For example, one important topic is the generality/specificity of the pages on the paths. Do the pages earlier in the path contain more general information than later pages? We are also exploring other applications for our path-mining algorithm. At its core, we have developed an approach to predicting which pages are likely to be viewed next, given a prefix of a visitor s trajectory. Therefore, a second potential application concerns using this predictive ability for pre-fetching and caching [8]. -7-

Our preliminary investigation of pre-fetching shows promising results. For the UCD site, we allow the PP algorithm to recommend a likely next page after each session prefix. The entire dataset contains over 236,000 sessions, leading to 832,307 recommendations from PP. Of these recommendations, 43,349 (5.2%) were correct (ie, that the user did indeed visit the recommended page next). We intend to extend this experiment to caching of multiple pages, and comparing our approach to existing page-prediction algorithms. Another possible direction would be to introduce a collaborative element to the system. We could rate each popular path for a user based on whether it appears in his sessions or not. Standard collaborative filtering techniques can then be used to recommend a particular popular path to recommend with greater confidence than our current PP algorithm. 7. Conclusions We have introduced the problem of automatically constructing Personalized Site Maps. The key challenge is to display to the visitor a subgraph that both contains relevant content items, and also organizes them in a coherent and meaningful manner. Our approach is based on the assumption that the best way to indicate the relationship between a given pair of pages is to show the path between them that has been most popular with past visitors. Based on this observation, we propose a naïve algorithm (PP) for mining popular paths from raw server logs, and a more sophisticated algorithm (MP) for mining path fragments. The key idea of MP is to mine a collection of path fragments that can be dynamically assembled in order to reconstruct many popular paths. Our experiments with two large Web sites confirm that MP can reconstruct a larger fraction of visitors sessions that PP. Acknowledgments. We thank Barry Smyth for helpful discussions. This research was funded by grant N-00014-00-1-0021 from the US Office of Naval Research, and grant 01/F.1/C015 from Science Foundation Ireland. References [1] Rob Cooley Web Usage Mining PhD Thesis, Department of Computer Science, University of Minnesota, 2001 [2] W. Gaul and L. Schmidt-Thieme. Mining web navigation path fragments. In Proceedings of the Workshop on Web Mining for E-Commerce -- Challenges and Opportunities, Boston, MA, August 2000 [3] Li, W-S., Ayan, N. F., Kolak, O., Vu, Q. "Constructing Multi-Granular and Topic Focused Web Sites in Proceedings of WWW10-2000. [4] McGinty, L., Smyth, B., Case Based Route Planning. In Proceedings of the 11 th Conference on Artificial Intelligence and Cognitive Science, Galway, Ireland, 2000. [5] Srikant, R., Agrawal, R. Mining Sequential Patterns: Generalisations and Performance improvements. In Proceedings of the 5 th International conference Extending Database Systems. 1996 [6] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data SIGKDD Explorations, Vol. 1, Issue 2, 2000. [7] Wexelblat, A., Maes, P. Footprints: History-Rich Tools for Information Foraging In Proceeedings of CHI 99 Conference on Human Factors in Computing, 1999-8-

[8] Yang, Q., Zhang, H. H., Li, T. "Mining Web logs for Prediction in WWW Caching and Pre-fetching." In Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD'01, San Francisco. 2001 [9] Ypma, A., Heskes, T. Categorization of Web Pages and User Clustering with mixtures of Hidden Markov Models. In Proceedings of the International Workshop on Web Knowledge Discovery and Data Mining, WEBKDD 02, July 23 2002, Edmonton, Canada. -9-