SUGGEST : A Web Usage Mining System

Size: px

Start display at page:

Download "SUGGEST : A Web Usage Mining System"

Jonathan Woods
6 years ago
Views:

1 SUGGEST : A Web Usage Mining System Ranieri Baraglia, Paolo Palmerini Ý CNUCE, Istituto del Consiglio Nazionale delle Ricerche (CNR), Pisa, Italy. Ýalso Universitá Ca Foscari, Venezia, Italy (Ranieri.Baraglia, Paolo.Palmerini)@cnuce.cnr.it Abstract During their navigation web users leave many records of their activity. This huge amount of data can be a useful source of knowledge. Sophisticated mining processes are needed for this knowledge to be extracted, understood and used. In this paper we propose a Web Usage Mining (WUM) system, called SUGGEST, designed to efficiently integrate the WUM process with the ordinary web server functionalities. It can provide useful information to make easier the web user navigation and to optimize the web server performance. Two quantities are introduced in order to give a measure of the quality of our WUM system. Keywords: Data mining, Web usage mining, User classification, Web personalization, Adaptive web system. 1 Introduction The problem of knowledge extraction from the huge amount of data left by web users during their navigation is a research task that has increasingly gained attention in the last years. Data can be stored in browser caches or in cookies at client level, and in access log files at server or proxy level. The analysis of such data can be used to understand users preferences and behavior in a process commonly referred to as Web Usage Mining (WUM) [6, 2]. The knowledge extracted can be used for different goals such as service personalization, site structure simplification, and web server performance improvement. In the past, several WUM projects have been proposed [11, 7, 8, 5, 9]. The Analog system [11] is structured according to two main components, performed online and offline with respect to the web server activity. Past users activity recorded in server log files is processed to form clusters of user sessions. The online component builds active user sessions which are then classified into one of the clusters found by the offline component. The classification allows to identify pages related to the ones in the active session and to return the requested page with a list of related documents. Analog was one of the first project of WUM. The geometrical approach used for clustering is affected by several limitations, related to scalability and to the effectiveness of the results found. Nevertheless, the architectural solution introduced was maintained in several other more recent projects. In [8] Perkowitz et al. propose Page Gather, a WUM system that builds index pages containing links to pages similar among themselves. Page Gather finds clusters of pages instead of clusters of sessions. Starting from the user activity sessions, the co-occurrence matrix Å is built. The element Å of Å is defined as the conditional probability that page is visited during a session if page is visited in the same session. A threshold minimum value for Å allows to prune some uninteresting entries. The directed acyclic graph associated with Å is then partitioned finding the graph s cliques. Finally, cliques are merged to originate the clusters. Page Gather main concern is on the index pages creation. There is not an online component of the WUM system, and the static index pages are kept in a separate Suggestion Section of the site. One important concept introduced in [8] is the hypotheses that users behave coherently during their navigation, i.e. pages within the same session are in general conceptually related. This assumption is called visit coherence. We will show in Section 3 how to use this concept to obtain a measure of quality for a WUM system. The WebWatcher system [4] is an interface agent for the World Wide Web. It accompanies a user through the pages by suggesting hyperlinks that it believes will be of interest. The system interacts with the user who can fill predefined forms with keywords to specify his interest. To suggest a hyperlink WebWatcher uses a measure of the hyperlink quality which is interpreted as the probability that a user will select that hyperlink. It is based on both keywords specified by a user, and associated to each hyperlink selected and information coming from the hypertext structure. WebWatcher is implemented as a server and operates much like a proxy. 1

2 In [7], clusters of URLs are found using the Association Rule Hypergraph Partitioning technique. The online component of the system finds the cluster that best matches a fix width sliding window of the current active session, by also taking into account the topology of the site. This component is implemented at a set of CGI scripts that dynamically create customized pages. In this paper we propose SUGGEST a WUM system which is designed to dynamically generate links to pages (suggestions) of potential interest for a user. It was implemented as an extension to the Apache web server. Since the criterium to be used for the validation of results obtained by a WUM system is still an open problem, we introduce and discuss a measure of quality for the SUGGEST system, which could be more generally applied to evaluate other WUM systems. The paper is organized as follows. Section 2 describes the main features of the SUGGEST project. Results of an experimental evaluation are reported in Section 3. Finally, in Section 4 we draw some conclusions along with future directions. 2 The SUGGEST system The main goal of SUGGEST is to find useful information from the user access data collected in web server logs. Such information is then exploited to generate suggestions to a user. Likewise Analog, SUGGEST adopts a two levels architecture, composed by an offline creation of historical knowledge and an online engine that understands users behavior. Moreover, it exploits an algorithm similar to that used in Page Gather for cluster creation, and introduce an effective online component that automatically classifies active user sessions and personalizes on-the-fly the HTML pages requested. The personalization is achieved by means of a set of suggestions dynamically generated on the basis of the active user session. Suggestions for users belonging to the same class may be different. The online component is implemented in such a way that no sort of modification is needed for the local web site (Apache web server), and can be easily extended to proxy servers. After a pre-processing of the data recorded in the web server log files, SUGGEST creates clusters of related pages based on users past activity, and then classifies new users by comparing pages in their active sessions with pages inside the clusters created. A set of suggestions is then obtained for each request. The offline component is performed at fixed time intervals, say once a week or a month depending on the specific characteristics of the web site. In this phase, the access log file is pre-processed and analyzed in order to first produce user sessions, and then to create clusters of pages which can be considered related, according to the users behavior. The offline component is on turn composed by two phases: pre-processing and clustering. During the first one we create user sessions. We begin by removing all the uninteresting entries from the input access log file, supposed to be in Common Log Format. Namely, we remove all the non-html requests, like images or CGI scripts. Also the dumb scans of the entire site coming from robot-like agents are removed. We used the technique described in [10] to model robots behavior. Then we create user sessions by identifying users with their IP address and sessions by means of a predefined timeout between two subsequent requests from the same user. According to Catledge et al. in [1] we fixed a timeout value equal to 30 minutes. The clustering phase finds sets of similar pages starting from the user sessions obtained by the pre-processing phase. We decided to follow the approach proposed in the Page Gather project, but with some modifications. The main difference is in the definition of the co-occurrence matrix Å. We think that the interest in a page depends on its content and not on the order a page is visited during a session. Therefore we adopt a symmetric co-occurrence matrix, and we define Å Æ Ñ Ü Æ Æ µ (1) where Æ is the number of sessions containing both pages and, Æ and Æ are the number of sessions containing only page or, respectively. Dividing by the maximum between single occurrences of the two pages has the effect of reducing the relative importance of index pages. Such pages are very likely to be visited with any other page and nevertheless are of little interest as potential suggestions, since they are too obvious. From the matrix Å we then build the undirected graph whose nodes are the pages and whose edges are the non-null elements of Å. To limit the number of edges in such a graph we apply a threshold filter specified by the parameter MinFreq. Elements of Å whose value is less than MinFreq are too little correlated and thus discarded. In order to find groups of pages strongly correlated, we partition the graph finding its connected components. Pages within the same cluster are ranked according to their occurrence frequency. Moreover, all the clusters with size lower than a threshold value MinClusterSize are discarded because considered not significant. We implemented the online component of SUGGEST as an extension to the Apache web server. Apache provides a mechanism to extend the web server functionalities by means of dynamically loadable modules, that can be used to perform specialized functions, such as custom authentication or dynamic page modification.

3 As requests arrive at the server they are recorded in a buffer of active sessions. Each session is identified on the basis of the client IP address and is associated with a timestamp that permits us to determine when a session is closed. In order to classify an active session, we look for the cluster that includes the larger number of pages in that session. Found the cluster, we need to determine the pages that will constitute the suggestions. The final suggestions are composed by a static and a dynamic part. The first one is given by the most relevant pages in that cluster, according to the order determined in the offline phase. The dynamic part of the suggestions is obtained from the pages in the same cluster that are more strictly related to those in the session that determined the classification. This relation is based on the values stored in the matrix Å. The static and dynamic suggestions are ranked together and returned as a set of interesting pages. It is worth noticing that this is a new feature introduced by our system. By means of the dynamic technique we just described, we allow users belonging to the same class to have different sets of suggestions, depending on the pages visited in their active session. The suggestions are implemented by inserting, a list of links to the pages found, at the end of the page requested. Other modifications can be applied as necessary (i.e. a personalized banner). Table 1. Dataset used in the experiments. Dataset Size Records Period (MBytes) (thousands) (days) Berkeley ¾¼¼ ½ ½ ¾¾ NASA ¾¼¼ ½ ¾ USASK ¾¼¼ ¾¼ ¼ ½ ¼ session is about 3 pages. Since for this value we still have almost half of all the sessions, we choose this value as the minimum length for an active session to be classified. All evaluation tests were run on a dual processor SMP 800 MHz Pentium III PC with 512 MBytes of RAM, two SCSI disks for 27 GBytes of total capacity, operating system Linux Experimental evaluation Measuring the performances of recommendation systems poses more than one problem. It is difficult to characterize the quality of the suggestions obtained and to quantify how useful the system is. We therefore study how our system behaves when varying its parameters and introduce two measures of suggestion quality: the coherence of suggestions and their overlapping with user real behavior. We also study if the SUGGEST system can be used to improve the web server performance, by guessing which page a user is more likely to request and accordingly prefetching it. The SUGGEST experimental evaluation was conduced using three access log files of public domain ½ : Berkeley, NASA, USASK, produced by the web servers of the Computer Science Department of Berkeley University, Saskatchewan University and Kennedy Center Space Center, respectively. Data are stored according to the Common Log Format. The characteristics of the datasets we used are given in Table 1. As shown in Figure 1 the percentage of the sessions formed by a predefined number of pages quickly decreases when the minimum number of pages in a session increases. Moreover, for all the datasets the average length of an user ½ Figure 1. Minimum number of pages in a session. Figures 2 and 3 show the number of clusters and the percentage of pages not clustered as function of the MinFreq parameter. Figure 2. Number of clusters found. The number of clusters increases up to MinFreq=0.5.

4 This is due to the fact that deleting entries from Å, we obtain a less connected graph. When the graph becomes highly disconnected (MinFreq ¼ ), the clusters found are smaller than the MinClusterSize threshold and are thus discarded. Therefore the total number of clusters found does not increase. which measures the fraction of pages of that belongs to the representative cluster for that session. Ô ¾ Ë Ô ¾ Æ (2) where Ô is a page, Ë is the -th session, is the cluster representing, and Æ is the number of pages in the -th session. The average value for over all Æ Ë sessions contained inside the dataset partition treated is given by: È ÆË ½ Æ Ë (3) Figure 4 plots as a function of MinFreq, in percent. For small values of MinFreq almost all pages in every session belong to the same cluster. This can be considered an experimental confirmation of the visit coherence hypothesis. In this case due to large number of pages in the cluster we limit the number of suggestions to those with higher rank. Figure 3. Percentage of outliers. Similar reasoning can be used to describe the behavior of the number of outliers, i.e. the number of pages that do not belong to any cluster and will therefore not contribute to the on-line classification. From Figure 3 we can observe that as MinFreq increases the percentage of outliers also grows. It is worth noticing that the three different datasets show qualitatively similar behavior. Once the clusters are created we are interested in determining if the pages in a cluster are actually somehow related among themselves, or not. In order to evaluate the cluster quality some techniques were introduced in previous works. Fu et al. [3] verify if pages in a cluster are related to the same topic, assuming a priori knowledge of their contents. In Analog [11] is verified if pages in a cluster are linked according to the web site structure. Due to our lack of knowledge about both the content and the general structure of the site that has produced the datasets, to evaluate the quality of the clusters produced by the offline phase we used the visit coherence index that allows to quantify a session intrinsic coherence. It measures the percentage of pages inside a user session which belong to the cluster representing the session considered. As in the Page Gather system, the basic assumption here is that the coherence hypotheses holds for every session. To evaluate the visit coherence, we split the datasets obtained from the pre-processing phase into two halves, apply the clustering on one half and measure if the suggestions generated on the basis of the second half still maintain the expected coherence. To verify if the coherence hypothesis holds for every session in the second half of the dataset, we define a quantity Figure 4. Coherence of visit. To measure the quality of the suggestions generated during the online phase we used a technique similar to that used to evaluate the cluster quality. Sessions found in one half of the dataset are submitted to the online module to classify them and to generate suggestions. Then the fraction of pages belonging both to a session and to the corresponding set of suggestions ËÙ Ø is computed by using the expression 4, for every session. Ô ¾ Ë Ô ¾ ËÙ Ø Æ (4) The average value of is obtained by summing over all the sessions: È ÆË Å ½ (5) Æ Ë where Æ Ë is the half of the total number of sessions.

5 Figure 5 shows in percentage the overlapping between the generated suggestions and the session pages. For small values of MinFreq we can say that the SUGGEST system is able to correctly identify users behavior. When MinFreq increases the number of outliers increases too, and consequently decreases the number of suggestions generated. server request, and not only a disk access. The SUGGEST system can by applied also to proxy servers, without any change, since the Apache web server can also run in proxy mode. We tested the overheads introduced by this first implementation of the SUGGEST system using the ab ¾ benchmarking tool. In Figure 7 we plotted the execution time for a single Apache process to satisfy an HTTP request. We vary the degree of concurrency, by submitting an increasing number of requests to the server. The two lines refer to standard Apache, and to Apache using the SUGGEST system. As the number of concurrent requests increases, the SUG- GEST performance degrade proportionally, due to the mutual exclusive access to shared memory areas by the apache processes. Some optimizations to overcame this limitation are still ongoing. Figure 5. Quality of suggestions. This conclusion leads to the possibility of using SUG- GEST also to optimize the web server performance. If the web server can forecast which pages a user is more likely to visit in the next requests, it can prefetch them in order to have the pages already available in memory when the request arrive. For this purpose we measure the average number of times that, given a sub-session of length, the page ½ in the session is included in the suggestions generated by SUGGEST. This fraction is called «and its behavior as a function of MinFreq is plotted in Figure 6. Figure 7. Total execution time. 4 Conclusions and future work Figure 6. Prefetching quality. For small values of MinFreq we can correctly guess the next page a user is going to request during navigation, with a probability up to 70%. Prefetching can be more effectively applied to proxy servers, where the latency we want to hide is due to a remote In this paper we have studied the problem of the realization of a Web Usage Mining system. We proposed SUGGEST, a system that classifies requests made to a web server, by analyzing past users navigation behavior. SUGGEST gathers different features of previously proposed WUM systems. The layered architecture of SUG- GEST can be used as a reference for further improvements of clustering and classification algorithms. A novel contribution of this work is the introduction of quantities that can be used to evaluate the quality of the suggestions found. Moreover, the original technique adopted for suggestions generation permits a more dynamic personalization with respect to previous systems. We are currently working on several modification and general improvements of the system: (a) an experimental evaluation of SUGGEST on a real world production web ¾

6 server is needed in order to observe how user navigation can be influenced by the presence of suggestions; (b) apply the system to a proxy server, for example, by modifying the cache replacement policies; (c) unify the offline and the online components in one single online module where clusters are hierarchically built and updated as soon as new requests arrive. 5 Acknowledge This research was partially supported by the Fondazione Cassa di Risparmio di Pisa within the project WebDigger: a data mining environment for the web data analysis. References [1] L. D. Catledge and J. E. Pitkow. Characterizing browsing stategies in the world-wide web. Computer Networks and ISDN Systems, 27, [2] O. Etzioni. The world wide web: quagmire or gold mine? Communications of the ACM, 39:65 68, november [3] Y. Fu, K. Sandhu, and M.-Y. Shih. Clustering of web users based on access patterns. In KDD 99 Workshop on Web Usage Analysis and User Profiling WEBKDD 99, August [4] T. Joachims, D. Freitag, and T. Mitchell. Webwatcher: A tour guide for the world wide web. Fifteenth International Joint Conference on Artificial Intelligence, [5] T. Kamdar and A. Joshi. On creating adaptive web servers using weblog mining. Technical Report Tr-CS-00-05, Department of Computer Science and Electrical Engineering. University of Maryland, Baltimore County, November [6] R. Kosala and H. Blockeel. Web mining research: a survey. In ACM SIGKDD, pages 1 15, july [7] B. Mobasher, R. Cooley, and J. Scrivastava. Automatic personalization based on web usage mining. Communications of the ACM, 43(8): , august [8] M. Perkowitz and O. Etzioni. Adaptive web sites: Conceptual cluster minin. In International Joint Conference on Artificial Intelligence, pages , [9] D. F. T. Joachims. Webwatcher: A tour guide for the world wide web. In Proceedings of IJCAI97, [10] P.-N. Tan and V. Kumar. Modeling of web robot navigational patterns. In WEBKDD 2000 Worskhop on Web Mining for E-Commerce Challenges and Opportunities, August [11] T. W. Yan, M. Jacobsen, H. Garcia-Molina, and D. Umeshwar. From user access patterns to dynamic hypertext linking. Fifth International World Wide Web Conference, May 1996.

On-line Generation of Suggestions for Web Users

On-line Generation of Suggestions for Web Users Fabrizio Silvestri Istituto ISTI - CNR Pisa Italy Ranieri Baraglia Istituto ISTI - CNR Pisa Italy Paolo Palmerini Istituto ISTI - CNR Pisa - Italy {fabrizio.silvestri,ranieri.baraglia,paolo.palmerini}@isti.cnr.it