Web User Session Clustering Using Modified K-Means Algorithm

Size: px

Start display at page:

Download "Web User Session Clustering Using Modified K-Means Algorithm"

Elwin Hopkins
5 years ago
Views:

1 Web User Session Clustering Using Modified K-Means Algorithm G. Poornalatha 1 and Prakash S. Raghavendra 2 Department of Information Technology, National Institute of Technology Karnataka (NITK), Surathkal, Mangalore, India 1 poornalathag@yahoo.com, 2 srp@nitk.ac.in Abstract. The proliferation of internet along with the attractiveness of the web in recent years has made web mining as the research area of great magnitude. Web mining essentially has many advantages which makes this technology attractive to researchers. The analysis of web user s navigational pattern within a web site can provide useful information for applications like, server performance enhancements, restructuring a web site, direct marketing in e- commerce etc. The navigation paths may be explored based on some similarity criteria, in order to get the useful inference about the usage of web. The objective of this paper is to propose an effective clustering technique to group users sessions by modifying K-means algorithm and suggest a method to compute the distance between sessions based on similarity of their web access path, which takes care of the issue of the user sessions that are of variable length. Keywords: web mining, clustering; K-means, Jaccard Index. 1 Introduction Now the present generation is living in an information era. Moreover, the evolution of the internet along with the popularity of the web has made even an ordinary person to use the information available at his finger tips for various purposes. Web has been adopted as a critical communication and information medium by a majority of the population. Due to the rapid growth in the use of web the task of analyzing, understanding and producing useful information manually from a vast quantity of data available on the web is a very complicated and time consuming task. Thus, there is a requirement to develop techniques to get the valuable information, hidden in the web data, so as to improve the web performance. This paper focuses on clustering web user sessions based on their navigation path which is of variable length. Clustering is a technique for grouping user sessions such that, within a single cluster the usage pattern is more similar while sessions in different groups are dissimilar. The knowledge discovered from the clustering may be used to analyze the pattern of usage of the web site by the user, to recommend for restructuring of web site, to pre-fetch or cache the pages and predict the next page A. Abraham et al. (Eds.): ACC 2011, Part II, CCIS 191, pp , Springer-Verlag Berlin Heidelberg 2011

2 244 G. Poornalatha and P.S. Raghavendra visited by the user to reduce the latency etc. As a result, realizing user s navigation patterns on a web site is an important activity for browser to pre-fetch as well as the web site designer to take decisions on redesigning the site. A number of clustering approaches have been proposed in the literature. For example, Federico et al. [1] present a survey of the developments in the area of web usage mining, where the view points on various techniques like association rules, clustering, sequence patterns etc. are given. Yunjuan et al. [2] suggest that the focus of web usage mining should be shifted from single user session to group of user sessions and applied clustering for identifying such cluster of similar sessions. They introduce an effective clustering technique using belief function based on Dempstershafer s theory. Chaofeng Li et al. [3] presented an algorithm for clustering of web session based on increase of similarities. Here number of clusters is defined according to the knowledge of application fields and uses ROCK to decide the initial point for each cluster. Dariusz Krol et al. [4] investigated on the internet system user behavior using cluster analysis. Here sessions are represented as vectors where each dimension represents a web page and stores the value of user interest in each page of a session. The sessions are clustered using Hard C-Means algorithm. Yongjian Fu et al. [5] proposed a generalization based clustering method which employs the attributeoriented induction method to reduce the large dimensionality of data. Prakash S Raghavendra et al. [6] modeled user behavior as a vector of the time spent at each URL. The cosine of the vector is taken as the similarity/distance measure, instead of euclidean distance and modified the standard k-means algorithm accordingly. Jin- HuaXu et al. [7] presented vector analysis and k-means based algorithm for mining user clusters. In the web usage domain, there are two kinds of interesting clusters to be discovered: usage clusters and page clusters. In both applications, permanent or dynamic HTML pages can be created that suggest related hyperlinks to the user according to the user/s query or past history of information [8]. George Pallis et al. [9] assessed the quality of user session clusters in order to make inferences regarding the users navigation behavior. The studies have shown that the most commonly used partitioning-based clustering algorithm, is the K-means algorithm, which is more suitable for large datasets. K- means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Euclidean distance is generally used as a metric. The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. In this paper, an effective method is proposed to compare variable length sessions and basic k-means algorithm is modified to get effective clusters, such that the initial centroid assignments will not have much impact on the clusters. Jaccard index is used to analyze the goodness of the clusters obtained, while [9] uses chi square test to validate the clusters obtained by using EM algorithm. The main contribution of this paper is to propose, improved way of comparing user sessions represented as vectors, that are of variable length inherently and employing Jaccard index for analyzing the

3 Web User Session Clustering Using Modified K-Means Algorithm 245 effectiveness of the clustering done on two standard set of web server logs. The results obtained by this proposed technique are encouraging. The rest of the paper is organized as follows. The section 2 talks about the proposed method of clustering in detail. The section 3 discusses about the results, followed by conclusion in section 4. 2 Clustering 2.1 Modified K-Means The basic K-means algorithm initially selects the cluster centroids randomly and finds the new cluster centroid based on the average value obtained within each cluster, in each iteration. In the modified K-means algorithm, the old cluster centroid is updated by the delta amount, where, delta is nothing but the average distance value of each cluster. i.e., instead of assigning a new point as a centroid, the existing centroid is moved by delta quantity in order to use the k-means for web session clustering, since web sessions are vectors and not data points. The modified algorithm is as shown in Algorithm Modified K-means. The numerical example is presented as given in Table 1 to compare the basic and modified K-means algorithm for the data set D={11,22,18,15,25,36,27,8,39,10}. The results reveal that, the modified K-means algorithm is better than the basic K-means algorithm in terms of number of iterations taken to converge and the quality of clusters formed irrespective of the initial centroids selected. Thus the empirical study shows that modified version of k-means is better than the basic K-means. No. 1 m1=8 m2=18 m3=36 Table 1. Comparison of basic and modified K-means Initial Basic K-means Modified K-means centroids Clusters iterations clusters iterations c1=11,15,8,10 5 c1=11,18,15,8,10 3 c2=22,18,25,27 c2=22,25,27 c3=36,39 c3=36,39 2 m1=11 m2=22 m3=18 3 m1=27 m2=8 m3=10 c1=11,15,8,10 c2=25,36,27,39 c3=22,18 c1=22,18,25,36,27,39 c2=8 c3=11,15,10 4 c1=11,15,8,10 c2=36,39 c3=22,18,25,27 20 c1=36,39 c2=11,15,8,10 c3=22,18,25,27 Algorithm: Modified K-means Input: a set of data D= {d 1, d 2,, d n }, the desired number of k clusters Output: a set of clusters C= {c 1, c 2,, c k } of D Method: Select any k data points {d 1, d 2,, d k } from D and set m i =d i to get M= {m 1, m2,,m k } where, 0 < i < k+1 newc=empty, newm=empty 5 6

4 246 G. Poornalatha and P.S. Raghavendra Repeat for each s i, compute D={d 1,d 2,,d k } where, d i = d i -m j, 0< j< k+1, 0< i< n+1 assign d i to c j where d j =min(d), 0 < j < k+1 for each c j, delta j =sum (distances of each d i in c j ) / number of sessions in c j. newm={m 1 +delta 1, m 2 +delta 2,,m k +delta k } if ( C == newc) or (M==newM) break; copy C into newc, M to newm until false 2.2 Modified K-Means for Web Session Clustering In general, the web user sessions are not simple data points, but n-dimensional vectors. Suppose, a user visits pages p1,p2,p7 of a web site in a sequence, then, the session is represented as a vector s={p1 P2 P7}. Before clustering web user sessions, the algorithm, Modified K-means is changed to suit the requirements as given in Algorithm Modified K-means for Web Session Clustering. To find the dissimilarity between any two sessions s i and s j, we propose an efficient function to compute variable length vector distance (VLVD) between any two sessions s i and s j as given in function VLVD. Algorithm: Modified K-Means for Web Session Clustering Input: a set of web user sessions WS= {s 1, s 2,, s n }, the desired number of k clusters Output: a set of clusters C= {c 1, c 2,, c k } of WS Method: Select any k sessions {s 1, s 2,, s k } from WS and set m i =s i to get M= {m 1, m 2,, m k } where, 0 < i < k+1 newc=empty, newm=empty Repeat for each s i, compute D={d 1,d 2,,d k } where, d i = VLVD(s i,m j ) and 0 < j < k+1, 0 < i < n+1 assign s i to c j where d j :=min(d) where 0 < j < k+1 for each c j, delta j :=sum (distances of each s i in c j ) / number of sessions in c j. newm={m 1 +delta 1, m 2 +delta 2,,m k +delta k } if ( C == newc) or (M==newM) break; copy C into newc, M to newm until false Function: VLVD Input: two web user sessions s i and s j Output: distance d between s i and s j Method: Set l 1 = s i where, s i is the length of the session s i Set l 2 = s j where, s j is the length of the session s j Set C = s i s j Set dist = l 1 + l 2 2C Set len = l 1 + l 2 d = dist/len return d

5 Web User Session Clustering Using Modified K-Means Algorithm 247 The majority of the algorithms discussed by the researchers represent each of the web session as a binary vector of length n, where n is the number of pages in a web site. Since, the issue of variable length of web user session vectors is not addressed efficiently by majority of the researchers; the function VLVD (s i, s j) tries to deal with the variable length session vectors to find the distance or dissimilarity between any two sessions. The VLVD function computes the number of pages that are different between any two sessions, similar to the hamming distance. To get the hamming distance, the two vectors that are taken into consideration should be of same length, but, the VLVD function overcomes this drawback. The value of d lies in the range of 0 and 1. The value 1 indicates that the two sessions are completely different, where as 0 indicates that the sessions are completely similar. Consider an example data set with 5 sessions, to illustrate the VLVD function. Example: S1: P1 P2 P3 P4 P5 S2: P4 P5 S3: P1 P2 P5 S4: P6 P7 S5: P1 P2 P3 P4 P5 VLVD (S1, S2) = 0.42 VLVD (S1, S3) = 0.25 VLVD (S1, S4) = 1.0 VLVD (S1, S5) = 0.0 The example clearly shows that, the sessions S1 and S5 are similar whereas, S1 and S4 are entirely different. S3 is closer to S1 compare to S2. Thus it is possible to measure the distance between the sessions efficiently, though they are not equal length vectors. 3 Results and Discussions To implement the modified k-means with VLVD function, two data sets are considered: The first set is NASA log taken from NASA Kennedy space center www server in Florida ( which consists of approximately 10, 00,000 + entries. The log has the data collected from 00:00:00 July 1, 1995 through 23:59:59 July 31, 1995, a total of 31 days. The data is preprocessed and based on domain knowledge obtained after constructing distinct user requests, 30 categories of pages are formulated. The second set is MSNBC data set taken from msnbc.com ( that gives the page visits of users who visited msnbc.com on September 28, Visits are recorded at the level of URL category and are recorded in time order and therefore, preprocessing was not required for this data set. Table 2 summarizes the details of these two sets and description of page categories for theses two data sets are given in Table 3 and 4 respectively.

6 248 G. Poornalatha and P.S. Raghavendra Table 2. Dataset Data Set Time period File size Number of sessions considered NASA 1/7/1995 to 31/7/ ,532 KB MSNBC 28/9/ ,287 KB Number of page categories Table 3. Web page categories NASA data set P1 /elv/ P11 /icon/ P21 /shuttle/countdown/ P2 /facilities/ P12 /images/ P22 /shuttle/movies/ P3 /shuttle/mission/ P13 /logistics/ P23 /software/ P4 /downs/ P14 /mdss/ P24 /statistics/ P5 /base-ops/ P15 /msfc/ P25 /history/apollo/ P6 /bio-med/ P16 /news/ P26 /history/gemini/ P7 /facts/ P17 /pao/ P27 /history/mercury/ P8 /finance/ P18 /payloads/ P28 /shuttle/ P9 /history/ P19 /persons/ P29 /shuttle/resources/ P10 /htbin/ P20 /procurement/ P30 /shuttle/technology/ Table 4. Web page categories MSNBC data set P1 Front page P7 Misc P13 Msn-sports P2 News P8 Weather P14 Sports P3 Tech P9 Msn-news P15 Summary P4 local P10 Health P16 Bbs P5 Opinion P11 Living P17 Travel P6 On-air P12 Business 3.1 Analysis of Clusters NASA Data Set Fig. 1 shows the frequency of access to various page categories in various clusters of NASA data. /history/apollo/ and /shuttle/missions/ categories are viewed more

In cluster 3 the category /elv/ is viewed majority of the times and 50% of frequency is to /shuttle/missions/ category.

7 Web User Session Clustering Using Modified K-Means Algorithm 249 Fig. 1. Normalized frequency of web page categories (NASA dataset) frequently in cluster 1 compare to other categories, while cluster 2 concentrates on /shuttle/missions/ category most of the times. In cluster 3 the category /elv/ is viewed majority of the times and 50% of frequency is to /shuttle/missions/ category. The users in cluster 4 are more interested in /shuttle/missions/ and /history/apollo categories. Similar to cluster 4, the frequency is more for categories /shuttle/missions/ and /history/apollo in cluster 5, along with the category /shuttle/countdown/, where as cluster 4 users are not interested in /shuttle/countdown because the frequency is zero for this category in cluster 4. It may look like the categories of cluster 1 and 4 are similar, but, the usage patterns of these two clusters are different. i.e., in cluster 1, /history/apollo is viewed more than /shuttle/missions/ where as it is vice versa in cluster 4. In cluster 4 around 40% of frequency is to /history/apollo/. Overall, it is observed that, the most frequently visited category is /shuttle/missions/ in this web site. Thus, the clusters formed show different patterns of usage in combination with /shuttle/missions/ category.

250 G. Poornalatha and P.S. Raghavendra 3.2 Analysis of Clusters MSNBC Data Set Fig. 2 shows the frequency of access to various page categories in various clusters.

In cluster 2, the users visit front page followed by news and local categories majority of the times, indicating their interest in local information and news.

They visit front page and just visit other categories, while cluster 4 clearly shows more than 50% of times the visit is to misc category.

8 250 G. Poornalatha and P.S. Raghavendra 3.2 Analysis of Clusters MSNBC Data Set Fig. 2 shows the frequency of access to various page categories in various clusters. More than 60% of times, request is to misc, on-air while 40% of times, for weather and sports categories in cluster 1. It shows that, users of this cluster show more interest in these categories. In cluster 2, the users visit front page followed by news and local categories majority of the times, indicating their interest in local information and news. The users in cluster 3 do not belong to any specific categories. They visit front page and just visit other categories, while cluster 4 clearly shows more than 50% of times the visit is to misc category. In contrast, users in cluster 5 are more interested in opinion and subsequently in on air and summary categories. Fig. 2. Normalised frequency of web page categories (MSNBC dataset) 3.3 Analysis of the Clusters Formed by the Proposed Method The graphs shown in Fig. 1 and 2, clearly indicates the patterns obtained by the proposed method for the two data sets. The Jaccard index, also known as the Jaccard similarity coefficient is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard index between two sample sets A and B is computed as:

9 Web User Session Clustering Using Modified K-Means Algorithm 251 Jac (A, B) = A B / A U B (1) If Jac (A, B) is equal to 1, it indicates that, the samples A and B are exactly similar. In our example, to compare the five clusters that were formed for the NASA data sets, (1) is used and the average value for each cluster is less than 0.3 as shown in Table 5. This indicates that, the clusters obtained are not exactly the same and hence the distance between the clusters is more across all the clusters. Thus, it could be inferred that the clustering done is reasonably good. Similar analysis could be done on the clusters of MSNBC data set provided we get the data regarding the actual pages of the site in each category along with the main page categories. Due to the unavailability of details regarding the pages, the Jaccard index is not applied to the clusters obtained for the MSNBC data set. However, the analysis done on the NASA data set proves the goodness of the proposed clustering method. Table 5. Jaccard index for NASA data set cluster 1 cluster 2 cluster 3 cluster 4 cluster 5 Average Jaccard index Conclusion With the explosive growth of the web-based applications, there is significant interest in analyzing the web usage data for the task of understanding the users web page navigation and apply the outcome knowledge to better serve the needs of user. This paper presents a modified k-means algorithm and also the VLVD function to compute the distance between user sessions that takes care of the issue of the uneven lengths of sessions. As a future work, it is planned to test the impact of this method to more number of user sessions and more number of clusters. Also, the clusters obtained by this proposed method, could be used to develop a recommender system as well as to design a web page prediction system that helps in reducing web page latency for the user. This would also help the web site administrator to reorganize the web site accordingly. References 1. Facca, F.M., Lanzi, P.L.: Mining interesting knowledge from web logs: a survey. Journal of Data and Knowledge Engineering 53, (2005) 2. Xie, Y., Phoha, V.V.: Web User clustering from Access Log Using Belief function. In: Proceedings of the First International Conference On Knowledge Capture (K-CAP 2001), pp ACM Press, New York (2001)

10 252 G. Poornalatha and P.S. Raghavendra 3. Li, C.: Algorithm of Web Session Clustering Based on Increase of Similarities. In: Proceedings of International Conference on Information Management, Innovation Management and Industrial Engineering, pp IEEE, Los Alamitos (2008) 4. Krol, D., Scigajlo, M., Trawinski, B.: Investigation of Internet System User Behavior Using Cluster Analysis. In: Proceedings of the Seventh International Conference on Machine Learning and Cybernetics, pp IEEE, Los Alamitos (2008) 5. Fu, Y., Sandhu, K., Shih, M.-Y.: Clustering of Web Users Based on Access Patterns. In: KDD workshop on Web Mining, San Diego, CA (1999) 6. Raghavendra, P.S., Chowdhury, S.R., Kameswari, S.V.: Comparative Study of Neural Networks and K-Means Classification in Web Usage Mining. In: Proceedings of 5th IEEE International Conference for Internet Technology and Secured Transaction (ICITST). IEEE, Los Alamitos (2010) 7. Xu, J.-H., Liu, H.: Web User Clustering Analysis based on KMeans Algorithm. In: Proceedings of 2010 International conference on Information, Networking and Automation (ICINA), pp. V26 V29. IEEE, Los Alamitos (2010) 8. Srivastava, J., Cooley, R., Deshpande, M.: Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data. In: ACM SIGKDD, vol. 1, pp (2000) 9. Pallis, G., Angelis, L., Vakali, A.: Validation and interpretation of Web users sessions clusters. Journal of Information Processing & Management 43, (2007)

Alignment Based Similarity distance Measure for Better Web Sessions Clustering

Available online at www.sciencedirect.com Procedia Computer Science 5 (2011) 450 457 The 2 nd International Conference on Ambient Systems, Networks and Technologies (ANT) Alignment Based Similarity distance