CHAPTER-6 WEB USAGE MINING USING CLUSTERING

Size: px

Start display at page:

Download "CHAPTER-6 WEB USAGE MINING USING CLUSTERING"

Mitchell Wilcox
5 years ago
Views:

1 CHAPTER-6 WEB USAGE MINING USING CLUSTERING 6.1 Related work in Clustering Technique 6.2 Quantifiable Analysis of Distance Measurement Techniques 6.3 Approaches to Formation of Clusters 6.4 Conclusion Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 75

2 This chapter will deal with clustering web usage mining technique for pattern discovery in detail. Cluster analysis is not an algorithm, but it consists of many different algorithms for pattern discovery tasks that could be used according to an application and targeted results. The first section of the chapter deals with a literature survey of overall clustering technique. The second section deals with partitioning clustering techniques in detail. The next section will deal with distance measurement techniques. Remaining sections deal with the new approach of pattern discovery and associated analysis. 6.1 Related work in Clustering Technique Clustering is an unsupervised data mining technique that divides data among different groups for pattern discovery task. It is also known as exploratory data analysis and no labeled data are available [5]. The ultimate goal of clustering is to separate finite unlabeled data sets into a discrete and finite set of useful, valid and hidden data sets. The cluster is defined as internal homogeneity and external separation [87]. Clustering procedure is applied to preprocessed data that consists of only selected attributes. After preprocessing, appropriate algorithm is selected or designed to achieve targeted results. At last, result interpretation is done to provide meaningful insights to end user. In [98] several characteristics of the cluster are described which can be considered for better cluster formation. Clustering is very useful in several applications of data mining, document retrieval, image segmentation, and pattern classification for purposes of pattern analysis, grouping, decision making and machine learning situations. Clustering in data mining is studied in literature extensively in the form of information retrieval and text mining [32,102] and very little work has been done for web analysis. There are three main things in cluster analysis (1) Effective Similarity Measures (2) Criterion Functions and (3) Algorithm Similarity Measures Similarity means distance between two objects. There are different domains such as web analysis, information retrieval, recommendation systems; social network analysis etc. requires efficient techniques for measuring similarity among diverse objects. To measure a distance, various similarity measures have been proposed in literature. There are two main categories of similarity measures: (I) Content Based and (II) linked based. The content based [118,46,73,95] methods evaluate similarity among various objects like web pages, persons, multimedia objects etc. The linked based similarity measures [6,93] are useful to determine similarity of two web objects links for the purpose of search engine. Linked based similarity measures are out of the scope of the proposed research since they are vital for search engine aspect rather than web usage mining purpose. Most popular and commonly used distance metric technique studied in literature is Euclidean distance [1,68,53]. It is an ordinary distance between two points that is Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 76

3 measured with the ruler and it is derived from Pythagorean formula. The distance between two data points in the plane with coordinates (p, q) and (r, s) is formulated by: DIST (( p, q), (r, s)) = Sqrt (( p-r) 2 + (q-s) 2 ) The usefulness of Euclidean depends on circumstances. For any plane, it provides pretty good results, but with slow speed. It is also not useful in the case of string distance measurement. Euclidean distance can be extended to any dimensions. Another most popular distance metric technique is Manhattan distance that computes the distance from one data point to another if grid like a path is followed. The Manhattan distance between two items is the sum of the differences of their corresponding elements [103]. The distance between two data points with coordinates (p1, q1) and (p2, q2) is derived from: DIST ((p1, q1), (p2, q2)) = n i1 pi qi It is the summation of horizontal and vertical elements where diagonal distance is computed using Pythagorean formula. It is derived from the Euclidean distance so it exhibits similar characteristics as of Euclidean distance. It is generally useful in gaming applications like chess to determine diagonal distance from one element to another. Minkowski is another famous distance measurement technique that can be considered as a generalization of both Euclidean and Manhattan distance. Several researches [4, 39,43] have been done based on Minkowski distance measurement technique to determine similarity among objects. The Minkowski distance of order p between two points: P (x1, x2, x3 xn) and Q( y1,y2,y3, yn) R n is defined as: If the value of p equals to one Minkowski distance is known as Manhattan distance while for value 2 it becomes Euclidean distance. All distances discussed so far are considered as ordinary distances that are measured from ruler and they are not suitable for measuring similarity of two strings so they are not appropriate in context to propose research since web session consists of numbers of string in the form of URLs. Hamming distance is a most popular similarity measure for strings that determines similarity by considering the number of positions at which the corresponding characters are different. More formally hamming distance between two strings P and Q is Pi Qi. Hamming distance theory is used widely in several applications like quantification of information, study of the properties of code and for secure communication. The main limitation on Hamming distance, is it only applicable for strings of the same length. Hamming distance is widely used in error free communication [2,112] field because the byte length remains same for both parties who are involved in communication. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 77

4 Hamming distance is not so much useful in context to web usage mining since length of all web sessions is not similar. The Levenshtein distance or edit distance is more sophisticated distance measurement technique for string similarity. It is the key distance in several fields such as optical character recognition, text processing, computational biology, fraud detection, cryptography etc. and was studied extensively by many authors [79,42,9]. The formula of Levenshtein distance between two strings S1, S2 is given by Lev S1, S2 ( S1, S2 ) where Lev S1, S2 (I, j) = Max (I, j) if Min (I, j) =0, Otherwise Lev S1,S2 ( i,j) = Min ( Lev s1, s2 (i-1, j) +1) OR Min ( Lev s1, s2 ( i, j-1) + 1) OR Min ( Lev s1, s2 ( i-1, j-1) + [ S1i # S2j] Levenshtein distance measurement technique is an ideal context for web session since it is applicable to strings of unequal size. Several bioinformatics distance measurement techniques that are used to align protein or nucleotide sequences can be used to web mining perspectives to cluster unequal size web sessions. One of the most important techniques of this category was invented by Saul B. Needleman and Christian D. Wunsch [80] to align unequal size protein sequences. This technique uses dynamic programming means solving complex problems by breaking them down into simpler sub problems. It is a global alignment technique in which closely related sequences of same length are very much appropriate. Alignment is done from beginning till end of sequence to find out best possible alignment. This technique uses scoring system. Positive or higher value is assigned for a match and a negative or a lower value is assigned for mismatch. It uses gap penalties to maximize the meaning of sequence. This gap penalty is subtracted from each gap that has been introduced. There are two main types of gap penalties such as open and extension. The open penalty is always applied at the start of the gap, and then the other gaps following it are given with a gap extension penalty which will be less compared to the open penalty. Typical values are 12 for gap opening, and 4 for gap extension. According to Needleman Wunsch algorithm, initial matrix is created with N * M dimension, where N = number of rows equals to number of characters of first string plus one and M= number of columns equals to number of characters of first string plus one. Extra row and column is used to align with gap. After that scoring scheme is introduced that can be user defined with specific scores. The simple basic scoring scheme is, if two sequences at i th and j th positions are same matching score is 1( S(I,j) =1) or if two sequences at i th and j th positions are not same mismatch Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 78

5 score is assumed as -1 ( S(I,j)=-1). The gap penalty is assumed as -1. When any kind of operation is performed like insertion or deletion, the dynamic programming matrix is defined with three different steps: 1. Initialization Phase: - In initialization phase the gap score can be added to previous cell of the row and column. 2. Matrix Filling Phase: - It is most crucial phase and matrix filling starting from the upper left hand corner of the matrix. It is required to know the diagonal, left and right score of the current position in order to find maximum score of each cell. From the assumed values, add match or mismatch score to diagonal value. Same way repeat the process for left and right value. Take the maximum of three values (i.e. diagonal, right and left) and fill i th and j th positions with obtained score. The equation to calculate the maximum score is as under: Mi,j = Max [ M i-1,j-1 + S i,j, M i, j-1 + W, M i-1,j +W] Where i,j describes row and columns. M is the matrix value of the required cell (stated as M i,j ) S is the score of the required cell (S i, j ) W is the gap alignment 3. Alignment through Trace Back: - It is the final step in Needleman Wunsch algorithm that is trace back for the best possible alignment. The best alignment among the alignments can be identified by using the maximum alignment score. Needleman Wunsch distance measurement technique is an ideal one in string similarity so this technique is also considered in the proposed research context. Smith-waterman is an important bioinformatics technique to align different strings. This technique compares segments of all possible lengths and optimizes the measure of similarity. Temple F.Smith and Michael S.Waterman [97] were founders of this technique. The main difference in the comparison of Needleman Wunsh is that negative scoring matrix cells are set to zero that makes local alignment visible. This technique compares diversified length segments instead of looking at entire sequence at once. The main advantages of smith waterman technique are: To gives conserved regions between the two sequences. To align partially overlapping sequences. To align the subsequence of the sequence to itself. Alike Needleman-Wunsch, this technique also uses scoring matrix system. For scoring system and gap analysis, same concepts used in Needleman Wunsch are applicable here in Smith- Waterman. It also uses same steps of initialization, matrix filling and alignment through trace Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 79

6 back. The equation to calculate maximum score is same as Needleman Wunsch. The main differences between Needleman-Wunsch and Smith Waterman are: Needleman Wunsch does global alignment while Smith Waterman focuses on local alignment. Needleman Wunsch requires alignment score for pair of remainder to be >=0 while for Smith Waterman it may be positive or negative. For Needleman Wunsch no gap penalty is required for processing while for Smith and Waterman gap penalty is required for efficient work. In Needleman and Wunsch score can not be decrease between two cells of a pathway while in Smith Waterman score can increase, decrease or remain same between two cells of pathway. Table 6.1 describes the comparison of different distance metrics techniques in context to proposed research. Table 6.1 Comparison of distance metrics techniques Sr.NO Technique Description Advantages Disadvantages 1. Euclidean Distance It describes distance between two points that would be measure with ruler and calculated using Pythagorean theorem. (1)It is faster for determination of correlation among points (2) It is fair measure because it compares data points based on actual ratings. (1)It is not suitable for ordinal data like string. (2) It requires actual data not rank. 2. Levenshtein It is a string metric for measuring the difference between two strings. It is fast and best suited for strings similarity. It is not considered order of sequence of characters while comparing. ( Table Continue to Next Page) Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 80

7 Table 6.1 Comparison of distance metrics techniques(continue) Sr.NO Technique Description Advantages Disadvantages 3. Needleman- Wunsch It is a bio informatics algorithm and provides global alignment between strings while comparing. It is best for string comparison because it considers ordering of sequence of characters It requires same length of string while comparing. 4. Smith- Waterman It is a bio informatics algorithm and provides local alignment between strings while comparing. It is best for string comparison because it considers ordering of sequence of characters and it is applicable for either similar or dissimilar length of strings. It is quite complex than any global alignment technique From above table it is to be analyzed that Euclidean distance is not suitable for proposed research because web sessions consists of sequences of web objects and which are in string format. Levenshtein distance is a very good technique for string sequences similarity but for prediction model of web caching and prefetching an ordering of web objects is an important aspect that is ignored by this distance metric technique so it is also not an appropriate way in proposed research context. Both Needleman-Wunsch and Smith-Waterman considers an ordering of sequence for string matching so they are ideal for this context. Web Sessions are not always of same length so Needleman-Wunsch algorithm is not cent percent fit for formation of web sessions clusters as it only provides global alignment. Smith-Waterman algorithm is applicable for both same length sequence as well as dissimilar length of sequences so it is an ideal algorithm for formation of clusters in this proposed research Categories of Clustering Algorithms Once appropriate distance metric is identified next step is to determine appropriate category of clustering algorithm. Clustering algorithms are basically of two main types described in following figure. Hierarchical clustering is also known as connectivity based clustering. It is based on most fundamental concept that objects are more related to nearby objects than objects far away. Algorithms of Hierarchical clustering connect different objects based on their distance measurement. Different clusters in hierarchical clustering are represented in binary tree Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 81

8 format. Hierarchical clustering algorithms are either top-down or bottom-up. Top down Clustering is also known as splitting algorithm. It proceeds by splitting clusters recursively until individual object is reached. Bottom up Clustering is known as merging algorithms. Bottom Up Clustering Algorithms Hierarchical Clustering Partitioning Clustering Bottom UP Top Down Centroid Medoid (Figure-6.1 Categories of Clustering Algorithms) clustering algorithms [ ] begin with any n clusters and each cluster contains a single sample or point. Then two clusters will merge so distance among them becomes as least as possible. The graphical representation of both techniques is as follows: C1 C2 C1,C2 Bottom Up C3 C1,C2,C3,C4,C5 C4 C3,C4 C3,C4,C5 C5 Top Down (Figure-6.2 Graphical Representation of Hierarchical Clustering Techniques) The main advantages of hierarchical algorithms are: (1) It is not required to specify number of clusters in advance. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 82

9 (2) Generation of smaller clusters is possible that may helpful for discovery of important information. But there are number of limitations of this category of clustering that are as follows: (a) Objects might incorrectly group so result should be examined closely before proceeding to next phase. (b) Use of different distance technique may generate different results. (c) Interpretation of result is subjective. (d) Interpretation of hierarchy is complex and often confusing. (e) Researches show that most of hierarchical algorithms do not revisit clusters once they build. Hierarchical clustering is not an ideal in context to proposed research because they are not flexible in cluster formation. It provides rigid means in terms of optimization of clustering results. Sometimes groping of clusters is also not up to the mark. In Hierarchical clustering ordinary distance metric techniques are used that not suit to web usage mining process. Partitioning clustering techniques are another category of clustering techniques. They are very effective and relocation based clustering techniques. There are two main approaches of that (I) Centroid and (II) Medoids. Gravity center of the objects is considered as a measure to represent each cluster. K-means algorithm is well known centroid based algorithm. There are three main steps in k-means algorithm: (i) Center point is determined and each cluster is associated with a center point. (ii) Each point is assigned to the cluster with the closet center point. (iii) K means number of clusters must be specified. According to k-means, numbers of clusters are selected randomly [44-125]. Using Euclidian distance measurement technique assigns every item to its nearest cluster center. Move each cluster center to mean of its assigned items. Change in cluster assignments by repeating assignments and moving clusters until it becomes less than a threshold value. K- means algorithm exhibits number of characteristics: (A) It is most suitable for large data sets. (B) It is sensitive to noise. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 83

10 (C) It terminates at local optimum means it is optimal within neighboring set of candidate solutions. (D) The clusters having three dimensional views in shapes. K-means is an ideal partitioned based clustering algorithm but it exhibits certain limitations in context to proposed research work. Following are the numbers of limitations of it: 1. It is not possible to predict number of clusters K in advance in proposed research. 2. K-means has problems when clusters are of different in size. In proposed research it is not possible that all clusters of having same size. 3. K-means has a problem of outlier. 4. Empty clusters are possible in K-means. 5. It is applicable only when the mean of clusters is defined and not suitable for categorical data. In proposed research data may be of categorical. 6. Results are heavily depends upon initial partitions only. 7. One object may not be a part of another cluster. According to proposed research one session might be a part of number of clusters. 8. Distance for centroid is calculated based on ordinary distance metric technique like Euclidian. Euclidian would be measure with ruler and calculated using Pythagorean Theorem. It does not suit to string data while web sessions are in form of string. K-medoid is another powerful partitioning clustering algorithm based on medoid philosophy. It is more efficient in context to noise and outliers as compared to K-means because it minimizes a sum of pair wise dissimilarities instead of a sum of squared Euclidean distances. Medoid is more centrally located point in the cluster and it deals with average dissimilarity to all the objects in the clusters. The most important and common algorithm of this category is Partitioning around Medoid (PAM). PAM algorithm like k-means requires selecting K clusters randomly. Associate each object to closest medoid. Closest distance is defined using valid distance metric technique most commonly Euclidian Distance. For each medoid and nonmedoid object, swap and compute the total cost of the configuration. Select configuration with the lowest cost. Repeat steps of association to closest medoid and swapping until there is no change in the medoid. The main advantage of K-medoid is that the problem of outlier is removed but still it facing number of same limitations as of K-Means like K clusters requires in advance, one object may not be a part of another cluster and for calculating medoid ordinary distance metric technique like Euclidian is used. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 84

11 In both K-means and K-medoid techniques, one object could not be the part of more than one cluster while in proposed research context, one web object could be part of more than one cluster. One popular technique of clustering is Fuzzy C-Means [81-126] that attempts to divide any n elements into collections of m fuzzy clusters with some criterion. The fuzzy C- Means algorithm is simple and contains three main steps. First step is to select number of clusters. Second step is to assign randomly to each point coefficients for being in the clusters. Third step has two sub steps, one is compute the centroid for each cluster and second one is for each point, compute its coefficients of being in the clusters. Repeat third step until the coefficients' change between two iterations is no more than threshold value. In this kind of clustering every object has a degree of belonging to clusters as in fuzzy logic rather than belonging completely to just one cluster. The main advantage is an algorithm minimizes intracluster variance but same problems like it requires predicting number of clusters in advance and results depends on the initial choice of weights. Following table describes above mentioned clustering techniques in context to proposed research work in condensed manner. Table 6.2 Clustering techniques characteristics Sr.No Clustering Technique Characteristics Justification with Proposed Research 1. Top-Down Hierarchical Clustering 2. Bottom Up Hierarchical Clustering 3. K-Means ( Centroid based Partitioning Algorithm) It splits clusters recursively until individual object is reached. It is not required to specify number of clusters in advance. Smaller clusters are possible. It begins with any n number of clusters and then two clusters will merge so distance among them becomes as least as possible. Using Euclidian distance measurement technique assigns every item to its nearest cluster center. This technique does not revisit cluster once they build but as far as web session clusters are concern they require repetitions until appropriate cluster is formed. Same problem of top-down clustering of not revisiting clusters after formation. It requires predicting K numbers of clusters in advance but that is not possible in proposed research context. ( Table Continue to Next Page) Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 85

12 Table 6.2 Clustering techniques characteristics (Continue) Sr.No Clustering Technique Characteristics Justification with Proposed Research It uses ordinary distance measurement technique and it is applicable only for numerical data but web sessions are in form of string data and also sequence is important while formation of clusters. One object may not be a part of more than one cluster but according to proposed research one session may be the part of number of clusters. Outlier or noise problem may arise. 4. K-Medoid It is based on medoid philosophy. (Medoid based Partitioning Algorithm) It is more efficient in context to noise and outliers. it minimizes a sum of pair wise dissimilarities instead of a sum of squared Euclidean distances. 5. Fuzzy C-Means It attempts to divide any n elements into collections of m fuzzy clusters with some criterion so one object may be the part of number of clusters. It exhibits similar kind of limitations as of K-means in context to proposed research work. It requires predicting K numbers of clusters in advance that is not possible in proposed research work. It also used ordinary distance measurement technique that is not suitable in context to proposed research work. From table 6.2 it is describes that no clustering technique is perfect for prediction model for web caching and web prefetching. There are two main limitations of all above mentioned clustering techniques. First is no appropriate distance measurement technique is used and second one is prediction of any number of clusters are not possible in advance. In proposed Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 86

13 research context no clustering techniques can be used as it directly. First important point is to determine appropriate distance measurement technique in context to proposed research work. The next section of this chapter will deal with quantifiable analysis of distance measurement techniques that suits to proposed research context. 6.2 Quantifiable Analysis of Distance Measurement Techniques It is cleared from section that Levenshtein, Needleman-Wunsch and Smith- Waterman distance measurement techniques are appropriate in terms of clustering of web sessions in terms prediction model of web caching and web prefetching. This section will analyze all distance measurement technique in quantifiable manner and decide which one be the most efficient in context to current work Infrastructure and experimental environment Following Infrastructure is used in experiment of analyzing distance measurement technique in quantifiable manner. (1) Personal Computer: - Intel Pentium 4 CPU, 2.40 GHz, 1 GB of RAM, 20 GB Hard Disk. (2) Online tool for Distance Measurement: - This tool is available at for distance measurement based on many distance measurement techniques. (3)Internet: - It is used to download purpose. (4)Sample Raw Log file:- which is downloaded from NASA site which contains transactions of users between to (5)Operating System: - Microsoft Windows XP professional version 2002, Service Pack-2. As far as experimental environment is concerned following steps are taken: (a) Web Object number is converted into respective alphabet as online tool is deals with string data. For Example: Cluster1: 2,5,7,8,9,10 that is converted to BEGHIJ Cluster2: 6, 8, 9,12,15,2,5 that is converted to FHILOBE (b) Same Sample data of Markov Model is taken to experiment the results of Levenshtein distance measurement technique Levenshtein Analysis Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 87

Table 6.3 describes distance between clusters using Levenshtein distance measurement technique. Distance between sessions is calculated using equation described in section 6.1. Figure 6.

14 Table 6.3 describes distance between clusters using Levenshtein distance measurement technique. Distance between sessions is calculated using equation described in section 6.1. Figure 6.3 is the snapshot of online tool that is used for distance measurement. The online tool requires string to match the distance so sessions are converted (Figure-6.3 Snapshot of online tool for distance measurement) Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 88

15 Table 6.3 Distance between clusters using Levenshtein Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 89

16 Accuracy % Analysis of Levensthtein Patterns High accuracy Patterns % Average Accuracy Patterns % Threshold Value Low Accuracy Patterns % Threshold Value (Figure-6.4 Levensthtein Pattern Analysis) From the above graph it is observed that threshold value of 70 is an ideal for Levensthtein distance that provides mean accuracy of and all patterns consists of mean accuracy between 70 and Following are several Limitations of Levensthtein Measure in pattern discovery in current research context (1) Session 1: Session 10: Here both sessions requires same web objects and order of web objects are also similar but distance between them is only 55. (2)Session 8: Session 20: Here case is same as (1) but distance between them is 88 that mean Levensthtein also considers length of two strings. (3)Session 5: Session 13: Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 90

17 Here orders of web objects are not exactly the same and some web objects are differ in both sessions still distance between them is 69. (4) Session 1: Session 2 : Here order is not same but several web objects are similar like 2,5,8,9 still distances between them is 0. (5) Session 1 : Session 3 : Here same case as previous one but distance measure is Needleman-Wunsch Analysis Table 6.4 describes distance between clusters using Needleman Wunsch distance measurement technique. Distance between sessions is calculated using equation described in section 6.1. Figure 6.4 describes analysis of patterns according to Needleman Wunsch distance measurement technique. From analysis it is found that it is very difficult to decide threshold value in this technique. Ideal value of threshold is 85 but it consists only 32% of clusters so it affects cache hit ratio. According to distance metric of Needleman Wunsch every session is half similar with other session and that is also not true. It is based on global alignment so it also considers length of sessions while comparing. There are several limitations in patter discovery based on Needleman Wunsch, they are following: (1) Session 1: Session 2: Here distance is 50% though ordering as well as number of web objects is dissimilar. (2) Session 1: Session 3 : Here more number of web objects are similar than previous case still distance is 50%. (3) Session 1 : Session 10 : Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 91

18 Table 6.4 Distance between clusters using Needleman Wunsch Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 92

19 Accuracy % Analysis of Needlemna Wunsch Patterns High accuracy Patterns % Average Accuracy Patterns % Low Accuracy Patterns % Threshold Value Threshold Value (Figure-6.5 Needleman Wunsch Pattern Analysis) Here both sessions requires same pages and order of web objects are also similar but distance between them is only 55. (4) Session 3: Session 18: Here order as well as number of web objects are differing still distance between them is 88%. (5) Session 6: Session 7: Here order as well as number of web objects are differ still their distance is 64 % that is higher than case (3) Smith Waterman Analysis Table 6.5 describes distance between clusters using smith waterman technique. Figure 6.6 describes analysis of patterns using smith waterman distance measurement technique. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 93

20 Table 6.5 Distance between clusters using Smith Waterman Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 94

21 Accuracy % Smith Waterman Pattern Analysis Threshold Value High accuracy Patterns % Average Accuracy Patterns % Low Accuracy Patterns % Threshold Value (Figure-6.6 Smith Waterman Pattern Analysis) Any threshold value from 65 to 100 is an ideal for smith waterman and that is decided based on space available in proxy server. It is based on local optimal so it is not taken into considerations length of strings. Several facts of Smith Waterman based on distance metric are as follows: (1) Session 1: Session 10: Here order as well as all web objects referred in both sessions are similar so distance is 100%. (2)Session 1: Session 14: Here length of strings is dissimilar but certain portion of second string is same as first string with order so distance is 100% (3) Session 3 : Session 9: Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 95

22 Here first string is not exactly same as second but nearly it is same so distance is 92% (4) Session 4: Session 15: Here first string is nearly similar as second string but not much similar than previous case (3) so distance is less than previous case that is 83 % (5) Session 1: Session 2: Here only two web pages are similar so distance is 33% (6) Session 1: Session 3: Here total four web objects are similar and out of that two are in order so obvious distance is more than previous case and that is 50% 6.3 Approaches to Formation of Clusters As discussed in related work of clustering technique hierarchical way of clustering is not an efficient technique for formation of clusters because clusters are not revisited by most of algorithms of hierarchical technique. For the prediction model of web caching and prefetching partitioning relocation clustering is an ideal solution for clusters formations. There are two main techniques for partitioning clustering and they are: (1) K-means and (2) K-medoids. In both techniques specification of K is require and it indicates numbers of clusters that are going to form after implementation of that technique. In proposed research it is not able to predict number of clusters so it is impossible to give value of K at initial level so both of these techniques are not suitable for this context. In both K-means and K-medoids one object could not be the part of more than one cluster while in proposed research context, one web object could be part of more than one cluster. One popular technique of clustering is Fuzzy C-Means that attempts to divide any n elements into collections of m fuzzy clusters with some criterion but this algorithm also requires choosing a number of clusters so it is not fitted perfectly in this proposed research work. One new approach for formation of clusters is suggested in this proposed work. Following are number of steps of this approach: [1] Determine distance metric based on smith-waterman distance metric technique. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 96

23 Based on quantifiable analysis of all relevant distance measurement techniques it is found that Smith-Waterman technique is most suitable technique in context to proposed work. Smith-Waterman technique is suitable for any kind of comparison either similar length or dissimilar length. Smith Waterman technique is also considers ordering of web objects. [2] Decide threshold value in context of proxy server cache memory. Threshold value should be decided based on the capacity of proxy server cache memory and that is decided based on application perspective. [3] Based on threshold value form clusters of web objects. Formation of clusters is done based on selected threshold value. [4] Repeat step 3 based on new threshold value if require. Repeat formation of clusters according to new threshold value, if require by an application. 6.4 Conclusion This chapter has described several related work in clustering data mining techniques in perspectives of web data. From literature survey it is found out that there are two main categories of clustering techniques (1) Hierarchical and (2) Partitioned based. Hierarchical clustering are divided into two main approaches top-down and bottom-up but both techniques suffered from many limitations and they do not suit in context to proposed work. The main limitation is that clusters are not revisited once they formed. Partitioning techniques overcome limitations of Hierarchical clustering in terms of optimization of cluster formation. There are two main approaches of that centroid based and medoid based. In both approaches it is required to know in advance number of clusters that is not feasible in context to current work. Other limitation is both techniques uses ordinary distance measurement techniques like Euclidian distance measurement that is not suitable for categorical data as well as not considering an ordering of objects. Other main limitation is that one object must be the part of one cluster only that is not true in case of proposed research so this chapter also described fuzzy C means technique. Fuzzy C means overcome limitation that one object could be the part of more than one cluster but still it requires to know number of clusters in advance so it is not perfect technique to form clusters in context to proposed work. The challenge of proposed work is to identify appropriate distance measurement technique that is suitable for all categories of data and also considers an ordering of objects. This chapter dealt with distance measurement techniques and identified appropriate techniques in context to proposed work. This chapter also did quantitative analysis of those distance measurement techniques and Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 97

24 identified most relevant in context to this. Finally chapter has given a new approach to form clusters. The new approach is based on appropriate distance measurement technique in context to web caching and prefetching criteria. The new approach is threshold based approach where value of threshold is decided based om meory of proxy server.the new approach is iterative means it provides liberty to select new value of threshold if previous one is not up to the date. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 98

ECLT 5810 Clustering

ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping