CHAPTER-6 WEB USAGE MINING USING CLUSTERING

Similar documents
ECLT 5810 Clustering

Gene Clustering & Classification

ECLT 5810 Clustering

Hierarchical Clustering

Clustering CS 550: Machine Learning

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Clustering. Lecture 6, 1/24/03 ECS289A

Cluster Analysis. Angela Montanari and Laura Anderlucci

STUDY OF DISTANCE MEASUREMENT TECHNIQUES IN CONTEXT TO PREDICTION MODEL OF WEB CACHING AND WEB PREFETCHING

University of Florida CISE department Gator Engineering. Clustering Part 2

CSE 5243 INTRO. TO DATA MINING

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Redefining and Enhancing K-means Algorithm

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Unsupervised Learning Partitioning Methods

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CHAPTER 4: CLUSTER ANALYSIS

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Understanding Clustering Supervising the unsupervised

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

ECS 234: Data Analysis: Clustering ECS 234

Introduction to Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Kapitel 4: Clustering

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Network Traffic Measurements and Analysis

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Road map. Basic concepts

Clustering and Visualisation of Data

SOCIAL MEDIA MINING. Data Mining Essentials

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Alignment Based Similarity distance Measure for Better Web Sessions Clustering

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Cluster analysis. Agnieszka Nowak - Brzezinska

Machine Learning (BSMC-GA 4439) Wenke Liu

Cluster Analysis. CSE634 Data Mining

Machine Learning (BSMC-GA 4439) Wenke Liu

New Approach for K-mean and K-medoids Algorithm

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

A Review on Cluster Based Approach in Data Mining

Exploratory data analysis for microarrays

Clustering in Data Mining

Information Retrieval and Web Search Engines

Data Clustering. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

Semi-Supervised Clustering with Partial Background Information

Multivariate Analysis

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Lecture 10. Sequence alignments

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Computational Genomics and Molecular Biology, Fall

Introduction to Mobile Robotics

Unsupervised Learning : Clustering

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Particle Swarm Optimization applied to Pattern Recognition

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Biology 644: Bioinformatics

I211: Information infrastructure II

Improved Performance of Unsupervised Method by Renovated K-Means

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Based on Raymond J. Mooney s slides

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

A Design of a Hybrid System for DNA Sequence Alignment

6. Learning Partitions of a Set

Bioinformatics explained: Smith-Waterman

What to come. There will be a few more topics we will cover on supervised learning

Analyzing Outlier Detection Techniques with Hybrid Method

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

数据挖掘 Introduction to Data Mining

Clustering part II 1

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Chapter DM:II. II. Cluster Analysis

Mining di Dati Web. Lezione 3 - Clustering and Classification

Machine Learning. Unsupervised Learning. Manfred Huber

Workload Characterization Techniques

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Unsupervised Learning

Visual Representations for Machine Learning

Data Informatics. Seon Ho Kim, Ph.D.

Mouse, Human, Chimpanzee

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

Sequence analysis Pairwise sequence alignment

Sequence Alignment. part 2

Lesson 3. Prof. Enza Messina

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Transcription:

CHAPTER-6 WEB USAGE MINING USING CLUSTERING 6.1 Related work in Clustering Technique 6.2 Quantifiable Analysis of Distance Measurement Techniques 6.3 Approaches to Formation of Clusters 6.4 Conclusion Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 75

This chapter will deal with clustering web usage mining technique for pattern discovery in detail. Cluster analysis is not an algorithm, but it consists of many different algorithms for pattern discovery tasks that could be used according to an application and targeted results. The first section of the chapter deals with a literature survey of overall clustering technique. The second section deals with partitioning clustering techniques in detail. The next section will deal with distance measurement techniques. Remaining sections deal with the new approach of pattern discovery and associated analysis. 6.1 Related work in Clustering Technique Clustering is an unsupervised data mining technique that divides data among different groups for pattern discovery task. It is also known as exploratory data analysis and no labeled data are available [5]. The ultimate goal of clustering is to separate finite unlabeled data sets into a discrete and finite set of useful, valid and hidden data sets. The cluster is defined as internal homogeneity and external separation [87]. Clustering procedure is applied to preprocessed data that consists of only selected attributes. After preprocessing, appropriate algorithm is selected or designed to achieve targeted results. At last, result interpretation is done to provide meaningful insights to end user. In [98] several characteristics of the cluster are described which can be considered for better cluster formation. Clustering is very useful in several applications of data mining, document retrieval, image segmentation, and pattern classification for purposes of pattern analysis, grouping, decision making and machine learning situations. Clustering in data mining is studied in literature extensively in the form of information retrieval and text mining [32,102] and very little work has been done for web analysis. There are three main things in cluster analysis (1) Effective Similarity Measures (2) Criterion Functions and (3) Algorithm. 6.1.1 Similarity Measures Similarity means distance between two objects. There are different domains such as web analysis, information retrieval, recommendation systems; social network analysis etc. requires efficient techniques for measuring similarity among diverse objects. To measure a distance, various similarity measures have been proposed in literature. There are two main categories of similarity measures: (I) Content Based and (II) linked based. The content based [118,46,73,95] methods evaluate similarity among various objects like web pages, persons, multimedia objects etc. The linked based similarity measures [6,93] are useful to determine similarity of two web objects links for the purpose of search engine. Linked based similarity measures are out of the scope of the proposed research since they are vital for search engine aspect rather than web usage mining purpose. Most popular and commonly used distance metric technique studied in literature is Euclidean distance [1,68,53]. It is an ordinary distance between two points that is Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 76

measured with the ruler and it is derived from Pythagorean formula. The distance between two data points in the plane with coordinates (p, q) and (r, s) is formulated by: DIST (( p, q), (r, s)) = Sqrt (( p-r) 2 + (q-s) 2 ) The usefulness of Euclidean depends on circumstances. For any plane, it provides pretty good results, but with slow speed. It is also not useful in the case of string distance measurement. Euclidean distance can be extended to any dimensions. Another most popular distance metric technique is Manhattan distance that computes the distance from one data point to another if grid like a path is followed. The Manhattan distance between two items is the sum of the differences of their corresponding elements [103]. The distance between two data points with coordinates (p1, q1) and (p2, q2) is derived from: DIST ((p1, q1), (p2, q2)) = n i1 pi qi It is the summation of horizontal and vertical elements where diagonal distance is computed using Pythagorean formula. It is derived from the Euclidean distance so it exhibits similar characteristics as of Euclidean distance. It is generally useful in gaming applications like chess to determine diagonal distance from one element to another. Minkowski is another famous distance measurement technique that can be considered as a generalization of both Euclidean and Manhattan distance. Several researches [4, 39,43] have been done based on Minkowski distance measurement technique to determine similarity among objects. The Minkowski distance of order p between two points: P (x1, x2, x3 xn) and Q( y1,y2,y3, yn) R n is defined as: If the value of p equals to one Minkowski distance is known as Manhattan distance while for value 2 it becomes Euclidean distance. All distances discussed so far are considered as ordinary distances that are measured from ruler and they are not suitable for measuring similarity of two strings so they are not appropriate in context to propose research since web session consists of numbers of string in the form of URLs. Hamming distance is a most popular similarity measure for strings that determines similarity by considering the number of positions at which the corresponding characters are different. More formally hamming distance between two strings P and Q is Pi Qi. Hamming distance theory is used widely in several applications like quantification of information, study of the properties of code and for secure communication. The main limitation on Hamming distance, is it only applicable for strings of the same length. Hamming distance is widely used in error free communication [2,112] field because the byte length remains same for both parties who are involved in communication. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 77

Hamming distance is not so much useful in context to web usage mining since length of all web sessions is not similar. The Levenshtein distance or edit distance is more sophisticated distance measurement technique for string similarity. It is the key distance in several fields such as optical character recognition, text processing, computational biology, fraud detection, cryptography etc. and was studied extensively by many authors [79,42,9]. The formula of Levenshtein distance between two strings S1, S2 is given by Lev S1, S2 ( S1, S2 ) where Lev S1, S2 (I, j) = Max (I, j) if Min (I, j) =0, Otherwise Lev S1,S2 ( i,j) = Min ( Lev s1, s2 (i-1, j) +1) OR Min ( Lev s1, s2 ( i, j-1) + 1) OR Min ( Lev s1, s2 ( i-1, j-1) + [ S1i # S2j] Levenshtein distance measurement technique is an ideal context for web session since it is applicable to strings of unequal size. Several bioinformatics distance measurement techniques that are used to align protein or nucleotide sequences can be used to web mining perspectives to cluster unequal size web sessions. One of the most important techniques of this category was invented by Saul B. Needleman and Christian D. Wunsch [80] to align unequal size protein sequences. This technique uses dynamic programming means solving complex problems by breaking them down into simpler sub problems. It is a global alignment technique in which closely related sequences of same length are very much appropriate. Alignment is done from beginning till end of sequence to find out best possible alignment. This technique uses scoring system. Positive or higher value is assigned for a match and a negative or a lower value is assigned for mismatch. It uses gap penalties to maximize the meaning of sequence. This gap penalty is subtracted from each gap that has been introduced. There are two main types of gap penalties such as open and extension. The open penalty is always applied at the start of the gap, and then the other gaps following it are given with a gap extension penalty which will be less compared to the open penalty. Typical values are 12 for gap opening, and 4 for gap extension. According to Needleman Wunsch algorithm, initial matrix is created with N * M dimension, where N = number of rows equals to number of characters of first string plus one and M= number of columns equals to number of characters of first string plus one. Extra row and column is used to align with gap. After that scoring scheme is introduced that can be user defined with specific scores. The simple basic scoring scheme is, if two sequences at i th and j th positions are same matching score is 1( S(I,j) =1) or if two sequences at i th and j th positions are not same mismatch Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 78

score is assumed as -1 ( S(I,j)=-1). The gap penalty is assumed as -1. When any kind of operation is performed like insertion or deletion, the dynamic programming matrix is defined with three different steps: 1. Initialization Phase: - In initialization phase the gap score can be added to previous cell of the row and column. 2. Matrix Filling Phase: - It is most crucial phase and matrix filling starting from the upper left hand corner of the matrix. It is required to know the diagonal, left and right score of the current position in order to find maximum score of each cell. From the assumed values, add match or mismatch score to diagonal value. Same way repeat the process for left and right value. Take the maximum of three values (i.e. diagonal, right and left) and fill i th and j th positions with obtained score. The equation to calculate the maximum score is as under: Mi,j = Max [ M i-1,j-1 + S i,j, M i, j-1 + W, M i-1,j +W] Where i,j describes row and columns. M is the matrix value of the required cell (stated as M i,j ) S is the score of the required cell (S i, j ) W is the gap alignment 3. Alignment through Trace Back: - It is the final step in Needleman Wunsch algorithm that is trace back for the best possible alignment. The best alignment among the alignments can be identified by using the maximum alignment score. Needleman Wunsch distance measurement technique is an ideal one in string similarity so this technique is also considered in the proposed research context. Smith-waterman is an important bioinformatics technique to align different strings. This technique compares segments of all possible lengths and optimizes the measure of similarity. Temple F.Smith and Michael S.Waterman [97] were founders of this technique. The main difference in the comparison of Needleman Wunsh is that negative scoring matrix cells are set to zero that makes local alignment visible. This technique compares diversified length segments instead of looking at entire sequence at once. The main advantages of smith waterman technique are: To gives conserved regions between the two sequences. To align partially overlapping sequences. To align the subsequence of the sequence to itself. Alike Needleman-Wunsch, this technique also uses scoring matrix system. For scoring system and gap analysis, same concepts used in Needleman Wunsch are applicable here in Smith- Waterman. It also uses same steps of initialization, matrix filling and alignment through trace Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 79

back. The equation to calculate maximum score is same as Needleman Wunsch. The main differences between Needleman-Wunsch and Smith Waterman are: Needleman Wunsch does global alignment while Smith Waterman focuses on local alignment. Needleman Wunsch requires alignment score for pair of remainder to be >=0 while for Smith Waterman it may be positive or negative. For Needleman Wunsch no gap penalty is required for processing while for Smith and Waterman gap penalty is required for efficient work. In Needleman and Wunsch score can not be decrease between two cells of a pathway while in Smith Waterman score can increase, decrease or remain same between two cells of pathway. Table 6.1 describes the comparison of different distance metrics techniques in context to proposed research. Table 6.1 Comparison of distance metrics techniques Sr.NO Technique Description Advantages Disadvantages 1. Euclidean Distance It describes distance between two points that would be measure with ruler and calculated using Pythagorean theorem. (1)It is faster for determination of correlation among points (2) It is fair measure because it compares data points based on actual ratings. (1)It is not suitable for ordinal data like string. (2) It requires actual data not rank. 2. Levenshtein It is a string metric for measuring the difference between two strings. It is fast and best suited for strings similarity. It is not considered order of sequence of characters while comparing. ( Table Continue to Next Page) Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 80

Table 6.1 Comparison of distance metrics techniques(continue) Sr.NO Technique Description Advantages Disadvantages 3. Needleman- Wunsch It is a bio informatics algorithm and provides global alignment between strings while comparing. It is best for string comparison because it considers ordering of sequence of characters It requires same length of string while comparing. 4. Smith- Waterman It is a bio informatics algorithm and provides local alignment between strings while comparing. It is best for string comparison because it considers ordering of sequence of characters and it is applicable for either similar or dissimilar length of strings. It is quite complex than any global alignment technique From above table it is to be analyzed that Euclidean distance is not suitable for proposed research because web sessions consists of sequences of web objects and which are in string format. Levenshtein distance is a very good technique for string sequences similarity but for prediction model of web caching and prefetching an ordering of web objects is an important aspect that is ignored by this distance metric technique so it is also not an appropriate way in proposed research context. Both Needleman-Wunsch and Smith-Waterman considers an ordering of sequence for string matching so they are ideal for this context. Web Sessions are not always of same length so Needleman-Wunsch algorithm is not cent percent fit for formation of web sessions clusters as it only provides global alignment. Smith-Waterman algorithm is applicable for both same length sequence as well as dissimilar length of sequences so it is an ideal algorithm for formation of clusters in this proposed research. 6.1.2 Categories of Clustering Algorithms Once appropriate distance metric is identified next step is to determine appropriate category of clustering algorithm. Clustering algorithms are basically of two main types described in following figure. Hierarchical clustering is also known as connectivity based clustering. It is based on most fundamental concept that objects are more related to nearby objects than objects far away. Algorithms of Hierarchical clustering connect different objects based on their distance measurement. Different clusters in hierarchical clustering are represented in binary tree Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 81

format. Hierarchical clustering algorithms are either top-down or bottom-up. Top down Clustering is also known as splitting algorithm. It proceeds by splitting clusters recursively until individual object is reached. Bottom up Clustering is known as merging algorithms. Bottom Up Clustering Algorithms Hierarchical Clustering Partitioning Clustering Bottom UP Top Down Centroid Medoid (Figure-6.1 Categories of Clustering Algorithms) clustering algorithms [104-124] begin with any n clusters and each cluster contains a single sample or point. Then two clusters will merge so distance among them becomes as least as possible. The graphical representation of both techniques is as follows: C1 C2 C1,C2 Bottom Up C3 C1,C2,C3,C4,C5 C4 C3,C4 C3,C4,C5 C5 Top Down (Figure-6.2 Graphical Representation of Hierarchical Clustering Techniques) The main advantages of hierarchical algorithms are: (1) It is not required to specify number of clusters in advance. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 82

(2) Generation of smaller clusters is possible that may helpful for discovery of important information. But there are number of limitations of this category of clustering that are as follows: (a) Objects might incorrectly group so result should be examined closely before proceeding to next phase. (b) Use of different distance technique may generate different results. (c) Interpretation of result is subjective. (d) Interpretation of hierarchy is complex and often confusing. (e) Researches show that most of hierarchical algorithms do not revisit clusters once they build. Hierarchical clustering is not an ideal in context to proposed research because they are not flexible in cluster formation. It provides rigid means in terms of optimization of clustering results. Sometimes groping of clusters is also not up to the mark. In Hierarchical clustering ordinary distance metric techniques are used that not suit to web usage mining process. Partitioning clustering techniques are another category of clustering techniques. They are very effective and relocation based clustering techniques. There are two main approaches of that (I) Centroid and (II) Medoids. Gravity center of the objects is considered as a measure to represent each cluster. K-means algorithm is well known centroid based algorithm. There are three main steps in k-means algorithm: (i) Center point is determined and each cluster is associated with a center point. (ii) Each point is assigned to the cluster with the closet center point. (iii) K means number of clusters must be specified. According to k-means, numbers of clusters are selected randomly [44-125]. Using Euclidian distance measurement technique assigns every item to its nearest cluster center. Move each cluster center to mean of its assigned items. Change in cluster assignments by repeating assignments and moving clusters until it becomes less than a threshold value. K- means algorithm exhibits number of characteristics: (A) It is most suitable for large data sets. (B) It is sensitive to noise. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 83

(C) It terminates at local optimum means it is optimal within neighboring set of candidate solutions. (D) The clusters having three dimensional views in shapes. K-means is an ideal partitioned based clustering algorithm but it exhibits certain limitations in context to proposed research work. Following are the numbers of limitations of it: 1. It is not possible to predict number of clusters K in advance in proposed research. 2. K-means has problems when clusters are of different in size. In proposed research it is not possible that all clusters of having same size. 3. K-means has a problem of outlier. 4. Empty clusters are possible in K-means. 5. It is applicable only when the mean of clusters is defined and not suitable for categorical data. In proposed research data may be of categorical. 6. Results are heavily depends upon initial partitions only. 7. One object may not be a part of another cluster. According to proposed research one session might be a part of number of clusters. 8. Distance for centroid is calculated based on ordinary distance metric technique like Euclidian. Euclidian would be measure with ruler and calculated using Pythagorean Theorem. It does not suit to string data while web sessions are in form of string. K-medoid is another powerful partitioning clustering algorithm based on medoid philosophy. It is more efficient in context to noise and outliers as compared to K-means because it minimizes a sum of pair wise dissimilarities instead of a sum of squared Euclidean distances. Medoid is more centrally located point in the cluster and it deals with average dissimilarity to all the objects in the clusters. The most important and common algorithm of this category is Partitioning around Medoid (PAM). PAM algorithm like k-means requires selecting K clusters randomly. Associate each object to closest medoid. Closest distance is defined using valid distance metric technique most commonly Euclidian Distance. For each medoid and nonmedoid object, swap and compute the total cost of the configuration. Select configuration with the lowest cost. Repeat steps of association to closest medoid and swapping until there is no change in the medoid. The main advantage of K-medoid is that the problem of outlier is removed but still it facing number of same limitations as of K-Means like K clusters requires in advance, one object may not be a part of another cluster and for calculating medoid ordinary distance metric technique like Euclidian is used. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 84

In both K-means and K-medoid techniques, one object could not be the part of more than one cluster while in proposed research context, one web object could be part of more than one cluster. One popular technique of clustering is Fuzzy C-Means [81-126] that attempts to divide any n elements into collections of m fuzzy clusters with some criterion. The fuzzy C- Means algorithm is simple and contains three main steps. First step is to select number of clusters. Second step is to assign randomly to each point coefficients for being in the clusters. Third step has two sub steps, one is compute the centroid for each cluster and second one is for each point, compute its coefficients of being in the clusters. Repeat third step until the coefficients' change between two iterations is no more than threshold value. In this kind of clustering every object has a degree of belonging to clusters as in fuzzy logic rather than belonging completely to just one cluster. The main advantage is an algorithm minimizes intracluster variance but same problems like it requires predicting number of clusters in advance and results depends on the initial choice of weights. Following table describes above mentioned clustering techniques in context to proposed research work in condensed manner. Table 6.2 Clustering techniques characteristics Sr.No Clustering Technique Characteristics Justification with Proposed Research 1. Top-Down Hierarchical Clustering 2. Bottom Up Hierarchical Clustering 3. K-Means ( Centroid based Partitioning Algorithm) It splits clusters recursively until individual object is reached. It is not required to specify number of clusters in advance. Smaller clusters are possible. It begins with any n number of clusters and then two clusters will merge so distance among them becomes as least as possible. Using Euclidian distance measurement technique assigns every item to its nearest cluster center. This technique does not revisit cluster once they build but as far as web session clusters are concern they require repetitions until appropriate cluster is formed. Same problem of top-down clustering of not revisiting clusters after formation. It requires predicting K numbers of clusters in advance but that is not possible in proposed research context. ( Table Continue to Next Page) Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 85

Table 6.2 Clustering techniques characteristics (Continue) Sr.No Clustering Technique Characteristics Justification with Proposed Research It uses ordinary distance measurement technique and it is applicable only for numerical data but web sessions are in form of string data and also sequence is important while formation of clusters. One object may not be a part of more than one cluster but according to proposed research one session may be the part of number of clusters. Outlier or noise problem may arise. 4. K-Medoid It is based on medoid philosophy. (Medoid based Partitioning Algorithm) It is more efficient in context to noise and outliers. it minimizes a sum of pair wise dissimilarities instead of a sum of squared Euclidean distances. 5. Fuzzy C-Means It attempts to divide any n elements into collections of m fuzzy clusters with some criterion so one object may be the part of number of clusters. It exhibits similar kind of limitations as of K-means in context to proposed research work. It requires predicting K numbers of clusters in advance that is not possible in proposed research work. It also used ordinary distance measurement technique that is not suitable in context to proposed research work. From table 6.2 it is describes that no clustering technique is perfect for prediction model for web caching and web prefetching. There are two main limitations of all above mentioned clustering techniques. First is no appropriate distance measurement technique is used and second one is prediction of any number of clusters are not possible in advance. In proposed Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 86

research context no clustering techniques can be used as it directly. First important point is to determine appropriate distance measurement technique in context to proposed research work. The next section of this chapter will deal with quantifiable analysis of distance measurement techniques that suits to proposed research context. 6.2 Quantifiable Analysis of Distance Measurement Techniques It is cleared from section 6.1.1 that Levenshtein, Needleman-Wunsch and Smith- Waterman distance measurement techniques are appropriate in terms of clustering of web sessions in terms prediction model of web caching and web prefetching. This section will analyze all distance measurement technique in quantifiable manner and decide which one be the most efficient in context to current work. 6.2.1 Infrastructure and experimental environment Following Infrastructure is used in experiment of analyzing distance measurement technique in quantifiable manner. (1) Personal Computer: - Intel Pentium 4 CPU, 2.40 GHz, 1 GB of RAM, 20 GB Hard Disk. (2) Online tool for Distance Measurement: - This tool is available at http://asecuritysite.com/forensics/simstring for distance measurement based on many distance measurement techniques. (3)Internet: - It is used to download purpose. (4)Sample Raw Log file:- which is downloaded from NASA site http://ita.ee.lbl.gov/html/contrib/nasa-http.html), which contains transactions of users between 15-11-2009 to 31-11-2009. (5)Operating System: - Microsoft Windows XP professional version 2002, Service Pack-2. As far as experimental environment is concerned following steps are taken: (a) Web Object number is converted into respective alphabet as online tool is deals with string data. For Example: Cluster1: 2,5,7,8,9,10 that is converted to BEGHIJ Cluster2: 6, 8, 9,12,15,2,5 that is converted to FHILOBE (b) Same Sample data of Markov Model is taken to experiment the results of Levenshtein distance measurement technique. 6.2.2 Levenshtein Analysis Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 87

Table 6.3 describes distance between clusters using Levenshtein distance measurement technique. Distance between sessions is calculated using equation described in section 6.1. Figure 6.3 is the snapshot of online tool that is used for distance measurement. The online tool requires string to match the distance so sessions are converted (Figure-6.3 Snapshot of online tool for distance measurement) Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 88

Table 6.3 Distance between clusters using Levenshtein 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 0 0 25 31 8 17 20 21 21 55 25 21 8 43 33 21 27 25 7 19 14 14 25 20 46 2 0 0 33 38 33 21 20 21 29 27 31 29 31 50 50 21 33 25 29 25 0 0 0 40 15 3 25 33 0 31 31 33 42 21 79 33 25 36 15 0 33 29 27 75 21 25 33 8 50 17 46 4 31 38 31 0 23 8 54 29 36 46 81 71 31 14 77 7 87 25 36 38 31 23 23 38 8 5 8 33 31 23 0 18 27 14 36 45 19 21 69 14 25 21 20 19 79 19 9 9 0 27 8 6 17 21 33 8 18 0 36 14 29 27 6 14 15 7 8 79 7 31 14 25 36 9 27 9 31 7 20 20 42 54 27 36 0 36 36 45 44 36 38 7 50 43 47 44 29 31 30 20 30 30 31 8 21 21 21 29 14 14 36 0 36 36 19 14 14 7 29 21 20 31 7 88 36 14 21 21 21 9 21 29 79 36 36 29 36 36 0 43 31 36 21 14 36 29 33 56 21 31 21 7 36 29 29 10 55 27 33 46 45 27 45 36 43 0 44 36 31 7 50 29 47 31 36 31 27 9 18 45 38 11 25 31 25 81 19 6 44 19 31 44 0 69 25 38 62 6 88 12 25 19 25 31 19 31 12 12 21 29 36 71 21 14 36 14 36 36 69 0 21 14 50 0 73 19 29 25 36 14 29 29 21 13 8 31 15 31 69 15 38 14 21 31 25 21 0 14 31 14 27 12 57 12 15 15 15 23 0 14 43 50 0 14 14 7 7 7 14 7 38 14 14 0 14 0 27 19 14 12 14 21 7 29 29 15 33 50 33 77 25 8 50 29 36 50 62 50 31 14 0 14 67 31 21 31 25 17 25 42 15 16 21 21 29 7 21 79 43 21 29 29 6 0 14 0 14 0 7 31 7 31 29 14 21 21 36 17 27 33 27 87 20 7 47 20 33 47 88 73 27 27 67 7 0 12 27 25 27 27 20 33 7 18 25 25 75 25 19 31 44 31 56 31 12 19 12 19 31 31 12 0 6 25 25 19 38 19 56 19 7 29 21 36 79 14 29 7 21 36 25 29 57 14 21 7 27 6 0 6 14 21 0 21 0 20 19 25 25 38 19 25 31 88 31 31 19 25 12 12 31 31 25 25 6 0 31 12 19 19 25 21 14 0 33 31 9 36 30 36 21 27 25 36 15 14 25 29 27 25 14 31 0 14 12 10 38 22 14 0 8 23 9 9 20 14 7 9 31 14 15 21 17 14 27 19 21 12 14 0 25 10 31 23 25 0 50 23 0 27 30 21 36 18 19 29 15 7 25 21 20 38 0 19 12 25 0 20 46 24 20 40 17 38 27 9 30 21 29 45 31 29 23 29 42 21 33 19 21 19 10 10 20 0 15 25 46 15 46 8 8 31 31 21 29 38 12 21 0 29 15 36 7 56 0 25 38 31 46 15 0 Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 89

Accuracy % Analysis of Levensthtein Patterns 80 70 60 50 40 30 20 10 0 High accuracy Patterns % Average Accuracy Patterns % Threshold Value 50 55 60 65 70 75 80 Low Accuracy Patterns % Threshold Value (Figure-6.4 Levensthtein Pattern Analysis) From the above graph it is observed that threshold value of 70 is an ideal for Levensthtein distance that provides mean accuracy of 78.99 and all patterns consists of mean accuracy between 70 and 78.99. Following are several Limitations of Levensthtein Measure in pattern discovery in current research context (1) Session 1: 2 5 7 8 9 10 Session 10: 2 5 7 8 9 10 12 13 14 10 Here both sessions requires same web objects and order of web objects are also similar but distance between them is only 55. (2)Session 8: 7 6 5 2 1 5 6 9 10 12 14 11 10 9 Session 20: 7 6 5 2 1 5 6 9 10 12 14 11 10 9 13 5 Here case is same as (1) but distance between them is 88 that mean Levensthtein also considers length of two strings. (3)Session 5: 5 7 9 11 12 13 14 15 2 3 14 Session 13: 3 6 9 11 12 13 14 15 2 3 14 10 12 Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 90

Here orders of web objects are not exactly the same and some web objects are differ in both sessions still distance between them is 69. (4) Session 1: 2 5 7 8 9 10 Session 2 : 6 8 9 12 15 2 5 Here order is not same but several web objects are similar like 2,5,8,9 still distances between them is 0. (5) Session 1 : 2 5 7 8 9 10 Session 3 : 3 4 5 6 7 9 10 11 12 15 14 13 Here same case as previous one but distance measure is 25. 6.2.3 Needleman-Wunsch Analysis Table 6.4 describes distance between clusters using Needleman Wunsch distance measurement technique. Distance between sessions is calculated using equation described in section 6.1. Figure 6.4 describes analysis of patterns according to Needleman Wunsch distance measurement technique. From analysis it is found that it is very difficult to decide threshold value in this technique. Ideal value of threshold is 85 but it consists only 32% of clusters so it affects cache hit ratio. According to distance metric of Needleman Wunsch every session is half similar with other session and that is also not true. It is based on global alignment so it also considers length of sessions while comparing. There are several limitations in patter discovery based on Needleman Wunsch, they are following: (1) Session 1: 2 5 7 8 9 10 Session 2: 6 8 9 12 15 2 5 Here distance is 50% though ordering as well as number of web objects is dissimilar. (2) Session 1: 2 5 7 8 9 10 Session 3 : 3 4 5 6 7 9 10 11 12 15 14 13 Here more number of web objects are similar than previous case still distance is 50%. (3) Session 1 : 2 5 7 8 9 10 Session 10 : 2 5 7 8 9 10 12 13 14 10 Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 91

Table 6.4 Distance between clusters using Needleman Wunsch 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 0 50 50 50 50 50 50 50 54 55 59 57 50 71 50 50 53 50 54 50 57 50 56 50 58 2 50 0 58 54 55 50 60 54 50 55 50 50 50 50 62 50 53 56 54 53 50 50 50 60 58 3 50 58 0 54 50 62 58 57 82 62 50 57 50 50 58 57 50 88 54 56 50 50 58 54 65 4 50 54 54 0 65 54 62 57 64 62 81 82 58 50 85 54 87 53 57 69 54 58 50 58 50 5 50 55 50 65 0 50 50 50 57 59 50 57 77 50 58 50 50 50 79 59 50 50 50 59 50 6 50 50 62 54 50 0 64 50 57 55 50 50 54 50 54 89 50 59 50 56 50 50 55 50 62 7 50 60 58 62 50 64 0 64 54 68 53 50 50 50 62 61 53 62 57 56 55 55 55 55 54 8 50 54 57 57 50 50 64 0 57 64 53 54 54 50 61 54 53 59 54 88 50 50 50 57 54 9 54 50 82 64 57 57 54 57 0 64 56 64 54 57 61 57 57 72 54 56 50 54 54 57 50 10 55 55 62 62 59 55 68 64 64 0 53 57 50 54 67 54 57 59 54 56 50 50 50 59 54 11 59 50 50 81 50 50 53 53 56 53 0 78 53 69 69 50 91 50 56 50 53 53 53 50 53 12 57 50 57 82 57 50 50 54 64 57 78 0 54 57 68 50 83 50 54 59 54 54 50 50 50 13 50 50 50 58 77 54 50 54 54 50 53 54 0 54 54 54 57 53 75 50 50 54 50 50 50 14 71 50 50 50 50 50 50 50 57 54 69 57 54 0 50 50 57 50 57 53 54 54 50 50 57 15 50 62 58 85 58 54 62 61 61 67 69 68 54 50 0 57 73 56 50 62 50 50 50 62 54 16 50 50 57 54 50 89 61 54 57 54 50 50 54 50 57 0 54 62 54 59 50 50 54 54 64 17 53 53 50 87 50 50 53 53 57 57 91 83 57 57 73 54 0 53 62 54 59 50 54 54 64 18 50 56 88 53 50 59 62 59 72 59 50 50 53 50 56 62 53 0 53 56 50 50 56 53 69 19 54 54 54 57 79 50 57 54 54 54 56 54 75 57 50 54 62 53 0 53 57 54 50 54 50 20 50 53 56 69 59 56 56 88 56 56 50 59 50 53 62 59 54 56 53 0 50 53 50 50 56 21 57 50 50 54 50 50 55 50 50 50 53 54 50 54 50 50 59 50 57 50 0 50 50 50 58 22 50 50 50 58 50 50 55 50 54 50 53 54 54 54 50 50 50 50 54 53 50 0 50 50 50 23 56 50 58 50 50 55 55 50 54 50 53 50 50 50 50 54 54 56 50 50 50 50 0 55 65 24 50 60 54 58 59 50 55 57 57 59 50 50 50 50 62 54 54 53 54 50 50 50 55 0 54 25 58 58 65 50 50 62 54 54 50 54 53 50 50 57 54 64 64 69 50 56 58 50 65 54 0 Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 92

Accuracy % Analysis of Needlemna Wunsch Patterns 120 100 80 60 40 20 0 High accuracy Patterns % Average Accuracy Patterns % Low Accuracy Patterns % Threshold Value 55 60 65 70 75 80 85 90 Threshold Value (Figure-6.5 Needleman Wunsch Pattern Analysis) Here both sessions requires same pages and order of web objects are also similar but distance between them is only 55. (4) Session 3: 3 4 5 6 7 9 10 11 12 15 14 13 Session 18: 8 9 10 2 3 4 5 6 7 9 10 11 12 15 14 13 Here order as well as number of web objects are differing still distance between them is 88%. (5) Session 6: 3 8 7 9 4 6 10 11 12 13 15 Session 7: 2 3 4 6 9 11 12 14 15 8 Here order as well as number of web objects are differ still their distance is 64 % that is higher than case (3) 6.2.4 Smith Waterman Analysis Table 6.5 describes distance between clusters using smith waterman technique. Figure 6.6 describes analysis of patterns using smith waterman distance measurement technique. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 93

Table 6.5 Distance between clusters using Smith Waterman 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 0 33 50 50 42 25 17 33 50 100 50 58 17 100 50 25 50 50 42 33 33 17 17 33 83 2 33 0 29 57 36 21 36 29 29 36 57 43 43 100 64 21 57 29 36 29 21 29 29 43 29 3 50 29 0 29 36 41 50 38 92 32 29 38 25 25 29 42 29 100 33 38 36 43 50 10 58 4 50 57 29 0 27 23 50 35 27 50 100 62 27 31 83 19 100 31 27 35 43 36 31 45 27 5 42 36 36 27 0 36 45 14 36 50 27 14 73 23 27 36 27 36 100 14 14 29 25 35 45 6 25 21 41 23 36 0 30 18 41 32 23 18 32 14 23 100 23 41 36 18 21 21 25 20 41 7 17 36 50 50 45 30 0 30 40 25 50 20 60 25 50 35 50 60 45 30 14 29 38 25 30 8 33 29 38 35 14 18 30 0 43 32 32 25 19 18 38 14 32 32 11 100 57 29 25 20 23 9 50 29 92 27 36 41 40 43 0 32 25 25 23 21 29 36 25 79 29 43 36 43 38 14 46 10 100 36 32 50 50 32 25 32 32 0 50 41 36 55 50 32 50 32 50 32 43 29 12 45 59 11 50 57 29 100 27 23 50 32 25 50 0 57 27 29 83 18 87 25 25 28 43 36 31 45 27 12 58 43 38 62 14 18 20 25 25 41 57 0 12 25 42 14 61 32 14 25 43 36 44 30 46 13 17 43 25 27 73 32 60 19 23 36 27 12 0 23 29 27 27 23 62 19 14 29 25 35 27 14 100 100 25 31 23 14 25 18 21 55 29 25 23 0 42 11 29 21 18 18 29 29 25 30 38 15 50 64 29 83 27 23 50 38 29 50 83 42 29 42 0 21 83 33 25 38 43 36 31 45 29 16 25 21 42 19 36 100 35 14 36 32 18 14 27 11 21 0 18 36 29 14 21 29 25 20 35 17 50 57 29 100 27 23 50 32 25 50 87 61 27 29 83 18 0 27 25 30 43 36 31 45 27 18 50 29 100 31 36 41 60 32 79 32 25 32 23 21 33 36 27 0 29 28 36 43 50 20 58 19 42 36 33 27 100 36 45 11 29 50 25 14 62 18 25 29 25 29 0 11 14 36 25 35 38 20 33 29 38 35 14 18 30 100 43 32 28 25 19 18 38 14 30 28 11 0 57 29 25 20 23 21 33 21 36 43 14 21 14 57 36 43 43 43 14 29 43 21 43 36 14 57 0 14 14 21 36 22 17 29 43 36 29 21 29 29 43 29 36 36 29 29 36 29 36 43 36 29 14 0 57 14 29 23 17 29 50 31 25 25 38 25 38 12 31 44 25 25 31 25 31 50 25 25 14 57 0 12 44 24 33 43 10 45 35 20 25 20 14 45 45 30 35 30 45 20 45 20 35 20 21 14 12 0 30 25 83 29 58 27 45 41 30 23 46 59 27 46 27 38 29 35 27 58 38 23 36 29 44 30 0 Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 94

Accuracy % Smith Waterman Pattern Analysis 80 70 60 50 40 30 20 10 0 Threshold Value 50 55 60 65 70 75 80 85 90 95 100 High accuracy Patterns % Average Accuracy Patterns % Low Accuracy Patterns % Threshold Value (Figure-6.6 Smith Waterman Pattern Analysis) Any threshold value from 65 to 100 is an ideal for smith waterman and that is decided based on space available in proxy server. It is based on local optimal so it is not taken into considerations length of strings. Several facts of Smith Waterman based on distance metric are as follows: (1) Session 1: 2 5 7 8 9 10 Session 10: 2 5 7 8 9 10 12 13 14 10 Here order as well as all web objects referred in both sessions are similar so distance is 100%. (2)Session 1: 2 5 7 8 9 10 Session 14: 6 8 9 12 15 2 5 1 2 5 7 8 9 10 Here length of strings is dissimilar but certain portion of second string is same as first string with order so distance is 100% (3) Session 3 : 3 4 5 6 7 9 10 11 12 15 14 13 Session 9: 6 4 5 6 7 9 10 11 12 15 14 13 10 Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 95

Here first string is not exactly same as second but nearly it is same so distance is 92% (4) Session 4: 2 4 6 8 9 10 12 14 15 3 9 8 6 Session 15: 2 4 6 8 9 10 12 14 15 3 2 1 Here first string is nearly similar as second string but not much similar than previous case (3) so distance is less than previous case that is 83 % (5) Session 1: 2 5 7 8 9 10 Session 2: 6 8 9 12 15 2 5 Here only two web pages are similar so distance is 33% (6) Session 1: 2 5 7 8 9 10 Session 3: 3 4 5 6 7 9 10 11 12 15 14 13 Here total four web objects are similar and out of that two are in order so obvious distance is more than previous case and that is 50% 6.3 Approaches to Formation of Clusters As discussed in related work of clustering technique hierarchical way of clustering is not an efficient technique for formation of clusters because clusters are not revisited by most of algorithms of hierarchical technique. For the prediction model of web caching and prefetching partitioning relocation clustering is an ideal solution for clusters formations. There are two main techniques for partitioning clustering and they are: (1) K-means and (2) K-medoids. In both techniques specification of K is require and it indicates numbers of clusters that are going to form after implementation of that technique. In proposed research it is not able to predict number of clusters so it is impossible to give value of K at initial level so both of these techniques are not suitable for this context. In both K-means and K-medoids one object could not be the part of more than one cluster while in proposed research context, one web object could be part of more than one cluster. One popular technique of clustering is Fuzzy C-Means that attempts to divide any n elements into collections of m fuzzy clusters with some criterion but this algorithm also requires choosing a number of clusters so it is not fitted perfectly in this proposed research work. One new approach for formation of clusters is suggested in this proposed work. Following are number of steps of this approach: [1] Determine distance metric based on smith-waterman distance metric technique. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 96

Based on quantifiable analysis of all relevant distance measurement techniques it is found that Smith-Waterman technique is most suitable technique in context to proposed work. Smith-Waterman technique is suitable for any kind of comparison either similar length or dissimilar length. Smith Waterman technique is also considers ordering of web objects. [2] Decide threshold value in context of proxy server cache memory. Threshold value should be decided based on the capacity of proxy server cache memory and that is decided based on application perspective. [3] Based on threshold value form clusters of web objects. Formation of clusters is done based on selected threshold value. [4] Repeat step 3 based on new threshold value if require. Repeat formation of clusters according to new threshold value, if require by an application. 6.4 Conclusion This chapter has described several related work in clustering data mining techniques in perspectives of web data. From literature survey it is found out that there are two main categories of clustering techniques (1) Hierarchical and (2) Partitioned based. Hierarchical clustering are divided into two main approaches top-down and bottom-up but both techniques suffered from many limitations and they do not suit in context to proposed work. The main limitation is that clusters are not revisited once they formed. Partitioning techniques overcome limitations of Hierarchical clustering in terms of optimization of cluster formation. There are two main approaches of that centroid based and medoid based. In both approaches it is required to know in advance number of clusters that is not feasible in context to current work. Other limitation is both techniques uses ordinary distance measurement techniques like Euclidian distance measurement that is not suitable for categorical data as well as not considering an ordering of objects. Other main limitation is that one object must be the part of one cluster only that is not true in case of proposed research so this chapter also described fuzzy C means technique. Fuzzy C means overcome limitation that one object could be the part of more than one cluster but still it requires to know number of clusters in advance so it is not perfect technique to form clusters in context to proposed work. The challenge of proposed work is to identify appropriate distance measurement technique that is suitable for all categories of data and also considers an ordering of objects. This chapter dealt with distance measurement techniques and identified appropriate techniques in context to proposed work. This chapter also did quantitative analysis of those distance measurement techniques and Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 97

identified most relevant in context to this. Finally chapter has given a new approach to form clusters. The new approach is based on appropriate distance measurement technique in context to web caching and prefetching criteria. The new approach is threshold based approach where value of threshold is decided based om meory of proxy server.the new approach is iterative means it provides liberty to select new value of threshold if previous one is not up to the date. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 98