CHAPTER-6 WEB USAGE MINING USING CLUSTERING

Size: px
Start display at page:

Download "CHAPTER-6 WEB USAGE MINING USING CLUSTERING"

Transcription

1 CHAPTER-6 WEB USAGE MINING USING CLUSTERING 6.1 Related work in Clustering Technique 6.2 Quantifiable Analysis of Distance Measurement Techniques 6.3 Approaches to Formation of Clusters 6.4 Conclusion Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 75

2 This chapter will deal with clustering web usage mining technique for pattern discovery in detail. Cluster analysis is not an algorithm, but it consists of many different algorithms for pattern discovery tasks that could be used according to an application and targeted results. The first section of the chapter deals with a literature survey of overall clustering technique. The second section deals with partitioning clustering techniques in detail. The next section will deal with distance measurement techniques. Remaining sections deal with the new approach of pattern discovery and associated analysis. 6.1 Related work in Clustering Technique Clustering is an unsupervised data mining technique that divides data among different groups for pattern discovery task. It is also known as exploratory data analysis and no labeled data are available [5]. The ultimate goal of clustering is to separate finite unlabeled data sets into a discrete and finite set of useful, valid and hidden data sets. The cluster is defined as internal homogeneity and external separation [87]. Clustering procedure is applied to preprocessed data that consists of only selected attributes. After preprocessing, appropriate algorithm is selected or designed to achieve targeted results. At last, result interpretation is done to provide meaningful insights to end user. In [98] several characteristics of the cluster are described which can be considered for better cluster formation. Clustering is very useful in several applications of data mining, document retrieval, image segmentation, and pattern classification for purposes of pattern analysis, grouping, decision making and machine learning situations. Clustering in data mining is studied in literature extensively in the form of information retrieval and text mining [32,102] and very little work has been done for web analysis. There are three main things in cluster analysis (1) Effective Similarity Measures (2) Criterion Functions and (3) Algorithm Similarity Measures Similarity means distance between two objects. There are different domains such as web analysis, information retrieval, recommendation systems; social network analysis etc. requires efficient techniques for measuring similarity among diverse objects. To measure a distance, various similarity measures have been proposed in literature. There are two main categories of similarity measures: (I) Content Based and (II) linked based. The content based [118,46,73,95] methods evaluate similarity among various objects like web pages, persons, multimedia objects etc. The linked based similarity measures [6,93] are useful to determine similarity of two web objects links for the purpose of search engine. Linked based similarity measures are out of the scope of the proposed research since they are vital for search engine aspect rather than web usage mining purpose. Most popular and commonly used distance metric technique studied in literature is Euclidean distance [1,68,53]. It is an ordinary distance between two points that is Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 76

3 measured with the ruler and it is derived from Pythagorean formula. The distance between two data points in the plane with coordinates (p, q) and (r, s) is formulated by: DIST (( p, q), (r, s)) = Sqrt (( p-r) 2 + (q-s) 2 ) The usefulness of Euclidean depends on circumstances. For any plane, it provides pretty good results, but with slow speed. It is also not useful in the case of string distance measurement. Euclidean distance can be extended to any dimensions. Another most popular distance metric technique is Manhattan distance that computes the distance from one data point to another if grid like a path is followed. The Manhattan distance between two items is the sum of the differences of their corresponding elements [103]. The distance between two data points with coordinates (p1, q1) and (p2, q2) is derived from: DIST ((p1, q1), (p2, q2)) = n i1 pi qi It is the summation of horizontal and vertical elements where diagonal distance is computed using Pythagorean formula. It is derived from the Euclidean distance so it exhibits similar characteristics as of Euclidean distance. It is generally useful in gaming applications like chess to determine diagonal distance from one element to another. Minkowski is another famous distance measurement technique that can be considered as a generalization of both Euclidean and Manhattan distance. Several researches [4, 39,43] have been done based on Minkowski distance measurement technique to determine similarity among objects. The Minkowski distance of order p between two points: P (x1, x2, x3 xn) and Q( y1,y2,y3, yn) R n is defined as: If the value of p equals to one Minkowski distance is known as Manhattan distance while for value 2 it becomes Euclidean distance. All distances discussed so far are considered as ordinary distances that are measured from ruler and they are not suitable for measuring similarity of two strings so they are not appropriate in context to propose research since web session consists of numbers of string in the form of URLs. Hamming distance is a most popular similarity measure for strings that determines similarity by considering the number of positions at which the corresponding characters are different. More formally hamming distance between two strings P and Q is Pi Qi. Hamming distance theory is used widely in several applications like quantification of information, study of the properties of code and for secure communication. The main limitation on Hamming distance, is it only applicable for strings of the same length. Hamming distance is widely used in error free communication [2,112] field because the byte length remains same for both parties who are involved in communication. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 77

4 Hamming distance is not so much useful in context to web usage mining since length of all web sessions is not similar. The Levenshtein distance or edit distance is more sophisticated distance measurement technique for string similarity. It is the key distance in several fields such as optical character recognition, text processing, computational biology, fraud detection, cryptography etc. and was studied extensively by many authors [79,42,9]. The formula of Levenshtein distance between two strings S1, S2 is given by Lev S1, S2 ( S1, S2 ) where Lev S1, S2 (I, j) = Max (I, j) if Min (I, j) =0, Otherwise Lev S1,S2 ( i,j) = Min ( Lev s1, s2 (i-1, j) +1) OR Min ( Lev s1, s2 ( i, j-1) + 1) OR Min ( Lev s1, s2 ( i-1, j-1) + [ S1i # S2j] Levenshtein distance measurement technique is an ideal context for web session since it is applicable to strings of unequal size. Several bioinformatics distance measurement techniques that are used to align protein or nucleotide sequences can be used to web mining perspectives to cluster unequal size web sessions. One of the most important techniques of this category was invented by Saul B. Needleman and Christian D. Wunsch [80] to align unequal size protein sequences. This technique uses dynamic programming means solving complex problems by breaking them down into simpler sub problems. It is a global alignment technique in which closely related sequences of same length are very much appropriate. Alignment is done from beginning till end of sequence to find out best possible alignment. This technique uses scoring system. Positive or higher value is assigned for a match and a negative or a lower value is assigned for mismatch. It uses gap penalties to maximize the meaning of sequence. This gap penalty is subtracted from each gap that has been introduced. There are two main types of gap penalties such as open and extension. The open penalty is always applied at the start of the gap, and then the other gaps following it are given with a gap extension penalty which will be less compared to the open penalty. Typical values are 12 for gap opening, and 4 for gap extension. According to Needleman Wunsch algorithm, initial matrix is created with N * M dimension, where N = number of rows equals to number of characters of first string plus one and M= number of columns equals to number of characters of first string plus one. Extra row and column is used to align with gap. After that scoring scheme is introduced that can be user defined with specific scores. The simple basic scoring scheme is, if two sequences at i th and j th positions are same matching score is 1( S(I,j) =1) or if two sequences at i th and j th positions are not same mismatch Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 78

5 score is assumed as -1 ( S(I,j)=-1). The gap penalty is assumed as -1. When any kind of operation is performed like insertion or deletion, the dynamic programming matrix is defined with three different steps: 1. Initialization Phase: - In initialization phase the gap score can be added to previous cell of the row and column. 2. Matrix Filling Phase: - It is most crucial phase and matrix filling starting from the upper left hand corner of the matrix. It is required to know the diagonal, left and right score of the current position in order to find maximum score of each cell. From the assumed values, add match or mismatch score to diagonal value. Same way repeat the process for left and right value. Take the maximum of three values (i.e. diagonal, right and left) and fill i th and j th positions with obtained score. The equation to calculate the maximum score is as under: Mi,j = Max [ M i-1,j-1 + S i,j, M i, j-1 + W, M i-1,j +W] Where i,j describes row and columns. M is the matrix value of the required cell (stated as M i,j ) S is the score of the required cell (S i, j ) W is the gap alignment 3. Alignment through Trace Back: - It is the final step in Needleman Wunsch algorithm that is trace back for the best possible alignment. The best alignment among the alignments can be identified by using the maximum alignment score. Needleman Wunsch distance measurement technique is an ideal one in string similarity so this technique is also considered in the proposed research context. Smith-waterman is an important bioinformatics technique to align different strings. This technique compares segments of all possible lengths and optimizes the measure of similarity. Temple F.Smith and Michael S.Waterman [97] were founders of this technique. The main difference in the comparison of Needleman Wunsh is that negative scoring matrix cells are set to zero that makes local alignment visible. This technique compares diversified length segments instead of looking at entire sequence at once. The main advantages of smith waterman technique are: To gives conserved regions between the two sequences. To align partially overlapping sequences. To align the subsequence of the sequence to itself. Alike Needleman-Wunsch, this technique also uses scoring matrix system. For scoring system and gap analysis, same concepts used in Needleman Wunsch are applicable here in Smith- Waterman. It also uses same steps of initialization, matrix filling and alignment through trace Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 79

6 back. The equation to calculate maximum score is same as Needleman Wunsch. The main differences between Needleman-Wunsch and Smith Waterman are: Needleman Wunsch does global alignment while Smith Waterman focuses on local alignment. Needleman Wunsch requires alignment score for pair of remainder to be >=0 while for Smith Waterman it may be positive or negative. For Needleman Wunsch no gap penalty is required for processing while for Smith and Waterman gap penalty is required for efficient work. In Needleman and Wunsch score can not be decrease between two cells of a pathway while in Smith Waterman score can increase, decrease or remain same between two cells of pathway. Table 6.1 describes the comparison of different distance metrics techniques in context to proposed research. Table 6.1 Comparison of distance metrics techniques Sr.NO Technique Description Advantages Disadvantages 1. Euclidean Distance It describes distance between two points that would be measure with ruler and calculated using Pythagorean theorem. (1)It is faster for determination of correlation among points (2) It is fair measure because it compares data points based on actual ratings. (1)It is not suitable for ordinal data like string. (2) It requires actual data not rank. 2. Levenshtein It is a string metric for measuring the difference between two strings. It is fast and best suited for strings similarity. It is not considered order of sequence of characters while comparing. ( Table Continue to Next Page) Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 80

7 Table 6.1 Comparison of distance metrics techniques(continue) Sr.NO Technique Description Advantages Disadvantages 3. Needleman- Wunsch It is a bio informatics algorithm and provides global alignment between strings while comparing. It is best for string comparison because it considers ordering of sequence of characters It requires same length of string while comparing. 4. Smith- Waterman It is a bio informatics algorithm and provides local alignment between strings while comparing. It is best for string comparison because it considers ordering of sequence of characters and it is applicable for either similar or dissimilar length of strings. It is quite complex than any global alignment technique From above table it is to be analyzed that Euclidean distance is not suitable for proposed research because web sessions consists of sequences of web objects and which are in string format. Levenshtein distance is a very good technique for string sequences similarity but for prediction model of web caching and prefetching an ordering of web objects is an important aspect that is ignored by this distance metric technique so it is also not an appropriate way in proposed research context. Both Needleman-Wunsch and Smith-Waterman considers an ordering of sequence for string matching so they are ideal for this context. Web Sessions are not always of same length so Needleman-Wunsch algorithm is not cent percent fit for formation of web sessions clusters as it only provides global alignment. Smith-Waterman algorithm is applicable for both same length sequence as well as dissimilar length of sequences so it is an ideal algorithm for formation of clusters in this proposed research Categories of Clustering Algorithms Once appropriate distance metric is identified next step is to determine appropriate category of clustering algorithm. Clustering algorithms are basically of two main types described in following figure. Hierarchical clustering is also known as connectivity based clustering. It is based on most fundamental concept that objects are more related to nearby objects than objects far away. Algorithms of Hierarchical clustering connect different objects based on their distance measurement. Different clusters in hierarchical clustering are represented in binary tree Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 81

8 format. Hierarchical clustering algorithms are either top-down or bottom-up. Top down Clustering is also known as splitting algorithm. It proceeds by splitting clusters recursively until individual object is reached. Bottom up Clustering is known as merging algorithms. Bottom Up Clustering Algorithms Hierarchical Clustering Partitioning Clustering Bottom UP Top Down Centroid Medoid (Figure-6.1 Categories of Clustering Algorithms) clustering algorithms [ ] begin with any n clusters and each cluster contains a single sample or point. Then two clusters will merge so distance among them becomes as least as possible. The graphical representation of both techniques is as follows: C1 C2 C1,C2 Bottom Up C3 C1,C2,C3,C4,C5 C4 C3,C4 C3,C4,C5 C5 Top Down (Figure-6.2 Graphical Representation of Hierarchical Clustering Techniques) The main advantages of hierarchical algorithms are: (1) It is not required to specify number of clusters in advance. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 82

9 (2) Generation of smaller clusters is possible that may helpful for discovery of important information. But there are number of limitations of this category of clustering that are as follows: (a) Objects might incorrectly group so result should be examined closely before proceeding to next phase. (b) Use of different distance technique may generate different results. (c) Interpretation of result is subjective. (d) Interpretation of hierarchy is complex and often confusing. (e) Researches show that most of hierarchical algorithms do not revisit clusters once they build. Hierarchical clustering is not an ideal in context to proposed research because they are not flexible in cluster formation. It provides rigid means in terms of optimization of clustering results. Sometimes groping of clusters is also not up to the mark. In Hierarchical clustering ordinary distance metric techniques are used that not suit to web usage mining process. Partitioning clustering techniques are another category of clustering techniques. They are very effective and relocation based clustering techniques. There are two main approaches of that (I) Centroid and (II) Medoids. Gravity center of the objects is considered as a measure to represent each cluster. K-means algorithm is well known centroid based algorithm. There are three main steps in k-means algorithm: (i) Center point is determined and each cluster is associated with a center point. (ii) Each point is assigned to the cluster with the closet center point. (iii) K means number of clusters must be specified. According to k-means, numbers of clusters are selected randomly [44-125]. Using Euclidian distance measurement technique assigns every item to its nearest cluster center. Move each cluster center to mean of its assigned items. Change in cluster assignments by repeating assignments and moving clusters until it becomes less than a threshold value. K- means algorithm exhibits number of characteristics: (A) It is most suitable for large data sets. (B) It is sensitive to noise. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 83

10 (C) It terminates at local optimum means it is optimal within neighboring set of candidate solutions. (D) The clusters having three dimensional views in shapes. K-means is an ideal partitioned based clustering algorithm but it exhibits certain limitations in context to proposed research work. Following are the numbers of limitations of it: 1. It is not possible to predict number of clusters K in advance in proposed research. 2. K-means has problems when clusters are of different in size. In proposed research it is not possible that all clusters of having same size. 3. K-means has a problem of outlier. 4. Empty clusters are possible in K-means. 5. It is applicable only when the mean of clusters is defined and not suitable for categorical data. In proposed research data may be of categorical. 6. Results are heavily depends upon initial partitions only. 7. One object may not be a part of another cluster. According to proposed research one session might be a part of number of clusters. 8. Distance for centroid is calculated based on ordinary distance metric technique like Euclidian. Euclidian would be measure with ruler and calculated using Pythagorean Theorem. It does not suit to string data while web sessions are in form of string. K-medoid is another powerful partitioning clustering algorithm based on medoid philosophy. It is more efficient in context to noise and outliers as compared to K-means because it minimizes a sum of pair wise dissimilarities instead of a sum of squared Euclidean distances. Medoid is more centrally located point in the cluster and it deals with average dissimilarity to all the objects in the clusters. The most important and common algorithm of this category is Partitioning around Medoid (PAM). PAM algorithm like k-means requires selecting K clusters randomly. Associate each object to closest medoid. Closest distance is defined using valid distance metric technique most commonly Euclidian Distance. For each medoid and nonmedoid object, swap and compute the total cost of the configuration. Select configuration with the lowest cost. Repeat steps of association to closest medoid and swapping until there is no change in the medoid. The main advantage of K-medoid is that the problem of outlier is removed but still it facing number of same limitations as of K-Means like K clusters requires in advance, one object may not be a part of another cluster and for calculating medoid ordinary distance metric technique like Euclidian is used. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 84

11 In both K-means and K-medoid techniques, one object could not be the part of more than one cluster while in proposed research context, one web object could be part of more than one cluster. One popular technique of clustering is Fuzzy C-Means [81-126] that attempts to divide any n elements into collections of m fuzzy clusters with some criterion. The fuzzy C- Means algorithm is simple and contains three main steps. First step is to select number of clusters. Second step is to assign randomly to each point coefficients for being in the clusters. Third step has two sub steps, one is compute the centroid for each cluster and second one is for each point, compute its coefficients of being in the clusters. Repeat third step until the coefficients' change between two iterations is no more than threshold value. In this kind of clustering every object has a degree of belonging to clusters as in fuzzy logic rather than belonging completely to just one cluster. The main advantage is an algorithm minimizes intracluster variance but same problems like it requires predicting number of clusters in advance and results depends on the initial choice of weights. Following table describes above mentioned clustering techniques in context to proposed research work in condensed manner. Table 6.2 Clustering techniques characteristics Sr.No Clustering Technique Characteristics Justification with Proposed Research 1. Top-Down Hierarchical Clustering 2. Bottom Up Hierarchical Clustering 3. K-Means ( Centroid based Partitioning Algorithm) It splits clusters recursively until individual object is reached. It is not required to specify number of clusters in advance. Smaller clusters are possible. It begins with any n number of clusters and then two clusters will merge so distance among them becomes as least as possible. Using Euclidian distance measurement technique assigns every item to its nearest cluster center. This technique does not revisit cluster once they build but as far as web session clusters are concern they require repetitions until appropriate cluster is formed. Same problem of top-down clustering of not revisiting clusters after formation. It requires predicting K numbers of clusters in advance but that is not possible in proposed research context. ( Table Continue to Next Page) Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 85

12 Table 6.2 Clustering techniques characteristics (Continue) Sr.No Clustering Technique Characteristics Justification with Proposed Research It uses ordinary distance measurement technique and it is applicable only for numerical data but web sessions are in form of string data and also sequence is important while formation of clusters. One object may not be a part of more than one cluster but according to proposed research one session may be the part of number of clusters. Outlier or noise problem may arise. 4. K-Medoid It is based on medoid philosophy. (Medoid based Partitioning Algorithm) It is more efficient in context to noise and outliers. it minimizes a sum of pair wise dissimilarities instead of a sum of squared Euclidean distances. 5. Fuzzy C-Means It attempts to divide any n elements into collections of m fuzzy clusters with some criterion so one object may be the part of number of clusters. It exhibits similar kind of limitations as of K-means in context to proposed research work. It requires predicting K numbers of clusters in advance that is not possible in proposed research work. It also used ordinary distance measurement technique that is not suitable in context to proposed research work. From table 6.2 it is describes that no clustering technique is perfect for prediction model for web caching and web prefetching. There are two main limitations of all above mentioned clustering techniques. First is no appropriate distance measurement technique is used and second one is prediction of any number of clusters are not possible in advance. In proposed Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 86

13 research context no clustering techniques can be used as it directly. First important point is to determine appropriate distance measurement technique in context to proposed research work. The next section of this chapter will deal with quantifiable analysis of distance measurement techniques that suits to proposed research context. 6.2 Quantifiable Analysis of Distance Measurement Techniques It is cleared from section that Levenshtein, Needleman-Wunsch and Smith- Waterman distance measurement techniques are appropriate in terms of clustering of web sessions in terms prediction model of web caching and web prefetching. This section will analyze all distance measurement technique in quantifiable manner and decide which one be the most efficient in context to current work Infrastructure and experimental environment Following Infrastructure is used in experiment of analyzing distance measurement technique in quantifiable manner. (1) Personal Computer: - Intel Pentium 4 CPU, 2.40 GHz, 1 GB of RAM, 20 GB Hard Disk. (2) Online tool for Distance Measurement: - This tool is available at for distance measurement based on many distance measurement techniques. (3)Internet: - It is used to download purpose. (4)Sample Raw Log file:- which is downloaded from NASA site which contains transactions of users between to (5)Operating System: - Microsoft Windows XP professional version 2002, Service Pack-2. As far as experimental environment is concerned following steps are taken: (a) Web Object number is converted into respective alphabet as online tool is deals with string data. For Example: Cluster1: 2,5,7,8,9,10 that is converted to BEGHIJ Cluster2: 6, 8, 9,12,15,2,5 that is converted to FHILOBE (b) Same Sample data of Markov Model is taken to experiment the results of Levenshtein distance measurement technique Levenshtein Analysis Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 87

14 Table 6.3 describes distance between clusters using Levenshtein distance measurement technique. Distance between sessions is calculated using equation described in section 6.1. Figure 6.3 is the snapshot of online tool that is used for distance measurement. The online tool requires string to match the distance so sessions are converted (Figure-6.3 Snapshot of online tool for distance measurement) Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 88

15 Table 6.3 Distance between clusters using Levenshtein Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 89

16 Accuracy % Analysis of Levensthtein Patterns High accuracy Patterns % Average Accuracy Patterns % Threshold Value Low Accuracy Patterns % Threshold Value (Figure-6.4 Levensthtein Pattern Analysis) From the above graph it is observed that threshold value of 70 is an ideal for Levensthtein distance that provides mean accuracy of and all patterns consists of mean accuracy between 70 and Following are several Limitations of Levensthtein Measure in pattern discovery in current research context (1) Session 1: Session 10: Here both sessions requires same web objects and order of web objects are also similar but distance between them is only 55. (2)Session 8: Session 20: Here case is same as (1) but distance between them is 88 that mean Levensthtein also considers length of two strings. (3)Session 5: Session 13: Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 90

17 Here orders of web objects are not exactly the same and some web objects are differ in both sessions still distance between them is 69. (4) Session 1: Session 2 : Here order is not same but several web objects are similar like 2,5,8,9 still distances between them is 0. (5) Session 1 : Session 3 : Here same case as previous one but distance measure is Needleman-Wunsch Analysis Table 6.4 describes distance between clusters using Needleman Wunsch distance measurement technique. Distance between sessions is calculated using equation described in section 6.1. Figure 6.4 describes analysis of patterns according to Needleman Wunsch distance measurement technique. From analysis it is found that it is very difficult to decide threshold value in this technique. Ideal value of threshold is 85 but it consists only 32% of clusters so it affects cache hit ratio. According to distance metric of Needleman Wunsch every session is half similar with other session and that is also not true. It is based on global alignment so it also considers length of sessions while comparing. There are several limitations in patter discovery based on Needleman Wunsch, they are following: (1) Session 1: Session 2: Here distance is 50% though ordering as well as number of web objects is dissimilar. (2) Session 1: Session 3 : Here more number of web objects are similar than previous case still distance is 50%. (3) Session 1 : Session 10 : Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 91

18 Table 6.4 Distance between clusters using Needleman Wunsch Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 92

19 Accuracy % Analysis of Needlemna Wunsch Patterns High accuracy Patterns % Average Accuracy Patterns % Low Accuracy Patterns % Threshold Value Threshold Value (Figure-6.5 Needleman Wunsch Pattern Analysis) Here both sessions requires same pages and order of web objects are also similar but distance between them is only 55. (4) Session 3: Session 18: Here order as well as number of web objects are differing still distance between them is 88%. (5) Session 6: Session 7: Here order as well as number of web objects are differ still their distance is 64 % that is higher than case (3) Smith Waterman Analysis Table 6.5 describes distance between clusters using smith waterman technique. Figure 6.6 describes analysis of patterns using smith waterman distance measurement technique. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 93

20 Table 6.5 Distance between clusters using Smith Waterman Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 94

21 Accuracy % Smith Waterman Pattern Analysis Threshold Value High accuracy Patterns % Average Accuracy Patterns % Low Accuracy Patterns % Threshold Value (Figure-6.6 Smith Waterman Pattern Analysis) Any threshold value from 65 to 100 is an ideal for smith waterman and that is decided based on space available in proxy server. It is based on local optimal so it is not taken into considerations length of strings. Several facts of Smith Waterman based on distance metric are as follows: (1) Session 1: Session 10: Here order as well as all web objects referred in both sessions are similar so distance is 100%. (2)Session 1: Session 14: Here length of strings is dissimilar but certain portion of second string is same as first string with order so distance is 100% (3) Session 3 : Session 9: Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 95

22 Here first string is not exactly same as second but nearly it is same so distance is 92% (4) Session 4: Session 15: Here first string is nearly similar as second string but not much similar than previous case (3) so distance is less than previous case that is 83 % (5) Session 1: Session 2: Here only two web pages are similar so distance is 33% (6) Session 1: Session 3: Here total four web objects are similar and out of that two are in order so obvious distance is more than previous case and that is 50% 6.3 Approaches to Formation of Clusters As discussed in related work of clustering technique hierarchical way of clustering is not an efficient technique for formation of clusters because clusters are not revisited by most of algorithms of hierarchical technique. For the prediction model of web caching and prefetching partitioning relocation clustering is an ideal solution for clusters formations. There are two main techniques for partitioning clustering and they are: (1) K-means and (2) K-medoids. In both techniques specification of K is require and it indicates numbers of clusters that are going to form after implementation of that technique. In proposed research it is not able to predict number of clusters so it is impossible to give value of K at initial level so both of these techniques are not suitable for this context. In both K-means and K-medoids one object could not be the part of more than one cluster while in proposed research context, one web object could be part of more than one cluster. One popular technique of clustering is Fuzzy C-Means that attempts to divide any n elements into collections of m fuzzy clusters with some criterion but this algorithm also requires choosing a number of clusters so it is not fitted perfectly in this proposed research work. One new approach for formation of clusters is suggested in this proposed work. Following are number of steps of this approach: [1] Determine distance metric based on smith-waterman distance metric technique. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 96

23 Based on quantifiable analysis of all relevant distance measurement techniques it is found that Smith-Waterman technique is most suitable technique in context to proposed work. Smith-Waterman technique is suitable for any kind of comparison either similar length or dissimilar length. Smith Waterman technique is also considers ordering of web objects. [2] Decide threshold value in context of proxy server cache memory. Threshold value should be decided based on the capacity of proxy server cache memory and that is decided based on application perspective. [3] Based on threshold value form clusters of web objects. Formation of clusters is done based on selected threshold value. [4] Repeat step 3 based on new threshold value if require. Repeat formation of clusters according to new threshold value, if require by an application. 6.4 Conclusion This chapter has described several related work in clustering data mining techniques in perspectives of web data. From literature survey it is found out that there are two main categories of clustering techniques (1) Hierarchical and (2) Partitioned based. Hierarchical clustering are divided into two main approaches top-down and bottom-up but both techniques suffered from many limitations and they do not suit in context to proposed work. The main limitation is that clusters are not revisited once they formed. Partitioning techniques overcome limitations of Hierarchical clustering in terms of optimization of cluster formation. There are two main approaches of that centroid based and medoid based. In both approaches it is required to know in advance number of clusters that is not feasible in context to current work. Other limitation is both techniques uses ordinary distance measurement techniques like Euclidian distance measurement that is not suitable for categorical data as well as not considering an ordering of objects. Other main limitation is that one object must be the part of one cluster only that is not true in case of proposed research so this chapter also described fuzzy C means technique. Fuzzy C means overcome limitation that one object could be the part of more than one cluster but still it requires to know number of clusters in advance so it is not perfect technique to form clusters in context to proposed work. The challenge of proposed work is to identify appropriate distance measurement technique that is suitable for all categories of data and also considers an ordering of objects. This chapter dealt with distance measurement techniques and identified appropriate techniques in context to proposed work. This chapter also did quantitative analysis of those distance measurement techniques and Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 97

24 identified most relevant in context to this. Finally chapter has given a new approach to form clusters. The new approach is based on appropriate distance measurement technique in context to web caching and prefetching criteria. The new approach is threshold based approach where value of threshold is decided based om meory of proxy server.the new approach is iterative means it provides liberty to select new value of threshold if previous one is not up to the date. Prediction Model for Web Caching and Prefetching with Web Usage Mining to optimize web objects 98

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM. Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

STUDY OF DISTANCE MEASUREMENT TECHNIQUES IN CONTEXT TO PREDICTION MODEL OF WEB CACHING AND WEB PREFETCHING

STUDY OF DISTANCE MEASUREMENT TECHNIQUES IN CONTEXT TO PREDICTION MODEL OF WEB CACHING AND WEB PREFETCHING STUDY OF DISTANCE MEASUREMENT TECHNIQUES IN CONTEXT TO PREDICTION MODEL OF WEB CACHING AND WEB PREFETCHING Dharmendra Patel 1 Smt.Chadaben Mohanbhai Patel Institute of Computer Applications, Charotar University

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:

More information

Redefining and Enhancing K-means Algorithm

Redefining and Enhancing K-means Algorithm Redefining and Enhancing K-means Algorithm Nimrat Kaur Sidhu 1, Rajneet kaur 2 Research Scholar, Department of Computer Science Engineering, SGGSWU, Fatehgarh Sahib, Punjab, India 1 Assistant Professor,

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Unsupervised Learning Partitioning Methods

Unsupervised Learning Partitioning Methods Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Introduction to Clustering

Introduction to Clustering Introduction to Clustering Ref: Chengkai Li, Department of Computer Science and Engineering, University of Texas at Arlington (Slides courtesy of Vipin Kumar) What is Cluster Analysis? Finding groups of

More information

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 CLUSTERING CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16 1. K-medoids: REFERENCES https://www.coursera.org/learn/cluster-analysis/lecture/nj0sb/3-4-the-k-medoids-clustering-method https://anuradhasrinivas.files.wordpress.com/2013/04/lesson8-clustering.pdf

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Alignment Based Similarity distance Measure for Better Web Sessions Clustering

Alignment Based Similarity distance Measure for Better Web Sessions Clustering Available online at www.sciencedirect.com Procedia Computer Science 5 (2011) 450 457 The 2 nd International Conference on Ambient Systems, Networks and Technologies (ANT) Alignment Based Similarity distance

More information

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394 Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31 Table of contents 1 Introduction 2 Data matrix and

More information

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2 Clustering Community Detection http://image.slidesharecdn.com/communitydetectionitilecturejune0-0609559-phpapp0/95/community-detection-in-social-media--78.jpg?cb=3087368 Jian Pei: CMPT 74/459 Clustering

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Cluster Analysis. CSE634 Data Mining

Cluster Analysis. CSE634 Data Mining Cluster Analysis CSE634 Data Mining Agenda Introduction Clustering Requirements Data Representation Partitioning Methods K-Means Clustering K-Medoids Clustering Constrained K-Means clustering Introduction

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

New Approach for K-mean and K-medoids Algorithm

New Approach for K-mean and K-medoids Algorithm New Approach for K-mean and K-medoids Algorithm Abhishek Patel Department of Information & Technology, Parul Institute of Engineering & Technology, Vadodara, Gujarat, India Purnima Singh Department of

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

Data Clustering. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

Data Clustering. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University Data Clustering Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University Data clustering is the task of partitioning a set of objects into groups such that the similarity of objects

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme Why do we need to find similarity? Similarity underlies many data science methods and solutions to business problems. Some

More information

Lecture 10. Sequence alignments

Lecture 10. Sequence alignments Lecture 10 Sequence alignments Alignment algorithms: Overview Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. We want to maximize the score

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using

More information

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

An Unsupervised Technique for Statistical Data Analysis Using Data Mining International Journal of Information Sciences and Application. ISSN 0974-2255 Volume 5, Number 1 (2013), pp. 11-20 International Research Publication House http://www.irphouse.com An Unsupervised Technique

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Particle Swarm Optimization applied to Pattern Recognition

Particle Swarm Optimization applied to Pattern Recognition Particle Swarm Optimization applied to Pattern Recognition by Abel Mengistu Advisor: Dr. Raheel Ahmad CS Senior Research 2011 Manchester College May, 2011-1 - Table of Contents Introduction... - 3 - Objectives...

More information

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming

Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming Similarity Matrix Based Session Clustering by Sequence Alignment Using Dynamic Programming Dr.K.Duraiswamy Dean, Academic K.S.Rangasamy College of Technology Tiruchengode, India V. Valli Mayil (Corresponding

More information

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

A Comparative study of Clustering Algorithms using MapReduce in Hadoop A Comparative study of Clustering Algorithms using MapReduce in Hadoop Dweepna Garg 1, Khushboo Trivedi 2, B.B.Panchal 3 1 Department of Computer Science and Engineering, Parul Institute of Engineering

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find

More information

I211: Information infrastructure II

I211: Information infrastructure II Data Mining: Classifier Evaluation I211: Information infrastructure II 3-nearest neighbor labeled data find class labels for the 4 data points 1 0 0 6 0 0 0 5 17 1.7 1 1 4 1 7.1 1 1 1 0.4 1 2 1 3.0 0 0.1

More information

Improved Performance of Unsupervised Method by Renovated K-Means

Improved Performance of Unsupervised Method by Renovated K-Means Improved Performance of Unsupervised Method by Renovated P.Ashok Research Scholar, Bharathiar University, Coimbatore Tamilnadu, India. ashokcutee@gmail.com Dr.G.M Kadhar Nawaz Department of Computer Application

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 201 Road map What is Cluster Analysis? Characteristics of Clustering

More information

A Design of a Hybrid System for DNA Sequence Alignment

A Design of a Hybrid System for DNA Sequence Alignment IMECS 2008, 9-2 March, 2008, Hong Kong A Design of a Hybrid System for DNA Sequence Alignment Heba Khaled, Hossam M. Faheem, Tayseer Hasan, Saeed Ghoneimy Abstract This paper describes a parallel algorithm

More information

6. Learning Partitions of a Set

6. Learning Partitions of a Set 6. Learning Partitions of a Set Also known as clustering! Usually, we partition sets into subsets with elements that are somewhat similar (and since similarity is often task dependent, different partitions

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

What to come. There will be a few more topics we will cover on supervised learning

What to come. There will be a few more topics we will cover on supervised learning Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression

More information

Analyzing Outlier Detection Techniques with Hybrid Method

Analyzing Outlier Detection Techniques with Hybrid Method Analyzing Outlier Detection Techniques with Hybrid Method Shruti Aggarwal Assistant Professor Department of Computer Science and Engineering Sri Guru Granth Sahib World University. (SGGSWU) Fatehgarh Sahib,

More information

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Dr. T. VELMURUGAN Associate professor, PG and Research Department of Computer Science, D.G.Vaishnav College, Chennai-600106,

More information

数据挖掘 Introduction to Data Mining

数据挖掘 Introduction to Data Mining 数据挖掘 Introduction to Data Mining Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities philfv8@yahoo.com Spring 2019 S8700113C 1 Introduction Last week: Association Analysis

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Chapter DM:II. II. Cluster Analysis

Chapter DM:II. II. Cluster Analysis Chapter DM:II II. Cluster Analysis Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis DM:II-1

More information

Mining di Dati Web. Lezione 3 - Clustering and Classification

Mining di Dati Web. Lezione 3 - Clustering and Classification Mining di Dati Web Lezione 3 - Clustering and Classification Introduction Clustering and classification are both learning techniques They learn functions describing data Clustering is also known as Unsupervised

More information

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. Unsupervised Learning. Manfred Huber Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

Visual Representations for Machine Learning

Visual Representations for Machine Learning Visual Representations for Machine Learning Spectral Clustering and Channel Representations Lecture 1 Spectral Clustering: introduction and confusion Michael Felsberg Klas Nordberg The Spectral Clustering

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,

More information

Mouse, Human, Chimpanzee

Mouse, Human, Chimpanzee More Alignments 1 Mouse, Human, Chimpanzee Mouse to Human Chimpanzee to Human 2 Mouse v.s. Human Chromosome X of Mouse to Human 3 Local Alignment Given: two sequences S and T Find: substrings of S and

More information

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

Sequence analysis Pairwise sequence alignment

Sequence analysis Pairwise sequence alignment UMF11 Introduction to bioinformatics, 25 Sequence analysis Pairwise sequence alignment 1. Sequence alignment Lecturer: Marina lexandersson 12 September, 25 here are two types of sequence alignments, global

More information

Sequence Alignment. part 2

Sequence Alignment. part 2 Sequence Alignment part 2 Dynamic programming with more realistic scoring scheme Using the same initial sequences, we ll look at a dynamic programming example with a scoring scheme that selects for matches

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Unsupervised Learning Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team Table of Contents 1)Clustering: Introduction and Basic Concepts 2)An Overview of Popular Clustering Methods 3)Other Unsupervised

More information