Sponsored by AIAT.or.th and KINDML, SIIT

Size: px

Start display at page:

Download "Sponsored by AIAT.or.th and KINDML, SIIT"

Marylou Sparks
5 years ago
Views:

1 CC: BY NC ND Table of Contents Chapter 4. Clustering and Association Analysis Cluster Analysis or Clustering Distance and similarity measurement Clustering Methods Partition-based Methods Hierarchical-based clustering Density-based clustering Grid-based clustering Model-based clustering Association Analysis and Frequent Pattern Mining Apriori algorithm FP-Tree algorithm CHARM algorithm Association Rules with Hierarchical Structure Efficient Association Rule Mining with Hierarchical Structure Historical Bibliography Exercise

2 Chapter 4. Clustering and Association Analysis This chapter presents two unsupervised tasks; i.e., clustering and association rule mining. Clustering groups input data into a number of groups without examples or clues while association rule mining finds frequently co-occurring events using correlation analysis. This chapter provides basic techniques for clustering, and association rule mining in order Cluster Analysis or Clustering Unlike classification, cluster analysis or clustering or data segmentation handles data objects the class labels of which are unknown or not given. Clustering is known as unsupervised learning and it does not rely on predefined classes and class-labeled training examples. For this reason, clustering is a form of learning by observation, rather than learning by examples (like classification). Although classification is an effective means for distinguishing groups or classes of objects, it often requires costly process of labeling a large set of training tuples or patterns,which the classifier uses to model each group. Therefore, it is more desirable to proceed in the reverse direction: first partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. In other words, even we can analyze data objects explicitly if we know the class label for each object (classification tasks) but it is quite common in large databases that objects may not be assigned any label (clustering tasks) since the process to assign class labels is very costly. In such situations where we have a set of objects (or data objects) without any class label given, it is quite often that one may need arranging them into groups based on their similarity in order to know some structure within the object collection. Moreover, a cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. As one by-product of clustering, it is also possible to detect outliers, i.e., objects that are deviated for norm. One additional advantage of such a clustering-based process is that it is adaptable to changes and helps figuring out useful features that distinguish different groups. In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in large databases. Active themes of research focus on the scalability of clustering methods, the effectiveness of methods for clustering complex shapes and types of data, highdimensional clustering techniques, and methods for clustering mixed numerical and categorical data in large databases. Clustering attempts to group a set of given data objects into classes or clusters, so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters. Similarities or dissimilarities are assessed based on the attribute values describing the objects where distance measures are used. There are many clustering approaches including partitioning methods, hierarchical methods, density-based methods, gridbased methods, model-based methods, methods for high-dimensional data (such as frequent pattern based methods), and constraint-based clustering. Conceptually, cluster analysis is an important human activity. Early in childhood, we learn how to distinguish between dogs and cats, between birds and fish, or between animals and plants, by continuously improving subconscious clustering schemes. With automated clustering, dense and sparse regions are identified in object space and, therefore, overall distribution patterns and interesting correlations among data attributes are discovered. Clustering has its roots in many areas, including data mining, statistics, information theory and machine learning. It has been widely used in numerous applications, such as market research, pattern recognition, data analysis, biological discovery, and image processing. In business, clustering can help marketing 171

3 people discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns. In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations. Clustering may also assist in the identification of areas of similar land usage in an earth observation database and in the identification of groups of houses in a city according to house type, value, and geographic location, as well as the identification of groups of automobile insurance policy holders with a high average claim cost. It can also be used to help classify online documents, e.g. web documents, for information discovery. As one major data mining function, cluster analysis can be used as a stand-alone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. Alternatively, it may serve as a preprocessing step for other algorithms, such as characterization, attribute subset selection, and classification, which would then operate on the detected clusters and the selected attributes or features. As the context of the huge amounts of data collected in databases, cluster analysis has recently become a highly active topic in data mining research. As a branch of statistics, cluster analysis has been extensively studied for many years, focusing mainly on distance-based cluster analysis. Cluster analysis tools based on k-means, k-medoids, and several other methods have also been built into many statistical analysis software packages or systems, such as DBMiner, Weka, KNIME, RapidMiner, SPSS, and SAS. In machine learning, clustering is an example of unsupervised learning. In clustering, there have been different ways in which the result of clustering can be expressed. Firstly, the groups that are identified may be exclusive so that any instance belongs to only one group. Secondly it is also possible to have clusters that are overlapping, i.e., an instance may fall into several groups. Thirdly, the cluster may be probabilistic in the sense that an instance may belong to each group with a certain probability. Fourthly the cluster may be hierarchical, such that there is a crude division of instances into groups at the top level, and each of these groups is refined further, perhaps all the way down to individual instances. Which choice among these possibilities should be applied depend on constraints and purposes of clustering. Two more factors are that clusters are formed in discrete domains (categorical or nominal attributes), numeric domains (integer- or real-valued attributes) or hybrid domains, and that clusters can be formed incrementally or statically. The following are a list of typical topics that we need to consider for clustering in real applications. Firstly, practical clustering techniques are required to be able to handle large-scale data, say millions of data objects, since in a real application we usually have such large amount of data. Secondly, clustering techniques need handling various types of attributes, including interval-based (numerical) data, binary data, categorical or nominal data, and ordinal data, or mixtures of these data types. Thirdly, without any constraint, most clustering techniques will naturally try to find clusters, which are in a spherical shape. However, intuitively and practically, clusters could be of any shape and cluster techniques should be able to detect clusters of arbitrary shape. Fourthly, clustering techniques should autonomously determine the suitable number of clusters and their associated elements without any required extra steps. Fifthly, clustering should be able to detect and eliminate noisy data which are outliers or erroneous data. Sixthly, it is necessary to enable incremental clustering which is insensitive to the order of input data for clustering. Many techniques cannot incorporate newly inserted data into existing clustering structures but must make clusters from scratch. Moreover, most of them are sensitive to the order of input data. Sevenly, in many cases, it is inevitable to handle data with high dimensions since data in many tasks have objects with a large number of attributes. Finding clusters of data objects in high dimensional space is challenging, especially considering that such 172

4 data can be sparse and highly skewed. Eighthly, it is practically to perform clustering under various kinds of constraints. For example, when we are assigned to locate a number of convenience stores, we may cluster residents by taking their static characteristics into account, as well as considering geographical constraints such as the city s rivers, ponds, roads, and highway networks, and the type and number of residents per cluster (customers for a store). A challenging task is to find groups of data with good clustering behavior that satisfy specified constraints. Ninethly, selection of clustering techniques and features have to be done with the consideration of interpretability and usability. Clustering may need to be tied to specific purposes, which match semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering features and methods. The next sections describe distance or similarity, which is the basis for clustering similar objects and then illustrate well-known types of clustering methods, including partition-based methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. Special topics on clustering in high-dimensional space, constraint-based clustering, and outlier analysis are partially discussed Distance and similarity measurement To enable the process of clustering, a form of distance or similarity need to be defined to determine which pair of objects is close to each other. In general, a data object is characterized by a set of attributes or features. They can be in various types, including interval-scaled variables; by binary variables; by categorical, ordinal, and ratio-scaled variables; or combinations of these variable types. Data objects can be represented in the form of a mathematics schema as shown in Section 2.1. It is also shown in brief as follows. Assume that a dataset D is composed of n instance objects; o 1,, o n, each of which o i is represented by a vector of p attributes (namely A 1,..., A p) a i1,..., a ip. In other words, a dataset can be viewed as a matrix with n rows and p columns, called a data matrix as in Figure 4-1. Each instance object o i is characterized by the values of attributes that express different aspects of the instance. Dataset D A 1 A 2 A k A p o 1 a 11 a 12 a 1k a 1p o 2 a 21 a 22 a 2k a 2p o i a i1 a i2 a ik a ip o n-1 a (n-1)1 a (n-1)2 a (n-1)k a (n-1)p o n a n1 a n2 a nk a np D = [ (o1, o2,, on) T ] Database: A set of data objects oi = (ai1, ai2,, aip) An object characterized by attributes Figure 4-1: A matrix representation of data objects with their attribute values. Conceptually we can measure dissimilarity (also called distance) or similarity between any pair of objects. The result can be represented in the form of a two-dimensional matrix, called dissimilarity (distance) matrix or similarity matrix. This matrix holds object-by-object structure and stores a collection of proximity values (dissimilarity or similarity values) that are available for all pairs of objects. It is represented by an -by- matrix as shown in Figure

5 In Figure 4-2 (a), expresses the difference or dissimilarity between objects i and j. In most works, is defined to be a nonnegative number. It is close to 0 when objects i and j are very similar or close to each other and it becomes a larger number when they are more different to each other. In general, the equality is assumed and the difference between two identical object is set to zero, i.e., d(i, i) = 0. On the other hand, in Figure 4-2 (b), expresses the similarity between objects i and j. Normally, represents a number that takes a maximum of 1 when objects i and j are identical, and it becomes a smaller number when they are farther from each other. Same as dissimilarity measure, it is natural to set the equality of s and the similarity between two identical objects is set to one, i.e., s(i, i) = 1. To calculate distance or similarity, since there are various types of attributes, measurement of the distance between two objects, say and, becomes an important issue. The point is how to fairly define the distance for interval-scaled attributes, binary attributes, categorical attributes, ordinal attributes, and ratio-scaled attributes since the distance (or similarity) between two objects, i.e., or in Figure 4-2, naturally comes from the combination of the distances of their attributes. As a naïve way, in many works the combination is usually defined by a summation operator, denoted by (or for similarity, ). While different types of attributes may have different distance (or similarity) calculations, it is possible to classify them into two main approaches. o 1 o 2 o i o n-1 o n o 1 0 d(1,2) d(1,i) d(1,n-1) d(1,n) o 2 d(2,1) 0 d(2,i) d(2,n-1) d(2,n) 0 o i d(i,1) d(i,2) 0 d(i,n-1) d(i,n) 0 o n-1 d(n-1,1) d(n-1,2) d(n-1,i) 0 d(n-1,n) o n d(n,1) d(n,2) d(n,i) d(n,n-1) 0 (a) Dissimilarity (Distance) matrix (minimum value = 0) o 1 o 2 o i o n-1 o n o 1 1 s(1,2) s(1,i) s(1,n-1) s(1,n) o 2 s(2,1) 1 s(2,i) s(2,n-1) s(2,n) 1 o i s(i,1) s(i,2) 1 s(i,n-1) s(i,n) 1 o n-1 s(n-1,1) s(n-1,2) s(n-1,i) 1 s(n-1,n) o n s(n,1) s(n,2) s(n,i) s(n,n-1) 1 (b) Similarity matrix (maximum value=1) Figure 4-2: A dissimilarity (distance) matrix and similarity matrix The first approach is to normalize all attributes into a fixed standard scale, say 0.0 to 1.0 (or -1.0 to 1.0) and then use a distance measure, such as Euclidean distance or Manhattan distance, or use a similarity measure, such as cosine similarity, to determine the distance between a pair of objects. The detail of this approach is described in Section The second approach is to use different measurements for different types of attributes as follows. 174

6 No. Type Method 1. Interval-scaled [Normalization Step] attributes Option 1: transformation of the original value to the standardized value by the absolute deviation of the attribute, denoted by, using the mean value of the attribute, denoted by. Option 2: transformation of the original value to the standardized value by the standard deviation of the attribute, denoted by, using the mean value of the attribute, denoted by. [Distance Measurement Step] For distance measurement in Figure 4-2, we can use a standard measure such as Euclidean distance or Mahattan distance between two objects, say and, as follows. Euclidean Distance: Manhattan Distance: Both Euclidean distance and Manhattan distance satisfy the following mathematic requirements of a distance function. 1. : The distance is a nonnegative number. 2. : The distance from an object to itself is zero. 3. : The distance is a symmetric function. 4. : The distance satisfies the triangular inequality. The distance from object i to object j directly is not larger than the indirect contour over any other object h. 175

7 1. Interval-scaled attributes (continued) It is also possible to measure similarity instead of distance. As for Figure 4-2, two common similarity measures are dot product and cosine similarity. Their formulae are given below. When the object is normalized, the dot product and the cosine similarity become identical. Dot Product: Cosine Similarity: Note that dot product has no bound but the cosine similarity ranges between -1 and 1 (in this task, 0 and 1). 2. Categorical attributes A categorical attribute can be viewed as a generalization of the binary attribute in that it can take more than two states. For example, product brand is a categorical attribute that may take one value from a set of more than two possible values, say Oracle, Microsoft, Google, and Facebook. For the distance for a categorical attribute, it is possible to use the same approach with an intervalscaled attribute by setting distance to 1 if two objects have the same value, otherwise 0. When the value of a categorical attribute of two objects, and, is the same, the dissimilarity and the similiarty between these two objects for that categorical attribute are set to 0 and 1, respectively. Otherwise they are 1 and 0, respectively. If ther when the object and the object have a same value for attribute A. when the object and the object have different values for attribute A. 3. Ordinal attributes A discrete ordinal attribute positions between a categorical attribute and numeric-valued attribute in the sense that a ordinal attribute has N discrete values (like categorical attributes) but they can be ordered in a meaningful sequence (like numeric-valued attributes). An ordinal attribute is useful for recording subjective assessments of qualities that cannot be measured objectively. For example, height can be high, middle and low, weight can be heavy, mild and light, etc. This ordinal property presents continuity of an unknown scale but its actual magnitude is not known. To handle the scale of an ordinal attribute, we can treat an ordinal variable by normalization as follows. First, the values of an ordinal variable are mapped to ranks. For example, suppose that each value of an ordinal attribute ( ) has been arranged to ranks ( ), say. Second each value of the ordinal attribute is mapped to a value between 0 and 1 ( ) as follows. Third the dissimilarity can then be computed using any of the distance measures for interval-scaled variables, using to represent the value for the i-th object. 176

8 4. Binary attributes Even it is possible to use the same approach with interval-scaled attributes, treating binary variables as if they are interval-scaled may not be suitable since it may mislead to improper clustering results. Here, it is necessary to use more suitable methods specific to binary data for computing dissimilarities. As one approach, a dissimilarity matrix can be calculated from the given binary data. Normally we consider all binary variables to have the same weight. By this setting, a 2-by-2 contingency table can be constructed to calculate dissimilarity between object i and object j as follows. Object i Object j 1 0 Total 1 a b a+b 0 c d c+d Total a+c b+d a+b+c+d Here, two types of dissimilarity measures for binary attributes are symmetric binary dissimilarity and asymmetric binary dissimilarity as follows. The asymmetric binary dissimilarity is used when the positive and negative outcomes of a binary attribute are not equally important, such as the positive and negative outcomes of a disease test. That is, the value of 1 of an attribute has different importance level to the value of 0 of that attribute. For example, we may give more importance to the outcome of HIV positive (1) which is usually rarely occurred, and less importance to the outcome of HIV negative (0) which is usually detected. As the negation version of this dissimilarity, we can calculate symmetric binary similarity and asymmetric binary similarity, as follows. 5. Ratio-scaled attributes A ratio-scaled attribute takes a positive value on a nonlinear scale, such as an exponential scale. The following is the formula for a ratio-scaled value. Here, A is a positive numeric constant, B is a numeric constant, and t is a focused variable. It is not good to treat ratio-scaled attributes like interval-scaled attributes since it is likely that the scale may be distorted due to exponential scales. There are two common methods to compute the dissimilarity between objects described by ratio-scaled attributes as follows. 1. The first method is to apply logarithmic transformation to a value of a ratio-scaled attribute for object I, say, by using the formula The transformed value can be treated as an intervalvalued attribute. However, it may be suitable to use other transformation, such as log-log transformation. 2. The second method is to treat as a continuous ordinal attribute and treat their ranks as an interval-valued attribute Clustering Methods During there are many existing clustering algorithms, we can classify the major clustering methods including partition-based methods, hierarchical methods, density-based methods, gridbased methods, and model-based methods. Their details are discussed below. 177

9 1. Partitioning methods A partitioning method divides objects (data tuples) into partitions of the data, where each partition represents a cluster and. This method needs to specify, the number of partitions, beforehand. Usually clustering assigns one object to only one cluster but it is possible to allow it to assign to several clusters, such as fuzzy partitioning techniques. The steps of partitioning methods are as follows. First, the partitioning method will assign randomly or heuristically each object to a cluster, as initial partition. Here one cluster has at least one object assigned from the beginning. Second, the method will relocate objects in clusters iteratively by attempting to improve the partitioning result by moving objects from one group to another based on a predefined criterion, which is that objects in the same cluster are close to each other, whereas they are far apart or very different from objects in different clusters. At present, there have been a few popular heuristic methods, such as (1) the k-means algorithm, where each cluster is represented by the mean value of the objects in the cluster, and (2) the k-medoids algorithm, where each cluster is represented by one of the objects located near the center of the cluster. The partitioning methods seem work well to construct a number of spherical-shaped clusters in small- to medium-sized databases. However, these methods need some modification to deal with clusters with complex shapes and for clustering very large data sets. 2. Hierarchical methods Unlike a partitioning method, a hierarchical method does not specify the number of clusters beforehand but attempts to create a hierarchical structure for the given set of data objects. Two types of a hierarchical method are agglomerative and divisive. The agglomerative method is the bottom-up approach where the process starts with each object forming a separate group and then successively merges the objects or groups that are close to one another, until all of the groups are merged into one (the topmost level of the hierarchy), or until a termination condition holds. On the other hand, the divisive method is the top-down approach where the procedure begins with all of the objects in the same cluster and then for each successive iteration, a cluster is split up into smaller clusters, until eventually each object is in one cluster, or until a termination condition holds. Even hierarchical methods are superior in small computation costs due to avoiding a combinatorial number of different merging or splitting choices, they may suffer with erroneous decisions in each margining or splitting choice since once a step is done, it can never be undone. To avoid this problem, two solutions are (1) to perform careful analysis of linkages among objects at each hierarchical partitioning, as in Chameleon, or (2) to integrate hierarchical agglomeration and other approaches by first using a hierarchical agglomerative algorithm to group objects into microclusters, and then performing macroclustering on the microclusters using another clustering method such as iterative relocation, as in BIRCH. 3. Density-based methods Most methods which use distance between objects for clustering will tend to find clusters with spherical shape. However, in general clusters can have arbitrary shape. Towards this, it is possible to apply the notion of density for cluster objects into any shape. The general idea is to start from a single point in a cluster and then to grow the given cluster as long as the density (number of data points or objects) in the neighborhood exceeds a threshold. For each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. A density-based method tries to filter out 178

10 noise (outliers) and discover clusters of arbitrary shape. Examples of density-based approach are DBSCAN and its extension, OPTICS, and DENCLUE. 4. Grid-based methods While point-to-point (object pair similarity) calculation in most clustering methods seems slow, a grid-based method first divides the object space into a finite number of cells as a grid structure. Then clustering operations are applied on this grid structure. Grid-based methods are superior in its fast processing time. Rather than the number of data objects, the time complexity depends only on the number of cells in each dimension in the quantized space. A typical grid-based method is STING. It is possible to combine gridbased and density-based as done in WaveCluster. 5. Model-based methods: Instead of using a simple similarity definition, a model-based method predefines a suitable model for each of the clusters and then find the best fit of the data to the given model. The model may form clusters by constructing a density function that reflects the spatial distribution of the data points. With statistics criteria, we can automatically detect the number of clusters and obtain more robustness. Some well-known model-based methods are EM, COBWEB and SOM. It is hard to say which type of clustering fixes the best with the current task. The solution depends both on the type of data available and on the particular purpose of the application. It is possible to explore one-by-one to see their result clusters and compare to get the most practical one. Some clustering methods may use the ideas of several clustering methods and become mixed-typed clustering. Moreover, some clustering tasks, such as text clustering or DNA microarray clustering, may have high dimension, causing difficulty in clustering since the data become sparse. Clustering high-dimensional data is challenging due to the curse of dimensionality. Many dimensions may not be relevant. As the number of dimensions increases, the data become increasingly sparse so that the distance measurement between pairs of points become meaningless and the average density of points anywhere in the data is likely to be low. For this task, two influential subspace clustering methods are CLIQUE and PROCLUS. Rather than searching over the entire data space, they search for clusters in subspaces (or subsets of dimensions) of the data. Frequent pattern based clustering, another clustering methodology, extracts distinct frequent patterns among subsets of dimensions that occur frequently. It uses such patterns to group objects and generate meaningful clusters. pcluster is an example of frequent pattern based clustering that groups objects based on their pattern similarity. Beyond simple clustering, constraint-based clustering performs clustering under user-specified or application-oriented constraints. Users may have some preferences on clustering data and specify them as constraints in clustering process. A constraint is a user s expectation or describes properties of the desired clustering results. For example, objects in a space are clustered under the existence of obstacles or they are clustered when some objects are known to be in or not in the same cluster Partition-based Methods A partitioning method divides objects (data tuples) into partitions of the data, where each partition represents a cluster and. This method needs to specify, the number of partitions, beforehand. As the most classic clustering, the k-means method receives in advance the number of clusters k we construct. With this parameter, k points are chosen at random as cluster centers. 179

11 Next all instances are assigned to their closest cluster center according to the ordinary distance metric, such as Euclidean distance. Next, the centroid, or mean, of the instances in each cluster is calculated to be a new centroid. These centroids are taken to be new center values for their respective clusters. The whole process is repeated with the new cluster centers. Finally, iteration continues until the same points are assigned to each cluster in consecutive rounds, at which stage the cluster centers have stabilized and will remain the same forever. In summary, four steps of the k-means algorithm are as follows. 1. Partition objects into k non-empty subsets 2. Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. 3. Assign each object to the cluster with the nearest seed point. 4. Go back to Step 2, stop when no more new assignment that is no member changes its group. The formal description of k-means can be described as follows. Given a training set of objects and their associated class labels, denoted by, each object is represented by an n-dimensional attribute vector, depicting the measure values of n attributes,, without any class label. Here, suppose that has possible values,. That is,. Algorithm 4.1 shows a pseudocode of the k- means method. Algorithm 4.1. k-means Algorithm Input: T is a dataset, where and k is the number of clusters. Output: A set of clusters, where each element is a cluster with its members Procedure: (1) FOREACH T { ; } // Randomly assign to a class (2) WHILE some members change their groups { (3) FOREACH { ; } // Calc centroid of each cluster (4) FOREACH T { (5) FOREACH { ; } // Calc distance to each cluster (6) // Select the best cluster for (7) ; } // Assign to the best cluster (8) } 180

ROUND 1 ROUND 2 ROUND 4 ROUND 3 Figure 4-3: A graphical example of k-means clustering For more clarity, Figure 4-3 and Figure 4-4 shows an graphical example of the

Its principle is to swap the clusters members among clusters until the total distance from each of the cluster s points to its center becomes minimal and no more

This figure shows the local optimal due to improper initial clusters.

In the figure, two initial clusters are A and B, where P1 and P3 are in the cluster A, and P2 and P4 are grouped in the cluster B.

However, the two natural clusters should be formed by grouping together the two vertices at either end of a short side.

12 ROUND 1 ROUND 2 ROUND 4 ROUND 3 Figure 4-3: A graphical example of k-means clustering For more clarity, Figure 4-3 and Figure 4-4 shows an graphical example of the k-means algorithm and its calculation for each step, respectively. Compared with other methods, the k- means clustering method is simple and effective. Its principle is to swap the clusters members among clusters until the total distance from each of the cluster s points to its center becomes minimal and no more swapping is needed. For example, it is possible to have situations in which k-means fails to find a good clustering like in Figure 4-5. This figure shows the local optimal due to improper initial clusters. We can view that these four objects are arranged at the vertices of a rectangle in two-dimensional space. In the figure, two initial clusters are A and B, where P1 and P3 are in the cluster A, and P2 and P4 are grouped in the cluster B. Graphically the two initial cluster centers fall at the middle points of the long sides. This clustering result seem stable. However, the two natural clusters should be formed by grouping together the two vertices at either end of a short side. That is, P1 and P2 are in the cluster A, and P3 and P4 are in the cluster B. In the k-means method, the final clusters are quite sensitive to the initial cluster centers. Completely different clustering results may be obtained even slight changes are made in the initial random cluster assignment. To increase the chance of finding a global minimum, one can execute this algorithm several times with different initial choices and choose the best final result, the one with the smallest total distance. 181

13 ROUND ONE No. x y Cluster A A A B A A A A A A A 10 8 A B B B B B B B B B B ROUND TWO No. x y A B Cluster A A A B A A A A A B A A B B A B B B B B B B ROUND THREE No. x y A B Cluster A A A B A A A A A B B A B B A B B A B B B B ROUND FOUR No. x y A B Cluster A A A B A B A B A B B A B B A B B A B B A B Figure 4-4: Numerical calculation of the k-mean clustering in Figure

However, all methods still left one question on how large the k should be. It is hard to estimate the likely number of clusters.

14 Figure 4-5: Local optimal due to improper initial clusters. There are many variants of the basic k-means method developed. Some of them try to produce a hierarchical clustering result (shown in the next section) with a cutting point at k groups and then perform k-means clustering on the result. However, all methods still left one question on how large the k should be. It is hard to estimate the likely number of clusters. One solution is to try different values of k, and choose the best clustering result, i.e., one with the largest intracluster similarity and the smallest inter-cluster similarity. Another solution of finding k is to begin by finding a few clusters and determining whether it is worth splitting them. For example, we choose k=2 and then perform k-means clustering until it terminates, and then consider splitting each cluster Hierarchical-based clustering As one of the common clustering methods, hierarchical clustering groups data objects into a tree of clusters, without a predefined number of clusters. Their basic operation is to merge the similar objects or object groups or to split dissimilar objects or object groups to different groups. Anyway, the most serious drawback of a pure hierarchical clustering method is its inability to reassign once a merge or split decision has been executed. If a particular merge or split decision is later known to be a wrong decision, the method cannot make it correct later. To solve this, it is possible to incorporate some iterative relocation mechanism into the original version. In general, there are two types of hierarchical clustering methods are agglomerative and divisive, depending on whether the hierarchical structure (tree) is formed in either bottom-up (merging) or topdown (splitting) style. As for the bottom-up fashion, the agglomerative hierarchical clustering approach starts with having each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are included in a single cluster or until some certain termination conditions are satisfied. most hierarchical clustering methods belong to this type. However, there may be various definitions of intercluster similarity. On the other hand, as for the top-down fashion, the divisive hierarchical clustering approach performs the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained or the diameter of each cluster is within a certain threshold. 183

Agglomerative versus divisive hierarchical clustering While two directions of hierarchical clusterings are top-down and bottom-up, it is possible to represent both of them in the form of a tree

Figure 4-6 shows a dendrogram for seven objects in (a) agglomerative clustering and (b) divisive clustering.

15 Agglomerative versus divisive hierarchical clustering While two directions of hierarchical clusterings are top-down and bottom-up, it is possible to represent both of them in the form of a tree structure. In general, such tree structure is called a dendrogram. It is commonly used to represent the process of hierarchical clustering. It shows how objects are grouped together step by step. Figure 4-6 shows a dendrogram for seven objects in (a) agglomerative clustering and (b) divisive clustering. Here, Step 0 is the initial stage while Step 6 is the final stage when a single cluster is constructed. The agglomerative hierarchical clustering places each object into a cluster with its own. Then it tries to merge step-by-step according to some criterion. For example, in Figure 4-6 (a) at the step 4, the agglomerative hierarchical clustering method attempts to merge {a,b,c} with {d} and form {a,b,c,d} in the bottom-up manner. It is also known as AGNES (AGglomerative NESting). In Figure 4-6 (b) at the step 3, the divisive hierarchical clustering method tries to divide {e,f,g} into {e,f} and {g}. This approach is also called DIANA (DIvisive ANAlysis). However, in either agglomerative or divisive hierarchical clustering, the user can specify the desired number of clusters as a termination condition. That is, it is possible to terminate at any step to obtain clustering results. If the user requests three clusters, the agglomerative method will terminate at Step 4 while the divisive method will terminate at Step 2. Bottom-up (Agglomerative) Top-down (divisive) Figure 4-6: Dendrogram: Agglomerative vs. Divisive Clustering Distance measurement among clusters As stated in Section 4.1.1, there are several possible distance definition to express distance between two single objects. However, in hieratchical clustering, an additional requirement is to define the distance between two clusters which may include more than one object. Four widelyused measures are single linakage, complete linkage, centroid comparision and element 184

comparion. Figure 4-7 shows the graphical representation of these four methods. formulation of each measure can be defined as follows. The No. Cluster Distance Distance definition 1.

Element comparison (average distance) where, is the centroid of and is the centroid of.

16 comparion. Figure 4-7 shows the graphical representation of these four methods. formulation of each measure can be defined as follows. The No. Cluster Distance Distance definition 1. Single linkage (minimum distance) 2. Complete linkage (maximum distance) 3. Centroid comparison (mean distance) 4. Element comparison (average distance) where, is the centroid of and is the centroid of. Figure 4-7: Graphical representation of four definitions of cluster distances Firstly, for the single linkage, if we use the minimum distance,, to measure the distance between clusters, it is sometimes called a nearest-neighbor clustering algorithm. In this method, the clustering process is terminated when the distance between nearest clusters exceeds an arbitrary threshold. It is possible to view the data points as nodes of a graph where edges form a path between the nodes in a cluster. When two clusters, and, are merged, an edge is added between the nearest pair of nodes in and. This merging process will result in a tree-like graph. An agglomerative hierarchical clustering algorithm that uses the minimum distance measure is also known as a minimal spanning tree algorithm. 185

17 Secondly, for the complete linkage, an algorithm uses the maximum distance, measure the distance between clusters. Also called a farthest-neighbor clustering algorithm, the clustering process is terminated when the maximum distance between the nearest clusters exceeds a predefined threshold. In this method, each cluster can be viewed as a complete subgraph where there exist edges connecting all of the nodes in the clusters. The distance between two clusters is determined by the most distant nodes in the two clusters. These farthestneighbor algorithms aim to minimize the increase in diameter of the clusters at each iteration as little as possible. It performs well when the true clusters are compact and approximately equal in size. Otherwise, the clusters produced can be meaningless. The nearest-neighbor clustering and the farthest-neighbor clustering express two extreme cases for defining the distance between clusters. They are quite sensitive to outliers or noisy data. Rather than these two methods, sometimes it is better to use mean or average distance instead, in order to compromise between the minimum and maximum distances and to overcome the outlier and noisy problem. The mean distance comes for the calculation of the centroid of each cluster and then the measurement of the distance between a pair of centroids. Also known as centroid comparison, this method is computationally simple and cheap. Every time two clusters are merged (for agglomerative) or a cluster is split (for divisive), a new centroid will be calculated for the newly merged cluster or two centroids will be calculated for the newly split two cluters. However, agglomerative clustering is slightly simpler than divisive clustering since it is possible to use a so-called weighted combination technique to calculate the centroid of the newly merged cluster but it cannot be applied for the divisive approach. For both methods, it is necessary to calculate the distance between the new centroid(s) with the other centroids. Compared to the centroid comparison, the element comparison is done to calculate the distance between two clusters by finding the average distance among all elements in those two clusters. This step is much more computational expensive than the centroid comparison. Moreover, the average distance is advantageous in that it can handle categoric as well as numeric data. The computation of the mean vector for categoric data can be difficult or impossible to define but it is possible for average distance. Problems in the hierarchical approach While hierarchical clustering is simple, it has a drawback in how to select points to merge or split. Each merging or splitting step is important since the next step will process based on the newly generated clusters and it will never reconsider the result of the previous steps or swap objects between clusters. If the previous merge or split decisions are not well determined, low-quality clusters may be generated. To solve this problem, it is possible to perform multiple-phase clustering by incorporating other clustering techniques into hierarchical clustering. Three common methods are BIRCH, ROCK and Chameleon. The BIRCH partitions objects hierarchically using tree structures whose leaf (or low-level nonleaf nodes) are treated as microclusters, depending on the scale of resolution. After that, it applies other clustering algorithms to perform macroclustering on the microclusters. The ROCK merges clusters based on their interconnectivity after hierarchical clustering. The Chameleon explores dynamic modeling in hierarchical clustering Density-based clustering Unlike partition-based or hierarchical-based clustering which tends to discover clusters with a spherical shape, density-based clustering methods have been designed to discover clusters with arbitrary shape. In this approach, dense regions of objects in the data space will be separated by regions of low density. As density-based methods, DBSCAN grows clusters according to a density-, to 186

DBSCAN: Density-Based Spatial Clustering of Applications with Noise DBSCAN is a density-based clustering algorithm, finding regions which includes objects with sufficiently high density into clusters.

18 based connectivity analysis. OPTICS extends DBSCAN to produce a cluster ordering obtained from a wide range of parameter settings. DENCLUE clusters objects based on a set of density distribution functions. In this book, we describe DBSCAN. DBSCAN: Density-Based Spatial Clustering of Applications with Noise DBSCAN is a density-based clustering algorithm, finding regions which includes objects with sufficiently high density into clusters. By this nature, It discovers clusters of arbitrary shape in spatial databases with noise. In this method, a cluster is defined as a maximal set of densityconnected points. The following are definitions of the concepts used in the methods. -neighborhood: The neighborhood objects within a radius of a given object is called the -neighborhood of the object. Core object: When an object has an -neighborhood containing objects more than a minimum number, MinObjs, it will be called a core object. Directly density-reachable: Given a set of objects D, an object p is directly densityreachable from the object q if p is within the -neighborhood of q, and q is a core object. Density-reachable: An object p is density-reachable from object q with respect to and MinObjs in a set of objects, D, if there is a chain of objects, where and such that is directly density-reachable from with respect to and MinObjs, for. Density-Connected: An object p is density-connected to object q with respect to and MinObjs in a set of objects, D, if there is an object such that both p and q are densityreachable from o with respect to and MinObjs. A density-based cluster: A density-based cluster is a set of density-connected objects that is maximal with respect to density-reachability. Every object not contained in any cluster is considered to be noise. Note that density reachability is the transitive closure of direct density reachability, and this relationship is asymmetric. Only core objects are mutually density reachable. Density connectivity, however, is a symmetric relation. Figure 4-8: An example of density-based clustering 187

19 For example, given twenty objects (a - t) as shown in Figure 4-8, the neighbor elements for each object are shown in the left list in the figure. They are derived, basing on. Moreover, the core objects (marked with * ) are a, b, d, f, g, i, j, k, m, o, p, q and s since they include more than three neighbors (MinObjs = 3). The directly density-reachable objects of each core object (q) can be listed as follows. Object (q) Directly density reachable objects Object (q) Directly density reachable objects a* b, c, d k* i, j, m, o, p b* a, d, f m* k, o, p d* a, b, c, f, h o* k, m, p, q, s f* b, d, h p* k, m, o, q g* e, i, j, k q* o, p, s, t i* g, j, k s* o, q, t j* e, g, i, k, m The density reachable objects of each core object (q) can be listed as follows. Object (q) Density reachable objects Object (q) Density reachable objects a* b, c, d, f, h k* e, g, i, j, m, o, p, q, s, t b* a, c, d, f, h m* e, g, i, j, k, o, p, q, s, t d* a, b, c, f, h o* e, g, i, j, k, m, p, q, s, t f* a, b, c, d, h p* e, g, i, j, k, m, o, q, s, t g* e, i, j, k, m, o, p, q, s, t q* e, g, i, j, k, m, o, p, s, t i* e, g, j, k, m, o, p, q, s, t s* e, g, i, j, k, m, o, p, q, t j* e, g, i, k, m, o, p, q, s, t The clustering result can be defined by the density-connected property. In the example, the result are two clusters as follows. Moreover, the objects that become noises are l, n and r. Cluster Cluster members 1 a, b, c, d, f, h 2 g, e, i, j, k, m, o, p, q, s, t When we apply a sort of indexing, the computational complexity of DBSCAN will be, where n is the number of database objects. Without any index, it is. With appropriate settings of the user-defined parameters and MinObjs, the algorithm is effective at finding arbitrary-shaped clusters Grid-based clustering In constrast with partition-based, hierarchical-based and density-based methods, the grid-based clustering approach uses a multiresolution grid data structure to quantize the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. This approach aims to improve the processing time. Its time complexity is typically independent of the number of data objects, instead it depends onthe number of cells in each dimension in the quantized space. Some typical grid-based methods are STING, WaveCluster and CLIQUE. Here, STING explores statistical information stored in the grid cells, WaveCluster clusters objects using a wavelet transformation method, and CLIQUE represents a grid-and density-based approach for clustering in high-dimensional data space. STING: STatistical INformation Grid STING is a grid-based multiresolution clustering technique in which the spatial area is divided into rectangular cells. There are usually several levels of such rectangular cells corresponding to 188

different levels of resolution, and these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of cells at the next lower level.

20 different levels of resolution, and these cells form a hierarchical structure: each cell at a high level is partitioned to form a number of cells at the next lower level. Statistical information regarding the attributes in each grid cell (such as the mean, maximum, and minimum values) is precomputed and stored. These statistical parameters are useful for query processing, as described below. Figure 4-9: Grid structure in grid-based clustering Figure 4-9 shows a hierarchical structure for STING clustering. Statistical parameters and characteristics of higher-level cells can easily be computed from those of the lower-level cells. These parameters can be the attribute-independent parameters such as count; the attributedependent parameters such as mean, stdev (standard deviation), min (minimum), max (maximum); and the distribution type of the attribute value in the cell such as normal, uniform, exponential, or none (if the distribution is unknown). When the data are loaded into the database, the parameters (e.g., count, mean, stdev, min, and max) of the bottom-level cells can be calculated directly from the data. Since STING uses a multiresolution approach to cluster analysis, the quality of STING clustering depends on the granularity of the lowest level of the grid structure. When the granularity is too fine, the cost of processing will increase substantially. On the other hand, when the bottom level of the grid structure is too coarse, it may reduce the quality of cluster analysis. While STING does not take the spatial relationship between the children and their neighboring cells for construction of a parent cell into consideration, the shapes of the resultant clusters are isothetic; that is, all of the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected. By this characteristics, the quality and accuracy of the clusters may be lower, tradeoff with the fast processing time Model-based clustering Instead of using a simple similarity definition, a model-based method predefines a suitable model for each of the clusters and then find the best fit of the data to the given model. Some well-known model-based methods are EM, COBWEB and SOM. 189

21 EM Algorithm: Expectation-Maximization method The EM (Expectation-Maximization) algorithm (Algorithm 4.2) is a popular iterative refinement algorithm used in several applications, such as speech recognition, image processing. It was developed to estimate the suitable values of unknown parameters by maximizing expectation. Algorithm 4.2. The EM Algorithm 1. Intitialization step: To obtain the seed for probability calculation, we start with making an initial guess of the parameter vector. Even there are several possible choices on this. As one simple approach, the method randomly partition the objects into k groups and then for each group (cluster) calculate its means (its center) (similar to k-means partitioning). 2. Repetetion step: To improve the initial cluster parameter, we iteratively refine the parameters (or clusters) based on the two steps of expectation step and maximization step. (a) Expectation Step The probability that each object is in a cluster is defined as follows. Here, is the probability that the object occurs in the cluster. We can define it as which follows the normal distribution (i.e., Gaussian distribution) under the mean ( ) and the standard deviation of the cluster ( ). The probability that the object belongs to the cluster is defined as follows. is the priori probability that the cluster will occur. Without bias, we can set all clusters to have the same probability. For the, it does not depend on any cluster. Therefore, it can be ignored for consideration. Finally, it is possible to use only as the probability that the object belongs to the cluster. (b) Maximization Step By the above probability estimates, we can re-estimate or refine the model parameters as follows. As its name, this step is the maximization of the likelihood of the distributions given the data. 190

22 Although there have been several variants of EM methods. most of them can be viewed as an extension of the k-means algorithm, which assigns an object to the cluster that is the closet to it, based on the cluster mean or the cluster representative. Instead of assigning each object to a single dedicated cluster, the EM method assigns each object to a cluster according to a weight representing the probability of membership. That is, no regid boundaries between clusters are defined. Afterwards, new means are computed based on weighted measures. Since we do not know which objects should be grouped into the same group beforehand, the EM begins with an initial estimate (or guess) for each parameter in the mixture model (collectively referred to as the parameter vector). The parameter can be set by randomly grouping objects into k clusters, or just selecting k objects in the set of objects to be the means of the clusters. After this intial setting, the EM algorithm will iteratively rescore the objects against the mixture density, produced by the parameter vector. The rescored objects are then used to update the parameter estimates. During calculation, each object is assigned a probability of how likely it belonged to a given cluster. The detail of the algorithm is described as follows. While the EM algorithm is simple and easy to implement and converges fast, sometimes it falls into a local optima. The convergence is guaranteed for certain forms of optimization functions with the computational complexity of O( where is the number of input features, is the number of objects, and is the number of iterations. Sometimes known as Bayesian clustering, the method focuses on the computation of classconditional probability density. They are commonly used in the statistics community. In industry, AutoClass is a popular Bayesian clustering method that uses a variant of the EM algorithm. The best clustering maximizes the ability to predict the attributes of an object given the correct clusterof the object. AutoClass can also estimate the number of clusters. It has been applied to several domains and was able to discover a new class of stars based on infrared astronomy data. Conceptual Clustering Unlike conventional clustering which does not focus on detailed description of a cluster, conceptual clustering forms a conceptual tree by also considering characteristic descriptions for each group, where each group corresponds to a node (concept or class) in the tree. In other words, conceptual clustering has two steps; clustering and characterization. Clustering quality is not solely a function of the individual objects but also the generality and simplicity of the derived concept descriptions. Most conceptual clustering methods uses probability measurements to determine the concepts or clusters. As an example of this type, COBWEB is a popular and simple method of incremental conceptual clustering. Its input objects are expressed by categorical attribute-value pairs. COBWEB creates a hierarchical clustering in the form of a classification tree. Figure 4-10 shows an example of a classification tree for a set of animal data. A classification tree differs from a decision tree in the sense that intermediate nodes in a classification tree specifies a concept but those in a decision indicates an attribute test. In a classification, each node refers to a concept and contains a probabilistic description of that concept, which summarizes the objects classified under the node. To be summarized, the COBWEB works as follows. Given a set of objects,, each object is represented by an n-dimensional attribute vector, depicting the measure values of n attributes of the object. The Bayesian (statistical) classifier assigns (or predicts) a class to the object when the class has the highest posterior probability over the others, conditioned on the object s attribute values. That is, the Bayesian classifier predicts that the object belongs to the class. 191

23 Figure 4-10: A classification tree. This figure is based on (Fisher, 1987) The probabilistic description includes the probability of the concept and conditional probabilities of the form, where is an attribute-value pair (that is, the i-th attribute takes its j-th possible value) and is the concept class. Normally, the counts are accumulated and stored at each node for probability calculation. The sibling nodes at a given level of a classification tree form a number of partitions. To classify an object using a classification tree, a partial matching function is employed to descend the tree along a path of best matching nodes. COBWEB uses a heuristic evaluation measure called category utility to guide construction of the tree. Category utility (CU) is defined as follows. where n is the number of nodes (also called concepts, or categories) forming a partition,, at the given level of the tree. In other words, category utility can be used to measure the increase in the expected number of attribute values that can be correctly guessed given a partition with no such knowledge. over the expected number of correct guesses As an incremental approach, when a new object comes to COBWEB, it descends the tree along an appropriate path, updates counts along the way and tries to search for the best node to place the object in. This decision is done by selecting the situation that has the highest category utility of the resulting partition. Indeed, COBWEB also computes the category utility of the partition that would result if a new node were to be created for the object. Therefore, the object is placed in an existing class, or a new class is created for it, based on the partition with the highest category utility value. COBWEB has the ability to automatically adjust the number of classes in a partition. That is there is no need to provide the number of clusters like k-means or hierarchical clustering. However, the COBWEB operators are highly sensitive to the input order of the object. To solve this problem, COBWEB has two additional operators, called merging and splitting. When an object is incorporated, the two best hosts are considered for merging into a single class. Moreover, COBWEB considers splitting the children of the best host among the existing categories, based on category utility. The merging and splitting operators implement a bidirectional search. That is, a merge can undo a previous split as well as a split can be undone by a later merging process. 192

24 Howover, COBWEB still has a number of limitations. Firstly, it assumes that probability distributions of an attributes are statistically independent of one another, which is not always true. Secondly, it is expensive to store the probability distribution representation of clusters, especially when the attributes have a large number of values. The time and space complexities depend not only on the number of attributes, but also on the number of values for each attribute. Moreover, the classification tree is not height-balanced for skewed input data, which may cause the time and space complexity to degrade dramatically. As an extension to COBWEB, CLASSIT deals with continuous (or real-valued) data. It stores a continuous normal distribution (i.e., mean and standard deviation) for each individual attribute in each node and applies a generalized category utility measure by an integral over continuous attributes instead of a sum over discrete attributes as in COBWEB. While conceptual clustering is popular in the machine learning community, both COBWEB and CLASSIT suffers in cases of clustering large database data Association Analysis and Frequent Pattern Mining Another form of knowledge that we can mine from data is frequent patterns or associations which include frequent itemsets, subsequences, substructures and association rules. For example, in a supermart database, a set of items such as bread and better are likely to appear frequently together. This set is called a frequent itemset. Moreover, in an electronics shop database, there may be a frequent subsequence that a PC, then digital camera, and a memory card are bought in sequence frequently. This is called a frequent subsequence. A substructure is a more complex pattern. It can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with itemsets or subsequences. If a substructure is found frequently, that substructure is called a frequent structured pattern. Such frequent patterns are important in mining associations, correlations, and many other interesting relationships among data. Moreover, again in a supermarket database, if one buys ice, he (or she) is likely to buy water at the supermarket. This is called an association rule. After performing frequent itemset mining, we can use the result to discover associations and correlations among items in large transactional or relational data sets. The discovery of interesting correlation relationships among huge amounts of business transaction records can help in many business decision-making processes, such as catalog design, cross-marketing, and customer shopping behavior analysis. A typical example of frequent itemset mining and association rule mining is market basket analysis. Its formulation can be summarized as follows. Let be a set of possible items, be a set of database transactions where each transaction includes a number of items, and an itemset is a set of items. A transaction is said to contain the itemset A if and only if. Let be a set of the transactions that contain. The support of an itemset A can be defined as the number of transactions that include all items in, devided by the number of all possible items. It correspond to the probability that will occur in a transaction (i.e., ) as follows. An itemset A is called a frequent itemset if and only if the support of A is greater than or equal to a threshold called minimum support (minsup), i.e.. Note that an itemset represents a set of items. When the itemset contains k items, it is called k-itemset. For example, in supermarket database, the set is a 2-itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset and the support is its ratio, compared to the total number of transactions. 193

25 An association rule is an implication of the form, where, and (i.e., and are two non-overlap itemsets. The support of the rule is denoted by. It corresponds to, (i.e., the union of sets A and B). This is taken to be the probability,. Moreover, as another measure, the confidence of the rule is denoted by, specifying the percentage of transactions in T containing A that also contain B. It corresponds to and then imply. Its formal description is as follows. An association rule is called a frequent rule if and only if is a frequent itemset and its confidence greater than or equal to a threshold called minimum confidence (minconf), i.e.. Besides support and confidence, another important measure is lift. Theoretically. if the value of lift is lower than one, i.e.,, the probability that the conclusion (B) will occur under the condition (A), i.e.,, is lower than the probability of conclusion without the precondition, i.e.,. This makes a meaningless rule. Therefore, we always expect an association rule with a lift larger than or equal to a threshold called minimum lift,. The fomal description of life is shown below. In general, as a process to find frequent association rules, association rule mining can be viewed as a two-step process. 1. Find all frequent itemsets. 2. Generate strong association rules from the frequent itemsets. The first process, from the transactional database, find a set of frequent itemsets which occur at least as frequently as a predetermined minimum support count (i.e., minsup) while the second process generates strong association rules from the frequent itemsets obtained from the first process. The strong association rules are supposed to have support greater than minimum support, confidence greater than minimum confidence and lift greater than minimum lift. In principle, the second step is much less costly than the first, the overall performance of mining association rules is normally dominated by the second step. One research topic in mining frequent itemsets from a large data set is to efficiently generate a huge number of itemsets satisfying a low minimum support. Furthermore, given a frequent itemset, each of its subsets is frequent as well. A long itemset will contain a combinatorial number of shorter, frequent subitemsets. For example, a frequent itemset with a length of 30 includes up to subsets as follows. This first term comes from the frequent 1-itemsets,, the second term is, and so on. To solve the problem of such extreme large number of itemsets, the concepts of closed frequent itemset and maximal frequent itemset are introduced. Here, we describe frequent itemsets, association rules as well as closed frequent itemsets and maximal frequent itemsets, using the example in Figure 4-11, 194

26 Transaction ID Items 1 coke, ice, paper, shoes, water 2 ice, orange, shirt, water 3 paper, shirt, water 4 coke, orange, paper, shirt, water 5 ice, orange, shirt, shoes, water 6 paper, shirt, water (a) An toy example of a transactional database for retailing ( ) Itemset Trans Freq. Itemset Trans Freq. Itemset Trans Freq. coke 14 2 ice, orange 25 2 orange, shirt, water ice ice, paper 1 1 paper, shirt, water orange ice, shirt 25 2 paper ce, water shirt orange, paper 4 1 shoes 15 2 orange, shirt water orange, water paper, shirt paper, water shirt, water (b) Frequent itemsets (minimum support = 3 (i.e., 50%)) Here, No. Left Items Right Items support confidence Lift 1 ice water 3/6=0.50 3/3=1.00 (3/3)/(6/6)=1.0 2 water ice 3/6=0.50 3/6=0.50 (3/6)/(3/6)=1.0 3 orange shirt 3/6=0.50 3/3=1.00 3/3)/(5/6)=1.2 4 shirt orange 3/6=0.50 3/5=0.60 (3/5)/(3/6)=1.2 5 orange water 3/6=0.50 3/3=1.00 (3/3)/(6/6)=1.0 6 water orange 3/6=0.50 3/6=0.50 (3/6)/(3/6)=1.0 7 paper shirt 3/6=0.50 3/4=0.75 (3/4)/(5/6)=0.9 8 shirt paper 3/6=0.50 3/5=0.60 (3/5)/(4/6)=0.9 9 paper water 4/6=0.67 4/4=1.00 (4/4)/(6/6)= water paper 4/6=0.67 4/6=0.67 (4/6)/(4/6)= shirt water 5/6=0.83 5/5=1.00 (5/5)/(6/6)= water shirt 5/6=0.83 5/6=0.83 (5/6)/(5/6)= orange, shirt water 3/6=0.50 3/3=1.00 (3/3)/(6/6)= orange, water shirt 3/6=0.50 3/4=0.75 (3/4)/(5/6)= shirt, water orange 3/6=0.50 3/5=0.60 (3/5)/(3/6)= water orange, shirt 3/6=0.50 3/6=0.50 (3/6)/(3/6)= shirt orange, water 3/6=0.50 3/5=0.60 (3/5)/(3/6)= orange shirt, water 3/6=0.50 3/3=1.00 (3/3)/(5/6)= paper, shirt water 3/6=0.50 3/3=1.00 (3/3)/(6/6)= paper, water shirt 3/6=0.50 3/4=0.75 (3/4)/(5/6)= shirt, water paper 3/6=0.50 3/5=0.60 (3/5)/(4/6)= water paper, shirt 3/6=0.50 3/6=0.50 (3/6)/(3/6)= shirt paper, water 3/6=0.50 3/5=0.60 (3/5)/(4/6)= paper shirt, water 3/6=0.50 3/4=0.75 (3/4)/(5/6)=0.9 (c) Association rules (minimum confidence = 66.67%) ( ) = Figure 4-11: Frequent itemset and frequent rules (association rules) 195

27 Given the transaction database in Figure 4-11 (a), we can find a set of frequent itemsets shown in Figure 4-11 (b) when the minimum support is set to 0.5. The set of frequent rules (association rules) is found as displayed in Figure 4-11 (c) when the minimum confidence is set to Moreover, a valid rule need to have a lift of at least 1.0. In this example, the thirteen frequent itemsets can be summarized in Figure No. Frequent Itemset Transaction Set Support 1. ice 125 3/6 2. orange 245 3/6 3. paper /6 4. shirt /6 5. water /6 6. ice, water 125 3/6 7. orange, shirt 245 3/6 8. orange, water 245 3/6 9. paper, shirt 346 3/6 10. paper, water /6 11. shirt, water /6 12. orange, shirt, water 245 3/6 13. paper, shirt, water 346 3/6 Figure 4-12: Summary of frequent itemsets with their transaction sets and supports. The concepts of closed frequent itemsets and maximal frequent itemsets are defined as follows. [Closed Frequent Itemset] An itemset X is closed in a data set S if there exists no proper super-itemset such that Y has the same support count as X in S. An itemset X is a closed frequent itemset in set S if X is both closed and frequent in S. [Maximal Frequent Itemset] An itemset X is a maximal frequent itemset (or max-itemset) in set S if X is frequent, and there exists no super-itemset Y such that and Y is frequent in S. Normally it is possible to find the whole set of frequent itemsets from the set of closed frequent itemsets but it is not possible to find all frequent itemsets from the set of maximal itemsets. The set of closed frequent itemsets contains complete information regarding its corresponding frequent itemsets but the set of maximal frequent itemsets registers only the support of the maximal frequent itemsets. Using the above example, the frequent closed itemsets are shown in Figure Here, the frequent itemsets which are not closed was indicated by strikethrough. There are six frequent closed itemsets. Here, the column Transaction Set indicates the set of transactions that include that frequent itemset. For example, the frequent itemset {orange, water} is in the 2 nd, 4 th and 5 th transactions in Figure 4-11 (a). Moreover, the frequent itemset {orange} is not closed since it have the same transaction set with its superset, i.e., the frequent itemset {orange, shirt, water}. Note that the shortest superset that includes the frequent itemset {orange} is the closed frequent itemset {orange, shirt, water}. Therefore, the closed frequent itemset of {orange} is {orange, shirt, water}. 196

28 No. No. (closed) Frequent Itemset Transaction Set Support 1. ice 125 3/6 2. orange 245 3/6 3. paper /6 4. shirt / water / (ice, water), (ice) 125 3/6 7. orange, shirt 245 3/6 8. orange, water 245 3/6 9. paper, shirt 346 3/ (paper, water), (paper) / (shirt, water), (shirt) / (orange, shirt, water), 245 3/6 (orange, water), (orange, shirt), (orange) (paper, shirt, water), (paper, shirt) 346 3/6 Figure 4-13: Six closed itemsets with their transaction sets and supports. In general, the set of closed frequent itemsets contains complete information regarding the frequent itemsets. For example, the transaction set and support of the frequent itemset {orange}, {orange, shirt} and {orange, water} can be found to be equivalent to those of the closed frequent itemset {orange, shirt, water} since there is no shorter frequent closed itemset that include them, except {orange, shirt, water}. That is, TransactionSet({orange})= TransactionSet({orange,shirt})= TransactionSet({orange,water})=TransactionSet({orange,shirt,water}). Moreover, in this example, the maximal frequent itemsets are {orange, shirt, water} and {paper, shirt, water} since they are the longest frequent itemsets and no superset of these itemsets are frequent. For the association rules (Figure 4-11 (c)), we can disgard the 2 nd, 4 th, 6 th, 8 th, 15 th, 16 th, 17 th, 21 st, 22 nd, 23 rd rules since their confidences are lower than the minimum confidences, i.e., Moreover, the 7 th, 8 th, 14 th, 20 th, 21 st, 23 rd and 24 th rules are disgarded since their lifts are lower than 1.0. Finally we have ten frequent rules left. As stated above, to find frequent association rules, association rule mining can be viewed as a two-step process; (1) Find all frequent itemsets, and (2) Generate strong association rules from the frequent itemsets. The second process is trivial while the first process consumes more time and space. The following subsections describes a set of existing algorithms, namely Apriori, CHARM and FP-Tree, to efficiently discover the frequent itemsets Apriori algorithm The Apriori algoroithm was proposed in 1994 by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules. As its name, the algorithm uses prior knowledge of frequent itemset properties to prune out some dominant infrequent itemsets, and then to ignore the process of counting their supports. It explores the space of itemsets iteratively in level-wise style, where k-itemsets are used to explore (k+1)-itemsets. At the first step, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item. Here, the items satisfy minimum support are collected as a set denoted by. Next, is used to find, the set of frequent 2-itemsets, which is used to find,and so on, until no more frequent k-itemsets can be found. For each step to find, one full database scan is required. 197

29 To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property is used to reduce the search space. The Apriori property states All non-empty subsets of a frequent itemset must also be frequent. Semantically, if an itemset does not satisfy the minimum support threshold, minsup ( is not frequent, i.e., ), any of its superset will never be frequent. In other words, even we add an item to the itemset, the resulting itemset will never have larger frequent than, i.e.,. In conclusion, is not frequent if X is not frequent. That is,. This Apriori property has partial order property. It is called anti-monotone in the sense that if a set cannot pass a test, all of its supersets will fail the same test as well. The property is monotonic in the context of failing a test. Exploiting this property, the two-step process can be defined as follows. [Join Step] Generate, a set of candidate k-itemsets by joining with itself. Then calculate which members in are frequent and find all frequent k-itemset. Here, if the items within a transaction or itemset are sorted in lexicographic order, then we can easily generate from by joining two itemsets in which have common prefix. For example, suppose that and are in. Since these two itemsets share the same prefix, we can generate as a candidate in. Note that this join step implicitly uses the Apriori property. [Prune Step] After the first step (join step), a set of candidate k-itemsets will be generated. Among them, we have to identify which ones are frequent (count no less than the minimum support) and which are not. In other words, is a superset of (frequent itemsets). While members in may or may not be frequent, all of the frequent k-itemsets are included in. That is, A scan of the database to determine the count of each candidate in would result in the determination of. In many cases, the set of candidates is huge, resulting in heavy computation. To reduce the size of, the Apriori property is also used in addition to the join step. That is, All non-empty subsets of a frequent itemset must also be frequent. This subset testing can be done quickly by maintaining a hash tree of all frequent itemsets. Algorithm 4.3 shows the pseudocode of the Apriori algorithm for discovering frequent itemsets for mining Boolean association rules. In the pseudocode of the Apriori algorithm, the function GenerateCandidate corresponds to the join step while the function CheckNoInfrequentSubset equals to the pruning step. The following shows an example of the Apriori algorithm with six retailing transactions where minsup = 0.5 and minconf =

30 Algorithm 4.3: The Apriori Algorithm OBJECTIVE: Find frequent itemsets using an iterative level-wise approach based on candidate generation. Input: D # a database of transactions; minsup # minimum support count Output: L # frequent itemsets in D. Procedure Main(): 1: L 1 = Find_frequent_1_Itemset(D) # Find frequent 1-itemsets from D 2: for (k = 2 ; L k-1 ; k++) { 3: L k = ; 4: C k = GenerateCandidate(L k-1 ); # Generate Candidates 5: foreach transaction t D { # Scan D for counts 6: C t = Subset(C k, t); # Find which candidates are in t 7: foreach candidate c Ct 8: count[c] = count[c]+1; # Add one to the count of c 9: } 10: foreach c in C k { # check all c in C k 11: if (count[c] minsup) # If count[c] > minsup 12: L k = L k {c}; # Add c to set of frequent itemset 13: } 14: } 15: return L = k L k # return all sets of large k-itemsets Procedure GenerateCandidate (L k-1 ) # L k-1 is frequent (k-1)-itemsets 1: C k = ; 2: foreach itemset x 1 L k-1 { # itemset in large (k-1)-itemsets 3: foreach itemset x 2 L k-1 { # itemset in large (k-1)-itemsets 4: if( (x 1 [1]=x 2 [1]) & (x 1 [2]=x 2 [2]) &...& (x 1 [k-2]=x 2 [k-2]) & 5: (x 1 [k-1]<x 2 [k-1]) ) 6: for (i=1 ; i<=k-2; i++) { 7: c[i] = x 1 [i]; # set the same prefix 8: } 9: c[k-1] = x 1 [k-1]; 10: c[k] = x 2 [k-1]; 11: } 12: if( CheckNoInfrequentSubset(c,L k-1 ) ) { 13: C k = C k {c}; 14: } 15: } 16: } 17: return C k ; Procedure CheckNoInfrequentSubset(c,L k-1 ) itemsets 1: foreach (k-1)-itemsets s of c { 2: if s L k-1 3: return FALSE 4: return TRUE; # c: a candidate k-itemset, # L k-1 is frequent (k-1)- 199

31 Transaction ID Items 1 coke, ice, paper, shoes, water 2 ice, orange, shirt, water 3 paper, shirt, water 4 coke, orange, paper, shirt, water 5 ice, orange, shirt, shoes, water 6 paper, shirt, water Step 1: Output: Scan the database to count the frequency of the 1-itemsets and generate the set of the candidate 1-itemset, C 1. C 1 Itemset Support {coke} 2 {ice} 3 {orange} 3 {paper} 4 {shirt} 5 {shoes} 2 {water} 6 Step 2: Output: From the set of the candidate 1-itemset C 1, identify the frequent ones and generate the frequent 1-itemset, L 1. Here the infrequent ones are omitted. C 1 L 1 Itemset Support Itemset Support {coke} 2 {ice} 3 {ice} 3 {orange} 3 {orange} 3 {paper} 4 {paper} 4 {shirt} 5 {shirt} 5 {water} 6 {shoes} 2 {water} 6 Step 3: From the set of the frequent1-itemset L 1, generate the set of the candidate 1- itemset, C 2 and count their frequencies. Output: L 1 C 2 Itemset Support Itemset Support {ice} 3 {ice,orange} 2 {orange} 3 {ice,paper} 1 {paper} 4 {ice,shirt} 2 {shirt} 5 {ice,water} 3 {water} 6 {orange,paper} 1 {orange,shirt} 3 {orange,water} 3 {paper,shirt} 3 {paper,water} 4 {shirt,water} 5 200

32 Step 4: Output: From the set of the candidate 2-itemset C 2, identify the frequent ones and generate the frequent 2-itemset, L 2. Here the infrequent ones are omitted. C 2 L 2 Itemset Support Itemset Support {ice,orange} 2 {ice,water} 3 {ice,paper} 1 {orange,shirt} 3 {ice,shirt} 2 {orange,water} 3 {ice,water} 3 {paper,shirt} 3 {orange,paper} 1 {paper,water} 4 {orange,shirt} 3 {shirt,water} 5 {orange,water} 3 {paper,shirt} 3 {paper,water} 4 {shirt,water} 5 Step 5: From the set of the frequent 2-itemset L 2, generate the set of the candidate 3- itemset, C 3 and count their frequencies. Here, perform join and prune steps as shown above. Output: L 2 C 3 Itemset Support Itemset Support {ice,water} 3 {orange,shirt,water} 3 {orange,shirt} 3 {paper,shirt,water} 3 {orange,water} 3 {paper,shirt} 3 {paper,water} 4 {shirt,water} 5 For the join step, {orange,shirt} and {orange,water} can be used to generate {orange,shirt,water}, and {paper,shirt} and {paper,water} can be used to generate {paper,shirt,water}. Moreover, for the prune step, since {shirt,water} is frequent (it exists in L 2), both {orange,shirt,water} and {paper,shirt,water} are frequenct. Step 6: Output: From the set of the candidate 3-itemset C 3, identify the frequent ones and generate the frequent 3-itemset, L 3. Here, all 3-itemsets are frequent. C 3 L 3 Itemset Support Itemset Support {orange,shirt,water} 3 {orange,shirt,water} 3 {paper,shirt,water} 3 {paper,shirt,water} 3 Step 7: Output: From the set of the frequent 3-itemset L 3, try to generate the set of the candidate 4-itemset, C 3 but no candidate can be generated. L 3 C 4 Itemset Support Itemset Support {orange,shirt,water} 3 {paper,shirt,water} 3 201

33 Finally, the set of frequent itemsets are They can be listed as follows. Itemset Support {ice} 3 {orange} 3 {paper} 4 {shirt} 5 {water} 6 {ice,water} 3 {orange,shirt} 3 {orange,water} 3 {paper,shirt} 3 {paper,water} 4 {shirt,water} 5 {orange,shirt,water} 3 {paper,shirt,water} 3 To improve the efficiency of Apriori-based mining, there have been many variations of the Apriori algorithm proposed that focus on improving the efficiency of the original algorithm. Some common methods are hash-based frequency counting, transaction reduction, partitioning, samping, and dynamic itemset counting FP-Tree algorithm Although Apriori algorithm can reduce the size of candidates generated, it still generates a large number of candidates. Moreover, it may need to repeatedly scan the database and check a large set of candidates by pattern matching. It is costly to go over each transaction in the database to determine the support of the candidate itemsets. To solve these issues, there has been a unique idea that propose a method, called frequent-pattern growth (simply FP-growth), to avoid generating candidates. Applying a divide-and-conquer strategy, the FP-growth processes as follows. 1. Scan the database once to find which 1-itemsets are frequent and which ones are not. 2. Eliminate the infrequent 1-itemsets from each transaction. 3. Compress the transactions in the database into a frequent-pattern tree (FP-tree). 4. Divide the compressed database into a set of conditional databases, a special kind of projected database. Each conditional database is associated with one frequent item or pattern fragment. 5. For each conditional database, iteratively mine a set of conditional databases. The following displays an example of FP-growth, given eight transactions of office material sales. Here, minsup is set to 50%. Transaction ID Items 1 Binder, Clip, Memo, Paper, Scissors, Stapler 2 Card, Clip, Pad, Paper, Punch, Tape 3 Binder, Clip, Pad, Paper, Pin, Ruler, Tape 4 Clip, Memo, Paper, Pin, Stapler, Tape 5 Card, Memo, Pad, Ruler, Scissors, Tape 6 Card, Memo, Pad, Paper, Punch, Ruler, Stapler 7 Binder, Clip, Memo, Paper, Stapler, Tape 8 Clip, Pad, Paper, Ruler, Scissors, Stapler, Tape 202

34 The database can be transformed to vertical format as follows. Item Binder 137 Card 256 Clip Memo Pad Paper Pin 34 Punch 26 Ruler 3568 Scissors 158 Stapler Tape Transaction ID Set The infrequent 1-itemsets are Binder, Card, Pin, Punch and Scissors. They are eliminated from the database as follows. Transaction ID Items 1 Clip Memo, Paper, Stapler 2 Clip, Pad, Paper, Tape 3 Clip, Pad, Paper, Ruler, Tape 4 Clip, Memo, Paper, Stapler, Tape 5 Memo, Pad, Ruler, Scissors, Tape 6 Memo, Pad, Paper, Ruler, Stapler 7 Clip, Memo, Paper, Stapler, Tape 8 Clip, Pad, Paper, Ruler, Stapler, Tape Then for each transaction, the remaining items are ordered by their frequencies, i.e.,,. Transaction ID Items 1 Paper, Clip, Memo, Stapler 2 Paper, Clip, Tape, Pad 3 Paper, Clip, Tape, Pad, Ruler 4 Paper, Clip, Tape, Memo, Stapler 5 Tape, Memo, Pad, Ruler 6 Paper, Memo, Pad, Ruler, Stapler 7 Paper, Clip, Tape, Memo, Stapler 8 Paper, Clip, Tape, Pad, Ruler, Stapler Each reduced transaction is used to construct a frequent-pattern (FP) tree as follows. 1. Create the root of the tree with the label of (null). 2. Scan database D a second time. 3. The items in each transaction are processed in the transaction order and a branch is created for each transaction. Finally the tree are constructed. 4. Later, the links are created among nodes with the same labels. The following snapshots (a)-(h) illustrates the process to create the FP-tree from the first transaction to the eigth transaction in order. 203

35 (a) (b) (c) (d) (e) (f) 204

(g) (h) Then the links are created from the header table to nodes and

From the FP-tree, we can mine frequent itemsets as follows.

conditional pattern base, then construct its (conditional) FPtree, and

36 (g) (h) Then the links are created from the header table to nodes and among nodes with the same label. From the FP-tree, we can mine frequent itemsets as follows. First of all, we start from each frequent 1-itemset, construct its conditional pattern base, then construct its (conditional) FPtree, and perform mining recursively on such a tree. An example of conditional pattern bases and conditional FP-trees are shown in the following tables. 205

37 1-itemset Conditional Pattern Base Conditional FPtree Ruler (R) {PP,C,T,P:2}, {PP,M,P:1}, {T,M,P:1} {P:4} {P, R:4} Stapler (S) {PP,C,M:1}, {PP,C,T,P,R:1}, {PP,C,T,M:2}, {PP,M,P,R:1} {PP:4, C:4} Frequent Patterns Generated {PP,S:4},{C,S:4},{PP,C,S:4} Pad (P) {PP,C,T:3}, {PP,M:1}, {T,M:1} {PP:4, T:3}, {T:1} {PP,P:4},{T,P:4} Memo (M) {PP,C:1}, {PP,C,T:2}, {PP:1}, {T:1} {PP:4} {PP,M:4} Tape (T) {PP,C:5} {PP:5, C:5} {PP,T:5},{C,T:5},{PP,C,T:5} Clip (C) {PP:6} {PP:6} {PP,C:6} Paper (PP) For each frequent 1-itemset, the conditional pattern bases can be selected as the second column. From the conditional pattern bases, we can construct conditional FP-trees as shown in the third column. From the conditional FP-tree, we can iteratively mine the frequent patterns as shown in the last column. If the conditional FP-tree includes a single path, all combinations of the items will be generated. For example, the Stapler (S) has a single conditional FP-tree {PP:4, C:4}, three possible combinations are {PP,S}, {C,S} and {PP,C,S}. In general, it was known that the FP-growth method could transform the problem of finding long frequent patterns to searching for shorter ones recursively and then concatenating the suffix. With frequency order in the tree construction, it offers good selectivity and then substantially reduces the search costs. However, when the database is large, it is sometimes unrealistic to construct a main FP-tree on memory. One common solution is to partition the database into a set of projected databases, and then construct an FP-tree and mine it in each projected database. Such a process can be recursively applied to any projected database if its FP-tree still cannot fit in main memory. The performance of the FP-growth is known to be an order of magnitude faster than the Apriori algorithm CHARM algorithm As stated previously, there have usually been a huge number of frequent itemsets generated, especially when the minsup threshold is set low or when there exist long patterns in the data set. As our previous example, Figure 4-13 showed that even closed frequent itemsets are the reduced set of patterns generated in frequent itemset mining, it preserves the complete information regarding the set of frequent itemsets. That is, from the set of closed frequent itemsets, we can easily derive the set of frequent itemsets and their support. Therefore, it is more practical to mine the set of closed frequent itemsets rather than the set of all frequent itemsets in most cases. A well-known method is called the CHARM algorithm. The CHARM algorithm utilizes vertical DB format, rather than the horizontal one. Figure 4-14 shows the horizontal, vertical formats of the database and the reduced set of items when the minsup is set to 50%, i.e., three transactions. The pseudo-code of the CHARM algorithm is summarized in Algorithm 4.4. In the algorithem, the main function (CHARM function) calls the CHARM_EXTEND function to extend the lattice by adding new nodes. The CHARM_PROPERTY function checks whether the newly created node have the same tidset (transaction ID set) with those of the parents or not. If it is the same, the parent needs to be replaced with the new node or the parent will be removed from the lattice. 206

38 Transaction ID Items 1 coke, ice, paper, shoes, water 2 ice, orange, shirt, water 3 paper, shirt, water 4 coke, orange, paper, shirt, water 5 ice, orange, shirt, shoes, water 6 paper, shirt, water (a) Horizontal format Itemset Trans Freq. Itemset Trans Freq. coke 14 2 ice ice orange orange paper paper shirt shirt water shoes 15 2 water (b) Vertical format (c) Reduced set of items Figure 4-14: A database with horizontal format and/or vertical format with its reduced set. Algorithm 4.4. The CHARM algorithm } } for each in Nodes 2. NewN = 3. for each in Nodes, with 4. and and 5. if then if N then ; 2. if then 3. Remove from Nodes 4. Replace all with 5. else if then 6. Replace all with 7. else if then 8. Remove from Nodes 9. Add to NewN 10. else 11. Add to NewN 207

The following is an example of the complete lattice.

During going down the branch, the CHARM_PROPERTY function is used for checking whether the lower nodes should be created or the

The following illustrates the CHARM algorithm to mine closed frequent itemsets from the supermartket database.

39 The following is an example of the complete lattice. The CHARM algorithm will start from the root node and then try to process the leftmost branches in order. During going down the branch, the CHARM_PROPERTY function is used for checking whether the lower nodes should be created or the parent nodes should be replaced or eliminated. The following illustrates the CHARM algorithm to mine closed frequent itemsets from the supermartket database. First, five nodes are created under the root node as follows. Next, the left node in the first level is expanded to generate nodes in the second level. Some nodes have a support lower than minsup (=3). The node {I,W} has the same tidset with its father {I}. Therefore, its father {I} will be replaced by the child {I,W}. 208

Therefore, its mother {O} was eliminated. The first node in the second level is expanded to generate a node {S,PO,P,W} in the third level.

40 Next, the second node in the first level is expanded to generate nodes in the second level. The last node {S,W} has the same tidset with its father {S}. Therefore, its father {S} will be replaced by the child {S,W}. Moreover, the first node {S,O} has the same tidset with its mother {O}. Therefore, its mother {O} was eliminated. The first node in the second level is expanded to generate a node {S,PO,P,W} in the third level. However, its support is lower than the minsup. Next, the fourth node in the first level is expanded to generate a node in the second level. The node {P,W} has the same tidset with its father {P}. Therefore, its father {P} will be replaced by the child {P,W}. 209

Finally, the final tree includes six nodes (closed itemsets). Their closed frequent itemsets, tidsets, frequent itemsets and their supports are shown as a table.

2.4. Association Rules with Hierarchical Structure While mining frequent itemsets and association rules is a common task in association analysis, another interesting task is to mine multilevel

41 Finally, the final tree includes six nodes (closed itemsets). Their closed frequent itemsets, tidsets, frequent itemsets and their supports are shown as a table. The CHARM algorithm will generate a reduced set of itemsets in the form of closed itemsets. Intuitively, when the database is densed, more compacted itemsets as closed itemsets will be derived Association Rules with Hierarchical Structure While mining frequent itemsets and association rules is a common task in association analysis, another interesting task is to mine multilevel association where hierarchy is assumed. Tasks related to multilevel association rules involve concepts at different levels of abstraction. In many cases, it is hard to find strong associations among data items at low or primitive levels of abstraction with enough support, due to the sparsity of data at those levels. As alternative, it is 210

possible to discover strong associations at high levels of abstraction. This high-level association may represent common-sense knowledge which may be novel.

42 possible to discover strong associations at high levels of abstraction. This high-level association may represent common-sense knowledge which may be novel. At this point, we may need to handle multiple levels of abstraction with sufficient flexibility for easy traversal among different abstraction spaces. Figure 4-15 shows an example of concept hierarchy of products, sales transactions and their modified transactions. Figure 4-16 shows the process of mining association rules with the concept hierarchy. A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher level, more general concepts. Items (data) can be generalized by replacing low-level concepts by their higher-level concepts, or ancestors, from a concept hierarchy. (a) Concept hierarchy of products TID Items 1 apple, orange, paper, shirt, water 2 coke, orange, paper, ruler, water 3 apple, coke, paper, shoes, water 4 coke, orange, paper, shirt 5 apple, orange, water 6 apple, ruler, shirt, shoes, water 7 coke, orange, paper, shirt, shoes 8 coke, orange, paper, shirt 9 ruler, shirt, shoes, water A coke, orange, paper, ruler (b) An example of sales transactions TID Items 1 fruit, stationary, clothing, drink 2 drink, fruit, stationary 3 fruit, drink, stationary, clothing 4 drink, fruit, stationary, shirt 5 fruit, drink 6 fruit, stationary, clothing drink 7 drink, orange, stationary, clothing 8 drink, orange, stationary, clothing 9 stationary, clothing, drink A drink, fruit, stationary (c) The modified sales transactions by replacing each item with its higher-level concept Figure 4-15: An example of concept hierarchy of products, sales transactions and their modified transactions In this example, assume that the minimum support and the minimum confidence to 0.5 and 0.8, respectively. In Figure 4-15, (a) illustrates the concept hierarchy of products, (b) shows the 211

43 transactions, and (c) is the modified transactions where each item is replaced by its higher concept. From the modified transactions, it is possible to mine frequent itemsets as listed in Figure 4-16 (a) and then the frequent rules can be found as shown in Figure 4-16 (b). Since all single higher-level items (e.g., clothing, drink, fruit, and stationary) are frequent, we will mine frequent itemsets and association rules in its lower level. From the original transactions in Figure 4-15 (a), the frequent itemsets and rules can be found as listed in Figure 4-16 (c) and (d), respectively. Items Count Items Count clothing (1-1) clothing, drink (2-1) drink A (1-2) clothing, fruit 136 (2-2) fruit A (1-3) clothing, stationary (2-3) stationary A (1-4) drink, fruit A (2-4) drink, stationary A (2-5) fruit, stationary 12346A (2-6) clothing, drink, stationary (2-7) drink, fruit, stationary 12346A (2-8) clothing, drink, fruit, stationary 136 (2-9) (a) Frequent itemsets when the minimum support is set to 50% (one-level higher) Rule Confidence Rule Confidence clothing drink 6/6=1.00 fruit stationary 6/7=0.86 drink clothing 6/10=0.60 stationary fruit 6/9=0.67 clothing stationary 6/6=1.00 clothing, drink stationary 6/6=1.00 stationary clothing 6/9=0.67 clothing, stationary drink 6/6=1.00 drink fruit 7/10=0.70 drink, stationary clothing 6/9=0.67 fruit drink 7/7=1.00 drink, fruit stationary 6/7=0.86 drink stationary 9/10=0.90 drink, stationary fruit 6/9=0.67 stationary drink 9/9=1.00 fruit, stationary drink 6/6=1.00 (b) Possible association rules when the minimum confidence is set to 80% (one-level higher) Items Count Items Count apple 1356 (1-1) coke, orange 2478A (2-1) coke 23478A (1-2) coke, paper 23478A (2-2) orange A (1-3) coke, shirt 478 (2-3) paper A (1-4) coke, water 23 (2-4) ruler 269A (1-5) orange, paper 12478A (2-5) shirt (1-6) orange, shirt 1478 (2-6) shoes 3679 (1-7) orange, water 125 (2-7) water (1-8) paper, shirt 1478 (2-8) paper, water 123 (2-9) shirt, shoes 679 (2-10) shirt, water 169 (2-11) coke, orange, paper 2478A (3-1) (c) Frequent itemset when the minimum support is set to 50% (infrequent items are strikethrough) Rule Confidence Rule Confidence Rule Confidence coke orange 5/6=0.83 orange coke 5/7=0.71 coke, orange paper 5/5=1.00 coke paper 6/6=1.00 paper coke 6/7=0.86 coke, paper orange 5/6=0.83 orange paper 6/7=0.86 paper orange 6/7=0.86 orange, paper coke 5/6=0.83 (d) Possible association rules when the minimum confidence is set to 80% Figure 4-16: An example of mining association rules with the hierarchical structure 212

44 This multiple-level or multilevel association rules can be generated from mining data at multiple levels of abstraction efficiently using concept hierarchies under a support-confidence framework. As shown above, in general, a top-down strategy can be used together with existing association rule mining algorithms, such as Apriori, FP-tree, CHARM and their variations. That is, during mining, the counts are accumulated for the calculation of frequent itemsets at each concept level, by starting at the top concept level, and working downward in the hierarchy toward the more specific concept levels, until no more frequent itemsets can be found. That is, for an example in Figure 4-16, the frequent itemsets and rules in (a)-(b) will be mined first before mining the frequent itemsets and rules in (c)-(d). A number of variations to this approach can applied. However, three major ones are enumerated as follows. 1. Uniform Minimum Support By this uniform minimum support for all levels, the same minimum support threshold is used when mining at each level of abstraction. For the above example, a minimum support threshold is set to 0.5 for all levels. The mining results for the level 1 and the level 2 are shown in Figure In this example, when a minimum support threshold of 0.5 is used for apple, orange, coke, water, ruler, paper, shoes, and shirt, the infrequent items, i.e., apple, (sup=4/10) ruler (sup=4/10), and shoes (sup=4/10) are eliminated. Moreover, when the same threshold is applied for fruit, drink, stationary and clothing, all general concepts are found to be frequent, even these three subitems are not. Search in the condition of a uniform minimum support threshold is simple. When we apply an Apriori-like optimization technique, we can control the search to avoid examining itemsets containing any item whose ancestors do not have minimum support. It is possible since we will find the frequent itemsets in a higher level before those in a lower level. However, the uniform support approach has a number of drawbacks. For example, if the minimum support threshold is set too high, it could miss some meaningful associations occurring at low abstraction levels. If the threshold is set too low, it may generate many uninteresting associations occurring at high abstraction levels. 2. Reduced Minimum Support Using reduced minimum support at lower levels (referred to as reduced support), each level of abstraction has its own minimum support threshold. The deeper the level of abstraction, the smaller the corresponding threshold is. For example, given the following 213

hierarchy with reduced minimum support, we can mine items in different levels with different thresholds. In this hierarchy, the minimum support thresholds for levels 1, 2 and 3, are 0.5, 0.8 and 0.

52 (d)-(e), only frequent itemsets and frequent rules with drink and stationary are consider. Moreover, the level-1 items can be mined in the same way but with a lower support (i.e., 0.

45 hierarchy with reduced minimum support, we can mine items in different levels with different thresholds. In this hierarchy, the minimum support thresholds for levels 1, 2 and 3, are 0.5, 0.8 and 0.9, respectively. By this threshold, as shown in Figure 3.52, two higher concepts, drink and stationary, are considered. In Figure 3.52 (d)-(e), only frequent itemsets and frequent rules with drink and stationary are consider. Moreover, the level-1 items can be mined in the same way but with a lower support (i.e., 0.5), as shown in Figure 3.52 (f)-(g). 3. Level-cross filtering by single items Similar to the reduced minimum support at lower levels, each level of abstraction has its own minimum support threshold where the deeper the level of abstraction, the smaller the corresponding threshold is. But we use the higher-level mining as filtering. For example, given the following hierarchy with reduced minimum support, we can eliminate some trivial itemsets and rules as follows. In this hierarchy, the minimum support thresholds for levels 1, 2 and 3, are 0.5, 0.8 and 0.9, respectively. By this threshold, only two higher concepts, drink and stationary, are considered and fruit and clothing are eliminated. Therefore, apple, orange, shoes, and shirt are not considered for mining. By this, a pruning mechanism can be provided. 214

4. Level-cross filtering by k-itemset This case is similar to level-cross filtering by single item but it filters k-itemsets instead of single items.

For example, given the following hierarchy with reduced minimum support, we examine all 2-itemsets at the lower level in the case (a).

46 4. Level-cross filtering by k-itemset This case is similar to level-cross filtering by single item but it filters k-itemsets instead of single items. If the k-itemset at the higher level does not pass minimum support, the k- itemsets at the lower level under that concepts will not be examined. For example, given the following hierarchy with reduced minimum support, we examine all 2-itemsets at the lower level in the case (a). Since drink, stationary passes the minimum support, coke, ruler, coke, paper, water, ruler and water, paper will be examined. On the other hand, we eliminate all 2-itemsets at the lower level in the case (b). Since drink, fruit does not passes the minimum support, coke, apple, coke, orange, water, apple and water, orange are pruned out without examination. The level-cross filtering by k-itemset is similar to level-cross filtering by single items but it filters k-itemsets instead of single items. (a) The higher concept passes minimum support, all lower items are examined. (b) The higher concept does not pass minimum support, all lower items are pruned out. 5. Controlled level-cross filtering by 1-itemset (single item) or k-itemset This case is similar to level-cross filtering by single item or k-itemset but there is a separate threshold for setting the condition to examine the lower level. This threshold is set separately from the minimum support. The following shows the case of 1-itemset (single item) and the case of 2-itemset. 215

(a) Controlled level-cross filtering by 1-itemset (single item) (b) Controlled level-cross filtering by k-itemset 4.2.5.

database records or transactions, and then to extract the rules telling us how a subset of items influences the presence of another subset.

47 (a) Controlled level-cross filtering by 1-itemset (single item) (b) Controlled level-cross filtering by k-itemset Efficient Association Rule Mining with Hierarchical Structure While association rule mining (ARM) is a process to find the set of all subsets of items (called itemsets) that frequently occur in the database records or transactions, and then to extract the rules telling us how a subset of items influences the presence of another subset. However, association rules may not provide desired knowledge in the database. It may be limited with the granularity over the items. For example, a rule 5% of customers who buy wheat breads, also buy chocolate milk is less expressive and less useful than a more general rule 30% of customers who buy bread, also buy milk. For this purpose, generalized association rule mining (GARM) was developed using the information of a pre-defined taxonomy over the items. The taxonomy is a piece of knowledge, e.g. the classification of the products (or items) into brands, categories, product groups, and so forth. Given a taxonomy where only leaf nodes (leaf items) present in the transactional database, more informative, initiative and flexible rules (called generalized association rules) can be mined from the database 216

Generalized Association Rules and Generalized Frequent Itemsets With the presence of a concept hierarchy or taxonomy, the formal problem description of generalized association rule is different from

48 Generalized Association Rules and Generalized Frequent Itemsets With the presence of a concept hierarchy or taxonomy, the formal problem description of generalized association rule is different from that of association rule mining. For clarity, all explanations in the section are illustrated using an example shown in Figure Let T be a concept hierarchy or taxonomy, a directed acyclic graph on items, which represents is-a relationship by edges, e.g. Figure 4-17 (a). The items in T are composed of a set of leaf items ( ) and a set of non-leaf items ( ). (a) concept hierarchy or taxonomy TID Items TID Items TID Itms 1 ACDE A 1245 A ABC B 2356 B BCDE C C ACD D D ABCDE E 1356 E BCDE U V W (b) horizontal database (left) vs. vertical database (middle) vs. extended vertical database (right) Figure 4-17: An example of mining generalized association rules Let be a set of distinct items where, and let be a set of transaction identifiers (tids). In this example,, and. A subset of I is called an itemset and a subset of T is called a tidset. Normally, a transactional database is represented in the horizontal database format, where each transaction corresponds to an itemset shown in the left table of Figure 4-17 (b). An alternative to the horizontal database format is the vertical database format, where each item corresponds to a tidset which contains that item shown in the middle table of Figure 4-17 (b). Note that the original database contains only leaf items. It is possible to represent an original vertical database by extending it to cover non-leaf items where a transaction of which item also supports its related items from taxonomy shown in the right table of Figure 4-17 (b). Let the binary relation be an extended database. For any and, can be denoted when x is related to y in the database (called x is supported by y). Here, except the elements in I, lower case letters are used to denote items and upper case letters for itemsets. For, is an ancestor of (conversely is a descendant of ) when there is a path from to in. For any, a set of its ancestors (descendants) is denoted by ( ). For 217

49 example, and. A generalized itemset is an itemset each element of which is not an ancestor of the others,. For example, ( for short), ( for short) are generalized itemsets. Let be a finite set of all generalized itemsets. Note that, for and (power of I). The support of G, denoted by, is defined by a percentage of the number of transactions in which occurs as a subset to the total number of transactions, thus. Any is called generalized frequent itemset (GFI) when its support is at least a user-specified minimum support (minsup) threshold. In GARM, a meaningful rule is an implication of the form, where and no item in is an ancestor of any items in. For example, and are meaningful rules, while is a meaningless rule because its support is redundant with. The support of a rule, defined as =, is the percentage of the number of transactions containing both and to the total number of transactions. The confidence of a rule, defined as, is the conditional probability that a transaction contains, given that it contains. For example, the support of is and the confidence is =1 or 100%. The meaningful rule is called a generalized association rule (GAR) when its confidence is at least a user-specified minimum confidence (minconf) threshold. The task of GARM is to discover all GARs the supports and confidences of which are at least minsup and minconf, respectively. Here, it is possible to consider two relationships, namely subset-superset and ancestordescendant relationships, based on lattice theory. Similar to ARM, GARM occupies the subsetsuperset relationship which represents a lattice of generalized itemsets. As the second relationship, an ancestor-descendant relationship is originally introduced to represent a set of k- generalized itemset taxonomies. By these relationships, it is possible to mine a smaller set of generalized closed frequent itemsets instead of mining a large set of conventional generalized frequent itemsets. Two algorithms called SET and cset are introduced to mine generalized frequent itemsets and generalized closed frequent itemsets, respectively. By a number of experiments, SET and cset outperform the previous well-known algorithms in both computational time and memory utilization. The number of generalized closed frequent itemsets is much more smaller than the number of generalized frequent itemsets Historical Bibliography As unsupervised learning, clustering has been studied extensively in many disciplines due to its broad applications. Several textbooks are dedicated to the methods of cluster analysis, including Hartigan (1975), Jain and Dubes (1988), Kaufman and Rousseeuw (1990), and Arabie, Hubert, and De Sorte (1996). Many survey articles on different aspects of clustering methods include those done by Jain, Murty and Flynn (1999) and Parsons, Haque, and Liu (2004). As a partitioning method, Lloyd (1957) and later MacQueen (1967) introduce the k-means algorithm. Later, Bradley, Fayyad, and Reina (1998) proposed a k-means based scalable clustering algorithm. Instead of using means, Kaufman and Rousseeuw (1990) proposed to use the nearest object to the means as the cluster center and call the method the k-medoids algorithm with two versions of PAM and CLARA. To cluster categorical data (in contrast with numerical data), Chaturvedi, Green, and Carroll (1994, 2001) proposed the k-modes clustering algorithm. Later Haung (1998) proposed independently the k-modes (for clustering categorical data) and k- prototypes (for clustering hybrid data) algorithms. As an extension to CLARA, the CLARANS 218

50 algorithm was later proposed by Ng and Han (1994). Ester, Kriegel, and Xu (1995) proposed efficient spatial access techniques, such as R*-tree and focusing techniques, to improve the performance of CLARANS. An early survey of agglomerative hierarchical clustering algorithms was conducted by Day and Edelsbrunner (1984). Kaufman and Rousseeuw (1990) also introduced agglomerative hierarchical clustering, such as AGNES, and divisive hierarchical clustering, such as DIANA. Later, Zhang, Ramakrishnan, and Livny (1996) proposed the BIRCH algorithm which integrates hierarchical clustering with distance-based iterative relocation or other nonhierarchical clustering methods to improve the clustering quality of hierarchical clustering methods. The BIRCH algorithm partitions objects hierarchically using tree structures whose leaf (or low-level nonleaf nodes) are treated as microclusters, depending on the scale of resolution. After that, it applies other clustering algorithms to perform macroclustering on the microclusters. Proposed by Guha, Rastogi, and Shim (1998, 1999), CURE and ROCK utilized linkage or nearest-neighbor analysis and its transformation to improve the conventional hierarchical clustering. Exploring dynamic modeling in hierarchical clustering, the Chameleon was proposed by Karypis, Han, and Kumar (1999). As an early density-based clustering method, Ester, Kriegel, Sander, and Xu (1996) proposed DBSCAN, which is the first algorithm to utilize the density in clustering with some parameters needed to be specified. After that, Ankerst, Breunig, Kriegel, and Sander [ABKS99] have proposed a cluster-ordering method, namely OPTICS, which facilitates density-based clustering without consideration of parameter setting. Almost at the same time, Hinneburg and Keim (1998) proposed the DENCLUE algorithm, which use a set of density distribution functions to glue similar objects together. As a grid-based multi-resolution approach, STING was proposed by Wang, Yang, and Muntz [WYM97] to cluster objects using statistical information collected in grid cells. Instead of the original feature space, Sheikholeslami, Chatterjee, and Zhang (1998) applied wavelet transform to implement a multi-resolution clustering method, namely WaveCluster, which is a combination of grid- and density-based approach. As another hybrid of a grid- and density-based approach, CLIQUE was designed based on Apriori by Agrawal, Gehrke, Gunopulos, and Raghavan (1998) to cope with high-dimensional clustering using dimension-growth subspace clustering. As model-based clustering, Dempster, Laird, and Rubin (1977) proposed a well-known statistics-based method, namely the EM (Expectation-Maximization) algorithm. Handling missing data in EM methods was presented by Lauritzen (1995). As a variant of the EM algorithm, AutoClass was proposed by Cheeseman and Stutz (1996) with incorporation of Bayesian Theory. While conceptual clustering was first introduced by Michalski and Stepp (1983), the first example is COBWEB invented by Fisher (1987). A succeeding version is CLASSIT by Gennari, Langley, and Fisher (1989). The task of association rule mining was first introduced by Agrawal, Imielinski, and Swami (1993). The Apriori algorithm for frequent itemset mining and a method to generate association rules from frequent itemsets was presented by Agrawal and Srikant (1994a, 1994b). Agrawal and Srikant (1994b), Han and Fu (1995), and Park, Chen, and Yu (1995) described transaction reduction techniques in their papers. Later, Pasquier, Bastide, Taouil, and Lakhal (1999) proposed a method to mine frequent closed itemsets, namely A-Close, based on Apriori algorithm. Later, Pei, Han, and Mao (2000) proposed CLOSET, an efficient closed itemset mining algorithm based on the frequent pattern growth method. As a further refined algorithm, CLOSET+ was invented by Wang, Han, and Pei (2003). Savasere, Omiecinski, and Navathe (1995) introduced the partitioning technique. Toivonen (1996) explored the sampling techniques while Brin, Motwani, Ullman, and Tsur (1997) provided a dynamic itemset counting approach. Han, Pei, and Yin (2000) proposed the FP-growth algorithm, a pattern-growth approach for mining frequent itemsets without candidate generation. Grahne and Zhu (2003) introduced FPClose, a prefix-treebased algorithm for mining closed itemsets using the pattern-growth approach. Zaki (2000) 219

51 proposed an approach for mining frequent itemsets by exploring the vertical data format, called ECLAT. Zaki and Hsiao (2002) presented an extension for mining closed frequent itemsets with the vertical data format, called CHARM. Bayardo (1998) gave the first study on mining maxpatterns. Multilevel association mining was studied in Han and Fu (1995) and Srikant and Agrawal (1995, 1997). In Srikant and Agrawal (1995, 1997), five algorithms named Basic, Cumulate, Stratify, Estimate and EstMerge were proposed. These algorithms apply the horizontal database and breath-first search strategy like Apriori-based algorithm. Later, Hipp, A. Myka, R. Wirth, and U. Guntzer (1998) proposed a method, namely Prutax, to use hash tree checking with vertical database format to avoid generating meaningless itemsets and then to reduce the computational time needed for multiple scanning the database. Lui and Chung (2000) proposed an efficient method to discover generalized association rules with multiple minimum supports. A parallel algorithm for generalized association rule mining (GARM) has also been proposed by Shintani and Kitsuregawa (1998). Some recent applications that utilize a GARM are shown by Michail (2000) and Hwang and Lim (2002). Later, Sriphaew and Theeramunkong (2002, 2003, 2004) introduced two types of constraints on two generalized itemset relationships, called subset-superset and ancestor-descendant constraints, to mine only a small set of generalized closed frequent itemsets instead of mining a large set of conventional generalized frequent itemsets. Two algorithms, named SET and cset, are proposed by Sriphaew and Theeramunkong (2004) to efficiently find generalized frequent itemsets and generalized closed frequent itemsets, respectively. 220

52 Exercise 1. Apply k-means algorithm to cluster the following data. AreaType Humidity Temperature Sea Mountain Mountain Mountain Sea Here, use the following approaches as the method for clustering Distance: Euclidean distance d ( x x ) ( x x )... ( x x ) ) For nominal attribute: use 0 and 1 For numeric attributes: decimal scaling normalization ij i1 j1 i2 j2 ip jp 2. Apply the hierarchical-based clustering for the previous problem. 3. Explain the merits and demerits of the grid-based clustering. 4. Apply the DBSCAN to cluster the following data points. Here, set the minimum number of objects to three and the radius of neighborhood objects to. 5. Explain the concept of conceptual clustering and EM algorithms, including their merits and demerits? 6. Assume that the database is as follows. Find frequent itemsets and frequent rules when minimum support and minimum confidence are set to 60% and 80%, respectively. Show the process of Apriori, FP-Tree and CHARM methods. TID ITEMS 1 apple, bacon, bread, pizza, potato, tuna, water 2 bacon, bread, cookie, corn, nut, shrimp, water 3 apple, bread, cookie, nut, pizza, potato, shrimp, tuna, water 4 apple, bacon, bread, cookie, nut, potato, water 5 bread, cookie, corn, nut, pizza, potato, water 6 apple, cookie, corn, nut, potato, shrimp, water 7 bacon, bread, cookie, corn, nut, tuna, water 8 apple, bread, nut, potato, shrimp, water 221

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques Chapter 7.1-4 March 8, 2007 Data Mining: Concepts and Techniques 1 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis Chapter 7 Cluster Analysis 3. A