Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall PDF Free Download

Data Mining Clustering Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 31

Table of contents 1 Introduction 2 Data matrix and dissimilarity matrix 3 Proximity Measures 4 Clustering methods Partitioning methods Hierarchical methods Model-based clustering Density based clustering Grid-based clustering 5 Cluster validation and assessment Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 2 / 31

Introduction Clustering is the process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster have high similarity, but are very dissimilar to objects in other clusters. Dissimilarities and similarities are assessed based on the attribute values describing the objects and often involve distance measures. Clustering as a data mining tool has its roots in many application areas such as biology, security, business intelligence, and Web search. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 3 / 31

Requirements for cluster analysis Clustering is a challenging research field and the following are its typical requirements. Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Requirements for domain knowledge to determine input parameters Ability to deal with noisy data Incremental clustering and insensitivity to input order Capability of clustering high-dimensionality data Constraint-based clustering Interpretability and usability Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 4 / 31

Comparing clustering methods The clustering methods can be compared using the following aspects: The partitioning criteria : In some methods, all the objects are partitioned so that no hierarchy exists among the clusters. Separation of clusters : In some methods, data partitioned into mutually exclusive clusters while in some other methods, the clusters may not be exclusive, that is, a data object may belong to more than one cluster. Similarity measure : Some methods determine the similarity between two objects by the distance between them; while in other methods, the similarity may be defined by connectivity based on density or contiguity. Clustering space : Many clustering methods search for clusters within the entire data space. These methods are useful for low-dimensionality data sets. With high- dimensional data, however, there can be many irrelevant attributes, which can make similarity measurements unreliable. Consequently, clusters found in the full space are often meaningless. Its often better to instead search for clusters within different subspaces of the same data set. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 5 / 31

Data matrix and dissimilarity matrix Suppose that we have n objects described by p attributes. The objects are x 1 = (x 11, x 12,..., x 1p ), x 2 = (x 21, x 22,..., x 2p ), and so on, where x ij is the value for object x i of the j th attribute. For brevity, we hereafter refer to object x i as object i. The objects may be tuples in a relational database, and are also referred to as data samples or feature vectors. Main memory-based clustering and nearest-neighbor algorithms typically operate on either of the following two data structures: Data matrix This structure stores the n objects in the form of a table or n p matrix. x 11... x 1f... x 1p..... x i1... x if... x ip..... x n1... x nf... x np Dissimilarity matrix : This structure stores a collection of proximities that are available for all pairs of objects. It is often represented by an n n matrix or table: 0 d(1, 2) d(1, 3)... d(1, n) d(2, 1) 0 d(2, 3)... d(2, n)....... d(n, 1) d(n, 2) d(n, 3)... 0 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 6 / 31

Proximity Measures indicates that the patient does not. Treating binary attributes as if they are numeric can be misleading. Therefore, methods specific to binary data are necessary for computing dissimilarity. So, how can we compute the dissimilarity between two binary attributes? One approach involves computing a dissimilarity matrix from the given binary data. If all Proximity measures for nominal attributes : Let the number of states of a nominal binary attributes are thought of as having the same weight, we have the 2 2 contingency table of Table between 2.3, where two q isobjects the number i of and attributes j canthat be equal computed 1 for both objects based on attribute be M. The dissimilarity the ratio of mismatches: i and j, r is the number of attributes d(i, j) = p that m equal 1 for object i but equal 0 for object j, s is the number of attributes that equal 0 for object i but equal 1 for object j, and t is the number of attributes that equal 0 for both objects i and j. The total number of attributes is p, where p = q + r + s + t. p where m is the number ofrecall matches that for and symmetric p is the binary total attributes, number each of stateattributes is equally valuable. describing Dissimilarity that is based on symmetric binary attributes is called symmetric binary the objects. dissimilarity. If objects i and j are described by symmetric binary attributes, then the Proximity measures for binary attributes : Binary attributes are either symmetric or asymmetric. Table 2.3 Contingency Table for Binary Attributes Object j 1 0 sum 1 q r q+ r Object i 0 s t s+ t sum q + s r+ t p For symmetric binary attributes, similarity is calculated as r + s d(i, j) = q + r + s + t For asymmetric binary attributes when the number of negative matches, t, is unimportant and the number of positive matches, q, is important, similarity is calculated as d(i, j) = r + s q + r + s Coefficient 1 d(i, j) is called the Jaccard coefficient. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 7 / 31

Proximity Measures (cont.) Dissimilarity of numeric attributes : The most popular distance measure is Euclidean distance d(i, j) = (x i1 x j2 ) 2 + (x i2 x j1 ) 2 +... + (x ip x jp ) 2 Another well-known measure is Manhattan distance d(i, j) = x i1 x j2 + x i2 x j1 +... + x ip x jp Minkowski distance is generalization of Euclidean and Manhattan distances d(i, j) = h x i1 x j2 h + x i2 x j1 h +... + x ip x jp h Dissimilarity of ordinal attributes : We first replace each x if by its corresponding rank r if {1,..., M f } and then normalize it using z if = r if 1 M f 1 Then dissimilarity can be computed using distance measures for numeric attributes using z if. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 8 / 31

Proximity Measures (cont.) Dissimilarity for attributes of mixed types : A more preferable approach is to process all attribute types together, performing a single analysis. d(i, j) = p f =1 δ(f ) ij d (f ) ij p f =1 δ(f ) ij where the indicator δ (f ) ij = 0 if either x if or x jf is missing x if = x jf = 0 and attribute f is asymmetric binary and otherwise δ (f ) ij = 1. The distance d (f ) ij is computed based on the type of attribute f. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 9 / 31

sify a given algorithm as uniquely belonging to only one clustering method category. Furthermore, some applications may have clustering criteria that require the integration of several clustering techniques. In the following sections, we examine each clustering method in detail. Advanced There are many clustering clustering methodsalgorithms and relatedinissues the literature. are discussed It is indifficult Chapter to 11. provide In general, a crisp the categorization notation of clustering used is as follows. methods Letbecause D a data these set of categories n objects tomay be clustered. overlap so Anthat objecta is method may described have features by d variables, from where several each categories. variable is also In general, called anthe attribute major orfundamental a dimension, clustering methods can be classified into the following categories. Clustering methods Method Partitioning methods Hierarchical methods Density-based methods Grid-based methods General Characteristics Find mutually exclusive clusters of spherical shape Distance-based May use mean or medoid (etc.) to represent cluster center Effective for small- to medium-size data sets Clustering is a hierarchical decomposition (i.e., multiple levels) Cannot correct erroneous merges or splits May incorporate other techniques like microclustering or consider object linkages Can find arbitrarily shaped clusters Clusters are dense regions of objects in space that are separated by low-density regions Cluster density: Each point must have a minimum number of points within its neighborhood May filter out outliers Use a multiresolution grid data structure Fast processing time (typically independent of the number of data objects, yet dependent on grid size) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 10 / 31

Partitioning methods The simplest and most fundamental version of cluster analysis is partitioning, which organizes the objects of a set into several exclusive groups or clusters. Formally, given a data set, D, of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k n), where each partition represents a cluster. The clusters are formed to optimize an objective partitioning criterion, such as a dissimilarity function based on distance, so that the objects within a cluster are similar to one another and dissimilar to objects in other clusters in terms of the data set attributes. + + + + Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 11 / 31

k-means clustering algorithm Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods distribute the objects in D into k clusters, C 1,..., C k, that is, C i D and C i C j = φ for (1 i, j k). An objective function is used to assess the partitioning quality so that objects within a cluster are similar to one another but dissimilar to objects in other clusters. This is, the objective function aims for high intracluster similarity and low intercluster similarity. A centroid-based partitioning technique uses the centroid of a cluster, C i, to represent that cluster. The difference between an object p C i and µ i, the representative of the cluster, is measured by p µ i. The quality of cluster C i can be measured by the within-cluster variation, which is the sum of squared error between all objects in C i and the centroid c i, defined as E = n i=1 p C i p µ i 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 12 / 31

336 Representative-based Clustering k-means clustering algorithm (cont.) 2 3 4 10 11 12 20 25 30 (a) Initial dataset µ 1 = 2 µ 2 = 4 2 3 4 10 11 12 20 25 30 (b) Iteration: t = 1 µ 1 = 2.5 µ 2 = 16 2 3 4 µ 1 = 3 10 11 12 20 25 30 (c) Iteration: t = 2 µ 2 = 18 2 3 4 10 11 12 20 25 30 (d) Iteration: t = 3 µ 1 = 4.75 µ 2 = 19.60 2 3 4 10 11 12 µ 1 = 7 (e) Iteration: t = 4 20 25 30 µ 2 = 25 2 3 4 10 11 12 20 25 30 (f) Iteration: t = 5(converged) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 13 / 31

k-means clustering algorithm (cont.) The k-means method is not guaranteed to converge to the global optimum and often terminates at a local optimum. The results may depend on the initial random selection of cluster centers. o obtain good results in practice, it is common to run the k-means algorithm multiple times with different initial cluster centers. The time complexity of the k-means algorithm is O(nkt), where n is the total number of objects, k is the number of clusters, and t is the number of iterations. Normally, k n and t n. Therefore, the method is relatively scalable and efficient in processing large data sets. There are several variants of the k-means method. These can differ in the selection of the initial k-means, the calculation of dissimilarity, and the strategies for calculating cluster means. The k-modes method is a variant of k-means, which extends the k-means paradigm to cluster nominal data by replacing the means of clusters with modes. The partitioning around medoid (PAM) is a realization of k-medoids method. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 14 / 31

Hierarchical methods 0 Chapter A hierarchical 10 Clusterclustering Analysis: Basic method Concepts works andby Methods grouping data objects into a hierarchy or tree of clusters. Agglomerative (AGNES) Step 0 Step 1 Step 2 Step 3 Step 4 a ab b c cde abcde d de e Step 4 Step 3 Step 2 Step 1 Step 0 Divisive (DIANA) Figure Hierarchical 10.6 Agglomerative clustering andmethods divisive hierarchical clustering on data objects {a, b, c, d, e}. Agglomerative hierarchical clustering Divisive hierarchical Level clustering a b c d e l = 0 1.0 l = 1 l = 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 15 / 31 0.8 0.6 ty scale

Distance measures in hierarchical methods Whether using an agglomerative method or a divisive method, a core need is to measure the distance between two clusters, where each cluster is generally a set of objects. Four widely used measures for distance between clusters are as follows, where p q is the distance between two objects or points, p and q; µ i is the mean for cluster, C i ; and n i is the number of objects in C i. They are also known as linkage measures. Minimum distance d min (C i, C j ) = min { p q } p C i,q C j Maximum distance Mean distance Average distance d max (C i, C j ) = max p C i,q C j { p q } d mean (C i, C j ) = µ i µ j d min (C i, C j ) = 1 n i n j p C i,q C j p q Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 16 / 31

Hierarchical methods Step 4 Step 3 Step 2 Step 1 Step 0 Divisive (DIANA) Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}. Level l = 0 l = 1 l = 2 l = 3 l = 4 a b c d e 1.0 0.8 0.6 0.4 0.2 0.0 Similarity scale Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}. different clusters. This is a single-linkage approach in that each cluster is represente by all the objects in the cluster, and the similarity between two clusters is measure Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 17 / 31

Model-based clustering k-means is closely related to a probabilistic model known as the Gaussian mixture model. p(x) = k π k N (x µ k, Σ k ) π k, k, Σ k are parameters. π k are called mixing proportions, each Gaussian is called a mixture component. The model is simply a weighted sum of Gaussians. But it is much more powerful than a Gaussian mixture models example single Gaussian, because it can model multi-modal distributions. Gaussian m A Gaussia Note that for p(x) to I Abe mixture a probability of three Gaussians. distribution, we require that k π k = 1 and that for all k we have π k > 0. Thus, we may interpret the π k as probabilities themselves. Set of parameters θ = {{π k }, {µ k }, {Σ k }} Roland Memisevic Machine Learning 21 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 18 / 31

Model-based clustering (cont.) Let X = {x 1,..., x n } be drawn i.i.d. from mixture of Gaussian. The log-likelihood of the observations equals to n k ln p(x θ) = ln π j N (x n µ j, Σ j ) i=1 Setting the derivatives of ln p(x θ) with respect to µ j and setting it equal to zero, we obtain N π j N (x i µ j, Σ j ) 0 = k l=1 π ln (x i µ l, Σ l ) Σ j(x i µ j ) Let i=1 γ(z ij ) = j=1 π j N (x i µ j, Σ j ) k l=1 π ln (x i µ l, Σ l ) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 19 / 31

Model-based clustering (cont.) We had Multiplying by Σ 1 j 0 = N i=1 and rearranging, we obtain π j N (x i µ j, Σ j ) k l=1 π ln (x i µ l, Σ l ) Σ j(x i µ j ) Similar to the above step, we obtain Please read 9.2 of Bishop. µ j = 1 n γ(z ij )x i n j n j = i=1 n γ(z ij ) i=1 Σ j = 1 n γ(z ij )(x i µ j )(x i µ j ) T n j π j = n j n i=1 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 20 / 31

Figure 10.13 Clusters of arbitrary shape. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 21 / 31 Regions with High Density Density based clustering How can we find dense regions in density-based clustering? The density of an object o can be measured by the number of objects close to o. DBSCAN (Density-Based Spatial Their general Clustering ideaof of Applications these methods with Noise) tofinds continue core objects, growing that is, a given objectscluster that have asdense long as the neighborhoods. It connects core objects and their neighborhoods to form dense regions density in the neighborhood exceeds some threshold. as clusters. How can we How finddoes dense DBSCAN regions quantify in density-based the neighborhood clustering? of an object? A user-specified parameterof > an0 object is used tox specify can bethemeasured radius of aby neighborhood the number we consider of objects for every closeobject. to x. The density The -neighborhood of an object o is the space within a radius centered at o. DBSCAN (Density-Based Due to the fixed neighborhood Spatial Clustering size parameterized of Applications by, the with density Noise) of a neighborhood that can is, objects be measured thatsimply have by dense the number neighborhoods. of objects in the neighborhood. To deter- finds core objects, It connects minecore whether objects a neighborhood and their neighborhoods is dense not, DBSCAN to form dense uses another regions user-specified as clusters.

Density based clustering (cont.) 395 How does DBSCAN quantify the neighborhood of an object? 320 user-specified para- meter ɛ > 0 is used to specify the radius of a neighborhood we consider for every object. 245 Definition (ɛ-neighborhood) 170 The ɛ-neighborhood of an object x is the space within a radius ɛ centered at x. 95 Due to the fixed neighborhood size parameterized by ɛ, the density of a neighbor-hood can be measured simply by the number of objects in the neighborhood. 20 Definition (ɛ-neighborhood) 0 100 200 300 400 500 600 Figure 15.1. Density-based dataset. An object is a core object if the ɛ-neighborhood of the object contains at least MinPts objects. X 1 x ϵ x y z (a) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 22 / 31 (b)

Figure 10.14 Density-reachability and density-connectivity in density-based clustering. Source: Based on MinPts = 3 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 23 / 31 Density based clustering (cont.) Given a set, D, of objects, we can identify all core objects with respect to the given parameters, ɛ and MinPts. The clustering task is therein reduced to using core objects and their neighborhoods to form dense regions, where the dense regions are clusters. Definition (Directly density-reachable) For a core object q and an object p, we say that p is directly density-reachable from q (with respect to ɛ and MinPts) if p is within the ɛ neighborhood of q. An object p is directly density-reachable from another 10.4 object Density-Based q if and Methods only if 473 q is a core object and p is in the ɛ neighborhood of q. q m p s r o

Density based clustering (cont.) How can we assemble a large dense region using small dense regions centered by core objects? Definition (Density-reachable) An object p is density-reachable from q (with respect to ɛ and MinPts in D) if there is a chain of objects p 1,..., p n, such that p 1 = q, p n = p, and p i+1 10.4is Density-Based directly density-reachable Methods 473 from p i with respect to ɛ and MinPts, for 1 i n, p i D. q m p s r o MinPts = 3 Figure 10.14 Density-reachability and density-connectivity in density-based clustering. Source: Based on Ester, Kriegel, Sander, and Xu [EKSX96]. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 24 / 31

Density based clustering (cont.) To connect core objects as well as their neighbors in a dense region, DBSCAN uses the notion of density-connectedness. Definition (Density-connected) Two objects p 1, p 2 D are density-connected with respect to ɛ and MinPts if there is an object q D such that both p 1 and p 2 are density-reachable 10.4 Density-Based from q with Methods respect 473 to ɛ and MinPts. q m p s r o MinPts = 3 Figure 10.14 Density-reachability and density-connectivity in density-based clustering. Source: Based on Ester, Kriegel, Sander, and Xu [EKSX96]. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 25 / 31

Density based clustering (cont.) How does DBSCAN find clusters? 1 Initially, all objects in data set D are marked as unvisited. 2 It randomly selects an unvisited object p, marks p as visited, and checks whether p is core point or not. 3 If p is not core point, then p is marked as a noise point. Otherwise, a new cluster C is created for p, and all the objects in the ɛ neighborhood of p are added to a candidate set N. 4 DBSCAN iteratively adds to C those objects in N that do not belong to any cluster. 5 In this process, for an object p N that carries the label unvisited, DBSCAN marks it as visited and checks its ɛ neighborhood. 6 If p is a core point, then those objects in its ɛ neighborhood are added to N. 7 DBSCAN continues adding objects to C until C can no longer be expanded, that is, N is empty. At this time, cluster C is completed, and thus is output. 8 To find the next cluster, DBSCAN randomly selects an unvisited object from the remaining ones. 9 The clustering process continues until all objects are visited. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 26 / 31

Density based clustering (example) X 2 395 320 245 170 95 20 0 100 200 300 400 500 600 X 1 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 27 / 31

Grid-based clustering ter 10 Cluster Analysis: Basic Concepts and Methods The grid-based clustering approach uses a multiresolution grid data structure. First layer (i 1)st layer ith layer 0.19 Hierarchical structure for STING clustering. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 28 / 31

Cluster validation and assessment Cluster evaluation assesses the feasibility of clustering analysis on a data set and the quality of the results generated by a clustering method. The major tasks of clustering evaluation include the following: 1 Assessing clustering tendency : In this task, for a given data set, we assess whether a nonrandom structure exists in the data. Cluster ing analysis on a data set is meaningful only when there is a nonrandom structure in the data. 2 Determining the number of clusters in a data set : Algorithms such as k-means, require the number of clusters in a data set as the parameter. Moreover, the number of clusters can be regarded as an interesting and important summary statistic of a data set. Therefore, it is desirable to estimate this number even before a clustering algorithm is used to derive detailed clusters. A simple method is to set the number of clusters to about n/2 for a data set of n points. 3 Measuring clustering quality : After applying a clustering method on a data set, we want to assess how good the resulting clusters are. There are also measures that score clusterings and thus can compare two sets of clustering results on the same data set. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 29 / 31

Assessing clustering tendency Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 30 / 31

Cluster validation and assessment How good is the clustering generated by a method, and how can we compare the clusterings generated by different methods? 1 Internal criterion : Typical objective functions in clustering formalize the goal of attaining high intra-cluster similarity and low inter-cluster similarity. But good scores on an internal criterion do not necessarily translate into good effectiveness in an application. An alternative to internal criteria is direct evaluation in the application of interest. 2 External criterion : External criterion evaluates how well the clustering matches the gold standard classes. The Rand index measures the percentage of decisions that are correct. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 31 / 31

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394