CS570: Introduction to Data Mining Cluster Analysis Reading: Chapter 10.4, 10.6, 11.1.3 Han, Chapter 8.4,8.5,9.2.2, 9.3 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann, and 2006 Tan, Steinbach & Kumar. Introd. Data Mining., Pearson. Addison Wesley. October 15, 2013 1
Cluster Analysis Overview Partitioning methods Hierarchical methods Graph-based methods (CHAMELEON) Self-organizing maps (SOM) Density-based methods EM method Cluster evaluation Outlier analysis October 15, 2013 2
Spatial Data A cluster is regarded as a dense region of regions of high density A cluster can have arbitrary shapes Existence of streaks and noises October 15, 2013 Data Mining: Concepts and Techniques 3
Density-Based Clustering Methods Clustering based on density Major features: Clusters of arbitrary shape Handle noise Need density parameters as termination condition Several interesting studies: DBSCAN: Ester, et al. (KDD 96) OPTICS: Ankerst, et al (SIGMOD 99). DENCLUE: Hinneburg & D. Keim (KDD 98) CLIQUE: Agrawal, et al. (SIGMOD 98) (more grid-based) October 15, 2013 4
DBSCAN: Basic Concepts Density = number of points within a specified radius core point: has high density border point: has less density, but in the neighborhood of a core point noise point: not a core point or a border point. Core point border point noise point
Two parameters: DBScan: Definitions Eps: radius of the neighborhood MinPts: Minimum number of points in an Epsneighborhood of that point Eps-neighborhood of a point p: N Eps (p): core point: N Eps (q) >= MinPts {q belongs to D dist(p,q) <= Eps} q p MinPts = 5 Eps = 1 cm October 15, 2013 6
DBScan: Definitions Directly density-reachable: if p belongs to N Eps (q), q is a core point q p MinPts = 5 Eps = 1 cm Density-reachable: if there is a chain of points p 1,, p n, p 1 = q, p n = p such that p i+1 is directly densityreachable from p i q p 1 p Density-connected: if there is a point o such that both, p and q are densityreachable from o w.r.t. Eps and MinPts p o q 7
DBSCAN: Cluster Definition A cluster is defined as a maximal set of density-connected points Outlier Border Core Eps = 1cm MinPts = 5 October 15, 2013 8
DBSCAN: The Algorithm Arbitrary select an unvisited point p, mart it as visited and If p is a core point Retrieve all points density-reachable from p w.r.t. Eps and MinPts, a cluster is formed, add p to cluster. Otherwise mark the point as noise and visit the next unvisited point in the database Continue the process until all of the points have been processed. October 15, 2013 Data Mining: Concepts and Techniques 9
DBSCAN: Sensitive to Parameters October 15, 2013 Data Mining: Concepts and Techniques 10
DBSCAN: Determining EPS and MinPts Basic idea: Suppose the neighborhood size is k For points within a cluster, their k th nearest neighbors are at roughly the same distance Noise points have the k th nearest neighbor at farther distance Plot sorted distance of every point to its k th nearest neighbor
DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4
When DBSCAN Does NOT Work Well Original Points (MinPts=4, Eps=9.75). Varying densities High-dimensional data (MinPts=4, Eps=9.92)
DBSCAN: Features Complexity: O(n 2 ), can be reduced to O(n logn) if using index structures Advantages does not require the number of clusters (vs. k-means) can find arbitrarily shaped clusters can identify noise mostly insensitive to the ordering of the points Disadvantages sensitive to parameters does not respond well to data sets with varying densities
OPTICS: A Cluster-Ordering Method (1999) OPTICS: Ordering Points To Identify the Clustering Structure Ankerst, Breunig, Kriegel, and Sander (SIGMOD 99) Produces a special order of the database wrt its density-based clustering structure This cluster-ordering contains info equiv to the densitybased clusterings corresponding to a broad range of parameter settings Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure Can be represented graphically or using visualization techniques October 15, 2013 15
OPTICS: Some Extension from DBSCAN Core Distance of p: smallest distance that makes p a core point MinPts = 5 ε = 3 cm Reachability Distance of p wrt o: max {core-distance (o), d (o, p)} p D e.g., r(p1, o) = 2.8cm p1 o r(p2,o) = 4cm p2 16
OPTICS: The Algorithm Arbitrary select an unvisited point p, mart it as visited and If p is a core point Retrieve all points density-reachable from p w.r.t. Eps and MinPts Update the reachability distance of points to nearest neighbor and output the points in ascending order Otherwise, visit the next unvisited point in the database Continue the process until all of the points have been processed. October 16, 2013 17
OPTICS: example Reachability -distance undefined Cluster-order of the objects October 15, 2013 18
Cluster Analysis Overview Partitioning methods Hierarchical methods Graph-based methods (CHAMELEON) Self-organizing maps (SOM) Density-based methods Other: EM method, COBWEB Cluster evaluation Outlier analysis October 15, 2013 19
Probabilistic Model-Based Clustering Assume the data are generated by a mathematical model Attempt to optimize the fit between the given data and some mathematical model Typical methods Statistical approach EM (Expectation maximization) Machine learning approach COBWEB Neural network approach SOM (Self-Organizing Feature Map) October 16, 2013 20
Clustering by Mixture Model Assume data are generated by a mixture of probabilistic model Generalization of k-means Each cluster can be represented by a probabilistic model, like a Gaussian (continuous) or a Poisson (discrete) distribution. October 16, 2013 21
Expectation Maximization (EM) Starts with an initial estimate of the parameters of the mixture model Iteratively refine the parameters using EM method Expectation step: computes expectation of the likelihood of each data point X i to belong to cluster C i Maximization step: computes maximum likelihood estimates of the parameters Until parameters do not change or below a threshold October 15, 2013 22
Conceptual Clustering Conceptual clustering Generates a concept description for each concept (class) Produces a hierarchical category or classification scheme Related to decision tree learning and mixture model learning COBWEB (Fisher 87) A popular and simple method of incremental conceptual learning Creates a hierarchical clustering in the form of a classification tree Each node refers to a concept and contains a probabilistic description of that concept October 15, 2013 23
COBWEB Classification Tree October 15, 2013 24
COBWEB: Learning the Classification Tree Incrementally builds the classification tree Given a new object Search for the best node at which to incorporate the object or add a new node for the object Update the probabilistic description at each node Merging and splitting Use a heuristic measure - Category Utility to guide construction of the tree October 15, 2013 25
COBWEB: Comments Limitations The assumption that the attributes are independent of each other is often too strong because correlation may exist Not suitable for clustering large database skewed tree and expensive probability distributions October 15, 2013 26
Cluster Analysis Overview Partitioning methods Hierarchical methods Graph-based methods (CHAMELEON) Self-organizing maps (SOM) Density-based methods Other: EM method, COBWEB Cluster evaluation Outlier analysis October 15, 2013 27
Cluster Evaluation Determine clustering tendency of data, i.e. distinguish whether non-random structure exists Determine correct number of clusters Evaluate how well the cluster results fit the data without external information Evaluate how well the cluster results are compared to externally known results Compare different clustering algorithms/results 28
y y y y Clusters found in Random Data Random Points 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 1 x 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 1 x DBSCAN 0.9 0.9 K-means 0.8 0.7 0.6 0.8 0.7 0.6 Complete Link 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x 29
Measures of Cluster Validity Unsupervised (internal indices): Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) Supervised (external indices): Used to measure the extent to which cluster labels match externally supplied class labels. Entropy Relative: Used to compare two different clustering results Often an external or internal index is used for this function, e.g., SSE or entropy 30
Internal Measures: Cohesion and Separation Cluster Cohesion: how closely related are objects in a cluster Cluster Separation: how distinct or well-separated a cluster is from other clusters Example: Squared Error Cohesion: within cluster sum of squares (SSE) 2 WSS ( x mi ) i x C i Separation: between cluster sum of squares BSS C ( m i i m j 2 ) Cohesion separation 31
Internal Measures: Cohesion and Separation Cluster cohesion is the sum of the weight of all links within a cluster. Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation 32
Internal Measures: Cohesion and Separation Example: SSE BSS + WSS = constant m 1 m 1 2 3 4 m 2 5 K=1 cluster: WSS (1 3) 2 (2 3) 2 (4 3) 2 (5 3) 2 10 BSS 4 (3 3) 2 0 Total 10 0 10 K=2 clusters: WSS (1 1.5) 2 (2 1.5) 2 (4 4.5) 2 (5 4.5) 2 1 BSS 2 (3 1.5) 2 2 (4.5 3) 2 9 Total 1 9 10 33
SSE Internal Measures: SSE SSE is good for comparing two clusterings Can also be used to estimate the number of clusters 10 6 4 2 0-2 9 8 7 6 5 4 3-4 2 1-6 5 10 15 0 2 5 10 15 20 25 30 K 34
Internal Measures: SSE Another example of a more complicated data set 1 2 6 4 3 5 7 SSE of clusters found using K-means 35
y Count Statistics framework for cluster validity More atypical -> likely valid structure in the data Use values resulting from random data as baseline Example Statistical Framework for SSE Clustering: SSE = 0.005 SSE of three clusters in 500 sets of random data points 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x 50 45 40 35 30 25 20 15 10 5 0 0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 SSE 36
External Measures Compare cluster results with ground truth or manually clustering Classification-oriented measures: entropy, purity, precision, recall, F-measures Similarity-oriented measures: Jaccard scores 37
External Measures: Classification-Oriented Measures Entropy: the degree to which each cluster consists of objects of a single class Precision: the fraction of a cluster that consists of objects of a specified class Recall: the extent to which a cluster contains all objects of a specified class 38
External Measure: Similarity-Oriented Measures Given a reference clustering T and clustering S f 00 : number of pair of points belonging to different clusters in both T and S f 01 : number of pair of points belonging to different cluster in T but same cluster in S f 10 : number of pair of points belonging to same cluster in T but different cluster in S f 11 : number of pair of points belonging to same clusters in both T and S Rand Jaccard f 00 f 01 f f 00 01 f f 11 10 f f 11 10 f 11 f 11 October 15, 2013 39 T S
y Points Using Similarity Matrix for Cluster Validation Order the similarity matrix with respect to cluster labels and inspect visually. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 Points 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Similarity 40
Points y Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 Points 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Similarity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x DBSCAN 41
Points y Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 Points 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Similarity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x K-means 42
Points y Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 Points 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Similarity 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 x Complete Link 43
Using Similarity Matrix for Cluster Validation 1 0.9 1 2 6 4 3 500 1000 1500 0.8 0.7 0.6 0.5 0.4 2000 0.3 5 2500 0.2 7 3000 500 1000 1500 2000 2500 3000 0.1 0 DBSCAN 44
Chapter 7. Cluster Analysis Overview Partitioning methods Hierarchical methods Density-based methods Other methods Cluster evaluation Outlier analysis October 15, 2013 45
References (1) R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973. M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD 99. P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996 Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02 M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96. M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987. D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. VLDB 98. October 15, 2013 Data Mining: Concepts and Techniques 46
References (2) V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99. D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB 98. S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98. S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March 1999. A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD 98. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988. G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999. L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB 98. G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988. P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997. R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94. October 15, 2013 Data Mining: Concepts and Techniques 47
References (3) L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review, SIGKDD Explorations, 6(1), June 2004 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition,. G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB 98. A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering in Large Databases, ICDT'01. A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles, ICDE'01 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, SIGMOD 02. W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB 97. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96. October 15, 2013 Data Mining: Concepts and Techniques 48
Clustering: Rich Applications and Multidisciplinary Efforts Pattern Recognition Spatial Data Analysis Create thematic maps in GIS by clustering feature spaces Detect spatial clusters or for other spatial mining tasks Image Processing Economic Science (especially market research) WWW Document clustering Cluster Weblog data to discover groups of similar access patterns October 15, 2013 Data Mining: Concepts and Techniques 49