Cluster Analysis. Outline. Motivation. Examples Applications. Han and Kamber, ch 8

Outline Cluster Analysis Han and Kamber, ch Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods CS by Rattikorn Hewett Texas Tech University Motivation Examples Applications Given a data set that we do not know what to look for The first step in identifying useful patterns is to group data by their similarity Once data are grouped (clustered/categorized), properties of each group can be analyzed Cluster Analysis Is a process of grouping data into meaningful classes/clusters (e.g., those with similar measures) Is unsupervised learning by observations vs. by examples Can be used As a stand-alone tool to get insight on data distribution As a preprocessing step for other algorithms Bioinformatics: Categorize genes with similar functions Marketing: Characterize customer based on purchasing patterns for targeted marketing programs Land use: Identify areas of similar land use from earth observation data Insurance: Identify groups of motor insurance policy holders with a high average claim cost City-planning: Identify groups of houses according to their house type, value, and geographical location Earth-quake studies: Characterize earth quake epicenters along continent faults WWW: Document classification, cluster Weblog data to discover groups of similar access patterns

Evaluation and Issues Basic criteria for evaluating clustering method quality of results (depending on similarity measures applied) High qualities clusters high intra-class and low inter-class similarity ability to discover some or all of the hidden patterns Main issue in data mining scalability Size of databases: need efficient clustering techniques for very large, high dimensional databases Complexity of data types and shapes: need clustering techniques for mixed numerical & categorical data Requirements Scalability Large databases High dimensionality Ability to deal with different types of attributes Performance Robust - able to deal with noise and outliers Insensitive to order of input records Discovery of clusters with arbitrary shape Practical Aspects Minimal requirements for domain knowledge to determine input parameters Incorporation of user-specified constraints Interpretability and usability Outline Typical data structures Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods Memory-based clustering algorithms typically operate on two data structures: Data matrix n data instances with p attributes Dissimilarity matrix i,j) = difference/dissimilarity between data instances i and j Two-mode matrix One-mode matrix x... x f... x p...............,) x... x... x i if ip,),)............... : : : x... x... x n nf np n,) n,)...... Row & cols represent different entities

Input data types Interval-scaled variables rough measurements on a continuous linear scale E.g., weights, coordinates Binary Nominal, ordinal, and ratio-scaled Nominal: extend binary variables to have more than two values e.g., map-color = {blue, yellow, red, green} Ordinal: extend nominal to be ordered in a meaningful sequence e.g., medal = {gold, silver, bronze} Ratio-scaled: +ve measurements on a non-linear scale e.g., growth and decay exponential scales Mixed attributes have different data types Prepare input data From a given data set, we need to ) Standardize/normalize data (if needed) ) Compute the difference matrix Interval-scaled variables () Interval-scaled variables () Standardize data for each column (attribute) f Calculate the mean absolute deviation: where s ( x f n m (x f n f m x x... ). f nf Calculate the standardized measurement (z-score) x m if f z if s x m... x m ) f f f f nf f f Using mean absolute deviation is more robust than using standard deviation (deviation is not square reduce effect of outliers) Distances are normally used to measure the similarity or dissimilarity between two data objects Minkowski distance: d ( i, j) q ( x x x x... x x ) i j i j ip jp where i = (x i, x i,, x ip ) and j = (x j, x j,, x jp ) are two p-dimensional data objects, and q is a positive integer If q =, d is Manhattan distance If q =, d is Euclidean distance i, j) w( x x w x x... wp x x i j i j i q q p jp ) q Weighted Euclidean Distance

Binary Variables () Computing Dissimilarity matrix A contingency table for binary data sum a c bd p Simple matching coefficient: for symmetric binary variables I.e., values and are equally valuable e.g., gender ( = male, = female) i, j) bc abcd Jaccard coefficient: for asymmetric binary variables I.e., outcome is more important than, e.g., = HIV pos, = HIV neg i, j) bc abc Instance i Instance j Note: b, c represent - and - d represents - a c b d sum a b c d Example Name Gender Fever Cough Test- Test- Test- Test- Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N gender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to, and the value N be set to jack, mary). jack, jim). jim, mary). Nominal Variables () Ordinal Variables A generalization of the binary variable in that it can take more than states, e.g., red, yellow, blue, green Computing dissimilarity matrix: Method : Simple matching m = # of matches (# variables that instances i and j have same state) p = total # of variables i, j) p p m Method : use a large number of binary variables create a new binary variable for each of the nominal states An ordinal variable can be discrete or continuous Order is important, e.g., rank Can be treated like interval-scaled Let x if be the data value of variable f for the i th instance Replace, x if by its rank r if {,.., M f } Normalize each variable so that its range would be in [, ] by replacing r if by rif Z if = Mf Compute the dissimilarity using methods for interval-scaled variables

Ratio-Scaled Variables Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, E.g., Ae Bt or Ae -Bt Methods: treat them like interval-scaled variables not a good choice! (why? the scale can be distorted) apply logarithmic transformation: y if = log(x if ) then use techniques for interval-scaled variables treat x if as continuous ordinal data & treat their ranks as intervalscaled values - recommended Mixed Variable Types A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio One may use a weighted formula to combine their effects i, j) p ( f ) ( f ) f ij d ij p ( f ) ij f ij (f) = if either () x if or x jf is missing, or () f is asymmetric binary and x if = x jf = ; o.w. ij (f) = d (f) ij is computed according to the data type of f Outline Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods Partitioning Methods Given a database of n objects k = number of clusters to form, k n To find k partitions of the database that optimize a similarity function in that Objects of different clusters/partitions are dissimilar Objects of the same partition are similar Basic Idea: Create initial partitions Iterative relocation move objects from one group to the other to improve objective function (e.g., min. error)

Partitioning Methods (cont) The k-means Method Approaches Global optimal: exhaustively enumerate all partitions Heuristic methods: Basic idea: Pick a reference point of each cluster Assign data objects to minimize sum of dissimilarities between each object and its reference point of the same cluster k-means Each cluster is referenced by the center of the cluster k-medoids Each cluster is referenced by one of the objects in the cluster Input: k = number of clusters and a database of n objects Output: a set of k clusters that min the squared-error Algorithm (Sketched) Randomly pick k objects as center of clusters Repeat Assign each object to the cluster with the nearest center Update the center (mean value of objects in the cluster) of each cluster Until no change (i.e., the squared-error converges) squared-error: E k i pc i p mi where m i = mean of cluster C i and p = object point The k-means Method K= Arbitrarily choose K objects as initial cluster center Assign each objects to most similar center reassign Update the cluster means Update the cluster means reassign Advantages and disadvantages Strength Relatively efficient O(tkn), where n is # objects, k is # clusters, and t is # iterations. k, t << n. Other clustering methods, e.g., PAM: O(k(n-k) ) Weakness Applicable only when mean is defined - categorical data? Need to specify the number of clusters, k, in advance Mean is sensitive to noisy data and outliers Not suitable to discover clusters with non-convex shapes Often terminates at a local optimum. The global optimum may be found using techniques such as deterministic annealing and genetic algorithms

Variations of the k-means Method Problem with k-means Method Variants of the k-means differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means k-modes method: handles categorical data [Huang ] Replacing means of clusters with modes Use a frequency-based method to update modes of clusters k-prototypes method: integrates k-means and k-modes to handle mixed data types EM (Expectation Maximization): extends k-means by assigning each object to a cluster according to a weight representing the probability of membership Scalable k-means method Discard object whose membership is ascertained The k-means algorithm is sensitive to outliers! Since an object with an extremely large value may substantially distort the distribution of the data k-medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. k-means is centroid-based, k-medoids is most central object-based The k-medoids Method Variations of the k-medoids Methods Input: k = number of clusters and a database of n objects Output: a set of k clusters that min sum of dissimilarities of all objects to their nearest medoid (most central object) Algorithm (Sketched) Randomly pick k objects as medoids of k clusters Repeat Assign each remaining object to the cluster with the nearest medoid Randomly select a non-medoid object, O random Compute cost of swapping a medoid object O j with O random If the cost is < then swap (i.e., O random is a medoid and O j is not) Until no change (i.e., the squared-error converges) Cost ~ squared-error with swapping squared-error without swapping if cost < means swapping reduces squared-error PAM(Partitioning Around Medoids) [Kaufman and Rousseeuw, ] Randomly select an initial set of medoids Iteratively replace the medoid by the non-medoid that gives the most error reduction PAM is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean PAM works effectively for small data sets, but does not scale well for large data sets O(k(n-k) ) for each iteration where n is # of data objects, k is # of clusters

Variations of the k-medoids Methods Outline CLARA (Clustering LARge Application) [Kaufmann & Rousseeuw, ] Uses multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output Strength: deals with larger data sets than PAM Weakness: Efficiency depends on the sample size A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased CLARANS (Clustering Large Applications based on RANdomized search) [Ng & Han, ] Combines CLARA s idea and random search to avoid sample s bias CLARAN does not confine medoid search to a fixed sample but it randomly selects nodes to be searched after a local optimum is found More efficient than PAM and CLARA Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Hierarchical Clustering Method Agglomerative method bottom-up (merge iteratively) Place objects in its own cluster, and merge the clusters into larger clusters until all objects are in a single cluster or termination Most hierarchical methods are of this kind they differ by the definition of intercluster similarity Example: AGNES (AGglomerative NESting) Divisive method top-down (divide iteratively) Start with all objects are in one cluster, and subdivides the cluster into smaller pieces until termination (e.g., reach specified number of clusters, Two closest clusters are close enough) Example: DIANA (DIvisive ANAlysis) Measures of distances Let p, p ) be a distance between points p and p p i and p j be points in clusters C i and C j, respectively n i be a number of points in clusters C i, and m i be a mean of points in clusters C i Minimum distance: d min (C i, C j ) = min i, j p i, p j ) Maximum distance: d max (C i, C j ) = max i, j p i, p j ) Mean distance: d mean (C i, C j ) = m i, m j ) Average distance: d avg (C i, C j ) = pi, pj) n n i j i, j

AGNES & DIANA Issues Introduced in [Kaufmann and Rousseeuw, ] Implemented in statistical analysis packages, e.g., S+ Use the Single-Link method (the similarity between two clusters is the similarity between the closest pair of points in each cluster) and the dissimilarity matrix AGNES Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster DIANA performs in inverse order of AGNES Major weakness do not scale well: e.g., agglomerative time complexity of at least O(n ), where n is the number of total objects Irreversable process To improve these drawbacks: Integration of hierarchical with iterative relocation BIRCH Balanced Iterative Reducing and Clustering using Hierarchies [Zhang Ramakrishnan & Livny, ] Careful analysis of object linkages at each level of hierarchy CURE Clustering Using Representatives [Guha, Rastogi & Shim, ] ROCK Robust Clustering using links [Guha, Rastogi & Shim, ] CHAMELEON [Karypis, Han & Kumar, ] BIRCH Basic Idea: Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering Phase : scan DB to build an initial in-memory CF tree to preserve inherent hierarchical structures of the data ~ similar to buidling B+-trees Phase : apply a (selected) clustering algorithm to cluster the leaf nodes of the CF-tree Clustering Feature Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: N i==x i SS: N i==x i Summarizes statistics (, th, nd moments) of a sub-cluster CF = (, (,),(,)) (,) (,) (,) (,) (,)

BIRCH Scales linearly finds a good clustering with a single scan and improves the quality with a few additional scans Weaknesses handles only numeric data sensitive to the order of the data record Uses notions of diameters for clustering doesn t perform well for non-spherical shapes CURE Basic Idea: use multiple representative points for each cluster adjust well with non-spherical shapes Shrink the representative points toward the center of the cluster dampen effects of outliers Scales well without sacrificing clustering quality Does not handle categorical data and results are significantly impacted from parameter setting ROCK Basic Idea: Construct sparse graph from a similarity matrix using concept of interconnectivity Hirarchical clustering on the sparse graph Similarity measures are not distance-based Cluster similarity is based on the number of points from different clusters that have neighbors in common Suitable for clustering categorical data CHAMELEON Hierarchical clustering based on k-nearest neighbors Dynamic modeling Measures the similarity based on interconnectivity and closeness (proximity) If these measures between two clusters are high relative to the internal measures within the clusters, merge the clusters Merge is based on a dynamic model adapting to internal structures of the clusters being merged Cure ignores interconnectivity of objects Rock ignores the closeness of clusters

Outline Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods Density-Based Clustering Methods Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Several interesting studies: DBSCAN [Ester et al., ] OPTICS [Ankerst et al., ] DENCLUE [Hinneburg & Keim, ] CLIQUE [Agrawal et al., ] Terminologies Additional terms -neighborhood of an object p, N (p) = {q dist(p,q) } q is a core object if N (q) contains at least M, min number of objects p is directly density-reachable from q if p is in N (q) and q is a core point Density-reachable: p is density-reachable from q if there is a chain of points p,, p n, p = q, p n = p such that p i+ is directly densityreachable from p i wrt., MinPts Density-connected q p p q p MinPts = = cm p is directly density-reachable from q A point p is density-connected to a point q if there is a point o such that both, p and q are density-reachable from o wrt. and MinPts Density-reachable is transitive Density-connected is symmetric p o q

Density-based Clustering Density-based Clustering DBSCAN (Density-based Spatial Clustering of Applications with Noise) Grows region with sufficiently high density into clusters A cluster has a maximal set of density-connected points with respect to density-reachability Objects not in any cluster is considered noise Requires input parameters: e and MinPts OPTICS (Ordering Points To Identify the Clustering Structure) Computes cluster ordering for automatic and interactive analysis The ordering represents the density-based clustering structure (select an object in ways that clusters with small radius (high density) will be finished first DENCLUE (Density-based Spatial ClustEring) Solid mathematical foundation use influence function Good for data sets with large amounts of noise Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets Significant faster than existing algorithm (faster than DBSCAN by a factor of up to ) But needs a large number of parameters Outline Grid-Based Clustering Method Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods Basic steps: Quantize the object space into a finite number of grid cells Perform clustering on the grid structure Examples STING [Wang et al., ] a grid-based method using statistical information stored in grid cells to answer queries WaveCluster [Sheikholeslami et al., ] and CLIQUE [Agrawal, et al., ] are both grid-based and density-based

STING WaveCluster Each cell at a high level is partitioned into smaller cells in the next lower level Has statistical information computed E.g., means, max, min, distribution types Query-answering is a top-down process Determine highest layer to start For each cell in the layer, Compute confidence of intervals to reflect relevance to query Remove irrelevant cells Move to next lower level Repeat the above step until the query specification is met Return the region of relevant cells PROS: fast, easy to parallelize, CONS: boundaries are only horizontal/vertical A multi-resolution clustering approach Impose grid structures on the data space Apply wavelet transform to each feature space (from grid cell info) Decompose a signal into different frequency sub-band Preserve relative distance between objects at different resolutions Clustering by finding dense regions in the transformed space Both grid-based and density-based with input parameters: # of grid cells for each dimension the wavelet, and the # of applications of wavelet transform Advantages: Scales for large databases O(n) Do not require # clusters or neighborhood radius Effective removal of outliers Detect arbitrary shaped clusters at different scales Not sensitive to noise, not sensitive to input order Transformation CLIQUE High resolution Medium resolution Low resolution Decompose data into four frequency sub-bands focusing on mean neighborhood of each point horizontal edges vertical edges corners Grid-based and density-based using the Apriori principle Partition space into grid structure Cluster - maximal set of connected dense units Narrowing search space: A k-dimensional candidate dense unit is not dense if its (k-)-dimensional projection unit is not dense Determine min cover of each cluster PROS: Scales well - good for clustering high-dimensional data in large databases Insensitive to the order of input CONS: low accuracy of the clustering at the expense of simplicity

Outline Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods Model-Based Clustering Methods Goal to fit the data to some mathematical model Statistical approach ~ Conceptual learning Cobweb CLASSIT Autoclass Neural net approach SOM (Self Organizing Map) Statistical Approach Conceptual learning: cluster (group like objects) + characterize (generalize & simplify) concepts COBWEB [Fischer, ] Build a classification tree where each node (not branch) has a label Use heuristic evaluation function called category utility that Represents a difference in expected number of attribute values when a hypothesized category is used and unused Rewards intraclass similarity and interclass dissimilarity Based on assumption that attributes are independent CONS: independent assumption Expensive to compute and update probability distribution not height-balanced bad for skewed data CLASSIT extension of COBWEB for continuous data similar problems AutoClass uses Bayesian statistical analysis popular in industries Neural Net Approach Competitive learning Hierarchical architecture of neuron units where Each unit in a given layer takes inputs from ALL units from previous layer Units within a cluster in a layer compete pick one that best corresponds to output from previous layer The winning unit adjust weights on its connections Cluster ~ mapping of low-level features to high-level features SOM (Self-Organizing Feature Maps) Unit whose weight vector is closest to current object becomes the winner More details in reference papers

Outline Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods Outlier Discovery Outliers : data objects that do not comply with general behavior are considerably dissimilar (inconsistent) to remaining data Generally, data mining algorithms try to min effect of outliers But one person s noise could be another person s signal Example Applications: Credit card fraud detection Telecom fraud detection Marketing customized to purchasers with extreme incomes Finding unusual responses to medical treatments Outlier Discovery: Find top n outlier points Approaches Approaches (contd) Statistical-based Assume a data distribution model (e.g., normal) Apply hypothesis testing to see if objects belong to the distribution Drawbacks: Most tests are for single attributes Data distribution may not be known Distance-based Outliers objects that do not have enough neighbors Various algorithms: index-based, nested-loop, cell-based Drawbacks: Does not scale well for high dimensions Requires experimentation for input parameters Deviation-based Outliers objects deviate from main characteristics of data Example techniques: Sequential exception technique OLAP data cube technique (explore regions of anomalies ch ) Sequential exception technique Given set S of n objects, build a sequence of subsets: S S S m where m n For each subset, test its similarity with the preceding subset in sequence Find the smallest subset whose removal results in greatest reduction of dissimilarity in the residual set called Exception set outliers

Outline Problems and Challenges Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods Considerable progress has been made in scalable clustering methods Partitioning: k-means, k-medoids, CLARANS Hierarchical: BIRCH, CURE Density-based: DBSCAN, CLIQUE, OPTICS Grid-based: STING, WaveCluster Model-based: Autoclass, Denclue, Cobweb Current clustering techniques do not address all the requirements adequately Constraint-based clustering analysis: Constraints exist in data space (bridges and highways) or in user queries Constraint-Based Clustering Analysis Clustering With Obstacle Objects Less parameters but more user-desired constraints E.g., an ATM allocation problem Not Taking obstacles into account Taking obstacles into account

Summary References Cluster analysis groups objects based on their similarity and has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by statistical, distance-based or deviation-based approaches There are still lots of research issues on cluster analysis, such as constraint-based clustering R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD' M. R. Anderberg. Cluster Analysis for Applications. Academic Press,. M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure, SIGMOD. P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scietific, M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'. M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'. D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, :-,. D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB. S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall,. References (cont) L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons,. E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB. G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons,. P. Michaud. Clustering techniques. Future Generation Computer systems,,. R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'. E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. Int. Conf. on Pattern Recognition, -. G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB. W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'.