! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

Preview Lecture Clustering! Introduction! Partitioning methods! Hierarchical methods! Model-based methods! Densit-based methods What is Clustering?! Cluster: a collection of data objects! Similar to one another within the same cluster! Dissimilar to the objects in other clusters! Cluster analsis! Grouping a set of data objects into clusters! Clustering is unsupervised classification: no predefined classes! Tpical applications! As a stand-alone tool to get insight into data distribution! As a preprocessing step for other algorithms Eamples of Clustering Applications! Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs! Land use: Identification of areas of similar land use in an earth observation database! Insurance: Identifing groups of motor insurance polic holders with a high average claim cost! Urban planning: Identifing groups of houses according to their house tpe, value, and geographical location! Seismolog: Observed earth quake epicenters should be clustered along continent faults What Is a Good Clustering?! A good clustering method will produce clusters with! High intra-class similarit! Low inter-class similarit! Precise definition of clustering qualit is difficult! Application-dependent! Ultimatel subjective Requirements for Clustering in Data Mining! Scalabilit! Abilit to deal with different tpes of attributes! Discover of clusters with arbitrar shape! Minimal domain knowledge required to determine input parameters! Abilit to deal with noise and outliers! Insensitivit to order of input records! Robustness wrt high dimensionalit! Incorporation of user-specified constraints! Interpretabilit and usabilit

Similarit and Dissimilarit Between Objects! Same we used for IBL (e.g, Lp norm)! Euclidean distance (p = ): d( i, j) = ( + +... + ) i j i j ip jp! Properties of a metric d(i,j):! d(i,j)! d(i,i) =! d(i,j) = d(j,i)! d(i,j) d(i,k) + d(k,j) Major Clustering Approaches! Partitioning: Construct various partitions and then evaluate them b some criterion! Hierarchical: Create a hierarchical decomposition of the set of objects using some criterion! Model-based: Hpothesize a model for each cluster and find best fit of models to data! Densit-based: Guided b connectivit and densit functions Partitioning Algorithms! Partitioning method: Construct a partition of a database D of n objects into a set of k clusters! Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion! Global optimal: ehaustivel enumerate all partitions! Heuristic methods: k-means and k-medoids algorithms! k-means (MacQueen, ): Each cluster is represented b the center of the cluster! k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw, ): Each cluster is represented b one of the objects in the cluster K-Means Clustering! Given k, the k-means algorithm consists of four steps:! Select initial centroids at random.! Assign each object to the cluster with the nearest centroid.! Compute each centroid as the mean of the objects assigned to it.! Repeat previous steps until no change. K-Means Clustering (contd.) Comments on the K-Means Method! Eample! Strengths! Relativel efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normall, k, t << n.! Often terminates at a local optimum. The global optimum ma be found using techniques such as simulated annealing and genetic algorithms! Weaknesses! Applicable onl when mean is defined (what about categorical data?)! Need to specif k, the number of clusters, in advance! Trouble with nois data and outliers! Not suitable to discover clusters with non-conve shapes

Hierarchical Clustering AGNES (Agglomerative Nesting)! Use distance matri as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step Step Step Step Step a b c d e ab de cde abcde Step Step Step Step Step agglomerative (AGNES) divisive (DIANA)! Produces tree of clusters (nodes)! Initiall: each object is a cluster (leaf)! Recursivel merges nodes that have the least dissimilarit! Criteria: min distance, ma distance, avg distance, center distance! Eventuall all nodes belong to the same cluster (root) A Dendrogram Shows How the Clusters are Merged Hierarchicall DIANA (Divisive Analsis) Decompose data objects into several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering ofthedataobjectsisobtainedbcuttingthe dendrogram at the desired level. Then each connected component forms a cluster.! Inverse order of AGNES! Start with root cluster containing all objects! Recursivel divide into subclusters! Eventuall each cluster contains a single object Other Hierarchical Clustering Methods BIRCH! Major weakness of agglomerative clustering methods! Do not scale well: time compleit of at least O(n ), where n is the number of total objects! Can never undo what was done previousl! Integration of hierarchical with distance-based clustering! BIRCH: uses CF-tree and incrementall adjusts the qualit of sub-clusters! CURE: selects well-scattered points from the cluster and then shrinks them towards the center of the cluster b a specified fraction! BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies (Zhang, Ramakrishnan & Livn, )! Incrementall construct a CF (Clustering Feature) tree! Parameters: ma diameter, ma children! Phase : scan DB to build an initial in-memor CF tree (each node: #points, sum, sum of squares)! Phase : use an arbitrar clustering algorithm to cluster the leaf nodes of the CF-tree! Scales linearl: finds a good clustering with a single scan and improves the qualit with a few additional scans! Weaknesses: handles onl numeric data, sensitive to order of data records.

Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: N i= X i SS: N i= X i CF = (, (,),(,)) (,) (,) (,) (,) (,) B= L= CF Tree Root CF CF CF CF child child child child Non-leaf node CF child CF child CF child CF child Leaf node Leaf node prev CF CF CF net prev CF CF CF net CURE (Clustering Using REpresentatives) Drawbacks of Distance-Based Method! CURE: non-spherical clusters, robust wrt outliers! Uses multiple representative points to evaluate the distance between clusters! Stops the creation of a cluster hierarch if a level consists of k clusters! Drawbacks of square-error-based clustering method! Consider onl one point as representative of a cluster! Good onl for conve clusters, of similar size and densit, and if k can be reasonabl estimated Cure: The Algorithm Data Partitioning and Clustering! Draw random sample s! Partition sample to p partitions with size s/p! Partiall cluster partitions into s/pq clusters! s =! p =! s/p =!s/pq =! Cluster partial clusters, shrinking representatives towards centroid! Label data on disk

Cure: Shrinking Representative Points Model-Based Clustering! Shrink the multiple representative points towards the gravit center b a fraction of α.! Multiple representatives capture the shape of the cluster! Basic idea: Clustering as probabilit estimation! One model for each cluster! Generative model:! Probabilit of selecting a cluster! Probabilit of generating an object in cluster! Find ma. likelihood or MAP model! Missing information: Cluster membership! Use EM algorithm! Qualit of clustering: Likelihood of test objects Mitures of Gaussians EM Algorithm for Mitures! Cluster model: Normal distribution (mean, covariance)! Assume: diagonal covariance, known variance, same for all clusters! Ma. likelihood: mean = avg. of samples! But what points are samples of a given cluster?! Estimate prob. that point belongs to cluster! Mean = weighted avg. of points, weight = prob.! But to estimate probs. we need model! Chicken and egg problem: use EM algorithm! Initialization: Choose means at random! E step:! For all points and means, compute Prob(point mean)! Prob(mean point) = Prob(mean) Prob(point mean) / Prob(point)! M step:! Each mean = Weighted avg. of points! Weight = Prob(mean point)! Repeat until convergence EM Algorithm (contd.) AutoClass! Guaranteed to converge to local optimum! K-means is special case! Developed at NASA (Cheeseman & Stutz, )! Miture of Naïve Baes models! Variet of possible models for Prob(attribute class)! Missing information: Class of each eample! Appl EM algorithm as before! Special case of learning Baes net with missing values! Widel used in practice

COBWEB A COBWEB Tree! Grows tree of clusters (Fisher, )! Each node contains: P(cluster), P(attribute cluster) for each attribute! Objects presented sequentiall! Options: Add to node, new node; merge, split! Qualit measure: Categor utilit: Increase in predictabilit of attributes/#clusters Neural Network Approaches Competitive Learning! Neuron = Cluster = Centroid in instance space! Laer = Level of hierarch! Several competing sets of clusters in each laer! Objects sequentiall presented to network! Within each set, neurons compete to win object! Winning neuron is moved towards object! Can be viewed as mapping from low-level features to high-level ones Self-Organizing Feature Maps! Clustering is also performed b having several units competing for the current object! The unit whose weight vector is closest to the current object wins! The winner and its neighbors learn b having their weights adjusted! SOMs are believed to resemble processing that can occur in the brain! Useful for visualizing high-dimensional data in - or -D space Densit-Based Clustering! Clustering based on densit (local cluster criterion), such as densit-connected points! Major features:! Discover clusters of arbitrar shape! Handle noise! One scan! Need densit parameters as termination condition! Representative algorithms:! DBSCAN (Ester et al., )! DENCLUE (Hinneburg & Keim, )

Definitions (I) Definitions (II)! Two parameters:! Eps: Maimum radius of neighborhood! MinPts: Minimum number of points in an Epsneighborhood of a point! N Eps (p) ={q Є D dist(p,q) <= Eps}! Directl densit-reachable: A point p is directl densitreachable from a point q wrt. Eps, MinPts iff! ) p belongs to N Eps (q)! ) q is a core point: p MinPts = q N Eps (q) >= MinPts Eps = cm! Densit-reachable:! A point p is densit-reachable from a point q wrt. Eps, MinPts if there is a chain of points p,, p n, p = q, p n = p such that p i+ is directl densit-reachable from p i! Densit-connected! A point p is densit-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are densit-reachable from o wrt. Eps and MinPts. p q o p p q DBSCAN: Densit Based Spatial Clustering of Applications with Noise DBSCAN: The Algorithm! Relies on a densit-based notion of cluster: A cluster is defined as a maimal set of densit-connected points! Discovers clusters of arbitrar shape in spatial databases with noise! Arbitraril select a point p! Retrieve all points densit-reachable from p wrt Eps and MinPts. Border Core Outlier Eps = cm MinPts =! If p is a core point, a cluster is formed.! If p is a border point, no points are densit-reachable from p and DBSCAN visits the net point of the database.! Continue the process until all of the points have been processed. DENCLUE: Using Densit Functions DENCLUE! DENsit-based CLUstEring (Hinneburg & Keim, )! Major features! Good for data sets with large amounts of noise! Allows a compact mathematical description of arbitraril shaped clusters in high-dimensional data sets! Significantl faster than other algorithms (faster than DBSCAN b a factor of up to )! But needs a large number of parameters! Uses grid cells but onl keeps information about grid cells that do actuall contain data points and manages these cells in a tree-based access structure.! Influence function: describes the impact of a data point within its neighborhood.! Overall densit of the data space can be calculated as the sum of the influence function of all data points.! Clusters can be determined mathematicall b identifing densit attractors.! Densit attractors are local maima of the overall densit function.

Influence Functions Densit Attractors! Eample f (, ) e d (, i ) N D σ Gaussian ( ) = i = e d (, i ) N D σ fgaussian(, i ) = = ( i i ) e f Gaussian d(, ) = σ Center-Defined & Arbitrar Clusters Clustering: Summar! Introduction! Partitioning methods! Hierarchical methods! Model-based methods! Densit-based methods