Chapter 6 Continued: Partitioning Methods

Similar documents
Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

Multivariate Analysis

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

University of California, Berkeley

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Clustering. Chapter 10 in Introduction to statistical learning

CSE 5243 INTRO. TO DATA MINING

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Supervised vs. Unsupervised Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

Mixture Models and the EM Algorithm

Exploratory data analysis for microarrays

Unsupervised Learning and Clustering

Multivariate Analysis (slides 9)

Machine Learning (BSMC-GA 4439) Wenke Liu

Hierarchical Clustering

Unsupervised Learning and Clustering

Clustering CS 550: Machine Learning

MSA220 - Statistical Learning for Big Data

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Clustering part II 1

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Hierarchical Clustering

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Note Set 4: Finite Mixture Models and the EM Algorithm

Improving Cluster Method Quality by Validity Indices

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Cluster Analysis for Microarray Data

Cluster Analysis: Agglomerate Hierarchical Clustering

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Multivariate Methods

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

CSE 5243 INTRO. TO DATA MINING

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Clustering Lecture 5: Mixture Model

Rowena Cole and Luigi Barone. Department of Computer Science, The University of Western Australia, Western Australia, 6907

CS Introduction to Data Mining Instructor: Abdullah Mueen

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Cluster Analysis. Ying Shen, SSE, Tongji University

Dynamic Thresholding for Image Analysis

University of Florida CISE department Gator Engineering. Clustering Part 5

CHAPTER 4: CLUSTER ANALYSIS

ALTERNATIVE METHODS FOR CLUSTERING

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

10701 Machine Learning. Clustering

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

SGN (4 cr) Chapter 11

Data Science and Statistics in Research: unlocking the power of your data Session 3.4: Clustering

Medoid Partitioning. Chapter 447. Introduction. Dissimilarities. Types of Cluster Variables. Interval Variables. Ordinal Variables.

High throughput Data Analysis 2. Cluster Analysis

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Unsupervised Learning Partitioning Methods

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

K-means and Hierarchical Clustering

Network Traffic Measurements and Analysis

Unsupervised Learning : Clustering

INF 4300 Classification III Anne Solberg The agenda today:

Distances, Clustering! Rafael Irizarry!

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

Clustering. Supervised vs. Unsupervised Learning

Clustering. Shishir K. Shah

Cluster Analysis. Angela Montanari and Laura Anderlucci

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

3. Cluster analysis Overview

2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.

Lecture on Modeling Tools for Clustering & Regression

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Clustering and Visualisation of Data

ECS 234: Data Analysis: Clustering ECS 234

Segmentation Computer Vision Spring 2018, Lecture 27

Stat 321: Transposable Data Clustering

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Clustering: Classic Methods and Modern Views

3. Cluster analysis Overview

Forestry Applied Multivariate Statistics. Cluster Analysis

Computational Genomics and Molecular Biology, Fall

Hierarchical Clustering / Dendrograms

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Machine learning - HT Clustering

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Comparison of Clustering Techniques for Cluster Analysis

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Understanding Clustering Supervising the unsupervised

The Application of K-medoids and PAM to the Clustering of Rules

Methods for Intelligent Systems

Transcription:

Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal value for some clustering criterion, or objective function. In reality, we cannot search all possible partitions to try to optimize the clustering criterion, but the algorithms are designed to search intelligently among the partitions. For a fixed k, partitioning methods are able to investigate far more possible partitions than a hierarchical method is. In practice, it is recommended to run a partitioning method for several choices of k and examine the resulting clusterings. STAT J530 Page 1

K-means Clustering The goal of K-means, the most well-known partitioning method, is to find the partition of n objects into k clusters that minimizes a within-cluster sum of squares criterion. In the traditional K-means approach, closeness to the cluster centers is defined in terms of squared Euclidean distance, defined by: d 2 E(x, x c ) = (x x c ) (x x c ) = m (x im x cm ) 2, where x = (x 1,...,x q ) is any particular observation and x c is the centroid (multivariate mean vector) for, say, cluster c. STAT J530 Page 2

K-means Clustering (Continued) The goal is to minimize the sum (over all objects within all clusters) of these squared Euclidean distances: WSS = k c=1 i c d 2 E(x i, x c ) In practice, K-means will not generally achieve the global minimum of this criterion over the whole space of partitions. In fact, only under certain conditions will it achieve the local minimum (Selim and Ismail, 1984). STAT J530 Page 3

The K-means Algorithm The K-means algorithm (MacQueen, 1967) begins by randomly allocating the n objects into k clusters (or randomly specifying k centroids). One at a time, the algorithm moves each object to the cluster whose centroid is closest to it, using the measure of closeness d 2 E(x, x c ). When an object is moved, the centroids are immediately recalculated for the cluster gaining the object and the cluster losing it. The method repeatedly cycles though the objects until no reassignments of objects take place. The final clustering result will somewhat depend on the initial configuration of the objects. In practice, it is good to rerun the algorithm a few times (with different starting points) to make sure the result is stable. The R function kmeans performs K-means clustering. STAT J530 Page 4

Ward s Method The method of Ward (1963) is a hybrid of hierarchical clustering and K-means. It begins with n clusters and joins clusters together, one step at a time. At each step, the method searches over all possible ways to join a pair of clusters so that the K-means criterion WSS is minimized for that step. It begins with each object as its own cluster (so that WSS = 0) and concludes with all objects in one cluster. The R function hclust performs Ward s method if the option method = ward is specified. STAT J530 Page 5

K-medoids Clustering The K-medoids algorithm (Kaufman and Rousseeuw, 1987) is a robust alternative K-means. It attempts to minimize the criterion Crit Med = k c=1 i c d(x i,m c ) where m c is a medoid, or most representative object, for cluster c. The algorithm begins (in the build step ) by selecting k such representative objects. It proceeds by assigning each object to the cluster with the closest medoid. Then (in the swap step ), if swapping any non-medoid object with a medoid results in a decrease in the criterion Crit Med, the swap is made. The algorithm stops when no swap can decrease Crit Med. STAT J530 Page 6

K-medoids Clustering (Continued) Like K-means, the K-medoids algorithm does not globally minimize its criterion in general. The R function pam in the cluster package performs K-medoids clustering. An advantage of K-medoids is that (unlike kmeans) the function can accept a dissimilarity matrix, as well as a raw data matrix. This is because the criterion to be minimized is a direct sum of pairwise dissimilarities between objects. The pam function also produces tools called the silhouette plot and average silhouette width to guide the choice of k (see examples). STAT J530 Page 7

Specialized Partitioning Methods The K-medoids algorithm is computationally infeasible for very large n (n > 5000 or so). The R function clara (Clustering Large Applications) is designed as a largesample version of pam. With clara, the medoids are calculated using randomly selected subsets of the data. The build-step and swap-step are carried out on the subsets rather than the entire data set. Fuzzy Cluster Analysis (implemented by fanny in R) assumes each object can have partial membership in several clusters. Rather than assigning each object to only one cluster, it assigns a membership coefficient for each cluster to an object that reflects the degree of membership of the object to that cluster. STAT J530 Page 8

Objective Methods to Determine the Number of Clusters k At some point we need to choose a single value of k to get a clustering solution. A variety of criteria have been proposed to pick the best value of k. The average silhouette width is based on the difference between the average dissimilarity of objects to other objects in their own cluster and the average dissimilarity of objects to the objects in a neighbor cluster. The larger the average silhouette width, the better the clustering of the objects. We could calculate the average silhouette width for clusterings based on several values of k and choose the k with the largest average silhouette width. The silhouette function in the cluster package of R gives the average silhouette width for any clustering result and distance matrix. STAT J530 Page 9

Other Methods to Determine the Number of Clusters Other criteria for choosing k include the Dunn index and the Davies-Bouldin Index, which are implemented with the clv.dunn and clv.davies.bouldin functions in the clv package. Especially with K-means clustering, a common way to choose k is to plot the withincluster sum-of-squares W SS for the K-means partitions for a variety of choices of k. As k increases, the corresponding WSS will decrease, and at some point will level off. The best choice of k usually occurs near the elbow in this plot. STAT J530 Page 10

Model-based Clustering Neither hierarchical nor partitioning methods assume a specific statistical model for the data. They are strictly exploratory tools, and no formal inference about a wider population is possible. Model-based clustering assumes that the population generating the data consists of k subpopulations, which correspond to the k clusters we seek. Therefore, the distribution for the data is assumed to be composed of k densities. This idea was originally proposed by Scott and Symons (1971) but fully developed in recent years by Banfield and Raftery (1993) and Fraley and Raftery (2002). STAT J530 Page 11

Clustering Model Setup Let γ = [γ 1,...,γ n ] be a vector of cluster labels, such that γ i = j if observation x i is from the j-th subpopulation. Suppose the subpopulation densities are denoted by f j (x; θ j ), where θ j contains the set of unknown parameters for the j-th density. Then the likelihood, given the observed data, is: L(θ 1,...,θ k, γ x 1,...,x n ) = n i=1 f γi (x i ; θ γi ). Fitting the model amounts to choosing θ 1,...,θ k, γ to maximize this likelihood. The estimated γ is the clustering vector that defines which cluster each object is assigned to. STAT J530 Page 12

The Multivariate Normality Assumption We may assume that each subpopulation (j = 1,...,k) follows a multivariate normal density having mean vectors µ j and covariance matrices Σ j, for j = 1,...,k, as its parameters. Then the likelihood becomes L(θ 1,...,θ k, γ) k j=1 i f j Σ j 1/2 exp [ 1 2 (x i µ j ) Σ 1 ] j (x i µ j ). The MLE of µ j is x j, the sample mean vector for the observations in subpopulation j. STAT J530 Page 13

The Multivariate Normality Assumption (continued) Replacing µ j with x j, the log-likelihood function is a constant plus 1 2 k j=1 trace(w j Σ 1 j + n ln Σ j ), where W j is a matrix containing the sums of squares and cross products of variables for observations in subpopulation j. We can assume a certain structure for the covariance matrices Σ j (j = 1,...,k) and then determine computationally the value of γ that maximizes this (log) likelihood. STAT J530 Page 14

Possible Covariance Structures We could consider a few possible covariance structures. A simple (maybe unrealistic!) assumption is that each subpopulation has the same covariance structure and that all the Σ j = σ 2 I. In this case, γ is chosen so that the total within-group sum-of-squares trace( k j=1 W j ) is minimized. This tends to produce clusters that are spherical and roughly of equal size. A slightly more complicated assumption is that each subpopulation has the same covariance structure, i.e., Σ j = Σ for all j = 1,...,k. This tends to produce clusters that are elliptical with roughly the same directional slope. STAT J530 Page 15

Other Covariance Structures An extremely unrestrictive assumption is that each subpopulation may have a completely different covariance structure, Σ j, j = 1,...,k. This may produce clusters that are different in size, shape, and orientation. We might consider assumptions that are less restrictive than the equal-covariances assumption yet more parsimonious than the unstructured-covariances assumption. The covariance structure we assume leads to a clustering solution in which the sizes, shapes, and orientations of the clusters might be the same or different. In practice, the R function Mclust in the mclust package considers many such models, letting the covariance assumptions and the number of clusters k vary. Usually the Bayesian information Criterion (BIC) is used to choose the best of all these competing models and thus determine the model-based clustering result. STAT J530 Page 16

Clustering Binary Data When the q variables measured on each observation are binary (e.g., representing the presence or absence of some characteristic), the objects may still be clustered based on a distance measure. Suppose, for each individual (i = 1,...,n), we let the binary variable X ij (for j = 1,...,q) take the value 0 or 1. Then two individuals have a match on a binary variable if both individuals have the same value for that variable (either both 0 or both 1). Otherwise, the two individuals are said to have a mismatch on the binary variable. Calculating squared Euclidean distances q j=1(x ij X i j )2 between each pair of rows of this sort of data matrix of 0 s and 1 s amounts to counting the total number of mismatches for each pair of objects. Once we calculate the distances, we can input them into a standard clustering algorithm like K-medoids or a hierarchical method. STAT J530 Page 17

Meaning of Matches and Mismatches for Binary Data Using squared Euclidean distance essentially treats 0-0 matches and 1-1 matches as equally important. Is this appropriate? It depends on the situation: If the binary variable is measuring a very rare (or very common) characteristic, then a 1-1 match may be more meaningful than a 0-0 match (or vice versa). If X i = 1 if an individual is a strict vegan and 0 otherwise, then a 1-1 match might indicate two similar individuals, but a 0-0 match would be less informative. If X i = 1 if an individual knows how to read and 0 otherwise, then a 0-0 match might indicate two similar individuals, but a 1-1 match would be less informative. STAT J530 Page 18

Other Measures of Distance for Binary Data Define a 2 2 table counting the matches (a = total 0-0 matches, d = total 1-1 matches) and mismatches (b = total 0-1 mismatches, c = total 1-0 mismatches) for a pair of objects: Y i Y i 0 1 Totals 0 a b a + b 1 c d c + d Totals a + c b + d q = a + b + c + d Defining the distance between the two objects to be b+c q matches and 1-1 matches. gives equal weights to 0-0 STAT J530 Page 19

Other Measures of Distance for Binary Data (Continued) Defining the distance between the two objects to be treating them as irrelevant (vegan example?). Defining the distance between the two objects to be treating them as irrelevant (reading example?). b+c b+c+d b+c a+b+c ignores 0-0 matches, ignores 1-1 matches, Several other distances measures based on a, b, c, d are possible (see Johnson and Wichern, p. 674). STAT J530 Page 20