Cluster Analysis. Angela Montanari and Laura Anderlucci

Similar documents
Machine Learning (BSMC-GA 4439) Wenke Liu

Finding Clusters 1 / 60

Machine Learning (BSMC-GA 4439) Wenke Liu

Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

Unsupervised Learning

Multivariate Analysis

Hierarchical clustering

3. Cluster analysis Overview

3. Cluster analysis Overview

Cluster Analysis. Ying Shen, SSE, Tongji University

Stat 321: Transposable Data Clustering

Cluster analysis. Agnieszka Nowak - Brzezinska

Gene Clustering & Classification

Unsupervised Learning and Clustering

ECLT 5810 Clustering

Chapter VIII.3: Hierarchical Clustering

Clustering Part 3. Hierarchical Clustering

ECLT 5810 Clustering

Hierarchical Clustering

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Clustering CS 550: Machine Learning

Lesson 3. Prof. Enza Messina

Road map. Basic concepts

Forestry Applied Multivariate Statistics. Cluster Analysis

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

CHAPTER 4: CLUSTER ANALYSIS

Understanding Clustering Supervising the unsupervised

Chapter 6: Cluster Analysis

Unsupervised Learning and Clustering

Hierarchical Clustering

Distances, Clustering! Rafael Irizarry!

Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning

A COMPARATIVE STUDY ON K-MEANS AND HIERARCHICAL CLUSTERING

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS7267 MACHINE LEARNING

Clustering part II 1

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Hierarchical Clustering / Dendrograms

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Clustering Lecture 3: Hierarchical Methods

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Cluster Analysis for Microarray Data

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

TELCOM2125: Network Science and Analysis

ECS 234: Data Analysis: Clustering ECS 234

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

CSE 5243 INTRO. TO DATA MINING

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

CSE 5243 INTRO. TO DATA MINING

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Clustering. Pattern Recognition IX. Michal Haindl. Clustering. Outline

Clustering. Lecture 6, 1/24/03 ECS289A

UNSUPERVISED CLASSIFICATION

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering Algorithms for general similarity measures

University of Florida CISE department Gator Engineering. Clustering Part 5

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Data Mining and Analysis: Fundamental Concepts and Algorithms

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Hierarchical clustering

Medoid Partitioning. Chapter 447. Introduction. Dissimilarities. Types of Cluster Variables. Interval Variables. Ordinal Variables.

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

CLUSTER ANALYSIS. V. K. Bhatia I.A.S.R.I., Library Avenue, New Delhi

Introduction to Clustering

SGN (4 cr) Chapter 11

Clustering Gene Expression Data: Acknowledgement: Elizabeth Garrett-Mayer; Shirley Liu; Robert Tibshirani; Guenther Walther; Trevor Hastie

Data Mining Concepts & Techniques

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Supervised vs. Unsupervised Learning

Chapter DM:II. II. Cluster Analysis

2. Find the smallest element of the dissimilarity matrix. If this is D lm then fuse groups l and m.

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

4. Ad-hoc I: Hierarchical clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Machine learning - HT Clustering

Clustering. Chapter 10 in Introduction to statistical learning

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

Multi-Modal Data Fusion: A Description

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

DATA CLASSIFICATORY TECHNIQUES

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Getting to Know Your Data

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Transcription:

Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a specific set of variables. In order to accomplish this objective, the most common starting point is computing a matrix, called dissimilarity matrix, which contains information about the dissimilarity of the observed units. According to the nature of the observed variables (quantitative, qualitative, binary or mixed type variables), we can define and use different measures of dissimilarity. 1.1 Dissimilarity measures Let s consider the n p data matrix, where n is the number of observed units and p is the number of observed variables; let s indicate with i and j two generic objects and with x ik the value of the k-th variable for the i-th unit. 1.1.1 Quantitative variables If the observed variables are all quantitative, each unit can be identified with a point in the p-dimensional space and the dissimilarity between two objects i and j can be measured through: a) Euclidean distance: d ij = p (x ik x jk ) 2 k=1 1

b) City-Block or Manhattan distance: d ij = p x ik x jk k=1 Both of these two distances are specific cases, respectively for l = 2 and l = 1, of a generic distance family known as c) Minkowski metric: { p } 1/l d ij = x ik x jk l k=1 the larger l the greater is the weight given to high distances x ik x jk. For all of the three distances the properties of a distance function hold: 1. d ij 0 2. d ii = 0 3. d ij = d ji 4. d ij d it + d tj (triangle inequality) 1.1.2 Binary variables When dealing with binary or categorical variables it is not possible to define true distance measures, but one can use some dissimilarity measures which: are still close to zero, when i and j are similar to each other increase when differences between two units are larger are symmetric are equal to zero when considering the dissimilarity of an object with itself Nevertheless, these dissimilarity measures generally do not satisfy the triangle inequality. Consider x ik that can assume only two values, coded as 0 and 1. In order to calculate the dissimilarity between two units, relevant information is represented in the following 2 2 table: 2

i j 1 0 1 a b 0 c d where a is the number of variables that take value 1 for both objects i and j, b is the number of variables that assume modality 1 in object i and modality 0 in object j, c is the number of times the opposite is verified and d is the number of variables that assume value 0 for both objects i and j. Furthermore, a + b + c + d = p, with p equal to the number of observed variables. Some of the best known dissimilarity measures are: a) Simple dissimilarity coefficient d ij = b + c a + b + c + d b) Jaccard coefficient d ij = b + c a + b + c Jaccard s coefficient is recommended when binary variables indicate presence/absence of a specific characteristic. In this case agreement 1/1 can have a different meaning than agreement 0/0; indeed, absence of a certain characteristic in both objects i and j does not contribute to improve the similarity between them. Hence, it can be more useful to divide by a + b + c rather than by a + b + c + d. 1.1.3 Categorical variables When observed variables are categorical with more than two levels, a simple measure of the degree of dissimilarity between two units is given by a) d ij = 1 c ij where c ij is the ratio of the number of variables that assume the same values in both units i and j and the total number of observed variables p. 3

1.1.4 Mixed type variables When the observed variables have different nature a dissimilarity measure that is able to deal with their specific characteristics is: a) Gower index d ij = 1 s ij, with s ij = p k=1 δ ijks ijk p k=1 δ ijk x ik x jk where, for quantitative variables, s ijhk = 1 range of variable x k For qualitative or binary variables s ijk = 1 if i and j agree with respect to variable k and equal to 0 if they disagree. Indicator δ ijk is equal to 1 if both quantities x ik and x jk are recorded and equal to 0 in the opposite situation; furthermore it is equal to 0 for asymmetric binary variables when agreement 0/0 is observed. 2 Agglomerative Hierarchical Clustering Methods Hierarchical clustering methods are criteria aimed to create nested partitions that allow investigations with respect to different within clusters homogeneity levels. Those relationships are often represented by dendrograms; by cutting the graph at a certain level of dissimilarity we obtain a partition of objects into different groups. Each partition in k groups is optimal in a step-wise sense. It is the best partition in k group one can obtain, given the partition in k+1 group defined at the previous step, for agglomerative methods, or the best partition given the partition in k 1 groups yielded at the previous step, for divisive methods. Notice that for agglomerative hierarchical clustering methods, when two clusters are merged, they will remain so until the end of the hierarchy; a mistake committed in the initial phase of classification can never be corrected. This is both the strength and the weakness of this kind of methods; the bigger is the number of units the higher is the probability of misclassifying a unit. 4

If the aim is to find a number of clusters much lower than the number of units (k n), divisive methods have a lower risk to commit a mistake, since the number of steps needed to reach partition in k groups is smaller. On the other hand, divisive methods have a high computational complexity and this is the reason why agglomerative approaches are preferred. Starting from the dissimilarity matrix, agglomerative hierarchical clustering methods initially consider each unit as a separate cluster, and they subsequently merge in one cluster those units that have the lowest dissimilarity. Then, they evaluate dissimilarities between this new cluster and the others, merging again the two groups that have the lowest value in the dissimilarity matrix. Different clustering methods in this class refer to the different ways the dissimilarity between clusters is defined and evaluated. 2.1 Single Linkage (Nearest Neighbour) method Distance between two clusters C and G is defined as the dissimilarity between the two closest points d CG = min(d ij ), i C; j G i.e. it is the shortest distance between any element i of C and any element j of G. Clusters obtained with the Single Linkage method tend to have a long and narrow shape and to be internally poorly homogenous since only two similar observations from two different clusters are needed in order to merge the two clusters (chaining effect). For this reason, this method is not able to identify overlapping clusters, but it is good when clusters look well separated, because in every cluster yielded by the procedure any unit is more similar to the units belonging to the same cluster than to units of different groups. 2.2 Complete Linkage (Furthest Neighbour) method Distance between two clusters C and G is defined as the dissimilarity between the two furthest-away points d CG = max(d ij ), i C; j G i.e. it is the longest distance between any element i of C and any element j of G. In other words, d CG is the diameter of the smallest sphere that can 5

wrap the cluster obtained by merging C and G. Clusters yielded by this procedure tend to be spherical and compact; nothing can be said about their separation. 2.3 Average Linkage Distance between two clusters C and G is defined as the average of all the dissimilarities between pairs of points one from each cluster d CG = 1 d ij n C n G i C,j G where n C is the number of elements in cluster C, and n G is the number of elements in cluster G. 2.4 Centroid Linkage Distance between two clusters C and G is defined as the Euclidean distance between the two cluster centres d CG = x C x G where x C is the mean vector of the observed variables in cluster C and x G is the mean vector of the observed variables in cluster G. This method is suitable for quantitative variables only. A problem that may be encountered with this approach is that dissimilarity of a union is lower than the one measured at the previous merge (because of the centroid shift) and this makes the results less interpretable. 2.5 Ward method Distance between two clusters C and G is defined as the deviance between the two clusters. Be x ikg the value of variable k observed on unit i in cluster g and be x kg the average of variable k in cluster g; then E g = n p g (x ikg x kg ) 2 k=1 i=1 E g is the deviance within gth cluster and E = g E g the deviance within the whole partition. As usual the classification process starts by considering 6

each unit as a cluster. In this case the deviance within is 0. Each merge leads to an increase of the deviance within ; the algorithm is looking for the merge that will imply the smallest increase of the deviance within the whole partition; i.e. the groups that will be merged are those having the largest deviance between. Like the centroid linkage, this method can only be applied to quantitative data, but differently from that, it does not have problems with the order of dissimilarities in the different levels of the hierarchy. Both methods identify spherical clusters. 3 Partitioning Clustering Methods Hierarchical methods described in the previous section generate n different nested classifications that go from n one-element clusters to a cluster which contains all the n observations. Partitive methods give rise to a single partition with K groups, where K is previously specified or is determined by the clustering method itself. In the partitioning approach, the number of clusters (say K) is decided a priori and data are split into clusters in such a way the within-cluster dissimilarity is minimized. The function that is to be minimized is K g=1 i:x i C g (x i x Cg ) 2 The problem seems simple, but actually the numbers involved make impossible to consider all possible partition of n individuals into K groups. That is why several algorithms, designed to search for the minimum values of the clustering criterion, have then been developed. 3.1 K-means Algorithm A very popular partitioning method is the K-means algorithm. Classification begins by choosing the first K observations of the data matrix as centroids and carries on by assigning each of the remaining n K units to the closest centroid. This first partition in K groups is then improved by assuming the centroids of the K groups as new aggregation points and by associating each unit to the closest centroid. 7

The resulting classification strongly depends on the order by which units are assigned to different groups; nevertheless it has been shown that this effect is poorly relevant when clusters are clearly separated. 4 How to choose the number of clusters So far we have not discussed how to choose the appropriate number of clusters. There is plenty of methods in the literature designed to solve the problem but up to now there is no unique solution. Here a few methods are presented. 4.1 The Average Silhouette Width (ASW) For a partition of n units into K clusters C 1,..., C K, suppose object i has been assigned to cluster C h. We indicate with a(i) the average dissimilarity of i to all other objects of cluster C h : a(i, h) = 1 C h 1 j C h d(i, j) This expression makes sense only when C h contains other objects other than i. Let s consider now any cluster C l different from C h and define the average dissimilarity of i to all objects of C l d(i, C l ) = 1 d(i, j) C l j C l After computing d(i, C l ) for all clusters C l different from C h, we select the smallest of those: b(i) = min i/ C l d(i, C l ) The cluster for which this minimum is obtained is call neighbour of object i; this is like the second-best choice for clustering object i. The silhouette s(i) is obtained by combining a(i) and b(i) as follows: 1 a(i), if a(i) < b(i) b(i) s(i) = 0, if a(i) = b(i) b(i) 1, if a(i) > b(i). a(i) 8

This can be rearranged in one formula s(i) = b(i) a(i) max[a(i), b(i)] And it can be easily seen that 1 s(i) 1 for each object i. When s(i) is at its largest (that is, close to 1), this implies that the within dissimilarity a(i) is much smaller than the smallest between dissimilarity b(i). Therefore, we can say that i is well classified: the second best choice is not nearly as close at the actual choice. When s(i) is about zero, then a(i) and b(i) are approximately equal and so it is not clear whether i should have been assigned to C h or to C l, it lies equally far away from both. The worst situation takes place when s(i) is close to 1, when a(i) is actually much larger than b(i), and hence i lies on average closer to C l than to C h ; therefore it would have seemed better to assign object i to its neighbour. The silhouette s(i) hence measures how well unit i has been classified. By computing the average of the s(i), calculated for all the observations i = 1,..., n, we obtain the so called average silhouette width (ASW): s(i) = 1 n n s(i, K) i=1 If K is not fixed and needs to be estimated, the ASW estimate K ASW is obtained by maximizing the equation. Its expression leads to a clustering that emphasises the separation between the clusters and their neighbouring clusters. 4.2 Pearson Gamma If K is not fixed and needs to be estimated, the Pearson Gamma (PG) estimator K Γ maximises the Pearson correlation ρ(d, m) between the vector d of pairwise dissimilarities and the binary vector m that is 0 for every pair of observations in the same cluster and 1 for every pair of observations in different clusters. PG emphasises a good approximation of the dissimilarity structure by the clustering in the sense that when observations are in different clusters they should strongly be correlated with large dissimilarity. 9

Remark When the number of clusters is known (hence there is no need to estimate it) one can use these indexes as a measure of clustering quality. 10