Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering
Clustering Algorithms Contents K-means Hierarchical algorithms Linkage functions Vector quantization SOM
Clustering Formulation Objects................................. Attributes Find groups of similar points (observations) in multidimensional space No target variable (unsupervised learning)......... Model
Methods of Clustering - Overview Variety of methods: Hierarchical clustering create hierarchy of clusters (one cluster entirely contained within another cluster) Non-hierarchical methods create disjoint clusters Overlapping clusters (objects can belong to >1 cluster simultaneously) Fuzzy clusters (defined by the probability (grade) of membership of each object in each cluster) Useful data preprocessing prior to clustering: PCA (Principal Components Analysis) to reduce dimensionality of data Data standarization (transform data to reduce large influence of variables with larger variance on results of clustering)
Introductory Example 97 countries described by 3 attributes: Birth, Death, InfantDeath rate (given as number per 1000, data from year 1995)
Analysis I Clustering raw data K-means algorithm Result: 3 clusters (no. of obs. in each cluster: 13, 32, 52) Example cntd.
Example Profiles of Clusters
Example Profiles of Clusters Notice: data clustered based on InfantDeath Rate only!
Example Standarization of Data Analysis II Data standarized prior to clustering (variables divided by their standard deviation) Result: 3 clusters (with 35, 46, 16 obs.) Data clustered based on InfantDeath and Death Analysis II Analysis I Observe that data with largest variance have largest influence on results of clustering
Example Profiles of Clusters Analysis II: profiles of clusters
Methods of Clustering Non-hierarchical methods K-means clustering Non-deterministic O(n), n - number of observations Hierarchical methods Aglomerative (join small clusters) Divisive (split big clusters) Deterministic methods O(n 2 ) O(n 3 ), depending on the clustering method (i.e. definition of intercluster distance)
Methods of Clustering - Remarks Clustering large datasets K-means If results of hierarchical clustering needed first use K-means yielding e.g. 50 clusters, followed by hierarchical clustering on results of K-means Consensus clustering Discover real clusters in data analyze stability of results with noise injected
K-means Algorithm K-means clustering Select k points (centroids of initial clusters; select randomly) Assign each observation to the nearest centroid (nearest cluster) For each cluster find the new centroid Repeat step 2 and 3 until no change occurs in cluster assignments
K-means Algorithm Result: k separate clusters Algorithm requires that the correct number of clusters k is specified in advance (difficult problem: how to know the real number of clusters in data )
Hierarchical Clustering Notation x i observations, i=1..n C k clusters G current number of clusters D KL distance between clusters C K and C L Between-cluster distance D KL linkage function (various definitions available, results of clustering depend on D KL ) C L C K D KL
Hierarchical Clustering Algorithm (agglomerative hierarchical clustering) C k = {x k }, k=1..n, G=n Find K, L such that D KL = min D IJ, 1<=I,J<=G Replace clusters C K and C L by cluster C K C L, G=G-1 Repeat steps 2 and 3 while G>1 C L D KL C K Result: hierarchy of clusters dendrogram
Hierarchy of Clusters - Dendrogram
Definitions of Distance Between Clusters Different definitions of distance between clusters Average linkage Single linkage Complete linkage Density linkage Ward s minimum variance method (SAS CLUSTER procedure accepts 11 different definitions of inter-cluster distance)
Notation x i observations, i=1..n Average Linkage d(x,y) distance between observations (Euclidean distance assumed from now on) C k clusters N K number of observations in cluster C K D KL distance between clusters C K and C L mean CK mean observation in cluster C K W K = x i -mean CK 2 x i C K variance in cluster Average linkage Tends to join clusters with small variance Resulting clusters tend to have similar variance
Notation x i observations, i=1..n Complete Linkage d(x,y) distance between observations C k clusters N K number of observations in cluster C K D KL distance between clusters C K and C L mean CK mean observation in cluster C K W K = x i -mean CK 2 x i C K variance in cluster Complete linkage Resulting clusters tend to have similar diameter
Notation x i observations, i=1..n Single Linkage d(x,y) distance between observations C k clusters N K number of observations in cluster C K D KL distance between clusters C K and C L mean CK mean observation in cluster C K W K = x i -mean CK 2 x i C K variance in cluster Single linkage Tends to produce elongated clusters, irregular in shape
Ward s Minimum Variance Method Notation x i observations, i=1..n d(x,y) distance between observations C k clusters N K number of observations in cluster C K D KL distance between clusters C K and C L mean CK mean observation in cluster C K W K = x i -mean CK 2 x i C K variance in cluster B KL =W M -W K -W L where C M =C K C L Ward s minimum variance method Tends to join small clusters Tends to produce clusters with similar number of observations
Density Linkage Notation x i observations, i=1..n d(x,y) distance between observations r a fixed constant f(x) proportion of observations within sphere centered at x with radius r divided by the volume of the sphere (measure of density of points near observation x) Density linkage We realize single linkage using the measure d* Capable of discovering clusters of irregular shape
Example Average Linkage Elongated clusters in data
Elongated clusters in data Example K-means
Example Density Linkage Elongated clusters in data
Nonconvex clusters in data Example K-means
Example Centroid Linkage Nonconvex clusters in data
Example Density Linkage Nonconvex clusters in data
Clusters of unequal size Example True Clusters
Clusters of unequal size Example K-means
Example Ward s Method Clusters of unequal size
Example Average Linkage Method: average linkage
Example Centroid Linkage Clusters of unequal size
Example Single Linkage Clusters of unequal size
Example Well Separated Data Any method will work
Example Poorly Separated Data True clusters
Example Poorly Separated Data Method: K-means
Example Poorly Separated Data Ward s method
Clustering Methods Final Remarks Standarization of variables prior to clustering Often necessary, otherwise variables with large variance tend to have large influence on clustering Often standarized measurement z ij is computed as the z-score: where x ij original measurement in observation i and variable j, j mean value of variable j, s j mean absolute deviation of variable j (or its standard deviation) Other ideas: divide variable by its range, max value or standard deviation
Clustering Methods Final Remarks The number of clusters No satisfactory theory to determine the right number of clusters in data Various criteria can be observed to help determine the right number of clusters, e.g. criteria based on variance accounted for by clusters R 2 =1-P G /T or semipartial R 2 =B KL /T where T total variance of observations; P G = W K over G clusters B KL =W M -W K -W L where C M =C K C L Cubic Clustering Criterion (CCC) Often data visualization useful for determining the number of clusters Scatterplot for 2-3 dimensional data In high dimensions apply PCA transformation (or similar) visualize data in 2-3 dimensional space of first principal components
Example 2 R, Semi-partial 2 R
Example Number of Clusters Useful Checks PST2: 3 or 6 or 9 (one before peak in value) PSF: 9 (peak in value) CCC: 18 (CCC around 3)
Kohonen VQ (Vector Quantization) Algorithm similar to k-means Idea of VQ algorithm: 1. Select k points (initial cluster centroids) 2. For observation x i find nearest centroid (winning seed) denoted by S n 3. Modify S n according to the formula: where L learning constant (decresing during learning process) 4. Repeat steps 2 and 3 over all training observations 5. Repeat steps 2-4 given number of iterations
VQ MacQueen Method For L=const VQ algorithm does not coverge MacQueen method: Learning constant L reciprocal to the numer of observations N n in cluster associated with the winning seed S n This algorithm converges 46
Kohonen SOM (Self Organizing Maps) 1. Select k initial points (cluster centroids), represent them on a 2D map 2. For observation x i find winning seed S n 3. Modify all centroids : S j =S j (1-K(j,n)L)+x i K(j,n)L, where L learning constant (decreasing during training) K(j,n) function decreasing with increasing distance on the 2D map between S j i S n centroids (K(j,j)=1) 4. Repeat steps 2 and 3 over all training observations 47
Example SOM-based clustering of wine data (R language, dataset wines, package kohonen) 48
Example SOM-based clustering of wine data (R language, dataset wines, package kohonen) 49
R system implementation of the SOM algorithm: function som() (package kohonen) Results: structure wine.som important members: wine.som$codes wine.som$unit.classif # codebook vectors # winning units for all data points 50
Codebook vectors represent clusters created at each 2D grid element (attributes of codebook vectors are mean values of respective attributes of cluster elements) 51
R system implementation of the SOM algorithm: function som() (package kohonen) Results: structure wine.som important members: wine.som$codes wine.som$unit.classif # codebook vectors # winning units for all data points 52
R system implementation of the SOM algorithm: function som() (package kohonen) Results: structure wine.som important members: wine.som$codes wine.som$unit.classif # codebook vectors # winning units for all data points 53
Results: assignment of observations (individual wines) to 2D grid Grouping seeds (codebook vectors) e.g. with hierarchical clustering (hclust function): 54
Przykład SOM w R 55
Przykład SOM w R 56
Przykład SOM w R 57