Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Similar documents
MATH5745 Multivariate Methods Lecture 13

Hierarchical Clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Cluster Analysis. Ying Shen, SSE, Tongji University

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Chapter 6 Continued: Partitioning Methods

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Multivariate Analysis

Distances, Clustering! Rafael Irizarry!

Cluster Analysis: Agglomerate Hierarchical Clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Information Retrieval and Web Search Engines

Network Traffic Measurements and Analysis

Clustering Part 3. Hierarchical Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Clustering in Ratemaking: Applications in Territories Clustering

Clustering. Chapter 10 in Introduction to statistical learning

Cluster Analysis for Microarray Data

University of Florida CISE department Gator Engineering. Clustering Part 2

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Unsupervised Learning Partitioning Methods

CHAPTER 4: CLUSTER ANALYSIS

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Chapter 6: Cluster Analysis

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

CS7267 MACHINE LEARNING

Finding Clusters 1 / 60

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Introduction to Computer Science

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Hierarchical Clustering 4/5/17

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Gene Clustering & Classification

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Kapitel 4: Clustering

Cluster Analysis. Angela Montanari and Laura Anderlucci

Hierarchical Clustering

MSA220 - Statistical Learning for Big Data

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Exploratory data analysis for microarrays

Clustering CS 550: Machine Learning

Information Retrieval and Web Search Engines

Data Mining Algorithms

Hierarchical Clustering

Clustering Lecture 3: Hierarchical Methods

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

INF 4300 Classification III Anne Solberg The agenda today:

Summer School in Statistics for Astronomers & Physicists June 15-17, Cluster Analysis

Service withdrawal: Selected IBM ServicePac offerings

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Lesson 3. Prof. Enza Messina

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Data Mining Concepts & Techniques

Unsupervised Learning and Clustering

were generated by a model and tries to model that we recover from the data then to clusters.

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Algorithms In R/Clustering/K-Means

Clust Clus e t ring 2 Nov

Hierarchical clustering

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

What to come. There will be a few more topics we will cover on supervised learning

Hierarchical clustering

Understanding Clustering Supervising the unsupervised

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs


10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Machine learning - HT Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Unsupervised Learning

Data Exploration with PCA and Unsupervised Learning with Clustering Paul Rodriguez, PhD PACE SDSC

What is clustering. Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster similarity

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Transcription:

Cluster Analysis Summer School on Geocomputation 27 June 2011 2 July 2011 Vysoké Pole Lecture delivered by: doc. Mgr. Radoslav Harman, PhD. Faculty of Mathematics, Physics and Informatics Comenius University, Bratislava, Slovakia

Structure of cluster analyses Cluster analyses Nonhierarchical Hierarchical Model-based K-means K-medoids Agglomerative Divisive

Nonhierarchical clustering Finds a decomposition of objects 1,...,n into k disjoint clusters C 1,...,C k of similar objects: C... C =, 1 k { 1,..., n} i j C C = The objects are (mostly) characterized by vectors of features i j x1,..., x n R p p=2 k=3 n=9 1 3 2 9 7 8 5 6 4 C 1 = { 1,3,7,9 } { 2,5,8} C = 2 C 3 = { 4,6} C = 4 1 C = 2 3 C = 3 2 How do we understand decomposition into clusters of similar objects? How is this decomposition calculated? Many different approaches: k-means, k-medoids, model-based clustering,...

ρ k i= 1 r C i 2 ρ K-means clustering The target function to be minimized with respect to the selection of clusters: ( x, c ) is the Euclidean distance: where ( t), ( t) r i c i = 1 C i r C i x where is the centroid of C i. ρ( x, y) = r p ( x( t) y( t) ) x y are the t th components of the vectors x, y. t= 1 2 Bad Good

K-means clustering It is a difficult problem to find the clustering that minimizes the target function of the k-means problem. There are many efficient heuristics that find a good, although not always optimal solution. Example: Lloyd s Algorithm Create a random initial clustering C 1,..., C k. Until a maximum prescribed number of iterations is reached, or no reassignment of objects occurs do: Calculate the centroids c 1,..., c k of clusters. For every i=1,...,k : Form the new cluster C i from all the points that are closer to c i than to any other centroid.

Illustration of the k-means algorithm p=2 k=3 n=11 Choose an initial clustering

Illustration of the k-means algorithm p=2 k=3 n=11 Calculate the centroids of clusters

Illustration of the k-means algorithm p=2 k=3 n=11 Assign the points to the closest centroids

Illustration of the k-means algorithm p=2 k=3 n=11 Create the new clustering

Illustration of the k-means algorithm p=2 k=3 n=11 Create the new clustering

Illustration of the k-means algorithm p=2 k=3 n=11 Calculate the new centroids of clusters

Illustration of the k-means algorithm p=2 k=3 n=11 Assign the points to the closest centroids

Illustration of the k-means algorithm p=2 k=3 n=11 Create the new clustering

Illustration of the k-means algorithm p=2 k=3 n=11 Create the new clustering

Illustration of the k-means algorithm p=2 k=3 n=11 Calculate the new centroids of clusters

Illustration of the k-means algorithm p=2 k=3 n=11 Assign the points to the closest centroids

Illustration of the k-means algorithm p=2 k=3 n=11 Create the new clustering

Illustration of the k-means algorithm p=2 k=3 n=11 Create the new clustering The clustering is the same as in the previous step, therefore STOP.

Properties of the k-means algorithm Advantages: Simple to understand and implement. Fast and convergent in a finite number of steps. Disadvantages: Different initial clusterings can lead to different final clusterings. It is thus advisable to run the procedure several times with different (random) initial clusterings. The resulting clustering depends on the units of measurement. If the variables are of different nature or are very different with respect to their magnitude, then it is advisable to standardize them. Not suitable for finding clusters with nonconvex shapes. The variables must be euclidean (real) vectors, so that we can calculate centroids and measure the distance from centroids; it is not enough to have only the matrix of pairwise distances or dissimilarities.

Computational issues of k-means Complexity: Linear in the number of objects, provided that we bound the number of iterations. In R (library stats): kmeans(x, centers, iter.max, nstart, algorithm) Dataframe of real vectors of features The number of clusters Maximum number of iterations The method used ("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen" ) Number of restarts

α The elbow method Selecting k the number of clusters is frequently a problem. Often done by graphical heuristics, such as the elbow method. k 2 ( k ) ( k) = ρ ( x c ) i= 1 r C ( k ) i r i C c ( k ) 1,..., ( k ) 1,..., C c ( k ) k ( k ) k optimal clustering obtained by assuming k clusters corresponding centroids elbow

k i= 1 r C i d K-medoids clustering Instead of centroids uses medoids the most central objects (the best representatives ) of each cluster. This allows using only dissimilarities d(r,s) of all pairs (r,s) of the objects. The aim is to find the clusters C 1,...,C k that minimize the target function: ( r, mi ) where for each i the medoid m i minimizes r C i d ( r, ) m i Bad Good

K-medoids algorithm Similarly as for k-means, it is a difficult problem to find the clustering that minimizes the target function of the k-medoids problem. There are many efficient heuristics that find a good, although not always optimal solution. Example: Algorithm Partitioning around medoids (PAM) Randomly select k objects m 1,...,m k as initial medoids. Until the maximum number of iterations is reached or no improvement of the target function has been found do: Calculate the clustering based on m 1,...,m k by associating each point to the nearest medoid and calculate the value of the target function. For all pairs (m i, x s ), where x s is a non-medoid point, try to improve the target function by taking x s to be a new medoid point and m i to be a non-medoid point.

Disadvantages: Properties of the k-medoids algorithm Advantages: Simple to understand and implement. Fast and convergent in a finite number of steps. Usually less sensitive to outliers than k-means. Allows using general dissimilarities of objects. Different initial sets of medoids can lead to different final clusterings. It is thus advisable to run the procedure several times with different initial sets of medoids. The resulting clustering depends on the units of measurement. If the variables are of different nature or are very different with respect to their magnitude, then it is advisable to standardize them.

Computational issues of k-medoids Complexity: At least quadratic, depending on the actual implementation. In R (library cluster): pam(x, k, diss, metric, medoids, stand, ) Dataframe of real vectors of features or a matrix of dissimilarities The number of clusters Is x a dissimilarity matrix? (TRUE, FALSE) Metrics used (euclidean, manhattan) Vector of initial medoids Standardize data? (TRUE, FALSE)

The silhouette a(r) the average dissimilarity of the object r and the objects of the same cluster b (r) the average dissimilarity of the object r and the objects of the neighboring cluster Silhouette of the object r the measure of how well is r clustered b( r) a( r) s( r) = [ 1,1] max ( b( r), a( r) ) s(r) close to 1 the object r is well clustered close to 0 the object r is at the boundary of clusters less than 0 the object r is probably placed in a wrong cluster

The silhouette

Model-based clustering We assume that the vectors of features of objects from the j-th cluster follow a multivariate normal distribution N p (µ j, Σ j ). The method of calculating the clustering is based on maximization of a (mathematically complicated) likelihood function. Idea: Find the most probable (most likely ) assignment of objects to clusters (and, simultaneously, the most likely positions of the centers µ j of the clusters and their covariance matricesσ j representing the shape of the clusters). Unlikely Likely

Model-based clustering Disadvantages compared to k-means and k-medoids More difficult to understand properly. Computationally more complex to solve. Cannot use only dissimilarities (disadvantage compared to k-medoids). Advantages over k-means and k-medoids Can find elliptic clusters with very high eccentricity, while k-means and k- medoids tend to form spherical clusters. The result is not dependent on the scale of variables (no standardization is necessary). Can find hidden clusters inside other more dispersed clusters. Allows a formal testing of the most appropriate number of clusters.

Computational issues of the model based clustering Complexity: Computationally a very hard problem, solved iteratively. We can use the so-called EM-algorithm, or algorithms of stochastic optimization. Modern computers can deal with problems with hundreds of variables and thousands of objects in a reasonable time. In R (library mclust): Mclust(data, modelnames,...) Dataframe of real vectors of features Model used (EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV, VVV)

Comparison of nonhierarchical clustering methods on artificial 2D data k-means model-based

Comparison of nonhierarchical clustering methods on artificial 2D data k-means model-based

Comparison of nonhierarchical clustering methods on artificial 2D data k-means model-based

Comparison of nonhierarchical clustering methods on the Landsat data p=36 dimensional measurements of color intensity of n=4435 areas k-means model-based

Hierarchical clustering Creates a hierarchy of objects represented by a tree of similarities called dendrogram. Most appropriate to cluster objects that were formed by a process of merging, splitting, or varying, such as countries, animals, commercial products, languages, fields of science etc. Advantages: For most methods, it is enough to have the dissimilarity matrix D between objects: D ij =d(r,s) is the dissimilarity between objects r and s. Does not require the knowledge of the number of clusters. Disadvantages: Depends on the scale of data. Computationally complex for large datasets. Different methods sometimes lead to very different dendrograms.

Example of a dendrogram height objects The dendrogram is created either: bottom-up (aglomerative, or ascending, clustering), or top-down (divisive, or descending, clustering).

Agglomerative clustering Algorithm: Create the set of clusters formed by individual objects (each object forms an individual cluster). While there are more than one top-level clusters do: Find the two top level clusters with the smallest mutual distance and join them into a new top level cluster. Different measures of distance between clusters provide different variants: Single linkage, Complete linkage, Average linkage, Ward s distance

Single linkage in agglomerative clustering The distance of two clusters is the dissimilarity of the least dissimilar objects of the clusters: D S ( C, C ) = min d( r, s) i j r C, s i C j Dendrogram

Average linkage in agglomerative clustering The distance of two clusters is the average of mutual dissimilarities of the objects in the clusters: 1 ( Ci, C j ) = D A d( r, s) C C i j r C i s C j Dendrogram

Other methods of measuring a distance of clusters in agglomerative clustering Complete linkage: the distance of clusters is the dissimilarity of the most dissimilar objects: ( ) + = j i j i C s j s C r i r C C m ij m j i W c x c x c x C C D ), ( ), ( ), (, 2 2 2 ρ ρ ρ ( ) ), ( max,, s r d C C D i C j s C r j i C = Ward s distance: Requires that for each object r we have the real vector of features x r. (The matrix of dissimilarities is not enough.) It is the difference between an extension of the two clusters combined and the sum of the extensions of the two individual clusters....,, j i ij c c c the centroids of j i j i C C C C,,... ρ the distance between vectors

Computational issues of agglomerative clustering Complexity: At least quadratic complexity with respect to the number of objects (depending on implementation). In R (library cluster): agnes(x, diss, metric, stand, method, ) Dataframe of real vectors of features or a matrix of dissimilarities Is x a dissimilarity matrix? (TRUE, FALSE) Metrics used (euclidean, manhattan) Standardize data? (TRUE, FALSE) Method of measuring the distance of clusters (single, average, complete, Ward)

Divisive clustering Algorithm: Form a single cluster consisting of all objects. For each bottom level cluster containing at least two objects: Find the most eccentric object that initiates a splinter group. (The object that has maximal average dissimilarity to other objects.) Find all objects in the cluster that are more similar to the most eccentric object than to the rest of the objects. (For instance, the objects that have higher average dissimilarity to the eccentric object than to the rest of the objects.) Divide the cluster into two subclusters accordingly. Continue until all bottom level clusters consist of a single object.

Illustration of the divisive clustering

Illustration of the divisive clustering

Illustration of the divisive clustering

Illustration of the divisive clustering

Illustration of the divisive clustering

Illustration of the divisive clustering

Illustration of the divisive clustering

Illustration of the divisive clustering

Illustration of the divisive clustering Dendrogram

Computational issues of divisive clustering Complexity: At least linear with respect to the number of objects (depending on implementation and a on the kind of the splitting subroutine ). In R (library cluster): diana(x, diss, metric, stand, ) Dataframe of real vectors of features or a matrix of dissimilarities Is x a dissimilarity matrix? (TRUE, FALSE) Metrics used (euclidean, manhattan) Standardize data? (TRUE, FALSE)

Comparison of hierarchical clustering methods n=25 objects - European countries (Albania, Austria, Belgium, Bulgaria, Czechoslovakia, Denmark, EGermany, Finland, France, Greece, Hungary, Ireland, Italy, Netherlands, Norway, Poland, Portugal, Romania, Spain, Sweden, Switzerland, UK, USSR, WGermany, Yugoslavia) p=9 dimensional vectors of features - consumption of various kinds of food (Red Meat, White Meat, Eggs, Milk, Fish, Cereals, Starchy foods, Nuts, Fruits/Vegetables)

Agglomerative - single linkage

Agglomerative - complete linkage

Agglomerative - average linkage

Divisive clustering

Thank you for attention