Clustering. Chapter 10 in Introduction to statistical learning

Similar documents
Network Traffic Measurements and Analysis

CSE 5243 INTRO. TO DATA MINING

Cluster Analysis for Microarray Data

Clustering CS 550: Machine Learning

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning

Clustering: K-means and Kernel K-means

Distances, Clustering! Rafael Irizarry!

10701 Machine Learning. Clustering

ECLT 5810 Clustering

Exploratory data analysis for microarrays

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

ECLT 5810 Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Machine Learning (BSMC-GA 4439) Wenke Liu

Cluster Analysis. Ying Shen, SSE, Tongji University

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Clustering Documentation

Clustering part II 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

MSA220 - Statistical Learning for Big Data

Clustering. Supervised vs. Unsupervised Learning

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Based on Raymond J. Mooney s slides

Supervised vs. Unsupervised Learning

Clust Clus e t ring 2 Nov

Clustering Algorithms. Margareta Ackerman

Understanding Clustering Supervising the unsupervised

University of Florida CISE department Gator Engineering. Clustering Part 2

Recommendation System Using Yelp Data CS 229 Machine Learning Jia Le Xu, Yingran Xu

Unsupervised Learning : Clustering

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

What to come. There will be a few more topics we will cover on supervised learning

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Machine learning - HT Clustering

K-means and Hierarchical Clustering

Hierarchical Clustering Lecture 9

Introduction to Artificial Intelligence

Data Mining Algorithms

Clustering in Ratemaking: Applications in Territories Clustering

Chapter 6 Continued: Partitioning Methods

Cluster analysis. Agnieszka Nowak - Brzezinska

Clustering Tips and Tricks in 45 minutes (maybe more :)

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Clustering. (Part 2)

Clustering algorithms

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

Lesson 3. Prof. Enza Messina

Gene Clustering & Classification

Kapitel 4: Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

University of Florida CISE department Gator Engineering. Clustering Part 4

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Statistical Methods for Data Mining

Clustering Part 4 DBSCAN

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Finding Clusters 1 / 60

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Cluster Analysis. Angela Montanari and Laura Anderlucci

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

3. Cluster analysis Overview

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Clustering Results. Result List Example. Clustering Results. Information Retrieval

INF 4300 Classification III Anne Solberg The agenda today:

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Stat 321: Transposable Data Clustering

Cluster Analysis. CSE634 Data Mining

Clustering Algorithms for general similarity measures

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

SOCIAL MEDIA MINING. Data Mining Essentials

Text clustering based on a divide and merge strategy

CHAPTER 4: CLUSTER ANALYSIS

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Unsupervised Learning

Unsupervised Learning and Clustering

K-Means. Oct Youn-Hee Han

Clustering in Data Mining

11/2/2017 MIST.6060 Business Intelligence and Data Mining 1. Clustering. Two widely used distance metrics to measure the distance between two records

Clustering. k-mean clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Transcription:

Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1

Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What is a cluster? Group of objects separated from other clusters 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 2

Digression: means and medians The mean is the minimizer of argmin y X x2d x y 2 Using argmin y X x2d x y Gives rise to the geometric median, which is more robust to outliers. Issue: no closed form solution. 3

Means, medians, medoids It may be useful to restrict exemplars to be one of the given data points. Medoid: representative object of a data set or a cluster whose average distance to all the objects in the cluster is minimal How would we compute the medoid for a set of points? 4

Clustering A clustering is a partition of the elements in your dataset into K subsets such that the following holds: Each observation belongs to a cluster: C 1 C 2... C K = {1,...,n} The clusters are non-overlapping: C k C k = for all k k. overlapping: no observation b 5

A plausible objective We would like a clustering to minimize the within-cluster variation. Let s assume it is measured via a function W(C k ). So the overall objective is: minimize C 1,...,C K { K } W (C k ) k=1 A possible definition of W(C k ): W (C k )= 1 C k X i,j2c k x i x j 2 6

A plausible objective We would like a clustering to minimize the within-cluster variation. Let s assume it is measured via a function W(C k ). So the overall objective is: minimize C 1,...,C K A possible definition of W(C k ): µ k the centroid for cluster k { K } W (C k ) k=1 W (C k )= 1 X x i x j 2 C k i,j2c k =2 X x i µ k 2 i2c k 7

The k-means objective function Putting it all together: min C 1,...,C K KX X k=1 i2c k x i µ k 2 A problem with this problem: NP-complete. 8

The k-means algorithm A heuristic algorithm for solving the problem: Algorithm 10.1 K-Means Clustering 1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations. 2. Iterate until the cluster assignments stop changing: (a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster. (b) Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance). 9

Example Data Step 1 Iteration 1, Step 2a Iteration 1, Step 2b Iteration 2, Step 2a Final Results FIGURE 10.6. The progress of the K-means algorithm on the example of Figure 10.5 with K=3. Top left: the observations are shown. Top center: in Step 1 of the algorithm, each observation is randomly assigned to a cluster. Top right: in Step 2(a), the cluster centroids are computed. These are shown as large col- 10

Local minima 320.9 235.8 235.8 235.8 235.8 310.9 FIGURE 10.7. K-means clustering performed six times on the data from Figure 10.5 with K =3,eachtimewithadifferent random assignment of the observations in Step 1 of the K-means algorithm. Above each plot is the value of the objective (10.11). Three different local optima were obtained, one of which 11

Running time? What is the running time per iteration? Algorithm 10.1 K-Means Clustering 1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations. 2. Iterate until the cluster assignments stop changing: (a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster. (b) Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance). 12

Running time? What is the running time per iteration? Algorithm 10.1 K-Means Clustering 1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations. 2. Iterate until the cluster assignments stop changing: (a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster. (b) Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance). Typically, converges very quickly (and in fact, guaranteed to converge in a finite number of iterations) 13

Running time? Algorithm 10.1 K-Means Clustering 1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations. 2. Iterate until the cluster assignments stop changing: (a) For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster. (b) Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance). Can easily be kernelized. 14

Dealing with local minima Run the algorithm multiple times with different initializations. 15

Initialization A good initialization can lead to faster convergence to a better optimal solution. The standard choice: k random data points More sophisticated approaches: v Create a collection of subsamples of the data. Cluster the resulting cluster centers using K-means and use for initialization. P.S. Bradley, and Usama M. Fayyad. Refining Initial Points for K-Means Clustering. Proceedings of the Fifteenth International Conference on Machine Learning ICML '98 v Kmeans++: Choose the first center randomly; subsequent centers are chosen with probability proportional to their distance to the closest center. The default in scikit-learn Arthur, D. and Vassilvitskii, S. (2007). "k-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. 16

Related algorithms Related algorithms: K-medoids Partitioning around medoids (PAM) 17

Sensitivity to scaling 1 5 0.8 4 0.6 3 0.4 2 0.2 1 0 0 0.2 1 0.4 2 0.6 3 0.8 4 1 2 1.5 1 0.5 0 0.5 1 1.5 2 5 2 1.5 1 0.5 0 0.5 1 1.5 2 (left) On this data 2-means detects the right clusters. (right) After rescaling the y-axis, this configuration has a higher between-cluster scatter than the intended one. 18

Assumptions behind the model K-means assumes spherical clusters. There are probabilistic extensions that address this to some extent. Probably the most widely used clustering algorithm because of its simplicity, speed, and easy implementation 19

Silhouettes How do we know we have a good clustering? The Silhouette coefficient is defined for each example s(x) = b(x) a(x) max(b(x),a(x)) a: the mean distance between the example and all other points in the same cluster. b: the mean distance between the example and all other points in the next nearest cluster. Peter J. Rousseeuw (1987). Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics 20: 53 65. 20

Silhouettes The Silhouette coefficient is defined for each example s(x) = Sort s(x) and group by cluster: b(x) a(x) max(b(x),a(x)) Figures generated using the scikit-learn silhoutte method 21

Silhouettes The Silhouette coefficient is defined for each example s(x) = Sort s(x) and group by cluster: b(x) a(x) max(b(x),a(x)) 22

Dendrograms Definition: Given a dataset D, a dendrogram is a binary tree with the elements of D at its leaves. An internal node of the tree represents the subset of elements in the leaves of the subtree rooted at that node. 23

Hierarchical clustering Algorithm outline: Start with each data point in a separate cluster At each step merge the closest pair of clusters 24

Hierarchical clustering Algorithm outline: Start with each data point in a separate cluster At each step merge the closest pair of clusters Need to define a measure of distance between clusters: A linkage function L :2 X 2 X! R calculates the distance between arbitrary subsets of the instance space, given a distance metric Dis : X X! R. 25

Linkage functions v v v v Single linkage v Smallest pairwise distance between elements from each cluster Complete linkage v Largest distance between elements from each cluster Average linkage v The average distance between elements from each cluster Centroid linkage v Distance between cluster means 26

Dendrograms revisited Interpretation of the vertical dimension: The distance between the clusters when they were merged (the level associated with the cluster). The leaves have level 0. X2 2 0 2 4 0 2 4 6 8 10 6 4 2 0 2 X 1 27

Dendrograms revisited Interpretation of the vertical dimension: The distance between the clusters when they were merged (the level associated with the cluster). The leaves have level 0. X2 2 0 2 4 0 2 4 6 8 10 0 2 4 6 8 10 6 4 2 0 2 X 1 28

Hierarchical clustering Algorithm 10.2 Hierarchical Clustering 1. Begin with n observations and a measure (such as Euclidean distance) of all the ( n 2) = n(n 1)/2 pairwise dissimilarities. Treat each observation as its own cluster. 2. For i = n, n 1,...,2: (a) Examine all pairwise inter-cluster dissimilarities among the i clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendrogram at which the fusion should be placed. (b) Compute the new pairwise inter-cluster dissimilarities among the i 1 remaining clusters. 29

Linkage matters Average Linkage Complete Linkage Single Linkage 30

distance vs similarity Hierarchical clustering can be performed with respect to a distance measure or a similarity measure. Correlation is often a better choice: 0 5 10 15 20 Observation 1 Observation 2 Observation 3 2 3 1 5 10 15 20 Variable Index 31

Summary Clustering depends on the choice of similarity/distance and preprocessing. Different methods will give different results. Clustering algorithms will find as many clusters as you ask for: need methods for deciding the number of clusters. Clustering is sensitive to noise. Hard choices to make there is no teaching signal as we had in supervised learning. 32