Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Similar documents
K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Hierarchical clustering

Clustering CS 550: Machine Learning

Unsupervised Learning : Clustering

Exploratory data analysis for microarrays

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Hierarchical clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering Basic Concepts and Algorithms 1

Cluster Analysis: Basic Concepts and Algorithms

Cluster analysis. Agnieszka Nowak - Brzezinska

CS7267 MACHINE LEARNING

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Introduction to Clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Supervised vs. Unsupervised Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis for Microarray Data

Network Traffic Measurements and Analysis

CSE 5243 INTRO. TO DATA MINING

Understanding Clustering Supervising the unsupervised

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Gene Clustering & Classification

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

CS Introduction to Data Mining Instructor: Abdullah Mueen

Machine Learning for OR & FE

Clustering. Supervised vs. Unsupervised Learning

MSA220 - Statistical Learning for Big Data

Lecture Notes for Chapter 8. Introduction to Data Mining

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

University of Florida CISE department Gator Engineering. Clustering Part 2

ECLT 5810 Clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

ECLT 5810 Clustering

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Cluster Analysis. Ying Shen, SSE, Tongji University

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Cluster Analysis: Agglomerate Hierarchical Clustering

Clustering. Chapter 10 in Introduction to statistical learning

CSE 5243 INTRO. TO DATA MINING

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering

Clustering: K-means and Kernel K-means

Introduction to Mobile Robotics

What to come. There will be a few more topics we will cover on supervised learning

K-Means Clustering. Sargur Srihari

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Unsupervised Learning

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Chapter 6: Cluster Analysis

Clustering and Visualisation of Data

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

10701 Machine Learning. Clustering

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Information Retrieval and Web Search Engines

Cluster Analysis. Angela Montanari and Laura Anderlucci

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

9.1. K-means Clustering

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering and The Expectation-Maximization Algorithm

Clustering. Lecture 6, 1/24/03 ECS289A

Machine learning - HT Clustering

CHAPTER 4: CLUSTER ANALYSIS

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

SGN (4 cr) Chapter 11

Introduction to Artificial Intelligence

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Information Retrieval and Web Search Engines

Hierarchical Clustering

Chapter 6 Continued: Partitioning Methods

Lecture 4 Hierarchical clustering

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Transcription:

Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1

Part I Clustering 2 / 1

Clustering Clustering Goal: Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. An unsupervised problem that tries to produce labelled data from unlabelled data. Many different techniques, but most revolve around forming groups with small within cluster distances relative to between cluster distances. 3 / 1

What is Cluster Analysis? Cluster analysis! Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Tan,Steinbach, Kumar Introduction to 4/18/2004 2 4 / 1

Clustering Applications Understanding of some structure in a dataset. Group related documents for browsing; Group genes and proteins that have similar functionality; Group stocks with similar price fluctuations Summarization: reduce the size of large data sets (sometimes known as vector quantization... ) 5 / 1

Cluster analysis 14.3 Cluster Analysis 527 FIGURE 14.14. DNA microarray data: average linkage hierarchical clustering has been applied independently to the rows (genes) and columns (samples), determining the ordering of the rows and columns (see text). The colors range from 6 / 1

Cluster analysis Original image... 7 / 1

Cluster analysis Some compression... 8 / 1

Cluster analysis Too much compression? 9 / 1

Cluster analysis Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Clusters can be ambiguous. 10 / 1 Tan,Steinbach, Kumar Introduction to 4/18/2004 5

Clustering Types of clustering Partitional A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset. Hierarchical A set of nested clusters organized as a hierarchical tree. Each data object is in exactly one subset for any horizontal cut of the tree... 11 / 1

Cluster analysis 502 14. Unsupervised Learning X2 X 1 FIGURE 14.4. Simulated data in the plane, clustered into three classes (represented A partitional by orange, example blue and green) by the K-means clustering algorithm 12 / 1

Cluster analysis 522 14. Unsupervised Learning LEUKEMIA K562B-repro K562A-repro BREAST BREAST CNS CNS BREAST NSCLC UNKNOWN OVARIAN MCF7A-repro BREAST MCF7D-repro LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA MELANOMA OVARIAN OVARIAN BREAST NSCLC LEUKEMIA NSCLC MELANOMA RENAL RENAL RENAL RENAL RENAL RENAL RENAL NSCLC OVARIAN OVARIAN NSCLC NSCLC NSCLC PROSTATE OVARIAN PROSTATE RENAL CNS CNS CNS BREAST NSCLC NSCLC BREAST RENAL MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA COLON COLON COLON COLON COLON COLON COLON FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering with average linkage to the human tumor microarray data. A hierarchical example chical structure produced by the algorithm. Hierarchical methods impose 13 / 1

Clustering Other distinctions Exclusivity Are points in only one cluster? Soft vs. hard Can we give a score for each case and each cluster? Partial vs. complete Do we cluster all points, or only some? Heterogeneity Are the clusters similar in size, shape, etc. 14 / 1

Types of clusters: well-separated! Well-Separated Clusters: A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters This type of cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. Tan,Steinbach, Kumar Introduction to 4/18/2004 11 15 / 1

! Center-based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster Types of clusters: center-based The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most representative point of a cluster 4 center-based clusters This type of cluster is a set of objects such that an object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster. Tan,Steinbach, Kumar Introduction to 4/18/2004 12 16 / 1

! Center-based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster Types of clusters: center-based The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most representative point of a cluster 4 center-based clusters The center of these clusters is usually, the average of all the points in the cluster, or a medoid, the most representative point of a cluster. Tan,Steinbach, Kumar Introduction to 4/18/2004 12 17 / 1

! Contiguous Cluster (Nearest neighbor o Types of Transitive) clusters: contiguity-based A cluster is a set of points such that a point in a clus closer (or more similar) to one or more other points i cluster than to any point not in the cluster. This type of cluster is a set of points such that a point in a cluster is closer (or more similar) 8 contiguous to one or clusters more other points in the cluster than to any point not in the cluster. Tan,Steinbach, Kumar Introduction to 4/18 18 / 1

Types! Density-based of clusters: contiguity-based A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. These types of clusters 6 density-based are made upclusters of dense region of points, often described by one of the other cluster types, but separated by low-density regions, often in the form of noise. Tan,Steinbach, Kumar Introduction to 4/18/2004 14 19 / 1

Clustering Mathematical characterizations Most clustering algorithms are based on a dissimilarity measure d. Data may be of mixed type so some of the similarities we saw earlier may be used. Most clustering algorithms do not insist that the dissimilarity is truly a distance. (Local) Given two elements of the partition C r, C s, we might consider d SL (C r, C s ) = min d(x, y) x C r,y C s These types of measures are often used in hierarchical clustering algorithms. 20 / 1

Clustering Mathematical characterizations (Global) A clustering is a partition C = {C 1,..., C k } of the cases. It writes T = 1 2 n d 2 (x i, x j ) i,j=1 and tries to minimize W (C). This problem is NP Hard... = W (C) + B(C) These types of measures are often used in partitional clustering algorithms. 21 / 1

Clustering Most common models Partitional K-means / medoid ; mixture models Hierarchical agglomerative (bottom-up) hierarchical clustering. We ll spend a some time on each of these... 22 / 1

Cluster analysis 502 14. Unsupervised Learning X2 X 1 FIGURE 14.4. Simulated data in the plane, clustered into three classes (represented K-means by orange, clustering blue and green) by the K-means clustering algorithm 23 / 1

Cluster analysis 522 14. Unsupervised Learning LEUKEMIA K562B-repro K562A-repro BREAST BREAST CNS CNS BREAST NSCLC UNKNOWN OVARIAN MCF7A-repro BREAST MCF7D-repro LEUKEMIA LEUKEMIA LEUKEMIA LEUKEMIA MELANOMA OVARIAN OVARIAN BREAST NSCLC LEUKEMIA NSCLC MELANOMA RENAL RENAL RENAL RENAL RENAL RENAL RENAL NSCLC OVARIAN OVARIAN NSCLC NSCLC NSCLC PROSTATE OVARIAN PROSTATE RENAL CNS CNS CNS BREAST NSCLC NSCLC BREAST RENAL MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA MELANOMA COLON COLON COLON COLON COLON COLON COLON FIGURE 14.12. Dendrogram from agglomerative hierarchical clustering with average linkage to the human tumor microarray data. Agglomerative hierarchical clustering chical structure produced by the algorithm. Hierarchical methods impose 24 / 1

Part II K-means clustering 25 / 1

K-means Outline K-means, K-medoids Choosing the number of clusters: Gap test, silhouette plot. Mixture modelling, EM algorithm. 26 / 1

K-means Figure : Simulated data in the plane, clustered into three classes (represented by red, blue and green) by the K-means clustering algorithm. From ESL. 27 / 1

K-means Algorithm (Euclidean) 1 For each data point, the closest cluster center (in Euclidean distance) is identified; 2 Each cluster center is replaced by the coordinatewise average of all data points that are closest to it. 3 Steps 1. and 2. are alternated until convergence. Algorithm converges to a local minimum of the within-cluster sum of squares. Typically one uses multiple runs from random starting guesses, and chooses the solution with lowest within cluster sum of squares. 28 / 1

K-means Non-Euclidean 1 We can replace the Euclidean distance squared with some other dissimilarity measure d, this changes the assignment rule to minimizing d.. is identified; 2 Each cluster center is replaced by the point that minimizes the sum of all pairwise d s. 3 Steps 1. and 2. are alternated until convergence. Algorithm converges to a local minimum of the within-cluster sum of d s. 29 / 1

K-means 14.3 Cluster Analysis 511-4 -2 0 2 4 6-2 0 2 4 6 Initial Centroids Initial Partition Iteration Number 2 Iteration Number 20 30 / 1

K-means Figure : Decrease in W (C), the within cluster sum of squares. 31 / 1

K-means Importance of Choosing Initial Centroids Figure : Another example of the iterations of K-means Tan,Steinbach, Kumar Introduction to 4/18/2004 24 32 / 1

K-means Two different K-means Clusterings Original Points Optimal Clustering Sub-optimal Clustering Tan,Steinbach, Kumar Introduction to 4/18/2004 22 33 / 1

The Iris data (K-means) 34 / 1

K-means Issues to consider Non-quantitative features, e.g. categorical variables, are typically coded by dummy variables, and then treated as quantitative. How many centroids k do we use? As k increases, both training and test error decrease! By test error, we mean the within-cluster sum of squares for data held-out when fitting the clusters... Possible to get empty clusters... 35 / 1

K-means Choosing K Ideally, the within cluster sum of squares flattens out quickly and we might choose the value of K at this elbow. We might also compare the observed within cluster sum of squares to a null model, like uniform on a box containing the data. This is the basis of the gap statistic. 36 / 1

K-means 520 14. Unsupervised Learning log WK -3.0-2.5-2.0-1.5-1.0-0.5 0.0 Gap -0.5 0.0 0.5 1.0 2 4 6 8 Number of Clusters 2 4 6 8 Number of Clusters FIGURE 14.11. (Left panel): observed (green) and expected (blue) values of log W K for the simulated data of Figure 14.4. Both curves have been translated to equal zero at one cluster. (Right panel): Gap curve, equal to the difference between the observed and expected values of log W K.TheGapestimateK is the smallest K producing a gap within one standard deviation of the gap at K +1; 37 / 1 Figure : Blue curve is the W K for uniform, green curve is for data.

K-means 520 14. Unsupervised Learning log WK -3.0-2.5-2.0-1.5-1.0-0.5 0.0 Gap -0.5 0.0 0.5 1.0 2 4 6 8 Number of Clusters 2 4 6 8 Number of Clusters FIGURE 14.11. (Left panel): observed (green) and expected (blue) values of log W K for the simulated data of Figure 14.4. Both curves have been translated to equal zero at one cluster. (Right panel): Gap curve, equal to the difference between the observed and expected values of log W K.TheGapestimateK is the smallest K producing a gap within one standard deviation of the gap at K +1; here K =2. 38 / 1 Figure : Largest gap is at 2, and the formal rule also takes into account the variability of estimating the gap.

K-medoid Algorithm Same as K-means, except that centroid is estimated not by the average, but by the observation having minimum pairwise distance with the other cluster members. Advantage: centroid is one of the observations useful, eg when features are 0 or 1. Also, one only needs pairwise distances for K-medoids rather than the raw observations. In R, the function pam implements this using Euclidean distance (not distance squared). 39 / 1

K-medoid Example: Country Dissimilarities This example comes from a study in which political science students were asked to provide pairwise dissimilarity measures for 12 countries. BEL BRA CHI CUB EGY FRA IND ISR USA USS YUG BRA 5.58 CHI 7.00 6.50 CUB 7.08 7.00 3.83 EGY 4.83 5.08 8.17 5.83 FRA 2.17 5.75 6.67 6.92 4.92 IND 6.42 5.00 5.58 6.00 4.67 6.42 ISR 3.42 5.50 6.42 6.42 5.00 3.92 6.17 USA 2.50 4.92 6.25 7.33 4.50 2.25 6.33 2.75 USS 6.08 6.67 4.25 2.67 6.00 6.17 6.17 6.92 6.17 YUG 5.25 6.83 4.50 3.75 5.75 5.42 6.08 5.83 6.67 3.67 ZAI 4.75 3.00 6.08 6.67 5.00 5.58 4.83 6.17 5.67 6.50 6.92 40 / 1

K-medoid Figure : Left panel: dissimilarities reordered and blocked according to 3-medoid clustering. Heat map is coded from most similar (dark red) to least similar (bright red). Right panel: two-dimensional multidimensional scaling plot, with 3-medoid clusters indicated by different colors. 41 / 1

The Iris data: K-medoid (PAM) 42 / 1

K-medoid Silhouette For each case 1 i n, and set of cases C and dissimilarity d define d(i, C) = 1 #C d(i, j). j C Each case 1 i n is assigned to a cluster C l(i). The silhouette width is defined for each case as silhouette(i) = min j l(i) d(i, C j ) d(i, C l(i) ) max( d(i, C l(i) ), min j l(i) d(i, C j )). High values of silhouette indicate good clusterings. In R this is computable for pam objects. 43 / 1

The Iris data: silhouette plot for K-medoid 44 / 1

The Iris data: average silhouette width 45 / 1

Mixture modelling A soft clustering algorithm Imagine we actually had labels Y for the cases, then this would be a classification problem. For this classification problem, we might consider using a Gaussian discriminant model like LDA or QDA. We would then have to estimate (µ j, Σ j ) within each cluster. This would be easy... The next model is based on this realization... 46 / 1

Mixture modelling EM algorithm The abbreviation: E=expectation, M=maximization. A special case of an majorization-minimization algorithm and widely used throughout statistics. Particularly useful for situations in which there might be some hidden data that would make the problem easy... 47 / 1

Mixture modelling EM algorithm In this mixture model framework, we assume that the data were drawn from the same model as in QDA (or LDA). Y Multinomial(1, π) X Y = l N(µ l, Σ l ) (choose a label) Only, we have lost our labels and only observe X n p. The goal is still the same, to estimate π, (µ l, Σ l ) 1 l k. 48 / 1

Mixture modelling EM algorithm The algorithm keeps track of (µ l, Σ l ) 1 l k It also trackes guesses at Y in the form of Γ n k. Alternates between guessing Y and estimating π, (µ l, Σ l ) 1 l k. 49 / 1

Mixture modelling EM algorithm Initialize Γ, µ, Σ, π. Repeat For 1 t T, Estimate Γ These are called the responsibilities Estimate µ l, 1 k γ (t+1) il = µ (t+1) l = π (t) l K l=1 π(t) l (t) (t) φ µ l, Σ l (X i ) (t) (t) φ µ l, Σ l n i=1 γ(t+1) il X i n i=1 γ(t+1) il (X i ) This is just weighted average with weights γ (t+1) l. 50 / 1

Mixture modelling EM algorithm Estimate Σ l, 1 k Estimate π l Σ (t+1) l = n i=1 γ(t+1) il (X i µ (t+1) l n i=1 γ(t+1) il )(X i µ (t+1) l ) T This is just a weighted estimate of the covariance matrix with weights γ (t+1) l. π (t+1) l = 1 n n i=1 γ (t+1) il 51 / 1

Mixture modelling EM algorithm The quantities Γ are not really parameters, they are estimates of the random labels Y which were unobserved. If we had observed Y then the rows of Γ would be all zero except one entry, which would be 1. In this case, estimation of π l, µ l, Σ l is just as it would have been in QDA... The EM simply replaces the unobserved Y with a guess... 52 / 1

The Iris data: Gaussian mixture modelling 53 / 1

The Iris data (K-means) 54 / 1

The Iris data: silhouette plot for K-medoid 55 / 1