K-means and Hierarchical Clustering

Similar documents
Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Mixture Models and the EM Algorithm

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

K-means and Hierarchical Clustering

Clustering algorithms

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

ALTERNATIVE METHODS FOR CLUSTERING

Machine Learning. Unsupervised Learning. Manfred Huber

Clustering web search results

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering Relational Data using the Infinite Relational Model

Unsupervised Learning

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Lecture 8: The EM algorithm

Clustering: Classic Methods and Modern Views

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. Chapter 10 in Introduction to statistical learning

Clustering and The Expectation-Maximization Algorithm

Expectation Maximization!

Clustering: K-means and Kernel K-means

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Cluster Analysis. Debashis Ghosh Department of Statistics Penn State University (based on slides from Jia Li, Dept. of Statistics)

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

Unsupervised: no target value to predict

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

CLUSTERING. JELENA JOVANOVIĆ Web:

Unsupervised Learning

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Cluster Analysis: Agglomerate Hierarchical Clustering

Latent Variable Models and Expectation Maximization

Approaches to Clustering

Introduction to Machine Learning. Xiaojin Zhu

Randomized Algorithms for Fast Bayesian Hierarchical Clustering

Inference and Representation

Note Set 4: Finite Mixture Models and the EM Algorithm

Clustering and Dimensionality Reduction

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

K-Means and Gaussian Mixture Models

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Machine Learning (BSMC-GA 4439) Wenke Liu

K-Means Clustering 3/3/17

Mixture Models and EM

Network Traffic Measurements and Analysis

Unsupervised Learning and Clustering

Machine Learning for OR & FE

Clustering Lecture 5: Mixture Model

Clustering. Image segmentation, document clustering, protein class discovery, compression

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning: Clustering

MSA220 - Statistical Learning for Big Data

K-Means. Oct Youn-Hee Han

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

SGN (4 cr) Chapter 11

10701 Machine Learning. Clustering

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Clustering. Danushka Bollegala

Segmentation Computer Vision Spring 2018, Lecture 27

CSE 5243 INTRO. TO DATA MINING

Chapter 6 Continued: Partitioning Methods

Unsupervised Learning and Clustering

Segmentation: Clustering, Graph Cut and EM

Clustering gene expression data

AN MDL FRAMEWORK FOR DATA CLUSTERING

Bayesian K-Means as a Maximization-Expectation Algorithm

K-Means Clustering. Sargur Srihari

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Clustering. Shishir K. Shah

Expectation Maximization: Inferring model parameters and class labels

Finding Clusters 1 / 60

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Semi-supervised Clustering

Introduction to Machine Learning CMU-10701

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

CLUSTERING IN BIOINFORMATICS

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

4. Ad-hoc I: Hierarchical clustering

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Machine learning - HT Clustering

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering CS 550: Machine Learning

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Lesson 3. Prof. Enza Messina

ECE 5424: Introduction to Machine Learning

Transcription:

K-means and Hierarchical Clustering Xiaohui Xie University of California, Irvine K-means and Hierarchical Clustering p.1/18

Clustering Given n data points X = {x 1, x 2,, x n }. Clustering is the partitioning of the set X into subsets (clusters), so that the data in each subset share some similarity - according to some defined distance measure. Similarity measure between any two points d(x i, x j ). For example, Euclidean distance: d(x i, x j ) = x i x j 2. Number of clusters: K K subsets (clusters): S 1, S 2,, S K where S i X i (1) S i Sj = φ i j (2) K i=1 S i = X (3) K-means and Hierarchical Clustering p.2/18

K-means Choose K and randomly guess K cluster Center locations Repeat until convergence 1. Each datapoint finds out which Center it s closest to. 2. Each Center finds the centroid of the points it owns. K-means and Hierarchical Clustering p.3/18

K-means Questions What is it trying to optimize? Are we sure it will terminate? Are we sure it will find an optimal clustering? How should we start it? How to automatically choose the number of centers? K-means and Hierarchical Clustering p.4/18

K-means Error function Given n datapoints X = {x 1, x 2,, x n }. Partition them into K clusters: S 1, S 2,, S K. Define an error function: V (S 1,, S K, c 1,, c K ) = K i=1 x j S i d(x j, c i ) K-means and Hierarchical Clustering p.5/18

K-means error function Assign each datapoint to the closest center. Suppose after the assignment, the updated clusters are S 1,, S K. Then, V (S 1,, S K, c 1,, c K ) V (S 1,, S K, c 1,, c K ) Each Center finds the centroid of the points it owns. c i = arg min c x j S i d(x j, c) For Euclidean distance measure: c i = 1 S i After the update: x j S i x j V (S 1,, S K, c 1,, c K) V (S 1,, S K, c 1,, c K ) Thus both steps decrease the error function (if there is any update). K-means and Hierarchical Clustering p.6/18

Will we find the optimal configuration? Not necessarily. Can you find a configuration that has converged, but does not have the minimum error? K-means and Hierarchical Clustering p.7/18

Trying to find good optima Carefully choose the starting Centers. Run K-means multiple times each from a different start configuration. Many other ideas. K-means and Hierarchical Clustering p.8/18

How to choose the number of Centers? In general a difficult problem. Related to model selection. Bayesian information criterion (BIC) BIC = V + λmk log n where m is the dimension of the data, or in other words, mk is the total number of free parameters. Cross-validation Non-parametric Bayesian method K-means and Hierarchical Clustering p.9/18

Single linkage hierarchical Clustering Initialize: Every point is its own cluster. Find most similar pair of clusters Minimum distance between points in clusters Maximum distance between points in clusters Average distance between points in clusters Merge it into a parent cluster Repeat... until you have merged the whole dataset into one cluster. K-means and Hierarchical Clustering p.10/18

Pros and Cons of Hierarchical Clustering The result is a dendrogram, or hierarchy of datapoints. To choose K clusters, just cut the K 1 longest links Cons: No real statistical or information theoretical foundation for the clustering. K-means and Hierarchical Clustering p.11/18

Probabilistic interpretation of K-means Input: n data points: x = {x 1,, x n }. Model: A mixture model of K components (clusters). Each component is described by a normal distribution: x N(µ i, σ i ) for the i th component. In other words, if x belongs to cluster i p(x x S i ) = 1 e x µ i 2σ i 2 2πσ 2 i 2 We will use θ i = (µ i, σ i ) and θ = (θ 1,, θ K ) to denote the parameters of the model. The assignment of the data to each of the clusters: z = {z 1,, z n } with z i {1, 2,, K}. K-means and Hierarchical Clustering p.12/18

Probabilistic K-means: inference The probability of generating the data given the model and the label: p(x z, θ) = n p(x i z i, θ) i=1 where p(x i z i, θ) = p(x i µ zi, σ zi ) The probability of generating the data given the model and the label: p(x z, θ) = n p(x i z i, θ) i=1 where p(x i z i, θ) = p(x i µ zi, σ zi ) The probability of generating the data given the model only: p(x θ) = n i=1 K P(z i = j)p(x i z i = j, θ) j=1 K-means and Hierarchical Clustering p.13/18

Probabilistic K-means: EM-algorithm ML estimate of the mixture model θ = argmax θ log p(x θ) EM-algorithm: E-step q(z i = j) = p(z i = j θ j, x i ) (4) 1 e x µj 2 2σ j 2 (5) σ j K-means and Hierarchical Clustering p.14/18

Probabilistic K-means: EM-algorithm EM-algorithm: M-step ˆθ =argmax θ ˆθ j =argmax θ j =argmin θ j n K q(z i = j) log p(x i θ j ) (6) i=1 j=1 n q(z i = j) log p(x i θ j ) (7) i=1 n q(z i = j)[ x i µ j 2 /(2σj) 2 + log σ j ] (8) i=1 K-means and Hierarchical Clustering p.15/18

Probabilistic K-means: EM-algorithm If we hold σ i = σ fixed for all i and only learn µ i, then ˆµ j =argmin µ j n q(z i = j) x i µ j 2 (9) i=1 ˆµ j = n i=1 q(z i = j)x i n i=1 q(z i = j) (10) In summary, the probabilistic K-means alternates between E-step: M-step: q(z i = j) e x i µ j 2 µ j P n i=1 q(z i=j)x i P n i=1 q(z i=j) 2σ 2 /Z K-means and Hierarchical Clustering p.16/18

elationship to the standard K-means 1 Under what condition does the EM-algorithm become the standard K-means? Note that in the E-step: q(z i = j) = = x i µ j 2 e 2σ 2 K k=1 e x i µ k 2 2σ 2 (11) e x i µ j 2 xi µm 2 2σ 2 1 + (12) x K i µ k 2 x i µm 2 k=1,k m e 2σ 2 where m = argmin k x i µ k 2, is the index of the cluster closest to x i (we assume m is unique). K-means and Hierarchical Clustering p.17/18

elationship to the standard K-means 2 Let σ 0, For all k m, x i µ k 2 x i µ m 2 > 0, hence the denominator in above Eq. 1 as σ 0. The numerator is 1 only when j = m, otherwise 0. In summary, as σ 0, q(z i = j) = 1 if j = m 0 otherwise (13) where m = argmax k x i µ k 2 is the index of the cluster closest to x i. Note that this is just the standard K-means procedure. K-means and Hierarchical Clustering p.18/18