Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Similar documents
Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Unsupervised Learning: Clustering

Clustering Lecture 5: Mixture Model

Mixture Models and the EM Algorithm

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning (BSMC-GA 4439) Wenke Liu

Cluster Analysis: Agglomerate Hierarchical Clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

Note Set 4: Finite Mixture Models and the EM Algorithm

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning for OR & FE

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

K-Means and Gaussian Mixture Models

Unsupervised Learning

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering web search results

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Clustering CS 550: Machine Learning

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Methods for Intelligent Systems

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering: Classic Methods and Modern Views

Mixture Models and EM

Inference and Representation

Supervised vs. Unsupervised Learning

Unsupervised Learning : Clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering and The Expectation-Maximization Algorithm

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Unsupervised: no target value to predict

Hierarchical Clustering 4/5/17

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Expectation Maximization (EM) and Gaussian Mixture Models

Introduction to Mobile Robotics

Clustering algorithms

Unsupervised Learning

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Chapter 6 Continued: Partitioning Methods

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

Network Traffic Measurements and Analysis

K-means and Hierarchical Clustering

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Unsupervised Learning

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Latent Variable Models and Expectation Maximization

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Grundlagen der Künstlichen Intelligenz

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Objective of clustering

Machine learning - HT Clustering


Based on Raymond J. Mooney s slides

Clustering and Visualisation of Data

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

CS Introduction to Data Mining Instructor: Abdullah Mueen

ECE 5424: Introduction to Machine Learning

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Introduction to Machine Learning CMU-10701

Clustering. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Data Clustering. Danushka Bollegala

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

CHAPTER 4: CLUSTER ANALYSIS

Expectation Maximization!

ALTERNATIVE METHODS FOR CLUSTERING

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Clustering: K-means and Kernel K-means

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Clustering. Unsupervised Learning

10-701/15-781, Fall 2006, Final

CLUSTERING. JELENA JOVANOVIĆ Web:

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

COMS 4771 Clustering. Nakul Verma

High throughput Data Analysis 2. Cluster Analysis

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

K-Means Clustering. Sargur Srihari

9.1. K-means Clustering

Cluster Analysis. Debashis Ghosh Department of Statistics Penn State University (based on slides from Jia Li, Dept. of Statistics)

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Clustering. Supervised vs. Unsupervised Learning

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Multivariate Analysis

Unsupervised Learning and Data Mining

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Transcription:

Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 / 37

Outline 1. k-means & k-medoids 2. Gaussian Mixture Models 3. Hierarchical Cluster Analysis 2 / 37

Syllabus Tue. 21.10. (1) 0. Introduction A. Supervised Learning Wed. 22.10. (2) A.1 Linear Regression Tue. 28.10. (3) A.2 Linear Classification Wed. 29.10. (4) A.3 Regularization Tue. 4.11. (5) A.4 High-dimensional Data Wed. 5.11. (6) A.5 Nearest-Neighbor Models Tue. 11.11. (7) A.6 Decision Trees Wed. 12.12. (8) A.7 Support Vector Machines Tue. 18.11. (9) A.8 A First Look at Bayesian and Markov Networks B. Unsupervised Learning Wed. 19.11. (10) B.1 Clustering Tue. 25.11. (11) B.2 Dimensionality Reduction Wed. 26.11. (12) B.3 Frequent Pattern Mining C. Reinforcement Learning Tue. 2.12. (13) C.1 State Space Models Wed. 3.12. (14) C.2 Markov Decision Processes 1 / 37

1. k-means & k-medoids Outline 1. k-means & k-medoids 2. Gaussian Mixture Models 3. Hierarchical Cluster Analysis 1 / 37

1. k-means & k-medoids Partitions Let X be a set. A set P P(X ) of subsets of X is called a partition of X if the subsets 1. are pairwise disjoint: A B =, A, B P, A B 2. cover X : A = X, and A P 3. do not contain the empty set: P. 1 / 37

1. k-means & k-medoids Partitions Let X := {x 1,..., x N } be a finite set. A set P := {X 1,..., X K } of subsets X k X is called a partition of X if the subsets 1. are pairwise disjoint: X k X j =, k, j {1,..., K}, k j K 2. cover X : X k = X, and k=1 3. do not contain the empty set: X k, k {1,..., K}. The sets X k are also called clusters, a partition P a clustering. K N is called number of clusters. Part(X ) denotes the set of all partitions of X. 1 / 37

1. k-means & k-medoids Partitions Let X be a finite set. A surjective function p : {1,..., X } {1,..., K} is called a partition function of X. The sets X k := p 1 (k) form a partition P := {X 1,..., X K }. 1 / 37

1. k-means & k-medoids Partitions Let X := {x 1,..., x N } be a finite set. A binary N K matrix P {0, 1} N K is called a partition matrix of X if it 1. is row-stochastic: K P i,k = 1, i {1,..., N}, k {1 2. does not contain a zero column: X i,k (0,..., 0) T, k {1,..., K}. k=1 The sets X k := {i {1,..., N} P i,k = 1} form a partition P := {X 1,..., X K }. P.,k is called membership vector of class k. 1 / 37

1. k-means & k-medoids The Cluster Analysis Problem Given a set X called data space, e.g., X := R m, a set X X called data, and a function D : X X Part(X ) R + 0 called distortion measure where D(P) measures how bad a partition P Part(X ) for a data set X X is, find a partition P = {X 1, X 2,... X K } Part (X ) with minimal distortion D(P). 2 / 37

1. k-means & k-medoids The Cluster Analysis Problem (given K) Given a set X called data space, e.g., X := R m, a set X X called data, a function D : X X Part(X ) R + 0 called distortion measure where D(P) measures how bad a partition P Part(X ) for a data set X X is, and a number K N of clusters, find a partition P = {X 1, X 2,... X K } Part K (X ) with K clusters with minimal distortion D(P). 2 / 37

1. k-means & k-medoids k-means: Distortion Sum of Distances to Cluster Centers Sum of squared distances to cluster centers: with D(P) := K k=1 n i=1: P i.k =1 x i µ k 2 µ k := mean {x i P i,k = 1, i = 1,..., n} 3 / 37

1. k-means & k-medoids k-means: Distortion Sum of Distances to Cluster Centers Sum of squared distances to cluster centers: with D(P) := n i=1 k=1 K P i,k x i µ k 2 = K k=1 n i=1: P i.k =1 x i µ k 2 µ k := n i=1 P i,kx i n i=1 P i,k = mean {x i P i,k = 1, i = 1,..., n} 3 / 37

1. k-means & k-medoids k-means: Distortion Sum of Distances to Cluster Centers Sum of squared distances to cluster centers: with D(P) := n i=1 k=1 K P i,k x i µ k 2 = K k=1 n i=1: P i.k =1 µ k := mean {x i P i,k = 1, i = 1,..., n} x i µ k 2 Minimizing D over partitions with varying number of clusters leads to singleton clustering with distortion 0; only the cluster analysis problem with given K makes sense. Minimizing D is not easy as reassigning a point to a different cluster also shifts the cluster centers. 3 / 37

1. k-means & k-medoids k-means: Minimizing Distances to Cluster Centers Add cluster centers µ as auxiliary optimization variables: D(P, µ) := n K P i,k x i µ k 2 i=1 k=1 4 / 37

1. k-means & k-medoids k-means: Minimizing Distances to Cluster Centers Add cluster centers µ as auxiliary optimization variables: D(P, µ) := n K P i,k x i µ k 2 i=1 k=1 Block coordinate descent: 1. fix µ, optimize P reassign data points to clusters: P i,k := δ(k = l i ), l i := arg min x i µ k 2 k {1,...,K} 4 / 37

1. k-means & k-medoids k-means: Minimizing Distances to Cluster Centers Add cluster centers µ as auxiliary optimization variables: Block coordinate descent: D(P, µ) := n i=1 k=1 K P i,k x i µ k 2 1. fix µ, optimize P reassign data points to clusters: P i,k := δ(k = l i ), l i := arg min x i µ k 2 k {1,...,K} 2. fix P, optimize µ recompute cluster centers: µ k := n i=1 P i,kx i n i=1 P i,k Iterate until partition is stable. 4 / 37

1. k-means & k-medoids k-means: Initialization k-means is usually initialized by picking K data points as cluster centers at random: 1. pick the first cluster center µ 1 out of the data points at random and then 2. sequentially select the data point with the largest sum of distances to already choosen cluster centers as next cluster center k 1 µ k := x i, i := arg max i {1,...,n} x i µ l 2, l=1 k = 2,..., K 5 / 37

1. k-means & k-medoids k-means: Initialization k-means is usually initialized by picking K data points as cluster centers at random: 1. pick the first cluster center µ 1 out of the data points at random and then 2. sequentially select the data point with the largest sum of distances to already choosen cluster centers as next cluster center k 1 µ k := x i, i := arg max i {1,...,n} x i µ l 2, l=1 k = 2,..., K Different initializations may lead to different local minima. run k-means with different random initializations and keep only the one with the smallest distortion (random restarts). 5 / 37

1. k-means & k-medoids k-means Algorithm 1: procedure cluster-kmeans(d := {x 1,..., x N } R M, K N, ɛ R + ) 2: i 1 unif({1,..., N}), µ 1 := x i1 3: for k := 2,..., K do 4: i k := arg max k 1 n {1,...,N} l=1 xn µ l, µ i := x ik,. 5: repeat 6: µ old := µ 7: for n := 1,..., N do 8: P n := arg min k {1,...,K} x n µ k 9: for k := 1,..., K do 10: µ k := mean {x n P n = k} K k=1 µ k µ old k < ɛ 1 K 11: until 12: return P Note: In implementations, the two loops over the data (lines 6 and 9) can be combined in one loop. 6 / 37

1. k-means & k-medoids Example 1 2 3 4 5 6 7 8 9 10 11 12 x 1 x 2 x 3 x 4 x 5 x 6 7 / 37

1. k-means & k-medoids Example 1 2 3 4 5 6 7 8 9 10 11 12 x 1 x 2 x 3 x 4 x 5 x 6 x 1 x 2 x 3 x 4 = µ 1 x 5 x 6 7 / 37

1. k-means & k-medoids Example 1 2 3 4 5 6 7 8 9 10 11 12 x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 7 / 37

1. k-means & k-medoids Example 1 2 3 4 5 6 7 8 9 10 11 12 x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 7 / 37

1. k-means & k-medoids Example 1 2 3 4 5 6 7 8 9 10 11 12 x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 7 / 37

1. k-means & k-medoids Example 1 2 3 4 5 6 7 8 9 10 11 12 x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 7 / 37

1. k-means & k-medoids Example 1 2 3 4 5 6 7 8 9 10 11 12 x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 x 1 x 2 = µ 2 x 3 x 4 x 5 = µ 1 x 6 7 / 37

1. k-means & k-medoids Example 1 2 3 4 5 6 7 8 9 10 11 12 x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 x 1 x 2 = µ 2 x 3 x 4 x 5 = µ 1 x 6 x 1 x 2 = µ 2 x 3 x 4 x 5 = µ 1 x 6 7 / 37

1. k-means & k-medoids Example 1 2 3 4 5 6 7 8 9 10 11 12 x 1 x 2 x 3 x 4 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 x 1 = µ 2 x 2 x 3 x 4 = µ 1 x 5 x 6 d = 33 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 x 1 µ 2 x 2 x 3 x 4 µ 1 x 5 x 6 d = 23.7 x 1 x 2 = µ 2 x 3 x 4 x 5 = µ 1 x 6 x 1 x 2 = µ 2 x 3 x 4 x 5 = µ 1 x 6 d = 16 7 / 37

1. k-means & k-medoids How Many Clusters K? n σ 2 D 0 1 n K 8 / 37

1. k-means & k-medoids How Many Clusters K? n σ 2 D 0 1 n K 8 / 37

1. k-means & k-medoids k-medoids: k-means for General Distances One can generalize k-means to general distances d: D(P, µ) := n K P i,k d(x i, µ k ) i=1 k=1 9 / 37

1. k-means & k-medoids k-medoids: k-means for General Distances One can generalize k-means to general distances d: D(P, µ) := n K P i,k d(x i, µ k ) i=1 k=1 step 1 assigning data points to clusters remains the same P i,k := arg min d(x i, µ k ) k {1,...,K} but step 2 finding the best cluster representatives µ k is not solved by the mean and may be difficult in general. 9 / 37

1. k-means & k-medoids k-medoids: k-means for General Distances One can generalize k-means to general distances d: D(P, µ) := n i=1 k=1 K P i,k d(x i, µ k ) step 1 assigning data points to clusters remains the same P i,k := arg min d(x i, µ k ) k {1,...,K} but step 2 finding the best cluster representatives µ k is not solved by the mean and may be difficult in general. idea k-medoids: choose cluster representatives out of cluster data points: n µ k := x j, j := arg min j {1,...,n}:P j,k =1 P i,k d(x i, x j ) i=1 9 / 37

1. k-means & k-medoids k-medoids: k-means for General Distances k-medoids is a kernel method : it requires no access to the variables, just to the distance measure. For the Manhattan distance/l 1 distance, step 2 finding the best cluster representatives µ k can be solved without restriction to cluster data points: (µ k ) j := median{(x i ) j P i,k = 1, i = 1,..., n}, j = 1,..., m 10 / 37

2. Gaussian Mixture Models Outline 1. k-means & k-medoids 2. Gaussian Mixture Models 3. Hierarchical Cluster Analysis 11 / 37

2. Gaussian Mixture Models Soft Partitions: Row Stochastic Matrices Let X := {x 1,..., x N } be a finite set. A N K matrix P [0, 1] N K is called a soft partition matrix of X if it 1. is row-stochastic: K P i,k = 1, i {1,..., N}, k {1,..., K} 2. does not contain a zero column: X i,k (0,..., 0) T, k {1,..., K}. k=1 P i,k is called the membership degree of instance i in class k or the cluster weight of instance i in cluster k. P.,k is called membership vector of class k. SoftPart(X ) denotes the set of all soft partitions of X. Note: Soft partitions are also called soft clusterings and fuzzy clusterings. 11 / 37

2. Gaussian Mixture Models The Soft Clustering Problem Given a set X called data space, e.g., X := R m, a set X X called data, and a function D : X X SoftPart(X ) R + 0 called distortion measure where D(P) measures how bad a soft partition P SoftPart(X ) for a data set X X is, find a soft partition P SoftPart (X ) with minimal distortion D(P). 12 / 37

2. Gaussian Mixture Models The Soft Clustering Problem (with given K) Given a set X called data space, e.g., X := R m, a set X X called data, a function D : X X SoftPart(X ) R + 0 called distortion measure where D(P) measures how bad a soft partition P SoftPart(X ) for a data set X X is, and a number K N of clusters, find a soft partition P SoftPart K (X ) [0, 1] X K with K clusters with minimal distortion D(P). 12 / 37

2. Gaussian Mixture Models Mixture Models Mixture models assume that there exists an unobserved nominal variable Z with K levels: p(x, Z) = p(z)p(x Z) = K (π k p(x Z = k) δ(z=k) The complete data loglikelihood of the completed data (X, Z) then is l(θ; X, Z) := n i=1 k=1 k=1 K δ(z i = k)(ln π k + ln p(x = x i Z = k; θ k ) with Θ := (π 1,..., π K, θ 1,..., θ K ) l cannot be computed because z i s are unobserved. 13 / 37

2. Gaussian Mixture Models Mixture Models: Expected Loglikelihood Given an estimate Θ (t 1) of the parameters, mixtures aim to optimize the expected complete data loglikelihood: Q(Θ;Θ (t 1) ) := E[l(Θ; X, Z) Θ (t 1) ] n K = E[δ(Z i = k) x i, Θ (t 1) ](ln π k + ln p(x = x i Z = k; θ k )) i=1 k=1 which is relaxed to Q(Θ, r; Θ (t 1) ) = n i=1 k=1 K r i,k (ln π k + ln p(x = x i Z = k; θ k )) + (r i,k E[δ(Z i = k) x i, Θ (t 1) ]) 2 14 / 37

2. Gaussian Mixture Models Mixture Models: Expected Loglikelihood Block coordinate descent (EM algorithm): alternate until convergence 1. expectation step: r (t 1) i,k := E[δ(Z i = k) x i, Θ (t 1) ] = p(z = k X = x i ; Θ (t 1) ) = = 2. maximization step: p(x = x i Z = k; Θ (t 1) )p(z = k; Θ (t 1) ) K k =1 p(x = x i Z = k ; Θ (t 1) )p(z = k ; Θ (t 1) ) p(x = x i Z = k; θ (t 1) k )π (t 1) k K k =1 p(x = x i Z = k ; θ (t 1) k )π (t 1) k Θ (t) := arg max Q(Θ, r (t 1) ; Θ (t 1) ) Θ = arg max π 1,...,π K,θ 1,...,θ K n K r i,k (ln π k + ln p(x = x i Z = k; θ k )) i=1 k=1 (0) 15 / 37

2. Gaussian Mixture Models Mixture Models: Expected Loglikelihood 2. maximization step: Θ (t) = π (t) k = n i=1 arg max n π 1,...,π K,θ 1,...,θ K i=1 k=1 n i=1 r i,k n r i,k p(x = x i Z = k; θ k ) K r i,k (ln π k + ln p(x = x i Z = k; θ k )) (1) p(x = x i Z = k; θ k ) θ k = 0, k ( ) (*) needs to be solved for specific cluster specific distributions p(x Z). 16 / 37

2. Gaussian Mixture Models Gaussian Mixtures Gaussian mixtures: use Gaussians for p(x Z): p(x = x Z = k) = 1 (2π) m Σ k e 1 2 (x µ k) T Σ 1 k (x µ k ), θ k := (µ k, Σ k ) µ (t) k = Σ (t) k = n i=1 r (t 1) i,k x i k i=1 r (t 1) i,k n i=1 r (t 1) i,k (x i µ (t) k n i=1 r (t 1) i,k n i=1 = r (t 1) i,k xi T x i µ (t) k n i=1 r (t 1) i,k )T (x i µ (t) k ) T µ (t) k (2) (3) 17 / 37

2. Gaussian Mixture Models Gaussian Mixtures: EM Algorithm, Summary 1. expectation step: i, k r (t 1) 1 i,k = (2π) m Σ (t 1) k r (t 1) i,k r (t 1) i,k = K (t 1) k =1 r i,k 2. maximization step: k π (t) k = µ (t) k = Σ (t) k = n i=1 r (t 1) i,k n n i=1 r (t 1) i,k x i n i=1 r (t 1) i,k e 1 2 (x i µ (t 1) k n i=1 r (t 1) i,k xi T x i µ (t) k n i=1 r (t 1) T µ (t) k ) T Σ (t 1) k 1 (x i µ (t 1) k ) i,k (0a) (0b) (1) (2) (3) 18 / 37

2. Gaussian Mixture Models Gaussian Mixtures for Soft Clustering The responsibilities r [0, 1] N K are a soft partition. P := r The negative expected loglikelihood can be used as cluster distortion: D(P) := max Q(Θ, r) Θ To optimize D, we simply can run EM. 19 / 37

2. Gaussian Mixture Models Gaussian Mixtures for Soft Clustering The responsibilities r [0, 1] N K are a soft partition. P := r The negative expected loglikelihood can be used as cluster distortion: D(P) := max Q(Θ, r) Θ To optimize D, we simply can run EM. For hard clustering: assign points to the cluster with highest responsibility (hard EM): r (t 1) i,k = δ(k = arg max r (t 1) k i,k ) (0b ) =1,...,K 19 / 37

2. Gaussian Mixture Models Gaussian Mixtures for Soft Clustering / Example 2 0 2 2 0 2 [Mur12, fig. 11.11] 20 / 37

2. Gaussian Mixture Models Gaussian Mixtures for Soft Clustering / Example [Mur12, fig. 11.11] 20 / 37

2. Gaussian Mixture Models Gaussian Mixtures for Soft Clustering / Example [Mur12, fig. 11.11] 20 / 37

2. Gaussian Mixture Models Model-based Cluster Analysis Different parametrizations of the covariance matrices Σ k restrict possible cluster shapes: full Σ: all sorts of ellipsoid clusters. diagonal Σ: ellipsoid clusters with axis-parallel axes unit Σ: spherical clusters. One also distinguishes cluster-specific Σ k : each cluster can have its own shape. shared Σ k = Σ: all clusters have the same shape. 21 / 37

2. Gaussian Mixture Models k-means: Hard EM with spherical clusters 1. expectation step: i, k r (t 1) 1 i,k = (2π) m Σ (t 1) k 1 = (2π) m e 1 2 (x i µ (t 1) k e 1 2 (x i µ (t 1) k ) T (x i µ (t 1) k ) ) T Σ (t 1) k 1 (x i µ (t 1) k ) (0a) r (t 1) i,k arg max r (t 1) k i,k = arg max =1,...,K k =1,...,K = δ(k = arg max r (t 1) k i,k ) (0b ) =1,...,K 1 = arg max k =1,...,K (2π) m e 1 2 (x i µ (t 1) k ) T (x i µ (t 1) k ) (x i µ (t 1) k ) T (x i µ (t 1) k ) = arg min x i µ (t 1) k k 2 =1,...,K 22 / 37

3. Hierarchical Cluster Analysis Outline 1. k-means & k-medoids 2. Gaussian Mixture Models 3. Hierarchical Cluster Analysis 23 / 37

3. Hierarchical Cluster Analysis Hierarchies Let X be a set. A tree (H, E), E H H edges pointing towards root with leaf nodes h corresponding bijectively to elements x h X plus a surjective map L : H {0,..., d}, d N with L(root) = 0 and L(h) = d for all leaves h H and L(h) L(g) for all (g, h) E called level map is called an hierarchy over X. 23 / 37

3. Hierarchical Cluster Analysis Hierarchies Let X be a set. A tree (H, E), E H H edges pointing towards root with leaf nodes h corresponding bijectively to elements x h X plus a surjective map L : H {0,..., d}, d N with L(root) = 0 and L(h) = d for all leaves h H and L(h) L(g) for all (g, h) E called level map is called an hierarchy over X. d is called the depth of the hierarchy. Hier(X ) denotes the set of all hierarchies over X. 23 / 37

3. Hierarchical Cluster Analysis Hierarchies / Example X : x 1 x 2 x 3 x 4 x 5 x 6 24 / 37

3. Hierarchical Cluster Analysis Hierarchies / Example X : x 1 x 2 x 3 x 4 x 5 x 6 24 / 37

3. Hierarchical Cluster Analysis Hierarchies / Example X : x 1 x 3 x 4 x 2 x 5 x 6 24 / 37

3. Hierarchical Cluster Analysis Hierarchies / Example L = 0 L = 1 L = 2 L = 3 L = 4 X : x 1 x 3 x 4 x 2 x 5 x 6 L = 5 24 / 37

3. Hierarchical Cluster Analysis Hierarchies: Nodes Correspond to Subsets Let (H, E) be such an hierarchy: nodes of an hierarchy correspond to subsets of X : leaf nodes h correspond to a singleton subset by definition. subset(h) := {x h }, x h X corresponding to leaf h interior nodes h correspond to the union of the subsets of their children: subset(h) := g H (g,h) E subset(g) thus the root node h corresponds to the full set X : subset(h) = X 25 / 37

3. Hierarchical Cluster Analysis Hierarchies: Nodes Correspond to Subsets {x 1, x 3, x 4, x 2, x 5, x 6 } {x 1, x 3, x 4 } {x 2, x 5, x 6 } {x 2, x 5 } {x 1, x 3 } X : {x 1 } {x 3 } {x 4 } {x 2 } {x 5 } {x 6 } 25 / 37

3. Hierarchical Cluster Analysis Hierarchies: Levels Correspond to Partitions Let (H, E) be such an hierarchy: levels l {0,..., d} correspond to partitions P l (H, L) := {h H L(h) l, g H : L(g) l, h g} 26 / 37

3. Hierarchical Cluster Analysis Hierarchies: Levels Correspond to Partitions {x 1, x 3, x 4, x 2, x 5, x 6 } {{x 1, x 3, x 4, x 2, x 5, x 6 }} {x 1, x 3, x 4 } {{x 1, x 3, x 4 }, {x 2, x 5, x 6 }} {x 2, x 5, x 6 } {x 2, x 5 } {{x 1, x 3 }, {x 4 }, {x 2, x 5, x 6 }} {{x 1, x 3 }, {x 4 }, {x 2, x 5 }, {x 6 }} {x 1, x 3 } {{x 1, x 3 }, {x 4 }, {x 2 }, {x 5 }, {x 6 }} {x 1 } {x 3 } {x 4 } {x 2 } {x 5 } {x 6 } {{x 1 }, {x 3 }, {x 4 }, {x 2 }, {x 5 }, {x 6 }} 26 / 37

3. Hierarchical Cluster Analysis The Hierarchical Cluster Analysis Problem Given a set X called data space, e.g., X := R m, a set X X called data and a function D : X X Hier(X ) R + 0 called distortion measure where D(P) measures how bad a hierarchy H Hier(X ) for a data set X X is, find a hierarchy H Hier(X ) with minimal distortion D(H). 27 / 37

3. Hierarchical Cluster Analysis Distortions for Hierarchies Examples for distortions for hierarchies: n D(H) := D(P K (H)) K=1 where P K (H) denotes the partition at level K 1 (with K classes) and D denotes a distortion for partitions. 28 / 37

3. Hierarchical Cluster Analysis Agglomerative and Divisive Hierarchical Clustering Hierarchies are usually learned by greedy search level by level: agglomerative clustering: 1. start with the singleton partition P n : P n := {X k k = 1,..., n}, X k := {x k }, k = 1,..., n 2. in each step K = n,..., 2 build P K 1 by joining the two clusters k, l {1,..., K} that lead to the minimal distortion D({X 1,..., X k,..., X l,..., X K, X k X l ) Note: X k denotes that the class X k is omitted from the partition. 29 / 37

3. Hierarchical Cluster Analysis Agglomerative and Divisive Hierarchical Clustering Hierarchies are usually learned by greedy search level by level: divisive clustering: 1. start with the all partition P 1 : P 1 := {X } 2. in each step K = 1, n 1 build P K+1 by splitting one cluster X k in two clusters X k, X l that lead to the minimal distortion D({X 1,..., X k,..., X K, X k, X l), X k = X k X l Note: X k denotes that the class X k is omitted from the partition. 29 / 37

3. Hierarchical Cluster Analysis Class-wise Defined Partition Distortions If the partition distortion can be written as a sum of distortions of its classes, K D({X 1,..., X K }) = D(X k ) then the optimal pair does only depend on X k, X l : D({X 1,..., X k,..., X l,..., X K, X k X l ) = D(X k X l ) ( D(X k ) + D(X l )) k=1 30 / 37

3. Hierarchical Cluster Analysis Closest Cluster Pair Partition Distortions For a cluster distance d : P(X ) P(X ) R + 0 with d(a B, C) min{ d(a, C), d(b, C)}, A, B, C X a partition can be judged by the closest cluster pair it contains: D({X 1,..., X K }) = min d(x k, X l ) k,l=1,k k l Such a distortion has to be maximized. To increase it, the closest cluster pair has to be joined. 31 / 37

3. Hierarchical Cluster Analysis Single Link Clustering d sl (A, B) := min d(x, y), A, x A,y B B X 32 / 37

3. Hierarchical Cluster Analysis Complete Link Clustering d cl (A, B) := max d(x, y), A, x A,y B B X 33 / 37

3. Hierarchical Cluster Analysis Average Link Clustering d al (A, B) := 1 A B x A,y B d(x, y), A, B X 34 / 37

3. Hierarchical Cluster Analysis Recursion Formulas for Cluster Distances d sl (X i X j, X k ) := min d(x, y) x X i X j,y X k = min{ min d(x, y), min d(x, y)} x X i,y X k x X j,y X k = min{d sl (X i, X k ), d sl (X j, X k )} 35 / 37

3. Hierarchical Cluster Analysis Recursion Formulas for Cluster Distances d sl (X i X j, X k ) = min{d sl (X i, X k ), d sl (X j, X k )} d cl (X i X j, X k ) := max d(x, y) x X i X j,y X k = max{ max d(x, y), max d(x, y)} x X i,y X k x X j,y X k = max{d cl (X i, X k ), d cl (X j, X k )} 35 / 37

3. Hierarchical Cluster Analysis Recursion Formulas for Cluster Distances d sl (X i X j, X k ) = min{d sl (X i, X k ), d sl (X j, X k )} d cl (X i X j, X k ) = max{d cl (X i, X k ), d cl (X j, X k )} 1 d al (X i X j, X k ) := d(x, y) X i X j X k x X i X j,y X k = X i 1 d(x, y) X i X j X i X k x X i,y X k + X j 1 d(x, y) X i X j X j X k x X j,y X k = X i X i + X j d X j al(x i, X k ) + X i + X j d al(x j, X k ) 35 / 37

3. Hierarchical Cluster Analysis Recursion Formulas for Cluster Distances d sl (X i X j, X k ) = min{d sl (X i, X k ), d sl (X j, X k )} d cl (X i X j, X k ) = max{d cl (X i, X k ), d cl (X j, X k )} d al (X i X j, X k ) = X i X i + X j d X j al(x i, X k ) + X i + X j d al(x j, X k ) agglomerative hierarchical clustering requires to compute the distance matrix D R n n only once: D i,j := d(x i, x j ), i, j = 1,..., K 35 / 37

3. Hierarchical Cluster Analysis Conclusion (1/2) Cluster analysis aims at detecting latent groups in data, without labeled examples ( record linkage). Latent groups can be described in three different granularities: partitions segment data into K subsets (hard clustering). hierarchies structure data into an hierarchy, in a sequence of consistent partitions (hierarchical clustering). soft clusterings / row-stochastic matrices build overlapping groups to which data points can belong with some membership degree (soft clustering). k-means finds a K-partition by finding K cluster centers with smallest Euclidean distance to all their cluster points. k-medoids generalizes k-means to general distances; it finds a K-partition by selecting K data points as cluster representatives with smallest distance to all their cluster points. 36 / 37

3. Hierarchical Cluster Analysis Conclusion (2/2) hierarchical single link, complete link and average link methods find a hierarchy by greedy search over consistent partitions, starting from the singleton parition (agglomerative) being efficient due to recursion formulas, requiring only a distance matrix. Gaussian Mixture Models find soft clusterings by modeling data by a class-specific multivariate Gaussian distribution p(x Z) and estimating expected class memberships (expected likelihood). The Expectation Maximiation Algorithm (EM) can be used to learn Gaussian Mixture Models via block coordinate descent. k-means is a special case of a Gaussian Mixture Model with hard/binary cluster memberships (hard EM) and spherical cluster shapes. 37 / 37

Readings k-means: [HTFF05], ch. 14.3.6, 13.2.3, 8.5 [Bis06], ch. 9.1, [Mur12], ch. 11.4.2 hierarchical cluster analysis: [HTFF05], ch. 14.3.12, [Mur12], ch. 25.5. [PTVF07], ch. 16.4. Gaussian mixtures: [HTFF05], ch. 14.3.7, [Bis06], ch. 9.2, [Mur12], ch. 11.2.3, [PTVF07], ch. 16.1. 38 / 37

References Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction, volume 27. 2005. Kevin P. Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012. William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes. Cambridge University Press, 3rd edition, 2007. 39 / 37