Clustering and The Expectation-Maximization Algorithm

Size: px

Start display at page:

Download "Clustering and The Expectation-Maximization Algorithm"

Randall Gilbert
5 years ago
Views:

1 Clustering and The Expectation-Maximization Algorithm Unsupervised Learning Marek Petrik 3/7 Some of the figures in this presentation are taken from An Introduction to Statistical Learning, with applications in R (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

2 Learning Methods 1. Supervised Learning: Learning a function f: Y = f(x) + ɛ 1.1 Regression 1.2 Classification 2. Unsupervised learning: Discover interesting properties of data (no labels) X 1, X 2, Dimensionality reduction or embedding 2.2 Clustering

3 Principal Components Analysis Reduce dimensionality Start with features X 1... X n Construct fewer features Z 1... Z M Z 1 = φ 11 X 1 + φ 21 X φ p1 X p Weights are usually normalized (using l 2 norm) p φ 2 j1 = 1 j=1 Data has greatest variance along Z 1

4 1st Principal Component Ad Spending Population 1st Principal Component: Direction with the largest variance Z 1 = (pop pop) (ad ad)

5 More Unsupervised Learning: Discovering Structure of Data 1. K-Means Clustering 2. Hierarchical Clustering 3. Expectation-Maximization Method (Not Covered in ISL, see ESL 8.5)

6 Clustering Simplify data in a different way than PCA. PCA finds a low-dimensional representation of data Clustering finds homogeneous subgroups among the observations

7 Clustering: Assumptions and Goals Exists a method for measuring similarity between data points Some points are more similar than others Discover latent patterns that exist but may not be observed/observable

8 Clustering: Assumptions and Goals Exists a method for measuring similarity between data points Some points are more similar than others Want to identify similarity patterns 1. Discover the different types of disease 2. Market segmentation: Types of users that visit a website 3. Discover movie or book genres 4. Discover types of topics in documents Discover latent patterns that exist but may not be observed/observable

9 Clustering Algorithms K-Means: simple and effective Hierarchical clustering: Many complex clusters Many other clustering methods, most heuristics EM: General algorithm for dealing with latent variables by maximizing likelihood

10 K-Means Clustering Cluster data into complete and non-overlapping sets Example: K=2 K=3 K=4

11 K-Means Objective k-th cluster: C k i-th observation in cluster k: i C k

12 K-Means Objective k-th cluster: C k i-th observation in cluster k: i C k Find clusters that are homogeneous: W (C k ) homogeneity of clusters K W (C k ) min C 1,...,C K k=1

13 K-Means Objective k-th cluster: C k i-th observation in cluster k: i C k Find clusters that are homogeneous: W (C k ) homogeneity of clusters K W (C k ) min C 1,...,C K k=1 min C 1,...,C K k=1 Define homogeneity as in-cluster variance K 1 p (x ij x i C k j) 2 i,i C k j=1

14 K-Means Objective k-th cluster: C k i-th observation in cluster k: i C k Find clusters that are homogeneous: W (C k ) homogeneity of clusters K W (C k ) min C 1,...,C K k=1 min C 1,...,C K k=1 Define homogeneity as in-cluster variance K 1 p (x ij x i C k j) 2 This is an NP hard problem i,i C k j=1

15 K-Means Algorithm Heuristic solution to the minimization problem 1. Randomly assign cluster numbers to observations 2. Iterate while clusters change 2.1 For each cluster, compute the centroid 2.2 Assign each observation to the closest cluster Note that: 1 C k i,i C k j=1 p (x ij x i j) 2 = 2 i,i C k j=1 p (x ij x kj ) 2

16 K-Means Illustration Data Step 1 Iteration 1, Step 2a Iteration 1, Step 2b Iteration 2, Step 2a Final Results

17 Properties of K-Means Local minimum: Does not necessarily find the optimal solution Multiple runs can result in different solutions Choose the result of the run with minimal objective Cluster labels do not matter

18 Multiple Runs of K-Means

19 Hierarchical Clustering Multiple levels of similarity needed in complex domains Build a similarity tree X X 1

20 Dendrogram: Similarity Tree

21 Hierarchical Clustering Algorithm 1. Begin with n observations and compute ( n 2) dissimilarity measures 2. For i = n, n 1,..., Fuse 2 most similar clusters 2.2 Update i 1 dissimilarities

22 Hierarchical Clustering Algorithm: Illustration 9 9 X X X X 1 9 X X X X 1

23 Dissimilarity Measure: Linkage 1. Complete 2. Single 3. Average 4. Centroid

24 Impact of Dissimilarity Measure Average Linkage Complete Linkage Single Linkage

25 Clustering in Practice Fraught with problems: no clear measure of quality (like MSE) How to choose k? Problem dependent Standardize features, center them? What dissimilarity to use?

26 Clustering in Practice Fraught with problems: no clear measure of quality (like MSE) How to choose k? Problem dependent Standardize features, center them? What dissimilarity to use? Careful over-explaining clustering results: source:

27 Expectation-Maximization Maximum likelihood approach to clustering General method for dealing with latent features / labels Especially useful with generative models A heuristic method used to solve complex optimization problems Generalization of the idea: Minorization-Maximization Gentle introduction: piyush/teaching/em algorithm.pdf

28 Recall LDA

29 LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label Predict:

30 LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label Predict: 1. Pr[balance default = yes] and Pr[default = yes]

31 LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = no] and Pr[default = no]

32 LDA: Linear Discriminant Analysis Generative model: capture probability of predictors for each label Predict: 1. Pr[balance default = yes] and Pr[default = yes] 2. Pr[balance default = no] and Pr[default = no] Classes are normal: Pr[balance default = yes]

33 LDA vs Logistic Regression Logistic regressions: Pr[default = yes balance] Linear discriminant analysis: Pr[balance default = yes] and Pr[default = yes] Pr[balance default = no] and Pr[default = no]

34 LDA with 1 Feature Classes are normal and class probabilities π k are scalars f k (x) = 1 ( σ 2π exp 1 ) 2σ 2 (x µ k) 2 Key Assumption:Class variances σk 2 are the same

35 EM For LDA Labels are missing, guess them Find the most likely model and latent observations: max model log l(model) = max log Pr[data, latent model] = model latent latent

36 EM For LDA Labels are missing, guess them Find the most likely model and latent observations: max model log l(model) = max log Pr[data, latent model] = model latent latent = max log Pr[data latent, model] Pr[latent model] model latent latent

37 EM For LDA Labels are missing, guess them Find the most likely model and latent observations: max model log l(model) = max log Pr[data, latent model] = model latent latent model latent latent = max log Pr[data latent, model] Pr[latent model] Difficult and non-convex optimization problem (log )

38 EM Derivation Iteratively approximate and optimize the log-likelihood function 1. Construct a concave lower bound 2. Maximize the lower bound 3. Repeat Notation: Model: θ Data: x Latent variables: z max θ,z log l(θ, z) = max log Pr[x θ] = = max log θ,z z θ,z Pr[x z, θ] Pr[z θ]

39 EM Derivation Suppose we have an estimate of the model θ n How to compute θ n+1 that improves on it? arg max log θ,z z = arg max log θ,z z = arg max log θ,z z jensen s arg max θ,z θ n+1, z n+1 = Pr[x, z θ] = arg max log θ,z z z = arg max θ,z Pr[z θ] Pr[z x, θ] = Pr[z θ] Pr[z x, θ] Pr[z x, θ n] Pr[z x, θ n ] = Pr[z θ] Pr[z x, θ] Pr[z x, θ n ] Pr[z x, θ n ] Pr[z x, θ n ] log Pr[z θ] Pr[z x, θ] Pr[z x, θ n ] Pr[z x, θ n ] log Pr[x, z θ] z =

40 EM Algorithm 1. E Step: Estimate Pr[z x, θ n ] for all values of z. (Construct the lower bound) 2. M-Step: Maximize the lower bound: θ n+1 = arg max Pr[z x, θ n ] log Pr[x, z θ] θ z This can be solved using traditional MLE methods with weighted samples

41 EM for Mixture of Gaussians Rough sketch 1. Randomly assign cluster weights to observations 2. Iterate while clusters change 2.1 For each cluster, compute the centroid based on observation weights of observations 2.2 Assign each observation new cluster weights based on the distances from centroids

42 Other Applications of EM Very powerful and general idea! Training with missing data for many model types Hidden variables in Bayesian nets Identifying confounding variables Solving difficult (complex) optimization problem: MM

COMS 4771 Clustering. Nakul Verma

COMS 4771 Clustering. Nakul Verma COMS 4771 Clustering Nakul Verma Supervised Learning Data: Supervised learning Assumption: there is a (relatively simple) function such that for most i Learning task: given n examples from the data, find