Introduction to Machine Learning CMU-10701

Similar documents
Clustering web search results

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Clustering Lecture 5: Mixture Model

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

ECE 5424: Introduction to Machine Learning

Unsupervised Learning: Clustering

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

K-Means and Gaussian Mixture Models

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning. Unsupervised Learning. Manfred Huber

Expectation Maximization (EM) and Gaussian Mixture Models

COMS 4771 Clustering. Nakul Verma

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Unsupervised Learning : Clustering

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Note Set 4: Finite Mixture Models and the EM Algorithm

Clustering. Image segmentation, document clustering, protein class discovery, compression

CS 229 Midterm Review

Introduction to Mobile Robotics

The EM Algorithm Lecture What's the Point? Maximum likelihood parameter estimates: One denition of the \best" knob settings. Often impossible to nd di

Mixture Models and EM

Machine Learning

Mixture Models and the EM Algorithm

Probabilistic Graphical Models

Latent Variable Models and Expectation Maximization

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Inference and Representation

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

K-Means Clustering 3/3/17

ALTERNATIVE METHODS FOR CLUSTERING

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

k-center Problems Joey Durham Graphs, Combinatorics and Convex Optimization Reading Group Summer 2008

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Based on Raymond J. Mooney s slides

9.1. K-means Clustering


Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Expectation Maximization: Inferring model parameters and class labels

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Grundlagen der Künstlichen Intelligenz

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Information Retrieval and Organisation

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Lecture 2 The k-means clustering problem

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

SGN (4 cr) Chapter 11

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Lecture 8: The EM algorithm

Machine Learning Lecture 3

K-Means. Oct Youn-Hee Han

Semi- Supervised Learning

Machine Learning Lecture 3

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Clustering. Shishir K. Shah

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Announcements. Image Segmentation. From images to objects. Extracting objects. Status reports next Thursday ~5min presentations in class

Clustering: Classic Methods and Modern Views

CS 6140: Machine Learning Spring 2016

Unsupervised Learning

Unsupervised Learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

K-Means Clustering. Sargur Srihari

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Machine Learning for OR & FE

Markov Random Fields and Segmentation with Graph Cuts

10701 Machine Learning. Clustering

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

An Introduction to PDF Estimation and Clustering

CLUSTERING. JELENA JOVANOVIĆ Web:

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Machine Learning (CSE 446): Unsupervised Learning

Clustering and The Expectation-Maximization Algorithm

Machine Learning

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Introduction to Information Retrieval

10-701/15-781, Fall 2006, Final

IBL and clustering. Relationship of IBL with CBR

Unsupervised: no target value to predict

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Understanding Clustering Supervising the unsupervised

LogisBcs. CS 6140: Machine Learning Spring K-means Algorithm. Today s Outline 3/27/16

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Deep Generative Models Variational Autoencoders

Mixture Models and EM

Fall 09, Homework 5

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Transcription:

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh

Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2

Clustering 3

K- means clustering What is clustering? Clustering: The process of grouping a set of objects into classes of similar objects high intra-class similarity low inter-class similarity It is the most common form of unsupervised learning 4

K- means clustering What is Similarity? Hard to define! But we know it when we see it The real meaning of similarity is a philosophical question. We will take a more pragmatic approach: think in terms of a distance (rather than similarity) between random variables. 5

The K- means Clustering Problem 6

K-means Clustering Problem K-means clustering problem: Partition the n observations into K sets (K n) S = {S 1, S 2,, S K } such that the sets minimize the within-cluster sum of squares: K=3 7

K-means Clustering Problem K-means clustering problem: Partition the n observations into K sets (K n) S = {S 1, S 2,, S K } such that the sets minimize the within-cluster sum of squares: How hard is this problem? The problem is NP hard, but there are good heuristic algorithms that seem to work well in practice: K means algorithm mixture of Gaussians 8

K-means Clustering Alg: Step 1 Given n objects. Guess the cluster centers k 1, k 2, k 3. (They were 1,, 3 in the previous slide) 9

K-means Clustering Alg: Step 2 Build a Voronoi diagram based on the cluster centers k 1, k 2, k 3. Decide the class memberships of the n objects by assigning them to the nearest cluster centers k 1, k 2, k 3. 10

K-means Clustering Alg: Step 3 Re-estimate the cluster centers (aka the centroid or mean), by assuming the memberships found above are correct. 11

K-means Clustering Alg: Step 4 Build a new Voronoi diagram. Decide the class memberships of the n objects based on this diagram 12

K-means Clustering Alg: Step 5 Re-estimate the cluster centers. 13

K-means Clustering Alg: Step 6 Stop when everything is settled. (The Voronoi diagrams don t change anymore) 14

Algorithm Input K- means clustering Data + Desired number of clusters, K Initialize Iterate the K cluster centers (randomly if necessary) 1. Decide the class memberships of the n objects by assigning them to the nearest cluster centers 2. Re-estimate the K cluster centers (aka the centroid or mean), by assuming the memberships found above are correct. Termination K- means Clustering Algorithm If none of the n objects changed membership in the last iteration, exit. Otherwise go to 1. 15

At each iteration, K- means clustering Algorithm Computation Complexity Computing distance between each of the n objects and the K cluster centers is O(Kn). Computing cluster centers: Each object gets added once to some cluster: O(n). Assume these two steps are each done once for l iterations: O(lKn). Can you prove that the K-means algorithm guaranteed to terminate? 16

K- means clustering Seed Choice 17

K- means clustering Seed Choice 18

K- means clustering Seed Choice The results of the K- means Algorithm can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clustering. K-means algorithm can get stuck easily in local minima. Select good seeds using a heuristic (e.g., object least similar to any existing mean) Try out multiple starting points (very important!!!) Initialize with the results of another method. 19

Alternating Optimization 20

K- means clustering K- means Algorithm (more formally) Randomly initialize k centers Classify: At iteration t, assign each point (j 2 {1,,n}) to nearest center: Classification at iteration t Recenter: i is the centroid of the new sets: Re-assign new cluster centers at iteration t 21

K- means clustering What is K-means optimizing? Define the following potential function F of centers and point allocation C Two equivalent versions Optimal solution of the K-means problem: 22

K- means clustering K-means Algorithm Optimize the potential function: K-means algorithm: (1) (2) Exactly first step Assign each point to the nearest cluster center Exactly 2 nd step (re-center) 23

K- means clustering K-means Algorithm Optimize the potential function: K-means algorithm: (coordinate descent on F) (1) Expectation step (2) Maximization step Today, we will see a generalization of this approach: EM algorithm 24

Gaussian Mixture Model 25

Generative approach Density Estimation There is a latent parameter Θ For all i, draw observed xi given Θ What if the basic model doesn t fit all data? ) Mixture modelling, Partitioning algorithms Different parameters for different parts of the domain. 26

K- means clustering Partitioning Algorithms K-means hard assignment: each object belongs to only one cluster Mixture modeling soft assignment: probability that an object belongs to a cluster 27

K- means clustering Gaussian Mixture Model Mixture of K Gaussians distributions: (Multi-modal distribution) There are K components Component i has an associated mean vector i Component i generates data from Each data point is generated using this process: 28

Gaussian Mixture Model Mixture of K Gaussians distributions: (Multi-modal distribution) Hidden variable Observed data Mixture component Mixture proportion 29

Mixture of Gaussians Clustering Assume that For a given x we want to decide if it belongs to cluster i or cluster j Cluster x based on posteriors: 30

Mixture of Gaussians Clustering Assume that 31

Piecewise linear decision boundary 32

MLE for GMM What if we don't know the parameters? ) Maximum Likelihood Estimate (MLE) 33

K-means and GMM MLE: What happens if we assume Hard assignment? P(y j = i) = 1 if i = C(j) = 0 otherwise In this case the MLE estimation: Same as K-means!!! 34

General GMM General GMM Gaussian Mixture Model (Multi-modal distribution) There are k components Component i has an associated mean vector i Each component generates data from a Gaussian with mean i and covariance matrix i. Each data point is generated according to the following recipe: 1) Pick a component at random: Choose component i with probability P(y=i) 2) Datapoint x~ N( I, i ) 35

General GMM GMM Gaussian Mixture Model (Multi-modal distribution) Mixture component Mixture proportion 36

General GMM Assume that Clustering based on posteriors: Quadratic Decision boundary second-order terms don t cancel out 37

General GMM MLE Estimation What if we don't know ) Maximize marginal likelihood (MLE): Doable, but often slow Non-linear, non-analytically solvable 38

Expectation-Maximization (EM) A general algorithm to deal with hidden data, but we will study it in the context of unsupervised learning (hidden class labels = clustering) first. EM is an optimization strategy for objective functions that can be interpreted as likelihoods in the presence of missing data. EM is much simpler than gradient methods: No need to choose step size. EM is an iterative algorithm with two linked steps: o E-step: fill-in hidden values using inference o M-step: apply standard MLE/MAP method to completed data We will prove that this procedure monotonically improves the likelihood (or leaves it unchanged). EM always converges to a local optimum of the likelihood. 39

Expectation-Maximization (EM) A simple case: We have unlabeled data x 1, x 2,, x m We know there are K classes We know P(y=1)= 1, P(y=2)= 2 P(y=3) P(y=K)= K We know common variance 2 We don t know 1, 2, K, and we want to learn them We can write Independent data Marginalize over class ) learn 1, 2, K 40

Expectation (E) step We want to learn: Our estimator at the end of iteration t-1: At iteration t, construct function Q: E step Equivalent to assigning clusters to each data point in K-means in a soft way 41

Maximization (M) step We calculated these weights in the E step Joint distribution is simple M step At iteration t, maximize function Q in t : Equivalent to updating cluster centers in K-means 42

EM for spherical, same variance GMMs E-step Compute expected classes of all datapoints for each class In K-means E-step we do hard assignment. EM does soft assignment M-step Compute Max of function Q. [In this example update μ given our data s class membership distributions (weights) ] Iterate. Exactly the same as MLE with weighted data. 43

EM for general GMMs The more general case: We have unlabeled data x 1, x 2,, x m We know there are K classes We don t know P(y=1)= 1, P(y=2)= 2 P(y=3) P(y=K)= K We don t know 1, K We don t know 1, 2, K We want to learn: Our estimator at the end of iteration t-1: The idea is the same: At iteration t, construct function Q (E step) and maximize it in t (M step) 44

EM for general GMMs At iteration t, construct function Q (E step) and maximize it in t (M step) E-step Compute expected classes of all datapoints for each class M-step Compute MLEs given our data s class membership distributions (weights) 45

EM for general GMMs: Example 46

EM for general GMMs: Example After 1 st iteration 47

EM for general GMMs: Example After 2 nd iteration 48

EM for general GMMs: Example After 3 rd iteration 49

EM for general GMMs: Example After 4 th iteration 50

EM for general GMMs: Example After 5 th iteration 51

EM for general GMMs: Example After 6 th iteration 52

EM for general GMMs: Example After 20 th iteration 53

GMM for Density Estimation 54

General EM algorithm What is EM in the general case, and why does it work? 55

General EM algorithm Notation Observed data: Unknown variables: Paramaters: For example in clustering: For example in MoG: Goal: 56

General EM algorithm Other Examples: Hidden Markov Models Observed data: Unknown variables: Paramaters: Initial probabilities: Transition probabilities: Emission probabilities: Goal: 57

General EM algorithm Goal: E Step: Free energy: M Step: We are going to discuss why this approach work 58

General EM algorithm Free energy: E Step: Let us see why! M Step: We maximize only here in!!! 59

General EM algorithm Free energy: Theorem: During the EM algorithm the marginal likelihood is not decreasing! Proof: 60

General EM algorithm Goal: E Step: M Step: During the EM algorithm the marginal likelihood is not decreasing! 61

Convergence of EM Sequence of EM lower bound F-functions EM monotonically converges to a local maximum of likelihood! 62

Convergence of EM Different sequence of EM lower bound F-functions depending on initialization Use multiple, randomized initializations in practice 63

Variational Methods 64

Variational methods Free energy: Variational methods might decrease the marginal likelihood! 65

Variational methods Free energy: Partial E Step: But not necessarily the best max/min which would be Partial M Step: Variational methods might decrease the marginal likelihood! 66

Summary: EM Algorithm A way of maximizing likelihood function for hidden variable models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces: 1.Estimate some missing or unobserved data from observed data and current parameters. 2. Using this complete data, find the MLE parameter estimates. Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess: E Step: M Step: In the M-step we optimize a lower bound F on the likelihood L. In the E-step we close the gap, making bound F =likelihood L. EM performs coordinate ascent on F, can get stuck in local optima. 67