Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Similar documents
K-Means Clustering 3/3/17

Machine Learning. Unsupervised Learning. Manfred Huber

COMS 4771 Clustering. Nakul Verma

Introduction to Machine Learning CMU-10701

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Clustering web search results

Unsupervised Learning: Clustering

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Lecture 8: The EM algorithm

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction


Clustering algorithms

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Unsupervised Learning

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Clustering. Image segmentation, document clustering, protein class discovery, compression

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Probabilistic Graphical Models

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Expectation Maximization: Inferring model parameters and class labels

Introduction to Machine Learning. Xiaojin Zhu

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Note Set 4: Finite Mixture Models and the EM Algorithm

Inference and Representation

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Clustering and The Expectation-Maximization Algorithm

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Based on Raymond J. Mooney s slides

ECE 5424: Introduction to Machine Learning

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

K-means and Hierarchical Clustering

IBL and clustering. Relationship of IBL with CBR

Clustering Lecture 5: Mixture Model

Mixture Models and the EM Algorithm

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering: K-means and Kernel K-means

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Clustering CS 550: Machine Learning

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Expectation Maximization!

CSE 5243 INTRO. TO DATA MINING

Introduction to Mobile Robotics

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Unsupervised: no target value to predict

Cluster Analysis. Ying Shen, SSE, Tongji University

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

CHAPTER 4: CLUSTER ANALYSIS

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering COMS 4771

Administrative. Machine learning code. Machine learning: Unsupervised learning

Hierarchical Clustering 4/5/17

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Clustering. Unsupervised Learning

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

K-Means and Gaussian Mixture Models

Unsupervised Learning and Clustering

Clustering Color/Intensity. Group together pixels of similar color/intensity.

ALTERNATIVE METHODS FOR CLUSTERING

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning and Clustering

Kernels and Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

An Introduction to PDF Estimation and Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

CSE 5243 INTRO. TO DATA MINING

Machine Learning

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

What to come. There will be a few more topics we will cover on supervised learning

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Introduction to Information Retrieval

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Information Retrieval and Web Search Engines

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Expectation Maximization: Inferring model parameters and class labels

Introduction to Information Retrieval

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Transcription:

Unsupervised Learning Clustering and the EM Algorithm Susanna Ricco Supervised Learning Given data in the form < x, y >, y is the target to learn. Good news: Easy to tell if our algorithm is giving the right answer. CPS 271 25 October 2007 Material borrowed from: Lise Getoor, Andrew Moore, Tom Dietterich, Sebastian Thrun, Rich Maclin,... (and Ron Parr) Unsupervised Learning Given data in the form < x > without any explicit target. Bad news: How do we define good performance? Good news: We can use our results for more than just predicting y. Unsupervised Learning is Model Learning Goal Produce global summary of the data. How? Assume data are sampled from underlying model with easily summarized properties. Good Clusters Want points in a cluster to be: 1. as similar as possible to other points in same cluster 2. as different as possible from points in another cluster Warning: Definition of similar and different depend on specific application. We ve already seen a lot of ways to measure distance between two data points. Why? Filter out noise Data compression

Types of Clustering Algorithms Hierarchical Clustering Hierarchical methods e.g., hierarchical agglomerative clustering Partition-based methods e.g., K-means Probabilistic model-based methods e.g., learning mixture models Spectral methods I m not going to talk about these Build a hierarchy of nested clusters. Either gradually Merge similar clusters (agglomerative method) Divide loose superclusters (divisive method) Result displayed as a dendrogram showing the sequence of merges or splits. Agglomerative Hierarchical Clustering Initialize C i = {x (i) } for i [1, n]. While more than one cluster left: 1. Let C i, C j be clusters that minimize D(C i, C j ) 2. C i = C i + C j 3. Remove C j from list of clusters 4. Store current clusters Measuring Distance What is D(C i, C j )? Single link method: Complete link method: Average link method: Centroid measure: D(C i, C j ) = min{d(x, y) x C i, y C j } D(C i, C j ) = max{d(x, y) x C i, y C j } D(C i, C j ) = 1 d(x, y) C i C j x C i y C j D(C i, C j ) = d(c i, c j ), where c i and c j are centroids Ward s measure: D(C i, C j ) = d(x, x) + d(y, ȳ) d(u, ū) x C i y C j u C i C j

Result Divisive Hierarchical Clustering smell move see break smash think exist Intransitive (always) sleep Transitive (sometimes) VERBS Begin with one single cluster, split to form smaller clusters. like chase Transitive (always) ear mouse cat Animals dog monster lion dragon Animates Can be difficult to choose potential splits: Monolithic methods split based on values a single variable woman girl man boy Humans Polythetic methods consider all variables together NOUNS cat book rock sandwich cookie Food bread plate Breakables glass 2.0 1.5 1.0 0.0 Distance (arbitrary scale) Inanimates TRENDS in Cognitive Sciences Less popular than agglomerative methods. Figure 2. Hierarchical clustering diagram of hidden-unit activation patterns in response to different words. The similarity between words and groups of words is reflected in the tree structure; items that are closer are joined further down the tree (i.e. to the right as shown here). Elman, J.L. An alternative view of the mental lexicon. Trends in Cognitive Science, 7, 301-306. Partition-based Clustering The K-Means Algorithm Pick some number of clusters K A popular partition-based clustering algorithm with the score function given by: Assign each point x (i) to a single cluster C k so that SCORE(C, D) is minimized/maximized. (What is the score function?) Total number of possible allocations: k n Use iterative improvement instead of intractable exhaustive search. where and K SCORE(C, D) = d(x, c k ) c k = 1 n k k=1 x C k x d(x, y) = x y 2.

Pseudo-code for K-Means K-Means Example 1. Initialize k cluster centers, c k. 2. For each x (i), assign cluster with closest center X(7) X(5) X(4) X(8) x (i) assigned to ˆk = arg min k d(x, c k ). 3. For each cluster, recompute center: c k = 1 n k x C k x 4. Check convergence (Have cluster centers moved?) 5. If not converged, go to 2. X(1) X(3) X(2) X(6) Original unlabeled data. K-Means Example K-Means Example c 3 c 3 X(7) c 2 X(5) X(4) X(8) X(7) c 2 X(5) X(4) X(8) X(6) X(6) c 1 c 1 X(1) X(2) X(1) X(2) X(3) X(3) Pick initial centers randomly. Assign points to nearest cluster.

K-Means Example K-Means Example c 3 c 3 X(7) c 2 X(5) X(4) X(8) X(7) c 2 X(5) X(4) X(8) X(6) X(6) c 1 c 1 X(1) X(2) X(1) X(2) X(3) X(3) Recompute cluster centers. Reassign points to nearest clusters. K-Means Example Understanding K-Means c 3 X(4) X(7) X(6) c 2 X(5) X(8) Time complexity per iteration? c 1 Does algorithm terminate? X(1) X(2) X(3) Does algorithm converge to global optimum? Recompute cluster centers.

K-Means Convergence Demo Model data as drawn from spherical Gaussians centered at cluster centers. log P(data assignments) = const 1 2 K k=1 x C k (x c k ) 2. How does this change when we reassign a point? How does this change when we recompute the means? http://home.dei.polimi.it/matteucc/clustering/ tutorial html/appletkm.html Monotonic improvement + finite assignments = convergence. Variations on K-Means Mixture Models What if we don t know K? Allow merging or splitting of clusters using heuristics. Assume data generated using the following procedure. 1. Pick one of k components according to P(z k ). This selects a (hidden) class label z k. 2. Generate a data point by sampling from p(x z k ). Results in probability distribution of single point What if means don t make sense? Use k-mediods instead. K p(x (i) ) = P(z k )p(x (i) z k ) k=1 where p(x z k ) is any distribution (gaussian, poisson, exponential, etc.).

Gaussian Mixture Model (GMM) Most common mixture model is a Gaussian mixture model: p(x z k ) = N (µ k, Σ k ) With this model, likelihood of data becomes N K p(x) = P(z k )p(x (i) z k ; µ k, Σ k ). n=1 k=1 LDA and GMMs LDA Built models p(x z k ) and P(z k ) using maximum likelihood given our training data. Used these models to compute P(z k x) to classify new query points. Clustering with GMMs Want to find P(z k ) and p(x z k ) to learn underlying model and find clusters. Want to compute P(z k x) for each point in training set to assign them to clusters. Can we use maximum likelihood to infer both model and assignments? Requires solving non-linear system of equations No efficient analytic solution Problem: Missing Labels Solution: The Expectation Maximization (EM) Algorithm We deal with missing labels by alternating between two steps: If we knew assignments, we could learn component models easily. We did this to train an LDA. 1. Expectation: Fix model and estimate missing labels. If we new the component models, we could estimate the most likely assignments easily. This is just classification. 2. Maximization: Fix missing labels (or a distribution over the missing labels) and find the model that maximizes the expected log-likelihood of the data.

Simple Example Simple Example Labeled Data Clusters correspond to grades in class. Model to learn: P(A) = 1 2 P(B) = µ P(C) = 2µ P(D) = 1 2 3µ What is maximum likelihood estimate for µ? Training data: a people got an A b people got a B c people got a C d people got a D Labeled Data Likelihood: P(a, b, c, d µ) = «1 a «1 d K (µ) b (2µ) c 2 2 3µ log P(a, b, c, d µ) = log K + a log 1 «1 2 + b log µ + c log(2µ) + d log 2 3µ µ log P(a, b, c, d µ) = b µ + 2c 2µ 3d 1 2 3µ For MLE, set µ log P = 0 and solve for µ to get µ = b + c 6(b + c + d) Simple Example Hidden Labels What if we only know that there are h high grades? (Exact labels are missing.) Now how do we find the maximum likelihood estimate of µ? 1. Expectation: Fix µ and infer the expected values of a and b: a = 1/2 1/2 + µ h, b = µ 1/2 + µ h Since we know a b = 1/2 and a + b = h. µ 2. Maximization: Fix these fractions a and b and compute the maximum likelihood µ as before: µ new = b + c 6(b + c + d). Formal Setup for General EM Algorithm Let D = {x (1),..., x (n) } be n observed data vectors. Let Z = {z (1),..., z (n) } be n values of hidden variables (i.e., the cluster labels). Log-likelihood of observed data given model: L(θ) = log p(d θ) = log Z Note: both θ and Z are unknown. p(d, Z θ) 3. Repeat.

Fun with Jensen s Inequality Let Q(Z) be any distribution over the hidden variables: log P(D θ) = log Z Z p(d, Z θ) Q(Z) Q(Z) Q(Z) log p(d, Z θ) Q(Z) = Q(Z) log p(d, Z θ) + Z Z = F (Q, θ) Q(Z) log 1 Q(Z) General EM Algorithm Alternate between steps until convergence: E step: Maximize F wrt Q, keeping θ fixed. Solution: M step: Q k+1 = p(z D, θ k ) Maximize F wrt θ, keeping Q fixed Solution: θ k+1 = arg max θ = arg max θ F (Q k+1, θ) p(z D, θ k ) log p(x, Z θ) Z General EM Algorithm in English Alternate steps until model parameters don t change much: E step: Estimate distribution over labels given a certain fixed model. Convergence The EM Algorithm will converge because: During E step, we make F (Q k+1, θ k ) = log P(D θ k ). During M step, we choose θ k+1 that increases F. Recall that F is a lower bound, M step: Choose new parameters for model to maximize expected log-likelihood of observed data and hidden variables. Implies F (Q k+1, theta k+1 ) log P(D θ k+1 ). log P(D θ k ) log P(D θ k+1 ) Implies convergence! (Why?)

Notes Example: EM for GMM Things to remember: Often closed form for both E and M step. Must specify stopping criteria. Complexity depends on number of iterations and time to compute E and M steps. May (will) converge to local optimum. * Initial model parameters. Example: EM for GMM Example: EM for GMM After first iteration After second iteration

Example: EM for GMM Example: EM for GMM After third iteration After fourth iteration Example: EM for GMM Example: EM for GMM After fifth iteration After sixth iteration

Example: EM for GMM Relation to K-Means Similarities K-Means used GMM with: covariance Σ = I (fixed) uniform P(Z k ) (fixed) unknown means Alternated estimating labels and recomputing unknown model parameters. Difference Makes hard assignment to cluster during E step. After convergence How to Pick K? How to Pick K? Do we want to pick the K that maximizes likelihood? Do we want to pick the K that maximizes likelihood? Other options: Cross-validation Add complexity penalty to objective function Prior knowledge

Summary Clustering: Infer assignments to hidden variables and hidden model parameters simultaneously. EM Algorithm: Powerful, popular, general method for doing this. EM Applications: Image segmentation SLAM Estimating motion models for tracking Hidden Markov Models etc.