Clustering web search results

Similar documents
Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Introduction to Machine Learning CMU-10701

ECE 5424: Introduction to Machine Learning

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Expectation Maximization: Inferring model parameters and class labels

Inference and Representation

K-Means and Gaussian Mixture Models

Clustering Lecture 5: Mixture Model

Unsupervised Learning: Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

COMS 4771 Clustering. Nakul Verma

Expectation Maximization: Inferring model parameters and class labels

Clustering. Image segmentation, document clustering, protein class discovery, compression

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Mixture Models and the EM Algorithm

Nearest Neighbor with KD Trees

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Note Set 4: Finite Mixture Models and the EM Algorithm

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

The EM Algorithm Lecture What's the Point? Maximum likelihood parameter estimates: One denition of the \best" knob settings. Often impossible to nd di

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Mixture Models and EM

Grundlagen der Künstlichen Intelligenz

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

CSC 411: Lecture 12: Clustering

Latent Variable Models and Expectation Maximization

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Lecture 8: The EM algorithm

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

1 Case study of SVM (Rob)

K-means and Hierarchical Clustering

Clustering and The Expectation-Maximization Algorithm

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Probabilistic Graphical Models

Supervised vs. Unsupervised Learning

Content-based image and video analysis. Machine learning

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Warped Mixture Models

Machine Learning (CSE 446): Unsupervised Learning

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

K-Means Clustering 3/3/17

Markov Random Fields and Segmentation with Graph Cuts

ALTERNATIVE METHODS FOR CLUSTERING

Unsupervised Learning

Network Traffic Measurements and Analysis

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Introduction to Machine Learning. Xiaojin Zhu

Machine Learning

CLUSTERING. JELENA JOVANOVIĆ Web:

CS 229 Midterm Review

Generative and discriminative classification techniques

Fall 09, Homework 5

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Unsupervised Learning

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

IBL and clustering. Relationship of IBL with CBR

Clustering. Supervised vs. Unsupervised Learning

Cluster Analysis. Debashis Ghosh Department of Statistics Penn State University (based on slides from Jia Li, Dept. of Statistics)

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

Approaches to Clustering

10701 Machine Learning. Clustering

Generative and discriminative classification techniques

An Introduction to PDF Estimation and Clustering

Quantitative Biology II!

The Multi Stage Gibbs Sampling: Data Augmentation Dutch Example


Introduction to Mobile Robotics

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering: K-means and Kernel K-means

10-701/15-781, Fall 2006, Final

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Machine Learning for OR & FE

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

732A54/TDDE31 Big Data Analytics

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Expectation Maximization (EM) and Gaussian Mixture Models

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Expectation Maximization!

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Machine Learning

Geoff McLachlan and Angus Ng. University of Queensland. Schlumberger Chaired Professor Univ. of Texas at Austin. + Chris Bishop

Computer vision: models, learning and inference. Chapter 10 Graphical Models

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Markov Random Fields and Segmentation with Graph Cuts

Transcription:

Clustering K-means Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 1 Clustering images Set of Images [Goldberger et al.] 2 1

Clustering web search results 3 Some Data 4 2

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 5 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 6 3

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) 7 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 8 4

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until terminated! 9 K-means Randomly initialize k centers µ (0) = µ 1 (0),, µ k (0) Classify: Assign each point j {1, N} to nearest center: Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! 10 5

What is K-means optimizing? Potential function F(µ,C) of centers µ and point allocations C: N Optimal K-means: min µ min C F(µ,C) 11 Does K-means converge??? Part 1 Optimize potential function: Fix µ, optimize C 12 6

Does K-means converge??? Part 2 Optimize potential function: Fix C, optimize µ 13 Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) (For LASSO it converged to the global optimum, because of convexity) K-means is a coordinate descent algorithm! 14 7

Mixtures of Gaussians Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 15 (One) bad case for k-means Clusters may overlap Some clusters may be wider than others 16 8

Density Estimation Estimate a density based on x 1,,x N Emily Fox 2013 17 Density Estimation Contour Plot of Joint Density 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Emily Fox 2013 18 9

Density as Mixture of Gaussians Approximate density with a mixture of Gaussians Mixture of 3 Gaussians Contour Plot of Joint Density 0.75 0.7 0.65 0.6 0.55 0.7 0.65 0.6 0.55 0.5 0.5 0.45 0.45 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0 0.2 0.4 0.6 0.8 1 0.25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Emily Fox 2013 19 Gaussians in d Dimensions 1 P(x) = (2π ) m/2 Σ exp # 1 1/2 $ % 2 x µ ( ) T Σ 1 ( x µ ) & ' ( 20 10

Density as Mixture of Gaussians Approximate density with a mixture of Gaussians 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 Mixture of 3 Gaussians 0 0.2 0.4 0.6 0.8 1 p(x i,µ, ) = Emily Fox 2013 21 Density as Mixture of Gaussians Approximate with density with a mixture of Gaussians Mixture of 3 Gaussians Our actual observations 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0 0.2 0.4 0.6 0.8 1 1 0.5 0 (b) 0 0.5 1 C. Bishop, Emily Pattern Fox 2013 Recognition & Machine Learning 22 11

Clustering our Observations Imagine we have an assignment of each x i to a Gaussian Our actual observations 1 (a) 1 (b) 0.5 0.5 0 0 0 0.5 1 Complete data labeled by true cluster assignments 0 0.5 1 C. Bishop, Emily Pattern Fox 2013 Recognition & Machine Learning 23 Clustering our Observations Imagine we have an assignment of each x i to a Gaussian 1 (a) Introduce latent cluster indicator variable z i 0.5 Then we have p(x i z i,,µ, ) = 0 0 0.5 1 Complete data labeled by true cluster assignments C. Bishop, Emily Pattern Fox 2013 Recognition & Machine Learning 24 12

Clustering our Observations We must infer the cluster assignments from the observations 1 0.5 (c) Posterior probabilities of assignments to each cluster *given* model parameters: r ik = p(z i = k x i,,µ, ) = 0 0 0.5 1 Soft assignments to clusters C. Bishop, Emily Pattern Fox 2013 Recognition & Machine Learning 25 Unsupervised Learning: not as hard as it looks Sometimes easy Sometimes impossible and sometimes in between 26 13

Summary of GMM Concept Estimate a density based on x 1,,x N KX p(x i, µ, ) = z in (x i µ z i, z i) z i =1 1 (a) 0.5 0 0 0.5 1 Complete data labeled by true cluster assignments Surface Plot of Joint Density, Marginalizing Cluster Assignments Emily Fox 2013 27 Summary of GMM Components Observations x i i 2 R d, i =1, 2,...,N Hidden cluster labels z i 2{1, 2,...,K}, i =1, 2,...,N Hidden mixture means µ k 2 R d, k =1, 2,...,K Hidden mixture covariances Hidden mixture probabilities k 2 R d d, k =1, 2,...,K KX k, k =1 k=1 Gaussian mixture marginal and conditional likelihood : KX p(x i, µ, ) = z i p(x i z i,µ, ) z i =1 p(x i z i,µ, ) =N (x i µ z i, z i) Emily Fox 2013 28 14

Expectation Maximization Machine Learning CSE546 Emily Fox University of Washington November 6, 2013 29 Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data? 30 15

But we don t see class labels!!! MLE: argmax i P(z i,x i ) But we don t know z i Maximize marginal likelihood: argmax i P(x i ) = argmax i k=1 K P(z i =k,x i ) 31 Special case: spherical Gaussians and hard assignments P(z i = k, x i 1 ) = (2π ) m/2 Σ k exp # 1 1/2 2 xi µ $ % k ( ) T Σ 1 k ( x i µ k ) & ' ( P(zi = k) If P(X z=k) is spherical, with same σ for all classes: # P(x i z i = k) exp 1 x i 2& µ $ % 2σ 2 k ' ( If each x i belongs to one class C(i) (hard assignment), marginal likelihood: N K N % P(x i, z i = k) exp 1 & ' 2σ 2 i=1 k=1 i=1 x i µ C(i) ( ) * 2 Same as K-means!!! 32 16

Supervised Learning of Mixtures of Gaussians Mixtures of Gaussians: Prior class probabilities: P(z=k) Likelihood function per class: P(x z=k) Suppose, for each data point, we know location x and class z Learning is easy J For prior P(z) For likelihood function: 33 EM: Reducing Unsupervised Learning to Supervised Learning If we knew assignment of points to classes è Supervised Learning! Expectation-Maximization (EM) Guess assignment of points to classes In standard ( soft ) EM: each point associated with prob. of being in each class Recompute model parameters Iterate 34 17

Form of Likelihood Conditioned on class of point x i... p(x i z i,µ, ) = Marginalizing class assignment: p(x i,µ, ) = Emily Fox 2013 35 Gaussian Mixture Model Most commonly used mixture model Observations: Parameters: Likelihood: Ex. z i = country of origin, x i = height of i th person k th mixture component = distribution of heights in country k Emily Fox 2013 36 18

Example (Taken from Kevin Murphy s ML textbook) Data: gene expression levels Goal: cluster genes with similar expression trajectories Emily Fox 2013 37 Mixture models are useful for Density estimation Allows for multimodal density Clustering Want membership information for each observation e.g., topic of current document Soft clustering: p(z i = k x i, )= Hard clustering: z i = arg max p(z i = k x i, )= k Emily Fox 2013 38 19

Issues Label switching Color = label does not matter Can switch labels and likelihood is unchanged Log likelihood is not convex in the parameters Problem is simpler for complete data likelihood Emily Fox 2013 39 ML Estimate of Mixture Model Params Log likelihood L x ( ), log p({x i } ) = X i log X z i p(x i,z i ) Want ML estimate ˆ ML = Neither convex nor concave and local optima Emily Fox 2013 40 20

If complete data were observed z i Assume class labels were observed in addition to L x,z ( ) = X log p(x i,z i ) i x i Compute ML estimates Separates over clusters k! Example: mixture of Gaussians (MoG) = { k,µ k, k } K k=1 Emily Fox 2013 41 Iterative Algorithm Motivates a coordinate ascent-like algorithm: 1. Infer missing values z i given estimate of parameters ˆ 2. Optimize parameters to produce new ˆ given filled in data z i 3. Repeat Example: MoG (derivation soon + HW) 1. Infer responsibilities r ik = p(z i = k x i, ˆ (t 1) )= 2. Optimize parameters max w.r.t. k : max w.r.t. µ k, k : Emily Fox 2013 42 21

E.M. Convergence EM is coordinate ascent on an interesting potential function Coord. ascent for bounded pot. func. è convergence to a local optimum guaranteed This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data 43 Gaussian Mixture Example: Start Emily Fox 2013 44 22

After first iteration Emily Fox 2013 45 After 2nd iteration Emily Fox 2013 46 23

After 3rd iteration Emily Fox 2013 47 After 4th iteration Emily Fox 2013 48 24

After 5th iteration Emily Fox 2013 49 After 6th iteration Emily Fox 2013 50 25

After 20th iteration Emily Fox 2013 51 Some Bio Assay data Emily Fox 2013 52 26

GMM clustering of the assay data Emily Fox 2013 53 Resulting Density Estimator Emily Fox 2013 54 27

Expectation Maximization (EM) Setup More broadly applicable than just to mixture models considered so far Model: x y observable incomplete data not (fully) observable complete data parameters Interested in maximizing (wrt ): p(x ) = X y p(x, y ) Special case: x = g(y) Emily Fox 2013 55 Expectation Maximization (EM) Derivation Step 1 Rewrite desired likelihood in terms of complete data terms p(y ) =p(y x, )p(x ) Step 2 Assume estimate of parameters Take expectation with respect to ˆ p(y x, ˆ ) Emily Fox 2013 56 28

Expectation Maximization (EM) Derivation Step 3 Consider log likelihood of data at any L x ( ) L x (ˆ ) relative to log likelihood at ˆ Aside: Gibbs Inequality Proof: E p [log p(x)] E p [log q(x)] Emily Fox 2013 57 Expectation Maximization (EM) Derivation L x ( ) L x (ˆ ) =[U(, ˆ ) U(ˆ, ˆ )] [V (, ˆ ) V (ˆ, ˆ )] Step 4 Determine conditions under which log likelihood at exceeds that at Using Gibbs inequality: ˆ If Then L x ( ) L x (ˆ ) Emily Fox 2013 58 29

Motivates EM Algorithm Initial guess: Estimate at iteration t: E-Step Compute M-Step Compute Emily Fox 2013 59 Example Mixture Models E-Step Compute M-Step Compute U(, ˆ (t) )=E[log p(y ) x, ˆ (t) ] ˆ (t+1) = arg max U(, ˆ (t) ) Consider y i = {z i,x i } i.i.d. p(x i,z i ) = z ip(x i E qt [log p(y )] = X z i)= E qt [log p(x i,z i )] = i Emily Fox 2013 60 30

Coordinate Ascent Behavior Bound log likelihood: L x ( ) = U(, ˆ (t) )+V(, ˆ (t) ) L x (ˆ (t) )= U(ˆ (t), ˆ (t) )+V (ˆ (t), ˆ (t) ) Figure from KM textbook Emily Fox 2013 61 Comments on EM Since Gibbs inequality is satisfied with equality only if p=q, any step that changes should strictly increase likelihood In practice, can replace the M-Step with increasing U instead of maximizing it (Generalized EM) Under certain conditions (e.g., in exponential family), can show that EM converges to a stationary point of L x ( ) Often there is a natural choice for y has physical meaning If you want to choose any y, not necessarily x=g(y), replace p(y ) in U with p(y, x ) Emily Fox 2013 62 31

Initialization In mixture model case where y i = {z i,x i } there are many ways to initialize the EM algorithm Examples: Choose K observations at random to define each cluster. Assign other observations to the nearest centriod to form initial parameter estimates Pick the centers sequentially to provide good coverage of data Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed Can be quite important to convergence rates in practice Emily Fox 2013 63 What you should know K-means for clustering: algorithm converges because it s coordinate ascent EM for mixture of Gaussians: How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data Be happy with this kind of probabilistic analysis Remember, E.M. can get stuck in local minima, and empirically it DOES EM is coordinate ascent 64 32