ECE 5424: Introduction to Machine Learning

Similar documents
Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering web search results

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Introduction to Machine Learning CMU-10701

Clustering Distance measures K-Means. Lecture 22: Aykut Erdem December 2016 Hacettepe University

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Unsupervised Learning: Clustering

Clustering Lecture 5: Mixture Model

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

COMS 4771 Clustering. Nakul Verma

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Inference and Representation

ECE 5424: Introduction to Machine Learning

Clustering and The Expectation-Maximization Algorithm

Unsupervised Learning

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

K-Means Clustering 3/3/17

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Introduction to Machine Learning. Xiaojin Zhu

K-Means and Gaussian Mixture Models

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Grundlagen der Künstlichen Intelligenz

Content-based image and video analysis. Machine learning

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Note Set 4: Finite Mixture Models and the EM Algorithm

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Machine Learning. Unsupervised Learning. Manfred Huber

Computer Vision. Exercise Session 10 Image Categorization

Image classification Computer Vision Spring 2018, Lecture 18

CS6670: Computer Vision

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

CS 229 Midterm Review

Markov Random Fields and Segmentation with Graph Cuts

CLUSTERING. JELENA JOVANOVIĆ Web:

CSC 411: Lecture 12: Clustering

Patch Descriptors. CSE 455 Linda Shapiro

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme


Machine Learning for OR & FE

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

IBL and clustering. Relationship of IBL with CBR

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Expectation Maximization: Inferring model parameters and class labels

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

Unsupervised Learning

Expectation Maximization (EM) and Gaussian Mixture Models

Unsupervised Learning

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Naïve Bayes, Gaussian Distributions, Practical Applications

Network Traffic Measurements and Analysis

Unsupervised: no target value to predict

Part-based and local feature models for generic object recognition

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

CSE 158. Web Mining and Recommender Systems. Midterm recap

Clustering. Image segmentation, document clustering, protein class discovery, compression

Patch Descriptors. EE/CSE 576 Linda Shapiro

Mixture Models and EM

Cluster Analysis: Agglomerate Hierarchical Clustering

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Learning from Data Mixture Models

Generative and discriminative classification techniques

Three Unsupervised Models

Mixture Models and the EM Algorithm

A Taxonomy of Semi-Supervised Learning Algorithms

Metric Learning for Large Scale Image Classification:

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Latent Variable Models and Expectation Maximization

An Introduction to PDF Estimation and Clustering

Alternatives to Direct Supervision

Data-Intensive Computing with MapReduce

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Clustering algorithms

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Machine Learning

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

Based on Raymond J. Mooney s slides

Part based models for recognition. Kristen Grauman

10-701/15-781, Fall 2006, Final

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Bayes Classifiers and Generative Methods

Lecture 8: The EM algorithm

Machine Learning

Variational Autoencoders. Sargur N. Srihari

Transcription:

ECE 5424: Introduction to Machine Learning Topics: Unsupervised Learning: Kmeans, GMM, EM Readings: Barber 20.1-20.3 Stefan Lee Virginia Tech

Tasks Supervised Learning x Classification y Discrete x Regression y Continuous Unsupervised Learning x Clustering c Discrete ID Dimensionality x z Continuous Reduction (C) Dhruv Batra 2

Unsupervised Learning Learning only with X Y not present in training data Some example unsupervised learning problems: Clustering / Factor Analysis Dimensionality Reduction / Embeddings Density Estimation with Mixture Models (C) Dhruv Batra 3

New Topic: Clustering Slide Credit: Carlos Guestrin 4

Synonyms Clustering Vector Quantization Latent Variable Models Hidden Variable Models Mixture Models Algorithms: K-means Expectation Maximization (EM) (C) Dhruv Batra 5

Some Data (C) Dhruv Batra Slide Credit: Carlos Guestrin 6

K-means 1. Ask user how many clusters they d like. (e.g. k=5) (C) Dhruv Batra Slide Credit: Carlos Guestrin 7

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations (C) Dhruv Batra Slide Credit: Carlos Guestrin 8

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) (C) Dhruv Batra Slide Credit: Carlos Guestrin 9

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns (C) Dhruv Batra Slide Credit: Carlos Guestrin 10

K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. Repeat until terminated! (C) Dhruv Batra Slide Credit: Carlos Guestrin 11

K-means Randomly initialize k centers (0) = 1 (0),, k (0) Assign: Assign each point i {1, n} to nearest center: C(i) Recenter: argmin j x i µ j 2 μ # becomes centroid of its points (C) Dhruv Batra Slide Credit: Carlos Guestrin 12

K-means Demo http://mlehman.github.io/kmeans-javascript/ (C) Dhruv Batra 13

What is K-means optimizing? Objective F(,C): function of centers and point allocations C: F (µ,c)= NX x i µ C(i) 2 i=1 1-of-k encoding F (µ, a) = NX kx a ij x i µ j 2 i=1 j=1 Optimal K-means: min min a F(,a) (C) Dhruv Batra 14

Coordinate descent algorithms Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum as we saw in applet (play with it!) K-means is a coordinate descent algorithm! (C) Dhruv Batra Slide Credit: Carlos Guestrin 15

K-means as Co-ordinate Descent Optimize objective function: min µ 1,...,µ k min a 1,...,a N F (µ, a) = Fix, optimize a (or C) min µ 1,...,µ k min a 1,...,a N NX i=1 kx a ij x i µ j 2 j=1 (C) Dhruv Batra Slide Credit: Carlos Guestrin 16

K-means as Co-ordinate Descent Optimize objective function: min µ 1,...,µ k min a 1,...,a N F (µ, a) = Fix a (or C), optimize min µ 1,...,µ k min a 1,...,a N NX i=1 kx a ij x i µ j 2 j=1 (C) Dhruv Batra Slide Credit: Carlos Guestrin 17

One important use of K-means Bag-of-word models in computer vision (C) Dhruv Batra 18

Bag of Words model aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0... gas 1... oil 1 Zaire 0 (C) Dhruv Batra Slide Credit: Carlos Guestrin 19

Object Bag of words Fei- Fei Li

Fei- Fei Li

Interest Point Features Compute SIFT descriptor [Lowe 99] Normalize patch Detect patches [Mikojaczyk and Schmid 02] [Matas et al. 02] [Sivic et al. 03] Slide credit: Josef Sivic

Patch Features Slide credit: Josef Sivic

dictionary formation Slide credit: Josef Sivic

Clustering (usually k-means) Vector quantization Slide credit: Josef Sivic

Clustered Image Patches Fei-Fei et al. 2005

Image representation frequency.. codewords Fei- Fei Li

(One) bad case for k-means Clusters may overlap Some clusters may be wider than others GMM to the rescue! (C) Dhruv Batra Slide Credit: Carlos Guestrin 28

GMM 35 30 25 20 15 10 5 0 25 20 15 10 5 0 5 10 15 20 25 (C) Dhruv Batra Figure Credit: Kevin Murphy 29

Recall Multi-variate Gaussians (C) Dhruv Batra 30

GMM 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (C) Dhruv Batra Figure Credit: Kevin Murphy 31

Hidden Data Causes Problems #1 Fully Observed (Log) Likelihood factorizes Marginal (Log) Likelihood doesn t factorize All parameters coupled! (C) Dhruv Batra 32

GMM vs Gaussian Joint Bayes Classifier On Board Observed Y vs Unobserved Z Likelihood vs Marginal Likelihood (C) Dhruv Batra 33

Hidden Data Causes Problems #2 35 30 25 20 15 10 5 0 25 20 15 10 5 0 5 10 15 20 25 (C) Dhruv Batra Figure Credit: Kevin Murphy 34

Hidden Data Causes Problems #2 Identifiability 19.5 14.5 35 9.5 30 25 4.5 20 µ 2 0.5 15 10 5.5 5 10.5 0 25 20 15 10 5 0 5 10 15 20 25 15.5 15.5 10.5 5.5 0.5 4.5 9.5 14.5 19.5 µ 1 (C) Dhruv Batra Figure Credit: Kevin Murphy 35

Hidden Data Causes Problems #3 Likelihood has singularities if one Gaussian collapses p(x) (C) Dhruv Batra x 36

Special case: spherical Gaussians and hard assignments If P(X Z=k) is spherical, with same for all classes: # P(x i z = j) exp 1 $ % 2σ 2 x i µ j 2 & ' ( If each x i belongs to one class C(i) (hard assignment), marginal likelihood: N k N % P(x i, y = j) exp 1 & ' 2σ 2 i=1 j=1 i=1 x i µ C(i) 2 ( ) * M(M)LE same as K-means!!! (C) Dhruv Batra Slide Credit: Carlos Guestrin 37

The K-means GMM assumption There are k components Component i has an associated mean vector µ ι µ 1 µ 2 µ 3 (C) Dhruv Batra Slide Credit: Carlos Guestrin 38

The K-means GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: µ 1 µ 2 µ 3 (C) Dhruv Batra Slide Credit: Carlos Guestrin 39

The K-means GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability P(y=i) µ 2 (C) Dhruv Batra Slide Credit: Carlos Guestrin 40

The K-means GMM assumption There are k components Component i has an associated mean vector µ ι Each component generates data from a Gaussian with mean m i and covariance matrix σ 2 Ι Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability P(y=i) 2. Datapoint Ν(µ ι, σ 2 Ι ) µ 2 x (C) Dhruv Batra Slide Credit: Carlos Guestrin 41

The General GMM assumption There are k components Component i has an associated mean vector m i Each component generates data from a Gaussian with mean m i and covariance matrix Σ i Each data point is generated according to the following recipe: 1. Pick a component at random: Choose component i with probability P(y=i) 2. Datapoint ~ N(m i, Σ i ) µ 1 µ 2 (C) Dhruv Batra Slide Credit: Carlos Guestrin 42 µ 3