COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Similar documents
COMS 4771 Clustering. Nakul Verma

Clustering web search results

Machine Learning. Unsupervised Learning. Manfred Huber

Network Traffic Measurements and Analysis

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Unsupervised Learning: Clustering

ECE 5424: Introduction to Machine Learning

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Content-based image and video analysis. Machine learning

Clustering and Dimensionality Reduction

Based on Raymond J. Mooney s slides

K-means and Hierarchical Clustering

CSE 573: Artificial Intelligence Autumn 2010

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

K-Means Clustering 3/3/17

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Introduction to Machine Learning. Xiaojin Zhu

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Unsupervised Learning

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Inference and Representation

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

INF 4300 Classification III Anne Solberg The agenda today:

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering Lecture 5: Mixture Model

Generative and discriminative classification techniques

Intro to Artificial Intelligence

Grundlagen der Künstlichen Intelligenz

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

CS 229 Midterm Review

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Clustering and The Expectation-Maximization Algorithm

K-means and Hierarchical Clustering

IBL and clustering. Relationship of IBL with CBR

Introduction to Machine Learning CMU-10701

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Unsupervised Learning

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

CHAPTER 4: CLUSTER ANALYSIS

Classification: Feature Vectors

Expectation Maximization: Inferring model parameters and class labels

Search Engines. Information Retrieval in Practice

Machine Learning

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CSC 411: Lecture 12: Clustering

Clustering. Image segmentation, document clustering, protein class discovery, compression

Unsupervised Learning

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch


Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Unsupervised: no target value to predict

Clustering algorithms

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Deep Learning for Computer Vision

Unsupervised Learning

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Machine Learning (BSMC-GA 4439) Wenke Liu

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Kernels and Clustering

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Unsupervised Learning

K-Means and Gaussian Mixture Models

Contents. Preface to the Second Edition

ECG782: Multidimensional Digital Signal Processing

Bioinformatics - Lecture 07

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Supervised vs unsupervised clustering

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

CS 343: Artificial Intelligence

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Instance-based Learning

Alternatives to Direct Supervision

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

10701 Machine Learning. Clustering

Hierarchical Clustering 4/5/17

Unsupervised learning in Vision

Expectation Maximization (EM) and Gaussian Mixture Models

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Machine Learning Lecture 3

Finding Clusters 1 / 60

Using Machine Learning to Optimize Storage Systems

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Fall 09, Homework 5

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

Clustering. Unsupervised Learning

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Gene Clustering & Classification

Transcription:

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this course are copyright of the instructor, and cannot be reused or reposted without the instructor s written permission.

Today s quiz The hinge loss function is convex True The number of Lagrange multipliers in the soft SVM problem is determined by the number of features in the input set. False A quadratic programming problem can be solved in polynomial time. True SVM with a Gaussian kernel requires specification of the variance of the kernel. This can be selected with cross-validation. True One disadvantage of the "kernel trick" is that the memory requirement grows linearly with the numbers of features computed. False 2

Uploading code in CMT 3

What is unsupervised learning? Given only input data: D = <x i >, i=1:n, find some patterns or regularity in the data. Typically use generative approaches: model the available data. Different classes of problems: 1. Clustering 2. Anomaly detection 3. Dimensionality reduction 4. Autoregression 4

A simple clustering example A fruit merchant approaches you, with a set of apples to classify according to their variety. Tells you there are five varieties of apples in the dataset. Tells you the weight and colour of each apple in the dataset. Can you label each apple with the correct variety? What would you need to know / assume? Data = <x 1,?>, <x 2,?>,, <x n,?> 5

A simple clustering example You know there are 5 varieties. Assume each variety generates apples according to a (varietyspecific) 2-D Gaussian distribution. 6

A simple clustering example You know there are 5 varieties. Assume each variety generates apples according to a (varietyspecific) 2-D Gaussian distribution. If you know µ i, s i2 for each class, it s easy to classify the apples. If you know the class of each apple, it s easy to estimate µ i, s i2. 7

A simple clustering example You know there are 5 varieties. Assume each variety generates apples according to a (varietyspecific) 2-D Gaussian distribution. If you know µ i, s i2 for each class, it s easy to classify the apples. If you know the class of each apple, it s easy to estimate µ i, s i2. What if we know neither? 8

A simple algorithm: K-means clustering Objective: Cluster n instances into K distinct classes. Preliminaries: Step 1: Pick the desired number of clusters, K. Step 2: Assume a parametric distribution for each class (e.g. Normal). Step 3: Randomly estimate the parameters of the K distributions. 9

A simple algorithm: K-means clustering Objective: Cluster n instances into K distinct classes. Preliminaries: Step 1: Pick the desired number of clusters, K. Step 2: Assume a parametric distribution for each class (e.g. Normal). Step 3: Randomly estimate the parameters of the K distributions. Iterate, until convergence: Step 4: Assign instances to the most likely classes based on the current parametric distributions. Step 5: Estimate the parametric distribution of each class based on the latest assignment. 10

K-means algorithm This data could easily be modeled by Gaussians. 1. Ask user how many clusters. Image courtesy of Andrew Moore, Carnegie Mellon U. 11

K-means algorithm This data could easily be modeled by Gaussians. 1. Ask user how many clusters. Image courtesy of Andrew Moore, Carnegie Mellon U. 2. Randomly guess k centers: { µ 1,, µ k } (assume s 2 is known). 12

K-means algorithm This data could easily be modeled by Gaussians. 1. Ask user how many clusters. Image courtesy of Andrew Moore, Carnegie Mellon U. 2. Randomly guess k centers: { µ 1,, µ k } (assume s 2 is known). 3. Assign each data point to the closest center. 13

K-means algorithm This data could easily be modeled by Gaussians. 1. Ask user how many clusters. Image courtesy of Andrew Moore, Carnegie Mellon U. 2. Randomly guess k centers: { µ 1,, µ k } (assume s 2 is known). 3. Assign each data point to the closest center. 4. Each center finds the centroid of the points it owns. 14

K-means algorithm This data could easily be modeled by Gaussians. 1. Ask user how many clusters. Image courtesy of Andrew Moore, Carnegie Mellon U. 2. Randomly guess k centers: { µ 1,, µ k } (assume s 2 is known). 3. Assign each data point to the closest center. 4. Each center finds the centroid of the points it owns and jumps there. 15

K-means algorithm starts Image courtesy of Andrew Moore, Carnegie Mellon U. 16

K-means algorithm continues (2) Image courtesy of Andrew Moore, Carnegie Mellon U. 17

K-means algorithm continues (3) Image courtesy of Andrew Moore, Carnegie Mellon U. 18

K-means algorithm continues (4) Image courtesy of Andrew Moore, Carnegie Mellon U. 19

K-means algorithm continues (5) Image courtesy of Andrew Moore, Carnegie Mellon U. 20

K-means algorithm continues (6) Image courtesy of Andrew Moore, Carnegie Mellon U. 21

K-means algorithm continues (7) Image courtesy of Andrew Moore, Carnegie Mellon U. 22

K-means algorithm continues (8) Image courtesy of Andrew Moore, Carnegie Mellon U. 23

K-means algorithm continues (9) Image courtesy of Andrew Moore, Carnegie Mellon U. 24

K-means algorithm terminates Image courtesy of Andrew Moore, Carnegie Mellon U. 25

A simple algorithm: K-means clustering Objective: Cluster n instances into K distinct classes. Preliminaries: Step 1: Pick the desired number of clusters, K. Step 2: Assume a parametric distribution for each class (e.g. Normal). Step 3: Randomly estimate the parameters of the K distributions. Iterate, until convergence: Step 4: Assign instances to the most likely classes based on the current parametric distributions. Hard assignment Step 5: Estimate the parametric distribution of each class based on the latest assignment. Maximization step 26

Properties of K-means Optimality? 27

Properties of K-means Optimality? Converges to a local optimum. Can use random re-starts to get better local optimum. Alternately, can choose your initial centers carefully: Place µ 1 on top of a randomly chosen datapoint. Place µ 2 on top of datapoint that is furthest from µ 1. Place µ 3 on top of datapoint that is furthest from both µ 1 and µ 2. 28

Properties of K-means Optimality? Converges to a local optimum. Can use random re-starts to get better local optimum. Alternately, can choose your initial centers carefully: Place µ 1 on top of a randomly chosen datapoint. Place µ 2 on top of datapoint that is furthest from µ 1. Place µ 3 on top of datapoint that is furthest from both µ 1 and µ 2. Complexity? 29

Properties of K-means Optimality? Converges to a local optimum. Can use random re-starts to get better local optimum. Alternately, can choose your initial centers carefully: Place µ 1 on top of a randomly chosen datapoint. Place µ 2 on top of datapoint that is furthest from µ 1. Place µ 3 on top of datapoint that is furthest from both µ 1 and µ 2. Complexity? O(knm) where k = #centers n = #datapoints m = dimensionality of data 30

Properties of K-means Optimality? Converges to a local optimum. Can use random re-starts to get better local optimum. Alternately, can choose your initial centers carefully: Place µ 1 on top of a randomly chosen datapoint. Place µ 2 on top of datapoint that is furthest from µ 1. Place µ 3 on top of datapoint that is furthest from both µ 1 and µ 2. Complexity? O(knm) where k = #centers n = #datapoints m = dimensionality of data 31

K-means: the good and the bad Good: We realize that maximizing parameters is easy once we have assignments Bad: What about points that are about equally far to two clusters? We can only update the mean (not variance) We have to assume equal variance between clusters Copyright C.M. Bishop, PRML 32

Questions about K-means? 33

Beyond K-means How to fit data where variance is unknown or non-identical between clusters? Copyright C.M. Bishop, PRML 34

Gaussian Mixture Model Idea: Fit data with a combination of Gaussian distributions. What defines a set of Gaussians? 35

Gaussian Mixture Model Idea: Fit data with a combination of Gaussian distributions. Write p(x) as a linear combination of Gaussians: p(x) = k=1:k p(z k ) p(x z k ) where p(z k ) is the probability of the k th mixture component and p(x z k ) = N(x μ k, σ k2 ) is the prob. of x for the k th mixture component. Determining p(z x) is easy once we know parameters p(z k ), μ k, σ k 2 (Bayes rule) 36

Gaussian Mixture Model Maximum likelihood often gives a good parameter estimate p(x ) = X Z p(x, Z ) Why is it hard here? 37

Expectation Maximization (more generally) Iterative method for learning the maximum likelihood estimate of a probabilistic model, when the model contains unobservable variables. 38

Expectation Maximization (more generally) Iterative method for learning the maximum likelihood estimate of a probabilistic model, when the model contains unobservable variables. Main idea: If we knew all variables (e.g. cluster assignments), we could easily maximize the likelihood. With unobserved variables, we fantasize how the data should look based on the current parameter setting. I.e. compute Expected sufficient statistics. Then we Maximize parameter setting, based on these statistics. 39

EM for clustering Objective: Cluster n instances into K distinct classes. Preliminaries: Step 1: Pick the desired number of clusters, K. Step 2: Assume a parametric distribution for each class (e.g. Normal). Step 3: Randomly initialize the parameters of the K distributions. Iterate, until convergence: Step 4: Assign responsibility for instances to classes based on the current parametric distributions. j i = p(z i x j, old ) Soft assignment Step 5: Estimate the parametric distribution of each class based on the latest assignment. Maximization step 40

EM for clustering Copyright C.M. Bishop, PRML 41

Expectation Maximization (more generally) Start with some initial parameter setting. Repeat (as long as desired): Expectation (E) step: Complete the data by assigning values to the missing items. Q(, old )= X Z Maximization (M) step: Compute the maximum likelihood parameter setting based on the completed data. new = arg max Q(, old ) p(z X, old ) log p(x, Z ) Once the data is completed (E-step), computing the log-likelihood and new parameters (M-step) is easy! This is what we did for K-means. 42

Why does it work? p(x ) Instead of, we maximize Q(, old )= X Z Original objective still improves if: p(z X, old ) log p(x, Z ) Maximum of Q is also the maximum of lower bound Lower bound is exact at (Not at local maximum) old 43 Copyright C.M. Bishop, PRML

Expectation Maximization: Properties Likelihood function is guaranteed to improve (or stay the same) with each iteration. Iterations can stop when no more improvements are achieved. Convergence to a local optimum of the likelihood function. Re-starts with different initial parameters are often necessary. Time complexity (per iteration) depends on model structure. EM is very useful in practice! 44

K-means or EM? K-means can be seen as a specific case of EM (where variance is fixed to a value that decreases to 0) K-means tends to converge faster EM can deal with unknown or non-identical variance K-means sometimes used to initialize EM 45

Anomaly detection http://www.anomalydetectionresearch.com 46

Anomaly detection Discriminative approaches tend to be ineffective when one class is much more rare than the other. 47

Anomaly detection Discriminative approaches tend to be ineffective when one class is much more rare than the other. A simple generative approach: Fit a model, p(x) using the input data. Set a decision threshold ε and predict Y= {1 if p(x)>ε, 0 otherwise}. Use a validation set to measure performance (can use crossvalidation to set ε). 48

Anomaly detection vs Supervised learning Anomaly detection Small number of positive examples (e.g. <10). Supervised learning Similar number of positive and negative examples Large number of negative examples (e.g. >100). http://opencourseonline.com/400/coursera-open-course-stanford-university-machine-learningvideo-playlist-15-anomaly-detection 49

Anomaly detection vs Supervised learning Anomaly detection Small number of positive examples (e.g. <10). Supervised learning Similar number of positive and negative examples Large number of negative examples (e.g. >100). Many different types of anomalies, so don t want to fit a model for the positive class. More homogeneity within classes, or enough data to sufficiently characterize each classes. http://opencourseonline.com/400/coursera-open-course-stanford-university-machine-learningvideo-playlist-15-anomaly-detection 50

A simple example Does the distribution of nominal data look familiar? From: M. Zhao and V. Saligrama, Anomaly Detection with Score functions based on Nearest Neighbor Graphs, Neural Information Processing Systems (NIPS) Conference, 2009 51

A simple example Another GMM! From: M. Zhao and V. Saligrama, Anomaly Detection with Score functions based on Nearest Neighbor Graphs, Neural Information Processing Systems (NIPS) Conference, 2009 52

A simple example GMM can be fit again with EM Another GMM! Note that before we were mainly interested in the cluster assignments (which items go together) Here we are interested in the final density From: M. Zhao and V. Saligrama, Anomaly Detection with Score functions based on Nearest Neighbor Graphs, Neural Information Processing Systems (NIPS) Conference, 2009 53

Anomaly detection Discriminative approaches tend to be ineffective when one class is much more rare than the other. A simple generative approach: Fit a model, p(x) using the input data (using a GMM?). Set a decision threshold ε and predict Y= {1 if p(x)>ε, 0 otherwise}. Use a validation set to measure performance (can use crossvalidation to set ε). Note: GMM is used here to model the nominal data only. We don t attempt to model the anomalous data (see above) 54

Practical issues Need p(x) to be low for anomalous examples. Apply techniques for construction/selection of features to achieve this. Need a validation set to select features and learning parameters. 55

Dimensionality reduction Given points in an m-dimensional space (for large m), project to a low dimensional space while preserving trends in the data. Principal Components Analysis Covered in detail in Lecture 9! 56

Dimensionality reduction Learn neural networks to perform dimensionality reduction: auto-encoding input output code encoder decoder Objective: recover input as output Covered in detail in neural network lecture 57

Autoregressive models for time series The problem: Given a time series: X = {x 1, x 2,, x T } Predict x t from x 1:t-1. 58

Autoregressive models for time series The problem: Given a time series: X = {x 1, x 2,, x T } Predict x t from x 1:t-1. A simple autoregressive (AR) model: X t = w 0 + ε + i=1:p w i x t-i + ε t where w i are the parameters and ε t is white noise. 59

Autoregressive models for time series The problem: Given a time series: X = {x 1, x 2,, x T } Predict x t from x 1:t-1. A simple autoregressive (AR) model: X t = w 0 + ε + i=1:p w i x t-i + ε t where w i are the parameters and ε t is white noise. A more general model, autoregressive-moving average (ARMA): X t = w 0 + ε + i=1:p w i x t-i + i=1:q Θ i ε t-i where w i Θ i are the parameters, and ε t ~N(0,σ 2 ) are assumed to be iid samples from a normal distribution. 60

What you should know The general form of the unsupervised learning problem Basic functioning and properties of useful algorithms: K-means Expectation-maximization A useful model Gaussian mixture models Characteristics of common problems: clustering, anomaly detection, dimensionality reduction, autoregression, autoencoding 61

Hierarchical clustering A hierarchy of clusters, where the cluster at each level are created by merging clusters from the next lower level. 62

Hierarchical clustering A hierarchy of clusters, where the cluster at each level are created by merging clusters from the next lower level. Two general approaches: Recursively merge a pair of clusters. Recursively split the existing clusters. 63

Hierarchical clustering A hierarchy of clusters, where the cluster at each level are created by merging clusters from the next lower level. Two general approaches: Recursively merge a pair of clusters. Recursively split the existing clusters. Use dissimilarity measure to select split/merge pairs: Measure pairwise distance between any points in the 2 clusters. E.g. Euclidean distance, Manhattan distance. Measure distance over entire clusters using linkage criterion. E.g. Min/Max/Mean over pairs of points. 64

Hierarchical clustering of news articles http://nlp.stanford.edu/ir-book/html/htmledition/hierarchical-agglomerative-clustering-1.html 65