Unsupervised Learning: Clustering

Similar documents
CS 343: Artificial Intelligence

Kernels and Clustering

CSE 573: Artificial Intelligence Autumn 2010

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Clustering Lecture 14

Classification: Feature Vectors

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Clustering Lecture 8. David Sontag New York University. Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Carlos Guestrin, Andrew Moore, Dan Klein

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Machine Learning. Unsupervised Learning. Manfred Huber

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Introduction to Machine Learning CMU-10701

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Clustering web search results

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering Lecture 5: Mixture Model

CS 229 Midterm Review

Note Set 4: Finite Mixture Models and the EM Algorithm

CS 343H: Honors AI. Lecture 23: Kernels and clustering 4/15/2014. Kristen Grauman UT Austin

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

ECE 5424: Introduction to Machine Learning

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Mixture Models and the EM Algorithm

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Based on Raymond J. Mooney s slides

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

K-Means Clustering 3/3/17

Introduction to Machine Learning. Xiaojin Zhu

COMS 4771 Clustering. Nakul Verma

What is machine learning?

Unsupervised Learning

10701 Machine Learning. Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Introduction to Mobile Robotics

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Unsupervised Learning

Simple Model Selection Cross Validation Regularization Neural Networks

Expectation Maximization (EM) and Gaussian Mixture Models

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering. Image segmentation, document clustering, protein class discovery, compression

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

The EM Algorithm Lecture What's the Point? Maximum likelihood parameter estimates: One denition of the \best" knob settings. Often impossible to nd di

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Mixture Models and EM

Unsupervised Learning

Clustering CS 550: Machine Learning

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Unsupervised: no target value to predict

Clustering Color/Intensity. Group together pixels of similar color/intensity.

K-Means and Gaussian Mixture Models

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering Lecture 3: Hierarchical Methods

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Clustering: Classic Methods and Modern Views

Inference and Representation

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Unsupervised Learning : Clustering


Unsupervised Learning and Clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

10703 Deep Reinforcement Learning and Control

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CHAPTER 4: CLUSTER ANALYSIS

CLUSTERING. JELENA JOVANOVIĆ Web:

Grundlagen der Künstlichen Intelligenz

Unsupervised Learning and Data Mining

Clustering: K-means and Kernel K-means

Unsupervised Learning and Clustering

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Information Retrieval and Web Search Engines

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Clustering algorithms

Cluster Analysis. Debashis Ghosh Department of Statistics Penn State University (based on slides from Jia Li, Dept. of Statistics)

Supervised vs. Unsupervised Learning

Content-based image and video analysis. Machine learning

Cluster Analysis: Agglomerate Hierarchical Clustering

Using Machine Learning to Optimize Storage Systems

Clustering. Unsupervised Learning

Approaches to Clustering

What to come. There will be a few more topics we will cover on supervised learning

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering. Supervised vs. Unsupervised Learning

A Course in Machine Learning

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Transcription:

Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer

Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non-parametric Y Continuous Y Discrete Gaussians Learned in closed form Linear Functions 1. Learned in closed form 2. Using gradient descent Decision Trees Greedy search; pruning Probability of class features 1. Learn P(Y), P(X Y); apply Bayes 2. Learn P(Y X) w/ gradient descent Non-probabilistic Linear: perceptron gradient descent Nonlinear: neural net: backprop Support vector machines 2

Overview of Learning Type of Supervision (eg, Experience, Feedback) What is Being Learned? Discrete Function Continuous Function Labeled Examples Classification Regression Policy Apprenticeship Learning Reward Reinforcement Learning Nothing Clustering 3

Clustering systems: Unsupervised learning Clustering Requires data, but no labels Detect patterns e.g. in Group emails or search results Customer shopping patterns Program executions (intrusion detection) Useful when don t know what you re looking for But: often get gibberish

Clustering Basic idea: group together similar instances Example: 2D point patterns What could similar mean? One option: small (squared) Euclidean distance

Outline K-means & Agglomerative Clustering Agglomerative Clustering Expectation Maximization (EM) 6

K-Means: Algorithm An iterative clustering algorithm Pick K random points as cluster centers (means) Alternate: Assign data instances to closest cluster center Change the cluster center to the average of its assigned points Stop when no points assignments change

K-means clustering: Example Pick K random points as cluster centers (means) 8

K-means clustering: Example Iterative Step 1 Assign data instances to closest cluster center 9

K-means clustering: Example Iterative Step 2 Change the cluster center to the average of the assigned points 10

K-means clustering: Example Repeat until convergence 11

K-means clustering: Example 12

K-means clustering: Example 13

K-means clustering: Example 14

Example: K-Means for Segmentation K=2 Goal of Segmentation is to partition an image into regions each of which has reasonably homogenous visual Original image appearance.

Example: K-Means for Segmentation K=2 K=3 K=10 Original image

Example: K-Means for Segmentation K=2 K=3 K=10 Original image 4% 8% 17%

K-Means as Optimization Consider the total distance to the means: points assignments means Two stages each iteration: Update assignments: fix means c, change assignments a Update means: fix assignments a, change means c Co-ordinate Gradient Descent Will it converge? Yes!, if you can argue that each update can t increase Φ

Phase I: Update Assignments For each point, re-assign to closest mean: Can only decrease total distance phi!

Phase II: Update Means Move each mean to the average of its assigned points: Also can only decrease total distance (Why?) Fun fact: the point y with minimum squared Euclidean distance to a set of points {x} is their mean

Initialization K-means is non-deterministic Requires initial means It does matter what you pick! What can go wrong? Various schemes for preventing this kind of thing: variancebased split / merge, initialization heuristics

K-Means Getting Stuck A local optimum:

K-Means Questions Will K-means converge? To a global optimum? Will it always find the true patterns in the data? If the patterns are very very clear? Runtime? Do people ever use it? How many clusters to pick?

Agglomerative Clustering Agglomerative clustering: First merge very similar instances Incrementally build larger clusters out of smaller clusters Algorithm: Maintain a set of clusters Initially, each instance in its own cluster Repeat: Pick the two closest clusters Merge them into a new cluster Stop when there s only one cluster left Produces not one clustering, but a family of clusterings represented by a dendrogram

Agglomerative Clustering How should we define closest for clusters with multiple elements?

Agglomerative Clustering How should we define closest for clusters with multiple elements? Many options: Closest pair (single-link clustering) Farthest pair (complete-link clustering) Average of all pairs Ward s method (min variance, like k-means) Find pair of clusters that leads to minimum increase in total within cluster distance after merging Different choices create different clustering behaviors

Clustering Behavior Average Farthest Nearest Mouse tumor data from [Hastie] 30

Agglomerative Clustering Questions Will agglomerative clustering converge? To a global optimum? Will it always find the true patterns in the data? Do people ever use it? How many clusters to pick?

EM: Soft Clustering Clustering typically assumes that each instance is given a hard assignment to exactly one cluster. Does not allow uncertainty in class membership or for an instance to belong to more than one cluster. Problematic because data points that lie roughly midway between cluster centers are assigned to one cluster Soft clustering gives probabilities that an instance belongs to each of a set of clusters. 32

Probabilistic Clustering Try a probabilistic model! allows overlaps, clusters of different size, etc. Can tell a generative story for data P(X Z) P(Z) Challenge: we need to estimate model parameters without labeled Zs Z X 1 X 2?? 0.1 2.1?? 0.5-1.1?? 0.0 3.0?? -0.1-2.0?? 0.2 1.5

Finite Mixture Models Given a dataset: Mixture model: Note: only one of them equals 1 at any given point. Each point is assumed to be generated from exactly one mixture component! Mixture Weights. 34

Finite Mixture Model: Probabilistic View The membership weight express our uncertainty about which of the K components generated the vector 35

Gaussian Mixture Models (GMMs) We can define a GMM by making each k-th component a Gaussian density with parameters: Question: How to learn these parameters from data? 36

Expectation Maximization

EM algorithm: Key Idea Start with random parameters Find a class for each example (E-step) Since we are using probabilistic classification, each example will be given a vector of probabilities Now we have a supervised learning problem. Estimate the parameters of the model using the maximum likelihood method (M-step) Iterate between the E-step and M-step until convergence

EM: Two Easy Steps E-step: (Yields a N x K matrix) Compute for all data points indexed by i and all mixture components indexed by k. M-step: Use the membership weights and data to compute the new parameters

Gaussian Mixture Example: Start

After first iteration

After 2nd iteration

After 3rd iteration

After 4th iteration

After 5th iteration

After 6th iteration

After 20th iteration

Properties of EM EM converges to a local minima This is because each iteration improves the loglikelihood Proof same as K-means E-step can never decrease likelihood M-step can never decrease likelihood If we make hard assignments instead of soft ones. Algorithm is equivalent to K-means!

What you should know K-means for clustering: algorithm converges because it s coordinate ascent Know what agglomerative clustering is EM for mixture of Gaussians: Remember, E.M. can get stuck in local minima, And empirically it DOES!