COMS 4771 Clustering. Nakul Verma

Similar documents
K-Means Clustering 3/3/17

Clustering web search results

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Introduction to Machine Learning CMU-10701

Machine Learning. Unsupervised Learning. Manfred Huber

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Clustering and The Expectation-Maximization Algorithm

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

ECE 5424: Introduction to Machine Learning

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering COMS 4771

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Inference and Representation

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Association Rule Mining and Clustering

Clustering Lecture 5: Mixture Model

Unsupervised Learning: Clustering

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin


Note Set 4: Finite Mixture Models and the EM Algorithm

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

ALTERNATIVE METHODS FOR CLUSTERING

Introduction to Mobile Robotics

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Clustering algorithms

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Clustering and Visualisation of Data

Lecture 8: The EM algorithm

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Clustering. Image segmentation, document clustering, protein class discovery, compression

Network Traffic Measurements and Analysis

Grundlagen der Künstlichen Intelligenz

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Intro to Artificial Intelligence

IBL and clustering. Relationship of IBL with CBR

Mixture Models and the EM Algorithm

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Clustering k-mean clustering

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

CS 229 Midterm Review

1 Case study of SVM (Rob)

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CS Introduction to Data Mining Instructor: Abdullah Mueen

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Hierarchical Clustering 4/5/17

Expectation Maximization: Inferring model parameters and class labels

Mixture Models and EM

K-Means and Gaussian Mixture Models

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering: K-means and Kernel K-means

Introduction to Machine Learning. Xiaojin Zhu

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Machine Learning

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Based on Raymond J. Mooney s slides

Unsupervised Learning

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

10701 Machine Learning. Clustering

Machine Learning for OR & FE

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

SGN (4 cr) Chapter 11

Machine Learning. Semi-Supervised Learning. Manfred Huber

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Unsupervised Learning

Lecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Clustering. Unsupervised Learning

Warped Mixture Models

Unsupervised Learning : Clustering

Lecture 11: Clustering Introduction and Projects Machine Learning

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Generative and discriminative classification techniques

Random projection for non-gaussian mixture models

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Machine Learning

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Expectation Maximization!

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Unsupervised Learning

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

A Taxonomy of Semi-Supervised Learning Algorithms

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Clustering. Unsupervised Learning

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

Content-based image and video analysis. Machine learning

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Expectation Maximization (EM) and Gaussian Mixture Models

Transcription:

COMS 4771 Clustering Nakul Verma

Supervised Learning Data: Supervised learning Assumption: there is a (relatively simple) function such that for most i Learning task: given n examples from the data, find an approximation Goal: gives mostly correct prediction on unseen examples Training Phase Testing Phase Unlabeled test data (unseen / future data) Labeled training data (n examples from data) Learning Algorithm classifier prediction

Unsupervised Learning Data: Unsupervised learning Assumption: there is an underlying structure in Learning task: discover the structure given n examples from the data Goal: come up with the summary of the data using the discovered structure Partition the data into meaningful structures clustering Find a low-dimensional representation that retains important information, and suppresses irrelevant/noise information Dimensionality reduction Let s take a closer look using an example

Example: Handwritten digits revisited Handwritten digit data, but with no labels What can we do? Suppose know that there are 10 groupings, can we find the groups? What if we don t know there are 10 groups? How can we discover/explore other structure in such data? A 2D visualization of digits dataset

Handwritten digits visualization

Grouping The Data, aka Clustering Data: Given: known target number of groups k Output: Partition into k groups. This is called the clustering problem, also known as unsupervised classification, or quantization

k-means Given: data Idea: find a set of representatives representative, and intended number of groupings k such that data is close to some Optimization: Unfortunately this is NP-hard Even for d=2 and k=2 How do we optimize this? How do we solve for d=1 or k=1 case?

Algorithm to approximate k-means Given: data, and intended number of groupings k Alternating optimization algorithm: Initialize cluster centers (say randomly) Repeat till no more changes occur Assign data to its closest center (this creates a partition) (assume centers are fixed) Find the optimal centers (assuming the data partition is fixed) Demo:

Algorithm to approximate k-means Given: data, and intended number of groupings k Alternating optimization algorithm: Initialize cluster centers (say randomly) Repeat till no more changes occur Assign data to its closest center (this creates a partition) (assume centers are fixed) Find the optimal centers (assuming the data partition is fixed) Demo:

Algorithm to approximate k-means Given: data, and intended number of groupings k Alternating optimization algorithm: Initialize cluster centers (say randomly) Repeat till no more changes occur Assign data to its closest center (this creates a partition) (assume centers are fixed) Find the optimal centers (assuming the data partition is fixed) Demo:

Algorithm to approximate k-means Given: data, and intended number of groupings k Alternating optimization algorithm: Initialize cluster centers (say randomly) Repeat till no more changes occur Assign data to its closest center (this creates a partition) (assume centers are fixed) Find the optimal centers (assuming the data partition is fixed) Demo:

Algorithm to approximate k-means Given: data, and intended number of groupings k Alternating optimization algorithm: Initialize cluster centers (say randomly) Repeat till no more changes occur Assign data to its closest center (this creates a partition) (assume centers are fixed) Find the optimal centers (assuming the data partition is fixed) Demo:

Algorithm to approximate k-means Given: data, and intended number of groupings k Alternating optimization algorithm: Initialize cluster centers (say randomly) Repeat till no more changes occur Assign data to its closest center (this creates a partition) (assume centers are fixed) Find the optimal centers (assuming the data partition is fixed) Demo:

Algorithm to approximate k-means Given: data, and intended number of groupings k Alternating optimization algorithm: Initialize cluster centers (say randomly) Repeat till no more changes occur Assign data to its closest center (this creates a partition) (assume centers are fixed) Find the optimal centers (assuming the data partition is fixed) Demo:

k-means Some properties of this alternating updates algorithm: The approximation can be arbitrarily bad, compared to the best cluster assignment! Performance quality heavily dependent on the initialization! k-means: How to select k? is the right k=2 or k=3? Solution: encode clustering for all values of k! (hierarchical clustering)

Example: Clustering Without Committing to k K=3 (coarser resolution) k=6 (finer resolution)

Hierarchical Clustering Two approaches: Top Down (divisive): Partition data into two groups (say, by k-means, with k=2) Recurse on each part Stop when cannot partition data anymore (ie single points left) Bottom Up (agglomerative): Start by each data sample as its own cluster (so initial number of clusters is n) Repeatedly merge closest pair of clusters Stop when only one cluster is left

Clustering via Probabilistic Mixture Modeling Alternative way to cluster data: Given: and number of intended number of clusters k. Assume a joint probability distribution over the joint space π 1 Discrete distribution over the clusters P[C=i] = π i π k Some multivariate distribution, e.g. Parameters: Modeling assumption data (x 1,c 1 ),, (x n,c n ) i.i.d. from BUT only get to see partial information: x 1, x 2,, x n looks familiar? (c 1,, c n hidden!)

Gaussian Mixture Modeling (GMM) Given: and k. Assume a joint probability distribution π 1 over the joint space Gaussian Mixture Model π k Example in R 2 x [3]: Mixing weight Mixture component (this is called a mixture model)

GMM: Parameter Learning So how to learn the parameters θ? MLE approach: Given data i.i.d. ummm. now what? Cannot really simplify further!

GMM: Maximum Likelihood MLE for Mixture modeling (like GMMs) is NOT a convex optimization problem In fact Maximum Likelihood Estimate for GMMs is degenerate! X = R, k = 2 (fit two Gaussian in 1d): Which pair of Gaussians gives higher likelihood? X as σ 0, MLE! X Aside: why doesn t this occur when fitting one Gaussian?

GMM: (local) Maximum Likelihood So, can we make any progress? Observation: even though a global MLE maximizer is not appropriate, several local maximizers are desirable! X An example non-maximized likelihood (do a few steps of gradient ascent) X Reaches a desirable local maximum! A better algorithm for finding good parameters: Expectation Maximization (EM)

Expectation Maximization (EM) Algorithm Similar in spirit to the alternating update for k-means algorithm Idea: Initialize the parameters arbitrarily Given the current setting of parameters find the best (soft) assignment of data samples to the clusters (Expectation-step) Update all the parameters with respect to the current (soft) assignment that maximizes the likelihood (Maximization-step) Repeat until no more progress is made.

EM for GMM Initialize arbitrarily Expectation-step: For each and compute the assignment of data x i to cluster j Maximization-step: Maximize the log-likelihood of the parameters (with respect to complete data) Why?

EM for GMM in Action Arbitrary θ assignment

EM for GMM in Action E step: soft assignment of data

EM for GMM in Action M step: Maximize parameter estimate

EM for GMM in Action After two rounds

EM for GMM in Action After five rounds

EM for GMM in Action After twenty rounds

What We Learned Unsupervised Learning problems: Clustering and Dimensionality Reduction K-means Hierarchical Clustering Gaussian Mixture Models EM algorithm

Questions?

Next time Dimension reduction!