ALTERNATIVE METHODS FOR CLUSTERING

Similar documents
DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering Lecture 5: Mixture Model

University of Florida CISE department Gator Engineering. Clustering Part 2

Introduction to Machine Learning CMU-10701

Expectation Maximization!

CS Introduction to Data Mining Instructor: Abdullah Mueen

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

COMS 4771 Clustering. Nakul Verma

IBL and clustering. Relationship of IBL with CBR

K-means and Hierarchical Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out


Mixture Models and the EM Algorithm

Clustering web search results

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Unsupervised: no target value to predict

Introduction to Mobile Robotics

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Machine Learning. Unsupervised Learning. Manfred Huber

Lecture 8: The EM algorithm

Expectation Maximization: Inferring model parameters and class labels

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

CLUSTERING. JELENA JOVANOVIĆ Web:

CSE 5243 INTRO. TO DATA MINING

K-Means Clustering. Sargur Srihari

Clustering. Image segmentation, document clustering, protein class discovery, compression

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering: Classic Methods and Modern Views

Unsupervised Learning : Clustering

Chapter 6 Continued: Partitioning Methods

CS 229 Midterm Review

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Supervised vs. Unsupervised Learning

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Clustering and The Expectation-Maximization Algorithm

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Expectation Maximization: Inferring model parameters and class labels

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Clustering. Shishir K. Shah

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

K-Means and Gaussian Mixture Models

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Clustering algorithms

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

SGN (4 cr) Chapter 11

Hierarchical Clustering

CSE 5243 INTRO. TO DATA MINING

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Note Set 4: Finite Mixture Models and the EM Algorithm

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Unsupervised Learning: Clustering

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Lesson 3. Prof. Enza Messina

A Comparison of Document Clustering Techniques

9.1. K-means Clustering

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Hierarchical Clustering

Clustering. Supervised vs. Unsupervised Learning

Inference and Representation

CSE494 Information Retrieval Project C Report

Clustering Part 2. A Partitional Clustering

Administrative. Machine learning code. Machine learning: Unsupervised learning

K-Means Clustering 3/3/17

Mixture Models and EM

Unsupervised Learning

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

K-Means. Oct Youn-Hee Han

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

CHAPTER 4: CLUSTER ANALYSIS

Clustering Lecture 3: Hierarchical Methods

Parameter Selection for EM Clustering Using Information Criterion and PDDP

Lecture 7: Segmentation. Thursday, Sept 20

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Introduction to Machine Learning. Xiaojin Zhu

Clustering: K-means and Kernel K-means

MULTICORE LEARNING ALGORITHM

Clustering CS 550: Machine Learning

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

CS47300: Web Information Search and Management

Clustering. K-means clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Transcription:

ALTERNATIVE METHODS FOR CLUSTERING

K-Means Algorithm

Termination conditions Several possibilities, e.g., A fixed number of iterations Objects partition unchanged Centroid positions don t change

Convergence of K-Means Define goodness measure of cluster c as sum of squared distances from cluster centroid: SSE c (c,s) = Σ i (d i s c ) 2 (sum over all d i in cluster c) G(C,s) = Σ c SSE c (c,s) Re-assignment monotonically decreases G It is a coordinate descent algorithm (opt one component at a time) At any step we have some value for G(C,s) 1) Fix s, optimize C assign d to the closest centroid G(C,s) < G(C,s) 2) Fix C, optimize s take the new centroids G(C,s ) < G(C,s) < G(C,s) The new cost is smaller than the original one local minimum

Time Complexity: Assign points to clusters Question Assuming the computation of a similarity is linear in the number of attributes A, what is the complexity of assigning points to clusters? Answer P = the set of points A = the set of attributes of each point K = the number of clusters O(k P A )

Time Complexity: Update centroids Question What is the complexity of updating centroids? Answer P = the set of points A = the set of attributes of each point K = the number of clusters O( P A )

Overall Time Complexity Question What is the complexity of k-means if t iterations are necessary to converge? Answer P = the set of points A = the set of attributes of each point K = the number of clusters O(t k P A )

MIXTURE MODELS AND THE EM ALGORITHM

Model-based clustering In order to understand our data, we will assume that there is a generative process (a model) that creates/describes the data, and we will try to find the model that best fits the data. Models of different complexity can be defined, but we will assume that our model is a distribution from which data points are sampled Example: the data is the height of all people in Greece In most cases, a single distribution is not good enough to describe all data points: different parts of the data follow a different distribution Example: the data is the height of all people in Greece and China We need a mixture model Different distributions correspond to different clusters in the data.

Gaussian Distribution

EM (Expectation Maximization) Algorithm E-Step: Assignment of points to clusters K-means: hard assignment, EM: soft assignment M-Step: K-means: Computation of centroids EM: Computation of the new model parameters

Gaussian Model What is a model? A Gaussian distribution is fully defined by the mean μ and the standard deviation σ We define our model as the pair of parameters θ = (μ, σ) This is a general principle: a model is defined as a vector of parameters θ

Fitting the model We want to find the normal distribution that best fits our data Find the best values for μ and σ But what does best fit mean?

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE)

MLE

Mixture of Gaussians Suppose that you have the heights of people from Greece and China and the distribution looks like the figure below (dramatization)

Mixture of Gaussians In this case the data is the result of the mixture of two Gaussians One for Greek people, and one for Chinese people Identifying for each value which Gaussian is most likely to have generated it will give us a clustering.

Mixture model C C C

Mixture Model

Mixture Models

EM (Expectation Maximization) Algorithm

Finding the best number of clusters In k-means the number of clusters K is given Partition n objects into predetermined number of clusters Finding the right number of clusters is part of the problem

Bisecting K-means Variant of K-means that can produce a hierarchical clustering

Bisecting K-Means The algorithm is exhaustive terminating at singleton clusters (unless K is known) Terminating at singleton clusters Is time consuming Singleton clusters are meaningless Intermediate clusters are more likely to correspond to real classes No criterion for stopping bisections before singleton clusters are reached.

Bayesian Information Criterion (BIC) A strategy to stop the Bisecting algorithm when meaningful clusters are reached to avoid over-splitting Using BIC as splitting criterion of a cluster in order to decide whether a cluster should split or no BIC measures the improvement of the cluster structure between a cluster and its two children clusters. Compute the BIC score of: A cluster and of its Two children clusters BIC approximates the probability that the M j is describing the real clusters in the data

BIC based split C 1 C Parent cluster: BIC(K=1)=1980 C1 2 C Two resulting clusters: BIC(K=2)=2245 The BIC score of the parent cluster is less than BIC score of the generated cluster structure => we accept the bisection.

X-Means Forward search for the appropriate value of k in a given range [r 1,r max ]: Recursively split each cluster and use BIC score to decide if we should keep each split 1. Run K-means with k=r 1 2. Improve structure 3. If k > r max Stop and return the best-scoring model Use local BIC score to decide on keeping a split Use global BIC score to decide which K to output at the end

X-Means 1. K-means with k=3 2. Split each centroid in 2 children moved a distance propotional to the region size in opposite direction (random) 3. Run 2-means in each region locally 4. Compare BIC of parent and children 4. Only centroids with higher BIC survives

BIC Formula The BIC score of a data collection is defined as (Kass and Wasserman, 1995): l ˆj D BIC(M )= ˆl æ ç ö D - p j j j è ø 2 logr is the log-likelihood of the data set D p j = A*K+1, is a function of the number of independent parameters R is the number of point Approximate the probability that the M j is describing the real clusters in the data

BIC (Bayesian Information Criterion) Adjusted Log-likelihood of the model. The likelihood that the data is explained by the clusters according to the spherical-gaussian assumption of k-means BIC(M )= ˆl æ ç ö D - p j j j è ø 2 logr Focusing on the set D n of points which belong to centroid n It estimates how closely to the centroid are the points of the cluster.