Clustering. Supervised vs. Unsupervised Learning

Similar documents
Supervised vs. Unsupervised Learning

Unsupervised Learning : Clustering

Unsupervised Learning and Clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning and Clustering

Expectation Maximization (EM) and Gaussian Mixture Models

Unsupervised Learning

Unsupervised Learning

Machine Learning for OR & FE

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

1 Case study of SVM (Rob)

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Network Traffic Measurements and Analysis

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Cluster Analysis. CSE634 Data Mining

Clustering CS 550: Machine Learning

INF 4300 Classification III Anne Solberg The agenda today:

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering and Visualisation of Data

CSE 5243 INTRO. TO DATA MINING

Market basket analysis

Unsupervised Learning Partitioning Methods

Clustering. Chapter 10 in Introduction to statistical learning

9.1. K-means Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

K-Means Clustering. Sargur Srihari

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

CSE 5243 INTRO. TO DATA MINING

What to come. There will be a few more topics we will cover on supervised learning

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Unsupervised Learning

Understanding Clustering Supervising the unsupervised

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

ECLT 5810 Clustering

Unsupervised Learning I: K-Means Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Clustering algorithms

ECLT 5810 Clustering

Clustering Part 2. A Partitional Clustering

k-means Gaussian mixture model Maximize the likelihood exp(

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

K-Means. Oct Youn-Hee Han

Clustering in Data Mining

Pattern Clustering with Similarity Measures

Mixture Models and EM

Unsupervised learning in Vision

IBL and clustering. Relationship of IBL with CBR

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

University of Florida CISE department Gator Engineering. Clustering Part 2

k-means Clustering David S. Rosenberg April 24, 2018 New York University

Clustering: K-means and Kernel K-means

Clustering: Classic Methods and Modern Views

MSA220 - Statistical Learning for Big Data

10701 Machine Learning. Clustering

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Unsupervised Learning: Clustering

The K-modes and Laplacian K-modes algorithms for clustering

Clustering part II 1

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Introduction to Mobile Robotics

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering web search results

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Introduction to Artificial Intelligence

K-Means Clustering 3/3/17

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

Association Rule Mining and Clustering

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Clustering. Huanle Xu. Clustering 1

Community Detection. Jian Pei: CMPT 741/459 Clustering (1) 2

Unsupervised Learning: K-means Clustering

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme


Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

6. Dicretization methods 6.1 The purpose of discretization

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Clustering Part 4 DBSCAN

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Transcription:

Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now that all one has is a collection of samples without being told their categories (unsupervised learning) 1

Clustering Goal: Grouping a collection of objects (data points) into subsets or clusters, such that those within each cluster are more closely related to one other than objects assigned to different clusters. Fundamental to all clustering techniques is the choice of distance or dissimilarity measure between two objects. 2

Dissimilarities based on Features i = ( i1, i2,!, iq ) T " # q, i =1,!,N ( ) = $ d k ( ik, jk ) D i, j d k q k =1 ( ik, jk ) = ik " jk " D i, j ( ) 2 q ( ) = $ ( ik # jk ) k =1 2 Squared Euclidean distance D w q ( i, j ) = # w k ik " jk k =1 ( ) 2 Weighted squared Euclidean distance Categorical Features E.g.: color, shape, etc. No natural ordering between variables eist; The degree-of-difference between pairs of values must be defined; If a variable assumes M distinct values: M M symmetricmatri with elements: m rs = m, m sr rr = 0, m Common choice : m rs rs 0 = 1 r s 3

Clustering Discovering patterns (e.g., groups) in data without any guidance (labels) sounds like an unpromising problem. The question of whether or not it is possible in principle to learn anything from unlabeled data depends upon the assumptions one is willing to accept. Clustering Miture modeling: makes the assumption that data are samples of a population described by a probability density function Combinatorial algorithms: work directly on the observed data with no reference to an underlying probability model. 4

Clustering Algorithms: Miture Modeling Data is a sample from a population described by a probability density function; The density function is modeled as a miture of component density functions (e.g., miture of Gaussians). Each component density describes one of the clusters; The parameters of the model (e.g., means and covariance matrices for miture of Gaussians) are estimated as to best fit the data (maimum likelihood estimation). Clustering Algorithms: Miture Modeling Suppose that we knew, somehow, that the given sample data come from a single normal distribution Then: the most we could learn from the data would be contained in the sample mean and in the sample covariance matri These statistics constitute a compact description of the data. 5

Clustering Algorithms: Miture Modeling What happens if our knowledge is inaccurate, and the data are not actually normally distributed? Eample All four data sets have identical mean and covariance matri 6

Eample Clearly second-order statistics are not capable to revealing all of the structure in an arbitrary set of data Clustering Algorithms: Miture Modeling If we assume that the samples come from a miture of c normal distributions, we can approimate a greater variety of situations; If the number of component densities is sufficiently high, we can approimate virtually any density function as a miture model. However 7

Clustering Algorithms: Miture Modeling The problem of estimating the parameters of a miture density is not trivial; When we have little prior knowledge about the nature of the data, the assumption of specific parametric forms may lead to poor or meaningless results. There is a risk of imposing structure on the data instead of finding the structure. Combinatorial Algorithms These algorithms work directly on the observed data, without regard to a probability model describing the data. Commonly used in data mining, since often no prior knowledge about the process that generated the data is available. 8

Combinatorial Algorithms q i R, i = 1,, N Prespecified number of Each data point function D i (, ) i k { 1, K} clusters K, k, is assigned to one, and only one cluster Goal :Find a partition of the data into K clusters that achieves a required objective, defined in termsof Usually, the assignment of theclustering goal is not met data to clusters is done so as a dissimilarity to minimize a "loss"function that measures the degree to which Combinatorial Algorithms Since the goal is to assign close points to the same cluster, a natural loss function would be : ( ) = 1 # # # D i, j Within cluster scatter 2 W C K k =1 i"c k j"c k ( ) Then, clustering becomes straightforward in principle : Minimize W data points to over all possible assignments of K clusters the N 9

Combinatorial Algorithms Unfortunately, such optimization by complete enumeration is feasible only for very small data sets. The number of distinct partitions is: ( ) = 1 K! S N,K For eample : S K # ("1) K"k K& ) % ( k N $ k ' k =1 10 ( 10,4) = 34,105 S( 19,4) 10 We need to limit the search space, and find in general a good suboptimal solution Combinatorial Algorithms Initialization: a partition is specified. Iterative step: the cluster assignments are changed in such a way that the value of the loss function is improved from its previous value. Stop criterion: when no improvement can be reached, the algorithm terminates. Iterative greedy descent. Convergence is guaranteed, but to local optima. 10

K-means One of the most popular iterative descent clustering methods. Features: quantitative type. Dissimilarity measure: Euclidean distance. K-means The "within cluster point scatter" becomes : ( ) = 1 2 W C K $ k =1 $ i#c k $ j#c k i " j 2 W ( C) can be rewritten as: W C K ## ( ) = i! k 2 k=1 i"c k (obtained by rewriting i! j where k = 1 C k # i"c k ( ) = ( i! k )! j! k i is the mean vector of cluster C k ( )) 11

The objective is: min C K ## k=1 i"c k i! k 2 K-means We can solve this problem by noticing : for any set of data S S = argmin m $ i#s i " m 2 $ % i " m 2 i#s (this is obtained by setting %m = 0) So we can solve the enlarged optimization problem: min C,m k K ## k=1 i"c k i! m k 2 K-means: The Algorithm 1. Given a cluster assignment C, the total within cluster scatter K # 2 # i! m k is minimized with respect to the m 1,!, m K k=1 i"c k giving the means of the currently assigned clusters; 2. Given a current set of means { m 1,!, m K }, K # 2 # i! m k is minimized with respect to C k=1 i"c k { } by assigning each point to the closest current cluster mean; 3.Steps 1and 2 are iterated until the assignments do not change. 12

K-means Eample (K=2) Initialize Means Height Weight K-means Eample Assign Points to Clusters Height Weight 13

K-means Eample Re-estimate Means Height Weight K-means Eample Re-assign Points to Clusters Height Weight 14

K-means Eample Re-estimate Means Height Weight K-means Eample Re-assign Points to Clusters Height Weight 15

K-means Eample Re-estimate Means and Converge Height Weight K-means Eample Convergence Height Weight 16

K-means: Properties and Limitations The algorithm converges to a local minimum The solution depends on the initial partition One should start the algorithm with many different random choices for the initial means, and choose the solution having smallest value of the objective function K-means: Properties and Limitations The algorithm is sensitive to outliers A variation of K-means improves upon robustness (K-medoids): Centers for each cluster are restricted to be one of the points assigned to the cluster; The center (medoid) is set to be the point that minimizes the total distance to other points in the cluster; K-medoids is more computationally intensive than K-means. 17

K-means: Properties and Limitations The algorithm requires the number of clusters K; Often K is unknown, and must be estimated from the data: We can test K " { 1,2,!, K ma } Compute { W 1,W 2,!,W ma } In general : W 1 >W 2 >! >W ma K " = actual number of clusters in the data, when K < K ", we can epect W K >>W K +1 when K > K ", further splits provide smaller decrease of W Set ˆ K " by identifying an "elbow shape" in the plot of W k Gap Statistics: Estimating the number of clusters in a data set via the gap statistic Tibshirani, Walther, & Hastie, 2001 Plot logw K Plot the curve logw K obtained from data uniformely distributed Estimate ˆ K " to be the point where the gap between the two curves is largest 18

An Application of K-means: Image segmentation Goal of segmentation: partition an image into regions with homogeneous visual appearance (which could correspond to objects or parts of objects) Image representation: each piel is represented as a three dimensional point in RGB space, where R = intensity of red G = intensity of green B = intensity of blue An Application of K-means: Image segmentation piel (R,G,B) 19

An Application of K-means: Image segmentation Mean (R,G,B) An Application of K-means: Image segmentation 20

An Application of K-means: (Lossy) Data compression Original image has N piels Each piel à (R,G,B) values Each value is stored with 8 bits of precision Transmitting the whole image costs 24N bits Compression achieved by K-means: Identify each piel with the corresponding centroid We have K such centroids à we need log 2 K bits per piel For each centroid we need 24 bits Transmitting the whole image costs 24K + N log2k bits Original image = 240180=43,200 piels à 43,20024=1,036,800 bits Compressed images: K=2: 43,248 bits K=3: 86,472 K=10: 173,040 bits Where K-means fails 1.5 15 1 10 2 0.5 0 0.5 1 2 5 0 5 1.5 1.5 1 0.5 0 0.5 1 1.5 1 (a) 10 10 5 0 1 (b) FIGURE 6.4 Two datasets in which K-means fails to capture the clear cluster structure. 21

Kernel K-means 1.5 1.5 1 1 0.5 0.5 2 0 2 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 1 (a) Kernel K-means after one iteration. 1.5 1 0.5 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 1 1.5 1 0.5 (b) After 5 iterations. 2 0 2 0 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 1 (c) After 10 iterations. 0.5 1 1.5 1.5 1 0.5 0 0.5 1 1.5 1 (d) At convergence (30 iterations). FIGURE 6.5 Result of applying kernelised K-means to the data shown in Figure 6.4(a). number of clusters if our aim is just to cluster (remember that we mentioned how the number of clusters could be chosen as the one that gave best performance in some later task like classification). To overcome some of these drawbacks, we will now describe clustering with statistical miture models. These models share some similarities with K-means but offer far richer representations of the data. 6.3 MIXTURE MODELS In Figure 6.4(b) we showed a dataset for which the original K-means failed. The two clusters were stretched in such a way that some objects that should have belonged to one were in fact closer to the centre of the other. The problem our K-means algorithm had here was that its definition of a cluster was too crude. The characteristics of these stretched clusters cannot be represented by a single point and the squared distance. We need to be able to incorporate a notion of shape. Statistical miture represent each cluster as a probability density. This generalisation leads to a powerful approach as we can model clusters with a wide variety of shapes in almost any type of data. 22