Supervised vs. Unsupervised Learning

Similar documents
Clustering. Supervised vs. Unsupervised Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu

Unsupervised Learning and Clustering

Machine Learning for OR & FE

Unsupervised Learning : Clustering

Expectation Maximization (EM) and Gaussian Mixture Models

Unsupervised Learning and Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Unsupervised Learning

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Market basket analysis

Unsupervised Learning Partitioning Methods

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

1 Case study of SVM (Rob)

CSE 5243 INTRO. TO DATA MINING

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

K-Means Clustering. Sargur Srihari

CSE 5243 INTRO. TO DATA MINING

Network Traffic Measurements and Analysis

Cluster Analysis. Ying Shen, SSE, Tongji University

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering CS 550: Machine Learning

Mixture Models and EM

IBL and clustering. Relationship of IBL with CBR

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

ECLT 5810 Clustering

INF 4300 Classification III Anne Solberg The agenda today:

University of Florida CISE department Gator Engineering. Clustering Part 2

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

ECLT 5810 Clustering

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

9.1. K-means Clustering

Clustering and Visualisation of Data

10701 Machine Learning. Clustering

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Cluster Analysis. CSE634 Data Mining

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Clustering algorithms

Pattern Clustering with Similarity Measures

What to come. There will be a few more topics we will cover on supervised learning

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Clustering: Classic Methods and Modern Views

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

CHAPTER 4: CLUSTER ANALYSIS

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Clustering. Chapter 10 in Introduction to statistical learning

K-Means. Oct Youn-Hee Han

Unsupervised learning in Vision

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

Clustering web search results

Clustering part II 1

k-means Clustering David S. Rosenberg April 24, 2018 New York University

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Unsupervised Learning

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Unsupervised Learning I: K-Means Clustering

MSA220 - Statistical Learning for Big Data

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Understanding Clustering Supervising the unsupervised

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

K-Means Clustering 3/3/17

High throughput Data Analysis 2. Cluster Analysis

Clustering in Data Mining

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Introduction to Mobile Robotics

Chapter 6 Continued: Partitioning Methods

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Exploratory data analysis for microarrays

CSC 411: Lecture 12: Clustering

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Cluster Analysis. Angela Montanari and Laura Anderlucci

Unsupervised Learning

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Unsupervised Learning: Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis for Microarray Data

[7.3, EA], [9.1, CMB]

Unsupervised Learning

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Introduction to Computer Science

Clustering. Discover groups such that samples within a group are more similar to each other than samples across groups.

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Transcription:

Clustering

Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now that all one has is a collection of samples without being told their categories (unsupervised learning)

Clustering Goal: Grouping a collection of objects (data points) into subsets or clusters, such that those within each cluster are more closely related to one other than objects assigned to different clusters. Fundamental to all clustering techniques is the choice of distance or dissimilarity measure between two objects.

Dissimilarities based on Features x i = ( x i1,x i2,, x iq ) T R q, i =1,,N ( ) = d k ( x ik,x jk ) D x i,x j d k q k=1 ( x ik,x jk ) = x ik x jk D x i, x j ( ) 2 q ( ) = ( x ik x jk ) k=1 2 Squared Euclidean distance D w q ( x i,x j ) = w k x ik x jk k=1 ( ) 2 Weighted squared Euclidean distance

Categorical Features E.g.: color, shape, etc. No natural ordering between variables exist; The degree-of-difference between pairs of values must be defined; If a variable assumes M distinct values: M M symmetric matrix with elements: m rs = m sr, m rr = 0, m rs 0 Common choice : m rs =1 r s

Clustering Discovering patterns (e.g., groups) in data without any guidance (labels) sounds like an unpromising problem. The question of whether or not it is possible in principle to learn anything from unlabeled data depends upon the assumptions one is willing to accept.

Clustering Mixture modeling: makes the assumption that data are samples of a population described by a probability density function Combinatorial algorithms: work directly on the observed data with no reference to an underlying probability model.

Clustering Algorithms: Mixture Modeling Data is a sample from a population described by a probability density function; The density function is modeled as a mixture of component density functions (e.g., mixture of Gaussians). Each component density describes one of the clusters; The parameters of the model (e.g., means and covariance matrices for mixture of Gaussians) are estimated as to best fit the data (maximum likelihood estimation).

Clustering Algorithms: Mixture Modeling Suppose that we knew, somehow, that the given sample data come from a single normal distribution Then: the most we could learn from the data would be contained in the sample mean and in the sample covariance matrix These statistics constitute a compact description of the data.

Clustering Algorithms: Mixture Modeling What happens if our knowledge is inaccurate, and the data are not actually normally distributed?

Example All four data sets have identical mean and covariance matrix

Example Clearly second-order statistics are not capable to revealing all of the structure in an arbitrary set of data

Clustering Algorithms: Mixture Modeling If we assume that the samples come from a mixture of c normal distributions, we can approximate a greater variety of situations; If the number of component densities is sufficiently high, we can approximate virtually any density function as a mixture model. However

Clustering Algorithms: Mixture Modeling The problem of estimating the parameters of a mixture density is not trivial; When we have little prior knowledge about the nature of the data, the assumption of specific parametric forms may lead to poor or meaningless results. There is a risk of imposing structure on the data instead of finding the structure.

Combinatorial Algorithms These algorithms work directly on the observed data, without regard to a probability model describing the data. Commonly used in data mining, since often no prior knowledge about the process that generated the data is available.

Combinatorial Algorithms x i R q, i =1,,N Prespecified number of clusters K, k { 1,,K} Each data point x i is assigned to one, and only one cluster Goal : Find a partition of the data into K clusters that achieves a required objective, defined in terms of a dissimilarity ( ) function D x i,x k Usually, the assignment of data to clusters is done so as to minimize a "loss" function that measures the degree to which the clustering goal is not met

Combinatorial Algorithms Since the goal is to assign close points to the same cluster, a natural loss function would be : W ( C) = 1 2 K k=1 i C k j C k D( x i,x j ) Within cluster scatter Then, clustering becomes straightforward in principle : Minimize W over all possible assignments of the N data points to K clusters

Combinatorial Algorithms Unfortunately, such optimization by complete enumeration is feasible only for very small data sets. The number of distinct partitions is: ( ) = 1 K! S N,K For example : S 10,4 K k=1 ( 1) K k K k k N ( ) = 34,105 S( 19,4) 10 10 We need to limit the search space, and find in general a good suboptimal solution

Combinatorial Algorithms Initialization: a partition is specified. Iterative step: the cluster assignments are changed in such a way that the value of the loss function is improved from its previous value. Stop criterion: when no improvement can be reached, the algorithm terminates. Iterative greedy descent. Convergence is guaranteed, but to local optima.

K-means One of the most popular iterative descent clustering methods. Features: quantitative type. Dissimilarity measure: Euclidean distance.

K-means The "within cluster point scatter" becomes : W ( C) = 1 2 K k=1 i C k j C k x i x j 2 W ( C) can be rewritten as : W C K ( ) = C k x i x k 2 k=1 i C k (obtained by rewriting ( x i x j ) = ( x i x k ) x j x k where x k = 1 C k i C k x i is the mean vector of cluster C k C k is the number of points in cluster C k ( ))

The objective is : min C K k=1 C k x i x k 2 i C k K-means We can solve this problem by noticing : for any set of data S x S = argmin m i S x i m 2 (this is obtained by setting x i m 2 i S m = 0) So we can solve the enlarged optimization problem: K min C k x i m k C,m k k=1 i C k 2

K-means: The Algorithm 1. Given a cluster assignment C, the total within cluster scatter K k=1 2 C k x i m k is minimized with respect to the m 1,,m K i C k giving the means of the currently assigned clusters; 2. Given a current set of means { m 1,,m K }, K k=1 2 C k x i m k is minimized with respect to C i C k { } by assigning each point to the closest current cluster mean; 3. Steps 1 and 2 are iterated until the assignments do not change.

K-means Example (K=2) Initialize Means Height x x Weight

K-means Example Assign Points to Clusters Height x x Weight

K-means Example Re-estimate Means Height x x Weight

K-means Example Re-assign Points to Clusters Height x x Weight

K-means Example Re-estimate Means Height x x Weight

K-means Example Re-assign Points to Clusters Height x x Weight

K-means Example Re-estimate Means and Converge Height x x Weight

K-means Example Convergence Height x x Weight

K-means: Properties and Limitations The algorithm converges to a local minimum The solution depends on the initial partition One should start the algorithm with many different random choices for the initial means, and choose the solution having smallest value of the objective function

K-means: Properties and Limitations The algorithm is sensitive to outliers A variation of K-means improves upon robustness (K-medoids): Centers for each cluster are restricted to be one of the points assigned to the cluster; The center (medoid) is set to be the point that minimizes the total distance to other points in the cluster; K-medoids is more computationally intensive than K-means.

K-means: Properties and Limitations The algorithm requires the number of clusters K; Often K is unknown, and must be estimated from the data: We can test K { 1,2,,K max } Compute { W 1,W 2,,W max } In general : W 1 > W 2 > > W max K = actual number of clusters in the data, when K < K, we can expect W K >> W K +1 when K > K, further splits provide smaller decrease of W Set ˆ K by identifying an "elbow shape" in the plot of W k

Gap Statistics: Estimating the number of clusters in a data set via the gap statistic Tibshirani, Walther, & Hastie, 2001 Plot logw K Plot the curve logw K obtained from data uniformely distributed Estimate ˆ K to be the point where the gap between the two curves is largest

An Application of K-means: Image segmentation Goal of segmentation: partition an image into regions with homogeneous visual appearance (which could correspond to objects or parts of objects) Image representation: each pixel is represented as a three dimensional point in RGB space, where R = intensity of red G = intensity of green B = intensity of blue

An Application of K-means: Image segmentation pixel (R,G,B)

An Application of K-means: Image segmentation Mean (R,G,B)

An Application of K-means: Image segmentation

An Application of K-means: (Lossy) Data compression Original image has N pixels Each pixel (R,G,B) values Each value is stored with 8 bits of precision Transmitting the whole image costs 24N bits Compression achieved by K-means: Identify each pixel with the corresponding centroid We have K such centroids we need log 2 K bits per pixel For each centroid we need 24 bits Transmitting the whole image costs 24K + N log2k bits Original image = 240x180=43,200 pixels 43,200x24=1,036,800 bits Compressed images: K=2: 43,248 bits K=3: 86,472 K=10: 173,040 bits