Unsupervised Learning and Clustering

Similar documents
Unsupervised Learning and Clustering

Unsupervised Learning : Clustering

Clustering CS 550: Machine Learning

Cluster Analysis: Agglomerate Hierarchical Clustering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Unsupervised Learning

Unsupervised Learning

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Cluster Analysis. Ying Shen, SSE, Tongji University

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Gene Clustering & Classification

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Supervised vs. Unsupervised Learning

Machine Learning. Unsupervised Learning. Manfred Huber

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Clustering. Supervised vs. Unsupervised Learning

Understanding Clustering Supervising the unsupervised

Lesson 3. Prof. Enza Messina

A Graph Theoretic Approach to Image Database Retrieval

CHAPTER 4: CLUSTER ANALYSIS

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

ECLT 5810 Clustering

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

ECLT 5810 Clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Data Informatics. Seon Ho Kim, Ph.D.

Clustering algorithms

Clustering in Data Mining

10701 Machine Learning. Clustering

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Information Retrieval and Web Search Engines

Unsupervised Learning

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Multivariate Analysis

Unsupervised Learning Hierarchical Methods

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Cluster Analysis. Angela Montanari and Laura Anderlucci

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Using the Kolmogorov-Smirnov Test for Image Segmentation

Clustering: K-means and Kernel K-means

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

University of Florida CISE department Gator Engineering. Clustering Part 2

Region-based Segmentation

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering part II 1

Hierarchical Clustering

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Clustering & Classification (chapter 15)

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Hierarchical Clustering 4/5/17

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Network Traffic Measurements and Analysis

Unsupervised: no target value to predict

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Information Retrieval and Web Search Engines

Introduction to Clustering

CS7267 MACHINE LEARNING

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Machine Learning (BSMC-GA 4439) Wenke Liu

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Clustering Lecture 3: Hierarchical Methods

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

Clustering algorithms and introduction to persistent homology

Artificial Intelligence. Programming Styles

Pattern Clustering with Similarity Measures

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

K-Means Clustering 3/3/17

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Cluster Analysis for Microarray Data

Based on Raymond J. Mooney s slides

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Image Segmentation. Selim Aksoy. Bilkent University

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Clustering Basic Concepts and Algorithms 1

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

High throughput Data Analysis 2. Cluster Analysis

CSE 5243 INTRO. TO DATA MINING

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Road map. Basic concepts

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

INF 4300 Classification III Anne Solberg The agenda today:

Transcription:

Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 1 / 50

Introduction Until now we have assumed that the training examples were labeled by their class membership. Procedures that use labeled samples are said to be supervised. In this chapter, we will study clustering as an unsupervised procedure that uses unlabeled samples. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 2 / 50

Introduction Unsupervised procedures are used for several reasons: Collecting and labeling a large set of sample patterns can be costly or may not be feasible. One can train with large amount of unlabeled data, and then use supervision to label the groupings found. Unsupervised methods can be used for feature extraction. Exploratory data analysis can provide insight into the nature or structure of the data. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 3 / 50

Data Description Assume that we have a set of unlabeled multi-dimensional patterns. One way of describing this set of patterns is to compute their sample mean and covariance. This description uses the assumption that the patterns form a cloud that can be modeled with a hyperellipsoidal shape. However, we must be careful about any assumptions we make about the structure of the data. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 4 / 50

Data Description Figure 1: These four data sets have identical first-order and second-order statistics. We need to find other ways of modeling their structure. Clustering is an alternative way of describing the data in terms of groups of patterns. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 5 / 50

Clusters A cluster is comprised of a number of similar objects collected or grouped together. Other definitions of clusters (from Jain and Dubes, 1988): A cluster is a set of entities which are alike, and entities from different clusters are not alike. A cluster is an aggregation of points in the test space such that the distance between any two points in the cluster is less than the distance between any point in the cluster and any point not in it. Clusters may be described as connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 6 / 50

Clustering Cluster analysis organizes data by abstracting the underlying structure either as a grouping of individuals or as a hierarchy of groups. These groupings are based on measured or perceived similarities among the patterns. Clustering is unsupervised. Category labels and other information about the source of data influence the interpretation of the clusters, not their formation. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 7 / 50

Clustering Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes. Figure 2: The number of clusters in the data often depend on the resolution (fine vs. coarse) with which we view the data. How many clusters do you see in this figure? 5, 8, 10, more? CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 8 / 50

Clustering Clustering algorithms can be divided into several groups: Exclusive (each pattern belongs to only one cluster) vs. nonexclusive (each pattern can be assigned to several clusters). Hierarchical (nested sequence of partitions) vs. partitional (a single partition). CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 9 / 50

Clustering Implementations of clustering algorithms can also be grouped: Agglomerative (merging atomic clusters into larger clusters) vs. divisive (subdividing large clusters into smaller ones). Serial (processing patterns one by one) vs. simultaneous (processing all patterns at once). Graph-theoretic (based on connectedness) vs. algebraic (based on error criteria). CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 10 / 50

Clustering Hundreds of clustering algorithms have been proposed in the literature. Most of these algorithms are based on the following two popular techniques: Iterative squared-error partitioning, Agglomerative hierarchical clustering. One of the main challenges is to select an appropriate measure of similarity to define clusters that is often both data (cluster shape) and context dependent. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 11 / 50

Similarity Measures The most obvious measure of similarity (or dissimilarity) between two patterns is the distance between them. If distance is a good measure of dissimilarity, then we can expect the distance between patterns in the same cluster to be significantly less than the distance between patterns in different clusters. Then, a very simple way of doing clustering would be to choose a threshold on distance and group the patterns that are closer than this threshold. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 12 / 50

Similarity Measures Figure 3: The distance threshold affects the number and size of clusters that are shown by lines drawn between points closer than the threshold. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 13 / 50

Criterion Functions The next challenge after selecting the similarity measure is the choice of the criterion function to be optimized. Suppose that we have a set D = {x 1,..., x n } of n samples that we want to partition into exactly k disjoint subsets D 1,..., D k. Each subset is to represent a cluster, with samples in the same cluster being somehow more similar to each other than they are to samples in other clusters. The simplest and most widely used criterion function for clustering is the sum-of-squared-error criterion. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 14 / 50

Squared-error Partitioning Suppose that the given set of n patterns has somehow been partitioned into k clusters D 1,..., D k. Let n i be the number of samples in D i and let m i be the mean of those samples m i = 1 n i x D i x. Then, the sum-of-squared errors is defined by k J e = x m i 2. i=1 x D i For a given cluster D i, the mean vector m i (centroid) is the best representative of the samples in D i. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 15 / 50

Squared-error Partitioning A general algorithm for iterative squared-error partitioning: 1. Select an initial partition with k clusters. Repeat steps 2 through 5 until the cluster membership stabilizes. 2. Generate a new partition by assigning each pattern to its closest cluster center. 3. Compute new cluster centers as the centroids of the clusters. 4. Repeat steps 2 and 3 until an optimum value of the criterion function is found (e.g., when a local minimum is found or a predefined number of iterations are completed). 5. Adjust the number of clusters by merging and splitting existing clusters or by removing small or outlier clusters. This algorithm, without step 5, is also known as the k-means algorithm. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 16 / 50

Squared-error Partitioning k-means is computationally efficient and gives good results if the clusters are compact, hyperspherical in shape, and well-separated in the feature space. However, choosing k and choosing the initial partition are the main drawbacks of this algorithm. The value of k is often chosen empirically or by prior knowledge about the data. The initial partition is often chosen by generating k random points uniformly distributed within the range of the data, or by randomly selecting k points from the data. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 17 / 50

Squared-error Partitioning Numerous attempts have been made to improve the performance of the basic k-means algorithm: incorporating a fuzzy criterion resulting in fuzzy k-means, using genetic algorithms, simulated annealing, deterministic annealing to optimize the resulting partition, using iterative splitting to find the initial partition. Another alternative is to use model-based clustering using Gaussian mixtures to allow more flexible shapes for individual clusters (k-means with Euclidean distance assumes spherical shapes). In model-based clustering, the value of k corresponds to the number of components in the mixture. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 18 / 50

Examples (a) Good initialization. (b) Good initialization. (c) Bad initialization. (d) Bad initialization. Figure 4: Examples for k-means with different initializations of five clusters for the same data. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 19 / 50

Hierarchical Clustering The k-means algorithm produces a flat data description where the clusters are disjoint and are at the same level. In some applications, groups of patterns share some characteristics when looked at a particular level. Hierarchical clustering tries to capture these multi-level groupings using hierarchical representations rather than flat partitions. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 20 / 50

Hierarchical Clustering In hierarchical clustering, for a set of n samples, the first level consists of n clusters (each cluster containing exactly one sample), the second level contains n 1 clusters, the third level contains n 2 clusters, and so on until the last (n th) level at which all samples form a single cluster. Given any two samples, at some level they will be grouped together in the same cluster and remain together at all higher levels. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 21 / 50

Hierarchical Clustering A natural representation of hierarchical clustering is a tree, also called a dendrogram, which shows how the samples are grouped. If there is an unusually large gap between the similarity values for two particular levels, one can argue that the level with fewer number of clusters represents a more natural grouping. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 22 / 50

Hierarchical Clustering Figure 5: A dendrogram can represent the results of hierarchical clustering algorithms. The vertical axis shows a generalized measure of similarity among clusters. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 23 / 50

Hierarchical Clustering Agglomerative Hierarchical Clustering: 1. Specify the number of clusters. Place every pattern in a unique cluster and repeat steps 2 and 3 until a partition with the required number of clusters is obtained. 2. Find the closest clusters according to a distance measure. 3. Merge these two clusters. 4. Return the resulting clusters. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 24 / 50

Hierarchical Clustering Popular distance measures (for two clusters D i and D j ): d min (D i, D j ) = min x D i x D j x x d max (D i, D j ) = max x D i x D j x x d avg (D i, D j ) = 1 #D i #D j d mean (D i, D j ) = m i m j x D i x D j x x CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 25 / 50

Hierarchical Clustering When d min is used to measure the distance between clusters, the algorithm is called the nearest neighbor clustering algorithm. Moreover, if the algorithm is terminated when the distance between nearest clusters exceeds a threshold, it is called the single linkage algorithm where patterns represent the nodes of a graph, edges connect patterns belonging to the same cluster, merging two clusters corresponds to adding an edge between the nearest pair of nodes in these clusters. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 26 / 50

Hierarchical Clustering When d max is used to measure the distance between clusters, the algorithm is called the farthest neighbor clustering algorithm. Moreover, if the algorithm is terminated when the distance between nearest clusters exceeds a threshold, it is called the complete linkage algorithm where patterns represent the nodes of a graph, edges connect all patterns belonging to the same cluster, merging two clusters corresponds to adding edges between every pair of nodes in these clusters. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 27 / 50

Hierarchical Clustering Figure 6: Examples for single linkage clustering. Figure 7: Examples for complete linkage clustering. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 28 / 50

Hierarchical Clustering Stepwise-Optimal Hierarchical Clustering: 1. Specify the number of clusters. Place every pattern in a unique cluster and repeat steps 2 and 3 until a partition with the required number of clusters is obtained. 2. Find the clusters whose merger increases an error criterion the least. 3. Merge these two clusters. When the sum-of-squared-error criterion J e is used, the pair of clusters whose merger increases J e as little as possible is the pair for which the distance d e (D i, D j ) = #D i #D j m i m j 2 #D i + #D j is minimum. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 29 / 50

Graph-Theoretic Clustering Graph: (S, R) S: Set of nodes R: Set of edges, R S S Clique: Set of nodes that are all connected to each other, i.e., {P S P P R}. Goal: Find clusters in a graph that are not as dense as cliques but are compact enough within user specified thresholds. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 30 / 50

Graph-Theoretic Clustering 6 3 5 2 4 7 1 9 10 8 Figure 8: An example graph. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 31 / 50

Graph-Theoretic Clustering (X, Y ) R means Y is a neighbor of X, Neighborhood(X) = {Y (X, Y ) R}. Conditional density D(Y X) is the number of nodes in the neighborhood of X which have Y as a neighbor, D(Y X) = #{N S (N, Y ) R and (X, N) R} = D(X Y ) = #{Neighborhood(X) Neighborhood(Y )}. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 32 / 50

Graph-Theoretic Clustering Given an integer K, a dense region Z around a node X S is defined as Z(X, K) = {Y S D(Y X) K}. Z(X) = Z(X, J) is a dense region candidate around X where J = max{k #Z(X, K) K} because if M is a major clique of size L, then X, Y M implies that D(Y X) L. Thus M Z(X, L) and K L #Z(X, K). CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 33 / 50

Graph-Theoretic Clustering Association of a node X to a subset B of S is A(X B) = #{Neighborhood(X) B} #B where 0 A(X B) 1. Compactness of a subset B of S is where 0 C(B) 1. C(B) = 1 #B A(X B) X B CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 34 / 50

Graph-Theoretic Clustering A dense region B of the graph (S, R) should satisfy 1. B = {N Z(X) A(N Z(X)) τ a } for some X S, 2. C(B) τ c, 3. #B τ s where τ a, τ c and τ s are thresholds supplied by the user for minimum association, minimum compactness, and minimum size, respectively. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 35 / 50

Graph-Theoretic Clustering Algorithm for finding a dense region around a node X: 1. Compute D(Y X) for every other node Y in S. 2. Find a dense region candidate Z(X, K ) where K = max{k #{Y D(Y X) K} K}. 3. Remove the nodes with a low association from the candidate set. Iterate until all of the nodes have high enough association. 4. Check whether the remaining nodes have high enough average association. 5. Check whether the candidate set is large enough. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 36 / 50

Graph-Theoretic Clustering Given the dense regions, the algorithm for graph-theoretic clustering proceeds as follows: 1. Define the dense-region relation F as F = {(B 1, B 2 ) B 1, B 2 are dense regions of R, #B 1 B 2 #B 1 τ o or #B 1 B 2 #B 2 τ o } where τ o is a threshold supplied by the user for minimum overlap. 2. Merge the regions that have enough overlap if all of the nodes in the set resulting after merging have high enough associations. 3. Iterate until no regions can be merged. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 37 / 50

Graph-Theoretic Clustering 6 3 5 2 4 7 1 9 10 8 Figure 9: Clusters found in the example graph using the thresholds τ a = 0.5, τ c = 0.6, τ s = 3, τ o = 0.9: {1, 2, 3, 4, 6} (compactness=0.92), {7, 8, 9, 10} (compactness=1.00), {2, 5, 6, 7, 10} (compactness=0.68). CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 38 / 50

Graph-Theoretic Clustering 7 6 10 5 8 1 9 2 4 3 Figure 10: Another example graph. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 39 / 50

Graph-Theoretic Clustering 7 6 10 5 8 1 9 2 4 3 Figure 11: Clusters found in the second example graph using the thresholds τ a = 0.5, τ c = 0.8, τ s = 3, τ o = 0.75: {1, 2, 3, 4, 6, 8} (compactness=0.78), {2, 4, 5, 8} (compactness=0.88), {5, 7, 9, 10} (compactness=1.00). CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 40 / 50

Cluster Validity The procedures we have considered so far either assume that the number of clusters is known or use some thresholds to decide how the clusters are formed. These may be reasonable assumptions for some applications but are usually unjustified if we are exploring a data set whose properties and structure are unknown. Furthermore, most of the iterative algorithms that we use may find a local extremum and may not give the best result. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 41 / 50

Cluster Validity Methods for validating the results of a clustering algorithm include: Repeating the clustering procedure for different values of the parameters, and examining the resulting values of the criterion function for large jumps or stable ranges. Evaluating the goodness-of-fit using measures such as the chi-squared or Kolmogorov-Smirnov statistics. Formulating hypothesis tests that check whether multiple clusters found have been formed by chance, and whether the observed change in the error criterion has any significance. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 42 / 50

Cluster Validity The groupings by the unsupervised clustering can also be compared to the known labels if ground truth is available. In an optimal result, the patterns with the same class labels in the ground truth must be assigned to the same cluster and the patterns corresponding to different classes must appear in different clusters. The following measures quantify how well the results of the unsupervised clustering algorithm reflect the groupings in the ground truth: Entropy, Rand index. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 43 / 50

Cluster Validity Entropy is an information theoretic criterion that measures the homogeneity of the distribution of the clusters with respect to different classes. Given K as the number of clusters resulting from the clustering algorithm and C as the number of classes in the ground truth, let hck denote the number of patterns assigned to cluster k with a ground truth class label c. hc. = K k=1 h ck denote the number of patterns with a ground truth class label c. h.k = C c=1 h ck denote the number of patterns assigned to cluster k. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 44 / 50

Cluster Validity The quality of individual clusters is measured in terms of the homogeneity of the class labels within each cluster. For each cluster k, the cluster entropy E k is given by E k = C c=1 h ck h.k log h ck h.k. Then, the overall cluster entropy E cluster is given by a weighted sum of individual cluster entropies as E cluster = 1 K k=1 h.k K h.k E k. k=1 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 45 / 50

Cluster Validity A smaller cluster entropy value indicates a higher homogeneity. However, the cluster entropy continues to decrease as the number of clusters increases. To overcome this problem, another entropy criterion that measures how patterns of the same class are distributed among the clusters can be defined. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 46 / 50

Cluster Validity For each class c, the class entropy E c is given by E c = K k=1 h ck h c. log h ck h c.. Then, the overall class entropy E class is given by a weighted sum of individual class entropies as E class = 1 C c=1 h c. C h c. E c. c=1 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 47 / 50

Cluster Validity Unlike the cluster entropy, the class entropy increases when the number of clusters increases. Therefore, the two measures can be combined for an overall entropy measure as E = βe cluster + (1 β)e class where β [0, 1] is a weight that balances the two measures. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 48 / 50

Cluster Validity The Rand index can also be used to measure the agreement of every pair of patterns according to both unsupervised and ground truth labelings. The agreement occurs if two patterns that belong to the same class are put into the same cluster, or two patterns that belong to different classes are put into different clusters. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 49 / 50

Cluster Validity The Rand index is computed as the proportion of all pattern pairs that agree in their labels. The index has a value between 0 and 1, where 0 indicates that the two labelings do not agree on any pair of patterns and 1 indicates that the two labelings are exactly the same. CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University) 50 / 50