Cluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning

Cluster Validation Ke Chen Reading: [5.., KPM], [Wang et al., 9], [Yang & Chen, ] COMP4 Machine Learning

Outline Motivation and Background Internal index Motivation and general ideas Variance-based internal indexes Application: finding the proper cluster number External index Motivation and general ideas Rand Index Application: Weighted clustering ensemble Summary COMP4 Machine Learning

Motivation and Background Motivation Supervised classification Class labels known for ground truth Accuracy: based on labels given Clustering analysis No class labels Evaluation still demanded Validation needs to Oranges: Apples: Evaluation Accuracy = 8/= 8% P Compare clustering algorithms Solve the number of clusters Avoid finding patterns in noise Find the best clusters from data COMP4 Machine Learning 3

Motivation and Background Illustrative Example: which one is the best? Data Set (Random Points) K-means (K=3) y y.9.8.7.6.5.4.3....4.6.8.9.8.7.6.5.4.3.. x K-means (K=3) y y.9.8.7.6.5.4.3....4.6.8.9.8.7.6.5.4.3.. x Agglomerative (Complete Link)..4.6.8 x COMP4 Machine Learning..4.6.8 x 4

Motivation and Background Cluster validation refers to procedures that evaluate the results of clustering in a quantitative and objective fashion. How to be quantitative : To employ the measures. How to be objective : To validate the measures! INPUT: DataSet(X) Clustering Algorithm(s) Partitions P Validity Index m* Different settings/configurations COMP4 Machine Learning 5

Motivation and Background Internal Criteria (Indexes) Validate without external information With different number of clusters Solve the number of clusters?? External Criteria (Indexes) Validate against ground truth Compare two partitions: (how similar)?? COMP4 Machine Learning 6

Internal Index Ground truth is unavailable but unsupervised validation must be done with common sense or a priori knowledge. There are a variety of internal indexes: Variances-based methods Rate-distortion methods Davies-Bouldin index (DBI) Bayesian Information Criterion (BIC) Silhouette Coefficient Minimum description principle (MDL) Stochastic complexity (SC) Modified Huber s Г (MHГ) index COMP4 Machine Learning 7

Internal Index Variances-based methods Minimise within cluster variance (SSW) Maximise between cluster variance (SSB) Intra-cluster variance is minimized Inter-cluster variance is maximized c COMP4 Machine Learning 8

Internal Index Variances-based methods (cont.) Assume an algorithm leads to a partition of K clusters where cluster i has nn ii data points and ci is its centroid. d(.,.) is a distance used in this algorithm. Within cluster variance (SSW) SSW K n ( K) = d ( x, c i i= j= ij i ) Between cluster variance (SSB) SSB K = n d ( ) i ( ci, c) where c is the global mean (centroid) of the whole data set. K i= COMP4 Machine Learning 9

Internal Index Variance based F-ratio index Measures ratio of between-cluster variance against the withincluster variance (original F-test) F-ratio index (W-B index) for a partition of K clusters F( K) where x n K * SSW ( K) SSB( K) K i= j= = = K ij i is the K i= n i n d i d ( x ( c i ij, c, c) jth data point in cluster is the number of data points in cluster i ) c i c i COMP4 Machine Learning

Internal Index Application: finding the proper cluster number F-ratio (x^5).4...8.6.4 S minimum Algorithm PNN Algorithm IS.. 5 3 9 7 5 3 9 7 5 Clusters COMP4 Machine Learning

.5.4 S4 Internal Index Application: finding the proper cluster number F-ratio.3.. minimum at 6 Algorithm PNN Algorithm IS..9.8 5 5 5 Number of clusters COMP4 Machine Learning minimum at 5

External Index Ground truth is available but an clustering algorithm doesn t use such information during unsupervised learning. There are a variety of internal indexes: Rand Index Adjusted Rand Index Pair counting index Information theoretic index Set matching index DVI index Normalised mutual information (NMI) index COMP4 Machine Learning 3

External Index Main issues If the ground truth is known, the validity of a clustering can be verified by comparing the class or clustering labels. However, this is much more complicated than in supervised classification (where labels used in training) The cluster IDs in a partition resulting from clustering have been assigned arbitrarily due to unsupervised learning permutation. The number of clusters may be different from the number of classes (the ground truth ) inconsistence. The most important problem in external indexes would be how to find all possible correspondences between the ground truth and a partition (or two candidate partitions in the case of comparison). COMP4 Machine Learning 4

External Index Rand Index This is the first external index proposed by Rand (97) to address the correspondence problem. Basic idea: considering all pairs in the data set by looking into both agreement and disagreement against the ground truth The index defined as RI (X, Y)= (a+d)/(a+b+c+d) X/Y Y: Pairs in the same class Y: Pairs in different classes X: Pairs in the same cluster a b X: Pairs in different clusters c d COMP4 Machine Learning 5

External Index Rand Index (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q To calculate a, b, c and d, we have to list all possible pairs (excluding any data point to itself) [, ], [, 3], [, 4], [, 5] [, 3], [, 4], [, 5] [3, 4], [3, 5] [4, 5] COMP4 Machine Learning 6

External Index Rand Index (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q Initialisation: a, b, c, d For data point pair [, ], we have X: [, ] assigned to (i, ii) > in different clusters Y: [, ] labelled by (q, p) > in different classes Thus, d d + = COMP4 Machine Learning 7

External Index Rand Index (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q Current Status: a=, b=, c=, d = For data point pair [, 3], we have X: [, 3] assigned to (i, ii) > in different clusters Y: [, 3] labelled by (q, q) > in the same class Thus, c c + = COMP4 Machine Learning 8

External Index Rand Index (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q Current Status: a=, b=, c=, d = For data point pair [, 4], we have X: [, 4] assigned to (i, i) > in the same cluster Y: [, 4] labelled by (q, p) > in different classes Thus, b b + = COMP4 Machine Learning 9

External Index Rand Index (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q Current Status: a=, b=, c=, d = For data point pair [, 5], we have X: [, 5] assigned to (i, i) > in the same cluster Y: [, 5] labelled by (q, q) > in the same class Thus, a a + = COMP4 Machine Learning

External Index Rand Index (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q Current Status: a=, b=, c=, d = On-class Exercise: continuing until have the final value of a, b, c and d. COMP4 Machine Learning

External Index Rand Index: Contingency Table In general, a, b, c and d are calculated from a contingency table Assume there are k clusters in X and l classes in Y nn iiii : the number of points in both cluster ii and class jj COMP4 Machine Learning

External Index Rand Index: Contingency Table (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q Contingency table p q i 3 ii 3 5 COMP4 Machine Learning 3

External Index Rand Index: measure the number of pairs in Same cluster/class in X and Y a = k l i= j= n ij ( n ij ) Same cluster in X /different classes in Y b = ( l n j n ij ). j= i= j= Different clusters in X / same class in Y c = ( k Different clusters/classes in X and Y d = k n i n ij ). i= i= j= ( N + k k l l l i= j= n ij ( k i= n i. + l j= n. j )) Ex. 4 COMP4 Machine Learning

Weighted Clustering Ensemble Motivation Evidence-accumulation based clustering ensemble (Fred & Jain, 5) is a simple clustering ensemble algorithm that use the evidence accumulated from multiple yet diversified partitions generated with different algorithms, initial conditions, different distance metrics, However, this clustering ensemble algorithm is subject to limitations: Don t distinguish between non-trivial and trivial partitions; i.e., all partitions used in a clustering ensemble are treated equally important. Sensitive to a cluster distance used in the hierarchical clustering for reaching a consensus Clustering validity indexes provide an effective way to measure non-trivialness or importance of a partition Internal index: directly measure the importance of a partition quantitatively and objectively External index: indirectly measure the importance of a partition via comparing with other partitions used in clustering ensemble for another round evidence accumulation COMP4 Machine Learning 5

Weighted Clustering Ensemble Use validity index values as weights Yun Yang and Ke Chen, Temporal data clustering via weighted clustering ensemble with different representations, IEEE Transactions on Knowledge and Data Engineering 3(), pp. 37-3,. Chen & Yang: Temporal Data Clustering with Different Representations COMP4 Machine Learning 6

Weighted Clustering Ensemble Example: convert clustering results into binary Distance matrix A B C D Cluster (C) Cluster (C) = D A D C B D C A B distance Matrix 7 Chen & Yang: Temporal Data Clustering with Different Representations

Weighted Clustering Ensemble Example: convert clustering results into binary Distance matrix A B C D Cluster (C) Cluster (C) = D A D C B D C A B distance Matrix 8 Chen & Yang: Temporal Data Clustering with Different Representations Cluster 3 (C3)

Weighted Clustering Ensemble Evidence accumulation: form the collective weighted distance matrix 9 Chen & Yang: Temporal Data Clustering with Different Representations = D = D ] ) ( ) [( 3 D w w w D w w w D N D M N D M E + + + + + =

Weighted Clustering Ensemble Clustering analysis with our weighted clustering ensemble on CAVIAR database annotated video sequences of pedestrians, a set of high-quality moving trajectories clustering analysis of trajectories is useful for many applications Experimental setting Representations: PCF, DCF, PLS and PDWT Initial clustering analysis: K-mean algorithm (4<K<), 6 initial settings Ensemble: 3 partitions totally (8 partitions/representation) Chen & Yang: Temporal Data Clustering with Different Representations 3

Weighted Clustering Ensemble Clustering results on CAVIAR database Chen & Yang: Temporal Data Clustering with Different Representations 3

Weighted Clustering Ensemble Application: UCR time series benchmarks COMP4 Machine Learning 3

Weighted Clustering Ensemble Application: Rand index (%) values of clustering ensembles (Yang & Chen ) COMP4 Machine Learning 33

Summary Cluster validation is a process that evaluate clustering results with a pre-defined criterion. Two different types of cluster validation methods Internal indexes no ground truth available defined based on common sense or a priori knowledge Application: finding the proper number of clusters, External indexes ground truth known or reference given ( relative index ) Application: performance evaluation of clustering with reference information Apart from direct evaluation, both indexes may be applied in weighted clustering ensemble, leading to better yet robust results by ignoring trivial partitions. K. Wang et al, CVAP: Validation for cluster analysis, Data Science Journal, vol. 8, May 9. [Code online available: http://www.mathworks.com/matlabcentral/fileexchange/authors/48] COMP4 Machine Learning 34