Cluster Validation. Ke Chen. Reading: [25.1.2, KPM], [Wang et al., 2009], [Yang & Chen, 2011] COMP24111 Machine Learning

Similar documents
Hierarchical and Ensemble Clustering

CS145: INTRODUCTION TO DATA MINING

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Consensus Clustering. Javier Béjar URL - Spring 2019 CS - MAI

Gene Clustering & Classification

Understanding Clustering Supervising the unsupervised

Information Retrieval and Organisation

Lecture on Modeling Tools for Clustering & Regression

Workload Characterization Techniques

COMP 465: Data Mining Still More on Clustering

Machine Learning (BSMC-GA 4439) Wenke Liu

ECS 234: Data Analysis: Clustering ECS 234

Using the Kolmogorov-Smirnov Test for Image Segmentation

Introduction to Information Retrieval

Road map. Basic concepts

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

THE AREA UNDER THE ROC CURVE AS A CRITERION FOR CLUSTERING EVALUATION

Chapter 6 Continued: Partitioning Methods

Data Clustering. Danushka Bollegala

Alternative Clusterings: Current Progress and Open Challenges

PV211: Introduction to Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Types of general clustering methods. Clustering Algorithms for general similarity measures. Similarity between clusters

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

ECG782: Multidimensional Digital Signal Processing

Clustering Algorithms for general similarity measures

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Unsupervised Learning

Cluster Analysis. Angela Montanari and Laura Anderlucci

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Information Retrieval and Web Search Engines

Machine Learning (BSMC-GA 4439) Wenke Liu

Binary Images Clustering with k-means

ECLT 5810 Clustering

Chapter DM:II. II. Cluster Analysis

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering Analysis Basics

THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS

Clustering Lecture 3: Hierarchical Methods

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Introduction to Artificial Intelligence

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Distances, Clustering! Rafael Irizarry!

Tri-modal Human Body Segmentation

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Machine Learning. Unsupervised Learning. Manfred Huber

Cluster analysis. Agnieszka Nowak - Brzezinska

Clustering. Lecture 6, 1/24/03 ECS289A

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology

ECLT 5810 Clustering

Extracting Layers and Recognizing Features for Automatic Map Understanding. Yao-Yi Chiang

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Information Retrieval and Web Search Engines

Exploratory Analysis: Clustering

University of Florida CISE department Gator Engineering. Clustering Part 5

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Bioimage Informatics

Accumulation. Instituto Superior Técnico / Instituto de Telecomunicações. Av. Rovisco Pais, Lisboa, Portugal.

Introduction to Information Retrieval

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Cluster Analysis (b) Lijun Zhang

Using Machine Learning to Optimize Storage Systems

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Algorithm Engineering Applied To Graph Clustering


Clustering & Classification (chapter 15)

Introduction to Information Retrieval

Clustering and Visualisation of Data

DETECTING RESOLVERS AT.NZ. Jing Qiao, Sebastian Castro DNS-OARC 29 Amsterdam, October 2018

IDENTIFYING BEHAVIORS IN CROWDED SCENES THROUGH STABILITY ANALYSIS FOR DYNAMICAL SYSTEMS

Scalable Clustering of Signed Networks Using Balance Normalized Cut

Clustering CS 550: Machine Learning

SOCIAL MEDIA MINING. Data Mining Essentials

Cluster Analysis. CSE634 Data Mining

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Visual Representations for Machine Learning

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Texture Image Segmentation using FCM

Automatic Domain Partitioning for Multi-Domain Learning

Improving Cluster Method Quality by Validity Indices

AN IMPROVED K-MEANS CLUSTERING ALGORITHM FOR IMAGE SEGMENTATION

Chapter 9. Classification and Clustering

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

High throughput Data Analysis 2. Cluster Analysis

6. Dicretization methods 6.1 The purpose of discretization

ECM A Novel On-line, Evolving Clustering Method and Its Applications

Robust Clustering of PET data

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Community Detection: Comparison of State of the Art Algorithms

Unsupervised Learning

Machine Learning: Symbol-based

Cluster Analysis: Agglomerate Hierarchical Clustering

Unsupervised Learning

Relative Clustering Validity Criteria: A Comparative Overview

Transcription:

Cluster Validation Ke Chen Reading: [5.., KPM], [Wang et al., 9], [Yang & Chen, ] COMP4 Machine Learning

Outline Motivation and Background Internal index Motivation and general ideas Variance-based internal indexes Application: finding the proper cluster number External index Motivation and general ideas Rand Index Application: Weighted clustering ensemble Summary COMP4 Machine Learning

Motivation and Background Motivation Supervised classification Class labels known for ground truth Accuracy: based on labels given Clustering analysis No class labels Evaluation still demanded Validation needs to Oranges: Apples: Evaluation Accuracy = 8/= 8% P Compare clustering algorithms Solve the number of clusters Avoid finding patterns in noise Find the best clusters from data COMP4 Machine Learning 3

Motivation and Background Illustrative Example: which one is the best? Data Set (Random Points) K-means (K=3) y y.9.8.7.6.5.4.3....4.6.8.9.8.7.6.5.4.3.. x K-means (K=3) y y.9.8.7.6.5.4.3....4.6.8.9.8.7.6.5.4.3.. x Agglomerative (Complete Link)..4.6.8 x COMP4 Machine Learning..4.6.8 x 4

Motivation and Background Cluster validation refers to procedures that evaluate the results of clustering in a quantitative and objective fashion. How to be quantitative : To employ the measures. How to be objective : To validate the measures! INPUT: DataSet(X) Clustering Algorithm(s) Partitions P Validity Index m* Different settings/configurations COMP4 Machine Learning 5

Motivation and Background Internal Criteria (Indexes) Validate without external information With different number of clusters Solve the number of clusters?? External Criteria (Indexes) Validate against ground truth Compare two partitions: (how similar)?? COMP4 Machine Learning 6

Internal Index Ground truth is unavailable but unsupervised validation must be done with common sense or a priori knowledge. There are a variety of internal indexes: Variances-based methods Rate-distortion methods Davies-Bouldin index (DBI) Bayesian Information Criterion (BIC) Silhouette Coefficient Minimum description principle (MDL) Stochastic complexity (SC) Modified Huber s Г (MHГ) index COMP4 Machine Learning 7

Internal Index Variances-based methods Minimise within cluster variance (SSW) Maximise between cluster variance (SSB) Intra-cluster variance is minimized Inter-cluster variance is maximized c COMP4 Machine Learning 8

Internal Index Variances-based methods (cont.) Assume an algorithm leads to a partition of K clusters where cluster i has nn ii data points and ci is its centroid. d(.,.) is a distance used in this algorithm. Within cluster variance (SSW) SSW K n ( K) = d ( x, c i i= j= ij i ) Between cluster variance (SSB) SSB K = n d ( ) i ( ci, c) where c is the global mean (centroid) of the whole data set. K i= COMP4 Machine Learning 9

Internal Index Variance based F-ratio index Measures ratio of between-cluster variance against the withincluster variance (original F-test) F-ratio index (W-B index) for a partition of K clusters F( K) where x n K * SSW ( K) SSB( K) K i= j= = = K ij i is the K i= n i n d i d ( x ( c i ij, c, c) jth data point in cluster is the number of data points in cluster i ) c i c i COMP4 Machine Learning

Internal Index Application: finding the proper cluster number F-ratio (x^5).4...8.6.4 S minimum Algorithm PNN Algorithm IS.. 5 3 9 7 5 3 9 7 5 Clusters COMP4 Machine Learning

.5.4 S4 Internal Index Application: finding the proper cluster number F-ratio.3.. minimum at 6 Algorithm PNN Algorithm IS..9.8 5 5 5 Number of clusters COMP4 Machine Learning minimum at 5

External Index Ground truth is available but an clustering algorithm doesn t use such information during unsupervised learning. There are a variety of internal indexes: Rand Index Adjusted Rand Index Pair counting index Information theoretic index Set matching index DVI index Normalised mutual information (NMI) index COMP4 Machine Learning 3

External Index Main issues If the ground truth is known, the validity of a clustering can be verified by comparing the class or clustering labels. However, this is much more complicated than in supervised classification (where labels used in training) The cluster IDs in a partition resulting from clustering have been assigned arbitrarily due to unsupervised learning permutation. The number of clusters may be different from the number of classes (the ground truth ) inconsistence. The most important problem in external indexes would be how to find all possible correspondences between the ground truth and a partition (or two candidate partitions in the case of comparison). COMP4 Machine Learning 4

External Index Rand Index This is the first external index proposed by Rand (97) to address the correspondence problem. Basic idea: considering all pairs in the data set by looking into both agreement and disagreement against the ground truth The index defined as RI (X, Y)= (a+d)/(a+b+c+d) X/Y Y: Pairs in the same class Y: Pairs in different classes X: Pairs in the same cluster a b X: Pairs in different clusters c d COMP4 Machine Learning 5

External Index Rand Index (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q To calculate a, b, c and d, we have to list all possible pairs (excluding any data point to itself) [, ], [, 3], [, 4], [, 5] [, 3], [, 4], [, 5] [3, 4], [3, 5] [4, 5] COMP4 Machine Learning 6

External Index Rand Index (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q Initialisation: a, b, c, d For data point pair [, ], we have X: [, ] assigned to (i, ii) > in different clusters Y: [, ] labelled by (q, p) > in different classes Thus, d d + = COMP4 Machine Learning 7

External Index Rand Index (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q Current Status: a=, b=, c=, d = For data point pair [, 3], we have X: [, 3] assigned to (i, ii) > in different clusters Y: [, 3] labelled by (q, q) > in the same class Thus, c c + = COMP4 Machine Learning 8

External Index Rand Index (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q Current Status: a=, b=, c=, d = For data point pair [, 4], we have X: [, 4] assigned to (i, i) > in the same cluster Y: [, 4] labelled by (q, p) > in different classes Thus, b b + = COMP4 Machine Learning 9

External Index Rand Index (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q Current Status: a=, b=, c=, d = For data point pair [, 5], we have X: [, 5] assigned to (i, i) > in the same cluster Y: [, 5] labelled by (q, q) > in the same class Thus, a a + = COMP4 Machine Learning

External Index Rand Index (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q Current Status: a=, b=, c=, d = On-class Exercise: continuing until have the final value of a, b, c and d. COMP4 Machine Learning

External Index Rand Index: Contingency Table In general, a, b, c and d are calculated from a contingency table Assume there are k clusters in X and l classes in Y nn iiii : the number of points in both cluster ii and class jj COMP4 Machine Learning

External Index Rand Index: Contingency Table (cont.) Example: for a 5-point data set, we have 3 4 5 X i ii ii i i Y q p q p q Contingency table p q i 3 ii 3 5 COMP4 Machine Learning 3

External Index Rand Index: measure the number of pairs in Same cluster/class in X and Y a = k l i= j= n ij ( n ij ) Same cluster in X /different classes in Y b = ( l n j n ij ). j= i= j= Different clusters in X / same class in Y c = ( k Different clusters/classes in X and Y d = k n i n ij ). i= i= j= ( N + k k l l l i= j= n ij ( k i= n i. + l j= n. j )) Ex. 4 COMP4 Machine Learning

Weighted Clustering Ensemble Motivation Evidence-accumulation based clustering ensemble (Fred & Jain, 5) is a simple clustering ensemble algorithm that use the evidence accumulated from multiple yet diversified partitions generated with different algorithms, initial conditions, different distance metrics, However, this clustering ensemble algorithm is subject to limitations: Don t distinguish between non-trivial and trivial partitions; i.e., all partitions used in a clustering ensemble are treated equally important. Sensitive to a cluster distance used in the hierarchical clustering for reaching a consensus Clustering validity indexes provide an effective way to measure non-trivialness or importance of a partition Internal index: directly measure the importance of a partition quantitatively and objectively External index: indirectly measure the importance of a partition via comparing with other partitions used in clustering ensemble for another round evidence accumulation COMP4 Machine Learning 5

Weighted Clustering Ensemble Use validity index values as weights Yun Yang and Ke Chen, Temporal data clustering via weighted clustering ensemble with different representations, IEEE Transactions on Knowledge and Data Engineering 3(), pp. 37-3,. Chen & Yang: Temporal Data Clustering with Different Representations COMP4 Machine Learning 6

Weighted Clustering Ensemble Example: convert clustering results into binary Distance matrix A B C D Cluster (C) Cluster (C) = D A D C B D C A B distance Matrix 7 Chen & Yang: Temporal Data Clustering with Different Representations

Weighted Clustering Ensemble Example: convert clustering results into binary Distance matrix A B C D Cluster (C) Cluster (C) = D A D C B D C A B distance Matrix 8 Chen & Yang: Temporal Data Clustering with Different Representations Cluster 3 (C3)

Weighted Clustering Ensemble Evidence accumulation: form the collective weighted distance matrix 9 Chen & Yang: Temporal Data Clustering with Different Representations = D = D ] ) ( ) [( 3 D w w w D w w w D N D M N D M E + + + + + =

Weighted Clustering Ensemble Clustering analysis with our weighted clustering ensemble on CAVIAR database annotated video sequences of pedestrians, a set of high-quality moving trajectories clustering analysis of trajectories is useful for many applications Experimental setting Representations: PCF, DCF, PLS and PDWT Initial clustering analysis: K-mean algorithm (4<K<), 6 initial settings Ensemble: 3 partitions totally (8 partitions/representation) Chen & Yang: Temporal Data Clustering with Different Representations 3

Weighted Clustering Ensemble Clustering results on CAVIAR database Chen & Yang: Temporal Data Clustering with Different Representations 3

Weighted Clustering Ensemble Application: UCR time series benchmarks COMP4 Machine Learning 3

Weighted Clustering Ensemble Application: Rand index (%) values of clustering ensembles (Yang & Chen ) COMP4 Machine Learning 33

Summary Cluster validation is a process that evaluate clustering results with a pre-defined criterion. Two different types of cluster validation methods Internal indexes no ground truth available defined based on common sense or a priori knowledge Application: finding the proper number of clusters, External indexes ground truth known or reference given ( relative index ) Application: performance evaluation of clustering with reference information Apart from direct evaluation, both indexes may be applied in weighted clustering ensemble, leading to better yet robust results by ignoring trivial partitions. K. Wang et al, CVAP: Validation for cluster analysis, Data Science Journal, vol. 8, May 9. [Code online available: http://www.mathworks.com/matlabcentral/fileexchange/authors/48] COMP4 Machine Learning 34