Traditional clustering fails if:

Similar documents
Clustering will not be satisfactory if:

Training-Free, Generic Object Detection Using Locally Adaptive Regression Kernels

Kernel Methods & Support Vector Machines

Expectation Maximization!

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Classification Part 4

Gene Clustering & Classification

CS145: INTRODUCTION TO DATA MINING

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Information Management course

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Kernel-based Transductive Learning with Nearest Neighbors

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Constrained Clustering with Interactive Similarity Learning

Weka ( )

Support vector machines

Pattern recognition (4)

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

ECLT 5810 Evaluation of Classification Quality

Robot Learning. There are generally three types of robot learning: Learning from data. Learning by demonstration. Reinforcement learning

Artificial Intelligence. Programming Styles

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Learning Better Data Representation using Inference-Driven Metric Learning

CS249: ADVANCED DATA MINING

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

Data Clustering. Danushka Bollegala

Introduction to Information Retrieval

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Network Traffic Measurements and Analysis

Evaluating Machine Learning Methods: Part 1

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

A Family of Contextual Measures of Similarity between Distributions with Application to Image Retrieval

Conic Duality. yyye

Contents. Preface to the Second Edition

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Information Retrieval and Organisation

INF 4300 Classification III Anne Solberg The agenda today:

Limitations of Matrix Completion via Trace Norm Minimization

Evaluating Classifiers

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

The Un-normalized Graph p-laplacian based Semi-supervised Learning Method and Speech Recognition Problem

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Non-exhaustive, Overlapping k-means

The Pre-Image Problem in Kernel Methods

Evaluating Machine-Learning Methods. Goals for the lecture

Evaluating Classifiers

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Introduction to Information Retrieval

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

CRF Based Point Cloud Segmentation Jonathan Nation

12 Classification using Support Vector Machines

CHAPTER 6 IDENTIFICATION OF CLUSTERS USING VISUAL VALIDATION VAT ALGORITHM

GRAPHS are mathematical structures, consisting of nodes. Learning Graphs with Monotone Topology Properties and Multiple Connected Components

Preliminary Local Feature Selection by Support Vector Machine for Bag of Features

ECLT 5810 Clustering

DATA MINING OVERFITTING AND EVALUATION

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Power to the points: Local certificates for clustering

Joint Vanishing Point Extraction and Tracking (Supplementary Material)

EE795: Computer Vision and Intelligent Systems

Support Vector Machines.

Large-Scale Face Manifold Learning

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Machine Learning for. Artem Lind & Aleskandr Tkachenko

CS4491/CS 7265 BIG DATA ANALYTICS

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Extracting Rankings for Spatial Keyword Queries from GPS Data

Classifying Imbalanced Data Sets Using. Similarity Based Hierarchical Decomposition

Fast Indexing and Search. Lida Huang, Ph.D. Senior Member of Consulting Staff Magma Design Automation

A Short SVM (Support Vector Machine) Tutorial

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Local similarity learning for pairwise constraint propagation

Support Vector Machines

Fabric Image Retrieval Using Combined Feature Set and SVM

Information Retrieval. Lecture 7

Visual Representations for Machine Learning

Linear programming and duality theory

Part II: A broader view

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Adaptive Learning of an Accurate Skin-Color Model

Topology-Invariant Similarity and Diffusion Geometry

EE795: Computer Vision and Intelligent Systems

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

ECLT 5810 Clustering

Boolean Classification

Approximate Nearest Centroid Embedding for Kernel k-means

Transcription:

Traditional clustering fails if: -- in the input space the clusters are not linearly separable; -- the distance measure is not adequate; -- the assumptions limit the shape or the number of the clusters. Our solution 1. map to a high-dimensional space where clusters are linearly separable; 2. minimize a convex objective function to learn an appropriate distance; 3. do mean shift clustering in the learned space.

nonlinear input few must-link and cannot-link pairs cluster label not specified linearly separable kernel space Bregman logdet divergence kernel mean shift All parameters determined from the processing.

Bregman divergence

Kullback-Leibler divergence

For positive semidefinite matrices, like our kernel matrix with rank r, optimization is convex with global optimum.

Kernel Matrix Representation d x n

Gaussian kernel for a pairwise distance: The estimation already using Bergman log det divergence. distances of pairs of kernel points All databases in our experiments have a miminum in sigma. The kernel matrix at most rank n, the number of points. The reduced rank r < n kernel K has only 99 percent of the original kernels energy in the rest of computation.

Two examples: Olympic circles. [5 classes] n = 1500, r = 58. x300 faster (reduced_rank/full_rank) USPS digits. [10 classes] (here) n = 1000, r = 499. x2 faster (reduced_rank/full_rank)

Linear Transformation in the Kernel Space A M or C are the set of must-link or cannot-link constraint pair. The symmetric dxd matrix based on this constraint pair alone The value of s is computed based on the constraint pair and the updated kernel matrix is obtained. If s=1 is the projection into the null space. Some problems appear... see also paper below. O.Tuzel, F.Porikli, P.Meer Kernel methods for weakly supervised mean shift clustering ICCV 48 55, 2009

Projecting into the null space. The dimensionality of the kernel space reduces with one after each constraint is applied. Null space projection does not have cannot-link constraints. Sensitive to labeling errors.

An example of null space projection. 5x150 points. 6 x 5 = 30 must-link constraints pairwise 4 points per class Plus one mislabeled constraint.

Kernel Learning by Minimizing the Log Det Divergence The kernel learning takes one pair of constraint at a time: must-link or cannot-link with soft margin. At each iteration the slack variable is updated. For each constraint, the optimization is solved by Bregman projection based updates Updates are repeated until convergence to global minimum P. Jain, et. al., Metric and kernel learning using a linear transformation. JMLR, 13:519 547, 2012. B. Kulis, et. al., Low-rank kernel learning with Bergman matrix divergences. JMLR, 10:341-376, 2007.

The initial kernel matrix K has rank r n, then, and the update can be rewritten as The scalar variable is computed in each iteration using the kernel matrix and the constraint pair. The matrix is updated using the Cholesky decomposition. Jain,P. et. al., Metric and kernel learning using a linear transformation. JMLR, 13:519-547, 2012. 19

Low rank kernel learning algorithm 20

Kernel Learning Using the Log Det Divergence For very large datasets, it is infeasible to learn the entire kernel matrix, and to store it in the memory. Generalization to out of sample points where x or y or both are out of sample. Distances can be computed using the learned kernel function 21

Kernel Mean Shift Clustering A nonparametric kernel mean shift finds iteratively the closest mode (maximum) of the p.d.f. estimate. k(x) the kernel profile g(u) = - k'(u) is the new kernel mean. Not an input point. Dimension of the mean shift D is at most 25... enough. Mean shift can run in a lower dimension if the optimal kernel has rank less than twenty five.

Variable Bandwith Parameter Selection Only the must-link constraint pairs participates. The whole algorithm with different k-s proves this clustering rule. only this d.b. entire algorithm to check! Olympic circles 5x10=50 must-link 10x10=100 USPS digits 5 per class

CONTINGENCY TABLE True T False F Negative N Positive P Result Ground-truth Not true True Not true TN FN True FP TP DEFINITIONS precision = TP quality = TP + FP + FN TP TP + FP [are correct] recall = TP F measure = 2 precision recall precision + recall TP + FN [true return] ROC curve ordinate: true positive rate (recall) abscissa: false positive alarm rate TP TP+FN FP FP+TN A threshold measures if the results are false or true. Changing the threshold rearranges all four true/not true boxes function of the ground truth. The thresholding start at one extreme, just one sample is true, TP = 1. Precision is one, recall close the zero. The other extreme is when the threshold has all the returned samples true. The TP is now the number of true samples. The precision is close to zero, the recall is one.

A simple example: The ground-truth have 15 items of A, associated as not true, and 5 items of B, associated as true. Threshold returning a single true value gives the contingency table Result Ground-truth Not true True Not true 15 4 True 0 1 and precision = 1 while recall = 0.2. Threshold returning an immediate result gives the contingency table Result Ground-truth Not true True Not true 9 2 True 6 3 and precision = 0.33 while recall = 0.6. Threshold returning all the results as true gives the contingency table Result Ground-truth Not true True Not true 0 0 True 15 5 and precision = 0.25 while recall = 1. 2

Adjusted Rand Index Scalar measure to evaluate clustering performance from the clustering output. negative if expected larger than obtained TP true positive; TN true negative; FP false positive; FN false negative. Compensates for chance, randomly assigned cluster labels get a low score. 27

Trade-off Parameter Measure with Adjusted Rand (AR) for the entire algorithm. Pairwise constraints: 15 points per class. Learning with half the data. Testing with the other half. Two-fold cross-validation. Must-link constraints: Olympic circles: 5x105=525 USPS digits: 10x105=1050

A few examples. The algorithm generalizes to out of sample points not taken into consideration in the kernel matrix optimization.

Comparison with the Semi-Supervised Kernel Mean Shift Clustering (SKMS) 1. Spectral clustering (E2CP); 2. Semi-supervised kernel k-means (SSKK); sigma given too in these two After LogDet Bregman divergence: 3. Kernel k-means (Kkm); 4. Kernel spectral clustering (KSC). These four methods need as input the number of clusters. 1. Zhiwu Lu, Horace H.S.Ip: Constrained spectral clustering via exhaustive and efficient. constraint propagation. ECCV, 1 14, 2010. 2. B. Kulis, S. Basu, I.S. Dhillon, R.J. Mooney: Semi supervised graph clustering: A kernel approach. Machine Learning, 74:1-22, 2009.

Experimental Evaluation Two synthetic data sets Olympic circles (5 classes) Concentric circles (10 classes) Nonlinearly separable Can have intersecting boundaries Four real data sets Small number of classes USPS (10 Classes) MIT Scene (8 Classes) Large number of classes PIE faces (68 Classes) Caltech Objects (50 Classes) 50 independent runs. Average parameter values! 37

50 runs. Most frequent parameters - Synthetic Example: Olympic Circles 5x300 points along the five circles. 25 points per class selected at random. Original Sample Result Experiment 1 Varied number of labeled points [5, 7, 10, 12, 15, 17, 20, 25] from each class to generate pairwise constraints. max. number of labeled points per class: 8.33%

50 runs. Most frequent parameters Synthetic Example 2: Concentric Circles 100 points along each of the ten concentric circles Experiment 1 Varied number of labeled points [5, 7, 10, 12, 15, 17, 20, 25] from each class to generate pairwise constraints Experiment 2 25 labeled points per class Introduce labeling errors by swapping similarity pairs with dissimilarity pairs max. number of labeled points per class: 25% 43

50 runs. Most frequent parameters - Real Example 1: USPS Digits Ten classes with 1100 points per class. A total of 11000 points 100 points per class K 1000x1000 kernel matrix Varied number of labeled points [5, 7, 10, 12, 15, 17, 20, 25] from each class to generate pairwise constraints Cluster all 11000 data points by generalizing to the remaining 10000 points. ARI = 0.7529 0.051 11000 x 11000 PDM max. number of labeled points per class: 2.27%

50 runs. Most frequent parameters - Real Example 2: MIT Scene Eight classes with 2688 points. The number of samples range between 260 and 410. 100 points per class K 800x800 kernel matrix. Varied number of labeled points [5, 7, 10, 12, 15, 17, 20] from each class to generate pairwise constraints. Cluster all 2688 data points by generalizing to the remaining 1888 points. max. number of labeled points per class: 7.44%

Is it really worth to do it per image? See three images misclassified from a class.

50 runs. Most frequent parameters Real Example 3: PIE Faces 68 subjects with 21 samples per subjects K 1428 x 1428 full ini al kernel matrix Varied number of labeled points [3, 4, 5, 6, 7] from each class to generate pairwise constraints Obtained perfect clustering for more than 5 labeled points per class max. number of labeled points per class: 33% 46

50 runs. Most frequent parameters - Real Example 3: Caltech 101 (subset) 50 categories with number of samples ranging between 31 and 40 points per class K 1959 x 1959 full kernel matrix Varied number of labeled points [5, 7, 10, 12, 15] from each class to generate pairwise constraints max. number of labeled points per class: 38%

Conclusion recovers arbitrarily shaped nonparametric clusters performs well with databases having large number of clusters... within the convention of clustering -- generalizing to out of sample points -- number of clusters is not an input parameter