Traditional clustering fails if: -- in the input space the clusters are not linearly separable; -- the distance measure is not adequate; -- the assumptions limit the shape or the number of the clusters. Our solution 1. map to a high-dimensional space where clusters are linearly separable; 2. minimize a convex objective function to learn an appropriate distance; 3. do mean shift clustering in the learned space.
nonlinear input few must-link and cannot-link pairs cluster label not specified linearly separable kernel space Bregman logdet divergence kernel mean shift All parameters determined from the processing.
Bregman divergence
Kullback-Leibler divergence
For positive semidefinite matrices, like our kernel matrix with rank r, optimization is convex with global optimum.
Kernel Matrix Representation d x n
Gaussian kernel for a pairwise distance: The estimation already using Bergman log det divergence. distances of pairs of kernel points All databases in our experiments have a miminum in sigma. The kernel matrix at most rank n, the number of points. The reduced rank r < n kernel K has only 99 percent of the original kernels energy in the rest of computation.
Two examples: Olympic circles. [5 classes] n = 1500, r = 58. x300 faster (reduced_rank/full_rank) USPS digits. [10 classes] (here) n = 1000, r = 499. x2 faster (reduced_rank/full_rank)
Linear Transformation in the Kernel Space A M or C are the set of must-link or cannot-link constraint pair. The symmetric dxd matrix based on this constraint pair alone The value of s is computed based on the constraint pair and the updated kernel matrix is obtained. If s=1 is the projection into the null space. Some problems appear... see also paper below. O.Tuzel, F.Porikli, P.Meer Kernel methods for weakly supervised mean shift clustering ICCV 48 55, 2009
Projecting into the null space. The dimensionality of the kernel space reduces with one after each constraint is applied. Null space projection does not have cannot-link constraints. Sensitive to labeling errors.
An example of null space projection. 5x150 points. 6 x 5 = 30 must-link constraints pairwise 4 points per class Plus one mislabeled constraint.
Kernel Learning by Minimizing the Log Det Divergence The kernel learning takes one pair of constraint at a time: must-link or cannot-link with soft margin. At each iteration the slack variable is updated. For each constraint, the optimization is solved by Bregman projection based updates Updates are repeated until convergence to global minimum P. Jain, et. al., Metric and kernel learning using a linear transformation. JMLR, 13:519 547, 2012. B. Kulis, et. al., Low-rank kernel learning with Bergman matrix divergences. JMLR, 10:341-376, 2007.
The initial kernel matrix K has rank r n, then, and the update can be rewritten as The scalar variable is computed in each iteration using the kernel matrix and the constraint pair. The matrix is updated using the Cholesky decomposition. Jain,P. et. al., Metric and kernel learning using a linear transformation. JMLR, 13:519-547, 2012. 19
Low rank kernel learning algorithm 20
Kernel Learning Using the Log Det Divergence For very large datasets, it is infeasible to learn the entire kernel matrix, and to store it in the memory. Generalization to out of sample points where x or y or both are out of sample. Distances can be computed using the learned kernel function 21
Kernel Mean Shift Clustering A nonparametric kernel mean shift finds iteratively the closest mode (maximum) of the p.d.f. estimate. k(x) the kernel profile g(u) = - k'(u) is the new kernel mean. Not an input point. Dimension of the mean shift D is at most 25... enough. Mean shift can run in a lower dimension if the optimal kernel has rank less than twenty five.
Variable Bandwith Parameter Selection Only the must-link constraint pairs participates. The whole algorithm with different k-s proves this clustering rule. only this d.b. entire algorithm to check! Olympic circles 5x10=50 must-link 10x10=100 USPS digits 5 per class
CONTINGENCY TABLE True T False F Negative N Positive P Result Ground-truth Not true True Not true TN FN True FP TP DEFINITIONS precision = TP quality = TP + FP + FN TP TP + FP [are correct] recall = TP F measure = 2 precision recall precision + recall TP + FN [true return] ROC curve ordinate: true positive rate (recall) abscissa: false positive alarm rate TP TP+FN FP FP+TN A threshold measures if the results are false or true. Changing the threshold rearranges all four true/not true boxes function of the ground truth. The thresholding start at one extreme, just one sample is true, TP = 1. Precision is one, recall close the zero. The other extreme is when the threshold has all the returned samples true. The TP is now the number of true samples. The precision is close to zero, the recall is one.
A simple example: The ground-truth have 15 items of A, associated as not true, and 5 items of B, associated as true. Threshold returning a single true value gives the contingency table Result Ground-truth Not true True Not true 15 4 True 0 1 and precision = 1 while recall = 0.2. Threshold returning an immediate result gives the contingency table Result Ground-truth Not true True Not true 9 2 True 6 3 and precision = 0.33 while recall = 0.6. Threshold returning all the results as true gives the contingency table Result Ground-truth Not true True Not true 0 0 True 15 5 and precision = 0.25 while recall = 1. 2
Adjusted Rand Index Scalar measure to evaluate clustering performance from the clustering output. negative if expected larger than obtained TP true positive; TN true negative; FP false positive; FN false negative. Compensates for chance, randomly assigned cluster labels get a low score. 27
Trade-off Parameter Measure with Adjusted Rand (AR) for the entire algorithm. Pairwise constraints: 15 points per class. Learning with half the data. Testing with the other half. Two-fold cross-validation. Must-link constraints: Olympic circles: 5x105=525 USPS digits: 10x105=1050
A few examples. The algorithm generalizes to out of sample points not taken into consideration in the kernel matrix optimization.
Comparison with the Semi-Supervised Kernel Mean Shift Clustering (SKMS) 1. Spectral clustering (E2CP); 2. Semi-supervised kernel k-means (SSKK); sigma given too in these two After LogDet Bregman divergence: 3. Kernel k-means (Kkm); 4. Kernel spectral clustering (KSC). These four methods need as input the number of clusters. 1. Zhiwu Lu, Horace H.S.Ip: Constrained spectral clustering via exhaustive and efficient. constraint propagation. ECCV, 1 14, 2010. 2. B. Kulis, S. Basu, I.S. Dhillon, R.J. Mooney: Semi supervised graph clustering: A kernel approach. Machine Learning, 74:1-22, 2009.
Experimental Evaluation Two synthetic data sets Olympic circles (5 classes) Concentric circles (10 classes) Nonlinearly separable Can have intersecting boundaries Four real data sets Small number of classes USPS (10 Classes) MIT Scene (8 Classes) Large number of classes PIE faces (68 Classes) Caltech Objects (50 Classes) 50 independent runs. Average parameter values! 37
50 runs. Most frequent parameters - Synthetic Example: Olympic Circles 5x300 points along the five circles. 25 points per class selected at random. Original Sample Result Experiment 1 Varied number of labeled points [5, 7, 10, 12, 15, 17, 20, 25] from each class to generate pairwise constraints. max. number of labeled points per class: 8.33%
50 runs. Most frequent parameters Synthetic Example 2: Concentric Circles 100 points along each of the ten concentric circles Experiment 1 Varied number of labeled points [5, 7, 10, 12, 15, 17, 20, 25] from each class to generate pairwise constraints Experiment 2 25 labeled points per class Introduce labeling errors by swapping similarity pairs with dissimilarity pairs max. number of labeled points per class: 25% 43
50 runs. Most frequent parameters - Real Example 1: USPS Digits Ten classes with 1100 points per class. A total of 11000 points 100 points per class K 1000x1000 kernel matrix Varied number of labeled points [5, 7, 10, 12, 15, 17, 20, 25] from each class to generate pairwise constraints Cluster all 11000 data points by generalizing to the remaining 10000 points. ARI = 0.7529 0.051 11000 x 11000 PDM max. number of labeled points per class: 2.27%
50 runs. Most frequent parameters - Real Example 2: MIT Scene Eight classes with 2688 points. The number of samples range between 260 and 410. 100 points per class K 800x800 kernel matrix. Varied number of labeled points [5, 7, 10, 12, 15, 17, 20] from each class to generate pairwise constraints. Cluster all 2688 data points by generalizing to the remaining 1888 points. max. number of labeled points per class: 7.44%
Is it really worth to do it per image? See three images misclassified from a class.
50 runs. Most frequent parameters Real Example 3: PIE Faces 68 subjects with 21 samples per subjects K 1428 x 1428 full ini al kernel matrix Varied number of labeled points [3, 4, 5, 6, 7] from each class to generate pairwise constraints Obtained perfect clustering for more than 5 labeled points per class max. number of labeled points per class: 33% 46
50 runs. Most frequent parameters - Real Example 3: Caltech 101 (subset) 50 categories with number of samples ranging between 31 and 40 points per class K 1959 x 1959 full kernel matrix Varied number of labeled points [5, 7, 10, 12, 15] from each class to generate pairwise constraints max. number of labeled points per class: 38%
Conclusion recovers arbitrarily shaped nonparametric clusters performs well with databases having large number of clusters... within the convention of clustering -- generalizing to out of sample points -- number of clusters is not an input parameter