Feature Selection for Image Retrieval and Object Recognition Nuno Vasconcelos et al. Statistical Visual Computing Lab ECE, UCSD Presented by Dashan Gao
Scalable Discriminant Feature Selection for Image Retrieval and Recognition N. Vasconcelos and M. Vasconcelos To appear in IEEE CVPR 2004 Feature Selection by Maximum Marginal Diversity: optimality and implications for visual recognition N. Vasconcelos Proceedings of IEEE CVPR, 2003 Feature Selection by Maximum Marginal Diversity N. Vasconcelos Proceedings of Neural Information Processing Systems, 2002. 2
Overview (1) Image retrieval is a large scale classification problem: A large number of classes, large amounts of data per class A discriminant feature space (of small dimensionality) is a pre-requisite for success Feature Selection (FS) makes learning easier and tractable, in a lower dimensional feature space X Goal: Find transformation T, constrained to be a subset projection Find the projection matrix T that optimizes a criterion for feature goodness 3
Overview (2) Weaknesses of traditional methods: Based on sub-optimal criteria: variance maximization (principal component analysis PCA) Lack of scalability: they take infeasible time to compute Difficult to extend to multi-class problems (boosting) Ultimate goal: minimize probability of error (MPE) Search for the Bayes error-optimal space of a given classification problem Achievable goal (discriminant sense) : maximize separation between the different classes to recognize. 4
Information theoretic feature selection (ITFS) Infomax goal: maximize mutual information between the selected features and class labels Outline: Optimality properties (in MPE and discriminant sense) (Contribution 1) Trade-off between optimality and complexity (Contribution 2) Algorithmic implementation with low complexity 5
Bayes Error (BE) Advantage: BE depends only on the feature space, thus is the ultimate discriminant measure for FS. Disadvantage: nonlinearity of max(.) operation 6
Infomax principle H(X) H(XIY) H(Y) H(Y X) H(.) is entropy H(Y X) is conditional entropy (class posterior entropy, CPE) max I(Y;X) = min H(Y X) 7
Infomax example 2 classes (M=2), 2 features, x 2 x1 x1 x2 Note: Variance-based criteria (e.g. PCA) fail in this case!! 8
Infomax vs BE To show: Bayes error >= Infomax 9
Example 0.6 0-0.8 Important observation: The gradients of the two curve have the same signs everywhere when defined The extrema of both sides are co-located LHS and RHS have the same optimization solution 10
Infomax vs BE Bayes error >= Infomax 11
example: M=2 µ BE and CPE as functions of m BE CPE (H(Y X) Infomax is optimal in MPE sense! Infomax is a good approximation of BE. The infomax solutions will be very similar to those BE. 12
Discriminant form of infomax Noting that Theorem 3: Infomax goal is equivalent to the goal that maximizes separation between the different classes 13
Feature Selection (FS) Forward sequential search for FS: A set of features are added to the current best subset in each step, with the goal of optimizing a cost function Denote the current subset by, the added features by, and the new subset by. We can prove or Maximizing mutual information (infomax) is simpler than minimizing BE 14
Proof Proof: 15
Feature Selection (cont d) favors discriminant features penalizes features redundant with previous unless redundancy provides information about Y A trade-off between the maximization of discriminant power and the minimization of redundancy Problem: Infomax requires high-dimensional density estimates Find a trade-off between optimality and complexity 16
Maximum Marginal Diversity (MMD) Marginal Diversity MMD based FS: a naïve infomax Select the subset of features that lead to a set of maximally diverse marginal densities. Optimality condition Lemma : MMD is optimal if the following holds: the average mutual information between features is not affected by the knowledge of the class label 17
the Naïve Bayes Classifier Assumption: features are only conditional independent given the class label however, the optimality condition for MMD doesn t hold under this assumption. Since Feature selected by MMD are not good for Naïve Bayes Classifier! 18
Advantage: MMD (continued) Computation is simple: only marginal distribution of each feature is considered. Disadvantage: The existence of optimality condition can hardly be proved practically. There is no guarantee for optimality if the condition does not hold. Fortunately, recent studies show that, for image recognition problems, MMD is very close to the optimal solution for the biologically plausible features, e.g. wavelet coefficients 19
Image statistics Feature dependencies tend to be localized across both space and image scale e.g. for standard wavelet decomposition: co-located coefficients of equal orientation can be arbitrarily dependent on the class average dependence between such sets of coefficients does not depend on the image class (strong V freq => weak H freq) This property is referred to as a more general casethan MMD: l-decomposability: feature set decomposable into mutually exclusive subsets of l th order features within subsets arbitrarily dependent, no constraints dependence between subsets does not depend on image class 20
More general case All the features are grouped as a collection of disjoint subsets The features within each subset are allowed to have arbitrary dependencies The dependencies between the subsets are constrained to be non-informative 21
l-decomposability A family of FS algorithms 22
A family of FS Algorithms (cont d) Theorem The optimal infomax FS solution only requires density estimates of dimension 23
A family of FS Algorithms (cont d) Parameter is a trade-off between optimality and complexity, sub-optimal but computationally efficient = 0, MMD case, all the features depend in a noninformative way. = n, all features depend in informative ways, optimal but computational unscalable 24
Infomax-based FS Algorithm 25
Algorithm Complexity Suppose C classes, F feature vectors per class, histogram with b bins along each axis 26
Experiments on MMD (1) A Simple example (the optimal feature subsets are known) Tow Gaussian classes of identity covariance and means, n = 20 Compare the average feature selection quality between with Jain&Zongker s result (Mahalanobis distance) Average Quality Branch and bound SFS MMD better # of training samples # of training samples In this sample, the optimal condition of MMD is satisfied feature selection quality : ratio between the correctly selected features and n 27
Experiments on MMD (2) Brodatz texture-base classification 112 texture classes, 64(8*8) dimensional feature space, classifiers based on Gaussian mixutures Classification Accuracy Cumulative MD # of features # of features 28
Experiments on MMD (3) Image retrieval on Brodatz texture database PRA MD # of features # of features PRA: area under precision/recall curve 29
Experiments on MMD (4) Features as filters projection of the textures onto the five most informative basis functions detectors of lines, corners, t-junctions and so forth 30
Experiment on infomax (1) Image retrieval on Corel image database (15 classes, 1500 images) Different size of the clusters ( ) Main observations: ITFS can significantly outperform variance-based methods (10 vs 30 features for equivalent PRA) for ITFS there is no noticeable gain for l > 1! PRA l=1 l=2 l=0 variance # of features 31
Experiment on infomax (2) Different number of histogram bins Main observations: Infomax-based FS is quite insensitive to the quality of the estimates (no noticeable variation above 8 bins per axis, small degradation for 4) Always significantly better than variance PRA # of features 32
Experiment on infomax (3) Image retrieval results on Corel 33
Conclusion Infomax based feature selection is optimal in MPE sense An explicit understanding of the trade-off between optimality and complexity, and the corresponding optimality condition implied by infomax (Most important contribution) A scalable Infomax-based FS algorithm for image retrieval and recognition Future work: Evaluation of optimality and efficiency of this infomax-based algorithm on other features (such as rectangular features in Viola&Jones face detector) and classification problems. 34