Feature Selection for Image Retrieval and Object Recognition

Similar documents
Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

A Probabilistic Architecture for Content-based Image Retrieval

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

3 Feature Selection & Feature Extraction

Discriminate Analysis

ECG782: Multidimensional Digital Signal Processing

Face detection and recognition. Detection Recognition Sally

Adaptive Learning of an Accurate Skin-Color Model

Face detection and recognition. Many slides adapted from K. Grauman and D. Lowe

Unsupervised Learning

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Machine Learning Techniques for Data Mining

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Generative and discriminative classification techniques

Slides for Data Mining by I. H. Witten and E. Frank

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Expectation Maximization (EM) and Gaussian Mixture Models

Network Traffic Measurements and Analysis

Object recognition (part 1)

Applying Supervised Learning

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Deep Generative Models Variational Autoencoders

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Machine Learning. Chao Lan

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Image Processing. Image Features

FACE RECOGNITION USING INDEPENDENT COMPONENT

The Curse of Dimensionality

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Network Traffic Measurements and Analysis

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Facial Expression Recognition Using Non-negative Matrix Factorization

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

PARALLEL CLASSIFICATION ALGORITHMS

CS6716 Pattern Recognition

Modern Medical Image Analysis 8DC00 Exam

Mobile Face Recognization

Segmentation of Images

Machine Learning. Supervised Learning. Manfred Huber

A Course in Machine Learning

What is machine learning?

Machine Learning Lecture 3

10-701/15-781, Fall 2006, Final

We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.

Non-Parametric Modeling

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Chapter 4: Non-Parametric Techniques

Verification: is that a lamp? What do we mean by recognition? Recognition. Recognition

Building Classifiers using Bayesian Networks

Computer vision: models, learning and inference. Chapter 13 Image preprocessing and feature extraction

Classification Algorithms in Data Mining

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

I How does the formulation (5) serve the purpose of the composite parameterization

Beyond Bags of Features

Kernel Methods & Support Vector Machines

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Bayes Classifiers and Generative Methods

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

CS 195-5: Machine Learning Problem Set 5

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Fast Edge Detection Using Structured Forests

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

Naïve Bayes for text classification

Chapter DM:II. II. Cluster Analysis

INF 4300 Classification III Anne Solberg The agenda today:

Statistics of Natural Image Categories

Louis Fourrier Fabien Gaie Thomas Rolf

Neural Network based textural labeling of images in multimedia applications

Dimension Reduction CS534

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

MTTTS17 Dimensionality Reduction and Visualization. Spring 2018 Jaakko Peltonen. Lecture 11: Neighbor Embedding Methods continued

Linear methods for supervised learning

Information-Driven Dynamic Sensor Collaboration for Tracking Applications

Face Modeling by Information Maximization 1

Feature selection. LING 572 Fei Xia

Dimensionality Reduction, including by Feature Selection.

Skin and Face Detection

Case Study 1: Estimating Click Probabilities

What do we mean by recognition?

Application of Support Vector Machine Algorithm in Spam Filtering

The exam is closed book, closed notes except your one-page cheat sheet.

Selection of Scale-Invariant Parts for Object Class Recognition

Metric Learning Applied for Automatic Large Image Classification

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

CS 229 Midterm Review

Feature Selection Using Principal Feature Analysis

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Probabilistic Graphical Models

Fisher vector image representation

Wavelet Applications. Texture analysis&synthesis. Gloria Menegaz 1

Transcription:

Feature Selection for Image Retrieval and Object Recognition Nuno Vasconcelos et al. Statistical Visual Computing Lab ECE, UCSD Presented by Dashan Gao

Scalable Discriminant Feature Selection for Image Retrieval and Recognition N. Vasconcelos and M. Vasconcelos To appear in IEEE CVPR 2004 Feature Selection by Maximum Marginal Diversity: optimality and implications for visual recognition N. Vasconcelos Proceedings of IEEE CVPR, 2003 Feature Selection by Maximum Marginal Diversity N. Vasconcelos Proceedings of Neural Information Processing Systems, 2002. 2

Overview (1) Image retrieval is a large scale classification problem: A large number of classes, large amounts of data per class A discriminant feature space (of small dimensionality) is a pre-requisite for success Feature Selection (FS) makes learning easier and tractable, in a lower dimensional feature space X Goal: Find transformation T, constrained to be a subset projection Find the projection matrix T that optimizes a criterion for feature goodness 3

Overview (2) Weaknesses of traditional methods: Based on sub-optimal criteria: variance maximization (principal component analysis PCA) Lack of scalability: they take infeasible time to compute Difficult to extend to multi-class problems (boosting) Ultimate goal: minimize probability of error (MPE) Search for the Bayes error-optimal space of a given classification problem Achievable goal (discriminant sense) : maximize separation between the different classes to recognize. 4

Information theoretic feature selection (ITFS) Infomax goal: maximize mutual information between the selected features and class labels Outline: Optimality properties (in MPE and discriminant sense) (Contribution 1) Trade-off between optimality and complexity (Contribution 2) Algorithmic implementation with low complexity 5

Bayes Error (BE) Advantage: BE depends only on the feature space, thus is the ultimate discriminant measure for FS. Disadvantage: nonlinearity of max(.) operation 6

Infomax principle H(X) H(XIY) H(Y) H(Y X) H(.) is entropy H(Y X) is conditional entropy (class posterior entropy, CPE) max I(Y;X) = min H(Y X) 7

Infomax example 2 classes (M=2), 2 features, x 2 x1 x1 x2 Note: Variance-based criteria (e.g. PCA) fail in this case!! 8

Infomax vs BE To show: Bayes error >= Infomax 9

Example 0.6 0-0.8 Important observation: The gradients of the two curve have the same signs everywhere when defined The extrema of both sides are co-located LHS and RHS have the same optimization solution 10

Infomax vs BE Bayes error >= Infomax 11

example: M=2 µ BE and CPE as functions of m BE CPE (H(Y X) Infomax is optimal in MPE sense! Infomax is a good approximation of BE. The infomax solutions will be very similar to those BE. 12

Discriminant form of infomax Noting that Theorem 3: Infomax goal is equivalent to the goal that maximizes separation between the different classes 13

Feature Selection (FS) Forward sequential search for FS: A set of features are added to the current best subset in each step, with the goal of optimizing a cost function Denote the current subset by, the added features by, and the new subset by. We can prove or Maximizing mutual information (infomax) is simpler than minimizing BE 14

Proof Proof: 15

Feature Selection (cont d) favors discriminant features penalizes features redundant with previous unless redundancy provides information about Y A trade-off between the maximization of discriminant power and the minimization of redundancy Problem: Infomax requires high-dimensional density estimates Find a trade-off between optimality and complexity 16

Maximum Marginal Diversity (MMD) Marginal Diversity MMD based FS: a naïve infomax Select the subset of features that lead to a set of maximally diverse marginal densities. Optimality condition Lemma : MMD is optimal if the following holds: the average mutual information between features is not affected by the knowledge of the class label 17

the Naïve Bayes Classifier Assumption: features are only conditional independent given the class label however, the optimality condition for MMD doesn t hold under this assumption. Since Feature selected by MMD are not good for Naïve Bayes Classifier! 18

Advantage: MMD (continued) Computation is simple: only marginal distribution of each feature is considered. Disadvantage: The existence of optimality condition can hardly be proved practically. There is no guarantee for optimality if the condition does not hold. Fortunately, recent studies show that, for image recognition problems, MMD is very close to the optimal solution for the biologically plausible features, e.g. wavelet coefficients 19

Image statistics Feature dependencies tend to be localized across both space and image scale e.g. for standard wavelet decomposition: co-located coefficients of equal orientation can be arbitrarily dependent on the class average dependence between such sets of coefficients does not depend on the image class (strong V freq => weak H freq) This property is referred to as a more general casethan MMD: l-decomposability: feature set decomposable into mutually exclusive subsets of l th order features within subsets arbitrarily dependent, no constraints dependence between subsets does not depend on image class 20

More general case All the features are grouped as a collection of disjoint subsets The features within each subset are allowed to have arbitrary dependencies The dependencies between the subsets are constrained to be non-informative 21

l-decomposability A family of FS algorithms 22

A family of FS Algorithms (cont d) Theorem The optimal infomax FS solution only requires density estimates of dimension 23

A family of FS Algorithms (cont d) Parameter is a trade-off between optimality and complexity, sub-optimal but computationally efficient = 0, MMD case, all the features depend in a noninformative way. = n, all features depend in informative ways, optimal but computational unscalable 24

Infomax-based FS Algorithm 25

Algorithm Complexity Suppose C classes, F feature vectors per class, histogram with b bins along each axis 26

Experiments on MMD (1) A Simple example (the optimal feature subsets are known) Tow Gaussian classes of identity covariance and means, n = 20 Compare the average feature selection quality between with Jain&Zongker s result (Mahalanobis distance) Average Quality Branch and bound SFS MMD better # of training samples # of training samples In this sample, the optimal condition of MMD is satisfied feature selection quality : ratio between the correctly selected features and n 27

Experiments on MMD (2) Brodatz texture-base classification 112 texture classes, 64(8*8) dimensional feature space, classifiers based on Gaussian mixutures Classification Accuracy Cumulative MD # of features # of features 28

Experiments on MMD (3) Image retrieval on Brodatz texture database PRA MD # of features # of features PRA: area under precision/recall curve 29

Experiments on MMD (4) Features as filters projection of the textures onto the five most informative basis functions detectors of lines, corners, t-junctions and so forth 30

Experiment on infomax (1) Image retrieval on Corel image database (15 classes, 1500 images) Different size of the clusters ( ) Main observations: ITFS can significantly outperform variance-based methods (10 vs 30 features for equivalent PRA) for ITFS there is no noticeable gain for l > 1! PRA l=1 l=2 l=0 variance # of features 31

Experiment on infomax (2) Different number of histogram bins Main observations: Infomax-based FS is quite insensitive to the quality of the estimates (no noticeable variation above 8 bins per axis, small degradation for 4) Always significantly better than variance PRA # of features 32

Experiment on infomax (3) Image retrieval results on Corel 33

Conclusion Infomax based feature selection is optimal in MPE sense An explicit understanding of the trade-off between optimality and complexity, and the corresponding optimality condition implied by infomax (Most important contribution) A scalable Infomax-based FS algorithm for image retrieval and recognition Future work: Evaluation of optimality and efficiency of this infomax-based algorithm on other features (such as rectangular features in Viola&Jones face detector) and classification problems. 34