TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation

Similar documents
Metric learning approaches! for image annotation! and face recognition!

TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation

Trans Media Relevance Feedback for Image Autoannotation

Image Annotation with TagProp on the MIRFLICKR set

Metric Learning for Large Scale Image Classification:

Image Annotation by Propagating Labels from Semantic Neighbourhoods

NMF-KNN: Image Annotation using Weighted Multi-view Non-negative Matrix Factorization

Generative and discriminative classification techniques

Keywords TBIR, Tag completion, Matrix completion, Image annotation, Image retrieval

Learning Representations for Visual Object Class Recognition

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Metric Learning for Large-Scale Image Classification:

Compressed local descriptors for fast image and video search in large databases

Category-level localization

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Object Classification Problem

Applying Supervised Learning

Bias-Variance Trade-off + Other Models and Problems

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Development in Object Detection. Junyuan Lin May 4th

Multiple cosegmentation

Fisher vector image representation

Automatic Image Annotation and Retrieval Using Hybrid Approach

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Auto-Annotation

Local Features and Kernels for Classifcation of Texture and Object Categories: A Comprehensive Study

ImageCLEF 2011

Learning Object Representations for Visual Object Class Recognition

Selection of Scale-Invariant Parts for Object Class Recognition

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Structured Models in. Dan Huttenlocher. June 2010

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task

Automatic Image Annotation Using Group Sparsity

Face detection and recognition. Detection Recognition Sally

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Bag of Words Models. CS4670 / 5670: Computer Vision Noah Snavely. Bag-of-words models 11/26/2013

Selection of Scale-Invariant Parts for Object Class Recognition

TA Section: Problem Set 4

Improved Fisher Vector for Large Scale Image Classification XRCE's participation for ILSVRC

Lecture #11: The Perceptron

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012)

Tri-modal Human Body Segmentation

Large-scale visual recognition The bag-of-words representation

Action recognition in videos

over Multi Label Images

Evaluation of GIST descriptors for web scale image search

Large-scale visual recognition Efficient matching

Latent Variable Models for Structured Prediction and Content-Based Retrieval

Artistic ideation based on computer vision methods

CS6670: Computer Vision

Discriminative classifiers for image recognition

Leveraging flickr images for object detection

Semantic Image Retrieval Using Correspondence Topic Model with Background Distribution

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Comparing Local Feature Descriptors in plsa-based Image Models

Basic Problem Addressed. The Approach I: Training. Main Idea. The Approach II: Testing. Why a set of vocabularies?

Click to edit title style

3 October, 2013 MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos

CS229 Final Project: Predicting Expected Response Times

Beyond Bags of features Spatial information & Shape models

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 4, APRIL

CS6716 Pattern Recognition

Generative and discriminative classification

Patch Descriptors. EE/CSE 576 Linda Shapiro

Generative and discriminative classification techniques

Network Traffic Measurements and Analysis

Machine Learning: Think Big and Parallel

BossaNova at ImageCLEF 2012 Flickr Photo Annotation Task

Modern Object Detection. Most slides from Ali Farhadi

Multiple-Choice Questionnaire Group C

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

Optimal Tag Sets for Automatic Image Annotation

Improving web-image search results using query-relative classifiers

Supervised learning. y = f(x) function

FastText. Jon Koss, Abhishek Jindal

Automatic summarization of video data

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

OBJECT CATEGORIZATION

Scene Classification with Low-dimensional Semantic Spaces and Weak Supervision

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

SkyFinder: Attribute-based Sky Image Search

Learning Compact Visual Attributes for Large-scale Image Classification

Adaptive Binary Quantization for Fast Nearest Neighbor Search

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Large Scale 3D Reconstruction by Structure from Motion

Tag Recommendation for Photos

Search Engines. Information Retrieval in Practice

Human Detection and Action Recognition. in Video Sequences

Distribution Distance Functions

A Cross-media Model for Automatic Image Annotation

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Patch Descriptors. CSE 455 Linda Shapiro

Local Image Features

A bipartite graph model for associating images and text

MICC-UNIFI at ImageCLEF 2013 Scalable Concept Image Annotation

Non-negative Matrix Factorization for Multimodal Image Retrieval

CS 231A Computer Vision (Fall 2011) Problem Set 4

Supervised Models for Multimodal Image Retrieval based on Visual, Semantic and Geographic Information

Deep Representations and Codes for Image Auto-Annotation

Support Vector Machines + Classification for IR

Transcription:

TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, Cordelia Schmid LEAR team, INRIA Rhône-Alpes, Grenoble, France

TagProp Goal: predict relevance of keywords for images? jet 2

TagProp Application 1: Image annotation Propose a list of relevant keywords to assist human annotator? jet 3

Corel Annotation - Examples 4

TagProp Application 2: Keyword based image search Given one or more keywords, propose a list of relevant images? jet 5

Corel Retrieval - Examples 6

TagProp Approach: generalize from a data base of annotated images? jet jet smoke sky prop sky jet clouds jet 7

Presentation Outline 1. Related work 2. Metric learning for nearest neighbors 3. Data sets & Feature extraction 4. Results 5. Conclusion 8

Related Work Latent topic models Inspired from text-analysis models (plsa, LDA) Generative model over keywords and image regions Trade-off: overfitting & capacity limited by nr. of topics Other mixture models Non-parametric KDE over image features Binary classifiers One classifier for each keyword: many classifiers, (no) parameter sharing dog yes/no Many terms with very few examples sky yes/no cartoon yes/no people yes/no house 9 yes/no

Related Work Many approaches for image annotation Seen as machine translation [Duygulu et al. '02] Extensions of LDA [Barnard et al. '03] Multiple Bernoulli relevance model [Feng et al. '04] Supervised multiclass labeling [Carneiro et al. '07] Kernel-based ranking [Grangier & Bengio 08] Local learning use most similar images to predict keywords: state-of-the-art image annotation results Diffusion of labels over similarity graph [Liu et al. '09] Ad hoc nearest neighbor model [Makadia et al. '08] Simple model, combination of many visual features 10

Nearest Neighbor Image Annotation jet smoke? bear polar snow tundra sky prop sky jet clouds jet sky jet sky jet clouds jet bear polar snow ice sky jet How to choose the visual distance to define neighbors? How many neighbors to consider? How to transfer the tags of neighboring images? 11

Presentation Outline 1. Related work 2. Metric learning for nearest neighbors 3. Data sets & Feature extraction 4. Results 5. Conclusion 12

A predictive model for keyword relevance Notation - relevance of keyword w for image i, y iw 2{+1, 1} - visual distance between images d ij 0 Use d ij to define weights ¼ ij for a nearest neighbor model The model outputs probability of keyword relevance p(y iw =+1) 13

A predictive model for keyword relevance jet smoke bear polar snow tundra Stochastic neighbor selection: with probability ¼ ij, pick image j to predict keywords of image i clouds jet sky prop sky jet ¼ ij 0, j ¼ ij =1 and ¼ ii =0 clouds jet bear polar snow ice sky jet sky jet sky jet Predictions: weighted sum over neighbor images j p(y iw = +1) = π ij p(y iw = +1 j) 14

A predictive model for keyword relevance jet smoke clouds jet sky prop bear polar snow tundra sky jet sky jet sky jet clouds jet bear polar snow ice sky jet Imagine we select image j to predict keyword w of image i p(y iw =+1 j) = 1-² if y jw =+1 ² otherwise p(y iw = +1 j) = 1 ε if y iw = +1 ε otherwise 15

A predictive model for keyword relevance Objective: Maximize likelihood of leave-one-out predictions on training data i,w Optimization: Gradient descent with constraints on parameters to enforce: What about the weights? L = c iw log p(y iw ) j π ij 0, π ij =1 16

Rank-based weights Fixed weight for k-th neighbour: k K parameters Effective neighborhood size set automatically 17

Distance-based weights Weights ¼ ij depend smoothly on d ij, exponential decrease π ij = Single parameter : decay rate, effective neighborhood size What is the right visual distance? Metric learning to linearly combine distances exp( λd ij) exp( λd ik ) d i j = w ( 1 ) d ( 1 ) i j + w ( 2 ) d ( 2 ) i j + + w ( n ) d ( n ) i j One parameter for each base distance k 18

Increasing the recall of rare words Keywords with low frequency in database have low recall Neighbors that have the keyword do not account for enough mass in the weighted sum Systematic low relevance : need to boost it Adjust dynamic range per keyword using sigmoid x iw = π ij p(y iw = +1 j) j p(y iw = +1) = σ(α w x iw + β w ) 19

Some practical issues Objective function and gradient are quadratic in the number of images We limit the size of neighborhoods to K. The exponential decrease justifies to forget images that are too far anyway The neighborhood is not fixed when learning metric We include as many neighbors from each distance as possible Overlap of neighborhoods allow to use approx. 2K/D We use annotation costs to compensate for noisier keyword absences Balance weight assigned to absences and presences 20

Optimization Rank-based: K parameters concave objective Gradient descent with convex constraints Distance-based weights: # parameters = # base distances Gradient descent with convex constraints Sigmoidal modulation: iterative optimization Optimize { w, w} for all words, concave Optimize the ¼ ij 21

Presentation Outline 1. Related work 2. Metric learning for nearest neighbors 3. Data sets & Feature extraction 4. Results 5. Conclusion 22

Data set 1: Corel 5k 5000 images: landscape, animals, cities Vocabulary of 260 words Annotations designed for retrieval

Data set 2: ESP Game 20.000 images: photos, drawings, graphs Vocabulary of 268 words Annotations generated by players of online game

Data set 3: IAPR TC-12 20.000 images: touristic photos, sports Vocabulary of 291 words Annotations extracted from descriptive text (nouns)

Feature extraction Bag-of-words histograms SIFT [Lowe 04] and Hue [van de Weijer & Schmid 06] Dense grid and Harris interest points K-means quantization Global color histograms Color spaces: RGB, HSV, LAB Each channel quantized in 16 bins Global GIST descriptor [Oliva & Torralba 01] Spatial 3x1 partioning [Lazebnik et al. 06] Concatenate histograms from regions Done for all features except GIST 26

Presentation Outline 1. Related work 2. Metric learning for nearest neighbors 3. Data sets & Feature extraction 4. Results 5. Conclusion 27

Evaluation measures Measures computed per keyword, then averaged Annotate images with the 5 most likely keywords Recall: # ims. correctly annotated / # ims. in ground truth Precision: # ims. correctly annotated / # ims. annotated Mean average precision 28

Variants of TagProp COREL 5K IAPR TC-12 ESP Game P R P R P R Rank-based 28% 32% 35% 22% 27% 20% Fixed distance 30% 33% 50% 20% 48% 19% Fixed with sigmoid 28% 35% 41% 30% 39% 24% ML with sigmoid 33% 42% 46% 35% 39% 27% Distance-based > Rank-based, comparable for COREL Sigmoid: trades precision for recall Metric learning: improves results significantly 29

Variants of TagProp COREL 5K IAPR TC-12 ESP Game P R P R P R Rank-based 28% 32% 35% 22% 27% 20% Fixed distance 30% 33% 50% 20% 48% 19% Fixed with sigmoid 28% 35% 41% 30% 39% 24% ML with sigmoid 33% 42% 46% 35% 39% 27% Distance-based > Rank-based, comparable for COREL Sigmoid: trades precision for recall Metric learning: improves results significantly 30

Variants of TagProp COREL 5K IAPR TC-12 ESP Game P R P R P R Rank-based 28% 32% 35% 22% 27% 20% Fixed distance 30% 33% 50% 20% 48% 19% Fixed with sigmoid 28% 35% 41% 30% 39% 24% ML with sigmoid 33% 42% 46% 35% 39% 27% Distance-based > Rank-based, comparable for COREL Sigmoid: trades precision for recall Metric learning: improves results significantly 31

Effect of sigmoid, detailed % Mean recall of words IAPR keywords binned by how many images they occur in ML (light blue), and ML with sigmoid (dark blue) Number of images with given keyword 32

Comparison to state-of-the-art % COREL 5K IAPR TC-12 ESP Game +6 +10 +18 +6 +17 +2 33

Retrieval performance on Corel 5k All Single Multi Easy Difficult Grangier 08 26% 34% 26% 43% 22% Fixed distance with sigmoid 31% 41% 30% 49% 27% ML with sigmoid 36% 46% 35% 55% 32% 2241 (multi-word) queries with at least one relevant image Easy queries: three or more relevant images [D. Gangier and S. Bengio, "A discriminative kernel-based model to rank images from text queries", PAMI 2008] Mean average precision: +10% overall Metric learning: improves results significantly 34

Distance-based weights Learned linear combinations are dataset-specific 35

Comparison to JEC and SVM Makadia 08, SVM & Fixed: the same combination of distances SVM: one per keyword, Gaussian kernel, C=100 SVM, Fixed and ML: sigmoidal modulation on output scores COREL 5K IAPR TC-12 ESP Game 36

Conclusion The main contributions Probabilistic nearest neighbor model Metric learning to find optimal distance combination Effective neighborhood size set automatically Sigmoidal non-linearity to boost recall of rare words State-of-the-art results Both on image annotation, and keyword-based retrieval On three different data sets and two evaluation protocols Future work Learn models specificly for annotation / retrieval Use localization of (some) concepts to improve performance 37

TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, Cordelia Schmid LEAR Team, INRIA Grenoble, France Online Demo http://lear.inrialpes.fr/~verbeek