on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Similar documents
Face2Face Comparing faces with applications Patrick Pérez. Inria, Rennes 2 Oct. 2014

Metric Learning for Large-Scale Image Classification:

Large-scale visual recognition Efficient matching

Mixtures of Gaussians and Advanced Feature Encoding

Metric learning approaches! for image annotation! and face recognition!

Metric Learning for Large Scale Image Classification:

Quasi Cosine Similarity Metric Learning

Large scale object/scene recognition

Compressed local descriptors for fast image and video search in large databases

Three things everyone should know to improve object retrieval. Relja Arandjelović and Andrew Zisserman (CVPR 2012)

PCCA: A New Approach for Distance Learning from Sparse Pairwise Constraints

Aggregating Descriptors with Local Gaussian Metrics

Is that you? Metric learning approaches for face identification

Large-scale visual recognition The bag-of-words representation

Bilinear Models for Fine-Grained Visual Recognition

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Face detection and recognition. Detection Recognition Sally

Category-level localization

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Fisher Vector Faces in the Wild

Part-based and local feature models for generic object recognition

TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Annotation

Computer Vision. Exercise Session 10 Image Categorization

Fisher Vector Faces in the Wild

Fast Indexing and Search. Lida Huang, Ph.D. Senior Member of Consulting Staff Magma Design Automation

Instances on a Budget

Explicit embeddings for nearest neighbor search with Mercer kernels

Supervised Hashing for Image Retrieval via Image Representation Learning

Data mining with sparse grids

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

Multiple Kernel Learning for Emotion Recognition in the Wild

arxiv: v1 [cs.cv] 20 Dec 2013

Learning to Recognize Faces in Realistic Conditions

Deep Face Recognition. Nathan Sun

Adaptive Binary Quantization for Fast Nearest Neighbor Search

Video Google faces. Josef Sivic, Mark Everingham, Andrew Zisserman. Visual Geometry Group University of Oxford

Some faces are more equal than others: Hierarchical organization for accurate and efficient large-scale identity-based face retrieval

Facial Expression Classification with Random Filters Feature Extraction

A Compact and Discriminative Face Track Descriptor

FaceNet. Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana

Contents. Preface to the Second Edition

Learning Compact and Effective Distance Metrics with Diversity Regularization. Pengtao Xie. Carnegie Mellon University

A Family of Contextual Measures of Similarity between Distributions with Application to Image Retrieval

Development in Object Detection. Junyuan Lin May 4th

Class 6 Large-Scale Image Classification

Metric Learning Applied for Automatic Large Image Classification

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

IMPROVING VLAD: HIERARCHICAL CODING AND A REFINED LOCAL COORDINATE SYSTEM. Christian Eggert, Stefan Romberg, Rainer Lienhart

Neural Codes for Image Retrieval. David Stutz July 22, /48 1/48

Action recognition in videos

6.819 / 6.869: Advances in Computer Vision

Data mining with sparse grids using simplicial basis functions

Artistic ideation based on computer vision methods

Evaluation of GIST descriptors for web scale image search

Visual Object Recognition

ECS289: Scalable Machine Learning

Part based models for recognition. Kristen Grauman

Stable hyper-pooling and query expansion for event detection

Learning Compact Visual Attributes for Large-scale Image Classification

Negative evidences and co-occurrences in image retrieval: the benefit of PCA and whitening

Approximate Nearest Neighbor Search. Deng Cai Zhejiang University

Learning to Match Images in Large-Scale Collections

String distance for automatic image classification

Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition

Bag-of-features. Cordelia Schmid

Learning Low-rank Transformations: Algorithms and Applications. Qiang Qiu Guillermo Sapiro

Improved Fisher Vector for Large Scale Image Classification XRCE's participation for ILSVRC

DeepIndex for Accurate and Efficient Image Retrieval

A Scalable Unsupervised Feature Merging Approach to Efficient Dimensionality Reduction of High-dimensional Visual Data

arxiv: v1 [cs.cv] 23 Nov 2017

Descriptors for CV. Introduc)on:

3D Object Recognition using Multiclass SVM-KNN

Pattern Spotting in Historical Document Image

Rongrong Ji (Columbia), Yu Gang Jiang (Fudan), June, 2012

ImageCLEF 2011

VISOR: Towards On-the-Fly Large-Scale Object Category Retrieval

Shrinkage Expansion Adaptive Metric Learning

An Associate-Predict Model for Face Recognition FIPA Seminar WS 2011/2012

Structured Models in. Dan Huttenlocher. June 2010

TEXTURE CLASSIFICATION METHODS: A REVIEW

Binary Embedding with Additive Homogeneous Kernels

Deep Learning for Face Recognition. Xiaogang Wang Department of Electronic Engineering, The Chinese University of Hong Kong

Fantope Regularization in Metric Learning

Tom-vs-Pete Classifiers and Identity- Preserving Alignment for Face Verification. Thomas Berg Peter N. Belhumeur Columbia University

Large-Scale Face Manifold Learning

Deep Learning for Computer Vision

Sparse coding for image classification

Oriented pooling for dense and non-dense rotation-invariant features

CS 340 Lec. 4: K-Nearest Neighbors

Hashing with Graphs. Sanjiv Kumar (Google), and Shih Fu Chang (Columbia) June, 2011

Day 3 Lecture 1. Unsupervised Learning

Intro to Artificial Intelligence

arxiv: v2 [cs.cv] 27 Nov 2017

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

A CBR SYSTEM FOR EFFICIENT FACE RECOGNITION UNDER PARTIAL OCCLUSION ICCBR 2017, TRONDHEIM, NORWAY

Enhanced and Efficient Image Retrieval via Saliency Feature and Visual Attention

Learning Discriminative Appearance-Based Models Using Partial Least Squares

Compressing Deep Neural Networks for Recognizing Places

Transcription:

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Vector visual representation Fixed-size image representation High-dim (100 100,000) Generic, unsupervised: BoW, FV, VLAD / DBM, SAE Generic, supervised: learned aggregators / CNN activations Class-specific, e.g. for faces: landmark-related SIFT, HoG, LBP, FV Key to compare images and fragments, with built-in invariance Verification (1-to-1) Search (1-to-N) Clustering (N-to-N) Recognition (1-to-K) local descriptors aggregated representation 2

VLAD: vector of locally aggregated descriptors C SIFT-like blocks, D = 128 C [Jégou et al. CVPR 10] 3

Face representation Sparse representation Layout of facial landmarks Multi-scale descriptor of facial landmarks Dense representation Fixed grid of overlapping blocks SIFT/HOG/LBP block description Fisher and CNN variants Landmarks still useful to normalize e.g., [Cinbis et al. ICCV 11] e.g., [Sivic et al. ICCV 09] 4

Embedding visual representation Further encoding to Reduce complexity and memory Improve discriminative power Specialize to specific tasks task Various types (possibly combined) Discrete (Hamming, VQ, PQ): Linear (PCA, metric learning): Non-linear (K-PCA, spectral, NMF, SC): 5

Outline Explicit embedding for visual search [JMIV 2015, with A. Bourrier, H. Jégou, F. Perronin and R. Gribonval] E-SVM encoding for visual search (and classification) [CVPR 2015, with J. Zepeda] representation E-SVM encoder Multiple metric learning for face verification [ACCV 2014, CVPR-w 2015, with G. Sharma and F. Jurie]?? 6 7/24/2015

Euclidean (approximate) search Nearest neighbor (1NN) search in Euclidean case Euclidean approximate NN (a-nn) for large scale Discrete embedding efficient to search with: binary hashing or VQ Product Quantization (PQ) [Jégou 2010]: asymmetric fine grain search 7

Beyond Euclidean Other (di)similarities χ 2 and histogram intersection (HI) kernels Data-driven kernels Appealing but costly Fast approximate search with Mercer kernels? Exploiting of kernel trick to transport techniques to implicit space Inspiration from classification with explicit embedding [Vedaldi and Zisserman, CVPR 10][Perronnin et al. CVPR 10] description hashing Kernel space implicit codes explicit embedding embedded description Euclidean encoding explicit codes 8

The implicit path Kernelized Locality Sensitive Hashing (KLSH) [Kulis and Grauman ICCV 09] Random draw of directions within RKHS subspace spanned by implicit maps of a random subset of input vectors Hashing function computed thanks to kernel trick Random Maximum Margin Hashing (RMMH) [Joly and Buisson CVPR 11] Each hashing function is a kernel SVM learned on a random subset of input vectors (one half labeled +1, the other -1) Outperforms KLSH 9

Explicit embedding Data-independent Truncated expansions or Fourier sampling Restricted to certain kernels (e.g., additive, multiplicative) Generic data-driven: Kernel PCA (KPCA) and the like Mercer kernel K to capture similarity Learning subset Low-rank approximation of kernel matrix 10

NN and a-nn search with KPCA Exact search KPCA encoding Exact Euclidean 1NN search Bound computation Most similar item is in short list truncated with bounds Approximate search KPCA encoding Euclidean a-knn search with PQ Similarity re-ranking of short list 11

Experiments 1NN local descriptors search N=1M SIFT (D=128), K=χ 2, M=1024, E=128, Tested also: KPCA+LSH (binary search in explicit space) [256bits] 12

Experiments 1NN image search N=1.2M images BoW (D=1000), K=χ 2, M=1024, E=128 Tested also: KPCA+LSH (binary search in explicit space) [256bits] 13

Discriminative encoding with E-SVM Boost discriminative power of representation Extract what is unique about image (representation) relative to all others Method Exemplar-SVM (E-SVM) [Malisiewicz 2012] to encode visual representation Symmetrical encoding even for asymmetric problems Recursive encoding Application: search and classification 14

Method Large generic set of images Exemplar-SVM Final encoding visual representation E-SVM encoder 15

Method E-SVM learning: stochastic gradient (SGD) with Pegasos Recursive encoding (RE-SVM) Image search: symmetrical embedding Query and database codes: Cosine similarity: Classification: learn and run classifier on E-SVM codes 16

Image search Holiday dataset, VLAD-64 (D=8192) 17

Image search Holiday and Oxford datasets 18

Face verification Given 2 face images: Same person? Persons unseen before Various types of supervision for learning Named faces (provide +/- pairs) Tracked faces (provide + pairs) Simultaneous faces (provide pairs) Labelled Faces in the Wild (LFW) +13,000 faces; +4,000 persons 10-fold testing with 300 +/- pairs per fold Restricted setting: only pair information for training Unrestricted setting: name information for training 19 7/24/2015

Linear metric learning Powerful approach to face verification Learning Mahalanobis distance in input space, via Typical training data: +/- pairs should become close/distant Verification of new faces: Several approaches Large margin nearest neighbor (LMNN) [Weinberger et al. NIPS 05] Information theoretic metric learning (ITML) [Davis et al. ICML 07] Logistic Discriminant Metric Learning (LDML) [Guillaumin et al. ICCV 09] Pairwise Constrained Component Analysis (PCCA) [Mignon & Jurie, CVPR 12] 20 7/24/2015

Low-rank metric learning Very high dimension (in range 1,000 100,000) Prohibitive size of Mahalanobis matrix Scarcity of training data Low-rank Mahalanobis metric learning: Learn linear projection (dim. reduction) and metric Minimize loss over training set Rank fixed by cross-validation Proposed: extension to latent variables and multiple metrics 21 7/24/2015

Losses Probabilistic logistic loss Generalized logistic loss Hinge loss 22 7/24/2015

Expanded parts model Expanded parts model [Sharma et al. CVPR 13] for human attributes and object/action recog. Objectives Avoid fixed layout Learn collection of discriminative parts and associated metrics Leverage the model to handle occlusions 23 7/24/2015

Expanded parts model Mine P discriminative parts and learn associated metrics Dissimilarity based on comparing K < P best parts Learning Minimize hinge loss: greedy on parts + gradient descent on matrices Prune down to P a large set of N random parts Projections initialized by whitened PCA Stochastic gradient: given annotated pair 24 7/24/2015

Experiments with occlusions LFW, unrestricted setting N = 500, P 50, K = 20,D = 10k, E = 20, 10 6 SGD iterations Random occlusions (20 80%) at test time, on one image only Focused occlusions 25 7/24/2015

Experiments with occlusions 26 7/24/2015

Comparing face sets Given groups of single-person faces e.g., labelled clusters, face tracks Comparing sets Based on face pair comparison, i.e. For face tracks: a single descriptor per track [Parkhi et al. CVPR 14] [Everingham et al.bmvc 06] 27 7/24/2015

Learning multiple metrics Metrics associated to L mined types of cross-pair variations Learning from annotated set pairs 28 7/24/2015

Learning multiple metrics Stochastic gradient: given annotated pair Subsample the sets (to ensure variety of cross-pair variations) Dissimilarity: Sub-gradient of pair s hinge loss: if Projections initialized by whitened PCA computed on random subsets 29 7/24/2015

New dataset From 8 different series (inc. Buffy, Dexter, MadMen, etc.) 400 high quality labelled face tracks, 23M faces, 94 actors Wide variety of poses, attributes, settings Ready for metric learning and test (700 pos., 7000 neg.) 30 7/24/2015

Comparing face tracks Parameters: D 14000, K = 3, 10 6 SGD iterations Method Subspace dim. E Aver. Precision known persons Aver. Precision unknown persons PCA+cosine sim + min-min 1000 24.8 20.4 PCA+cosine sim + min-min 100 21.4 20.2 Metric Learning + min-min 100 23.7 21.0 Latent ML (proposed) (3X)33 27.9 22.9 31 7/24/2015

Conclusion Learn embedding of visual description task Unsupervised learning of Task-dependent supervised learning of Also for deep learning 1-layer adaptation of CNN features for classification with linear SVM Ad-hoc dim. reduction or learned with L1 regularization (Kulkarni et al. BMVC15) Same performance as VGG-M 128 [Chatfield 2014], with 4x smaller codes 32