Bilinear Models for Fine-Grained Visual Recognition

Similar documents
Bilinear CNN Models for Fine-grained Visual Recognition

FINE-GRAINED recognition involves classification of instances

DFT-based Transformation Invariant Pooling Layer for Visual Classification

Improved Bilinear Pooling with CNNs

Visualizing and Understanding Deep Texture Representations

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

In Defense of Fully Connected Layers in Visual Representation Transfer

Deep Spatial Pyramid Ensemble for Cultural Event Recognition

Locally-Transferred Fisher Vectors for Texture Classification

Compact Bilinear Pooling

arxiv: v1 [cs.cv] 18 Apr 2016

Multiple VLAD encoding of CNNs for image classification

DeepIndex for Accurate and Efficient Image Retrieval

arxiv: v2 [cs.cv] 23 May 2016

Is 2D Information Enough For Viewpoint Estimation? Amir Ghodrati, Marco Pedersoli, Tinne Tuytelaars BMVC 2014

An Exploration of Computer Vision Techniques for Bird Species Classification

Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification

Return of the Devil in the Details: Delving Deep into Convolutional Nets

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

Part Localization by Exploiting Deep Convolutional Networks

Aggregating Descriptors with Local Gaussian Metrics

Robust Scene Classification with Cross-level LLC Coding on CNN Features

Multi-region Bilinear Convolutional Neural Networks for Person Re-Identification

Weakly Supervised Learning of Part Selection Model with Spatial Constraints for Fine-Grained Image Classification

Using Machine Learning for Classification of Cancer Cells

CS231N Section. Video Understanding 6/1/2018

Object detection with CNNs

Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks

Deep filter banks for texture recognition and segmentation

Mixtures of Gaussians and Advanced Feature Encoding

IMAGE CLASSIFICATION WITH MAX-SIFT DESCRIPTORS

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Instance Retrieval at Fine-grained Level Using Multi-Attribute Recognition

arxiv: v1 [cs.cv] 8 Dec 2016

Object Detection Based on Deep Learning

Structured Prediction using Convolutional Neural Networks

Learning to generate 3D shapes

Regionlet Object Detector with Hand-crafted and CNN Feature

Deep Neural Networks:

Spatial Localization and Detection. Lecture 8-1

Computer Vision Lecture 16

Computer Vision Lecture 16

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Co-occurrence Feature Learning for Visual Object Recognition

arxiv: v2 [cs.cv] 19 Dec 2014

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

MoNet: Moments Embedding Network

Industrial Technology Research Institute, Hsinchu, Taiwan, R.O.C ǂ

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

AFTER the breakthrough in image classification using. Selective Convolutional Descriptor Aggregation for Fine-Grained Image Retrieval

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Kernel Pooling for Convolutional Neural Networks

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Transferring Deep Convolutional Neural Networks for the Scene Classification of High-Resolution Remote Sensing Imagery

Yiqi Yan. May 10, 2017

Real-time Object Detection CS 229 Course Project

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Deep Learning for Computer Vision II

Paper Motivation. Fixed geometric structures of CNN models. CNNs are inherently limited to model geometric transformations

Deep Quantization: Encoding Convolutional Activations with Deep Generative Model

Overall Description. Goal: to improve spatial invariance to the input data. Translation, Rotation, Scale, Clutter, Elastic

DEEP STRUCTURED OUTPUT LEARNING FOR UNCONSTRAINED TEXT RECOGNITION

Bag-of-features. Cordelia Schmid

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition

Grassmann Pooling as Compact Homogeneous Bilinear Pooling for Fine-Grained Visual Classification

Lecture 7: Semantic Segmentation

Fast CNN-Based Object Tracking Using Localization Layers and Deep Features Interpolation

DL Tutorial. Xudong Cao

Low-rank Bilinear Pooling for Fine-Grained Classification

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

arxiv: v3 [cs.cv] 13 Apr 2016 Abstract

Deep Image: Scaling up Image Recognition

Sunrise or Sunset: Selective Comparison Learning for Subtle Attribute Recognition

Deep Texture Manifold for Ground Terrain Recognition

Deep Quantization: Encoding Convolutional Activations with Deep Generative Model

Metric Learning for Large Scale Image Classification:

Rich feature hierarchies for accurate object detection and semantic segmentation

arxiv: v1 [cs.cv] 24 Apr 2015

Fully Convolutional Networks for Semantic Segmentation

PASCAL VOC Classification: Local Features vs. Deep Features. Shuicheng YAN, NUS

Towards Large-Scale Semantic Representations for Actionable Exploitation. Prof. Trevor Darrell UC Berkeley

arxiv: v2 [cs.cv] 27 Nov 2015

Actions and Attributes from Wholes and Parts

Introduction to Neural Networks

Supplementary material for Analyzing Filters Toward Efficient ConvNet

Improving Face Recognition by Exploring Local Features with Visual Attention

YOLO9000: Better, Faster, Stronger

Deconvolutions in Convolutional Neural Networks

Adapting Models to Signal Degradation using Distillation

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

arxiv: v2 [cs.cv] 30 Jul 2016

Learning Transferable Features with Deep Adaptation Networks

arxiv: v1 [cs.cv] 6 Jul 2016

Fuzzy Set Theory in Computer Vision: Example 3

Pixels to Voxels: Modeling Visual Representation in the Human Brain

PIXELS TO VOXELS: MODELING VISUAL REPRESENTATION IN THE HUMAN BRAIN

arxiv: v1 [cs.cv] 26 Dec 2015

Transcription:

Bilinear Models for Fine-Grained Visual Recognition Subhransu Maji College of Information and Computer Sciences University of Massachusetts, Amherst

Fine-grained visual recognition Example: distinguish between closely related categories California gull Ringed beak gull inter-category variation v.s intra-category variation location, pose, viewpoint, background, lighting, gender, season, etc 2

Part-based models Localize parts and compare corresponding locations (, )(, ) Factor out the variation due to pose, viewpoint and location 3

General image classification Classical approaches: Image as a collection of patches Bag-of-Visual-Word models [Leung & Malik 99, Csurka et al. 04] Orderless pooling and no explicit modeling of pose or viewpoint Fisher vectors are effective for fine-grained classification [california, ringed beak, heermann,..] Modern approaches: Convolutional Neural Networks (CNNs) 4

Texture representations vs CNNs image! non-linear filters! feature field! encoder! representation! x" Handcrafted features" Orderless pooling" ɸ(x)! local h spatial i x" c1! c2! c3! c4! c5! f6! f7! f8! ɸ(x)! convolutional layers! fully-connected (FC) layers! 5

Mix and match image" non-linear filters" feature field" encoder" representation" Handcrafted local descriptors" Orderless pooling" ɸ(x)" CNN local descriptors" CNN FC pooling" 6

Mix and match Standard texture representation! image# non-linear filters# feature field# encoder# representation# Handcrafted local descriptors! Orderless pooling! x! ɸ(x)# CNN local descriptors! CNN FC pooling! [Sivic and Zisserman 03, Csurka et al. 04, Perronnin and Dance 07, Perronnin et al. 10, Jegou et al. 10] # 7

Mix and match Standard application of CNN! image# non-linear filters# feature field# encoder# representation# Handcrafted local descriptors! Orderless pooling! ɸ(x)# CNN local descriptors! CNN FC pooling! FC-CNN# [Chatfield et al. 14, Girshick et al. 2014, Gong et al. 14, Razavin et al. 14]# 8

Mix and match Order-less pooling of CNN local descriptors! image" non-linear filters" feature field" encoder" representation" Handcrafted local descriptors! Orderless pooling! ɸ(x)" CNN local descriptors! CNN FC pooling! 9

Mix and match CNN descriptors pooled by Fisher Vector! image" non-linear filters" feature field" encoder" representation" Handcrafted local descriptors! Fisher Vector! ɸ(x)" CNN local descriptors! CNN FC pooling! FV-CNN" [Cimpoi et al., CVPR 15] 10

CNNs for texture recognition Texture recognition accuracy Dataset FV-SIFT FC-CNN FV-CNN KT-2b 70.8 71.0 73.3 FMD 59.8 70.3 73.5 DTD 58.6 58.8 66.8 Cimpoi et al., CVPR 15 Significant improvements over simply using CNN features KT-2b dataset (11 material categories) 11

Title CNNs Text for texture recognition Texture recognition accuracy Dataset FV-SIFT FC-CNN FV-CNN FV-CNN (VD) KT-2b 70.8 71.0 71.0 81.8 FMD 59.8 70.3 72.6 79.8 DTD 58.6 58.8 66.7 72.3 Cimpoi et al., CVPR 15 Using the very deep model from Oxford VGG group that performed among the best on LSVRC 2014 (ImageNet classification challenge) http://www.robots.ox.ac.uk/~vgg/research/very_deep/ Talk @ Facebook 12

Scenes as textures MIT Indoor dataset (67 classes) Prev. best: 70.8% FV-CNN 81.0% Zhou et al., NIPS 14 Cimpoi et al., CVPR 15 Domain specific CNNs with texture models ImageNet (no data augmentation) The advantage of domain specific training disappears with FV-CNN AlexNet Better CNNs lead to better performance and FV is better than FC Talk @ Facebook 13

Drawbacks of FV-CNN The features in the FV-CNN are not learned Hand crafted (e.g., SIFT), or CNN but trained with a different architecture (e.g., fully connected layers) The GMM parameters are learned in an unsupervised manner Can we learn the features for FV models? Gradient of the FV with respect to x is nasty (x) = exp 1 2 (x µ i) T 1 (x µ i i ) Pi exp 1 2 (x µ i) T 1 (x µ i i ) Partial attempt: Sydorov et al. 10 learn GMM params discriminatively 14

Bilinear models for classification A bilinear model for classification is a four-tuple B =(f A,f B, P, C) feature extractor pooling classification f : L I!R c D e.g., SIFT is R 1x128 image local features l I f A (l, I) f B (l, I) f A (l, I) T f B (l, I) bilinear(l, I) 15

Bilinear models for classification A bilinear model for classification is a four-tuple B =(f A,f B, P, C) feature extractor pooling classification f : L I!R c D image local features pooling descriptor C class l f A (l, I) f A (l, I) T f B (l, I) X bilinear(l, I) f B (l, I) l I bilinear(l, I) (I) 16

Fisher vector is a bilinear model [california, ringed beak, heermann,..] Fisher vector (FV) models [Perronnin et al., 10] Locally encode statistics of feature x weighted by η(x) FV is bilinear model with 17

Bilinear CNN model Decouple fa and fb by using separate CNNs CNN stream A chestnut sided warbler CNN stream B convolutional + pooling layers pooled bilinear vector softmax 18

Bilinear CNN model Back-propagation though the bilinear layer is easy Allows end-to-end training 19

Experiments: Methods Local features: SIFT descriptor [Lowe ICCV99] VGG-M (5 conv + 2 fc layers) [Chatfield et al., BMVC14] VGG-VD (16 conv + 2 fc layers) [Simonyan and Zisserman, ICLR15] Pooling architectures: Fully connected pooling (FC) Fisher vector pooling (FV) Bilinear pooling (B) Notation examples: FC-CNN (M) Fully connected pooling with VGG-M FV-CNN (D) Fisher vector pooling with VGG-VD [Cimpoi et al.,15] B-CNN (D, M) Bilinear pooling with VGG-D and VGG-M 20

Experiments: Datasets small, clutter clutter CUB 200-2011 200 species 11,788 images FGVC Aircraft 100 variants 10,000 images Stanford cars 196 models 16,185 images All models are trained with image labels only No part or object annotations are used at training or test time 21

Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 22

Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 FC-CNN (M) 52.7 23

Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 FC-CNN (M) 52.7 FV-CNN (M) 61.1 24

Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 FC-CNN (M) 52.7 FV-CNN (M) 61.1 B-CNN (M,M) 72.0 25

Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 fine-tuning helps FV-CNN (M) 61.1 B-CNN (M,M) 72.0 26

Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 FV-CNN (M) 61.1 64.1 B-CNN (M,M) 72.0 direct fine-tuning is hard so use ft FC-CNN models indirect fine-tuning helps outperforms multi-scale FV-CNN Cimpoi et al. CVPR 15 27

Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 FV-CNN (M) 61.1 64.1 B-CNN (M,M) 72.0 78.1 direct fine-tuning is hard so use ft FC-CNN models indirect fine-tuning helps outperforms multi-scale FV-CNN Cimpoi et al. CVPR 15 28

Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 FC-CNN (D) 61.0 70.4 FV-CNN (M) 61.1 64.1 B-CNN (M,M) 72 78.1 29

Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 FC-CNN (D) 61.0 70.4 FV-CNN (M) 61.1 64.1 FV-CNN (D) 71.3 74.7 B-CNN (M,M) 72 78.1 30

Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 FC-CNN (D) 61.0 70.4 FV-CNN (M) 61.1 64.1 FV-CNN (D) 71.3 74.7 B-CNN (M,M) 72 78.1 B-CNN (D,M) 80.1 84.1 B-CNN (D,D) 80.1 84.0 31

Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 FC-CNN (D) 61.0 70.4 FV-CNN (M) 61.1 64.1 FV-CNN (D) 71.3 74.7 B-CNN (M,M) 72 78.1 B-CNN (D,M) 80.1 84.1 B-CNN (D,D) 80.1 84.0 SoTA 84.1 [1], 82.0 [2], 73.9 [3], 75.7 [4] [1] Spatial Transformer Networks, Jaderberg et al., NIPS 15 [2] Fine-Grained Rec. w/o Part Annotations, Krause et al., CVPR 15 (+ object bounding-boxes) [3] Part-based R-CNNs, Zhang et al., ECCV 14 (+ part bounding-boxes) [4] Pose normalized CNNs, Branson et al., BMVC 14 (+ landmarks) 32

Results: Comparison with SoTA FC-CNN FV-CNN B-CNN SoTA CUB-200-2011 70.4 74.7 84.1 84.1 FVGC-Aircraft 74.1 77.6 84.1 80.7 Stanford-Cars 79.8 85.7 91.3 92.6 50 63 75 88 100 33

Model visualization Visualizing top activation on fine-tuned B-CNN (D,M) on CUB D-Net M-Net 34

Most confused categories CUB-200 Aircrafts Stanford cars 35

Visualizing invariances (I) 0 (I ) invariant images 36

Visualizing categories Iˆ = arg max score(c, I) + (I) I log-likelihood image-prior 37

Conclusion Bilinear models achieve high accuracy on fine-grained tasks w/ limited data allow end-to-end training of bag-of-visual-words models fast at test time: B-CNN [D, D] 10 images/second on TeslaK40 code and models: http://vis-www.cs.umass.edu/bcnn References Fisher vector CNNs [Cimpoi et al., CVPR 15] Bilinear CNNs [Lin et al., ICCV 2015] Visualizing and Understanding Deep Texture Representations [Lin and Maji, arxiv:1511.05197, Nov 2015] Tsung-Yu Lin Aruni RoyChowdhury 38