Bilinear Models for Fine-Grained Visual Recognition Subhransu Maji College of Information and Computer Sciences University of Massachusetts, Amherst
Fine-grained visual recognition Example: distinguish between closely related categories California gull Ringed beak gull inter-category variation v.s intra-category variation location, pose, viewpoint, background, lighting, gender, season, etc 2
Part-based models Localize parts and compare corresponding locations (, )(, ) Factor out the variation due to pose, viewpoint and location 3
General image classification Classical approaches: Image as a collection of patches Bag-of-Visual-Word models [Leung & Malik 99, Csurka et al. 04] Orderless pooling and no explicit modeling of pose or viewpoint Fisher vectors are effective for fine-grained classification [california, ringed beak, heermann,..] Modern approaches: Convolutional Neural Networks (CNNs) 4
Texture representations vs CNNs image! non-linear filters! feature field! encoder! representation! x" Handcrafted features" Orderless pooling" ɸ(x)! local h spatial i x" c1! c2! c3! c4! c5! f6! f7! f8! ɸ(x)! convolutional layers! fully-connected (FC) layers! 5
Mix and match image" non-linear filters" feature field" encoder" representation" Handcrafted local descriptors" Orderless pooling" ɸ(x)" CNN local descriptors" CNN FC pooling" 6
Mix and match Standard texture representation! image# non-linear filters# feature field# encoder# representation# Handcrafted local descriptors! Orderless pooling! x! ɸ(x)# CNN local descriptors! CNN FC pooling! [Sivic and Zisserman 03, Csurka et al. 04, Perronnin and Dance 07, Perronnin et al. 10, Jegou et al. 10] # 7
Mix and match Standard application of CNN! image# non-linear filters# feature field# encoder# representation# Handcrafted local descriptors! Orderless pooling! ɸ(x)# CNN local descriptors! CNN FC pooling! FC-CNN# [Chatfield et al. 14, Girshick et al. 2014, Gong et al. 14, Razavin et al. 14]# 8
Mix and match Order-less pooling of CNN local descriptors! image" non-linear filters" feature field" encoder" representation" Handcrafted local descriptors! Orderless pooling! ɸ(x)" CNN local descriptors! CNN FC pooling! 9
Mix and match CNN descriptors pooled by Fisher Vector! image" non-linear filters" feature field" encoder" representation" Handcrafted local descriptors! Fisher Vector! ɸ(x)" CNN local descriptors! CNN FC pooling! FV-CNN" [Cimpoi et al., CVPR 15] 10
CNNs for texture recognition Texture recognition accuracy Dataset FV-SIFT FC-CNN FV-CNN KT-2b 70.8 71.0 73.3 FMD 59.8 70.3 73.5 DTD 58.6 58.8 66.8 Cimpoi et al., CVPR 15 Significant improvements over simply using CNN features KT-2b dataset (11 material categories) 11
Title CNNs Text for texture recognition Texture recognition accuracy Dataset FV-SIFT FC-CNN FV-CNN FV-CNN (VD) KT-2b 70.8 71.0 71.0 81.8 FMD 59.8 70.3 72.6 79.8 DTD 58.6 58.8 66.7 72.3 Cimpoi et al., CVPR 15 Using the very deep model from Oxford VGG group that performed among the best on LSVRC 2014 (ImageNet classification challenge) http://www.robots.ox.ac.uk/~vgg/research/very_deep/ Talk @ Facebook 12
Scenes as textures MIT Indoor dataset (67 classes) Prev. best: 70.8% FV-CNN 81.0% Zhou et al., NIPS 14 Cimpoi et al., CVPR 15 Domain specific CNNs with texture models ImageNet (no data augmentation) The advantage of domain specific training disappears with FV-CNN AlexNet Better CNNs lead to better performance and FV is better than FC Talk @ Facebook 13
Drawbacks of FV-CNN The features in the FV-CNN are not learned Hand crafted (e.g., SIFT), or CNN but trained with a different architecture (e.g., fully connected layers) The GMM parameters are learned in an unsupervised manner Can we learn the features for FV models? Gradient of the FV with respect to x is nasty (x) = exp 1 2 (x µ i) T 1 (x µ i i ) Pi exp 1 2 (x µ i) T 1 (x µ i i ) Partial attempt: Sydorov et al. 10 learn GMM params discriminatively 14
Bilinear models for classification A bilinear model for classification is a four-tuple B =(f A,f B, P, C) feature extractor pooling classification f : L I!R c D e.g., SIFT is R 1x128 image local features l I f A (l, I) f B (l, I) f A (l, I) T f B (l, I) bilinear(l, I) 15
Bilinear models for classification A bilinear model for classification is a four-tuple B =(f A,f B, P, C) feature extractor pooling classification f : L I!R c D image local features pooling descriptor C class l f A (l, I) f A (l, I) T f B (l, I) X bilinear(l, I) f B (l, I) l I bilinear(l, I) (I) 16
Fisher vector is a bilinear model [california, ringed beak, heermann,..] Fisher vector (FV) models [Perronnin et al., 10] Locally encode statistics of feature x weighted by η(x) FV is bilinear model with 17
Bilinear CNN model Decouple fa and fb by using separate CNNs CNN stream A chestnut sided warbler CNN stream B convolutional + pooling layers pooled bilinear vector softmax 18
Bilinear CNN model Back-propagation though the bilinear layer is easy Allows end-to-end training 19
Experiments: Methods Local features: SIFT descriptor [Lowe ICCV99] VGG-M (5 conv + 2 fc layers) [Chatfield et al., BMVC14] VGG-VD (16 conv + 2 fc layers) [Simonyan and Zisserman, ICLR15] Pooling architectures: Fully connected pooling (FC) Fisher vector pooling (FV) Bilinear pooling (B) Notation examples: FC-CNN (M) Fully connected pooling with VGG-M FV-CNN (D) Fisher vector pooling with VGG-VD [Cimpoi et al.,15] B-CNN (D, M) Bilinear pooling with VGG-D and VGG-M 20
Experiments: Datasets small, clutter clutter CUB 200-2011 200 species 11,788 images FGVC Aircraft 100 variants 10,000 images Stanford cars 196 models 16,185 images All models are trained with image labels only No part or object annotations are used at training or test time 21
Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 22
Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 FC-CNN (M) 52.7 23
Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 FC-CNN (M) 52.7 FV-CNN (M) 61.1 24
Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 FC-CNN (M) 52.7 FV-CNN (M) 61.1 B-CNN (M,M) 72.0 25
Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 fine-tuning helps FV-CNN (M) 61.1 B-CNN (M,M) 72.0 26
Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 FV-CNN (M) 61.1 64.1 B-CNN (M,M) 72.0 direct fine-tuning is hard so use ft FC-CNN models indirect fine-tuning helps outperforms multi-scale FV-CNN Cimpoi et al. CVPR 15 27
Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 FV-CNN (M) 61.1 64.1 B-CNN (M,M) 72.0 78.1 direct fine-tuning is hard so use ft FC-CNN models indirect fine-tuning helps outperforms multi-scale FV-CNN Cimpoi et al. CVPR 15 28
Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 FC-CNN (D) 61.0 70.4 FV-CNN (M) 61.1 64.1 B-CNN (M,M) 72 78.1 29
Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 FC-CNN (D) 61.0 70.4 FV-CNN (M) 61.1 64.1 FV-CNN (D) 71.3 74.7 B-CNN (M,M) 72 78.1 30
Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 FC-CNN (D) 61.0 70.4 FV-CNN (M) 61.1 64.1 FV-CNN (D) 71.3 74.7 B-CNN (M,M) 72 78.1 B-CNN (D,M) 80.1 84.1 B-CNN (D,D) 80.1 84.0 31
Results: Birds classification Accuracy on CUB 200-2011 dataset Setting: provided with only the image at test time Method w/o ft w/ ft FV-SIFT 18.8 - FC-CNN (M) 52.7 58.8 FC-CNN (D) 61.0 70.4 FV-CNN (M) 61.1 64.1 FV-CNN (D) 71.3 74.7 B-CNN (M,M) 72 78.1 B-CNN (D,M) 80.1 84.1 B-CNN (D,D) 80.1 84.0 SoTA 84.1 [1], 82.0 [2], 73.9 [3], 75.7 [4] [1] Spatial Transformer Networks, Jaderberg et al., NIPS 15 [2] Fine-Grained Rec. w/o Part Annotations, Krause et al., CVPR 15 (+ object bounding-boxes) [3] Part-based R-CNNs, Zhang et al., ECCV 14 (+ part bounding-boxes) [4] Pose normalized CNNs, Branson et al., BMVC 14 (+ landmarks) 32
Results: Comparison with SoTA FC-CNN FV-CNN B-CNN SoTA CUB-200-2011 70.4 74.7 84.1 84.1 FVGC-Aircraft 74.1 77.6 84.1 80.7 Stanford-Cars 79.8 85.7 91.3 92.6 50 63 75 88 100 33
Model visualization Visualizing top activation on fine-tuned B-CNN (D,M) on CUB D-Net M-Net 34
Most confused categories CUB-200 Aircrafts Stanford cars 35
Visualizing invariances (I) 0 (I ) invariant images 36
Visualizing categories Iˆ = arg max score(c, I) + (I) I log-likelihood image-prior 37
Conclusion Bilinear models achieve high accuracy on fine-grained tasks w/ limited data allow end-to-end training of bag-of-visual-words models fast at test time: B-CNN [D, D] 10 images/second on TeslaK40 code and models: http://vis-www.cs.umass.edu/bcnn References Fisher vector CNNs [Cimpoi et al., CVPR 15] Bilinear CNNs [Lin et al., ICCV 2015] Visualizing and Understanding Deep Texture Representations [Lin and Maji, arxiv:1511.05197, Nov 2015] Tsung-Yu Lin Aruni RoyChowdhury 38