PASCAL VOC Classification: Local Features vs. Deep Features. Shuicheng YAN, NUS

PASCAL VOC Classification: Local Features vs. Deep Features Shuicheng YAN, NUS

PASCAL VOC Why valuable? Multi-label, Real Scenarios! Visual Object Recognition Object Classification Object Detection Object Segmentation Person, Horse, Barrier, Table, etc PASCAL VOC Visual object classes challenges Main tasks: object classification, detection and segmentation Be held yearly 2007 2012 Tens of teams from universities and industries participated including INRIA, Berkeley, Oxford, NEC, etc. Become the dataset for visual object recognition research Other tasks: person layout, action recognition, etc. Data: 20 object classes, ~23,000 images with fine labeling

PASCAL VOC: 2010-2014 NUS-(PSL) team results 2014, Classification MAP to 0.91 2012, 2011, 2010, Winner of object classification task. (cls) 2012, Winner of object segmentation task. (seg) 2010, Honorable mention of object detection task. (det) NUS-(PSL) architecture Visual Object Recognition A joint learning of cls-det-seg. Cls: Global Information Det: Local Information Seg: Finedetailed Information

PASCAL VOC: 2010-2014 HCP 2014: 91.4% Deep feature 2014: 83.2% Sub-category 2012: 82.2% GHM 2011: 78.7% Context-SVM LLC 2010: 73.8% 25% Deep feature 2013: 79.0%

I. Spring of Local Features: 2010-2012

Pipeline Feature Representation Low Level Features Feature Encoding Feature Pooling GHM[2]: Generalized Hierarchical Matching (GHM) for object central problems. Object central pooling. Model Learning Classifier Learning Subcategory mining[1]: Automatically mining the visual subcategories based on ambiguity modeling. Context Modeling Contextualization[3]: Mutual Contextualization for object classification and detection tasks. Great performance improvement. 1. Jian Dong, Qiang Chen, Jiashi Feng, Wei Xia, Zhongyang Huang, Shuicheng YAN, Subcategory-aware Object Classification. In CVPR'13. 2. Qiang Chen, Zheng Song, Yang Hua, Zhongyang Huang, Shuicheng Yan. Hierarchical Matching with Side Information for Image Classification. In CVPR 12. 3. Zheng Song*, Qiang Chen*, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification. In CVPR'11.

Framework NUS-PSL 2010 Visual Features Chair Local Feature Extraction Feature Coding Kernel Nonlinear Kernel Classification Post Processing SVM Kernel Regression Regression Feature Pooling SPM Detection Results Max pooling Linear Kernel Confidence Refinement with Exclusive prior

Framework NUS-PSL 2012 Subcategory Mining Chair III Subcategory Flipping Mining Visual Features Local Feature Extraction Feature Coding Feature Coding FK Flipping Kernel Nonlinear+ Nonlinear Kernel Linear Kernel I Contextualized Object Classification and Detection Classification Post Processing SVM Kernel Regression Regression Feature Pooling Generalized SPM, SPM GHM II Hierarchical Matching Subcategory Detection Results Detection Results Flipping Max pooling Linear Kernel Confidence Refinement with Exclusive prior

Outline for VOC: 2010-2012 Context model: Contextualized Object Classification and Detection Feature pooling: Generalized Hierarchical Matching/Pooling Subcategory learning: Sub-Category Aware Detection & Classification

Contextualized Object Classification and Detection Det: Local patches with matched local shape/texture Cls: Global probabilities to contain objects Occurrence Probability Det Cls Whether Can Exchange Information?

Observations Object classification and detection are mutually complemental to each other. Each subject task serves as context task for the other. Context is not robust for the subject task, so use only when necessary person Scene/Global level information is not stable for object detection. False alarm of object detection harms object classification.

Contextualized SVM - Formulation Adaptive contextualization Sample specific classification Adaptive embedding of context features Original classification hyperplane n: feature dim m: context dim Configurable model complexity: low rank constraint dim n x m R x (n + m) Context model (dim m) Easy to be solved and kernelized, if Selection to ambiguous samples (dim n) is fixed.

Contextualized SVM - Formulation Ambiguity modeling: Define the ambiguity degree of sample as the hinge loss of the subject task, Learn the Ambiguity-guided Mixture Model (AMM) through EM to maximize the following objective, Multi-mode ambiguity term is defined as the posterior of each mixture r,

Iterative Co-training of Detection and Classification Learn to Detect Classification Pipeline Detection Pipeline Context from initial Classification Initial Model Detection Feature Detection Feature Context from 1st Classification Context SVM Learn to Classify Context from initial Detection Context SVM Classification Feature Context from 1st Detection Classification Feature a) initial model b) 1st iteration of ContextSVM c) 2nd iteration of ContextSVM

Results Iterative contextualization: Mean AP values of 20 classes on VOC 2010 train/val

Results Comparison with state-of-the-arts on VOC 2010

Exemplar results Representative examples of the baseline (without contextualization) and Context-SVM for classification task.

Generalized Hierarchical Matching/Pooling Traditional Pooling: SPM = approximate geometric constraint Not optimal for object recognition due to misalignment (a) Images (b) SPM partitions (c) Object Confidence Map partition

Hierarchical Pooling for Image Classification Design a general form of hierarchical matching with side information. Represent image with hierarchical structure

Hierarchical Matching Kernel Image Similarity Kernel is defined as the weighted sum over each cluster kernel. General form of SPM, PMK, etc Flexible to integrate other side information.

Generalized Hierarchical Matching/Pooling Encoded local feature vs. side information (a) Side information and Image (c) Hierarchical structure representation (b) Hierarchically cluster by side information. Level 1 (top),2 (mid),3 (bottom) (d) Matching/pooling within each cluster Utilize side information to hierarchically pool local features

Side information design Side Information - Detection Confidence Map Images Sliding window Process Shape Model sub-window Score vote back to image Score vote back to image Fusing Object Confidence Maps Appearance Model

Results VOC

Sub-Category Mining Chair Sofa Diningtable Ambiguity Guided Subcategory Mining Subcategory-aware Object Classification Subcategory Model 1 Subcategory Model 2 Fusion Model Subcategory Model N

Sub-Category Mining Ambiguous Categories Sofa Instance Affinity Graph Detected Subgraphs Corresponding Subcategories Ambiguity Chair Graph Shift Visualization Similarity Ambiguity Similarity Subcategory Mining based on both Similarity & Ambiguity Calculate the sample intra-class similarity Calculate the sample inter-class ambiguity Detect dense subgraphs by graph shift algorithm [1] Subgraphs to subcategories. [1] Hairong Liu, Shuicheng Yan. Robust Graph Mode Seeking by Graph Shift. ICML 2010

Sub-Category Aware Detection & Classification Subcategory Model 1 Testing Image Feature Sliding/Selective Extraction Window Search Local Feature Extraction and Coding GHM Pooling Detection Model Image Classification Model Representation Subcategory Model N Subcategory Classification Result 1 Subcategory Detection Result 1 Subcategory Classification Result N Subcategory Detection Result N Fusion Model Category level Result

Sub-Category Mining Result Subcategories Bus Chair Outliers

Summary of VOC results 2010 Our Best Other's Best aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor MAP 93 79 71.6 77.8 54.3 85.2 78.6 78.8 64.5 64 62.9 69.6 82 84.4 91.6 48.6 65.4 59.6 89.4 77.2 73.8 93.3 77 69.9 77.2 53.7 85.9 80.4 79.4 62.9 66.2 61.1 71.1 76.7 81.7 90.2 53.3 66.3 58 87.5 76.2 2011 Our Best Other's Best 95.5 81.1 79.4 82.5 58.2 87.7 84.1 83.1 68.5 74.7 68.5 76.4 83.3 87.5 92.8 56.5 77.7 67 91.2 77.5 78.7 94.5 82.6 79.4 80.7 57.8 87.8 85.5 83.9 66.6 74.2 69.4 75.2 83 88.1 93.5 58.7 75.5 66.3 90 77.2 2012 Our Best Other's Best 97.3 84.2 80.8 85.3 60.8 89.9 86.8 89.3 75.4 77.8 75.1 83 87.5 90.1 95 57.8 79.2 73.4 94.5 80.7 82.2 92 74.2 73 77.5 54.3 85.2 81.9 76.4 65.2 63.2 68.5 68.9 78.2 81 91.6 55.9 69.4 65.4 86.7 77.4

II. Spring of Deep Feature: 2013-2014

CNN: Single-label Image Classification Definition Assign one and only one label from a pre-defined set to an image Explicit assumption: object is roughly aligned Alex Net [1] made a great breakthrough in single-label classification in ILSVRC2012 (with 10% gain over the previous methods) [1] A. Krizhevsky, I. Sutskever, G. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012.

CNN: Multi-label Image Classification Definition Assign multiple labels from a pre-defined set to an image vs. Single-label images Multi-label images Challenges Foreground objects are not roughly aligned Interactions between different objects, e.g. partial visibility and occlusion A large number of training images are required The label space is expanded from n to 2^n Directly CNN training is unreasonable and unreliable!

Hypotheses-CNN-Pooling(HCP) Our framework Scores for individual hypothesis Shared convolutional neural network 5 3 11 5 dog person sheep 3 3 3 27 3 3 13 13 13 55 Hypotheses assumption: single-labeled c Max 256 Max 96 Pooling Pooling 384 384 256 Max 4096 4096 Pooling Max Pooling

Characteristics of Our Framework No ground-truth bounding box information is required for training on the multi-label image dataset The proposed HCP infrastructure is robust to the noisy and/or redundant hypotheses No explicit hypothesis label is required for training The shared CNN can be well pre-trained with a large-scale single-label image dataset The HCP outputs are naturally multi-label prediction results

Training of HCP Hypotheses extraction Initialization of HCP Pre-training on a large-scale single-label image set, e.g. ImageNet Image-fine-tuning on a multi-label image set Hypotheses-fine-tuning

Hypotheses Extraction Criteria: High object detection recall rate Small number of hypotheses High computational efficiency Solution: BING [2]+ Boxes clustering [2] M.-M. Cheng, J. Warrell, W.-Y. Lin, and P.H.S.Torr. BING: Binarized normed gradients for objectness estimation at 300fps. CVPR 2014.

Hypotheses Extraction

Initialization of HCP Pre-training Step1 Single-label Images (e.g. ImageNet) Parameters transferring Step2 Multi-label Images (e.g. Pascal VOC) Image-fine-tuning

Hypotheses-fine-tuning

Experimental Results A subset from detection dataset of ILSVRC 2013 is used for BING training

Experimental Results Performance on PASCAL VOC 2007 New

Experimental Results Performance on PASCAL VOC 2012 New-1 New-2

Experimental Results Complementary Analysis: Hand-crafted features vs. Deep features

Experimental Results One test sample from VOC2007 500 hypotheses for each image, 1~1.5s Generate hypotheses Feed into the shared CNN person hors e car person Cross-hypothesis max-pooling person horse car

New Result: Network in Network (NIN) NIN: CNN with non-linear filters, yet without final fully-connected NN layer CNN Intuitively less overfitting globally, and more discriminative locally (not finally used in our submission due to the surgery of our main team member, but very effective) [4] With less parameter # [4] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, Yoshua Bengio: Maxout Networks. ICML (3) 2013: 1319-1327 NIN

Better Local Abstraction Local patch is projected to its feature vector. Using a small network. Motivation: Better Local Abstraction! Cascaded Cross Channel Parametric Pooling (CCCP) Lin, Min, Qiang Chen, and Shuicheng Yan. "Network In Network." ICLR-2014.

CCCP Cascaded 1x1 Convolution in Implementation

Global Average Pooling CNN NIN Confidence map of each category Save tons of parameters

NIN in ILSVR2014 To avoid hyper-parameter tuning, we put cccp layer directly on convolution layers of ZFNet. (Network in ZFNet) layer details Conv1 Stride = 2, kernel = 7x7, channel_out = 96 Cccp1 Output = 96 Conv2 Stride = 2, kernel = 5x5, channel_out = 256 Cccp2 Output = 256 Conv3 Stride = 1, kernel = 3x3, channel_out = 512 Cccp3 Output = 256 Conv4 Stride = 1, kernel = 3x3, channel_out = 1024 Cccp4 Output = 512 Cccp5 Output = 384 Conv5 Stride = 1, kernel = 3x3, channel_out = 512 Cccp6 Output = 256 layer details Conv1 Stride = 2, kernel = 7x7, channel_out = 96 Conv2 Stride = 2, kernel = 5x5, channel_out = 256 Conv3 Stride = 1, kernel = 3x3, channel_out = 512 Conv4 Stride = 1, kernel = 3x3, channel_out = 1024 Conv5 Stride = 1, kernel = 3x3, channel_out = 512 Fc1 Output = 4096 Fc2 Output = 4096 Fc1 Output = 4096 Fc3 Output = 1000 Fc2 Output = 4096 Fc3 Output = 1000 (10.91%) With 256xN training and 3 view test Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." Computer Vision ECCV 2014. Springer International Publishing, 2014. 818-833.

NIN in HCP Scores for individual hypothesis Shared NIN dog person sheep Max Pooling c

Compared with State-of-the-arts on VOC 2012 From 81.7% Category plane bicycle bird boat bottle bus car cat chair cow table dog horse motor person plant sheep sofa train tv MAP NUS-PSL[1] 97.3 84.2 80.8 85.3 60.8 89.9 86.8 89.3 75.4 77.8 75.1 83.0 87.5 90.1 95.0 57.8 79.2 73.4 94.5 80.7 82.2 PRE-1000C[2] 93.5 78.4 87.7 80.9 57.3 85.0 81.6 89.4 66.9 73.8 62.0 89.5 83.2 87.6 95.8 61.4 79.0 54.3 88.0 78.3 78.7 PRE-1512[2] 94.6 82.9 88.2 84.1 60.3 89.0 84.4 90.7 72.1 86.8 69.0 92.1 93.4 88.6 96.1 64.3 86.6 62.3 91.1 79.8 82.8 < 90.3% Chatfield et al.[3] HCP-NIN HCP-NIN+NUS-PSL 96.8 98.4 99.5 82.5 89.5 93.7 91.5 96.2 96.8 88.1 91.7 94.0 62.1 72.5 77.7 88.3 91.1 95.3 81.9 87.2 92.4 94.8 97.1 98.2 70.3 73.0 86.1 80.2 89.5 91.3 76.2 75.1 83.5 92.9 96.3 97.3 90.3 93.0 96.8 89.3 90.5 96.3 95.2 94.8 95.8 57.4 66.5 72.2 83.6 90.3 91.5 66.4 65.8 81.1 93.5 95.6 97.6 81.9 82.0 90.0 83.2 86.8 91.4 [1] S. Yan, J. Dong, Q. Chen, Z. Song, Y. Pan, W. Xia, H. Zhongyang, Y. Hua, and S. Shen. Generalized hierarchical matching for subcategory aware object classification. In Visual Recognition Challange workshop, ECCV, 2012. [2] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. CVPR, 2014. [3] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman. Return of the Devil in the Details: Delving Deep into Convolutional Nets, BMVC, 2014

Demo Online Demo

Highest and Lowest Score Five Images for Each Class Aeroplane Bicycle Bird Boat Bottle

Highest and Lowest Score Five Images for Each Class Bus Car Cat Chair Cow

Highest and Lowest Score Five Images for Each Class Dining table Dog Horse Motorbike Person

Highest and Lowest Score Five Images for Each Class Pottedplant Sheep Sofa Train TV monitor

What s next? Better Solution for Small/Occluded Objects? More Extra Data? HCP 2014: 91.4% Deep feature 2014: 83.2% Better Local Features? Sub-category Better Deep Features? GHM 2011: 78.7% Context-SVM LLC 2009: 66.5% 2010: 73.8% 2012: 82.2% 25%

Shuicheng YAN eleyans@nus.edu.sg