PASCAL VOC Classification: Local Features vs. Deep Features. Shuicheng YAN, NUS

Similar documents
Deep condolence to Professor Mark Everingham

Return of the Devil in the Details: Delving Deep into Convolutional Nets

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Object Detection Based on Deep Learning

Weakly Supervised Object Recognition with Convolutional Neural Networks

Beyond Sliding Windows: Object Localization by Efficient Subwindow Search

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Category-level localization

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Computer Vision Lecture 16

Detection III: Analyzing and Debugging Detection Methods

Semantic Pooling for Image Categorization using Multiple Kernel Learning

An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b

Deformable Part Models

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Spatial Localization and Detection. Lecture 8-1

arxiv: v3 [cs.cv] 13 Apr 2016 Abstract

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Regionlet Object Detector with Hand-crafted and CNN Feature

Hierarchical Image-Region Labeling via Structured Learning

Object Recognition II

Fashion Analytics and Systems

Unified, real-time object detection

Learning Representations for Visual Object Class Recognition

Bag-of-features. Cordelia Schmid

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

arxiv: v1 [cs.cv] 4 Jun 2015

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation

Feature-Fused SSD: Fast Detection for Small Objects

LEARNING OBJECT SEGMENTATION USING A MULTI NETWORK SEGMENT CLASSIFICATION APPROACH

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

CAP 6412 Advanced Computer Vision

Subcategory-aware Object Classification

DeepIndex for Accurate and Efficient Image Retrieval

Fully Convolutional Networks for Semantic Segmentation

Lecture 5: Object Detection

Exploit Bounding Box Annotations for Multi-label Object Recognition

Groupout: A Way to Regularize Deep Convolutional Neural Network

Structured Prediction using Convolutional Neural Networks

Video Semantic Indexing using Object Detection-Derived Features

Using Machine Learning for Classification of Cancer Cells

Gradient of the lower bound

Deep Learning with Tensorflow AlexNet

Rich feature hierarchies for accurate object detection and semantic segmentation

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Optimizing Object Detection:

G-CNN: an Iterative Grid Based Object Detector

Single-Shot Refinement Neural Network for Object Detection -Supplementary Material-

Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds

Object Detection by 3D Aspectlets and Occlusion Reasoning

Project 3 Q&A. Jonathan Krause

Part Localization by Exploiting Deep Convolutional Networks

Rich feature hierarchies for accurate object detection and semantic segmentation

Cost-alleviative Learning for Deep Convolutional Neural Network-based Facial Part Labeling

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Detection and Localization with Multi-scale Models

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Tiny ImageNet Visual Recognition Challenge

CSE 559A: Computer Vision

Content-Based Image Recovery

Know your data - many types of networks

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Speaker: Ming-Ming Cheng Nankai University 15-Sep-17 Towards Weakly Supervised Image Understanding

Linear combinations of simple classifiers for the PASCAL challenge

Object detection with CNNs

c 2011 by Pedro Moises Crisostomo Romero. All rights reserved.

Robust Scene Classification with Cross-level LLC Coding on CNN Features

Attributes. Computer Vision. James Hays. Many slides from Derek Hoiem

TS 2 C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection

Machine Learning. MGS Lecture 3: Deep Learning

arxiv: v2 [cs.cv] 22 Sep 2014

arxiv: v1 [cs.cv] 13 Jul 2018

Part-based and local feature models for generic object recognition

Yiqi Yan. May 10, 2017

FACIAL POINT DETECTION BASED ON A CONVOLUTIONAL NEURAL NETWORK WITH OPTIMAL MINI-BATCH PROCEDURE. Chubu University 1200, Matsumoto-cho, Kasugai, AICHI

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Learning Spatial Context: Using Stuff to Find Things

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

Computer Vision Lecture 16

Ranking Figure-Ground Hypotheses for Object Segmentation

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Mixtures of Gaussians and Advanced Feature Encoding

arxiv: v1 [cs.cv] 23 Apr 2015

Learning Object Representations for Visual Object Class Recognition

Cascade Region Regression for Robust Object Detection

Proposal-free Network for Instance-level Object Segmentation

Segmenting Objects in Weakly Labeled Videos

Generative Adversarial Network

Final Report: Smart Trash Net: Waste Localization and Classification

Structured Models in. Dan Huttenlocher. June 2010

segdeepm: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection

Computer Vision Lecture 16

Semantic Segmentation

Find that! Visual Object Detection Primer

Part based models for recognition. Kristen Grauman

Transcription:

PASCAL VOC Classification: Local Features vs. Deep Features Shuicheng YAN, NUS

PASCAL VOC Why valuable? Multi-label, Real Scenarios! Visual Object Recognition Object Classification Object Detection Object Segmentation Person, Horse, Barrier, Table, etc PASCAL VOC Visual object classes challenges Main tasks: object classification, detection and segmentation Be held yearly 2007 2012 Tens of teams from universities and industries participated including INRIA, Berkeley, Oxford, NEC, etc. Become the dataset for visual object recognition research Other tasks: person layout, action recognition, etc. Data: 20 object classes, ~23,000 images with fine labeling

PASCAL VOC: 2010-2014 NUS-(PSL) team results 2014, Classification MAP to 0.91 2012, 2011, 2010, Winner of object classification task. (cls) 2012, Winner of object segmentation task. (seg) 2010, Honorable mention of object detection task. (det) NUS-(PSL) architecture Visual Object Recognition A joint learning of cls-det-seg. Cls: Global Information Det: Local Information Seg: Finedetailed Information

PASCAL VOC: 2010-2014 HCP 2014: 91.4% Deep feature 2014: 83.2% Sub-category 2012: 82.2% GHM 2011: 78.7% Context-SVM LLC 2010: 73.8% 25% Deep feature 2013: 79.0%

I. Spring of Local Features: 2010-2012

Pipeline Feature Representation Low Level Features Feature Encoding Feature Pooling GHM[2]: Generalized Hierarchical Matching (GHM) for object central problems. Object central pooling. Model Learning Classifier Learning Subcategory mining[1]: Automatically mining the visual subcategories based on ambiguity modeling. Context Modeling Contextualization[3]: Mutual Contextualization for object classification and detection tasks. Great performance improvement. 1. Jian Dong, Qiang Chen, Jiashi Feng, Wei Xia, Zhongyang Huang, Shuicheng YAN, Subcategory-aware Object Classification. In CVPR'13. 2. Qiang Chen, Zheng Song, Yang Hua, Zhongyang Huang, Shuicheng Yan. Hierarchical Matching with Side Information for Image Classification. In CVPR 12. 3. Zheng Song*, Qiang Chen*, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection and Classification. In CVPR'11.

Framework NUS-PSL 2010 Visual Features Chair Local Feature Extraction Feature Coding Kernel Nonlinear Kernel Classification Post Processing SVM Kernel Regression Regression Feature Pooling SPM Detection Results Max pooling Linear Kernel Confidence Refinement with Exclusive prior

Framework NUS-PSL 2012 Subcategory Mining Chair III Subcategory Flipping Mining Visual Features Local Feature Extraction Feature Coding Feature Coding FK Flipping Kernel Nonlinear+ Nonlinear Kernel Linear Kernel I Contextualized Object Classification and Detection Classification Post Processing SVM Kernel Regression Regression Feature Pooling Generalized SPM, SPM GHM II Hierarchical Matching Subcategory Detection Results Detection Results Flipping Max pooling Linear Kernel Confidence Refinement with Exclusive prior

Outline for VOC: 2010-2012 Context model: Contextualized Object Classification and Detection Feature pooling: Generalized Hierarchical Matching/Pooling Subcategory learning: Sub-Category Aware Detection & Classification

Contextualized Object Classification and Detection Det: Local patches with matched local shape/texture Cls: Global probabilities to contain objects Occurrence Probability Det Cls Whether Can Exchange Information?

Observations Object classification and detection are mutually complemental to each other. Each subject task serves as context task for the other. Context is not robust for the subject task, so use only when necessary person Scene/Global level information is not stable for object detection. False alarm of object detection harms object classification.

Contextualized SVM - Formulation Adaptive contextualization Sample specific classification Adaptive embedding of context features Original classification hyperplane n: feature dim m: context dim Configurable model complexity: low rank constraint dim n x m R x (n + m) Context model (dim m) Easy to be solved and kernelized, if Selection to ambiguous samples (dim n) is fixed.

Contextualized SVM - Formulation Ambiguity modeling: Define the ambiguity degree of sample as the hinge loss of the subject task, Learn the Ambiguity-guided Mixture Model (AMM) through EM to maximize the following objective, Multi-mode ambiguity term is defined as the posterior of each mixture r,

Iterative Co-training of Detection and Classification Learn to Detect Classification Pipeline Detection Pipeline Context from initial Classification Initial Model Detection Feature Detection Feature Context from 1st Classification Context SVM Learn to Classify Context from initial Detection Context SVM Classification Feature Context from 1st Detection Classification Feature a) initial model b) 1st iteration of ContextSVM c) 2nd iteration of ContextSVM

Results Iterative contextualization: Mean AP values of 20 classes on VOC 2010 train/val

Results Comparison with state-of-the-arts on VOC 2010

Exemplar results Representative examples of the baseline (without contextualization) and Context-SVM for classification task.

Outline for VOC: 2010-2012 Context model: Contextualized Object Classification and Detection Feature pooling: Generalized Hierarchical Matching/Pooling Subcategory learning: Sub-Category Aware Detection & Classification

Generalized Hierarchical Matching/Pooling Traditional Pooling: SPM = approximate geometric constraint Not optimal for object recognition due to misalignment (a) Images (b) SPM partitions (c) Object Confidence Map partition

Hierarchical Pooling for Image Classification Design a general form of hierarchical matching with side information. Represent image with hierarchical structure

Hierarchical Matching Kernel Image Similarity Kernel is defined as the weighted sum over each cluster kernel. General form of SPM, PMK, etc Flexible to integrate other side information.

Generalized Hierarchical Matching/Pooling Encoded local feature vs. side information (a) Side information and Image (c) Hierarchical structure representation (b) Hierarchically cluster by side information. Level 1 (top),2 (mid),3 (bottom) (d) Matching/pooling within each cluster Utilize side information to hierarchically pool local features

Side information design Side Information - Detection Confidence Map Images Sliding window Process Shape Model sub-window Score vote back to image Score vote back to image Fusing Object Confidence Maps Appearance Model

Results VOC

Outline for VOC: 2010-2012 Context model: Contextualized Object Classification and Detection Feature pooling: Generalized Hierarchical Matching/Pooling Subcategory learning: Sub-Category Aware Detection & Classification

Sub-Category Mining Chair Sofa Diningtable Ambiguity Guided Subcategory Mining Subcategory-aware Object Classification Subcategory Model 1 Subcategory Model 2 Fusion Model Subcategory Model N

Sub-Category Mining Ambiguous Categories Sofa Instance Affinity Graph Detected Subgraphs Corresponding Subcategories Ambiguity Chair Graph Shift Visualization Similarity Ambiguity Similarity Subcategory Mining based on both Similarity & Ambiguity Calculate the sample intra-class similarity Calculate the sample inter-class ambiguity Detect dense subgraphs by graph shift algorithm [1] Subgraphs to subcategories. [1] Hairong Liu, Shuicheng Yan. Robust Graph Mode Seeking by Graph Shift. ICML 2010

Sub-Category Aware Detection & Classification Subcategory Model 1 Testing Image Feature Sliding/Selective Extraction Window Search Local Feature Extraction and Coding GHM Pooling Detection Model Image Classification Model Representation Subcategory Model N Subcategory Classification Result 1 Subcategory Detection Result 1 Subcategory Classification Result N Subcategory Detection Result N Fusion Model Category level Result

Sub-Category Mining Result Subcategories Bus Chair Outliers

Summary of VOC results 2010 Our Best Other's Best aeroplane bicycle bird boat bottle bus car cat chair cow diningtable dog horse motorbike person pottedplant sheep sofa train tvmonitor MAP 93 79 71.6 77.8 54.3 85.2 78.6 78.8 64.5 64 62.9 69.6 82 84.4 91.6 48.6 65.4 59.6 89.4 77.2 73.8 93.3 77 69.9 77.2 53.7 85.9 80.4 79.4 62.9 66.2 61.1 71.1 76.7 81.7 90.2 53.3 66.3 58 87.5 76.2 2011 Our Best Other's Best 95.5 81.1 79.4 82.5 58.2 87.7 84.1 83.1 68.5 74.7 68.5 76.4 83.3 87.5 92.8 56.5 77.7 67 91.2 77.5 78.7 94.5 82.6 79.4 80.7 57.8 87.8 85.5 83.9 66.6 74.2 69.4 75.2 83 88.1 93.5 58.7 75.5 66.3 90 77.2 2012 Our Best Other's Best 97.3 84.2 80.8 85.3 60.8 89.9 86.8 89.3 75.4 77.8 75.1 83 87.5 90.1 95 57.8 79.2 73.4 94.5 80.7 82.2 92 74.2 73 77.5 54.3 85.2 81.9 76.4 65.2 63.2 68.5 68.9 78.2 81 91.6 55.9 69.4 65.4 86.7 77.4

II. Spring of Deep Feature: 2013-2014

CNN: Single-label Image Classification Definition Assign one and only one label from a pre-defined set to an image Explicit assumption: object is roughly aligned Alex Net [1] made a great breakthrough in single-label classification in ILSVRC2012 (with 10% gain over the previous methods) [1] A. Krizhevsky, I. Sutskever, G. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012.

CNN: Multi-label Image Classification Definition Assign multiple labels from a pre-defined set to an image vs. Single-label images Multi-label images Challenges Foreground objects are not roughly aligned Interactions between different objects, e.g. partial visibility and occlusion A large number of training images are required The label space is expanded from n to 2^n Directly CNN training is unreasonable and unreliable!

Hypotheses-CNN-Pooling(HCP) Our framework Scores for individual hypothesis Shared convolutional neural network 5 3 11 5 dog person sheep 3 3 3 27 3 3 13 13 13 55 Hypotheses assumption: single-labeled c Max 256 Max 96 Pooling Pooling 384 384 256 Max 4096 4096 Pooling Max Pooling

Characteristics of Our Framework No ground-truth bounding box information is required for training on the multi-label image dataset The proposed HCP infrastructure is robust to the noisy and/or redundant hypotheses No explicit hypothesis label is required for training The shared CNN can be well pre-trained with a large-scale single-label image dataset The HCP outputs are naturally multi-label prediction results

Training of HCP Hypotheses extraction Initialization of HCP Pre-training on a large-scale single-label image set, e.g. ImageNet Image-fine-tuning on a multi-label image set Hypotheses-fine-tuning

Hypotheses Extraction Criteria: High object detection recall rate Small number of hypotheses High computational efficiency Solution: BING [2]+ Boxes clustering [2] M.-M. Cheng, J. Warrell, W.-Y. Lin, and P.H.S.Torr. BING: Binarized normed gradients for objectness estimation at 300fps. CVPR 2014.

Hypotheses Extraction

Initialization of HCP Pre-training Step1 Single-label Images (e.g. ImageNet) Parameters transferring Step2 Multi-label Images (e.g. Pascal VOC) Image-fine-tuning

Hypotheses-fine-tuning

Experimental Results A subset from detection dataset of ILSVRC 2013 is used for BING training

Experimental Results Performance on PASCAL VOC 2007 New

Experimental Results Performance on PASCAL VOC 2012 New-1 New-2

Experimental Results Complementary Analysis: Hand-crafted features vs. Deep features

Experimental Results One test sample from VOC2007 500 hypotheses for each image, 1~1.5s Generate hypotheses Feed into the shared CNN person hors e car person Cross-hypothesis max-pooling person horse car

New Result: Network in Network (NIN) NIN: CNN with non-linear filters, yet without final fully-connected NN layer CNN Intuitively less overfitting globally, and more discriminative locally (not finally used in our submission due to the surgery of our main team member, but very effective) [4] With less parameter # [4] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, Yoshua Bengio: Maxout Networks. ICML (3) 2013: 1319-1327 NIN

Better Local Abstraction Local patch is projected to its feature vector. Using a small network. Motivation: Better Local Abstraction! Cascaded Cross Channel Parametric Pooling (CCCP) Lin, Min, Qiang Chen, and Shuicheng Yan. "Network In Network." ICLR-2014.

CCCP Cascaded 1x1 Convolution in Implementation

Global Average Pooling CNN NIN Confidence map of each category Save tons of parameters

NIN in ILSVR2014 To avoid hyper-parameter tuning, we put cccp layer directly on convolution layers of ZFNet. (Network in ZFNet) layer details Conv1 Stride = 2, kernel = 7x7, channel_out = 96 Cccp1 Output = 96 Conv2 Stride = 2, kernel = 5x5, channel_out = 256 Cccp2 Output = 256 Conv3 Stride = 1, kernel = 3x3, channel_out = 512 Cccp3 Output = 256 Conv4 Stride = 1, kernel = 3x3, channel_out = 1024 Cccp4 Output = 512 Cccp5 Output = 384 Conv5 Stride = 1, kernel = 3x3, channel_out = 512 Cccp6 Output = 256 layer details Conv1 Stride = 2, kernel = 7x7, channel_out = 96 Conv2 Stride = 2, kernel = 5x5, channel_out = 256 Conv3 Stride = 1, kernel = 3x3, channel_out = 512 Conv4 Stride = 1, kernel = 3x3, channel_out = 1024 Conv5 Stride = 1, kernel = 3x3, channel_out = 512 Fc1 Output = 4096 Fc2 Output = 4096 Fc1 Output = 4096 Fc3 Output = 1000 Fc2 Output = 4096 Fc3 Output = 1000 (10.91%) With 256xN training and 3 view test Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." Computer Vision ECCV 2014. Springer International Publishing, 2014. 818-833.

NIN in HCP Scores for individual hypothesis Shared NIN dog person sheep Max Pooling c

Compared with State-of-the-arts on VOC 2012 From 81.7% Category plane bicycle bird boat bottle bus car cat chair cow table dog horse motor person plant sheep sofa train tv MAP NUS-PSL[1] 97.3 84.2 80.8 85.3 60.8 89.9 86.8 89.3 75.4 77.8 75.1 83.0 87.5 90.1 95.0 57.8 79.2 73.4 94.5 80.7 82.2 PRE-1000C[2] 93.5 78.4 87.7 80.9 57.3 85.0 81.6 89.4 66.9 73.8 62.0 89.5 83.2 87.6 95.8 61.4 79.0 54.3 88.0 78.3 78.7 PRE-1512[2] 94.6 82.9 88.2 84.1 60.3 89.0 84.4 90.7 72.1 86.8 69.0 92.1 93.4 88.6 96.1 64.3 86.6 62.3 91.1 79.8 82.8 < 90.3% Chatfield et al.[3] HCP-NIN HCP-NIN+NUS-PSL 96.8 98.4 99.5 82.5 89.5 93.7 91.5 96.2 96.8 88.1 91.7 94.0 62.1 72.5 77.7 88.3 91.1 95.3 81.9 87.2 92.4 94.8 97.1 98.2 70.3 73.0 86.1 80.2 89.5 91.3 76.2 75.1 83.5 92.9 96.3 97.3 90.3 93.0 96.8 89.3 90.5 96.3 95.2 94.8 95.8 57.4 66.5 72.2 83.6 90.3 91.5 66.4 65.8 81.1 93.5 95.6 97.6 81.9 82.0 90.0 83.2 86.8 91.4 [1] S. Yan, J. Dong, Q. Chen, Z. Song, Y. Pan, W. Xia, H. Zhongyang, Y. Hua, and S. Shen. Generalized hierarchical matching for subcategory aware object classification. In Visual Recognition Challange workshop, ECCV, 2012. [2] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. CVPR, 2014. [3] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman. Return of the Devil in the Details: Delving Deep into Convolutional Nets, BMVC, 2014

Demo Online Demo

Highest and Lowest Score Five Images for Each Class Aeroplane Bicycle Bird Boat Bottle

Highest and Lowest Score Five Images for Each Class Bus Car Cat Chair Cow

Highest and Lowest Score Five Images for Each Class Dining table Dog Horse Motorbike Person

Highest and Lowest Score Five Images for Each Class Pottedplant Sheep Sofa Train TV monitor

What s next? Better Solution for Small/Occluded Objects? More Extra Data? HCP 2014: 91.4% Deep feature 2014: 83.2% Better Local Features? Sub-category Better Deep Features? GHM 2011: 78.7% Context-SVM LLC 2009: 66.5% 2010: 73.8% 2012: 82.2% 25%

Shuicheng YAN eleyans@nus.edu.sg