Rich feature hierarchies for accurate object detection and semantic segmentation

Rich feature hierarchies for accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Presented by Pandian Raju and Jialin Wu

Last class SGD for Document recognition ImageNet classification LeCun et al. 1998 Krizhevsky et al. 2012 What problems did they solve?

Image Classification ImageNet sample image Image credit: http://rocknrollnerd.github.io/ml/2015/05/27/leopard-sofa.html

Image Classification ImageNet sample image Image credit: \https://www.pinterest.com/explore/facts-about-tigers

Image Classification ImageNet sample image Image credit: https://www.youtube.com/watch?v=a1ofpnxiwvm

Image Classification MS COCO sample image Image credit: https://github.com/pdollar/coco/blob/master/pythonapi/pycocodemo.ipynb

Object Detection MS COCO sample image Image credit: https://github.com/pdollar/coco/blob/master/pythonapi/pycocodemo.ipynb

Semantic segmentation Image credit: http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review

Different visual recognition tasks Image classification Object detection Semantic segmentation

History

PASCAL VOC Detection history Image credit: Ross Girshick

DPM - Deformable Parts Model Image credit: http://www.embeddedvisionsystems.it/solutions/ip2lib/117-dpm

Feature learning with CNNs Fukushima 1980 Neocognitron LeCun et al. 1998 SGD for document recognition Krizhevsky et al. 2012 ImageNet classification (AlexNet)

Brute force

Brute force Forget it!

Regions Gu et al. 2009 Recognition using regions

R-CNN High level flow Category independent region proposals Extract feature vector using CNN Classify each region using a linear SVM per class

Region Proposal 2000 region proposals Selective search algorithm (Uijilings et al. 2012) Exhaustive search Segmentation Selective search

Feature vector using CNN 5 convolutional layers 2 fully connected layers Output: 4096 dimension feature vector

Linear SVMs SVM for Cat No SVM for Dog Yes CNN SVM for Lion A sample region No

Challenges Localization Region proposal and bounding box regression Limited training set

Challenges Localization Limited training set Region proposal and bounding box regression Supervised pre-training with fine tuning

Training Cat Supervised Dog CNN SVM pre-training Fine tuning (SGD) Car Regions from PASCAL VOC ILSVRC (2012)

Testing Intersection-over-Union (IoU) Image credit: http://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/

Testing Ignore unwanted regions (using non-maxima suppression) SVM Scores 0.87 0.72

Object Detection Results Slide credit: Ross Girshick

Semantic segmentation Results R-CNN easily extended for semantic segmentation task O2P: then leader in the task (uses CPMC - Constrained Parametric Min-Cuts) Image credit: Ross Girshick

PROS

Intuitive Combining Regions with CNN

Performant Easily scales with number of classes

Run time analysis Once for all SVM classes Feature vectors Supervised ImageNet CNN SVM pre-training Fine tuning (SGD) Regions from PASCAL VOC Feature vec: low dimension - 4K

Run time analysis Slide credit: Ross Girshick

Impact One of the commonly used methods used for semantic segmentation

Ablation studies Analyzing performance impact of different layers

Ablation studies Effect of different layers and fine tuning on map (mean average precision) Without fine tuning fc7 generalizes worse than fc6 Representational power: conv layer > fully connected layer Fine tuning: increases map by 8%.

Visualization of network Showing which layers learn which features

Visualization of network what features does each layer learn? Image credit: Ross Girshick

Evaluation Compared with different other baselines and methods

Failure modes Analyzed common failure modes and also suggested solutions (BB)

Detection error analysis Image credit: Ross Girshick

Bounding box regression Most errors: Mislocalizations BB regression: Linear regression model to predict a new detection window given the pool5 features. Bounding box regression

Source code Properly documented source code in github

Source code Image credit: https://github.com/rbgirshick/rcnn

CONS

Computational costly Every proposals have to go through the whole network

Need two-steps for Detection Can t unify proposal step and classification step

Using SVM No end to end training

Violate spatial translation invariance Devils are FC layers

No global information

Person or Not

Idea is simple No more than image classification