Rich feature hierarchies for accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Presented by Pandian Raju and Jialin Wu
Last class SGD for Document recognition ImageNet classification LeCun et al. 1998 Krizhevsky et al. 2012 What problems did they solve?
Image Classification ImageNet sample image Image credit: http://rocknrollnerd.github.io/ml/2015/05/27/leopard-sofa.html
Image Classification ImageNet sample image Image credit: \https://www.pinterest.com/explore/facts-about-tigers
Image Classification ImageNet sample image Image credit: https://www.youtube.com/watch?v=a1ofpnxiwvm
Image Classification MS COCO sample image Image credit: https://github.com/pdollar/coco/blob/master/pythonapi/pycocodemo.ipynb
Object Detection MS COCO sample image Image credit: https://github.com/pdollar/coco/blob/master/pythonapi/pycocodemo.ipynb
Semantic segmentation Image credit: http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review
Different visual recognition tasks Image classification Object detection Semantic segmentation
History
PASCAL VOC Detection history Image credit: Ross Girshick
DPM - Deformable Parts Model Image credit: http://www.embeddedvisionsystems.it/solutions/ip2lib/117-dpm
Feature learning with CNNs Fukushima 1980 Neocognitron LeCun et al. 1998 SGD for document recognition Krizhevsky et al. 2012 ImageNet classification (AlexNet)
Brute force
Brute force Forget it!
Regions Gu et al. 2009 Recognition using regions
R-CNN High level flow Category independent region proposals Extract feature vector using CNN Classify each region using a linear SVM per class
Region Proposal 2000 region proposals Selective search algorithm (Uijilings et al. 2012) Exhaustive search Segmentation Selective search
Feature vector using CNN 5 convolutional layers 2 fully connected layers Output: 4096 dimension feature vector
Linear SVMs SVM for Cat No SVM for Dog Yes CNN SVM for Lion A sample region No
Challenges Localization Region proposal and bounding box regression Limited training set
Challenges Localization Limited training set Region proposal and bounding box regression Supervised pre-training with fine tuning
Training Cat Supervised Dog CNN SVM pre-training Fine tuning (SGD) Car Regions from PASCAL VOC ILSVRC (2012)
Testing Intersection-over-Union (IoU) Image credit: http://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/
Testing Ignore unwanted regions (using non-maxima suppression) SVM Scores 0.87 0.72
Object Detection Results Slide credit: Ross Girshick
Semantic segmentation Results R-CNN easily extended for semantic segmentation task O2P: then leader in the task (uses CPMC - Constrained Parametric Min-Cuts) Image credit: Ross Girshick
PROS
Intuitive Combining Regions with CNN
Performant Easily scales with number of classes
Run time analysis Once for all SVM classes Feature vectors Supervised ImageNet CNN SVM pre-training Fine tuning (SGD) Regions from PASCAL VOC Feature vec: low dimension - 4K
Run time analysis Slide credit: Ross Girshick
Impact One of the commonly used methods used for semantic segmentation
Ablation studies Analyzing performance impact of different layers
Ablation studies Effect of different layers and fine tuning on map (mean average precision) Without fine tuning fc7 generalizes worse than fc6 Representational power: conv layer > fully connected layer Fine tuning: increases map by 8%.
Visualization of network Showing which layers learn which features
Visualization of network what features does each layer learn? Image credit: Ross Girshick
Evaluation Compared with different other baselines and methods
Failure modes Analyzed common failure modes and also suggested solutions (BB)
Detection error analysis Image credit: Ross Girshick
Bounding box regression Most errors: Mislocalizations BB regression: Linear regression model to predict a new detection window given the pool5 features. Bounding box regression
Source code Properly documented source code in github
Source code Image credit: https://github.com/rbgirshick/rcnn
CONS
Computational costly Every proposals have to go through the whole network
Need two-steps for Detection Can t unify proposal step and classification step
Using SVM No end to end training
Violate spatial translation invariance Devils are FC layers
No global information
Person or Not
Idea is simple No more than image classification