R-FCN: Object Detection with Really - Friggin Convolutional Networks Jifeng Dai Microsoft Research Li Yi Tsinghua Univ. Kaiming He FAIR Jian Sun Microsoft Research NIPS, 2016 Or Region-based Fully Convolutional Networks VGG Reading Group - Sam Albanie
Object Detection
Some members of the postdeepluvian* ARXIV Nov, 2015 object detection family tree ARXIV Apr, 2015 ProNet ARXIV Jan, 2017 FAST R-CNN CVPR 2016 ARXIV Dec, 2015 SSD+ DSSD ARXIV Nov, 2013 R-CNN ARXIV June, 2014 SPP-Net ICCV 2015 ARXIV June, 2015 Faster RCNN SSD CVPR 2016 ARXIV June, 2015 ARXIV Dec, 2016 YOLO 9000 CVPR 2014 ECCV 2014 NIPS 2015 YOLO ARXIV June, 2016 NOW WITH MORE LAYERS *Serge Bolongieism ARXIV June, 2015 R-CNN minus R BMVC 2015 Fully connected bidirectional inspiration layer CVPR 2016 ARXIV Dec, 2015 G-CNN CVPR 2016 Fully connected unidirectional inspiration layer R-FCN NIPS 2016
* Motivation: Sharing is Caring RoI UNSHARED CONVOLUTIONS SHARED CONVOLUTIONS SHARED CONVOLUTIONS RoI R-CNN Faster R-CNN R-FCN ResNet-101 backbone (to scale) RoI *Trademark: Salvation Army
Problem: Location Invariance For image classification we want location invariance For object detection, we want location variance In previous work, a RoI pooling layer has been inserted before the final convolutions to break the invariance at the cost of reduced sharing
Solution: Position-Sensitive Score Maps Waffle explanation. Much like neural networks, it works on multiple layers.
Position-Sensitive Score Maps Channels take responsibility for relative spatial locations
Efficient Sharing of Diagrams
Backbone: Res-101 Minor modifications: Remove the GAP Dimensionality reduction layer (1024)
Further Details Bbox regression under standard parameterisation Standard loss function Online Hard Example Mining during training Faster R-CNN-style alternating optimisation Dilation used at conv5 (RPN works from conv4) - gives a 2.6 map boost
Visualisation: Hit
Visualisation: Miss
Experiments
The Effect of Position Sensitivity on fully convolutional strategies ( naive Faster R-CNN still has FC layer after RoI pooling) Without position sensitivity, Faster R-CNN takes a major performance hit when the RoI pooling is late in the network
Standard Benchmarks: VOC 2007
Standard Benchmarks: VOC 2012
The Effect of Depth Saturates at ResNet-101
The Effect of Proposal Type Works pretty well with any proposal method
Summary A little more efficient than Faster RCNN Simpler Makes a tradeoff with efficiency for accuracy
Appendix/Details
Standard Benchmarks: MS COCO
The Effect of Proposal Numbers: VOC 2007
Position Sensitive RoI Pooling: for all the indexing fans Scores are averaged over bins inside regions where (i,j)-th bin spans:
Standard Object Detection Multitask Loss Function Class loss is computed by averaging the positional scores (i.e. voting) to produce a C+1 dim vector for each RoI, pushing through softmax and computing cross entropy. Regression loss is similar, producing a 4-dim vector which is passed into Huber loss. The two losses are combined in a weighted sum: Positive examples are formed from the RoIs that have intersection-over-union (IoU) overlap with a ground-truth box of at least 0.5, and negative otherwise
Bounding Box Regression In Object Detection: R-CNN style Predict bounding box updates with additional 4*k*k-dim convolutional layer {(P i,g i )} i=1,...,n, where P i =(P i x,p i y,p i w,p i h) Parameterise mapping with linear functions such that: d x (P ),d y (P ),d w (P ),d h (P ) ˆ G x = P w d x (P )+P x Ĝ y = P h d y (P )+P y (scale invariant) Gw ˆ = P w exp(d w (P ))) Gˆ h = P h exp(d h (P ))) (log space)
OHEM: Online Hard Example Mining (bootstrapping) Rank regions by loss and only use the top ranked These hard examples will evolve as the network trains OHEM is particularly efficient in R-FCN due to the (almost) free ranking of all region proposals
Alternating Optimisation: You put your left boot in, your left boot out 1. Train RPN 2. Use proposals to train Fast R-CNN 3. The resulting network is used to initialise RPN 4. Retrain Fast R-CNN with the updated RPN sharing convolutions
Dilated Convolutions Figure 1 from Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions.
Dilated Convolutions