R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS JIFENG DAI YI LI KAIMING HE JIAN SUN MICROSOFT RESEARCH TSINGHUA UNIVERSITY MICROSOFT RESEARCH MICROSOFT RESEARCH SPEED/ACCURACY TRADE-OFFS FOR MODERN CONVOLUTIONAL OBJECT DETECTORS JONATHAN HUANG VIVEK RATHOD CHEN SUN MENGLONG ZHU ANOOP KORATTIKARA ALIREZA FATHI IAN FISCHER ZBIGNIEW WOJNA YANG SONG SERGIO GUADARRAMA KEVIN MURPHY Deep Learning Seminar Tel-Aviv university Instructor: Dr. Raja Giryes Gilad Uziel Netzah Calamaro

Introduction There are two family methods for object detection region - based - (two stages) single - shot (one stage) R-FCN is hybrid of both Use Region Proposal Network (RPN) Work on entire image simultaneously

ROI ROI The Main Idea k ROI k k k k 0.1 2-1.2-3 1. 7 4 0.125 2.8 1.35 2.3 5 0.2 k 1.1 2.5 2.4 0.7 6.1 2.2 0.3 2.4 1.9 0.3 4.875 2.875-1 4 3.2 1.2 7 4.2 Feature maps k k 2.- 1.2 2.1 4.4 1.1. 0.8 3.2 0.2 1.3 3.65 0.225 1.2 4.8 0.8 0.6 ROI k k k k 1.2 0.3 3.2 1.6-1 3.8 1.65 0.725 4.25 1.6 2.4 5.2 2.2 1.2 4.8

R-FCN Architecture

Bounding Box Aside from the k 2 c + 1 -d convolutional layer, we append a sibling 4k 2 -d convolutional layer for bounding box regression. The position-sensitive RoI pooling is performed on this bank of 4k 2 maps. producing a 4k 2 -d vector for each RoI. Then it is aggregated into a 4-d vector by average voting. This 4-d vector parameterizes a bounding box as t = (t x, t y, t w, t h ).

Visualization Visualization of R-FCN for the person category when an RoI does not correctly overlap the object (k k = 3 3).

Visualization Visualization of R-FCN for the person category when an RoI does correctly overlap the object (k k = 3 3).

Loss Function L(s, t x,y,w,h ) = L cls s c + λ c > 0 L reg t, t. L cls s c computed by Softmax function. L reg t, t computed by smooth L1 function. c > 0 - indicator which equals to 1 if the argument is true and 0 otherwise. We set the balance weight λ = 1. c - RoI s ground-truth label (c = 0 means background). t - RoI s ground-truth box.

Backpropagation For the RPN we define positive examples as the RoIs that have intersection-over-union (IoU) overlap with a ground-truth box of at least 0.5, and negative otherwise. Backpropagation is performed based on B = 128 RoIs that have the highest loss (positive and negative) the selected examples.

Backbone Architecture The incarnation of R-FCN based on ResNet-101. ResNet-101 has 100 convolutional layers followed by global average pooling and a 1000-class fc layer. We remove the average pooling layer and the fc layer and only use the convolutional layers to compute feature maps. The last convolutional block in ResNet-101 is 2048-d, and we attach a randomly initialized 1024-d 1 1 convolutional layer for reducing dimension. Then we apply the k 2 (c + 1) - channel convolutional layer to generate score maps.

Results No. of proposals - 300 K X K = 7 X 7 83. 6% map - on the PASCAL VOC 2007 170 ms - test time, per image

Speed/accuracy trade-offs for modern convolutional object detectors Comparative study of R-FCN, SSD and Faster R-CNN

motivation of 2 nd paper Most works discuss only accuracy: This work focus also on memory/speed and on accuracy/speed/memory trade-off Selection of correct algorithm for a specific purpose, and optimization of parameters within that algorithm: 1. Mobile devices (cellular) require low memory footprint 2. Autonomous cars require real-time performance = speed and accuracy 3. Server-side applications such as google/ facebook require accuracy still throughput bottleneck 4. Contests require accuracy Compare apples vs apples require an objective, comprehensive test bench that can show the differences need to develop a test bench SSD+ YOLO SSD Summit Bhala Faster R-CNN performance (kaimin He) R-FCN performance (Vatsal Srivastava)

Location of videos: Yolo vs SSD: https://www.youtube.com/watch?v=8ql69caj2ku R-FCN: https://www.youtube.com/watch?v=jljhxuzoeaq Faster R-CNNN: https://www.youtube.com/watch?v=wzmsmkk9vua

motivation of 2 nd paper There are sweet points on the trade-off graph, where investment of a lot of GPU time yield small accuracy gain. This may be looked reversibly: one may invest much less GPU time with little accuracy loss

Comparative architecture reminder Results SSD Faster R-CNN R-FCN

EXPERIMENTAL PLATFORM Use 6 feature extractors at all detectors VGG-16 ResNet101 Inspection V2 (semantic segmentation) Inspection V3 Inspection ResNet MobileNet Which platforms used

EXPERIMENTAL PLATFORM ADDITIONAL DETAILS

Loss function configuration What is the loss function Matching anchors to ground-truth instances Argmax vs. Bipartite matching Input size configuration Computation platform: Intel Xeon E5-1650 (6 cores) Nvidia GTX titan GPU 4 times more computation than home gamer card W X h size rectangle L1 norm location loss function תחום רחב Variable resolution input size assure

Training and hyper-parameter training Asynchronous SGD when some mini-batch compute it s gradient it is added to the total gradient without waiting for the others and continue it s training. Might cause delayed SGD but is faster. Results:

Mean Average Precision (map) (mili sec)

Mean Average Precision (map) conclusions R-FCN, SSD are faster than R-CNN on average Faster R-CNN is more accurate Sweet spot: a point where in order to obtain little more accuracy much speed must be sacrificed. Another way to view Sweet spot : little GPU is invested without sacrifice too much accuracy Larger feature extractors are slower What is inception?

Mean Average Precision (map) Colored by feature extractor (mili sec)

Mean Average Precision (map) Colored by feature extractor - conclusions Larger feature extractors are slower The colored cluster show relation to feature extractor. Architectures (R- FCN, R-CNN,SSD) were implemented using various feature extractors That makes the test bench variable MobileNet, Inception V2 are faster on average than inception Resnet V2 due to being smaller feature extractors Sweet spot: a point where in order to obtain little more accuracy - much speed must be sacrificed

Memory vs. GPU time for different feature extractors (mili sec)

Memory vs. GPU for different feature extractors - conclusions Larger feature extractors are both slower and demand more memory. It comes together: larger means more memory and occasionally more GPU time Inception ResNet V2 is more memory and demand consuming MobileNet with SSD is fastest and minimal GPU/memory consuming Sweet spot: R-FCN w/resnet 101, and Faster R-CNN w/resnet 101 with only 50 proposals R-FCN w/ Resnet 101 at 100ms GPU with high accuracy and not too high memory consumption

map for each object size by meta-architecture and feature extractor accuracy

map for each object size by meta-architecture and feature extractor How to read Bar Graph: partitions each feature extractor model by object size (small, medium, large). 3 architectures are drawn per each feature extractor SSD has (very) poor performance on small objects and competitive with Faster R-CNN, R-FCN on larger objects outperforming them when they are with lightweight feature extractors Small object improved resolution may compensate for its size, in accuracy

map on small objects vs map on large objects colored by input resolution SSD

map on small objects vs map on large objects colored by input resolution High resolution models lead to significantly better map results(*) on small objects (~*2) and somewhat better results on large objects In SSD higher resolution improve large objects accuracy but is less successful at small objects accuracy improvement R-FCN, Faster R-CNN, SSD: Strong performance on small objects implies strong performance on large objects. Opposite is not correct: SSD perform well on large objects but poor on small objects

map vs Top-1 accuracy of the feature extractor on imagenet

map vs Top-1 accuracy of the feature extractor on imagenet There is an overall correlation between classification (=feature extraction) accuracy and detection (=overall) accuracy This correlation appears to only be more significant for Faster R- CNN and R-FCN The performance of SSD appears to be less reliant on its feature extractor s classification accuracy SSD is unable to fully leverage the power of the ResNet and Inception ResNet feature extractors Using cheaper feature extractors does not hurt SSD too much. With large objects it is competitive with Faster R-CNN and F-RCN

Effect of proposing fewer regions in (a) Faster R-CNN and (b) R-FCN on map (solid line) and GPU time (dash line) The effect of adjusting number of proposals on performance fixed Faster R-CNN R-FCN

Effect of proposing fewer regions in (a) Faster R-CNN and (b) R-FCN on map (solid line) and GPU time (dash line) Figure (a): For Faster R-CNN with Inception Resnet feature extractor with 50 proposals, 96% of the 300 proposals accuracy is obtained, reducing GPU runtime by factor 3 Figure (a): Using Inception Resnet, which has 35.4% map with 300 proposals accuracy is maintained similar (29% map) with only 10 proposals. Sweet spot is around 50 proposals Figure (a): similar tradeoffs hold for other feature extractors although less intense

Effect of proposing fewer regions in (a) Faster R-CNN and )b) R-FCN on map (solid line) and GPU time (dash line) Figure (b) savings from using fewer proposals in the R-FCN setting are minimal, since box classifier (the expensive part) is only run once per image. Figure (b) at 100 proposals, the speed and accuracy for Faster R-CNN models with ResNet, becomes comparable to that of equivalent R-FCN models which use 300 proposals in both map and GPU speed. Faster R-CNN dramatic proposals-to-gpu effect, less significant proposals-to-accuracy effect R-FCN mild effect of proposals over GPU, accuracy

State-of-the-art detection with MS COCO dataset What is multi cropping inference? What is map, AR?

Facts: Run on COCO dataset. Average accuracy is taken at thresholds 50%, 95% Table 3: Test is ensemble of 5 best performance, fast R-CNN RestNet feature extractors table 2 results: Interpretation of tables 2,3,4 The model average accuracy is 41.3%, better than previous results 37.1% Improvement of ~60% accuracy for small objects over previous result

Thank you Questions?

Modifies the proposal generator to directly output class probability (instead of objectiveness). 1) No separate proposal generator such as R-CNN. 2) Direct link from Feature extractor to detection generator Pros: Very fast (suitable for mobile applications, autonomous vehicle Cons: Not good at detecting smaller object (YOLO) but using feature maps from different layers can help a lot (SSD) back

Reminder: Start with Feature Extractor continue with Proposal Generator, then Box Classifier Feature Extractor: 5 convolution layers Proposal generator: insert after conv5 of the feature extractor output = bounding boxes and objectiveness Box classifier: input = crop of conv5 from the bounding boxes with ROI pooling to get feature maps of fixed size; pass through = fc* ; output = class probability Pro: best performing accuracy Con: GPU runtime depends on the number of proposal back

Translation-variance in detection. The classification network to output the same thing if the cat moves from the top left to bottom right (object detection), but the Region- Proposal-Network (object location) to output differently) Box classifier is given the crop of fc6 instead of conv5. Computation for each proposal is reduced New position sensitive score maps: shape = k*k * (C+1), h, w. So this encodes the position into the channel dimension New position-sensitive ROI pooling: input = k * k * (c + 1), roi_h, roi_w ; pool = c + 1, k, k ; output = c+1. In the other words, top-left bin will only pool from some filters. Classifier: input = feature maps Pro: a variation of R-FCN (TA-FCN) is best instance segmentation architecture Pro: fast and pretty accurate Cons: less accurate than Faster R-CNN back

What is map and AR back

multi- cropping inference A novel pooling strategy that crops different regions from convolutional feature maps and applies max-pooling at varying times back

Loss function:, weight balancing localization and classification losses predicted box encoding - location loss f ( ;, ) loc a location class - classification loss - box encoding of box a with respect to anchor b class - cls - class label l ( x ) l ( x) loc ( b, a) a f ( ; a, ) image parameters, - model parameters, - negative anchor cl y a a back

What is: Region-of-Interest pooling For example, to detect multiple cars and pedestrians in a single image. Its purpose is to perform max pooling on inputs of nonuniform sizes to obtain fixed-size feature maps (e.g. 7 7). back

Inception pooling module module By parallelizing layers and combining them back less computation invested equivalent to using additional depth layers back