Optimizing Object Detection: - PDF Free Download

Lecture 10: Optimizing Object Detection: A Case Study of R-CNN, Fast R-CNN, and Faster R-CNN and Single Shot Detection Visual Computing Systems

Today s task: object detection Image classification: what is the object in this image? tennis racket Object detection involves localization: where is the tennis racket in this image? (if there is one at all?)

Quick review Why did we say that DNNs learn good features?

Krizhevsky (AlexNet) image classification network [Krizhevsky 2012] Input: fixed size image Output: probability of label (for 1000 class labels) convlayer convlayer convlayer convlayer convlayer Network assigns input image one of 1000 potential labels. DNN produces feature vector in 4K-dim space that is input to multi-way classifier ( softmax ) to produce perlabel probabilities

VGG-16 image classification network [Simonyan 2015] Input: fixed size image Output: probability of label (for 1000 class labels) Network assigns input image one of 1000 potential labels.

Today: several object detection papers R-CNN [Girshick 2014] Fast R-CNN [Girshick 2015] Faster R-CNN [Ren, He, Girshick, Sun 2015] Each paper improves on both the wall-clock performance and the detection accuracy of the previous Also Single Shot Detection (SSD) [Liu 2016] And Mask-RCNN for instance segmentation [He 2017]

Using classification network as a subroutine for object detection [Girshick 2014] Search over all regions of the image and all region sizes for objects ( Sliding window over image, repeated for multiple potential object scales) for all region top-left positions (x,y): for all region sizes (w,h): cropped = image_crop(image, bbox(x,y,w,h)) resized = image_resize(227,227) label = detect_object(resized) if (label!= background) // region defined by bbox(x,y,w,h) contains object // of class label

Optimization 1: filter detection work via object proposals Selective search [Uijlings IJCV 2013] Input: image Output: list of regions (various scales) that are likely to contain objects Idea: proposal algorithm filtersof parts the image not likelyshowing to containthe objects Figure 2: Two examples ourofselective search necessity of d

Object detection pipeline executed only on proposed regions [Girshick 2014] Input image: (of any size) Object Proposal generator List of proposed regions (~2000) Crop/ Resample for each proposed region Pixel region (of canonical size) Classification DNN object label

Object detection performance on Pascal VOC cow airplanes Example training data VOC 2010 test DPM v5 [18] UVA [34] Regionlets [36] SegDPM [16] R-CNN R-CNN BB aero 49.2 56.2 65.0 61.4 67.1 71.8 bike 53.8 42.4 48.9 53.4 64.1 65.8 bird 13.1 15.3 25.9 25.6 46.7 53.0 boat 15.3 12.6 24.6 25.2 32.0 36.8 bottle 35.5 21.8 24.5 35.5 30.5 35.9 bus 53.4 49.3 56.1 51.7 56.4 59.7 car 49.7 36.8 54.5 50.6 57.2 60.0 cat 27.0 46.1 51.2 50.8 65.9 69.9 chair 17.2 12.9 17.0 19.3 27.0 27.9 cow 28.8 32.1 28.9 33.8 47.3 50.6 table 14.7 30.0 30.2 26.8 40.9 41.4 dog 17.8 36.5 35.8 40.4 66.6 70.0 horse mbike person plant 46.4 51.2 47.7 10.8 43.5 52.9 32.9 15.3 40.2 55.7 43.5 14.3 48.3 54.4 47.1 14.8 57.8 65.9 53.6 26.7 62.0 69.0 58.1 29.5 sheep 34.2 41.1 43.9 38.7 56.5 59.4 sofa 20.7 31.8 32.6 35.0 38.1 39.3 train 43.8 47.0 54.0 52.8 52.8 61.2 tv 38.3 44.8 45.9 43.1 50.2 52.4 map 33.4 35.1 39.7 40.4 50.2 53.7 Table 1: Detection average precision (%) on VOC 2010 test. R-CNN is most directly comparable to UVA and Regionlets since all DNN weights pre-trained object classification ImageNet of data, diﬀerent methods use selective search region using proposals. Bounding box regressionon (BB) is described(lots in Section 3.4. At publicationtask) time, SegDPM was the top-performer on the PASCAL VOC leaderboard. DPM and SegDPM use context rescoring not used by the other methods. DNN weights fine-tuned for the 20 VOC categories + 1 background category (task-specific data) was selected by a grid search over {0, 0.1,..., 0.5} on a validation set. We found that selecting this threshold care- being much faster (Section 2.2). Our method achieves simstanford CS348V, ilar performance (53.3% map) on VOC 2011/12 test. Winter 2018

Optimization 2: region of interest pooling RGB input image: (of any size) Fully convolutional network : sequence of convolutions and pooling steps: output size is dependent on input size convlayer/ maxpool ROI pool HxWxC 7x7xC Idea: the output of early convolutional layers of network on downsampled input region is approximated by resampling output of fullyconvolutional implementation of conv layers. Performance optimization: can evaluate convolutional layers once on large input, then reuse intermediate output many times to approximate response of a subregion of image. ROI maxpool ROI maxpool

Optimization 2: region of interest pooling This is a form of approximate common subexpression elimination for all proposed regions (x,y,w,h): cropped = image_crop(image, bbox(x,y,w,h)) resized = image_resize(227,227) label = detect_object(resized) // 1000 s of regions/image redundant work (many regions overlap, so responses at lower network layers are computed many times computed once per image conv5_response = evaluate_conv_layers(image) for all proposed regions (x,y,w,h): region_conv5 = roi_pool(conv5_response, bbox(x,y,w,h)) label = evaluate_fully_connected_layers(region_conv5)

Fast R-CNN pipeline [Girshick 2015] Input image: (of any size) Object Proposal generator DNN (conv layers only!) List of proposed regions (~2000) Response maps ROI pooling layer for each proposed region Pixel region (of canonical size) Fullyconnected layers class-label softmax bbox regression softmax object label bbox Evaluation speed: 146x faster than R-CNN (47sec/img 0.32 sec/img) [This number excludes cost of proposals] Training speed: 9x faster than R-CNN Training mini-batch: pick N images, pick 128/N boxes from each image (allows sharing of conv-layer pre-computation for multiple image-box training samples) Simultaneously train class predictions and bbox predictions: joint loss = class label loss + bbox loss Note: training updates weights in BOTH fully connected/softmax layers AND conv layers method BabyLearning train Prop. set aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv map BabyLearning R-CNN BB [10] Prop. 12 77.7 79.3 73.8 72.4 62.3 63.1 48.8 44.0 45.4 44.4 67.3 64.6 67.0 66.3 80.3 84.9 41.3 38.8 70.8 67.3 49.7 48.4 79.5 82.3 74.7 75.0 78.6 76.7 64.5 65.7 36.0 35.8 69.9 66.2 55.7 54.8 70.4 69.1 61.7 58.8 63.8 62.9 SegDeepM 12+seg 82.3 75.2 67.1 50.7 49.8 71.1 69.6 88.2 42.5 71.2 50.0 85.7 76.6 81.8 69.3 41.5 71.9 62.2 73.2 64.6 67.2 FRCN [ours] 12 80.1 74.4 67.7 49.4 41.4 74.2 68.8 87.8 41.9 70.1 50.2 86.1 77.3 81.1 70.4 33.3 67.0 63.3 77.2 60.0 66.1 FRCN [ours] 07++12 82.0 77.8 71.6 55.3 42.4 77.3 71.7 89.3 44.5 72.1 53.7 87.7 80.0 82.5 72.7 36.6 68.7 65.4 81.1 62.7 68.8

Problem: bottleneck is now generating proposals Selective search [Uijlings 13] ~ 10 sec/image on CPU EdgeBoxes [Zitnick 14] ~ 0.2 sec/image on CPU Input image: (of any size) Object Proposal generator DNN (conv layers only!) List of proposed regions (~2000) Response maps ROI pooling layer for each proposed region Pixel region (of canonical size) Fullyconnected layers class-label softmax bbox regression softmax object label bbox Idea: why not predict regions from the convolutional feature maps that must be computed for detection anyway? (share computation between proposals and detection)

Faster R-CNN using a region proposal network (RPN) [Ren 2015] Input image: (of any size) Region proposal network List of proposed regions for each proposed region DNN (conv layers only!) ROI pooling layer Response maps

Faster R-CNN using a region proposal network (RPN) Input image: (of any size) DNN (conv layers only!) Response maps WxHx512 512 3x3 conv filters (3x3x512x512 weights) 1x1 conv (2-way softmax) 512 x (9*2) weights 1x1 conv (bbox regressor) 512 x (9x4) weights objectness score (for 9 boxes) bbox offset (for 9 boxes) 3x3 conv projects into 512-element vector per spatial position (assuming VGG input conv layers, cls layer 2k scores 4k coordinates reg layer k anchor boxes receptive field for each output is ~228x228 pixels) 256-d intermediate layer At each point assume 9 anchor boxes of various aspect ratios and scales Given 512-element vector predict objectness score of each anchor + bbox correction to anchor sliding window conv feature map

Training faster R-CNN 512 3x3 conv filters (3x3x512x512 weights) Input image: (of any size) 1x1 conv (2-way softmax) 512 x (9*2) weights objectness score (for 9 boxes) List of proposed regions DNN (conv layers) Response maps WxHx512 1x1 conv (bbox regressor) 512 x (9x4) weights bbox offset (for 9 boxes) ROI pooling layer Goal: want to jointly learn - Region prediction network weights - Object classification network weights - While constraining initial conv layers to be the same (for efficiency) for each proposed region Pixel region (of canonical size) Fullyconnected layers class-label softmax bbox regression softmax object label bbox

Alternating training strategy Train region proposal network (RPN) - Using loss based on ground-truth object bounding boxes - Positive example: intersection over union with ground truth box above threshold - Negative example: interestion over union less than threshold Then use trained RPN to train Fast R-CNN - Using loss based on detections and bbox regression Use conv layers from R-CNN to initialize RPN Fine-tune RPN - Using loss based on ground-truth boxes Use updated RPN to fine tune Fast R-CNN - Using loss based on detections and bbox regression Repeat Notice: solution learns to predict boxes that are good for object-detection task - End-to-end optimization for object-detection task - Compare to using off-the-shelf object-proposal algorithm

Faster R-CNN results Specializing region proposals for object-detection task yields better accuracy. SS = selective search for object proposals method # proposals data map (%) uk:8080/anonymous/ynplxb.html. : http://host.robots.ox.ac.uk:8080/anonymous/x SS 2000 12 65.7 SS 2000 07++12 68.4 RPN+VGG, shared 300 12 67.0 RPN+VGG, shared 300 07++12 70.4 RPN+VGG, shared 300 COCO+07++12 75.9 Shared convolutions improve algorithm performance: Values are times in ms model system conv proposal region-wise total rate VGG SS + Fast R-CNN 146 1510 174 1830 0.5 fps VGG RPN + Fast R-CNN 141 10 47 198 5 fps ZF RPN + Fast R-CNN 31 3 25 59 17 fps

Summary Knowledge of algorithm and properties of DNN used to gain algorithmic speedups - Not just modify the schedule of the loops Key insight: sharing results of convolutional layer computations: - Between different proposed regions (proposed object bboxes) - Between region proposal logic and detection logic Example of end-to-end training - Back-propagate through entire algorithm to train all components at once - Better accuracy: globally optimize the various parts of the algorithm to be optimal for given task (Faster R-CNN: how to propose boxes learned simultaneously with detection logic) - Can constrain learning to preserve performance characteristics (Faster R-CNN: conv layer weights shared across RPN and detection task)

Extending to instance segmentation person.93 person.95 umbrella.97 umbrella.97 umbrella.96 umbrella.99 umbrella1.00 person.98 umbrella.89 umbrella1.00 person.89 umbrella.98 handbag.97 person.95 person.80 backpack.95 skateboard.82 baseball bat.85 baseball bat.98dog1.00 kite.72 kite.89 kite.81 kite1.00 kite.98 kite.73 kite.99 kite.89 kite.88 kite.82 kite.97 person.87 person.95person.72 person.94 person.92 person.95 person.97 person.88 person.97 person.97 person.77 person.82 person.83 person.89 person.97 person.98 person.86 person.81 person.77 person.88 person.98 person.96 person.94 person.88 person.96 person.86 person.91 chair.96 chair.78 chair.98 cup.93 cup.79 dining table.81 dining table.75 dining table.96 chair.85 chair.89 cup.75 cup.71 chair.99 chair.95 chair.99 chair.98 dining table.78 chair.92 wine glass.80 cup.83 chair.95 wine glass.80 chair.85 cup.71 cup.98 frisbee1.00 person.80 chair.94 chair.87 chair.97 kite.99 kite.95 kite.86 kite.88 kite.84 cup.96 wine glass.91 wine glass.93 cup.91 car.93 kite.98 person.80 person.87person.71 person.98 person.78 person.89 person.77 person.98 person.98 person.94 person.81 person.72 person.84 person.95 person.82 person.72 person.94 person.96 person.98 motorcycle.72 car.95 car.95 car.97 car.99 car.78 traffic light.73 car.93 car.98 truck.88 skateboard.99 person.91 person.98 person.80 handbag.80 couch.82 person.90 bus1.00 car1.00 car.95truck.86 car.98 bus1.00 car.97 skateboard.98 wine wineglass.94 glass.94 wine glass.83 car.93 car.82 car.99 car.99 car.99 car.98 car.96 car.91car.94 backpack.88 handbag.91 person.76 person.78 person.98 person.78 person.86 person.88 tv.98 car.87 kite.95 kite.84 potted plant.92 kite.93 diningchair.83 table.91 traffic light.87 horse1.00 horse1.00 horse1.00 person.82 person.93 person.74 baseball bat.99 bicycle.93 ra.76 a1.00 person.98 handbag.85 person.98 surfboard1.00 person.91 surfboard1.00 surfboard.98 surfboard1.00 surfboard1.00 backpack.98 backpack.96 car.98 person.98 tv.84 car.78 traffic light.99 nt.97 bottle.97 bird.93 traffic light.71 handbag.73 bench.97 person.92 person.74 person.87 person.98 tie.85 stop sign.88 wine glass.99 person.98 person.97 person.95 person.95 person.95 person.73 person.98 person.95 person.80 person.95 dining table.95 handbag.88 person.77 wine glass1.00 person.87 wine glass1.00 handbag.88 person.97 person.81 person.90 cell clock.73 phone.77 handbag.99 person.96 chair.93 chair.81 chair.97 chair.99 chair.99 chair.81 chair.93 chair.94 chair.92 person.71 person.94 suitcase.98 chair.81 chair.98 chair.83 chair.91 chair.80 chair.71 person.98 chair.73 suitcase.93 suitcase.72 suitcase1.00 suitcase.96 suitcase1.00 suitcase.88 person.98 suitcase.99 donut.95 donut.89 donut1.00 donut.99 donut.98 donut.89 donut.97 donut.95 donut.94 donut.98 donut1.00 donut.95 donut.96 horse.97 person.96 person.96 person.97 person.98 horse.99 donut.89donut.89 donut.90 1.00 meter.98 donut.86 donut.93donut.99 donut.96 donut.86 donut.95 donut.98 donut.81 donut.89 donut1.00 donut.96donut.98 donut.98 horse.77 sports ball.99 tennis racket1.00 cow.93 truck.92 donut.99 truck.93 bus.90 bus.99 donut.90 truck.97 truck.99 truck.96 truck.99 donut1.00 donut.88 [Image credit:onhecoco et al. 2017] s of Mask R-CNN test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table Stanford CS348V, Winter1). 2018

Mask RCNN Extend Faster R-CNN to also emit a segmentation per box - Previously: box and class emitted in parallel - Now: box, class, and segmentation emitted in parallel Pixel region of canonical size (7x7) Output of resampling the region generated by the Faster R-CNN region proposal network ROI pooling layer (ROIAlign) 7 7 1024 res5 7 7 2048 ave Faster R-CNN w/ ResNet [19] class 2048 box per-class scores (80) bbox adjustments 14 14 256 14 14 80 binary masks for 80 output classes mask

Mask R-CNN for human pose Loss based on bitmapped with hot pixels at joint keypoint locations rather than segmentation masks Figure 7. Keypoint detection results on COCO test using Mask R-CNN (ResNet-50-FPN), with person segmentation masks predicted from the same model. This model has a keypoint AP of 63.1 and runs at 5 fps. CMU-Pose+++ [6] G-RMI [32] Mask R-CNN, keypoint-only Mask R-CNN, keypoint & mask APkp AP50 kp AP75 kp APM kp APL kp 61.8 62.4 62.7 63.1 84.9 84.0 87.0 87.3 67.5 68.5 68.4 68.7 57.1 59.1 57.4 57.8 68.2 68.1 71.1 71.4 Table 4. Keypoint detection AP on COCO test-dev. Ours is a single model (ResNet-50-FPN) that runs at 5 fps. CMU-Pose+++ [6] is the 2016 competition winner that uses multi-scale testing, Faster R-CNN Mask R-CNN, mask-only Mask R-CNN, keypoint-only Mask R-CNN, keypoint & mask APbb person APmask person APkp 52.5 53.6 50.7 52.0 45.8 45.1 64.2 64.7 Table 5. Multi-task learning of box, mask, and keypoint about the person category, evaluated on minival. All entries are trained on the same data for fair comparisons. TheStanford backbone is Winter ResNetCS348V, 2018

An alternative approach to object detection Recall structure of algorithms so far: (reduce detection to classification) for all proposed regions (x,y,w,h): cropped = image_crop(image, bbox(x,y,w,h)) resized = image_resize(classifier_width,classifier_height) label = classify_object(resized) bbox_adjustment = adjust_bbox(resized) New approach to detection: for each level l of network: for each (x,y) position in output: use region around (l,x,y) to directly predict which anchor boxes centered at (x,y) are valid and class score for that box If there are B anchor boxes and C classes, then At each (l,x,y), prediction network has B(C + 4) outputs For each anchor B, there are C class probabilities + 4 values to adjust the anchor box

SSD: Single shot multi box detector [Lui ECCV 2016] multibox detectors operating on different scales of features If feature maps have P channels (e.g., P=512 and 256 below) Each classifier is a 3x3xP filter (C + 4) filters for one anchor bbox (assume one of the C categories is background ) VGG-16 through Conv5_3 layer Classifier : Conv: 3x3x(4x(Classes+4)) Extra Feature Layers SSD 300 Image 300 3 38 Conv4_3 38 512 19 Conv6 (FC6) 19 Conv: 3x3x1024 1024 19 Conv7 (FC7) 19 1024 10 Conv8_2 10 512 Conv: 1x1x1024 Conv: 1x1x256 Conv: 3x3x512-s2 Classifier : Conv: 3x3x(6x(Classes+4)) 5 Conv9_2 5 256 Conv: 1x1x128 Conv: 3x3x256-s2 3 Conv10_2 3 Conv: 1x1x128 Conv: 3x3x256-s1 Conv: 3x3x(4x(Classes+4)) Conv11_2 1 256 256 Conv: 1x1x128 Conv: 3x3x256-s1 Detections:8732 per Class Non-Maximum Suppression 74.3mAP 59FPS Deep feature hierarchy of fully convolutional layers Note: diagram shows only the feature maps

SSD anchor boxes loc : (cx, cy, w, h) conf : (c 1,c 2,,c p ) (a) Image with GT boxes (b) 8 8 feature map (c) 4 4 feature map Anchor boxes at each feature map level are of different sizes Intuition: receptive field of cells at higher levels of the network (lower resolution feature maps) is a larger fraction of the image, have information to make predictions for larger boxes

Object detection performance 600x600 input images [Credit: Tensorflow detection model zoo]

Discussion Why did we say that DNNs learn good features? Consider Mask R-CNN

Discussion Today we saw our first examples of end-to-end learning - Idea: globally optimize all parts of a topology for specific task at hand - Empirically: often enables better accuracy (and also better performance) Interesting to consider effects on interpretability of these models (modularity is typically a favorable property of software)

Emerging theme (from today s lecture and the Inception, MobileNet, and related readings) Computer vision practitioners are programming via lowlevel manipulation of DNN topology - See shift from reasoning about individual layers to writing up of basic microarchitecture modules (e.g., Inception module) - Differentiable programming Interesting question: what programming model constructs or automated compilation tools could help raise the level of abstraction?