Optimizing Object Detection:

Similar documents
Optimizing Object Detection:

Object detection with CNNs

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Spatial Localization and Detection. Lecture 8-1

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Object Detection Based on Deep Learning

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Yiqi Yan. May 10, 2017

arxiv: v1 [cs.cv] 4 Jun 2015

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Lecture 5: Object Detection

OBJECT DETECTION HYUNG IL KOO

Deep Learning for Object detection & localization

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Exploiting Depth from Single Monocular Images for Object Detection and Semantic Segmentation

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

G-CNN: an Iterative Grid Based Object Detector

Rich feature hierarchies for accurate object detection and semantic segmentation

Single-Shot Refinement Neural Network for Object Detection -Supplementary Material-

Object Detection on Self-Driving Cars in China. Lingyun Li

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Why Is Recognition Hard? Object Recognizer

YOLO9000: Better, Faster, Stronger

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

Project 3 Q&A. Jonathan Krause

Unified, real-time object detection

Faster R-CNN Implementation using CUDA Architecture in GeForce GTX 10 Series

Automatic detection of books based on Faster R-CNN

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Yield Estimation using faster R-CNN

Supplementary Material: Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

arxiv: v1 [cs.cv] 23 Apr 2015

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Feature-Fused SSD: Fast Detection for Small Objects

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

Regionlet Object Detector with Hand-crafted and CNN Feature

arxiv: v1 [cs.cv] 15 Aug 2018

SSD: Single Shot MultiBox Detector

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

arxiv: v1 [cs.cv] 3 Apr 2016

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

Rich feature hierarchies for accurate object detection and semantic segmentation

Learning Detection with Diverse Proposals

YOLO 9000 TAEWAN KIM

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

arxiv: v2 [cs.cv] 30 Mar 2016

Real-time Object Detection CS 229 Course Project

arxiv: v1 [cs.cv] 5 Oct 2015

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Weakly Supervised Object Recognition with Convolutional Neural Networks

Future directions in computer vision. Larry Davis Computer Vision Laboratory University of Maryland College Park MD USA

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

An Object Detection Algorithm based on Deformable Part Models with Bing Features Chunwei Li1, a and Youjun Bu1, b

Detection and Localization with Multi-scale Models

arxiv: v2 [cs.cv] 1 Oct 2014

Modern Convolutional Object Detectors

DPATCH: An Adversarial Patch Attack on Object Detectors

segdeepm: Exploiting Segmentation and Context in Deep Neural Networks for Object Detection

arxiv: v3 [cs.cv] 15 Sep 2018

Ranking Figure-Ground Hypotheses for Object Segmentation

MULTI-SCALE OBJECT DETECTION WITH FEATURE FUSION AND REGION OBJECTNESS NETWORK. Wenjie Guan, YueXian Zou*, Xiaoqun Zhou

arxiv: v2 [cs.cv] 8 Apr 2018

arxiv: v4 [cs.cv] 30 Nov 2016

Computer Vision Lecture 16

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

arxiv: v1 [cs.cv] 1 Apr 2017

RECOGNIZING objects and localizing them in images is

Proposal-free Network for Instance-level Object Segmentation

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Dictionary Pair Classifier Driven Convolutional Neural Networks for Object Detection

Convolutional Networks for Computer Vision Applications

Object Recognition and Detection

arxiv: v1 [cs.cv] 6 Jul 2017

Rich feature hierarchies for accurate object detection and semant

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

Final Report: Smart Trash Net: Waste Localization and Classification

Computer Vision Lecture 16

Kaggle Data Science Bowl 2017 Technical Report

A CLOSER LOOK: SMALL OBJECT DETECTION IN FASTER R-CNN. Christian Eggert, Stephan Brehm, Anton Winschel, Dan Zecha, Rainer Lienhart

Martian lava field, NASA, Wikipedia

Fuzzy Set Theory in Computer Vision: Example 3

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Structured Prediction using Convolutional Neural Networks

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

All You Want To Know About CNNs. Yukun Zhu

arxiv: v1 [cs.cv] 11 Apr 2017

Deep Residual Learning

Towards Weakly- and Semi- Supervised Object Localization and Semantic Segmentation

EE-559 Deep learning Networks for semantic segmentation

Beyond Sliding Windows: Object Localization by Efficient Subwindow Search

arxiv: v1 [cs.cv] 14 Dec 2015

CONVOLUTIONAL Neural Networks (ConvNets) have. Deep Label Distribution Learning With Label Ambiguity

Joint Object Detection and Viewpoint Estimation using CNN features

Adaptive Object Detection Using Adjacency and Zoom Prediction

Content-Based Image Recovery

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION

Transcription:

Lecture 10: Optimizing Object Detection: A Case Study of R-CNN, Fast R-CNN, and Faster R-CNN Visual Computing Systems

Today s task: object detection Image classification: what is the object in this image? tennis racket Object detection: where is the tennis racket in this image? (if there is one at all?)

Krizhevsky (AlexNet) image classification network [Krizhevsky 2012] Input: fixed size image Output: probability of label (for 1000 class labels) convlayer convlayer convlayer convlayer convlayer Network assigns input image one of 1000 potential labels. DNN produces feature vector in 4K-dim space that is input to multi-way classifier ( softmax ) to produce perlabel probabilities

VGG-16 image classification network [Simonyan 2015] Input: fixed size image Output: probability of label (for 1000 class labels) Network assigns input image one of 1000 potential labels.

Today: three object detection papers R-CNN [Girshick 2014] Fast R-CNN [Girshick 2015] Faster R-CNN [Ren, He, Girshick, Sun 2015] Each paper improves on both the wall-clock performance and the detection accuracy of the previous

Using AlexNet as a subroutine in object detection [Girshick 2014] Search over all regions of the image for objects. ( Sliding window over image, repeated for multiple potential object scales) for all region top-left positions (x,y): for all region sizes (w,h): cropped = image_crop(image, bbox(x,y,w,h)) resized = image_resize(227,227) label = detect_object(resized) if (label!= background) // region defined by bbox(x,y,w,h) contains object // of class label

Optimization 1: filter detection work via object proposals Selective search [Uijlings IJCV 2013] Input: image Output: list of regions (various scales) that are likely to contain objects Idea: proposal algorithm filtersof parts the image not likelyshowing to containthe objects Figure 2: Two examples ourofselective search necessity of d

Object detection pipeline executed only on proposed regions [Girshick 2014] Input image: (of any size) Object Proposal generator List of proposed regions (~2000) Crop/ Resample for each proposed region Pixel region (of canonical size) Classification DNN object label

Object detection performance on Pascal VOC cow airplanes Example training data VOC 2010 test DPM v5 [18] UVA [34] Regionlets [36] SegDPM [16] R-CNN R-CNN BB aero 49.2 56.2 65.0 61.4 67.1 71.8 bike 53.8 42.4 48.9 53.4 64.1 65.8 bird 13.1 15.3 25.9 25.6 46.7 53.0 boat 15.3 12.6 24.6 25.2 32.0 36.8 bottle 35.5 21.8 24.5 35.5 30.5 35.9 bus 53.4 49.3 56.1 51.7 56.4 59.7 car 49.7 36.8 54.5 50.6 57.2 60.0 cat 27.0 46.1 51.2 50.8 65.9 69.9 chair 17.2 12.9 17.0 19.3 27.0 27.9 cow 28.8 32.1 28.9 33.8 47.3 50.6 table 14.7 30.0 30.2 26.8 40.9 41.4 dog 17.8 36.5 35.8 40.4 66.6 70.0 horse mbike person plant 46.4 51.2 47.7 10.8 43.5 52.9 32.9 15.3 40.2 55.7 43.5 14.3 48.3 54.4 47.1 14.8 57.8 65.9 53.6 26.7 62.0 69.0 58.1 29.5 sheep 34.2 41.1 43.9 38.7 56.5 59.4 sofa 20.7 31.8 32.6 35.0 38.1 39.3 train 43.8 47.0 54.0 52.8 52.8 61.2 tv 38.3 44.8 45.9 43.1 50.2 52.4 map 33.4 35.1 39.7 40.4 50.2 53.7 Table 1: Detection average precision (%) on VOC 2010 test. R-CNN is most directly comparable to UVA and Regionlets since all methods use selective search region proposals. Bounding box regression (BB) is described in Section 3.4. At publication time, SegDPM Fine DNNonweights obtained by pretraining for object classification onused ImageNet was the tuned top-performer the PASCAL VOC leaderboard. DPM and SegDPM use context rescoring not by the otherfor methods. the 20 VOC categories (+ 1 background category) was selected by a grid search over {0, 0.1,..., 0.5} on a validation set. We found that selecting this threshold care- being much faster (Section 2.2). Our method achieves sim15-769, Fall 2016 ilar performance (53.3% map) on VOC 2011/12CMU test.

Optimization 2: region of interest pooling RGB input image: (of any size) Fully convolutional network : sequence of convolutions and pooling steps: output size dependent on input size convlayer/ maxpool ROI pool HxWxC 7x7xC Idea: the output of early convolutional layers of network on downsampled input region is approximated by resampling output of fullyconvolutional implementation of conv layers. Performance optimization: can evaluate convolutional layers once on large input, then reuse intermediate output many times to approximate response of a subregion of image. ROI maxpool ROI maxpool

Optimization 2: region of interest pooling Form of approximate common subexpression elimination for all proposed regions (x,y,w,h): cropped = image_crop(image, bbox(x,y,w,h)) resized = image_resize(227,227) label = detect_object(resized) // 1000 s regions/image redundant work (many regions overlap, so responses at lower network layers are computed many times conv5_response = evaluate_conv_layers(image) for all proposed regions (x,y,w,h): region_conv5 = roi_pool(conv5_response, bbox(x,y,w,h)) label = evaluate_fully_connected_layers(region_conv5) computed once per image

Fast R-CNN pipeline [Girshick 2015] Input image: (of any size) Object Proposal generator DNN (conv layers only!) List of proposed regions (~2000) Response maps ROI pooling layer for each proposed region Pixel region (of canonical size) Fullyconnected layers class-label softmax bbox regression softmax object label bbox Evaluation speed: 146x faster than R-CNN (47sec/img 0.32 sec/img) [This number excludes cost of proposals] Training speed: 9x faster than R-CNN Training mini-batch: pick N images, pick 128/N boxes from each image (allows sharing of conv-layer pre-computation for multiple image-box training samples) Simultaneously train class predictions and bbox predictions: joint loss = class label loss + bbox loss Note: training updates weights in BOTH fully connected/softmax layers AND cons layers method BabyLearning train Prop. set aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv map BabyLearning R-CNN BB [10] Prop. 12 77.7 79.3 73.8 72.4 62.3 63.1 48.8 44.0 45.4 44.4 67.3 64.6 67.0 66.3 80.3 84.9 41.3 38.8 70.8 67.3 49.7 48.4 79.5 82.3 74.7 75.0 78.6 76.7 64.5 65.7 36.0 35.8 69.9 66.2 55.7 54.8 70.4 69.1 61.7 58.8 63.8 62.9 SegDeepM 12+seg 82.3 75.2 67.1 50.7 49.8 71.1 69.6 88.2 42.5 71.2 50.0 85.7 76.6 81.8 69.3 41.5 71.9 62.2 73.2 64.6 67.2 FRCN [ours] 12 80.1 74.4 67.7 49.4 41.4 74.2 68.8 87.8 41.9 70.1 50.2 86.1 77.3 81.1 70.4 33.3 67.0 63.3 77.2 60.0 66.1 FRCN [ours] 07++12 82.0 77.8 71.6 55.3 42.4 77.3 71.7 89.3 44.5 72.1 53.7 87.7 80.0 82.5 72.7 36.6 68.7 65.4 81.1 62.7 68.8

Problem: bottleneck is now generating proposals Selective search [Uijlings 13] ~ 10 sec/image on CPU EdgeBoxes [Zitnick 14] ~ 0.2 sec/image on CPU Input image: (of any size) Object Proposal generator DNN (conv layers only!) List of proposed regions (~2000) Response maps ROI pooling layer for each proposed region Pixel region (of canonical size) Fullyconnected layers class-label softmax bbox regression softmax object label bbox Idea: why not predict regions from the convolutional feature maps that must be computed for detection anyway? (share computation between proposals and detection)

Faster R-CNN using a region proposal network (RPN) [Ren 2015] Input image: (of any size) Region proposal network List of proposed regions for each proposed region DNN (conv layers only!) ROI pooling layer Response maps

Faster R-CNN using a region proposal network (RPN) Input image: (of any size) DNN (conv layers only!) Response maps WxHx512 512 3x3 conv filters (3x3x512x512 weights) 1x1 conv (2-way softmax) 512 x (9*2) weights 1x1 conv (bbox regressor) 512 x (9x4) weights objectness score (for 9 boxes) bbox offset (for 9 boxes) 3x3 conv projects into 512-element vector per spatial position (assuming VGG input conv layers, cls layer 2k scores 4k coordinates reg layer k anchor boxes receptive field for each output is ~228x228 pixels) 256-d intermediate layer At each point assume 9 anchor boxes of various aspect ratios and scales Given 512-element vector predict objectness score of each anchor + bbox correction to anchor sliding window conv feature map

Training faster R-CNN 512 3x3 conv filters (3x3x512x512 weights) Input image: (of any size) 1x1 conv (2-way softmax) 512 x (9*2) weights objectness score (for 9 boxes) List of proposed regions DNN (conv layers) Response maps WxHx512 1x1 conv (bbox regressor) 512 x (9x4) weights bbox offset (for 9 boxes) ROI pooling layer Goal: want to jointly learn - Region prediction network weights - Object classification network weights - While constraining initial conv layers to be the same (for efficiency) for each proposed region Pixel region (of canonical size) Fullyconnected layers class-label softmax bbox regression softmax object label bbox

Alternating training strategy Train RPN Then use trained RPN to train Fast R-CNN Use conv layers from R-CNN to initialize RPN Fine-tune RPN Use updated RPN to fine tune Fast R-CNN Repeat Notice: solution learns to predict boxes that are good for object-detection task - End-to-end optimization for object-detection task - Compare to using off-the-shelf object-proposal algorithm

Faster R-CNN results Specializing region proposals for object-detection task yields better accuracy. SS = uk:8080/anonymous/ynplxb.html. selective search : http://host.robots.ox.ac.uk:8080/anonymous/x method # proposals data map (%) SS 2000 12 65.7 SS 2000 07++12 68.4 RPN+VGG, shared 300 12 67.0 RPN+VGG, shared 300 07++12 70.4 RPN+VGG, shared 300 COCO+07++12 75.9 Shared convolutions improve algorithm performance: Times in ms model system conv proposal region-wise total rate VGG SS + Fast R-CNN 146 1510 174 1830 0.5 fps VGG RPN + Fast R-CNN 141 10 47 198 5 fps ZF RPN + Fast R-CNN 31 3 25 59 17 fps

Summary Detailed knowledge of algorithm and properties of DNN used to gain algorithmic speedups - Not just tune the schedule of the loops Key insight: sharing results of convolutional layer computations: - Between different proposed regions - Between region proposal logic and detection logic Push for end-to-end training - Clean: back-propagate through entire algorithm to train all components at once - Better accuracy: globally optimize the various parts of the algorithm to be optimal for task (here: how to propose boxes learned simultaneously with detection logic) - Can constrain learning to preserve performance characteristics (conv layer weights must age shared across RPN and detection task)

Emerging theme (from today s lecture and the Inception, SqueezeNet, and related readings) Computer vision practitioners are programming via lowlevel manipulation of DNN topology - See shift from reasoning about individual layers to writing up of basic microarchitecture modules (e.g., Inception module) What programming model constructs or automated compilation tools could help raise the level of abstraction?