Why Is Recognition Hard? Object Recognizer

Similar documents
Spatial Localization and Detection. Lecture 8-1

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

OBJECT DETECTION HYUNG IL KOO

Object detection with CNNs

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Yiqi Yan. May 10, 2017

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Object Detection Based on Deep Learning

Optimizing Object Detection:

Deep Learning for Object detection & localization

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Rich feature hierarchies for accurate object detection and semantic segmentation

ECE 6504: Deep Learning for Perception

Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Computer Vision Lecture 16

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Yield Estimation using faster R-CNN

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Regionlet Object Detector with Hand-crafted and CNN Feature

Machine Learning 13. week

Computer Vision Lecture 16

Computer Vision Lecture 16

Know your data - many types of networks

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Semantic Segmentation

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Rich feature hierarchies for accurate object detection and semantic segmentation

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Object Detection on Self-Driving Cars in China. Lingyun Li

Deep Learning for Computer Vision II

Deep Neural Networks:

Fully Convolutional Networks for Semantic Segmentation

Advanced Video Analysis & Imaging

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Classification of objects from Video Data (Group 30)

Deep Learning with Tensorflow AlexNet

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Deep Learning for Vision: Tricks of the Trade

Automatic detection of books based on Faster R-CNN

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Project 3 Q&A. Jonathan Krause

Self Driving. DNN * * Reinforcement * Unsupervised *

Structured Prediction using Convolutional Neural Networks

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

Optimizing Object Detection:

Deep Residual Learning

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Rich feature hierarchies for accurate object detection and semant

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Modern Convolutional Object Detectors

Convolutional Neural Networks

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

CPSC340. State-of-the-art Neural Networks. Nando de Freitas November, 2012 University of British Columbia

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Object Recognition II

Lecture 5: Object Detection

Instance-aware Semantic Segmentation via Multi-task Network Cascades

An Exploration of Computer Vision Techniques for Bird Species Classification

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

Deconvolutions in Convolutional Neural Networks

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Object Recognition and Detection

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

Scene Text Recognition for Augmented Reality. Sagar G V Adviser: Prof. Bharadwaj Amrutur Indian Institute Of Science

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

All You Want To Know About CNNs. Yukun Zhu

Convolutional-Recursive Deep Learning for 3D Object Classification

INF 5860 Machine learning for image classification. Lecture 11: Visualization Anne Solberg April 4, 2018

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Using Machine Learning for Classification of Cancer Cells

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Autoencoder. Representation learning (related to dictionary learning) Both the input and the output are x

Large-Scale Visual Recognition With Deep Learning

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Generative Modeling with Convolutional Neural Networks. Denis Dus Data Scientist at InData Labs

Fuzzy Set Theory in Computer Vision: Example 3

Martian lava field, NASA, Wikipedia

Binary Convolutional Neural Network on RRAM

Object Detection. Part1. Presenter: Dae-Yong

Cascade Region Regression for Robust Object Detection

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Kaggle Data Science Bowl 2017 Technical Report

Flow-Based Video Recognition

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

WITH the recent advance [1] of deep convolutional neural

YOLO 9000 TAEWAN KIM

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

WE are witnessing a rapid, revolutionary change in our

Machine Learning. MGS Lecture 3: Deep Learning

Transcription:

Why Is Recognition Hard? Object Recognizer panda

Why is Recognition Hard? Object Recognizer panda Pose

Why is Recognition Hard? Object Recognizer panda Occlusion

Why is Recognition Hard? Object Recognizer panda Multiple objects

Why is Recognition Hard? Object Recognizer panda Inter-class similarity

Ideal Features Ideal Feature Extractor - window, top-left - clock, top-middle - shelf, left - drawing,middle - statue, bottom left - - hat, bottom right Q.: What objects are in the image? Where is the clock? What is on the top of the table?

Ideal Features are Non-Linear I 1 Ideal Feature Extractor - club, angle = 90 - man, frontal pose...? Ideal Feature Extractor - club, angle = 270 - man, frontal pose... I 2 Ideal Feature Extractor - club, angle = 360 - man, side pose...

Ideal Features are Non-Linear I 1 Ideal Feature Extractor - club, angle = 90 - man, frontal pose... INPUT IS NOT THE AVERAGE! Ideal Feature Extractor - club, angle = 270 - man, frontal pose... I 2 Ideal Feature Extractor - club, angle = 360 - man, side pose...

Ideal Features are Non-Linear I 1 Ideal Feature Extractor - club, angle = 90 - man, frontal pose... Ideal Feature Extractor - club, angle = 270 - man, frontal pose... I 2 Ideal Feature Extractor - club, angle = 360 - man, side pose...

The Manifold of Natural Images 11

Ideal Feature Extraction Pixel n Ideal Feature Extractor Pose Pixel 2 Pixel 1 Expression

Learning Non-Linear Features f(x; θ) features Q.: Which class of non-linear functions shall we consider?

Learning Non-Linear Features Given a dictionary of simple non-linear functions: g 1,, g n Proposal #1: linear combination f x σ j g j Proposal #2: composition f x g 1 (g 2 g n x )

Learning Non-Linear Features Given a dictionary of simple non-linear functions: g 1,, g n Proposal #1: linear combination f x σ j g j Kernel learning Boosting Proposal #2: composition f x g 1 (g 2 g n x ) Deep learning Scattering networks (wavelet cascade) S.C. Zhou & D.Mumford grammar

Linear Combination + prediction of class templete matchers... BAD: it may require an exponential no. of templates!!! Input image

Composition Prediction of class high-level parts mid-level parts low-level parts Reuse of intermediate parts distributed representations GOOD: (exponentially) more efficient Input image

Convolution Neural Network

Convolution Definition from Wolfram MathWorld A convolution is an integral that expresses the amount of overlap of one function g as it is shifted over another function f. It therefore "blends" one function with another. f ( x, y) Input image h(x, y) Kernel (filter) g(x, y) Output image g = f h = y,where ቐ g x, f x + u, y + v h u, v dudv g x, y = σ u,v f x + u, y + v h u, v 18

Connectivity & Weight Sharing All different weights All different weights Shared weights Convolution layer has much smaller number of parameters by local connection and weight sharing 19

Fully Connected Layer Example: 1000x1000 image 1M hidden units 10 12 parameters!!! - Spatial correlation is local - Better to put resources elsewhere!

Locally Connected Layer Example: 1000x1000 image 1M hidden units Filter size: 10x10 100M parameters Filter/Kernel/Receptive field: input patch which the hidden unit is connected to.

Locally Connected Layer STATIONARITY? Statistics are similar at different locations (translation invariance) Example: 1000x1000 image 1M hidden units Filter size: 10x10 100M parameters

Convolution Layer Share the same parameters across different locations: Convolutions with learned kernels

Convolution Layer Learn multiple filters. E.g.: 1000x1000 image 100 Filters Filter size: 10x10 10K parameters

Convolution Layer hidden unit / filter response

Convolution Layer output feature map 3D kernel (filter) Input feature maps

Convolution Layer output feature map many 3D kernels (filters) Input feature maps NOTE: the no. of output feature maps is usually larger than the no. of input feature maps

Convolution Layer Convolutional Layer Input feature maps Output feature map NOTE: the no. of output feature maps is usually larger than the no. of input feature maps

Key Ideas: Convolutional Nets A standard neural net applied to images: Scales quadratically with the size of the input. Does not leverage stationarity. Solution: Connect each hidden unit to a small patch of the input. Share the weight across hidden units. This is called: Convolutional network.

Convolution Layer Detect the same feature at different positions in the input image. features Filter (kernel) Input 30

Tanh Non-linearity or Activation Function Sigmoid: 1 1 + e x Rectified Linear Unit (ReLU): max(0, x) Simplifies back propagation Makes learning faster Make feature sparse Preferred option 31

Pooling (average, L2, max) Special Layers layer i N(x, y) layer i+1 (x, y) H i+1,x,y = max (j,k) N x,y h i,j,k Local Contrast Normalization (over space/features) layer i N(x, y) layer i+1 (x, y) H i+1,x,y = h i,x,y m i,x,y σ i,x,y

Pooling Let us assume filter is an eye detector. Q.: how can we make the detection robust to the exact location of the eye?

Pooling By pooling (e.g., taking max) filter responses at different locations we gain robustness to the exac t spatial location of features. H i+1,x,y = max (j,k) N x,y h i,j,k

Pooling Layer NOTE: 1) the no. of output feature maps is the same as the no. of input feature maps 2) spatial resolution is reduced patch collapsed into one value use of stride > 1 Input feature maps Output feature maps

Pooling Layer NOTE: 1) the no. of output feature maps is the same as the no. of input feature maps 2) spatial resolution is reduced patch collapsed into one value use of stride > 1 Pooling Layer Input feature maps Output feature maps

Pooling Role of Pooling Invariance to small transformations. Reduce the effect of noises and shift or distortion. Max Sum 37

Local Contrast Normalization H i+1,x,y = h i,x,y m i,x,y σ i,x,y

Local Contrast Normalization H i+1,x,y = h i,x,y m i,x,y σ i,x,y We want the same response.

Local Contrast Normalization H i+1,x,y = h i,x,y m i,x,y σ i,x,y Performed also across features and in the higher layers. Effects: improves invariance improves optimization increases sparsity

Convolutional Nets: Typical Architecture One stage (zoom) Convol. LCN Pooling Convol. LCN Pooling Convolutional layer increases no. feature maps. Pooling layer decreases spatial resolution.

Convolutional Nets: Typical Architecture One stage (zoom) Convol. LCN Pooling Convol. LCN Pooling Example with only two filters

Convolutional Nets: Typical Architecture One stage (zoom) Convol. LCN Pooling Convol. LCN Pooling A hidden unit in the first hidden layer is influenced by a small neighborhood (equal to size of filter).

Convolutional Nets: Typical Architecture One stage (zoom) Convol. LCN Pooling Convol. LCN Pooling A hidden unit after the pooling layer is influenced by a larger neighborhood (it depends on filter sizes and strides).

Convolutional Nets: Typical Architecture One stage (zoom) Convol. LCN Pooling Input Image Whole system Fully Conn. Layers Class Labels 1 st stage 2 nd stage 3 rd stage After a few stages, residual spatial resolution is very small. We have learned a descriptor for the whole image.

Convolutional Nets: Typical Architecture One stage (zoom) Convol. LCN Pooling Conceptually similar to: SIFT K-Means Pyramid Pooling SVM Lazebnik et al....spatial Pyramid Matching... CVPR 2006 SIFT Fisher Vector Pooling SVM Sanchez et al. Image classification with F.V.: Theory and practice IJCV 2012

Convolution Neural Network??? A special kind of multi-layer neural networks. Implicitly extract relevant features. A feed-forward network that can extract topological properties from an image. Like almost every other neural networks, CNN are trained with a version of the back-propagation algorithm. 47

Deep Learning Framework

Deep Learning Framework Main contributor Language (API) 1 st : 2 nd : C++ 1 st : 2 nd : 3 rd : C++ Language (Backend) C++ C++ C++ 49

Deep Learning Framework Pre-trained CNN Models Official https://www.tensorflow.org/versions/r0. 7/tutorials/image_recognition/index.html Inception v3 Others VGG 16 AlexNet All Caffe models https://github.com/szagoruyko/loadcaffe Model Zoo https://github.com/bvlc/caffe/wiki/model-zoo Others (by facebook) Inception v1 Inception v3 ResNet NIN VGG_M / VGG_S.. PlaceCNN GoolgLeNet Caffenet Alexnet VGG16 / VGG19 ResNet 50

Available Codes Deep Learning Framework Computer Vision Applications DPPnet ImageQA SimpleBaseline ImageQA VQA_LSTM_CNN ImageQA Neural Talk2 Caption Generation Fast R-CNN Object Detection FCN Semantic Segmentation DeconvNet Semantic Segmentation DecoupledNet Semantic Segmentation Ask Neurons ImageQA Generative Models VAE Variational Auto-Encoder DCGAN Generative Adversarial Net VAM Visual Analogy-Making Semi-supervised Ladder Network VAE Variational Auto-Encoder DCGAN Generative Adversarial Net LAPGAN Generative Adversarial Net Action Conditional Video Prediction Useful modules STN Spatial Transformer Networks Deep Reinforcement Learning DQN On by toy example Neural Turning Machine / Memory Network MemNN End-to-End Memory Network NTM Neural Turing Machine N2N Net2Net Simple Algorithm Learning DQN Atari game MemNN End-to-End Memory Network NTM Neural Turing Machine Learning to Execute RNN / Language lstm-char-cnn Character Aware LM Language Modeling Sequence-to-sequence learning lstm-char-cnn Character Aware LM char-rnn character language model lstm Long-short term memory LRCN Long-term Recurrent Convolutional Network 51

Pros and Cons Deep Learning Framework Pros python Cons distributed training abstraction (don t have to know everything) Abstraction (hard to control every single detail) Many basic tensor operations Weight sharing is simple Many open sources lua (less libraries than python) Adding new module is easy Lua (mostly) C++ / Cuda (sometimes) Very flexible (we only use tensors and nn functi ons) More deep learning libraries Weight sharing is complex State-of-the-art Detection Implementation State-of-the-art Semantic Segmentation Implement ation matlab / python Poor rnn support Less flexible (control flow is not supported) Less basic tensor operations Less effort for CNN training Adding New Layer C++ / Cuda (always) 52

Object Detection - Region-based CNN -

Region Proposals Find blobby image regions that are likely to contain objects. Class-agnostic object detector. Look for blob-like regions.

Region Proposals Selective Search Bottom-up segmentation, merging regions at multiple scales Convert regions to boxes Uijlings et al, Selective Search for Object Recognition, IJCV 2013

Region-based CNN (R-CNN) Input image Extract region proposals (~2k/images) Compute CNN features on regions Classify and refine regions

Region-based CNN (R-CNN) R-CNN at test time: Step 1 Input image Extract region proposals (~2k/images) Compute CNN features Classify regions Selective Search

Region-based CNN (R-CNN) R-CNN at test time: Step 2 Input image Extract region proposals (~2k/images) Compute CNN features Classify regions

Region-based CNN (R-CNN) R-CNN at test time: Step 2 Input image Extract region proposals (~2k/images) Compute CNN features Classify regions

Region-based CNN (R-CNN) R-CNN at test time: Step 2 Input image Extract region proposals (~2k/images) Compute CNN features Classify regions

Region-based CNN (R-CNN) R-CNN at test time: Step 2 Input image Extract region proposals (~2k/images) Compute CNN features Classify regions

Region-based CNN (R-CNN) R-CNN at test time: Step 2 Input image Extract region proposals (~2k/images) Compute CNN features Classify regions

Region-based CNN (R-CNN) R-CNN at test time: Step 3 Input image Extract region proposals (~2k/images) Compute CNN features Classify regions

Region-based CNN (R-CNN) R-CNN at test time: Step 4, Object proposal refinement Linear regression on CNN fea tu res Original p roposal Bo unding- box regression Predicted object bounding b ox

Bounding-Box Regression

R-CNN Training: Step 1 Supervised pre-training Train AlexNet for the 1000-way ILSVRC image classification task.

R-CNN Training: Step 2 Fine-tune the CNN for detection Transfer the representation learned for ILSVRC classification to PASCAL (or ImageNet detection). Instead of 1000 ImageNet classes, want to 20 object classes + background. Throw away final fully connected layer, reinitialize from scratch. Keep training CNN model using positive/negative regions from detection images.

R-CNN Training: Step 3 Extract Features Extract region proposals for all images. For each region: warp to CNN input size, run forward through CNN, save pool5 features to disk. Have a big hard drive: features are ~200 GB for PASCAL dataset.

R-CNN Training: Step 4 Train one binary SVM per class to classify region features

R-CNN Training: Step 4 Train one binary SVM per class to classify region features

R-CNN Training: Step 5 Bounding-box regression: For each class, train a linear regression model to map from cached features to offsets to GT boxes to make up for slightly wrong proposals.

R-CNN Results on PASCAL VOC 2007 VOC 2010 DPM v5 (Girshick et al. 2011) 33.7% 29.6% UVA sel. search (Uijlings et al. 2013) 35.1% Regionlets (Wang et al. 2013) 41.7% 39.7% SegDPM (Fidler et al. 2013) 40.4% R-CNN 54.2% 50.2% R-CNN + bbox regression 58.5% 53.7% m etric: m ean average p recision (higher is bet t er)

ImageNet Detection New Challenge introduced in 2013 PASCAL-like image complexity 200 object categories

R-CNN on ImageNet Detection

Object Detection - Fast R-CNN -

R-CNN Problems Slow at test time Need to run full forward pass of CNN for each region proposal SVMs and regressors are post-hoc CNN features not updated in response to SVMs and regressors Complex multi-stage training pipeline Very time consuming processes (total 84 hours). 2k proposals x (Computing Conv. time + FC time) (18 hours). For SVM and bounding box regressor training (3 hours). Feature extraction: time consuming (63 hours), disk I/O is costly.

Fast R-CNN Testing R-CNN Problem #1: Slow at test time due to independent forward passes of the CNN. Solution: Share computation of convolutional layers between proposals for an image.

Fast R-CNN Training R-CNN Problem #2: Post-hoc training CNN not updated in response to final classifiers and regressors. R-CNN Problem #3: Complex training pipeline. Solution: Just train the whole system end-to-end all at once.

Fast R-CNN Feature maps are shared to each proposals. Applied to entire image. Train the detector in a single stage, end-to-end. No caching features to disk. No post hoc training steps. 1 computing Conv. Time + 2K x FC time: 8.75 hours

Fast R-CNN Testing Regions of Interest (RoIs) from a proposal method ConvNet conv5 feature map of image Forward whole image through ConvNet Input image

Fast R-CNN Testing Regions of Interest (RoIs) from a proposal method ConvNet RoI Pooling layer conv5 feature map of image Forward whole image through ConvNet Input image

Fast R-CNN Testing Softmax classifier Linear + softmax FCs Fully-connected layers Regions of Interest (RoIs) from a proposal method ConvNet RoI Pooling layer conv5 feature map of image Forward whole image through ConvNet Input image

Testing Fast R-CNN Linear + Softmax classifier softmax Linear Bounding-box regressors FCs Fully-connected layers Regions of Interest (RoIs) from a proposal method ConvNet RoI Pooling layer conv5 feature map of image Forward whole image through ConvNet Input image

Training Linear + softmax Fast R-CNN Log loss + smooth L1 loss Linear Multi task loss FCs Trainable ConvNet

Results Fast R-CNN R-CNN [1] SPP-net [2] Train time (h) 9.5 84 25 - Speedup 8.8x 1x 3.4x Test time / image 0.32s 47.0s 2.3s Test speedup 146x 1x 20x map 66.9% 66.0% 63.1% Timings exclude object proposal time, which is equal for all methods. All methods use VGG-16.

Fast R-CNN

Fast R-CNN Results R-CNN Fast R-CNN Training time (h) 84 9.5 - Speedup 1x 8.8x Test time / image 47 seconds 0.32 seconds - Speedup 1x 146x map (VOC 2007) 66.0 66.9 All methods use VGG-16.

Object Detection - Faster R-CNN -

Fast R-CNN Problems Test-time speeds don t include region proposals. R-CNN Model Fast R-CNN Time SS time + 2000*Conv. Time + 2000*Fc Time SS time + 1*Conv. Time + 2000*Fc Time Including time for selective search. R-CNN Fast R-CNN Test time / image 47 seconds 0.32 seconds - Speedup 1x 146x Test time / image with Selective search 50 seconds 2 seconds - Speedup 1x 25x

Fast R-CNN Problems Test-time speeds don t include region proposals. R-CNN Fast R-CNN Test time / image 47 seconds 0.32 seconds - Speedup 1x 146x Test time / image with Selective search 50 seconds 2 seconds - Speedup 1x 25x Solution Just make the CNN do region proposals too!!!

Faster R-CNN

Faster R-CNN Insert Region Proposal Network (RPN) after the last convolutional layer. RPN trained to produce region proposals directly; No need for external proposals. After RPN, use RoI pooling and an upstream classifier and bounding box regressor just like Fast R-CNN.

Region Proposal Networks (RPN) Goal Share computation with a Fast R-CNN object detection network. RPN input Image of any size. Output Rectangular object proposals with objectness score.

Region Proposal Networks (RPN) Slide a small window on the feature map. Build a small network for Classifying object or not-object. Regressing bounding box locations. Position of the sliding window provides localization information with reference (anchor) to the image. Bounding box regression provides finer localization information with reference (anchor) to this sliding widow.

Region Proposal Networks (RPN) Use k anchor boxes (ex., k = 9; 3 scales, 3 aspect ratios.) at each location. Anchors are translation invariant Use the same ones at every location. Regression gives offsets from anchor boxes. Classification gives the probability that each (regressed) anchor shows an object.

RPN: Loss function L p i, t i = 1 N cls i L cls p i, p i + λ 1 N reg i p i L reg t i, t i

RPN: Loss function - classification 2 class Softmax loss Discriminative training p i = 1 if IoU > 0.7 p i = 0 if IoU < 0.3 Otherwise, do not contribute to loss. L cls p i, p i L reg t i, t i L p i, t i = 1 N cls i L cls p i, p i + λ 1 N reg i p i L reg t i, t i

RPN: Loss function - regression L reg t, t = in which i x,y,w,h smooth L1 t i, t i, L cls p i, p i L reg t i, t i smooth L1 x = ቊ 0.5x2 if x < 1 x 0.5 otherwise t x = x x a /w a, t y = y y a /h a, t w = log w/w a, t h = log h/h a, t x = x x a /w a, t y = y y a /h a, t w = log w /w a, t h = log h /h a L p i, t i = 1 N cls i L cls p i, p i + λ 1 N reg i p i L reg t i, t i Only positive sample contribute to reg. loss.

Bounding-Box Regression

How to train faster R-CNN Balance sampling (pos. vs neg. = 1:1) An anchor is labeled as positive if The anchor/anchors with the highest IoU overlap with a GT box. The anchor has an IoU overlap with a GT box higher than 0.7 Negative labels are assigned to anchors with IoU lower than 0.3 for all GT boxes. Alternating optimization (in the first paper) 4 step training. Joint training (since publication) One network, four losses.

How to train faster R-CNN Alternating optimization Step 1: Train RPN initialized with ImageNet pre-trained model. [no parameter sharing] Step 2: Train fast R-CNN using proposal generated in Step 1. [no parameter sharing] Step 3: Use convolutional layers trained in Step2 to initialize model, fix shared convolutional layers, update RPN. Step 4: fix shared convolutional layers, fine-tune fully connected layers of fast R-CNN.

How to train faster R-CNN Joint training one network, four losses RPN classification (anchor good / bad) RPN regression (anchor proposal) Fast R-CNN classification (over classes) Fast R-CNN regression (proposal box)

Object Detection: Datasets Model PASCAL VOC ImageNet Detection (ILSVRC) MS COCO # of classes 20 200 80 # of images (train+val) ~20k ~470k ~120k Avg. object/image 2.4 1.1 7.2

VOC 2007, ZF Conv. Net Results

Results VOC 2007, using VGG-16, Fast R-CNN vs. Faster R-CNN

Results VOC 2012, using VGG-16, Fast R-CNN vs. Faster R-CNN

Results VOC 2007, using VGG-16, different settings of achnors

Fast R-CNN vs. Faster R-CNN Results

Results Model R-CNN Fast R-CNN Faster R-CNN Time SS time + 2000*Conv. Time + 2000*Fc Time SS time + 1*Conv. Time + 2000*Fc Time 1*Conv. Time + 300*Fc Time Test time / image (with proposals) R-CNN Fast R-CNN Faster R-CNN 50 seconds 2 seconds 0.2 seconds - Speedup 1x 25x 250x map (VOC 2007) 66.0 66.9 69.9

Summary Object Detection Find a variable number of objects by classifying image regions. Before CNNs, Dense multi-scale sliding window (HoG, DPM). Avoid dense sliding windows with region proposals. R-CNN Selective search + CNN Classification / bounding box regression. Fast R-CNN Swap order of convolutions and region extraction. Faster R-CNN Compute region proposals within the network.