Why Is Recognition Hard? Object Recognizer

Why Is Recognition Hard? Object Recognizer panda

Why is Recognition Hard? Object Recognizer panda Pose

Why is Recognition Hard? Object Recognizer panda Occlusion

Why is Recognition Hard? Object Recognizer panda Multiple objects

Why is Recognition Hard? Object Recognizer panda Inter-class similarity

Ideal Features Ideal Feature Extractor - window, top-left - clock, top-middle - shelf, left - drawing,middle - statue, bottom left - - hat, bottom right Q.: What objects are in the image? Where is the clock? What is on the top of the table?

Ideal Features are Non-Linear I 1 Ideal Feature Extractor - club, angle = 90 - man, frontal pose...? Ideal Feature Extractor - club, angle = 270 - man, frontal pose... I 2 Ideal Feature Extractor - club, angle = 360 - man, side pose...

Ideal Features are Non-Linear I 1 Ideal Feature Extractor - club, angle = 90 - man, frontal pose... INPUT IS NOT THE AVERAGE! Ideal Feature Extractor - club, angle = 270 - man, frontal pose... I 2 Ideal Feature Extractor - club, angle = 360 - man, side pose...

Ideal Features are Non-Linear I 1 Ideal Feature Extractor - club, angle = 90 - man, frontal pose... Ideal Feature Extractor - club, angle = 270 - man, frontal pose... I 2 Ideal Feature Extractor - club, angle = 360 - man, side pose...

The Manifold of Natural Images 11

Ideal Feature Extraction Pixel n Ideal Feature Extractor Pose Pixel 2 Pixel 1 Expression

Learning Non-Linear Features f(x; θ) features Q.: Which class of non-linear functions shall we consider?

Learning Non-Linear Features Given a dictionary of simple non-linear functions: g 1,, g n Proposal #1: linear combination f x σ j g j Proposal #2: composition f x g 1 (g 2 g n x )

Learning Non-Linear Features Given a dictionary of simple non-linear functions: g 1,, g n Proposal #1: linear combination f x σ j g j Kernel learning Boosting Proposal #2: composition f x g 1 (g 2 g n x ) Deep learning Scattering networks (wavelet cascade) S.C. Zhou & D.Mumford grammar

Linear Combination + prediction of class templete matchers... BAD: it may require an exponential no. of templates!!! Input image

Composition Prediction of class high-level parts mid-level parts low-level parts Reuse of intermediate parts distributed representations GOOD: (exponentially) more efficient Input image

Convolution Neural Network

Convolution Definition from Wolfram MathWorld A convolution is an integral that expresses the amount of overlap of one function g as it is shifted over another function f. It therefore "blends" one function with another. f ( x, y) Input image h(x, y) Kernel (filter) g(x, y) Output image g = f h = y,where ቐ g x, f x + u, y + v h u, v dudv g x, y = σ u,v f x + u, y + v h u, v 18

Connectivity & Weight Sharing All different weights All different weights Shared weights Convolution layer has much smaller number of parameters by local connection and weight sharing 19

Fully Connected Layer Example: 1000x1000 image 1M hidden units 10 12 parameters!!! - Spatial correlation is local - Better to put resources elsewhere!

Locally Connected Layer Example: 1000x1000 image 1M hidden units Filter size: 10x10 100M parameters Filter/Kernel/Receptive field: input patch which the hidden unit is connected to.

Locally Connected Layer STATIONARITY? Statistics are similar at different locations (translation invariance) Example: 1000x1000 image 1M hidden units Filter size: 10x10 100M parameters

Convolution Layer Share the same parameters across different locations: Convolutions with learned kernels

Convolution Layer Learn multiple filters. E.g.: 1000x1000 image 100 Filters Filter size: 10x10 10K parameters

Convolution Layer hidden unit / filter response

Convolution Layer output feature map 3D kernel (filter) Input feature maps

Convolution Layer output feature map many 3D kernels (filters) Input feature maps NOTE: the no. of output feature maps is usually larger than the no. of input feature maps

Convolution Layer Convolutional Layer Input feature maps Output feature map NOTE: the no. of output feature maps is usually larger than the no. of input feature maps

Key Ideas: Convolutional Nets A standard neural net applied to images: Scales quadratically with the size of the input. Does not leverage stationarity. Solution: Connect each hidden unit to a small patch of the input. Share the weight across hidden units. This is called: Convolutional network.

Convolution Layer Detect the same feature at different positions in the input image. features Filter (kernel) Input 30

Tanh Non-linearity or Activation Function Sigmoid: 1 1 + e x Rectified Linear Unit (ReLU): max(0, x) Simplifies back propagation Makes learning faster Make feature sparse Preferred option 31

Pooling (average, L2, max) Special Layers layer i N(x, y) layer i+1 (x, y) H i+1,x,y = max (j,k) N x,y h i,j,k Local Contrast Normalization (over space/features) layer i N(x, y) layer i+1 (x, y) H i+1,x,y = h i,x,y m i,x,y σ i,x,y

Pooling Let us assume filter is an eye detector. Q.: how can we make the detection robust to the exact location of the eye?

Pooling By pooling (e.g., taking max) filter responses at different locations we gain robustness to the exac t spatial location of features. H i+1,x,y = max (j,k) N x,y h i,j,k

Pooling Layer NOTE: 1) the no. of output feature maps is the same as the no. of input feature maps 2) spatial resolution is reduced patch collapsed into one value use of stride > 1 Input feature maps Output feature maps

Pooling Layer NOTE: 1) the no. of output feature maps is the same as the no. of input feature maps 2) spatial resolution is reduced patch collapsed into one value use of stride > 1 Pooling Layer Input feature maps Output feature maps

Pooling Role of Pooling Invariance to small transformations. Reduce the effect of noises and shift or distortion. Max Sum 37

Local Contrast Normalization H i+1,x,y = h i,x,y m i,x,y σ i,x,y

Local Contrast Normalization H i+1,x,y = h i,x,y m i,x,y σ i,x,y We want the same response.

Local Contrast Normalization H i+1,x,y = h i,x,y m i,x,y σ i,x,y Performed also across features and in the higher layers. Effects: improves invariance improves optimization increases sparsity

Convolutional Nets: Typical Architecture One stage (zoom) Convol. LCN Pooling Convol. LCN Pooling Convolutional layer increases no. feature maps. Pooling layer decreases spatial resolution.

Convolutional Nets: Typical Architecture One stage (zoom) Convol. LCN Pooling Convol. LCN Pooling Example with only two filters

Convolutional Nets: Typical Architecture One stage (zoom) Convol. LCN Pooling Convol. LCN Pooling A hidden unit in the first hidden layer is influenced by a small neighborhood (equal to size of filter).

Convolutional Nets: Typical Architecture One stage (zoom) Convol. LCN Pooling Convol. LCN Pooling A hidden unit after the pooling layer is influenced by a larger neighborhood (it depends on filter sizes and strides).

Convolutional Nets: Typical Architecture One stage (zoom) Convol. LCN Pooling Input Image Whole system Fully Conn. Layers Class Labels 1 st stage 2 nd stage 3 rd stage After a few stages, residual spatial resolution is very small. We have learned a descriptor for the whole image.

Convolutional Nets: Typical Architecture One stage (zoom) Convol. LCN Pooling Conceptually similar to: SIFT K-Means Pyramid Pooling SVM Lazebnik et al....spatial Pyramid Matching... CVPR 2006 SIFT Fisher Vector Pooling SVM Sanchez et al. Image classification with F.V.: Theory and practice IJCV 2012

Convolution Neural Network??? A special kind of multi-layer neural networks. Implicitly extract relevant features. A feed-forward network that can extract topological properties from an image. Like almost every other neural networks, CNN are trained with a version of the back-propagation algorithm. 47

Deep Learning Framework

Deep Learning Framework Main contributor Language (API) 1 st : 2 nd : C++ 1 st : 2 nd : 3 rd : C++ Language (Backend) C++ C++ C++ 49

Deep Learning Framework Pre-trained CNN Models Official https://www.tensorflow.org/versions/r0. 7/tutorials/image_recognition/index.html Inception v3 Others VGG 16 AlexNet All Caffe models https://github.com/szagoruyko/loadcaffe Model Zoo https://github.com/bvlc/caffe/wiki/model-zoo Others (by facebook) Inception v1 Inception v3 ResNet NIN VGG_M / VGG_S.. PlaceCNN GoolgLeNet Caffenet Alexnet VGG16 / VGG19 ResNet 50

Available Codes Deep Learning Framework Computer Vision Applications DPPnet ImageQA SimpleBaseline ImageQA VQA_LSTM_CNN ImageQA Neural Talk2 Caption Generation Fast R-CNN Object Detection FCN Semantic Segmentation DeconvNet Semantic Segmentation DecoupledNet Semantic Segmentation Ask Neurons ImageQA Generative Models VAE Variational Auto-Encoder DCGAN Generative Adversarial Net VAM Visual Analogy-Making Semi-supervised Ladder Network VAE Variational Auto-Encoder DCGAN Generative Adversarial Net LAPGAN Generative Adversarial Net Action Conditional Video Prediction Useful modules STN Spatial Transformer Networks Deep Reinforcement Learning DQN On by toy example Neural Turning Machine / Memory Network MemNN End-to-End Memory Network NTM Neural Turing Machine N2N Net2Net Simple Algorithm Learning DQN Atari game MemNN End-to-End Memory Network NTM Neural Turing Machine Learning to Execute RNN / Language lstm-char-cnn Character Aware LM Language Modeling Sequence-to-sequence learning lstm-char-cnn Character Aware LM char-rnn character language model lstm Long-short term memory LRCN Long-term Recurrent Convolutional Network 51

Pros and Cons Deep Learning Framework Pros python Cons distributed training abstraction (don t have to know everything) Abstraction (hard to control every single detail) Many basic tensor operations Weight sharing is simple Many open sources lua (less libraries than python) Adding new module is easy Lua (mostly) C++ / Cuda (sometimes) Very flexible (we only use tensors and nn functi ons) More deep learning libraries Weight sharing is complex State-of-the-art Detection Implementation State-of-the-art Semantic Segmentation Implement ation matlab / python Poor rnn support Less flexible (control flow is not supported) Less basic tensor operations Less effort for CNN training Adding New Layer C++ / Cuda (always) 52

Object Detection - Region-based CNN -

Region Proposals Find blobby image regions that are likely to contain objects. Class-agnostic object detector. Look for blob-like regions.

Region Proposals Selective Search Bottom-up segmentation, merging regions at multiple scales Convert regions to boxes Uijlings et al, Selective Search for Object Recognition, IJCV 2013

Region-based CNN (R-CNN) Input image Extract region proposals (~2k/images) Compute CNN features on regions Classify and refine regions

Region-based CNN (R-CNN) R-CNN at test time: Step 1 Input image Extract region proposals (~2k/images) Compute CNN features Classify regions Selective Search

Region-based CNN (R-CNN) R-CNN at test time: Step 2 Input image Extract region proposals (~2k/images) Compute CNN features Classify regions

Region-based CNN (R-CNN) R-CNN at test time: Step 3 Input image Extract region proposals (~2k/images) Compute CNN features Classify regions

Region-based CNN (R-CNN) R-CNN at test time: Step 4, Object proposal refinement Linear regression on CNN fea tu res Original p roposal Bo unding- box regression Predicted object bounding b ox

Bounding-Box Regression

R-CNN Training: Step 1 Supervised pre-training Train AlexNet for the 1000-way ILSVRC image classification task.

R-CNN Training: Step 2 Fine-tune the CNN for detection Transfer the representation learned for ILSVRC classification to PASCAL (or ImageNet detection). Instead of 1000 ImageNet classes, want to 20 object classes + background. Throw away final fully connected layer, reinitialize from scratch. Keep training CNN model using positive/negative regions from detection images.

R-CNN Training: Step 3 Extract Features Extract region proposals for all images. For each region: warp to CNN input size, run forward through CNN, save pool5 features to disk. Have a big hard drive: features are ~200 GB for PASCAL dataset.

R-CNN Training: Step 4 Train one binary SVM per class to classify region features

R-CNN Training: Step 5 Bounding-box regression: For each class, train a linear regression model to map from cached features to offsets to GT boxes to make up for slightly wrong proposals.

R-CNN Results on PASCAL VOC 2007 VOC 2010 DPM v5 (Girshick et al. 2011) 33.7% 29.6% UVA sel. search (Uijlings et al. 2013) 35.1% Regionlets (Wang et al. 2013) 41.7% 39.7% SegDPM (Fidler et al. 2013) 40.4% R-CNN 54.2% 50.2% R-CNN + bbox regression 58.5% 53.7% m etric: m ean average p recision (higher is bet t er)

ImageNet Detection New Challenge introduced in 2013 PASCAL-like image complexity 200 object categories

R-CNN on ImageNet Detection

Object Detection - Fast R-CNN -

R-CNN Problems Slow at test time Need to run full forward pass of CNN for each region proposal SVMs and regressors are post-hoc CNN features not updated in response to SVMs and regressors Complex multi-stage training pipeline Very time consuming processes (total 84 hours). 2k proposals x (Computing Conv. time + FC time) (18 hours). For SVM and bounding box regressor training (3 hours). Feature extraction: time consuming (63 hours), disk I/O is costly.

Fast R-CNN Testing R-CNN Problem #1: Slow at test time due to independent forward passes of the CNN. Solution: Share computation of convolutional layers between proposals for an image.

Fast R-CNN Training R-CNN Problem #2: Post-hoc training CNN not updated in response to final classifiers and regressors. R-CNN Problem #3: Complex training pipeline. Solution: Just train the whole system end-to-end all at once.

Fast R-CNN Feature maps are shared to each proposals. Applied to entire image. Train the detector in a single stage, end-to-end. No caching features to disk. No post hoc training steps. 1 computing Conv. Time + 2K x FC time: 8.75 hours

Fast R-CNN Testing Regions of Interest (RoIs) from a proposal method ConvNet conv5 feature map of image Forward whole image through ConvNet Input image

Fast R-CNN Testing Regions of Interest (RoIs) from a proposal method ConvNet RoI Pooling layer conv5 feature map of image Forward whole image through ConvNet Input image

Fast R-CNN Testing Softmax classifier Linear + softmax FCs Fully-connected layers Regions of Interest (RoIs) from a proposal method ConvNet RoI Pooling layer conv5 feature map of image Forward whole image through ConvNet Input image

Testing Fast R-CNN Linear + Softmax classifier softmax Linear Bounding-box regressors FCs Fully-connected layers Regions of Interest (RoIs) from a proposal method ConvNet RoI Pooling layer conv5 feature map of image Forward whole image through ConvNet Input image

Training Linear + softmax Fast R-CNN Log loss + smooth L1 loss Linear Multi task loss FCs Trainable ConvNet

Results Fast R-CNN R-CNN [1] SPP-net [2] Train time (h) 9.5 84 25 - Speedup 8.8x 1x 3.4x Test time / image 0.32s 47.0s 2.3s Test speedup 146x 1x 20x map 66.9% 66.0% 63.1% Timings exclude object proposal time, which is equal for all methods. All methods use VGG-16.

Fast R-CNN

Fast R-CNN Results R-CNN Fast R-CNN Training time (h) 84 9.5 - Speedup 1x 8.8x Test time / image 47 seconds 0.32 seconds - Speedup 1x 146x map (VOC 2007) 66.0 66.9 All methods use VGG-16.

Object Detection - Faster R-CNN -

Fast R-CNN Problems Test-time speeds don t include region proposals. R-CNN Model Fast R-CNN Time SS time + 2000*Conv. Time + 2000*Fc Time SS time + 1*Conv. Time + 2000*Fc Time Including time for selective search. R-CNN Fast R-CNN Test time / image 47 seconds 0.32 seconds - Speedup 1x 146x Test time / image with Selective search 50 seconds 2 seconds - Speedup 1x 25x

Fast R-CNN Problems Test-time speeds don t include region proposals. R-CNN Fast R-CNN Test time / image 47 seconds 0.32 seconds - Speedup 1x 146x Test time / image with Selective search 50 seconds 2 seconds - Speedup 1x 25x Solution Just make the CNN do region proposals too!!!

Faster R-CNN

Faster R-CNN Insert Region Proposal Network (RPN) after the last convolutional layer. RPN trained to produce region proposals directly; No need for external proposals. After RPN, use RoI pooling and an upstream classifier and bounding box regressor just like Fast R-CNN.

Region Proposal Networks (RPN) Goal Share computation with a Fast R-CNN object detection network. RPN input Image of any size. Output Rectangular object proposals with objectness score.

Region Proposal Networks (RPN) Slide a small window on the feature map. Build a small network for Classifying object or not-object. Regressing bounding box locations. Position of the sliding window provides localization information with reference (anchor) to the image. Bounding box regression provides finer localization information with reference (anchor) to this sliding widow.

Region Proposal Networks (RPN) Use k anchor boxes (ex., k = 9; 3 scales, 3 aspect ratios.) at each location. Anchors are translation invariant Use the same ones at every location. Regression gives offsets from anchor boxes. Classification gives the probability that each (regressed) anchor shows an object.

RPN: Loss function L p i, t i = 1 N cls i L cls p i, p i + λ 1 N reg i p i L reg t i, t i

RPN: Loss function - classification 2 class Softmax loss Discriminative training p i = 1 if IoU > 0.7 p i = 0 if IoU < 0.3 Otherwise, do not contribute to loss. L cls p i, p i L reg t i, t i L p i, t i = 1 N cls i L cls p i, p i + λ 1 N reg i p i L reg t i, t i

RPN: Loss function - regression L reg t, t = in which i x,y,w,h smooth L1 t i, t i, L cls p i, p i L reg t i, t i smooth L1 x = ቊ 0.5x2 if x < 1 x 0.5 otherwise t x = x x a /w a, t y = y y a /h a, t w = log w/w a, t h = log h/h a, t x = x x a /w a, t y = y y a /h a, t w = log w /w a, t h = log h /h a L p i, t i = 1 N cls i L cls p i, p i + λ 1 N reg i p i L reg t i, t i Only positive sample contribute to reg. loss.

Bounding-Box Regression

How to train faster R-CNN Balance sampling (pos. vs neg. = 1:1) An anchor is labeled as positive if The anchor/anchors with the highest IoU overlap with a GT box. The anchor has an IoU overlap with a GT box higher than 0.7 Negative labels are assigned to anchors with IoU lower than 0.3 for all GT boxes. Alternating optimization (in the first paper) 4 step training. Joint training (since publication) One network, four losses.

How to train faster R-CNN Alternating optimization Step 1: Train RPN initialized with ImageNet pre-trained model. [no parameter sharing] Step 2: Train fast R-CNN using proposal generated in Step 1. [no parameter sharing] Step 3: Use convolutional layers trained in Step2 to initialize model, fix shared convolutional layers, update RPN. Step 4: fix shared convolutional layers, fine-tune fully connected layers of fast R-CNN.

How to train faster R-CNN Joint training one network, four losses RPN classification (anchor good / bad) RPN regression (anchor proposal) Fast R-CNN classification (over classes) Fast R-CNN regression (proposal box)

Object Detection: Datasets Model PASCAL VOC ImageNet Detection (ILSVRC) MS COCO # of classes 20 200 80 # of images (train+val) ~20k ~470k ~120k Avg. object/image 2.4 1.1 7.2

VOC 2007, ZF Conv. Net Results

Results VOC 2007, using VGG-16, Fast R-CNN vs. Faster R-CNN

Results VOC 2012, using VGG-16, Fast R-CNN vs. Faster R-CNN

Results VOC 2007, using VGG-16, different settings of achnors

Fast R-CNN vs. Faster R-CNN Results

Results Model R-CNN Fast R-CNN Faster R-CNN Time SS time + 2000*Conv. Time + 2000*Fc Time SS time + 1*Conv. Time + 2000*Fc Time 1*Conv. Time + 300*Fc Time Test time / image (with proposals) R-CNN Fast R-CNN Faster R-CNN 50 seconds 2 seconds 0.2 seconds - Speedup 1x 25x 250x map (VOC 2007) 66.0 66.9 69.9

Summary Object Detection Find a variable number of objects by classifying image regions. Before CNNs, Dense multi-scale sliding window (HoG, DPM). Avoid dense sliding windows with region proposals. R-CNN Selective search + CNN Classification / bounding box regression. Fast R-CNN Swap order of convolutions and region extraction. Faster R-CNN Compute region proposals within the network.