Convolutional Networks for Computer Vision Applications

Similar documents
Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Spatial Localization and Detection. Lecture 8-1

Object Detection Based on Deep Learning

Object detection with CNNs

Structured Prediction using Convolutional Neural Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Supplementary material for Analyzing Filters Toward Efficient ConvNet

Optimizing Object Detection:

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Channel Locality Block: A Variant of Squeeze-and-Excitation

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Lecture 5: Object Detection

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Deconvolutions in Convolutional Neural Networks

Content-Based Image Recovery

Unified, real-time object detection

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Computer Vision Lecture 16

Computer Vision Lecture 16

Yiqi Yan. May 10, 2017

Learning Deep Features for Visual Recognition

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Project 3 Q&A. Jonathan Krause

Feature-Fused SSD: Fast Detection for Small Objects

Deep Learning for Computer Vision II

YOLO9000: Better, Faster, Stronger

Rich feature hierarchies for accurate object detection and semantic segmentation

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Lecture 7: Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

CS 1674: Intro to Computer Vision. Object Recognition. Prof. Adriana Kovashka University of Pittsburgh April 3, 5, 2018

Martian lava field, NASA, Wikipedia

Regionlet Object Detector with Hand-crafted and CNN Feature

Convolutional Neural Networks

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Advanced Video Analysis & Imaging

OBJECT DETECTION HYUNG IL KOO

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Dense Image Labeling Using Deep Convolutional Neural Networks

Direct Multi-Scale Dual-Stream Network for Pedestrian Detection Sang-Il Jung and Ki-Sang Hong Image Information Processing Lab.

Application of Convolutional Neural Network for Image Classification on Pascal VOC Challenge 2012 dataset

arxiv: v1 [cs.cv] 4 Jun 2015

Fuzzy Set Theory in Computer Vision: Example 3

Computer Vision Lecture 16

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

G-CNN: an Iterative Grid Based Object Detector

DEEP NEURAL NETWORKS FOR OBJECT DETECTION

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Classification of objects from Video Data (Group 30)

arxiv: v1 [cs.cv] 24 May 2016

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Learning Deep Representations for Visual Recognition

Deep Learning for Object detection & localization

Rich feature hierarchies for accurate object detection and semantic segmentation

Industrial Technology Research Institute, Hsinchu, Taiwan, R.O.C ǂ

All You Want To Know About CNNs. Yukun Zhu

R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection

Know your data - many types of networks

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Deep Residual Learning

Real-time Object Detection CS 229 Course Project

Semantic Segmentation

Object Detection on Self-Driving Cars in China. Lingyun Li

Machine Learning. MGS Lecture 3: Deep Learning

Traffic Multiple Target Detection on YOLOv2

Optimizing Intersection-Over-Union in Deep Neural Networks for Image Segmentation

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Como funciona o Deep Learning

Microscopy Cell Counting with Fully Convolutional Regression Networks

Why Is Recognition Hard? Object Recognizer

Deep Neural Networks:

Object Detection and Its Implementation on Android Devices

Fuzzy Set Theory in Computer Vision: Example 3, Part II

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

arxiv: v1 [cs.cv] 5 Oct 2015

An Exploration of Computer Vision Techniques for Bird Species Classification

INTRODUCTION TO DEEP LEARNING

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

CSE 559A: Computer Vision

Study of Residual Networks for Image Recognition

Deep Learning with Tensorflow AlexNet

PT-NET: IMPROVE OBJECT AND FACE DETECTION VIA A PRE-TRAINED CNN MODEL

Cascade Region Regression for Robust Object Detection

HIERARCHICAL JOINT-GUIDED NETWORKS FOR SEMANTIC IMAGE SEGMENTATION

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Transfer Learning. Style Transfer in Deep Learning

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

LARGE-SCALE PERSON RE-IDENTIFICATION AS RETRIEVAL

FaceNet. Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana

Mimicking Very Efficient Network for Object Detection

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Training Deep Neural Networks (in parallel)

Transcription:

Convolutional Networks for Computer Vision Applications Andrea Vedaldi Latest version of the slides http://www.robots.ox.ac.uk/~vedaldi/assets/teach/vedaldi16deepcv.pdf Lab experience https://www.robots.ox.ac.uk/~vgg/practicals/cnn-reg/

Computer Vision & CNNs 2 Image classification Matching, optical flow, stereo Coarse (high-level objects) Siamese architectures Fine grained (dog, bird species) Object detection Synthesis and visualization Pre-images and matching statistics R-CNN Stochastic networks Bounding box regression, YOLO Adversarial networks Image segmentation Pose, parts, key points Fully-connected networks U architectures CRF backprop Action recognition Attribute prediction Sentence generation Recurrent CNNs LSTMs Depth-map estimation Face recognition and verification Text recognition and spotting

Computer Vision & CNNs 3 Image classification Matching, optical flow, stereo Coarse (high-level objects) Siamese architectures Fine grained (dog, bird species) Object detection Synthesis and visualization Pre-images and matching statistics R-CNN Stochastic networks Bounding box regression, YOLO Adversarial networks Image segmentation Pose, parts, key points Fully-connected networks U architectures CRF backprop Action recognition Attribute prediction Sentence generation Recurrent CNNs LSTMs Depth-map estimation Face recognition and verification Text recognition and spotting

4 Review c1 c2 c3 c4 c5 f6 f7 f8 bike w1 w2 w3 w4 w5 w6 w7 w8 A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. NIPS, 2012.

Convolutional Neural Network (CNN) 5 A sequence of local & shift invariant layers Example: convolution layer input data x filter bank F output data y

Data = 3D tensors 6 There is a vector of feature channels (e.g. RGB) at each spatial location (pixel). W channels H = c = 1 c = 2 c = 3 W 3D tensor H = C

Convolution with 3D filters 7 Each filter acts on multiple input channels Local Filters look locally Translation invariant Filters act the same everywhere F Σ x y

Filter banks 8 Multiple filters produce multiple output channels F (1) Σ F (2) Σ x y One filter = one output channel

Linear / non-linear chains 9 The basic blueprint of most architectures Σ Σ S S Σ S filtering ReLU filtering & downsampling ReLU

10 Three years of progress From AlexNet (2012) to ResNet (2015)

How deep is enough? 11 AlexNet (2012) 5 convolutional layers 3 fully-connected layers

How deep is enough? 12 AlexNet (2012) VGG-M (2013) VGG-VD-16 (2014) 16 conv layers

How deep is enough? 13 AlexNet (2012) VGG-M (2013) VGG-VD-16 (2014) GoogLeNet (2014)

How deep is enough? 14 AlexNet (2012) VGG-M (2013) VGG-VD-16 (2014) GoogLeNet (2014)

How deep is enough? 15 GoogLeNet (2014) VGG-VD-16 (2014) VGG-M (2013) AlexNet (2012) ResNet 50 (2015) ResNet 152 (2015) 16 convolutional layers 50 convolutional layers Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proc. NIPS, 2012. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proc. CVPR, 2015. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, 2015. 152 convolutional layers K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016.

Accuracy 16 3 more accurate in 3 years 20.0 2.6 17.5 2.3 Top 5 error 15.0 12.5 10.0 7.5 5.0 More accurate 2.0 1.6 1.3 1.0 0.7 2.5 0.0 caffe-alex vgg-f vgg-m googlenet-dag vgg-verydeep-16 resnet-50-dag resnet-101-dag resnet-152-dag 0.3 0.0 caffe-alex vgg-f vgg-m googlenet-dag vgg-verydeep-16 resnet-50-dag resnet-101-dag resnet-152-dag

Speed 17 5 slower 800 5.0 speed (images/s on Titan X) 700 600 500 400 300 200 100 0 caffe-alex vgg-f vgg-m googlenet-dag vgg-verydeep-16 resnet-50-dag resnet-101-dag resnet-152-dag Slower 4.4 3.8 3.1 2.5 1.9 1.3 0.6 0.0 caffe-alex vgg-f vgg-m googlenet-dag vgg-verydeep-16 resnet-50-dag resnet-101-dag resnet-152-dag Remark: 101 ResNet layers same size/speed as 16 VGG-VD layers Reason: far fewer feature channels (quadratic speed/space gain) Moral: optimize your architecture

Model size 18 Num. of parameters is about the same 500 6.0 438 5.3 model size (MBs) 375 313 250 188 125 Larger 4.5 3.8 3.0 2.3 1.5 63 0 caffe-alex vgg-f vgg-m googlenet-dag vgg-verydeep-16 resnet-50-dag resnet-101-dag resnet-152-dag 0.8 0.0 caffe-alex vgg-f vgg-m googlenet-dag vgg-verydeep-16 resnet-50-dag resnet-101-dag resnet-152-dag Remark: 101 ResNet layers same size/speed as 16 VGG-VD layers Reason: far fewer feature channels (quadratic speed/space gain) Moral: optimize your architecture

Recent advances 19 Design guidelines Batch normalization Residual learning

Recent advances 20 Design guidelines Batch normalization Residual learning

Design guidelines 21 Guideline 1: Avoid tight bottlenecks features From bottom to top The spatial resolution H W decreases The number of channels C increases Guideline Avoid tight information bottleneck Decrease the data volume H W C slowly K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proc. ICLR, 2015. image C. Szegedy, V. Vanhoucke, S. Ioffe, and J. Shlens. Rethinking the inception architecture for computer vision. In Proc. CVPR, 2016.

Receptive field 22 Must be large enough neuron Receptive field of a neuron The image region influencing a neuron Anything happening outside is invisible to the neuron Importance Large image structures cannot be detected by neurons with small receptive fields Enlarging the receptive field Large filters neuron s receptive field Chains of small filters

Design guidelines 23 Guideline 2: Prefer small filter chains One big filter bank Two smaller filter banks prefer 5 5 filters + ReLU 3 3 filters + ReLU 3 3 filters + ReLU Benefit 1: less parameters, possibly faster Benefit 2: same receptive field of big filter Benefit 3: packs two non-linearities (ReLUs)

Design guidelines 24 Guideline 3: Keep the number of channels at bay Num. of operations H W C Hf Wf C K Num. of parameters C = num. input channels K = num. output channels complexity C K

Design guidelines 25 Guideline 4: Less computations with filter groups M filters G groups of M/G filters consider instead split channels filter groups put back complexity (C K) / G

Design guidelines 26 Guideline 4: Less computations with filter groups Full filters Group-sparse filters 0 0 = C K = 0 0 0 0 y F x y F x complexity: C K complexity: C K / G Groups = filters, seen as a matrix, have a block structure

Design guidelines 27 Guideline 5: Low-rank decompositions decompose spatially vertical 1 3 C K horizontal 3 1 K K vertical 1 3 K K filter bank 3 3 C K decompose channels groups 3 3 C/G K/G network in network 1 1 K K Make sure to mix the information

Recent advances 28 Design guidelines Batch normalization Residual learning

Batch normalization 29 Condition features pick feature channel k batch of N tensors compute moments mean µk variance σk subtract mean & divide by variance Standardize the response of each feature channel within the batch Average over spatial locations Also, average over multiple images in the batch (e.g. 16-256) S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, 2015

Batch normalization 30 Training vs testing modes x (2) x (3) BN y (2) y (3) x (1) y (1) Training: batch-specific moment averages are collected moments µ, σ Testing: moment averages are used instead of batch-specific moments Moments (mean & variance) Training: compute anew for each batch Testing: fixed to their average values

Batch normalization 31 Utilization xn xn+1 BN xn+2 scale bias xn+3 ReLU xn+4 filters, biases F, b moments µ, σ scale, biases s, b Implemented a single block in MatConvNet Batch normalization is used after filtering, before ReLU It is always followed by channel-specific scaling factor s and bias b Noisy bias/variance estimation replaces dropout regularization

Recent advances 32 Design guidelines Batch normalization Residual learning

Residual learning 33 Fixed identity // learned residual identity residual K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. CVPR, 2016. xn xn+1 xn+2 ReLU xn+3 ReLU xn+4 Σ xn+5 LU ReLU Σ ReLU ReLU Σ ReLU Re

Summary 34 Impact of deep learning in vision 2012: amazing results by AlexNet in the ImageNet challenge 2013-15: massive 3 improvement 2016-19: further massive improvements not unlikely What have we learned several incremental refinements AlexNet was just a first proof of concept after all Things that work Deeper architectures Smarter architectures (groups, low rank decompositions, ) Batch normalization Residual connections

Semantic segmentation

Semantic image segmentation 36 Label individual pixels c1 c2 c3 c4 c5 f6 f7 f8 input = image convolutional fully-connected output = image

Convolutional layers 37 Local receptive field feature component features input image receptive field

Fully connected layers 38 Global receptive field class predictions fully-connected fully-connected fully-connected

Convolutional vs Fully Connected 39 Comparing the receptive fields Downsampling filters Responses are spatially selective, can be used to localize things. Upsampling filters Responses are global, do not characterize well position Which one is more useful for pixel level labelling?

Fully-connected layer = large filter 40 1 1 K K w (k) = F (k) W H C K W H C

Fully-convolutional neural networks 41 class predictions J. Long, E. Shelhamer, and T. Darrell. Fully convolutional models for semantic segmentation. In Proc. CVPR, 2015

Fully-convolutional neural networks 42 Dense evaluation Apply the whole network convolutional Estimates a vector of class probabilities at each pixel Downsampling In practice most network downsample the data fast The output is very low resolution (e.g. 1/32 of original)

Upsampling the resolution 43 Interpolating filter Downsampling filters Upsampling filters Σ Σ Upsampling filters allow to increase the resolution of the output Very useful to get full-resolution segmentation results

Deconvolution layer 44 Or convolution transpose Convolution As matrix multiplication F Banded matrix equivalent to F Convolution transpose Transposed T F Transposed matrix

U-architectures 45 From image to image net net net net net input image skip layers segmentation mask (output image)

U-architectures 46 Several variants: FCN, U-arch, deconvolution, 1 64 64 128 64 64 2 image pool1 pool2 pool3 pool4 pool5 32x upsampled prediction (FCN-32s) 2x upsampled prediction pool4 prediction 16x upsampled prediction (FCN-16s) P 2x upsampled prediction pool3 prediction 8x upsampled prediction (FCN-8s) P input image tile 572 x 572 570 x 570 568 x 568 284² 128 128 282² 280² 256 128 200² 198² 196² 392 x 392 390 x 390 388 x 388 388 x 388 output segmentation map 256 256 512 256 140² 138² 68² 136² 512 512 66² 32² 64² 30² 1024 56² 28² 1024 512 104² 54² 52² 102² 100² conv 3x3, ReLU copy and crop max pool 2x2 up-conv 2x2 conv 1x1 J. Long, E. Shelhamer, and T. Darrell. Fully convolutional models for semantic segmentation. In Proc. CVPR, 2015 H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proc. ICCV, 2015 O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Proc. MICCAI, 2015

Try it yourself: MatConvNet-FCN demo 47 Dense networks for semantic segmentation sofa cat person

Object detection boat : 0.853 person :0.993 person :0.972 person :0.981 person :0.907

R-CNN 49 Region-based Convolutional Neural Network Pros: simple and effective CNN CNN CNN Cons: slow as the CNN is re-evaluated for each tested region chair background potted plant c1 c2 c3 c4 c5 f6 f7 SVM label Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation R. Girshick, J. Donahue, T. Darrell, J. Malik, CVPR 2014

Region proposals 50 Cut down the number of candidates Proposal-method: Selective Search [van de Sande, Uijlings et al.] hierarchical segmentation each region generates a ROI ~ 2000 regions / image

From proposals to CNN features 51 Dilate, crop, reshape Propose Dilate Crop & scale Anisotropic 227 x 227

From proposal to CNN features 52 Evaluate CNN c1 c2 c3 c4 c5 f6 f7 Scale Anisotropic 227 x 227 CNN features Up to FC-7 AlexNet Feature vector 4096 D

Classification of a region 53 Run an SVM or similar on top old school aeroplane cat dog c1 c2 c3 c4 c5 f6 f7 SVM horse person Scale Anisotropic 227 x 227 CNN features Up to FC-7 AlexNet Feature vector 4096 D Label One out of N

Region adjustment 54 Bounding-box regression c1 c2 c3 c4 c5 f6 f7 Ridge regress. Scale Anisotropic 227 x 227 CNN features Up to FC-7 AlexNet Feature vector 4096 D Box adjustment dx1, dx2, dy1, dy2

R-CNN results on PASCAL VOC 55 At the time of introduction (2013) VOC 2007 VOC 2010 DPM v5 (Girshick et al. 2011) 33.7% 29.6% UVA sel. search (Uijlings et al. 2013) 35.1% Regionlets (Wang et al. 2013) 41.7% 39.7% SegDPM (Fidler et al. 2013) 40.4% R-CNN (TorontoNet) 54.2% 50.2% R-CNN (TorontoNet) + bbox regression 58.5% 53.7% R-CNN (VGG-VD) 62.1% R-CNN (ONet) + bbox regression 66.0% 62.9%

R-CNN summary 56 Region-based Convolutional Neural Network old school trained a-posteriori image Region proposals CNN features SVM classifier class pertained on ImageNet then fine-tuned Ridge regression box Can we achieve end-to-end training?

Towards better R-CNNs 57 Region-based Convolutional Neural Network image Region proposals CNN features CNN classifier class CNN regressor box End-to-end training Except for region proposals Problem: this is still pretty slow!

Accelerating R-CNN 58 crop c1 c2 c3 c4 c5 f6 f7 chair c1 c2 c3 c4 c5 f6 f7 background c1 c2 c3 c4 c5 f6 f7 potted plant crop f6 f7 chair c1 c2 c3 c4 c5 f6 f7 background f6 f7 potted plant

The Spatial Pooling layer 59 Max pooling in arbitrary regions any given region feature vector maxpooling c1 c2 c3 c4 c5 He, Zhang,Ren & Sun, Spatial Pyramid Pooling (SPP) in Deep Convolutional Networks for Visual Recognition, ECCV 2014

The Spatial Pooling layer 60 As a building block feature map SP region-specific feature vectors list of regions He, Zhang,Ren & Sun, Spatial Pyramid Pooling (SPP) in Deep Convolutional Networks for Visual Recognition, ECCV 2014

The Spatially Pyramid Pooling Layer 61 Same as above, but for multiple subdivisions maxpooling

Fast R-CNN 62 Summary same parameters f6 f7 chair c1 c2 c3 c4 c5 SPP r6 r7 box refinement f6 f7 background selective search r6 r7 box refinement f6 f7 potted plant still not so fast r6 r7 box refinement Ross Girshick. Fast R-CNN. ICCV 2015

R-CNN minus R 63 Fixed image-independent proposal set all training boxes simplify using clustering ~2-3000 representative boxes Fixed proposal generation Take all bounding box in the training set Run K-means clustering to distill a few thousands [Lenc Vedaldi BMVC 2015]

Vs other proposal sets 64 Matches the training set statistics by construction ground truth selective search 2K sliding windows 7K clustering 3K

R-CNN minus R 65 Replace image-specific boxes with a fixed pool f6 f7 chair c1 c2 c3 c4 c5 SPP r6 r7 box refinement f6 f7 background r6 r7 box refinement f6 f7 potted plant fixed boxes pool r6 r7 box refinement

Why does it work? 66 Answer: regression is quite powerful Dashed line: initial Solid line: corrected by the CNN

Quantitative comparisons 67 Image-specific vs fixed 0.61 Baseline BBR map (VOC07) 0.5625 0.515 0.4675 0.42 Sel. Search (2K boxes) Slid. Win. (7K Boxes) Clusters (2K Boxes) Clusters (7K Boxes) Selective search is much better than fixed generators However, bounding box regression almost eliminates the difference Clustering allows to use significantly less boxes than sliding windows

Faster R-CNN 68 Even better performance with fixed proposals 3 aspect ratios together proposals prototypes (anchors) 3 scales Ideas: Better fixed region proposal sampling Proposal shape specific classifier / regressors Ren, He, Girshick, & Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Faster R-CNN 69 Even better performance with fixed proposals Ren, He, Girshick, & Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Shape-specific classifiers / regressors 70 f6,1 f7,1 f6,2 f7,2 f6,3 f7,3 shape 1 shape 2 shape 3 r6,1 r7,1 r6,2 r7,2 r6,3 r7,3 different parameters Model parameters: translation invariant but shape/scale specific Object aspects are learned by brute force Ren, He, Girshick, & Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Training: what is a positive or negative box? 71 Based on overlap with ground truth treat as positive overlap > 70% treat as negative overlap < 30% Ren, He, Girshick, & Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Fast and Faster R-CNN performance 72 Better, faster! Method Time / image map (%) R-CNN ~50s 66.0 Fast R-CNN ~2s 66.9 Faster R-CNN 198ms 69.9 Detection map on PASCAL VOC 2007, with VGG-16 pre-trained on ImageNet.

Multi-scale representations 73 Three strategies scale image scale feature filters fixed scale features model parameters shared for all scales recompute features for each scale cannot exploit fine details when visible brute force modeling of scale compute features at a single scale can model fine details just fine

Example detections 74 person : 0.992 dog : 0.983 horse : 0.993 car : 1.000 person : 0.974 cat : 0.928 dog : 0.989 bus: 0.980 boat : 0.853 person : 0.993 person : 0.753 person : 0.972 person : 0.981 person : 0.907

PASCAL Leaderboards (Nov 2014) 75 Detection challenge comp4: train on own data

PASCAL Leaderboards (Dec 2015) 76 Detection challenge comp4: train on own data

Part II: A CNN example in text spotting 1.00/1.00/1.00 Other applications LORD NELSON

Huge variety of applications 78 Siamese networks for face recognition/verification same different E.g. VGG-Face

Huge variety of applications 79 Text spotting CREAM E.g. SynthText and VGG-Text http://zeus.robots.ox.ac.uk/textsearch/#/search/

A two-step approach 80 detection classification APARTMENTS We will focus on the classification step Most previous approaches start by recognising individual characters [Yao et al. 2014, Bissacco et al. 2013, Jaderberg et al. 2014, Posner et al. 2010, Quack et al. 2009, Wang et al. 2011, Wang et al. 2012, Weinman et al. 2014, ] The alternative is to directly map word images to words [Almazan et al. 2014, Goel et al. 2013, Mishra et al. 2012, Novikova et al. 2012, Rodriguez- Serrano et al. 2012, Godfellow et al. 2013]

A massive classifier 81 c1 c2 c3 c4 c5 f6 f7 f8 loss 32 100 gray scale 5 5 64 5 5 128 3 3 256 3 3 512 3 3 512 4096 1 1 4096 1 1 90,000 Goal: map images to one of 90K classes (one per word) Architecture each linear operator is followed by ReLU c1, c2, c3, c5 are followed by 2 2 max pooling 500 million parameters evaluation requires 2.2ms on a GPU

Learning a massive classifier 82 Massive training data 9 million images spanning 90K words (100 examples per word) Learning algorithm SGD dropout (after fc6 and fc7) mini batches Problem in practice each batch must contain at least 1/5 of all the classes batch size = 18K (!!) Solution: incremental training learn first using 5K classes only (1K minibatches) then incrementally add 5K more classes

Synth Text dataset 83 Existing text recognition benchmark datasets are too small to train the model Synth Text http://www.robots.ox.ac.uk/~vgg/data/text/ a new synthetic dataset for text spotting include realistic visual effects infinity large (9M images available for download)

Synth Text generation 84 Font rendering sample at random one of 1400 Google Fonts Border/shadow randomly add inset/outset border and shadow Projective distortion Blending use a random crop from SVT as background randomly sample alpha channel, mixing operator (normal, burn, ) Noise elastic distortion, white noise, blur, JPEG compression,

Overall system 85 Proposal generation: edge boxes, AST Proposal filtering: HOG, RF Bounding box-regression: CNN Text recognition: CNN Post-processing: merging, non-max. suppression

1.00/1.00/1.00 Qualitative results: text spotting 86 1.00/1.00/1.00 1.00/1.00/1.00 1.00/1.00/1.00

Qualitative results: text spotting 87 1.00/1.00/1.00 1.00/1.00/1.00 1.00/0.88/0.93 1.00/1.00/1.00

Qualitative results: text retrieval 88 APARTMENTS BORIS JOHNSON HOLLYWOOD

Qualitative results: text retrieval 89 POLICE CASTROL VISION

Backpropagation revisited

Backpropagation 91 Compute derivatives using the chain rule bike c1 c2 c3 c4 c5 f6 f7 f8 loss error w1 w2 w3 w4 w5 w6 w7 w8 x forward R backward derror derror derror derror derror derror derror derror dw1 dw2 dw3 dw4 dw5 dw6 dw7 dw8

Chain rule: scalar version 92 x0 x1 xn-1 f1 f2 fn-1 fn xn

Chain rule: scalar version 93 xn xn-1 x1 x0 A composition of n functions Derivative chain rule

Tensor-valued functions 94 E.g. linear convolution = bank of 3D filters height width channels instances input x H W C 1 or N filters F Hf Wf C K output y H - Hf + 1 W - Wf + 1 K 1 or N

Vector representation 95 3D tensors vec vectors

Derivative of tensor-valued functions 96 Derivative (Jacobian): every output element w.r.t. every input element! The vec operator allows us to use a familiar matrix notation for the derivatives

Chain rule: tensor version 97 Using vec() and matrix notation xn xn-1 x1 x0

The (unbearable) size of tensor derivatives 98 The size of these Jacobian matrices is huge. Example: 275 B elements 32 32 512 1 TB of memory required!! 32 32 512

Unless the output is a scalar 99 Scalar This is always the case if the last layer is the loss function Now the Jacobian has the same size as x. Example: 524K elements 32 32 512 Just 2MB of memory 1 1 1

Backpropagation 100 Assumed xn is a scalar (e.g. loss) xn xn-1 x1 x0 compute this first! small explicitly compute uber matrices do not explicitly compute

Backpropagation 101 Assumed xn is a scalar (e.g. loss) xn xn-1 x1 x0 small explicitly compute uber matrices do not explicitly compute

Backpropagation 102 Assumed xn is a scalar (e.g. loss) xn xn-1 x1 x0 small explicitly compute uber matrix do not explicitly compute

Backpropagation 103 Assumed xn is a scalar (e.g. loss) xn xn-1 x1 x0 small explicitly compute

Projected function derivative 104 The BP-transpose function z y x function projected onto p projected function derivative An equivalent circuit is obtained by introducing a transposed function f T

Backpropagation network 105 BP induces a transposed network xn xn-1 x1 x0 dxn dxn-1 dx1 dx0 where Note: the BP network is linear in dx1,, dxn-1,dxn. Why?

Backpropagation network 106 BP induces a transposed network forward x0 x1 xn-1 xn dx0 dx1 dxn-1 dxn backward

Projected function derivative 107 Interpretation of vector-matrix product in BP z y x function projected onto p projected function derivative

Example: MatConvNet 108 Modular: every part can be used directly Portable C++ / CUDA Core Building Blocks Wrappers Convolution, pooling, normalization, cudnn, vl_nnconv() vl_nnpool() vl_nnnormalize() SimpleNN DagNN Could be used directly or in other languages (e.g. Python or Lua) Atomic operations Reusable Flexible GPU support Pack a CNN model Simple to use

Anatomy of a building block 109 forward (eval) x vl_nnconv y W, b y = vl_nnconv(x, W, b)

Anatomy of a building block 110 forward (eval) x vl_nnconv y z() z R W, b y = vl_nnconv(x, W, b)

Anatomy of a building block 111 forward (eval) x vl_nnconv y z() z R W, b y = vl_nnconv(x, W, b) backward (backprop) dz dx vl_nnconv dz dy z() dz dw dz db dzdx = vl_nnconv(x, W, b, dzdy)

Anatomy of a building block 112 forward (eval) x vl_nnconv y z() z R W, b y = vl_nnconv(x, W, b) backward (backprop) dz dx vl_nnconv dz dy z() dz dw dz db dzdx = vl_nnconv(x, W, b, dzdy)

Anatomy of a building block 113 forward (eval) x vl_nnconv y z() z R W, b dz dx y = vl_nnconv(x, W, b) backward (backprop) vl_nnconv dz dy z() Very fast implementations Native MATLAB GPU support dz dw dz db dzdx = vl_nnconv(x, W, b, dzdy)

Summary 114 Progress CNNs are still new, potential still being unveiled Depth, architectures, batch normalization, residual connections Image segmentation Fully-convolutional nets: a label for each pixel Deconvolution, U-architectures, skip layers Object detection Region nets: from pixels to a list of objects R-CNN, Fast R-CNN, R-CNN minus R, Faster R-CNN Text spotting Brute force from synthetic data Backpropagation revisited