Deep Residual Learning

Similar documents
Learning Deep Representations for Visual Recognition

Learning Deep Features for Visual Recognition

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

Spatial Localization and Detection. Lecture 8-1

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Yiqi Yan. May 10, 2017

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Fuzzy Set Theory in Computer Vision: Example 3

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Classifying a specific image region using convolutional nets with an ROI mask as input

arxiv: v1 [cs.cv] 10 Dec 2015

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Object detection with CNNs

Cascade Region Regression for Robust Object Detection

Know your data - many types of networks

Deep Learning with Tensorflow AlexNet

Object Detection Based on Deep Learning

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Computer Vision Lecture 16

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Computer Vision Lecture 16

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

Advanced Video Analysis & Imaging

Fei-Fei Li & Justin Johnson & Serena Yeung

Todo before next class

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Content-Based Image Recovery

POINT CLOUD DEEP LEARNING

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Study of Residual Networks for Image Recognition

Final Report: Smart Trash Net: Waste Localization and Classification

All You Want To Know About CNNs. Yukun Zhu

Computer Vision Lecture 16

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Supplementary material for Analyzing Filters Toward Efficient ConvNet

arxiv: v1 [cs.cv] 14 Dec 2015

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

R-FCN: Object Detection with Really - Friggin Convolutional Networks

MCMOT: Multi-Class Multi-Object Tracking using Changing Point Detection

Seminars in Artifiial Intelligenie and Robotiis

Deep Learning for Computer Vision II

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Convolutional Neural Networks

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Rich feature hierarchies for accurate object detection and semantic segmentation

Structured Prediction using Convolutional Neural Networks

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

OBJECT DETECTION HYUNG IL KOO

Channel Locality Block: A Variant of Squeeze-and-Excitation

Lecture 5: Object Detection

Inception Network Overview. David White CS793

Feature-Fused SSD: Fast Detection for Small Objects

Visual features detection based on deep neural network in autonomous driving tasks

Flow-Based Video Recognition

Return of the Devil in the Details: Delving Deep into Convolutional Nets

arxiv: v2 [cs.cv] 26 Jan 2018

Lecture 7: Semantic Segmentation

arxiv: v1 [cs.cv] 20 Dec 2016

Generative Modeling with Convolutional Neural Networks. Denis Dus Data Scientist at InData Labs

Dynamic Routing Between Capsules

Convolutional Neural Networks

Rich feature hierarchies for accurate object detection and semant

Binary Convolutional Neural Network on RRAM

WE are witnessing a rapid, revolutionary change in our

Deep Learning for Object detection & localization

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Application of Convolutional Neural Network for Image Classification on Pascal VOC Challenge 2012 dataset

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

YOLO9000: Better, Faster, Stronger

arxiv: v1 [cs.cv] 4 Dec 2014

Automatic detection of books based on Faster R-CNN

Overall Description. Goal: to improve spatial invariance to the input data. Translation, Rotation, Scale, Clutter, Elastic

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Modern Convolutional Object Detectors

Optimizing Object Detection:

Rich feature hierarchies for accurate object detection and semantic segmentation

Implementing Deep Learning for Video Analytics on Tegra X1.

CS6501: Deep Learning for Visual Recognition. Object Detection I: RCNN, Fast-RCNN, Faster-RCNN

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Como funciona o Deep Learning

Paper Motivation. Fixed geometric structures of CNN models. CNNs are inherently limited to model geometric transformations

Deeply Cascaded Networks

YOLO 9000 TAEWAN KIM

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Object Detection. TA : Young-geun Kim. Biostatistics Lab., Seoul National University. March-June, 2018

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Object Detection. Part1. Presenter: Dae-Yong

Tiny ImageNet Challenge Submission

3 Object Detection. BVM 2018 Tutorial: Advanced Deep Learning Methods. Paul F. Jaeger, Division of Medical Image Computing

Transcription:

Deep Residual Learning MSRA @ ILSVRC & COCO 2015 competitions Kaiming He with Xiangyu Zhang, Shaoqing Ren, Jifeng Dai, & Jian Sun Microsoft Research Asia (MSRA)

MSRA @ ILSVRC & COCO 2015 Competitions 1st places in all five main tracks ImageNet Classification: Ultra-deep (quote Yann) 152-layer nets ImageNet Detection: 16% better than 2nd ImageNet Localization: 27% better than 2nd COCO Detection: 11% better than 2nd COCO Segmentation: 12% better than 2nd *improvements are relative numbers

Revolution of Depth 152 layers 25.8 28.2 16.4 22 layers 19 layers 6.7 7.3 11.7 3.57 8 layers 8 layers shallow ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ImageNet Classification top-5 error (%) ILSVRC'11 ILSVRC'10

Revolution of Depth 101 layers Engines of visual recognition 58 66 86 34 shallow 8 layers 16 layers HOG, DPM AlexNet (RCNN) VGG (RCNN) ResNet (Faster RCNN)* PASCAL VOC 2007 Object Detection map (%) *w/ other improvements & more data

Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384, pool/2 fc, 4096 fc, 4096 fc, 1000

Revolution of Depth soft max2 Soft maxact ivat ion FC AlexNet, 8 layers (ILSVRC 2012) 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 VGG, 19 layers (ILSVRC 2014), pool/2 GoogleNet, 22 layers (ILSVRC 2014) AveragePool 7x7+ 1(V) Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) soft max1 3x3 conv, 384, pool/2 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) MaxPool 3x3+ 2(S) Soft maxact ivat ion FC, pool/2 Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) FC Conv 1x1+ 1(S) fc, 4096 Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hconcat AveragePool 5x5+ 3(V) fc, 4096 Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) fc, 1000, pool/2 Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) soft max0 Soft maxact ivat ion Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hconcat FC FC Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Conv 1x1+ 1(S) AveragePool 5x5+ 3(V), pool/2, pool/2 fc, 4096 fc, 4096 Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) MaxPool 3x3+ 2(S) Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hconcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 3x3+ 1(S) Conv 1x1+ 1(V) LocalRespNorm MaxPool 3x3+ 2(S) fc, 1000 input Conv 7x7+ 2(S)

11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384, pool/2 fc, 4096 fc, 4096 fc, 1000, pool/2, pool/2, pool/2, pool/2, pool/2 fc, 4096 fc, 4096 fc, 1000 7x7 conv, 64, /2, po ol /2 1x1 conv, 64 1x1 conv, 64 1x1 conv, 64 1x1 conv, 128, /2 1x1 conv, 128 1x1 conv, 128 1x1 conv, 128 1x1 conv, 128 1x1 conv, 128 1x1 conv, 128 1x1 conv, 128, /2, /2 1x1 conv, 2048 1x1 conv, 2048 1x1 conv, 2048 ave pool, fc 1000 Revolution of Depth AlexNet, 8 layers (ILSVRC 2012) VGG, 19 layers (ILSVRC 2014) ResNet, 152 layers (ILSVRC 2015)

7x7 conv, 64, /2, po ol /2 1x1 conv, 64 Revolution of Depth 1x1 conv, 64 ResNet, 152 layers 1x1 conv, 64 1x1 conv, 128, /2 1x1 conv, 128 1x1 conv, 128 1x1 conv, 128 1x1 conv, 128 1x1 conv, 128 1x1 conv, 128 (there was an animation here) 1x1 conv, 128, /2

1x1 conv, 128 Revolution of Depth, /2 ResNet, 152 layers (there was an animation here) Kaiming He, Xiangyu Zhang, 1x1 conv, Shaoqing 256 Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arxiv 2015.

Revolution of Depth ResNet, 152 layers (there was an animation here) Kaiming He, Xiangyu Zhang, 3x3 conv, Shaoqing 256 Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arxiv 2015.

Revolution of Depth ResNet, 152 layers, /2 1x1 conv, 2048 1x1 conv, 2048 (there was an animation here) 1x1 conv, 2048 ave pool, fc 1000

Is learning better networks as simple as stacking more layers?

Simply stacking layers? 20 train error (%) 56-layer CIFAR-10 20 test error (%) 56-layer 10 20-layer 10 20-layer 0 0 1 2 3 4 5 6 iter. (1e4) 0 0 1 2 3 4 5 6 iter. (1e4) Plain nets: stacking 3x3 conv layers 56-layer net has higher training error and test error than 20-layer net

Simply stacking layers? error (%) 20 10 CIFAR-10 56-layer 44-layer 32-layer 20-layer error (%) 60 50 40 ImageNet-1000 34-layer 5 plain-20 plain-32 plain-44 0 0 plain-56 1 2 3 iter. (1e4) 4 5 6 solid: test/val dashed: train 30 plain-18 plain-34 20 0 10 20 30 40 50 iter. (1e4) 18-layer Overly deep plain nets have higher training error A general phenomenon, observed in many datasets

a shallower model (18 layers) 7x7 conv, 64, /2 7x7 conv, 64, /2 a deeper counterpart (34 layers), /2, /2 A deeper model should not have higher training error, /2 extra layers, /2 A solution by construction: original layers: copied from a learned shallower model extra layers: set as identity at least the same training error, /2 fc 1000, /2 fc 1000 Optimization difficulties: solvers cannot find the solution when going deeper

Deep Residual Learning Plaint net xx HH xx is any desired mapping, hope the 2 weight layers fit HH(xx) any two stacked layers weight layer weight layer HH(xx) relu relu

Deep Residual Learning Residual net xx HH xx is any desired mapping, hope the 2 weight layers fit HH(xx) weight layer hope the 2 weight layers fit FF(xx) FF(xx) relu weight layer identity xx let HH xx = FF xx + xx HH xx = FF xx + xx relu

Deep Residual Learning FF xx is a residual mapping w.r.t. identity xx FF(xx) weight layer relu weight layer identity xx If identity were optimal, easy to set weights as 0 If optimal mapping is closer to identity, easier to find small fluctuations HH xx = FF xx + xx relu

Related Works Residual Representations VLAD & Fisher Vector [Jegou et al 2010], [Perronnin et al 2007] Encoding residual vectors; powerful shallower representations. Product Quantization (IVF-ADC) [Jegou et al 2011] Quantizing residual vectors; efficient nearest-neighbor search. MultiGrid & Hierarchical Precondition [Briggs, et al 2000], [Szeliski 1990, 2006] Solving residual sub-problems; efficient PDE solvers.

7x7 conv, 64, /2 7x7 conv, 64, /2 pool, /2 pool, /2 Network Design plain net ResNet, /2, /2 Keep it simple Our basic design (VGG-style) all 3x3 conv (almost) spatial size /2 => # filters x2 Simple design; just deep!, /2, /2 Other remarks: no max pooling (almost) no hidden fc no dropout, /2, /2 avg pool avg pool fc 1000 fc 1000

Training All plain/residual nets are trained from scratch All plain/residual nets use Batch Normalization Standard hyper-parameters & augmentation

CIFAR-10 experiments error (%) 20 10 CIFAR-10 plain nets 5 plain-20 plain-32 plain-44 0 0 plain-56 1 2 3 iter. (1e4) 4 5 6 56-layer 44-layer 32-layer 20-layer solid: test dashed: train error (%) 20 10 5 CIFAR-10 ResNets 0 0 1 2 3 4 5 6 iter. (1e4) ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-110 20-layer 32-layer 44-layer 56-layer 110-layer Deep ResNets can be trained without difficulties Deeper ResNets have lower training error, and also lower test error

ImageNet experiments ImageNet plain nets ImageNet ResNets 60 60 50 50 error (%) 40 34-layer error (%) 40 18-layer 30 solid: test plain-18 dashed: train plain-34 20 0 10 20 30 40 50 iter. (1e4) 18-layer 30 ResNet-18 ResNet-34 20 0 10 20 30 iter. (1e4) 40 50 34-layer Deep ResNets can be trained without difficulties Deeper ResNets have lower training error, and also lower test error

ImageNet experiments A practical design of going deeper 64-d 256-d 3x3, 64 relu 3x3, 64 1x1, 64 relu 3x3, 64 relu 1x1, 256 relu all-3x3 similar complexity relu bottleneck (for ResNet-50/101/152)

ImageNet experiments this model has lower time complexity than VGG-16/19 Deeper ResNets have lower error 6.7 7.4 8 7 5.7 6.1 6 5 4 ResNet-152 ResNet-101 ResNet-50 ResNet-34 10-crop testing, top-5 val error (%)

ImageNet experiments 152 layers 25.8 28.2 16.4 22 layers 19 layers 6.7 7.3 11.7 3.57 8 layers 8 layers shallow ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ImageNet Classification top-5 error (%) ILSVRC'11 ILSVRC'10

Just classification? A treasure from ImageNet is on learning features.

Features matter. (quote [Girshick et al. 2014], the R-CNN paper) task 2nd-place winner MSRA margin (relative) ImageNet Localization (top-5 error) 12.0 9.0 27% absolute 8.5% better! ImageNet Detection (map@.5) 53.6 62.1 16% COCO Detection (map@.5:.95) 33.5 37.3 11% COCO Segmentation (map@.5:.95) 25.1 28.2 12% Our results are all based on ResNet-101 Our features are well transferrable

Object Detection (brief) Simply Faster R-CNN + ResNet proposals classifier RoI pooling Faster R-CNN baseline map@.5 map@.5:.95 VGG-16 41.5 21.5 ResNet-101 48.4 27.2 Region Proposal Net feature map COCO detection results (ResNet has 28% relative gain) CNN image Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Object Detection (brief) RPN learns proposals by extremely deep nets We use only 300 proposals (no SS/EB/MCG!) Add what is just missing in Faster R-CNN Iterative localization Context modeling Multi-scale testing All are based on CNN features; all are end-to-end (train and/or inference) All benefit more from deeper features cumulative gains! Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Our results on COCO too many objects, let s check carefully! *the original image is from the COCO dataset Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

*the original image is from the COCO dataset Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

*the original image is from the COCO dataset Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015.

Instance Segmentation (brief) CONVs box instances (RoIs) Solely CNN-based ( features matter ) Differentiable RoI warping layer (w.r.t box coord.) Multi-task cascades, exact end-to-end training CONVs FCs mask instances RoI warping, pooling categorized instances for each RoI FCs conv feature map masking person person horse person for each RoI Jifeng Dai, Kaiming He, & Jian Sun. Instance-aware Semantic Segmentation via Multi-task Network Cascades. arxiv 2015.

input *the original image is from the COCO dataset Jifeng Dai, Kaiming He, & Jian Sun. Instance-aware Semantic Segmentation via Multi-task Network Cascades. arxiv 2015.

Conclusions MSRA team Deeper is still better Features matter! Kaiming He Xiangyu Zhang Shaoqing Ren Faster R-CNN is just amazing Jifeng Dai Jian Sun Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015. Jifeng Dai, Kaiming He, & Jian Sun. Instance-aware Semantic Segmentation via Multi-task Network Cascades. arxiv 2015.