CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Size: px

Start display at page:

Download "CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague"

Jade Anthony
6 years ago
Views:

1 CNNS FROM THE BASICS TO RECENT ADVANCES Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

2 OUTLINE Short review of the CNN design Architecture progress AlexNet VGGNet ResNet now GoogLeNet VGGNet is an universal design? Automatic architecture search Design choices 3

Layers of Features from Tiny Images, 2009 ImageNet: 1.

3 DATASETS USED IN PRESENTATION: IMAGENET AND CIFAR-10 Russakovskiy et.al, ImageNet Large Scale Visual Recognition Challenge, 2015 Krizhevsky, Learning Multiple Layers of Features from Tiny Images, 2009 ImageNet: 1.2M training images (~ 256 x 256px ) 50k validation images 1000 classes CIFAR-10: 50k training images (32x32 px ) 10k validation images 10 classes 4

4 IMAGENET WINNERS 5

CAFFENET ARCHITECTURE AlexNet (original)

, Caffe: Convolutional Architecture for Fast Feature

5 CAFFENET ARCHITECTURE AlexNet (original):krizhevsky et.al., ImageNet Classification with Deep Convolutional Neural Networks, CaffeNet: Jia et.al., Caffe: Convolutional Architecture for Fast Feature Embedding, Image credit: Roberto Matheus Pinheiro Pereira, Deep Learning Talk. Srinivas et.al A Taxonomy of Deep Convolutional Neural Nets for Computer Vision,

6 CAFFENET ARCHITECTURE 7 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

7 VGGNET ARCHITECTURE All convolutions are 3x3 Good performance, but slow 8

8 INCEPTION (GOOGLENET): BUILDING BLOCK 9 Szegedy et.al. Going Deeper with Convolutions. CVPR, 2015 Image credit:

9 DEPTH LIMIT NOT REACHED YET IN DEEP LEARNING 10

10 DEPTH LIMIT NOT REACHED YET IN DEEP LEARNING 11

11 DEPTH LIMIT NOT REACHED YET IN DEEP LEARNING VGGNet 19 layers, 19.6 billion FLOPS. Simonyan et.al., 2014 ResNet 152 layers, 11.3 billion FLOPS. He et.al., 2015 Stochastic ResNet 1200 layers. Huang et.al., Deep Networks with Stochastic Depth, Slide credit Šulc et.al. Very Deep Residual Networks with MaxOut for Plant Identification in the Wild, 2016.

12 RESNET: RESIDUAL BLOCK 13 He et.al. Deep Residual Learning for Image Recognition, ICCV 2015

13 XCEPTION, RESNEXT: DEPTH-SEP CONV Inception Xception 14 F.Chollet. Xception: Deep Learning with Depthwise Separable Convolutions, arxiv 2016

14 XCEPTION, RESNEXT: DEPTH-SEP CONV ResNet block ResNeXt block 15 Xie et.al. Aggregated Residual Transformations for Deep Neural Network, CVPR 2017

15 ACCURACY-COMPLEXITY TRADE-OFF 16 Canziani et.al. An analysis of deep neural network models for practical applications, 2016

16 ACCURACY-COMPLEXITY TRADE-OFF 17 Canziani et.al. An analysis of deep neural network models for practical applications, 2016

17 RESIDUAL NETWORKS: HOT TOPIC Identity mapping Wide ResNets Stochastic depth Residual Inception ResNets + ELU ResNet in ResNet DC Nets Weighted ResNet 18

18 AUTOMATIC ARCHITECTURE SEARCH: REINFORCEMENT LEARNING ENASNet Conv cell ENASNet Reduction cell 19 Pham et.al. Efficient Neural Architecture Search via Parameter Sharing, 2018

19 AUTOMATIC ARCHITECTURE SEARCH: EVOLUTION AmoebaNet Conv cell AmoebaNet Reduction cell 20 Real et.al. Regularized Evolution for Image Classifier Architecture Search, 2018

20 TOO MUCH ARCH OPTIMIZATION MAY HURT ImageNet classification: PASCAL semantic segmentation 21 Long et.al. Fully Convolutional Networks for Semantic Segmentation, CVPR 2015

21 TOO MUCH ARCH OPTIMIZATION MAY HURT Performance on ESPGame dataset, semantic metrics Visual ST Visual MEN Multimodal ST Multimodal MEN AlexNet GoogLeNet VGGNet Kiela et.al. Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics

22 NOW SMALL DESIGN CHOICES 23

23 CAFFENET ARCHITECTURE Image credit: Hu et.al, 2015 Transferring Deep Convolutional Neural Networks for the Scene Classification of High-Resolution Remote Sensing Imagery 25

24 LIST OF HYPER-PARAMETERS TESTED 26

25 REFERENCE METHODS: IMAGE SIZE SENSITIVE 27 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

26 CHOICE OF NON-LINEARITY 28

27 CHOICE OF NON-LINEARITY 29 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

28 NON-LINEARITIES ON CAFFENET 30 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

29 BATCH NORMALIZATION (AFTER EVERY CONVOLUTION LAYER) 31 Ioffe et.al, ICML 2015

30 BATCH NORMALIZATION: WHERE, BEFORE OR AFTER NON-LINEARITY? ImageNet, top-1 accuracy Network No BN Before ReLU After ReLU CaffeNet128-FC GoogLeNet CIFAR-10, top-1 accuracy, FitNet4 network Non-linearity BN Before BN After TanH ReLU MaxOut In short: better to test with your architecture and dataset :) 32 Mishkin and Matas. All you need is a good init. ICLR, 2016 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

31 BATCH NORMALIZATION SOMETIMES WORKS TOO GOOD AND HIDES PROBLEMS Case: CNN has less number outputs ( just typo), than classes in dataset: 26 vs. 28 Plain CNN diverges 33 BatchNormed learns well

32 NON-LINEARITIES ON CAFFENET, WITH BATCH NORMALIZATION 34 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

33 NON-LINEARITIES: TAKE AWAY MESSAGE Use ELU without batch normalization Or ReLU + BN Try maxout for the final layers Fallback solution (if something goes wrong) ReLU 35 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

34 BUT IN SMALL DATA REGIME ( ~50K IMAGES) TRY LEAKY OR RANDOMIZED RELU Accuracy [%], Network in Network architecture ReLU VLReLU RReLU PReLU CIFAR CIFAR LogLoss, Plankton VGG architecture ReLU VLReLU RReLU PReLU KNDB Xu et.al. Empirical Evaluation of Rectified Activations in Convolutional Network ICLR 2015

35 INPUT IMAGE SIZE 37 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

36 PADDING TYPES No padding, stride = 2 Zero padding, stride = 2 Zero padding, stride = 1 38 Dumoulin and Visin. A guide to convolution arithmetic for deep learning. ArXiv 2016

37 PADDING Zero-padding: Preserving spatial size, not washing out information Dropout-like augmentation by zeros Caffenet128 with conv padding: 47% top-1 acc w/o conv padding: 41% top-1 acc. 39

38 MAX POOLING: PADDING AND KERNEL 40 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

39 POOLING METHODS 41 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

40 POOLING METHODS 42 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

41 LEARNING RATE POLICY: LINEAR 43 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

42 LEARNING RATE POLICY: LINEAR COSINE Bello et.al performed large scale reinforcementlearning search of learning rate schedule: Interestingly, we also found that the linear cosine decay generally allows for a larger initial learning rate and leads to faster convergence 44 Bello et.al. Neural Optimizer Search with Reinforcement Learning, arxiv 2017

43 IMAGE PREPROCESSING Subtract mean pixel (training set), divide by std. RGB is the best (standard) colorspace for CNN Do nothing more unless you have specific dataset. Subtract local mean pixel B.Graham, 2015 Kaggle Diabetic Retinopathy Competition report 45

44 IMAGE PREPROCESSING: WHAT DOESN`T WORK 46 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

45 IMAGE PREPROCESSING: LET`S LEARN THE COLORSPACE 47 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016 Image credit:

46 DATASET QUALITY AND SIZE 48 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

47 NETWORK WIDTH: SATURATION AND SPEED PROBLEM 49 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

48 BATCH SIZE AND LEARNING RATE 50 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

49 CLASSIFIER DESIGN Take home: put fully-connected before final layer, not earlier Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016 Ren et.al Object Detection Networks on Convolutional Feature Maps, arxiv

50 APPLYING ALTOGETHER >5 pp. additional top-1 accuracy for free. 52 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arxiv 2016

51 TAKE HOME MESSAGES use ELU if without batchnorm or ReLU with BN. apply a learned colorspace transformation of RGB (2 layers of 1x1 convolution ). use the linear learning rate decay policy. use a sum of the average and max pooling layers. use mini-batch size around 128 or 256. If this is too big for your GPU, decrease the learning rate proportionally to the batch size. use fully-connected layers as convolutional and average the predictions for the final decision. when investing in increasing training set size, check if a plateau has not been reach. cleanliness of the data is more important than the size. if you cannot increase the input image size, reduce the stride in the consequent layers, it has roughly the same effect. if your network has a complex and highly optimized architecture, like e.g. GoogLeNet, be careful with modifications. 53

52 THANK YOU FOR THE ATTENTION Any questions? All logs, graphs and network definitions: Feel free to add your tests The paper is here: 54

53 ARCHITECTURE Use as small filters as possible 3x3 + ReLU + 3x3 + ReLU > 5x5 + ReLU. 3x1 + 1x3 > 3x3. 2x2 + 2x2 > 3x3 Exception: 1 st layer. Too computationally ineffective to use 3x3 there. 55 Convolutional Neural Networks at Constrained Time Cost. He and Sun, CVPR 2015

54 WEIGHT INITIALIZATION FOR A VERY DEEP NET Gaussian noise with variance. var ω l = 0.01 (AlexNet, Krizhevsky et.al, 2012) var ω l = 1/n inputs (Glorot et.al. 2010) var ω l = 2/n inputs (He et.al. 2015) Orthonormal: (Saxe et.al. 2013) Glorot SVD ω l = V Data-dependent: LSUV (Mishkin et.al, 2016) 56 Mishkin and Matas. All you need is a good init. ICLR, 2016

55 WEIGHT INITIALIZATION INFLUENCES ACTIVATIONS a) Layer gain G L < 1 vanishing activations variance b) Layer gain G L > 1 exploding activation variance 57 Mishkin and Matas. All you need is a good init. ICLR, 2016

56 ACTIVATIONS INFLUENCES MAGNITUDE OF GRADIENT COMPONENTS 58 Mishkin and Matas. All you need is a good init. ICLR, 2016

57 LAYER-SEQUENTIAL UNIT-VARIANCE ORTHOGONAL INITIALIZATION Algorithm 1. Layer-sequential unit-variance orthogonal initialization. L convolution or fully-connected layer, W L its weights, O L layer output, ε variance tolerance, T i iteration number, T max max number of iterations. Pre-initialize network with orthonormal matrices as in Saxe et.al. (2013) for each convolutional and fully-connected layer L do do forward pass with mini-batch calculate var(o L ) W i+1 i L = W L var(o L ) until var O L 1.0 < ε or (T i > Tmax) end for * The LSUV algorithm does not deal with biases and initializes them with zeros 59 Mishkin and Matas. All you need is a good init. ICLR, 2016

58 COMPARISON OF THE INITIALIZATIONS FOR DIFFERENT ACTIVATIONS CIFAR-10 FitNet, accuracy [%] CIFAR-10 FitResNet, accuracy [%] 60 Mishkin and Matas. All you need is a good init. ICLR, 2016

59 LSUV INITIALIZATION IMPLEMENTATIONS Caffe Keras Torch 61 Mishkin and Matas. All you need is a good init. ICLR, 2016

60 CAFFENET TRAINING 62 Mishkin and Matas. All you need is a good init. ICLR, 2016

61 GOOGLENET TRAINING 63 Mishkin and Matas. All you need is a good init. ICLR, 2016

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 7. Image Processing COMP9444 17s2 Image Processing 1 Outline Image Datasets and Tasks Convolution in Detail AlexNet Weight Initialization Batch Normalization