Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Inception and Residual Networks Hantao Zhang Deep Learning with Python https://en.wikipedia.org/wiki/residual_neural_network Deep Neural Network Progress from Large Scale Visual Recognition Challenge (ILSVRC) 1

GoogLeNet SLIDE CREDIT:GOOGLE INC 1x1 Convolution Does it make any sense to do 1x1 convolutions? Can we do dimensionality reduction on the depth? 1x1 conv 2

1x1 filters = Average Pooling The key contribution is made by average pooling instead of fully connected layers Average pooling 1x1 Convolutions When # of image channels > # of filters: It s considered as dimension reduction: (row, height, channels) (row, height, filters). When # of image channels = # of filters: Projection onto space of the same dimension with average pooling. It increases non-linearity w/o affecting receptive field. It acts like coordinate-dependent transformation in the filter space. It suffers with less over-fitting due to smaller kernel size (1x1). Another perspective: Fully connected with weight sharing Used in many networks, including GoogLeNet 3

Choice of Modules Which one? Pick them all!! 1x1 conv? 3x3 conv? 5x5 conv? Pooling? Inception Module What to do? Pick them all!! 1x1 conv? 3x3 conv? 5x5 conv? Pooling? Inception module (naïve version) 4

Inception Module Inception module with dimensionality reduction Inception module (naïve version) Inception Module in GoogLeNet 9 inception layers 5

Classification results on ImageNet Team Year Place Error (top-5) Uses external data SuperVision 2012-16.4% no SuperVision 2012 1st 15.3% ImageNet 22k Clarifai 2013-11.7% no Clarifai 2013 1st 11.2% ImageNet 22k MSRA 2014 3rd 7.35% no VGG 2014 2nd 7.32% no GoogLeNet 2014 1st 6.67% no SLIDE CREDIT:GOOGLE INC GoogLeNet Only 5M parameters! (12x fewer than Alex net, 27x fewer than VGG net!) 6

GoogLeNet Optimal local sparse structure using available dense components To capture dense clusters : 1x1 convolutions More spatially spread out clusters captured by 3x3 and 5x5. Pooling layer: Generally improves performance. Outputs of all these are concatenated and passed to next layer Give rise to the (naive) Inception Module Deep Neural Network Progress from Large Scale Visual Recognition Challenge (ILSVRC) Problem: Training the deeper network is more difficult because of vanishing/exploding gradients problem. Solution: Residual network 7

Residual Network Introduced residual nets against plain nets Reached 3.57% top-5 error rate!! Winners of ImageNet 2015 in all sub-competitions! Residual Network Residual block Appropriate for treating perturbation as keeping a base information Avoid vanishing/exploding gradients problem. 8

Residual Network 17 Difference between an original image and a changed image Preserving base information Some Network residual can treat perturbation Residual Network Shortcuts connections Identity shortcuts Projection shortcuts Using tensorflow: y = # y = F(x, {W i }) x = # x = W s x y = tf.add(y, x) 9

Residual learning building block y ReLU + y conv2d y ReLU y conv2d x 19 Residual Network Code Example def _residual_v1(self, x, kernel_size, in_filter, out_filter, stride): "" Residual unit with 2 sub layers """ with tf.name_scope('residual_v1') as name_scope: orig_x = x x = self._conv(x, kernel_size, out_filter, stride) x = self._batch_norm(x) x = self._relu(x) x = self._conv(x, kernel_size, out_filter, 1) x = self._batch_norm(x) if in_filter!= out_filter: orig_x = self._avg_pool(orig_x, stride, stride) pad = (out_filter - in_filter) // 2 orig_x = tf.pad(orig_x, [[0, 0], [0, 0], [0, 0], [pad, pad]]) x = self._relu(tf.add(x, orig_x)) return x 10

Deep residual network 21 ResNet is an ensemble model? 22 11

Remove a layer? What happens if we remove the second layer? 23 Residual Networks How many layers to stack? Single layer? = 1 12

Network Design 25 Basic design (VGG-style) All 3x3 conv (almost) Spatial size/2 => #filters x2 Batch normalization Simple design, just deep Other remarks No max pooling (almost) No hidden functions No dropout Network Design ResNet-152 Use bottlenecks ResNet-152(11.3 billion FLOPs) has lower complexity than VGG-16/19 nets (15.3/19.6 billion FLOPs) 13

Residual Networks A deeper counterpart (34 layers) A shallow network (18 layers) Degradation problem : A deeper model should not have higher training error Original layers: copied from shallower model Extra layers: set as identity At least the same training error Therefore, Solvers might have difficulties approximating identity mappings by multiple non-linear layers Deep Neural Network 28 Overly deep plain nets have higher training error A general phenomenon, observed in many datasets Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arxiv 2015. 14

Residual Network Deeper ResNets have lower training error Results Deep ResNets can be trained without difficulties Deep ResNets have lower training error, and also lower test error 15

Residual Networks Results 32 1 st places in all five main tracks in ILSVRC & COCO 2015 Competitions ImageNet Classification ImageNet Detection ImageNet Localization COCO Detection COCO Segmentation Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. Deep Residual Learning for Image Recognition. arxiv 2015. 16

Residual Networks Dealing with different dimensions: (A) Zero padding (no extra parameters) (B) Zero padding and 1x1 conv (C) All 1x1 conv Possible Architectures for Residual Blocks AlphaGo uses the first with nonlinear rectifiers 17

Highway Networks We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on information highways,,.,., Transform gate Carry gate For simplicity let 1,.,. 1, Highway Networks [Srivastava et al. 2015] We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on information highways,.,. 1,,, if, = 0 if, = 1 To be learned via backpropagation, Initialize with negative value (e.g. -3) to have an initial carry behavior (inspired by the preferred LSTM initial bias in bridging) 18

Highway vs. Residual Networks [Srivastava et al. 2015] Highway Nets Residual Nets,.,. 1, Can pass through transform gate, carry gate or a linear combination of them! Can do differently for different features Learn the gate functions in a data driven way Always pass both identity and the transformation Same behavior for all features No data driven approach Extra parameters ( ) Parameter free No improvement with deeper nets Performs way better in practice Highway vs. Residual Networks (CIFAR) Highway nets 19

Densely Connected CNNs Densely Connected CNNs 20

Depth vs. width The authors of residual networks tried to make them as thin as possible in favor of increasing their depth and having less parameters, and even introduced a «bottleneck» block which makes ResNet blocks even thinner. We note, however, that the residual block with identity mapping that allows to train very deep networks is at the same time a weakness of residual networks. As gradient flows through the network there is nothing to force it to go through residual block weights and it can avoid learning anything during training, so it is possible that there is either only a few blocks that learn useful representations, or many blocks share very little information with small contribution to the final goal. 41 Exploring over 1000 layers Test 1202 layers Training is finished Training error is similar Testing error is high because of over fitting 21

Experimental results 8 times faster to train 43 Experimental results 44 22

Wide Net Widening consistently improves performance across residual networks of different depth Increasing both depth and width helps until the number of parameters becomes too high and stronger regularization is needed Wide networks can successfully learn with a 2 or more times larger number of parameters than thin ones, which would require doubling the depth of thin networks, making them infeasibly expensive to train. 45 Conclusions The deeper network can cover more complex problems Receptive field size Non-linearity However, training the deeper network is more difficult because of vanishing/exploding gradients problem. Residual Networks help to avoid such a problem. Wide Networks are an alternative for achieving the same or better high performance. No matter how deeper or wider, keep the total size smaller. 23