Restricted Connectivity in Deep Neural Networks

Restricted Connectivity in Deep Neural Networks Yani Ioannou University of Cambridge April 7, 207

Overview Introduction Research Overview PhD Research Motivation Structural Priors Spatial Structural Priors Filter-wise Structural Priors Summary/Future Work Collaborative Research 2

Introduction 3

About Me Ph.D. student in the Department of Engineering at the University of Cambridge. Supervised by Professor Roberto Cipolla, head of the Computer Vision and Robotics group in the Machine Intelligence Lab, and Dr. Antonio Criminisi, a principal researcher at Microsoft Research. 4

Research Overview 5

First Author Publications during PhD Decision Forests, Convolutional Networks and the Models in-between. Y. Ioannou, D. Robertson, D. Zikic, P. Kontschieder, J. Shotton, M. Brown, A. Criminisi. MSR Technical Report 205 Training CNNs with Low-Rank Filters for Efficient Image Classification. Y. Ioannou, D. Robertson, J. Shotton, R. Cipolla, A. Criminisi. ICLR 206 Deep roots: Improving CNN efficiency with hierarchical filter groups. Y. Ioannou, D. Robertson, R. Cipolla, A. Criminisi. CVPR 207 6

Motivation 7

ILSVRC Imagenet Large-Scale Visual Recognition Challenge Imagenet Large-Scale Visual Recognition Challenge..2 Million Training Images, 000 classes. 50,000 image validiation/test set. In 202 Alex Krizhevsky won challenge with CNN 2. AlexNet was 26.2% better than second best, 5.3%. State-of-the-art beats human error (5%). Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. 2 Krizhevsky, Sutskever, and Hinton, ImageNet Classification with Deep Convolutional Neural Networks. 8

AlexNet Complexity 3 6 million parameters 724 million FLOPS (per-sample) Imagenet has.28 million training samples (227 227 3) Images of dimensions (227 227 3) 200 billion pixels 3 Krizhevsky, Sutskever, and Hinton, ImageNet Classification with Deep Convolutional Neural Networks. 9

AlexNet Complexity - FLOPS 2.0 0 8.5.0 0.5 0.0 conv conv2 conv3 conv4 conv5 fc6 fc7 fc8 FLOPS 0

AlexNet Complexity - Parameters 0 7 Parameters 3.0 2.0.0 0.0 conv conv2 conv3 conv4 conv5 fc6 96% in fully connected layers fc7 fc8

Hinton Philosophy If you give me any sized dataset, what you ought to do if you want good generalization is get yourself into the small data regime. That is, however big your dataset, you ought to make a much bigger model so that that s small data. So, I think what the brain is doing is... regularizing the hell out of it, and that s a better thing to do than what statisticians used to think you should do, which is have a small model. Geoffery Hinton 4 Cambridge, June 205 4 http://sms.cam.ac.uk/media/207973 2

8% 6% alexnet Crop & Mirror Aug. Extra Augmentation 4% vgg- Top-5 Error 2% 0% 8% googlenet vgg- vgg-3 vgg-6 (C) googlenet 0x vgg-9 vgg-6 (D) resnet-50-mirror-earlylr googlenet 44x 6% msra-a msra-b msra-c pre-resnet 200 0 8 0 9 0 0 0 0 2 0 3 0 4 log 0 (Multiply-Accumulate Operations) 3

Top-5 Error 8% alexnet 7% 6% 5% 4% 3% 2% % 0% 9% 8% 7% 6% googlenet googlenet 0x resnet-50-mirror-earlylr googlenet 44x pre-resnet 200 vgg- vgg- vgg-3 vgg-6 (C) vgg-9 vgg-6 (D) msra-a msra-b Crop & Mirror Aug. Extra Augmentation msra-c 0 7 0 8 0 9 log 0 (Number of Parameters) 4

Generalization and Num. Parameters.0 0.5 samples function 3rd order fit.0 0.5 samples function 20th order fit 0.0 0.0 0.5 0.5.0 0.0 0.2 0.4 0.6 0.8.0 (a) 3 rd -order poly., 3 points.0 0.0 0.2 0.4 0.6 0.8.0 (b) 20 th -order poly., 3 points.0 0.5 samples function 3rd order fit.0 0.5 samples function 20th order fit 0.0 0.0 0.5 0.5.0 0.0 0.2 0.4 0.6 0.8.0 (c) 3 rd -order poly., 0 points.0 0.0 0.2 0.4 0.6 0.8.0 (d) 20 th -order poly., 0 points 5

When fitting a curve, we often have little idea of what order polynomial would best fit the data! Weak Prior - Regularization. Knowing only that our model is over-parameterized is a relatively weak prior, however we can encode this into the fit by using a regularization penalty. This restricts the model to effectively use only a small number of the parameters, by adding a penalty for example on the L2 norm of the model weights. Strong Priors - Structural. With more prior information on the task, e.g. from the convexity of the polynomial, we may imply that a certain order polynomial is more appropriate, and restrict learning to some particular orders. 6

The Problem Creating a massively over-parameterized network, has consequences Training time: Translates into 2-3 weeks of training on 8 GPUs! (ResNet 200) Forward pass (ResNet 50): 2 ms GPU, 62 ms CPU Forward pass (GoogLeNet): 4.4 ms GPU, 300 ms CPU But what about the practicalities of using deep learning: on embedded devices realtime applications backed by distributed/cloud computing 7

Compression/Representation Isn t that already being addressed? Approximation (compression/pruning) of neural networks Reduced representation (8-bit floats/binary!) Allow us to have a trade off in compute v.s. accuracy. These methods will still apply to any network. Instead, let s try to address the fundamental problem of over-parameterization. 8

Alternative Philosophy Deep networks need many more parameters than data points because they aren t just learning to model data, but also learning what not to learn. Idea: Why don t we help the network, through structural priors, not to learn things it doesn t need to? The Atlas Slave (Accademia, Florence) 9

Structural Priors 20

Typical Convolutional Layer c 2 filters H W c h w c H H2 WW2 c 2 image/feature map filter output featuremap 2

Sparsity of Convolution Input image (zero-padded 3 x 4 pixels) Fully connected (a) layer structure Kernel N/A 0 2 3 4 5 6 7 8 9 0 Output pixels Connection weights 0 2 3 4 5 6 7 8 9 0 Input pixels Connection structure Input pixels Output pixels Input pixels 0 2 3 4 5 6 7 8 9 0 Output feature map (4 x 3) 0 2 3 4 5 6 7 8 9 0 Convolutional (b) 3 3 square Convolutional (c) 3 vertical Output pixels Input pixels Output pixels Input pixels Convolutional (d) 3 horizontal Output pixels 22

Why are CNNs uniformly structured? The marvelous powers of the brain emerge not from any single, uniformly structured connectionist network but from highly evolved arrangements of smaller, specialized networks which are interconnected in very specific ways. Marvin Minsky Perceptrons (988 edition) Deep networks are largely monolithic (uniformly connected), with few exceptions Why don t we try to structure our networks closer to the specialized components required for learning images? 23

Spatial Structural Priors Published as a conference paper at ICLR 206 TRAINING CNNS WITH LOW-RANK FILTERS FOR EFFICIENT IMAGE CLASSIFICATION Yani Ioannou, Duncan Robertson 2, Jamie Shotton 2, Roberto Cipolla & Antonio Criminisi 2 University of Cambridge, 2 Microsoft Research {yai20,rc000}@cam.ac.uk, {a-durobe,jamiesho,antcrim}@microsoft.com b 206 ABSTRACT We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that 24

Proposed Method Learning a basis for filters. m filters H W c * * H W m d filters m H W d A learned basis of vertical/horizontal rectangular filters and square filters! Shape of learned filters is a full w h c. But what can be effectively learned is limited by the number and complexity of the components. 25

VGG/Imagenet Results Top-5 Error 4% 3% 2% % 0% Baseline Networks Our Results vgg- gmp-lr-lde gmp-lr gmp-lr-join gmp-sf gmp-lr-2x gmp gmp-lr-join-wfull 2 3 4 5 6 7 8 Multiply-Accumulate Operations 0 9 Top-5 Error 4% 3% 2% % 0% vgg- gmp-lr-lde gmp-lr-join gmp-sf gmp-lr-2x gmp gmp-lr-join-wfull 0.2 0.4 0.6 0.8 Parameters.2.4 0 8 26

Imagenet Results VGG- (low-rank): 24% smaller, 4% fewer FLOPS VGG- (low-rank/full-rank mix): 6% fewer FLOPS with % lower error on ILSRVC val, but 6% larger. GoogLeNet: 4% smaller, 26% fewer FLOPS Or better results if you tune it on GoogLeNet more... 27

Rethinking the Inception Architecture for Computer Vision arxiv:52.00567v3 [cs.cv] Dec 205 Christian Szegedy Google Inc. szegedy@google.com nx Abstract Vincent Vanhoucke vanhoucke@google.com Sergey Ioffe sioffe@google.com Zbigniew Wojna Filter Concat University College London zbigniewwojna@gmail.com xn Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 204 very nx deep convolutional nx networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend xn to translate xn to immediate x quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling x factors x for various use Pool cases such x as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently Base as possible by suitably factorized convolutions and aggressive regularization. We benchmark our Figure methods 6. Inception the modules ILSVRC after the 202 factorization classification of the n n challenge validation convolutions. set demonstrate In our proposed architecture, substantial wegains chose n over = 7 for the 7 7 grid. (The f lter sizes are picked using principle 3) the state of the art: 2.2% top- and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 7.3% top- error.. Introduction x3 3x Jonathon Shlens shlens@google.com Filter Concat Since the 202 ImageNet competition [6] winning enlarly high performance in the 204 ILSVRC [6] classification challenge. 3x3 Onex3 interesting 3x observation x was that gains in the classification performance tend to transfer to significant quality gains in a wide variety of application domains. x x Pool x This means that architectural improvements in deep convolutional architecture can be utilized for improving performance for most otherbase computer vision tasks that are increasingly reliant on high quality, learned visual features. Also, improvements Figure 7. Inception inmodules the network with expanded quality the fresulted lter bank outputs. in new application domains for convolutional networks in cases where high dimensional representations, as suggested by principle 2 of This architecture is used the coarsest (8 8) grids to promote AlexNet Section 2. features We are using could thisnot solution compete only onwith the coarsest hand grid, engineered, crafted since that solutions, is the place.g. where proposal producinggeneration high dimensional in detection[4]. sparse representation is the most critical as the ratio of local processing (by Although convolutions) VGGNet is increased [8] has compared the compelling to the spatial aggregation. simplicity, this comes at a high cost: evalu- feature of architectural ating the network requires a lot of computation. On the other hand, the Inception architecture of GoogLeNet [20] was also designed to perform well even under strict constraints on memory and computational budget. For example, GoogleNet employed only 5 million parameters, which represented a 2 reduction with respect to its predecessor AlexNet, which used 60 million parameters. Furthermore, VGGNet employed about 3x more parameters than AlexNet. The computational cost of Inception is also much lower than VGGNet or its higher performing successors [6]. This has made it feasible to utilize Inception networks in big-data scenarios[7], [3], where huge amount of data needed to 28

Filter-wise Structural Priors Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups Yani Ioannou Duncan Robertson 2 Roberto Cipolla Antonio Criminisi 2 University of Cambridge, 2 Microsoft Research ov 206 Abstract We propose a new method for creating computationally efficient and compact convolutional neural networks be achieved by weight decay or dropout [5]. Furthermore, a carefully designed sparse network connection structure can also have a regularizing effect. Convolutional Neural Networks (CNNs) [6, 7] embody this idea, using a sparse 29

Typical Convolutional Layer c 2 filters H W c h w c H H2 WW2 c 2 image/feature map filter output featuremap 30

AlexNet Filter Grouping Uses 2 filter groups in most of the convolutional layers Allowed training across two GPUs (model parallelism) 3

AlexNet Filter Grouping Top-5 Val. Error 20% 9% 8% 4 groups no groups 2 groups 0.5 0.6 0.7 0.8 0.9..2 Model Parameters (# floats) 0 9 32

Grouped Convolutional Layer c 2 filters c 2 g H W c c 2 g h w c g H H2 WW2 c 2 image/feature map filter output featuremap 33

Root Modules c 2filters c 2 g c 3filters H W c c 2 g h c w g H H2 W W2 H c 2 c W c 2 3 Root-2 Module: d filters in g = 2 filter groups, of shape h w c/2. c 2filters c 2 g c 2 g c 3filters H W c c 2 g c 2 g h w c g H H2 W W2 H c 2 c W c 2 3 Root-4 Module: d filters in g = 4 filter groups, of shape h w c/4. 34

Network-in-Network H W 3 input conva convb convc conv2a conv2b conv2c conv3a conv3b conv3c pool output 35

NiN Root Architectures root-4 module root-2 module 36

NiN Root Architectures Network-in-Network. Filter groups in each convolutional layer. Model conv conv2 conv3 a b c a b c a b c 5 5 5 5 3 3 Orig. root-2 2 root-4 4 2 root-8 8 4 root-6 6 8 37

CIFAR0: Model Parameters v.s. Error 8.8% 0 2 6 NiN Root Tree Column 8.6% 8.4% 6 4 Error 8.2% 8.0% 6 8 8 8 4 4 2 2 2 7.8% 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9.0 Model Parameters 0 6 NiN: mean and standard deviation (error bars) are shown over 5 different random initializations. 38

CIFAR0: FLOPS (Multiply-Add) v.s. Error. 8.8% 0 2 6 NiN Root Tree Column 8.6% 8.4% 6 4 Error 8.2% 8.0% 6 8 8 8 4 4 2 2 2 7.8%.0.2.4.6.8 2.0 2.2 FLOPS (Multiply-Add) 0 8 NiN: mean and standard deviation (error bars) are shown over 5 different random initializations. 39

Inter-layer Filter Covariance 0 conv3a 92 0 conv3a 92 0 conv3a 92 92 92 92 conv2c conv2c conv2c (a) Standard: g = (b) Root-4: g = 2 (c) Root-32: g = 6 Figure: The block-diagonal sparsity learned by a root-module is visible in the correlation of filters on layers conv3a and conv2c in the NiN network. 40

ResNet 50 All Filters Top-5 Error 9% 8% 64 32 6 8 4 2 7%.6.8 2 2.2 2.4 2.6 Model Parameters (# Floats) 0 7 Top-5 Error 9% 8% 64 32 6 8 4 2 7% 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 FLOPS (Multiply-Add) 0 9 4

Top-5 Error 9% 8% 32 64 6 8 2 4 7% 0 0.2 0.4 0.6 0.8.2.4.6 GPU Forward (ms) Top-5 Error 9% 8% 64 32 6 2 8 4 7% 420 440 460 480 500 520 540 560 580 600 620 CPU Forward (ms) 42

Imagenet Results Networks with root modules have similar or higher accuracy than the baseline architectures with much less computation. ResNet 50 5 : 40% smaller, 45% fewer FLOPS ResNet 200 6 : 44% smaller, 25% fewer FLOPS GoogLeNet: 7% smaller, 44% fewer FLOPS But when you also increase the number of filters... 5 Caffe Re-implementation 6 Based on Facebook Torch Model 43

Aggregated Residual Transformations for Deep Neural Networks Saining Xie Ross Girshick 2 Piotr Dollár 2 Zhuowen Tu Kaiming He 2 UC San Diego 2 Facebook AI Research 256-d {s9xie,ztu}@ucsd.edu in {rbg,kaiminghe}@fb.com 256-d in arxiv:6.0543v [cs.cv] 6 Nov 206 256, x, 64 Abstract We present a simple, highly modularized network architecture for image classification. Our network is constructed 64, 3x3, 64 4, 3x3, 4 by repeating a building block that aggregates a set of transformations with64, the x, same256 topology. Our simple 4, design x, 256 results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which+ we call cardinality (the size of the set of transformations), as256-d an essential out factor in addition to the dimensions of depth and width. On the ImageNet-K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, codenamed ResNeXt, are the foundations of our entry to the ILSVRC 206 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart.. Introduction 256, x, 4 Research on visual recognition is undergoing a transition from feature engineering to network engineering [24, 23, 43, 33, 35, 37, 3]. In contrast to traditional handdesigned features (e.g., SIFT [28] and HOG [5]), features learned by neural networks from large-scale data [32] require minimal human involvement during training, and can be transferred to a variety of recognition tasks [7, 0, 27]. Nevertheless, human effort has been shifted to designing 256, x, 4 256, x, 64 64, 3x3, 64 64, 256 4, x, 256 + 256-d in 4, 3x3, 4 + 256-d out total 32 paths 256, x, 4... 4, 3x3, 4 256, x, 4 256, x, 4 4, 3x3, 4 4, 3x3, 4 4, x, 256 4, 4, x, 256 + + 256-d in total 32 paths... 256-d out 256, x, 4 4, 3x3, 4 4, x, 256 Figure. Left: A block of ResNet [3]. Right: A block of + ResNeXt with cardinality = 32, with roughly the same complexity. A layer is shown 256-d as out (# in channels, filter size, # out channels). Figure. Left: A block of ResNet [3]. Right: A block of ResNeXt with cardinality = 32, with roughly the same complexity. A layer is shown as (# in channels, f lter size, # out channels). ogy. This simple rule reduces the free choices of hyperparameters, and depth is exposed as an essential dimension in neural networks. Moreover, we argue that the simplicity of this rule may reduce the risk of over-adapting the hyperparameters to a specific dataset. The robustness of VGGnets and ResNets has been proven by various visual recognition tasks [7, 0, 9, 27, 30, 3] and by non-visual tasks involving speech [4, 29] and language [4, 40, 9]. Unlike VGG-nets, the family of Inception models [37, 6, 38, 36] have demonstrated that carefully designed topologies are able to achieve compelling accuracy with low theoretical complexity. The Inception models have evolved over time [37, 38], but an important common property is a split-transform-merge strategy. In an Inception module, the input is split into a few lower-dimensional embeddings (by convolutions), transformed by a set of specialized filters (3 3, 5 5, etc.), and merged by concatenation. It can be shown that the solution space of this architecture is a Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. 44

Summary/Future Work 45

Summary Using structural priors: Models are less computationally complex They also use less parameters They significantly help generalization in deeper networks They significantly help generalization with larger datasets Are amenable to model parallelization (as with original AlexNet), for better parallelism across gpus/nodes 46

Future Work: Research We don t always have enough knowledge of the domain to propose good structural priors Our results (and follow up work) do show however that current methods of training/regularization seem to have limited effectiveness in DNNs learning such priors themselves How can we otherwise learn structural priors? 47

Future Work: Applications Both of these methods apply to most deep learning applications: Smaller model state easier storage and synchronization Faster training and test of models behind ML cloud services Embedded devices/tensor processing units And more specific to each method Low-rank filters Even larger impact for volumetric imagery (Microsoft Radiomics) Root Modules Model parallelization (Azure/Amazon Cloud) 48