Restricted Connectivity in Deep Neural Networks

Similar documents
Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Channel Locality Block: A Variant of Squeeze-and-Excitation

Inception Network Overview. David White CS793

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Learning Deep Representations for Visual Recognition

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

Convolutional Neural Networks

Know your data - many types of networks

Fei-Fei Li & Justin Johnson & Serena Yeung

Learning Deep Features for Visual Recognition

Deep Residual Learning

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Spatial Localization and Detection. Lecture 8-1

Deep Learning for Computer Vision II

Fuzzy Set Theory in Computer Vision: Example 3

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Deep Learning with Tensorflow AlexNet

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

Supplementary material for Analyzing Filters Toward Efficient ConvNet

Project 3 Q&A. Jonathan Krause

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

A Method to Estimate the Energy Consumption of Deep Neural Networks

Elastic Neural Networks for Classification

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

11. Neural Network Regularization

Aggregated Residual Transformations for Deep Neural Networks

All You Want To Know About CNNs. Yukun Zhu

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Computer Vision Lecture 16

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Study of Residual Networks for Image Recognition

YOLO9000: Better, Faster, Stronger

POINT CLOUD DEEP LEARNING

Convolutional Neural Networks and Supervised Learning

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

CNN optimization. Rassadin A

INTRODUCTION TO DEEP LEARNING

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Yiqi Yan. May 10, 2017

FaceNet. Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Machine Learning. MGS Lecture 3: Deep Learning

FUSION MODEL BASED ON CONVOLUTIONAL NEURAL NETWORKS WITH TWO FEATURES FOR ACOUSTIC SCENE CLASSIFICATION

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

Computer Vision Lecture 16

Real-time convolutional networks for sonar image classification in low-power embedded systems

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Computer Vision Lecture 16

CrescendoNet: A New Deep Convolutional Neural Network with Ensemble Behavior

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

Dynamic Routing Between Capsules

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Learning Structured Sparsity in Deep Neural Networks

Global Optimality in Neural Network Training

Convolutional Neural Networks

CRESCENDONET: A NEW DEEP CONVOLUTIONAL NEURAL NETWORK WITH ENSEMBLE BEHAVIOR

Facial Expression Classification with Random Filters Feature Extraction

Is Bigger CNN Better? Samer Hijazi on behalf of IPG CTO Group Embedded Neural Networks Summit (enns2016) San Jose Feb. 9th

Progressive Neural Architecture Search

Fully Convolutional Networks for Semantic Segmentation

Tiny ImageNet Challenge Submission

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

arxiv: v2 [cs.cv] 26 Jan 2018

arxiv: v1 [cs.cv] 4 Dec 2014

Structured Prediction using Convolutional Neural Networks

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Classifying a specific image region using convolutional nets with an ROI mask as input

Binary Convolutional Neural Network on RRAM

CSE 559A: Computer Vision

Classification of objects from Video Data (Group 30)

Mask R-CNN. By Kaiming He, Georgia Gkioxari, Piotr Dollar and Ross Girshick Presented By Aditya Sanghi

Todo before next class

DeepIndex for Accurate and Efficient Image Retrieval

Yelp Restaurant Photo Classification

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Lecture 37: ConvNets (Cont d) and Training

arxiv:submit/ [cs.ai] 3 Mar 2016

arxiv: v1 [cs.cv] 20 Dec 2016

Using Machine Learning for Classification of Cancer Cells

Model Compression. Girish Varma IIIT Hyderabad

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Machine Learning 13. week

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

A performance comparison of Deep Learning frameworks on KNL

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

Convolutional Layer Pooling Layer Fully Connected Layer Regularization

Transcription:

Restricted Connectivity in Deep Neural Networks Yani Ioannou University of Cambridge April 7, 207

Overview Introduction Research Overview PhD Research Motivation Structural Priors Spatial Structural Priors Filter-wise Structural Priors Summary/Future Work Collaborative Research 2

Introduction 3

About Me Ph.D. student in the Department of Engineering at the University of Cambridge. Supervised by Professor Roberto Cipolla, head of the Computer Vision and Robotics group in the Machine Intelligence Lab, and Dr. Antonio Criminisi, a principal researcher at Microsoft Research. 4

Research Overview 5

First Author Publications during PhD Decision Forests, Convolutional Networks and the Models in-between. Y. Ioannou, D. Robertson, D. Zikic, P. Kontschieder, J. Shotton, M. Brown, A. Criminisi. MSR Technical Report 205 Training CNNs with Low-Rank Filters for Efficient Image Classification. Y. Ioannou, D. Robertson, J. Shotton, R. Cipolla, A. Criminisi. ICLR 206 Deep roots: Improving CNN efficiency with hierarchical filter groups. Y. Ioannou, D. Robertson, R. Cipolla, A. Criminisi. CVPR 207 6

Motivation 7

ILSVRC Imagenet Large-Scale Visual Recognition Challenge Imagenet Large-Scale Visual Recognition Challenge..2 Million Training Images, 000 classes. 50,000 image validiation/test set. In 202 Alex Krizhevsky won challenge with CNN 2. AlexNet was 26.2% better than second best, 5.3%. State-of-the-art beats human error (5%). Russakovsky et al., ImageNet Large Scale Visual Recognition Challenge. 2 Krizhevsky, Sutskever, and Hinton, ImageNet Classification with Deep Convolutional Neural Networks. 8

AlexNet Complexity 3 6 million parameters 724 million FLOPS (per-sample) Imagenet has.28 million training samples (227 227 3) Images of dimensions (227 227 3) 200 billion pixels 3 Krizhevsky, Sutskever, and Hinton, ImageNet Classification with Deep Convolutional Neural Networks. 9

AlexNet Complexity - FLOPS 2.0 0 8.5.0 0.5 0.0 conv conv2 conv3 conv4 conv5 fc6 fc7 fc8 FLOPS 0

AlexNet Complexity - Parameters 0 7 Parameters 3.0 2.0.0 0.0 conv conv2 conv3 conv4 conv5 fc6 96% in fully connected layers fc7 fc8

Hinton Philosophy If you give me any sized dataset, what you ought to do if you want good generalization is get yourself into the small data regime. That is, however big your dataset, you ought to make a much bigger model so that that s small data. So, I think what the brain is doing is... regularizing the hell out of it, and that s a better thing to do than what statisticians used to think you should do, which is have a small model. Geoffery Hinton 4 Cambridge, June 205 4 http://sms.cam.ac.uk/media/207973 2

8% 6% alexnet Crop & Mirror Aug. Extra Augmentation 4% vgg- Top-5 Error 2% 0% 8% googlenet vgg- vgg-3 vgg-6 (C) googlenet 0x vgg-9 vgg-6 (D) resnet-50-mirror-earlylr googlenet 44x 6% msra-a msra-b msra-c pre-resnet 200 0 8 0 9 0 0 0 0 2 0 3 0 4 log 0 (Multiply-Accumulate Operations) 3

Top-5 Error 8% alexnet 7% 6% 5% 4% 3% 2% % 0% 9% 8% 7% 6% googlenet googlenet 0x resnet-50-mirror-earlylr googlenet 44x pre-resnet 200 vgg- vgg- vgg-3 vgg-6 (C) vgg-9 vgg-6 (D) msra-a msra-b Crop & Mirror Aug. Extra Augmentation msra-c 0 7 0 8 0 9 log 0 (Number of Parameters) 4

Generalization and Num. Parameters.0 0.5 samples function 3rd order fit.0 0.5 samples function 20th order fit 0.0 0.0 0.5 0.5.0 0.0 0.2 0.4 0.6 0.8.0 (a) 3 rd -order poly., 3 points.0 0.0 0.2 0.4 0.6 0.8.0 (b) 20 th -order poly., 3 points.0 0.5 samples function 3rd order fit.0 0.5 samples function 20th order fit 0.0 0.0 0.5 0.5.0 0.0 0.2 0.4 0.6 0.8.0 (c) 3 rd -order poly., 0 points.0 0.0 0.2 0.4 0.6 0.8.0 (d) 20 th -order poly., 0 points 5

When fitting a curve, we often have little idea of what order polynomial would best fit the data! Weak Prior - Regularization. Knowing only that our model is over-parameterized is a relatively weak prior, however we can encode this into the fit by using a regularization penalty. This restricts the model to effectively use only a small number of the parameters, by adding a penalty for example on the L2 norm of the model weights. Strong Priors - Structural. With more prior information on the task, e.g. from the convexity of the polynomial, we may imply that a certain order polynomial is more appropriate, and restrict learning to some particular orders. 6

The Problem Creating a massively over-parameterized network, has consequences Training time: Translates into 2-3 weeks of training on 8 GPUs! (ResNet 200) Forward pass (ResNet 50): 2 ms GPU, 62 ms CPU Forward pass (GoogLeNet): 4.4 ms GPU, 300 ms CPU But what about the practicalities of using deep learning: on embedded devices realtime applications backed by distributed/cloud computing 7

Compression/Representation Isn t that already being addressed? Approximation (compression/pruning) of neural networks Reduced representation (8-bit floats/binary!) Allow us to have a trade off in compute v.s. accuracy. These methods will still apply to any network. Instead, let s try to address the fundamental problem of over-parameterization. 8

Alternative Philosophy Deep networks need many more parameters than data points because they aren t just learning to model data, but also learning what not to learn. Idea: Why don t we help the network, through structural priors, not to learn things it doesn t need to? The Atlas Slave (Accademia, Florence) 9

Structural Priors 20

Typical Convolutional Layer c 2 filters H W c h w c H H2 WW2 c 2 image/feature map filter output featuremap 2

Sparsity of Convolution Input image (zero-padded 3 x 4 pixels) Fully connected (a) layer structure Kernel N/A 0 2 3 4 5 6 7 8 9 0 Output pixels Connection weights 0 2 3 4 5 6 7 8 9 0 Input pixels Connection structure Input pixels Output pixels Input pixels 0 2 3 4 5 6 7 8 9 0 Output feature map (4 x 3) 0 2 3 4 5 6 7 8 9 0 Convolutional (b) 3 3 square Convolutional (c) 3 vertical Output pixels Input pixels Output pixels Input pixels Convolutional (d) 3 horizontal Output pixels 22

Why are CNNs uniformly structured? The marvelous powers of the brain emerge not from any single, uniformly structured connectionist network but from highly evolved arrangements of smaller, specialized networks which are interconnected in very specific ways. Marvin Minsky Perceptrons (988 edition) Deep networks are largely monolithic (uniformly connected), with few exceptions Why don t we try to structure our networks closer to the specialized components required for learning images? 23

Spatial Structural Priors Published as a conference paper at ICLR 206 TRAINING CNNS WITH LOW-RANK FILTERS FOR EFFICIENT IMAGE CLASSIFICATION Yani Ioannou, Duncan Robertson 2, Jamie Shotton 2, Roberto Cipolla & Antonio Criminisi 2 University of Cambridge, 2 Microsoft Research {yai20,rc000}@cam.ac.uk, {a-durobe,jamiesho,antcrim}@microsoft.com b 206 ABSTRACT We propose a new method for creating computationally efficient convolutional neural networks (CNNs) by using low-rank representations of convolutional filters. Rather than approximating filters in previously-trained networks with more efficient versions, we learn a set of small basis filters from scratch; during training, the network learns to combine these basis filters into more complex filters that 24

Proposed Method Learning a basis for filters. m filters H W c * * H W m d filters m H W d A learned basis of vertical/horizontal rectangular filters and square filters! Shape of learned filters is a full w h c. But what can be effectively learned is limited by the number and complexity of the components. 25

VGG/Imagenet Results Top-5 Error 4% 3% 2% % 0% Baseline Networks Our Results vgg- gmp-lr-lde gmp-lr gmp-lr-join gmp-sf gmp-lr-2x gmp gmp-lr-join-wfull 2 3 4 5 6 7 8 Multiply-Accumulate Operations 0 9 Top-5 Error 4% 3% 2% % 0% vgg- gmp-lr-lde gmp-lr-join gmp-sf gmp-lr-2x gmp gmp-lr-join-wfull 0.2 0.4 0.6 0.8 Parameters.2.4 0 8 26

Imagenet Results VGG- (low-rank): 24% smaller, 4% fewer FLOPS VGG- (low-rank/full-rank mix): 6% fewer FLOPS with % lower error on ILSRVC val, but 6% larger. GoogLeNet: 4% smaller, 26% fewer FLOPS Or better results if you tune it on GoogLeNet more... 27

Rethinking the Inception Architecture for Computer Vision arxiv:52.00567v3 [cs.cv] Dec 205 Christian Szegedy Google Inc. szegedy@google.com nx Abstract Vincent Vanhoucke vanhoucke@google.com Sergey Ioffe sioffe@google.com Zbigniew Wojna Filter Concat University College London zbigniewwojna@gmail.com xn Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 204 very nx deep convolutional nx networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend xn to translate xn to immediate x quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling x factors x for various use Pool cases such x as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently Base as possible by suitably factorized convolutions and aggressive regularization. We benchmark our Figure methods 6. Inception the modules ILSVRC after the 202 factorization classification of the n n challenge validation convolutions. set demonstrate In our proposed architecture, substantial wegains chose n over = 7 for the 7 7 grid. (The f lter sizes are picked using principle 3) the state of the art: 2.2% top- and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 7.3% top- error.. Introduction x3 3x Jonathon Shlens shlens@google.com Filter Concat Since the 202 ImageNet competition [6] winning enlarly high performance in the 204 ILSVRC [6] classification challenge. 3x3 Onex3 interesting 3x observation x was that gains in the classification performance tend to transfer to significant quality gains in a wide variety of application domains. x x Pool x This means that architectural improvements in deep convolutional architecture can be utilized for improving performance for most otherbase computer vision tasks that are increasingly reliant on high quality, learned visual features. Also, improvements Figure 7. Inception inmodules the network with expanded quality the fresulted lter bank outputs. in new application domains for convolutional networks in cases where high dimensional representations, as suggested by principle 2 of This architecture is used the coarsest (8 8) grids to promote AlexNet Section 2. features We are using could thisnot solution compete only onwith the coarsest hand grid, engineered, crafted since that solutions, is the place.g. where proposal producinggeneration high dimensional in detection[4]. sparse representation is the most critical as the ratio of local processing (by Although convolutions) VGGNet is increased [8] has compared the compelling to the spatial aggregation. simplicity, this comes at a high cost: evalu- feature of architectural ating the network requires a lot of computation. On the other hand, the Inception architecture of GoogLeNet [20] was also designed to perform well even under strict constraints on memory and computational budget. For example, GoogleNet employed only 5 million parameters, which represented a 2 reduction with respect to its predecessor AlexNet, which used 60 million parameters. Furthermore, VGGNet employed about 3x more parameters than AlexNet. The computational cost of Inception is also much lower than VGGNet or its higher performing successors [6]. This has made it feasible to utilize Inception networks in big-data scenarios[7], [3], where huge amount of data needed to 28

Filter-wise Structural Priors Deep Roots: Improving CNN Efficiency with Hierarchical Filter Groups Yani Ioannou Duncan Robertson 2 Roberto Cipolla Antonio Criminisi 2 University of Cambridge, 2 Microsoft Research ov 206 Abstract We propose a new method for creating computationally efficient and compact convolutional neural networks be achieved by weight decay or dropout [5]. Furthermore, a carefully designed sparse network connection structure can also have a regularizing effect. Convolutional Neural Networks (CNNs) [6, 7] embody this idea, using a sparse 29

Typical Convolutional Layer c 2 filters H W c h w c H H2 WW2 c 2 image/feature map filter output featuremap 30

AlexNet Filter Grouping Uses 2 filter groups in most of the convolutional layers Allowed training across two GPUs (model parallelism) 3

AlexNet Filter Grouping Top-5 Val. Error 20% 9% 8% 4 groups no groups 2 groups 0.5 0.6 0.7 0.8 0.9..2 Model Parameters (# floats) 0 9 32

Grouped Convolutional Layer c 2 filters c 2 g H W c c 2 g h w c g H H2 WW2 c 2 image/feature map filter output featuremap 33

Root Modules c 2filters c 2 g c 3filters H W c c 2 g h c w g H H2 W W2 H c 2 c W c 2 3 Root-2 Module: d filters in g = 2 filter groups, of shape h w c/2. c 2filters c 2 g c 2 g c 3filters H W c c 2 g c 2 g h w c g H H2 W W2 H c 2 c W c 2 3 Root-4 Module: d filters in g = 4 filter groups, of shape h w c/4. 34

Network-in-Network H W 3 input conva convb convc conv2a conv2b conv2c conv3a conv3b conv3c pool output 35

NiN Root Architectures root-4 module root-2 module 36

NiN Root Architectures Network-in-Network. Filter groups in each convolutional layer. Model conv conv2 conv3 a b c a b c a b c 5 5 5 5 3 3 Orig. root-2 2 root-4 4 2 root-8 8 4 root-6 6 8 37

CIFAR0: Model Parameters v.s. Error 8.8% 0 2 6 NiN Root Tree Column 8.6% 8.4% 6 4 Error 8.2% 8.0% 6 8 8 8 4 4 2 2 2 7.8% 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9.0 Model Parameters 0 6 NiN: mean and standard deviation (error bars) are shown over 5 different random initializations. 38

CIFAR0: FLOPS (Multiply-Add) v.s. Error. 8.8% 0 2 6 NiN Root Tree Column 8.6% 8.4% 6 4 Error 8.2% 8.0% 6 8 8 8 4 4 2 2 2 7.8%.0.2.4.6.8 2.0 2.2 FLOPS (Multiply-Add) 0 8 NiN: mean and standard deviation (error bars) are shown over 5 different random initializations. 39

Inter-layer Filter Covariance 0 conv3a 92 0 conv3a 92 0 conv3a 92 92 92 92 conv2c conv2c conv2c (a) Standard: g = (b) Root-4: g = 2 (c) Root-32: g = 6 Figure: The block-diagonal sparsity learned by a root-module is visible in the correlation of filters on layers conv3a and conv2c in the NiN network. 40

ResNet 50 All Filters Top-5 Error 9% 8% 64 32 6 8 4 2 7%.6.8 2 2.2 2.4 2.6 Model Parameters (# Floats) 0 7 Top-5 Error 9% 8% 64 32 6 8 4 2 7% 2.2 2.4 2.6 2.8 3 3.2 3.4 3.6 3.8 FLOPS (Multiply-Add) 0 9 4

Top-5 Error 9% 8% 32 64 6 8 2 4 7% 0 0.2 0.4 0.6 0.8.2.4.6 GPU Forward (ms) Top-5 Error 9% 8% 64 32 6 2 8 4 7% 420 440 460 480 500 520 540 560 580 600 620 CPU Forward (ms) 42

Imagenet Results Networks with root modules have similar or higher accuracy than the baseline architectures with much less computation. ResNet 50 5 : 40% smaller, 45% fewer FLOPS ResNet 200 6 : 44% smaller, 25% fewer FLOPS GoogLeNet: 7% smaller, 44% fewer FLOPS But when you also increase the number of filters... 5 Caffe Re-implementation 6 Based on Facebook Torch Model 43

Aggregated Residual Transformations for Deep Neural Networks Saining Xie Ross Girshick 2 Piotr Dollár 2 Zhuowen Tu Kaiming He 2 UC San Diego 2 Facebook AI Research 256-d {s9xie,ztu}@ucsd.edu in {rbg,kaiminghe}@fb.com 256-d in arxiv:6.0543v [cs.cv] 6 Nov 206 256, x, 64 Abstract We present a simple, highly modularized network architecture for image classification. Our network is constructed 64, 3x3, 64 4, 3x3, 4 by repeating a building block that aggregates a set of transformations with64, the x, same256 topology. Our simple 4, design x, 256 results in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which+ we call cardinality (the size of the set of transformations), as256-d an essential out factor in addition to the dimensions of depth and width. On the ImageNet-K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, codenamed ResNeXt, are the foundations of our entry to the ILSVRC 206 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart.. Introduction 256, x, 4 Research on visual recognition is undergoing a transition from feature engineering to network engineering [24, 23, 43, 33, 35, 37, 3]. In contrast to traditional handdesigned features (e.g., SIFT [28] and HOG [5]), features learned by neural networks from large-scale data [32] require minimal human involvement during training, and can be transferred to a variety of recognition tasks [7, 0, 27]. Nevertheless, human effort has been shifted to designing 256, x, 4 256, x, 64 64, 3x3, 64 64, 256 4, x, 256 + 256-d in 4, 3x3, 4 + 256-d out total 32 paths 256, x, 4... 4, 3x3, 4 256, x, 4 256, x, 4 4, 3x3, 4 4, 3x3, 4 4, x, 256 4, 4, x, 256 + + 256-d in total 32 paths... 256-d out 256, x, 4 4, 3x3, 4 4, x, 256 Figure. Left: A block of ResNet [3]. Right: A block of + ResNeXt with cardinality = 32, with roughly the same complexity. A layer is shown 256-d as out (# in channels, filter size, # out channels). Figure. Left: A block of ResNet [3]. Right: A block of ResNeXt with cardinality = 32, with roughly the same complexity. A layer is shown as (# in channels, f lter size, # out channels). ogy. This simple rule reduces the free choices of hyperparameters, and depth is exposed as an essential dimension in neural networks. Moreover, we argue that the simplicity of this rule may reduce the risk of over-adapting the hyperparameters to a specific dataset. The robustness of VGGnets and ResNets has been proven by various visual recognition tasks [7, 0, 9, 27, 30, 3] and by non-visual tasks involving speech [4, 29] and language [4, 40, 9]. Unlike VGG-nets, the family of Inception models [37, 6, 38, 36] have demonstrated that carefully designed topologies are able to achieve compelling accuracy with low theoretical complexity. The Inception models have evolved over time [37, 38], but an important common property is a split-transform-merge strategy. In an Inception module, the input is split into a few lower-dimensional embeddings (by convolutions), transformed by a set of specialized filters (3 3, 5 5, etc.), and merged by concatenation. It can be shown that the solution space of this architecture is a Moreover, increasing cardinality is more effective than going deeper or wider when we increase the capacity. 44

Summary/Future Work 45

Summary Using structural priors: Models are less computationally complex They also use less parameters They significantly help generalization in deeper networks They significantly help generalization with larger datasets Are amenable to model parallelization (as with original AlexNet), for better parallelism across gpus/nodes 46

Future Work: Research We don t always have enough knowledge of the domain to propose good structural priors Our results (and follow up work) do show however that current methods of training/regularization seem to have limited effectiveness in DNNs learning such priors themselves How can we otherwise learn structural priors? 47

Future Work: Applications Both of these methods apply to most deep learning applications: Smaller model state easier storage and synchronization Faster training and test of models behind ML cloud services Embedded devices/tensor processing units And more specific to each method Low-rank filters Even larger impact for volumetric imagery (Microsoft Radiomics) Root Modules Model parallelization (Azure/Amazon Cloud) 48