Stochastic Function Norm Regularization of DNNs

Similar documents
Deep Learning With Noise

Chapter 1. Introduction

Weighted Convolutional Neural Network. Ensemble.

Level-set MCMC Curve Sampling and Geometric Conditional Simulation

From Maxout to Channel-Out: Encoding Information on Sparse Pathways

Neural Networks: promises of current research

Motivation Dropout Fast Dropout Maxout References. Dropout. Auston Sterling. January 26, 2016

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Efficient Convolutional Network Learning using Parametric Log based Dual-Tree Wavelet ScatterNet

Face Recognition A Deep Learning Approach

Convolutional Neural Networks

Deep Neural Networks:

Deep Learning with Tensorflow AlexNet

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Convexization in Markov Chain Monte Carlo

ImageNet Classification with Deep Convolutional Neural Networks

Autoencoders, denoising autoencoders, and learning deep networks

Deep Learning for Computer Vision II

Combine the PA Algorithm with a Proximal Classifier

DropConnect Regularization Method with Sparsity Constraint for Neural Networks

Groupout: A Way to Regularize Deep Convolutional Neural Network

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Stochastic Gradient Descent Algorithm in the Computational Network Toolkit

Advanced Introduction to Machine Learning, CMU-10715

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Theoretical Concepts of Machine Learning

arxiv: v1 [cs.cv] 9 Nov 2015

Bayesian Statistics Group 8th March Slice samplers. (A very brief introduction) The basic idea

Comparing Dropout Nets to Sum-Product Networks for Predicting Molecular Activity

Alternatives to Direct Supervision

arxiv: v2 [cs.cv] 30 Oct 2018

Supplementary material for Analyzing Filters Toward Efficient ConvNet

On the Importance of Normalisation Layers in Deep Learning with Piecewise Linear Activation Units

arxiv: v1 [cs.cv] 6 Jul 2016

Sampling informative/complex a priori probability distributions using Gibbs sampling assisted by sequential simulation

End-To-End Spam Classification With Neural Networks

arxiv: v1 [cs.lg] 16 Jan 2013

arxiv: v1 [cs.cv] 29 Oct 2017

A Fast Learning Algorithm for Deep Belief Nets

A Random Number Based Method for Monte Carlo Integration

Learning Convolutional Neural Networks using Hybrid Orthogonal Projection and Estimation

Markov Random Fields and Gibbs Sampling for Image Denoising

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient

DNN-BASED AUDIO SCENE CLASSIFICATION FOR DCASE 2017: DUAL INPUT FEATURES, BALANCING COST, AND STOCHASTIC DATA DUPLICATION

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Novel Lossy Compression Algorithms with Stacked Autoencoders

Emotion Detection using Deep Belief Networks

arxiv: v1 [stat.ml] 21 Feb 2018

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Random Walk Distributed Dual Averaging Method For Decentralized Consensus Optimization

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

Study of Residual Networks for Image Recognition

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

Deep Learning & Neural Networks

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Lecture 19: Generative Adversarial Networks

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

arxiv: v3 [stat.ml] 15 Nov 2017

Contextual Dropout. Sam Fok. Abstract. 1. Introduction. 2. Background and Related Work

Deep Neural Network Hyperparameter Optimization with Genetic Algorithms

Perceptron: This is convolution!

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation

Learning Social Graph Topologies using Generative Adversarial Neural Networks

1 Methods for Posterior Simulation

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

The exam is closed book, closed notes except your one-page cheat sheet.

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Exploring Capsules. Binghui Peng Runzhou Tao Shunyu Yao IIIS, Tsinghua University {pbh15, trz15,

27: Hybrid Graphical Models and Neural Networks

Training Restricted Boltzmann Machines with Multi-Tempering: Harnessing Parallelization

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Efficient Feature Learning Using Perturb-and-MAP

Unsupervised Learning

Deep Neural Network Acceleration Framework Under Hardware Uncertainty

A Deep Learning primer

Cache-efficient Gradient Descent Algorithm

Markov chain Monte Carlo methods

Seminars in Artifiial Intelligenie and Robotiis

Curve Sampling and Geometric Conditional Simulation

Quantitative Biology II!

ADAPTIVE DATA AUGMENTATION FOR IMAGE CLASSIFICATION

Vulnerability of machine learning models to adversarial examples

Deep Neural Networks for Recognizing Online Handwritten Mathematical Symbols

Instance-based Learning

10703 Deep Reinforcement Learning and Control

arxiv: v2 [cs.cv] 21 Dec 2013

Automated Crystal Structure Identification from X-ray Diffraction Patterns

Neural Networks and Deep Learning

Tiny ImageNet Visual Recognition Challenge

Kernel-based online machine learning and support vector reduction

Adaptive Normalized Risk-Averting Training For Deep Neural Networks

Supplementary material for: BO-HB: Robust and Efficient Hyperparameter Optimization at Scale

SEMI-SUPERVISED LEARNING WITH GANS:

MCMC Methods for data modeling

How Learning Differs from Optimization. Sargur N. Srihari

Research on Pruning Convolutional Neural Network, Autoencoder and Capsule Network

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Deep neural networks II

Transcription:

Stochastic Function Norm Regularization of DNNs Amal Rannen Triki Dept. of Computational Science and Engineering Yonsei University Seoul, South Korea amal.rannen@yonsei.ac.kr Matthew B. Blaschko Center for Processing Speech and Images Departement Elektrotechniek KU Leuven, Belgium matthew.blaschko@esat.kuleuven.be Abstract Deep neural networks (DNNs) have had an enormous impact on image analysis. State-of-the-art training methods, based on weight decay and DropOut, result in impressive performance when a very large training set is available. However, they tend to have large problems overfitting to small data sets. Indeed, the available regularization methods deal with the complexity of the network function only indirectly. In this paper, we study the feasibility of directly using L 2 function norms for regularization. Two methods to integrate this new regularization in a stochastic backpropagation framework are proposed. Moreover, the convergence of these new algorithms is studied. We finally show that they outperform the stateof-the-art methods in the low sample regime on benchmark datasets (MNIST and CIFAR0). The obtained results demonstrate very clear improvement, especially in the context of small sample regimes with data laying in a low dimensional manifold. Source code of the method is made available at https://github.com/amalrt/ DNN_Reg. Introduction Deep Neural Networks (DNNs) have been shown to be a powerful tool in learning from large datasets in many domains. However, they require a large number of training samples in order to have reasonable generalization power. Indeed, this fact tends to limit their application to the type of images where it is easy to construct a large labeled database, such as natural images where collection can be done from image sharing websites and labels obtained at low cost via crowdsourcing. When only few labeled samples are available, such as in medical imaging, DNNs tend to overfit to the data and may not be competitive with methods for which known tractable methods for stronger regularization are available, e.g. kernel methods. Generally in learning theory, one wants to minimize the statistical risk R(W ) = l(fw (x), y)dp (x, y), where l is a loss a function and P (x, y) is the data distribution of inputs and outputs, respectively. However, as P is generally unknown, only the empirical risk R N (W ) = N N i= l(f W (x i ), y i ) is accessible. This approximation is accurate when a large database is used, but when only few samples are available, an appropriate regularization is required. Training a DNN with regularization is solving arg min W R N (W ) + λω(f W ). Much research has approached the idea of regularized DNN training. However, until now, this regularization tends to be poor compared, for example, to regularization in kernel methods []. Two categories of regularization can be distinguished in the existing literature: (i) regularizing by controlling the magnitude of the weights W (l and l 2 regularization [4, 8]); and (ii) regularizing by controlling the network architecture (DropOut [6] and DropConnect [3]). In order to make the regularization stronger, we study the feasibility of using L 2 function norms directly. The canonical L 2 function norm is: Ω(f W ) = f W 2 2 = f W (x), f W (x) dx. () ART is also affiliated with KU Leuven.

As the exact value of this norm, and more importantly of its gradient, are not accessible, we study two closely-related weighted norms. For both cases, we reweight the norm using a probability distribution (we suppose for simplicity that all the distributions admit density functions). 2 Methods and algorithms : Function norm regularization Data distribution: The first idea is to consider the function norm with respect to the data distribution : Ω(f W ) = f W (x), f W (x) dp (x), (2) where P is the marginal data distribution P (x) := P (x, y)dy. One can easily see that this is a valid function norm [3], and furthermore is just a weighted canonical function norm such that the regularization has a higher effect on the data that has the higher probability to be observed. For this first method, our regularization term, written another way is E x P (x) [ f W (x), f W (x) ]. For some set {x,..., x m } drawn i.i.d. from P (x) m f W (x i ), f W (x i ) (3) m i= is an unbiased estimate. For samples outside the training set, this is a U-statistic of order, and has an asymptotic Gaussian distribution for which finite sample estimates converge quickly [7]. Only the images of these samples under f W are needed for regularization and not their labels, which is particularly interesting for applications such as medical imaging where labeling the samples require expert intervention and is usually highly expensive. Slice sampling: The second idea is to use a probability density proportional to the integrand for the weighted norm. The regularization term is then: Ω(f W ) = f W (x), f W (x) d P (x), (4) where P (x) f W (x) 2 2. This idea is inspired by stochastic integration methods. A simple way to apply these methods is to draw samples in the domain of the function and to approximate the integral by the average value of the function in these points in a Quasi-Monte-Carlo fashion []. However, as we are working in a very high dimension, this type of numerical integration is not appropriate because it suffers from the curse of dimensionality and results in a sample concentration independent of the characteristics of the function we want to integrate, i.e. f W (x) 2 2. As this function is real and non-negative, we can consider sampling from a distribution for which the density is proportional to the function itself, using Markov-chain methods such as Gibbs [2] and Metropolis Hastings [5] sampling. As Gibbs sampling requires the approximation of the marginal distribution (which is not easily accessible in our case and most importantly non-standard), and Metropolis-Hastings sampling is depending on the choice of the proposal distribution, we generate our samples using slice sampling [9]. This method draws samples uniformly from the hypervolume under the n-dimensional graph of the function by iterating the following steps (to sample from a function f): (i) Choose an initial sample x 0 ; (ii) Draw a uniformly distributed sample y (0, f(x 0 )); (iii) Find an hyperrectangle, around x 0, that is contained in the slice S = {x : f(x) > f(x 0 )}; (iv) Draw the new samples x uniformly from this hyperrectangle; (v) Go back to the second step and iterate until collecting the desired number of samples. In [9], many methods to define the hyperrectangle in the third step are proposed. Here we used the method called "stepping-out and shrinking-in". This method consists of finding a neighborhood of x 0, increasing it progressively until being outside the slice, and drawing samples from this increased neighborhood. If the drawn sample is outside the slice, it is used to shrink the neighborhood. The samples that are drawn using this procedure will have a concentration that depends on the variation of f W (x) 2 2. They will be concentrated around the regions where its values are high. The effect of the curse of dimensionality is then limited. Moreover, this means that the regularization will have a higher effect on the regions where the samples are more concentrated, i.e. the regions 2

where the values of the function are high, resulting in a less complex function. This method does not require new images, which is interesting when we have access to only a small number of samples (labeled or unlabeled). An extended version of this paper is available on arxiv, in which we prove the convergence of stochastic gradient descent when applied to the regularized objectives, and provide an algorithm to integrate the regularization in a classical training procedure [0]. 3 Experiments and results To test the proposed methods, two series of experiments using two different databases are considered: MNIST and CIFAR0. The experiments are conducted using MatConvNet [2]. For both of the databases, the convolutional neural network LeNet is used. For each we compare our regularization methods to (i) training with only weight decay and (ii) training with DropOut (with different rates) + weight decay. MNIST: We report the results of the tests conducted with our algorithms on MNIST (60,000 samples for training and 0,000 samples for test). These experiments where conducted using only a 2.4 GHz Intel Core 2 Duo processor. In a first set of experiments, we test our first idea and the two reference methods listed above using a decreasing number of samples of MNIST in the training data set. The remaining training samples are used for regularization using data distribution. For slice sampling, we generate 6000 samples in the case of 600 training samples, and 4000 samples in the case of 00 samples. Figure shows the evolution of the test top--error during training (the x-axis represents the number of epochs) for weight decay, weight decay+dropout with the rate 50%, weight decay+dropout with the rate 25% and approximate function norm regularization using the data distribution and slice sampling. -Weight decay -Weight decay + DropOut (25%) -Weight decay + DropOut (50%) -Function norm (Data distribution) Function norm (Slice sampling) 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.0 2,000 training samples 600 training samples - 00 training samples 0 0 Figure : MNIST - Left: 2,000 samples - Center: 600 samples - Right: 00 samples In Figure 3-left, we display the test Top- error evolution as a function of the training time using 00 samples for training. The number (:N) indicates the ratio (Training samples:regularization samples). Table shows the accuracy obtained on the test set at the end of the training. As a reference, note that the best accuracy obtained when all the training data is used is 99.06%. Training Regularization Weight WD WD Function norm Function norm decay + DO(25%) + DO(50%) (data dist.) (slice sampling) 2,000 48,000 98.26 97.65 95.82 97.99 97.99 (:4),200 58,800 92.60 89.93 85.52 95.5 92.70 (:4) 600 59,400 88 83.2 73.97 92.84 97 (:0) 00 59,900 37 8.75 3.36 83 79.74 (:40) Table : MNIST - Correct classification rate (%) for different methods CIFAR: In this paragraph, we report the results of our algorithms when applied to CIFAR 0. The database is composed of 50,000 samples for training and 0,000 samples for test. These experiments where conducted on a machine equipped with a 4 core CPU (4.00 GHz) and a GPU GeForce GTX 750 Ti. In this set of experiments, we test our two algorithms, weight decay and weight decay + DropOut (0% and 50%) with a decreasing number of samples of CIFAR in the training data set. The 3

remaining training samples are used for regularization using data distribution. For the slice sampling method, we generate 5000 samples in the case of 500 training samples, and 4000 samples in the case of 00 samples. Figure 2 shows the Top- error for the test set as a function of the number of epochs using respectively 0,000, 500, and 00 training samples. Figure 3-right shows the evolution of the test Top- error while using 500 samples for training. Table 2 shows the accuracy obtained on the test set at the end of the training. As a reference, note that the best accuracy obtained when all the training data is used is 88%. 0,000 training samples 500 training samples 5 00 training samples 5 5 5 5 5 5 5 Figure 2: CIFAR - Left: 0,000 samples - Center: 500 samples - Right: 00 samples Training Regularization Weight WD WD Function norm Function norm decay + DO(0%) + DO(50%) (data dist.) (slice sampling) 0,000 40,000 7.53 69.35 49 7.90 72.02 (:4) 5,000 45,000 67.27 64.59 28.08 68.26 68.2 (:4),000 49,000 52.57 47.99 2.03 55 54.38 (:0) 500 49,500 43.64 37.6.57 39.96 48.47 (:0) Table 2: CIFAR- Correct classification rate (%) for different methods 4 Conclusion In this work, two methods to introduce function norm regularization in neural network training have been developed, one of which does not require any additional training data. We have demonstrated both theoretically and empirically that their integration into stochastic backpropagation converges. Their efficiency have been shown on the MNIST and CIFAR0 data sets. Our experiments suggest that the developed methods are of high quality on small databases that lie in low dimensional manifolds. For more complicated data with a small number of samples, the results of our methods outperform the state-of-the-art regularization methods available in the literature. Moreover, we have shown that both of the sampling methods yield similar results, which suggests that they are equally good approximations of the function norm. MNIST Slice sampling ratio (:40) CIFAR Slice sampling ratio (:0) 5 5 5 5 5 0 0 0 0 0 2 0 3 0 4 Training time 0 0 0 0 0 2 0 3 0 4 Training time Figure 3: Sampling method comparison - Test Top-error evolution in time - Left : MNIST, 00 samples/ Right: CIFAR, 500 samples 4

References [] R. E. Caflisch. Monte Carlo and quasi-monte Carlo methods. Acta Numerica, 7: 49, 998. [2] G. Casella and E. I. George. Explaining the Gibbs sampler. The American Statistician, 46(3):67 74, 992. [3] J. Galambos. Advanced Probability Theory. CRC Press, 995. [4] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. Book in preparation for MIT Press, 206. [5] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57():97 09, 970. [6] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arxiv:207.0580, 202. [7] A. J. Lee. U-Statistics: Theory and Practice. CRC Press, 990. [8] J. Moody, S. Hanson, A. Krogh, and J. A. Hertz. A simple weight decay can improve generalization. Advances in Neural Information Processing Systems, 4:950 957, 995. [9] R. M. Neal. Slice sampling. The Annals of Statistics, 3(3):705 767, 06 2003. [0] A. Rannen Triki and M. B. Blaschko. Stochastic function norm regularization of deep networks. CoRR, abs/605.09085, 206. [] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, 200. [2] A. Vedaldi and K. Lenc. Matconvnet convolutional neural networks for MATLAB. In Proceedings of the ACM International Conference on Multimedia, 205. [3] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning, pages 058 066, 203. 5