Deep Learning With Noise - PDF Free Download

Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu Abstract Recent works have shown that, by allowing some inaccuracy when training deep neural networks, not only the training performance but also the accuracy of the model can be improved. Our work, taking those previous works as examples and guidance, tries to study the impact of introducing different types of noise in different components of training a deep neural network. We intend to experiment with noise types which include Binomial noise, Gaussian noise, Gamma noise, etc. We also intend to study the effects of noise in different parts of the model which include neurons and network links in the input, hidden, and output layers, as well as matrix multiplication and gradient computation in the backward propagation process. 1 Introduction Large-scale deep neutral network models have become increasingly popular to solve hard classification problems and have demonstrated significant improvements in accuracy. Compared to traditional statistical machine learning methods, which require a human domain expert that can construct a good set of features as input dataset, deep learning models do not require a hand crafted feature set to begin with, and hence is more powerful and suitable for hard AI tasks such as speech recognition or visual object classification. Without any hand crafting of the raw input data, deep neural network machine learning models can learn a hierarchy of features by itself in the first several layers of neural network model. Then, in the deepest layer of the model, a set of features are selected and weighted for each output to generate a prediction. Avoiding the inevitable human error in feature selection, deep learning often outperforms traditional approaches in those hard classification problems in terms of accuracy. In order to train a more complicated model which includes feature selection capability, deep neural network is typically trained with more data than a traditional machine learning method. Due to the scale of the deep neutral network (with multiple layers of neurons) and the scale of the input data set, the performance of these models, in addition to accuracy, has become a significant factor in such implementations. Recent work [1, 2] on large-scale machine learning systems propose to significantly improve the performance by relaxing the consistency when training neural network models (e.g., weights will not be updated in each iteration). One interesting observation in these papers, alongside the above one, is that relaxation surprisingly improves the accuracy of the deep learning model on the test data. However, the effect of noise on deep learning models has never been systematically studied, nor is the underlying reason for the improved accuracy. One hypothesis of the above observation is that relaxing consistency introduces stochastic noise into training process [1]. This implicitly mitigates over-fitting of the model and generalizes the model better to classify test data. Another hypothesis of this observation is that the introduced noise eliminates the memorization effect of a deep neural network, and hence allows the model to capture the general observation of the training data that can be applied to the test data well. 1

Our work, taking previous works as examples and guidance, tries to systematically study the effect of introducing different noise into different components of different types of deep learning neural networks. We observe that a reasonable amount, and a reasonable magnitude of noise, when introduced into a deep learning model, can improve the accuracy and the convergence rate of the model. We hope that our work can provide insights into future methods of approximate deep learning model, and inspire and motivate more work to take advantage of the 2 Background and Related Work In this section, we first introduce several common neural networks model, Logistic Regression (single neuron) model, Multi-Layer Perceptron (MLP) model, and Convolutional Neural Network (LeNet) model. Then we summarize and compare our work to several related works that introduce noise into those models to improve accuracy. 2.1 Neural Network Models Explained The simplest form of a neural network, which is also the primary component of any neural network model, is a single neuron. Figure 1a illustrates a single-neuron neural network, which is also known as the Logistic Regression (LR) model. The neuron shown in the figure consumes a vector of numbers (X) as inputs, and produces a single number as its output, which typically represents the prediction result by the model. The neuron stores a vector of weights (W), with each weight represents how positively or negatively each input affects the output. The output of the neuron and the update to the neuron s weight can be computed as follows: Output = tanh(w X) W new = W old learningrate Wold cost(w ) A Multi-Layer Perceptron (MLP) model is essentially multiple layers of neurons connected by a network. Figure 1b illustrates a simple example of such model, which is composed of three layers. The first layer is the input layer, which provides the raw inputs to the next layer. The second layer is the hidden layer, whose input is fully connected to the input layer, and output fully connected to the output layer. The hidden layer is known to be capable of extracting features from the input. The third layer is the output layer, which outputs the prediction results for the data. Note that the figure shows only an example of such model, the model can become deeper to extract more implicit features from the raw input if we add more hidden layers between input and output layers, which are fully connected to the layers next to each other. (a) Single-layer neural network (LR) model. (b) Multi-Layer Perceptron (MLP) model. Figure 1: Illustration of two neural network models. A Convolutional Neural Network (LeNet) model adds multiple convolution layers in addition to the MLP model. Figure 2 illustrates an example of this model. In the convolution layer, multiple steps are processed. First, the input is transformed into a two-dimensional array. Then a sliding window which contains a small two-dimensional weight vector is applied to the input. The sliding window is capable of extracting 2-D features from the inputs such as images. Finally, the processed input is downsampled by a 2x2 matrix, which reduces the size of the input by 4. In this figure, we show an example which contains two convolution layer and a hidden layer. We can also have a more complex model by adding more convolution layers or more hidden layers, which also allows the model to extract more implicit features. 2

Figure 2: Convolutional Neural Network (LeNet) model. 2.2 Comparison with Related Works We summarize three recent works that explain and explore three mechanisms to introduce noise into a multi-layer neural network (mlp). Dropout proposes to regularize fully connected neural networks by probabilistically dropping an output (set to zero) of a hidden layer neuron [3] (i.e., with a low probability (1 p), one of the output of a hidden layer neutron is set to 0 in the forward propagation process). This can effectively decrease test error rates by preventing over-fitting of the model. Inspired by Dropout [3], DropConnect proposes to probabilistically drop a weight of a hidden layer neuron (as opposed to an output of a hidden layer neuron in DropConnect) [2]. Maxout extends Dropout and DropConnect by probabilistically set an output or a weight of a hidden layer neuron to maximum value [4]. While these works attempts to explore similar ideas as our work, we believe our work is much more comprehensive than these works as we systematically and experimentally explored various noise models, various noise locations, and various neural networks. 3 Proposed Method In this section, we overview all types of noise that we have introduced into each model (LR, MLP, and LeNet) in our experiments. 3.1 Adding Noise into Logistic Regression We first introduce noise into gradient descent component of Logistic Regression. To be more specific, in a noise-free Logistic Regression model, weights are updated in the following way: W new = W old learningrate Wold cost(w ) In a noise-added Logistic Regression model, weights are updated as: W new = W old learningrate (mask Wold cost(w )) or W new = W old mask Gau Wold cost(w ) where learning rate is a scalar and mask is a vector that has the same dimension as W. We generate mask as a random vector from Binomial distribution Bin(1, 0.5), Gaussian distribution N (learningrate, 2 learningrate), Rayleigh Distribution Rayleigh(1) or Gamma Distribution Gamma(1, 1). 3.2 Adding Noise into Multi-layer Logistic Regression Secondly, we introduce noise into weights between layers. In our Multi-layer Logistic Regression model, there are three layers: input layer, hidden layer and output layer. Each layer consists of neurons. Neurons in different layers are connected by weights. During a noise-free training process of the model, weights between layers are transmitted and updated without any loss of information or variances. However, during a noise-added training process, weights between layers are are subject to some variation. To be more specific, let W input be the 3

matrix of weights between input layer and hidden layer, W output be the matrix of weights between hidden layer and output layer. In a noise-added training process, we add combination of the following steps: W input = W input mask W output = W output mask W input = W input + mask where mask is a matrix of the same dimension as W input or W output. We generate mask as a random matrix from Binomial Distribution Bin(1, 0.99) or Gaussian Distribution N (0, 0.01). 3.3 Adding Noise into Convolutional Neural Network Last, we introduce noise into feature mapping component of the model. The difference between Convolutional Neural Network (LeNet) and Multi-layer Logistic Regression (MLP) is that LeNet has a feature mapping process before MLP. Feature mapping is a process where a small window moves along the image to extract local features. In other words, the window, acting as a function, will compute a linear combination of the underlying pixels. In a noise-added feature mapping process, the extracted feature is subject to some variation. 4 Experiments In this section, we first present the dataset we use in our experiments as well as the parameters for each model. Note that we fine tuned these parameters to achieve the best possible outcomes before we add our modification to the code. Next, we present the results and conclude findings we get from our experiments, including those negative results and the lessons learned in this project. 4.1 Dataset and Implementation Parameters We experiment on three datasets a hand-written digit dataset (MNIST), two tiny images datasets (CIFAR-10 and CIFAR-100). Specifications of datasets are summarized in Table 1. Dataset Description Class Training Set Size Testing Set Size MNIST hand-written digits 10 60,000 10,000 CIFAR-10 32x32 RGB images 10 50,000 10,000 CIFAR-100 32x32 RGB images 100 50,000 10,000 Table 1: Datasets: MNIST, CIFAR-10, CIFAR-100 We preprocess CIFAR-10 and CIFAR-100 by grey-scaling every image using the following formula: Y = 0.2126 R + 0.7152 G + 0.0722 B In other words, every pixel in the image is now a linear combination of its original RGB values. These two datasets are preprocessed due to technical implementation limitations (which will be fixed after the deadline), not machine learning theory reasons. Our neural network models are implemented using Python Theono Library. The starter code is from DeepLearning.net. Parameters of each neural network models are summarized in Table 2. Model Parameters Logistic Regression (LR) learning rate = 0.13 Multi-layer Logistic Regression (MLP) LR + hidden units = 500 Convolutional Neural Network MLP + window size = 5x5, downsample = 2x2 Table 2: Parameters of Neural Network Models We use stochastic logistic regression with learning rate = 0.13. In Multi-layer Logistic Regression, there are 500 neurons in the hidden layer. During feature mapping of Convolutional Neural Network, windows are of size 5 by 5 and downsample is of size 2 by 2. 4

In our experiments, different models may run different number of iterations. This is because we set a threshold of accuracy increase when training the model. If the model s accuracy increase is less than the threshold, we stop training the model. Hence some models run more iterations as long as their accuracy increases are above the threshold. 4.2 Adding Noise into Logistic Regression Figure 3 shows test error rate using noise-free and noise-added Logistic Regression on MNIST. Figure 3: Logistic Regression with Noise on MNIST In Figure 3, the vertical axis is test error rate (%), the horizontal axis is number of iterations. The experiments all run on MNIST. The noise-free line shows test error rate using a noise-free Logistic Regression model. The noise(gaussian) line shows test error rate when mask Gau is applied during gradient descent. The noise(binomial) line shows test error rate when a mask generated from Bin(1, 0.5) is applied during gradient descent. The noise(rayleigh) line shows test error rate when a mask generated from Rayleigh(1) is applied during gradient descent. The noise(gamma) line shows test error rate when a mask generated from Gamma(1, 1) is applied during gradient descent. Finding 1: A reasonable amount, and a reasonable amplitude of noise improves deep neural network model s accuracy, while a noise that is too significant does not. As showed in Figure 3, noise-added models achieve better accuracy compared to the noise-free model. Noise(Binomial) model has the lowest test error rate (7.156%) among the five experiments. However, it also has the lowest convergence rate. This is a phenomenon we have observed throughout the project. Though adding noise can improve accuracy, the side effect is that it will take longer to train the model, hence decrease convergence rate. 4.3 Adding Noise into Multi-layer Logistic Regression Figure 4 shows test error rate using noise-free and noise-added Multi-layer Logistic Regression on MNIST. In Figure 4, the vertical axis is test error rate (%), the horizontal axis is number of iterations. The experiments all run on MNIST. The noise-free line shows test error rate using a noise-free Multilayer Logistic Regression. The dropconnect line shows test error rate when a mask generated from Bin(1, 0.99) is applied to W input. The dropout line shows test error rate when a mask generated from Bin(1, 0.99) is applied to W output. The dropconnect&out line shows test error rate when mask generated from Bin(1, 0.99) is applied to both W input and W output. The noise-variation line shows test error rate when a mask generated from Gaussian N (0, 0.01) is added to W input. Finding 2: Deep learning models with noise perform no worse than the noise-free model. As showed in Figure 4, noise-added models perform no worse than the noise-free model. Since the test error rate of the noise-free model is already quite low (2.63%), it is difficult for noise-added models to significantly improve accuracy. We notice that dropout model and dropconnect model 5

Figure 4: Multi-layer Logistic Regression with Noise on MNIST Figure 5: Multi-layer Logistic Regression with Noise on CIFAR-10 perform better than dropconnect&out model and noise-variation model. It is difficult to provide a conclusive explanation for this observation at the moment because we have not finished fine-tuning our noise-added models. It is possible that noise from certain distributions is more likely to prevent overfitting and hence improve accuracy. Figure 5 shows test error rate using noise-free and noise-added Multi-layer Logistic Regression on CIFAR-10. In Figure 5, the vertical axis is test error rate (%), the horizontal axis is number of iterations. The experiments all run on CIFAR-10. The noise-free line, dropconnect line and dropout line use the same models as experiments in Figure 4, respectively. Finding 3: Deep learning models with noise can take more iterations to converge as the test error fluctuates due to noise. As showed in Figure 5, noise-added models perform much better than noise-free model, though it takes longer to train noise-added models. An interesting observation is that as training iterations increase, test error rate of noise-added models fluctuates. This is another side effect of adding noise into the model. Finding 4: Noise added to earlier stage of the deep learning models can be better integrated and generate less fluctuation. From the above experiments using MLP, we observe that models with noise added between input layer and hidden layer outperform other noise-added models. An intuitive explanation for this phenomenon is that noise added to earlier stage of the model can be better integrated while noise added to later stage of the model tends to cause more fluctuation. 6

4.4 Adding Noise into Convolutional Neural Network Figure 6 shows test error rate using noise-free and noise-added Convolutional Multi-layer Logistic Regression on MNIST. In Figure 6, the vertical axis is test error rate (%), the horizontal axis is Figure 6: Convolutional Neural Network with Noise on MNIST number of iterations. The experiments all run on MNIST. The noise-free line shows test error rate when using a noise-free Convolutional Neural Network. The noise@downsampe line shows test error rate when noise is added during downsample process. The noise-before-hidden-layer line shows test error rate when noise is added right before hidden layer. Finding 5: The convergence rate is faster for deep learning models with noise. As showed in Figure 6, the three models perform equally well. We observe that as the number of iterations increases, noise-added models converge slightly faster. This phenomenon is interesting because it is unexpected. Similarly phenomenon appears in Figure 7 as well. Figure 7 shows test error rate using noise-free and noise-added Convolutional Multi-layer Logistic Regression on CIFAR-10. In Figure 7, the vertical axis is test error rate (%), the horizontal axis is Figure 7: Convolutional Neural Network with Noise on CIFAR-10 number of iterations. The experiments all run on CIFAR-10. The noise-free line shows test error rate when using a noise-free Convolutional Neural Network. The convo-dropconnect line shows test error rate when the MLP part of the model has noise added between input layer and hidden layer. The convo-dropout line shows test error rate when the MLP part of the model has noise added between hidden layer and output layer. 7

Finding 6: Noise improves both accuracy and convergence rate more with complex deep learning models. As showed in Figure 7, the three models achieve the same lowest test error rate. However, through the training process, the noise-added model (convo-dropconnect) converges faster than the noisefree model. The intuition behind this phenomenon is that since Convolutional Neural Network is a complicated model, noise is better integrated and absorbed. We conjecture that noise added to complicated deep learning models can improve not only accuracy but also convergence rate. 4.5 Negative Results Lesson Learned: Complex deep learning models can integrate noise better than simple models. Figure 8 shows some negative results from our experiments. We experiment noise-free and noiseadded Logistic Regression on CIFAR-100. The noise-added model perform much worse than noisefree model. The explanation for this result is that Logistic Regression model is too simple to integrate noise when running CIFAR-100. This agrees with our previous conjecture that complicated models are better at integrating noise. Figure 8: Negative Results on CIFAR-100 5 Conclusions In this project, we systematically perform experiments on studying the effect of adding noise into deep learning neural networks. We conduct experiments on adding different noise into different components of neural network models. The experiment results show that adding noise almost always improves accuracy. Our main observations are: (1) Noise added during early stage of the model can be better integrated while noise added during late stage of the model tends to cause fluctuation of accuracy; (2) Complicated neural network models can integrate and absorb noise better than simple neural network models; (3) Sometimes adding noise can improve not only accuracy but also convergence rate. We hope that this experimental study can provide insights into future design of deep learning neural network models and machine learning hardwares. Next generation machine learning hardwares can fully exploit the results that Beyond this project, we hope to pursue three major research directions: (1) Conduct more thorough experiments that quantitatively analyze the effect of noise on deep learning models; (2) Provide theoretical explanation for the effect of noise on deep learning models based on our experiment results and findings; (3) Design and explore more efficient computer hardware and systems for deep learning models. References [1] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, Project adam: Building an efficient and scalable deep learning training system, in OSDI, 2014. 8

[2] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, Regularization of neural networks using dropconnect, in ICML, 2013. [3] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, arxiv preprint arxiv:1207.0580, 2012. [4] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. C. Courville, and Y. Bengio, Maxout networks, in ICML, 2013. 9