DropConnect Regularization Method with Sparsity Constraint for Neural Networks

Size: px

Start display at page:

Download "DropConnect Regularization Method with Sparsity Constraint for Neural Networks"

Domenic Stone
5 years ago
Views:

1 Chinese Journal of Electronics Vol.25, No.1, Jan DropConnect Regularization Method with Sparsity Constraint for Neural Networks LIAN Zifeng 1,JINGXiaojun 1, WANG Xiaohan 2, HUANG Hai 1, TAN Youheng 1 and CUI Yuanhao 1 (1. Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, Being University of Posts and Telecommunications, Being , China) (2. School of Software and Microelectronics, Peking University, Being , China) Abstract DropConnect is a recently introduced algorithm to prevent the co-adaptation of feature detectors. Compared to Dropout, DropConnect gains state-of-the-art results on several image recognition benchmarks. Motivated by the success of DropConnect, we extended this algorithm with the ability of sparse feature selection. In DropConnect algorithm, the dropping masks of weights are generated using Bernoulli gating variables that are independent of the weights and activations. We introduce a new strategy to generate masks depending on the outputs of previous layer. Using this method, neurons which are promising to produce sparser features will be assigned a bigger possibility to keep active in the forward and backward propagations. We then evaluate such sparsity constrained DropConnect on MNIST and CIFAR datasets in comparison with ordinary DropConnect and Dropout method. The results show that our new method improves the sparsity of features significantly, while not degrading the precision. Key words DropConnect, Sparse regularization, Deep learning, Neural networks. I. Introduction Feedforward artificial neural networks are well suited to deal with large labeled datasets, since their capacity can be scaled up easily by adding more layers to the networks, or adding more units in each layer. However, in order to keep the ability of extracting the higher dimensional features, when datasets are large enough, the corresponding neural networks must be large enough too. Large-scale neural networks always contain millions or billions of parameters, and the relationship between the input and the correct output is complicated too. Given a limited amount of labeled training data, and a random initial status to start from, the neural networks can learn many different settings of the weights that can model the training set almost perfectly from forward and backward propagations. But when evaluated on the test set and validation set, almost all of them will do worse than on the training set because the feature detectors have been tuned too fine to work on the training data but not on the test and validation data. Such a frustrating phenomenon is always called Overfitting. In order to deal with overfitting, a wide range of techniques for regularizing neural networks have been developed. Recently, two state-of-the-art regularization methods were proposed, called Dropout [1] and DropConnect [2], while the latter is a generalization of Dropout. When training neural networks using DropConnect model, each weight between two layers is kept with probability (1 p), otherwise being set to zero with probability p. Extensive experiments show that DropConnect improves the network s generalization ability, and gives improved test performance. In DropConnect of Li Wan et al. [2], the dropping masks of weights are generated using Bernoulli gating variables. So that every weight shares the same probability of being dropped or not. In this paper, we impose a sparsity constraint on DropConnect when generating dropping mask, thus change the dropping probability of masks from a constant into a function of the activation of neurons in previous layer. The following experiments show that such modified version of DropConnect, hereinafter referred to as Sparse DropConnect, improves the sparsity property of features significantly, while keeping a good performance as DropConnect in terms of generalization ability. The rest of this paper is organized as follows. In Sec- Manuscript Received May 11, 2015; Accepted June 23, This work is supported by the National Natural Science Foundation of China (No , No ), National High Technology Research and Development Program of China (No.2011AA01A204). c 2016 Chinese Institute of Electronics. DOI: /cje

2 DropConnect Regularization Method with Sparsity Constraint for Neural Networks 153 tion II we make a literature review of prior works, including a brief introduction of Dropout and DropConnect strategy. Then, detailed methodology of Sparse Drop- Connect method is described in Section III. Section IV gives our experiments of different models on MNIST and CIFAR-10 datasets, including discussions and analyses about the result. Finally, we conclude our work and discuss the future work in Section V. II. Related Works Deep learning algorithms are special cases of representation learning proposed in recent years, which have made important empirical successes in traditional AI applications such as computer vision and natural language processing [3,4]. Deep learning algorithms can learn multiple levels of representation, thus more abstract features could be discovered automatically. Abstract and complex representations are believed to be more useful than the shallow ones. But as neural networks grow deeper and larger, overfitting becomes a major challenge when processing large datasets while only small amounts of data are offered as training set. To regularize neural networks from overfitting and coadaptation of feature detectors, some simple approaches have shown favorable effects, such as imposing a L2 penalty on the network weights, weight elimination [5], Bayesian methods [6], and early stopping of training. Denoising autoencoders (DAEs) by Vincent et al. [7,8] add noise to the input units of an autoencoder as a type of regularization, and the network is trained to reconstruct the noise-free input. DAEs achieved good performance on Autoencoders. Recently, a new form of regularization method called Dropout was proposed by Hinton et al. in 2012 [1]. Then, Li Wan et al. takes the idea a step further, and introduced a general form of Dropout, called DropConnect [2], for regularizing large fully connected layers within neural networks. When training neural networks with Dropout method, each element of a layer s output is kept unchanged with probability (1 p), otherwise being set to 0 with probability p. DropConnect is the generalization of Dropout in which each connection, instead of each output unit as in Dropout, can be dropped with probability p. DropConnect is similar to Dropout as it introduces dynamic noise into the model, but differs in that the noise is on the weights W, rather than the output vectors of a layer. Both Dropout and DropConnect can be seen as stochastic regularization techniques. Similar to Dropout, DropConnect technique is very suitable for fully connected layers, but also can be used in other neural network models, such as Convolutional neural networks (CNNs) [9] and Deep belief networks (DBNs) [10]. Based on these previous works, we made some improvement on original DropConnect strategy, in order to get a better performance in term of feature sparsity, while keeping the regularization ability of DropConnect. The detailed methodology and sparseness measure are as below. III. Methodology of Sparse DropConnect In this section we firstly review the DropConnect neural network model briefly. Then the detailed methodology of Sparse DropConnect is elaborated, especially the calculation of DropConnect probability of weights. At the end of this section, a measuring method of feature sparseness is formulated, which will be used in the experiments later to examine the performance of our method. 1. DropConnect formulization Consider a feed-forward neural network with L fully connected hidden layers, in which one or more fully connected layers adopt the method of DropConnect. Let l {1,...,L} indexes one of the DropConnect layer, while p means the probability of dropping connects. Let y (l) i denotes the activation of unit i in layer l, while y (0) = x is the input. z (l) i is the total weighted sum of inputs to unit i in layer l. W l and b l are the weights and biases of layer l, respectively. The symbol * means element-wise multiplication. f(z (l) ) denotes the activation function for the l-th layer. Then, the feed-forward operation of a Drop- Connect neural network can be described as follows: mask (l) Bernoulli(p) (1) W (l) = W (l) mask (l) (2) n z (l+1) i = W (l) y(l) j + b (l) i (3) j=1 y (l+1) = f(z (l+1) ) (4) 2. Sparse DropConnect Sparse representation are a favorable compromise between dense input signals and local codes. In neural networks, sparse coding always means small set of hidden neurons or small average activity ratio of hidden layers. Given a potentially large set of input data, sparse representation attempts to find the smallest number of representative patterns automatically. Such patterns can be used to reproduces the original input patterns with minimum deviation when combined in appropriate proportions and weighted. The sparse representation for the input then consists of those representative patterns. In DropConnect strategy, all the weights are dropped according to the same probability, without considering the sparsity property of neurons. What will happen if different units were assigned with different probability of propagation according to the sparseness of its activations? In

3 154 Chinese Journal of Electronics 2016 other words, masks of weights turns into a function of outputs that are produced by neurons in previous layer. Based on this intuition, the feed-forward operations of our Sparse DropConnect neural networks can be formulized as follows: W (l) mask (l) ρ (l) = W (l) z (l+1) i = Uniform(0.5) (5) = p drop(y (l 1) j ) (6) (mask(l) n j=1 >ρ(l) ) (7) W (l) y(l) j + b (l) i (8) y (l+1) = f(z (l+1) ) (9) where W (l) denotes the weight associated with the connection between the j-th unit in layer l 1, and the i-th unit in layer l. The symbol * in Eq.(7) means element-wise multiplication. ρ (l) denotes the DropConnect probability associated with the weight W (l). p drop( ) is a function that calculates the dropping probability (ρ (l) ), which will be elaborated in Section III.3. The probability function ρ (l) takes a normalized neural activation of previous layer (y (l 1) j ) as input, and produce a DropConnect probability (ρ (l) ) ranges from 0 to 1.0. Then, the randomly generatedmasksismodifiedaccording to ρ (l). If the initial random mask value associated with W (l) (denoted by mask (l) ) is bigger than ρ(l), the mask will be set to one, the corresponding weight will keep active during forward and backward propagation. Otherwise the mask will be set to zero, and the weight will be omitted during propagations. Since the dropping probabilities of weights are function values of previous layer s outputs, so that Sparse DropConnect method adds the ability of selective feedforward propagation to the normal DropConnect strategy. In addition, since sparsity property is partially embodied in the distribution of neural outputs, such selective propagation adds a new sparse regularization to DropConnect models. To update weight matrix W during backward propagation phase, the DropConnect masks are applied to the gradient vectors to update only the weights and biases that were active in the forward pass. During testing phase, we use the same approximation method adopted by Hinton et al. in 2012 [1] : f((m W )x) f( (M W )x). (10) M M This method averages outputs before the activation rather than after. Although it has not been justified mathematically, but works well in practice. The 1-dimensional Gaussian approximation method used in DropConnect is not applicable in this paper, since the masks are not Bernoulli variables anymore after sparsity constrained selection of Eq.(7), thus the weighted sum cannot be approximated by a Gaussian variable. 3. DropConnect probability of weights Under the hypothesis of sparse representation, the average activity ratio must be as low as possible, typically a small value close to zero. To satisfy this constraint, most activations of hidden units should be near to zero. To achieve this, the typical method is adding an extra penalty term to optimize the objective function. The extra term penalizes the y (l) i which deviate significantly from the anticipative average activity ratio when the network is given a specific input, where y (l) i denotes the activation value of the i-th hidden unit in layer l. Under the sparsity hypothesis, most of the outputs should take the value of zero. Besides, for the non-zero activation outputs, the preferred value is prone to either zero, or a considerably large value compared with the average activation value. According to Eq.(6), the probabilities of dropping connects for weights between layer l and layer l 1 are depending on the outputs of layer l 1. Before calculating these probabilities, neurons output vector of previous layer, denoted by y (l 1), need to be normalized to the range of [0, 1]. We use ỹ (l 1) to denote the normalized output vector. As to each element of the normalized neuron s outputs ỹ (l 1), the associated DropConnect probability corresponding to W (l) ρ (l) = p drop(y (l 1) j will be: )=4ỹ (l 1) j (1 ỹ (l 1) j ) (11) Fig.1 demonstrates the DropConnect probability curve of activations. As can be seen in the figure, activations are normalized to the range of [0, 1] before being inputted to this probability function. The corresponding DropConnect probabilities range from zero to 1.0. When ỹ (l 1) j equals to 0.5, the value of ρ (l) takes the maximum of 1.0. As ỹ (l 1) j moves from middle to either end of the range [0, 1], DropConnect probability decreases smoothly from 1.0 towards zero. Fig. 1. ropping probability curve for normalized activations 4. Measure of feature sparseness This section defines the sparseness measure of the fea-

4 DropConnect Regularization Method with Sparsity Constraint for Neural Networks 155 ture vectors outputted by neural networks, which will be used later to quantify the performance of Sparse Drop- Connect strategy. The concept of sparse coding or sparse representation refers to a signal representational scheme that only a few units in a large neural network are effectively used to represent typical data vectors [11]. This constraint implies that most neurons in a neural network will take values of zero, or close to zero. In some special but reasonable cases, a small set of neurons may output values that are far from zero, but concentrated in the neighborhood of a relatively big value. Two commonly used sparseness measures as well as sparsity constraints are L 1 norm and L 2 norm. When regularized by the L 1 norm in feature extraction tasks, sparsity is yielded by learning each variable (weight or bias) individually according to the objective function constrained by L 1 norm. Being regularized by linear combinations of L 2 norms is known to induce sparsity into W too [12].Both methods are widely used in neural networks as sparsity constraints and show remarkable power of regularization. Meanwhile, L 1 and L 2 norm are commonly used measures of sparseness too. In this paper, we use a sparseness measure based on the relationship between L 1 norm and L 2 norm, which is introduced by Patrik O. Hoyer [13]. The sparseness measure is defined as: n ( xi )/ x 2 i sparsness(x) = (12) n 1 where x denotes the input features, n is the number of elements of x. The input features need to be flattened to 1-dimensional when it s a multidimensional matrix. Normalization is also needed if input feature ranges beyond [0, 1]. This function evaluates to 1.0 if and only if the input x contains only one single non-zero component, and equal to zero if and only if all components are equal. In other cases, the result of this sparseness measure equation takes the value interpolating smoothly between 0 and 1.0. Fig. 2. Illustration of various degrees of sparseness Fig.2 illustrates the concept of the sparseness measure used in this paper. Four vectors with sparseness levels of about 0.1, 0.4, 0.7, and 0.9 are shown in the figures. Each bar denotes one element of the vector. The vector with the lowest level of sparseness (leftmost vector) has the most non-zero values, and most of the elements deviate not far from the mean. At the highest level (rightmost), most elements are zero, and only a few take significant values. IV. Experiments and Results Analysis In order to make clear comparisons, we evaluate Sparse DropConnect method using neural network models of DNNs and CNNs which are similar to the models used in DropConnect experiments of Li Wan et al. [2] for the task of image classification. Our main expectation for Sparse DropConnect is improvements on the sparsity property of features, the utmost classification precision is not the primary concern. Therefore, the layers and neurons contained in each model are not as large as the stateof-the-art records. These model settings may not produce excellent testing errors, but when compared with the same models, the improvements with respect to sparsity property should be demonstrated clearly. We run experiments on MNIST [9] and CIFAR-10 [14] datasets. All datasets are pre-processed by mean normalization, but no whitening processing or data augmentation schemes are used. Sparse DropConnect, DropConnect, Dropout, and non-regularized methods are used respectively in the last fully connected layers. All experiments use mini-batch Stochastic gradient descent (SGD) with the mini-batch size of 100, and the momentum parameter is fixed at 0.9. Our implementation is based on a Deep learning library named cuda-convnet [15],whichis a fast C++/CUDA implementation of feed-forward neural networks written by Alex Krizhevsky [14,15]. For each dataset, two aspects of results about performance will be outputted after training and test. The first aspects of data is about the discriminative ability, which is characterized by test error and cross entropy. Another is about the sparsity of features learned by the neural network models, which will be measured by the sparseness measure elaborated in Section III.4. ReLU and sigmoid are used as activation functions in these experiments. In experiments aimed to test the classification accuracy, we use ReLU units to gain a better performance. But in experiments intended to perform sparseness measurements, sigmoid units are preferred. There are two reasons for this choice. On the one hand, ReLU units doesn t face the gradient vanishing problem as with sigmoid and tanh functions, thus it s possible to obtain a higher classification accuracy. Meanwhile, by limiting negative value to zero, the outputs of ReLU units become sparser than other neurons. This effect will entangle with the sparsity constraints, and affect the sparseness measurement of experiments. On the other hand, the output

5 156 Chinese Journal of Electronics 2016 range of ReLU unit is [0, + ), this makes it inappropriate to sparseness analysis compared with sigmoid unit, which has a fixed output range of [0, 1]. Though the outputs of ReLU units can be normalized to [0, 1] by being divided by the maximum value of each mini-batch outputs, but since the maximum value of every mini-batch differs, this normalization will cause distortion of the probability distribution of activations when regarding all the minibatches as a whole. Therefore, both ReLU and sigmoid units are used to get classification accuracy or sparseness of features, respectively. 1. MNIST The MNIST handwritten digit dataset [9] for classification task consists of black and white images. It contains 70,000 handwritten digit samples which belong to 10 classes from digit 0 to 9. Each digit in the 60,000 training images and 10,000 test images is size normalized to fit in a pixel box while preserving their aspect ratio. We train models with two fully connected layers each with 400 neurons using ReLU or sigmoid activation functions to compare to DropConnect of Li Wan et al. [2].The first layer takes sized raw image pixels as input. Layer 2 and layer 3 are hidden layers of 400 units, which are responsible for learning image features. The last hidden layer s output is fed into a 10-class Softmax layer, then yields classification results from 0 to 9. We use stochastic gradient descent with mini-batch size of 100, and an objective function of cross-entropy. The initial learning rate is set to 0.1, and anneals with epochs. We train the model for five stages of epochs, with corresponding learning rate set as In Fig.3 we show the performance and convergence property of our Sparse DropConnect method used in fully connected DNN models described above, compared to No- Drop, Dropout and DropConnect methods on MNIST dataset. Sparse DropConnect, DropConnect, Dropout and No-Drop are denoted by sdc, dc, do and na in the figure, respectively. In Fig.3(a) weplottheerrorratesof each method on both test set and training set. Fig.3(b) shows the convergence properties of the four methods. From the two figures we can see that, model with No-Drop overfits quickly, while Sparse DropConnect, DropConnect and Dropout converge slowly, but reach better test performances ultimately. Sparse DropConnect is slower to converge than Dropout, but slightly faster than Drop- Connect, and reaches the lowest test error in the end. The final error rates of every method are summarized in the second column of Table 1. The third column of Table 1 shows the values of sparseness according to the measure described in Section III.4. Note that due to the limitation of DNN architecture s distinguishing ability, the test errors of all models are not as pretty as the stateof-the-art results, but Sparse DropConnect still performs best among them. What s really important is the sparsity property, in which Sparse DropConnect obtains outstanding result. Table 1. Accuracy and sparseness on MNIST dataset. Model Test error (%) Sparseness Sparse DropConnect DropConnect Dropout No-Drop In order to demonstrate the sparsity property of features that are outputted by neural networks adopting Sparse DropConnect, DropConnect, Dropout, or No-Drop strategies respectively, we train neural networks using the four methods as described above, then input the test set to the fully trained networks, and make analysis of the distributions and sparsity properties of outputs. The test set of MNIST contains 10,000 samples. Fig. 3. Comparison of performance and convergence property. (a) Error rates of train/test sets; (b) Logistic regression costs of train/test sets

6 DropConnect Regularization Method with Sparsity Constraint for Neural Networks 157 The last fully connected layer above output layer, which is layer 3, is a hidden layer containing 400 sigmoid neurons. Therefore, a total number of 4,000,000 values will be outputted by layer 3, range from zero to one. As a way to demonstrate the sparsity property of the neural networks trained with different methods respectively, we plot the histograms of the 4,000,000 outputs of layer 3 in Fig.4 to show the probability distributions. Each subfigure in Fig.4 corresponds to one regularization method out of Sparse DropConnect, DropConnect, Dropout, or No-Drop. In Fig.4, we can see that the overall output values of hidden units trained with Sparse DropConnect are smaller and more concentrated than those trained with other methods, which reflects that the outputs seem to be sparser. Most values of Sparse DropConnect outputs are between 0 and 0.1, which is the biggest difference with other distributions. Using the sparseness measure described in Section III.4, the neural network trained with Sparse DropConnect achieves best sparseness value of Sparseness values of other neural networks are 0.24 for DropConnect, 0.05 for Dropout, and 0.14 for No- Drop, respectively. kernel size is 5, and stride is 1. Three max-pooling layers summarize a 3 3 neighborhood and use a stride of 2. Between the last convolutional layer and output layer, there is a fully-connected layer with 64 neurons. Such a model is designed for rapid training rather than optimal classification performance. In the last fully connected layer, Sparse DropConnect, DropConnect, Dropout, or No-Drop is adopted respectively to evaluate the performances with respect to classification accuracy and feature sparsity. Since these experiments are aimed to demonstrate the effects of different regularization methods, rather than to gain optimal classification result, the neural network model don t need to be very complicated, neither the training epochs need to be a large number. Thus we train the models for 150 epochs with an initial learning rate of 0.1 and their default weight decay in all experiments on CIFAR-10. Table 2 summarizes the classification accuracy and feature sparseness of models trained using different strategies including Sparse DropConnect, DropConnect, Dropout, and No-Drop. As shown in Table 2, the Sparse DropConnect method achieves the best property of feature sparsity, while keeping a comparable performance of classification accuracy to Dropout and DropConnect strategies. Table 2. Accuracy and sparseness on CIFAR-10 dataset. Model Test error (%) Sparseness Sparse DropConnect DropConnect Dropout No-Drop V. Conclusion and Further Work Fig. 4. Comparison of feature distributions and sparseness. (a) Sparse DropConnect(0.54); (b) DropConnect(0.24); (c) Dropout(0.05); (d) No-Drop(0.14) 1. CIFAR-10 The CIFAR-10 [13,14] dataset is a subset of the Tiny Images dataset [16]. It contains 60, color images in 10 classes, with 6,000 images per class. Each class contains a training set of 5,000 images and a test set of 1,000 images, respectively. The experiments on CIFAR-10 are based on the simple convolutional network models described by Alex Krizhevsky [15], which is named as layers-80sec.cfg in the author s code repository. This model contains three convolutional layers with 32, 32 and 64 feature maps respectively, each followed by a polling layer. The convolutional We have presented Sparse DropConnect regularization method, which is a modified version of DropConnect constrained by sparse feature selection strategy. By regulating the probability of passing through or dropping of each neuron adaptively, Sparse DropConnect allows us to add sparsity property to conventional DropConnect models while maintaining its key advantage of antioverfitting. A series of experiments are done on MNIST and CIFAR-10 datasets using neural network models including DNNs and CNNs. Trained with Sparse DropConnect methods, the outputs of fully connected layer in both DNNs and CNNs exhibit considerably sparser distributions than their counterparts that are trained without Sparse DropConnect. Meanwhile, the accuracy of classification is improved slightly too. Such experiments have demonstrated the preferable performance of Sparse Drop- Connect method with respect to both feature sparsity and discriminative ability. Although Sparse DropConnect performs better than other methods, the error border of this method has not

158 Chinese Journal of Electronics 2016 been calculated theoretically. Meanwhile, our implementation of Sparse DropConnect is slightly slower than Dropout and DropConnect.

Hinton, N. Srivastava, A. Krizhevsky, et al., Improving neural networks by preventing co-adaptation of feature detectors, arxiv preprint arxiv:1207.0580, 2012. [2] L. Wan, et al.

7 158 Chinese Journal of Electronics 2016 been calculated theoretically. Meanwhile, our implementation of Sparse DropConnect is slightly slower than Dropout and DropConnect. In very deep models used for large datasets, speed of the feature extractor will be an important parameter, since it may cause significant difference in overall training time. References [1] G.E. Hinton, N. Srivastava, A. Krizhevsky, et al., Improving neural networks by preventing co-adaptation of feature detectors, arxiv preprint arxiv: , [2] L. Wan, et al., Regularization of neural networks using Drop- Connect, Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp , [3] Y. Bengio, Learning deep architectures for AI, Foundations and trends in Machine Learning, Vol.2, No.1, pp.1 127, 2009 [4] Y. Bengio, et al., Representation learning: A review and new perspectives, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.35, No.8, pp , [5] A.S. Weigend, D.E. Rumelhart and B.A. Huberman, Generalization by weight-elimination with application to forecasting, Neural Information Processing Systems (NIPS), [6] D.J.C. Mackay, Probable networks and plausible predictions A review of practical bayesian methods for supervised neural networks, Network Computation in Neural Systems, Vol.6, No.3, pp , [7] P. Vincent, et al., Extracting and composing robust features with denoising autoencoders, Proceedings of the 25th International Conference on Machine Learning, ACM, pp , [8] P. Vincent, et al., Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, Proceedings of the 27th International Conference on Machine Learning, ACM, pp , [9] Y. LeCun, L. Bottou, Y. Bengio and P. Haner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, Vol.86, No.11, pp , [10] Hinton, Geoffrey, S. Osindero and Yee-Whye Teh, A fast learning algorithm for deep belief nets, Neural computation, Vol.18, No.7, pp , [11] D.J. Field, What is the goal of sensory coding?, Neural Computation, Vol.6, pp , [12] P. Zhao, G. Rocha and B. Yu, The composite absolute penalties family for grouped and hierarchical variable selection, Annals of Statistics, Vol.37, No.6A, pp , [13] P.O. Hoyer, Non-negative matrix factorization with sparseness constraints, The Journal of Machine Learning Research, Vol.5, pp , [14] A. Krizhevsky, Learning multiple layers of features from tiny images, Master s Thesis, University of Toronto, [15] A. Krizhevsky, Cuda-convnet, available at om/p/cuda-convnet/, [16] Torralba, Antonio, R. Fergus and W.T. Freeman, 80 million tiny images: A large data set for nonparametric object and scene recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.30, No.11, pp , LIAN Zifeng was born in He is now pursuing the Ph.D. degree in School of Information and Communication Engineering, Being University of Posts and Telecommunications. His research interests include pattern recognition, machine learning and deep learning. ( lianzf@bupt.edu.cn) JING Xiaojun was born in He received the M.S. and Ph.D. degrees in 1995 and 1999 respectively, both in communications and information systems. From 2000 to 2002, he had been the postdoctoral researcher in Being University of Posts and Telecommunications, and now he is a professor in Being University of Posts and Telecommunications. ( jxiaojun@bupt.edu.cn)

Deep Learning With Noise

Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu