Tuning the Layers of Neural Networks for Robust Generalization

208 Int'l Conf. Data Science ICDATA'18 Tuning the Layers of Neural Networks for Robust Generalization C. P. Chiu, and K. Y. Michael Wong Department of Physics Hong Kong University of Science and Technology Clear Water Bay, Hong Kong, China cpchiu@connect.ust.hk, phkywong@ust.hk Abstract Neural Networks are known to have generalization ability for test data. The generalization ability depends on the fine tuning of the network architecture, which mainly depended on design experience. In this work, we explore a simple way to identify the network layer responsible for the lack of performance robustness for translationally displaced input patterns, and hence provide evidence to improve the translational robustness of the network by modifying that particular layer for small datasets. It achieves a significant improvement in the weighted average error with modification hints provided by the random epochs training process on MNIST and Fashion MNIST datasets. This method also provides a way to understand the weight space development of neural networks. Keywords Generalization robustness, weak layer identification, architecture search, data augmentation, random epochs training. I. INTRODUCTION Neural network are known to have generalization ability on different tasks [1] [4]. However, its generalization ability is often restricted by different factors, as illustrated by the occurrence of adversarial examples [5], [6]. A recent work has shown that neural networks may perform poorly even when dealing input patterns that are simply rotationally or translationally displaced [7]. Different visualization techniques have been developed to understand the internal features inside deep neural networks in order to improve the network architecture and understand better the fundamental nature of its operation [8], [9]. These visualization techniques depend on the input image and are normally demanding in terms of resources. Model architecture plays an important role on generalization. Different initialization [4], [10], regularization [4], [11], [12], and optimization [13] techniques are proposed to enhance network generalization robustness, but the model architecture design generally relied on experience. Therefore, automatic architecture search has drawn attention owing to the current hand-crafting nature of neural network research. The common automatic architecture search methods are based on evolution of topologies [14], reinforcement learning [15]. Despite its promising performance [16], these methods are demanding in terms of resources and are not able to reveal the fundamental nature of deep neural networks. In this work, we propose a method to identify the weak layer in terms of translational robustness, using simple data augmentation and hence the demand for resources is low. From this, we also demonstrate that the robustness can be improved by modifying the weak layer. This approach can also provide a better understanding for the development of weights in the network during the training process. As an example of identifying the weak layer in a neural network, we consider the robustness of networks processing visual images when input images are translated. For benchmarking we select a network trained with augmented data with epochs of randomly translated input images as a reference. Its performance is compared with those trained with a less stochastic sequence of translated images, which normally result in weaker performance. The correlations of each layer of these networks with the reference networks are monitored, and the layer with the weakest correlation is identified to be weakest layer for further tuning. Using MNIST and Fashion MNIST datasets for testing, we found that this method is effective. The rest of the paper is organized as follows. In section 2, we first describe the architecture of the neural network. Then, we propose two different data augmentation methods to impose different extent of perturbation to the network. Third, we use zero normalized cross correlation and weighted average error to measure the generalization robustness. In section 3, we discuss the experimental results from our proposed method for the identification of the weak layer. We demonstrate that this method is effective on standard machine learning datasets such as MNIST and Fashion MNIST. In section 4, we conclude our results in this work and propose possible applications and limitations of the method. II. EXPERIMENTAL SETUP A. Model Architecture The network is based on the LeNet used to process the MNIST dataset [17]. The convolutional kernels are 5 5 pixels in both convolutional layers which have 20 kernels, and the second layer has 20 50 kernels. The third and fourth layers are fully connected layers with size 800, and 500. The network is pre-trained with the MNIST dataset without data augmentation. Even the network is further trained with another 200 epochs with weight decay, we observed that the network does not show any further improvement without data augmentation. The activation function of all layers is defaulted to be tanh, and the last layer activation is arg max of softmax.

Int'l Conf. Data Science ICDATA'18 209 Fig. 1: Pre-trained Model, weighted average error = 0.661. The learning rate is 0.05, and the weight decay is 0.001. The network is initialized with a given seed for the pseudo random number generator and undergoes mini-batch learning. The pretrained model achieves 0.91% error on the test set. B. Training and Data Augmentation Methods We deployed two data augmentation methods in this work. a) Random Epochs Training (RET): For every epoch, the validation and training dataset are displaced horizontally and vertically in range of [ 10, 10] randomly, and the network is trained with 200 epochs in total. b) Sequential Training: The network is trained and validated with the dataset which is displaced leftwards by 2 pixels for several epochs. The network is then trained with the dataset which is displaced further leftwards by 2 pixels. This is repeated in turn for displacements in the rightward, upward and downward directions. 10000 images in MNIST training dataset are separated and used as the validation set. C. Zero Normalized Cross Correlation of Weight Space The direct visualization of the weight space is nearly impossible owing to the large number of parameters. Therefore, we use Zero Normalized Cross Correlation (ZNCC) to check the similarity of the different models in fully connected layers as well as convolutional layers. Consider the comparison of two network models A and B. For the convolutional layers, C AB ij = W ij AW ij B W ij A n Wij B n n σij A (1) σb ij the are weights in i-th layer,and j-th kernel are the standard respectively, n is the dimension where Wij A, W ij B in models A, and B respectively. σij A and σb ij derivations of Wij A and W ij B of the weight W ij. This is further simplified to the root mean square of the ZNCC kernels in a single layer i. Ci AB = (Cij AB ) 2 j (2) (a) The sequentially trained model at an intermediate stage when x =0;y = 4, weighted average error = 0.666. (b) The final state of the sequentially trained model, weighted average error = 0.624. (c) The RET Model, mean of 10 trials of the weighted average error = 0.358 ± 0.035. Fig. 2: MNIST error map of models trained with different data augmentation methods (with pre-training stage). x-axis: horizontal displacement of input test images, y-axis: vertical displacement of input test images, color bar : test set classification error.

210 Int'l Conf. Data Science ICDATA'18 For the fully connected layer, it is similar to the ZNCC for the convolutional layer where it is given as below. Ci AB = W i AW i B Wi A n Wi B n n σi A (3) σb i D. Error Maps and Weighted Average Errors The trained networks are tested with the test set for each displacement horizontally and vertically. The corresponding classification error with different displaced test sets is plotted in the two-dimensional space of pixel displacements as a map to check the generalization ability. The generalization performance is further simplified with weighted average error, the weighted average error of the map is computed with the following equation, i weighted average error = exp( x2 i +y2 i 2σ )E 2 i i exp( (4) x2 i +y2 i 2σ ) 2 where E i is the classification error of the model given that the pixel displacement of the test set is (x i,y i ), and σ is equal to 6 as the loss of information increases rapidly for translation displacements of the dataset around ±6. As seen in fig. 1, the generalization error rapidly increases to 0.9 accordingly. III. EXPERIMENTAL RESULT AND DISCUSSION A. Error Map of Trained Models Ideally, the error map should have a basin shape since there is information lost when the displacement increases gradually until saturation. The classification error of the pre-trained model increases rapidly with small displacements of around ±6 pixels as shown in fig. 1 with weighted average error equal to 0.66. There are patches of unnaturally high error regions beyond the center in the pre-trained model. Beyond these patches, the network input will be virtually blank for large pixel displacements, and the network classifies most of the largely displaced patterns as class 1, since the handwritten digits in class 1 of the MNIST dataset have the largest number of blank pixels among the 10 classes. After the sequential training, and random epochs training, the high error patches vanish as seen in fig. 2b, and 2c. Data augmentation is crucial for robust generalization; however, it does not guarantee good generalization; the mean of the weighted average error is 0.36 as seen in 2c, and the error map has an irregular contour. Sequential training only improves the generalization robustness slightly with the weighted average error dropping from 0.66 of the pre-trained model to 0.62. When the pixel displacement of the dataset is small in sequential training, the basin of the classification error is shifted to the corresponding displacement as shown in fig. 2a. Therefore, the corresponding change in the weight space should be responsible for the center displacement. In the RET model, the basin is broadened as seen in fig. 2c with the mean of the weighted average error 0.36. Therefore, in the analyses below, the RET model can be regarded as a benchmark for robust generalization. By comparing the (a) Pre-trained Model. (b) RET Model. Fig. 3: Normalized cross correlation (Ci AB ) between the sequentially trained model and (a) the pre-trained model and (b) the RET model. sequential trained and RET models layer by layer, the layer with the largest change in deviation can be found and identified as the layer responsible for the lack of translational robustness. B. Comparison of Trained Models As described previously, sequential training shifts the basin in the direction corresponding to the displacement of the dataset, as illustrated in fig. 2a. Fig. 3a shows the deviation of the network weight in the weight space from the pre-rained model through the training sequence. As the RET model achieves a high generalization power with a significant drop in weighted average error, we treat it as a benchmark model of generalization and check how the sequentially trained model deviates from it. As shown

Int'l Conf. Data Science ICDATA'18 211 (a) The RET Model without pre-training stage. The activation function is tanh. Mean of 10 trials of the weighted average error = 0.324 ± 0.041. (a) The pre-trained Model. Weighted average error = 0.702. (b) The RET Model without the pre-training stage. The activation function is Relu. Mean of 10 trials of the weighted average error = 0.259 ± 0.026. Fig. 4: Error map of the RET model (without the pre-training stage) with different activation functions in layer 2 (a) tanh, (b) Relu. in fig. 3b, the normalized cross correlation of the sequential model changed relatively slightly for layers 0, 1, and 3. It means that these layers already capture some translationally invariant features. However, it was found that only the first fully connected layer (layer 2) is subjected to a large change comparing with the other layers. It is likely that improving the first fully connected layer is crucial for the generalization of translational invariance for this model and dataset. C. Robustness Improvement with minimal change We have seen in fig. 1 that an irregular landscape with patches of high error regions presents in the pre-trained model. However, as shown in fig. 4a and 2c, the error map still has (b) The RET Model. Mean of 10 trials the weighted average error = 0.407 ± 0.046. Fig. 5: Fashion MNIST error maps of the pre-trained and RET (after pre-training) models. an irregular contour even the network is trained without pretraining, and there is only a drop of around 0.04 from 0.36 in the mean of the weighted average error. Therefore, we suspect that the likely cause is the vanishing backpropagation gradient of the tanh activation function when the magnitude of its argument is too large. From the previous section, we know that layer 2 is possibly responsible for the lack of translational invariance. Hence, we only focus on the modification of layer 2. As shown in fig. 4b, when the activation functions in layer 2 are changed from tanh to Relu, the basin of low error in the error map is broadened to roughly [ 10, 10] horizontally and vertically. The weighted average error is greatly reduced from 0.66 to 0.26 for the RET network. D. Further Test on Fashion MNIST The Fashion MNIST, which is similar to MNIST, is a dataset of clothing images with a size of 28 by 28 [18]. It is a more

212 Int'l Conf. Data Science ICDATA'18 (a) The RET Model without the pre-training stage. Mean of 10 trials of the weighted average error = 0.456 ± 0.031. (b) The RET Model without the pre-training stage, Layer 2 Activation: Relu. Mean of 10 trials of the weighted average error = 0.370 ± 0.030. (a) (c) The RET Model without the pre-training stage, Layer 1 and 2 Activation: Relu. Mean of 10 trials of the weighted average error = 0.380 ± 0.044. (d) The RET Model without the pre-training stage, Layer 1 Activation: Relu. Mean of 10 trials of the weighted average error = 0.396 ± 0.028. Fig. 7: Fashion MNIST error map of the RET models (without the pre-training stage) of different activation functions. (b) Fig. 6: Fashion MNIST normalized cross correlation between sequentially trained models and pre-trained (and RET) models. challenging dataset. For example, the network is required to learn to differentiate T-shirts, shirts, coats, shoes, etc. from gray scale information. Similar to the MNIST model, the network performance decays rapidly after the displacement of a few pixels with the weighted average error of 0.7 as shown in fig. 5a. The network achieves a better generalization for the RET model with 0.3 drop in mean weighted average error. However, the basin is slightly shifted as shown in fig. 5b. This is likely the result of the RET training strategy, which may be biased towards certain displacement which gives a lower classification error. If the data augmentation is done randomly for individual images instead of individual epochs, it is likely that the bias can be reduced. However, it should also be noticed that the network may be biased owing to the network architecture. For the case without pre-training, the RET model performs worse than the case with pre-training, as the weighted average error increases from 0.41 in fig. 5b to 0.46 in fig. 7a. Note that the basins for the Fashion MNIST dataset are broader than those in MNIST because objects in the Fashion MNIST dataset cover broader regions in the image field. Therefore, large displacements in RET for Fashion MNIST results in a larger mean weighted average error than RET model with pretraining stage. From fig. 6, all layers of weights gradually deviate from the pre-trained model, and layer 2 was found to have a relatively larger change in ZNCC compared with the RET model. Therefore, it is likely the layer 2 is responsible for the lack of translation robustness. Again, by modifying the activation function on layer 2, fig. 7b shows that the network becomes less biased and achieves a lower mean weighted average error of 0.37. Unlike the MNIST dataset, layers 0, and 3 have higher ZNCCs compared with that of the MNIST data, as shown in fig. 6b. This is because objects in the Fashion MNIST dataset have very different features, hence layers 0 and 3 already capture the more robust features of translational invariance of clothings. Although there is a slight drop in the correlation for layer 1, it should be noted that it may result from the influence due to

Int'l Conf. Data Science ICDATA'18 213 the dramatic change of layer 2. Note that although changing the activation function gives us a way to explore performance improvement, its success is not guaranteed. As seen in fig. 7c, and 7d, if layer 1 is changed alone or together with layer 2, the error map becomes irregular and biased. The mean weighted average error are larger for RET model with modification on layer 1 than that with modification on layer 2 only. IV. CONCLUSION In this work, we demonstrated that the network generalization power can be improved by random epochs training. It can be further utilized to identify the layer responsible for the lack of translational invariance and used to further improve the robustness of the network. Although we only demonstrated this idea on translated patterns from the MNIST and Fashion MNIST datasets, we believe that it is also applicable for image rotations in other types of small datasets. As the weight deviation changes due to changes of the dataset, it can help to identify the layer responsible for the lack of robustness against the distortion. However, for a large dataset like ImageNet, the dataset itself may be robust enough to have all kinds of distortions embedded within. Thus, this method should be further verified and tested on large scale datasets. [14] K. O. Stanley, and M. Risto. Evolving neural networks through augmenting topologies., Evolutionary Computation, vol. 10, pp. 99-127, 2002 [15] B. Zoph, V. Vasudevan, J. Shlens, Q.V. Le, Learning Transferable Architectures for Scalable Image Recognition, arxiv preprint arxiv:1707.07012, 2017. [16] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, A. Kurakin, Large-Scale Evolution of Image Classifiers, Proc. 34th ICML, vol. 70, pp. 2902-2911, 2017. [17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-Based Learning Applied to Document Recognition, Proc. IEEE, vol. 86, pp. 2278-2324, 1998. [18] H. Xiao, K. Rasul, and R. Vollgraf, Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms, arxiv:1708.07747, 2017. ACKNOWLEDGMENT This work is supported by the Research Grants Council of Hong Kong (grant nos. 16322616 and 16306817). REFERENCES [1] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Adv. NIPS, 25, pp.1097-1105, 2012. [2] A. Graves, A. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, IEEE ICASSP, 2013 [3] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel and, D. Hassabis, Mastering the game of Go without human knowledge, Nature, 550, pp. 354-359, 2017. [4] K. He, X. Zhang, S. Ren, J. Sun, Delving deep into rectifiers: Surpassing human level performance on imagenet classification., Proc. IEEE ICCV, pp. 1026-1034, 2015. [5] X. Yuan, P. He, Q. Zhu, R. R. Bhat, X. Li, Adversarial Examples: Attacks and Defenses for Deep Learning, arxiv preprint arxiv:1712.07107, 2017. [6] A. A. Alemi, I. Fischer, J. V. Dillon, K. Murphy, Deep Variational Information Bottleneck, ICLR, 2017. [7] L. Engstrom, D. Tsipras, L. Schmidt,and A. Madry, A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations., NIPS Workshop Mach. Learning and Comput. Security, 2017. [8] M. Aravindh, and A. Vedaldi, Understanding Deep Image Representations by Inverting Them, IEEE CVPR, pp. 5188-5196, 2015. [9] M. D. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV, Springer, pp. 818-833, 2014. [10] G. Xavier, B. Yoshua. Understanding the difficulty of training deep feedforward neural networks. Proc.13-th Int. Conf. Artificial Intell. and Statist., pp. 249-256, 2010. [11] A.Krogh, J. A. Hertz, A simple weight decay can improve generalization., Adv. NIPS, pp. 950-957, 1992. [12] S. Ioffe, C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Proc. 32nd ICML, vol 37, pp. 448-456, 2015. [13] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, ICLR, 2015.