Deep Neural Network Hyperparameter Optimization with Genetic Algorithms

Deep Neural Network Hyperparameter Optimization with Genetic Algorithms EvoDevo A Genetic Algorithm Framework Aaron Vose, Jacob Balma, Geert Wenes, and Rangan Sukumar Cray Inc. October 2017 Presenter Vose, A et al Slide 1

EvoDevo Motivation and Description Improve time-to-accuracy as well as accuracy in DNN training Take the trial-and-error out of DNN training Genetic / Evolutionary Algorithm (GA/EA) framework. Use a fitness function and crossover and migration mechanisms to evolve local, somewhat isolated pools (demes) of hyperparameters over multiple generations per training epoch Optimization of neural network hyperparameters and topology: Number of filters, kernel size of convolutional layers. Size of fully-connected layers. Dropout rate and momentum used during training. Vose, A et al Slide 2

EvoDevo Built into existing CNNs Support for multiple toolkits (Google s TensorFlow, Microsoft s Cognitive Toolkit, Keras and others) C wrapper provides generic interface, multi-node support via MPI. References Inspired by previous work in theoretical biology at UTK [1]. See also recent MENNDL work out of ORNL [6]. Vose, A et al Slide 3

Datasets MNIST: Size: 70,000 images 60,000 train and 10,000 test Resolution: 28 28 greyscale pixels Classes: 10 classes one each for digits 0-9 Figure 1: Selected example images from MNIST [4]. Vose, A et al Slide 4

Datasets CIFAR-10: Size: 60,000 images 50,000 train and 10,000 test Resolution: 32 32 color pixels Classes: 10 classes airplane, bird, cat,... Figure 2: Selected example images from CIFAR-10 [3]. Vose, A et al Slide 5

NN Architectures LeNet-5 in TensorFlow: Model: 7-Layer (5 hidden) LeNet [5] ToolKit: Google s TensorFlow Language: Python code calling TF API Figure 3: LeNet-5 neural network architecture. Vose, A et al Slide 6

NN Architectures ResNet-110 in CNTK: Model: 110-Layer ResNet [2] ToolKit: Microsoft s CNTK Language: Configuration script read by CNTK (C++) Figure 4: ResNet neural network architecture (34 layers shown for clarity). Vose, A et al Slide 7

Results Time to Accuracy LeNet-5, MNIST, TensorFlow: Model: 7-Layer (5 hidden) LeNet Momentum: 1e 4 1e 3 Topology: 5 32, 5 64, 1024 5 18, 5 32, 512 c1kern c1filt, c2kern c2filt, fullconn Gain: 70% reduction in training time to validation accuracy of 99.1%. Genetic Algorithm: Fitness function: Optimization time: 5 samples of: training time to 99.1% accuracy. 24 hours Vose, A et al Slide 8

Results Time to Accuracy Figure 5: Validation accuracy during training. Vose, A et al Slide 9

Results Final Accuracy ResNet-110, CIFAR-10, CNTK: Model: 110-Layer ResNet Topology: 16, 32, 64 32, 15, 128 cstack1filt, cstack2filt, cstack3filt Error: Gain: 6.35% initial 5.91% optimized 7% reduction in final top-1 classification error Genetic Algorithm: Fitness function: Optimization time: 3 samples of: validation accuracy at 2 epochs. 24 hours Vose, A et al Slide 10

Results Final Accuracy Figure 6: Best individual s two-epoch validation accuracy improves over successive generations of EvoDevo s evolutionary algorithm. Vose, A et al Slide 11

Genetic Algorithm Life Cycle Details Typical Parameters PARAM EPOCHS = PARAM GENERATIONS = PARAM DEMES = 31 epochs 5 generations per epoch 4 demes (local populations) in 2 2 grid PARAM POPULATION SIZE = 4 demes * 25 to 85 individuals Infrastructure Results obtained on 16 Cray XC-50 nodes with NVIDIA P100s Vose, A et al Slide 12

Genetic Algorithm Life Cycle Details Generations and Epochs g 0 P g initial population while g < PARAM GENERATIONS: p P g : p.runtime execute( p ) p P g : p.fitness e ((p.runtime min)/(max min))2 while P (g+1) < PARAM POPULATION SIZE: p a p P g with probability p.fitness q.fitness q Pg p.fitness q.fitness q Pg p b p P g with probability c mutate( crossover( p a, p b ) ) P (g+1) P (g) {c} g g + 1 if MOD( g, PARAM GENERATIONS/PARAM EPOCHS ) == 0: migrate best population member( north, south, east, west ) Vose, A et al Slide 13

Genetic Algorithm Life Cycle Details Crossover Figure 7: Crossover combines the hyperparameters of two parents to create a new child. Vose, A et al Slide 14

Genetic Algorithm Life Cycle Details Migration Figure 8: Migration copies the best individuals to neighboring demes each epoch. Vose, A et al Slide 15

Conclusions Evolution of DNN Topologies and Hyperparameters with EvoDevo HPC-scalable solution for exploration of DNN topologies and hyperparameters Simultaneous evolution of hyperparameters and topology widens search space, maximizes training speed or validation accuracy Supports individuals with distributed training node-sets (via MPI), enabling large data-parallel training tasks Population size scales with machine resources Time-to-Accuracy: Shown to significantly improve training time for DNNs Selects for individuals who reach target accuracy fastest Final Accuracy: Shown to improve validation accuracy over a known best-topology on CIFAR-10 Prunes search space of topologies when a good starting topology is not known (applies to new datasets, similar to MENNDL) Vose, A et al Slide 16

Future Work Expand Hyperparameter Evolution: Stride of convolutional and pooling layers. Number of convolutional and fully-connected layers. Activation function (e.g., logistic, tanh, ReLU). Random seed value for better initial weights. Larger Runs: Larger data sets such as CIFAR-100 and ImageNet. Larger EvoDevo runs on more compute nodes. New Applications: Unsupervised learning with Generative Adversarial Networks (GANs). Vose, A et al Slide 17

References Bibliography: S. Gavrilets and A. Vose. Dynamic patterns of adaptive radiation. Proceedings of the National Academy of Sciences of the United States of America, 102(50):18040 18045, 2005. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009. Y. LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. T. E. Potok, C. D. Schuman, S. R. Young, R. M. Patton, F. Spedalieri, J. Liu, K.-T. Yao, G. Rose, and G. Chakma. A study of complex deep learning networks on high performance, neuromorphic, and quantum computers. In Proceedings of the Workshop on Machine Learning in High Performance Computing Environments, pages 47 55. IEEE Press, 2016. Vose, A et al Slide 18