Dropout Auston Sterling January 26, 2016
Outline Motivation Dropout Fast Dropout Maxout
Co-adaptation Each unit in a neural network should ideally compute one complete feature. Since units are trained together, multiple units may co-adapt, becoming dependent on one another to compute a feature This is sub-optimal, requiring more computation and causing overfitting
Co-adaptation 1 Which is preferable? 1 Srivastava et al., Dropout: A Simple Way to Prevent Neural Networks from Overfitting.
Model Combination We can reduce overfitting by combining the outputs of many different neural nets Best to train on different subsets of the data so that, while each may overfit to its subset, the combined models have a broader view This can be prohibitively expensive and requires large amounts of data
Sexual Reproduction Genes taken from either of two parents Each gene must be useful by itself; no guarantee that dependent genes will also make it through Specialized genes make it easy to incorporate beneficial new ones
Dropout 2 For each step of training, set the output of each unit to 0 with probability p. Best results with p 0.5 for hidden units and p close to 1 for inputs When testing, use all units but multiply weights by p That s it! 2 Hinton et al., Improving neural networks by preventing co-adaptation of feature detectors.
Dropout Notes Constrain L2 norm of weight vector for each unit (max-norm regularization), use a large learning rate The final trained network (if using softmax output) is exactly equivalent to the geometric mean of the probability distributions over labels predicted by all 2 N networks
Dropout results
Dropout results
Dropout is a Monte Carlo process, sampling the 2 N masks Can the process be approximated without requiring so much sampling? If z is the mask and w is weights, Y(z) = w T D z x = m i w i x i z i tends to a normal distribution Approximate Y(z) with a Gaussian and sample to compute gradients Fast Dropout Training 3 3 Wang and Manning, Fast dropout training.
Fast Dropout Results
Maxout Networks 4 Alternative activation function: h i (x) = max j [1,k] xt W i,j + b i,j Can approximate other activations Universal approximator 4 Goodfellow et al., Maxout networks.
Maxout and Dropout Dropout is exact model averaging for softmax, but also for multiple linear layers Authors claim linear operations with max works particularly well with dropout
Bibliography Goodfellow, Ian J et al. Maxout networks. In: arxiv preprint arxiv:1302.4389 (2013). Hinton, Geoffrey E. et al. Improving neural networks by preventing co-adaptation of feature detectors. In: CoRR abs/1207.0580 (2012). URL: http://arxiv.org/abs/1207.0580. Srivastava, Nitish et al. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. In: Journal of Machine Learning Research 15 (2014), pp. 1929 1958. URL: http://jmlr.org/papers/v15/srivastava14a.html. Wang, Sida and Christopher Manning. Fast dropout training. In: Proceedings of the 30th International Conference on Machine Learning (ICML-13). Ed. by Sanjoy Dasgupta and David Mcallester. Vol. 28. 2. JMLR Workshop and Conference Proceedings, May 2013, pp. 118 126. URL: http://jmlr.csail.mit.edu/proceedings/papers/v28/wang13a.pdf.