Adaptive Dropout Training for SVMs

Size: px

Start display at page:

Download "Adaptive Dropout Training for SVMs"

Madeline Price
5 years ago
Views:

1 Department of Computer Science and Technology Adaptive Dropout Training for SVMs Jun Zhu Joint with Ning Chen, Jingwei Zhuo, Jianfei Chen, Bo Zhang Tsinghua University ShanghaiTech Symposium on Data Science, June 23-26, 2015

2 Outline Overfitting in Big Data Dropout training for SVMs Adaptive dropout rates Big learning with Bayesian methods Conclusions 1

3 Overfitting Bias-variance tradeoff risk variance / estimation error bias / approximation error complexity of function class 2

Overfitting in Big Data Big Model + Big Data + Big/Super Cluster Big ML 9 layers sparse autoencoder with: -local receptive fields to scale up; - local L2 pooling and local contrast normalization for

4 Overfitting in Big Data Big Model + Big Data + Big/Super Cluster Big ML 9 layers sparse autoencoder with: -local receptive fields to scale up; - local L2 pooling and local contrast normalization for invariant features - 1B parameters (connections) - 10M 200x200 images - train with 1K machines (16K cores) for 3 days [Le et al., 2012] -able to build high-level concepts, e.g., cat faces and human bodies -15.8% accuracy in recognizing 22K objects (70% relative improvements) 3

5 Overfitting in Big Data Relevant information grows slower than linear Model capacity may grow faster than the amount of relevant information!

6 Overfitting in Big Data Relevant information grows slower than linear (Bialek et al., 2001)

7 Overfitting in Big Data Regularization to prevent overfitting is increasingly important, rather than increasingly irrelevant! Increasing attention, e.g., dropout training (Hinton, 2012) More theoretical understanding and extensions MCF (van der Maaten et al., 2013); Logistic-loss (Wager et al., 2013); Generalization error (Wager et al., 2014) Dropout SVM (Chen et al., 2014); Adaptive Dropout (Zhuo et al., 2015) 6

Great spillberg film One of the best sci-fi /adventure movies I have ever seen. Great movie about robots and ones yearning to know ones creator.

8 Amazon Reviews Classification Positive or Negative? I love the deeper meaning behind this movie. I had watched it years ago when it first came out but never understood it til now. Great spillberg film One of the best sci-fi /adventure movies I have ever seen. Great movie about robots and ones yearning to know ones creator. The ending will stick in your mind forever in a good way. a massive, manipulative tear jerker which did nothing to illuminate me on the subjects of love or parenting or relationship or science or sibling rivalry. stock characters Very long. Boring stretches made this movie hard to finish 7

9 Amazon Reviews Classification Positive or Negative? Regularized Empirical Risk Minimization: Instead of regularizing parameters, we incorporate knowledge directly from data to do regularization? Regularization by corrupting data 8

10 Regularization by Corruptions Original Features Corrupted Features Corrupted Features I love the deeper meaning behind this movie. I had watched it years ago when it first came out but never understood it til now. Great spillberg film One of the best sci-fi /adventure movies I have ever seen. Great movie about robots and ones yearning to know ones creator. The ending will stick in your mind forever in a good way. a massive, manipulative tear jerker which did nothing to illuminate me on the subjects of love or parenting or relationship or science or sibling rivalry. stock characters I love the deeper meaning behind this movie. I had watched it years ago when it first came out but never understood it til now. Great spillberg film One of the best sci-fi /adventure movies I have ever seen. Great movie about robots and ones yearning to know ones creator. The ending will stick in your mind forever in a good way. a massive, manipulative tear jerker which did nothing to illuminate me on the subjects of love or parenting or relationship or science or sibling rivalry. stock characters I love the deeper meaning behind this movie. I had watched it years ago when it first came out but never understood it til now. Great spillberg film One of the best sci-fi /adventure movies I have ever seen. Great movie about robots and ones yearning to know ones creator. The ending will stick in your mind forever in a good way. a massive, manipulative tear jerker which did nothing to illuminate me on the subjects of love or parenting or relationship or science or sibling rivalry. stock characters Very long. Boring stretches made this movie hard to finish Very long. Boring stretches made this movie hard to finish Very long. Boring stretches made this movie hard to finish 9

11 Feature Noising Models Define a label-invariant corrupting distribution, assume the corruption is independent across features Very long. Boring stretches made this movie hard to finish x x d Very long. Boring stretches made this movie hard to finish ~x d ~x d D Corrupting distributions: Dropout Gaussian Laplace Poisson 10

12 Feature Noising Models Feature noising models control over-fitting! (Hinton et.al, 2012) Augment the data size by corrupting the given training examples with a fixed noise distribution. Very long. Boring stretches made this movie hard to finish Very long. Boring stretches made this movie hard to finish Explicit Corruption Downside: gets computationally prohibitive, unless... 11

13 Feature Noising Models Feature noising models control over-fitting! (Hinton et.al, 2012) Augment the data size by corrupting the given training examples with a fixed noise distribution. Very long. Boring stretches made this movie hard to finish Very long. Boring stretches made this movie hard to finish Explicit Corruption Downside: gets computationally prohibitive, unless Implicit Corruption (MCF) 12

14 Regularization by corrupting data Feature noising as minimizing the expected loss functions Theoretical understanding L2-regularization for additive Gaussian noise (Bishop, 1995) Adaptive regularization for dropout LR (Wager et. al., 2013) Generalization bound (Wager et al., 2014) Empirical results in various applications Document classification (van der Maaten et. al., 2013); Entity recognition (Wang et.al., 2013); Image classification (Wang & Manning, 2013). Expected Losses Quadratic loss (Bishop, 1995) Exponential loss (van der Maaten et. al., 2013) Logistic loss (van der Maaten et. al., 2013; Wager et al., 2013) 13

15 Losses in Machine Learning 14

16 Dropout Training for Support Vector Machines Explicit corruption for SVM (Burges & Scholkopf, 1997) One technical challenge is the non-smoothness of the hinge loss makes it hard to compute Intractable! Our work: Develop an iteratively re-weighted least square (IRLS) algorithm to minimize a variational bound; Apply ideas to develop IRLS for dropout Logistic regression; Derive an adaptive learning rule to decide the noise level [Chen et al., AAAI 2014; Zhuo et al., IJCAI 2015] 15

17 16 Variational bound with Data Augmentation Theorem: Let, and be the pseudo-likelihood of response variables for sample n. The pseudo-likelihood can be expressed as a scale-location mixture of Gaussians where is a generalized inverse Gaussian variable, proof follows (Polson & Scott, 2011) with some careful treatments.

18 Variational bound with Data Augmentation A variational bound with data augmentation The expected hinge loss is: (c is the regularization parameter) Using the ideas of data augmentation Variational upper bound 17

19 Iteratively Re-weighted Least Square Algorithm Variational optimization problem Coordinate Descent (variational EM) For (i.e., E-step) 18

20 Iteratively Re-weighted Least Square Algorithm Coordinate Descent (variational EM) For (i.e., M-step) A re-weighted least square problem under feature noising adaptive weights h n := E q [ 1 n ]= Reduces to the square loss by setting 1 c p E p [³ 2 n] re-weighted label yn h = (`+ 1 )y n c n h n = 1 c ; and ` = 0 19

21 20 Variational bound for Expected Logistic Loss The expected logistic loss function Theorem: Let, and be the pseudo-likelihood of response variables where, and is a Polya-Gamma variable, proof follows (Polson & Scott, 2012) with some careful treatments.

22 Variational bound for Expected Logistic Loss Variational optimization problem Coordinate Descent (variational EM) For (i.e., E-step) a Polya-Gamma distribution (Polson et al., 2012) 21

23 Iteratively Re-weighted Least Square Algorithm Coordinate Descent (variational EM) For (i.e., M-step) A re-weighted least square problem under feature noising l n := 1 c E q[ n ]= adaptive weights Reduces to the square loss by setting 1 2 p E p [! n 2 ] 1 + e ep E p [! 2 n ] 1 p E p [! 2 n ] l n = c 2 re-weighted label y l n = c 2 n y n 22

24 23 Comparison of logistic / hinge loss under IRLS Comparison of hinge & logistic loss under the IRLS framework. Hinge Parameter Parameter c Update Update Logistic -- c ` ` n y n c n h yn h n l yn l Both losses iteratively minimize the expectation of a reweighted quadratic loss, but differ in the update rules of the weights and the labels at each iteration. Quadratic loss is a special case with a single iteration

25 Experiment Compare Dropout-SVM and Dropout-Logistic with state-ofart models: MCF-logistic, MCF-quadratic All our predictors use L2-regularization, with parameters set by cross-validation Single-parameter noising model: x d x d d ~x d D ~x d D 24

26 Better 25 Experiment 1: Review Classification (P / N) No Corruption No Corruption No Corruption No Corruption

27 Experiment 1: Review Classification (P / N) Comparing explicit and implicit dropout corruption (Amazon Books; hinge loss): 26

28 Experiment 2: Nightmare at test time In some settings, features may be randomly unobserved at test time We experiment with this nightmare at test time scenario on MNIST digits: Train regular dropout classifiers on the original training set Randomly delete features from the test images, and measure classification error 27

29 Better 28 Experiment 2: Nightmare at test time Classification error on test images with randomly deleted features: More test corruption

30 Adaptive Dropout Rates A Bayesian feature noising model p( i j ) = ~x i = x i ± i Y D d= 1 (1 d) id 1 id x id ~x id Allows various dimensions to have different dropout rates d Automatically infer the dropout rates (non-informative prior) d id D d N ^ d = P N i= 1 I(y iµ d x id < 0) P N i= 1 I(x id 6= 0) Group-wise structure among features is allowed [Zhuo, Zhu, & Zhang, 2015] 29

31 30 Adaptive Dropout Rates Some results on Amazon books and kitchen review data: Adaptive rates can improve the performance

32 Big Learning with Bayesian Methods Why Bayes? Robust to overfitting Flexible in modeling Avoid (heavy) parameter tuning Generic algorithms to do inference Why not Bayes? Computationally too slow Not scalable to big data Good news: Much recent progress on scalable Bayesian methods 31

33 Big Learning with Bayesian Methods Stochastic/Online Methods Variational, MCMC Distributed Methods Variational, MCMC Data-Parallel Graph-Parallel Model-Parallel master server map reduce slave client 32

34 Big Learning with Bayesian Methods Online/Stochastic Learning Online Bayesian PA (Shi & Zhu, ICML 2014) Stochastic subgradient MCMC (Hu et al., arxiv: , 2015) Deep generative models (Li et al., arxiv: , 2015; Du et al., arxiv: , 2015) 33

35 Big Learning with Bayesian Methods Online/Stochastic Learning Online Bayesian PA (Shi & Zhu, ICML 2014) Stochastic subgradient MCMC (Hu et al., arxiv: , 2015) Deep generative models (Li et al., arxiv: , 2015; Du et al., arxiv: , 2015) 34

36 Big Learning with Bayesian Methods Online/Stochastic Learning Online Bayesian PA (Shi & Zhu, ICML 2014) Stochastic subgradient MCMC (Hu et al., arxiv: , 2015) Deep generative models (Li et al., arxiv: , 2015; Du et al., arxiv: , 2015) 35

37 Big Learning with Bayesian Methods Online/Stochastic Learning Online Bayesian PA (Shi & Zhu, ICML 2014) Stochastic subgradient MCMC (Hu et al., arxiv: , 2015) Deep generative models (Li et al., arxiv: , 2015; Du et al., arxiv: , 2015) Distributed Learning Distributed Bayesian Inference (Xu et al., NIPS 2014) Scalable topic graph learning (Chen et al., NIPS 2013) Scalable dynamic LDA (Bhadury et al., 2015, preprint) A comprehensive survey Big Learning with Bayesian Methods, Zhu et al., arxiv:

38 Conclusions Feature noising controls over-fitting Dropout training for SVMs with an iteratively re-weighted least square (IRLS) algorithm Apply ideas to develop IRLS for dropout Logistic regression Adaptive update rule for dropout levels Future Work Kernel trick in dropout learning Dropout-SVM in deep architectures Big learning with Bayesian methods 37

39 Department of Computer Science and Technology Thank you!

Adaptive Dropout Rates for Learning with Corrupted Features

Adaptive Dropout Rates for Learning with Corrupted Features Jingwei Zhuo, Jun Zhu, Bo Zhang Dept. of Comp. Sci. & Tech., State Key Lab of Intell. Tech. & Sys., TNList Lab, Center for Bio-Inspired Computing