[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization

Size: px

Start display at page:

Download "[POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization"

Kellie Lindsay Rice
5 years ago
Views:

1 [POLS 8500] Stochastic Gradient Descent, Linear Model Selection and Regularization L. Jason Anastasopoulos February 2, 2017

2 Gradient descent Let s begin with our simple problem of estimating the parameters for a linear regression model with gradient descent. Where the gradient J(θ) is in general defined as: J(θ) = And in the case of linear regression is: [ J, J,, J ] θ 0 θ 1 θ p J(θ) = 1 N (y T θx T )X

3 Gradient descent for linear regression The gradient decent algorithm finds parameters in the following manner: repeat while ( η J(θ) > ɛ){ } θ := θ η 1 N (y T θx T )X

4 Gradient descent R function As it turns out, this is quite easy to implement in R as a function which we call gradientr below: gradientr

5 Normal equations in R Let s also make a function that estimates the parameters with the normal equations: θ = (X T X) 1 X T Y normalest <- function(y, X){ X = data.frame(rep(1,length(y)),x) X = as.matrix(x) theta = solve(t(x)%*%x)%*%t(x)%*%y return(theta) }

6 Running gradient descent Now let s make up some fake data and see gradient descent in action with η = 10 and 1000 epochs: y = rnorm(n = 1000, mean = 0, sd = 1) x1 = rnorm(n = 1000, mean = 0, sd = 1) x2 = rnorm(n = 1000, mean = 0, sd = 1) x3 = rnorm(n = 1000, mean = 0, sd = 1) x4 = rnorm(n = 1000, mean = 0, sd = 1) x5 = rnorm(n = 1000, mean = 0, sd = 1) gdec.eta1 = gradientr(y = y, X = data.frame(x1,x2,x3, x4,x5), eta = 10, iters = 1000)

7 Did we get the correct parameter values? Let s check if we got the correct parameter values Explicit.Coef<-normalest(y=y, X = data.frame(x1,x2,x3,x4,x5)) Gradient.Coef<-gdec.eta1$coef data.frame(explicit.coef, Gradient.Coef) ## Explicit.Coef Gradient.Coef ## rep.1..length.y ## x ## x ## x ## x ## x

8 L 2 loss for each epoch Let s take a look at the L2-loss for each epoch: L2 loss Epoch

9 What if we decreased η? What if we decreased η = 1?: L2 loss Epoch

10 Stochastic Gradient Descent Gradient descent can often have slow convergence because each iteration requires calculation of the gradient for every single training example. If we update the parameters each time by iterating through each training example, we can actually get excellent estimates despite the fact that we ve done less work.

11 Stochastic Gradient Descent For stochastic gradient descent, thus: Becomes: J(θ) = 1 N (y T θx T )X J(θ) i = 1 N (y i θ T X i )X i Where i is each row of the data set.

12 Stochastic Gradient Descent Algorithm This the stochastic gradient descent algorithm proceeds as follows for the case of linear regression: Step 1 Randomly shuffle the data Step 2 repeat for {i := 1,, N}{ θ := θ η J(θ) i } Part of the homework assignment will be to write a R function that performs stochastic gradient descent.

13 Linear model selection and regularization Prediction accuracy Recall the standard linear model: P Y = θ 0 + θ i x i + ɛ i=1 If n >> p least squares will do well on test obs. If n > p lot of variability in least-squares fit, overfitting. If p > n there is no solution that the normal eqns can explicitly solve.

14 Linear model selection and regularization Model interpretatibility Inclusion of many irrelevant variables increases chances of overfitting. Often need ways to perform feature selection for linear regression and other models.

15 Three ways to accomplish this Subset Selection- find p P that best fit Y. Shrinkage/Regularization - Fit model to all p predictors, shrink values of some θ 0. Dimension Reduction - Project p onto M dimensional subspace s.t. M < p. Use M as predictors.

16 Best subset selection Y = θ 0 + θ 1 x θ p x p Choose from among the best predictors of a model according to some criteria. C p, AIC, BIC or adj. R 2

17 Best subset selection algorithm 1. M 0 denote the null model. 2. for i = 1,, p{ Y = θ 0 + θ 1 x θ p x p Fit each ( p i ) models containing i predictors. Choose the best among the ( p i ) models M with the lowest RSS (highest R 2 ) 3. Select best model M among the M using some criteria Cross-validated prediction error, AIC, BIC or adj. R 2.

18 Example in R using bestglm ## Loading required package: leaps ## Warning: package 'leaps' was built under R version str(zprostate) ## 'data.frame': 97 obs. of 10 variables: ## $ lcavol : num ## $ lweight: num ## $ age : num ## $ lbph : num ## $ svi : num ## $ lcp : num ## $ gleason: num ## $ pgg45 : num ## $ lpsa : num ## $ train : logi TRUE TRUE TRUE TRUE TRUE TRUE...

19 Example in R using bestglm str(zprostate) ## 'data.frame': 97 obs. of 10 variables: ## $ lcavol : num ## $ lweight: num ## $ age : num ## $ lbph : num ## $ svi : num ## $ lcp : num ## $ gleason: num ## $ pgg45 : num ## $ lpsa : num ## $ train : logi TRUE TRUE TRUE TRUE TRUE TRUE...

20 ## (Intercept) lcavol lweight age lbph svi lcp gle ## 1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE F ## 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE F ## 3 TRUE TRUE TRUE FALSE FALSE TRUE FALSE F ## 4 TRUE TRUE TRUE FALSE TRUE TRUE FALSE F ## 5 TRUE TRUE TRUE FALSE TRUE TRUE FALSE F ## RSS ## ## ## ## Example in R using bestglm train <- (zprostate[zprostate[, 10], ])[, -10] X <- train[, 1:8] y <- train[, 9] out <- summary(regsubsets(x = X, y = y, nvmax = ncol(x))) Subsets <- out$which RSS <- out$rss

21 Example in R using bestglm Let bestglm() find the best model using Bayesian information criterion (default, BIC) Xy <- cbind(as.data.frame(x), lpsa = y) out <- bestglm(xy, IC = "BIC") out$bestmodel ## ## Call: ## lm(formula = y ~., data = data.frame(xy[, c(bestset[-1] ## drop = FALSE], y = y)) ## ## Coefficients: ## (Intercept) lcavol lweight ##

22 Example in R using bestglm Let bestglm() find the best model using Akaike information criterion (AIC) Xy <- cbind(as.data.frame(x), lpsa = y) out <- bestglm(xy, IC = "AIC") out$bestmodel ## lm(formula = y ~., data = data.frame(xy[, c(bestset[-1] ## ## Call: ## drop = FALSE], y = y)) ## ## Coefficients: ## (Intercept) lcavol lweight age ## ## svi lcp pgg45 ##

23 Example in R using bestglm Let bestglm() find the best model using cross-validated prediction error. Xy <- cbind(as.data.frame(x), lpsa = y) out <- bestglm(xy, IC = "CV", CVArgs=list(Method="HTF", K=10, REP=1)) out$bestmodel ## ## Call: ## lm(formula = y ~., data = data.frame(xy[, c(bestset[-1] ## drop = FALSE], y = y)) ## ## Coefficients: ## (Intercept) lcavol lweight svi ##

24 Problems with best subset selection What if you have p = 50 predictors and wanted to choose a model with a variable subset of i = 10? That would require you to estimate ( 50 10) = models! Very computationally intensive.

25 Forward stepwise selection Does not consider all possible subsets of models Starts with null model M 0, adds predictors on at a time and assesses greatest additional improvement in fit.

26 Forward stepwise selection algorithm 1. M 0 denote the null model. 2. for i = 1,, p 1 : Y = θ 0 + θ 1 x θ p x p Consider all p i models that alter the predictors M i with an additional predictor. Choose the best among the p i models M + with the lowest RSS (highest R 2 ) 3. Select best model M among the M 0,, M 0 using some criteria C p, AIC, BIC or adj. R 2.

27 Forward stepwise selection algorithm Cuts down the number of models that you need to estimate significantly. Only estimate the null model and p i models on the i th iteration. Total Models = p (p i) = 1 + i=1 p(p + 1) 2

28 Forward stepwise selection in R using step() USJudgeRatings Dataset # CONT Number of contacts of lawyer with judge. # INTG Judicial integrity. # DMNR Demeanor. # DILG Diligence. # CFMG Case flow managing. # DECI Prompt decisions. # PREP Preparation for trial. # FAMI Familiarity with law. # ORAL Sound oral rulings. # WRIT Sound written rulings. # PHYS Physical ability. # RTEN Worthy of retention.

29 Forward stepwise selection in R using step() What if we wanted to train a model to predict whether a judge would be worthy of retention (RTEN)? library(datasets) data(usjudgeratings)

30 Forward stepwise selection in R using step() null.model = lm(rten~1, data=usjudgeratings) largest.model = lm(usjudgeratings$rten~., data=usjudgeratings[,1:11]) forward.stepwise = step(null.model, direction = 'backward', scope = largest.model) ## Start: AIC=9.26 ## RTEN ~ 1

31 Model selection criteria While R 2 provides a measure of fit, it will always increase as the number of predictors increase. i R 2 = 1 (y i (θ 0 + ( p θ px ip )) 2 i (y i ȳ) 2

32 Model selection criteria Thus, in order to avoid estimating models that tend to overfit data, we need to use criteria which penalize models that have more features. Here we consider C p, Akaike information criterion (AIC), Bayesian information criterion (BIC) and adjusted R 2.

33 C p C p = 1 n (RSS + 2d ˆσ2 ) In the equation above a penalty of 2d ˆσ 2 is added to the residual sum of squares where d is the number of parameters.

34 AIC AIC = 1 nˆσ 2 (RSS + 2d ˆσ2 )

35 BIC BIC = 1 n (RSS + log(n)d ˆσ2 )

36 Adjusted R 2 AdjR 2 = 1 RSS/(n d 1) TSS/(n 1)

37 Validation set approach In order to understand cross validation we have to take a step back. We ve already discussed the training set and the test set. But we haven t discussed the validation set

38 Validation set approach Training set: The data that you train your model with (ie estimate parameters). Test set: Data that you use to test how well your trained model predicts new data. Validation set: Data that is used to provide a more accurate estimate of the performance of one or several models and often to avoid overfitting.

39 Validation set approach If the goal is to estimate the test error in a number of models, which gives us a sense of how accurate our model makes predictions we might take the following steps: Step 1: divide data into training set and validation set. Step 2: train model on training set, estimate test error rate on validation set.

40 Validation set approach: example Bible and Quran: n 1100 randomly sampled verses. Step 1 Randomly divide data training set n = 1050 and validation set n = Step 2 estimate parameters on training set, estimate test error on validation set.

41 Validation set approach Why not just estimate error on a test set? We want to reserve the test set to verify the final accuracy of the model, but want another data set to provide us with information about how we should adjust out model.

42 Problems with the validation set approach 1. Validation test error estimate is highly variable/depends on validation set chosen. 2. Validation test error estimate may overestimate the error because not enough training data is used to train the model.

43 Leave-One-Out Cross-Validation Training set: {(x 2, y 2 ),, (x n, y n ))} Validation set: {(x 1, y 1 )} LOOCV solves both of these problems. Validation set is only a single observation and training set is the rest of the training data.

44 Leave-One-Out Cross-Validation Algorithm for(i = 1,, n): select validation set {(x i, y i )} select training set and train on {(x i, y i )} estimate MSE i = 1 n 1 i (y i ŷ) 2 Estimate overall MSE: CV n = 1 MSE i n i

45 Leave-One-Out Cross-Validation Algorithm Benefits - provides excellent and stable estimate of MSE. can be used for any kind of model. Drawbacks computationally intensive

46 k-fold Cross-Validation the k-fold method solves the problem of computational complexity. data is divided into k groups or folds. the first fold is used for training, the k 1 folds are used as validation sets.

47 k-fold Cross-Validation Algorithm Choose k Randomly divide data Θ into Θ 1,, Θ k sets of size n k Train model on first fold y 1 = f (X) 1 for (i = {2,, k}): Calculate MSE i = k/n j=1 (y i ŷ) 2 Calculate overall MSE: 1 k CV k = k i=1 MSE i

48 For next time Regularization Logistic Regression Linear Discriminant Analysis N{"a}ive Bayes

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always