Lecture A. Exhaustive and non-exhaustive algorithms without resampling. Samuel Müller and Garth Tarr

Size: px

Start display at page:

Download "Lecture A. Exhaustive and non-exhaustive algorithms without resampling. Samuel Müller and Garth Tarr"

Anissa Adele Stanley
5 years ago
Views:

1 Lecture A Exhaustive and non-exhaustive algorithms without resampling Samuel Müller and Garth Tarr

2 "Doubt is not a pleasant mental state, but certainty is a ridiculous one." Voltaire ( ) Necessity to select models is great despite philosophical issues Myriad of model/variable selection methods based on loglikelihood, RSS, prediction errors, resampling, etc 2 / 42

3 Some motivating questions How to practically select model(s)? How to cope with an ever increasing number of observed variables? How to visualise the model building process? How to assess the stability of a selected model? 3 / 42

4 Topics Selecting models Stepwise model selection Information criteria including AIC and BIC Exhaustive and non-exhaustive searches Regularization methods 4 / 42

5 Body fat data N = 252 measurements of men (available in R package mfp) Training sample size n = 128 (available in R package mplot) Response variable: body fat percentage: Bodyfat 13 or 14 explanatory variables data("bodyfat", package = "mplot") dim(bodyfat) ## [1] names(bodyfat) ## [1] "Id" "Bodyfat" "Age" "Weight" "Height" "Neck" ## [7] "Chest" "Abdo" "Hip" "Thigh" "Knee" "Ankle" ## [13] "Bic" "Fore" "Wrist" Source: Johnson (1996) 5 / 42

6 Body fat: Boxplots bfat = bodyfat[, -1] # remove the ID column boxplot(bfat, horizontal = TRUE, las = 1) 6 / 42

7 pairs(bfat) 7 / 42

8 Body fat data: Full model and power set names(bfat) ## [1] "Bodyfat" "Age" "Weight" "Height" "Neck" "Chest" ## [7] "Abdo" "Hip" "Thigh" "Knee" "Ankle" "Bic" ## [13] "Fore" "Wrist" Full model Power set Bodyfat=β 1 + β 2 Age + β 3 Weight + β 4 Height + β 5 Height + β 6 Chest + β 7 Abdo + β 8 Hip + β 9 Thigh + β 10 Knee + β 11 Ankle + β 12 Bic + β 13 Fore + β 14 Wrist + ϵ Here, the intercept is not subject to selection. Therefore, the number of possible regression models is: M = #A = 2 13 = 8192 p 1 = 13 and thus 8 / 42

9 Body fat: Best bivariate tted model M0 = lm(bodyfat ~ Abdo, data = bfat) summary(m0) ## ## Call: ## lm(formula = Bodyfat ~ Abdo, data = bfat) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) <2e-16 *** ## Abdo <2e-16 *** ## --- ## Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: on 126 degrees of freedom ## Multiple R-squared: , Adjusted R-squared: ## F-statistic: on 1 and 126 DF, p-value: < 2.2e-16 9 / 42

10 Body fat: Full tted model M1 = lm(bodyfat ~., data = bfat) summary(m1) ## ## Call: ## lm(formula = Bodyfat ~., data = bfat) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) ## Age ## Weight ## Height ## Neck ## Chest ## Abdo e-13 *** ## Hip ## Thigh ## Knee ## Ankle ## Bic ## Fore ## Wrist ## / 42

11 The AIC and BIC AIC is the most widely known and used model selection method (of course this does not imply it is the best/recommended method to use) AIC Akaike (1973): BIC Schwarz (1978): AIC = 2 LogLik + 2 p BIC = 2 LogLik + log(n) p The smaller the AIC/BIC the better the model These and many other criteria choose models by minimizing an expression that can be written as Loss + Penalty 11 / 42

12 Stepwise variable selection Stepwise, forward and backward: 1. Start with some model, typically null model (with no explanatory variables) or full model (with all variables) 2. For each variable in the current model, investigate effect of removing it 3. Remove the least informative variable, unless this variable is nonetheless supplying signi cant information about the response 4. For each variable not in the current model, investigate effect of including it 5. Include the most statistically signi cant variable not currently in model (unless no signi cant variable exists) 6. Go to step 2. Stop only if no change in steps / 42

13 Body fat: Forward search using BIC M0 = lm(bodyfat ~ 1, data = bfat) # Null model M1 = lm(bodyfat ~., data = bfat) # Full model step.fwd.bic = step(m0, scope = list(lower = M0, upper = M1), direction = "forward", trace = FALSE, k = log(128)) # BIC: k=log(n) # summary(step.fwd.bic) round(summary(step.fwd.bic)$coef,3) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) ## Abdo ## Weight Best model has two features: Abdo and Weight Backward search using BIC gives the same model (con rm yourself) 13 / 42

14 Body fat: Backward search using AIC M0 = lm(bodyfat ~ 1, data = bfat) # Null model M1 = lm(bodyfat ~., data = bfat) # Full model step.bwd.aic = step(m1, scope = list(lower = M0, upper = M1), direction = "backward", trace = FALSE, k = 2) round(summary(step.bwd.aic)$coef,3) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) ## Weight ## Neck ## Abdo ## Fore Best model has four variables Using AIC gives larger model because 2 < log(128) = 4.85 k=2 is the default in step() 14 / 42

15 Body fat: Stepwise using AIC M0 = lm(bodyfat ~ 1, data = bfat) # Null model M1 = lm(bodyfat ~., data = bfat) # Full model step.aic = step(m1, scope = list(lower = M0, upper = M1), trace = FALSE) round(summary(step.aic)$coef,3) ## Estimate Std. Error t value Pr(> t ) ## (Intercept) ## Weight ## Neck ## Abdo ## Fore Here, default stepwise gives the same as bacward using AIC This does not hold in general 15 / 42

16 Criticism of stepwise procedures Never run (p value based) automatic stepwise procedures on their own! Stepwise regression is probably the most abused computerized statistical technique ever devised. If you think you need stepwise regression to solve a particular problem you have, it is almost certain you do not. Professional statisticians rarely use automated stepwise regression. Wilkinson (1998) See Wiegand (2010) for a more recent simulation study and review of the performance of stepwise procedures 16 / 42

17 Exhaustive search Exhaustive search is the only technique guaranteed to nd the predictor variable subset with the best evaluation criterion. Since we look over the whole model space, we can identify the best model(s) at each model size Sometimes known as best subsets model selection Loss component is (typically) the residual sum of squares Main drawback: exhaustive searching is computationally intensive 17 / 42

18 The leaps package The leaps package performs regression subset selection including exhaustive searches (Lumley and Miller, 2009) for linear models Main function: regsubsets() performs an exhaustive search of the model space to nd the best subsets of variables for predicting y in linear regression Uses an e cient branch-and-bound algorithm Alternative function: leaps() is just a compatibility wrapper for regsubsets() 18 / 42

19 The regsubsets function Input: full model and various optional parameters. Output: a regsubsets object but you interact with it using the summary function which gives: which A logical matrix indicating which elements are in each model rsq The r-squared for each model rss Residual sum of squares for each model adjr2 Adjusted r-squared cp Mallows' Cp bic Bayesian information criterion outmat A version of the which component that is formatted for printing 19 / 42

20 Body fat: Exhaustive search require(leaps) rgsbst.out = regsubsets(bodyfat ~., data = bfat, nbest = 1, # 1 best model for each dimension nvmax = NULL, # NULL for no limit on number of variables force.in = NULL, force.out = NULL, method = "exhaustive") rgsbst.out ## Subset selection object ## Call: regsubsets.formula(bodyfat ~., data = bfat, nbest = 1, nvmax = NULL, ## force.in = NULL, force.out = NULL, method = "exhaustive") ## 13 Variables (and intercept) ## Forced in Forced out ## Age FALSE FALSE ## Weight FALSE FALSE ## Height FALSE FALSE ## Neck FALSE FALSE ## Chest FALSE FALSE ## Abdo FALSE FALSE ## Hip FALSE FALSE ## Thigh FALSE FALSE ## Knee FALSE FALSE ## Ankle FALSE FALSE ## Bic FALSE FALSE ## Fore FALSE FALSE ## Wrist FALSE FALSE ## 1 subsets of each size up to 13 ## Selection Algorithm: exhaustive 20 / 42

21 Body fat: Exhaustive search summary.out = summary(rgsbst.out) names(summary.out) ## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj" as.data.frame(summary.out$outmat) ## Age Weight Height Neck Chest Abdo Hip Thigh Knee Ankle Bic Fore Wrist ## 1 ( 1 ) * ## 2 ( 1 ) * * ## 3 ( 1 ) * * * ## 4 ( 1 ) * * * * ## 5 ( 1 ) * * * * * ## 6 ( 1 ) * * * * * * ## 7 ( 1 ) * * * * * * * ## 8 ( 1 ) * * * * * * * * ## 9 ( 1 ) * * * * * * * * * ## 10 ( 1 ) * * * * * * * * * * ## 11 ( 1 ) * * * * * * * * * * * ## 12 ( 1 ) * * * * * * * * * * * * ## 13 ( 1 ) * * * * * * * * * * * * * 21 / 42

22 Body fat: Exhaustive search and C p summary.out$cp ## [1] ## [8] (id = which.min(summary.out$cp)) ## [1] 4 summary.out$which[id, ] ## (Intercept) Age Weight Height Neck Chest Abdo ## TRUE FALSE TRUE FALSE TRUE FALSE TRUE ## Hip Thigh Knee Ankle Bic Fore Wrist ## FALSE FALSE FALSE FALSE FALSE TRUE FALSE names(which(summary.out$which[id, ] == TRUE)) ## [1] "(Intercept)" "Weight" "Neck" "Abdo" "Fore" 22 / 42

23 Body fat: Exhaustive search and C p plot(rgsbst.out, scale = "Cp", main = "Mallows' Cp") 23 / 42

24 Body fat: Exhaustive search and C p plot(summary.out$cp ~ rowsums(summary.out$which), ylab = "Cp", xlab = "Number of parameters") abline(v=rowsums(summary.out$which)[id],col=gray(0.5)) 24 / 42

25 Regularisation methods with glmnet Regularisation methods shrink estimated regression coe imposing a penalty on their sizes cients by There are different choices for the penalty The penalty choice drives the properties of the method In particular, glmnet implements Lasso regression using a Ridge regression using a L 1 L 2 penalty penalty Elastic-net using both, an L 1 and an L 2 penalty Adaptive lasso, it takes advantage of feature weights 25 / 42

26 Lasso is least squares with L 1 penalty Simultaneously estimate and select coe cients through ^β lasso = ^β lasso (λ) = argmin β R p{(y Xβ) T (y Xβ) + λ β 1 } positive tuning parameter λ controls the shrinkage Optimisation problem above is equivalent to solving minimise RSS(β) = (y Xβ) T (y Xβ) subject to p β j t. j=1 26 / 42

27 Example: Diabetes We consider the diabetes data and we only analyse main effects The data from the lars package is the same as the mplot package but is structured and scaled differently data("diabetes", package = "lars") names(diabetes) # list, slot x has 10 main effects ## [1] "x" "y" "x2" dim(diabetes) ## [1] x = diabetes$x y = diabetes$y 27 / 42

28 Example: Diabetes class(x) = NULL # remove the AsIs class; otherwise single boxplot drawn boxplot(x, cex.axis = 2) 28 / 42

29 Example: Diabetes lm.fit = lm(y ~ x) #least squares fit for the full model first summary(lm.fit) ## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) < 2e-16 *** ## xage ## xsex *** ## xbmi e-14 *** ## xmap e-06 *** ## xtc ## xldl ## xhdl ## xtch ## xltg e-05 *** ## xglu / 42

30 Fitting a lasso regression Lasso regression with the glmnet() function in library(glmnet) has as a rst argument the design matrix and as a second argument the response vector, i.e. library(glmnet) lasso.fit = glmnet(x,y) 30 / 42

31 Comments glmnet.fit is an object of class glmnet that contains all the relevant information of the tted model for further use Various methods are provided for the object such as coef, plot, predict and print that extract and present useful object information The additional material shows that cross-validation can be used to choose a good value for λ through using the function cv.glmnet set.seed(1) # to get a reproducable lambda.min value lasso.cv = cv.glmnet(x, y) lasso.cv$lambda.min ## [1] / 42

32 Lasso regression coe cients Clearly the regression coe cients are smaller for the lasso regression (this slide) than for least squares (next slide) round(coef(lasso.fit,s=c(5,1,0.5,0)),2) ## 11 x 4 sparse Matrix of class "dgcmatrix" ## ## (Intercept) ## age ## sex ## bmi ## map ## tc ## ldl ## hdl ## tch ## ltg ## glu / 42

33 Lasso regression vs least squares round(coef(lm.fit),2) ## (Intercept) xage xsex xbmi xmap ## ## xtc xldl xhdl xtch xltg ## ## xglu ## beta.lasso = coef(lasso.fit,s=1) sum(abs(beta.lasso[-1])) ## [1] beta.ls = coef(lm.fit) sum(abs(beta.ls[-1])) ## [1] / 42

34 Lasso coe cient plot plot(lasso.fit, label = TRUE, cex.axis = 1.5, cex.lab = 1.5) 34 / 42

35 We observe that the lasso Can shrink coe cients exactly to zero Simultaneously estimates and selects coe cients Paths are not necessarily monotone (e.g. coe table for different values of s) cients for hdl in above 35 / 42

36 Further remarks on lasso lasso is an acronym for least absolute selection and shrinkage operator Theoretically, when the tuning parameter λ = 0, there is no bias and ^β lasso = ^β LS ( p n ) however, glmnet approximates the solution When λ =, we get ^βlasso = 0 an estimator with zero variance For λ in between, we are balancing bias and variance As λ increases, more coe cients are set to zero (less variables are selected), and among the nonzero coe cients, more shrinkage is employed 36 / 42

37 Why does the lasso give zero coe cients? This gure is taken from James, Witten, Hastie, and Tibshirani (2013) "An Introduction to Statistical Learning, with applications in R" with permission from the authors. 37 / 42

38 Ridge is least squares with L 2 penalty As the lasso, ridge regression shrinks the estimated regression coe cients by imposing a penalty on their sizes minimise RSS(β) = (y Xβ) T (y Xβ) subject to p β 2 t, j j=1 where the positive tuning parameter λ controls the shrinkage However, unlike the lasso, ridge regression does not shrink the parameters all the way to zero Ridge coe cients have closed form solution (even when n < p) ^β ridge (λ) = (X T X + λi) 1 X T y 38 / 42

39 Fitting a ridge regression Ridge regression with the glmnet() function by changing one of its default options to alpha=0 ridge.fit = glmnet(x, y, alpha = 0) ridge.cv = cv.glmnet(x, y, alpha = 0) ridge.cv$lambda.min ## [1] Comments In general, the optimal λ for ridge is very different than the optimal λ for the lasso β 2 j β j The scale of and is very different α is known as the elastic-net mixing parameter 39 / 42

40 Elastic-net: best of both worlds? Minimising the (penalised) log-loglikelihood in linear regression with iid errors is equivalent to minimising the (penalised) RSS More generally, the elastic-net minimises the following objective function over β 0 and β over a grid of values of λ covering the entire range 1 n N w i l(y i, β 0 + β T x i ) + λ [(1 α) i=1 p β 2 + j α j=1 p β j ] j=1 This is general enough to include logistic regression, generalised linear models and Cox regression, where l(y i ) is the log-likelihood of observation i and an optional observation weight w i 40 / 42

41 References Akaike, H. (1973). "Information theory and and extension of the maximum likelihood principle". In: Second International Symposium on Information Theory. Ed. by B. Petrov and F. Csaki. Budapest: Academiai Kiado, pp James, G, D. Witten, T. Hastie, et al. (2013). An introduction to statistical learning with applications in R. New York, NY: Springer. ISBN: URL: Johnson, R. W. (1996). "Fitting Percentage of Body Fat to Simple Body Measurements". In: Journal of Statistics Education 4.1. URL: Lumley, T. and A. Miller (2009). leaps: regression subset selection. R package version 2.9. URL: Schwarz, G. (1978). "Estimating the Dimension of a Model". In: The Annals of Statistics 6.2, pp DOI: /aos/ Wiegand, R. E. (2010). "Performance of using multiple stepwise algorithms for variable selection". In: Statistics in Medicine 29.15, pp DOI: /sim Wilkinson, L. (1998). SYSTAT 8 statistics manual. SPSS Inc. 41 / 42

42 devtools::session_info(include_base = FALSE) ## mime [1] CRAN (R 3.5.0) ## Session info ## mplot * [1] CRAN (R 3.5.2) ## setting value ## version R version ( ) ## pillar ## pkgbuild [1] CRAN (R 3.5.0) [1] CRAN (R 3.5.2) ## os macos Mojave ## pkgconfig [1] CRAN (R 3.5.0) ## system x86_64, darwin ## pkgload [1] CRAN (R 3.5.0) ## ui X11 ## pkgmaker [1] CRAN (R 3.5.0) ## language (EN) ## collate en_au.utf-8 ## plyr ## prettyunits [1] CRAN (R 3.5.0) [1] CRAN (R 3.5.0) ## ctype en_au.utf-8 ## processx [1] CRAN (R 3.5.2) ## tz Australia/Sydney ## promises [1] CRAN (R 3.5.0) ## date ## ps [1] CRAN (R 3.5.0) ## ## purrr [1] CRAN (R 3.5.2) ## Packages ## R [1] CRAN (R 3.5.2) ## package * version date lib source ## assertthat [1] CRAN (R 3.5.2) ## backports [1] CRAN (R 3.5.0) ## bibtex [1] CRAN (R 3.5.0) ## callr [1] CRAN (R 3.5.2) ## cli [1] CRAN (R 3.5.2) ## codetools [1] CRAN (R 3.5.2) ## crayon [1] CRAN (R 3.5.0) ## desc [1] CRAN (R 3.5.0) ## devtools [1] CRAN (R 3.5.1) ## digest [1] CRAN (R 3.5.0) ## dorng [1] CRAN (R 3.5.0) ## dplyr [1] CRAN (R 3.5.2) ## evaluate [1] CRAN (R 3.5.2) ## foreach * [1] CRAN (R 3.5.0) ## formatr [1] CRAN (R 3.5.2) ## fs [1] CRAN (R 3.5.2) ## glmnet * [1] CRAN (R 3.5.0) ## glue [1] CRAN (R 3.5.2) ## htmltools [1] CRAN (R 3.5.0) ## httpuv [1] CRAN (R 3.5.2) ## httr [1] CRAN (R 3.5.0) ## iterators [1] CRAN (R 3.5.0) ## jsonlite [1] CRAN (R 3.5.0) ## knitcitations * [1] CRAN (R 3.5.0) ## knitr * [1] CRAN (R 3.5.2) ## later [1] CRAN (R 3.5.2) ## lattice [1] CRAN (R 3.5.2) ## leaps * [1] CRAN (R 3.5.0) ## lubridate [1] CRAN (R 3.5.0) ## magrittr [1] CRAN (R 3.5.0) ## Matrix * [1] CRAN (R 3.5.2) ## memoise [1] CRAN (R 3.5.0) ## Rcpp [1] CRAN (R 3.5.2) ## RefManageR [1] Github (ropensci/refmanager@3834c ## registry [1] CRAN (R 3.5.2) ## remotes [1] CRAN (R 3.5.0) ## rlang [1] CRAN (R 3.5.2) ## rmarkdown [1] CRAN (R 3.5.2) ## rngtools [1] CRAN (R 3.5.0) ## rprojroot [1] CRAN (R 3.5.0) ## sessioninfo [1] CRAN (R 3.5.0) ## shiny [1] CRAN (R 3.5.1) ## shinydashboard [1] CRAN (R 3.5.0) ## stringi [1] CRAN (R 3.5.2) ## stringr [1] CRAN (R 3.5.2) ## testthat [1] CRAN (R 3.5.0) ## tibble [1] CRAN (R 3.5.2) ## tidyselect [1] CRAN (R 3.5.0) ## usethis [1] CRAN (R 3.5.0) ## withr [1] CRAN (R 3.5.0) ## xaringan [1] CRAN (R 3.5.2) ## xfun [1] CRAN (R 3.5.2) ## xml [1] CRAN (R 3.5.0) ## xtable [1] CRAN (R 3.5.0) ## yaml [1] CRAN (R 3.5.1) ## ## [1] /Library/Frameworks/R.framework/Versions/3.5/Resources/library 42 / 42

Lecture B. Exhaustive and non-exhaustive algorithms with resampling. Samuel Müller and Garth Tarr

Lecture B. Exhaustive and non-exhaustive algorithms with resampling. Samuel Müller and Garth Tarr Lecture B Exhaustive and non-exhaustive algorithms with resampling Samuel Müller and Garth Tarr Overview 1. Cross validation 2. The mplot package 3. Arti cial example: Information criteria can be blind