Package FWDselect. December 19, 2015

Similar documents
Package npregfast. November 30, 2017

Package biglars. February 19, 2015

Package MMIX. R topics documented: February 15, Type Package. Title Model selection uncertainty and model mixing. Version 1.2.

Package datasets.load

Package addhaz. September 26, 2018

Package rfsa. R topics documented: January 10, Type Package

Package quickreg. R topics documented:

Package EBglmnet. January 30, 2016

Package glmnetutils. August 1, 2017

Package abe. October 30, 2017

Package vinereg. August 10, 2018

Package milr. June 8, 2017

Discussion Notes 3 Stepwise Regression and Model Selection

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

Package voxel. R topics documented: May 23, 2017

Package spikeslab. February 20, 2015

Package ClusterBootstrap

Package freeknotsplines

Package GWRM. R topics documented: July 31, Type Package

Package simr. April 30, 2018

Package gppm. July 5, 2018

Package optimus. March 24, 2017

Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Lecture 13: Model selection and regularization

Package modmarg. R topics documented:

Package glmmml. R topics documented: March 25, Encoding UTF-8 Version Date Title Generalized Linear Models with Clustering

Package PTE. October 10, 2017

Package citools. October 20, 2018

Package sure. September 19, 2017

Package mtsdi. January 23, 2018

Package TVsMiss. April 5, 2018

Package PCADSC. April 19, 2017

Package GLMMRR. August 9, 2016

Package svalues. July 15, 2018

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Package semisup. March 10, Version Title Semi-Supervised Mixture Model

Package rknn. June 9, 2015

Package flexcwm. May 20, 2018

Package misclassglm. September 3, 2016

Package glinternet. June 15, 2018

Package cwm. R topics documented: February 19, 2015

Package gee. June 29, 2015

Package rpst. June 6, 2017

Package MIICD. May 27, 2017

Package PSTR. September 25, 2017

Lecture 3 - Object-oriented programming and statistical programming examples

Package qvcalc. R topics documented: September 19, 2017

Package sparsenetgls

Package spind. September 3, 2017

Package strat. November 23, 2016

Package caic4. May 22, 2018

Package oaxaca. January 3, Index 11

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

Package ICsurv. February 19, 2015

Package sfc. August 29, 2016

Package reghelper. April 8, 2017

Package ToTweedieOrNot

Package fbroc. August 29, 2016

Package spark. July 21, 2017

Package DTRreg. August 30, 2017

Package RLRsim. November 4, 2016

Package ordinalnet. December 5, 2017

Package cosinor. February 19, 2015

Package PedCNV. February 19, 2015

Package cumseg. February 19, 2015

Package msda. February 20, 2015

Package RAMP. May 25, 2017

Package manet. September 19, 2017

Package pwrab. R topics documented: June 6, Type Package Title Power Analysis for AB Testing Version 0.1.0

Package robustgam. February 20, 2015

Package enpls. May 14, 2018

Package longclust. March 18, 2018

CH9.Generalized Additive Model

Package basictrendline

Nina Zumel and John Mount Win-Vector LLC

Resampling methods (Ch. 5 Intro)

Package ArCo. November 5, 2017

Package PUlasso. April 7, 2018

Package DSBayes. February 19, 2015

Package sprinter. February 20, 2015

Package flam. April 6, 2018

Package rptr. January 4, 2018

Generalized Additive Models

Package intccr. September 12, 2017

Package palmtree. January 16, 2018

Package sizemat. October 19, 2018

Package clustmd. May 8, 2017

Package tidylpa. March 28, 2018

Package BiDimRegression

Package robustgam. January 7, 2013

Package CuCubes. December 9, 2016

Generalized Additive Model

Package coloc. February 24, 2018

Package simsurv. May 18, 2018

Package rvinecopulib

Package clusternomics

Package acebayes. R topics documented: November 21, Type Package

Package slickr. March 6, 2018

Package dyncorr. R topics documented: December 10, Title Dynamic Correlation Package Version Date

GLMSELECT for Model Selection

Transcription:

Title Selecting Variables in Regression Models Version 2.1.0 Date 2015-12-18 Author Marta Sestelo [aut, cre], Nora M. Villanueva [aut], Javier Roca-Pardinas [aut] Maintainer Marta Sestelo <sestelo@uvigo.es> Package FWDselect December 19, 2015 A simple method to select the best model or best subset of variables using different types of data (binary, Gaussian or Poisson) and applying it in different contexts (parametric or non-parametric). URL http://cran.r-project.org/package=fwdselect BugReports http://github.com/sestelo/fwdselect/issues Depends R (>= 3.1.0) License MIT + file LICENSE LazyData true Imports cvtools, mgcv, parallel, graphics, stats RoxygenNote 5.0.0 NeedsCompilation no Repository CRAN Date/Publication 2015-12-19 09:34:15 R topics documented: diabetes........................................... 2 episode........................................... 3 FWDselect......................................... 3 plot.qselection........................................ 4 pollution........................................... 5 print.qselection....................................... 6 1

2 diabetes print.selection........................................ 7 qselection.......................................... 8 selection........................................... 9 test.............................................. 11 Index 14 diabetes Diabetes data. The diabetes data is a data frame with 11 variables and 442 measurements. These are the data used in the Efron et al. (2004) paper. The data has been standardized to have unit L2 norm in each column and zero mean. diabetes Format diabetes is a data frame with 11 variables (columns). The first column of the data frame contains the response variable (diabetes$y) which is a quantitative measure of disease progression one year after baseline. The rest of the columns contain the measurements of the ten explanatory variables (age, sex, body mass index, average blood pressure and six blood serum registers) obtained from each of the 442 diabetes patients. Source The orginal data are available in the lars package, see http://cran.r-project.org/web/packages/lars/. References Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Annals of Statistics, 32:407 499. library(fwdselect) data(diabetes) head(diabetes)

episode 3 episode Episode of SO2. Pollution incident data. Format Registered values of SO2 in different temporal instant. Each column of the dataset corresponds with the value obtained by the series of bi-hourly means for SO2 in the instant t (5-min temporal instant). The values of this dataset are greater than the maximum value permitted for SO2 atmospheric. episode episode is a data frame with 19 variables (columns). Y response variable, registered values of SO2 at a specific temporal instant, in microg/m3n. This is the value that we want to predict. In0 registered values of SO2 at a specific temporal instant, in this case instant zero, in microg/m3n. In1 registered values of SO2 at a specific temporal instant, in this case 5-min instant temporal before, in microg/m3n. In2 registered values of SO2 at a specific temporal instant, in this case 10-min instant temporal before, in microg/m3n... data(episode) head(episode) FWDselect FWDselect: Selecting Variables in Regression Models. Details This package introduces a simple method to select the best model using different types of data (binary, Gaussian or Poisson) and applying it in different contexts (parametric or non-parametric). The proposed method is a new forward stepwise-based selection procedure that selects a model containing a subset of variables according to an optimal criterion (obtained by cross-validation) and also takes into account the computational cost. Additionally, bootstrap resampling techniques are used to implement tests capable of detecting whether significant effects of the unselected variables are present in the model.

4 plot.qselection Package: FWDselect Type: Package Version: 2.1.0 Date: 2015-12-18 License: MIT + file LICENSE FWDselect is just a shortcut for Forward selection and is a very good summary of one of the package s major functionalities, i.e., that of providing a forward stepwise-based selection procedure. This software helps the user select relevant variables and evaluate how many of these need to be included in a regression model. In addition, it enables both numerical and graphical outputs to be displayed. The package includes several functions that enable users to select the variables to be included in linear, generalized linear or generalized additive regression models. Users can obtain the best combinations of q variables by means of the main function selection. Additionally, if one wants to obtain the results for more than one size of subset, it is possible to apply the qselection function, which returns a summary table showing the different subsets, selected variables and information criterion values. The object obtained when using this last function is the argument required for plot.qselection, which provides a graphical output. Finally, to determine the number of variables that should be introduced into the model, only the test function needs to be applied. Author(s) Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas. References Burnham, K., Anderson, D. (2002). Model selection and multimodel inference: a practical informationtheoretic approach. 2nd Edition Springer. Efron, B. (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics, 7:1-26. Efron, B. and Tibshirani, R. J. (1993). An introduction to the Bootstrap. Chapman and Hall, London. Miller, A. (2002). Subset selection in regression. Champman and Hall. Sestelo, M., Villanueva, N. M. and Roca-Pardinas, J. (2013). FWDselect: Variable selection algorithm in regression models. Discussion Papers in Statistics and Operation Research, University of Vigo, 13/02. plot.qselection Visualization of qselection object This function plots the cross-validation information criterion for several subsets of size q chosen by the user.

pollution 5 ## S3 method for class qselection plot(x = object, y = NULL, ylab = NULL,...) Arguments x qselection object. y NULL ylab NULL... Other options. Value Simply returns a plot. Author(s) Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas. See Also selection. library(fwdselect) data(diabetes) x = diabetes[,2:11] y = diabetes[,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) plot(obj2) pollution Emission of SO2. Pollution incident data. Registered values of SO2 in different temporal instant. Each column of the dataset corresponds with the value obtained by the series of bi-hourly means for SO2 in the instant t (5-min temporal instant). pollution

6 print.qselection Format pollution is a data frame with 19 variables (columns). Y response variable, registered values of SO2 at a specific temporal instant, in microg/m3n. This is the value that we want to predict. In0 registered values of SO2 at a specific temporal instant, in this case instant zero, in microg/m3n. In1 registered values of SO2 at a specific temporal instant, in this case 5-min instant temporal before, in microg/m3n. In2 registered values of SO2 at a specific temporal instant, in this case 10-min instant temporal before, in microg/m3n.... data(pollution) head(pollution) print.qselection Short qselection summary qselection summary ## S3 method for class qselection print(x = object,...) Arguments x Value qselection object.... Other options. The function returns a summary table with the subsets of size q, their information criterion values and the chosen variables for each one. Additionally, an asterisk is shown next to the size of subset which minimizes the information criterion. Author(s) See Also Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas. selection.

print.selection 7 library(fwdselect) data(diabetes) x = diabetes[,2:11] y = diabetes[,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) obj2 print.selection Short selection summary selection summary ## S3 method for class selection print(x = model,...) Arguments x Value selection object.... Other options. The function returns the best subset of size q and its information criterion value. In the case of seconds=true this information is returned for each alternative model. Author(s) See Also Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas. selection. library(fwdselect) data(diabetes) x = diabetes[,2:11] y = diabetes[,1] obj1 = selection(x, y, q = 1, method = "lm", criterion = "variance", cluster = FALSE) obj1

8 qselection qselection Selecting variables for several subset sizes Function that enables to obtain the best variables for more than one size of subset. Returns a table with the chosen covariates to be introduced into the models and their information criteria. Additionally, an asterisk is shown next to the size of subset which minimizes the information criterion. qselection(x, y, qvector, criterion = "deviance", method = "lm", family = "gaussian", nfolds = 5, cluster = TRUE, ncores = NULL) Arguments x y qvector A data frame containing all the covariates. A vector with the response values. A vector with more than one variable-subset size to be selected. criterion The information criterion to be used. Default is the deviance. Other functions provided are the coefficient of determination ("R2"), the residual variance ("variance"), the Akaike information criterion ("aic"), AIC with a correction for finite sample sizes ("aicc") and the Bayesian information criterion ("bic"). The deviance, coefficient of determination and variance are calculated by crossvalidation. method family nfolds cluster ncores A character string specifying which regression method is used, i.e., linear models ("lm"), generalized additive models ("glm") or generalized additive models ("gam"). A description of the error distribution and link function to be used in the model: ("gaussian"), ("binomial") or ("poisson"). Number of folds for the cross-validation procedure, for deviance, R2 or variance criterion. A logical value. If TRUE (default), the procedure is parallelized. Note that there are cases without enough repetitions (e.g., a low number of initial variables) that R will gain in performance through serial computation. R takes time to distribute tasks across the processors also it will need time for binding them all together later on. Therefore, if the time for distributing and gathering pieces together is greater than the time need for single-thread computing, it does not worth parallelize. An integer value specifying the number of cores to be used in the parallelized procedure. If NULL (default), the number of cores to be used is equal to the number of cores of the machine - 1.

selection 9 Value q criterion selection A vector of subset sizes. A vector of Information criterion values. Selected variables for each size. Author(s) Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas. See Also selection plot.qselection. library(fwdselect) data(diabetes) x = diabetes[,2:11] y = diabetes[,1] obj2 = qselection(x, y, qvector = c(1:9), method = "lm", criterion = "variance", cluster = FALSE) obj2 selection Selecting a subset of q variables Main function for selecting the best subset of q variables. Note that the selection procedure can be used with lm, glm or gam functions. selection(x, y, q, prevar = NULL, criterion = "deviance", method = "lm", family = "gaussian", seconds = FALSE, nmodels = 1, nfolds = 5, cluster = TRUE, ncores = NULL) Arguments x y q prevar A data frame containing all the covariates. A vector with the response values. An integer specifying the size of the subset of variables to be selected. A vector containing the number of the best subset of q-1 variables. NULL, by default.

10 selection Value criterion The information criterion to be used. Default is the deviance. Other functions provided are the coefficient of determination ("R2"), the residual variance ("variance"), the Akaike information criterion ("aic"), AIC with a correction for finite sample sizes ("aicc") and the Bayesian information criterion ("bic"). The deviance, coefficient of determination and variance are calculated by crossvalidation. method family seconds nmodels nfolds cluster ncores Best model A character string specifying which regression method is used, i.e., linear models ("lm"), generalized additive models ("glm") or generalized additive models ("gam"). A description of the error distribution and link function to be used in the model: ("gaussian"), ("binomial") or ("poisson"). A logical value. By default, FALSE. If TRUE then, rather than returning the single best model only, the function returns a few of the best models (equivalent). Number of secondary models to be returned. Number of folds for the cross-validation procedure, for deviance, R2 or variance criterion. A logical value. If TRUE (default), the procedure is parallelized. Note that there are cases without enough repetitions (e.g., a low number of initial variables) that R will gain in performance through serial computation. R takes time to distribute tasks across the processors also it will need time for binding them all together later on. Therefore, if the time for distributing and gathering pieces together is greater than the time need for single-thread computing, it does not worth parallelize. An integer value specifying the number of cores to be used in the parallelized procedure. If NULL (default), the number of cores to be used is equal to the number of cores of the machine - 1. The best model. If seconds=true, it returns also the best alternative models. Variable name Names of the variable. Variable number Number of the variables. Information criterion Information criterion used and its value. Prediction Author(s) The prediction of the best model. Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas. library(fwdselect) data(diabetes) x = diabetes[,2:11] y = diabetes[,1]

test 11 obj1 = selection(x, y, q = 1, method = "lm", criterion = "variance", cluster = FALSE) obj1 # second models obj11 = selection(x, y, q = 1, method = "lm", criterion = "variance", seconds = TRUE, nmodels = 2, cluster = FALSE) obj11 # prevar argument obj2 = selection(x, y, q = 2, method = "lm", criterion = "variance", cluster = FALSE) obj2 obj3 = selection(x, y, q = 3, prevar = obj2$variable_numbers, method = "lm", criterion = "variance", cluster = FALSE) test Bootstrap based test for covariate selection Function that applies a bootstrap based test for covariate selection. It helps to determine the number of variables to be included in the model. test(x, y, method = "lm", family = "gaussian", nboot = 50, speedup = TRUE, qmin = NULL, unique = FALSE, q = NULL, bootseed = NULL, cluster = TRUE, ncores = NULL) Arguments x y method family nboot speedup A data frame containing all the covariates. A vector with the response values. A character string specifying which regression method is used, i.e., linear models ("lm"), generalized additive models. A description of the error distribution and link function to be used in the model: ("gaussian"), ("binomial") or ("poisson"). Number of bootstrap repeats. A logical value. If TRUE (default), the testing procedure is computationally efficient since it considers one more variable to fit the alternative model than the number of variables used to fit the null. If FALSE, the fit of the alternative model is based on considering the best subset of variables of size greater than q, the one that minimizes an information criterion. The size of this subset must be given by the user filling the argument qmin.

12 test qmin unique q bootseed cluster ncores By default NULL. If speedup is FALSE, qmin is an integer number selected by the user. To help you select this argument, it is recommended to visualize the graphical output of the plot function and choose the number q which minimizes the curve. A logical value. By default FALSE. If TRUE, the test is performed only for one null hypothesis, given by the argument q. By default NULL. If unique is TRUE, q is the size of the subset of variables to be tested. Seed to be used in the bootstrap procedure. A logical value. If TRUE (default), the testing procedure is parallelized. An integer value specifying the number of cores to be used in the parallelized procedure. If NULL (default), the number of cores to be used is equal to the number of cores of the machine - 1. Details Value Note In a regression framework, let X 1, X 2,..., X p, a set of p initial variables and Y the response variable, we propose a procedure to test the null hypothesis of q significant variables in the model q effects not equal to zero versus the alternative in which the model contains more than q variables. Based on the general model Y = m(x) + ε where m(x) = m 1 (X 1 ) + m 2 (X 2 ) +... + m p (X p ) the following strategy is considered: for a subset of size q, considerations will be given to a test for the null hypothesis p H 0 (q) : I {mj 0} q vs. the general hypothesis H 1 : j=1 p I {mj 0} > q j=1 A list with two objects. The first one is a table containing Hypothesis Statistic pvalue Number of the null hypothesis tested Value of the T statistic pvalue obtained in the testing procedure Decision Result of the test for a significance level of 0.05 The second argument nvar indicates the number of variables that have to be included in the model. The detailed expression of the formulas are described in HTML help http://cran.r-project. org/web/packages/fwdselect/fwdselect.pdf

test 13 Author(s) Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas. References Sestelo, M., Villanueva, N. M. and Roca-Pardinas, J. (2013). FWDselect: an R package for selecting variables in regression models. Discussion Papers in Statistics and Operation Research, University of Vigo, 13/01. See Also selection library(fwdselect) data(diabetes) x = diabetes[,2:11] y = diabetes[,1] test(x, y, method = "lm", cluster = FALSE, nboot = 5) ## for speedup = FALSE # obj2 = qselection(x, y, qvector = c(1:9), method = "lm", # cluster = FALSE) # plot(obj2) # we choose q = 7 for the argument qmin # test(x, y, method = "lm", cluster = FALSE, nboot = 5, # speedup = FALSE, qmin = 7)

Index diabetes, 2 episode, 3 FWDselect, 3 FWDselect-package (FWDselect), 3 plot.qselection, 4, 4, 9 pollution, 5 print.qselection, 6 print.selection, 7 qselection, 4, 6, 8 selection, 4 7, 9, 9, 13 test, 4, 11 14