Package CALIBERrfimpute

Similar documents
Package missforest. February 15, 2013

Package smcfcs. March 30, 2019

Package rpst. June 6, 2017

Package midastouch. February 7, 2016

Package EFS. R topics documented:

Package endogenous. October 29, 2016

Package RLT. August 20, 2018

Package SSLASSO. August 28, 2018

Package palmtree. January 16, 2018

Package ECctmc. May 1, 2018

Package nonlinearicp

Package simsurv. May 18, 2018

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

Package epitab. July 4, 2018

Package intccr. September 12, 2017

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Package MatchIt. April 18, 2017

Package TestDataImputation

Package ArCo. November 5, 2017

Package simr. April 30, 2018

Package OLScurve. August 29, 2016

Package randomforest.ddr

Package bayesdp. July 10, 2018

Package kernelfactory

Package MultiVarMI. April 9, 2018

Package TPD. June 14, 2018

Faculty of Sciences. Holger Cevallos Valdiviezo

Package InformativeCensoring

Package gppm. July 5, 2018

Package fastdummies. January 8, 2018

Package estprod. May 2, 2018

Package ClusterBootstrap

Package rereg. May 30, 2018

Package quint. R topics documented: September 28, Type Package Title Qualitative Interaction Trees Version 2.0.

Package areaplot. October 18, 2017

Package PTE. October 10, 2017

Package datasets.load

Package truncreg. R topics documented: August 3, 2016

Package lhs. R topics documented: January 4, 2018

Package addhaz. September 26, 2018

Package apricom. R topics documented: November 24, 2015

Missing Data: What Are You Missing?

Random Forest A. Fornaser

Multiple imputation using chained equations: Issues and guidance for practice

Package NCSampling. June 27, 2017

Package DCEM. September 30, 2018

Package glinternet. June 15, 2018

Package preprocomb. June 26, 2016

Package avirtualtwins

Package quickreg. R topics documented:

Package MOST. November 9, 2017

Using the missforest Package

Package nricens. R topics documented: May 30, Type Package

Package Numero. November 24, 2018

Package Bergm. R topics documented: September 25, Type Package

Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Package subsemble. February 20, 2015

Package naivebayes. R topics documented: January 3, Type Package. Title High Performance Implementation of the Naive Bayes Algorithm

Package PrivateLR. March 20, 2018

Package binmto. February 19, 2015

Package clustering.sc.dp

CHAPTER 11 EXAMPLES: MISSING DATA MODELING AND BAYESIAN ANALYSIS

Package abe. October 30, 2017

A new recommended way of dealing with multiple missing values: Using missforest for all your imputation needs.

Package sure. September 19, 2017

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

Package bdots. March 12, 2018

Package IPMRF. August 9, 2017

Package MIICD. May 27, 2017

CHAPTER 1 INTRODUCTION

Package apastyle. March 29, 2017

Package GWRM. R topics documented: July 31, Type Package

Package clusterpower

Package tidylpa. March 28, 2018

Package ordinalforest

Package misclassglm. September 3, 2016

Nonparametric Approaches to Regression

Package optimus. March 24, 2017

Package ranger. November 10, 2015

Package CausalImpact

Package Modeler. May 18, 2018

Package glmmml. R topics documented: March 25, Encoding UTF-8 Version Date Title Generalized Linear Models with Clustering

JMP Book Descriptions

Package catenary. May 4, 2018

Package mcemglm. November 29, 2015

Package mvprobit. November 2, 2015

Package lcc. November 16, 2018

Business Club. Decision Trees

Package mgc. April 13, 2018

Machine Learning: An Applied Econometric Approach Online Appendix

Package RCA. R topics documented: February 29, 2016

Package sparsereg. R topics documented: March 10, Type Package

Package bisect. April 16, 2018

Package DSBayes. February 19, 2015

Package simmsm. March 3, 2015

Package inca. February 13, 2018

Package CorporaCoCo. R topics documented: November 23, 2017

Package pcr. November 20, 2017

Package TANDEM. R topics documented: June 15, Type Package

Package sprinter. February 20, 2015

Transcription:

Type Package Package CALIBERrfimpute June 11, 2018 Title Multiple Imputation Using MICE and Random Forest Version 1.0-1 Date 2018-06-05 Functions to impute using Random Forest under Full Conditional Specifications (Multivariate Imputation by Chained Equations). The CALIBER programme is funded by the Wellcome Trust (086091/Z/08/Z) and the National Institute for Health Research (NIHR) under its Programme Grants for Applied Research programme (RP-PG-0407-10314). The author is supported by a Wellcome Trust Clinical Research Training Fellowship (0938/30/Z/10/Z). License GPL-3 Depends mice (>= 2.20) Imports mvtnorm, randomforest Suggests missforest, rpart, survival, xtable Author Anoop Shah [aut, cre], Jonathan Bartlett [ctb], Harry Hemingway [ths], Owen Nicholas [ths], Aroon Hingorani [ths] Maintainer Anoop Shah <anoop@doctors.org.uk> NeedsCompilation no Repository CRAN Date/Publication 2018-06-11 10:40:00 UTC R topics documented: CALIBERrfimpute-package................................ 2 makemar.......................................... 3 mice.impute.rfcat...................................... 4 mice.impute.rfcont..................................... 6 setrfoptions........................................ 8 simdata........................................... 9 Index 11 1

2 CALIBERrfimpute-package CALIBERrfimpute-package Imputation in MICE using Random Forest Details Multivariate Imputation by Chained Equations (MICE) is commonly used to impute missing values in analysis datasets using full conditional specifications. However, it requires that the predictor models are specified correctly, including interactions and nonlinearities. Random Forest is a regression and classification method which can accommodate interactions and non-linearities without requiring a particular statistical model to be specified. The mice package provides the mice.impute.rf function for imputation using Random Forest, as of version 2.20. The CALIBERrfimpute package provides different, independently developed imputation functions using Random Forest in MICE. This package contains reports of two simulation studies: Simulation study is a comparison of Random Forest and parametric MICE in a linear regression example. Vignette for survival analysis with interactions compares the Random Forest MICE algorithm for continuous variables (mice.impute.rfcont) with parametric MICE and the algorithm of Doove et al. in the mice package (mice.impute.cart and mice.impute.rf). Package: CALIBERrfimpute Type: Package Version: 1.0-1 Date: 2018-06-05 License: GPL-3 Author(s) Anoop Shah Maintainer: anoop@doctors.org.uk References Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of Random Forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American Journal of Epidemiology 2014. doi: 10.1093/aje/kwt312 Doove LL, van Buuren S, Dusseldorp E. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics and Data Analysis 2014;72:92 104. doi:

makemar 3 10.1016/j.csda.2013.10.025 mice, randomforest, mice.impute.rfcont, mice.impute.rfcat, mice.impute.rf makemar Creates artificial missing at random missingness Introduces missingness into x1 and x2 into a data.frame of the format produced by simdata, for use in the simulation study. The probability of missingness depends on the logistic of the fully observed variables y and x3; hence it is missing at random but not missing completely at random. Usage makemar(simdata, prop = 0.2) Arguments simdata simulated dataset created by simdata. prop proportion of missing values to be introduced in x1 and x2. Details This function is used for simulation and testing. Value A data.frame with columns: y x1 x2 x3 x4 dependent variable, based on the model y = x1 + x2 + x3 + normal error partially observed continuous variable partially observed continuous or binary (factor) variable fully observed continuous variable variable not in the model to predict y, but associated with x1, x2 and x3; used as an auxiliary variable in imputation simdata

4 mice.impute.rfcat Examples set.seed(1) mydata <- simdata(n=100) mymardata <- makemar(mydata, prop=0.1) # Count the number of missing values sapply(mymardata, function(x){sum(is.na(x))}) # y x1 x2 x3 x4 # 0 11 10 0 0 mice.impute.rfcat Impute categorical variables using Random Forest within MICE This method can be used to impute logical or factor variables (binary or >2 levels) in MICE by specifying method = rfcat. It was developed independently from the mice.impute.rf algorithm of Doove et al., and differs from it in some respects. Usage mice.impute.rfcat(y, ry, x, ntree_cat = NULL, nodesize_cat = NULL, maxnodes_cat = NULL, ntree = NULL,...) Arguments y ry a logical or factor vector of observed values and missing values of the variable to be imputed. a logical vector stating whether y is observed or not. x a matrix of predictors to impute y. ntree_cat number of trees, default = 10. A global option can be set thus: setrfoptions(ntree_cat=10). nodesize_cat minimum size of nodes, default = 1. maxnodes_cat ntree A global option can be set thus: setrfoptions(nodesize_cat=1). Smaller values of nodesize create finer, more precise trees but increase the computation time. maximum number of nodes, default NULL. If NULL the number of nodes is determined by number of observations and nodesize_cat. an alternative argument for specifying the number of trees, over-ridden by ntree_cat. This is for consistency with the mice.impute.rf function.... other arguments to pass to randomforest.

mice.impute.rfcat 5 Details Value Note This Random Forest imputation algorithm has been developed as an alternative to logistic or polytomous regression, and can accommodate non-linear relations and interactions among the predictor variables without requiring them to be specified in the model. The algorithm takes a bootstrap sample of the data to simulate sampling variability, fits a set of classification trees, and chooses each imputed value as the prediction of a randomly chosen tree. A vector of imputed values of y. This algorithm has been tested on simulated data and in survival analysis of real data with artificially introduced missingness completely at random. There was slight bias in hazard ratios compared to polytomous regression, but coverage of confidence intervals was correct. Author(s) Anoop Shah References Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of Random Forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American Journal of Epidemiology 2014. doi: 10.1093/aje/kwt312 setrfoptions, mice.impute.rfcont, mice, mice.impute.rf, mice.impute.cart, randomforest Examples set.seed(1) # A small sample dataset mydata <- data.frame( x1 = as.factor(c('this', 'this', NA, 'that', 'this')), x2 = 1:5, x3 = c(true, FALSE, TRUE, NA, FALSE)) mice(mydata, method = c('logreg', 'norm', 'logreg'), m = 2, maxit = 2) mice(mydata[, 1:2], method = c('rfcat', 'rfcont'), m = 2, maxit = 2) mice(mydata, method = c('rfcat', 'rfcont', 'rfcat'), m = 2, maxit = 2) # A larger simulated dataset mydata <- simdata(100, x2binary = TRUE) mymardata <- makemar(mydata) cat('\nnumber of missing values:\n') print(sapply(mymardata, function(x){sum(is.na(x))}))

6 mice.impute.rfcont # Test imputation of a single column in a two-column dataset cat('\ntest imputation of a simple dataset') print(mice(mymardata[, c('y', 'x2')], method = 'rfcat')) # Analyse data cat('\nfull data analysis:\n') print(summary(lm(y ~ x1 + x2 + x3, data = mydata))) cat('\nmice normal and logistic:\n') print(summary(pool(with(mice(mymardata, method = c('', 'norm', 'logreg', '', '')), lm(y ~ x1 + x2 + x3))))) # Set options for Random Forest setrfoptions(ntree_cat = 10) cat('\nmice using Random Forest:\n') print(summary(pool(with(mice(mymardata, method = c('', 'rfcont', 'rfcat', '', '')), lm(y ~ x1 + x2 + x3))))) cat('\ndataset with unobserved levels of a factor\n') data3 <- data.frame(x1 = 1:100, x2 = factor(c(rep('a', 25), rep('b', 25), rep('c', 25), rep('d', 25)))) data3$x2[data3$x2 == 'D'] <- NA mice(data3, method = c('', 'rfcat')) mice.impute.rfcont Impute continuous variables using Random Forest within MICE This method can be used to impute continuous variables in MICE by specifying method = rfcont. It was developed independently from the mice.impute.rf algorithm of Doove et al., and differs from it in drawing imputed values from a normal distribution. Usage mice.impute.rfcont(y, ry, x, ntree_cont = NULL, nodesize_cont = NULL, maxnodes_cont = NULL, ntree = NULL,...) Arguments y ry a vector of observed values and missing values of the variable to be imputed. a logical vector stating whether y is observed or not. x a matrix of predictors to impute y. ntree_cont number of trees, default = 10. A global option can be set thus: setrfoptions(ntree_cont=10).

mice.impute.rfcont 7 Details Value Note nodesize_cont minimum size of nodes, default = 5. A global option can be set thus: setrfoptions(nodesize_cont=5). Smaller values of nodesize create finer, more precise trees but increase the computation time. maxnodes_cont ntree maximum number of nodes, default NULL. If NULL the number of nodes is determined by number of observations and nodesize_cont. an alternative argument for specifying the number of trees, over-ridden by ntree_cont. This is for consistency with the mice.impute.rf function.... other arguments to pass to randomforest. This Random Forest imputation algorithm has been developed as an alternative to normal-based linear regression, and can accommodate non-linear relations and interactions among the predictor variables without requiring them to be specified in the model. The algorithm takes a bootstrap sample of the data to simulate sampling variability, fits a regression forest trees and calculates the out-of-bag mean squared error. Each value is imputed as a random draw from a normal distribution with mean defined by the Random Forest prediction and variance equal to the out-of-bag mean squared error. If only one tree is used (not recommended), a bootstrap sample is not taken in the first stage because the Random Forest algorithm performs an internal bootstrap sample before fitting the tree. A vector of imputed values of y. This algorithm has been tested on simulated data with linear regression, and in survival analysis of real data with artificially introduced missingness at random. On the simulated data there was slight bias if the distribution of missing values was very different from observed values, because imputed values were closer to the centre of the data than the missing values. However in the survival analysis the hazard ratios were unbiased and coverage of confidence intervals more conservative than normal-based MICE, but the mean length of confidence intervals was shorter with mice.impute.rfcont. Author(s) Anoop Shah References Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway H. Comparison of Random Forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American Journal of Epidemiology 2014. doi: 10.1093/aje/kwt312 setrfoptions, mice.impute.rfcat, mice, mice.impute.rf, mice.impute.cart, randomforest

8 setrfoptions Examples set.seed(1) # A small dataset with a single row to be imputed mydata <- data.frame(x1 = c(2, 3, NA, 4, 5, 1, 6, 8, 7, 9), x2 = 1:10, x3 = c(1, 3, NA, 4, 2, 8, 7, 9, 6, 5)) mice(mydata, method = c('norm', 'norm', 'norm'), m = 2, maxit = 2) mice(mydata[, 1:2], method = c('rfcont', 'rfcont'), m = 2, maxit = 2) mice(mydata, method = c('rfcont', 'rfcont', 'rfcont'), m = 2, maxit = 2) # A larger simulated dataset mydata <- simdata(100) cat('\nsimulated multivariate normal data:\n') print(data.frame(mean = colmeans(mydata), sd = sapply(mydata, sd))) # Apply missingness pattern mymardata <- makemar(mydata) cat('\nnumber of missing values:\n') print(sapply(mymardata, function(x){sum(is.na(x))})) # Test imputation of a single column in a two-column dataset cat('\ntest imputation of a simple dataset') print(mice(mymardata[, c('y', 'x1')], method = 'rfcont')) # Analyse data cat('\nfull data analysis:\n') print(summary(lm(y ~ x1 + x2 + x3, data=mydata))) cat('\nmice using normal-based linear regression:\n') print(summary(pool(with(mice(mymardata, method = 'norm'), lm(y ~ x1 + x2 + x3))))) # Set options for Random Forest setrfoptions(ntree_cont = 10) cat('\nmice using Random Forest:\n') print(summary(pool(with(mice(mymardata, method = 'rfcont'), lm(y ~ x1 + x2 + x3))))) setrfoptions Set Random Forest options for imputation using MICE Usage A convenience function to set global options for number of trees or number of nodes. setrfoptions(ntree_cat = NULL, ntree_cont = NULL, nodesize_cat = NULL, nodesize_cont = NULL, maxnodes_cat = NULL, maxnodes_cont = NULL)

simdata 9 Arguments ntree_cat number of trees to be used for imputing categorical variables (each imputed value is the prediction of a randomly chosen tree), default = 10. ntree_cont number of trees in the forest for imputing continuous variables, default = 10. nodesize_cat nodesize_cont maxnodes_cat maxnodes_cont minimum node size for trees for imputing categorical variables, default = 1. A higher value can be used on larger datasets in order to save time. minimum node size for trees for imputing continuous variables, default = 5. A higher value can be used on larger datasets in order to save time. maximum number of nodes in trees for imputing categorical variables. By default the size limit is set by the number of observations and nodesize_cat. maximum number of nodes in trees for imputing continuous variables. By default the size limit is set by the number of observations and nodesize_cont. Details This function sets the global options which have the prefix CALIBERrfimpute_. Value No return value. The function prints a message stating the new option setting. mice.impute.rfcat, mice.impute.rfcont Examples # Set option using setrfoptions setrfoptions(ntree_cat=15) options()$caliberrfimpute_ntree_cat # Set option directly options(caliberrfimpute_ntree_cat=20) options()$caliberrfimpute_ntree_cat simdata Simulate multivariate data for testing Creates multivariate normal or normal and binary data, as used in the simulation study.

10 simdata Usage simdata(n = 2000, mymean = rep(0, 4), mysigma = matrix( c( 1, 0.2, 0.1, -0.7, 0.2, 1, 0.3, 0.1, 0.1, 0.3, 1, 0.2, -0.7, 0.1, 0.2, 1), byrow = TRUE, nrow = 4, ncol = 4), residsd = 1, x2binary = FALSE) Arguments n Value mymean number of observations to create. vector of length 4, giving the mean of each variable. mysigma variance-covariance matrix of multivariate normal distribution from which x1- x4 are to be drawn. residsd x2binary Data frame with 5 columns: y x1 residual standard deviation. if TRUE, x2 is converted to a binary factor variable (1, 2) with probability equal to the logistic of the underlying normally distributed variable. continuous, generated by y = x1 + x2 + x3 + normal error if x2 is continuous, or y = x1 + x2 + x3-1 + normal error if x2 is a factor with values 1 or 2 continuous x2 continuous or binary (factor) with value 1 or 2 x3 x4 makemar Examples continuous continuous set.seed(1) simdata(n=4, x2binary=true) # y x1 x2 x3 x4 # 1-0.06399616-1.23307320 2-0.6521442 1.6141842 # 2 1.00822173-0.05167026 1 0.4659907 0.5421826 # 3 2.87886825 0.43816687 1 1.5217240 0.2808691 # 4 0.79129101-0.72510640 1 0.7342611 0.1820001

Index Topic package CALIBERrfimpute-package, 2 CALIBERrfimpute (CALIBERrfimpute-package), 2 CALIBERrfimpute-package, 2 makemar, 3, 10 mice, 3, 5, 7 mice.impute.cart, 2, 5, 7 mice.impute.rf, 2 7 mice.impute.rfcat, 3, 4, 7, 9 mice.impute.rfcont, 2, 3, 5, 6, 9 randomforest, 3, 5, 7 setrfoptions, 5, 7, 8 simdata, 3, 9 11