The partial Package. R topics documented: October 16, Version 0.1. Date Title partial package. Author Andrea Lehnert-Batar

Size: px

Start display at page:

Download "The partial Package. R topics documented: October 16, Version 0.1. Date Title partial package. Author Andrea Lehnert-Batar"

Martin Tucker
5 years ago
Views:

1 The partial Package October 16, 2006 Version 0.1 Date Title partial package Author Andrea Lehnert-Batar Maintainer Andrea Lehnert-Batar Depends R (>= 2.0.1),e1071 Description partial implements (partial) attributable risk estimates, corresponding variance estimates and confidence intervals. License GPL (version 2 or later) R topics documented: AR PartialAR boot icu Index 11 AR (Adjusted) Attributable Risk Estimates, Variances and Confidence Intervals Description Usage AR derives crude, adjusted, crude joint or adjusted joint attributable risk estimates for one or multiple exposure factors of primary interest adjusted for secondary confounding variables together with different variance estimates and confidence intervals. AR( D, x, C = NULL, model = NULL, fmla, w = NULL, Var = c("none","delta","boot","bayes","jackknife"), CI = c("none","normal","logit","percentile","bca"), alpha = 0.05, B = 500) 1

2 2 AR Arguments D x C w model fmla Var CI alpha B a vector containing a dichotomous indicator variable for the disease status. a matrix containing a dichotomous indicator variable for the exposure status. If ncol(x)=1, the crude or adjusted attributable risk for the risk factor in x is computed. If ncol(x)>1, the joint attributable risk of the multiple risk factors in x is returned. a matrix containing one or multiple confounding variables for adjusting the attributable risk. Every column of C must be dichotomous. If C=NULL, the crude attributable risk for the exposure in x is computed. a weight vector which is used to define the resampling technique. If a nonparametric or bayesian bootstrap or the jackknife is used, w can be ignored by the user as it is regulated by the input parameter Var. If else the user wants to use a different resampling method, w can individually be changed. if model=true, the attributable risk is computed by use of coefficients from a logistic regression model. If model=null, the attributable risk is computed with probabilities directly estimated from the contingency tables of the data set. if model=true, fmla defines the desired form of the logistic regression model and is an obligatory parameter. a character string indicating the method of variance estimation: Var="delta" indicates a variance estimate derived over the delta method, Var="boot" means application of a nonparametric bootstrap, Var="bayes" indicates the Bayesian Bootstrap and Var="jackknife" the Jackknife. If the default Var="none" is selected, only the point estimate of the attributable risk is returned. Var="none" is the default! a character string indicating the method of confidence interval estimation: if CI="normal" a confidence interval constructed by using percentiles from a standard normal distribution is computed. If CI="logit" a logit-transformation of the attributable risk is used. If the logit-transformation is used together with variance estimation based on resampling methods (bootstrap or jackknife) moments of a truncated normal distribution are used for the construction of the confidence interval if the empirical distribution of the attributable risk contains negative values. "percentile" and CI="BCa" yields confidence intervals based on the simple percentile method and the BCa method, respectively (only possible when Var="boot" or Var="bayes"). CI="none" is the default! the probability of error for the estimation of confidence intervals, yielding a 1 α confidence level. number of replications for resampling methods. Details Point estimates of the crude attributable risk are the same wether the model free or the model based approach is applied. For adjusted attributable risk estimates, the model free approach yields estimates based on the case load weighting approach, where the attributable risk is written as a weighted-sum over all adjustment levels of the confounding variables in C. If the model based approach is applied based on a main effects model (for instance fmla = "D~x1+c1+c2") the point estimate is based on the Mantel-Haenszel approach. This means that an adjusted Odds Ratio is plugged into the formula of the attributable risk. By using a fully saturated model for estimation (thus fmla="d~(x1+c1+c2)^3"), the point estimate again equals the model free estimate, as the interaction structure within the data is totally considered. If adjusted attributable risks are computed for large data sets containing no sparse strata, the model free approach is recommended. But

3 AR 3 if many of the strata defined by the adjustment levels of the confounding variables contain only few observations, the model based approach should be used. The main benefit of the model based approach is its flexibility, as a logistic regression model perfectly mapping the interaction structure of the dataset can be determined over the parameter fmla. All supplied variants of variance estimation yield asymptotic estimates. The variance estimate based on the delta method is computationally least expensive. It is based on a expansion of the attributable risk about its mean by taking a one step Taylor approximation. The nonparametric bootstrap (Var="boot") is based on Efron s bootstrap. Samples are taken out of the underlying data set and the attributable risk is repeatedly computed for each of those samples. Rubin s bayesian bootstrap (Var="bayes") is a weighted method of the bootstrap. A weight vector is generated from a Dirichlet distribution. For each sampled weight vector, the attributable risk is computed with weighted observations. The jackknife can be seen as an approximation to Efron s nonparametric bootstrap. Replications of the attributable risk are computed by leaving out every single observation once at a time. If the data set contains far more observations as the number of replications normally computed for a bayesian or nonparametric bootstrap, the computation of a bootstrap is recommended. For the number of replications, B=500 is usually assumed to suffice. The number of replications should be increased if BCa-intervals are computed. BCa-intervals thus are computationally expensive, but yield very stable results in many situations. Attention: Results of simulation studies showed that the application of the simple percentile method can yield unsatisfactory results when applied to the adjusted AR. Value If attributable risk estimates are computed a single value for the point estimate is returned. If Var!="none" a list containing the point estimate together with its corresponding variance estimate is returned. If CI!="none" a list containing the point estimate, the variance estimate and the corresponding confidence interval is returned. Note The variables D, x and C have to be dichotomous, but it has to be ensured that they are not defined as factors. D has to be a vector, whereas x and C have to be matrices. If the model based approach is used, the colnames of x and C have to be used in fmla! Note that the implemented estimates of the attributable risk are only valid if the data has been obtained under a multinomial sampling model! Author(s) Andrea Lehnert-Batar References Basu S., Landis J.R. (1995) Model-based Estimation of Population Attributable Risk under Crosssectional Sampling. American Journal of Epidemiology, 142, Benichou J., Gail M.H. (1989) A Delta Method for Implicitly Defined Random Variables. The American Statistician, 43, Benichou J. (2001) A review of adjusted estimators of attributable risk. Statistical Methods in Medical Research, 10, Efron B. (1979) Bootstrap methods: another look at the jackknife. Annals of Statistics, 7,1-26. Gefeller O. (1992) An annotated bibliography on the attributable risk. Biometrical Journal, 34,

4 4 PartialAR Lehnert-Batar A., Pfahlberg A., Gefeller O. (2006) Confidence Intervals for Attributable Risk Estimates under Multinomial Sampling. Biometrical Journal, to appear. Quenouille M. (1949) Approximation tests of correlation in time series. Journal of the Royal Statistical Society, Series B, 11, Rubin D.B. (1981) The Bayesian Bootstrap. The Annals of Statistics, 9, See Also PartialAR, boot Examples data(icu) attach(data.frame(icu)) ### Computation of crude AR for INF model-free and ### ### model-based with variance estimates ### Exp <- matrix(inf,ncol=1) colnames(exp) <- "Infection" AR(D = STA, x = Exp, Var = "delta") AR(D = STA, x = Exp, model = TRUE, fmla = "STA~Infection", Var="delta") ### Computation of adjusted AR model-free adjusted for SEX and RACE ### ### First coerce variable RACE into dummy matrix as it is not dichotomous! ### RACENEW <- model.matrix(~as.factor(race)-1) AR(D = STA, x = Exp, C = cbind(sex,racenew), Var="delta", CI="normal") ### Computation of joint attributable risk for exposure factors INF and TYP ### ### adjusted for RACE and SEX ### Exp2 <- cbind(inf,typ) Conf <- cbind(racenew,sex) AR(D = STA, x = Exp2, C = Conf, Var="delta") ### Computation of model-based AR for the exposure factor INF, ###\n ### adjusted for TYP, LOC and SEX ### Conf <- cbind(racenew[,-3],sex) colnames(conf) <- c("white","black","sex") AR(D = STA, x = Exp, C = Conf, model = TRUE, fmla = "STA~Infection+white+black+sex", Var = "delta", CI = "normal") PartialAR Partial Attributable Risk Estimates, Variances and Confidence Intervals Description PartialAR derives estimates for partial attributable risks for multiple exposure factors. Variance estimates and confidence intervals can optionally be returned. Variance estimates are available based on either resampling methods or the delta method. Confidence intervals based on the BCa method or alternatively constructed with percentiles of the standard normal distribution are available.

5 PartialAR 5 Usage PartialAR( D, x, w = NULL, model = NULL, fmla, Var = c("none","delta","boot","bayes","jackknife"), CI = c("none","normal","percentile","bca"), alpha = 0.05, B = 500) Arguments D x w model fmla Var CI alpha B a vector containing a dichotomous indicator variable for the disease status. a matrix containing different exposure variables. Each column of x indicates the exposure status of one exposure variable, which has to be dichotomous. Categorial risk factors have first to be transformed into dummy variables. a weight vector which is used to define the resampling technique. If a nonparametric or bayesian bootstrap or the jackknife is used, w can be ignored by the user as it is regulated by the input parameter Var. If else the user wants to use a different resampling method, w can individually be changed. if model=true, the conditional probabilities necessary for the partial attributable risks are computed by use of estimated coefficients from a logistic regression. If model=null, the conditional probabilities are directly estimated from the data. if model=true, fmla defines the desired form of the logistic regression model and is an obligatory parameter. a character string indicating the method of variance estimation: Var="delta" indicates a variance estimated derived over the delta method, Var="boot" and Var="bayes" means a nonparametric bootstrap and bayesian bootstrap, respectively, Var="jackknife" the Jackknife. Var="none" is the default! a character string indicating the method of confidence interval estimation: if CI="normal" a confidence interval constructed by using percentiles from a standard normal distribution is computed. CI="percentile" and CI="BCa" yields confidence intervals based on the simple percentile and the BCa method, respectively (only possible when Var="boot" or Var="bayes"). CI="none" is the default! the probability of error for the estimation of confidence intervals, yielding a 1 α confidence level. number of replications for resampling methods. Details The partial attributable risk additively decomposes the joint attributable risk of the risk factors in p into risk shares for each single factor. The parameter conceptually corresponds to the Shapley value from cooperative game theory. The partial attributable risk estimates give the possibility of ranking the risk factors according to their individual relevance for the disease load within the population under study. If model=null, the conditional probabilities necessary for the computation of the partial attributable risks are directly estimated from the data set. Else if model=true, a logistic regression is used for estimating the conditional probabilities. If the model based approach is used, a formula defining the form of the logistic regression model has to be given in fmla. By choosing the model based approach, the matrix p must contain colnames which are used in fmla (see example!). All supplied variants of variance estimation yield asymptotic estimates. The variance estimate based on the delta method is computationally least expensive. It is based on an expansion of the

6 6 PartialAR Value partial attributable risk about its mean by taking a one step Taylor approximation. If the sample size of the data set is small, the delta method can yield insufficient results. The nonparametric bootstrap (Var="boot") is based on Efron s bootstrap. Samples are taken out of the underlying data set and the attributable risk is repeatedly computed for each of those samples. Rubin s bayesian bootstrap (Var="bayes") is a weighted method of the bootstrap. A weight vector is generated from a Dirichlet distribution. For each sampled weight vector, the partial attributable risks are computed with weighted observations. The jackknife can be seen as an approximation to Efron s nonparametric bootstrap. Replications of the partial attributable risks are computed by leaving out every single observation once at a time. If the data set contains far more observations as the number of replications normally computed for a bayesian or nonparametric bootstrap, the computation of a bootstrap is recommended. For the number of replications, B=500 is normally assumed to suffice. The number of replications should be increased if BCa-intervals are computed. BCa-intervals thus are computationally expensive, but yield very stable results in many situations. Results of a simulation study showed that the bayesian bootstrap tends to overestimate the variance. In situations where a sufficient amount of observations is available within the different strata defined by the status of disease and exposures, the use of variance estimates derived over the delta method is advisable, as the computation is computationally least expensive. If sparse data situations occur, the nonparametric bootstrap combined with the BCa-method should be chosen. If only partial attributable risk estimates are computed a vector of length identical to the number of exposure variables in p is returned. If Var=TRUE, a list containing the vector of partial attributable risks and the vector with corresponding variance estimates is returned. If CI=TRUE, a list containing the vector of partial attributable risks, their variance estimates and a matrix with endpoints of confidence intervals is returned. Note The variables in D and x have to be dichotomous, but it has to be ensured that they are not defined as factors. D has to be a vector, whereas x has to be a matrix. If the model based approach is used, the colnames of x have to be used in fmla! Note that the implemented estimates of the attributable risk are only valid if the data has been obtained under a multinomial sampling model! Author(s) Andrea Lehnert-Batar References Cox L A JR. (1985), A new measure of attributable risk for public health applications. Management Science, 31, Efron B. (1979) Bootstrap methods: another look at the jackknife. Annals of Statistics, 7,1-26. Eide G E, Gefeller O. (1995), Sequential and average attributable fractions as aids in the selection of preventive strategies. Journal of Clinical Epidemiology, 48, Grömping U, Weimann U. (2004), The asymptotic distribution of the partial attributable risk in cross-sectional studies. Statistics, 38, Land M, Vogel C, Gefeller O. (2001), Partitioning methods for multi-factorial risk attribution. Statistical Methods in Medical Research, 10, Quenouille M. (1949) Approximation tests of correlation in time series. Journal of the Royal Statistical Society, Series B, 11,

7 boot 7 See Also AR, boot Examples data(icu) attach(data.frame(icu)) ### Computation of partial attributable risks together with corresponing ### ### confidence intervals based on variance estimates derived from delta method ### PartialAR(STA,cbind(CAN,INF,TYP,LOC),Var="delta",CI="normal") ### Compuation of partial attributable risks by model-based approach ### ### using a simple main-effects model ### Exp <- cbind(can,inf,typ,loc) colnames(exp) <- c("cancer","infection","admission","coma") PartialAR(D = STA, x = Exp, model = TRUE, fmla = "STA~cancer+infection+admission+coma", Var = "delta", CI = "normal") boot Bootstrap and Jackknife Replications for the Attributable Risk or Partial Attributable Risks Description Usage boot computes replications of the (adjusted) attributable risk or partial attributable risks in order to obtain the empirical distribution of the estimated parameter. The computation of replications is either based on the nonparametric bootstrap, the bayesian bootstrap or the jackknife. boot(d, x, C = NULL, param = c("ar","par"), model = NULL, fmla = fmla, type = c("boot","bayes","jackknife"), B = 500) Arguments D x C param model fmla a vector containing a dichotomous indicator variable for the disease status. a matrix containing different exposure variables. Each column of x indicates the exposure status of one exposure variable, which has to be dichotomous. Categorial risk factors have first to be transformed into dummy variables. a matrix containing one or multiple confounding variables for adjusting the attributable risk. Every column of C must be dichotomous. If C=NULL, replications of the crude attributable risk for the exposure in x are computed. This input parameter is only valid if param="ar". defines, whether replications for the (adjusted) attributable risk (param="ar") or for partial attributable risks (param="par") are computed. if model=true the model based approach of estimating the (partial) attributable risk is used for the computation of replications. if the model based approach is used, fmla determines the logistic regression model.

8 8 boot B type Number of replications. A character string indicating the resampling method: type="boot" computes replications based on nonparametric bootstrap, type="jackknife" uses the jackknife. Details Value Note The nonparametric bootstrap (type="boot") is based on Efron s bootstrap. Samples are taken out of the underlying data set and the attributable risk is repeatedly computed for each of those samples. Rubin s bayesian bootstrap is a weighted method of the bootstrap. A weight vector is generated from a Dirichlet distribution. For each sampled weight vector, the attributable risk is computed with weighted observations. The jackknife can be seen as an approximation to Efron s nonparametric bootstrap. Replications of the attributable risk are computed by leaving out every singel observation once at a time. If the data set contains far more observations as the number of replications normally computed for a bayesian or nonparametric bootstrap, the computation of a bootstrap is recommended. For the number of replications, B=500 is usually assumed to suffice. If param="ar", a vector of length B containing replications of the attributable risk is returned. Else if param="par", a matrix with B rows and ncol(x) columns with replications of the partial attributable risks is returned. The variables in D and x have to be dichotomous, but it has to be ensured that they are not defined as factors. D has to be a vector, whereas x has to be a matrix. If the model based approach is used, the colnames of x have to be used in fmla! Note that the implemented estimates of the attributable risk are only valid if the data has been obtained under a multinomial sampling model! Author(s) Andrea Lehnert-Batar References Basu S., Landis J.R. (1995) Model-based Estimation of Population Attributable Risk under Crosssectional Sampling. American Journal of Epidemiology, 142, Benichou J., Gail M.H. (1989) A Delta Method for Implicitly Defined Random Variables. The American Statistician, 43, Benichou J. (2001) A review of adjusted estimators of attributable risk. Statistical Methods in Medical Research, 10, Cox L A JR. (1985), A new measure of attributable risk for public health applications. Management Science, 31, Efron B. (1979) Bootstrap methods: another look at the jackknife. Annals of Statistics, 7,1-26. Eide G E, Gefeller O. (1995), Sequential and average attributable fractions as aids in the selection of preventive strategies. Journal of Clinical Epidemiology, 48, Lehnert-Batar A., Pfahlberg A., Gefeller O. (2006) Confidence Intervals for Attributable Risk Estimates under Multinomial Sampling. Biometrical Journal, to appear. Quenouille M. (1949) Approximation tests of correlation in time series. Journal of the Royal Statistical Society, Series B, 11,

9 icu 9 See Also Rubin D.B. (1981) The Bayesian Bootstrap. The Annals of Statistics, 9, AR Examples data(icu) attach(data.frame(icu)) ### Computation of nonparametric bootstrap replications for the ### ### adjusted AR of INF adjusted for SEX ### boot(d = STA,x = INF, C = SEX, param = "AR", type="boot") ### Computation of nonparametric bootstrap replications ### ### of partial attributable risks ### boot(d = STA,x = cbind(can,inf,typ,loc), param = "PAR", type="boot") icu ICU Data Description Usage Format Details Source The ICU data frame has 200 rows and 7 columns. data(icu) This data frame contains the following columns: STA factor, vital status (0 = Lived, 1 = Died). SEX factor, sex (0 = male, 1 = female). RACE factor, Race (1 = white, 2 = black, 3 = other). CAN factor, Cancer part of present problem (0 = No, 1 = Yes). INF factor, Infection probable at ICU admission (0 = No, 1 = Yes). TYP factor, Type of admission (0 = Elective, 1 = Emergency). LOC factor, Level of consciousness at ICU admission (0 = no coma, 1 = deep stupor or coma). The ICU data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The data set presented here contains all variables which significantly explained the vital status at hospital discharge. The data were collected at Baystate Medical Center in Springfield, Massachusetts. These data are copyrighted by John Wiley & Sons Inc. and must be acknowledged and used accordingly. Hosmer and Lemeshow (2000), Applied Logistic Regression: Second Edition.

10 10 icu References D.W. Hosmer and S. Lemeshow (2000), Applied Logistic Regression. New York: Wiley Series in Probability and Mathematical Statistics

11 Index Topic datasets icu, 9 Topic manip AR, 1 boot, 7 PartialAR, 4 AR, 1, 7, 9 boot, 4, 7, 7 icu, 9 PartialAR, 4, 4 11

The Bootstrap and Jackknife

The Bootstrap and Jackknife Summer 2017 Summer Institutes 249 Bootstrap & Jackknife Motivation In scientific research Interest often focuses upon the estimation of some unknown parameter, θ. The parameter