Package SimHap. February 15, 2013

Size: px

Start display at page:

Download "Package SimHap. February 15, 2013"

Sharon Tate
5 years ago
Views:

1 Package SimHap February 15, 2013 Type Package Title A comprehensive modeling framework for epidemiological outcomes and a simulation-based approach to haplotypic analysis of population-based data Version Date Depends stats, survival, nlme Author Pamela A. McCaskie Maintainer Matthew Kowgier <matthew.kowgier@oicr.on.ca> SimHap is a package for genetic association testing. It can perform single SNP and multi-locus (haplotype) association analyses for continuous Normal, binary, longitudinal and right-censored outcomes measured in population-based samples. SimHap uses estimation maximisation techniques for inferring haplotypic phase in individuals, and incorporates a novel simulation-based approach to deal with the uncertainty of imputed haplotypes in association testing. License GPL-2 Repository CRAN Date/Publication :39:28 NeedsCompilation no R topics documented: epi.bin epi.cc.match epi.long epi.quant epi.surv

2 2 epi.bin haplo.bin haplo.cc.match haplo.long haplo.quant haplo.surv infer.haplos infer.haplos.cc longpheno.dat make.haplo.rare pheno.dat prepare.cc snp.bin snp.cc.match SNP.dat snp.long snp.quant snp.surv SNP2Geno SNP2Haplo SNPlong.dat SNPsurv.dat summary.epibin summary.epiclogit summary.epilong summary.epiquant summary.episurv summary.hapbin summary.hapclogit summary.haplong summary.hapquant summary.hapsurv summary.snpbin summary.snpclogit summary.snplong summary.snpquant summary.snpsurv survpheno.dat Index 73 epi.bin Epidemiological analysis for binary outcomes epi.bin is used to fit generalized linear regression models to epidemiological phenotype data for a binary outcome, assuming a binomial error distribution and logit link function.

3 epi.bin 3 epi.bin(formula, pheno, sub = NULL) Arguments formula pheno sub a symbolic description of the model to be fit. The details of model specification are given below. a dataframe containing phenotype data. an expression representing a subset of the data on which to perform the model. Details formula should be in the form of outcome ~ predictor(s). A formula has an implied intercept term. See documentation for formula function for more details of allowed formulae. Value epi.bin returns an object of class epibin containing the following items formula results fit.glm ANOD loglik AIC formula passed to epi.bin. a table containing the odds ratios, confidence intervals and p-values of the parameter estimates. a glm object fit using formula. analysis of deviance table for the model fit using formula. the log-likelihood for the linear model fit using formula. Akaike Information Criterion for the linear model fit using formula. Author(s) Pamela A. McCaskie References Dobson, A.J. (1990) An Introduction to Generalized Linear Models. London: Chapman and Hall. Hastie, T.J., Pregibon, D. (1992) Generalized linear models. Chapter 6 of Statistical Models in S, eds Chambers, J.M., Hastie, T.J., Wadsworth & Brooks/Cole. McCaskie, P.A., Carter, K.W. Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple-imputation approach to haplotypic analysis of population-based data, [online] See Also McCullagh, P., Nelder, J.A. (1989) Generalized Linear Models. London: Chapman and Hall. Venables, W.N., Ripley, D.B. (2002) Modern Applied Statistics with S. New York: Springer. snp.bin, haplo.bin, epi.quant

4 4 epi.cc.match data(pheno.dat) mymodel <- epi.bin(formula=plaque~age+sbp, pheno=pheno.dat) # example with a subsetting variable, looking at males only mymodel <- epi.bin(formula=plaque~age+sbp, pheno=pheno.dat, sub=expression(sex==1)) epi.cc.match Epidemiological analysis for matched case-control data epi.cc.match is used to fit conditional logistic regression models to matched case-control data. epi.cc.match(formula, pheno, sub = NULL) Arguments formula pheno sub a symbolic description of the model to be fit. The response must be binary indicator of case-control status, and the formula must contain a variable indicating strata, or the matching sequence. a dataframe containing phenotype data. an expression representing a subset of the data on which to perform the models. Details formula should be in the form: response ~ predictor(s) + strata(strata_variable). Value epi.cc.match returns an object of class epiclogit containing the following items results formula Wald loglik fit.coxph rsquared a table containing the odds ratios, confidence intervals and p-values of the parameter estimates. formula passed to epi.cc.match. The Wald test for overall significance of the fitted model including SNP parameters. the log-likelihood for the linear model fit using formula. an object of class clogit fit using formula1. See clogit for details. r-squared values for the fitted model.

5 epi.long 5 Author(s) Pamela A McCaskie References McCaskie, P.A., Carter, K.W. Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple-imputation approach to haplotypic analysis of population-based data, [online] See Also epi.bin, clogit data(pheno.dat) mymodel <- epi.cc.match(formula=disease~sbp+dbp+strata(strat), pheno=pheno.dat) # example with subsetting variable mymodel <- epi.cc.match(formula=disease~sbp+dbp+strata(strat), pheno=pheno.dat, sub=expression(sex==1)) epi.long Epidemiological analysis for longitudinal data epi.long is used to fit linear mixed effects models to epidemiological data with longitudinal outcomes. epi.long(fixed, random, pheno, cor="corcar1", value = 0.2, form=~1, sub = NULL) Arguments fixed as per lme. A two-sided linear formula object describing the fixed-effects part of the model including SNP parameters, with the response on the left of a ~ operator and the terms, separated by + operators.

6 6 epi.long random pheno cor as per lme. A one-sided formula of the form ~x1+...+xn g1/.../gm, with x1+...+xn specifying the model for the random effects and g1/.../gm the grouping structure (m may be equal to 1, in which case no / is required). The random effects formula will be repeated for all levels of grouping, in the case of multiple levels of grouping. a dataframe containing phenotype data. a corstruct object describing the within-group correlation structure. Available correlation structures are corar1, corcar1, and corcompsymm. See the documentation of corclasses for a description of these. Defaults to corcar1. value for corar1 - the value of the lag 1 autocorrelation, which must be between -1 and 1. For corcar1 - the correlation between two observations one unit of time apart. Must be between 0 and 1. For corcompsymm - the correlation between any two correlated observations. Defaults to 0.2. form sub Details Value a one sided formula of the form ~ t, or ~ t g, specifying a time covariate t and, optionally, a grouping factor g. A covariate for this correlation structure must be integer valued. When a grouping factor is present in form, the correlation structure is assumed to apply only to observations within the same grouping level; observations with different grouping levels are assumed to be uncorrelated. Defaults to ~ 1, which corresponds to using the order of the observations in the data as a covariate, and no groups. an expression representing a subset of the data on which to perform the models. cor will always default to corcar1 and value will always default to 0.2. Be sure to change both parameters accordingly if desired. See corclasses for more details. epi.long returns an object of class epilong. The summary function can be used to obtain and print a summary of the results. An object of class epilong is a list containing the following components: results fixed_formula a table containing the coefficients, standard errors and p-values of the parameter estimates. fixed effects formula. random_formula random effects formula. fit.lme ANOD loglik AIC corstruct Author(s) Pamela A McCaskie a lme object fit using formula. analysis of deviance table for the fitted model. the log-likelihood for the fitted model. Akaike Information Criterion for the model fit using formula. correlation structure used in the fitted model.

7 epi.long 7 References Bates, D.M., Pinheiro, J.C. (1998) Computational methods for multilevel models. Available in PostScript or PDF formats at Box, G.E.P., Jenkins, G.M., Reinsel, G.C. (1994) Time Series Analysis: Forecasting and Control, 3rd Edition, Holden-Day. Davidian, M., Giltinan, D.M. (1995) Nonlinear Mixed Effects Models for Repeated Measurement Data, Chapman and Hall. Laird, N.M., Ware, J.H. (1982) Random-Effects Models for Longitudinal Data, Biometrics, 38, Lindstrom, M.J., Bates, D.M. (1988) Newton-Raphson and EM Algorithms for Linear Mixed- Effects Models for Repeated-Measures Data, Journal of the American Statistical Association, 83, Littel, R.C., Milliken, G.A., Stroup, W.W., Wolfinger, R.D. (1996) SAS Systems for Mixed Models, SAS Institute. McCaskie, P.A., Carter, K.W, Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple imputation approach to haplotypic analysis of population-based data, [online] Pinheiro, J.C., Bates, D.M. (1996) Unconstrained Parametrizations for Variance-Covariance Matrices, Statistics and Computing, 6, Pinheiro, J.C., Bates, D.M. (2000) Mixed-Effects Models in S and S-PLUS, Springer. See Also snp.long, haplo.long, corclasses data(longpheno.dat) mymodel <- epi.long(fixed=fev1f~height+weight, random=~1 ID, pheno=longpheno.dat, form=~year ID) # example with a subsetting variable, looking at males only mymodel <- epi.long(fixed=fev1f~height+weight, random=~1 ID, pheno=longpheno.dat, form=~year ID, sub=expression(sex==1))

8 8 epi.quant epi.quant Epidemiological analysis for quantitative outcomes epi.quant is used to fit linear regression models to single SNP genotype and phenotype data for a continuous Normal outcome. epi.quant(formula, pheno, sub = NULL) Arguments formula pheno sub a symbolic description of the full model to be fit. The details of model specification are given below. a dataframe containing phenotype data. an expression representing a subset of the data on which to perform the models. Details formula should be in the form of response ~ predictor(s). A formula has an implied intercept term. See documentation for formula function for more details of allowed formulae. Value epi.quant returns an object of class epiquant containing the following items formula results fit.lm ANOD loglik AIC formula passed to epi.quant. a table containing the coefficients, standard errors and p-values of the parameter estimates. a lm object fit using formula. analysis of deviance table for the model fit using formula. the log-likelihood for the linear model fit using formula. Akaike Information Criterion for the linear model fit using formula. Author(s) Pamela A. McCaskie

9 epi.surv 9 References Chambers, J.M. (1992) Linear models. Chapter 4 of Statistical Models in S, eds Chambers, J.M., Hastie, T.J., Wadsworth & Brooks/Cole. McCaskie, P.A., Carter, K.W, Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple imputation approach to haplotypic analysis of population-based data, [online] Wilkinson, G.N., Rogers, C.E. (1973) Symbolic descriptions of factorial models for analysis of variance. Applied Statistics, 22, See Also snp.quant, haplo.quant, epi.bin data(pheno.dat) mymodel <- epi.quant(formula=ldl~age+sbp, pheno=pheno.dat) # example with a subsetting variable, looking at males only mymodel <- epi.quant(formula=ldl~age+sbp, pheno=pheno.dat, sub=expression(sex==1)) epi.surv Epidemiological analysis for survival data epi.surv is used to fit Cox proportional hazards models to epidemiological survival data. epi.surv(formula, pheno, sub = NULL) Arguments formula pheno sub a symbolic description of the model to be fit. The response must be a survival object as returned by the Surv function. a dataframe containing phenotype data. an expression representing a subset of the data on which to perform the models. Details formula should be in the form of response ~ predictor(s). A formula has an implied intercept term. See documentation for the formula function for more details of allowed formulae.

10 10 epi.surv Value epi.surv returns an object of class episurv containing the following items results formula Wald loglik fit.coxph rsquared a table containing the hazard ratios, confidence intervals and p-values of the parameter estimates. formula passed to epi.surv. The Wald test for overall significance of the fitted model including SNP parameters. the log-likelihood for the linear model fit using formula. an object of class coxph fit using formula1. See coxph.object for details. r-squared values for the fitted model. Author(s) Pamela A McCaskie References Andersen, P., Gill, R. (1982) Cox s regression model for counting processes, a large sample study, Annals of Statistics, 10: McCaskie, P.A., Carter, K.W, Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple imputation approach to haplotypic analysis of population-based data, [online] Therneau, T., Grambsch, P., Fleming, T. Martingale based residuals for survival models, Biometrika, 77(1): See Also snp.surv, haplo.surv data(survpheno.dat) mymodel <- epi.surv(formula=surv(time, status)~age, pheno=survpheno.dat) # example with subsetting variable mymodel <- epi.surv(formula=surv(time, status)~age, pheno=survpheno.dat, sub=expression(sex==1))

11 haplo.bin 11 haplo.bin Haplotype analysis for a binary trait haplo.bin performes a series of generalized linear models using a simulation-based approach to account for uncertainty in haplotype assignment when phase is unknown. haplo.bin(formula1, formula2, pheno, haplo, sim, effect = "add", sub = NULL, adjust=false) Arguments formula1 formula2 pheno haplo sim effect sub adjust a symbolic description of the full model including haplotype parameters to be fit. The details of model specification are given below. a symbolic description of the nested model excluding haplotype parameters, to be compared to formula1 in a likelihood ratio test. a phenotype data set. a haplo object made by make.haplo.rare. The subjects must in the same order as they are in the phenotype data. the number of simulations from which to evaluate the results. the genetic effect type: "add" for additive, "dom" for dominant and "rec" for recessive. Defaults to additive. See note. an expression representing a subset of the data on which to perform the models. a logical flag. If adjust=true, the adjusted degrees of freedom is used. This is recommended when the computed degrees of freedom is larger than the complete data degrees of freedom. By default, adjust=false. Details Value formula1 should be in the form outcome ~ predictor(s) + haplotype(s) and formula2 should be in the form outcome ~ predictor(s). A formula has an implied intercept term. See documentation for the formula function for more details of allowed formulae. haplo.bin returns an object of class hapbin. The summary function can be used to obtain and print a summary of the results. An object of class hapbin is a list containing the following components: formula1 formula1 formula1 passed to haplo.bin. formula2 passed to haplo.bin.

12 12 haplo.bin Note results a table containing the coefficients, averaged over the sim models performed; standard errors, computed as the sum of the between-imputation and withinimputation variance; and p-values, based on a t-distribution with appropriately computed degrees of freedom, of the parameter estimates. empiricalresults a list containing the odds ratios, confidence intervals and p-values calculated at each simulation. ANOD loglik WALD aic aicpredicted effect analysis of deviance table for the model fit using formula1, averaged over all simulations. the average log-likelihood for the generalized linear model fit using formula1. a Wald test, testing for significant improvement of the model when haplotypic parameters are included. Akaike Information Criterion for the generalized linear model fit using formula1, averaged over all simulations. Akaike Information Criteria calculated at each simulation. the haplotypic effect modelled, ADDITIVE, DOMINANT or RECESSIVE. To model a codominant haplotypic effect, define the desired haplotype as a factor in the formula1 argument. e.g. factor(h.aaa), and use the default option for effect. Author(s) Pamela A. McCaskie References Dobson, A.J. (1990) An Introduction to Generalized Linear Models. London: Chapman and Hall. Hastie, T.J., Pregibon, D. (1992) Generalized linear models. Chapter 6 of Statistical Models in S, eds Chambers, J.M., Hastie, T.J., Wadsworth & Brooks/Cole. Little, R.J.A., Rubin, D.B. (2002) Statistical Analysis with Missing Data. John Wiley and Sons, New Jersey. McCaskie, P.A., Carter, K.W. Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple-imputation approach to haplotypic analysis of population-based data, [online] McCullagh, P., Nelder, J.A. (1989) Generalized Linear Models. London: Chapman and Hall. Rubin, D.B. (1996) Multiple imputation after 18+ years (with discussion). Journal of the American Statistical Society, 91: Venables, W.N., Ripley, D.B. (2002) Modern Applied Statistics with S. New York: Springer. Barnard, J., Rubin, D.B. (1999) Small-sample degrees of freedom with multiple imputation. Biometrika, 86, See Also snp.bin, haplo.quant, haplo.quant, haplo.long

13 haplo.cc.match 13 data(snp.dat) # convert SNP.dat to format required by infer.haplos haplo.dat <- SNP2Haplo(SNP.dat) data(pheno.dat) # generate haplotype frequencies and haplotype design matrix myinfer<-infer.haplos(haplo.dat) # print haplotype frequencies generated by infer.haplos myinfer$hap.freq # generate haplo object where haplotypes with a frequency # below min.freq are grouped as a category called "rare" myhaplo<-make.haplo.rare(myinfer,min.freq=0.05) mymodel <- haplo.bin(formula1=plaque~age+sbp+h.n1aa, formula2=plaque~age+sbp, pheno=pheno.dat, haplo=myhaplo, sim=10) # example with a subsetting variable, looking at males only # and modelling a dominant haplotypic effect mymodel <- haplo.bin(formula1=plaque~age+sbp+h.n1aa, formula2=plaque~age+sbp, pheno=pheno.dat, haplo=myhaplo, sim=10, effect="dom", sub=expression(sex==1)) haplo.cc.match Haplotype analysis for matched case-control data haplo.surv performs a series of conditional logistic regression models to matched case-control data with haplotypes using a simulation-based approach to account for uncertainty in haplotype assignment when phase is unknown. haplo.cc.match(formula1, formula2, pheno, haplo, sim, effect = "add", sub = NULL) Arguments formula1 a symbolic description of the full model to be fit, including haplotype parameters. The response must be binary indicator of case-control status, and the formula must contain a variable indicating strata, or the matching sequence.

14 14 haplo.cc.match formula2 pheno haplo sim effect sub a symbolic description of the nested model excluding haplotype parameters, to be compared to formula1 in a likelihood ratio test. The response must be binary indicator of case-control status, and the formula must contain a variable indicating strata, or the matching sequence. a dataframe containing phenotype data. a haplotype object made by make.haplo.rare. the number of simulations from which to evaluate the results. the genetic effect type: "add" for additive, "dom" for dominant and "rec" for recessive. Defaults to additive. See note. optional. An expression using a binary operator, representing a subset of individuals on which to perform analysis. e.g. sub=expression(sex==1). Details Value formula1 should be in the form: response ~ predictor(s) + strata(strata_variable) + haplotype(s) and formula2 should be in the form: response ~ predictor(s) + strata(strata_variable). If case-control data is not matched, the haplo.bin function should be used. haplo.cc.match returns an object of class hapclogit. The summary function can be used to obtain and print a summary of the results. An object of class hapclogit is a list containing the following components: formula1 formula2 formula1 passed to haplo.cc.match. formula2 passed to haplo.cc.match. results a table containing the odds ratios, confidence intervals and p-values of the parameter estimates, averaged over the n=sim models performed. empiricalresults a list containing the odds ratios, confidence intervals and p-values calculated at each simulation loglik LRT ANOVA Wald rsquared effect the average log-likelihood for the n=sim linear models fit using formula1. a likelihood ratio test, testing for significant improvement of the model when haplotypic parameters are included analysis of variance, comparing the two models fit with and without haplotypic parameters. The Wald test for overall significance of the fitted model including haplotypes. r-squared values for models fit using formula1 and formula2. the haplotypic effect modelled, ADDITIVE, DOMINANT or RECESSIVE.

15 haplo.cc.match 15 Note To model a codominant haplotypic effect, define the desired haplotype as a factor in the formula1 argument. e.g. factor(h.aaa), and use the default option for effect. Author(s) Pamela A. McCaskie References Little, R.J.A., Rubin, D.B. (2002) Statistical Analysis with Missing Data. John Wiley and Sons, New Jersey. McCaskie, P.A., Carter, K.W. Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple-imputation approach to haplotypic analysis of population-based data, [online] Rubin, D.B. (1996) Multiple imputation after 18+ years (with discussion). Journal of the American Statistical Society, 91: See Also snp.cc.match, haplo.bin data(snp.dat) # convert SNP.dat to format required by infer.haplos haplo.dat <- SNP2Haplo(SNP.dat) data(pheno.dat) newdata <- prepare.cc(geno=haplo.dat, pheno=pheno.dat, cc.var="disease") newhaplo.dat <- newdata$geno newpheno.dat <- newdata$pheno # generates haplotype frequencies and haplotype design matrix myinfer<-infer.haplos.cc(geno=newhaplo.dat, pheno=newpheno.dat, cc.var="disease") # prints haplotype frequencies among cases myinfer$hap.freq.cases # prints haplotype frequencies among controls myinfer$hap.freq.controls # generate haplo object where haplotypes with a frequency # below min.freq are grouped as a category called "rare" myhaplo<-make.haplo.rare(myinfer,min.freq=0.05)

16 16 haplo.long mymodel <- haplo.cc.match(formula1=disease~sbp+dbp+h.n1aa+strata(strat), formula2=disease~sbp+dbp+strata(strat), haplo=myhaplo, pheno=pheno.dat, sim=10) # example using a subsetting variable - looking at males only mymodel <- haplo.cc.match(formula1=disease~sbp+dbp+h.n1aa+strata(strat), formula2=disease~sbp+dbp+strata(strat), haplo=myhaplo, pheno=pheno.dat, sim=10, sub=expression(sex==1)) haplo.long Haplotype analysis for longitudinal data haplo.long performes a series of linear mixed effects models using a simulation-based approach to account for uncertainty in haplotype assignment when phase is unknown. haplo.long(fixed, random, pheno, haplo, cor=null, value = 0.2, form=~1, sim, effect = "add", sub = NULL, adjust=false) Arguments fixed random pheno haplo cor as per lme. A two-sided linear formula object describing the fixed-effects part of the model including SNP parameters, with the response on the left of a ~ operator and the terms, separated by + operators. as per lme. A one-sided formula of the form ~x1+...+xn g1/.../gm, with x1+...+xn specifying the model for the random effects and g1/.../gm the grouping structure (m may be equal to 1, in which case no / is required). The random effects formula will be repeated for all levels of grouping, in the case of multiple levels of grouping. a dataframe containing phenotype data. a haplotype object made by make.haplo.rare. The subjects must in the same order as they are in the phenotype data. a corstruct object describing the within-group correlation structure. Available correlation structures are corar1, corcar1, and corcompsymm. See the documentation of corclasses for a description of these. Defaults to NULL corresponding to no within-subject correlation. value for corar1 - the value of the lag 1 autocorrelation, which must be between -1 and 1. For corcar1 - the correlation between two observations one unit of time apart. Must be between 0 and 1. For corcompsymm - the correlation between any two correlated observations. Defaults to 0.2.

17 haplo.long 17 form sim effect sub adjust a one sided formula of the form ~ t, or ~ t g, specifying a time covariate t and, optionally, a grouping factor g. A covariate for this correlation structure must be integer valued. When a grouping factor is present in form, the correlation structure is assumed to apply only to observations within the same grouping level; observations with different grouping levels are assumed to be uncorrelated. Defaults to ~ 1, which corresponds to using the order of the observations in the data as a covariate, and no groups. the number of simulations from which to evaluate the results. the haplotypic effect type: "add" for additive, "dom" for dominant and "rec" for recessive. Defaults to additive. See note. optional. An expression representing a subset of individuals on which to perform analysis. e.g. sub=expression(sex==1). a logical flag. If adjust=true, the adjusted degrees of freedom is used. This is recommended when the computed degrees of freedom is larger than the complete data degrees of freedom. By default, adjust=false. Value Note haplo.long returns an object of class haplong. The summary function can be used to obtain and print a summary of the results. An object of class haplong is a list containing the following components: fixed_formula fixed effects formula. random_formula random effects formula. results a table containing the coefficients, averaged over the sim models performed; standard errors, computed as the sum of the between-imputation and withinimputation variance; and p-values, based on a t-distribution with appropriately computed degrees of freedom, of the parameter estimates. empiricalresults a list containing the coefficients, standard errors and p-values calculated at each simulation. ANOD loglik AIC aicempirical corstruct effect analysis of deviance table for the fitted model. the log-likelihood for the fitted model. Akaike Information Criterion for the linear model fit using formula. Akaike Information Criteria calculated at each simulation. correlation structure used in the fitted model. the haplotypic effect modelled, ADDITIVE, DOMINANT or RECESSIVE To model a codominant haplotypic effect, define the desired haplotype as a factor in the formula1 argument. e.g. factor(h.aaa), and use the default option for effect Author(s) Pamela A. McCaskie

18 18 haplo.long References Bates, D.M., Pinheiro, J.C. (1998) Computational methods for multilevel models. Available in PostScript or PDF formats at Box, G.E.P., Jenkins, G.M., Reinsel, G.C. (1994) Time Series Analysis: Forecasting and Control, 3rd Edition, Holden-Day. Davidian, M., Giltinan, D.M. (1995) Nonlinear Mixed Effects Models for Repeated Measurement Data, Chapman and Hall. Laird, N.M., Ware, J.H. (1982) Random-Effects Models for Longitudinal Data, Biometrics, 38, Lindstrom, M.J., Bates, D.M. (1988) Newton-Raphson and EM Algorithms for Linear Mixed- Effects Models for Repeated-Measures Data, Journal of the American Statistical Association, 83, Littel, R.C., Milliken, G.A., Stroup, W.W., Wolfinger, R.D. (1996) SAS Systems for Mixed Models, SAS Institute. Little, R.J.A., Rubin, D.B. (2002) Statistical Analysis with Missing Data. John Wiley and Sons, New Jersey. McCaskie, P.A., Carter, K.W, Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple imputation approach to haplotypic analysis of population-based data, [online] Pinheiro, J.C., Bates, D.M. (1996) Unconstrained Parametrizations for Variance-Covariance Matrices, Statistics and Computing, 6, Pinheiro, J.C., Bates, D.M. (2000) Mixed-Effects Models in S and S-PLUS, Springer. Rubin, D.B. (1996) Multiple imputation after 18+ years (with discussion). Journal of the American Statistical Society, 91: Barnard, J., Rubin, D.B. (1999) Small-sample degrees of freedom with multiple imputation. Biometrika, 86, See Also snp.long, haplo.quant, haplo.quant, haplo.long data(snplong.dat) # convert SNP.dat to format required by infer.haplos longhaplo.dat <- SNP2Haplo(SNPlong.dat) data(longpheno.dat) # generate haplotype frequencies and haplotype design matrix myinfer<-infer.haplos(longhaplo.dat) # print haplotype frequencies generated by infer.haplos myinfer$hap.freq

19 haplo.quant 19 # generate haplo object where haplotypes with a frequency # below min.freq are grouped as a category called "rare" myhaplo<-make.haplo.rare(myinfer,min.freq=0.05) mymodel <- haplo.long(fixed=fev1f~h.acv2, random=~1 ID, pheno=longpheno.dat, haplo=myhaplo, cor="corcar1", form=~year ID, sim=10) # example with a subsetting variable - looking at males only mymodel <- haplo.long(fixed=fev1f~height+h.acv2, random=~1 ID, pheno=longpheno.dat, haplo=myhaplo, cor="corcar1", form=~year ID, sim=10, sub=expression(sex==1)) haplo.quant Haplotype analysis for a Normally distributed quantitative trait haplo.quant performs a series of linear models using a simulation-based approach to account for uncertainty in haplotype assignment when phase is unknown. haplo.quant(formula1, formula2, pheno, haplo, sim, effect = "add", sub = NULL, predict_variable = NULL, adjust = FALSE) Arguments formula1 formula2 pheno haplo sim effect a symbolic description of the full model including haplotype parameters to be fit. The details of model specification are given below. a symbolic description of the nested model excluding haplotype parameters, to be compared to formula1 in a likelihood ratio test. a dataframe containing phenotype data. a haplotype object made by make.haplo.rare. The subjects must in the same order as they are in the phenotype data. the number of simulations from which to evaluate the results. the haplotypic effect type: "add" for additive, "dom" for dominant and "rec" for recessive. Defaults to additive. See note. sub optional. An expression representing a subset of individuals on which to perform analysis. e.g. sub=expression(sex==1). predict_variable an expression using a binary operator, representing a subset of the data on which to perform the models

20 20 haplo.quant adjust a logical flag. If adjust=true, the adjusted degrees of freedom is used. This is recommended when the computed degrees of freedom is larger than the complete data degrees of freedom. By default, adjust=false. Details Value Note formula1 should be in the form of response ~ predictor(s) + haplotype(s) and formula2 should be in the form response ~ predictor(s). A formula has an implied intercept term. See formula for more details of allowed formulae. haplo.quant returns an object of class hapquant. The summary function can be used to obtain and print a summary of the results. An object of class hapquant is a list containing the following components: formula1 formula1 formula1 passed to haplo.quant. formula2 passed to haplo.quant. results a table containing the coefficients, averaged over the sim models performed; standard errors, computed as the sum of the between-imputation and withinimputation variance; and p-values, based on a t-distribution with appropriately computed degrees of freedom, of the parameter estimates. empiricalresults a list containing the coefficients, standard errors and p-values calculated at each simulation. rsquared ANOD loglik WALD r-squared values for models fit using formula1 and formula2. analysis of deviance table for the model fit using formula1, averaged over all simulations. the average log-likelihood for the linear model fit using formula1. a Wald test, testing for significant improvement of the model when haplotypic parameters are included. predicted estimated marginal means of the outcome variable broken down by haplotype levels, evaluated at mean values of the model predictors, averaged over all simulations. empiricalpredicted estimated marginal means calculated at each simulation. aic aicpredicted effect Akaike Information Criterion for the linear model fit using formula1, averaged over all simulations. Akaike Information Criteria calculated at each simulation. the haplotypic effect modelled, ADDITIVE, DOMINANT or RECESSIVE. To model a codominant haplotypic effect, define the desired haplotype as a factor in the formula1 argument. e.g. factor(h.aaa), and use the default option for effect

21 haplo.quant 21 Author(s) Pamela A. McCaskie References Chambers, J.M. (1992) Linear models. Chapter 4 of Statistical Models in S, eds Chambers, J.M., Hastie, T.J., Wadsworth & Brooks/Cole. Little, R.J.A., Rubin, D.B. (2002) Statistical Analysis with Missing Data. John Wiley and Sons, New Jersey. McCaskie, P.A., Carter, K.W, Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple imputation approach to haplotypic analysis of population-based data, [online] Rubin, D.B. (1996) Multiple imputation after 18+ years (with discussion). Journal of the American Statistical Society, 91: Wilkinson, G.N., Rogers, C.E. (1973) Symbolic descriptions of factorial models for analysis of variance. Applied Statistics, 22, Barnard, J., Rubin, D.B. (1999) Small-sample degrees of freedom with multiple imputation. Biometrika, 86, See Also haplo.bin data(snp.dat) # convert SNP.dat to format required by infer.haplos haplo.dat <- SNP2Haplo(SNP.dat) data(pheno.dat) # generate haplotype frequencies and haplotype design matrix myinfer<-infer.haplos(haplo.dat) # print haplotype frequencies generated by infer.haplos myinfer$hap.freq # generate haplo object where haplotypes with a frequency # below min.freq are grouped as a category called "rare" myhaplo<-make.haplo.rare(myinfer,min.freq=0.05) mymodel <- haplo.quant(formula1=hdl~age+sbp+h.n1aa, formula2=hdl~age+sbp, pheno=pheno.dat, haplo=myhaplo, sim=10) # example using a variable for which to predict marginal means mymodel <- haplo.quant(formula1=hdl~age+sbp+factor(h.n1aa), formula2=hdl~age+sbp, pheno=pheno.dat, haplo=myhaplo, sim=10,

22 22 haplo.surv predict_variable="h.n1aa") # example with a subsetting variable, looking at males only # and modelling a dominant haplotypic effect mymodel <- haplo.quant(formula1=hdl~age+sbp+h.n1aa, formula2=hdl~age+sbp, pheno=pheno.dat, haplo=myhaplo, sim=10, effect="dom", sub=expression(sex==1)) haplo.surv Haplotype analysis for survival data haplo.surv performs a series of Cox proportional hazards models to survival data with haplotypes using a simulation-based approach to account for uncertainty in haplotype assignment when phase is unknown. haplo.surv(formula1, formula2, pheno, haplo, sim, effect = "add", sub = NULL) Arguments formula1 formula2 pheno haplo sim effect sub a symbolic description of the full model to be fit, including haplotype parameters. The response must be a survival object as returned by the Surv function. a symbolic description of the nested model excluding haplotype parameters, to be compared to formula1 in a likelihood ratio test. The response must be a survival object as returned by the Surv function. a dataframe containing phenotype data. a haplotype object made by make.haplo.rare. the number of simulations from which to evaluate the results. the genetic effect type: "add" for additive, "dom" for dominant and "rec" for recessive. Defaults to additive. See note. optional. An expression using a binary operator, representing a subset of individuals on which to perform analysis. e.g. sub=expression(sex==1). Details formula1 should be in the form of response ~ predictor(s) + haplotype(s) and formula2 should be in the form response ~ predictor(s). A formula has an implied intercept term. See documentation for the formula function for more details of allowed formulae.

23 haplo.surv 23 Value Note haplo.surv returns an object of class hapsurv. The summary function can be used to obtain and print a summary of the results. An object of class hapsurv is a list containing the following components: formula1 formula2 results formula1 passed to haplo.surv. formula2 passed to haplo.surv. a table containing the hazard ratios, confidence intervals and p-values of the parameter estimates, averaged over the n=sim models performed. empiricalresults a list containing the hazard ratios, confidence intervals and p-values calculated at each simulation loglik the average log-likelihood for the n=sim linear models fit using formula1. LRT predicted ANOVA Wald rsquared effect a likelihood ratio test, testing for significant improvement of the model when haplotypic parameters are included estimated marginal means of the outcome variable broken down by haplotype levels, evaluated at mean values of the model predictors, averaged over all simulations. analysis of variance, comparing the two models fit with and without haplotypic parameters. The Wald test for overall significance of the fitted model including haplotypes. r-squared values for models fit using formula1 and formula2. the haplotypic effect modelled, ADDITIVE, DOMINANT or RECESSIVE. To model a codominant haplotypic effect, define the desired haplotype as a factor in the formula1 argument. e.g. factor(h.aaa), and use the default option for effect. Author(s) Pamela A. McCaskie References Andersen, P., Gill, R. (1982) Cox s regression model for counting processes, a large sample study, Annals of Statistics, 10: Little, R.J.A., Rubin, D.B. (2002) Statistical Analysis with Missing Data. John Wiley and Sons, New Jersey. McCaskie, P.A., Carter, K.W, Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple imputation approach to haplotypic analysis of population-based data, [online] Rubin, D.B. (1996) Multiple imputation after 18+ years (with discussion). Journal of the American Statistical Society, 91: Therneau, T., Grambsch, P., Fleming, T. Martingale based residuals for survival models, Biometrika, 77(1):

24 24 infer.haplos See Also snp.surv, haplo.bin, haplo.quant, haplo.long data(snpsurv.dat) # convert SNP.dat to format required by infer.haplos survhaplo.dat <- SNP2Haplo(SNPsurv.dat) data(survpheno.dat) # generate haplotype frequencies and haplotype design matrix myinfer<-infer.haplos(survhaplo.dat) # print haplotype frequencies generated by infer.haplos myinfer$hap.freq # generate haplo object where haplotypes with a frequency # below min.freq are grouped as a category called "rare" myhaplo<-make.haplo.rare(myinfer,min.freq=0.05) mymodel <- haplo.surv(formula1=surv(time, status)~age+h.v1aa, formula2=surv(time, status)~age, haplo=myhaplo, pheno=survpheno.dat, sim=10) # example using a subsetting variable - looking at males only mymodel <- haplo.surv(formula1=surv(time, status)~age+h.v1aa, formula2=surv(time, status)~age, haplo=myhaplo, pheno=survpheno.dat, sim=10, sub=expression(sex==1)) infer.haplos Infer haplotype configurations when phase is unknown infer.haplos generates a haplotype object to be used in association analysis. infer.haplos(geno) Arguments geno a genotype data frame where each SNP is represented by two columns, one for each allele, in the form of haplo.dat

25 infer.haplos 25 Value infer.haplos returns a list containing the following items hapmat hap.freq initfreq a dataframe containing all possible haplotype configurations with their respective likelihoods, for each individual. haplotype frequencies estimated using the EM algorithm, and the standard errors of these frequencies. initial haplotype frequencies to be used by other SimHap functions. Author(s) Pamela A. McCaskie References Excoffier, L., Slatkin, M.. (1995) Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Molecular Biology Evolution, 12(5): McCaskie, P.A., Carter, K.W, Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple imputation approach to haplotypic analysis of population-based data, [online] See Also make.haplo.rare data(snp.dat) # convert SNP.dat to format required by infer.haplos haplo.dat <- SNP2Haplo(SNP.dat) data(pheno.dat) # generates haplotype frequencies and haplotype design matrix myinfer<-infer.haplos(haplo.dat) # prints haplotype frequencies generated by infer.haplos myinfer$hap.freq # generates haplo object where haplotypes with a frequency # below min.freq are grouped as a category called "rare" myhaplo<-make.haplo.rare(myinfer,min.freq=0.05) mymodel <- haplo.quant(formula1=hdl~age+sbp+h.n1aa, formula2=hdl~age+sbp, pheno=pheno.dat, haplo=myhaplo, sim=10)

26 26 infer.haplos.cc infer.haplos.cc Infer haplotype configuration independently in cases and controls infer.haplos.cc generates a haplotype object to be used in association analysis. infer.haplos.cc(geno, pheno, cc.var) Arguments geno pheno cc.var a genotype data frame where each SNP is represented by two columns, one for each allele, in the form of haplo.dat. a data frame containing phenotype data with at least two columns - a subject identifier and an indicator of disease status. the column name of the parameter indicating disease status. Must be entered with quotations, e.g. DISEASE". Details Value Note cc.var must be binary, taking only values 0 or 1. infer.haplos.cc returns a list containing the following items hapmat a dataframe containing all possible haplotype configurations with their respective likelihoods, for each individual. hap.freq.cases haplotype frequencies among cases estimated using the EM algorithm, and the standard errors of these frequencies. hap.freq.controls haplotype frequencies among controls estimated using the EM algorithm, and the standard errors of these frequencies. init.freq.cases initial haplotype frequencies among cases to be used by other SimHap functions. init.freq.controls initial haplotype frequencies among controls to be used by other SimHap functions. infer.haplos.cc is to be used in place of infer.haplos when haplotypes and haplotype frequencies are to be inferred independently in cases and controls. geno and pheno should have individuals in the same order, with the subject identifier column in ascending order.

27 infer.haplos.cc 27 Author(s) Pamela A. McCaskie References McCaskie, P.A., Carter, K.W, Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple imputation approach to haplotypic analysis of population-based data, [online] Stram, D.O., Leigh Pearce, C., Bretsky, P., Freedman, M., Hirschhorn, J.N., Altshuler, D., Kolonel, L.N., Henderson, B.E., Thomas, D.C. (2003) Modeling and EM Estimation of Haplotype-Specific Relative Risks from Genotype Data for a Case-Control Study of Unrelated Individuals, Human Heredity, 55: See Also infer.haplos, prepare.cc data(snp.dat) # convert SNP.dat to format required by infer.haplos haplo.dat <- SNP2Haplo(SNP.dat) data(pheno.dat) newdata <- prepare.cc(geno=haplo.dat, pheno=pheno.dat, cc.var="disease") newhaplo.dat <- newdata$geno newpheno.dat <- newdata$pheno # generates haplotype frequencies and haplotype design matrix myinfer<-infer.haplos.cc(geno=newhaplo.dat, pheno=newpheno.dat, cc.var="disease") # prints haplotype frequencies among cases myinfer$hap.freq.cases # prints haplotype frequencies among controls myinfer$hap.freq.controls # generated haplo object where haplotypes with a frequency # below min.freq are grouped as a category called "rare" myhaplo<-make.haplo.rare(myinfer,min.freq=0.05) mymodel <- haplo.quant(formula1=hdl~age+sbp+h.n1aa, formula2=hdl~age+sbp, pheno=newpheno.dat, haplo=myhaplo, sim=10)

28 28 longpheno.dat longpheno.dat Example phenotypic, longitudinal data Format Details longpheno.dat is an example phenotypic data set containing biological measures to be used by snp.long and haplo.long data(longpheno.dat) A data frame with 601 observations on the following 12 variables. ID patient identifier. year year of survey. time time point in years, where 1966=1. sex 1=male, 0=female. age age in years. height height in metres. weight weight in kilograms. bmi body-mass index. fev1f forced expired volume in the first second - measure of lung function. longpheno.dat was simulated to take the format of the Busselton Health Survey from Western Australia. data(snplong.dat) # transforms SNPlong.dat to an object containing 3 columns # per SNP - additive, dominant and recessive, where genotypes # defined in baseline serve as the baseline genotypes longgeno.dat <- SNP2Geno(SNPlong.dat, baseline=c("aa", "GG", "V2V2")) data(longpheno.dat) mymodel <- snp.long(fixed=fev1f~snp_1_add, random=~1 ID, geno=longgeno.dat, pheno=longpheno.dat, form=~year ID)

29 make.haplo.rare 29 make.haplo.rare Group rare haplotypes together make.haplo.rare groups haplotypes with frequencies below a specified threshold together and processes data into a format compatible with the haplotype analysis functions make.haplo.rare(infer.object, min.freq) Arguments infer.object min.freq result of a call to infer.haplos. minimum frequency of haplotypes to include in analysis. Haplotype with a frequency below this value will be grouped together in a group called rare. Value hapdata hapobject A data frame containing all haplotype configurations and their posterior probabilities for each individual, grouping rare haplotypes into a category called rare. A list containing the original haplotype information for each individual as well as haplotype frequency tables Author(s) Pamela A. McCaskie References McCaskie, P.A., Carter, K.W, Hazelton, M., Palmer, L.J. (2007) SimHap: A comprehensive modeling framework for epidemiological outcomes and a multiple imputation approach to haplotypic analysis of population-based data, [online] See Also infer.haplos data(snp.dat) # convert SNP.dat to format required by infer.haplos haplo.dat <- SNP2Haplo(SNP.dat)

30 30 pheno.dat data(pheno.dat) # generate haplotype frequencies and haplotype design matrix myinfer<-infer.haplos(haplo.dat) # print haplotype frequencies generated by infer.haplos myinfer$hap.freq # generate haplo object where haplotypes with a frequency # below min.freq are grouped as a category called "rare" myhaplo<-make.haplo.rare(myinfer,min.freq=0.05) mymodel <- haplo.quant(formula1=hdl~age+sbp+h.n1aa, formula2=hdl~age+sbp, pheno=pheno.dat, haplo=myhaplo, sim=10) pheno.dat Example phenotypic data Format pheno.dat is an example phenotypic data set containing biological measures to be used by snp.quant, haplo.quant, snp.bin and haplo.bin data(pheno.dat) A data frame with 180 observations on the following 16 variables. ID patient identifiers. SEX 1=male, 0=female. AGE age in years. SBP systolic blood pressure (mmhg). DBP diastolic blood pressure (mmhg). BMI body-mass index. WHR waist-hip ratio. HDL plasma high density lipoprotein (mmol/l). LDL plasma low density lipoprotein (mmol/l). DIABETES a binary indicator of history of type 2 diabetes. FH_IHD a binary indicator of family history of ischaemic heart disease. PLAQUE a binary indicator of the presence of 1 or more carotid plaques. SMOKE a binary indicator of smoking history (0=never smoked, 1=ever smoked).

31 prepare.cc 31 PY pack-years of smoking. DISEASE a binary indicator of ischaemic heart disease. STRAT a matching variable indicating the pairs of matched cases and controls. Details pheno.dat is a simulated data set of coronary heart disease related phenotypes data(snp.dat) # convert SNP.dat to format required by snp.quant geno.dat <- SNP2Geno(SNP.dat, baseline=c("mm", "11", "GG", "CC")) data(pheno.dat) mymodel <- snp.quant(formula1=ldl~age+sbp+factor(snp_1_add), formula2=hdl~age+sbp, geno=geno.dat, pheno=pheno.dat) # example with a subsetting variable, looking at males only mymodel <- snp.quant(formula1=ldl~age+sbp+factor(snp_1_add), formula2=hdl~age+sbp, geno=geno.dat, pheno=pheno.dat, sub=expression(sex==1)) prepare.cc Prepare case-control data for inferring haplotypes prepare.cc prepares case-control data when there may be missing values in the case status variable. This eliminates problems when using infer.haplos.cc. prepare.cc(geno, pheno, cc.var) Arguments geno pheno cc.var a genotype data frame where each SNP is represented by two columns, one for each allele, in the form of haplo.dat. a data frame containing phenotype data with at least two columns - a subject identifier and an indicator of disease status. the column name of the parameter indicating disease status. Must be entered with quotations, e.g. DISEASE".

The SimHap Package. September 7, 2007

The SimHap Package. September 7, 2007 The SimHap Package September 7, 2007 Type Package Title A comprehensive modeling framework for epidemiological outcomes and a multiple-imputation approach to haplotypic analysis of population-based data