The gss Package. September 25, Description A comprehensive package for structural multivariate function estimation using smoothing splines.

Size: px

Start display at page:

Download "The gss Package. September 25, Description A comprehensive package for structural multivariate function estimation using smoothing splines."

Berenice Flowers
6 years ago
Views:

1 The gss Package September 25, 2004 Version Depends R (>= 1.7.0) Title General Smoothing Splines Author Chong Gu <chong@stat.purdue.edu> Maintainer Chong Gu <chong@stat.purdue.edu> A comprehensive package for structural multivariate function estimation using smoothing splines. License GPL R topics documented: LakeAcidity aids bacteriuria buffalo cdssden drkpk dssden family fitted.ssanova gastric gauss.quad gssanova gssanova hzdrate.sshzd mkfun.poly mkfun.tp mkran mkrk.factor mkterm.ssanova nlm nox ozone predict.ssanova print.ssanova

2 2 LakeAcidity project rkpk rkpk smolyak ssanova ssanova ssden sshzd stan summary.gssanova summary.gssanova summary.ssanova wesdr Index 47 LakeAcidity Water Acidity in Lakes Data extracted from the Eastern Lake Survey of 1984 conducted by the United States Environmental Protection Agency, concerning 112 lakes in the Blue Ridge. data(lakeacidity) Format A list containing 112 observations on the following variables. ph cal lat lon geog Surface ph. Calcium concentration. Latitude. Longitude. Geographic location, derived from lat and lon Details geog was generated from lat and lon using the code given in the Example section. Source Douglas, A. and Delampady, M. (1990), Eastern Lake Survey Phase I: Documentation for the Data Base and the Derived Data sets. Tech Report 160 (SIMS), Dept. Statistics, University of British Columbia. References Gu, C. and Wahba, G. (1993), Semiparametric analysis of variance with tensor product thin plate splines. Journal of the Royal Statistical Society Ser. B, 55,

3 bacteriuria 3 Examples ## Converting latitude and longitude to x-y coordinates ## Not run: convert <- function(lat, lon) { lat <- lat/180*pi lon <- lon/180*pi m.lat <- (max(lat)+min(lat))/2 m.lon <- (max(lon)+min(lon))/2 x <- cos(m.lat)*sin(m.lon-lon) y <- sin(lat-m.lat) cbind(x,y) } data(lakeacidity) convert(lakeacidity$lat,lakeacidity$lon) ## Clean up rm(lakeacidity,convert) ## End(Not run) aids AIDS Incubation A data set collected by Centers for Disease Control and Prevention concerning AIDS patients who were infected with the HIV virus through blood transfusion. data(aids) Format A data frame containing 295 observations on the following variables. incu Time from HIV infection to AIDS diagnosis. infe Time from HIV infection to end of data collection (July 1986). age Age at time of blood transfusion. Source Wang, M.-C. (1989), A semiparametric model for randomly truncated data. Journal of the American Statistical Association, 84, bacteriuria Treatment of Bacteriuria Bacteriuria patients were randomly assigned to two treatment groups. Weekly binary indicator of bacteriuria was recorded for every patient over 4 to 16 weeks. A total of 72 patients were represented in the data, with 36 each in the two treatment groups.

4 4 buffalo data(bacteriuria) Format A data frame containing 820 observations on the following variables. id trt time infect Identification of patients, a factor. Treatments 1 or 2, a factor. Weeks after randomization. Binary indicator of bacteriuria (bacteria in urine). Source Joe, H. (1997), Multivariate Models and Dependence Concepts. Hall. London: Chapman and References Gu and Ma (2003b), Generalized Nonparametric Mixed-Effect Models: Computation and Smoothing Parameter Selection. Available at html. buffalo Buffalo Annual Snowfall Annual snowfall accumulations in Buffalo, NY from 1910 to data(buffalo) Format A vector of 64 numerical values. Source Scott, D. W. (1985), Average shifted histograms: Effective nonparametric density estimators in several dimensions. The Annals of Statistics, 13,

5 cdssden 5 cdssden Evaluating Conditional PDF, CDF, and Quantiles of Smoothing Spline Density Estimates Evaluate conditional pdf, cdf, and quantiles for smoothing spline density estimates. cdssden(object, x, cond, int) cpssden(object, q, cond, int) cqssden(object, p, cond, int) object x cond int q p Object of class "ssden". Data frame or vector of points on which conditional density is to be evaluated. One row data frame of conditioning variables. Normalizing constant. Vector of points on which conditional cdf is to be evaluated. Vector of probabilities for which conditional quantiles are to be calculated. Details The argument x in cdssden is of the same form as the argument newdata in predict.lm, but can take a vector for 1-D conditional densities. cpssden and cqssden naturally only work for 1-D conditional densities. Value cdssden returns a list object with the following components. pdf int Vector of conditional pdf. Normalizing constant. cpssden and cpssden return a vector of conditional cdf or quantiles. Note cpssden and cqssden can be very slow. See Also ssden and dssden.

6 6 drkpk drkpk Numerical Engine for ssden and sshzd Calculate penalized likelihood density estimates and hazard estimates via the Newton iteration and evaluate the cross-validation score, as implemented in the RATFOR routine dnewton.r and hzdnewton.r, and minimize the cross-validation score using nlm and nlm0. sspdsty(s,r,q,cnt,qd.s,qd.r,qd.wt,prec,maxiter,alpha) mspdsty(s,r,q,cnt,qd.s,qd.r,qd.wt,prec,maxiter,alpha) msphzd(s,r,q,nobs,cnt,qd.s,qd.r,qd.wt,prec,maxiter,alpha) s r q Nobs cnt qd.s qd.r qd.wt prec maxiter alpha Unpenalized terms evaluated at data points. Basis of penalized terms evaluated at data points. Penalty matrix. Total number of lifetime observations. Bin-counts for histogram data. Unpenalized terms evaluated at quadrature nodes. Basis of penalized terms evaluated at quadrature nodes. Quadrature weights. Precision requirement for internal iterations. Maximum number of iterations allowed for internal iterations. Parameter defining cross-validation score for smoothing parameter selection. Details sspdsty is used by ssden to compute cross-validated density estimate with a single smoothing parameter. mspdsty is used by ssden to compute cross-validated density estimate with multiple smoothing parameters. msphzd is used by sshzd to compute cross-validated hazard estimate with single or multiple smoothing parameters. References Gu, C. and Wang, J. (2002), Penalized Likelihood Density Estimation: Direct Cross- Validation and Scalable Approximation. Available at stat.purdue.edu/~chong/manu. html. Gu, C. (2002), Smoothing Spline ANOVA Models. New York: Springer-Verlag.

7 dssden 7 dssden Evaluating PDF, CDF, and Quantiles of Smoothing Spline Density Estimates Evaluate pdf, cdf, and quantiles for smoothing spline density estimates. dssden(object, x) pssden(object, q) qssden(object, p) object x q p Object of class "ssden". Data frame or vector of points on which density is to be evaluated. Vector of points on which cdf is to be evaluated. Vector of probabilities for which quantiles are to be calculated. Details The argument x in dssden is of the same form as the argument newdata in predict.lm, but can take a vector for 1-D densities. pssden and qssden naturally only work for 1-D densities. Value A vector of pdf, cdf, or quantiles. See Also ssden and cdssden. family Utility Functions for Error Families Utility functions for fitting Smoothing Spline ANOVA models with non-gaussian responses.

8 8 family mkdata.binomial(y, eta, wt, offset) dev.resid.binomial(y, eta, wt) dev.null.binomial(y, wt, offset) cv.binomial(y, eta, wt, hat, alpha) y0.binomial(y, eta0, wt) proj0.binomial(y0, eta, offset) kl.binomial(eta0, eta1, wt) cfit.binomial(y, wt, offset) mkdata.poisson(y, eta, wt, offset) dev.resid.poisson(y, eta, wt) dev.null.poisson(y, wt, offset) cv.poisson(y, eta, wt, hat, alpha, sr, q) y0.poisson(eta0) proj0.poisson(y0, eta, wt, offset) kl.poisson(eta0, eta1, wt) cfit.poisson(y, wt, offset) mkdata.gamma(y, eta, wt, offset) dev.resid.gamma(y, eta, wt) dev.null.gamma(y, wt, offset) cv.gamma(y, eta, wt, hat, rss, alpha) y0.gamma(eta0) proj0.gamma(y0, eta, wt, offset) kl.gamma(eta0, eta1, wt) cfit.gamma(y, wt, offset) mkdata.inverse.gaussian(y, eta, wt, offset) dev.resid.inverse.gaussian(y, eta, wt) dev.null.inverse.gaussian(y, wt, offset) mkdata.nbinomial(y, eta, wt, offset, nu) dev.resid.nbinomial(y, eta, wt) dev.null.nbinomial(y, wt, offset) cv.nbinomial(y, eta, wt, hat, alpha) y0.nbinomial(y,eta0,nu) proj0.nbinomial(y0, eta, wt, offset) kl.nbinomial(eta0, eta1, wt, nu) cfit.nbinomial(y, wt, offset, nu) mkdata.weibull(y, eta, wt, offset, nu) dev.resid.weibull(y, eta, wt, nu) dev.null.weibull(y, wt, offset, nu) cv.weibull(y, eta, wt, hat, nu, alpha) y0.weibull(y, eta0, nu) proj0.weibull(y0, eta, wt, offset, nu) kl.weibull(eta0, eta1, wt, nu, int) cfit.weibull(y, wt, offset, nu) mkdata.lognorm(y, eta, wt, offset, nu) dev.resid.lognorm(y, eta, wt, nu)

9 fitted.ssanova 9 dev0.resid.lognorm(y, eta, wt, nu) dev.null.lognorm(y, wt, offset, nu) cv.lognorm(y, eta, wt, hat, nu, alpha) y0.lognorm(y, eta0, nu) proj0.lognorm(y0, eta, wt, offset, nu) kl.lognorm(eta0, eta1, wt, nu, y0) cfit.lognorm(y, wt, offset, nu) mkdata.loglogis(y, eta, wt, offset, nu) dev.resid.loglogis(y, eta, wt, nu) dev0.resid.loglogis(y, eta, wt, nu) dev.null.loglogis(y, wt, offset, nu) cv.loglogis(y, eta, wt, hat, nu, alpha) y0.loglogis(y, eta0, nu) proj0.loglogis(y0, eta, wt, offset, nu) kl.loglogis(eta0, eta1, wt, nu, y0) cfit.loglogis(y, wt, offset, nu) y eta wt offset nu Model response. Fitted values on link scale. Model weights. Model offset. Size for nbinomial. Inverse scale for log life time. Details These are not to be called by the user. mkdata.x create the pseudo data to be used in iterated penalized least squares fitting. dev.resid.x calculate the deviance residuals. dev.null.x calculate the deviance of the constant null model. See Also gssanova. fitted.ssanova Fitted Values and Residuals from Smoothing Spline ANOVA Fits Methods for extracting fitted values and residuals from smoothing spline ANOVA fits. fitted.ssanova(object,...) residuals.ssanova(object,...) fitted.gssanova(object,...) residuals.gssanova(object, type="working",...)

10 10 gauss.quad object type Details... Ignored. Object of class "ssanova" or "gssanova". Type of residuals desired, with two alternatives "working" (default) or "deviance". The fitted values for "gssanova" objects are on the link scale, the same scale the "working" residuals are on. gastric Gastric Cancer Data Survival of gastric cancer patients under chemotherapy and chemotherapy-radiotherapy combination. data(gastric) Format A data frame containing 90 observations on the following variables. futime status trt Follow-up time, in days. Censoring status. Factor indicating the treatments: 1 chemothrapy, 2 combination. Source Moreau, T., O Quigley, J., and Mesbah, M. (1985), A global goodness-of-fit statistic for the proportional hazards model. Applied Statistics, 34, gauss.quad Generating Gauss-Legendre Quadrature Generate Gauss-Legendre quadratures using the FORTRAN routine gaussq.f found at NETLIB. gauss.quad(size,interval)

11 gssanova 11 size interval Size of quadrature. Interval to be covered. Value gauss.quad returns a list object with the following components. pt wt Quadrature nodes. Quadrature weights. gssanova Fitting Smoothing Spline ANOVA Models with Non Gaussian Responses Fit smoothing spline ANOVA models to responses from selected exponential families with cubic spline, linear spline, or thin-plate spline marginals for numerical variables. Factors are also accepted. The symbolic model specification via formula follows the same rules as in lm and glm. gssanova(formula, family, type="cubic", data=list(), weights, subset, offset, na.action=na.omit, partial=null, method=null, varht=1, nu=null, prec=1e-7, maxiter=30, ext=.05, order=2) formula family type data weights subset Symbolic description of the model to be fit. of the error distribution. Supported are exponential families "binomial", "poisson", "Gamma", "inverse.gaussian", and "nbinomial". Also supported are accelerated life model families "weibull", "lognorm", and "loglogis". Type of numerical marginals to be used. Supported are type="cubic" for cubic spline marginals, type="linear" for linear spline marginals, and type="tp" for thin-plate spline marginals. Optional data frame containing the variables in the model. Optional vector of weights to be used in the fitting process. Optional vector specifying a subset of observations to be used in the fitting process. offset Optional offset term with known parameter 1. na.action partial Function which indicates what should happen when the data contain NAs. Optional extra unpenalized terms in partial spline models. method Score used to drive the performance-oriented iteration. Supported are method="v" for GCV, method="m" for GML, and method="u" for Mallow s CL.

12 12 gssanova varht Dispersion parameter needed for method="u". Ignored when method="v" or method="m" are specified. nu Inverse scale parameter in accelerated life model families. Ignored for exponential families. prec maxiter ext order Details Value Note Precision requirement for the iterations. Maximum number of iterations allowed for performance-oriented iteration, and for inner-loop multiple smoothing parameter selection when applicable. For cubic spline and linear spline marginals, this option specifies how far to extend the domain beyond the minimum and the maximum as a percentage of the range. The default ext=.05 specifies marginal domains of lengths 110 percent of their respective ranges. Prediction outside of the domain will result in an error. Ignored if type="tp" is specified. For thin-plate spline marginals, this option specifies the order of the marginal penalties. Ignored if type="cubic" or type="linear" are specified. The models are fitted by penalized likelihood method through the performance-oriented iteration, as described in the reference cited below. Only one link is implemented for each family. It is the logit link for "binomial", and the log link for "poisson", "Gamma", and "inverse.gaussian". For "nbinomial", the working parameter is the logit of the probability p; see NegBinomial. For "weibull", "lognorm", and "loglogis", it is the location parameter for the log life time. For family="binomial", "poisson", "nbinomial", "weibull", "lognorm", and "loglogis", the score driving the performance-oriented iteration defaults to method="u" with varht=1. For family="gamma" and "inverse.gaussian", the default is method="v". See ssanova for details and notes concerning smoothing spline ANOVA models. gssanova returns a list object of class "gssanova" which inherits from the class "ssanova". The method summary is used to obtain summaries of the fits. The method predict can be used to evaluate the fits at arbitrary points, along with the standard errors to be used in Bayesian confidence intervals, both on the scale of the link. The methods residuals and fitted.values extract the respective traits from the fits. For family="binomial", the response can be specified either as two columns of counts or as a column of sample proportions plus a column of total counts entered through the argument weights, as in glm. For family="nbinomial", the response may be specified as two columns with the second being the known sizes, or simply as a single column with the common unknown size to be estimated through the maximum likelihood method. For family="weibull", "lognorm", or "loglogis", the response consists of three columns, with the first giving the follow-up time, the second the censoring status, and the third the left-truncation time. For data with no truncation, the third column can be omitted.

13 gssanova1 13 Author(s) Chong Gu, References Gu, C. (1992), Cross-validating non Gaussian data. Journal of Computational and Graphical Statistics, 1, See Also Methods predict.ssanova, summary.gssanova, and fitted.gssanova. Examples ## Fit a cubic smoothing spline logistic regression model test <- function(x) {.3*(1e6*(x^11*(1-x)^6)+1e4*(x^3*(1-x)^10))-2} x <- (0:100)/100 p <- 1-1/(1+exp(test(x))) y <- rbinom(x,3,p) logit.fit <- gssanova(cbind(y,3-y)~x,family="binomial") ## The same fit logit.fit1 <- gssanova(y/3~x,"binomial",weights=rep(3,101)) ## Obtain estimates and standard errors on a grid est <- predict(logit.fit,data.frame(x=x),se=true) ## Plot the fit and the Bayesian confidence intervals plot(x,y/3,ylab="p") lines(x,p,col=1) lines(x,1-1/(1+exp(est$fit)),col=2) lines(x,1-1/(1+exp(est$fit+1.96*est$se)),col=3) lines(x,1-1/(1+exp(est$fit-1.96*est$se)),col=3) ## Clean up ## Not run: rm(test,x,p,y,logit.fit,logit.fit1,est) dev.off() ## End(Not run) gssanova1 Fitting Smoothing Spline ANOVA Models with Non Gaussian Responses Fit smoothing spline ANOVA models to responses from selected exponential families with cubic spline, linear spline, or thin-plate spline marginals for numerical variables. Factors are also accepted. The symbolic model specification via formula follows the same rules as in lm and glm. gssanova1(formula, family, type="cubic", data=list(), weights, subset, offset, na.action=na.omit, partial=null, alpha=null, nu=null, id.basis=null, nbasis=null, seed=null, random=null, ext=.05,order=2)

14 14 gssanova1 formula family type data weights subset Symbolic description of the model to be fit. of the error distribution. Supported are exponential families "binomial", "poisson", "Gamma", and "nbinomial". Also supported are accelerated life model families "weibull", "lognorm", and "loglogis". Type of numerical marginals to be used. Supported are type="cubic" for cubic spline marginals, type="linear" for linear spline marginals, and type="tp" for thin-plate spline marginals. Optional data frame containing the variables in the model. Optional vector of weights to be used in the fitting process. Optional vector specifying a subset of observations to be used in the fitting process. offset Optional offset term with known parameter 1. na.action partial alpha nu id.basis nbasis seed random ext order Details Function which indicates what should happen when the data contain NAs. Optional extra unpenalized terms in partial spline models. Tuning parameter defining cross-validation; larger values yield smoother fits. Defaults are alpha=1 for family="binomial" and alpha=1.4 for family="poisson" and family="gamma". Input for future support of accelerated life model families. Index designating selected knots. Number of knots to be selected. Ignored when id.basis is supplied. Seed for reproducible random selection of knots. Ignored when id.basis is supplied. Input for parametric random effects in nonparametric mixed-effect models. See mkran for details. For cubic spline and linear spline marginals, this option specifies how far to extend the domain beyond the minimum and the maximum as a percentage of the range. The default ext=.05 specifies marginal domains of lengths 110 percent of their respective ranges. Prediction outside of the domain will result in an error. Ignored if type="tp" is specified. For thin-plate spline marginals, this option specifies the order of the marginal penalties. Ignored if type="cubic" or type="linear" are specified. gssanova1 implements an alternative approach to the fitting of gssanova models. It works in some q-dimensional function space determined by a set of knots typically through random selection; the default dimension (nbasis) is set to be q = 10n ( 2/9), adequate for cubic splines. The algorithms are of the order O(nq 2 ), scaling much better than the O(n 3 ) of ssanova; the memory requirements are O(nq) versus O(n 2 ). gssanova1 adopts direct cross-validation instead of the indirect cross-validation of gssanova. Currently, only three families are supported, "binomial", "poisson", and "Gamma". Other families of gssanova shall be added in the future. Only one link is implemented for each family. It is the logit link for "binomial", and the log link for "poisson", and "Gamma". See ssanova for details and notes concerning smoothing spline ANOVA models.

15 gssanova1 15 Value Note gssanova1 returns a list object of class "gssanova1" which inherits from the classes "ssanova1", "gssanova", and "ssanova". The method summary is used to obtain summaries of the fits. The method predict can be used to evaluate the fits at arbitrary points, along with the standard errors to be used in Bayesian confidence intervals, both on the scale of the link. The methods residuals and fitted.values extract the respective traits from the fits. For family="binomial", the response can be specified either as two columns of counts or as a column of sample proportions plus a column of total counts entered through the argument weights, as in glm. Author(s) Chong Gu, chong@stat.purdue.edu References See Also gssanova and methods predict.ssanova1, summary.gssanova1, and fitted.gssanova. Examples ## Fit a cubic smoothing spline logistic regression model test <- function(x) {.3*(1e6*(x^11*(1-x)^6)+1e4*(x^3*(1-x)^10))-2} x <- (0:100)/100 p <- 1-1/(1+exp(test(x))) y <- rbinom(x,3,p) logit.fit <- gssanova1(cbind(y,3-y)~x,family="binomial") ## The same fit logit.fit1 <- gssanova1(y/3~x,"binomial",weights=rep(3,101), id.basis=logit.fit$id.basis) ## Obtain estimates and standard errors on a grid est <- predict(logit.fit,data.frame(x=x),se=true) ## Plot the fit and the Bayesian confidence intervals plot(x,y/3,ylab="p") lines(x,p,col=1) lines(x,1-1/(1+exp(est$fit)),col=2) lines(x,1-1/(1+exp(est$fit+1.96*est$se)),col=3) lines(x,1-1/(1+exp(est$fit-1.96*est$se)),col=3) ## Fit a mixed-effect logistic model data(bacteriuria) bact.fit <- gssanova1(infect~trt+time,family="binomial",data=bacteriuria, id.basis=(1:820)[bacteriuria$id%in%c(3,38)],random=~1 id) ## Predict fixed effects predict(bact.fit,data.frame(time=2:16,trt=as.factor(rep(1,15))),se=true) ## Estimated random effects bact.fit$b

16 16 hzdrate.sshzd ## Clean up ## Not run: rm(test,x,p,y,logit.fit,logit.fit1,est,bacteriuria,bact.fit) dev.off() ## End(Not run) hzdrate.sshzd Evaluating Smoothing Spline Hazard Estimates Evaluate smoothing spline hazard estimates by sshzd. hzdrate.sshzd(object, x, se=false) hzdcurve.sshzd(object, time, covariates=null, se=false) survexp.sshzd(object, time, covariates=null, start=0) object x se time covariates start Object of class "sshzd". Data frame or vector of points on which hazard is to be evaluated. Flag indicating if standard errors are required. Vector of time points. Vector of covariate values. Optional starting times of the intervals. Value For se=false, hzdrate.sshzd returns a vector of hazard evaluations. For se=true, hzdrate.sshzd returns a list consisting of the following components. fit se.fit fit se.fit Vector of hazard evaluations. Vector of standard errors for log hazard. Vector or columns of hazard curve(s). Vector or columns of standard errors for log hazard curve(s). survexp.sshzd returns a vector of expected survivals, which are based on the cumulative hazards over the specified intervals. See Also sshzd.

17 mkfun.poly 17 mkfun.poly Crafting Building Blocks for Polynomial Splines Craft numerical functions to be used by mkterm.linear, mkterm.cubic to assemble model terms. mkrk.linear(range) mkrk.cubic(range) mkphi.cubic(range) range Numerical vector whose minimum and maximum specify the range on which the function to be crafted is defined. Details These are not to be called by the user. Value A list of two components. fun env Function definition. Portable local constants derived from the argument. Note mkrk.x create a bivariate function fun(x,y,env,outer=false), where x, y are real arguments and local constants can be passed in through env. mkphi.cubic creates a univariate function fun(x,nu=1,env). See Also mkfun.tp, mkrk.factor.

18 18 mkfun.tp mkfun.tp Crafting Building Blocks for Thin-Plate Splines Craft numerical functions to be used by mkterm.tp to assemble model terms. mkrk.tp(dm,order,mesh,weight) mkphi.tp(dm,order,mesh,weight) mkrk.tp.p(dm,order) mkphi.tp.p(dm,order) dm Dimension of the variable d. order Order of the differential operator m. mesh weight Details Value Normalizing mesh. Normalizing weights. These are not to be called by the user. Thin-plate splines are defined for 2m > d. mkrk.tp.p generates the pseudo kernel, and mkphi.tp.p generates the (m+d 1)!/d!/(m 1)! lower order polynomials with total order less than m. mkphi.tp generates normalized lower order polynomials orthonormal w.r.t. a norm specified by mesh and weight as described in the reference, and mkrk.tp conditions the pseudo kernel to generate the reproducing kernel orthogonal to the lower order polynomials w.r.t. the norm. A list of two components. fun env Function definition. Portable local constants derived from the arguments. Note mkrk.tp creates a bivariate function fun(x,y,env,outer=false), where x, y are real arguments and local constants can be passed in through env. mkphi.tp creates a collection of univariate functions fun(x,nu,env), where x is the argument and nu is the index. References Gu, C. and Wahba, G. (1993), Semiparametric analysis of variance with tensor product thin plate splines. Journal of the Royal Statistical Society Ser. B, 55,

19 mkran 19 See Also mkfun.poly, mkrk.factor. mkran Generating Random Effects in Mixed-Effect Models Generate entries representing random effects in mixed-effect models. mkran(formula, data) formula data Symbolic description of the random effects. Data frame containing the variables in the model. Details Value This function generates random effects terms from simple grouping variables, for use in nonparametric mixed-effect models as described in Gu and Ma (2003a, b). The syntax of the formula resembles that of similar utilities for linear and nonlinear mixed-effect models, as described in Pinheiro and Bates (2000). Currently, mkran takes only two kinds of formulas, ~1 grp2 or ~grp1 grp2. Both grp1 and grp2 should be factors, and for the second formula, the levels of grp2 should be nested under those of grp1. The Z matrix is determined by grp2. When observations are ordered according to the levels of grp2, the Z matrix is block diagonal of 1 vectors. The Sigma matrix is diagonal. For ~1 grp2, it has one tuning parameter. For ~grp1 grp2, the number of parameters equals the number of levels of grp1, with each parameter shared by the grp2 levels nested under the same grp1 level. A list of three components. z sigma init Z matrix. Sigma matrix to be evaluated through sigma$fun(para,sigma$env). Initial parameter values. Note One may pass a formula or a list to the argument random in calls to ssanova1 orgssanova1 to fit nonparametric mixed-effect models. A formula will be converted to a list using mkran. A list should be of the same form as the value of mkran. Author(s) Chong Gu, chong@stat.purdue.edu

20 20 mkrk.factor References Gu and Ma (2003a), Optimal Smoothing in Nonparametric Mixed-Effect Models. Available at Gu and Ma (2003b), Generalized Nonparametric Mixed-Effect Models: Computation and Smoothing Parameter Selection. Available at html. Pinheiro and Bates (2000), Mixed-Effects Models in S and S-PLUS. New York: Springer- Verlag. Examples ## Toy data test <- data.frame(grp=as.factor(rep(1:2,c(2,3)))) ## First formula ran.test <- mkran(~1 grp,test) ran.test$z ran.test$sigma$fun(2,ran.test$sigma$env) # diag(10^(-2),2) ## Second formula ran.test <- mkran(~grp grp,test) ran.test$z ran.test$sigma$fun(c(1,2),ran.test$sigma$env) # diag(10^(-1),10^(-2)) ## Clean up ## Not run: rm(test,ran.test) mkrk.factor Crafting Building Blocks for Discrete Splines Craft numerical functions to be used by mkterm.x to assemble model terms involving factors. mkrk.nominal(levels) mkrk.ordinal(levels) levels Levels of the factor. Details These are not to be called by the user. For a nominal factor with levels 1, 2,..., k, the level means f(i) will be shrunk towards each other through a penalty proportional to where f(.) = (f(1) f(k))/k. (f(1) f(.)) (f(k) f(.)) 2 For a ordinal factor with levels 1 < 2 <... < k, the level means f(i) will be shrunk towards each other through a penalty proportional to (f(1) f(2)) (f(k 1) f(k)) 2

21 mkterm.ssanova 21 Value A list of two components. fun env Function definition. Portable local constants derived from the arguments. Note mkrk.x create a bivariate function fun(x,y,env,outer=false), where x, y are real arguments and local constants can be passed in through env. See Also mkterm.ssanova, mkfun.poly, mkfun.tp. mkterm.ssanova Assembling Model Terms for Smoothing Spline ANOVA Models Assemble numerical functions for calculating model terms in a Smoothing Spline ANOVA Model. mkterm.linear(mf, ext) mkterm.linear1(mf, range) mkterm.cubic(mf, ext) mkterm.cubic1(mf, range) mkterm.tp(mf, order, mesh, weight) mf ext range order mesh weight Model frame of the model formula. Size of the buffer zone beyond the data range. Data frame specifying domain range. Order of the differential operator. Normalizing mesh. Normalizing weights. Details These are not to be called by the user. For polynomial splines, ext specifies how far to go beyond the data range percentage wise. For example, if the minimum and maximum values of a variable in mf is 0 and 1 and ext=.05, then the marginal domain on which the model is defined would be [.95, 1.05]. See mkfun.tp for order, mesh, weight.

22 22 nox Value A list object with a component labels containing the labels of all model terms. For each of the model terms, there is a component holding the numerical functions for calculating the unpenalized and penalized parts within the term. Note The numerical functions are assembled using building blocks crafted by mkfun.poly, mkfun.tp, mkrk.factor. nlm0 Minimizing Univariate Functions on Finite Intervals Minimize univariate functions on finite intervals using 3-point quadratic fit, with goldensection safe-guard. nlm0(fun,range,prec=1e-7) fun range prec Function to be minimized. Interval on which the function to be minimized. Desired precision of the solution. Value nlm0 returns a list object with the following components. estimate minimum evaluations Minimizer. Minimum. Number of function evaluations. nox NOx in Engine Exhaust Data from an experiment in which a single-cylinder engine was run with ethanol to see how the NOx concentration in the exhaust depended on the compression ratio and the equivalence ratio. data(nox) Format A data frame containing 88 observations on the following variables.

23 ozone 23 nox comp equi NOx concentration in exhaust. Compression ratio. Equivalence ratio. Source Brinkman, N. D. (1981), Ethanol fuel a single-cylinder engine study of efficiency and exhaust emissions. SAE Transactions, 90, References Cleveland, W. S. and Devlin, S. J. (1988), Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83, Breiman, L. (1991), The pi method for estimating multivariate functions from noisy data. Technometrics, 33, ozone Ozone Concentration in Los Angeles Basin Daily measurements of ozone concentration and eight meteorological quantities in the Los Angeles basin for 330 days of data(ozone) Format A data frame containing 330 observations on the following variables. upo3 Upland ozone concentration, in ppm. vdht Vandenberg 500 millibar height, in meters. wdsp Wind speed, in miles per hour. hmdt Humidity. sbtp Sandburg Air Base temperature, in Celsius. ibht Inversion base height, in foot. dgpg Dagget pressure gradient, in mmhg. ibtp Inversion base temperature, in Fahrenheit. vsty Visibility, in miles. day Calendar day, between 1 and 366. Source Unknown. References Breiman, L. and Friedman, J. H. (1985), Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80,

24 24 predict.ssanova Hastie, T. and Tibshirani, R. (1990), Generalized Additive Models. Chapman and Hall. predict.ssanova Predicting from Smoothing Spline ANOVA Fits Evaluate terms in a smoothing spline ANOVA fit at arbitrary points. Standard errors of the terms can be requested for use in constructing Bayesian confidence intervals. predict.ssanova(object, newdata, se.fit=false, include=object$terms$labels,...) predict.ssanova1(object, newdata, se.fit=false, include=object$terms$labels,...) Value object newdata se.fit include... Ignored. Object of class inheriting from "ssanova". Data frame or model frame in which to predict. Flag indicating if standard errors are required. List of model terms to be included in the prediction. The partial and offset terms, if present, are to be specified by "partial" and "offset", respectively. For se.fit=false, predict.ssanova returns a vector of the evaluated fit. For se.fit=true, predict.ssanova returns a list consisting of the following components. fit se.fit Vector of evaluated fit. Vector of standard errors. Note To supply the partial terms for partial spline models, add a component partial=i(...) in newdata; the as is function I(...) is necessary when partial has more than one column. For mixed-effect models through ssanova1 or gssanova1, the Z matrix is set to 0 if not supplied. To supply the Z matrix, add a component random=i(...) in newdata. Author(s) Chong Gu, chong@stat.purdue.edu References Gu, C. and Wahba, G. (1993), Smoothing spline ANOVA with component-wise Bayesian confidence intervals. Journal of Computational and Graphical Statistics, 2,

25 print.ssanova 25 See Also Fitting functions ssanova, ssanova and methods summary.ssanova, summary.gssanova, fitted.ssanova. Examples ## THE FOLLOWING EXAMPLE IS TIME-CONSUMING ## Not run: ## Fit a model with thin-plate marginals, where geog is 2-D data(lakeacidity) fit <- ssanova(ph~log(cal)*geog,"tp",lakeacidity) ## Obtain estimates and standard errors on a grid new <- data.frame(cal=1,geog=i(matrix(0,1,2))) new <- model.frame(~log(cal)+geog,new) predict(fit,new,se=true) ## Evaluate the geog main effect predict(fit,new,se=true,inc="geog") ## Evaluate the sum of the geog main effect and the interaction predict(fit,new,se=true,inc=c("geog","log(cal):geog")) ## Evaluate the geog main effect on a grid grid <- seq(-.04,.04,len=21) new <- model.frame(~geog,list(geog=cbind(rep(grid,21),rep(grid,rep(21,21))))) est <- predict(fit,new,se=true,inc="geog") ## Plot the fit and standard error par(pty="s") contour(grid,grid,matrix(est$fit,21,21),col=1) contour(grid,grid,matrix(est$se,21,21),add=true,col=2) ## Clean up rm(lakeacidity,fit,new,grid,est) dev.off() ## End(Not run) print.ssanova Print Functions for Smoothing Spline ANOVA Models Print functions for Smoothing Spline ANOVA models. print.gssanova1(x,...) print.ssanova(x,...) print.ssanova1(x,...) print.ssden(x,...) print.sshzd(x,...) print.summary.ssanova(x, digits=6,...) print.summary.gssanova(x, digits=6,...) print.summary.gssanova1(x, digits=6,...)

26 26 project x Object of class ssanova, summary.ssanova, summary.gssanova, or ssden. digits Number of significant digits to be printed in values.... Ignored. See Also ssanova, gssanova, summary.ssanova, summary.gssanova, ssden, sshzd. project Projecting Smoothing Spline ANOVA Fits for Model Diagnostics Calculate Kullback-Leibler projection of smoothing spline ANOVA fits for Model Diagnostics. project(object,...) project.ssanova1(object, include,...) project.gssanova1(object, include,...) project.ssden(object, include, mesh=false,...) project.sshzd(object, include, mesh=false,...) object Object of class "ssanova1", "gssanova1", "ssden", or "sshzd".... Additional arguments. Ignored in project.x. include List of model terms to be included in the reduced model space. The partial and offset terms, if present, are to be specified by "partial" and "offset", respectively. mesh Flag indicating whether to return evaluations of the projection. Details The entropy KL(fit0,null) can be decomposed as the sum of KL(fit0,fit1) and KL(fit1,null), where fit0 is the fit to be projected, fit1 is the projection in the reduced model space, and null is the constant fit. The ratio KL(fit0,fit1)/KL(fit0,null) serves as a diagnostic of the feasibility of the reduced model. For regression fits, smoothness safe-guard is used to prevent interpolation, and KL(fit0,fit1)+KL(fit1,null) may not match KL(fit0,null) perfectly. For mixed-effect models from ssanova1 and gssanova1, the estimated random effects are treated as offset.

27 rkpk 27 Value The functions return a list consisting of the following components. ratio kl check mesh KL(fit0,fit1)/KL(fit0,null); the smaller the value, the more feasible the reduced model is. KL(fit0,fit1). KL(fit0,fit1)/KL(fit0,null)+KL(fit1,null)/KL(fit0,null); a value closer to 1 is preferred. The evaluations of the projection. Author(s) Chong Gu, chong@stat.purdue.edu References Gu, C. (2003), Model Diagnostics for Smoothing Spline ANOVA Models. Available at See Also Fitting functions ssanova1, gssanova1, ssden, and sshzd. rkpk Interface to RKPACK Call RKPACK routines for numerical calculations for fitting and predicting from Smoothing Spline ANOVA models. sspreg(s, q, y, method="v", varht=1) mspreg(s, q, y, method="v", varht=1, prec=1e-7, maxiter=30) sspregpoi(family, s, q, y, wt, offset, method="u", varht=1, nu, prec=1e-7, maxiter=30) mspregpoi(family, s, q, y, wt, offset, method="u", varht=1, nu, prec=1e-7, maxiter=30) getcrdr(obj, r) getsms(obj) s q y method varht prec Design matrix of unpenalized terms. Penalty matrices of penalized terms. Model response. Method for smoothing parameter selection. Assumed dispersion parameter, needed only for method="u". Precision requirement for iterations.

28 28 rkpk1 maxiter family wt offset obj nu r Maximum number of iterations allowed. Error family. Model weights. Model offset. Object returned from a call to sspreg, mspreg, sspregpoi, or mspregpoi. Optional argument for nbinomial, weibull, lognorm, and loglogis families. Inputs for standard error calculation. Details sspreg is used by ssanova to fit Gaussian models with a single smoothing parameter. mspreg is used to fit Gaussian models with multiple single smoothing parameters. sspregpoi is used by gssanova to fit non Gaussian models with a single smoothing parameter. mspregpoi is used to fit non Gaussian models with multiple single smoothing parameters. getcrdr and getsms are used by predict.ssanova to calculate standard errors of the fitted terms. References Gu, C. (1989), RKPACK and its applications: Fitting smoothing spline models. In ASA Proceedings of Statistical Computing Section, pp rkpk1 Numerical Engine for ssanova1 and gssanova1 Calculate penalized least squares regression estimates via the normal equation and evaluate the GCV, GML, or Mallows CL scores, as implemented in the RATFOR routine reg.r, and minimize the cross-validation score using nlm. sspreg1(s,r,q,y,method,alpha,varht,random) mspreg1(s,r,q,y,method,alpha,varht,random) sspngreg1(family,s,r,q,y,wt,offset,alpha,nu,random) mspngreg1(family,s,r,q,y,wt,offset,alpha,nu,random) ngreg1(dc,family,sr,q,y,wt,offset,nu,alpha) ngreg.proj(dc,family,sr,q,y0,wt,offset,nu)

29 rkpk1 29 family s r q y wt offset method alpha nu varht random dc sr y0 of the error distribution. Supported are exponential families "binomial", "poisson", "Gamma", and "nbinomial". Also supported are accelerated life model families "weibull", "lognorm", and "loglogis". Unpenalized terms evaluated at data points. Basis of penalized terms evaluated at data points. Penalty matrix. Response vector. Model weights. Model offset. "v" for GCV, "m" for GML, or "u" for Mallows CL. Parameter modifying GCV or Mallows CL scores for smoothing parameter selection. Optional argument for future support of nbinomial, weibull, lognorm, and loglogis families. External variance estimate needed for method="u". Input for parametric random effects in nonparametric mixed-effect models. Coefficients of fits. cbind(s,r). Components of the fit to be projected. Details sspreg1 is used by ssanova1 to compute regression estimates with a single smoothing parameter. mspreg1 is used by ssanova1 to compute regression estimates with multiple smoothing parameters. ssngpreg1 is used by gssanova1 to compute non-gaussian regression estimates with a single smoothing parameter. mspngreg1 is used by gssanova1 to compute non-gaussian regression estimates with multiple smoothing parameters. ngreg1 is used by ssngpreg1 and mspngreg1 to perform Newton iteration with fixed smoothing parameters and to calculate cross-validation scores on return. ngreg.proj is used by project.gssanova1 to calculate Kullback-Leibler projection for non-gaussian regression. References Kim, Y.-J. and Gu, C. (2002) Penalized Least Squares Regression: Fast Computation via Efficient Approximation. Available at

30 30 ssanova smolyak Generating Smolyak Cubature Generate Smolyak cubatures using C routines modified from smolyak.c found in Knut Petras SMOLPACK. smolyak.quad(d,k) smolyak.size(d,k) d k Dimension of unit cube. Depth of algorithm. Details smolyak.quad and smolyak.size concern the delayed Smolyak cubature. Value smolyak.quad returns a list object with the following components. pt wt Quadrature nodes in rows of matrix. Quadrature weights. smolyak.size returns an integer. ssanova Fitting Smoothing Spline ANOVA Models Fit smoothing spline ANOVA models with cubic spline, linear spline, or thin-plate spline marginals for numerical variables. Factors are also accepted. The symbolic model specification via formula follows the same rules as in lm. ssanova(formula, type="cubic", data=list(), weights, subset, offset, na.action=na.omit, partial=null, method="v", varht=1, prec=1e-7, maxiter=30, ext=.05, order=2)

31 ssanova 31 formula type data weights subset Symbolic description of the model to be fit. Type of numerical marginals to be used. Supported are type="cubic" for cubic spline marginals, type="linear" for linear spline marginals, and type="tp" for thin-plate spline marginals. Optional data frame containing the variables in the model. Optional vector of weights to be used in the fitting process. Optional vector specifying a subset of observations to be used in the fitting process. offset Optional offset term with known parameter 1. na.action partial method varht prec maxiter ext order Details Function which indicates what should happen when the data contain NAs. Optional extra unpenalized terms in partial spline models. Method for smoothing parameter selection. Supported are method="v" for GCV, method="m" for GML (REML), and method="u" for Mallow s CL. External variance estimate needed for method="u". Ignored when method="v" or method="m" are specified. Precision requirement in the iteration for multiple smoothing parameter selection. Ignored when only one smoothing parameter is involved. Maximum number of iterations allowed for multiple smoothing parameter selection. Ignored when only one smoothing parameter is involved. For cubic spline and linear spline marginals, this option specifies how far to extend the domain beyond the minimum and the maximum as a percentage of the range. The default ext=.05 specifies marginal domains of lengths 110 percent of their respective ranges. Prediction outside of the domain will result in an error. Ignored if type="tp" is specified. For thin-plate spline marginals, this option specifies the order of the marginal penalties. Ignored if type="cubic" or type="linear" are specified. ssanova and the affiliated methods provide a front end to RKPACK, a collection of RAT- FOR routines for structural multivariate nonparametric regression via the penalized least squares method. The algorithms implemented in RKPACK are of the orders O(n 3 ) in execution time and O(n 2 ) in memory requirement. The constants in front of the orders vary with the complexity of the model to be fit. The model specification via formula is intuitive. For example, y~x1*x2 yields a model of the form y = c + f 1 (x1) + f 2 (x2) + f 12 (x1, x2) + e with the terms denoted by "1", "x1", "x2", and "x1:x2". Through the specifications of the side conditions, these terms are uniquely defined. In the current implementation, f 1 and f 12 integrate to 0 on the x1 domain for cubic spline and linear spline marginals, and add to 0 over the x1 (marginal) sampling points for thin-plate spline marginals. The penalized least squares problem is equivalent to a certain empirical Bayes model or a mixed effect model, and the model terms themselves are generally sums of finer terms of two types, the unpenalized terms (fixed effects) and the penalized terms (random effects).

32 32 ssanova Attached to every penalized term there is a smoothing parameter, and the model complexity is largely determined by the number of smoothing parameters. The method predict can be used to evaluate the sum of selected or all model terms at arbitrary points within the domain, along with standard errors derived from a certain Bayesian calculation. The method summary has a flag to request diagnostics for the practical identifiability and significance of the model terms. Value ssanova returns a list object of class "ssanova". The method summary is used to obtain summaries of the fits. The method predict can be used to evaluate the fits at arbitrary points, along with the standard errors to be used in Bayesian confidence intervals. The methods residuals and fitted.values extract the respective traits from the fits. Factors Factors are accepted as predictors. When a factor has 3 or more levels, all terms involving it are treated as penalized terms, with the level means being shrunk towards each other. The shrinking is done differently for nominal and ordinal factors; see mkrk.factor for details. Note The independent variables appearing in formula can be multivariate themselves. In particular, ssanova(y~x,"tp",order=order) can be used to fit ordinary thin-plate splines in any dimension, of any order permissible, and with standard errors available for Bayesian confidence intervals. Note that thin-plate splines reduce to polynomial splines in one dimension. For univariate marginals, the additive models using type="cubic" and type="tp" yield identical fit through different internal makes. For example, ssanova(y~x1+x2,"cubic") and ssanova(y~x1+x2,"tp") yield the same fit. The same is not true for models with interactions, however. Mathematically, the domain (through ext for type="cubic") or the order (through order for type="tp") could be specified individually for each of the variables. Such flexibility is not provided in our implementation, however, as it would be more a source for confusion than a practical utility. Author(s) Chong Gu, chong@stat.purdue.edu References Gu, C. (2002), Smoothing Spline ANOVA Models. New York: Springer-Verlag. Wahba, G. (1990), Spline Models for Observational Data. Philadelphia: SIAM. See Also Methods predict.ssanova, summary.ssanova, and fitted.ssanova.

33 ssanova1 33 Examples ## Fit a cubic spline x <- runif(100); y < *sin(2*pi*x) + rnorm(x) cubic.fit <- ssanova(y~x,method="m") ## The same fit with different internal makes tp.fit <- ssanova(y~x,"tp",method="m") ## Obtain estimates and standard errors on a grid new <- data.frame(x=seq(min(x),max(x),len=50)) est <- predict(cubic.fit,new,se=true) ## Plot the fit and the Bayesian confidence intervals plot(x,y,col=1); lines(new$x,est$fit,col=2) lines(new$x,est$fit+1.96*est$se,col=3) lines(new$x,est$fit-1.96*est$se,col=3) ## Clean up ## Not run: rm(x,y,cubic.fit,tp.fit,new,est) dev.off() ## End(Not run) ## Fit a tensor product cubic spline data(nox) nox.fit <- ssanova(log10(nox)~comp*equi,data=nox) ## Fit a spline with cubic and nominal marginals nox$comp<-as.factor(nox$comp) nox.fit.n <- ssanova(log10(nox)~comp*equi,data=nox) ## Fit a spline with cubic and ordinal marginals nox$comp<-as.ordered(nox$comp) nox.fit.o <- ssanova(log10(nox)~comp*equi,data=nox) ## Clean up ## Not run: rm(nox,nox.fit,nox.fit.n,nox.fit.o) ssanova1 Fitting Smoothing Spline ANOVA Models Fit smoothing spline ANOVA models with cubic spline, linear spline, or thin-plate spline marginals for numerical variables. Factors are also accepted. The symbolic model specification via formula follows the same rules as in lm. ssanova1(formula, type="cubic", data=list(), weights, subset, offset, na.action=na.omit, partial=null, method="v", alpha=1.4, varht=1, id.basis=null, nbasis=null, seed=null, random=null, ext=.05, order=2) formula type Symbolic description of the model to be fit. Type of numerical marginals to be used. Supported are type="cubic" for cubic spline marginals, type="linear" for linear spline marginals, and type="tp" for thin-plate spline marginals.

Package gss. R topics documented: April 19, 2018

Package gss. R topics documented: April 19, 2018 Version 2.1-8 Date 2018-04-18 Title General Smoothing Splines Author Chong Gu Maintainer Chong Gu Depends R (>= 2.14.0), stats Package gss April 19, 2018 A comprehensive