Semiparametric Tools: Generalized Additive Models

Size: px

Start display at page:

Download "Semiparametric Tools: Generalized Additive Models"

Oscar Bradley
5 years ago
Views:

1 Semiparametric Tools: Generalized Additive Models Jamie Monogan Washington University November 8, 2010 Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

2 Regression Splines Choose breakpoints (also called knots). This is the key tradeoff: more knots mean more flexibility, but can be more compute-intensive and sometimes too wavy. Strategies: 1 cardinal knots: uniform over range of X data, 2 at quantiles, 3 adaptive (complex), 4 at selected X i. Effects: 1 bad choices can be dramatic, 2 bad choices can miss important features. Setup: interior knots given by ξ 1 < ξ 2 < < ξ S over the range (X (1), X (n) ), along with the boundary knots: ξ 0 < X (1), X (n) < ξ S Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

3 Regression Splines We must choose a basis function which must join smoothly at knots. Truncated Power Series, with S knots: S s(x) = δ 0 + δ 1 x + δ 2 x 2 + δ 3 x 3 + δ 3+i (x ξ i ) 3 + i=1 Where (x ξ i ) 3 + means we include only positive terms, else zero: (x ξ i ) 3 + = max[0,(x ξ i ) 3 ] Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

4 Regression Splines So s(x) is a linear weighted combination of the S + 4 functions: P 0 (x) = 1 P 1 (x) = x P 2 (x) = x 2 P 3 (x) = x 3 P S1 (x) = (x ξ 1 ) 3 +, P S 2 (x) = (x ξ 2 ) 3 +, P S 3 (x) = (x ξ 3 ) 3 + meaning that the the function is linear in these S + 4 parameters and can be estimated with OLS. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

5 Cubic Regression Splines Cubic Splines add the condition that at the boundary knots, s(x) and s(x) are both zero, so s(x) is linear (but not necessarily flat) in the intervals: [ξ 0,ξ 1 ], [ξ S,ξ S+1 ]. Now estimate by applying the function: 1 regress y i on f (x i ), say for S=3: ŷ i = f (x i ) = δ 0 + δ 1 x i + δ 2 xi 2 + δ 3 xi 3 + δ 4 (x i ξ 1 ) δ 5 (x i ξ 2 ) δ 6 (x i ξ 3 ) 3 + using OLS. 2 Obtains 7 coefficient estimates: ˆδ i, i = 1,...,7. Extension: B-Splines, different parameterizations,... Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

6 Cubic Regression Splines The R function for Cubic Splines is smooth.spline x1<-seq(-10,10,length=100); y1<-cos(x1)/(rt(length(x1),50)*0.5) par(mfrow=c(1,3),oma=c(3,3,3,3),mar=c(2,0,2,0),bg="whitesmoke") plot(x1,y1,pch=3,ylim=range(y1)*1.2,col="slateblue") spline.out <- smooth.spline(x1,y1,all.knots=true) lines(spline.out$x,spline.out$y,col="forest green") mtext("all X Are Knots",outer=FALSE,side=3,cex=1.1,line=1.5) plot(x1,y1,pch=3,ylim=range(y1)*1.2,yaxt="n",col="slateblue") spline.out <- smooth.spline(x1,y1,all.knots=false,nknots=4) lines(spline.out$x,spline.out$y,col="forest green") mtext("4 Knots",outer=FALSE,side=3,cex=1.1,line=1.5) plot(x1,y1,pch=3,ylim=range(y1)*1.2,yaxt="n",col="slateblue") spline.out <- smooth.spline(x1,y1,all.knots=false,nknots=10) lines(spline.out$x,spline.out$y,col="forest green") mtext("20 Knots",outer=FALSE,side=3,cex=1.1,line=1.5) Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

7 Cubic Regression Splines All X Are Knots 4 Knots 20 Knots Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

8 Penalized Splines The goal is: min n (y i f (x i )) 2 + λ i=1 b a (f (t)) 2 dt where λ is a fixed constant, and a < x (1), x (n) < b. The idea is that the first term promotes fit and the second term penalizes overfitting. The function f (x) has an explicit and unique form that minimizes the cubic spline with knots at all of the x i. If λ =, then we get linear regression. As λ approaches 0, we get closer to interpolation. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

9 Estimating Penalized Splines For regression-style models, another way of notating the function to be minimized is: y Xβ 2 + λ b a (f (t)) 2 dt. Because f (t) is linear in the parameters by design, it can be written as quadratic: b a (f (t)) 2 dt = βsβ, where S is a matrix of known coefficients determined by the form of f (t). Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

10 Estimating Penalized Splines So the penalized least squares estimator is given by: ˆB = (X X + λs) 1 X y, with the associated hat (influence) matrix: A = X(X X + λs) 1 X. The matrix A also gives the effective degrees of freedom for the smoothed fit by its trace. The max[tr(a)] is the number of parameters minus the number of constraints, and the min[tr(a)] is the max minus the rank of the S matrix. As the number of parameters goes from zero to infinity, the edf moves upward between these two quantities. We can use edf for hypothesis testing between two models. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

11 Penalized Splines Code The R function for Penalized Splines is also smooth.spline nonlin.mat<-read.table("gam.test.dat",header=true) attach(nonlin.mat) postscript("class.stat.comp/cognitive2i.ps") par(mfrow=c(1,3),oma=c(3,3,3,3),mar=c(2,0,2,0)) plot(x1,y1,pch=3,ylim=range(y1)*1.2,col="slateblue") spline.out <- smooth.spline(x1,y1,spar=0.1,all.knots=true) lines(spline.out$x,spline.out$y,col="maroon2") mtext("interpolation",outer=false,side=3,cex=1.1,line=1.5) Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

12 Penalized Splines plot(x1,y1,pch=3,ylim=range(y1)*1.2,yaxt="n",col="slateblue") spline.out <- smooth.spline(x1,y1,spar=0.5) lines(spline.out$x,spline.out$y,col="maroon2") mtext("parameter: 0.5",outer=FALSE,side=3,cex=1.1,line=1.5) plot(x1,y1,pch=3,ylim=range(y1)*1.2,yaxt="n",col="slateblue") spline.out <- smooth.spline(x1,y1,spar=0.95) lines(spline.out$x,spline.out$y,col="maroon2") mtext("parameter: 9.5",outer=FALSE,side=3,cex=1.1,line=1.5) dev.off() detach(nonlin.mat) Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

13 Penalized Splines Interpolation Parameter: 0.5 Parameter: Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

14 Thin Plate Splines Some disadvantages of standard spline approaches: user must stipulated knots, bases given for only one variable, criteria for bases unclear. We want: knot-free spline bases over any number of explanatory variables that have optimal properties. Thin plate splines (Duchon 1977; Wahba 1990, Green & Silverman 1994, Wood 2006, Chapter 4) are a good solution to this problem. General Strategy: penalize with derivative functions, to produce a smooth function according to y i = g(x i ) + ǫ i where ǫ i is a random error vector with good properties and x i is a d -length explanatory variable vector. It automatically calculates how much weight to give the conflicting goals of following the data and making the fit as smooth as possible by putting the tradeoff into an explicit function. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

15 Thin Plate Splines Objective: find a function f that minimizes the vector norm: y f 2 = λj md (f ) where: y is the n -length outcome variable vector, f = f (x 1 ), f (x 2 ),..., f (x n ), λ is a smoothing parameter, J md is a penalty term based on the curviness of the smooth. This is very much in line with the spline technology that we have been studying except that there will be more automatic rather than human decisions. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

16 Thin Plate Splines Penalty Function The penalty is defined with the following function: J md = where 2m > d + 1. η 1 + +η d =m ( m! m ) f 2 η 1! η k! x η 1 1 dx xη d 1 dx d d For instance, when d = 2 η 1 = η 2 = 1, and m = 2, then: [( 2 ) ( f 2 ) ( f 2 )] f J 22 = 2 x x 1 x 2 x2 2 dx 1 dx 2. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

17 Thin Plate Splines One function that minimizes y f 2 = λj md (f ) is: ˆf (x) = n M δ i η md ( y f ) + α j φ j (x), i=1 j=1 where: δ and α are coefficient vectors to be estimated. δ has the linear constrant T δ = 0, with the matrix values T ij = φ j (x i ), which are linearly independent polynomials spanning R d of degree less than m as well as spanning the null space of J mn. Returning to the example where m = d = 3, Finally: φ 1 = 1, φ 2 = x 1, φ 3 = x 2. ( 1) 2 η md (χ) = π d/2 (m 1)!(m d/2)! χ2m d log(χ) d even Γ(d/2 m) 2 2m π d/2 (m 1)! χ2m d log(χ) d odd Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

18 Thin Plate Splines Now define the matrix E with elements: E ij = η md ( x i x j ). The fitting problem is now expressible as: minimize y E T 2 +λd ED,subject to T D = 0,with respect to T,α. This truncates the space of rough components, those with D parameters, while leaving the smooth components untouched. Primary challenge (besides all the math): comptuational efficiency: there are as many unknown quantities as datapoints and estimation time is proportional to d 3. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

19 Thin Plate Splines in R library(rgcvpack) # DEFINE A THREE-DIMENSIONAL FUNCTION (2 IN, 1 OUT) f <- function(x, y) { 0.75*exp( -((10*x-1)^2 + (10*y-1)^2)/5 ) *exp( -((10*x-7)^2 + (10*y-5)^2)/5 ) *exp( -((10*x-4)^2 + (10*y-7)^2)/5 ) } # CREATE A FAKE DATASET USING THIS FUNCTION set.seed(pi); n <- 15; x2 <- x1 <- seq(0,1,length=n) y <- outer(x1, x2, f); y <- y + rnorm(n^2,0,0.05*max(abs(y))) # THE FUNCTION NEEDS THESE AS VECTORS x1.vec<-rep(x1,n); x2.vec<-rep(x2,rep(n,n)) y.vec<-as.vector(y) Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

20 Thin Plate Splines in R #RUN THE THIN PLATE SPLINE WITH ALL DATA POINTS AS KNOTS #ORDER 3 thinpl.out <- fittps(cbind(x1.vec,x2.vec), y.vec, m=3) # GRAPH par(mar=c(3,3,1,1),col.axis="white",col.lab="white", col.sub="white", col="white",bg="slategray") persp(x1, x2, matrix(predict(thinpl.out),n,n), theta=130, phi=20, expand=0.50, xlab="x1", ylab="x2", zlab="y", xlim=c(0,1), ylim=c(0,1),zlim=range(y), ticktype="detailed", scale=false, main="thin Plate Spline") Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

21 y Thin Plate Splines in R Thin Plate Spline x x Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

22 Smoothing Parameter Selection Recall as λ approaches 0, we get closer to interpolation (for a scalar parameter). Penalized maximum likelihood methods can only estimate β coefficients conditional on smoothing parameters, λ. Two basic scenarios to estimation via minimizing the error quantity, E(M) = E ( µ ˆµ 2 /n ) : σ 2 known or assumed true, then estimation uses Mallow s C p -UBRE (Unbiased Risk Estimator). σ 2 unknown, then estimation uses generalized cross validation (GCV). Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

23 Smoothing Parameter Selection, Scale Parameter Known For regression, the expected mean square error is given by: ( ) E(M) = E µ Xˆβ 2 /n = E ( y Ay 2) /n σ 2 + 2tr(A)σ 2 /n where A = X(X X + λs) 1 X. Which means we minimize y Ay 2 /n σ 2 + 2tr(A)σ 2 /n So the smoothing parameters affect the estimation through A. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

24 Smoothing Parameter Selection, Scale Parameter Unknown In this case minimize the mean square prediction error: P = σ + M, which is the average squared error in predicting a new observation, y n+1, using the fitted model. P is most easily estimated with cross validation (later generalized cross validation): Jackknife out each case iteratively. At each step, calculate ˆµ [i], which is the prediction of y i from the model that does not include case i. Finally, calculate: ˆP = 1 n (y i = ˆµ [i] ) 2. n i 1 Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

25 Smoothing Parameter Selection, Scale Parameter Unknown The ˆµ [i] term means that we have to do the full jackknifing loop through all of the data. Actually this can be done in one step using the complete-data model: ˆP = 1 n n i=1 (y i ˆµ i ) 2 (1 A ii ) 2. This is analagous to the short-hand method for calculating the jackknifed standard error. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

26 Generalized Additive Models Big Picture: just like a GLM except we will do component-wise smoothing of some right-hand side variables. More computationally intenstive that GLM estimation with many more model-fitting choices to make. Results are often given graphically for smoothed parameters, especially if there are many. Definitive citations: Hastie and Tibshirani (1986), Generalized Additive Models (with discussion). Statistical Science 1, Wood (2006), Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC. Hastie (1993), in Chambers and Hastie, Statistical Models in S. Chapman & Hall. Hastie and Tibshirani (1990), Generalized Additive Models. Chapman & Hall. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

27 Generalized Additive Models Structure: Y = α + n f j (x j ) + ǫ j=1 E[ǫ] = 0 cor(ǫ i,x j ) = 0 Var(ǫ) = σ 2 Solved by an algorithm called backfitting. Typically we think of f j s as univariate and smooth, but they don t have to be either: f (x j1,x j2 ) like an interaction or other single dimension mapping, or categorical specifications. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

28 Generalized Additive Models To avoid a plethora of free constants in each of the f j (), it is common to assume E[f j (x j )] = 0, which can be achieved by centering if necessary. Big point: unlike a GLM, each term is represented additively and therefore we can use the same marginal interpretation as linear models (but without the linear assumption obviously). Two consequences: 1 The variation of the fitted response surface holding all but one explanatory variable constant does not depend on the values of the other explanatory values. 2 Plots of the fits separately are very useful. Botanical Example Let s study our cherry tree data. The simple model of interest is: log(volume i ) = f 1 (Height i ) + f 2 (Girth i ) + ǫ i Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

29 Details on GAM Model Specification The R formula for gam is just like glm except we have new smoother terms: s and te. The notation s(x1), gives a spline based smooth for the X1 explanatory variable. The notation te(x2) gives a tensor product based smooth for X2 explanatory variable. It is common to mix smoothed and unsmoothed terms in a model: Y ~ X1 + s(x2) + te(x3) There can be nested smoothing specifications: Y ~ s(x1) + s(x2) + s(x1,x2) Y ~ s(x1,x2) + s(x2,x3) We can also control the smooth with parameter vectors, for instance: Y ~ te(x1,x2, bs=c("tp","tp"), m=c(3,4), k=(5,6)) which gives a tensor product smooths of X1 and X2 with bases of dimension 3 for X1 and 4 for X2, and marginal penalties of 5 for X1 and 6 for X2. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

30 Full Syntax for gam There are many modeling options. gam(formula, family=gaussian(), data=list(), weights=null, subset=null, na.action, offset=null, method="gcv.cp", optimizer=c("outer","newton"), control=gam.control(), scale=0,select=false,knots=null,sp=null,min.sp=null, H=NULL,gamma=1,fit=TRUE,paraPen=NULL,G=NULL,in.out,...) with: formula a full R modeling formula, including smooth terms family if gaussian fitting is by least-squares, and if symmetric by a re-descending M-estimator data an optional data frame, list or environment weights optional regression-style weights for each case subset an optional subset of the data to be used na.action the regular model treatment of missing data offset used to supply a model offset for use in fitting control control parameters, see gam.control Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

31 Full Syntax for gam method smoothing parameter estimation method GCV.Cp to use GCV for unknown scale parameter and Mallows Cp/UBRE/AIC for known scale. GACV.Cp is equivalent, but using GACV in place of GCV. REML for REML estimation, including of unknown scale, P-REML for REML estimation, but using a Pearson estimate of the scale. ML and P-ML are similar, but using maximum likelihood in place of REML optimizer perf for performance iteration, outer for the more stable direct approach. outer can use several alternative optimizers, specified in the second element of optimizer: newton (default), bfgs, optim, nlm and nlm.fd (slow) scale positive values for the scale parameter, negative for unknown, zero for 1 into Poisson and binomial and unknown for other distributions select If TRUE then the fit can add an extra penalty to each term knots sp min.sp H list containing user specified knot values (must match k value supplied smoothing parameter vector in the order that the smooth terms appear in the model formula, negative elements indicate that the parameter should be estimated lower bounds for smoothing parameters user supplied fixed quadratic penalty on the parameters, often for ridge Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

32 Full Syntax for gam gamma multiplier to inflate the model d.f. in the GCV or UBRE/AIC score fit If TRUE then model is fit, if FALSE then the model is set up and an object G containing what would be required to fit is returned is returned parapen optional list specifying any penalties to be applied to parametric model terms G object returned by a previous call to gam with fit=false in.out optional list for initializing outer iteration Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

33 Terrorism Data Analysis Source: The International Policy Institute for Counter-Terrorism, Herzlia, Israel. Provided on an online database with details of attacks in Israel since September, Subsetted by Mark Harrison to give 103 suicide attacks over a three-year period from November 6, 2000 to November 3, 2003 when there was a steep drop. Information provided: date and place of the attack, attack type, the type of target and device employed, organizational affiliation of the attacker, and the number of casualties, along with a written description of the attack. Casualties are given personal attributes such as name, age, sex, nationality, and religion. Jamie Monogan (WUStL) Generalized Additive Models November 8, / 33

Generalized Additive Models

:p Texts in Statistical Science Generalized Additive Models An Introduction with R Simon N. Wood Contents Preface XV 1 Linear Models 1 1.1 A simple linear model 2 Simple least squares estimation 3 1.1.1