More advanced use of mgcv. Simon Wood Mathematical Sciences, University of Bath, U.K.

Similar documents
GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs, GAMMs and other penalized GLMs using mgcv in R. Simon Wood Mathematical Sciences, University of Bath, U.K.

A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K.

Generalized Additive Models

Stat 8053, Fall 2013: Additive Models

Generalized additive models I

NONPARAMETRIC REGRESSION SPLINES FOR GENERALIZED LINEAR MODELS IN THE PRESENCE OF MEASUREMENT ERROR

Splines and penalized regression

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Linear Methods for Regression and Shrinkage Methods

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Bootstrapping Methods

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines. Robert J. Hill and Michael Scholz

Generalized Additive Model

STAT 705 Introduction to generalized additive models

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Nonparametric regression using kernel and spline methods

Nonparametric Approaches to Regression

Generalized Additive Models

MT1007. Fisheries Assessment.

Lasso. November 14, 2017

Exponential Random Graph Models for Social Networks

Package gamm4. July 25, Index 10

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Nina Zumel and John Mount Win-Vector LLC

Model selection and validation 1: Cross-validation

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Hierarchical generalized additive models: an introduction with mgcv

GAMs with integrated model selection using penalized regression splines and applications to environmental modelling.

Nonlinearity and Generalized Additive Models Lecture 2

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Package msgps. February 20, 2015

Lecture 13: Model selection and regularization

Discussion Notes 3 Stepwise Regression and Model Selection

Workshop 8: Model selection

Applied Survey Data Analysis Module 2: Variance Estimation March 30, 2013

Cross-validation. Cross-validation is a resampling method.

arxiv: v1 [stat.ap] 14 Nov 2017

Lasso.jl Documentation

Estimation of Item Response Models

CS249: ADVANCED DATA MINING

The linear mixed model: modeling hierarchical and longitudinal data

Final Exam. Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm

Tutorial #1: Using Latent GOLD choice to Estimate Discrete Choice Models

Chapter 18. Geometric Operations

Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica

Lecture 25: Review I

Lecture 3 - Object-oriented programming and statistical programming examples

Straightforward intermediate rank tensor product smoothing in mixed models

STA121: Applied Regression Analysis

arxiv: v2 [stat.ap] 14 Nov 2016

The mgcv Package. July 21, 2006

Package pendvine. R topics documented: July 9, Type Package

Introduction to Mixed Models: Multivariate Regression

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

CH9.Generalized Additive Model

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Lecture 26: Missing data

1 Pencil and Paper stuff

Cross-validation and the Bootstrap

Modelling Personalized Screening: a Step Forward on Risk Assessment Methods

Dimension Reduction Methods for Multivariate Time Series

Model selection. Peter Hoff. 560 Hierarchical modeling. Statistics, University of Washington 1/41

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

A review of spline function selection procedures in R

2. Linear Regression and Gradient Descent

Package ordinalcont. January 10, 2018

Package FWDselect. December 19, 2015

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

Manuel Oviedo de la Fuente and Manuel Febrero Bande

CPSC 340: Machine Learning and Data Mining

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

Bayes Estimators & Ridge Regression

Variable selection is intended to select the best subset of predictors. But why bother?

Package penrvine. R topics documented: May 21, Type Package

Package dglm. August 24, 2016

Artificial Neural Networks Lab 2 Classical Pattern Recognition

ITSx: Policy Analysis Using Interrupted Time Series

Notes on Simulations in SAS Studio

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Chapters 5-6: Statistical Inference Methods

MATH 423 Linear Algebra II Lecture 17: Reduced row echelon form (continued). Determinant of a matrix.

Geology Geomath Estimating the coefficients of various Mathematical relationships in Geology

Assessing the Quality of the Natural Cubic Spline Approximation

1 Methods for Posterior Simulation

MATH 829: Introduction to Data Mining and Analysis Model selection

Package RAMP. May 25, 2017

Bayesian model selection and diagnostics

Package svmpath. R topics documented: August 30, Title The SVM Path Algorithm Date Version Author Trevor Hastie

Package caic4. May 22, 2018

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Making the Transition from R-code to Arc

Section 4 Matching Estimator

A Method for Comparing Multiple Regression Models

Package TVsMiss. April 5, 2018

Transcription:

More advanced use of mgcv Simon Wood Mathematical Sciences, University of Bath, U.K.

Fine control of smoothness: gamma Suppose that we fit a model but a component is too wiggly. For GCV/AIC we can increase the cost of degrees of freedom to be more BIC like. i.e. multiply the EDF by log(n)/... b <- gam(accel~s(times,k=4),data=mcycle) plot(b,residuals=true,pch=,cex=.5) ## Too wiggly!! gamma <- log(nrow(mcycle))/ ## BIC like penalization ## gamma multiplies the EDF in GCV or AIC score... b <- gam(accel~s(times,k=4),data=mcycle,gamma=gamma) plot(b,residuals=true,pch=,cex=.5) s(times,11.3) 1 5 1 s(times,9.13) 1 5 1 1 3 4 5 times 1 3 4 5 times

Fine control of smoothness: sp Alternatively the sp argument to gam (or to individual s terms) can be used to fix some smoothing parameters. ## try adaptive... b <- gam(accel~s(times,bs="ad"),data=mcycle) plot(b,residuals=true,pch=,cex=.5) ## not adaptive enough?! ## Decrease some elements of sp. sp[i] < => estimate sp[i]... b <- gam(accel~s(times,bs="ad"),data=mcycle,sp=c(-1,1e-5,1e-5,-1,-1)) plot(b,residuals=true,pch=,cex=.5) ## hmm! s(times,8.75) 1 5 1 s(times,8.7) 1 5 1 1 3 4 5 times 1 3 4 5 times

Posterior inference Suppose that we want to make inferences about some non-linear functional of the model. Simulation from the distribution β y is the answer. For example we might be interested in the trough to peak difference in the mcycle model just fitted. pd <- data.frame(times=seq(1,4,length=1)) Xp <- predict(b,pd,type="lpmatrix") ## map coefs to fitted curves beta <- coef(b);vb <- vcov(b) ## posterior mean and cov of coefs n <- 1 br <- mvrnorm(n,beta,vb) ## simulate n rep coef vectors from post. a.range <- rep(na,n) for (i in 1:n) { ## loop to get trough to peak diff for each sim pred.a <- Xp%*%br[i,] ## curve for this replicate a.range[i] <- max(pred.a)-min(pred.a) ## range for this curve } quantile(a.range,c(.5,.975)) ## get 95% CI.5% 97.5% 134.17 17.738

Posterior simulation versus bootstrapping Posterior simulation is very quick. It is much more efficient than bootstrapping. In any case bootstrapping is problematic... 1. For parametric bootstrapping the smoothing bias causes problems, the model simulated from is biased and the fits to the samples will be yet more biased.. For non-parametric case-resampling the presence of replicate copies of the same data causes undersmoothing, especially with GCV based smoothness selection. An objection to posterior simulation is that we condition on ˆλ. This is fixable, by simulation of replicate λ vectors, and then simulating β vectors from the distribution implied by each λ, but in practice it usually adds little.

by variables mgcv allows smooths to interact with simple parametric terms and factor variables, using the by argument to s and te. Starting with metric by variables, consider the model where f is a smooth function. y i = α + f (t i )x i + ǫ i gam(y ~ s(t,by=x)) would fit this (smooth not centered). No extra theory is required. gam just has to multiply each element of the i th row of the model matrix for f (t i ) by x i for each i, and everything else is unchanged. Such models are sometimes called varying coefficient models. The idea is that the linear regression coefficient for x i is varying smoothly with t i. When the smooth term is a function of location then the models are known as geographic regression models.

Example geographic regression The dataframe mack contains data from a fisheries survey sampling mackerel eggs. lon 14 1 6 4 44 46 48 5 5 54 56 58 lat One model is that egg densities are determined by a quadratic in (transformed) sea bed depth, but that the coefficients of this quadratic vary with location...

5 5 5 6 6 6 4.4 Mackerel model fit The model can be fitted by mack$log.na <- log(mack$net.area) mack$t.bd <- (mack$b.depth)^.5 b <- gam(egg.count ~ offset(log.na) + s(lon,lat) + s(lon,lat,by=t.bd)+ s(lon,lat,by=i(t.bd^)), data=mack,family=tweedie(p=1.1,link=log),method="ml") for (i in 1:3) { plot(b,select=i);lines(coast)} 1se s(lon,lat,5.1) +1se 1se s(lon,lat,3):t.bd +1se 1se s(lon,lat,1.31):i(t.bd^) +1se lat 44 46 48 5 5 54 56 58 15 1 15 15 5 1 5 1 1 5 5 5 1 5 5 lat 44 46 48 5 5 54 56 58 4 4 4 4 4 lat 44 46 48 5 5 54 56 58......4....4..4.6.4 14 1 6 4 lon 14 1 6 4 lon 14 1 1 8 6 4 lon

Concurvity Notice how uncertain the contours are in the previous model. There is a concurvity problem with the model. The covariate t.bd is itself very well modelled as a smooth function of the other covariates lon and lat... > b.conc <- gam(t.bd~s(lon,lat,k=5),data=mack,method="ml") > summary(b.conc)... edf Ref.df F p-value s(lon,lat) 46.83 48.8 1.3 <e-16 ***... R-sq.(adj) =.885 Deviance explained = 89.4% Logically this means that all the model terms could be fairly well approximated by smooth functions of location... This is a difficult issue to deal with.

Smooth-factor interactions Occasionally a smooth-factor interaction is required. The by variable argument to s and te permits this. Iff d is a factor, and x is continuous then s(x,by=d) creates a separate (centered) smooth of x for each level of d. To force the smooths to all have the same smoothing parameter, set the id argument to something, e.g. s(x,by=d,id=1) Note that the smooths are all subject to centering constraints. With a metric by variable this would not the case (unless the by variable is a constant, which it should not be).

Factor by example As an example recall the discrete Height class version of the trees data. We could try the model log E(Volume i ) = f k (Girth i ) if tree i is height class k. ct5 <- gam(volume~s(girth,by=hclass) + Hclass, family=gamma(link=log),data=trees,method="reml") par(mfrow=c(1,3));plot(ct5) s(girth,1.18):hclasssmall 1...5 1. 1.5. s(girth,1.61):hclassmedium 1...5 1. 1.5. s(girth,1):hclasslarge 1...5 1. 1.5. 8 1 1 14 16 18 Girth 8 1 1 14 16 18 Girth 8 1 1 14 16 18 Girth Notice that with factor by variables the smooths have centering constraints applied, hence the need for the separate Hclass term in the model.

The summation convention s and te smooth terms accept matrix arguments and by variables to implement general L ij f j terms. If X and L are n p matrices then s(x,by=l) evaluates L ij f j = k f (X ik)l ik for all i. For example, consider data y i Poi where log{e(y i )} = k i (x)f (x)dx 1 h (the x k are evenly spaced points). p k i (x k )f (x k ) k=1 Let X ik = x k i and L ik = k i (x k )/h. The model is fit by gam(y ~ s(x,by=l),poisson)

Summation example: predicting octane Consider predicting octane rating of fuel from near infrared spectrum of the fuel. octane = 85.3 log(1/r)..4.8 1. 1 1 14 16 wavelength (nm) There are 6 such spectrum (k i (x)) - octane (y i ) pairs (x is wavelength), and a model might be y i = α + f (x)k i (x)dx + ǫ i where f (x) is a smooth function of wavelength.

Fitting the octane model The following fits the model library(pls);data(gasoline);gas <- gasoline nm <- seq(9,17,by=) ## create wavelength matrix... gas$nm <- t(matrix(nm,length(nm),length(gas$octane))) b <- gam(octane~s(nm,by=nir,bs="ad"),data=gas) plot(b,rug=false,shade=true,main="estimated function") plot(fitted(b),gas$octane,...) Estimated function octane s(nm,7.9):nir 8 4 4 6 measured 84 86 88 1 1 14 16 nm 84 85 86 87 88 89 fitted...can predict octane quite well from NIR spectrum.

Model selection Various model selection strategies are possible. Two stand out 1. Use backward selection, based on GCV, REML or AIC, possibly guided by termwise approximate p-values and plots.. Let smoothness selection do all the work by adding a penalty on the null space of each smooth. gam(...,select=true) does this. The second option is nicely consistent with how we select between models of different smoothness, but works the optimizer rather hard. If H type selection is desired approximate p-values can be used, or we can increase gamma so that single term deletion by AIC is equivalent to using a significance level of e.g. 5% as opposed to the AIC default of 15%. i.e. set gamma = 3.84/. It is rare for fully automatic selection to be fully satisfactory.

Mackerel selection example As an example, consider the mack data again, but this time we ll use an additive structure, with a number of candidate predictors... > b <- gam(egg.count ~ offset(log.na) + s(lon,lat,k=1) + s(t.bd) + + s(temp.m)+s(c.dist)+s(temp.surf)+s(vessel,bs="re"), + data=mack,family=tweedie(p=1.3,link=log), + method="ml",select=true) > b... Estimated degrees of freedom: 5.6647e+1.8944e+ 3.13e-4 1.619e+.455e-4.8475e+ total = 64.65149 ML score: 1555.35 So the smooths of temp.m and temp.surf have been penalized out of the model.

3 1 Mackerel selection example continued Refitting we have... b3 <- gam(egg.count ~ offset(log.na) + s(lon,lat,k=1) + s(t.bd) + s(c.dist)+s(vessel,bs="re"), data=mack,family=tweedie(p=1.3,link=log), method="ml") data(coast) par(mfrow=c(1,3));plot(b3) 1se s(lon,lat,57.35) +1se lat 44 46 48 5 5 54 56 58 3 3 1 1 1 1 1 1 1 1 1 1 1 3 1 1 s(t.bd,.95) 3 1 1 s(c.dist,1) 3 1 1 14 1 6 4 lon 3 4 5 6 7 8 t.bd..5 1. 1.5..5 c.dist

Time to stop Goodbye.