GAMs, GAMMs and other penalized GLMs using mgcv in R. Simon Wood Mathematical Sciences, University of Bath, U.K.

Similar documents
A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

More advanced use of mgcv. Simon Wood Mathematical Sciences, University of Bath, U.K.

Generalized Additive Models

Stat 8053, Fall 2013: Additive Models

Straightforward intermediate rank tensor product smoothing in mixed models

Generalized additive models I

Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica

Generalized Additive Model

Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines. Robert J. Hill and Michael Scholz

Last time... Bias-Variance decomposition. This week

Package gamm4. July 25, Index 10

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Lecture 13: Model selection and regularization

Nonparametric Approaches to Regression

Splines and penalized regression

Assessing the Quality of the Natural Cubic Spline Approximation

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

The mgcv Package. July 21, 2006

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

STAT 705 Introduction to generalized additive models

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

A review of spline function selection procedures in R

NONPARAMETRIC REGRESSION SPLINES FOR GENERALIZED LINEAR MODELS IN THE PRESENCE OF MEASUREMENT ERROR

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

arxiv: v1 [stat.ap] 14 Nov 2017

Generalized Additive Models

Regularization and model selection

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Chapter 5: Basis Expansion and Regularization

Machine Learning / Jan 27, 2010

Hierarchical generalized additive models: an introduction with mgcv

Model selection and validation 1: Cross-validation

GAMs with integrated model selection using penalized regression splines and applications to environmental modelling.

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Lasso. November 14, 2017

Lecture 7: Splines and Generalized Additive Models

Mixture models and clustering

Semiparametric Tools: Generalized Additive Models

What is machine learning?

P-spline ANOVA-type interaction models for spatio-temporal smoothing

Economics Nonparametric Econometrics

Lecture 16: High-dimensional regression, non-linear regression

Linear Penalized Spline Model Estimation Using Ranked Set Sampling Technique

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

arxiv: v2 [stat.ap] 14 Nov 2016

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Smoothing and Forecasting Mortality Rates with P-splines. Iain Currie. Data and problem. Plan of talk

arxiv: v1 [stat.me] 2 Jun 2017

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Curve fitting using linear models

Parameterization of triangular meshes

GAM: The Predictive Modeling Silver Bullet

A comparison of spline methods in R for building explanatory models

Support Vector Machines

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016

A spatio-temporal model for extreme precipitation simulated by a climate model.

Spline Models. Introduction to CS and NCS. Regression splines. Smoothing splines

EECS 556 Image Processing W 09. Interpolation. Interpolation techniques B splines

Moving Beyond Linearity

Convexization in Markov Chain Monte Carlo

Linear Methods for Regression and Shrinkage Methods

27: Hybrid Graphical Models and Neural Networks

Calibration and emulation of TIE-GCM

Package bgeva. May 19, 2017

Clustering. Supervised vs. Unsupervised Learning

Discussion Notes 3 Stepwise Regression and Model Selection

Generative and discriminative classification techniques

INLA: Integrated Nested Laplace Approximations

CS 450 Numerical Analysis. Chapter 7: Interpolation

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Knowledge Discovery and Data Mining

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

RESAMPLING METHODS. Chapter 05

DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis

Probabilistic Graphical Models

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

4.5 The smoothed bootstrap

Knowledge Discovery and Data Mining

1D Regression. i.i.d. with mean 0. Univariate Linear Regression: fit by least squares. Minimize: to get. The set of all possible functions is...

Voxel selection algorithms for fmri

Network Traffic Measurements and Analysis

Additive hedonic regression models for the Austrian housing market ERES Conference, Edinburgh, June

Logistic Regression and Gradient Ascent

Chapter 18. Geometric Operations

3D Geometry and Camera Calibration

Nonparametric regression using kernel and spline methods

PS 6: Regularization. PART A: (Source: HTF page 95) The Ridge regression problem is:

Comparing different interpolation methods on two-dimensional test functions

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Missing Data Analysis for the Employee Dataset

Problem Set 4. Assigned: March 23, 2006 Due: April 17, (6.882) Belief Propagation for Segmentation

Transcription:

GAMs, GAMMs and other penalied GLMs using mgcv in R Simon Wood Mathematical Sciences, University of Bath, U.K.

Simple eample Consider a very simple dataset relating the timber volume of cherry trees to their height and trunk girth. 65 70 75 80 85 Girth 8 12 16 20 65 75 85 Height Volume 10 30 50 70 8 10 12 14 16 18 20 10 20 30 40 50 60 70

trees initial gam fit A possible model is log(µ i ) = f 1 (Height i )+f 2 (Girth i ), Volume i Gamma(µ i, φ) Which can be estimated using... library(mgcv) ct1 <- gam(volume ~ s(height) + s(girth), family=gamma(link=log),data=trees) gam produces a representation for the smooth functions, and estimates them along with their degree of smoothness.

Cherry tree results par(mfrow=c(1,2)) plot(ct1,various options) ## calls plot.gam s(height,1) 0.3 0.1 0.1 s(girth,2.51) 0.5 0.0 0.5 1.0 65 70 75 80 85 8 10 12 14 16 18 20 Height This is a simple eample of a Generalied Additive Model. Given the machinery for fitting GAMs a wide variety of other models can also be estimated. Girth

The general model Let y be a response variable, f j a smooth function of predictors j and L ij a linear functional. A general model... g(µ i ) = A i α + j L ij f j + Z i b, b N(0, Σ θ ), y i EF(µ i, φ) A and Z are model matrices, g is a known link function. Popular eamples are... g(µ i ) = A i α + j f j( ji ) (Generalied additive model) g(µ i ) = A i α + j f j( ji ) + Z i b, b N(0, Σ θ ) (GAMM) g(µ i ) = h i ()f ()d (Signal regression model) g(µ i ) = j f j(t ji ) ji (Varying coefficient model) The list goes on...

Representing the f j A convenient model representation arises by approimating each function with a basis epansion K f () = γ k b k () (b k k=1 known, e.g. spline) K is set to something generous to avoid bias. Some way of measuring the wiggliness of each f is also useful, for eample f () 2 d = γ T Sγ. (S known) For many models we will need an identifiability constraint, i f ( i) = 0. Automatic re-parameteriation handles this.

Basis & penalty eample y 0.0 0.2 0.4 0.6 0.8 1.0 basis functions y 2 0 2 4 6 8 10 f"()2 d = 377000 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 y 2 0 2 4 6 8 10 f"()2 d = 63000 y 2 0 2 4 6 8 10 f"()2 d = 200 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Representing the model The general model is g(µ i ) = A i α + j L ij f j + Z i b, b N(0, Σ θ ), y i EF(µ i, φ) Given a basis for each f j, we can write j L ijf j = B i γ, where B is determined by the basis functions and L ij, while γ is a vector containing all the basis coefficients. Then the model becomes... g(µ i ) = A i α + B i γ + Z i b, b N(0, Σ θ ), y i EF(µ i, φ) Problem: if the bases were rich enough to give negligible bias, then MLE will overfit.

Estimation strategies 1. Control prediction error Penalie the model likelihood, using the f j wiggliness measures, j λ jγ T S j γ, to control model prediction error. Given λ, θ, φ, this yields the maimum penalied likelihood estimates (MPLE), ˆα, ˆγ, ˆb. GCV or similar gives ˆλ, ˆθ, ˆφ. 2. Empirical Bayes Put eponential ( priors on function wiggliness so that γ N 0, ( ) j λ js j ) /φ. Given λ, θ, φ the MAP estimates ˆα, ˆγ, ˆb, are found by MPLE as under 1. Maimie the marginal likelihood (REML) to obtain ˆλ, ˆθ, ˆφ.

Maimum Penalied Likelihood Estimation Define coefficient vector β T = (α T, γ T, b T ) Deviance is D(β) = 2{l ma l(β)} β is estimated by ˆβ = arg min D(β) + β j λ j β T S j β + β T Σ θ βφ is a ero padded version of Σ 1 θ so that β T Σ θ β = bt Σ 1 θ b. Similarly the S j here are ero padded versions of the originals. Σ θ In practice this optimiation is by Penalied IRLS.

Estimating λ etc We treat ˆβ as a function of λ, θ, φ for GCV/REML optimiation. GCV or REML are optimied numerically, with each step requiring evaluation of the corresponding ˆβ. Note that computationally the empirical Bayes approach is equivalent to generalied linear mied modelling. However in most applications it is not appropriate to treat the smooth functions as random quantities to be resampled from the prior at each replication of the data set. So the models are not frequentist random effects models.

More inference and degrees of freedom The Bayesian approach yields the large sample result β N( ˆβ, V β ) where V β is a penalied version of the information matri at ˆβ. Via an approach pioneered in Nychka (1988, JASA) it is possible to show that CI s based on this result have good frequentist properties. Effective degrees of freedom can also be computed. Let β denote the unpenalied MLE of β, EDF = j ˆβ j β j

Smooths for semi-parametric GLMs To build adequate semi-parametric GLMs requires that we use functions with appropriate properties. In one dimension there are several alternatives, and not alot to choose between them. In 2 or more dimensions there is a major choice to make. If the arguments of the smooth function are variables which all have the same units (e.g. spatial location variables) then an isotropic smooth may be appropriate. This will tend to ehibit the same degree of fleibility in all directions. If the relative scaling of the covariates of the smooth is essentially arbitrary (e.g. they are measured in different units), then scale invariant smooths should be used, which do not depend on this relative scaling.

Splines Many smooths covered here are based on splines. Here s the original underlying idea... wear 2.0 2.5 3.0 3.5 4.0 4.5 1.5 2.0 2.5 3.0 sie Mathematically the red curve is the function minimiing (y i f ( i )) 2 + λ f () 2 d. i

Splines have variable stiffness Varying the fleibility of the strip (i.e. varying λ) changes the spline function curve. wear 2.0 3.0 4.0 wear 2.0 3.0 4.0 1.5 2.0 2.5 3.0 sie 1.5 2.0 2.5 3.0 sie wear 2.0 3.0 4.0 wear 2.0 3.0 4.0 1.5 2.0 2.5 3.0 sie 1.5 2.0 2.5 3.0 But irrespective of λ the spline functions always have the same basis. sie

Penalied regression splines Full splines have one basis function per data point. This is computationally wasteful, when penaliation ensures that the effective degrees of freedom will be much smaller than this. Penalied regression splines simply use fewer spline basis functions. There are two alternatives: 1. Choose a representative subset of your data (the knots ), and create the spline basis as if smoothing only those data. Once you have the basis, use it to smooth all the data. 2. Choose how many basis functions are to be used and then solve the problem of finding the set of this many basis functions that will optimally approimate a full spline. I ll refer to 1 as knot based and 2 as eigen based.

Knot based eample: "cr" In mgcv the "cr" basis is a knot based approimation to the minimier of i (y i f ( i )) 2 + λ f () 2 d a cubic spline. "cc" is a cyclic version. full spline basis b i() 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5 3.0 y data to smooth 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 function estimate: full black, regression red simple regression spline basis s() 0.5 1.0 1.5 2.0 2.5 3.0 b i() 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Eigen based eample: "tp" The "tp", thin plate regression spline basis is an eigen approimation to a thin plate spline (including cubic spline in 1 dimension). full spline basis b i() 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.5 1.0 1.5 2.0 2.5 3.0 y data to smooth 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 function estimate: full black, regression red thin plate regression spline basis s() 0.5 1.0 1.5 2.0 2.5 3.0 b i() 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

1 dimensional smoothing in mgcv Various 1D smoothers are available in mgcv... "cr" knot based cubic regression spline. "cc" cyclic version of above. "ps" Eilers and Mar style p-splines: Evenly spaced B-spline basis, with discrete penalty on coefficients. "ad" adaptive smoother in which strength of penalty varies with covariate. "tp" thin plate regression spline. Optimal low rank eigen appro. to a full spline: fleible order penalty derivative. Smooth classes can be added (?smooth.construct).

1D smooths compared accel 100 50 0 50 tp cr ps cc 10 20 30 40 50 times ad

Isotropic smooths One way of generaliing splines from 1D to several D is to turn the fleible strip into a fleible sheet (hyper sheet). This results in a thin plate spline. It is an isotropic smooth. Isotropy may be appropriate when different covariates are naturally on the same scale. In mgcv terms like s(,) generate such smooths. 0.8 0.8 0.8 0.6 0.6 0.6 linear predictor 0.4 linear predictor 0.4 linear predictor 0.4 0.2 0.2 0.2 0.0 0.8 0.0 0.8 0.0 0.8 0.2 0.6 0.2 0.6 0.2 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.6 0.8 0.2 0.6 0.8 0.2 0.6 0.8 0.2

Thin plate spline details In 2 dimensions a thin plate spline is the function minimiing {y i f ( i, i )} 2 + λ f 2 + 2f 2 + fdd 2 i This generalies to any number of dimensions, d, and any order of differential, m, such that 2m > d + 1. Full thin plate spline has n unknown coefficients. An optimal low rank eigen approimation is a thin plate regression spline. A t.p.r.s uses far fewer coefficients than a full spline, and is therefore computationally efficient, while losing little statistical performance.

Scale invariant smoothing: tensor product smooths Isotropic smooths assume that a unit change in one variable is equivalent to a unit change in another variable, in terms of function variability. When this is not the case, isotropic smooths can be poor. Tensor product smooths generalie from 1D to several D using a lattice of bendy strips, with different fleibility in different directions. f(,)

Tensor product smooths Carefully constructed tensor product smooths are scale invariant. Consider constructing a smooth of,. Start by choosing marginal bases and penalties, as if constructing 1-D smooths of and. e.g. f () = α i a i (), J (f ) = f () = β j b j (), f () 2 d = α T S α & J (f ) = B T S B

Marginal reparameteriation Suppose we start with f () = 6 i=1 β jb j (), on the left. f () 0.1 0.3 0.5 f () 0.1 0.3 0.5 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 We can always re-parameterie so that its coefficients are functions heights, at knots (right). Do same for f.

Making f depend on Can make f a function of by letting its coefficients vary smoothly with f() f(,)

The complete tensor product smooth Use f basis to let f coefficients vary smoothly (left). Construct in symmetric (see right). f(,) f(,)

Tensor product penalties - one per dimension -wiggliness: sum marginal penalties over red curves. -wiggliness: sum marginal penalties over green curves. f(,) f(,)

Isotropic vs. tensor product comparison Isotropic Thin Plate Spline Tensor Product Spline 0.4 0.4 0.2 0.2 0.0 8 0.0 8 2 4 6 8 2 4 6 2 4 6 8 2 4 6 0.6 0.4 0.4 0.2 0.2 0.0 8 0.0 8 0.5 1.0 1.5 4 2 6 0.5 1.0 1.5 4 2 6... each figure smooths the same data. The only modification is that has been divided by 5 in the bottom row.

Soap film smoothing Sometimes we require a smoother defined over a bounded region, where it is important not to smooth across boundary features. A useful smoother can be constructed by considering a distorted soap film suspended from a fleible boundary wire. 1.0 0.5 0.0 0.5 1.0 y 1 0 1 2 3 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 y 1 0 1 2 3 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 y y 1 0 1 2 3 1.0 0.5 0.0 0.5 1.0 y y 1 0 1 2 3 1 0 1 2 3 1 0 1 2 3

Markov Random fields Sometimes data are associated with discrete geographic regions. A neighbourhood structure on these regions can be used to define a Markov random field, which induces a simple penalty structure on region specific coefficients. s(district,2.71) 4 2 0 2 4 6 8 10

0 Dimensional smooths Having considered smoothing with respect to 1, 2 and several covariates, or a neighbourhood structure, consider the case where we want to smooth without reference to a covariate (or graph). This amounts to simply shrinking some model coefficients towards ero, as in ridge regression. In the contet of penalied GLMs such ero dimensional smooths are equivalent to simple Gaussian random effects.

Specifying GAMs in R with mgcv library(mgcv) loads a semi-parametric GLM package. gam(formula,family) is quite like glm. The family argument specifies the distribution and link function. e.g. Gamma(log). The response variable and linear predictor structure are specified in the model formula. Response and parametric terms eactly as for lm or glm. Smooth functions are specified by s or te terms. e.g. is specified by... log{e(y i )} = α + β i + f ( i ), y i Gamma, gam(y ~ + s(),family=gamma(link=log))

More specification: by variables Smooth terms can accept a by variable argument, which allows L ij f j terms other than just f j ( i ). e.g. E(y i ) = f 1 ( i )w i + f 2 (v i ), y i Gaussian, becomes gam(y ~ s(,by=w) + f(v)) i.e. f 1 ( i ) is multiplied by w i in the linear predictor. e.g. E(y i ) = f j ( i, i ) if factor g i is of level j, becomes gam(y ~ s(,,by=g) + g) i.e. there is one smooth function for each level of factor variable g, with each y i depending on just one of these functions.

Yet more specification: a summation convention s and te smooth terms accept matri arguments and by variables to implement general L ij f j terms. If X and L are n p matrices then s(x,by=l) evaluates L ij f j = k f (X ik)l ik for all i. For eample, consider data y i Poi where log{e(y i )} = k i ()f ()d 1 h p k i ( k )f ( k ) k=1 (the k are evenly spaced points). Let X ik = k i and L ik = k i ( k )/h. The model is fit by gam(y ~ s(x,by=l),poisson)