GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Similar documents
GAMs, GAMMs and other penalized GLMs using mgcv in R. Simon Wood Mathematical Sciences, University of Bath, U.K.

More advanced use of mgcv. Simon Wood Mathematical Sciences, University of Bath, U.K.

Generalized Additive Models

A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K.

Package gamm4. July 25, Index 10

Straightforward intermediate rank tensor product smoothing in mixed models

Stat 8053, Fall 2013: Additive Models

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Generalized additive models I

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

Splines and penalized regression

GAM: The Predictive Modeling Silver Bullet

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

The mgcv Package. July 21, 2006

What is machine learning?

Generalized Additive Model

Clustering Lecture 5: Mixture Model

GAMs with integrated model selection using penalized regression splines and applications to environmental modelling.

Lecture 13: Model selection and regularization

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Nonparametric regression using kernel and spline methods

Linear Methods for Regression and Shrinkage Methods

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Adaptive Estimation of Distributions using Exponential Sub-Families Alan Gous Stanford University December 1996 Abstract: An algorithm is presented wh

Semiparametric Tools: Generalized Additive Models

Package bgeva. May 19, 2017

Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica

Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines. Robert J. Hill and Michael Scholz

Topics in Machine Learning-EE 5359 Model Assessment and Selection

DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R

Generalized Additive Models

davidr Cornell University

I How does the formulation (5) serve the purpose of the composite parameterization

Model selection and validation 1: Cross-validation

Moving Beyond Linearity

REGULARIZED REGRESSION FOR RESERVING AND MORTALITY MODELS GARY G. VENTER

Lasso. November 14, 2017

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Generalized additive models for large data sets

DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R

Nonparametric Regression

Package caic4. May 22, 2018

Machine Learning / Jan 27, 2010

Ludwig Fahrmeir Gerhard Tute. Statistical odelling Based on Generalized Linear Model. íecond Edition. . Springer

Generative and discriminative classification techniques

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

Multiple imputation using chained equations: Issues and guidance for practice

Expectation Maximization (EM) and Gaussian Mixture Models

Using Machine Learning to Optimize Storage Systems

STAT 705 Introduction to generalized additive models

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Economics Nonparametric Econometrics

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

Machine Learning. Topic 4: Linear Regression Models

Last time... Bias-Variance decomposition. This week

Assessing the Quality of the Natural Cubic Spline Approximation

Knowledge Discovery and Data Mining

Calibration and emulation of TIE-GCM

Poisson Regression and Model Checking

CoxFlexBoost: Fitting Structured Survival Models

Introduction to the R Statistical Computing Environment R Programming: Exercises

Instance-based Learning

arxiv: v1 [stat.ap] 14 Nov 2017

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Theoretical Concepts of Machine Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

INLA: an introduction

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an

Multiresponse Sparse Regression with Application to Multidimensional Scaling

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Network Traffic Measurements and Analysis

Lecture 16: High-dimensional regression, non-linear regression

TECHNICAL REPORT NO December 11, 2001

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Statistical Matching using Fractional Imputation

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Introduction to the R Statistical Computing Environment R Programming: Exercises

Approximation methods and quadrature points in PROC NLMIXED: A simulation study using structured latent curve models

Clustering and The Expectation-Maximization Algorithm

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Lecture on Modeling Tools for Clustering & Regression

Package dglm. August 24, 2016

P-spline ANOVA-type interaction models for spatio-temporal smoothing

Knowledge Discovery and Data Mining

Bayes Estimators & Ridge Regression

CS 229 Midterm Review

Regularization and model selection

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Introduction to Machine Learning CMU-10701

arxiv: v1 [stat.me] 2 Jun 2017

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER

Statistical Modeling with Spline Functions Methodology and Theory

Spline Models. Introduction to CS and NCS. Regression splines. Smoothing splines

Hierarchical generalized additive models: an introduction with mgcv

Mixture Models and the EM Algorithm

3 Nonlinear Regression

Chapter 15 Mixed Models. Chapter Table of Contents. Introduction Split Plot Experiment Clustered Data References...

Linear Penalized Spline Model Estimation Using Ranked Set Sampling Technique

Study Guide. Module 1. Key Terms

Missing Data Analysis for the Employee Dataset

Transcription:

GAMs semi-parametric GLMs Simon Wood Mathematical Sciences, University of Bath, U.K.

Generalized linear models, GLM 1. A GLM models a univariate response, y i as g{e(y i )} = X i β where y i Exponential family 2. g is a known, smooth monotonic link function. 3. X i is the ith row of a known model matrix, which depends on measured predictor variables (covariates). 4. β is an unknown parameter vector, estimated by MLE. 5. Xβ (= η) is the linear predictor of the model. 6. Class includes log-linear models, general linear regression, logistic regression,... 7. glm in R implements this class.

Generalized additive models, GAM 1. A GAM (semi-parametric GLM) is a GLM where the linear predictor depends linearly on unknown smooth functions. 2. In general g{e(y i )} = A i θ + j L ij f j where y i Exponential family 3. Parameters, θ, and smooth functions, f j, are unknown. 4. Parametric model matrix, A, and linear functionals, L ij, depend on predictor variables. 5. Examples of L ij f j : f 1 (x i ), f 2 (z i )w i, k i (v)f j (v)dv,...

Specifying GAMs in R with mgcv library(mgcv) loads a semi-parametric GLM package. gam(formula,family) is quite like glm. The family argument specifies the distribution and link function. e.g. Gamma(log). The response variable and linear predictor structure are specified in the model formula. Response and parametric terms exactly as for lm or glm. Smooth functions are specified by s or te terms. e.g. is specified by... log{e(y i )} = α + βx i + f (z i ), y i Gamma, gam(y x + s(z),family=gamma(link=log))

More specification: by variables Smooth terms can accept a by variable argument, which allows L ij f j terms other than just f j (x i ). e.g. E(y i ) = f 1 (z i )w i + f 2 (v i ), y i Gaussian, becomes gam(y s(z,by=w) + f(v)) i.e. f 1 (x i ) is multiplied by w i in the linear predictor. e.g. E(y i ) = f j (x i, z i ) if factor g i is of level j, becomes gam(y s(x,z,by=g) + g) i.e. there is one smooth function for each level of factor variable g, with each y i depending on just one of these functions.

Yet more specification: a summation convention s and te smooth terms accept matrix arguments and by variables to implement general L ij f j terms. If X and L are n p matrices then s(x,by=l) evaluates L ij f j = k f (X ik)l ik for all i. For example, consider data y i Poi where log{e(y i )} = k i (x)f (x)dx 1 h p k i (x k )f (x k ) k=1 (the x k are evenly spaced points). Let X ik = x k i and L ik = k i (x k )/h. The model is fit by gam(y s(x,by=l),poisson)

How to estimate GAMs? The gam calls on the previous slides would also estimate the specified model it s useful to have some idea how. We need to estimate the parameters, θ, and the smooth functions, f j. This includes estimating how smooth the f j are. To begin with we need decide on two things 1. How to represent the f j by something computable. 2. How to formalize what is meant by smooth.... here we ll discuss only the approach based on representing the f j with penalized regression splines (as in R package, mgcv).

Bases and penalties for the f j Represent each f as a weighted sum of known basis functions, b k, K f (x) = γ k b k (x) k=1 Now only the γ k are unknown. Spline function theory supplies good choices for the b k. K is chosen large enough to be sure that bias(ˆf ) var(ˆf ). Choose a measure of function wiggliness e.g. f (x) 2 dx = γ T Sγ, where S ij = b i (x)b j (x)dx.

Basis & penalty example y 0.0 0.2 0.4 0.6 0.8 1.0 basis functions y 2 0 2 4 6 8 10 f"(x)2 dx = 377000 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x y 2 0 2 4 6 8 10 f"(x)2 dx = 63000 y 2 0 2 4 6 8 10 f"(x)2 dx = 200 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x

A computable GAM Representing each f j with basis functions b jk we have, g{e(y i )} = A i θ + j L ij f j = A i θ + j γ jk L ij b jk = Xβ where X contains A and the L ij b jk, while β contains θ and the γ j vectors. For notational convenience we can re-write the wiggliness penalties as γ T j S j γ j = β T S j β, where S j is just S j padded with extra zeroes. So the GAM has become a parametric GLM, with some associated penalties. The L ij f j are often confounded via the intercept. So the f j are usually constrained to sum to zero, over the observed values. A reparameterization achieves this. k

Estimating the model If the bases are large enough to be sure of avoiding bias, then MLE of the model will overfit/undersmooth.... so we use maximum penalized likelihood estimation. minimize 2l(β) + j λ j β T S j β w.r.t. β (1) where l is the log likelihood. λ j β T S j β forces f j to be smooth. How smooth is controlled by λ j, which must be chosen. For now suppose that an angel has revealed values for λ in a dream. We ll eliminate the angel later. Given λ, (1) is optimized by a penalized version of the IRLS algorithm used for GLMs

Bayesian motivation We can be more principled about motivating (1). Suppose that our prior belief is that smooth models are more probable than wiggly ones. We can formalize this belief with an exponential prior on wiggliness exp( 1 λ j β T S j β) 2 an improper Gaussian prior β N(0, ( j λ jβ T S j β) ). In this case the posterior modes, ˆβ, of β y are the minimizers of (1). j

The distribution of β y The Bayesian approach gives us more... For any exponential family distribution var(y i ) = φv (µ i ) where µ i = E(y i ). Let W 1 ii = V (µ i )g (µ i ) 2. The Bayesian approach gives the large sample result β y N( ˆβ, (X T WX + j λ j S j ) 1 φ) This can be used to get CIs with good frequentist properties (by an argument of Nychka, 1988, JASA). Simulation from β y is a very cheap way of making any further inferences required. mgcv uses this result for inference (e.g.?vcov.gam).

Degrees of Freedom, DoF Penalization reduces the freedom of the parameters to vary. So dim(β) is not a good measure of the DoF of a GAM. A better measure considers the average degree of shrinkage of the coefficients, ˆβ. Informally, suppose β is the unpenalized parameter vector estimate, then approximately ˆβ = F β where F = (X T WX + j λ js j ) 1 X T WX. The F ii are shrinkage factors and tr(f) = i F ii is a measure of the effective degrees of freedom (EDF) of the model.

Estimating the smoothing parameters, λ We need to estimate the appropriate degree of smoothing, i.e. λ. This involves 1. Deciding on a statistical approach to take. 2. Producing a computational method to implement it. There are 3 main statistical approaches 1. Choose λ to minimize error in predicting new data. 2. Treat smooths as random effects, following the Bayesian smoothing model, and estimate the λ j as variance parameters using a marginal likelihood approach. 3. Go fully Bayesian by completing the Bayesian model with a prior on λ (requires simulation and not pursued here).

A prediction error criterion: cross validation λ too high λ about right λ too low y 2 0 2 4 6 8 y 2 0 2 4 6 8 y 2 0 2 4 6 8 0.2 0.4 0.6 0.8 1.0 x 0.2 0.4 0.6 0.8 1.0 x 0.2 0.4 0.6 0.8 1.0 1. Choose λ to try to minimize the error predicting new data. 2. Minimize the average error in predicting single datapoints omitted from the fit. Each datum left out once in average. 3. Invariant version is Generalized Cross Validation, GCV: x V g (λ) = l max l( ˆβ λ ) (n EDF λ ) 2

Marginal likelihood based smoothness selection λ too low, prior variance too high λ and prior variance about right λ too high, prior variance too low y 10 5 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 x y 10 5 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 x 10 5 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 1. Choose λ to maximize the average likelihood of random draws from the prior implied by λ. 2. If λ too low, then almost all draws are too variable to have high likelihood. If λ too high, then draws all underfit and have low likelihood. The right λ maximizes the proportion of draws close enough to data to give high likelihood. 3. Formally, maximize e.g. V r (λ) = log f (y β)f λ (β)dβ. y x

Prediction error vs. likelihood λ estimation s(x,12.07) 4 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 x log GCV 0.0 0.5 1.0 1.5 15 10 5 0 5 log(λ) REML/n 1.4 1.6 1.8 2.0 2.2 15 10 5 0 5 log(λ) s(x,1) 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 x log GCV 0.0 0.1 0.2 0.3 15 10 5 0 5 log(λ) REML/n 1.4 1.6 1.8 15 10 5 0 5 log(λ) 1. Pictures show GCV and REML scores for different replicates from same truth. 2. Compared to REML, GCV penalizes overfit only weakly, and so tends to undersmooth.

Computational strategies for smoothness selection 1. Single iteration. Use an approximate V to re-estimate λ at each step of the PIRLS iteration used to find ˆβ. 2. Nested iteration. Optimize V w.r.t. λ by a Newton method, with each trial λ requiring a full PIRLS iteration to find ˆβ λ. 1 need not converge and is often no cheaper than 2. 2 is much harder to implement efficiently. In mgcv, 1 is known as performance iteration, and 2 is outer iteration. gam defaults to 2 (see method and optimizer arguments). gamm (mixed GAMs) and bam (large datasets) use 1.

Summary A semi-parametric GLM has a linear predictor depending linearly on unknown smooth functions of covariates. Representing these functions with intermediate rank linear basis expansions recovers an over-parameterized GLM. Maximum penalized likelihood estimation of this GLM avoids overfit by penalizing function wiggliness. It uses penalized iteratively reweighted least squares. The degree of penalization is chosen by REML, GCV etc. Viewing the fitting penalties as being induced by priors on function wiggliness, provides a justification for PMLE and implies a posterior distribution for the model coefficients which gives good frequentist inference.

Bibliography Most of the what has been discussed here is somehow based on the work of Grace Wahba, see, e.g. Wahba (1990) Spline models of observational data. SIAM. The notion of a GAM and the software interface that is gam and associated functions is due to Hastie and Tibshirani, see Hastie and Tibshirani (1990) Generalized Additive Models Chapman and Hall. Chong Gu developed the first computational method for multiple smoothing parameter selection, in 1992. See e.g. Gu (2002) Smoothing Spline ANOVA. Springer. Wood (2006) Generalized Additive Models: An introduction with R. CRC. provides more detail on the framework given here.