A review of spline function selection procedures in R

Similar documents
A comparison of spline methods in R for building explanatory models

Multivariable Regression Modelling

The pspline Package. August 4, Author S original by Jim Ramsey R port by Brian Ripley

Last time... Bias-Variance decomposition. This week

A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K.

Stat 8053, Fall 2013: Additive Models

Splines and penalized regression

Generalized Additive Models

Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Spline Models. Introduction to CS and NCS. Regression splines. Smoothing splines

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Moving Beyond Linearity

Applied Statistics : Practical 9

GAMs, GAMMs and other penalized GLMs using mgcv in R. Simon Wood Mathematical Sciences, University of Bath, U.K.

Nonparametric Approaches to Regression

Lecture 16: High-dimensional regression, non-linear regression

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni

Package slp. August 29, 2016

Generalized Additive Model

Estimating survival from Gray s flexible model. Outline. I. Introduction. I. Introduction. I. Introduction

Assessing the Quality of the Natural Cubic Spline Approximation

STAT 705 Introduction to generalized additive models

1D Regression. i.i.d. with mean 0. Univariate Linear Regression: fit by least squares. Minimize: to get. The set of all possible functions is...

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines. Robert J. Hill and Michael Scholz

Package gamm4. July 25, Index 10

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

Lecture 17: Smoothing splines, Local Regression, and GAMs

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Generalized additive models I

Comparing different interpolation methods on two-dimensional test functions

The theory of the linear model 41. Theorem 2.5. Under the strong assumptions A3 and A5 and the hypothesis that

Package rgcvpack. February 20, Index 6. Fitting Thin Plate Smoothing Spline. Fit thin plate splines of any order with user specified knots

Moving Beyond Linearity

Package splines2. June 14, 2018

Nonparametric regression using kernel and spline methods

Lecture 7: Splines and Generalized Additive Models

Linear Penalized Spline Model Estimation Using Ranked Set Sampling Technique

Package freeknotsplines

Knot-Placement to Avoid Over Fitting in B-Spline Scedastic Smoothing. Hirokazu Yanagihara* and Megu Ohtaki**

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

Curve fitting using linear models

Additive hedonic regression models for the Austrian housing market ERES Conference, Edinburgh, June

Semiparametric Tools: Generalized Additive Models

Nonlinearity and Generalized Additive Models Lecture 2

Spline fitting tool for scientometric applications: estimation of citation peaks and publication time-lags

arxiv: v2 [stat.ap] 14 Nov 2016

Generalized Additive Models

3 Nonlinear Regression

A technique for constructing monotonic regression splines to enable non-linear transformation of GIS rasters

Important Properties of B-spline Basis Functions

Chapter 10: Extensions to the GLM

Smoothing Scatterplots Using Penalized Splines

1. Assumptions. 1. Introduction. 2. Terminology

Statistical Modeling with Spline Functions Methodology and Theory

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

More advanced use of mgcv. Simon Wood Mathematical Sciences, University of Bath, U.K.

The mgcv Package. July 21, 2006

STA121: Applied Regression Analysis

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

See the course website for important information about collaboration and late policies, as well as where and when to turn in assignments.

Hierarchical generalized additive models: an introduction with mgcv

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Nonparametric Regression

Fall CSCI 420: Computer Graphics. 4.2 Splines. Hao Li.

An introduction to interpolation and splines

Organizing data in R. Fitting Mixed-Effects Models Using the lme4 Package in R. R packages. Accessing documentation. The Dyestuff data set

Interpolation by Spline Functions

NONPARAMETRIC REGRESSION SPLINES FOR GENERALIZED LINEAR MODELS IN THE PRESENCE OF MEASUREMENT ERROR

MS in Applied Statistics: Study Guide for the Data Science concentration Comprehensive Examination. 1. MAT 456 Applied Regression Analysis

Package bgeva. May 19, 2017

Spline Notes. Marc Olano University of Maryland, Baltimore County. February 20, 2004

MATLAB, R and S-PLUS Functions for Functional Data Analysis

Fitting latency models using B-splines in EPICURE for DOS

TECHNICAL REPORT NO December 11, 2001

MINI-PAPER A Gentle Introduction to the Analysis of Sequential Data

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

lecture 10: B-Splines

(Spline, Bezier, B-Spline)

Monograph: Sóskuthy, Márton (2017) Generalised additive mixed models for dynamic analysis in linguistics: a practical introduction. Working Paper.

Package robustgam. January 7, 2013

A Method for Comparing Multiple Regression Models

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Linear Methods for Regression and Shrinkage Methods

Knowledge Discovery and Data Mining

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER

CoxFlexBoost: Fitting Structured Survival Models

Workshop 8: Model selection

Four equations are necessary to evaluate these coefficients. Eqn

Package robustgam. February 20, 2015

GAMs with integrated model selection using penalized regression splines and applications to environmental modelling.

davidr Cornell University

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

CS 450 Numerical Analysis. Chapter 7: Interpolation

Fitting to a set of data. Lecture on fitting

Abstract R-splines are introduced as splines t with a polynomial null space plus the sum of radial basis functions. Thin plate splines are a special c

Transcription:

Matthias Schmid Department of Medical Biometry, Informatics and Epidemiology University of Bonn joint work with Aris Perperoglou on behalf of TG2 of the STRATOS Initiative September 1, 2016

Introduction The Subject Fit a statistical model of the form g(y X ) = β 0 + f (X ) p explanatory variables X = (X 1,..., X p ) f unknown, allowed to be nonlinear but should be interpretable Common specification: f (X 1,..., X p ) = f 1 (X 1 ) +... + f p (X p ) Generalized additive models (GAMs) Splines are the most popular method to estimate f 1,..., f p 668 articles in Biometrics, 790 articles in Statistics in Medicine, 173 articles in Biometrical Journal GAM books by Hastie/Tibshirani and Wood are hugely popular (> 12, 000 and > 5, 000 citations, respectively) 2/24

Introduction Definition of Splines Set of piecewise polynomials, each of degree d Joined together at a set of knots τ 1,..., τ K Continuous in value + sufficiently smooth at the knots f j(xj) 4 2 0 2 4 τ 1 τ 2 τ 3 τ 4 X j 3/24

Spline Estimation Spline Estimation Splines are typically represented by a set of (non-unique) basis functions B 1,..., B K+d+1 f j (X j ) = K+d+1 k=1 β k B k (X j ) Linearization of the estimation problem Many types & subtypes, e.g., Natural splines: Required to be linear in τ 1 and τ K Penalized splines: Minimization of a critierion of the form Sum of residuals + λ J m with smoothing parameter λ and wiggliness parameter m ( ) Example: J m = m f j / xj m 2 dxj 4/24

Spline Estimation Spline Estimation (2) Types & subtypes (cont.): Smoothing splines: Natural cubic splines with knots at the data values x 1j,..., x nj P-splines: Use a B-spline basis with equidistant knots Approximate integrated squared derivative penalty by an m-th order difference penalty Thin plate regression splines: Penalized splines expressed in terms of radial basis functions Additional restrictions: Monotonic splines, cyclic splines,... 5/24

The Problem The Problem Spline modeling involves the selection of a comparatively large number of parameters Number & placement of knots Choice of basis & restrictions Choice of smoothing parameter / optimization procedure Choice of penalty order Mathematical properties are well understood, but... 6/24

The Problem The Problem (2) Little guidance for applied statisticians Even for level 2 users it is hard to choose between competing approaches Broadly speaking the default penalized thin plate regression splines tend to give the best MSE performance, but they are slower to set up than the other bases. The knot based penalized cubic regression splines (with derivative based penalties) usually come next in MSE performance, with the P-splines doing just a little worse. However the P-splines are useful in non-standard situations. (Excerpt from smooth.terms help file in R package mgcv) 7/24

The Problem The Problem (3) Little guidance so can we rely on software? Default results of some spline procedures in R: 0.0 0.2 0.4 0.6 0.8 1.0 5 0 5 X j f j(xj) bs ns tp (mgcv) ps (mgcv) smooth.spline true 8/24

The Problem Why R? Guidance on software plays a key role Why R? Most widely spread software amongst people who do research in biomedical statistics Also popular in the field of data science Most widely used software in statistics courses Start STRATOS work on splines with a project on spline implementations in R 9/24

First Steps First Steps Identify popular/relevant R packages and functions Basic spline packages & functions Regression packages Understand spline implementations Analyze documentation Compare approaches Provide first recommendations 10/24

First Steps Current Situation By July 2016, CRAN package repository featured 8,670 add-on packages 530 packages related to splines and/or spline modeling Even more packages were available on GITHUB, R-Forge etc. that we do not look into Restrict to packages that are both (a) popular, and (b) relevant for standard regression modeling 11/24

Basic Spline Packages Basic Spline Packages Download numbers extracted from RStudio logs 12/24

Basic Spline Packages Preliminary Findings Lack of harmonization Differences in terminology Sometimes even within the same package Lack of documentation R help files are often not detailed enough to fully understand how the implemented methods work 13/24

Basic Spline Packages Analysis of the splines Package Consider the evaluation of cubic B-spline basis functions Most important functions: splinedesign, bs Both splinedesign and bs have an argument named knots Define knot sequence 0.1, 0.2,..., 0.9 for the above example > knots <- seq(0.1, 0.9, 0.1) > splinedesign(x, knots = knots) Error in splinedesign(x, knots = knots) : die 'x' Daten müssen im Bereich 0.4 bis 0.6 liegen, außer es ist outer.ok = TRUE gesetzt > > dim(bs(x, knots = knots)) [1] 101 12 14/24

Basic Spline Packages Analysis of the splines Package (2) B-spline basis requires definition of additional knots Classical B-spline plot contained in many textbooks: B j (X j ) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 X j 15/24

Basic Spline Packages Analysis of the splines Package (3) Basis functions produced by bs: B j (X j ) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 X j 16/24

Basic Spline Packages Smoothing Splines Evaluation of the smoothing spline basis Apply ns function to a vector of length 10 Apply smooth.spline function > ns(x, knots = x, intercept = TRUE) Error in qr.default(t(const)) : NA/NaN/Inf in foreign function call (arg 1) > N <- ns(x, knots = sort(x)[2:9], intercept = TRUE) > dim(n) [1] 10 10 > > S <- smooth.spline(x, y) > length(s$fit$coef) [1] 12 17/24

Basic Spline Packages Smoothing Splines (2) Apply smooth.pspline function in R package pspline Details The method produces results similar to function smooth.spline, but the smoothing function is a natural smoothing spline rather than a B-spline smooth, and as a consequence will differ slightly for norder = 2 over the initial and final intervals.... References Heckman, N. and Ramsay, J. O. (1996) Spline smoothing with model based penalties. McGill University, unpublished manuscript. 18/24

Basic Spline Packages Smoothing Splines (3) R code for smooth.pspline function > smooth.pspline function (x, y, w = rep(1, length(x)), norder = 2, df = norder + 2, spar = 0, method = 1) { my.call <- match.call()... result <-.Fortran("pspline", as.integer(n), as.integer(nvar), as.integer(norder), as.double(x), as.double(w), as.double(y), as.double(yhat), as.double(lev), as.double(gcv), as.double(cv), as.double(df), as.double(spar), as.double(dfmax), as.double(work), as.integer(method), as.integer(irerun), as.integer(ier), PACKAGE = "pspline")... } <environment: namespace:pspline> 19/24

Regression Regression General features of popular regression packages package downloaded vignette book website data sets quantreg 1,461,837 x x 8 mgcv 1,055,188 x x 2 survival 761,623 x x 35 VGAM 230,784 x x x 50 gam 127,386 x 3 gamlss 60,868 (x) x 29 20/24

Regression Regression (2) Smooth terms in popular regression packages quantreg Allows for bs, ns etc. in model formula qss smoothing with total variation roughness penalty mgcv implements (among many others) Thin plate regression splines (tp, default) Penalized natural cubic splines with cardinal spline basis cr P-splines ps, bases on splinedesign survival Allows for bs, ns etc. in model formula P-splines (pspline, same as ps in mgcv) 21/24

Regression Regression (3) Smooth terms in popular regression packages (cont.) VGAM Allows for bs, ns etc. in model formula P-splines (ps, similar to mgcv) gam gamlss Smoothing splines (s, similar to smooth.spline) Functions s, gam.s similar to smooth.spline Various P-spline and smoothing spline methods 22/24

Regression Regression (4) Model selection in popular regression packages quantreg: information criteria (but no step function) mgcv: information criteria (but no step function), null space penalization / ridge penalty survival: information criteria (but no step function, works with stepaic from MASS VGAM: information criteria (but no step function) gam: information criteria, step.gam function gamlss: information criteria, stepgaic and related functions, null space penalty 23/24

Conclusion Conclusion, Next Steps and Future Work Many regression packages rely on basic routines Variable selection based on information criteria or penalties Details of spline routines in regression packages are often not contained in help files + may be difficult to retrieve from literature Notable exception: mgcv Next steps: Investigate routines for smoothing parameter optimization (GCV, UBRE, REML, etc.) Investigate decomposition of splines (near + nonlinear parts) Future work: Analyze monotonic splines, cyclic splines, multivariate splines Extend analysis to other regression-type methods involving splines (mboost, fda) 24/24