Splines and penalized regression

Similar documents
Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Lecture 16: High-dimensional regression, non-linear regression

Generalized additive models I

Lecture 17: Smoothing splines, Local Regression, and GAMs

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Moving Beyond Linearity

Moving Beyond Linearity

Lecture 7: Splines and Generalized Additive Models

Generalized Additive Models

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Spline Models. Introduction to CS and NCS. Regression splines. Smoothing splines

Generalized Additive Model

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

STAT 705 Introduction to generalized additive models

Nonparametric regression using kernel and spline methods

Last time... Bias-Variance decomposition. This week

Stat 8053, Fall 2013: Additive Models

Computational Physics PHYS 420

Lecture 13: Model selection and regularization

Economics Nonparametric Econometrics

Chapter 5: Basis Expansion and Regularization

CS 450 Numerical Analysis. Chapter 7: Interpolation

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Four equations are necessary to evaluate these coefficients. Eqn

Generalized additive models II

Assessing the Quality of the Natural Cubic Spline Approximation

A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K.

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

February 2017 (1/20) 2 Piecewise Polynomial Interpolation 2.2 (Natural) Cubic Splines. MA378/531 Numerical Analysis II ( NA2 )

Chapter 19 Estimating the dose-response function

1D Regression. i.i.d. with mean 0. Univariate Linear Regression: fit by least squares. Minimize: to get. The set of all possible functions is...

Applied Statistics : Practical 9

Interactive Graphics. Lecture 9: Introduction to Spline Curves. Interactive Graphics Lecture 9: Slide 1

Nonlinearity and Generalized Additive Models Lecture 2

Nonparametric Approaches to Regression

Nonparametric Regression

The theory of the linear model 41. Theorem 2.5. Under the strong assumptions A3 and A5 and the hypothesis that

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Rational Bezier Surface

Spline Curves. Spline Curves. Prof. Dr. Hans Hagen Algorithmic Geometry WS 2013/2014 1

Curve fitting using linear models

2 Regression Splines and Regression Smoothers

davidr Cornell University

Statistical Modeling with Spline Functions Methodology and Theory

Lecture 8. Divided Differences,Least-Squares Approximations. Ceng375 Numerical Computations at December 9, 2010

lecture 10: B-Splines

Topics in Machine Learning

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

Statistical Modeling with Spline Functions Methodology and Theory

Nonparametric Regression and Generalized Additive Models Part I

Network Traffic Measurements and Analysis

An introduction to interpolation and splines

Using the SemiPar Package

The pspline Package. August 4, Author S original by Jim Ramsey R port by Brian Ripley

A review of spline function selection procedures in R

BASIC LOESS, PBSPLINE & SPLINE

Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica

Lecture 9: Introduction to Spline Curves

Generalized Additive Models

8 Piecewise Polynomial Interpolation

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Engineering Analysis ENG 3420 Fall Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 11:00-12:00

Splines. Chapter Smoothing by Directly Penalizing Curve Flexibility

Knowledge Discovery and Data Mining

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Lecture 7: Linear Regression (continued)

Concept of Curve Fitting Difference with Interpolation

Model selection and validation 1: Cross-validation

Important Properties of B-spline Basis Functions

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

GAMs, GAMMs and other penalized GLMs using mgcv in R. Simon Wood Mathematical Sciences, University of Bath, U.K.

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Topics in Machine Learning-EE 5359 Model Assessment and Selection

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

Fitting latency models using B-splines in EPICURE for DOS

Parameterization of triangular meshes

HW 10 STAT 472, Spring 2018

I How does the formulation (5) serve the purpose of the composite parameterization

Introduction to Mixed Models: Multivariate Regression

STA 414/2104 S: February Administration

HW 10 STAT 672, Summer 2018

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Natural Quartic Spline

Nonparametric Methods Recap

Parameterization. Michael S. Floater. November 10, 2011

Interpolation and Splines

Machine Learning. Topic 4: Linear Regression Models

Lecture 25: Review I

Regression Analysis and Linear Regression Models

2D Spline Curves. CS 4620 Lecture 13

Week 5: Multiple Linear Regression II

Estimating survival from Gray s flexible model. Outline. I. Introduction. I. Introduction. I. Introduction

Lecture 22 The Generalized Lasso

Differentiation of Cognitive Abilities across the Lifespan. Online Supplement. Elliot M. Tucker-Drob

Multiple Linear Regression

NONPARAMETRIC REGRESSION SPLINES FOR GENERALIZED LINEAR MODELS IN THE PRESENCE OF MEASUREMENT ERROR

Use of Extreme Value Statistics in Modeling Biometric Systems

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

COMP3421. Global Lighting Part 2: Radiosity

Transcription:

Splines and penalized regression November 23

Introduction We are discussing ways to estimate the regression function f, where E(y x) = f(x) One approach is of course to assume that f has a certain shape, such as linear or quadratic, that can be estimated parametrically We have also discussed locally weighted linear/polynomial models as a way of allowing f to be more flexible An alternative, more direct approach is penalization

Controlling smoothness with penalization Here, we directly solve for the function f that minimizes the following objective function, a penalized version of the least squares objective: n {y i f(x i )} 2 + λ {f (u)} 2 du i=1 The first term captures the fit to the data, while the second penalizes curvature note that for a line, f (u) = 0 for all u Here, λ is the smoothing parameter, and it controls the tradeoff between the two terms: λ = 0 imposes no restrictions and f will therefore interpolate the data λ = renders curvature impossible, thereby returning us to ordinary linear regression

Splines Introduction It may sound impossible to solve for such an f over all possible functions, but the solution turns out to be surprisingly simple This solutions, it turns out, depends on a class of functions called splines We will begin by introducing splines themselves, then move on to discuss how they represent a solution to our penalized regression problem

Basis functions One approach for extending the linear model is to represent x using a collection of basis functions: f(x) = M β m h m (x) m=1 Because the basis functions {h m } are prespecified and the model is linear in these new variables, ordinary least squares approaches for model fitting and inference can be employed This idea is probably not new to you, as transformations and expansions using polynomial bases are common

Global versus local bases However, polynomial bases with global representations have undesirable side effects: each observation affects the entire curve, even for x values far from the observation In previous lectures, we got around this problem with local weighting In this lecture, we will explore instead an approach based on piecewise basis functions As we will see, splines are piecewise polynomials joined together to make a singe smooth curve

The piecewise constant model To understand splines, we will gradually build up a piecewise model, starting at the simplest one: the piecewise constant model First, we partition the range of x into K + 1 intervals by choosing K points {ξ k } K k=1 called knots For our example involving bone mineral density, we will choose the tertiles of the observed ages

The piecewise constant model (cont d) female male 0.20 0.15 spnbmd 0.10 0.05 0.00 0.05 age

The piecewise linear model female male 0.20 0.15 spnbmd 0.10 0.05 0.00 0.05 age

The continuous piecewise linear model female male 0.20 0.15 spnbmd 0.10 0.05 0.00 0.05 age

Basis functions for piecewise continuous models These constraints can be incorporated directly into the basis functions: h 1 (x) = 1, h 2 (x) = x, h 3 (x) = (x ξ 1 ) +, h 4 (x) = (x ξ 2 ) +, where ( ) + denotes the positive portion of its argument: { r if r 0 r + = 0 if r < 0

Basis functions for piecewise continuous models It can be easily checked that these basis functions lead to a composite function f(x) that: Is everywhere continuous Is linear everywhere except the knots Has a different slope for each region Also, note that the degrees of freedom add up: 3 regions 2 degrees of freedom in each region - 2 constraints = 4 basis functions

Splines Introduction The preceding is an example of a spline: a piecewise m 1 degree polynomial that is continuous up to its first m 2 derivatives By requiring continuous derivatives, we ensure that the resulting function is as smooth as possible We can obtain more flexible curves by increasing the degree of the spline and/or by adding knots However, there is a tradeoff: Few knots/low degree: Resulting class of functions may be too restrictive (bias) Many knots/high degree: We run the risk of overfitting (variance)

The truncated power basis The set of basis functions introduced earlier is an example of what is called the truncated power basis Its logic is easily extended to splines of order m: h j (x) = x j 1 j = 1,..., m h m+k (x) = (x ξ k ) m 1 + l = 1,..., K Note that a spline has m + K degrees of freedom

Quadratic splines female male 0.20 0.15 spnbmd 0.10 0.05 0.00 0.05 age

Cubic splines female male 0.20 0.15 spnbmd 0.10 0.05 0.00 0.05 age

Additional notes These types of fixed-knot models are referred to as regression splines Recall that cubic splines contain 4 + K degrees of freedom: K + 1 regions 4 parameters per region - K knots 3 constraints per knot It is claimed that cubic splines are the lowest order spline for which the discontinuity at the knots cannot be noticed by the human eye There is rarely any need to go beyond cubic splines, which are by far the most common type of splines in practice

Implementing regression splines The truncated power basis has two principal virtues: Conceptual simplicity The linear model is nested inside it, leading to simple tests of the null hypothesis of linearity Unfortunately, it has a number of computational/numerical flaws it s inefficient and can lead to overflow and nearly singular matrix problems The more complicated but numerically much more stable and efficient B-spline basis is often employed instead

B-splines in R Fortunately, one can use B-splines without knowing the details behind their complicated construction In the splines package (which by default is installed but not loaded), the bs() function will implement a B-spline basis for you X <- bs(x,knots=quantile(x,p=c(1/3,2/3))) X <- bs(x,df=5) X <- bs(x,degree=2,df=10) Xp <- predict(x,newdata=x) By default, bs uses degree=3, knots at evenly spaced quantiles, and does not return a column for the intercept

Natural cubic splines Polynomial fits tend to be erratic at the boundaries of the data This is even worse for cubic splines Natural cubic splines ameliorate this problem by adding the additional (4) constraints that the function is linear beyond the boundaries of the data

Natural cubic splines (cont d) female male 0.20 0.15 spnbmd 0.10 0.05 0.00 0.05 age

Natural cubic splines, 6 df female male 0.20 0.15 spnbmd 0.10 0.05 0.00 0.05 age

Natural cubic splines, 6 df (cont d) Relative change in spinal BMD 0.05 0.00 0.05 0.10 0.15 0.20 Age female male

Splines vs. Loess (6 df each) female spline loess male 0.20 0.15 spnbmd 0.10 0.05 0.00 0.05 age

Natural splines in R R also provides a function to compute a basis for the natural cubic splines, ns, which works almost exactly like bs, except that there is no option to change the degree Note that a natural spline has m + K 4 degrees of freedom; thus, a natural cubic spline with K knots has K degrees of freedom

Mean and variance estimation Because the basis functions are fixed, all standard approaches to inference for regression are valid In particular, letting L = X(X X) 1 X denote the projection matrix, E(ˆf) = Lf(x) V(ˆf) = σ 2 L(L L) 1 L CV = 1 ( yi ŷ i n 1 l ii Furthermore, extensions to logistic regression, Cox proportional hazards regression, etc., are straightforward i ) 2

Mean and variance estimation (cont d) female male 0.20 0.15 spnbmd 0.10 0.05 0.00 0.05 age

Mean and variance estimation (cont d) Coronary heart disease study, K = 4 ^(x) f 2 1 0 1 2 3 π^(x) 0.0 0.2 0.4 0.6 0.8 1.0 100 120 140 160 180 200 220 Systolic blood pressure 100 120 140 160 180 200 220 Systolic blood pressure

Problems with knots Fixed-df splines are useful tools, but are not truly nonparametric Choices regarding the number of knots and where they are located are fundamentally parametric choices and have a large effect on the fit Furthermore, assuming that you place knots at quantiles, models will not be nested inside each other, which complicates hypothesis testing

Controlling smoothness with penalization We can avoid the knot selection problem altogether via the nonparametric formulation introduced at the beginning of lecture: choose the function f that minimizes n {y i f(x i )} 2 + λ {f (u)} 2 du i=1 We will now see that the solution to this problem lies in the family of natural cubic splines

Terminology First, some terminology: The parametric splines with fixed degrees of freedom that we have talked about so far are called regression splines A spline that passes through the points {x i, y i } is called an interpolating spline, and is said to interpolate the points {x i, y i } A spline that describes and smooths noisy data by passing close to {x i, y i } without the requirement of passing through them is called a smoothing spline

Natural cubic splines are the smoothest interpolators Theorem: Out of all twice-differentiable functions passing through the points {x i, y i }, the one that minimizes λ {f (u)} 2 du is a natural cubic spline with knots at every unique value of {x i }

Natural cubic splines solve the nonparametric formulation Theorem: Out of all twice-differentiable functions, the one that minimizes n {y i f(x i )} 2 + λ {f (u)} 2 du i=1 is a natural cubic spline with knots at every unique value of {x i }

Design matrix Let {N j } n j=1 denote the collection of natural cubic spline basis functions and N denote the n n design matrix consisting of the basis functions evaluated at the observed values: N ij = N j (x i ) f(x) = n j=1 N j(x)β j f(x) = Nβ

Solution Introduction The penalized objective function is therefore where Ω jk = N j (t)n k (t)dt The solution is therefore (y Nβ) (y Nβ) + λβ Ωβ, ˆβ = (N N + λω) 1 N y

Smoothing splines are linear smoothers Note that the fitted values can be represented as ŷ = N(N N + λω) 1 N y = L λ y Thus, smoothing splines are linear smoothers, and we can use all the results that we derived back when discussing local regression

Smoothing splines are linear smoothers (cont d) In particular: CV = 1 ( yi ŷ i n 1 l ii E ˆf(x 0 ) = l i (x 0 )f(x 0 ) i i ) 2 V ˆf(x 0 ) = σ 2 i l i (x 0 ) 2 i ˆσ 2 = (y i ŷ i ) 2 n 2ν + ν ν = tr(l) ν = tr(l L)

CV, GCV for BMD example male CV GCV 0.5 1.0 1.5 female 0.006 0.005 0.004 0.003 0.002 0.5 1.0 1.5 log(λ)

Undersmoothing and oversmoothing of BMD data 0.05 0.05 0.15 0.05 0.05 0.15 Males 0.05 0.05 0.15 0.05 0.05 0.15 0.05 0.05 0.15 Females 0.05 0.05 0.15

Pointwise confidence bands female male 0.20 0.15 spnbmd 0.10 0.05 0.00 0.05 age

R implementation Recall that local regression had a simple, standard function for basic one-dimensional smoothing (loess) and an extensive package for more comprehensive analyses (locfit) Spline-based smoothing is similar (smooth.spline) does not require any packages and implements simple one-dimensional smoothing: fit <- smooth.spline(x,y) plot(fit,type="l") predict(fit,xx) By default, the function will choose λ based on GCV, but this can be changed to CV, or you can specify λ

The mgcv package If you have a binary outcome variable or multiple covariates or want confidence intervals, however, smooth.spline is lacking A very extensive package called mgcv provides those features, as well as much more The basic function is called gam, which stands for generalized additive model (we ll discuss GAMs more in a later lecture)

The mgcv package (cont d) The syntax of gam is very similar to glm and locfit, with a function s() placed around any terms that you want a smooth function of: fit <- gam(y~s(x)) fit <- gam(y~s(x),family="binomial") plot(fit) plot(fit,shade=true) predict(fit,newdata=data.frame(x=xx),se.fit=true) summary(fit)

Hypothesis testing We are often interested in testing whether the smaller of two nested models provides an adequate fit to the data In ordinary regression, this accomplished via the F -statistic: F = (RSS 0 RSS)/q ˆσ 2, where RSS is the residual sum of squares and q is the difference in degrees of freedom between the two models Or via the likelihood ratio test: Λ = 2{l(ˆβ) l(ˆβ 0 )} χ 2 q

Hypothesis testing (cont d) These results do not hold exactly for the case of penalized least squares, but still provide a way to compute useful approximate p-values, using ν = tr(l) as the degrees of freedom in a model One can do this manually, or via anova(fit0,fit,test="f") anova(fit0,fit,test="chisq") It should be noted, however, that such tests treat λ as fixed, even though in reality it is almost always estimated from the data using CV or GCV