Lecture 17: Smoothing splines, Local Regression, and GAMs

Similar documents
Lecture 16: High-dimensional regression, non-linear regression

Moving Beyond Linearity

Moving Beyond Linearity

Splines and penalized regression

Lecture 7: Linear Regression (continued)

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Lecture 7: Splines and Generalized Additive Models

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

Model selection and validation 1: Cross-validation

STAT 705 Introduction to generalized additive models

Lecture 25: Review I

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Generalized additive models I

Spline Models. Introduction to CS and NCS. Regression splines. Smoothing splines

Applied Statistics : Practical 9

Lecture 13: Model selection and regularization

Chapter 5: Basis Expansion and Regularization

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Lecture 20: Bagging, Random Forests, Boosting

Nonparametric Regression

Last time... Bias-Variance decomposition. This week

Generalized Additive Model

Lecture 26: Missing data

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Linear Regression and K-Nearest Neighbors 3/28/18

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Incorporating Geospatial Data in House Price Indexes: A Hedonic Imputation Approach with Splines. Robert J. Hill and Michael Scholz

Knowledge Discovery and Data Mining

1D Regression. i.i.d. with mean 0. Univariate Linear Regression: fit by least squares. Minimize: to get. The set of all possible functions is...

Nonparametric regression using kernel and spline methods

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Generalized Additive Models

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Nonparametric Regression and Generalized Additive Models Part I

Nonparametric Classification Methods

Nonparametric Regression (and Classification)

Nonlinearity and Generalized Additive Models Lecture 2

Lecture on Modeling Tools for Clustering & Regression

Economics Nonparametric Econometrics

NONPARAMETRIC REGRESSION SPLINES FOR GENERALIZED LINEAR MODELS IN THE PRESENCE OF MEASUREMENT ERROR

Doubly Cyclic Smoothing Splines and Analysis of Seasonal Daily Pattern of CO2 Concentration in Antarctica

HW 10 STAT 672, Summer 2018

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Random Forest A. Fornaser

Stat 8053, Fall 2013: Additive Models

HW3: Multiple Linear Regression

HW 10 STAT 472, Spring 2018

Nonparametric Mixed-Effects Models for Longitudinal Data

Computational Physics PHYS 420

Linear Penalized Spline Model Estimation Using Ranked Set Sampling Technique

A Graphical Analysis of Simultaneously Choosing the Bandwidth and Mixing Parameter for Semiparametric Regression Techniques

Curve fitting using linear models

Generalized Additive Models

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Machine Learning / Jan 27, 2010

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Lecture VIII. Global Approximation Methods: I

Semiparametric Tools: Generalized Additive Models

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

GAMs with integrated model selection using penalized regression splines and applications to environmental modelling.

Nonparametric Methods Recap

Machine Learning Duncan Anderson Managing Director, Willis Towers Watson

Regression Analysis and Linear Regression Models

Principal curves and surfaces: Data visualization, compression, and beyond

Multiple Linear Regression

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

CoxFlexBoost: Fitting Structured Survival Models

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni

Lasso. November 14, 2017

Instance and case-based reasoning

The theory of the linear model 41. Theorem 2.5. Under the strong assumptions A3 and A5 and the hypothesis that

Interpolation by Spline Functions

STA 414/2104 S: February Administration

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

GAM: The Predictive Modeling Silver Bullet

1 StatLearn Practical exercise 5

6 Model selection and kernels

Abstract R-splines are introduced as splines t with a polynomial null space plus the sum of radial basis functions. Thin plate splines are a special c

I211: Information infrastructure II

A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K.

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Comparison of Linear Regression with K-Nearest Neighbors

Four equations are necessary to evaluate these coefficients. Eqn

Distribution-free Predictive Approaches

Lecture 8. Divided Differences,Least-Squares Approximations. Ceng375 Numerical Computations at December 9, 2010

2 Regression Splines and Regression Smoothers

Lecture 19: Decision trees

Using the SemiPar Package

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Interactive Graphics. Lecture 9: Introduction to Spline Curves. Interactive Graphics Lecture 9: Slide 1

Smoothing non-stationary noise of the Nigerian Stock Exchange All-Share Index data using variable coefficient functions

What is machine learning?

Nonparametric Regression

Topics in Machine Learning

Lecture 06 Decision Trees I

Assessing the Quality of the Natural Cubic Spline Approximation

Solution to Series 7

Economics "Modern" Data Science

Spatial Interpolation & Geostatistics

Transcription:

Lecture 17: Smoothing splines, Local Regression, and GAMs Reading: Sections 7.5-7 STATS 202: Data mining and analysis November 6, 2017 1 / 24

Cubic splines Define a set of knots ξ 1 < ξ 2 < < ξ K. We want the function Y = f(x) to: 1. Be a cubic polynomial between every pair of knots ξ i, ξ i+1. 2. Be continuous at each knot. 3. Have continuous first and second derivatives at each knot. 2 / 24

Cubic splines Define a set of knots ξ 1 < ξ 2 < < ξ K. We want the function Y = f(x) to: 1. Be a cubic polynomial between every pair of knots ξ i, ξ i+1. 2. Be continuous at each knot. 3. Have continuous first and second derivatives at each knot. It turns out, we can write f in terms of K + 3 basis functions: f(x) = β 0 +β 1 X+β 2 X 2 +β 3 X 3 +β 4 h(x, ξ 1 )+ +β K+3 h(x, ξ K ) where, h(x, ξ) = { (x ξ) 3 if x > ξ 0 otherwise 2 / 24

Natural cubic splines Spline which is linear instead of cubic for X < ξ 1, X > ξ K. Wage 50 100 150 200 250 Natural Cubic Spline Cubic Spline 20 30 40 50 60 70 Age The predictions are more stable for extreme values of X. 3 / 24

Choosing the number and locations of knots The locations of the knots are typically quantiles of X. The number of knots, K, is chosen by cross validation: Mean Squared Error 1600 1620 1640 1660 1680 Mean Squared Error 1600 1620 1640 1660 1680 2 4 6 8 10 Degrees of Freedom of Natural Spline 2 4 6 8 10 Degrees of Freedom of Cubic Spline 4 / 24

Smoothing splines Find the function f which minimizes n (y i f(x i )) 2 + λ i=1 f (x) 2 dx The RSS of the model. A penalty for the roughness of the function. 5 / 24

Smoothing splines Find the function f which minimizes n (y i f(x i )) 2 + λ i=1 f (x) 2 dx The RSS of the model. A penalty for the roughness of the function. Facts: The minimizer ˆf is a natural cubic spline, with knots at each sample point x 1,..., x n. btaining ˆf is similar to a Ridge regression. 5 / 24

Regression splines vs. Smoothing splines Cubic regression splines Smoothing splines Fix the locations of K knots at quantiles of X. 6 / 24

Regression splines vs. Smoothing splines Cubic regression splines Smoothing splines Fix the locations of K knots at quantiles of X. Number of knots K < n. 6 / 24

Regression splines vs. Smoothing splines Cubic regression splines Smoothing splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 6 / 24

Regression splines vs. Smoothing splines Cubic regression splines Smoothing splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Choose K by cross validation. 6 / 24

Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Smoothing splines Put n knots at x 1,..., x n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Choose K by cross validation. 6 / 24

Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: Smoothing splines Put n knots at x 1,..., x n. We could find a cubic spline which makes the RSS = 0 verfitting! n (y i f(x i )) 2 i=1 Choose K by cross validation. 6 / 24

Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Smoothing splines Put n knots at x 1,..., x n. We could find a cubic spline which makes the RSS = 0 verfitting! Instead, we obtain the fitted values ˆf(x 1 ),..., ˆf(x n ) through an algorithm similar to Ridge regression. Choose K by cross validation. 6 / 24

Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Choose K by cross validation. Smoothing splines Put n knots at x 1,..., x n. We could find a cubic spline which makes the RSS = 0 verfitting! Instead, we obtain the fitted values ˆf(x 1 ),..., ˆf(x n ) through an algorithm similar to Ridge regression. The function ˆf is the only natural cubic spline that has these fitted values. 6 / 24

Deriving a smoothing spline 1. Show that if you fix the values f(x 1 ),..., f(x n ), the roughness f (x) 2 dx is minimized by a natural cubic spline. Problem 5.7 in ESL. 7 / 24

Deriving a smoothing spline 1. Show that if you fix the values f(x 1 ),..., f(x n ), the roughness f (x) 2 dx is minimized by a natural cubic spline. Problem 5.7 in ESL. 2. Deduce that the solution to the smoothing spline problem is a natural cubic spline, which can be written in terms of its basis functions. f(x) = β 0 + β 1 f 1 (x) +... β n+3 f n+3 (x) 7 / 24

Deriving a smoothing spline 3. Letting N be a matrix with N(i, j) = f j (x i ), we can write the objective function: (y Nβ) T (y Nβ) + λβ T Ω N β, where Ω N (i, j) = f i (t)f j (t)dt. 8 / 24

Deriving a smoothing spline 3. Letting N be a matrix with N(i, j) = f j (x i ), we can write the objective function: (y Nβ) T (y Nβ) + λβ T Ω N β, where Ω N (i, j) = f i (t)f j (t)dt. 4. By simple calculus, the coefficients ˆβ which minimize (y Nβ) T (y Nβ) + λβ T Ω N β, are ˆβ = (N T N + λω N ) 1 N T y. 8 / 24

Deriving a smoothing spline 5. Note that the predicted values are a linear function of the observed values: ŷ = N(N T N + λω N ) 1 N T y }{{} S λ 9 / 24

Deriving a smoothing spline 5. Note that the predicted values are a linear function of the observed values: ŷ = N(N T N + λω N ) 1 N T y }{{} S λ 6. The degrees of freedom for a smoothing spline are: Trace(S λ ) = S λ (1, 1) + S λ (2, 2) + + S λ (n, n) 9 / 24

Choosing the regularization parameter λ We typically choose λ through cross validation. 10 / 24

Choosing the regularization parameter λ We typically choose λ through cross validation. Fortunately, we can solve the problem for any λ with the same complexity of diagonalizing an n n matrix (just like in ridge regression). 10 / 24

Choosing the regularization parameter λ We typically choose λ through cross validation. Fortunately, we can solve the problem for any λ with the same complexity of diagonalizing an n n matrix (just like in ridge regression). There is a shortcut for LCV: RSS loocv (λ) = n (y i i=1 ˆf ( i) λ (x i )) 2 10 / 24

Choosing the regularization parameter λ We typically choose λ through cross validation. Fortunately, we can solve the problem for any λ with the same complexity of diagonalizing an n n matrix (just like in ridge regression). There is a shortcut for LCV: RSS loocv (λ) = = n (y i i=1 ˆf ( i) λ (x i )) 2 [ n y i ˆf λ (x i ) 1 S λ (i, i) i=1 ] 2 10 / 24

Choosing the regularization parameter λ Smoothing Spline Wage 0 50 100 200 300 16 Degrees of Freedom 6.8 Degrees of Freedom (LCV) 20 30 40 50 60 70 80 Age 11 / 24

Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Choose K by cross validation. Smoothing splines Put n knots at x 1,..., x n. We could find a cubic spline which makes the RSS = 0 verfitting! Instead, we obtain the fitted values ˆf(x 1 ),..., ˆf(x n ) through an algorithm similar to Ridge regression. The function ˆf is the only natural cubic spline that has these fitted values. 12 / 24

Local linear regression 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.5 0.0 0.5 1.0 1.5 Local Regression Idea: At each point, use regression function fit only to nearest neighbors of that point. This generalizes KNN regression, which is a form of local constant regression. The span is the fraction of training samples used in each regression. 13 / 24

Local linear regression To predict the regression function f at an input x: 14 / 24

Local linear regression To predict the regression function f at an input x: 1. Assign a weight K i to the training point x i, such that: Ki = 0 unless x i is one of the k nearest neighbors of x. K i decreases when the distance d(x, x i ) increases. 14 / 24

Local linear regression To predict the regression function f at an input x: 1. Assign a weight K i to the training point x i, such that: Ki = 0 unless x i is one of the k nearest neighbors of x. K i decreases when the distance d(x, x i ) increases. 2. Perform a weighted least squares regression; i.e. find (β 0, β 1 ) which minimize n K i (y i β 0 β 1 x i ) 2. i=i 14 / 24

Local linear regression To predict the regression function f at an input x: 1. Assign a weight K i to the training point x i, such that: Ki = 0 unless x i is one of the k nearest neighbors of x. K i decreases when the distance d(x, x i ) increases. 2. Perform a weighted least squares regression; i.e. find (β 0, β 1 ) which minimize n K i (y i β 0 β 1 x i ) 2. i=i 3. Predict ˆf(x) = ˆβ 0 + ˆβ 1 x. 14 / 24

Local linear regression Local Linear Regression Wage 0 50 100 200 300 Span is 0.2 (16.4 Degrees of Freedom) Span is 0.7 (5.3 Degrees of Freedom) 20 30 40 50 60 70 80 Age The span, k/n, is chosen by cross-validation. 15 / 24

Generalized Additive Models (GAMs) Extension of non-linear models to multiple predictors: wage = β 0 + β 1 year + β 2 age + β 3 education + ɛ wage = β 0 + f 1 (year) + f 2 (age) + f 3 (education) + ɛ 16 / 24

Generalized Additive Models (GAMs) Extension of non-linear models to multiple predictors: wage = β 0 + β 1 year + β 2 age + β 3 education + ɛ wage = β 0 + f 1 (year) + f 2 (age) + f 3 (education) + ɛ The functions f 1,..., f p can be polynomials, natural splines, smoothing splines, local regressions... 16 / 24

Fitting a GAM If the functions f 1 have a basis representation, we can simply use least squares: Natural cubic splines Polynomials Step functions wage = β 0 + f 1 (year) + f 2 (age) + f 3 (education) + ɛ 17 / 24

Fitting a GAM therwise, we can use backfitting: 18 / 24

Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 18 / 24

Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1, f 3,..., f p fixed, and fit f 2 using the partial residuals: y i β 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response. 18 / 24

Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1, f 3,..., f p fixed, and fit f 2 using the partial residuals: y i β 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response. 3.... 18 / 24

Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1, f 3,..., f p fixed, and fit f 2 using the partial residuals: y i β 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response. 3.... 4. Iterate 18 / 24

Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1, f 3,..., f p fixed, and fit f 2 using the partial residuals: y i β 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response. 3.... 4. Iterate This works for smoothing splines and local regression. 18 / 24

Properties of GAMs GAMs are a step from linear regression toward a fully nonparametric method. 19 / 24

Properties of GAMs GAMs are a step from linear regression toward a fully nonparametric method. The only constraint is additivity. This can be partially addressed by adding key interaction variables X i X j. 19 / 24

Properties of GAMs GAMs are a step from linear regression toward a fully nonparametric method. The only constraint is additivity. This can be partially addressed by adding key interaction variables X i X j. We can report degrees of freedom for most non-linear functions. 19 / 24

Properties of GAMs GAMs are a step from linear regression toward a fully nonparametric method. The only constraint is additivity. This can be partially addressed by adding key interaction variables X i X j. We can report degrees of freedom for most non-linear functions. As in linear regression, we can examine the significance of each of the variables. 19 / 24

Example: Regression for Wage <HS HS <Coll Coll >Coll f1(year) 30 20 10 0 10 20 30 f2(age) 50 40 30 20 10 0 10 20 f3(education) 30 20 10 0 10 20 30 40 2003 2005 2007 2009 year 20 30 40 50 60 70 80 age education year: natural spline with df=4. age: natural spline with df=5. education: step function. 20 / 24

Example: Regression for Wage <HS HS <Coll Coll >Coll f1(year) 30 20 10 0 10 20 30 f2(age) 50 40 30 20 10 0 10 20 f3(education) 30 20 10 0 10 20 30 40 2003 2005 2007 2009 year 20 30 40 50 60 70 80 age education year: smoothing spline with df=4. age: smoothing spline with df=5. education: step function. 21 / 24

GAMs for classification We can model the log-odds in a classification problem using a GAM: log P (Y = 1 X) P (Y = 0 X) = β 0 + f 1 (X 1 ) + + f p (X p ). The fitting algorithm is a version of backfitting, but we won t discuss the details. 22 / 24

Example: Classification for Wage>250 <HS HS <Coll Coll >Coll f1(year) 4 2 0 2 4 f2(age) 8 6 4 2 0 2 f3(education) 400 200 0 200 400 2003 2005 2007 2009 year 20 30 40 50 60 70 80 age education year: linear. age: smoothing spline with df=5. education: step function. 23 / 24

Example: Classification for Wage>250 HS <Coll Coll >Coll f1(year) 4 2 0 2 4 f2(age) 8 6 4 2 0 2 f3(education) 4 2 0 2 4 2003 2005 2007 2009 year 20 30 40 50 60 70 80 age education year: linear. age: smoothing spline with df=5. education: step function. Exclude samples with education < HS. 24 / 24