Lecture 17: Smoothing splines, Local Regression, and GAMs Reading: Sections 7.5-7 STATS 202: Data mining and analysis November 6, 2017 1 / 24
Cubic splines Define a set of knots ξ 1 < ξ 2 < < ξ K. We want the function Y = f(x) to: 1. Be a cubic polynomial between every pair of knots ξ i, ξ i+1. 2. Be continuous at each knot. 3. Have continuous first and second derivatives at each knot. 2 / 24
Cubic splines Define a set of knots ξ 1 < ξ 2 < < ξ K. We want the function Y = f(x) to: 1. Be a cubic polynomial between every pair of knots ξ i, ξ i+1. 2. Be continuous at each knot. 3. Have continuous first and second derivatives at each knot. It turns out, we can write f in terms of K + 3 basis functions: f(x) = β 0 +β 1 X+β 2 X 2 +β 3 X 3 +β 4 h(x, ξ 1 )+ +β K+3 h(x, ξ K ) where, h(x, ξ) = { (x ξ) 3 if x > ξ 0 otherwise 2 / 24
Natural cubic splines Spline which is linear instead of cubic for X < ξ 1, X > ξ K. Wage 50 100 150 200 250 Natural Cubic Spline Cubic Spline 20 30 40 50 60 70 Age The predictions are more stable for extreme values of X. 3 / 24
Choosing the number and locations of knots The locations of the knots are typically quantiles of X. The number of knots, K, is chosen by cross validation: Mean Squared Error 1600 1620 1640 1660 1680 Mean Squared Error 1600 1620 1640 1660 1680 2 4 6 8 10 Degrees of Freedom of Natural Spline 2 4 6 8 10 Degrees of Freedom of Cubic Spline 4 / 24
Smoothing splines Find the function f which minimizes n (y i f(x i )) 2 + λ i=1 f (x) 2 dx The RSS of the model. A penalty for the roughness of the function. 5 / 24
Smoothing splines Find the function f which minimizes n (y i f(x i )) 2 + λ i=1 f (x) 2 dx The RSS of the model. A penalty for the roughness of the function. Facts: The minimizer ˆf is a natural cubic spline, with knots at each sample point x 1,..., x n. btaining ˆf is similar to a Ridge regression. 5 / 24
Regression splines vs. Smoothing splines Cubic regression splines Smoothing splines Fix the locations of K knots at quantiles of X. 6 / 24
Regression splines vs. Smoothing splines Cubic regression splines Smoothing splines Fix the locations of K knots at quantiles of X. Number of knots K < n. 6 / 24
Regression splines vs. Smoothing splines Cubic regression splines Smoothing splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 6 / 24
Regression splines vs. Smoothing splines Cubic regression splines Smoothing splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Choose K by cross validation. 6 / 24
Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Smoothing splines Put n knots at x 1,..., x n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Choose K by cross validation. 6 / 24
Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: Smoothing splines Put n knots at x 1,..., x n. We could find a cubic spline which makes the RSS = 0 verfitting! n (y i f(x i )) 2 i=1 Choose K by cross validation. 6 / 24
Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Smoothing splines Put n knots at x 1,..., x n. We could find a cubic spline which makes the RSS = 0 verfitting! Instead, we obtain the fitted values ˆf(x 1 ),..., ˆf(x n ) through an algorithm similar to Ridge regression. Choose K by cross validation. 6 / 24
Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Choose K by cross validation. Smoothing splines Put n knots at x 1,..., x n. We could find a cubic spline which makes the RSS = 0 verfitting! Instead, we obtain the fitted values ˆf(x 1 ),..., ˆf(x n ) through an algorithm similar to Ridge regression. The function ˆf is the only natural cubic spline that has these fitted values. 6 / 24
Deriving a smoothing spline 1. Show that if you fix the values f(x 1 ),..., f(x n ), the roughness f (x) 2 dx is minimized by a natural cubic spline. Problem 5.7 in ESL. 7 / 24
Deriving a smoothing spline 1. Show that if you fix the values f(x 1 ),..., f(x n ), the roughness f (x) 2 dx is minimized by a natural cubic spline. Problem 5.7 in ESL. 2. Deduce that the solution to the smoothing spline problem is a natural cubic spline, which can be written in terms of its basis functions. f(x) = β 0 + β 1 f 1 (x) +... β n+3 f n+3 (x) 7 / 24
Deriving a smoothing spline 3. Letting N be a matrix with N(i, j) = f j (x i ), we can write the objective function: (y Nβ) T (y Nβ) + λβ T Ω N β, where Ω N (i, j) = f i (t)f j (t)dt. 8 / 24
Deriving a smoothing spline 3. Letting N be a matrix with N(i, j) = f j (x i ), we can write the objective function: (y Nβ) T (y Nβ) + λβ T Ω N β, where Ω N (i, j) = f i (t)f j (t)dt. 4. By simple calculus, the coefficients ˆβ which minimize (y Nβ) T (y Nβ) + λβ T Ω N β, are ˆβ = (N T N + λω N ) 1 N T y. 8 / 24
Deriving a smoothing spline 5. Note that the predicted values are a linear function of the observed values: ŷ = N(N T N + λω N ) 1 N T y }{{} S λ 9 / 24
Deriving a smoothing spline 5. Note that the predicted values are a linear function of the observed values: ŷ = N(N T N + λω N ) 1 N T y }{{} S λ 6. The degrees of freedom for a smoothing spline are: Trace(S λ ) = S λ (1, 1) + S λ (2, 2) + + S λ (n, n) 9 / 24
Choosing the regularization parameter λ We typically choose λ through cross validation. 10 / 24
Choosing the regularization parameter λ We typically choose λ through cross validation. Fortunately, we can solve the problem for any λ with the same complexity of diagonalizing an n n matrix (just like in ridge regression). 10 / 24
Choosing the regularization parameter λ We typically choose λ through cross validation. Fortunately, we can solve the problem for any λ with the same complexity of diagonalizing an n n matrix (just like in ridge regression). There is a shortcut for LCV: RSS loocv (λ) = n (y i i=1 ˆf ( i) λ (x i )) 2 10 / 24
Choosing the regularization parameter λ We typically choose λ through cross validation. Fortunately, we can solve the problem for any λ with the same complexity of diagonalizing an n n matrix (just like in ridge regression). There is a shortcut for LCV: RSS loocv (λ) = = n (y i i=1 ˆf ( i) λ (x i )) 2 [ n y i ˆf λ (x i ) 1 S λ (i, i) i=1 ] 2 10 / 24
Choosing the regularization parameter λ Smoothing Spline Wage 0 50 100 200 300 16 Degrees of Freedom 6.8 Degrees of Freedom (LCV) 20 30 40 50 60 70 80 Age 11 / 24
Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Choose K by cross validation. Smoothing splines Put n knots at x 1,..., x n. We could find a cubic spline which makes the RSS = 0 verfitting! Instead, we obtain the fitted values ˆf(x 1 ),..., ˆf(x n ) through an algorithm similar to Ridge regression. The function ˆf is the only natural cubic spline that has these fitted values. 12 / 24
Local linear regression 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.5 0.0 0.5 1.0 1.5 Local Regression Idea: At each point, use regression function fit only to nearest neighbors of that point. This generalizes KNN regression, which is a form of local constant regression. The span is the fraction of training samples used in each regression. 13 / 24
Local linear regression To predict the regression function f at an input x: 14 / 24
Local linear regression To predict the regression function f at an input x: 1. Assign a weight K i to the training point x i, such that: Ki = 0 unless x i is one of the k nearest neighbors of x. K i decreases when the distance d(x, x i ) increases. 14 / 24
Local linear regression To predict the regression function f at an input x: 1. Assign a weight K i to the training point x i, such that: Ki = 0 unless x i is one of the k nearest neighbors of x. K i decreases when the distance d(x, x i ) increases. 2. Perform a weighted least squares regression; i.e. find (β 0, β 1 ) which minimize n K i (y i β 0 β 1 x i ) 2. i=i 14 / 24
Local linear regression To predict the regression function f at an input x: 1. Assign a weight K i to the training point x i, such that: Ki = 0 unless x i is one of the k nearest neighbors of x. K i decreases when the distance d(x, x i ) increases. 2. Perform a weighted least squares regression; i.e. find (β 0, β 1 ) which minimize n K i (y i β 0 β 1 x i ) 2. i=i 3. Predict ˆf(x) = ˆβ 0 + ˆβ 1 x. 14 / 24
Local linear regression Local Linear Regression Wage 0 50 100 200 300 Span is 0.2 (16.4 Degrees of Freedom) Span is 0.7 (5.3 Degrees of Freedom) 20 30 40 50 60 70 80 Age The span, k/n, is chosen by cross-validation. 15 / 24
Generalized Additive Models (GAMs) Extension of non-linear models to multiple predictors: wage = β 0 + β 1 year + β 2 age + β 3 education + ɛ wage = β 0 + f 1 (year) + f 2 (age) + f 3 (education) + ɛ 16 / 24
Generalized Additive Models (GAMs) Extension of non-linear models to multiple predictors: wage = β 0 + β 1 year + β 2 age + β 3 education + ɛ wage = β 0 + f 1 (year) + f 2 (age) + f 3 (education) + ɛ The functions f 1,..., f p can be polynomials, natural splines, smoothing splines, local regressions... 16 / 24
Fitting a GAM If the functions f 1 have a basis representation, we can simply use least squares: Natural cubic splines Polynomials Step functions wage = β 0 + f 1 (year) + f 2 (age) + f 3 (education) + ɛ 17 / 24
Fitting a GAM therwise, we can use backfitting: 18 / 24
Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 18 / 24
Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1, f 3,..., f p fixed, and fit f 2 using the partial residuals: y i β 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response. 18 / 24
Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1, f 3,..., f p fixed, and fit f 2 using the partial residuals: y i β 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response. 3.... 18 / 24
Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1, f 3,..., f p fixed, and fit f 2 using the partial residuals: y i β 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response. 3.... 4. Iterate 18 / 24
Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1, f 3,..., f p fixed, and fit f 2 using the partial residuals: y i β 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response. 3.... 4. Iterate This works for smoothing splines and local regression. 18 / 24
Properties of GAMs GAMs are a step from linear regression toward a fully nonparametric method. 19 / 24
Properties of GAMs GAMs are a step from linear regression toward a fully nonparametric method. The only constraint is additivity. This can be partially addressed by adding key interaction variables X i X j. 19 / 24
Properties of GAMs GAMs are a step from linear regression toward a fully nonparametric method. The only constraint is additivity. This can be partially addressed by adding key interaction variables X i X j. We can report degrees of freedom for most non-linear functions. 19 / 24
Properties of GAMs GAMs are a step from linear regression toward a fully nonparametric method. The only constraint is additivity. This can be partially addressed by adding key interaction variables X i X j. We can report degrees of freedom for most non-linear functions. As in linear regression, we can examine the significance of each of the variables. 19 / 24
Example: Regression for Wage <HS HS <Coll Coll >Coll f1(year) 30 20 10 0 10 20 30 f2(age) 50 40 30 20 10 0 10 20 f3(education) 30 20 10 0 10 20 30 40 2003 2005 2007 2009 year 20 30 40 50 60 70 80 age education year: natural spline with df=4. age: natural spline with df=5. education: step function. 20 / 24
Example: Regression for Wage <HS HS <Coll Coll >Coll f1(year) 30 20 10 0 10 20 30 f2(age) 50 40 30 20 10 0 10 20 f3(education) 30 20 10 0 10 20 30 40 2003 2005 2007 2009 year 20 30 40 50 60 70 80 age education year: smoothing spline with df=4. age: smoothing spline with df=5. education: step function. 21 / 24
GAMs for classification We can model the log-odds in a classification problem using a GAM: log P (Y = 1 X) P (Y = 0 X) = β 0 + f 1 (X 1 ) + + f p (X p ). The fitting algorithm is a version of backfitting, but we won t discuss the details. 22 / 24
Example: Classification for Wage>250 <HS HS <Coll Coll >Coll f1(year) 4 2 0 2 4 f2(age) 8 6 4 2 0 2 f3(education) 400 200 0 200 400 2003 2005 2007 2009 year 20 30 40 50 60 70 80 age education year: linear. age: smoothing spline with df=5. education: step function. 23 / 24
Example: Classification for Wage>250 HS <Coll Coll >Coll f1(year) 4 2 0 2 4 f2(age) 8 6 4 2 0 2 f3(education) 4 2 0 2 4 2003 2005 2007 2009 year 20 30 40 50 60 70 80 age education year: linear. age: smoothing spline with df=5. education: step function. Exclude samples with education < HS. 24 / 24