Lecture 17: Smoothing splines, Local Regression, and GAMs

Size: px

Start display at page:

Download "Lecture 17: Smoothing splines, Local Regression, and GAMs"

August Brooks
6 years ago
Views:

1 Lecture 17: Smoothing splines, Local Regression, and GAMs Reading: Sections STATS 202: Data mining and analysis November 6, / 24

2 Cubic splines Define a set of knots ξ 1 < ξ 2 < < ξ K. We want the function Y = f(x) to: 1. Be a cubic polynomial between every pair of knots ξ i, ξ i Be continuous at each knot. 3. Have continuous first and second derivatives at each knot. 2 / 24

3 Cubic splines Define a set of knots ξ 1 < ξ 2 < < ξ K. We want the function Y = f(x) to: 1. Be a cubic polynomial between every pair of knots ξ i, ξ i Be continuous at each knot. 3. Have continuous first and second derivatives at each knot. It turns out, we can write f in terms of K + 3 basis functions: f(x) = β 0 +β 1 X+β 2 X 2 +β 3 X 3 +β 4 h(x, ξ 1 )+ +β K+3 h(x, ξ K ) where, h(x, ξ) = { (x ξ) 3 if x > ξ 0 otherwise 2 / 24

4 Natural cubic splines Spline which is linear instead of cubic for X < ξ 1, X > ξ K. Wage Natural Cubic Spline Cubic Spline Age The predictions are more stable for extreme values of X. 3 / 24

5 Choosing the number and locations of knots The locations of the knots are typically quantiles of X. The number of knots, K, is chosen by cross validation: Mean Squared Error Mean Squared Error Degrees of Freedom of Natural Spline Degrees of Freedom of Cubic Spline 4 / 24

6 Smoothing splines Find the function f which minimizes n (y i f(x i )) 2 + λ i=1 f (x) 2 dx The RSS of the model. A penalty for the roughness of the function. 5 / 24

7 Smoothing splines Find the function f which minimizes n (y i f(x i )) 2 + λ i=1 f (x) 2 dx The RSS of the model. A penalty for the roughness of the function. Facts: The minimizer ˆf is a natural cubic spline, with knots at each sample point x 1,..., x n. btaining ˆf is similar to a Ridge regression. 5 / 24

8 Regression splines vs. Smoothing splines Cubic regression splines Smoothing splines Fix the locations of K knots at quantiles of X. 6 / 24

9 Regression splines vs. Smoothing splines Cubic regression splines Smoothing splines Fix the locations of K knots at quantiles of X. Number of knots K < n. 6 / 24

10 Regression splines vs. Smoothing splines Cubic regression splines Smoothing splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 6 / 24

11 Regression splines vs. Smoothing splines Cubic regression splines Smoothing splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Choose K by cross validation. 6 / 24

12 Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Smoothing splines Put n knots at x 1,..., x n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Choose K by cross validation. 6 / 24

13 Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: Smoothing splines Put n knots at x 1,..., x n. We could find a cubic spline which makes the RSS = 0 verfitting! n (y i f(x i )) 2 i=1 Choose K by cross validation. 6 / 24

14 Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Smoothing splines Put n knots at x 1,..., x n. We could find a cubic spline which makes the RSS = 0 verfitting! Instead, we obtain the fitted values ˆf(x 1 ),..., ˆf(x n ) through an algorithm similar to Ridge regression. Choose K by cross validation. 6 / 24

15 Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Choose K by cross validation. Smoothing splines Put n knots at x 1,..., x n. We could find a cubic spline which makes the RSS = 0 verfitting! Instead, we obtain the fitted values ˆf(x 1 ),..., ˆf(x n ) through an algorithm similar to Ridge regression. The function ˆf is the only natural cubic spline that has these fitted values. 6 / 24

16 Deriving a smoothing spline 1. Show that if you fix the values f(x 1 ),..., f(x n ), the roughness f (x) 2 dx is minimized by a natural cubic spline. Problem 5.7 in ESL. 7 / 24

17 Deriving a smoothing spline 1. Show that if you fix the values f(x 1 ),..., f(x n ), the roughness f (x) 2 dx is minimized by a natural cubic spline. Problem 5.7 in ESL. 2. Deduce that the solution to the smoothing spline problem is a natural cubic spline, which can be written in terms of its basis functions. f(x) = β 0 + β 1 f 1 (x) +... β n+3 f n+3 (x) 7 / 24

18 Deriving a smoothing spline 3. Letting N be a matrix with N(i, j) = f j (x i ), we can write the objective function: (y Nβ) T (y Nβ) + λβ T Ω N β, where Ω N (i, j) = f i (t)f j (t)dt. 8 / 24

19 Deriving a smoothing spline 3. Letting N be a matrix with N(i, j) = f j (x i ), we can write the objective function: (y Nβ) T (y Nβ) + λβ T Ω N β, where Ω N (i, j) = f i (t)f j (t)dt. 4. By simple calculus, the coefficients ˆβ which minimize (y Nβ) T (y Nβ) + λβ T Ω N β, are ˆβ = (N T N + λω N ) 1 N T y. 8 / 24

20 Deriving a smoothing spline 5. Note that the predicted values are a linear function of the observed values: ŷ = N(N T N + λω N ) 1 N T y }{{} S λ 9 / 24

21 Deriving a smoothing spline 5. Note that the predicted values are a linear function of the observed values: ŷ = N(N T N + λω N ) 1 N T y }{{} S λ 6. The degrees of freedom for a smoothing spline are: Trace(S λ ) = S λ (1, 1) + S λ (2, 2) + + S λ (n, n) 9 / 24

22 Choosing the regularization parameter λ We typically choose λ through cross validation. 10 / 24

23 Choosing the regularization parameter λ We typically choose λ through cross validation. Fortunately, we can solve the problem for any λ with the same complexity of diagonalizing an n n matrix (just like in ridge regression). 10 / 24

24 Choosing the regularization parameter λ We typically choose λ through cross validation. Fortunately, we can solve the problem for any λ with the same complexity of diagonalizing an n n matrix (just like in ridge regression). There is a shortcut for LCV: RSS loocv (λ) = n (y i i=1 ˆf ( i) λ (x i )) 2 10 / 24

25 Choosing the regularization parameter λ We typically choose λ through cross validation. Fortunately, we can solve the problem for any λ with the same complexity of diagonalizing an n n matrix (just like in ridge regression). There is a shortcut for LCV: RSS loocv (λ) = = n (y i i=1 ˆf ( i) λ (x i )) 2 [ n y i ˆf λ (x i ) 1 S λ (i, i) i=1 ] 2 10 / 24

26 Choosing the regularization parameter λ Smoothing Spline Wage Degrees of Freedom 6.8 Degrees of Freedom (LCV) Age 11 / 24

27 Regression splines vs. Smoothing splines Cubic regression splines Fix the locations of K knots at quantiles of X. Number of knots K < n. Find the natural cubic spline ˆf which minimizes the RSS: n (y i f(x i )) 2 i=1 Choose K by cross validation. Smoothing splines Put n knots at x 1,..., x n. We could find a cubic spline which makes the RSS = 0 verfitting! Instead, we obtain the fitted values ˆf(x 1 ),..., ˆf(x n ) through an algorithm similar to Ridge regression. The function ˆf is the only natural cubic spline that has these fitted values. 12 / 24

28 Local linear regression Local Regression Idea: At each point, use regression function fit only to nearest neighbors of that point. This generalizes KNN regression, which is a form of local constant regression. The span is the fraction of training samples used in each regression. 13 / 24

29 Local linear regression To predict the regression function f at an input x: 14 / 24

30 Local linear regression To predict the regression function f at an input x: 1. Assign a weight K i to the training point x i, such that: Ki = 0 unless x i is one of the k nearest neighbors of x. K i decreases when the distance d(x, x i ) increases. 14 / 24

31 Local linear regression To predict the regression function f at an input x: 1. Assign a weight K i to the training point x i, such that: Ki = 0 unless x i is one of the k nearest neighbors of x. K i decreases when the distance d(x, x i ) increases. 2. Perform a weighted least squares regression; i.e. find (β 0, β 1 ) which minimize n K i (y i β 0 β 1 x i ) 2. i=i 14 / 24

32 Local linear regression To predict the regression function f at an input x: 1. Assign a weight K i to the training point x i, such that: Ki = 0 unless x i is one of the k nearest neighbors of x. K i decreases when the distance d(x, x i ) increases. 2. Perform a weighted least squares regression; i.e. find (β 0, β 1 ) which minimize n K i (y i β 0 β 1 x i ) 2. i=i 3. Predict ˆf(x) = ˆβ 0 + ˆβ 1 x. 14 / 24

33 Local linear regression Local Linear Regression Wage Span is 0.2 (16.4 Degrees of Freedom) Span is 0.7 (5.3 Degrees of Freedom) Age The span, k/n, is chosen by cross-validation. 15 / 24

34 Generalized Additive Models (GAMs) Extension of non-linear models to multiple predictors: wage = β 0 + β 1 year + β 2 age + β 3 education + ɛ wage = β 0 + f 1 (year) + f 2 (age) + f 3 (education) + ɛ 16 / 24

35 Generalized Additive Models (GAMs) Extension of non-linear models to multiple predictors: wage = β 0 + β 1 year + β 2 age + β 3 education + ɛ wage = β 0 + f 1 (year) + f 2 (age) + f 3 (education) + ɛ The functions f 1,..., f p can be polynomials, natural splines, smoothing splines, local regressions / 24

36 Fitting a GAM If the functions f 1 have a basis representation, we can simply use least squares: Natural cubic splines Polynomials Step functions wage = β 0 + f 1 (year) + f 2 (age) + f 3 (education) + ɛ 17 / 24

37 Fitting a GAM therwise, we can use backfitting: 18 / 24

38 Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 18 / 24

39 Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1, f 3,..., f p fixed, and fit f 2 using the partial residuals: y i β 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response. 18 / 24

40 Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1, f 3,..., f p fixed, and fit f 2 using the partial residuals: y i β 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response / 24

41 Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1, f 3,..., f p fixed, and fit f 2 using the partial residuals: y i β 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response Iterate 18 / 24

42 Fitting a GAM therwise, we can use backfitting: 1. Keep f 2,..., f p fixed, and fit f 1 using the partial residuals: y i β 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1, f 3,..., f p fixed, and fit f 2 using the partial residuals: y i β 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response Iterate This works for smoothing splines and local regression. 18 / 24

43 Properties of GAMs GAMs are a step from linear regression toward a fully nonparametric method. 19 / 24

44 Properties of GAMs GAMs are a step from linear regression toward a fully nonparametric method. The only constraint is additivity. This can be partially addressed by adding key interaction variables X i X j. 19 / 24

45 Properties of GAMs GAMs are a step from linear regression toward a fully nonparametric method. The only constraint is additivity. This can be partially addressed by adding key interaction variables X i X j. We can report degrees of freedom for most non-linear functions. 19 / 24

46 Properties of GAMs GAMs are a step from linear regression toward a fully nonparametric method. The only constraint is additivity. This can be partially addressed by adding key interaction variables X i X j. We can report degrees of freedom for most non-linear functions. As in linear regression, we can examine the significance of each of the variables. 19 / 24

47 Example: Regression for Wage <HS HS <Coll Coll >Coll f1(year) f2(age) f3(education) year age education year: natural spline with df=4. age: natural spline with df=5. education: step function. 20 / 24

48 Example: Regression for Wage <HS HS <Coll Coll >Coll f1(year) f2(age) f3(education) year age education year: smoothing spline with df=4. age: smoothing spline with df=5. education: step function. 21 / 24

49 GAMs for classification We can model the log-odds in a classification problem using a GAM: log P (Y = 1 X) P (Y = 0 X) = β 0 + f 1 (X 1 ) + + f p (X p ). The fitting algorithm is a version of backfitting, but we won t discuss the details. 22 / 24

50 Example: Classification for Wage>250 <HS HS <Coll Coll >Coll f1(year) f2(age) f3(education) year age education year: linear. age: smoothing spline with df=5. education: step function. 23 / 24

51 Example: Classification for Wage>250 HS <Coll Coll >Coll f1(year) f2(age) f3(education) year age education year: linear. age: smoothing spline with df=5. education: step function. Exclude samples with education < HS. 24 / 24

Lecture 16: High-dimensional regression, non-linear regression

Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, 2017 1 / 17 High-dimensional regression Most of the methods we