Lecture 16: High-dimensional regression, non-linear regression

Size: px

Start display at page:

Download "Lecture 16: High-dimensional regression, non-linear regression"

Vincent Brooks
5 years ago
Views:

1 Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, / 17

2 High-dimensional regression Most of the methods we ve discussed work best when n is much larger than p. 2 / 17

3 High-dimensional regression Most of the methods we ve discussed work best when n is much larger than p. However, the case p n is now common, due to experimental advances and cheaper computers: 2 / 17

4 High-dimensional regression Most of the methods we ve discussed work best when n is much larger than p. However, the case p n is now common, due to experimental advances and cheaper computers: 1. Medicine: Instead of regressing heart disease onto just a few clinical observations (blood pressure, salt consumption, age), we use in addition 500,000 single nucleotide polymorphisms. 2 / 17

5 High-dimensional regression Most of the methods we ve discussed work best when n is much larger than p. However, the case p n is now common, due to experimental advances and cheaper computers: 1. Medicine: Instead of regressing heart disease onto just a few clinical observations (blood pressure, salt consumption, age), we use in addition 500,000 single nucleotide polymorphisms. 2. Marketing: Using search terms to understand online shopping patterns. A bag of words model defines one feature for every possible search term, which counts the number of times the term appears in a person s search. There can be as many features as words in the dictionary. 2 / 17

6 Some problems we have talked about When n = p, we can find a fit that goes through every point. Y Y X X 3 / 17

7 Some problems we have talked about When n = p, we can find a fit that goes through every point. Y Y X X Least-squares regression doesn t have a unique solution when p > n. 3 / 17

8 Some problems we have talked about When n = p, we can find a fit that goes through every point. Y Y X X Least-squares regression doesn t have a unique solution when p > n. Fortunately, we can use regularization methods, such as variable selection, ridge regression and the lasso. 3 / 17

9 Some problems we have talked about We know that least-squares regression doesn t work when p > n. We can use regularization methods, such as variable selection, ridge regression and the lasso. When n = p, we can find a fit that goes through every point. Measures of training error are a bad proxy for the test error. R Training MSE Test MSE Number of Variables Number of Variables Number of Variables 4 / 17

10 Some new problems Y Y X Furthermore, it becomes hard to estimate the noise σ 2. (Typical unbiased estimate is ˆσ 2 = 1 n p 1RSS for regression with intercept and p additional predictors.) Measures of model fit C p, AIC, and BIC fail without reasonable noise estimate. X 5 / 17

11 Some new problems p = 20 p = 50 p = Degrees of Freedom Degrees of Freedom Degrees of Freedom Simulation with n = 100 observations. In each case, only 20 predictors are associated to the response. Plots show the test error of the Lasso. 6 / 17

12 Some new problems p = 20 p = 50 p = Degrees of Freedom Degrees of Freedom Degrees of Freedom Simulation with n = 100 observations. In each case, only 20 predictors are associated to the response. Plots show the test error of the Lasso. Message: Adding irrelevant predictors hurts the performance of the regression! 6 / 17

13 Interpreting coefficients when p > n When p > n, we always have multicollinearity: some predictors are a linear combination of other predictors. Typically, there isn t a single best set of variables for prediction: many sets are equally good. The Lasso and Ridge regression will choose one set of coefficients. Take-away: Don t overstate the importance of the predictors selected in high dimensions. 7 / 17

14 Beyond linearity Problem: How do we model a non-linear relationship between inputs and output? Degree 4 Polynomial Wage Pr(Wage>250 Age) Age Left: Regression of wage onto age. Right: Logistic regression for classes wage> 250 and wage 250 (Dotted lines are ±2 standard errors) Age 8 / 17

15 Basis functions Strategy: Leverage linear regression technology Define a model with derived features f j (X): Y = β 0 + β 1 f 1 (X) + β 2 f 2 (X) + + β d f d (X). 9 / 17

16 Basis functions Strategy: Leverage linear regression technology Define a model with derived features f j (X): Y = β 0 + β 1 f 1 (X) + β 2 f 2 (X) + + β d f d (X). Fit this model through least-squares regression (possibly using regularization). 9 / 17

17 Basis functions Strategy: Leverage linear regression technology Define a model with derived features f j (X): Y = β 0 + β 1 f 1 (X) + β 2 f 2 (X) + + β d f d (X). Fit this model through least-squares regression (possibly using regularization). Options for features / basis functions f 1,..., f d : 9 / 17

18 Basis functions Strategy: Leverage linear regression technology Define a model with derived features f j (X): Y = β 0 + β 1 f 1 (X) + β 2 f 2 (X) + + β d f d (X). Fit this model through least-squares regression (possibly using regularization). Options for features / basis functions f 1,..., f d : 1. Polynomials, f i (x) = x i. 9 / 17

19 Basis functions Strategy: Leverage linear regression technology Define a model with derived features f j (X): Y = β 0 + β 1 f 1 (X) + β 2 f 2 (X) + + β d f d (X). Fit this model through least-squares regression (possibly using regularization). Options for features / basis functions f 1,..., f d : 1. Polynomials, f i (x) = x i. 2. Indicator functions, f i (x) = 1(c i x < c i+1 ). Piecewise Constant Wage Pr(Wage>250 Age) Age Age 9 / 17

20 Basis functions Options for features / basis functions f 1,..., f d : 3. Piecewise polynomials: Piecewise Cubic Continuous Piecewise Cubic Wage Wage Age Age Cubic Spline Linear Spline Wage Wage Age Age 10 / 17

21 Cubic splines Define a set of knots (a.k.a. break points) ξ 1 < ξ 2 < < ξ K. 11 / 17

22 Cubic splines Define a set of knots (a.k.a. break points) ξ 1 < ξ 2 < < ξ K. We want the function Y = f(x) to: 11 / 17

23 Cubic splines Define a set of knots (a.k.a. break points) ξ 1 < ξ 2 < < ξ K. We want the function Y = f(x) to: 1. Be a cubic polynomial between every pair of knots ξ i, ξ i / 17

24 Cubic splines Define a set of knots (a.k.a. break points) ξ 1 < ξ 2 < < ξ K. We want the function Y = f(x) to: 1. Be a cubic polynomial between every pair of knots ξ i, ξ i Be continuous at each knot. 11 / 17

25 Cubic splines Define a set of knots (a.k.a. break points) ξ 1 < ξ 2 < < ξ K. We want the function Y = f(x) to: 1. Be a cubic polynomial between every pair of knots ξ i, ξ i Be continuous at each knot. 3. Have continuous first and second derivatives at each knot. 11 / 17

26 Cubic splines Define a set of knots (a.k.a. break points) ξ 1 < ξ 2 < < ξ K. We want the function Y = f(x) to: 1. Be a cubic polynomial between every pair of knots ξ i, ξ i Be continuous at each knot. 3. Have continuous first and second derivatives at each knot. It turns out, we can write f in terms of K + 3 basis functions: f(x) = β 0 +β 1 X+β 2 X 2 +β 3 X 3 +β 4 h(x, ξ 1 )+ +β K+3 h(x, ξ K ) where, h(x, ξ) = { (x ξ) 3 if x > ξ 0 otherwise 11 / 17

27 Cubic splines Define a set of knots (a.k.a. break points) ξ 1 < ξ 2 < < ξ K. We want the function Y = f(x) to: 1. Be a cubic polynomial between every pair of knots ξ i, ξ i Be continuous at each knot. 3. Have continuous first and second derivatives at each knot. It turns out, we can write f in terms of K + 3 basis functions: f(x) = β 0 +β 1 X+β 2 X 2 +β 3 X 3 +β 4 h(x, ξ 1 )+ +β K+3 h(x, ξ K ) where, h(x, ξ) = { (x ξ) 3 if x > ξ 0 otherwise Popular because (allegedly) most human eyes cannot detect discontinuities of third derivate at knots. 11 / 17

28 Natural cubic splines Spline which is linear instead of cubic for X < ξ 1, X > ξ K. Wage Natural Cubic Spline Cubic Spline Age The predictions are more stable for extreme values of X. 12 / 17

29 Choosing the number and locations of knots The locations of the knots are typically quantiles of X. 13 / 17

30 Choosing the number and locations of knots The locations of the knots are typically quantiles of X. The number of knots, K, is chosen by cross validation: Mean Squared Error Mean Squared Error Degrees of Freedom of Natural Spline Degrees of Freedom of Cubic Spline 13 / 17

31 Natural cubic splines vs. polynomial regression Splines can fit complex functions with few parameters: low degree polynomial, but can fit complex functions by adding more knots. Polynomials require high degree terms to be flexible. High-degree polynomials can be unstable at the edges of the dataset. Degree 15 polynomial vs. spline with 15 degrees of freedom: Wage Natural Cubic Spline Polynomial 14 / 17

32 Smoothing splines Find the function f which minimizes n (y i f(x i )) 2 + λ i=1 f (x) 2 dx The RSS of the model. A penalty for the roughness of the function. 15 / 17

33 Smoothing splines Find the function f which minimizes n (y i f(x i )) 2 + λ i=1 f (x) 2 dx The RSS of the model. A penalty for the roughness of the function. Facts: The minimizer ˆf is a natural cubic spline, with knots at each sample point x 1,..., x n. Obtaining ˆf is similar to a Ridge regression. 15 / 17

34 Deriving a smoothing spline 1. Show that if you fix the values f(x 1 ),..., f(x 2 ), the roughness f (x) 2 dx is minimized by a natural cubic spline. 16 / 17

35 Deriving a smoothing spline 1. Show that if you fix the values f(x 1 ),..., f(x 2 ), the roughness f (x) 2 dx is minimized by a natural cubic spline. 2. Deduce that the solution to the smoothing spline problem is a natural cubic spline, which can be written in terms of its basis functions. f(x) = β 0 + β 1 f 1 (x) +... β n+3 f n+3 (x) 16 / 17

36 Deriving a smoothing spline 1. Show that if you fix the values f(x 1 ),..., f(x 2 ), the roughness f (x) 2 dx is minimized by a natural cubic spline. 2. Deduce that the solution to the smoothing spline problem is a natural cubic spline, which can be written in terms of its basis functions. f(x) = β 0 + β 1 f 1 (x) +... β n+3 f n+3 (x) 3. Letting N be a matrix with N(i, j) = f j (x i ), we can write the objective function: (y Nβ) T (y Nβ) + λβ T Ω N β, where Ω N (i, j) = f i (t)f j (t)dt. 16 / 17

37 Deriving a smoothing spline 4. Similar to the ridge regression setting, the coefficients ˆβ which minimize (y Nβ) T (y Nβ) + λβ T Ω N β, are ˆβ = (N T N + λω N ) 1 N T y. 17 / 17

38 Deriving a smoothing spline 4. Similar to the ridge regression setting, the coefficients ˆβ which minimize (y Nβ) T (y Nβ) + λβ T Ω N β, are ˆβ = (N T N + λω N ) 1 N T y. 5. Note that the predicted values are a linear function of the observed values: ŷ = N(N T N + λω N ) 1 N T y }{{} S λ 17 / 17

Lecture 17: Smoothing splines, Local Regression, and GAMs

Lecture 17: Smoothing splines, Local Regression, and GAMs Reading: Sections 7.5-7 STATS 202: Data mining and analysis November 6, 2017 1 / 24 Cubic splines Define a set of knots ξ 1 < ξ 2 < < ξ K. We want