Moving Beyond Linearity - PDF Free Download

Moving Beyond Linearity The truth is never linear! 1/23

Moving Beyond Linearity The truth is never linear! r almost never! 1/23

Moving Beyond Linearity The truth is never linear! r almost never! But often the linearity assumption is good enough. 1/23

Moving Beyond Linearity The truth is never linear! r almost never! But often the linearity assumption is good enough. When its not... polynomials, step functions, splines, local regression, and generalized additive models o er a lot of flexibility, without losing the ease and interpretability of linear models. 1/23

Polynomial Regression y i = 0 + 1 x i + 2 x 2 i + 3 x 3 i +...+ d x d i + i Degree 4 Polynomial Wage 50 100 150 200 250 300 Pr(Wage>250 Age) 0.00 0.05 0.10 0.15 0.20 20 30 40 50 60 70 80 Age 20 30 40 50 60 70 80 Age 2/23

Details Create new variables X 1 = X, X 2 = X 2, etc and then treat as multiple linear regression. 3/23

Details Create new variables X 1 = X, X 2 = X 2, etc and then treat as multiple linear regression. Not really interested in the coe cients; more interested in the fitted function values at any value x 0 : ˆf(x 0 )= ˆ0 + ˆ1x 0 + ˆ2x 2 0 + ˆ3x 3 0 + ˆ4x 4 0. Since ˆf(x 0 ) is a linear function of the ˆ`, can get a simple expression for pointwise-variances Var[ ˆf(x 0 )] at any value x 0. In the figure we have computed the fit and pointwise standard errors on a grid of values for x 0.We show ˆf(x 0 ) ± 2 se[ ˆf(x 0 )]. 3/23

Details continued Logistic regression follows naturally. For example, in figure we model Pr(y i > 250x i )= exp( 0 + 1 x i + 2 x 2 i +...+ dx d i ) 1 + exp( 0 + 1 x i + 2 x 2 i +...+ dx d i ). To get confidence intervals, compute upper and lower bounds on on the logit scale, and then invert to get on probability scale. Can do separately on several variables just stack the variables into one matrix, and separate out the pieces afterwards (see GAMs later). 4/23

Bad tail behavior for polynomial regression Let us consider the function known as Runge s example (left panel) f(x) = 1 1+x 2 interpolated using a 15th polynomial(right panel). It turns out that high order interpolation using global polynomial often is dangerous.

Step Functions Another way of creating transformations of a variable cut the variable into distinct regions. C 1 (X) =I(X <35), C 2 (X) =I(35 apple X<50),...,C 3 (X) =I(X 65) Piecewise Constant Wage 50 100 150 200 250 300 Pr(Wage>250 Age) 0.00 0.05 0.10 0.15 0.20 20 30 40 50 60 70 80 Age 20 30 40 50 60 70 80 Age 5/23

Step functions

Step functions Notice that for any value of X, C 0 (X) +C 1 (X) + + C K (X) = 1, since X must be in exactly one of the K +1intervals. We then use least squares to fit linear model using C 1 (X), C 2 (X),, C K (X) as predictors: y i = 0 + 1 C 1 (X) + 1 C 1 (X) + + K C K (X) + i For logistic regression Pr(y i > 250x i ) = exp( 0 + 1 C 1 (x i )) + + K C K (x i ) 1+exp( 0 + 1 C 1 (x i )) + + K C K (x i )

Step functions continued Easy to work with. Creates a series of dummy variables representing each group. 6/23

Step functions continued Easy to work with. Creates a series of dummy variables representing each group. Useful way of creating interactions that are easy to interpret. For example, interaction e ect of Year and Age: I(Year < 2005) Age, I(Year 2005) Age would allow for di erent linear functions in each age category. In R: I(year < 2005) or cut(age, c(18, 25, 40, 65, 90)). 6/23

Basis functions Polynomial and piecewise-constant regression models are in fact special cases of a basis function approach. y i = 0 + 1 b 1 (x i )+ 2 b 2 (x i )+ + K b K (x i )+ i where basis function b 1 ( ), b 2 ( ),, b K ( ) are fixed and known. For polynomial regression, b j (x i ) = x j i. For piecewise constant functions, b j (x i ) = I(c j apple x i < c j+1 ).

Moving Beyond Linearity

Piecewise Polynomials Instead of a single polynomial in X over its whole domain, we can rather use di erent polynomials in regions defined by knots. E.g. (see figure) ( 01 + 11 x i + 21 x 2 i y i = + 31x 3 i + i if x i <c; 02 + 12 x i + 22 x 2 i + 32x 3 i + i if x i c. Better to add constraints to the polynomials, e.g. continuity. Splines have the maximum amount of continuity. 7/23

Piecewise Cubic Continuous Piecewise Cubic Wage 50 100 150 200 250 Wage 50 100 150 200 250 20 30 40 50 60 70 Age 20 30 40 50 60 70 Age Cubic Spline Linear Spline Wage 50 100 150 200 250 Wage 50 100 150 200 250 20 30 40 50 60 70 Age 20 30 40 50 60 70 Age 8/23

Linear Splines A linear spline with knots at k,k=1,...,k is a piecewise linear polynomial continuous at each knot. We can represent this model as y i = 0 + 1 b 1 (x i )+ 2 b 2 (x i )+ + K+3 b K+3 (x i )+ i, where the b k are basis functions. b 1 (x i ) = x i b k+1 (x i ) = (x i k ) +, k =1,...,K Here the () + means positive part; i.e. xi (x i k ) + = k if x i > k 0 otherwise 9/23

f(x) 0.3 0.5 0.7 0.9 0.0 0.2 0.4 0.6 0.8 1.0 x b(x) 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.8 1.0 x 10 / 23

Cubic Splines A cubic spline with knots at k,k=1,...,k is a piecewise cubic polynomial with continuous derivatives up to order 2 at each knot. Again we can represent this model with truncated power basis functions y i = 0 + 1 b 1 (x i )+ 2 b 2 (x i )+ + K+3 b K+3 (x i )+ i, b 1 (x i ) = x i b 2 (x i ) = x 2 i b 3 (x i ) = x 3 i b k+3 (x i ) = (x i k ) 3 +, k =1,...,K where (x i k ) 3 + = (xi k ) 3 if x i > k 0 otherwise 11 / 23

f(x) 1.0 1.2 1.4 1.6 0.0 0.2 0.4 0.6 0.8 1.0 x b(x) 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.8 1.0 x 12 / 23

More for spline Simple linear model: y i = 0 + 1 x i + i X matrix: quadratic model (polynomial) : X matrix: y i = 0 + 1 x i + 2 x 2 i + i

More for spline

More for spline broken stick model: y i = 0 + 1 x i + 11 (x i 0.6) + + i whip model (polynomial) : y i = 0 + 1 x i + 11 (x i 0.5) + + 12 (x i 0.55) + + 1k (x i 0.95) + + i X matrix: Spline model for f KX f (x) = 0 + 1 x + b k (x k k ) + k=1

Splines Figure 6: *

Linear splines

cubic splines

B-Splines

Natural cubic spline X consisting of 50 points drawn at random from U[0,1], and an assumed error model with constant variance. The linear and cubic polynomial fits have two and four degrees of freedom, respectively, while the cubic spline and natural cubic spline each have six degrees of freedom. The cubic spline has two knots at 0.33 and 0.66, while the natural spline has boundary knots at 0.1 and 0.9, and four interior knots uniformly spaced between them.

Natural Cubic Splines A natural cubic spline extrapolates linearly beyond the boundary knots. This adds 4 = 2 2 extra constraints, and allows us to put more internal knots for the same degrees of freedom as a regular cubic spline. Wage 50 100 150 200 250 Natural Cubic Spline Cubic Spline 20 30 40 50 60 70 13 / 23

Fitting splines in R is easy: bs(x,...) for any degree splines, and ns(x,...) for natural cubic splines, in package splines. Natural Cubic Spline Wage 50 100 150 200 250 300 Pr(Wage>250 Age) 0.00 0.05 0.10 0.15 0.20 20 30 40 50 60 70 80 Age 20 30 40 50 60 70 80 Age 14 / 23

Knot placement ne strategy is to decide K, the number of knots, and then place them at appropriate quantiles of the observed X. A cubic spline with K knots has K + 4 parameters or degrees of freedom. A natural spline with K knots has K degrees of freedom. Wage 50 100 150 200 250 300 Natural Cubic Spline Comparison of a Polynomial degree-14 polynomial and a natural cubic spline, each with 15df. 20 30 40 50 60 70 80 Age 15 / 23

Smoothing spline In the last section, we create splines by specifying a set of knots, producing a sequence of basis funcitons and then using LS to estimate spline coefficients. Now, we introduce a somewhat different approach smoothing spline. In fitting a smooth curve to a set of data, we need to find function g(x) to fits the observed data well. That is we want RSS = P n i=1 (y i g(x i )) 2 to be small. However, if we don t put any constraints on g(x i ), then we can make RSS zero by choosing g such that it interpolates all of y i. verfit, too rough. How might we ensure that g is smooth?

Smoothing Splines This section is a little bit mathematical Consider this criterion for fitting a smooth function g(x) to some data: nx Z minimize (y i g(x i )) 2 + g 00 (t) 2 dt g2s i=1 16 / 23

Smoothing Splines This section is a little bit mathematical Consider this criterion for fitting a smooth function g(x) to some data: nx Z minimize (y i g(x i )) 2 + g 00 (t) 2 dt g2s i=1 The first term is RSS, and tries to make g(x) match the data at each x i. The second term is a roughness penalty and controls how wiggly g(x) is. It is modulated by the tuning parameter 0. 16 / 23

Smoothing Splines continued The solution is a natural cubic spline, with a knot at every unique value of x i. The roughness penalty still controls the roughness via. 17 / 23

Smoothing Splines continued The solution is a natural cubic spline, with a knot at every unique value of x i. The roughness penalty still controls the roughness via. Some details Smoothing splines avoid the knot-selection issue, leaving a single to be chosen. The algorithmic details are too complex to describe here. In R, the function smooth.spline() will fit a smoothing spline. The vector of n fitted values can be written as ĝ = S y, where S is a n n matrix (determined by the x i and ). The e ective degrees of freedom are given by df = nx {S } ii. i=1 17 / 23

Smoothing spline It can be shown the criterion of smoothing spline has an explicit, finite-dimensional, unique minimizer which is a natrural cubic spline with knots at the unique values of the x i, i = 1,, N. We can write f (x) as f (x) = NX N j (x) j i=1, where the N j (x) are an N dimensional set of basis functions. The criterion thus reduces to

Smoothing spline

Smoothing spline Reinsch form ŷ = N (N T N + N ) 1 N T y = N (N T [I + N T N 1 ]N) 1 N T y K is known as the penalty matrix. = (I + N T N 1 ) 1 y = (I + K) 1 y (6)

Smoothing spline The eigen decomposition of S and d k the corresponding eigenvalue of K.

Smoothing spline

Smoothing Splines continued choosing We can specify df rather than! In R: smooth.spline(age, wage, df = 10) 18 / 23

Smoothing Splines continued choosing We can specify df rather than! In R: smooth.spline(age, wage, df = 10) The leave-one-out (L) cross-validated error is given by RSS cv ( )= nx (y i ĝ ( i) (x i )) 2 = i=1 nx apple yi ĝ (x i ) i=1 1 {S } ii 2. In R: smooth.spline(age, wage) 18 / 23

Smoothing Spline Wage 0 50 100 200 300 16 Degrees of Freedom 6.8 Degrees of Freedom (LCV) 20 30 40 50 60 70 80 Age 19 / 23

Local Regression 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.5 0.0 0.5 1.0 1.5 Local Regression With a sliding weight function, we fit separate linear fits over the range of X by weighted least squares. See text for more details, and loess() function in R. 20 / 23

local regression

local regression locally weighted regression solves a separate weighted least squares prob- lem at each target point x 0 : the estimate is then f ˆ(x 0 ) = (x ˆ 0 )+ (x ˆ 0 )x 0 define the vector function b(x) T = (1, x), let B be the N 2 rgression matrix with ith row b(x i ) T, and W (x 0 ) the N N diagonal matrix with ith diagonal element K (x 0, x i ), then

Multidimensional spline

Generalized Additive Models Allows for flexible nonlinearities in several variables, but retains the additive structure of linear models. y i = 0 + f 1 (x i1 )+f 2 (x i2 )+ + f p (x ip )+ i. <HS HS <Coll Coll >Coll f1(year) 30 20 10 0 10 20 30 f2(age) 50 40 30 20 10 0 10 20 f3(education) 30 20 10 0 10 20 30 40 2003 2005 2007 2009 year 20 30 40 50 60 70 80 age education 21 / 23

Fitting a GAM I If the functions f 1 have a basis representation, we can simply use least squares: I I I Natural cubic splines Polynomials Step functions wage = 0 + f 1 (year)+f 2 (age)+f 3 (education)+ 17 / 24

Fitting a GAM I therwise, we can use backfitting: 18 / 24

Fitting a GAM I therwise, we can use backfitting: 1. Keep f 2,...,f p fixed, and fit f 1 using the partial residuals: y i 0 f 2 (x i2 ) f p (x ip ), as the response. 18 / 24

Fitting a GAM I therwise, we can use backfitting: 1. Keep f 2,...,f p fixed, and fit f 1 using the partial residuals: y i 0 f 2 (x i2 ) f p (x ip ), as the response. 2. Keep f 1,f 3,...,f p fixed, and fit f 2 using the partial residuals: y i 0 f 1 (x i1 ) f 3 (x i3 ) f p (x ip ), as the response. 18 / 24

GAM details Can fit a GAM simply using, e.g. natural splines: lm(wage ns(year, df = 5)+ ns(age, df = 5)+ education) 22 / 23

GAM details Can fit a GAM simply using, e.g. natural splines: lm(wage ns(year, df = 5)+ ns(age, df = 5)+ education) Coe cients not that interesting; fitted functions are. The previous plot was produced using plot.gam. Can mix terms some linear, some nonlinear and use anova() to compare models. Can use smoothing splines or local regression as well: gam(wage s(year, df = 5)+ lo(age, span =.5)+ education) 22 / 23

p(x) log 1 p(x) GAMs for classification = 0 + f 1 (X 1 )+f 2 (X 2 )+ + f p (X p ). HS <Coll Coll >Coll f1(year) 4 2 0 2 4 f2(age) 8 6 4 2 0 2 f3(education) 4 2 0 2 4 2003 2005 2007 2009 year 20 30 40 50 60 70 80 age education gam(i(wage > 250) year + s(age, df = 5)+education, family = binomial) 23 / 23

GAM

Nonparametric Beta regression Fractional data that are restricted in the standard unit interval (0, 1) with a highly skewed distribution are commonly encountered. the linear regression model cannot always guarantee that the fitted or predicted values will fall into the unit interval (0, 1). To overcome such issues, one possible approach is to first transform the response such that it can take values within ( 1, +1) and then apply a regression model to the transformed response. such an approach still has shortcomings; one shortcoming is that the coefficients cannot easily be interpreted in terms of the original response, and another shortcoming is that the fractional response is generally asymmetric and highly skewed. one appealing model is beta regression, which response variable is assumed to follow a beta distribution within the unit interval (0,1).

Nonparametric Beta regression Let (y i, X > i )>, i = 1,...,n, be vectors that are independent and identically distributed as (y, X), where y is a response variable that is restricted to the unit interval (0,1) and X i = (x i1,, x ip ) > 2 R p is the ith observation of the p covariates, which are assumed to be fixed and known. The beta density is: f (y; µ, ) = ( ) (µ ) ( ) yµ 1 (1 y) (1 µ) 1, (1) where µ 2 (0, 1) is the mean of y, > 0 is a precision parameter, and ( ) is the gamma function. The variance of y is var(y) = µ(1 µ)/(1 + ). linear beta regression model: g(µ i ) = px x ij j, (2) where g( ) is a strictly monotonic and twice-differentiable link function that maps (0,1) into R; j=1

Nonparametric Beta regression we propose nonparametric additive beta regression: px g(µ i ) = f j (x ij ), (3) j=1 where µ i = E(y i ), i = 1,, n. g( ) is a strictly monotonic and twice- differentiable link function that maps (0, 1) into R, and we use the logit link function g(µ) = log µ/(1 µ) in this study. f j s are unknown smooth functions to be estimated, and suppose that some of them are zero. using B-splines to approximate each of the unknown functions f j s, we can have the following approximation f j (x) t f nj (x) = Xm n k=1 jk k (x) with m n coefficients jk, k = 1,, m n. px Xm n g(µ i ) t 0 + jk k (x ij ). (4) j=1 k=1

Nonparametric Beta regression The objective function: L(, ) = l(, ) n px w j (m n )P ( j 2 ), (5) where l(, ) = P n i=1 l i (µ i, ) is the log-likelihood function; l i (µ i, ) = ln ( ) ln (µ i ) ln ((1 µ i ) ) + (µ i 1) ln y i + {(1 µ i ) 1} ln(1 y i ); P ( ) is a penalty function, w j ( ) is used to rescale the penalty with respect to the dimensionality of the parameter vector j. The penalized likelihood estimators are then defined as ( n, ) = argmax, L(, ). j=1

Nonparametric Beta regression Theorem (Estimation consistency). Define F T = { j : nj 2, 0, 1 apple j apple p}, and let M denote the cardinality of any set M {1,, p}. Under conditions (C1) (C4), we can obtain the following: (i) With probability converging to 1, F T apple M 1 F T = M 1 q for a finite constant M 1 > 1. (ii) If max{p 0 ( j 2 )} 2 m n /n 2! 0 as n!1, then P( F T F T )! p 1. pp (iii) nj j 2 2 = p(mn 2d+1 ) + p ( 4 max {P0 ( j 2 ) } 2 m 2 n ). n 2 j=1

Theorem (Selection consistency). Under conditions (C1) (C5), we have: (i) P( n= 0 )! 1. Define fˆ j (x) = m Pn have k=1 jk k (x). Under conditions (C1) (C5), we (i) P( ˆf j 2 > 0, j 2 F T and ˆf j 2 = 0, j 2 F F /F T )! p 1. qp (ii) fˆ j f j 2 2 = p mn 2d + p 4m n (max {P 0 ( j 2 )}) 2 /n 2. j=1