A popular method for moving beyond linearity 2. Basis expansion and regularization 1 Idea: Augment the vector inputs x with additional variables which are transformation of x use linear models in this new space of derived input features Model a linear basis expansion: M f (X ) = β m h m (X ) m=1 where h m : R p R is the m-th transformation. 1 Section based on chapter 5 Examples of transformations Piecewise-polynomials and splines h m (x) = x for m = 1,..., p is the original linear model h m (x) = xj 2 or h m (x) = x i x j for i, j = 1,..., p enables to consider polynomial terms h m (x) = log(x j ) or h m (x) = x j or h m (x) = x... h m (x) = 1 Lm x U m In this section, we consider piecewise-polynomials and splines. Consider until further notice that x is one-dimensional. Piecewise polynomial function: divide the domain into continuous intervals represent f by a polynomial in each interval Examples: piecewise constant, piecewise linear,...etc... We may want to incorporate continuity restrictions at the knots.
Piecewise constant and linear Incorporate constraints in the basis Smoother functions are often preferred. These can be obtained by increasing the order of the local polynomial. Cubic spline: continuous function with continuous first and second derivatives at the knots The following basis represent a cubic spline with knots at ξ 1 and ξ 2 : h 1 (x) = 1 h 3 (x) = x 2 h 5 (x) = (x ξ 1 ) 3 + h 2 (x) = x h 4 (x) = x 3 h 6 (x) = (x ξ 2 ) 3 + A natural cubic spline basis has additionnal constraints beyond the boundary knots. Order-M splines An order-m spline with knots ξ j, j = 1... K is a piecewise polynomial of order M and has continuous derivatives up to the order M 2. The general form of the basis set is h j (x) = x j 1 for j = 1... M h M+k (x) = (x ξ k ) M 1 + for k = 1... K
Regression spline Fixed-knot splines are also called regression splines: needs to select (i) the order of the spline (M), (ii) the number of knots (K) and (iii) their placement. The spline is fitted via regression using the model y = f (x) = M+K m=1 β m h m (x) Often the quantiles of the observations are used to determine the position of the knots (see R function bs) The smoothness of the curve is determined by the number and position of the knots: with fewer knots, the curve is smoother ydat 0.15 0.10 0.05 0.00 0.05 0.10 0.15 0.0 0.2 0.4 0.6 0.8 1.0 x Smoothing splines Avoid the knot selection problem by: maximising the number of knots adding a regularization term Determine the function f that minimises the penalised residual sum of squares RSS(f, λ) = (y i f (x i )) 2 + λ (f (t)) 2 dt i=1 where λ is a smoothing parameter. If λ = 0: any function f which interpolates the data are eligible if λ = : the simplest least-square line fit It can be shown that a natural spline minimises the criteria: f (x) = N j (x)θ j where N j (x) are set of basis functions for representing the family of natural splines. We denote by {N} ij = N j (x i ). Therefore j=1 RSS(f, λ) = (y Nθ) T (y Nθ) + λθ T Ω N θ where {Ω N } jk = N j (t)n (t)dt and the fitted smoothing spline is where k f (x) = N j (x) θ j j=1 θ = (N T N + λω N ) 1 N T y.
Degree of freedom and smoother matrix Automatic selection of the smoothing parameter The estimated parameter θ is linear in y. The smoother matrix S λ = (N T N + λω N ) 1 N T only depends on the {x i } i. The effective degree of freedom of a smoothing spline is df λ = trace(s λ ) Multi-dimensional splines Other regularisation methods Each of the approaches have multidimensional analogs. Consider x = (x 1, x 2 ) T R 2, a basis of functions h ik, k = 1,... M i, for representing functions of x i, i = 1, 2, then we can represent any function of x using tensor product basis M 1 M 2 g(x) = θ jk g jk (x) j=1 k=1 Regularization and Reproducing Kernel Hilbert Spaces Wavelet Smoothing... where g jk (x) = h 1j (x 1 )h 2k (x 2 ) Smoothing splines can also be generalised Curse of dimensionality: the dimension of the basis grows exponentially fast
Kernel smoothing methods: class of regression techniques to estimate a regression function f (x) over the domain R p by fitting a different simple model separately query point x 0. 3. Kernel smoothing methods 2 Only use the observation close to x 0 to fit the simple model. The resulting estimated function f (x) is smooth. The weighting function or kernel K λ (x 0, x i ) assigns a weight to the observation x i based on its distance from x 0. λ typically controls the width of the neighbourhood this is the only parameter that needs to be determined from the training data. 2 Section based on chapter 6 Be careful: same name for different notions! One-dimensional kernel smoother The word kernel is often associated to various distinct mathematical objects. The kernel smoothing methods should not be confused with the kernel methods (that you have seen in the Data Mining course) which consider a positive definite kernel and an associated reproducible kernel Hilbert space. In kernel methods, the kernel defines an inner product between feature maps. In this section, the kernel (in one dimension) can be written as follows ( ) x x0 K λ (x 0, x) = D h λ (x 0 ) where D is a non-negative real-valued integrable function (that integrates to 1). Intuitive estimate of a regression function f (x) = E(Y X = x): For each x, use the set N k (x) of k nearest neighbours of x Estimate f (x) by Remark: f (x) is discontinuous in x. f (x) = Average(yi x i N k (x)) Rather than give all the points in the neighbourhood the same weight, we can assign weights that depends on the distance between the observation.
One-dimensional kernel smoother Example: the Nadarya-Watson kernle-weigthed average f (x) = N i=1 K λ(x, x i )y i N i=1 K λ(x, x i ) with the Epanechnikov kernel ( ) x0 x K λ (x 0, x) = D λ and D(t) = 3 4 (1 t2 ) if t 1 and 0 otherwise. Settings to determine The main settings characterising a kernel smoother ( ) x x0 K λ (x 0, x) = D h λ (x 0 ) are: the smoothing parameter λ controls the width: large λ implies lower variance but higher bias the function h λ : constant h λ (x) tend to keep bias of the estimate constant the function D; typical examples are the Epanechnikov kernel the tri-cube function where D(t) = (1 t 3 ) 3 if t 1, 0 otherwise the gaussian density function, where the standard deviation plays the role of the window size
Local linear regression Local linear regression Locally weighted regression solves a separated least square problems at each target point x 0 : and then min α(x 0),β(x 0) K λ (x 0, x i )[y i α(x 0 ) β(x 0 )x i ] 2 i=1 f (x0 ) = α(x 0 ) + β(x 0 )x 0 = b(x 0 ) T (B T W (x 0 )B) 1 B T W (x 0 )y = l i (x 0 )y i i=1 The estimate is linear in y. The weights l i (x 0 ) combine weighting kernels K λ (x 0,.) and the least squares operations. It is often called the equivalent kernel. Using the linearity of the local regression and a series expansion of f around x 0 we can show that the bias E( f (x 0 )) f (x 0 ) only depends on quadratic and higher terms in the expansion of f. where b(x 0 ) T = (1, x 0 ), B T = (b(x 1 ), b(x 2 ),... b(x N )) and W (x 0 ) is a N N diagonal matrix with i-th diagonal element equal to K λ (x 0, x i ). Local polynomial regression Selecting the width of the kernel Simiarly, we can fit local polynomial of any degree d. with solution min α(x 0),β j (x 0),j=1,...d K λ (x 0, x i )[y i α(x 0 ) i=1 f (x0 = α(x 0 ) + d β j (x 0 )x j 0. j=1 d j=1 β j (x 0 )x j i ]2 The bias only depends on terms of degree d + 1 or higher in the expansion of f. Local linear fits tend to be biased in region of curvature of true function; local quadratic regression usually correct this bias. Price for this bias reduction: increased variance, especially in the tails. There is a natural bias-variance tradeoff as we change the width of the kernel. Narrow window: f (x 0 ) estimated based on small number of points, therefore small bias but large variance Wide window: the variance of f (x 0 ) will be small relative to the variance of any y i but the bias will be higher As in previous section, λ can be chosen via cross-validation.
In R The function loess fit a polynomial surface determined by one or more numerical predictors, using local fitting. It uses a tri-cube function. The degree of the polynomial can be chosen as well as the smoothing parameter. Example from http://research.stowers-institute.org/efg/r/statistics/loess.htm: period <- 120 > x <- 1:120 > y <- sin(2*pi*x/period) + runif(length(x),-1,1) > plot(x,y, main="sine Curve + Uniform Noise") > y.loess <- loess(y ~ x, span=0.75, data.frame(x=x, y=y)) > y.predict <- predict(y.loess, data.frame(x=x)) > lines(x,y.predict) Figure from http://research.stowers-institute.org/efg/r/statistics/loess.htm Local regression in R p Kernel smoothing and local regression generalise very naturally to two or more dimensions. However Boundary effects are a much bigger problem in two or higher dimensions than in one because the fractions of points on the boundary is larger. It is impossible to simultaneously maintain low bias and low variance as the dimension increases without the total sample size increasing exponentially in p When dimension to sample-size ratio increases, we may want to incorporate some structural assumptions about the model.