A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

Similar documents
Moving Beyond Linearity

Splines and penalized regression

Lecture 7: Splines and Generalized Additive Models

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Last time... Bias-Variance decomposition. This week

STA 414/2104 S: February Administration

Chapter 5: Basis Expansion and Regularization

Lecture 17: Smoothing splines, Local Regression, and GAMs

Nonparametric Regression

Nonparametric regression using kernel and spline methods

Lecture 16: High-dimensional regression, non-linear regression

Moving Beyond Linearity

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Computational Physics PHYS 420

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

STAT 705 Introduction to generalized additive models

3 Nonlinear Regression

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Parameterization. Michael S. Floater. November 10, 2011

Assessing the Quality of the Natural Cubic Spline Approximation

Knowledge Discovery and Data Mining

1 Standard Errors on Different Models

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Learning from Data Linear Parameter Models

Spline Models. Introduction to CS and NCS. Regression splines. Smoothing splines

Curve fitting using linear models

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Applied Statistics : Practical 9

1D Regression. i.i.d. with mean 0. Univariate Linear Regression: fit by least squares. Minimize: to get. The set of all possible functions is...

3 Nonlinear Regression

Economics Nonparametric Econometrics

Rational Bezier Surface

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Generalized Additive Model

CS 450 Numerical Analysis. Chapter 7: Interpolation

Machine Learning / Jan 27, 2010

Nonparametric Approaches to Regression

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Distribution-free Predictive Approaches

Topics in Machine Learning

The theory of the linear model 41. Theorem 2.5. Under the strong assumptions A3 and A5 and the hypothesis that

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

What is machine learning?

A Practical Review of Uniform B-Splines

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Nonparametric Regression and Generalized Additive Models Part I

lecture 10: B-Splines

Smooth Curve from noisy 2-Dimensional Dataset

Rational Bezier Curves

Machine Learning. Topic 4: Linear Regression Models

February 2017 (1/20) 2 Piecewise Polynomial Interpolation 2.2 (Natural) Cubic Splines. MA378/531 Numerical Analysis II ( NA2 )

Spatial Interpolation & Geostatistics

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

Lecture 9: Introduction to Spline Curves

COMP3421. Global Lighting Part 2: Radiosity

EECS 556 Image Processing W 09. Interpolation. Interpolation techniques B splines

Spatial Interpolation - Geostatistics 4/3/2018

Going nonparametric: Nearest neighbor methods for regression and classification

TECHNICAL REPORT NO December 11, 2001

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Four equations are necessary to evaluate these coefficients. Eqn

Support Vector Machines

Robust Poisson Surface Reconstruction

Basis Functions. Volker Tresp Summer 2017

Numerical Methods 5633

CS321 Introduction To Numerical Methods

Multivariate Conditional Distribution Estimation and Analysis

Lecture 2.2 Cubic Splines

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

A Comparative Study of LOWESS and RBF Approximations for Visualization

Lecture 8. Divided Differences,Least-Squares Approximations. Ceng375 Numerical Computations at December 9, 2010

Network Traffic Measurements and Analysis

A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K.

Statistical Modeling with Spline Functions Methodology and Theory

Remark. Jacobs University Visualization and Computer Graphics Lab : ESM4A - Numerical Methods 331

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Parameterization of triangular meshes

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

NONPARAMETRIC REGRESSION TECHNIQUES

Statistical Modeling with Spline Functions Methodology and Theory

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni

CS 475 / CS Computer Graphics. Modelling Curves 3 - B-Splines

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Divide and Conquer Kernel Ridge Regression

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Edge and local feature detection - 2. Importance of edge detection in computer vision

Algorithms for convex optimization

Generalized Additive Models

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Nonlinearity and Generalized Additive Models Lecture 2

Spline Curves. Spline Curves. Prof. Dr. Hans Hagen Algorithmic Geometry WS 2013/2014 1

Automated Parameterization of the Joint Space Dynamics of a Robotic Arm. Josh Petersen

A dimension adaptive sparse grid combination technique for machine learning

Cubic smoothing spline

Non-Parametric Modeling

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Perceptron as a graph

Generalized additive models I

Interpolation and Splines

Transcription:

A popular method for moving beyond linearity 2. Basis expansion and regularization 1 Idea: Augment the vector inputs x with additional variables which are transformation of x use linear models in this new space of derived input features Model a linear basis expansion: M f (X ) = β m h m (X ) m=1 where h m : R p R is the m-th transformation. 1 Section based on chapter 5 Examples of transformations Piecewise-polynomials and splines h m (x) = x for m = 1,..., p is the original linear model h m (x) = xj 2 or h m (x) = x i x j for i, j = 1,..., p enables to consider polynomial terms h m (x) = log(x j ) or h m (x) = x j or h m (x) = x... h m (x) = 1 Lm x U m In this section, we consider piecewise-polynomials and splines. Consider until further notice that x is one-dimensional. Piecewise polynomial function: divide the domain into continuous intervals represent f by a polynomial in each interval Examples: piecewise constant, piecewise linear,...etc... We may want to incorporate continuity restrictions at the knots.

Piecewise constant and linear Incorporate constraints in the basis Smoother functions are often preferred. These can be obtained by increasing the order of the local polynomial. Cubic spline: continuous function with continuous first and second derivatives at the knots The following basis represent a cubic spline with knots at ξ 1 and ξ 2 : h 1 (x) = 1 h 3 (x) = x 2 h 5 (x) = (x ξ 1 ) 3 + h 2 (x) = x h 4 (x) = x 3 h 6 (x) = (x ξ 2 ) 3 + A natural cubic spline basis has additionnal constraints beyond the boundary knots. Order-M splines An order-m spline with knots ξ j, j = 1... K is a piecewise polynomial of order M and has continuous derivatives up to the order M 2. The general form of the basis set is h j (x) = x j 1 for j = 1... M h M+k (x) = (x ξ k ) M 1 + for k = 1... K

Regression spline Fixed-knot splines are also called regression splines: needs to select (i) the order of the spline (M), (ii) the number of knots (K) and (iii) their placement. The spline is fitted via regression using the model y = f (x) = M+K m=1 β m h m (x) Often the quantiles of the observations are used to determine the position of the knots (see R function bs) The smoothness of the curve is determined by the number and position of the knots: with fewer knots, the curve is smoother ydat 0.15 0.10 0.05 0.00 0.05 0.10 0.15 0.0 0.2 0.4 0.6 0.8 1.0 x Smoothing splines Avoid the knot selection problem by: maximising the number of knots adding a regularization term Determine the function f that minimises the penalised residual sum of squares RSS(f, λ) = (y i f (x i )) 2 + λ (f (t)) 2 dt i=1 where λ is a smoothing parameter. If λ = 0: any function f which interpolates the data are eligible if λ = : the simplest least-square line fit It can be shown that a natural spline minimises the criteria: f (x) = N j (x)θ j where N j (x) are set of basis functions for representing the family of natural splines. We denote by {N} ij = N j (x i ). Therefore j=1 RSS(f, λ) = (y Nθ) T (y Nθ) + λθ T Ω N θ where {Ω N } jk = N j (t)n (t)dt and the fitted smoothing spline is where k f (x) = N j (x) θ j j=1 θ = (N T N + λω N ) 1 N T y.

Degree of freedom and smoother matrix Automatic selection of the smoothing parameter The estimated parameter θ is linear in y. The smoother matrix S λ = (N T N + λω N ) 1 N T only depends on the {x i } i. The effective degree of freedom of a smoothing spline is df λ = trace(s λ ) Multi-dimensional splines Other regularisation methods Each of the approaches have multidimensional analogs. Consider x = (x 1, x 2 ) T R 2, a basis of functions h ik, k = 1,... M i, for representing functions of x i, i = 1, 2, then we can represent any function of x using tensor product basis M 1 M 2 g(x) = θ jk g jk (x) j=1 k=1 Regularization and Reproducing Kernel Hilbert Spaces Wavelet Smoothing... where g jk (x) = h 1j (x 1 )h 2k (x 2 ) Smoothing splines can also be generalised Curse of dimensionality: the dimension of the basis grows exponentially fast

Kernel smoothing methods: class of regression techniques to estimate a regression function f (x) over the domain R p by fitting a different simple model separately query point x 0. 3. Kernel smoothing methods 2 Only use the observation close to x 0 to fit the simple model. The resulting estimated function f (x) is smooth. The weighting function or kernel K λ (x 0, x i ) assigns a weight to the observation x i based on its distance from x 0. λ typically controls the width of the neighbourhood this is the only parameter that needs to be determined from the training data. 2 Section based on chapter 6 Be careful: same name for different notions! One-dimensional kernel smoother The word kernel is often associated to various distinct mathematical objects. The kernel smoothing methods should not be confused with the kernel methods (that you have seen in the Data Mining course) which consider a positive definite kernel and an associated reproducible kernel Hilbert space. In kernel methods, the kernel defines an inner product between feature maps. In this section, the kernel (in one dimension) can be written as follows ( ) x x0 K λ (x 0, x) = D h λ (x 0 ) where D is a non-negative real-valued integrable function (that integrates to 1). Intuitive estimate of a regression function f (x) = E(Y X = x): For each x, use the set N k (x) of k nearest neighbours of x Estimate f (x) by Remark: f (x) is discontinuous in x. f (x) = Average(yi x i N k (x)) Rather than give all the points in the neighbourhood the same weight, we can assign weights that depends on the distance between the observation.

One-dimensional kernel smoother Example: the Nadarya-Watson kernle-weigthed average f (x) = N i=1 K λ(x, x i )y i N i=1 K λ(x, x i ) with the Epanechnikov kernel ( ) x0 x K λ (x 0, x) = D λ and D(t) = 3 4 (1 t2 ) if t 1 and 0 otherwise. Settings to determine The main settings characterising a kernel smoother ( ) x x0 K λ (x 0, x) = D h λ (x 0 ) are: the smoothing parameter λ controls the width: large λ implies lower variance but higher bias the function h λ : constant h λ (x) tend to keep bias of the estimate constant the function D; typical examples are the Epanechnikov kernel the tri-cube function where D(t) = (1 t 3 ) 3 if t 1, 0 otherwise the gaussian density function, where the standard deviation plays the role of the window size

Local linear regression Local linear regression Locally weighted regression solves a separated least square problems at each target point x 0 : and then min α(x 0),β(x 0) K λ (x 0, x i )[y i α(x 0 ) β(x 0 )x i ] 2 i=1 f (x0 ) = α(x 0 ) + β(x 0 )x 0 = b(x 0 ) T (B T W (x 0 )B) 1 B T W (x 0 )y = l i (x 0 )y i i=1 The estimate is linear in y. The weights l i (x 0 ) combine weighting kernels K λ (x 0,.) and the least squares operations. It is often called the equivalent kernel. Using the linearity of the local regression and a series expansion of f around x 0 we can show that the bias E( f (x 0 )) f (x 0 ) only depends on quadratic and higher terms in the expansion of f. where b(x 0 ) T = (1, x 0 ), B T = (b(x 1 ), b(x 2 ),... b(x N )) and W (x 0 ) is a N N diagonal matrix with i-th diagonal element equal to K λ (x 0, x i ). Local polynomial regression Selecting the width of the kernel Simiarly, we can fit local polynomial of any degree d. with solution min α(x 0),β j (x 0),j=1,...d K λ (x 0, x i )[y i α(x 0 ) i=1 f (x0 = α(x 0 ) + d β j (x 0 )x j 0. j=1 d j=1 β j (x 0 )x j i ]2 The bias only depends on terms of degree d + 1 or higher in the expansion of f. Local linear fits tend to be biased in region of curvature of true function; local quadratic regression usually correct this bias. Price for this bias reduction: increased variance, especially in the tails. There is a natural bias-variance tradeoff as we change the width of the kernel. Narrow window: f (x 0 ) estimated based on small number of points, therefore small bias but large variance Wide window: the variance of f (x 0 ) will be small relative to the variance of any y i but the bias will be higher As in previous section, λ can be chosen via cross-validation.

In R The function loess fit a polynomial surface determined by one or more numerical predictors, using local fitting. It uses a tri-cube function. The degree of the polynomial can be chosen as well as the smoothing parameter. Example from http://research.stowers-institute.org/efg/r/statistics/loess.htm: period <- 120 > x <- 1:120 > y <- sin(2*pi*x/period) + runif(length(x),-1,1) > plot(x,y, main="sine Curve + Uniform Noise") > y.loess <- loess(y ~ x, span=0.75, data.frame(x=x, y=y)) > y.predict <- predict(y.loess, data.frame(x=x)) > lines(x,y.predict) Figure from http://research.stowers-institute.org/efg/r/statistics/loess.htm Local regression in R p Kernel smoothing and local regression generalise very naturally to two or more dimensions. However Boundary effects are a much bigger problem in two or higher dimensions than in one because the fractions of points on the boundary is larger. It is impossible to simultaneously maintain low bias and low variance as the dimension increases without the total sample size increasing exponentially in p When dimension to sample-size ratio increases, we may want to incorporate some structural assumptions about the model.