Applications of the k-nearest neighbor method for regression and resampling

Size: px

Start display at page:

Download "Applications of the k-nearest neighbor method for regression and resampling"

Hilary Murphy
5 years ago
Views:

1 Applications of the k-nearest neighbor method for regression and resampling

2 Objectives Provide a structured approach to exploring a regression data set. Introduce and demonstrate the k-nearest neighbor (knn) method for regression and uncertainty analysis (including resampling scenarios) with a data set. Univariate and Bivariate Conditioning (e.g., what are the plausible values of Y given x1 and x2?) Which of the historical values of Y are most likely to occur given the current values of the predictors x1, x2?

3 Exploratory Data Analysis: Step 1 : Trends gy 4 2 rain If there are systematic trends in the series: are they consistent? yr yr do they imply spurious connections? 10 do they suggest the need to check the data? pc1 4-8 pc2-5 do they reflect the influence of a few extremes or outliers? yr yr

4 gy rain rain pc rain 400 gy pc pc1 gy pc2 Step 2: The inter-variable relations: Linear? Homogeneous error structure? Correlations yr gy rain pc1 pc2 yr gy rain pc pc

5 Step 3: Probability Structure: Normal? gy rain pc pc2 The marginal density function of each relevant variable

6 gy 4 2 -ve skew rain ve skew Normal Distribution Normal Distribution pc1 4 +ve skew pc Normal Distribution Normal Distribution Quantile Plot to check Normality

7 GY= PC PC2 R 2 = D Relations High Uncertainty outside sampled area

8 Brush and Spin Plots for Visualization of multivariate relationships in Splus

9 Summary of Exploratory Data Analysis: GY (-ve skew) and Rain, PC1 (+ve skew) are not normally distributed GY-Rain relationship is quadratic with non-constant variance Rain-PC1 and GY-PC1 relationships also appear nonlinear. PC2 does not seem to be related to either Rain or GY. However, PC1 and PC2 appear correlated PC1 and PC2 exhibit trends that appear related, but are not reflected in either Rain or GY If we are doing a forecast, Rain is not known and hence at this point the only variable that appears important is PC1 (nonlinear), but PC2 appears to be marginally significant in the multivariate fit, and there is some indication that the interaction of PC1 and PC2 provides some insight.

10 Regression General: y = f(x1, x2, xp) + e Linear: f(.) = a1 x1+ a2 x2 +.ap xp K-nn: f (.) = k w i y i i= 1 Linear Regression is a weighted average y=xa+e a = (X T X) -1 X T y Hence, y=x (X T X) -1 X T y +e = Hy+e e.g., y 1 = h 11 y 1 +h 12 y 2 +.h 1n y n H is a weight matrix that depends only on X but all values of x are used, not just the k neighbors

11 Knn Weights and neighbors Weights: (a) Equally weight all neighbors: Uniform (b) Weight by distance many choices (a) Use regression on neighborhood (LOESS) (b) Distance weights, e.g., rank based 1/ i wi = k 1/ j j= 1 No. of Neighbors more => less variance in estimate, but more bias- choose by Xvalidation

12 A time series from the model x t+1 = 1-4(x t - 0.5) k-nearest neighborhoods A and B for x t =x* A and x* B respectively x t D 3 D1 D 2 D i time 2 S Values of x t xt State A B State Logistic Map Example 0 x* A x* B xt Some Dynamical Systems can be represented as Iterated Function Systems 4-state Markov Chain discretization

13 Define the composition of the "feature vector" D t of dimension d. (1) Dependence on two prior values of the same time series. D t : (x t-1, x t-2 ) ; d=2 (2) Dependence on multiple time scales (e.g., monthly+annual) D t : (x t-τ1, x t-2τ1,... x t-m1τ1 ; x t-τ2, x t-2τ2,... x t-m2τ2 ) ; d=m1+m2 (3) Dependence on multiple variables and time scales D t : (x1 t-τ1,... x1 t-m1τ1 ; x2 t, x2 t-τ2,... x2 t-m2τ2 ); d=m1+m2+1 Identify the k nearest neighbors of D t in the data D 1... D n Define the kernel function ( derived by taking expected values of distances to each of k nearest neighbors, assuming the number of observations of D in a neighborhood B r (D*) of D*; r 0, as n, is locally Poisson, with rate λ(d*)) for the j th nearest neighbor K(j) = 1/j k 1/j i = 1 j=1...k Selection of k: GCV, FPE, Mutual Information, or rule of thumb (k=n 0.5 )

14 K(.) x Illustration of Kernel with exponentially distributed conditioning variable

15 Centroid of K(j(i)) 0.3 Centroid of K(j(i)) K() 0.2 Uniform Kernel Centroid K() 0.2 Uniform Kernel Centroid x x Behavior of selected Kernel as an averaging function K() Centroid of K(j(i)) Centroid Uniform Kernel x

16 Model: V3=50V1+V2+e Or y = X b +e Neighbors based on d= x*-x 2 Why do we need a semi-parametric approach? Neighbors based on d= (x*-x) b 2

17 Univariate and Bivariate Examples from the spreadsheet

18 Simulation and forecast using Splus Script

19 Summary Check if a linear model (w/ or w/o transforms) can be used (relations are linear + distributions are Normal). Yes use it If you transformed y, it is better to backtransform the percentiles of the forecast, than just the regression mean No use k nearest neighbor method with distance weights and k=20 to 30 => you need at least data points and a few predictors If the data set is large (e.g., daily values) you can use k-nn directly and expect results to be generally comparable to or better than parametric regression

Exploratory Data Analysis EDA

Exploratory Data Analysis EDA Luc Anselin http://spatial.uchicago.edu 1 from EDA to ESDA dynamic graphics primer on multivariate EDA interpretation and limitations 2 From EDA to ESDA 3 Exploratory Data