Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Emily Fox University of Washington February 3, 2017 Simplest approach: Nearest neighbor regression 1
Fit locally to each data point Predicted value = closest y i y 1 nearest neighbor (1-NN) regression price ($) Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint sq.ft. x 3 1-NN regression more formally Dataset of (,$) pairs: (x 1,y 1 ), (x 2,y 2 ),,(x N,y N ) Query point: x q 1. Find closest x i in dataset y 2. Predict price ($) Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint sq.ft. x 4 2
Visualizing 1-NN in multiple dimensions Voronoi tesselation (or diagram): - Divide space into N regions, each containing 1 datapoint - Defined such that any x in region is closest to region s datapoint Don t explicitly form! 5 Distance metrics: Defining notion of closest In 1D, just Euclidean distance: distance(x j,x q ) = x j -x q In multiple dimensions: - can define many interesting distance functions - most straightforwardly, might want to weight different dimensions differently 6 3
Weighting housing inputs Some inputs are more relevant than others # bedrooms # bathrooms sq.ft. living sq.ft. lot floors year built year renovated waterfront 7 Scaled Euclidean distance Formally, this is achieved via p distance(x j, x q ) = a 1 (x j [1]-x q [1]) 2 + + a d (x j [d]-x q [d]) 2 weight on each input (defining relative importance) Other example distance metrics: - Mahalanobis, rank-based, correlation-based, cosine similarity, Manhattan, Hamming, 8 4
Different distance metrics lead to different predictive surfaces Euclidean distance Manhattan distance 9 Can 1-NN be used for classification? Yes!! Just predict class of neighbor 10 5
1-NN algorithm Performing 1-NN search Query house: Dataset: Specify: Distance metric Output: Most similar house 6
1-NN algorithm closest house Initialize Dist2NN =, For i=1,2,,n = Ø Compute: δ = distance(, ) If δ < Dist2NN set = i set Dist2NN = δ Return most similar house i q query house closest house to query house 1-NN in practice 1.4 Nearest Neighbors Kernel (K = 1) 1.2 1 0.8 0.6 0.4 0.2 Fit looks good for data dense in x and low noise 14 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 7
Sensitive to regions with little data 1.4 Nearest Neighbors Kernel (K = 1) 1.2 1 0.8 0.6 0.4 0.2 Not great at interpolating over large regions 15 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Also sensitive to noise in data Nearest Neighbors Kernel (K = 1) 1.5 1 Fits can look quite wild Overfitting? 0.5 f(x0) 0 0.5 1 16 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 8
k-nearest neighbors Get more comps More reliable estimate if you base estimate off of a larger set of comparable homes $ = 850k $ =??? $ = 749k $ = 833k $ = 901k 18 9
k-nn regression more formally Dataset of (,$) pairs: (x 1,y 1 ), (x 2,y 2 ),,(x N,y N ) Query point: x q 1. Find k closest x i in dataset 2. Predict 19 Performing k-nn search Query house: Dataset: Specify: Distance metric Output: Most similar houses 10
k-nn algorithm sort first k houses by distance to query house Initialize Dist2kNN = sort(δ 1,,δ k ) list of sorted distances = sort(,, ) list of sorted houses 1 k For i=k+1,,n query house Compute: δ = distance( i, q ) If δ < Dist2kNN[k] find j such that δ > Dist2kNN[j-1] but δ < Dist2kNN[j] remove furthest house and shift queue: [j+1:k] = [j:k-1] Dist2kNN[j+1:k] = Dist2kNN[j:k-1] set Dist2kNN[j] = δ and [j] = Return k most similar houses i closest houses to query house k-nn in practice Nearest Neighbors Kernel (K = 30) 1.5 1 0.5 f(x0) Much more reasonable fit in the presence of noise 0 0.5 1 Boundary & sparse region issues 22 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 11
k-nn in practice Nearest Neighbors Kernel (K = 30) 1.5 1 0.5 f(x0) Discontinuities! Neighbor either in or out 0 0.5 1 23 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 Issues with discontinuities Overall predictive accuracy might be okay, but For example, in housing application: - If you are a buyer or seller, this matters - Can be a jump in estimated value of house going just from 2640 sq.ft. to 2641 sq.ft. - Don t really believe this type of fit 24 12
Weighted k-nearest neighbors Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn1 y NN1 + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j=1 c qnnj 26 13
How to define weights? Want weight c qnnj to be small when distance(x NNj,x q ) large and c qnnj to be large when distance(x NNj,x q ) small 27 Kernel weights for d=1 Define: c qnnj = Kernel λ ( x NNj -x q ) simple isotropic case Gaussian kernel: Kernel λ ( x i -x q ) = exp(-(x i -x q ) 2 /λ) Note: never exactly 0! 28 -λ 0 λ 14
Kernel weights for d 1 Define: c qnnj = Kernel λ (distance(x NNj,x q )) 29 -λ 0 λ Kernel regression 15
Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn1 y NN1 + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j=1 c qnnj 31 Kernel regression Instead of just weighting NN, weight all points Predict: NX weight on each datapoint NX Nadaraya-Watson kernel weighted average ŷ q = i=1 NX c qi y i c qi i=1 = i=1 Kernel λ (distance(x i,x q )) * y i NX Kernel λ (distance(x i,x q )) i=1 32 16
Kernel regression in practice Epanechnikov Kernel (lambda = 0.2) 1.5 1 f(x0) Kernel has bounded support Only subset of data needed to compute local fit 0.5 0 0.5 1 33 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 Choice of bandwidth λ Often, choice of kernel matters much less than choice of λ λ = 0.04 λ = 0.2 λ = 0.4 Epanechnikov Kernel (lambda = 0.04) Epanechnikov Boxcar Kernel Kernel (lambda (lambda = 0.2) = 0.2) Epanechnikov Kernel (lambda = 0.4) 1.5 1.5 1.5 1 f(x0) 1 f(x0) 1 f(x0) f(x0) 0.5 0.5 0.5 0 0 0 0.5 1 0.5 1 Boxcar kernel 0.5 1 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 34 17
Choosing λ (or k in k-nn) How to choose? Same story as always Cross Validation 35 Formalizing the idea of local fits 18
Contrasting with global average A globally constant fit weights all points equally NX equal weight on each datapoint ŷ q = 1 NX y i N = i=1 i=1 c y i NX c i=1 1.5 1 0.5 Boxcar Kernel (lambda = 1) f(x0) 0 0.5 1 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 37 Contrasting with global average Kernel regression leads to locally constant fit - slowly add in some points and let others gradually die off Boxcar Kernel (lambda = 0.2) Epanechnikov Kernel (lambda = 0.2) NX 1.5 1.5 ŷ q = i=1 NX Kernel λ (distance(x i,x q )) i=1 Kernel λ (distance(x i,x q )) * y i 1 0.5 0 0.5 1 f(x0) 1 0.5 0 0.5 1 f(x0) 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 38 19
Local linear regression So far, fitting constant function locally at each point à locally weighted averages Can instead fit a line or polynomial locally at each point à locally weighted linear regression 39 Local regression rules of thumb - Local linear fit reduces bias at boundaries with minimum increase in variance - Local quadratic fit doesn t help at boundaries and increases variance, but does help capture curvature in the interior - With sufficient data, local polynomials of odd degree dominate those of even degree Recommended default choice: local linear regression 40 20
Discussion on k-nn and kernel regression Nonparametric approaches k-nn and kernel regression are examples of nonparametric regression General goals of nonparametrics: - Flexibility - Make few assumptions about f(x) - Complexity can grow with the number of observations N Lots of other choices: - Splines, trees, locally weighted structured regression models 42 21
Limiting behavior of NN: Noiseless setting (ε i =0) In the limit of getting an infinite amount of noiseless data, the MSE of 1-NN fit goes to 0 43 Limiting behavior of NN: Noiseless setting (ε i =0) In the limit of getting an infinite amount of noiseless data, the MSE of 1-NN fit goes to 0 1-NN fit Quadratic fit Not true for parametric models! 44 22
Error vs. amount of data Error 45 # data points in training set Limiting behavior of NN: Noisy data setting In the limit of getting an infinite amount of data, the MSE of NN fit goes to 0 if k grows, too 1-NN fit 200-NN fit Quadratic fit 46 23
NN and kernel methods for large d or small N NN and kernel methods work well when the data cover the space, but - the more dimensions d you have, the more points N you need to cover the space - need N = O(exp(d)) data points for good performance This is where parametric models become useful 47 Complexity of NN search Naïve approach: Brute force search - Given a query point x q - Scan through each point x 1,x 2,, x N - O(N) distance computations per 1-NN query! - O(Nlogk) per k-nn query! What if N is huge??? (and many queries) KD-trees! Locality-sensitive hashing, etc. 48 24
k-nn for classification Spam filtering example Not spam Spam Input: x Output: y 50 Text of email, sender, IP, 25
Using k-nn for classification Space of labeled emails (not spam vs. spam), organized by similarity of text query email not spam vs. spam: decide via majority vote of k-nn 51 Using k-nn for classification Space of labeled emails (not spam vs. spam), organized by similarity of text query email not spam vs. spam: decide via majority vote of k-nn 52 26
Summary for nearest neighbor and kernel regression What you can do now Motivate the use of nearest neighbor (NN) regression Define distance metrics in 1D and multiple dimensions Perform NN and k-nn regression Analyze computational costs of these algorithms Discuss sensitivity of NN to lack of data, dimensionality, and noise Perform weighted k-nn and define weights using a kernel Define and implement kernel regression Describe the effect of varying the kernel bandwidth λ or # of nearest neighbors k Select λ or k using cross validation Compare and contrast kernel regression with a global average fit Define what makes an approach nonparametric and why NN and kernel regression are considered nonparametric methods Analyze the limiting behavior of NN regression Use NN for classification 54 27
Recap of topics so far Emily Fox University of Washington February 3, 2017 What you have learned thus far Point estimation Regression Training, test, validation, generalization error Overfitting Bias-variance tradeoff Regularized regression = ridge, LASSO Cross validation Logistic regression Decision trees Boosting Instance-based learning 56 28
The ML pipeline Inputs Features Task Model Algorithm x Sq.ft. #bedrooms #bathrooms text of review loan application h j (x) x[j] x[j] p tf-idf Regression x or h j (x) à R Classification x or h j (x) à {0,1,...,k} (We ve focused on {-1,+1}) Linear models w T h(x) Decision trees Ensembles NN Optimize a lost function Gradient = 0 Gradient ascent/ descent Stochastic gradient ascent/descent Coordinate ascent/ descent Boosting/AdaBoost Evaluation and model selection: Training, validation or cross-validation, test error 57 Concepts: bias-variance tradeoff, overfitting Your Midterm Exam Content: Everything up to today Only 50mins, so arrive early and settle down quickly Cheat sheet: - Single 8 ½ x 11 handwritten sheet, front and back No: - Computer, phone, other materials, The exam: - Covers key concepts and ideas, work on understanding the big picture, and differences between methods 58 29