Going nonparametric: Nearest neighbor methods for regression and classification

Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 8, 28 Locality sensitive hashing for approximate NN search cont d

Complexity of brute-force search Given a query point, scan through each point - O(N) distance computations per -NN query! - O(Nlogk) per k-nn query! What if N is huge??? (and many queries) 3 Moving away from exact NN search Approximate neighbor finding - Don t find exact neighbor, but that s okay for many applications Out of millions of articles, do we need the closest article or just one that s pretty similar? Do we even fully trust our measure of similarity??? Focus on methods that provide good probabilistic guarantees on approximation 4 2

Using score for NN search #awful 2D Data Sign(Score) Bin index x = [, 5] - x 2 = [, 3] - x 3 = [3, ] 4 3 2 Only search here for queries with Score(x)<. #awesome.5 #awful = 2 3 4 Only search here for queries with Score(x)> #awesome 5 candidate neighbors if Score(x)< Query point x Using score for NN search 2D Data Sign(Score) Bin index x = [, 5] - x 2 = [, 3] - x 3 = [3, ] candidate neighbors if Score(x)< Bin List containing indices of datapoints: {,2,4,7, } {3,5,6,8, } HASH TABLE search for NN amongst this set 6 3

Three potential issues with simple approach. Challenging to find good line 2. Poor quality solution: - Points close together get split into separate bins 3. Large computational cost: - Bins might contain many points, so still searching over large set for each NN query 7 How to define the line? Crazy idea: Define line randomly! #awful 4 3 2. #awesome.5 #awful = 2 3 4 #awesome 8 4

How bad can a random line be? Goal: If x,y are close (according to cosine similarity), want binned values to be the same. #awful 4 3 2 9 Bins are the same θ xy x 2 3 4 y Bins are the same Bins are different If θ xy is small (x,y close), unlikely to be placed into separate bins #awesome Three potential issues with simple approach. Challenging to find good line 2. Poor quality solution: - Points close together get split into separate bins 3. Large computational cost: - Bins might contain many points, so still searching over large set for each NN query Bin List containing indices of datapoints: {,2,4,7, } {3,5,6,8, } 5

Improving efficiency: Reducing # points examined per query Reducing search cost through more bins #awful 4 3 2 2 [ ] Line 2 [ ] Line 2 3 4 Line 3 #awesome [ ] [ ] 6

Using score for NN search 2D Data Sign (Score ) Bin index Sign (Score 2 ) Bin 2 index Sign (Score 3 ) Bin 3 index x = [, 5] - - - x 2 = [, 3] - - - x 3 = [3, ] Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,} {3,5,6} search for NN amongst this set 3 Improving search quality by searching neighboring bins Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,} {3,5,6} 4 Query point here, but is NN? Not necessarily Even worse than before Each line can split pts. Sacrificing accuracy for speed #awful 4 3 2 [ ] Line 2 [ ] Line 2 3 4 [ ] Line 3 [ ] #awesome 7

Improving search quality by searching neighboring bins Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,} {3,5,6} 5 Next closest bins (flip bit) #awful 4 3 2 [ ] Line 2 [ ] Line 2 3 4 [ ] Line 3 [ ] #awesome Improving search quality by searching neighboring bins Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,} {3,5,6} 6 Further bin (flip 2 bits) #awful 4 3 2 [ ] Line 2 [ ] Line 2 3 4 [ ] Line 3 [ ] #awesome 8

Improving search quality by searching neighboring bins Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,} {3,5,6} 7 Quality of retrieved NN can only improve with searching more bins Algorithm: Continue searching until computational budget is reached or quality of NN good enough #awful 4 3 2 [ ] Line 2 [ ] Line 2 3 4 [ ] Line 3 [ ] #awesome LSH recap kd-tree competitor data structure Draw h random lines Compute score for each point under each line and translate to binary index Use h-bit binary vector per data point as bin index Create hash table For each query point x, search bin(x), then neighboring bins until time limit 8 9

Moving to higher dimensions d Draw random planes x[2] #awful #awesome x[] #great x[3] Score(x) = v #awesome + v 2 #awful + v 3 #great 2

Cost of binning points in d-dim Score(x) = v #awesome + v 2 #awful + v 3 #great Per data point, need d multiplies to determine bin index per plane One-time cost offset if many queries of fixed dataset 2 Summary for retrieval using nearest neighbors and locality sensitive hashing

What you can do now Implement nearest neighbor search for retrieval tasks Contrast document representations (e.g., raw word counts, tf-idf, ) - Emphasize important words using tf-idf Contrast methods for measuring similarity between two documents - Euclidean vs. weighted Euclidean - Cosine similarity vs. similarity via unnormalized inner product Describe complexity of brute force search Implement LSH for approximate nearest neighbor search 23 Motivating NN methods for regression: Fit globally vs. fit locally 2

Parametric models of f(x) y price ($) sq.ft. x 25 Parametric models of f(x) y price ($) sq.ft. x 26 3

Parametric models of f(x) y price ($) sq.ft. x 27 Parametric models of f(x) y price ($) sq.ft. x 28 4

f(x) is not really a polynomial y price ($) sq.ft. x 29 What alternative do we have? If we: - Want to allow flexibility in f(x) having local structure - Don t want to infer structural breaks What s a simple option we have? - Assuming we have plenty of data 3 5

Simplest approach: Nearest neighbor regression Fit locally to each data point Predicted value = closest y i y nearest neighbor (-NN) regression 32 price ($) Here, this is the closest datapoint Here, this is the closest datapoint sq.ft. Here, this is the closest datapoint Here, this is the closest datapoint x 6

What people do naturally Real estate agent assesses value by finding sale of most similar house $ =??? $ = 85k 33 -NN regression more formally Dataset of (,$) pairs: (x,y ), (x 2,y 2 ),,(x N,y N ) Query point: x q. Find closest x i in dataset y 2. Predict price ($) Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint 34 sq.ft. x 7

Visualizing -NN in multiple dimensions Voronoi tesselation (or diagram): - Divide space into N regions, each containing datapoint - Defined such that any x in region is closest to region s datapoint Don t explicitly form! 35 Different distance metrics lead to different predictive surfaces Euclidean distance Manhattan distance 36 8

Can -NN be used for classification? Yes!! Just predict class of neighbor 37 -NN algorithm 9

Performing -NN search Query house: Dataset: Specify: Distance metric Output: Most similar house -NN algorithm closest house Initialize Dist2NN =, = Ø For i=,2,,n Compute: δ = distance(, ) If δ < Dist2NN set = i set Dist2NN = δ Return most similar house i q query house closest house to query house 2

-NN in practice.4 Nearest Neighbors Kernel (K = ).2.8.6.4.2 Fit looks good for data dense in x and low noise 4..2.3.4.5.6.7.8.9 Sensitive to regions with little data.4 Nearest Neighbors Kernel (K = ).2.8.6.4.2 Not great at interpolating over large regions 42..2.3.4.5.6.7.8.9 2

Also sensitive to noise in data Nearest Neighbors Kernel (K = ).5 Fits can look quite wild Overfitting?.5 f(x).5 43..2.3.4 x.6.7.8.9 k-nearest neighbors 22

Get more comps More reliable estimate if you base estimate off of a larger set of comparable homes $ = 85k $ =??? $ = 749k $ = 833k $ = 9k 45 k-nn regression more formally Dataset of (,$) pairs: (x,y ), (x 2,y 2 ),,(x N,y N ) Query point: x q. Find k closest x i in dataset 2. Predict 46 23

Performing k-nn search Query house: Dataset: Specify: Distance metric Output: Most similar houses k-nn algorithm Initialize Dist2kNN = sort(δ,,δ k ) = sort(,, ) k For i=k+,,n Compute: δ = distance( i, q ) If δ < Dist2kNN[k] find j such that δ > Dist2kNN[j-] but δ < Dist2kNN[j] remove furthest house and shift queue: [j+:k] = [j:k-] Dist2kNN[j+:k] = Dist2kNN[j:k-] set Dist2kNN[j] = δ and [j] = Return k most similar houses sort first k houses by distance to query house list of sorted distances query house list of sorted houses i closest houses to query house 24

k-nn in practice Nearest Neighbors Kernel (K = 3).5 f(x).5.5 49..2.3.4 x.6.7.8.9 k-nn in practice Nearest Neighbors Kernel (K = 3).5.5 f(x) Much more reasonable fit in the presence of noise.5 Boundary & sparse region issues 5..2.3.4 x.6.7.8.9 25

k-nn in practice Nearest Neighbors Kernel (K = 3).5.5 f(x) Discontinuities! Neighbor either in or out.5 5..2.3.4 x.6.7.8.9 Issues with discontinuities Overall predictive accuracy might be okay, but For example, in housing application: - If you are a buyer or seller, this matters - Can be a jump in estimated value of house going just from 264 sq.ft. to 264 sq.ft. - Don t really believe this type of fit 52 26

Weighted k-nearest neighbors Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn y NN + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j= c qnnj 54 27

How to define weights? Want weight c qnnj to be small when distance(x NNj,x q ) large and c qnnj to be large when distance(x NNj,x q ) small 55 Kernel weights for d= Define: c qnnj = Kernel λ ( x NNj -x q ) simple isotropic case Gaussian kernel: Kernel λ ( x i -x q ) = exp(-(x i -x q ) 2 /λ) Note: never exactly! 56 -λ λ 28

Kernel weights for d Define: c qnnj = Kernel λ (distance(x NNj,x q )) 57 -λ λ Kernel regression 29

Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn y NN + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j= c qnnj 59 Kernel regression Instead of just weighting NN, weight all points Predict: NX weight on each datapoint NX Nadaraya-Watson kernel weighted average ŷ q = i= NX c qi y i c qi i= = i= Kernel λ (distance(x i,x q )) * y i NX Kernel λ (distance(x i,x q )) i= 6 3

Kernel regression in practice Epanechnikov Kernel (lambda =.2).5 f(x) Kernel has bounded support Only subset of data needed to compute local fit.5.5 6..2.3.4 x.6.7.8.9 Choice of bandwidth λ Often, choice of kernel matters much less than choice of λ λ =.4 λ =.2 λ =.4 Epanechnikov Kernel (lambda =.4) Epanechnikov Boxcar Kernel Kernel (lambda (lambda =.2) =.2) Epanechnikov Kernel (lambda =.4).5.5.5 f(x) f(x) f(x) f(x).5.5.5.5.5 Boxcar kernel.5..2.3.4 x.6.7.8.9..2.3.4 x.6.7.8.9..2.3.4 x.6.7.8.9 62 3

Choosing λ (or k in k-nn) How to choose? Same story as always Cross Validation 63 32