Going nonparametric: Nearest neighbor methods for regression and classification

Size: px

Start display at page:

Download "Going nonparametric: Nearest neighbor methods for regression and classification"

Maud Hill
5 years ago
Views:

1 Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 3, 208 Locality sensitive hashing for approximate NN search cont d

2 Complexity of brute-force search Given a query point, scan through each point - O(N) distance computations per -NN query! - O(Nlogk) per k-nn query! What if N is huge??? (and many queries) 3 Moving away from exact NN search Approximate neighbor finding - Don t find exact neighbor, but that s okay for many applications Out of millions of articles, do we need the closest article or just one that s pretty similar? Do we even fully trust our measure of similarity??? Focus on methods that provide good probabilistic guarantees on approximation 4 2

3 Using score for NN search #awful 2D Data Sign(Score) Bin index x = [0, 5] - 0 x 2 = [, 3] - 0 x 3 = [3, 0] Only search here for queries with Score(x)<0.0 #awesome.5 #awful = Only search here for queries with Score(x)>0 #awesome 5 candidate neighbors if Score(x)<0 Query point x Using score for NN search 2D Data Sign(Score) Bin index x = [0, 5] - 0 x 2 = [, 3] - 0 x 3 = [3, 0] candidate neighbors if Score(x)<0 Bin 0 List containing indices of datapoints: {,2,4,7, } {3,5,6,8, } HASH TABLE search for NN amongst this set 6 3

4 Three potential issues with simple approach. Challenging to find good line 2. Poor quality solution: - Points close together get split into separate bins 3. Large computational cost: - Bins might contain many points, so still searching over large set for each NN query 7 How to define the line? Crazy idea: Define line randomly! #awful #awesome.5 #awful = #awesome 8 4

bins #awesome Three potential issues with simple approach. Challenging to find good line 2.

5 How bad can a random line be? Goal: If x,y are close (according to cosine similarity), want binned values to be the same. #awful Bins are the same θ xy x y Bins are the same Bins are different If θ xy is small (x,y close), unlikely to be placed into separate bins #awesome Three potential issues with simple approach. Challenging to find good line 2. Poor quality solution: - Points close together get split into separate bins 3. Large computational cost: - Bins might contain many points, so still searching over large set for each NN query Bin 0 List containing indices of datapoints: {,2,4,7, } {3,5,6,8, } 0 5

6 Improving efficiency: Reducing # points examined per query Reducing search cost through more bins #awful Bin index: [0 0 0] Line 2 Bin index: [0 0] Line Line 3 #awesome Bin index: [ 0] Bin index: [ ] 6

7 Using score for NN search 2D Data Sign (Score ) Bin index Sign (Score 2 ) Bin 2 index Sign (Score 3 ) Bin 3 index x = [0, 5] x 2 = [, 3] x 3 = [3, 0] Bin [0 0 0] = 0 Data indices: [0 0 ] = [0 0] = 2 [0 ] = 3 [ 0 0] = 4 [ 0 ] = 5 [ 0] = 6 [ ] = 7 {,2} -- {4,8,} {7,9,0} {3,5,6} search for NN amongst this set 3 Improving search quality by searching neighboring bins Bin [0 0 0] = 0 Data indices: [0 0 ] = [0 0] = 2 [0 ] = 3 [ 0 0] = 4 [ 0 ] = 5 [ 0] = 6 [ ] = 7 {,2} -- {4,8,} {7,9,0} {3,5,6} 4 Query point here, but is NN? Not necessarily Even worse than before Each line can split pts. Sacrificing accuracy for speed #awful Bin index: [0 0 0] Line 2 Bin index: [0 0] Line Bin index: [ 0] Line 3 Bin index: [ ] #awesome 7

8 Improving search quality by searching neighboring bins Bin [0 0 0] = 0 Data indices: [0 0 ] = [0 0] = 2 [0 ] = 3 [ 0 0] = 4 [ 0 ] = 5 [ 0] = 6 [ ] = 7 {,2} -- {4,8,} {7,9,0} {3,5,6} 5 Next closest bins (flip bit) #awful Bin index: [0 0 0] Line 2 Bin index: [0 0] Line Bin index: [ 0] Line 3 Bin index: [ ] #awesome Improving search quality by searching neighboring bins Bin [0 0 0] = 0 Data indices: [0 0 ] = [0 0] = 2 [0 ] = 3 [ 0 0] = 4 [ 0 ] = 5 [ 0] = 6 [ ] = 7 {,2} -- {4,8,} {7,9,0} {3,5,6} 6 Further bin (flip 2 bits) #awful Bin index: [0 0 0] Line 2 Bin index: [0 0] Line Bin index: [ 0] Line 3 Bin index: [ ] #awesome 8

9 Improving search quality by searching neighboring bins Bin [0 0 0] = 0 Data indices: [0 0 ] = [0 0] = 2 [0 ] = 3 [ 0 0] = 4 [ 0 ] = 5 [ 0] = 6 [ ] = 7 {,2} -- {4,8,} {7,9,0} {3,5,6} 7 Quality of retrieved NN can only improve with searching more bins Algorithm: Continue searching until computational budget is reached or quality of NN good enough #awful Bin index: [0 0 0] Line 2 Bin index: [0 0] Line Bin index: [ 0] Line 3 Bin index: [ ] #awesome LSH recap kd-tree competitor data structure Draw h random lines Compute score for each point under each line and translate to binary index Use h-bit binary vector per data point as bin index Create hash table For each query point x, search bin(x), then neighboring bins until time limit 8 9

10 Moving to higher dimensions d Draw random planes x[2] #awful #awesome x[] #great x[3] Score(x) = v #awesome + v 2 #awful + v 3 #great 20 0

11 Cost of binning points in d-dim Score(x) = v #awesome + v 2 #awful + v 3 #great Per data point, need d multiplies to determine bin index per plane One-time cost offset if many queries of fixed dataset 2 Summary for retrieval using nearest neighbors and locality sensitive hashing

12 What you can do now Implement nearest neighbor search for retrieval tasks Contrast document representations (e.g., raw word counts, tf-idf, ) - Emphasize important words using tf-idf Contrast methods for measuring similarity between two documents - Euclidean vs. weighted Euclidean - Cosine similarity vs. similarity via unnormalized inner product Describe complexity of brute force search Implement LSH for approximate nearest neighbor search 23 Motivating NN methods for regression: Fit globally vs. fit locally 2

13 Parametric models of f(x) y price ($) sq.ft. x 25 Parametric models of f(x) y price ($) sq.ft. x 26 3

14 Parametric models of f(x) y price ($) sq.ft. x 27 Parametric models of f(x) y price ($) sq.ft. x 28 4

15 f(x) is not really a polynomial y price ($) sq.ft. x 29 What alternative do we have? If we: - Want to allow flexibility in f(x) having local structure - Don t want to infer structural breaks What s a simple option we have? - Assuming we have plenty of data 30 5

16 Simplest approach: Nearest neighbor regression Fit locally to each data point Predicted value = closest y i y nearest neighbor (-NN) regression 32 price ($) Here, this is the closest datapoint Here, this is the closest datapoint sq.ft. Here, this is the closest datapoint Here, this is the closest datapoint x 6

What people do naturally Real estate agent assesses

?? $ = 850k 33 -NN regression more formally Dataset

point: x q. Find closest x i in dataset y 2.

datapoint Here, this is the closest datapoint Here,

17 What people do naturally Real estate agent assesses value by finding sale of most similar house $ =??? $ = 850k 33 -NN regression more formally Dataset of (,$) pairs: (x,y ), (x 2,y 2 ),,(x N,y N ) Query point: x q. Find closest x i in dataset y 2. Predict price ($) Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint 34 sq.ft. x 7

Visualizing -NN in multiple dimensions Voronoi tesselation (or diagram): - Divide space into N regions, each containing datapoint - Defined such that any x in region

18 Visualizing -NN in multiple dimensions Voronoi tesselation (or diagram): - Divide space into N regions, each containing datapoint - Defined such that any x in region is closest to region s datapoint Don t explicitly form! 35 Different distance metrics lead to different predictive surfaces Euclidean distance Manhattan distance 36 8

19 Can -NN be used for classification? Yes!! Just predict class of neighbor 37 -NN algorithm 9

20 Performing -NN search Query house: Dataset: Specify: Distance metric Output: Most similar house -NN algorithm closest house Initialize Dist2NN =, = Ø For i=,2,,n Compute: δ = distance(, ) If δ < Dist2NN set = i set Dist2NN = δ Return most similar house i q query house closest house to query house 20

-NN in practice.4 Nearest Neighbors Kernel (K = ).2 0.8 0.6 0.4 0.2 Fit looks good for data dense in x and low noise 4 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Sensitive to regions with little data.

21 -NN in practice.4 Nearest Neighbors Kernel (K = ) Fit looks good for data dense in x and low noise Sensitive to regions with little data.4 Nearest Neighbors Kernel (K = ) Not great at interpolating over large regions

22 Also sensitive to noise in data Nearest Neighbors Kernel (K = ).5 Fits can look quite wild Overfitting? 0.5 f(x0) x k-nearest neighbors 22

23 Get more comps More reliable estimate if you base estimate off of a larger set of comparable homes $ = 850k $ =??? $ = 749k $ = 833k $ = 90k 45 k-nn regression more formally Dataset of (,$) pairs: (x,y ), (x 2,y 2 ),,(x N,y N ) Query point: x q. Find k closest x i in dataset 2. Predict 46 23

24 Performing k-nn search Query house: Dataset: Specify: Distance metric Output: Most similar houses k-nn algorithm Initialize Dist2kNN = sort(δ,,δ k ) = sort(,, ) k For i=k+,,n Compute: δ = distance( i, q ) If δ < Dist2kNN[k] find j such that δ > Dist2kNN[j-] but δ < Dist2kNN[j] remove furthest house and shift queue: [j+:k] = [j:k-] Dist2kNN[j+:k] = Dist2kNN[j:k-] set Dist2kNN[j] = δ and [j] = Return k most similar houses sort first k houses by distance to query house list of sorted distances query house list of sorted houses i closest houses to query house 24

25 k-nn in practice Nearest Neighbors Kernel (K = 30) f(x0) Much more reasonable fit in the presence of noise Boundary & sparse region issues x k-nn in practice Nearest Neighbors Kernel (K = 30) f(x0) Discontinuities! Neighbor either in or out x

26 Issues with discontinuities Overall predictive accuracy might be okay, but For example, in housing application: - If you are a buyer or seller, this matters - Can be a jump in estimated value of house going just from 2640 sq.ft. to 264 sq.ft. - Don t really believe this type of fit 5 Weighted k-nearest neighbors 26

27 Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn y NN + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j= c qnnj 53 How to define weights? Want weight c qnnj to be small when distance(x NNj,x q ) large and c qnnj to be large when distance(x NNj,x q ) small 54 27

28 Kernel weights for d= Define: c qnnj = Kernel λ ( x NNj -x q ) simple isotropic case Gaussian kernel: Kernel λ ( x i -x q ) = exp(-(x i -x q ) 2 /λ) Note: never exactly 0! 55 -λ 0 λ Kernel weights for d Define: c qnnj = Kernel λ (distance(x NNj,x q )) 56 -λ 0 λ 28

29 Kernel regression Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn y NN + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j= c qnnj 58 29

30 Kernel regression Instead of just weighting NN, weight all points Predict: NX weight on each datapoint NX Nadaraya-Watson kernel weighted average ŷ q = i= NX c qi y i c qi i= = i= Kernel λ (distance(x i,x q )) * y i NX Kernel λ (distance(x i,x q )) i= 59 Kernel regression in practice Epanechnikov Kernel (lambda = 0.2).5 f(x0) Kernel has bounded support Only subset of data needed to compute local fit x

31 Choice of bandwidth λ Often, choice of kernel matters much less than choice of λ λ = 0.04 λ = 0.2 λ = 0.4 Epanechnikov Kernel (lambda = 0.04) Epanechnikov Boxcar Kernel Kernel (lambda (lambda = 0.2) = 0.2) Epanechnikov Kernel (lambda = 0.4) f(x0) f(x0) f(x0) f(x0) Boxcar kernel x x x Choosing λ (or k in k-nn) How to choose? Same story as always Cross Validation 62 3

32 Formalizing the idea of local fits Contrasting with global average A globally constant fit weights all points equally NX equal weight on each datapoint ŷ q = NX y i N = i= i= c y i NX c i= Boxcar Kernel (lambda = ) f(x0) x

33 Contrasting with global average Kernel regression leads to locally constant fit - slowly add in some points and let others gradually die off Boxcar Kernel (lambda = 0.2) Epanechnikov Kernel (lambda = 0.2) NX.5.5 ŷ q = i= NX Kernel λ (distance(x i,x q )) i= Kernel λ (distance(x i,x q )) * y i f(x0) f(x0) x x Local linear regression So far, fitting constant function locally at each point à locally weighted averages Can instead fit a line or polynomial locally at each point à locally weighted linear regression 66 33

34 Local regression rules of thumb - Local linear fit reduces bias at boundaries with minimum increase in variance - Local quadratic fit doesn t help at boundaries and increases variance, but does help capture curvature in the interior - With sufficient data, local polynomials of odd degree dominate those of even degree Recommended default choice: local linear regression 67 Discussion on k-nn and kernel regression 34

35 Nonparametric approaches k-nn and kernel regression are examples of nonparametric regression General goals of nonparametrics: - Flexibility - Make few assumptions about f(x) - Complexity can grow with the number of observations N Lots of other choices: - Splines, trees, locally weighted structured regression models 69 Limiting behavior of NN: Noiseless setting (ε i =0) In the limit of getting an infinite amount of noiseless data, the MSE of -NN fit goes to

fit goes to 0 -NN fit Quadratic fit Not true for parametric models!

36 Limiting behavior of NN: Noiseless setting (ε i =0) In the limit of getting an infinite amount of noiseless data, the MSE of -NN fit goes to 0 -NN fit Quadratic fit Not true for parametric models! 7 Error vs. amount of data Error # data points in training set 72 36

Limiting behavior of NN: Noisy data setting In the limit of getting an infinite amount of data, the MSE of NN fit goes to 0 if k grows, too -NN fit 200-NN fit Quadratic fit 73 NN and kernel methods

37 Limiting behavior of NN: Noisy data setting In the limit of getting an infinite amount of data, the MSE of NN fit goes to 0 if k grows, too -NN fit 200-NN fit Quadratic fit 73 NN and kernel methods for large d or small N NN and kernel methods work well when the data cover the space, but - the more dimensions d you have, the more points N you need to cover the space - need N = O(exp(d)) data points for good performance This is where parametric models become useful 74 37

38 Complexity of NN search Naïve approach: Brute force search - Given a query point x q - Scan through each point x,x 2,, x N - O(N) distance computations per -NN query! - O(Nlogk) per k-nn query! What if N is huge??? (and many queries) KD-trees! Locality-sensitive hashing, etc. 75 k-nn for classification 38

39 Spam filtering example Not spam Spam Input: x Output: y 77 Text of , sender, IP, Using k-nn for classification Space of labeled s (not spam vs. spam), organized by similarity of text query not spam vs. spam: decide via majority vote of k-nn 78 39

40 Using k-nn for classification Space of labeled s (not spam vs. spam), organized by similarity of text query not spam vs. spam: decide via majority vote of k-nn 79 Summary for nearest neighbor and kernel regression 40

41 What you can do now Motivate the use of nearest neighbor (NN) regression Perform k-nn regression Analyze computational costs of these algorithms Discuss sensitivity of NN to lack of data, dimensionality, and noise Perform weighted k-nn and define weights using a kernel Define and implement kernel regression Describe the effect of varying the kernel bandwidth λ or # of nearest neighbors k Select λ or k using cross validation Compare and contrast kernel regression with a global average fit Define what makes an approach nonparametric and why NN and kernel regression are considered nonparametric methods Analyze the limiting behavior of NN regression Use NN for classification 8 4

Going nonparametric: Nearest neighbor methods for regression and classification

Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 8, 28 Locality sensitive hashing for approximate NN