Going nonparametric: Nearest neighbor methods for regression and classification

Size: px

Start display at page:

Download "Going nonparametric: Nearest neighbor methods for regression and classification"

Hilary Ball
5 years ago
Views:

1 Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 8, 28 Locality sensitive hashing for approximate NN search cont d

Complexity of brute-force search Given a query point, scan through each point - O(N) distance computations per -NN query! - O(Nlogk) per k-nn query! What if N is huge?

2 Complexity of brute-force search Given a query point, scan through each point - O(N) distance computations per -NN query! - O(Nlogk) per k-nn query! What if N is huge??? (and many queries) 3 Moving away from exact NN search Approximate neighbor finding - Don t find exact neighbor, but that s okay for many applications Out of millions of articles, do we need the closest article or just one that s pretty similar? Do we even fully trust our measure of similarity??? Focus on methods that provide good probabilistic guarantees on approximation 4 2

3 Using score for NN search #awful 2D Data Sign(Score) Bin index x = [, 5] - x 2 = [, 3] - x 3 = [3, ] Only search here for queries with Score(x)<. #awesome.5 #awful = Only search here for queries with Score(x)> #awesome 5 candidate neighbors if Score(x)< Query point x Using score for NN search 2D Data Sign(Score) Bin index x = [, 5] - x 2 = [, 3] - x 3 = [3, ] candidate neighbors if Score(x)< Bin List containing indices of datapoints: {,2,4,7, } {3,5,6,8, } HASH TABLE search for NN amongst this set 6 3

4 Three potential issues with simple approach. Challenging to find good line 2. Poor quality solution: - Points close together get split into separate bins 3. Large computational cost: - Bins might contain many points, so still searching over large set for each NN query 7 How to define the line? Crazy idea: Define line randomly! #awful #awesome.5 #awful = #awesome 8 4

#awful 4 3 2 9 Bins are the same θ xy x 2 3 4 y Bins are the same Bins are different If θ xy is small

approach. Challenging to find good line 2.

5 How bad can a random line be? Goal: If x,y are close (according to cosine similarity), want binned values to be the same. #awful Bins are the same θ xy x y Bins are the same Bins are different If θ xy is small (x,y close), unlikely to be placed into separate bins #awesome Three potential issues with simple approach. Challenging to find good line 2. Poor quality solution: - Points close together get split into separate bins 3. Large computational cost: - Bins might contain many points, so still searching over large set for each NN query Bin List containing indices of datapoints: {,2,4,7, } {3,5,6,8, } 5

6 Improving efficiency: Reducing # points examined per query Reducing search cost through more bins #awful [ ] Line 2 [ ] Line Line 3 #awesome [ ] [ ] 6

7 Using score for NN search 2D Data Sign (Score ) Bin index Sign (Score 2 ) Bin 2 index Sign (Score 3 ) Bin 3 index x = [, 5] x 2 = [, 3] x 3 = [3, ] Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} {7,9,} {3,5,6} search for NN amongst this set 3 Improving search quality by searching neighboring bins Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} {7,9,} {3,5,6} 4 Query point here, but is NN? Not necessarily Even worse than before Each line can split pts. Sacrificing accuracy for speed #awful [ ] Line 2 [ ] Line [ ] Line 3 [ ] #awesome 7

8 Improving search quality by searching neighboring bins Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} {7,9,} {3,5,6} 5 Next closest bins (flip bit) #awful [ ] Line 2 [ ] Line [ ] Line 3 [ ] #awesome Improving search quality by searching neighboring bins Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} {7,9,} {3,5,6} 6 Further bin (flip 2 bits) #awful [ ] Line 2 [ ] Line [ ] Line 3 [ ] #awesome 8

9 Improving search quality by searching neighboring bins Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} {7,9,} {3,5,6} 7 Quality of retrieved NN can only improve with searching more bins Algorithm: Continue searching until computational budget is reached or quality of NN good enough #awful [ ] Line 2 [ ] Line [ ] Line 3 [ ] #awesome LSH recap kd-tree competitor data structure Draw h random lines Compute score for each point under each line and translate to binary index Use h-bit binary vector per data point as bin index Create hash table For each query point x, search bin(x), then neighboring bins until time limit 8 9

10 Moving to higher dimensions d Draw random planes x[2] #awful #awesome x[] #great x[3] Score(x) = v #awesome + v 2 #awful + v 3 #great 2

11 Cost of binning points in d-dim Score(x) = v #awesome + v 2 #awful + v 3 #great Per data point, need d multiplies to determine bin index per plane One-time cost offset if many queries of fixed dataset 2 Summary for retrieval using nearest neighbors and locality sensitive hashing

12 What you can do now Implement nearest neighbor search for retrieval tasks Contrast document representations (e.g., raw word counts, tf-idf, ) - Emphasize important words using tf-idf Contrast methods for measuring similarity between two documents - Euclidean vs. weighted Euclidean - Cosine similarity vs. similarity via unnormalized inner product Describe complexity of brute force search Implement LSH for approximate nearest neighbor search 23 Motivating NN methods for regression: Fit globally vs. fit locally 2

13 Parametric models of f(x) y price ($) sq.ft. x 25 Parametric models of f(x) y price ($) sq.ft. x 26 3

14 Parametric models of f(x) y price ($) sq.ft. x 27 Parametric models of f(x) y price ($) sq.ft. x 28 4

15 f(x) is not really a polynomial y price ($) sq.ft. x 29 What alternative do we have? If we: - Want to allow flexibility in f(x) having local structure - Don t want to infer structural breaks What s a simple option we have? - Assuming we have plenty of data 3 5

16 Simplest approach: Nearest neighbor regression Fit locally to each data point Predicted value = closest y i y nearest neighbor (-NN) regression 32 price ($) Here, this is the closest datapoint Here, this is the closest datapoint sq.ft. Here, this is the closest datapoint Here, this is the closest datapoint x 6

point: x q. Find closest x i in dataset y 2.

17 What people do naturally Real estate agent assesses value by finding sale of most similar house $ =??? $ = 85k 33 -NN regression more formally Dataset of (,$) pairs: (x,y ), (x 2,y 2 ),,(x N,y N ) Query point: x q. Find closest x i in dataset y 2. Predict price ($) Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint 34 sq.ft. x 7

Visualizing -NN in multiple dimensions Voronoi tesselation (or diagram): - Divide space into N regions, each containing datapoint - Defined such that any x in region

18 Visualizing -NN in multiple dimensions Voronoi tesselation (or diagram): - Divide space into N regions, each containing datapoint - Defined such that any x in region is closest to region s datapoint Don t explicitly form! 35 Different distance metrics lead to different predictive surfaces Euclidean distance Manhattan distance 36 8

19 Can -NN be used for classification? Yes!! Just predict class of neighbor 37 -NN algorithm 9

Performing -NN search Query house: Dataset: Specify: Distance metric Output: Most similar house -NN algorithm closest house Initialize Dist2NN =, = Ø

20 Performing -NN search Query house: Dataset: Specify: Distance metric Output: Most similar house -NN algorithm closest house Initialize Dist2NN =, = Ø For i=,2,,n Compute: δ = distance(, ) If δ < Dist2NN set = i set Dist2NN = δ Return most similar house i q query house closest house to query house 2

-NN in practice.4 Nearest Neighbors Kernel (K = ).2.8.6.4.2 Fit looks good for data dense in x and low noise 4..2.3.4.5.6.7.8.9 Sensitive to regions with little data.

21 -NN in practice.4 Nearest Neighbors Kernel (K = ) Fit looks good for data dense in x and low noise Sensitive to regions with little data.4 Nearest Neighbors Kernel (K = ) Not great at interpolating over large regions

22 Also sensitive to noise in data Nearest Neighbors Kernel (K = ).5 Fits can look quite wild Overfitting?.5 f(x) x k-nearest neighbors 22

?? $ = 749k $ = 833k $ = 9k 45 k-nn regression more formally Dataset of

23 Get more comps More reliable estimate if you base estimate off of a larger set of comparable homes $ = 85k $ =??? $ = 749k $ = 833k $ = 9k 45 k-nn regression more formally Dataset of (,$) pairs: (x,y ), (x 2,y 2 ),,(x N,y N ) Query point: x q. Find k closest x i in dataset 2. Predict 46 23

24 Performing k-nn search Query house: Dataset: Specify: Distance metric Output: Most similar houses k-nn algorithm Initialize Dist2kNN = sort(δ,,δ k ) = sort(,, ) k For i=k+,,n Compute: δ = distance( i, q ) If δ < Dist2kNN[k] find j such that δ > Dist2kNN[j-] but δ < Dist2kNN[j] remove furthest house and shift queue: [j+:k] = [j:k-] Dist2kNN[j+:k] = Dist2kNN[j:k-] set Dist2kNN[j] = δ and [j] = Return k most similar houses sort first k houses by distance to query house list of sorted distances query house list of sorted houses i closest houses to query house 24

25 k-nn in practice Nearest Neighbors Kernel (K = 3).5 f(x) x k-nn in practice Nearest Neighbors Kernel (K = 3).5.5 f(x) Much more reasonable fit in the presence of noise.5 Boundary & sparse region issues x

26 k-nn in practice Nearest Neighbors Kernel (K = 3).5.5 f(x) Discontinuities! Neighbor either in or out x Issues with discontinuities Overall predictive accuracy might be okay, but For example, in housing application: - If you are a buyer or seller, this matters - Can be a jump in estimated value of house going just from 264 sq.ft. to 264 sq.ft. - Don t really believe this type of fit 52 26

27 Weighted k-nearest neighbors Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn y NN + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j= c qnnj 54 27

28 How to define weights? Want weight c qnnj to be small when distance(x NNj,x q ) large and c qnnj to be large when distance(x NNj,x q ) small 55 Kernel weights for d= Define: c qnnj = Kernel λ ( x NNj -x q ) simple isotropic case Gaussian kernel: Kernel λ ( x i -x q ) = exp(-(x i -x q ) 2 /λ) Note: never exactly! 56 -λ λ 28

29 Kernel weights for d Define: c qnnj = Kernel λ (distance(x NNj,x q )) 57 -λ λ Kernel regression 29

30 Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn y NN + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j= c qnnj 59 Kernel regression Instead of just weighting NN, weight all points Predict: NX weight on each datapoint NX Nadaraya-Watson kernel weighted average ŷ q = i= NX c qi y i c qi i= = i= Kernel λ (distance(x i,x q )) * y i NX Kernel λ (distance(x i,x q )) i= 6 3

31 Kernel regression in practice Epanechnikov Kernel (lambda =.2).5 f(x) Kernel has bounded support Only subset of data needed to compute local fit x Choice of bandwidth λ Often, choice of kernel matters much less than choice of λ λ =.4 λ =.2 λ =.4 Epanechnikov Kernel (lambda =.4) Epanechnikov Boxcar Kernel Kernel (lambda (lambda =.2) =.2) Epanechnikov Kernel (lambda =.4) f(x) f(x) f(x) f(x) Boxcar kernel x x x

32 Choosing λ (or k in k-nn) How to choose? Same story as always Cross Validation 63 32

Going nonparametric: Nearest neighbor methods for regression and classification

Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 3, 208 Locality sensitive hashing for approximate NN