Machine Learning: k-nearest Neighbors Lecture 08 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu
Nonparametric Methods: k-nearest Neighbors Input: A training dataset (x 1, t 1 ), (x 2, t 2 ), (x n, t n ). A test instance x. Output: Estimated class label y(x). 1. Find k instances x 1, x 2,, x k nearest to x. 2. Let tît å y( x) = arg max d ( t where d ( x) t ì1 = í î0 k i= 1 t x = t x ¹ t i ) is the Kronecker delta function. 2
k-nearest Neighbors (k-nn) Euclidean distance, k = 4 3
k-nearest Neighbors (k-nn) Euclidian distance, k = 1. Voronoi diagram decision boundary 4
k-nn for Classification: Probabilistic Justification Assume a dataset with N j points in class C j. Þ total number of points is N = å j N j Draw a sphere centered at x containing K points: sphere has volume V. sphere contains K j points from class C j. If V sufficiently small and K sufficiently large, we can estimate [2.5.1]: K j K N j p(x C j ) = p (x) = p( C j ) = N V NV N j K j Bayes theorem Þ p( C j x) = Þ choose class C j with most neighbors. K 5
Distance Metrics Euclidean distance: T d( x, y) = x - y = ( x - y) ( x - y) 2 Hamming distance: # of (discrete) features that have different values in x and y. Mahalanobis distance: T -1 d( x, y) = ( x - y) S ( x - y) scale-invariant metric that normalizes for variance. if S = I Þ Euclidean distance. (sample) covariance matrix if S = diag(s 1-2, s 2-2, s K -2 ) Þ normalized Euclidean distance. 6
Distance Metrics Cosine similarity: d( x, y) = 1- cos( x, y) = 1- T x y x y used for text and other high-dimensional data. Levenshtein distance (Edit distance): distance metric on strings (sequences of symbols). min. # of basic edit operations that can transform one string into the other (delete, insert, substitute). x = athens y = hints used in bioinformatics. Þ d(x,y) = 4 7
Efficient Indexing Linear searching for k-nearest neighbors is not efficient for large training sets: O(N) time complexity. For Euclidean distance use a kd-tree: instances stored at leaves of the tree. internal nodes branch on threshold test on individual features. expected time to find the nearest neighbor is O(log N) Indexing structures depend on distance function: inverted index for text retrieval with cosine similarity. 8
k-nn and The Curse of Dimensionality Standard metrics weigh each feature equally: Problematic when many features are irrelevant. One solution is to weigh each feature differently: Use measure indicating ability to discriminate between classes, such as: Information Gain, Chi-square Statistic Pearson Correlation, Signal to Noise Ration, T test. Stretch the axes: lengthen for relevant features, shorten for irrelevant features. Equivalent with Mahalanobis distance with diagonal covariance. 9
Distance-Weighted k-nn For any test point x, weight each of the k neighbors according to their distance from x. 1. Find k instances x 1, x 2,, x k nearest to x. 2. Let tît å y( x) = arg max wd ( t where wi = x - xi k i= 1-2 i t i ) measures the similarity between x and x i 10
Kernel-based Distance-Weighted NN For any test point x, weight all training instances according to their similarity with x. 1. Assume binary classification, T = {+1, -1}. 2. Compute weighted majority: N æ y( x) = signçå K( x, x è i= 1 ) i t i ö ø 11
Regression with k-nearest Neighbor Input: A training dataset (x 1, t 1 ), (x 2, t 2 ), (x n, t n ). A test instance x. Output: Estimated function value y(x). 1. Find k instances x 1, x 2,, x k nearest to x. 2. k 1 Let y( x) = åt i k i= 1 12
3 Datasets & Linear Interpolation [http://www.autonlab.org/tutorials/mbl08.pdf] Linear interpolation does not always lead to good models of the data. 13
Regression with 1-Nearest Neighbor 14
Regression with 1-Nearest Neighbor 15
Regression with 1-Nearest Neighbor Þ 1-NN has high variance 16
Regression with 9-Nearest Neighbor k = 1 k = 9 17
Regression with 9-Nearest Neighbor k = 1 k = 9 18
Regression with 9-Nearest Neighbor k = 1 k = 9 19
Distance-Weighted k-nn for Regression For any test point x, weight each of the k neighbors according to their similarity with x. 1. Find k instances x 1, x 2,, x k nearest to x. 2. Let y( x) = k å i= 1 w t i i k å i= 1 w i where wi = x - xi -2 For k = N Þ Shepard s method [Shepard, ACM 68]. 20
Kernel-based Distance Weighted NN Regression For any test point x, weight all training instances according to their similarity with x. 1. Return weighted average: y( x) N å i= 1 = N å i= 1 K( x, x i K( x, x ) t i ) i 21
NN Regression with Gaussian Kernel 2s 2 =10 2s 2 =20 2s 2 =80 K x-x - 2s ( x, xi) = e i 2 2 Increased kernel width means more influence from distant points. 22
NN Regression with Gaussian Kernel 2s 2 =1/16 of x axis 2s 2 =1/32 of x axis 2s 2 =1/32 of x axis K x-x - 2s ( x, xi) = e i 2 2 23
k-nearest Neighbor Summary Training: memorize the training examples. Testing: compute distance/similarity with training examples. Trades decreased training time for increased test time. Use kernel trick to work in implicit high dimensional space. Needs feature selection when many irrelevant features. An Instance-Based Learning (IBL) algorithm: Memory-based learning Lazy learning Exemplar-based Case-based 24