Nearest neighbor classifiers

Size: px

Start display at page:

Download "Nearest neighbor classifiers"

Ilene Baldwin
6 years ago
Views:

1 Nearest Neighbor Classification Nearest neighbor classifiers Chapter e-6 HAPT Distance based methods NN(image):. Find the image in the training data which is closest to the query image.. Return its label. query closest image Measuring distance q q Superised learning methods (nearest neighbor methods) Clustering (k-means and hierarchical clustering) How to measure closeness? Distance measures for continuous data: HAPT 3 The Euclidean distance: d (x, x 0 )= x x 0 = p ux (x x 0 ) (x x 0 )= t d (x i x 0 i ) (based on the -norm)

2 The Manhattan distance: More distance measures d (x, x 0 )= x x 0 = dx x i x 0 i This is the distance if we can only trael along coordinate axes. The Minkowski distance The Minkowki distance of order p:! /p dx d p (x, x 0 )= x x 0 p = x i x 0 i p This is based on the p-norm (sometimes called the L p norm). The Euclidean and Manhattan distances are special cases 5 6 The Minkowski distance The Minkowki distance of order p:! /p dx d p (x, x 0 )= x x 0 p = x i x 0 i p This is based on the p-norm (sometimes called the L p norm). What happens when p goes to infinity? The Minkowski distance The Minkowki distance of order p: d p (x, x 0 )= x x 0 p = What happens when p goes to infinity?! /p dx x i x 0 i p d (x, x 0 ) = max x i x 0 i i We get the Chebyshe distance 7 8

3 The Minkowski distance The unit sphere for arious alues of p Properties of a distance A function d(x,x ) is called a distance function if it satisfies the following conditions: i. d(x, x) = 0 (a distance of a point to itself is zero) ii. iii. i. d(x, x ) 0 if x x (all other distances are non-zero) d(x, x ) = d(x, x) (distances are symmetric) d(x, x ) <= d(x, x ) + d(x, x ) (detours make distances larger) The last condition is called the triangle inequality The Minkowski distance with p< does not satisfy the triangle inequality. 9 0 Data normalization Is ery important in this context! The Mahalanobis distance Sometimes it s useful to use different scales for different coordinates. Therefore: use an ellipse rather than a circle to identify points that are a fixed distance away. Also consider rotating the ellipse.! R = p p p p 45 degree clockwise rotation S =! 0 0 distance metric then Euclidean distance is our only choice. (right) The rotated e x T R T S Rx = /4; the axis-parallel ellipse x T S x = /4; and the circle x T x = /4. 3

4 The Mahalanobis distance The shape of the ellipse can be estimated from data as the inerse of the coariance matrix: Dis M (x, y ) = p (x y) (x y) (The inerse of the coariance matrix has the effect of decorrelating and normalizing features) NN(image):. Find the image in the training data which is closest to the query image.. Return its label. query closest image 3 Voronoi diagrams Classify a gien test example to the class of the nearest More formally: omputes the final hypothe D =(x,y )...(x N,y N ) he nearest point to in th Reorder the data according from to (breaking its similarity to an input x: (x [n] (x),y [n] (x)) i.e. The prediction: d(x, x [] ) d(x, x [] ) d(x, x [N] ) g(x) =y [] (x) 5 Voronoi diagram with respect to a collection of points x,,x N : The Voronoi cell associated with point x i is the set of points that are closer to x i than eery other point in the collection HAPT 6 4

Voronoi diagrams The Voronoi diagram depends on the distance measure that is used: Classify a gien test example to the class of the nearest Voronoi diagram computed from Euclidean distance (L norm)

org/wiki/voronoi_diagram 7 Decision boundary is the result of fusing adjacent Voronoi cells that are associated with the same class.

5 Voronoi diagrams The Voronoi diagram depends on the distance measure that is used: Classify a gien test example to the class of the nearest Voronoi diagram computed from Euclidean distance (L norm) Voronoi diagram computed from Manhattan distance (L norm) Image from 7 Decision boundary is the result of fusing adjacent Voronoi cells that are associated with the same class. HAPT 8 Classify a gien test example to the class of the nearest Classify a gien test example to the class of the nearest What is the accuracy of the nearest neighbor classifier when it is tested on the training set? (i.e., what is E in ) Property of the nearest neighbor classifier: E out E* out where E* out is the error of an optimal classifier More precisely: Theorem: For any δ > 0 there is a sufficiently large N such that with probability > δ the resulting nearest neighbor classifier has E out E* out

6 k-nn k-nn Use the closest k neighbors to make a decision instead of a single nearest neighbor Choose the label that occurs among the majority of the k nearest neighbors Use the closest k neighbors to make a decision instead of a single nearest neighbor Choose the label that occurs among the majority of the k nearest neighbors Why do you expect this to work better? Can produce confidence scores. How? k-nn Use the closest k neighbors to make a decision instead of a single nearest neighbor Choose the label using a majority ote Other refinements: an example s ote is inersely proportional to its distance NN s k-nn Decision boundary of NN s k-nn on the digits data: (a) -NN rule (b) -NN rule Figure 6.: The -NN and -NN rules for classifying a random sample of 500 digits ( (blue circle) s all other digits). Note, 500. For the -NN rule, the in-sample error is zero, resulting in a complicated decision Figure boundary 6. in chapter with islands e-6 of red and blue regions. For the -NN rule, the in-sample error is not zero and the decision boundary is simpler. Exercise 6.3 Fix an odd k. ForN =,,... and data sets {D N } of size N, letg N be the k-nn rule deried from D N,without-of-sampleerrorE out(g N). (a) Argue that E out(g N )=E x[q k(η(x))] + E x[ϵ N (x)] for some error term ϵ N (x) which conerges to zero, and where TER 4 6

7 How to choose k? Interim conclusions The alue of k can be chosen using cross-alidation, like any classifier paramter: Eout (%).5 Figure 6.3 in chapter e-6 k = k =3 k = N CV #DataPoints,N ER 5 Properties of nearest neighbor classifiers: Simple and easy to implement No training required Expressie: can achiee zero training error Easy to explain the result But Running time can be an issue Not the best in terms of generalization 6 Classify a gien test example to the class of the nearest Classify a gien test example to the class of the nearest Running time for testing an example when dataset has N examples? Running time for testing an example when dataset has N examples? O(N). Expensie when dealing with large datasets

8 Running time for testing an example when dataset has N examples is O(N). Solutions: Condense the dataset Efficient nearest neighbor search (a) Condensed data for -NN (KD-trees, ball-tress, antage-point trees) (b) Condensed data for -NN TER Algorithm: KD-tree ² Cycle through the coordinates ² Insert a node that corresponds to the median of the gien coordinate, and put all other points in the left/right subtree on the basis of that coordinate A KD-tree for the set of points (,3), (5,4), (9,6), (4,7), (8,), (7,). Can be used for implementing nearest neighbor search in O(log N) Not effectie for high dimensional data (use ball-tree or antage-point tree) 9 30 in high dimensions Distance functions lose their usefulness in high dimensions. Consider the Euclidean distance for example: ux d (x, x 0 )= t d (x i x 0 i ) We expect that if d is large, many of the features won t be releant, and so the signal contained in the informatie dimensions can easily be corrupted by the noise. This can lead to low accuracy of a nearest neighbor classifier. The curse of dimensionality An umbrella term for the issues that can arise in high dimensional data. Solution: feature selection, dimensionality reduction 3 8

$Example: In high dimensions, most of the olume of the unit sphere is ery close to its surface. Let s compute the fraction of the olume that is between r=-ε and r=.$ $The required fraction is: Related fact: V d (r) =k d r d V d () V d ( ) V d () = ( ) d The ratio of the olume of the unit sphere and unit cube tends to 0 as d goes to infinity.$

9 e-6. Similarity-Based Methods 6.. Nearest Neighbor /8/6 The curse of dimensionality Some of our intuition from low dimensional spaces breaks in high dimensions. Example: In high dimensions, most of the olume of the unit sphere is ery close to its surface. Let s compute the fraction of the olume that is between r=-ε and r=. The required fraction is: Related fact: V d (r) =k d r d V d () V d ( ) V d () = ( ) d The ratio of the olume of the unit sphere and unit cube tends to 0 as d goes to infinity. Multi-class problems e-6. The Similarity-Based nearest neighbor Methods algorithm works much the 6.. same Nearest way for Neighbor multi-class problems Symmetry Aerage Intensity (a) Multiclass digits data (b) -NN decision regions In fact, nearest neighbor methods are easily adaptable to any ML problem. Figure 6.6: -NN decision classifier for the multiclass digits data. Exercise 6.9 Symmetry Aerage Intensity 0 PTER With C classes labeled,...,c,defineπ c(x) =P[c x] (the probability to obsere class c gien x, analogoustoπ(x)). Let η(x) = max c π c(x). (a) Define a target f(x) =argmax c π c(x). Showthat,onatestpointx, f attains the minimum possible error probability of e(f(x)) = P[f(x) y] = η(x) data is shown in Table 6.. The sum of the diagonal elements gies the probability of correct classification, which is about 4%. From Table 6., we can easily identify some of the digits which are commonly confused, for example 8 Non-parametric s parametric methods is often classified as 0. In comparison, for the two class problem of classifying digit ersus all others, the success was upwards of 98%. For the multiclass problem, Non-parametric random performance methods would don t hae achiee any a parameters success rate of that 0%, are sotheper- formance learned is significantly from the data. aboe random; howeer, it is significantly below the -class ersion; multiclass problems are typically much harder than the two class problem. Though symmetry and intensity are two features that hae Parametric methods hae a specific form that the learned model enough information to distinguish between ersus the other digits,theyare will hae. not powerful enough to sole the multiclass problem. Better features would certainly help. One can also gain by breaking up the problem into a sequence of -class tasks and tailoring the features to each -class problem using domain knowledge Nearest Neighbor for Regression. With multiclass data, the natural way to combine the classes of the nearest neighbors to obtain the class of a test input is by using some form of majority oting procedure. When the output is a real number (y R), the natural way to combine the outputs (a) Nonparametric of the nearest NN neighbors (b) is Parametric using some linear form of aeraging. The simplest way to extend the k-nearestneighbor algorithmto regression Figure 6.5: The decision boundary of the flexible nonparametric nearest is to take theneighbor aerage ruleof molds theitself target to the alues data, whereas from the the rigid k-nearest parametric neighbors: linear model will always gie a linear separator. g(x) = k y [i] (x). k The k-nearest neighbor method would also be considered nonparametric (once the parameter k has been specified). Theorem 6. is an example of There are no ageneralconergenceresult. explicit 0 Under mild regularity conditions, matter what the Nearest parameters target f is, weneighbor being learned, can recoer it as Nregression and so this is anonparametric,proidedthatk is chosen regression appropriately. technique. That s Figurequite 6.7aillustrates powerful statement the k-nn about such technique asimplelearning for regression using a toymodel dataapplied set generated to learning by a general the target f. Suchfunction conergence in results light gray. undermild How do assumptions turn k-nn on f into are aa trademark regression of nonparametric method? methods. This has led to the folklore that nonparametric methods are, in some sense, more powerful than their cousins the parametric methods: for the parametric linear model, only if the target f happens to be in the linearly parameterized hypothesis k = set, can one get such conergence k =3 to f with larger N. k = To complicate the distinction between the two methods, let slookatthe non-linear feature transform (e.g. the polynomial transform). As the polynomial order increases, the number of parameters to be estimated increases and H, thehypothesisset,getsmoreandmoreexpressie. Itispossible to choose the polynomial order to increase with N, butnottooquicklysothath gets more and more expressie, eentually capturing any fixed targetandyet PTE e-chapter 0 For the technically oriented, it establishes the uniersal consistency of the k-nn rule. Figure 6.7: k-nn for regression on a noisy toy data set with N =0. c AM L Abu-Mostafa, Magdon-Ismail, Lin: Jan-05 e-chap:6 8 c AM L Abu-Mostafa, Magdon-Ismail, Lin: Jan-05 e-chap: (b) Show that for the nearest neighbor rule (k =), with high probability, the final hypothesis g N achiees an error on the test point x that is

10 Distances and kernels Distances and kernels The Euclidean distance can be expressed in terms of dot products: Dis (x, y) = x y = p (x y) (x y) = p x x x y y y Replacing dot products with kernels: Dis K (x, y) = p K(x, x) K(x, y) K(y, y) Replacing dot products with kernels: Dis K (x, y) = p K(x, x) K(x, y) K(y, y) If we consider a kernel that satisfies K(x,x) =, then nearest neighbor classification with kernels or distances is equialent. As an alternatie, consider kernels as measures of similarity, and rather than looking for the closest points, look for the most similar points, and use kernels directly Summary : Pros: Simple and easy to implement No training inoled One method that does it all Cons: Accuracy suffers in high dimensions Testing speed is an issue for large datasets 39 0

Nearest Neighbor Classification. Machine Learning Fall 2017

Nearest Neighbor Classification Machine Learning Fall 2017 1 This lecture K-nearest neighbor classification The basic algorithm Different distance measures Some practical aspects Voronoi Diagrams and Decision