Introduction to Machine Learning Lecture 4. Mehryar Mohri Courant Institute and Google Research

Size: px

Start display at page:

Download "Introduction to Machine Learning Lecture 4. Mehryar Mohri Courant Institute and Google Research"

Meghan Lane
5 years ago
Views:

1 Introduction to Machine Learning Lecture 4 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

2 Nearest-Neighbor Algorithms

3 Nearest Neighbor Algorithms Definition: fix k 1, given a labeled sample S =((x 1,y 1 ),...,(x m,y m )) (X {0, 1}) m, the k-nn returns the hypothesis h S defined by x X,h S (x) =1 P i:y i =1 w i> P i:y i =0 w i, where the weights w 1,...,w m are chosen such that w i = 1 if x k i is among the k nearest neighbors of x. 3

4 Voronoi Diagram 4

5 Questions Performance: does it work? Choice of the weights: are there better choices than uniform? In particular, can take into account distance to each nearest neighbor. Choice of the distance metric: can a useful metric be defined (or even learned) for a particular problem? Computation in high dimension: data structures and algorithms to improve upon naive algorithm. 5

6 Bayes Classifier Definition: the Bayes error is defined by R = inf h h measurable Pr [h(x) = y]. (x,y) D the Bayes classifier is a measurable hypothesis achieving that error. 6

7 Set-up Sample drawn i.i.d. according to some distribution Nearest neighbor of x X: Error of hypothesis returned on point x X: where S =((x 1,y 1 ),...,(x m,y m )) (X {0, 1}) m. y(u) NN(S, x) =argmin x in S d(x, x ). R(h S,x)=1 y(hs (x))=y(x), is the label of point u (random variable). D 7

8 Convergence of NN Algorithm Lemma: for any x in support, NN(S, x),x) x with probability one when S +. Proof: Let x be in the support of the distribution, then for any >0, Pr[B(x, )] > 0. Thus, Pr d NN(S, x),x > d NN(S, x),x = 1 Pr[B(x, )] S 0. Since is decreasing with S, this also implies convergence with probability one. 8

9 NN Algorithm - Limit Guarantee Theorem: let h S be the hypothesis returned by the nearest neighbor algorithm. Then, Proof: lim S E S D m[r(h S)] 2R E S D m[r(h S,x)] = Pr [y(nn(s, x)) = y(x)] S Dm = x Pr [y(x ) = y(x) NN(S, x) =x ] Pr S D m[nn(s, x) =x ] = x (1 Pr [y(x )=y(x) NN(S, x) =x ]) Pr S D m[nn(s, x) =x ] = 1 Pr[y x]pr[y x ] x y Y 1 Y /2 Y 1 R. Pr S D m[nn(s, x) =x ]. 9

10 NN Algorithm - Limit Guarantee In view of the lemma, one when S +. Thus, Let lim S + E S D m[r(h S,x)] = NN(S, x) x From this it can be concluded that lim S + E S D m[r(h S)] =, then with probability 1 y Y Pr[y x] 2. E 1 Pr[y x] 2. x D y Y y =argmaxpr[y x] y 1 Pr[y x] 2 =1 Pr[y x] 2 Pr[y x] 2. y Y y=y 10

11 NN Algorithm - Limit Guarantee Now, since the variance is non-negative, 1 Y 1 Thus, in view of, Pr[y x] Pr[y x] 0. Y 1 y=y y=y y=y Pr[y x] =(1 Pr[y x]) E 1 Pr[y x] 2 E x D x D y Y = E x D = E x D 1 Pr[y x] 2 (1 Pr[y x]) 2 Y 1 1 (1 R (x)) 2 R (x) 2 Y 1 2R (x) Y R (x) 2 Y 1 2R Y R 2 Y 1. (using E[R (x) 2 ] E[R (x)] 2 ) 11

12 Notes Similar results for the k-nn algorithm. m = S or (k ) ( k. m 0) Guarantees only for infinite amount of data: machine learning deals with finite samples. arbitrarily slow convergence rate. 12

13 NN Problem Problem: given sample S =((x 1,y 1 ),...,(x m,y m )), find the nearest neighbor of test point x. general problem extensively studied in computer science. exact vs. approximate algorithms. dimensionality N crucial. better algorithms for small intrinsic dimension (e.g., limited doubling dimension). 13

14 NN Problem - Case N = 2 Algorithm: compute Voronoi diagram in O(m log m). point location data structure to determine NN. complexity: O(m) space, O(log m) time. x 14

15 NN Problem - Case N > 2 Voronoi diagram: size in. O m N/2 Linear algorithm (no pre-processing): compute distance x x i for all i [1,m]. complexity of distance computation: Ω(Nm). no additional space needed. Tree-based data structures: pre-processing. often used in applications: k-d trees ( k-dimensional trees). 15

16 k-d Trees Binary space partioning trees. Prominent tree-based data structure. Works for low or medium dimensionality. NN search: O(log m) for randomly distributed points. O(Nm 1 1 N ) in the worst case (Lee and Wong, 1977). Can be extended to k-nn search. High dimension: typically inefficient. approximate NN methods. 16 (Bentley, 1975)

17 k-d Trees - Illustration (3, 5),Y (4, 2),X (5, 9),X (1, 1) (8, 4) (2, 9.5) (7, 5.5) 17

18 k-d Trees - Construction Algorithm: for each non-leaf node, choose dimension (e.g., longest of hyperrectangle). choose pivot (median). split node according to (pivot, dimension). balanced tree, binary space partitioning. 18

19 k-d Trees - NN Search 19

20 k-d Trees - NN Search Algorithm: find region containing x (starting from root node, move to child node based on node test). save region point x 0 as current best. move up tree and recursively search regions intersecting hypersphere S(x, x x 0 ) : update current best if current point is closer. restart search with each intersecting sub-tree. move up tree when no more intersecting subtree. 20

21 References Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, Vol. 18, No. 9, Lee, D. T. and Wong, C. K. Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees. Acta Informatica Vol. 9, Issue 1. Springer, NY, Mehryar Mohri - Foundations of Machine Learning 21 Courant Institute, NYU

Geometric data structures:

Geometric data structures: Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Sham Kakade 2017 1 Announcements: HW3 posted Today: Review: LSH for Euclidean distance Other