Nearest Neighbor Predictors

Similar documents
Nearest Neighbor Classification. Machine Learning Fall 2017

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

CSC 411: Lecture 05: Nearest Neighbors

PROBLEM 4

Going nonparametric: Nearest neighbor methods for regression and classification

Nearest Neighbor Methods

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Going nonparametric: Nearest neighbor methods for regression and classification

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Distribution-free Predictive Approaches

k-nearest Neighbor (knn) Sept Youn-Hee Han

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

1 Case study of SVM (Rob)

Design and Analysis of Algorithms Prof. Madhavan Mukund Chennai Mathematical Institute

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Support vector machines

CSCI567 Machine Learning (Fall 2014)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

CS 584 Data Mining. Classification 1

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Using Machine Learning to Optimize Storage Systems

Classification and K-Nearest Neighbors

(Refer Slide Time: 00:02:00)

High Dimensional Indexing by Clustering

Machine Learning Classifiers and Boosting

Unsupervised Learning : Clustering

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Nonparametric Classification Methods

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Lecture 9. Support Vector Machines

7 Approximate Nearest Neighbors

Rational Numbers and the Coordinate Plane

In what follows, we will focus on Voronoi diagrams in Euclidean space. Later, we will generalize to other distance spaces.

Clustering CS 550: Machine Learning

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

The Boundary Graph Supervised Learning Algorithm for Regression and Classification

Introduction to Artificial Intelligence

Application of Support Vector Machine Algorithm in Spam Filtering

Conjectures concerning the geometry of 2-point Centroidal Voronoi Tessellations

CISC 4631 Data Mining

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

Based on Raymond J. Mooney s slides

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Nearest Neighbor Classification

Introduction to Supervised Learning

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

BTRY/STSCI 6520, Fall 2012, Homework 3

Instance-based Learning

CS 340 Lec. 4: K-Nearest Neighbors

Lifting Transform, Voronoi, Delaunay, Convex Hulls

Non-Parametric Modeling

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Chapter 8. Voronoi Diagrams. 8.1 Post Oce Problem

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Kernels and Clustering

Geometric Computations for Simulation

Data Mining and Machine Learning: Techniques and Algorithms

10-701/15-781, Fall 2006, Final

Classification: Feature Vectors

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Challenges motivating deep learning. Sargur N. Srihari

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Introducing Categorical Data/Variables (pp )

Planar Graphs. 1 Graphs and maps. 1.1 Planarity and duality

Lecture 3: Linear Classification

Classification Algorithms in Data Mining

What to come. There will be a few more topics we will cover on supervised learning

Nearest neighbor classification DSE 220

Theorem 2.9: nearest addition algorithm

CSE 573: Artificial Intelligence Autumn 2010

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Data Mining. Lecture 03: Nearest Neighbor Learning

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Instance-based Learning

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Nonparametric Regression

Linear Regression and K-Nearest Neighbors 3/28/18

I211: Information infrastructure II

Naïve Bayes for text classification

CS S Lecture February 13, 2017

6.034 Quiz 2, Spring 2005

Artificial Intelligence. Programming Styles

Autonomous and Mobile Robotics Prof. Giuseppe Oriolo. Motion Planning 1 Retraction and Cell Decomposition

Halftoning and quasi-monte Carlo

CS 343: Artificial Intelligence

Classification of Surfaces

Topic 7 Machine learning

k-nearest Neighbors + Model Selection

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Introduction to Machine Learning. Xiaojin Zhu

Weka ( )

Transcription:

Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method, which can be used for either classification or regression. In this method, we just remember the entire training set T. At test time, when given a new data point x in the data space X, the nearest-neighbor predictor returns the label (for classification) or value (for regression) y n corresponding to the training data point x n that is closest to x in a pre-specified metric. Thus, there is no training: Just store all of T = {(x 1, y 1 ),..., (x N, y N )}, regardless of the nature of data points x n or labels or values y n, and define some distance in the data space X. For example, (x, x ) = x x 2, the squared Euclidean distance. The classifier is then h(x) = y ν(x) where ν(x) = arg min n=1,...,n (x, x n). In words: To find the value y = h(x) for a new data point x, find the index n = ν(x) of the training data point x n that is closest to x and return the value y n for that training sample. There is no distinction between classification and regression here: If Y is categorical, h returns a label, and is therefore a classifier. Otherwise, h returns a value (anything indeed: a single real number, a vector in any space,...), and is therefore a regressor. In addition, the number of classes is irrelevant for nearestneighbor classification: Y can have any number of elements. A first unusual property of the nearest-neighbor predictor is that training takes no time and inference (computing the value of h(x) for a given x) is slow: Computing the arg min n, at least in a naive implementation, requires comparing x with all the training data points x n, so the time required for prediction is linear in N, the size of the training set. This is the opposite of what happens with most machine learning methods, which take a possibly long training time and are usually fast at inference time. One can do better than linear-time, in practice, by building a data structure (typically based on R-trees or k-d trees or locality-sensitive hashing schemes) to store T. This construction requires time initially, but then finding the nearest neighbor to x is faster in practice. Theoretically, building an exact nearest-neighbor search structure is an NP-hard problem. However, practically efficient implementations exist at the cost of some approximation. For instance, rather than obtaining the closest sample, some of these implementations guarantee that a sample within ρ of closest is returned, for a given ρ > 1. This means that if the truly nearest sample to x is x ν (x) but h returns the value associated to x ν(x), then (x, x ν(x) ) < ρ (x, x ν (x)). 1

(a) Figure 1: (a) The black lines are the edges of the Voronoi diagram of the sample points (blue circles and red squares), considered with no regard to their label values. Each cell (polygon) of the Voronoi diagram is subsequently colored in the color corresponding to the label of the training sample in it. The red and blue regions are the two subsets of the partition of data space X into sets corresponding to the two classes. Any new data point x in the red region is classified as a red square. Any other data point is classified as a blue circle. The black lines trace the boundary between the two sets of the partition. In words, the distance between x and x ν(x) is never more than ρ times the distance between x and the nearest training data point. One can pick any ρ strictly greater than one: the closer ρ is to one, the more it takes to construct the data structure, and the more storage space it takes. With these data structures, if one is willing to pay the ρ approximation, the typical situation is restored: Building the data structure can be considered as part of training time, which is now long, and testing the predictor is comparatively fast. Incidentally, small approximations are often entirely acceptable in this type of problems. This is because the particular notion of distance chosen is often a crude way to capture dissimilarity (what is so special about the Euclidean distance anyway?) and a small discrepancy when looking for the nearest point to x is often negligible when compared to the looseness in the choice of distance. In few dimensions, perhaps d = 2, 3, one can build an exact data structure called the Voronoi diagram: Partition space into convex regions that are either polygons (in 2D) or polyhedra (in 3D), one region V n for each training sample (x n, y n ). The defining property of V n is that any point x V n is closer to x n than it is to any other training data point. Figure 1(a) illustrates this concept for a subset of the data in a set introduced in a previous note. 1 If we only draw the edges between Voronoi regions that belong to data points with different labels, we obtain the partition that the nearest-neighbor classifier induces in the data space X, as shown by the black lines in Figure 1. The Voronoi diagram is a purely conceptual construction for nearest-neighbor predictors: While this 1 A subset of the data is used to make the figure more legible. There would be no problem in computing the Voronoi diagram for the whole set. 2

construction would lead to an efficient classifier for new data points x, it is rare that data space is two- or three-dimensional in typical classification or regression problems. While the Voronoi diagram is a welldefined concept in more dimensions, there is no known efficient algorithm to construct it. Instead, the approximate algorithms mentioned earlier are used for classification when both d (the dimensionality of X) and N (the number of training samples) are very large. When d is large but N is not, the nearest neighbor of x can be found efficiently by a linear scan of all the elements in the training set. For regression, each Voronoi region has in general a different value y n, so the hypothesis space H of all functions that the nearest-neighbor regressor can represent is the set of all real-valued functions that are piecewise constant over Voronoi-like regions. Overfitting and k-nearest Neighbors It should be clear from both concept and Figure 1 that nearest-neighbor predictors can represent functions (for regression) or class boundaries (for classification) of arbitrary complexity, as long as the training set T is large enough. In light of our previous discussions, it should also be clear that this high degree of expressiveness of H is not necessarily a good thing, as it can lead to overfitting: For instance, for binary classification, the exact boundary between the two classes will follow the vagaries of the exact placement of data points that are near other data points of the opposite class. Even the simple example in Figure 1 should convince you that a slightly different choice of training data would lead to quite different boundaries. More explicitly, Figure 2 shows the boundaries defined by the nearest-neighbor classifiers for (a) a training set T with 200 points and (b-f) five random samples of 40 out of the 200 points in T. There is clearly quite a bit of variance in the results, and rather than drawing a boundary that somehow captures the bulk of the point distributions, each nearest-neighbor classifier follows every twist and turn of the complicated space between blue circles and red squares. This is an example of overfitting. The notion of k-nearest-neighbors is introduced to control overfitting to some extent, where k is a positive integer. Rather that returning the value y n of the training sample whose data point x n is nearest to the new data point x, the k-nearest-neighbors regressor returns the average (or sometimes the median) of the values for the k data points that are nearest to x. The k-nearest-neighbors classifier, on the other hand, returns the majority of the labels for the k data points that are nearest to x. For classification, it is customary to set k to an odd value, so that the concept of majority requires no tie-breaking rule. As an example, Figure 3 draws approximate decision regions for the same data sets as in Figure 2, but using a 9-nearest-neighbors classifier rather than a 1-nearest-neighbor classifier. The variance in the boundaries between regions is now smaller. In some sense, the 9-nearest-neighbors classifier captures more of the gist of the two point distributions, and less of the fine detail along the boundary between them. The 9-nearest-neighbors classifier overfits less than the 1-nearest-neighbors classifier. Figure 4 illustrates k-nearest-neighbors regression in one dimension, d = 1 for the dataset in panel (a) of the Figure. The line for k = 1 (orange) overfits dramatically, as it interpolates each data point. The line for k = 100 does quite a bit of smoothing, and reduces overfitting. Points for Further Exploration Some of the following questions will be explored later in the course. Others will be covered in some of the homework assignments. The rest are left for you to muse about. How to choose the metric for a particular regression or classification task? For instance, if some features (entries of x) are irrelevant but vary widely, they may limit the usefulness of a metric that includes them. 3

(a) (c) (d) (e) (f) Figure 2: (a) The decision boundary of the nearest-neighbor classifier for a training set T with 200 points, 100 points per class. (b-f) The decision boundary of the nearest-neighbor classifier for five sets of 40 points (20 per class) each, drawn randomly out of T. As a programming alternative (which works for any classifier when d = 2), these regions were approximated by creating a dense grid of points on the plane, classifying each point on the grid, and placing a small square of the proper color at that grid point. 4

(a) (c) (d) (e) (f) Figure 3: (a-f) The approximate decision regions of the 9-nearest-neighbor classifier for the same training sets as in Figure 2. 5

800 700 800 700 k = 1 k = 10 k = 100 600 600 500 500 400 400 300 300 200 200 100 100 0 0 1000 2000 3000 4000 5000 6000 (a) 0 0 1000 2000 3000 4000 5000 6000 Figure 4: (a) The horizontal axis denotes the gross living area in square feet of a set of 1379 homes in Ames, Iowa. The vertical axis is the sale price in thousands of dollars [1]. An example of k-nearest-neighbor regression on the data in (a). The three lines are the k-nearest-neighbor regression plots for k = 1, 10, 100. Will be informative in many dimensions, given our discussion on distances when d is high? This point is related to the previous one. How to choose k? Is it fruitful to think of weighted averaging in a k-nearest-neighbors regressor? Weights might help modulate the trade-off between accuracy (small k) and generalization (large k). Can a k-nearest-neighbors predictor be viewed as the solution of a loss minimization problem? What is the loss in that case? What applications does k-nearest-neighbors prediction suit best? Training is easy, and can be done incrementally (just add new data to T ), but inference is potentially expensive. References [1] D. De Cock. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education, 19(3), 2011. 6