Going nonparametric: Nearest neighbor methods for regression and classification

Similar documents
Going nonparametric: Nearest neighbor methods for regression and classification

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Problem 1: Complexity of Update Rules for Logistic Regression

Nearest Neighbor with KD Trees

Nearest Neighbor with KD Trees

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Nonparametric Methods Recap

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Instance-based Learning

Notes and Announcements

Locality- Sensitive Hashing Random Projections for NN Search

Supervised Learning: Nearest Neighbors

Nearest Neighbor Predictors

CSC 411: Lecture 05: Nearest Neighbors

Lasso Regression: Regularization for feature selection

Perceptron as a graph

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Geometric data structures:

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Linear Regression and K-Nearest Neighbors 3/28/18

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Economics Nonparametric Econometrics

Support Vector Machines + Classification for IR

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Classification and K-Nearest Neighbors

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Classifiers and Detection. D.A. Forsyth

Nearest Neighbor Classification. Machine Learning Fall 2017

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Classification: Feature Vectors

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

CSC411/2515 Tutorial: K-NN and Decision Tree

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Distribution-free Predictive Approaches

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Instance-based Learning

Kernel Methods & Support Vector Machines

CISC 4631 Data Mining

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

Supervised Learning: K-Nearest Neighbors and Decision Trees

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

Machine Learning / Jan 27, 2010

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

CSE 573: Artificial Intelligence Autumn 2010

CS 340 Lec. 4: K-Nearest Neighbors

Machine Learning Lecture 3

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Expectation Maximization: Inferring model parameters and class labels

Nonparametric regression using kernel and spline methods

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Data Mining. Lecture 03: Nearest Neighbor Learning

Nonparametric Regression

The Boundary Graph Supervised Learning Algorithm for Regression and Classification

Topic 7 Machine learning

CS 229 Midterm Review

CSE 446 Bias-Variance & Naïve Bayes

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Coding for Random Projects

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Machine Learning: Think Big and Parallel

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

Kernels and Clustering

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Based on Raymond J. Mooney s slides

CPSC 340: Machine Learning and Data Mining

Lecture 9: Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Machine Learning Lecture 3

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Non-Parametric Modeling

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

Machine Learning Lecture 3

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CS 343: Artificial Intelligence

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

SUPPORT VECTOR MACHINES

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Lecture 03 Linear classification methods II

A Dendrogram. Bioinformatics (Lec 17)

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Predictive Indexing for Fast Search

CS 343H: Honors AI. Lecture 23: Kernels and clustering 4/15/2014. Kristen Grauman UT Austin

Transcription:

Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 3, 208 Locality sensitive hashing for approximate NN search cont d

Complexity of brute-force search Given a query point, scan through each point - O(N) distance computations per -NN query! - O(Nlogk) per k-nn query! What if N is huge??? (and many queries) 3 Moving away from exact NN search Approximate neighbor finding - Don t find exact neighbor, but that s okay for many applications Out of millions of articles, do we need the closest article or just one that s pretty similar? Do we even fully trust our measure of similarity??? Focus on methods that provide good probabilistic guarantees on approximation 4 2

Using score for NN search #awful 2D Data Sign(Score) Bin index x = [0, 5] - 0 x 2 = [, 3] - 0 x 3 = [3, 0] 4 3 2 0 Only search here for queries with Score(x)<0.0 #awesome.5 #awful = 0 0 2 3 4 Only search here for queries with Score(x)>0 #awesome 5 candidate neighbors if Score(x)<0 Query point x Using score for NN search 2D Data Sign(Score) Bin index x = [0, 5] - 0 x 2 = [, 3] - 0 x 3 = [3, 0] candidate neighbors if Score(x)<0 Bin 0 List containing indices of datapoints: {,2,4,7, } {3,5,6,8, } HASH TABLE search for NN amongst this set 6 3

Three potential issues with simple approach. Challenging to find good line 2. Poor quality solution: - Points close together get split into separate bins 3. Large computational cost: - Bins might contain many points, so still searching over large set for each NN query 7 How to define the line? Crazy idea: Define line randomly! #awful 4 3 2.0 #awesome.5 #awful = 0 0 0 2 3 4 #awesome 8 4

How bad can a random line be? Goal: If x,y are close (according to cosine similarity), want binned values to be the same. #awful 4 3 2 0 9 Bins are the same θ xy x 0 2 3 4 y Bins are the same Bins are different If θ xy is small (x,y close), unlikely to be placed into separate bins #awesome Three potential issues with simple approach. Challenging to find good line 2. Poor quality solution: - Points close together get split into separate bins 3. Large computational cost: - Bins might contain many points, so still searching over large set for each NN query Bin 0 List containing indices of datapoints: {,2,4,7, } {3,5,6,8, } 0 5

Improving efficiency: Reducing # points examined per query Reducing search cost through more bins #awful 4 3 2 0 2 Bin index: [0 0 0] Line 2 Bin index: [0 0] Line 0 2 3 4 Line 3 #awesome Bin index: [ 0] Bin index: [ ] 6

Using score for NN search 2D Data Sign (Score ) Bin index Sign (Score 2 ) Bin 2 index Sign (Score 3 ) Bin 3 index x = [0, 5] - 0-0 - 0 x 2 = [, 3] - 0-0 - 0 x 3 = [3, 0] Bin [0 0 0] = 0 Data indices: [0 0 ] = [0 0] = 2 [0 ] = 3 [ 0 0] = 4 [ 0 ] = 5 [ 0] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,0} {3,5,6} search for NN amongst this set 3 Improving search quality by searching neighboring bins Bin [0 0 0] = 0 Data indices: [0 0 ] = [0 0] = 2 [0 ] = 3 [ 0 0] = 4 [ 0 ] = 5 [ 0] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,0} {3,5,6} 4 Query point here, but is NN? Not necessarily Even worse than before Each line can split pts. Sacrificing accuracy for speed #awful 4 3 2 0 Bin index: [0 0 0] Line 2 Bin index: [0 0] Line 0 2 3 4 Bin index: [ 0] Line 3 Bin index: [ ] #awesome 7

Improving search quality by searching neighboring bins Bin [0 0 0] = 0 Data indices: [0 0 ] = [0 0] = 2 [0 ] = 3 [ 0 0] = 4 [ 0 ] = 5 [ 0] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,0} {3,5,6} 5 Next closest bins (flip bit) #awful 4 3 2 0 Bin index: [0 0 0] Line 2 Bin index: [0 0] Line 0 2 3 4 Bin index: [ 0] Line 3 Bin index: [ ] #awesome Improving search quality by searching neighboring bins Bin [0 0 0] = 0 Data indices: [0 0 ] = [0 0] = 2 [0 ] = 3 [ 0 0] = 4 [ 0 ] = 5 [ 0] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,0} {3,5,6} 6 Further bin (flip 2 bits) #awful 4 3 2 0 Bin index: [0 0 0] Line 2 Bin index: [0 0] Line 0 2 3 4 Bin index: [ 0] Line 3 Bin index: [ ] #awesome 8

Improving search quality by searching neighboring bins Bin [0 0 0] = 0 Data indices: [0 0 ] = [0 0] = 2 [0 ] = 3 [ 0 0] = 4 [ 0 ] = 5 [ 0] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,0} {3,5,6} 7 Quality of retrieved NN can only improve with searching more bins Algorithm: Continue searching until computational budget is reached or quality of NN good enough #awful 4 3 2 0 Bin index: [0 0 0] Line 2 Bin index: [0 0] Line 0 2 3 4 Bin index: [ 0] Line 3 Bin index: [ ] #awesome LSH recap kd-tree competitor data structure Draw h random lines Compute score for each point under each line and translate to binary index Use h-bit binary vector per data point as bin index Create hash table For each query point x, search bin(x), then neighboring bins until time limit 8 9

Moving to higher dimensions d Draw random planes x[2] #awful #awesome x[] #great x[3] Score(x) = v #awesome + v 2 #awful + v 3 #great 20 0

Cost of binning points in d-dim Score(x) = v #awesome + v 2 #awful + v 3 #great Per data point, need d multiplies to determine bin index per plane One-time cost offset if many queries of fixed dataset 2 Summary for retrieval using nearest neighbors and locality sensitive hashing

What you can do now Implement nearest neighbor search for retrieval tasks Contrast document representations (e.g., raw word counts, tf-idf, ) - Emphasize important words using tf-idf Contrast methods for measuring similarity between two documents - Euclidean vs. weighted Euclidean - Cosine similarity vs. similarity via unnormalized inner product Describe complexity of brute force search Implement LSH for approximate nearest neighbor search 23 Motivating NN methods for regression: Fit globally vs. fit locally 2

Parametric models of f(x) y price ($) sq.ft. x 25 Parametric models of f(x) y price ($) sq.ft. x 26 3

Parametric models of f(x) y price ($) sq.ft. x 27 Parametric models of f(x) y price ($) sq.ft. x 28 4

f(x) is not really a polynomial y price ($) sq.ft. x 29 What alternative do we have? If we: - Want to allow flexibility in f(x) having local structure - Don t want to infer structural breaks What s a simple option we have? - Assuming we have plenty of data 30 5

Simplest approach: Nearest neighbor regression Fit locally to each data point Predicted value = closest y i y nearest neighbor (-NN) regression 32 price ($) Here, this is the closest datapoint Here, this is the closest datapoint sq.ft. Here, this is the closest datapoint Here, this is the closest datapoint x 6

What people do naturally Real estate agent assesses value by finding sale of most similar house $ =??? $ = 850k 33 -NN regression more formally Dataset of (,$) pairs: (x,y ), (x 2,y 2 ),,(x N,y N ) Query point: x q. Find closest x i in dataset y 2. Predict price ($) Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint 34 sq.ft. x 7

Visualizing -NN in multiple dimensions Voronoi tesselation (or diagram): - Divide space into N regions, each containing datapoint - Defined such that any x in region is closest to region s datapoint Don t explicitly form! 35 Different distance metrics lead to different predictive surfaces Euclidean distance Manhattan distance 36 8

Can -NN be used for classification? Yes!! Just predict class of neighbor 37 -NN algorithm 9

Performing -NN search Query house: Dataset: Specify: Distance metric Output: Most similar house -NN algorithm closest house Initialize Dist2NN =, = Ø For i=,2,,n Compute: δ = distance(, ) If δ < Dist2NN set = i set Dist2NN = δ Return most similar house i q query house closest house to query house 20

-NN in practice.4 Nearest Neighbors Kernel (K = ).2 0.8 0.6 0.4 0.2 Fit looks good for data dense in x and low noise 4 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Sensitive to regions with little data.4 Nearest Neighbors Kernel (K = ).2 0.8 0.6 0.4 0.2 Not great at interpolating over large regions 42 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 2

Also sensitive to noise in data Nearest Neighbors Kernel (K = ).5 Fits can look quite wild Overfitting? 0.5 f(x0) 0 0.5 43 0 0. 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 k-nearest neighbors 22

Get more comps More reliable estimate if you base estimate off of a larger set of comparable homes $ = 850k $ =??? $ = 749k $ = 833k $ = 90k 45 k-nn regression more formally Dataset of (,$) pairs: (x,y ), (x 2,y 2 ),,(x N,y N ) Query point: x q. Find k closest x i in dataset 2. Predict 46 23

Performing k-nn search Query house: Dataset: Specify: Distance metric Output: Most similar houses k-nn algorithm Initialize Dist2kNN = sort(δ,,δ k ) = sort(,, ) k For i=k+,,n Compute: δ = distance( i, q ) If δ < Dist2kNN[k] find j such that δ > Dist2kNN[j-] but δ < Dist2kNN[j] remove furthest house and shift queue: [j+:k] = [j:k-] Dist2kNN[j+:k] = Dist2kNN[j:k-] set Dist2kNN[j] = δ and [j] = Return k most similar houses sort first k houses by distance to query house list of sorted distances query house list of sorted houses i closest houses to query house 24

k-nn in practice Nearest Neighbors Kernel (K = 30).5 0.5 f(x0) Much more reasonable fit in the presence of noise 0 0.5 Boundary & sparse region issues 49 0 0. 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 k-nn in practice Nearest Neighbors Kernel (K = 30).5 0.5 f(x0) Discontinuities! Neighbor either in or out 0 0.5 50 0 0. 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 25

Issues with discontinuities Overall predictive accuracy might be okay, but For example, in housing application: - If you are a buyer or seller, this matters - Can be a jump in estimated value of house going just from 2640 sq.ft. to 264 sq.ft. - Don t really believe this type of fit 5 Weighted k-nearest neighbors 26

Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn y NN + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j= c qnnj 53 How to define weights? Want weight c qnnj to be small when distance(x NNj,x q ) large and c qnnj to be large when distance(x NNj,x q ) small 54 27

Kernel weights for d= Define: c qnnj = Kernel λ ( x NNj -x q ) simple isotropic case Gaussian kernel: Kernel λ ( x i -x q ) = exp(-(x i -x q ) 2 /λ) Note: never exactly 0! 55 -λ 0 λ Kernel weights for d Define: c qnnj = Kernel λ (distance(x NNj,x q )) 56 -λ 0 λ 28

Kernel regression Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn y NN + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j= c qnnj 58 29

Kernel regression Instead of just weighting NN, weight all points Predict: NX weight on each datapoint NX Nadaraya-Watson kernel weighted average ŷ q = i= NX c qi y i c qi i= = i= Kernel λ (distance(x i,x q )) * y i NX Kernel λ (distance(x i,x q )) i= 59 Kernel regression in practice Epanechnikov Kernel (lambda = 0.2).5 f(x0) Kernel has bounded support Only subset of data needed to compute local fit 0.5 0 0.5 60 0 0. 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 30

Choice of bandwidth λ Often, choice of kernel matters much less than choice of λ λ = 0.04 λ = 0.2 λ = 0.4 Epanechnikov Kernel (lambda = 0.04) Epanechnikov Boxcar Kernel Kernel (lambda (lambda = 0.2) = 0.2) Epanechnikov Kernel (lambda = 0.4).5.5.5 f(x0) f(x0) f(x0) f(x0) 0.5 0.5 0.5 0 0 0 0.5 0.5 Boxcar kernel 0.5 0 0. 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 0 0. 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 0 0. 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 6 Choosing λ (or k in k-nn) How to choose? Same story as always Cross Validation 62 3

Formalizing the idea of local fits Contrasting with global average A globally constant fit weights all points equally NX equal weight on each datapoint ŷ q = NX y i N = i= i= c y i NX c i=.5 0.5 Boxcar Kernel (lambda = ) f(x0) 0 0.5 0 0. 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 64 32

Contrasting with global average Kernel regression leads to locally constant fit - slowly add in some points and let others gradually die off Boxcar Kernel (lambda = 0.2) Epanechnikov Kernel (lambda = 0.2) NX.5.5 ŷ q = i= NX Kernel λ (distance(x i,x q )) i= Kernel λ (distance(x i,x q )) * y i 0.5 0 0.5 f(x0) 0.5 0 0.5 f(x0) 0 0. 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 0 0. 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 65 Local linear regression So far, fitting constant function locally at each point à locally weighted averages Can instead fit a line or polynomial locally at each point à locally weighted linear regression 66 33

Local regression rules of thumb - Local linear fit reduces bias at boundaries with minimum increase in variance - Local quadratic fit doesn t help at boundaries and increases variance, but does help capture curvature in the interior - With sufficient data, local polynomials of odd degree dominate those of even degree Recommended default choice: local linear regression 67 Discussion on k-nn and kernel regression 34

Nonparametric approaches k-nn and kernel regression are examples of nonparametric regression General goals of nonparametrics: - Flexibility - Make few assumptions about f(x) - Complexity can grow with the number of observations N Lots of other choices: - Splines, trees, locally weighted structured regression models 69 Limiting behavior of NN: Noiseless setting (ε i =0) In the limit of getting an infinite amount of noiseless data, the MSE of -NN fit goes to 0 70 35

Limiting behavior of NN: Noiseless setting (ε i =0) In the limit of getting an infinite amount of noiseless data, the MSE of -NN fit goes to 0 -NN fit Quadratic fit Not true for parametric models! 7 Error vs. amount of data Error # data points in training set 72 36

Limiting behavior of NN: Noisy data setting In the limit of getting an infinite amount of data, the MSE of NN fit goes to 0 if k grows, too -NN fit 200-NN fit Quadratic fit 73 NN and kernel methods for large d or small N NN and kernel methods work well when the data cover the space, but - the more dimensions d you have, the more points N you need to cover the space - need N = O(exp(d)) data points for good performance This is where parametric models become useful 74 37

Complexity of NN search Naïve approach: Brute force search - Given a query point x q - Scan through each point x,x 2,, x N - O(N) distance computations per -NN query! - O(Nlogk) per k-nn query! What if N is huge??? (and many queries) KD-trees! Locality-sensitive hashing, etc. 75 k-nn for classification 38

Spam filtering example Not spam Spam Input: x Output: y 77 Text of email, sender, IP, Using k-nn for classification Space of labeled emails (not spam vs. spam), organized by similarity of text query email not spam vs. spam: decide via majority vote of k-nn 78 39

Using k-nn for classification Space of labeled emails (not spam vs. spam), organized by similarity of text query email not spam vs. spam: decide via majority vote of k-nn 79 Summary for nearest neighbor and kernel regression 40

What you can do now Motivate the use of nearest neighbor (NN) regression Perform k-nn regression Analyze computational costs of these algorithms Discuss sensitivity of NN to lack of data, dimensionality, and noise Perform weighted k-nn and define weights using a kernel Define and implement kernel regression Describe the effect of varying the kernel bandwidth λ or # of nearest neighbors k Select λ or k using cross validation Compare and contrast kernel regression with a global average fit Define what makes an approach nonparametric and why NN and kernel regression are considered nonparametric methods Analyze the limiting behavior of NN regression Use NN for classification 8 4