Going nonparametric: Nearest neighbor methods for regression and classification

Similar documents
Going nonparametric: Nearest neighbor methods for regression and classification

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Problem 1: Complexity of Update Rules for Logistic Regression

Nearest Neighbor with KD Trees

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Nearest Neighbor with KD Trees

Nearest Neighbor Predictors

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Instance-based Learning

Locality- Sensitive Hashing Random Projections for NN Search

CSC 411: Lecture 05: Nearest Neighbors

Supervised Learning: Nearest Neighbors

Nonparametric Methods Recap

Notes and Announcements

Geometric data structures:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Lasso Regression: Regularization for feature selection

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Perceptron as a graph

Economics Nonparametric Econometrics

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Support Vector Machines + Classification for IR

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Linear Regression and K-Nearest Neighbors 3/28/18

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

Predictive Indexing for Fast Search

Kernel Methods & Support Vector Machines

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Support vector machines

Classification and K-Nearest Neighbors

A Dendrogram. Bioinformatics (Lec 17)

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

CS 340 Lec. 4: K-Nearest Neighbors

Coding for Random Projects

CISC 4631 Data Mining

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Machine Learning Lecture 3

Classifiers and Detection. D.A. Forsyth

Moving Beyond Linearity

Topic 7 Machine learning

Distribution-free Predictive Approaches

Data Mining. Lecture 03: Nearest Neighbor Learning

Lecture 03 Linear classification methods II

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Algorithms for Nearest Neighbors

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

VECTOR SPACE CLASSIFICATION

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

Expectation Maximization: Inferring model parameters and class labels

Nonparametric regression using kernel and spline methods

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Support Vector Machines

PROBLEM 4

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Based on Raymond J. Mooney s slides

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Parametrizing the easiness of machine learning problems. Sanjoy Dasgupta, UC San Diego

Basis Functions. Volker Tresp Summer 2017

5 Learning hypothesis classes (16 points)

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Robust PDF Table Locator

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Distribution Distance Functions

Supervised Learning: K-Nearest Neighbors and Decision Trees

Data Mining in Bioinformatics Day 1: Classification

Generative and discriminative classification techniques

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

The Boundary Graph Supervised Learning Algorithm for Regression and Classification

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Instance-based Learning

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Discriminative classifiers for image recognition

CSE 573: Artificial Intelligence Autumn 2010

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja (UBC), David G. Lowe (Google) IEEE 2014

Data Preprocessing. Supervised Learning

All lecture slides will be available at CSC2515_Winter15.html

CS 229 Midterm Review

Lecture #11: The Perceptron

Transcription:

Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 46: Machine Learning Emily Fox University of Washington May 8, 28 Locality sensitive hashing for approximate NN search cont d

Complexity of brute-force search Given a query point, scan through each point - O(N) distance computations per -NN query! - O(Nlogk) per k-nn query! What if N is huge??? (and many queries) 3 Moving away from exact NN search Approximate neighbor finding - Don t find exact neighbor, but that s okay for many applications Out of millions of articles, do we need the closest article or just one that s pretty similar? Do we even fully trust our measure of similarity??? Focus on methods that provide good probabilistic guarantees on approximation 4 2

Using score for NN search #awful 2D Data Sign(Score) Bin index x = [, 5] - x 2 = [, 3] - x 3 = [3, ] 4 3 2 Only search here for queries with Score(x)<. #awesome.5 #awful = 2 3 4 Only search here for queries with Score(x)> #awesome 5 candidate neighbors if Score(x)< Query point x Using score for NN search 2D Data Sign(Score) Bin index x = [, 5] - x 2 = [, 3] - x 3 = [3, ] candidate neighbors if Score(x)< Bin List containing indices of datapoints: {,2,4,7, } {3,5,6,8, } HASH TABLE search for NN amongst this set 6 3

Three potential issues with simple approach. Challenging to find good line 2. Poor quality solution: - Points close together get split into separate bins 3. Large computational cost: - Bins might contain many points, so still searching over large set for each NN query 7 How to define the line? Crazy idea: Define line randomly! #awful 4 3 2. #awesome.5 #awful = 2 3 4 #awesome 8 4

How bad can a random line be? Goal: If x,y are close (according to cosine similarity), want binned values to be the same. #awful 4 3 2 9 Bins are the same θ xy x 2 3 4 y Bins are the same Bins are different If θ xy is small (x,y close), unlikely to be placed into separate bins #awesome Three potential issues with simple approach. Challenging to find good line 2. Poor quality solution: - Points close together get split into separate bins 3. Large computational cost: - Bins might contain many points, so still searching over large set for each NN query Bin List containing indices of datapoints: {,2,4,7, } {3,5,6,8, } 5

Improving efficiency: Reducing # points examined per query Reducing search cost through more bins #awful 4 3 2 2 [ ] Line 2 [ ] Line 2 3 4 Line 3 #awesome [ ] [ ] 6

Using score for NN search 2D Data Sign (Score ) Bin index Sign (Score 2 ) Bin 2 index Sign (Score 3 ) Bin 3 index x = [, 5] - - - x 2 = [, 3] - - - x 3 = [3, ] Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,} {3,5,6} search for NN amongst this set 3 Improving search quality by searching neighboring bins Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,} {3,5,6} 4 Query point here, but is NN? Not necessarily Even worse than before Each line can split pts. Sacrificing accuracy for speed #awful 4 3 2 [ ] Line 2 [ ] Line 2 3 4 [ ] Line 3 [ ] #awesome 7

Improving search quality by searching neighboring bins Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,} {3,5,6} 5 Next closest bins (flip bit) #awful 4 3 2 [ ] Line 2 [ ] Line 2 3 4 [ ] Line 3 [ ] #awesome Improving search quality by searching neighboring bins Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,} {3,5,6} 6 Further bin (flip 2 bits) #awful 4 3 2 [ ] Line 2 [ ] Line 2 3 4 [ ] Line 3 [ ] #awesome 8

Improving search quality by searching neighboring bins Bin [ ] = Data indices: [ ] = [ ] = 2 [ ] = 3 [ ] = 4 [ ] = 5 [ ] = 6 [ ] = 7 {,2} -- {4,8,} -- -- -- {7,9,} {3,5,6} 7 Quality of retrieved NN can only improve with searching more bins Algorithm: Continue searching until computational budget is reached or quality of NN good enough #awful 4 3 2 [ ] Line 2 [ ] Line 2 3 4 [ ] Line 3 [ ] #awesome LSH recap kd-tree competitor data structure Draw h random lines Compute score for each point under each line and translate to binary index Use h-bit binary vector per data point as bin index Create hash table For each query point x, search bin(x), then neighboring bins until time limit 8 9

Moving to higher dimensions d Draw random planes x[2] #awful #awesome x[] #great x[3] Score(x) = v #awesome + v 2 #awful + v 3 #great 2

Cost of binning points in d-dim Score(x) = v #awesome + v 2 #awful + v 3 #great Per data point, need d multiplies to determine bin index per plane One-time cost offset if many queries of fixed dataset 2 Summary for retrieval using nearest neighbors and locality sensitive hashing

What you can do now Implement nearest neighbor search for retrieval tasks Contrast document representations (e.g., raw word counts, tf-idf, ) - Emphasize important words using tf-idf Contrast methods for measuring similarity between two documents - Euclidean vs. weighted Euclidean - Cosine similarity vs. similarity via unnormalized inner product Describe complexity of brute force search Implement LSH for approximate nearest neighbor search 23 Motivating NN methods for regression: Fit globally vs. fit locally 2

Parametric models of f(x) y price ($) sq.ft. x 25 Parametric models of f(x) y price ($) sq.ft. x 26 3

Parametric models of f(x) y price ($) sq.ft. x 27 Parametric models of f(x) y price ($) sq.ft. x 28 4

f(x) is not really a polynomial y price ($) sq.ft. x 29 What alternative do we have? If we: - Want to allow flexibility in f(x) having local structure - Don t want to infer structural breaks What s a simple option we have? - Assuming we have plenty of data 3 5

Simplest approach: Nearest neighbor regression Fit locally to each data point Predicted value = closest y i y nearest neighbor (-NN) regression 32 price ($) Here, this is the closest datapoint Here, this is the closest datapoint sq.ft. Here, this is the closest datapoint Here, this is the closest datapoint x 6

What people do naturally Real estate agent assesses value by finding sale of most similar house $ =??? $ = 85k 33 -NN regression more formally Dataset of (,$) pairs: (x,y ), (x 2,y 2 ),,(x N,y N ) Query point: x q. Find closest x i in dataset y 2. Predict price ($) Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint 34 sq.ft. x 7

Visualizing -NN in multiple dimensions Voronoi tesselation (or diagram): - Divide space into N regions, each containing datapoint - Defined such that any x in region is closest to region s datapoint Don t explicitly form! 35 Different distance metrics lead to different predictive surfaces Euclidean distance Manhattan distance 36 8

Can -NN be used for classification? Yes!! Just predict class of neighbor 37 -NN algorithm 9

Performing -NN search Query house: Dataset: Specify: Distance metric Output: Most similar house -NN algorithm closest house Initialize Dist2NN =, = Ø For i=,2,,n Compute: δ = distance(, ) If δ < Dist2NN set = i set Dist2NN = δ Return most similar house i q query house closest house to query house 2

-NN in practice.4 Nearest Neighbors Kernel (K = ).2.8.6.4.2 Fit looks good for data dense in x and low noise 4..2.3.4.5.6.7.8.9 Sensitive to regions with little data.4 Nearest Neighbors Kernel (K = ).2.8.6.4.2 Not great at interpolating over large regions 42..2.3.4.5.6.7.8.9 2

Also sensitive to noise in data Nearest Neighbors Kernel (K = ).5 Fits can look quite wild Overfitting?.5 f(x).5 43..2.3.4 x.6.7.8.9 k-nearest neighbors 22

Get more comps More reliable estimate if you base estimate off of a larger set of comparable homes $ = 85k $ =??? $ = 749k $ = 833k $ = 9k 45 k-nn regression more formally Dataset of (,$) pairs: (x,y ), (x 2,y 2 ),,(x N,y N ) Query point: x q. Find k closest x i in dataset 2. Predict 46 23

Performing k-nn search Query house: Dataset: Specify: Distance metric Output: Most similar houses k-nn algorithm Initialize Dist2kNN = sort(δ,,δ k ) = sort(,, ) k For i=k+,,n Compute: δ = distance( i, q ) If δ < Dist2kNN[k] find j such that δ > Dist2kNN[j-] but δ < Dist2kNN[j] remove furthest house and shift queue: [j+:k] = [j:k-] Dist2kNN[j+:k] = Dist2kNN[j:k-] set Dist2kNN[j] = δ and [j] = Return k most similar houses sort first k houses by distance to query house list of sorted distances query house list of sorted houses i closest houses to query house 24

k-nn in practice Nearest Neighbors Kernel (K = 3).5 f(x).5.5 49..2.3.4 x.6.7.8.9 k-nn in practice Nearest Neighbors Kernel (K = 3).5.5 f(x) Much more reasonable fit in the presence of noise.5 Boundary & sparse region issues 5..2.3.4 x.6.7.8.9 25

k-nn in practice Nearest Neighbors Kernel (K = 3).5.5 f(x) Discontinuities! Neighbor either in or out.5 5..2.3.4 x.6.7.8.9 Issues with discontinuities Overall predictive accuracy might be okay, but For example, in housing application: - If you are a buyer or seller, this matters - Can be a jump in estimated value of house going just from 264 sq.ft. to 264 sq.ft. - Don t really believe this type of fit 52 26

Weighted k-nearest neighbors Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn y NN + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j= c qnnj 54 27

How to define weights? Want weight c qnnj to be small when distance(x NNj,x q ) large and c qnnj to be large when distance(x NNj,x q ) small 55 Kernel weights for d= Define: c qnnj = Kernel λ ( x NNj -x q ) simple isotropic case Gaussian kernel: Kernel λ ( x i -x q ) = exp(-(x i -x q ) 2 /λ) Note: never exactly! 56 -λ λ 28

Kernel weights for d Define: c qnnj = Kernel λ (distance(x NNj,x q )) 57 -λ λ Kernel regression 29

Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn y NN + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j= c qnnj 59 Kernel regression Instead of just weighting NN, weight all points Predict: NX weight on each datapoint NX Nadaraya-Watson kernel weighted average ŷ q = i= NX c qi y i c qi i= = i= Kernel λ (distance(x i,x q )) * y i NX Kernel λ (distance(x i,x q )) i= 6 3

Kernel regression in practice Epanechnikov Kernel (lambda =.2).5 f(x) Kernel has bounded support Only subset of data needed to compute local fit.5.5 6..2.3.4 x.6.7.8.9 Choice of bandwidth λ Often, choice of kernel matters much less than choice of λ λ =.4 λ =.2 λ =.4 Epanechnikov Kernel (lambda =.4) Epanechnikov Boxcar Kernel Kernel (lambda (lambda =.2) =.2) Epanechnikov Kernel (lambda =.4).5.5.5 f(x) f(x) f(x) f(x).5.5.5.5.5 Boxcar kernel.5..2.3.4 x.6.7.8.9..2.3.4 x.6.7.8.9..2.3.4 x.6.7.8.9 62 3

Choosing λ (or k in k-nn) How to choose? Same story as always Cross Validation 63 32