Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Similar documents
Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Going nonparametric: Nearest neighbor methods for regression and classification

Going nonparametric: Nearest neighbor methods for regression and classification

Problem 1: Complexity of Update Rules for Logistic Regression

Nearest Neighbor with KD Trees

Nearest Neighbor with KD Trees

Lasso Regression: Regularization for feature selection

Instance-based Learning

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Perceptron as a graph

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Notes and Announcements

Supervised Learning: Nearest Neighbors

CSC 411: Lecture 05: Nearest Neighbors

Classification and K-Nearest Neighbors

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Machine Learning Lecture 3

Locality- Sensitive Hashing Random Projections for NN Search

Nonparametric Methods Recap

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Machine Learning / Jan 27, 2010

Machine Learning: Think Big and Parallel

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Support Vector Machines + Classification for IR

Machine Learning Lecture 3

Machine Learning Lecture 3

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Instance-based Learning

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Machine Learning. Chao Lan

Nearest Neighbor Classification. Machine Learning Fall 2017

The Boundary Graph Supervised Learning Algorithm for Regression and Classification

Linear Regression and K-Nearest Neighbors 3/28/18

CSE 446 Bias-Variance & Naïve Bayes

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Nearest Neighbor Predictors

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

k-nearest Neighbors + Model Selection

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

CISC 4631 Data Mining

Supervised Learning: K-Nearest Neighbors and Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Expectation Maximization: Inferring model parameters and class labels

CS 340 Lec. 4: K-Nearest Neighbors

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Case Study 1: Estimating Click Probabilities

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Distribution-free Predictive Approaches

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

Instance-Based Learning. Goals for the lecture

The exam is closed book, closed notes except your one-page cheat sheet.

SUPPORT VECTOR MACHINES

Nonparametric Regression

Bias-Variance Analysis of Ensemble Learning

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

CSE 546 Machine Learning, Autumn 2013 Homework 2

Network Traffic Measurements and Analysis

Geometric data structures:

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Moving Beyond Linearity

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Economics Nonparametric Econometrics

MLCC 2018 Local Methods and Bias Variance Trade-Off. Lorenzo Rosasco UNIGE-MIT-IIT

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Chapter 5 Efficient Memory Information Retrieval

Lecture 16: High-dimensional regression, non-linear regression

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

CSC 411: Lecture 02: Linear Regression

Data Mining. Lecture 03: Nearest Neighbor Learning

Machine Learning (CSE 446): Practical Issues

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #19: Machine Learning 1

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

3 Nonlinear Regression

Lecture on Modeling Tools for Clustering & Regression

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Lecture 9: Support Vector Machines

Classification: Feature Vectors

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Slides for Data Mining by I. H. Witten and E. Frank

A Brief Look at Optimization

Announcements: projects

VECTOR SPACE CLASSIFICATION

Transcription:

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Emily Fox University of Washington February 3, 2017 Simplest approach: Nearest neighbor regression 1

Fit locally to each data point Predicted value = closest y i y 1 nearest neighbor (1-NN) regression price ($) Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint sq.ft. x 3 1-NN regression more formally Dataset of (,$) pairs: (x 1,y 1 ), (x 2,y 2 ),,(x N,y N ) Query point: x q 1. Find closest x i in dataset y 2. Predict price ($) Here, this is the closest datapoint Here, this is the closest datapoint Here, this is the closest datapoint sq.ft. x 4 2

Visualizing 1-NN in multiple dimensions Voronoi tesselation (or diagram): - Divide space into N regions, each containing 1 datapoint - Defined such that any x in region is closest to region s datapoint Don t explicitly form! 5 Distance metrics: Defining notion of closest In 1D, just Euclidean distance: distance(x j,x q ) = x j -x q In multiple dimensions: - can define many interesting distance functions - most straightforwardly, might want to weight different dimensions differently 6 3

Weighting housing inputs Some inputs are more relevant than others # bedrooms # bathrooms sq.ft. living sq.ft. lot floors year built year renovated waterfront 7 Scaled Euclidean distance Formally, this is achieved via p distance(x j, x q ) = a 1 (x j [1]-x q [1]) 2 + + a d (x j [d]-x q [d]) 2 weight on each input (defining relative importance) Other example distance metrics: - Mahalanobis, rank-based, correlation-based, cosine similarity, Manhattan, Hamming, 8 4

Different distance metrics lead to different predictive surfaces Euclidean distance Manhattan distance 9 Can 1-NN be used for classification? Yes!! Just predict class of neighbor 10 5

1-NN algorithm Performing 1-NN search Query house: Dataset: Specify: Distance metric Output: Most similar house 6

1-NN algorithm closest house Initialize Dist2NN =, For i=1,2,,n = Ø Compute: δ = distance(, ) If δ < Dist2NN set = i set Dist2NN = δ Return most similar house i q query house closest house to query house 1-NN in practice 1.4 Nearest Neighbors Kernel (K = 1) 1.2 1 0.8 0.6 0.4 0.2 Fit looks good for data dense in x and low noise 14 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 7

Sensitive to regions with little data 1.4 Nearest Neighbors Kernel (K = 1) 1.2 1 0.8 0.6 0.4 0.2 Not great at interpolating over large regions 15 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Also sensitive to noise in data Nearest Neighbors Kernel (K = 1) 1.5 1 Fits can look quite wild Overfitting? 0.5 f(x0) 0 0.5 1 16 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 8

k-nearest neighbors Get more comps More reliable estimate if you base estimate off of a larger set of comparable homes $ = 850k $ =??? $ = 749k $ = 833k $ = 901k 18 9

k-nn regression more formally Dataset of (,$) pairs: (x 1,y 1 ), (x 2,y 2 ),,(x N,y N ) Query point: x q 1. Find k closest x i in dataset 2. Predict 19 Performing k-nn search Query house: Dataset: Specify: Distance metric Output: Most similar houses 10

k-nn algorithm sort first k houses by distance to query house Initialize Dist2kNN = sort(δ 1,,δ k ) list of sorted distances = sort(,, ) list of sorted houses 1 k For i=k+1,,n query house Compute: δ = distance( i, q ) If δ < Dist2kNN[k] find j such that δ > Dist2kNN[j-1] but δ < Dist2kNN[j] remove furthest house and shift queue: [j+1:k] = [j:k-1] Dist2kNN[j+1:k] = Dist2kNN[j:k-1] set Dist2kNN[j] = δ and [j] = Return k most similar houses i closest houses to query house k-nn in practice Nearest Neighbors Kernel (K = 30) 1.5 1 0.5 f(x0) Much more reasonable fit in the presence of noise 0 0.5 1 Boundary & sparse region issues 22 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 11

k-nn in practice Nearest Neighbors Kernel (K = 30) 1.5 1 0.5 f(x0) Discontinuities! Neighbor either in or out 0 0.5 1 23 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 Issues with discontinuities Overall predictive accuracy might be okay, but For example, in housing application: - If you are a buyer or seller, this matters - Can be a jump in estimated value of house going just from 2640 sq.ft. to 2641 sq.ft. - Don t really believe this type of fit 24 12

Weighted k-nearest neighbors Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn1 y NN1 + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j=1 c qnnj 26 13

How to define weights? Want weight c qnnj to be small when distance(x NNj,x q ) large and c qnnj to be large when distance(x NNj,x q ) small 27 Kernel weights for d=1 Define: c qnnj = Kernel λ ( x NNj -x q ) simple isotropic case Gaussian kernel: Kernel λ ( x i -x q ) = exp(-(x i -x q ) 2 /λ) Note: never exactly 0! 28 -λ 0 λ 14

Kernel weights for d 1 Define: c qnnj = Kernel λ (distance(x NNj,x q )) 29 -λ 0 λ Kernel regression 15

Weighted k-nn Weigh more similar houses more than those less similar in list of k-nn Predict: weights on NN ŷ q = c qnn1 y NN1 + c qnn2 y NN2 + c qnn3 y NN3 + + c qnnk y NNk kx j=1 c qnnj 31 Kernel regression Instead of just weighting NN, weight all points Predict: NX weight on each datapoint NX Nadaraya-Watson kernel weighted average ŷ q = i=1 NX c qi y i c qi i=1 = i=1 Kernel λ (distance(x i,x q )) * y i NX Kernel λ (distance(x i,x q )) i=1 32 16

Kernel regression in practice Epanechnikov Kernel (lambda = 0.2) 1.5 1 f(x0) Kernel has bounded support Only subset of data needed to compute local fit 0.5 0 0.5 1 33 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 Choice of bandwidth λ Often, choice of kernel matters much less than choice of λ λ = 0.04 λ = 0.2 λ = 0.4 Epanechnikov Kernel (lambda = 0.04) Epanechnikov Boxcar Kernel Kernel (lambda (lambda = 0.2) = 0.2) Epanechnikov Kernel (lambda = 0.4) 1.5 1.5 1.5 1 f(x0) 1 f(x0) 1 f(x0) f(x0) 0.5 0.5 0.5 0 0 0 0.5 1 0.5 1 Boxcar kernel 0.5 1 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 34 17

Choosing λ (or k in k-nn) How to choose? Same story as always Cross Validation 35 Formalizing the idea of local fits 18

Contrasting with global average A globally constant fit weights all points equally NX equal weight on each datapoint ŷ q = 1 NX y i N = i=1 i=1 c y i NX c i=1 1.5 1 0.5 Boxcar Kernel (lambda = 1) f(x0) 0 0.5 1 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 37 Contrasting with global average Kernel regression leads to locally constant fit - slowly add in some points and let others gradually die off Boxcar Kernel (lambda = 0.2) Epanechnikov Kernel (lambda = 0.2) NX 1.5 1.5 ŷ q = i=1 NX Kernel λ (distance(x i,x q )) i=1 Kernel λ (distance(x i,x q )) * y i 1 0.5 0 0.5 1 f(x0) 1 0.5 0 0.5 1 f(x0) 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1 38 19

Local linear regression So far, fitting constant function locally at each point à locally weighted averages Can instead fit a line or polynomial locally at each point à locally weighted linear regression 39 Local regression rules of thumb - Local linear fit reduces bias at boundaries with minimum increase in variance - Local quadratic fit doesn t help at boundaries and increases variance, but does help capture curvature in the interior - With sufficient data, local polynomials of odd degree dominate those of even degree Recommended default choice: local linear regression 40 20

Discussion on k-nn and kernel regression Nonparametric approaches k-nn and kernel regression are examples of nonparametric regression General goals of nonparametrics: - Flexibility - Make few assumptions about f(x) - Complexity can grow with the number of observations N Lots of other choices: - Splines, trees, locally weighted structured regression models 42 21

Limiting behavior of NN: Noiseless setting (ε i =0) In the limit of getting an infinite amount of noiseless data, the MSE of 1-NN fit goes to 0 43 Limiting behavior of NN: Noiseless setting (ε i =0) In the limit of getting an infinite amount of noiseless data, the MSE of 1-NN fit goes to 0 1-NN fit Quadratic fit Not true for parametric models! 44 22

Error vs. amount of data Error 45 # data points in training set Limiting behavior of NN: Noisy data setting In the limit of getting an infinite amount of data, the MSE of NN fit goes to 0 if k grows, too 1-NN fit 200-NN fit Quadratic fit 46 23

NN and kernel methods for large d or small N NN and kernel methods work well when the data cover the space, but - the more dimensions d you have, the more points N you need to cover the space - need N = O(exp(d)) data points for good performance This is where parametric models become useful 47 Complexity of NN search Naïve approach: Brute force search - Given a query point x q - Scan through each point x 1,x 2,, x N - O(N) distance computations per 1-NN query! - O(Nlogk) per k-nn query! What if N is huge??? (and many queries) KD-trees! Locality-sensitive hashing, etc. 48 24

k-nn for classification Spam filtering example Not spam Spam Input: x Output: y 50 Text of email, sender, IP, 25

Using k-nn for classification Space of labeled emails (not spam vs. spam), organized by similarity of text query email not spam vs. spam: decide via majority vote of k-nn 51 Using k-nn for classification Space of labeled emails (not spam vs. spam), organized by similarity of text query email not spam vs. spam: decide via majority vote of k-nn 52 26

Summary for nearest neighbor and kernel regression What you can do now Motivate the use of nearest neighbor (NN) regression Define distance metrics in 1D and multiple dimensions Perform NN and k-nn regression Analyze computational costs of these algorithms Discuss sensitivity of NN to lack of data, dimensionality, and noise Perform weighted k-nn and define weights using a kernel Define and implement kernel regression Describe the effect of varying the kernel bandwidth λ or # of nearest neighbors k Select λ or k using cross validation Compare and contrast kernel regression with a global average fit Define what makes an approach nonparametric and why NN and kernel regression are considered nonparametric methods Analyze the limiting behavior of NN regression Use NN for classification 54 27

Recap of topics so far Emily Fox University of Washington February 3, 2017 What you have learned thus far Point estimation Regression Training, test, validation, generalization error Overfitting Bias-variance tradeoff Regularized regression = ridge, LASSO Cross validation Logistic regression Decision trees Boosting Instance-based learning 56 28

The ML pipeline Inputs Features Task Model Algorithm x Sq.ft. #bedrooms #bathrooms text of review loan application h j (x) x[j] x[j] p tf-idf Regression x or h j (x) à R Classification x or h j (x) à {0,1,...,k} (We ve focused on {-1,+1}) Linear models w T h(x) Decision trees Ensembles NN Optimize a lost function Gradient = 0 Gradient ascent/ descent Stochastic gradient ascent/descent Coordinate ascent/ descent Boosting/AdaBoost Evaluation and model selection: Training, validation or cross-validation, test error 57 Concepts: bias-variance tradeoff, overfitting Your Midterm Exam Content: Everything up to today Only 50mins, so arrive early and settle down quickly Cheat sheet: - Single 8 ½ x 11 handwritten sheet, front and back No: - Computer, phone, other materials, The exam: - Covers key concepts and ideas, work on understanding the big picture, and differences between methods 58 29