Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Similar documents
Non-Parametric Modeling

Machine Learning and Pervasive Computing

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Chapter 4: Non-Parametric Techniques

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Nonparametric Methods

Machine Learning Lecture 3

Machine Learning Lecture 3

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Instance-based Learning

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Machine Learning Lecture 3

Machine Learning. Supervised Learning. Manfred Huber

Nonparametric Classification. Prof. Richard Zanibbi

Content-based image and video analysis. Machine learning

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

Perceptron as a graph

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Generative and discriminative classification techniques

Generative and discriminative classification techniques

Nearest Neighbor Classification. Machine Learning Fall 2017

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Notes and Announcements

We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.

Going nonparametric: Nearest neighbor methods for regression and classification

CSC 411: Lecture 05: Nearest Neighbors

CS 340 Lec. 4: K-Nearest Neighbors

Supervised Learning: Nearest Neighbors

Generative and discriminative classification

Classifiers and Detection. D.A. Forsyth

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Distribution-free Predictive Approaches

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

ECG782: Multidimensional Digital Signal Processing

Instance-based Learning

Data Mining and Machine Learning: Techniques and Algorithms

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

CS6716 Pattern Recognition

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Generative and discriminative classification techniques

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Parametrizing the easiness of machine learning problems. Sanjoy Dasgupta, UC San Diego

Problem 1: Complexity of Update Rules for Logistic Regression

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Nonparametric Methods Recap

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Classification: Linear Discriminant Functions

What is machine learning?

Support Vector Machines

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Nearest Neighbor Classification

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

Nearest Neighbor with KD Trees

9 Classification: KNN and SVM

Based on Raymond J. Mooney s slides

Nearest Neighbor Predictors

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Support Vector Machines + Classification for IR

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Going nonparametric: Nearest neighbor methods for regression and classification

Nearest Neighbor with KD Trees

Features: representation, normalization, selection. Chapter e-9

Machine Learning / Jan 27, 2010

Classification. Slide sources:

Lecture 7: Decision Trees

Instance Based Learning. k-nearest Neighbor. Locally weighted regression. Radial basis functions. Case-based reasoning. Lazy and eager learning

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

CISC 4631 Data Mining

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Linear Regression and K-Nearest Neighbors 3/28/18

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

The exam is closed book, closed notes except your one-page cheat sheet.

CSE 573: Artificial Intelligence Autumn 2010

Machine Learning Techniques for Data Mining

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Classification Algorithms in Data Mining

Discriminative classifiers for image recognition

6.867 Machine Learning

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Performance Measures

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Supervised Learning: K-Nearest Neighbors and Decision Trees

k-nearest Neighbor (knn) Sept Youn-Hee Han

Object Recognition. Lecture 11, April 21 st, Lexing Xie. EE4830 Digital Image Processing

Naïve Bayes for text classification

The Boundary Graph Supervised Learning Algorithm for Regression and Classification

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Recognition Tools: Support Vector Machines

Machine Learning. Chao Lan

Data Preprocessing. Supervised Learning

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Transcription:

Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2015

Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows K-Nearest Neighbor Density Estimation Supervised: Instance-based learners Classification knn classification Weighted (or kernel) knn Regression knn regression Locally linear weighted regression 2

Introduction Estimation of arbitrary density functions Parametric density functions cannot usually fit the densities we encounter in practical problems. e.g., parametric densities are unimodal. Non-parametric methods don't assume that the model (from) of underlying densities is known in advance Non-parametric methods (for classification) can be categorized into Generative Estimate p(x y = i) from D i using non-parametric density estimation Discriminative Estimate p(y = i x) from D 3

Parametric vs. nonparametric methods Parametric methods need to find parameters from data and then use the inferred parameters to decide on new data points Learning: finding parameters from data Nonparametric methods Training examples are explicitly used for each test data classification. Training phase is not required Both supervised and unsupervised learning methods can be categorized into parametric and non-parametric methods. 4

Histogram approximation idea Histogram approximation of an unknown pdf P(b l ) k(b l )/n l = 1,, L k(b l ): number of samples (among n ones) lied in the bin b l k n The corresponding estimated pdf: p x = P(b l) h x x bl h 2 h Mid-point of the bin b l 5

Non-parametric density estimation Probability of falling in a region R: P = R p x dx We can estimate smoothed p x by estimating P: D = x i n i=1 : a set of samples drawn i.i.d. according to p x The probability that k of the n samples fall in R: P k = E k = np n k Pk 1 P n k This binomial distribution peaks sharply about the mean for large values of n: k np k as an estimate for P n More accurate for larger n 6

Non-parametric density estimation Assumptions: p x is continuous and the region R enclosing x is so small that p is near constant in it: P = R p x dx = p x V V = Vol R x R p x = P V k/n V Let V approach zero if we want to find p x the averaged version. instead of 7

Non-parametric density estimation: Main approaches Two approaches of satisfying conditions: k-nearest neighbor density estimator: fix k and determine the value of V from the data Volume grows until it contains k neighbors of x Kernel density estimator (Parzen window): fix V and determine k from the data Number of points falling inside the volume can vary from point to point 8

Parzen window (kernel density estimator) Extension of histogram idea: Hyper-cubes with length of side h (i.e., volume h d ) are located on the samples Hypercube as a simple window function: φ u = 1 ( u 1 1 2 u d 1 2 ) 0 o. w. 1 1/2 1/2 1 1/2 1/2 p x = k nv = 1 nv i=1 n φ x x(i) h k = n i=1 V = h d φ x x(i) h number of samples in the hypercube around x 9

Window or kernel function Necessary conditions for window function to find legitimate density function: φ(x) 0 φ x dx = 1 Windows are also called kernels or potential functions. 10

Kernel density estimator: Example p n x = 1 n n i=1 N(x x (i), h 2 ) 1 2πh e x x (i) 2 2h 2 1 1.2 1.4 1.5 1.6 2 2.1 2.15 4 4.3 4.7 4.75 5 σ = h p x = 1 n = 1 n i=1 n i=1 n 1 N(x x (i), σ 2 ) 2πσ e x x (i) 2 2σ 2 11 Choice of σ is crucial.

Kernel density estimator: Example σ = 0.02 σ = 0.1 σ = 0.5 σ = 1.5 12

Window (or kernel) function: Width parameter n p x = 1 n 1 h d i=1 φ x x(i) h Choosing h: Too large: low resolution Too small: much variability [Duda, Hurt, and Stork] For unlimited n, by letting V slowly approach zero as n increases the estimated density converges to the true one 13

Width parameter For a fixed n, a smaller h results in higher variance while a larger h leads to higher bias. For a fixed h, the variance decreases as the number of sample points n tends to infinity for a large enough number of samples, the smaller h leads to better the accuracy of estimate In practice, where only a finite number of samples is possible, a compromise between h and n must be made. h can be set using techniques like cross-validation where the density estimation used for learning tasks such as classification 14

Practical issues: Curse of dimensionality Large n is necessary to find an acceptable density estimation in high dimensional feature spaces n must grow exponentially with the dimensionality d. If n equidistant points are required to densely fill a one-dim interval, n d points are needed to fill the corresponding d-dim hypercube. We need an exponentially large quantity of training data to ensure that the cells are not empty Also complexity requirements 15 d = 1 d = 2 d = 3

k-nearest neighbor density estimator Cell volume is a function of the point location To estimate p(x), let the cell around x grow until it captures k samples called k nearest neighbors of x. k is a function of n Two possibilities can occur: high density near x cell will be small which provides a good resolution low density near x cell will grow large and stop until higher density regions are reached 16

k-nearest Neighbor Estimation: Example Discontinuities in the slopes [Bishop] 17

Convergence K-nearest-neighbor density estimator and the kernel density estimator converge to the true probability density in the limit n When provided V shrinks suitably with n, and k grows with n 18

Non-parametric density estimation: Summary Generality of distributions With enough samples, convergence to an arbitrarily complicated target density can be obtained. The number of required samples must be very large to assure convergence grows exponentially with the dimensionality of the feature space These methods are very sensitive to the choice of window width or number of nearest neighbors There may be severe requirements for computation time and storage (needed to save all training samples). training phase simply requires storage of the training set. computational cost of evaluating p(x) grows linearly with the size of the data set. 19

Nonparametric learners Memory-based or instance-based learners lazy learning: (almost) all the work is done at the test time. Generic description: Memorize training (x (1), y (1) ),..., (x (n), y (n) ). Given test x predict: y = f(x; x (1), y (1),..., x (n), y (n) ). f is typically expressed in terms of the similarity of the test sample x to the training samples x (1),..., x (n) 20

Parzen window & generative classification If 1 n1 1 h d 1 n2 1 h d x (i) D1 x (i) D2 φ x x(i) h φ x x(i) h > P(C 2) P(C 1 ) decide C 1 otherwise decide C 2 n j = D j (j = 1,2): number of training samples in class C j D j : set of training samples labels as C j For large n, it needs both high time and memory requirements 21

Parzen window & generative classification: Example Smaller h larger h 22 [Duda, Hurt, and Stork]

k-nearest neighbor estimation & generative classification If kn 2V 2 kn 1 V 1 > P(C 2) P(C 1 ) decide C 1 otherwise decide C 2 n j = D j (j = 1,2): number of training samples in class C j D j : set of training samples labels as C j V j shows the hypersphere volumes r j : the radius of the hypersphere centered at x containing k samples of the class C j (j = 1,2) k may not necessarily be the same for all classes 23

k-nearest-neighbor (knn) rule k-nn classifier: k > 1 nearest neighbors Label for x predicted by majority voting among its k-nn. x 2 k = 5 1? x = [x 1, x 2 ] 1-1 1 x 1 What is the effect of k? 24

knn classifier Given Training data {(x 1, y 1 ),..., (x n, y n )} are simply stored. Test sample: x To classify x: Find k nearest training samples to x Out of these k samples, identify the number of samples k j belonging to class C j (j = 1,, C). Assign x to the class C j where j = argmax k j j=1,,c It can be considered as a discriminative method. 25

Probabilistic perspective of knn knn as a discriminative nonparametric classifier Non-parametric density estimation for P(C j x) P C j x k j k where k j shows the number of training samples among k nearest neighbors of x that are labeled C j Bayes decision rule for assigning labels 26

Nearest-neighbor classifier: Example Voronoi tessellation: Each cell consists of all points closer to a given training point than to any other training points All points in a cell are labeled by the category of the corresponding training point. 27 [Duda, Hurt, and Strok s Book]

knn classifier: Effect of k [Bishop] 28

Nearest neighbor classifier: error bound Nearest-Neighbor: knn with k = 1 Decision rule: y = y NN(x) where NN(x) = argmin i=1,,n x x (i) Cover & Hart 67: asymptotic risk of NN classifier satisfies: R R NN 2R (1 R ) 2R R n : expected risk of NN classifier with n training examples drawn from p(x, y) R NN = lim n R n R : the optimal Bayes risk 29

k-nn classifier: error bound Devr 96: the asymptotic risk of the knn classifier R = lim n R n satisfies R R knn R + where R is the optimal Bayes risk. 2R NN k 30

Bound on the k-nearest Neighbor Error Rate bounds on k-nn error rate for different values of k (infinite training data) [Duda, Hurt, and Strok] As k increases, the upper bounds get closer to the lower bound (the Bayes Error rate). 31 When k, the two bounds meet and k-nn rule becomes optimal.

Instance-based learner Main things to construct an instance-based learner: A distance metric Number of nearest neighbors of the test data that we look at A weighting function (optional) How to find the output based on neighbors? 32

Distance measures Euclidean distance d x, x = x x 2 2 = x 1 x 1 2 + + x d x d 2 Distance learning methods for this purpose Weighted Euclidean distance d w x, x = w 1 x 1 x 1 2 + + w d x d x d 2 Mahalanobis distance d A x, x = x 1 x 1 T A x 1 x 1 Other distances: Hamming, angle, Sensitive to irrelevant features L p x, x = p d i=1 x i x i p 33

Distance measure: example d x, x = x 1 x 1 2 + x 2 x 2 2 d x, x = x 1 x 1 2 + 3 x 2 x 2 2 34

Weighted knn classification Weight nearer neighbors more heavily: y = f x = argmax c=1,,c j N k (x) w j (x) I(c = y j ) w j (x) = 1 x x (j) 2 An example of weighting function In the weighted knn, we can use all training examples instead of just k (Stepard s method): 35 y = f x = argmax c=1,,c n j=1 w j (x) I(c = y j ) Weights can be found using a kernel function w j (x) = K(x, x (j) ): e.g., K(x, x (j) ) = e d(x,x(j) ) σ 2

Weighting functions Weighting function w.r.t. d(x, x ): 36 [Fig. has been adopted from Andrew Moore s tutorial on Instance-based learning ]

knn regression Simplest k-nn regression: Let x 1,, x k be the k nearest neighbors of x and y 1,, y k be their labels. y = 1 k k j=1 y j Problems of knn regression for fitting functions: Problem 1: Discontinuities in the estimated function Solution:Weighted (or kernel) regression 1NN: noise-fitting problem knn ( k > 1 ) smoothes away noise, but there are other deficiencies. flats the ends 37

knn regression: examples Dataset 1 Dataset 2 Dataset 3 k = 1 k = 9 38 [Figs. have been adopted from Andrew Moore s tutorial on Instance-based learning ]

Weighted (or kernel) knn regression Higher weights to nearer neighbors: y = f x = j N k (x) w j(x) y (j) j N k (x) w j(x) In the weighted knn regression, we can use all training examples instead of just k in the weighted form: y = f x = j=1 n w j (x) y (j) n w j (x) j=1 39

Kernel knn regression σ = 1 32 of x-axis width σ = 1 32 of x-axis width σ = 1 16 of x-axis width Choosing a good parameter (kernel width) is important. 40 [This slide has been adapted from Andrew Moore s tutorial on Instance-based learning ]

Kernel knn regression In these datasets, some regions are without samples Best kernel widths have been used Disadvantages: not capturing the simple structure of the data failure to extrapolate at edges 41 [Figs. have been adopted from Andrew Moore s tutorial on Instance-based learning ]

Locally weighted linear regression For each test sample, it produces linear approximation to the target function in a local region Instead of finding the output using weighted averaging (as in the kernel regression), we fit a parametric function locally: y = f x, x 1, y 1,..., x n, y n y = f x; w x = w x T x J w x = i N k (x) y i w x T x i 2 Unweighted linear 42 w is found for each test sample

Locally weighted linear regression y = f x, x 1, y 1,..., x n, y n J w x = y i w T x x i 2 i N k (x) J w x = K x, x i y i w T x x i 2 i N k (x) unweighted e.g, K x,x i = e x x i 2σ 2 weighted 2 J w x = n i=1 K x, x i y i w x T x i 2 Weighted on all training examples 43

Locally weighted linear regression: example σ = 1 16 of x-axis width σ = 1 32 of x-axis width σ = 1 8 of x-axis width More proper result than weighted knn regression 44 [Figs. have been adopted from Andrew Moore s tutorial on Instance-based learning ]

Instance-based regression: summary Idea 1: weighted knn regression using the weighted average on the output of x s neighbors (or on the outputs of all training data): y = i=1 k y i K(x, x (i) ) k K(x, x (j) ) j=1 Idea 2: Locally weighted parametric regression Fit a parametric model (e.g. linear function) to the neighbors of x (or on all training data). Implicit assumption: the target function is reasonably smooth. y 45 x

Parametric vs. nonparametric methods Is SVM classifier parametric? y = sign(w 0 + α i y i K(x, x (i) )) α i >0 In general, we can not summarize it in a simple parametric form. Need to keep around support vectors (possibly all of the training data). However, α i are kind of parameters that are found in the training phase 46

Instance-based learning: summary Learning is just storing the training data prediction on a new data based on the training data themselves An instance-based learner does not rely on assumption concerning the structure of the underlying density function. With large datasets, instance-based methods are slow for prediction on the test data kd-tree, Locally Sensitive Hashing (LSH), and other knn approximations can help. 47