Chapter 4: Non-Parametric Techniques

Similar documents
Non-Parametric Modeling

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Nonparametric Methods

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.

Machine Learning Lecture 3

Nonparametric Classification. Prof. Richard Zanibbi

Machine Learning and Pervasive Computing

Machine Learning Lecture 3

Machine Learning Lecture 3

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Machine Learning. Supervised Learning. Manfred Huber

Generative and discriminative classification techniques

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Pattern Recognition ( , RIT) Exercise 1 Solution

Optimal Time Bounds for Approximate Clustering

Nearest Neighbor Classification

Nearest neighbor classification DSE 220

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

High Dimensional Indexing by Clustering

k-nearest Neighbors + Model Selection

CS 340 Lec. 4: K-Nearest Neighbors

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

SFU CMPT Lecture: Week 8

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Data Mining and Machine Learning: Techniques and Algorithms

Nearest Neighbor Classification. Machine Learning Fall 2017

Nearest Neighbor Classification

Geometric data structures:

Network Traffic Measurements and Analysis

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

Machine Learning. Unsupervised Learning. Manfred Huber

Online Facility Location

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

COMS 4771 Clustering. Nakul Verma

Unsupervised Learning : Clustering

Nearest neighbors classifiers

Nearest Neighbors Classifiers

Section 4 Matching Estimator

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

CSCI567 Machine Learning (Fall 2014)

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Introduction to Machine Learning. Xiaojin Zhu

Data Mining in Bioinformatics Day 1: Classification

Parametrizing the easiness of machine learning problems. Sanjoy Dasgupta, UC San Diego

This chapter explains two techniques which are frequently used throughout

Perceptron as a graph

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Nearest Neighbor Predictors

Clustering algorithms

Statistical Methods in AI

Pruning Nearest Neighbor Cluster Trees

TECHNICAL REPORT NO Random Forests and Adaptive Nearest Neighbors 1

Based on Raymond J. Mooney s slides

SD 372 Pattern Recognition

Clustering. Supervised vs. Unsupervised Learning

ECG782: Multidimensional Digital Signal Processing

MSA220 - Statistical Learning for Big Data

Notes and Announcements

Lecture 5 Finding meaningful clusters in data. 5.1 Kleinberg s axiomatic framework for clustering

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Instance-based Learning

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Note Set 4: Finite Mixture Models and the EM Algorithm

10701 Machine Learning. Clustering

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Stanford University CS359G: Graph Partitioning and Expanders Handout 1 Luca Trevisan January 4, 2011

1) Give decision trees to represent the following Boolean functions:

Sampling-based Planning 2

PATTERN CLASSIFICATION AND SCENE ANALYSIS

The exam is closed book, closed notes except your one-page cheat sheet.

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Clustering web search results

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Treewidth and graph minors

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Unsupervised Learning

10-701/15-781, Fall 2006, Final

Unsupervised Learning and Clustering

Learning with minimal supervision

Nearest Neighbor Search by Branch and Bound

Constructive floorplanning with a yield objective

9881 Project Deliverable #1

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Theorem 2.9: nearest addition algorithm

CS 229 Midterm Review

Transcription:

Chapter 4: Non-Parametric Techniques Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule

Supervised Learning

How to fit a density function to 1-D data? * * * * * * * * * ** * 35 40 45 50 55 60 * * * * * * * ** * ** * * * ** *

Fitting a Normal density using MLE

20 bins 15 bins 10 bins How many bins? 5 bins

Parzen Windows Width = 1.25 Width = 2.5

Parzen Windows Width = 5 Width = 10

Gaussian Mixture Model A weighted combination of a small but unknown no. of Gaussians Can one recover the parameters of Gaussians from unlabeled data?

Introduction 8 All parametric densities are unimodal (have a single local maximum), whereas many practical problems involve multi-modal densities (subgroups) Nonparametric procedures can be used with arbitrary distributions and without the assumption that the form of the underlying densities is known Two types of nonparametric methods: Estimating class conditional densities, P(x w j ) Bypass class-conditional density estimation and directly estimate the a-posteriori probability Pattern Classification, Ch4 (Part 1)

Density Estimation 9 Basic idea: Probability that a vector x will fall in region R is: P = p( x' )dx' (1) ò Â P is a smoothed (or averaged) version of the density function p(x). How do we estimate P? If we have a sample of size n (i.i.d from p(x)), the probability that k of the n points fall in R is (binomial law): P k æ nö = ç P è k ø k ( 1 - P ) n-k (2) and the expected value for k is: E(k) = np (3) Pattern Classification, Ch4 (Part 1)

ML estimation of unknown P = q 10 Max( P k q ) is achieved when q ˆ q = k n @ P Ratio k/n is a good estimate for the probability P and can be used to estimate density fn. p (x) Assume p(x) is continuous and region R is so small that p does not vary significantly within it. Then ò Â p( x' )dx' @ x )V where x is a point in R and V the volume enclosed by R p( (4)

Combining equations (1), (3) and (4) yields: 11 p ( x ) @ k / V n Pattern Classification, Ch4 (Part 1)

Conditions for convergence 12 The fraction k/(nv) is a space averaged value of p(x). Convergence to true p(x) is obtained only if V approaches zero. lim V 0, k= 0 p( x ) (if fixed) This is the case where no samples are included in R = 0 n = lim V 0, k¹ 0 p( x ) = In this case, the estimate diverges Pattern Classification, Ch4 (Part 1)

The volume V needs to approach 0 if we want to obtain p(x) rather than just an averaged version of it. 13 In practice, V cannot be allowed to become small since the number of samples, n, is always limited We have to accept a certain amount of averaging in p(x) Theoretically, if an unlimited number of samples is available, we can circumvent this difficulty as follows: To estimate the density of x, we form a sequence of regions R 1, R 2, containing x: the first region contains one sample, the second two samples and so on. Let V n be the volume of R n, k n be the number of samples falling in R n and p n (x) be the n th estimate for p(x): p n (x) = (k n /n)/v n (7) Pattern Classification, Ch4 (Part 1)

Necessary conditions for p n (x) to converge to p(x): 14 1 ) limv n 2 ) lim k n 3 ) lim k n n n n = 0 = / n = 0 Two ways of defining sequences of regions that satisfy these conditions: (a) Shrink an initial region where V n = 1/Ön ; it can be shown that p n ( x ) n p( x ) This is called the Parzen-window estimation method (b) Specify k n as some function of n, such as k n = Ön; the volume V n is grown until it encloses k n neighbors of x. This is called the k n -nearest neighbor estimation method Pattern Classification, Ch4 (Part 1)

15 Pattern Classification, Ch4 (Part 1)

Parzen Windows 16 For simplicity, let us assume that the region R n is a d- dimensional hypercube V n = h d n (h Let j(u)be the ì ï1 j(u) = í ï î0 n : length of following window 1 u j j 2 otherwise the edge of = 1,...,d  n ) function : j((x-x i )/h n ) is equal to unity if x i falls within the hypercube of volume V n centered at x and equal to zero otherwise. Pattern Classification, Ch4 (Part 1)

The number of samples in this hypercube is: 17 k n = i = n å i= 1 æ x - j ç è h n x i ö ø By substituting k n in equation (7), we obtain the following estimate: p n (x) i = n 1 1 æ x - x = å jç n i= 1 Vn è hn i ö ø P n (x) estimates p(x) as an average of functions of x and the samples x i, i = 1,,n. These functions j can be of any form, e.g., Gaussian. Pattern Classification, Ch4 (Part 1)

18 The behavior of the Parzen-window method Case where p(x) àn(0,1) Let j(u) = (1/Ö(2p) exp(-u 2 /2) and h n = h 1 /Ön (n>1) Thus: (h 1 : known parameter) p n ( x ) i = n 1 1 æ x - x = å j ç n i= 1 hn è hn i ö ø is an average of normal densities centered at the samples x i. Pattern Classification, Ch4 (Part 1)

Parzen Window Density Estimate

20 Numerical results: For n = 1 and h 1 =1 1-1 / 2 2 p1 ( x ) = j( x - x1 ) = e ( x - x1 ) N( x1 2p,1 ) For n = 10 and h = 0.1, the contributions of the individual samples are clearly observable! Pattern Classification, Ch4 (Part 1)

21 Pattern Classification, Ch4 (Part 1)

22 Pattern Classification, Ch4 (Part 1)

Analogous results are also obtained in two dimensions: 23 Pattern Classification, Ch4 (Part 1)

24 Pattern Classification, Ch4 (Part 1)

Unknown density p(x) = l 1.U(a,b) + l 2.T(c,d) is a mixture of a uniform and a triangle density 25 Pattern Classification, Ch4 (Part 1)

26 Pattern Classification, Ch4 (Part 1)

Parzen-window density estimates

28 Classification example How do we design classifiers based on Parzen-window density estimation? Estimate the class-conditional densities for each class; classify a test sample by the label corresponding to the maximum a posteriori density Decision region for a Parzen-window classifier depends upon the choice of window function and the window width; in practice window function chosen is Gaussian Pattern Classification, Ch4 (Part 1)

29 Pattern Classification, Ch4 (Part 1)

K n - Nearest Neighbor Density Estimation 30 Let the region volume be a function of the training data Center a region about x and let it grows until it captures k n samples (k n = f(n)) k n denotes the k n nearest-neighbors of x Two possibilities can occur: If the true density is high near x, the region will be small which provides a good resolution If the true density at x is low, the region will grow large to capture k n samples We can obtain a family of estimates by setting k n =k 1 /Ön and choosing different values for k 1

32 Illustration For k n = Ön = 1 ; the estimate becomes: P n (x) = k n / n.v n = 1 / V 1 =1 / 2 x-x 1 Pattern Classification, Chapter 4 (Part 2)

33 Pattern Classification, Chapter 4 (Part 2)

34 Pattern Classification, Chapter 4 (Part 2)

35 Estimation of a-posteriori probabilities Goal: estimate P(w i x) from a set of n labeled samples Place a cell of volume V around x and capture k samples Suppose k i samples amongst k are labeled w i then: p n (x, w i ) = k i /n.v An estimate for p n (w i x) is: pn( x, wi ) pn ( wi x ) = = j c p ( x, w ) = å j= 1 n j ki k

36 k i /k is the fraction of the samples within the cell that are labeled w i For minimum error rate, the most frequently represented category within the cell is selected If k is large and the cell is sufficiently small, the performance will approach the optimal Bayes rule Pattern Classification, Chapter 4 (Part 2)

Nearest neighbor Decision Rule 37 Let D n = {x 1, x 2,, x n } be a set of n labeled prototypes Let x Î D n be the closest prototype to a test point x then the nearest-neighbor rule for classifying x is to assign it the label associated with x The nearest-neighbor rule leads to an error rate greater than the minimum possible, namely the Bayes error rate If the number of prototypes is large (unlimited), the error rate of the nearest-neighbor classifier is never worse than twice the Bayes rate (it can be proved!) Pattern Classification, Chapter 4 (Part 2)

38

Bound on the Nearest Neighbor Error Rate

k Nearest Neighbor (k-nn) Decision Rule 40 Classify the test sample x by assigning it the label most frequently represented among the k nearest samples of x Pattern Classification, Chapter 4 (Part 2)

41 Pattern Classification, Chapter 4 (Part 2)

The k-nearest Neighbor Rule The bounds on k-nn error rate for different values of k. As k increases, the upper bounds get progressively closer to the lower bound the Bayes Error rate. When k goes to infinity, the two bounds meet and k-nn rule becomes optimal.

Metric Nearest neighbor rule relies on a metric or a distance function between the patterns A metric D(.,.) is a function that takes two vectors a and b as arguments, and returns a scalar function. For a function D(.,.) to be called a metric, it needs to satisfy the following properties for any given vectors a, b and c.

Metrics and Nearest-neighbors A metric remains a metric even when each dimension is scaled (in fact, under any positive definite linear transformation). Left plot shows a scenario where the black point is closer to x. Scaling the x 1 axis by 1/3, however, brings the red point closer to x than the black point.

Minkowski metric Example Metrics K = 1 gives the Manhattan distance K = 2 gives the Euclidean distance K = µ gives the max-distance (maximum value of the absolute differences between the elements of the two vectors) Tanimoto metric: used for finding distance between sets

Computational Complexity of K-NN Rule 46 Complexity both in terms of time (search) and space (storage of prototypes) has received a lot of attention Suppose we are given n labeled samples in d dimensions and we seek the sample closest to a test point x. Naïve approach would require calculating the distance of x to each stored point and then find the closest. Three general algorithmic techniques for reducing the computational burden computing partial distances Pre-structuring editing the stored prototypes Pattern Classification, Chapter 4 (Part 2)

Nearest-Neighbor Editing For each given test point, k-nn has to make N distance computations, followed by selection of k-closest neighbors. Imagine a billion web-pages, or a million images. Different ways to reduce search complexity Pre-structuring the data. (Eg. Divide 2D data uniformly distributed in [-1 1] 2 into quadrants and compare the given example to points only in that quadrant) Build search trees Nearest neighbor editing Nearest-neighbor editing eliminates useless prototypes Eliminate prototypes that are surrounded by training examples of the same category label Leaves the decision boundaries intact

Nearest neighbor editing: Algorithm The complexity of this algorithm is O(d 3 n [d/2] ln n), where [d/2] is the floor of d/2.

A Branch and Bound Algorithm for Computing k-nearest Neighbors K. Fukunga & P. Narendra, IEEE Trans. Computers, July 1975

KNN K-NN generally requires large number of computations for large n (number of training samples) Several solutions to reduce complexity Hart s method: condense the training dataset size while retaining most of the samples that provide sufficient discriminatory information between classes Fischer and Patrick: re-order the examples such that many distance computations can be eliminated Fukunaga and Narendra: hierarchically decompose the design samples into disjoint subsets and apply a branch-and-bound algorithm for searching through the hierarchy.

Stage 1: Hierarchical Decomposition Given a dataset of n samples, divide it into l groups, and recursively divide each of the l-groups further into l-groups each. Partitioning must be done such that similar samples are placed in the same group. K-means algorithm is used to divide the dataset, since it scales well to large datasets.

Searching the Tree For each node in the tree, evaluate the following quantities S p N p M p Set of samples associated with node p Number of samples at node p Sample mean of samples at node p r p Farthest distance from mean M p Use two rules for eliminating examples for nearest neighbor computation.

Rule 1 Rule 1: No sample x i in S p can be nearest neighbor to test point x if where B is the distance to the current nearest neighbor. Proof: For any point x i in S p if test point x is in the same node, by triangle inequality, we have Since d(x i,m p ) < r p by definition, Therefore x i cannot be nearest neighbor to x if

Rule 2 Rule 2: A sample x i in S p cannot be a nearest neighbor to x if Proof: On similar lines as the previous proof using triangle inequality. This rule is applied only on the leaf nodes. The distances d(x i,m p ) are computed for all the points in the leaf nodes; only requires additional storage of n real numbers.

Branch and Bound Search Algorithm 1. Set B =. Current level L = 1, and Node = 0. 2. Expansion: Place all nodes that are immediate successors of current node in the active set of nodes. 3. For each node, test Rule 1. Remove nodes that fail. 4. Pick the closest node to the test point x and expand using step 2. If you are at the final level, go to step 5. 5. Apply Rule 2 to all the points in the last node. 1. For all the points that fail, do not compute the distance to x. 2. For points that succeed, compute the distances. 6. Pick the closest point using the computed distances.

Extension to k-nearest Neighbors The same algorithm may be applied with B representing the distance to the k-th closest point in the dataset. When a distance is actually calculated in Step 5, it is compared against distances to its current k-nearest neighbors, and the current k-nearest neighbor table is updated as necessary.