University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Similar documents
University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Non-Parametric Modeling

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Machine Learning Lecture 3

Machine Learning and Pervasive Computing

Machine Learning Lecture 3

Machine Learning Lecture 3

Nonparametric Methods

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Chapter 4: Non-Parametric Techniques

Nonparametric Classification. Prof. Richard Zanibbi

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Distribution-free Predictive Approaches

Content-based image and video analysis. Machine learning

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Machine Learning. Supervised Learning. Manfred Huber

10-701/15-781, Fall 2006, Final

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

An Introduction to PDF Estimation and Clustering

Instance-based Learning

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Support Vector Machines + Classification for IR

Generative and discriminative classification techniques

Generative and discriminative classification techniques

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Performance Measures

Pattern Recognition ( , RIT) Exercise 1 Solution

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

Notes and Announcements

1) Give decision trees to represent the following Boolean functions:

Parametrizing the easiness of machine learning problems. Sanjoy Dasgupta, UC San Diego

k-nearest Neighbors + Model Selection

Nearest Neighbor Classification

Perceptron as a graph

What is machine learning?

Learning from Data: Adaptive Basis Functions

SD 372 Pattern Recognition

MCMC Methods for data modeling

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Going nonparametric: Nearest neighbor methods for regression and classification

Chap.12 Kernel methods [Book, Chap.7]

Machine Learning in Biology

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

A Dendrogram. Bioinformatics (Lec 17)

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Machine Learning / Jan 27, 2010

Classifiers for Recognition Reading: Chapter 22 (skip 22.3)

Data Mining and Machine Learning: Techniques and Algorithms

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

Classification Algorithms in Data Mining

Nearest Neighbor Predictors

VECTOR SPACE CLASSIFICATION

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

ECG782: Multidimensional Digital Signal Processing

Going nonparametric: Nearest neighbor methods for regression and classification

Introduction to Machine Learning CMU-10701

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Temporal Modeling and Missing Data Estimation for MODIS Vegetation data

Machine Learning. Chao Lan

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Nearest Neighbors Classifiers

Support Vector Machines.

Generative and discriminative classification techniques

The exam is closed book, closed notes except your one-page cheat sheet.

CSC 411: Lecture 05: Nearest Neighbors

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

Nonparametric Methods Recap

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

Supervised Learning: K-Nearest Neighbors and Decision Trees

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Nearest Neighbor Methods

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Nearest neighbor classification DSE 220

Machine Learning Classifiers and Boosting

Lecture 7: Decision Trees

Regularization and model selection

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Support vector machines

Basis Functions. Volker Tresp Summer 2017

Data Mining in Bioinformatics Day 1: Classification

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Generative and discriminative classification

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Transcription:

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2015

11. Non-Parameteric Techniques 1 Introduction Various forms of classifiers, such as Gaussian distributions and logistic regression, have been examined. These start by assuming the nature of the decision boundary, or class-conditional PDFs and then estimate the parameters from training data. An alternative approach is to use the training data to determine the form of the decision boundary or class-conditional PDFs. These approaches are called non-parametric. Wehave already seen a classifier that is based on the training examples, support vector machines (Gaussian processes are another example). Gaussian mixtures are sometimes described as a semi-parametric approach due its ability to approximate many different functional forms. In this lecture techniques for calculating class-conditional density estimates, and building classifiers, without assuming a particular form for the density in advance will be examined. First we we will look in more detail at the density estimation problem and then classification. In particular: kernel estimation (Parzen Windows); k-nearest neighbour method; nearest neighbour classifier. By default these approaches are not sparse (like Gaussian processes). A simple scheme for making a nearest neighbour scheme sparser will be described.

2 Engineering Part IIB: 4F10 Statistical Pattern Processing Example Applications Application of Genetic Algorithm and K-Nearest Neighbour Method in Real World Medical Fraud Detection Problem Application of the K-Nearest-Neighbour Rule to the Classification of Banded Chromosomes Development and application of K-nearest neighbour weather generating model Application of k-nearest neighbor on feature projections classifier to text categorization Parzen-Window Based Normalized Mutual Information for Medical Image Registration Predicting student performance: an application of data mining methods with an educational web-based service

11. Non-Parameteric Techniques 3 Predicting Student Performance Predict performance of students for on-line assessment results here on simple pass/fail classification Features: total number of correct answers success at the first try number of attempts to get answer time spent until correct time spent on the problem total time spent on problems participating in the communication mechanism Results Classifier 2-class (%) Tree C5.0 80.3 CART 81.5 1-NN 76.8 Non-Tree K-NN 82.3 Parzen 75.0 MLP 79.5

2 Engineering Part IIB: 4F10 Statistical Pattern Processing Histograms Simplest form of non-parametric estimation is the histogram 4 Histogram 2 Bins 5 Histogram 25 Bins 3.5 4.5 3 4 3.5 2.5 3 2 2.5 1.5 2 1 1.5 1 0.5 0.5 0 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 7 Histogram 100 Bins 6 5 4 3 2 1 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 Histograms (in 1-D) require bin size determination Too large bins cause over-smoothing and loss of detail Too small bin cause noisy estimation Problem rapidly gets worse as dimension increases : number of bins is each side a constant length rises exponentially with dimension (the curse of dimensionality)

4F10: Statistical Pattern Processing Histograms(1) 4 Histogram 2 Bins 3.5 3 2.5 2 1.5 1 0.5 0 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 39

4F10: Statistical Pattern Processing Histograms(2) 5 Histogram 25 Bins 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 40

4F10: Statistical Pattern Processing Histograms(3) 7 Histogram 100 Bins 6 5 4 3 2 1 0 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 41

11. Non-Parameteric Techniques 3 Non-Parametric Density Estimation The histogram can be generalised to other forms of region. The probability that a new vector x drawn from the unknown density p(x) will fall into a region R is P = p( x)d x R If n samples are drawn independently from p(x) then ( ) n P [k of these will fall in region R] = P k (1 P ) (n k) k This is a binomial distribution with E{k} = np ; Can use k n as an estimate of P Variance of this estimate about the true value P { (k ) } 2 E n P = 1n { 2E (k np ) 2} = P (1 P ) n Estimate peaks sharply around the mean as the number of samples increases (n ) If p(x) is continuous and smooth in the region R, theprobability can be approximated as P = p( x)d x p(x)v R where V is the volume of R and x is some point lying inside it. Hence p(x) k/n V

4F10: Statistical Pattern Processing Probability in a Region P = R p( x)d x 42

4F10: Statistical Pattern Processing Probability in a Region x P = R p( x)d x p(x)v, p(x) k/n V = p(x) 43

4 Engineering Part IIB: 4F10 Statistical Pattern Processing Region Selection How good is the estimate p(x) k/n V = p(x) For this estimate, p(x),tobereasonable: Rlarge enough so that binomial distribution is sharply peaked Rsmall enough so that using a constant approximation in a region for p(x) is accurate There are two basic choices in determining the volume, V of the region R: use a fixed volume. Thisisnormallyachievedbyselecting a particular window function which is used for all points. allow the volume to vary. The volume depends on the number of points in a particular region of space Having obtained an estimate of the class-conditional PDF using the data from each of the classes and class priors estimated, a generative model has been estimated, so classification can be performed using Bayes decision rule where P (ω j x) = p(x ω j)p (ω j ) p(x ω j)p (ω j ) k k p(x ω i )P (ω i ) p(x ω i )P (ω i ) i=1 i=1

11. Non-Parameteric Techniques 5 Hypercube Regions Let R be a hypercube of dimension d and edge length h. Then volume of this is V = h d Need to count the number of points that lie within this region. The number of points may be specified by defining an appropriate kernel, or window, function. Oneappropriateformis { 1 uj 0.5 φ(u) = 0 otherwise where u = (x x i) h This is a unit hypercube, with centre at x. If the sample x i is within this hypercube the value is one and zero otherwise. The total number of samples contained within the hypercube can be expressed as n ( ) x xi k = φ h i=1 and a corresponding density estimate is p(x) = 1 n ( ) 1 x xi n h dφ h i=1 Similar to the histogram except cell locations (centres) defined by data points rather than fixed.

4F10: Statistical Pattern Processing Hypercube Regions x p(x) = 1 n n i=1 1 h dφ ( ) x xi h 45

6 Engineering Part IIB: 4F10 Statistical Pattern Processing Parzen Windows The use of hypercubes to estimate PDFs is an example of a general class called Parzen Window estimates. It is possible to use any form of window function that satisfy φ(u) 0; φ(u)du =1 These constraints ensures that the final density estimate will be a valid probability density. One often used window function is the normal distribution. ) 1 φ(u) = exp ( u 2 (2πh 2 ) d/2 2h 2 The density estimate is given by p(x) = 1 n 1 exp ( x x i 2 ) = n (2πh 2 ) d/2 2h 2 i=1 n α i k(x, x i ) where α i =1/n and k(x, x i )=φ(u). Thesimilaritytothedecision boundary in SVMs and prediction mean in Gaussian processes is clear. The window width, h, isafreeparameterassociatedwith this estimate and influences the nature of the density estimate (compare to length scale in GPs and kernel width in SVMs). It is commonly made a function of the number of samples, n. Oneformish n = h/ n i=1

4F10: Statistical Pattern Processing Parzen Window p(x) = = 1 n n i=1 n i=1 1 exp (2πh 2 ) d/2 α i k(x, x i ) ( x x i 2 ) 2h 2 44

11. Non-Parameteric Techniques 7 Parzen Window Example Below are examples of the Parzen Window estimates of a univariate Gaussian PDF. The window width is set using h which is defined for each case. The values of the estimate for different values of n and h are shown below (from DHS). h = 1 h =.5 h =.1 n = 1 n = 10 n = 100 n =

8 Engineering Part IIB: 4F10 Statistical Pattern Processing Parzen Window Classification The value of h that results from these forms of density estimation will also affect the decision boundaries in a classification task. For the left diagram a small value of h is used, a larger value for the right diagram (figs from DHS). 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 There are a number of issues with this type of approach: Requires all data points to be stored, and evaluation of PDF may be slow (not sparse) Need to choose window function and widths. Can be shown that window in fact gives a biased estimate of the true PDF (see examples paper) An alternative approach is to adapt the width of the windows according to the data available.

4F10: Statistical Pattern Processing Parzen Window Classification 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 2

14 Engineering Part IIB: 4F10 Statistical Pattern Processing k-nearest Neighbours Density x 2 x x 1

11. Non-Parameteric Techniques 9 k-nearest Neighbours Density Rather than fixing the volume, it can be made to vary according to the density of points in a particular region of space. This is the k-nearest neighbour approach. The basic procedure for estimating the density at a point x 1. Select the value of k. 2. Centre a cell around x and allow it to grow until it captures k nearest neighbours. (From DHS) x 2 x x 1 Compute the volume of the cell, V. 3. Density estimate given by (n total training samples) p(x) = k/n V

10 Engineering Part IIB: 4F10 Statistical Pattern Processing Nearest Neighbour Classifier Rather than using non-parametric techniques to estimate PDFs which may then be used, if required, for classification, nonparametric techniques may be directly used for classification. For nearest neighbour classifier the space is partitioned into regions with a particular class label. This is sometimes referred to as the Voronoi tessellation and is illustrated below for 2D data. Note that in the above we are assuming a Euclidean distance measure is used to measure nearest, and it may be necessary to pre-process the data to make this appropriate. Easy to implement; No training of a model required

11. Non-Parameteric Techniques 11 k-nn Classification An extension to NN is to use k-nn for classification Assume that there are n i points from class ω i in the data set so i n i = n. ConstructahypersphereofvolumeV around x so that k samples are contained of which k i are from ω i The class-conditional density and prior are given by p(x ω i ) p(x ω i )= k i/n i V, P(ω i)= n i n The density estimate for two different values of k for a simple 1-D example is shown below (fig from DHS) p(x) 3 5 x The posterior probability is given by P (ω j x) p(x ω j)p (ω j ) k p(x ω i )P (ω i ) i=1 = k j k

4F10: Statistical Pattern Processing K-NN Classifier p(x) 3 5 x 7

12 Engineering Part IIB: 4F10 Statistical Pattern Processing Computational Cost Consider n labelled training points in d dimensional space. For a direct implementation of the NN classifier: each distance measure takes O(d); each of n stored point is compared to the test sample, total cost O(nd). As n get large this becomes impractical, so techniques have been developed to reduce the load. The set of distances between training points could be pre-computed and stored - requires storing n(n 1)/2 distances. Alternatively: partial distances: thedistanceiscomputedonsomesubset r of the parameters, r<d. This distance is used to reduce the search. For example if the partial distance is greater than the current closest distance, terminate search. search trees: structure the search so that at each level only a subset of the training samples are evaluated. This determines which branch of the tree is then searched. This process typically does not guarantee that the nearest neighbour is found, but can dramatically reduce the computational load. editing: eliminate useless prototypes,samples,during training. For example remove prototypes that are surrounded by training points of the same category.

4F10: Statistical Pattern Processing Nearest Neighbour Editing 8

11. Non-Parameteric Techniques 13 Nearest Neighbour Editing Asimplealgorithmforallowingeditingisfortrainingdata D: Initialise j =0; Construct the full set of neighbours for D : do j=j+1; examine the neighbours of x j if any neighbour is not from the same class as x j mark x j until j = n Discard all points that are not marked. This process will leave all the decision boundaries unchanged, hence the performance of the algorithm will be the same as that of the original set of complete points. [It s non-trivial to get the full set of neighbours in high-dimensional space!] This algorithm does not guarantee that the minimal set of points is found. However, it can dramatically reduce the number of points. It is possible to combine all 3 schemes for reducing the cost. Editing may be used to reduce the number of prototypes. These may then be placed in a search tree. Finally partial distances may be used in the evaluation.

14 Engineering Part IIB: 4F10 Statistical Pattern Processing Non-Parametric Bayesian Schemes (ref) In recent years there has also been interest in looking at schemes where, for example, the number of components in a Gaussian mixture model (GMM) tends to infinity p(x θ) = M c m N (x; µ m, Σ m ) m=1 First consider rather than using a single value of the parameters, θ,abayesian approach can be adopted. Here for an M component GMM p(x D,M)= p(x θ)p(θ D,M)dθ where p(θ D,M) determines a GMM parameters component prior: c 1,...,c M,priorforeachcomponent component distribution: {µ 1, Σ 1 },...,{µ M, Σ M } What happens as M? This is an infinite GMM. Thechangetothelikelihoodis: p(x D) = p(x θ)p(θ D)dθ where p(θ D) will also specify the number of components.

11. Non-Parameteric Techniques 15 Non-Parametric Bayesian Schemes (ref) Unfortunately not possible to get a parametric form for p(θ D). How to train the system? Possible to use a sample-based approach (as with RBM) p(x D) = p(x θ)p(θ D)dθ 1 J J p(x θ j ) j=1 where θ j is a sample of possible GMM parameters drawn from p(θ D). This has modified the problem from training the distribution, to drawing samples from p(θ D) -thisispossible! It is possible to apply the same sort of approaches to mixtures of experts: infinite mixture of experts hidden Markov models: infinite HMMs