These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Similar documents
Machine Learning and Pervasive Computing

Non-Parametric Modeling

Machine Learning Lecture 3

Machine Learning Lecture 3

Machine Learning Lecture 3

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Chapter 4: Non-Parametric Techniques

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Nonparametric Methods

Generative and discriminative classification techniques

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning. Supervised Learning. Manfred Huber

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Content-based image and video analysis. Machine learning

Notes and Announcements

Distribution-free Predictive Approaches

Nonparametric Classification. Prof. Richard Zanibbi

Instance-based Learning

Introduction to Machine Learning

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Generative and discriminative classification techniques

Performance Measures

All lecture slides will be available at CSC2515_Winter15.html

Machine Learning. Classification

Economics Nonparametric Econometrics

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Applying Supervised Learning

An Introduction to PDF Estimation and Clustering

Nearest Neighbor Classification. Machine Learning Fall 2017

We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.

VECTOR SPACE CLASSIFICATION

CS 340 Lec. 4: K-Nearest Neighbors

9881 Project Deliverable #1

Nonparametric Methods Recap

SD 372 Pattern Recognition

Generative and discriminative classification

Generative and discriminative classification techniques

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Nearest Neighbor Classification

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Basis Functions. Volker Tresp Summer 2017

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

The Curse of Dimensionality

Perceptron as a graph

Digital Image Processing. Prof. P.K. Biswas. Department of Electronics & Electrical Communication Engineering

k-nearest Neighbors + Model Selection

Section 4 Matching Estimator

Multivariate Data Analysis and Machine Learning in High Energy Physics (V)

Instance-based Learning

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

Biometrics Technology: Image Processing & Pattern Recognition (by Dr. Dickson Tong)

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Supervised Learning: Nearest Neighbors

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Nonparametric Density Estimation

CS 229 Midterm Review

CISC 4631 Data Mining

Naïve Bayes for text classification

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Region-based Segmentation

CSCI567 Machine Learning (Fall 2014)

Data Mining. Lecture 03: Nearest Neighbor Learning

Methods for Intelligent Systems

Nearest neighbors classifiers

Support Vector Machines + Classification for IR

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

More on Classification: Support Vector Machine

Robust Shape Retrieval Using Maximum Likelihood Theory

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Undirected Graphical Models. Raul Queiroz Feitosa

3D Object Recognition using Multiclass SVM-KNN

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

NAME: Sample Final Exam (based on previous CSE 455 exams by Profs. Seitz and Shapiro)

Feature scaling in support vector data description

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

5. Feature Extraction from Images

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Sampling-based Planning 2

Classifiers for Recognition Reading: Chapter 22 (skip 22.3)

CS 195-5: Machine Learning Problem Set 5

Bayes Classifiers and Generative Methods

SYDE 372 Introduction to Pattern Recognition. Distance Measures for Pattern Classification: Part I

CS570: Introduction to Data Mining

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Features: representation, normalization, selection. Chapter e-9

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Image Processing. Image Features

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Transcription:

Machine Learning Algorithms (IFT6266 A7) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop 1

A note: Methods We (perhaps unwisely) skipped Bishop 2.1--2.4 until just before our graphical models lectures Give it a look.... 2 2

Nonparametric Methods Parametric methods: fit a small number of parameters to a data set. Must make assumptions about the underlying distribution of data. If those assumptions are wrong, model is flawed. Eg. trying to fit Gaussian model to multimodal data. Nonparametric methods: fit a large number of parameters to a data set. Number of parameters scales with number of data points. In general one parameter per data point Main advantage is that we need make only very weak assumptions about the underlying distribution of data. 3 3

Histograms. 1. 1. 1 p i = n i N i p(x) dx = 1 4 4

Histograms Discontinuities at bin boundaries Curse of dimensionality: if we divide each variable in a D- dimensional space into M bins, we get M D bins Locality is important (density defined via evaluation of points in a local neighborhood) Smoothing achieved by bin count; we want neither too much nor too little smoothing Relates to model complexity and regularization in parametric modeling

Kernel density estimators Assume Euclidean distance, observe from unknown density p(x) Consider small region R containing x. Mass for region is: P = R p(x) dx Distribution of points is binomial: Bin(K N, P ) = N! K!(N K)! P K (1 P ) N K From chp 2.1 on binary variables: p(x = 1 µ) = µ Bin(m N, µ) = [m] var[m] N m= ( N m ) ( ) N µ m (1 µ) N m m mbin(m N, µ) = Nµ N (m [m]) 2 Bin(m N, µ) = Nµ(1 µ) m= 6 N! (N m)!m! 6

Kernel density estimators Thus we see that mean fraction of points falling into region is E[K/N]=P Variance is var[k/n] = P(1-P) / N For large N, distribution will be sharply peaked around the mean so K NP If region R is sufficiently smally that the density p(x) is constant then we have P p(x)v where V is the volume of R Combining we get p(x) = K / NV Depends on contradictory assumptions (that R is small enough to have constant density over region yet sufficiently large such that the number K of points falling into the region generates a sharply peaked distribution If we fix K and determine V we get K-nearest-neighbors If we fix V and determine K we get kernal density estimator 7 7

Kernel density estimators Wish to determine density around x in region R centered on x. We will count points: k(u) = 1, ui. i = 1,..., D, otherwise k(u) is a kernel function, here called a Parzen window Total number of data points inside of cube of side h defined by k(u) : K = N n=1 ( ) x xn k h Substitution into p(x) = K / NV (where volume of hypercube of side h in D dimeinsions is V = 1/h d ) P (X) = 1 N N n=1 ( ) 1 x h D k xn h 8 8

Parzen estimator Model suffers from discontinuities at hypercube boundaries Substitute Gaussian (where h represents standard deviation of Gaussian components): P (X) = 1 N N n=1 { 1 exp x x n 2 } (2πh 2 ) 1/2 2h 2 As expected, h acts to smooth Tradeoff between noise sensitivity and oversmoothing Any kernel can be used provided: k(u) m, integral k(u)du = 1 No computation for training phase But must store entire data set to evaluate. 1. 1. 1 9 9

Nearest neighbor methods One weakness of kernel density estimation is that kernel width h is fixed for all lerneks. Fix K and determine V Consider small sphere centered at x, allow sphere to grow until it contains precisely K points. Estimate of density p(x) given by same formula: P(X) = K / NV K governs degree of smoothing. Compare Parzens (left) to KNN (right). 1. 1. 1. 1. 1 1. 1 1

Classification using KNN Can do classification by applying KNN to each class and applying Bayes: To classify new point x, draw a sphere containning precisely K points. Suppose sphere contains Kk points from class Ck. Then model p(x) = K/NV provides a density estimate: p(x C k ) = We also obtain unconditional density: p(x) = K k N k V K NV With class priors: p(c k ) = N k N Applying Bayes theorem yields: p(c k x) = p(x C k)p(c k ) p(x) 11 = K k K 11

Classification using KNN To minimize misclassification rate, always choose class with largest Kk/K. For K=1, yields decision boundary composed of hyperplanes that form perpendicular bisectors of pairs from different classes x 2 x 2 (a) x 1 (b) x 1 Left: K=3, Right: K=1 12 12

KNN Example K acts as regularizer. Tree-base search can be used to find approximate near neighbors. 2 2 2 1 1 1 1 2 1 2 1 2 Oil dataset. Left K=1, Middle K=3, Right K=31 13 13