Generative and discriminative classification techniques

Similar documents
Generative and discriminative classification techniques

Generative and discriminative classification

Generative and discriminative classification techniques

Non-Parametric Modeling

Machine Learning. Classification

Machine Learning. Supervised Learning. Manfred Huber

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Computer Vision. Exercise Session 10 Image Categorization

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Machine Learning Lecture 3

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Machine Learning Lecture 3

Machine Learning Lecture 3

Chapter 4: Non-Parametric Techniques

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Fisher vector image representation

CS4495/6495 Introduction to Computer Vision. 8C-L1 Classification: Discriminative models

10-701/15-781, Fall 2006, Final

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Bayes Classifiers and Generative Methods

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

An Introduction to PDF Estimation and Clustering

Mixture Models and the EM Algorithm

Content-based image and video analysis. Machine learning

Notes and Announcements

6.867 Machine Learning

Clustering web search results

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Note Set 4: Finite Mixture Models and the EM Algorithm

Performance Measures

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

Markov Random Fields and Segmentation with Graph Cuts

A Taxonomy of Semi-Supervised Learning Algorithms

Classification Algorithms in Data Mining

ECG782: Multidimensional Digital Signal Processing

Introduction to Mobile Robotics

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

COMS 4771 Clustering. Nakul Verma

IBL and clustering. Relationship of IBL with CBR

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Latent Variable Models and Expectation Maximization

CS 229 Midterm Review

Clustering Lecture 5: Mixture Model

Pattern Recognition ( , RIT) Exercise 1 Solution

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

Lecture 9: Hough Transform and Thresholding base Segmentation

Machine Learning and Pervasive Computing

SD 372 Pattern Recognition

Introduction to Machine Learning CMU-10701

Lecture 11: Classification

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Nearest Neighbors Classifiers

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Undirected Graphical Models. Raul Queiroz Feitosa

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Nonparametric Methods

Deep Generative Models Variational Autoencoders

Digital Image Processing. Prof. P.K. Biswas. Department of Electronics & Electrical Communication Engineering

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

PATTERN CLASSIFICATION AND SCENE ANALYSIS

Mixture Models and EM

Metric learning approaches! for image annotation! and face recognition!

What is machine learning?

Markov Networks in Computer Vision. Sargur Srihari

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

08 An Introduction to Dense Continuous Robotic Mapping

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Introduction to Machine Learning. Xiaojin Zhu

Verification: is that a lamp? What do we mean by recognition? Recognition. Recognition

What do we mean by recognition?

Markov Random Fields for Recognizing Textures modeled by Feature Vectors

Search Engines. Information Retrieval in Practice

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Supervised vs. Unsupervised Learning

Semi-supervised Learning

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Object Recognition. Lecture 11, April 21 st, Lexing Xie. EE4830 Digital Image Processing

Semantic Segmentation. Zhongang Qi

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Robotics Programming Laboratory

Segmentation of Distinct Homogeneous Color Regions in Images

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University


Textural Features for Image Database Retrieval

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Machine Learning. Unsupervised Learning. Manfred Huber

Norbert Schuff VA Medical Center and UCSF

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Transcription:

Generative and discriminative classification techniques Machine Learning and Category Representation 2014-2015 Jakob Verbeek, November 28, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

Classification Given training data labeled for two or more classes

Classification Given training data labeled for two or more classes Determine a surface that separates those classes

Classification Given training data labeled for two or more classes Determine a surface that separates those classes Use that surface to predict the class membership of new data

Classification examples in category-level recognition Image classification: for each of a set of labels, predict if it is relevant or not for a given image. For example: Person = yes, TV = yes, car = no,...

Classification examples in category-level recognition Category localization: predict bounding box coordinates. Classify each possible bounding box as containing the category or not. Report most confidently classified box.

Classification examples in category-level recognition Semantic segmentation: classify pixels to categories (multi-class) Impose spatial smoothness by Markov random field models.

Classification examples in category-level recognition Event recognition: classify video as belonging to a certain category or not. Example of cliff diving category video recognized by our system.

Classification examples in category-level recognition Temporal action localization: find all instances in a movie. Enables fast-forward to actions of interest, here drinking

Classification Goal is to predict for a test data input the corresponding class label. Data input x, eg. image but could be anything, format may be vector or other Class label y, can take one out of at least 2 discrete values, can be more In binary classification we often refer to one class as positive, and the other as negative Classifier: function f(x) that assigns a class to x, or probabilities over the classes. Training data: pairs (x,y) of inputs x, and corresponding class label y. Learning a classifier: determine function f(x) from some family of functions based on the available training data. Classifier partitions the input space into regions where data is assigned to a given class Specific form of these boundaries will depend on the family of classifiers used

Generative classification: principle Model the class conditional distribution over data x for each class y: p( x y ) Data of the class can be sampled (generated) from this distribution Estimate the a-priori probability that a class will appear p( y ) Infer the probability over classes using Bayes' rule of conditional probability p ( y) p( x y) p ( x) Unconditional distribution on x is obtained by marginalizing over the class y p ( y x)= p( x )= y p( y) p(x y )

Generative classification: practice In order to apply Bayes' rule, we need to estimate two distributions. A-priori class distribution In some cases the class prior probabilities are known in advance. If the frequencies in the training data set are representative for the true class probabilities, then estimate the prior by these frequencies. More elaborate methods exist, but not discussed here. Class conditional data distributions Select a class of density models Parametric model, e.g. Gaussian, Bernoulli, Semi-parametric models: mixtures of Gaussian, Bernoulli,... Non-parametric models: histograms, nearest-neighbor method, Or more structured models taking problem knowledge into account. Estimate the parameters of the model using the data in the training set associated with that class.

Estimation of the class conditional model Given a set of n samples from a certain class, and a family of distributions. X={x 1,..., x n } P={pθ ( x); θ Θ} Question how do we quantify the fit of a certain model to the data, and how do we find the best model defined in this sense? Maximum a-posteriori (MAP) estimation: use Bayes' rule again as follows: Assume a prior distribution over the parameters of the model p(θ) Then the posterior likelihood of the model given the data is p(θ X )= p( x θ) p(θ)/ p( X ) Find the most likely model given the observed data θ=argmax θ p(θ X )=argmax θ {ln p(θ)+ln p( X θ)} Maximum likelihood parameter estimation: assume prior over parameters is uniform (for bounded parameter spaces), or near uniform so that its effect is negligible for the posterior on the parameters. In this case the MAP estimator is given by θ=argmax θ p( X θ) For i.id. samples: n n θ=argmax θ p( x i θ)=argmax θ ln p( xi θ) i=1 i=1

Generative classification methods Generative probabilistic methods use Bayes rule for prediction Problem is reformulated as one of parameter/density estimation p ( y) p( x y) p ( x)= y p ( y) p ( x y) p ( y x)= p ( x) Adding new classes to the model is easy: Existing class conditional models stay as they are Estimate p(x new class) from training examples of new class Re-estimate class prior probabilities

Example of generative classification Three-class example in 2D with parametric model Single Gaussian model per class, uniform class prior Exercise 1: how is this model related to the Gaussian mixture model we looked at last week for clustering? Exercise 2: characterize surface of equal class probability when the covariance matrices are the same for all classes p( x y ) p ( y x)= p ( y) p( x y) p ( x)

Density estimation, e.g. for class-conditional models Any type of data distribution may be used, preferably one that is modeling the data well, so that we can hope for accurate classification results. If we do not have a clear understanding of the data generating process, we can use a generic approach, Gaussian distribution, or other reasonable parametric model Estimation in closed form, otherwise often relatively simple estimation Mixtures of XX Estimation using EM algorithm, not more complicated than single XX Non-parametric models can adapt to any data distribution given enough data for estimation. Examples: (multi-dimensional) histograms, and nearest neighbors. Estimation often trivial, given a single smoothing parameter.

Histogram density estimation Suppose we have N data points use a histogram with C cells Consider maximum likelihood estimator n C ^ θ=argmax θ i=1 pθ (x i )=argmax θ c=1 n c ln θ c Take into account constraint that density should integrate to one ( θc :=1 C 1 k =1 ) v k θk / v C Exercise: derive maximum likelihood estimator Some observations: Discontinuous density estimate Cell size determines smoothness Number of cells scales exponentially with the dimension of the data

The Naive Bayes model Histogram estimation, and other methods, scale poorly with data dimension Fine division of each dimension: many empty bins Rough division of each dimension: poor density model Even for one cut per dimension: 2D cells The number of parameters can be made linear in the data dimensionality by assuming independence between the dimensions D p( x)= d=1 p(x (d)) For example, for histogram model: we estimate a histogram per dimension Still CD cells, but only D x C parameters to estimate, instead of CD Independence assumption can be (very) unrealistic for high dimensional data But classification performance may still be good using the derived p(y x) Partial independence, e.g. using graphical models, relaxes this problem. Principle can be applied to estimation with any type of density estimate

Example of a naïve Bayes model Hand-written digit classification Input: binary 28x28 scanned digit images, collect in 784 long bit string Desired output: class label of image Generative model over 28 x 28 pixel images: 2784 possible images Independent Bernoulli model for each class p ( x y=c)= d p ( x d y=c) Probability per pixel per class Maximum likelihood estimator is average value per pixel/bit per class Classify using Bayes rule: p ( y x)= p ( y) p( x y) p ( x) p ( x d =1 y=c)=θcd p ( x d =0 y=c)=1 θcd

k-nearest-neighbor density estimation: principle Instead of having fixed cells as in histogram method, Center cell on the test sample for which we evaluate the density. Fix number of samples in the cell, find the corresponding cell size. Probability to find a point in a sphere A centered on x0 with volume v is P(x A )= A p( x)dx A smooth density is approximately constant in small region, and thus P ( x A )= A p (x) dx A p( x 0 ) dx= p( x 0 ) v A k Alternatively: estimate P from the fraction of training data in A: P(x A ) N Total N data points, k in the sphere A p( x 0 ) k Nv A Combine the above to obtain estimate Note: density estimates not guaranteed to integrate to one!

k-nearest-neighbor density estimation: practice Procedure in practice: Choose k For given x, compute the volume v which contain k samples. k Estimate density with p( x) Nv Volume of a sphere with radius r in d dimensions is What effect does k have? Data sampled from mixture of Gaussians plotted in green Larger k, larger region, smoother estimate Similar role as cell size for histogram estimation 2r d π d / 2 v (r, d)= Γ(d /2+ 1)

K-nearest-neighbors for classification Use Bayes' rule with knn density estimation for p(x y) k Nv Find sphere volume v to capture k data points for estimate p( x)= Use the same sphere for each class for estimates p( x y=c)= Nc Estimate class prior probabilities p( y=c)= N Calculate class posterior distribution as fraction of k neighbors in class c p ( y=c) p (x y=c) p( x) 1 kc = p (x) Nv p( y=c x)= kc = k kc Nc v

Smoothing effects for large values of k: data set Use Bayes' rule with knn density estimation for p(x y), with a little twist Find sphere volume v to capture k data points for estimate p( x)= Use the same sphere for each class for estimates p( x y=c)= k Nv kc Nc v

Smoothing effects for large values of k, k=1 Use Bayes' rule with knn density estimation for p(x y), with a little twist Find sphere volume v to capture k data points for estimate p( x)= Use the same sphere for each class for estimates p( x y=c)= k Nv kc Nc v

Smoothing effects for large values of k, k=5 Use Bayes' rule with knn density estimation for p(x y), with a little twist Find sphere volume v to capture k data points for estimate p( x)= Use the same sphere for each class for estimates p( x y=c)= k Nv kc Nc v

Smoothing effects for large values of k, k=10 Use Bayes' rule with knn density estimation for p(x y), with a little twist Find sphere volume v to capture k data points for estimate p( x)= Use the same sphere for each class for estimates p( x y=c)= k Nv kc Nc v

Smoothing effects for large values of k, k=100 Use Bayes' rule with knn density estimation for p(x y), with a little twist Find sphere volume v to capture k data points for estimate p( x)= Use the same sphere for each class for estimates p( x y=c)= k Nv kc Nc v

Summary generative classification methods (Semi-) Parametric models, e.g. p(x y) is Gaussian, or mixture of Pros: no need to store training data, just the class conditional models Cons: may fit the data poorly, and might therefore lead to poor classification result Non-parametric models: Pros: flexibility, no assumptions distribution shape, learning is trivial. KNN can be used for anything that comes with a distance. Cons of histograms: Only practical in low dimensional data (<5 or so), application in high dimensional data leads to exponentially many and mostly empty cells Naïve Bayes modeling in higher dimensional cases Cons of k-nearest neighbors Need to store all training data (memory cost) Computing nearest neighbors (computational cost)