Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Similar documents
Machine Learning Lecture 3

Machine Learning Lecture 3

Machine Learning Lecture 3

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Non-Parametric Modeling

Machine Learning and Pervasive Computing

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Content-based image and video analysis. Machine learning

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Notes and Announcements

Chapter 4: Non-Parametric Techniques

Generative and discriminative classification techniques

Clustering Lecture 5: Mixture Model

Machine Learning. Supervised Learning. Manfred Huber

Generative and discriminative classification techniques

SD 372 Pattern Recognition

Nonparametric Classification. Prof. Richard Zanibbi

Nonparametric Methods

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Machine Learning Lecture 11

What is machine learning?

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

Multivariate Data Analysis and Machine Learning in High Energy Physics (V)

Classifiers for Recognition Reading: Chapter 22 (skip 22.3)

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Pattern Recognition ( , RIT) Exercise 1 Solution

Machine Learning. Chao Lan

PATTERN CLASSIFICATION AND SCENE ANALYSIS

Machine Learning Lecture 9

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

CS 229 Midterm Review

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Machine Learning Lecture 9

Expectation Maximization: Inferring model parameters and class labels

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Perceptron as a graph

Machine Learning Lecture 12

Adaptive Metric Nearest Neighbor Classification

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Instance-based Learning

Undirected Graphical Models. Raul Queiroz Feitosa

Nonparametric Methods Recap

10-701/15-781, Fall 2006, Final

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Instance-based Learning

Dynamic Thresholding for Image Analysis

Lecture 11: Classification

K-Means and Gaussian Mixture Models

Using Machine Learning to Optimize Storage Systems

Machine Learning / Jan 27, 2010

The exam is closed book, closed notes except your one-page cheat sheet.

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Lecture 7: Decision Trees

Computer Vision 2 Lecture 8

Building Classifiers using Bayesian Networks

CS6716 Pattern Recognition

Fisher vector image representation

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Machine Learning. Classification

Support Vector Machines

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Bayes Classifiers and Generative Methods

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

ECG782: Multidimensional Digital Signal Processing

CS4495/6495 Introduction to Computer Vision. 8C-L1 Classification: Discriminative models

Gaussian Processes for Robotics. McGill COMP 765 Oct 24 th, 2017

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

An Introduction to PDF Estimation and Clustering

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CS 195-5: Machine Learning Problem Set 5

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Data Mining Classification: Bayesian Decision Theory

CSE 573: Artificial Intelligence Autumn 2010

Challenges motivating deep learning. Sargur N. Srihari

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Support Vector Machines + Classification for IR

Robust Shape Retrieval Using Maximum Likelihood Theory

Bias-Variance Analysis of Ensemble Learning

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Comparison Study of Different Pattern Classifiers

Object recognition (part 1)

Nearest Neighbor Classification

9.913 Pattern Recognition for Vision. Class I - Overview. Instructors: B. Heisele, Y. Ivanov, T. Poggio

Simple Model Selection Cross Validation Regularization Neural Networks

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Supervised vs unsupervised clustering

INF 4300 Classification III Anne Solberg The agenda today:

9881 Project Deliverable #1

Generative and discriminative classification

Transcription:

Truth Course Outline Machine Learning Lecture 3 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Probability Density Estimation II 2.04.205 Discriminative Approaches (5 weeks) Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de Generative Models (4 weeks) Bayesian Networks Markov Random Fields leibe@vision.rwth-aachen.de Many slides adapted from B. Schiele 2 Topics of This Lecture Recap: Bayes Decision Theory Recap: Bayes Decision Theory Parametric Methods Recap: Maximum Likelihood approach Bayesian Learning Non-Parametric Methods Histograms Kernel density estimation K-Nearest Neighbors k-nn for Classification Bias-Variance tradeoff Optimal decision rule Decide for C if This is equivalent to p(c jx) > p(c 2 jx) p(xjc )p(c ) > p(xjc 2 )p(c 2 ) Which is again equivalent to (Likelihood-Ratio test) p(xjc ) p(xjc 2 ) > p(c 2) p(c ) Decision threshold 3 4 Recap: Bayes Decision Theory Recap: Classifying with Loss Functions Decision regions: R, R 2, R 3, We can formalize the intuition that different decisions have different weights by introducing a loss matrix L kj L kj = loss for decision C j if truth is C k : Example: cancer diagnosis Decision L cancer diagnosis = 5 6

Recap: Minimizing the Expected Loss Recap: Gaussian (or Normal) Distribution Optimal solution is the one that minimizes the loss. But: loss function depends on the true class, which is unknown. One-dimensional case Mean ¹ Solution: Minimize the expected loss Variance ¾ 2 N(xj¹; ¾ 2 ) = p ¾ (x ¹)2 exp ½ 2¼¾ 2¾ 2 This can be done by choosing the regions Adapted decision rule: p(xjc ) p(xjc 2 ) > (L 2 L 22 ) p(c 2 ) (L 2 L ) p(c ) R j such that 7 Multi-dimensional case Mean ¹ Covariance N(xj¹; ) = ½ exp ¾ (2¼) D=2 j j=2 2 (x ¹)T (x ¹) 8 Recap: Maximum Likelihood Approach Topics of This Lecture Computation of the likelihood Single data point: p(x n jµ) Assumption: all data points X = fx ; : : : ; x ng are independent NY L(µ) = p(xjµ) = p(x n jµ) Log-likelihood E(µ) = ln L(µ) = ln p(x n jµ) Estimation of the parameters µ (Learning) Maximize the likelihood (=minimize the negative log-likelihood) Take the derivative and set it to zero. N @ @µ E(µ) = X @ @µ p(x njµ) p(x n jµ)! = 0 9 Recap: Bayes Decision Theory Parametric Methods Recap: Maximum Likelihood approach Bayesian Learning Non-Parametric Methods Histograms Kernel density estimation K-Nearest Neighbors k-nn for Classification Bias-Variance tradeoff 0 Recap: Maximum Likelihood Limitations Deeper Reason Maximum Likelihood has several significant limitations It systematically underestimates the variance of the distribution! E.g. consider the case N = ; X = fx g x Maximum Likelihood is a Frequentist concept In the Frequentist view, probabilities are the frequencies of random, repeatable events. These frequencies are fixed, but can be estimated more precisely when more data is available. Maximum-likelihood estimate: We say ML overfits to the observed data. ^¾ = 0! We will still often use ML, but it is important to know about this effect. ^¹ x This is in contrast to the Bayesian interpretation In the Bayesian view, probabilities quantify the uncertainty about certain states or events. This uncertainty can be revised in the light of new evidence. Bayesians and Frequentists do not like each other too well Slide adapted from Bernt Schiele 2 2

Bayesian vs. Frequentist View To see the difference Suppose we want to estimate the uncertainty whether the Arctic ice cap will have disappeared by the end of the century. This question makes no sense in a Frequentist view, since the event cannot be repeated numerous times. In the Bayesian view, we generally have a prior, e.g. from calculations how fast the polar ice is melting. If we now get fresh evidence, e.g. from a new satellite, we may revise our opinion and update the uncertainty from the prior. Posterior / Likelihood Prior This generally allows to get better uncertainty estimates for many situations. Main Frequentist criticism The prior has to come from somewhere and if it is wrong, the result will be worse. 3 Bayesian Approach to Parameter Learning Conceptual shift Maximum Likelihood views the true parameter vector µ to be unknown, but fixed. In Bayesian learning, we consider µ to be a random variable. This allows us to use knowledge about the parameters µ i.e., to use a prior for µ Training data then converts this prior distribution on µ into a posterior probability density. The prior thus encodes knowledge we have about the type of distribution we expect to see for µ. Slide adapted from Bernt Schiele 4 Bayesian Learning Approach Bayesian view: Consider the parameter vector µ as a random variable. Bayesian Learning Approach p(xjx) = p(xjµ)p(µjx)dµ When estimating the parameters from a dataset X, we compute p(xjx) = p(x; µjx)dµ Assumption: given µ, this doesn t depend on X anymore p(xjx) = p(x; µjx) = p(xjµ; X)p(µjX) p(xjµ)p(µjx)dµ This is entirely determined by the parameter µ (i.e., by the parametric form of the pdf). p(µjx) = p(xjµ)p(µ) p(x) p(x) = = p(µ) p(x) L(µ) p(xjµ)p(µ)dµ = L(µ)p(µ)dµ Inserting this above, we obtain p(xjµ)l(µ)p(µ) p(xjµ)l(µ)p(µ) p(xjx) = dµ = R dµ p(x) L(µ)p(µ)dµ Slide adapted from Bernt Schiele 5 6 Bayesian Learning Approach Discussion Estimate for x based on parametric form µ Likelihood of the parametric form µ given the data set X. p(xjµ)l(µ)p(µ) p(xjx) = R dµ L(µ)p(µ)dµ Normalization: integrate over all possible values of µ If we now plug in a (suitable) prior p(µ), we can estimate from the data set X. Prior for the parameters µ p(xjx) 7 Bayesian Density Estimation Discussion p(xjx) = p(xjµ)l(µ)p(µ) p(xjµ)p(µjx)dµ = R dµ L(µ)p(µ)dµ The probability p(µjx) makes the dependency of the estimate on the data explicit. If p(µjx) is very small everywhere, but is large for one ^µ, then In this case, the estimate is determined entirely by ^µ. The more uncertain we are about µ, the more we average over all parameter values. p(xjx) ¼ p(xj^µ) 8 3

Bayesian Density Estimation Bayesian Learning Approach Problem In the general case, the integration over µ is not possible (or only possible stochastically). Example where an analytical solution is possible Normal distribution for the data, ¾ 2 assumed known and fixed. Estimate the distribution of the mean: p(¹jx) = p(xj¹)p(¹) p(x) Prior: We assume a Gaussian prior over ¹, Sample mean: Bayes estimate: Note: ¹ N = ¾2 ¹ 0 + N¾ 2 0 ¹x ¾ 2 + N¾ 2 0 ¾ 2 N = ¾0 2 + N ¾ 2 ¹x = N x n p(¹jx) 9 ¹ 0 = 0 20 Slide adapted from Bernt Schiele Summary: ML vs. Bayesian Learning Topics of This Lecture Maximum Likelihood Simple approach, often analytically possible. Problem: estimation is biased, tends to overfit to the data. Often needs some correction or regularization. But: Approximation gets accurate for N!. Bayesian Learning General approach, avoids the estimation bias through a prior. Problems: Need to choose a suitable prior (not always obvious). Integral over µ often not analytically feasible anymore. But: Efficient stochastic sampling techniques available. Recap: Bayes Decision Theory Parametric Methods Recap: Maximum Likelihood approach Bayesian Learning Non-Parametric Methods Histograms Kernel density estimation K-Nearest Neighbors k-nn for Classification Bias-Variance tradeoff (In this lecture, we ll use both concepts wherever appropriate) 2 22 Non-Parametric Methods Histograms Non-parametric representations Often the functional form of the distribution is unknown Basic idea: Partition the data space into distinct bins with widths i and count the number of observations, n i, in each bin. 3 N =0 2 x Estimate probability density from data Histograms Kernel density estimation (Parzen window / Gaussian kernels) k-nearest-neighbor Often, the same width is used for all bins, i =. 0 0 0.5 This can be done, in principle, for any dimensionality D but the required number of bins grows exponentially with D! 23 24 4

Histograms The bin width M acts as a smoothing factor. not smooth enough about OK too smooth 25 Summary: Histograms Properties Very general. In the limit (N!), every probability density can be represented. No need to store the data points once histogram is computed. Rather brute-force Problems High-dimensional feature spaces D-dimensional space with M bins/dimension will require M D bins! Requires an exponentially growing number of data points Curse of dimensionality Discontinuities at bin edges Bin size? too large: too much smoothing too small: too much noise 26 Statistically Better-Founded Approach Statistically Better-Founded Approach Data point x comes from pdf p(x) P = p(y)dy Probability that x falls into small region R R If R is sufficiently small, p(x) is roughly constant P = Let V be the volume of R R p(y)dy ¼ p(x)v If the number N of samples is sufficiently large, we can estimate P as P = K N ) K NV 27 Kernel methods Example: Determine the number K of data points inside a fixed window fixed V determine K Kernel Methods K NV fixed K determine V K-Nearest Neighbor 28 Kernel Methods Kernel Methods: Parzen Window Parzen Window Hypercube of dimension D with edge length h: k(u) = K = ½ ; jui 2 ; i = ; : : : ; D 0; else k( x x n ) V = h Probability density estimate: Kernel function K NV = Nh D k(u)du = h d k( x x n ) h 29 Interpretations. We place a kernel window k at location x and count how many data points fall inside it. 2. We place a kernel window k around each data point x n and sum up their influences at location x. Direct visualization of the density. Still, we have artificial discontinuities at the cube boundaries We can obtain a smoother density model if we choose a smoother kernel function, e.g. a Gaussian 30 5

Kernel Methods: Gaussian Kernel Gauss Kernel: Examples Gaussian kernel Kernel function ¾ k(u) = exp ½ u2 (2¼h 2 ) =2 2h 2 K = k(x x n ) V = k(u)du = not smooth enough about OK Probability density estimate K NV = N ½ (2¼) D=2 h exp jjx x njj 2 ¾ 2h 2 too smooth h acts as a smoother. 3 32 Kernel Methods Statistically Better-Founded Approach In general Any kernel such that K NV can be used. Then K = k(x x n ) fixed V determine K Kernel Methods fixed K determine V K-Nearest Neighbor And we get the probability density estimate K NV = N k(x x n ) K-Nearest Neighbor Increase the volume V until the K next data points are found. Slide adapted from Bernt Schiele 33 34 K-Nearest Neighbor k-nearest Neighbor: Examples Nearest-Neighbor density estimation Fix K, estimate V from the data. Consider a hypersphere centred on x and let it grow to a volume V? that includes K of the given N data points. Then K = 3 not smooth enough about OK Side note Strictly speaking, the model produced by K-NN is not a true density model, because the integral over all space diverges. E.g. consider K = and a sample exactly on a data point x = x j. too smooth K acts as a smoother. 35 36 6

Summary: Kernel and k-nn Density Estimation Properties Very general. In the limit (N!), every probability density can be represented. No computation involved in the training phase Simply storage of the training set Problems Requires storing and computing with the entire dataset. Computational cost linear in the number of data points. This can be improved, at the expense of some computation during training, by constructing efficient tree-based search structures. Kernel size / K in K-NN? Too large: too much smoothing Too small: too much noise 37 K-Nearest Neighbor Classification Bayesian Classification p(c j jx) = p(xjc j)p(c j ) p(x) Here we have K NV p(xjc j ) ¼ K j N j V p(c j ) ¼ N j N p(c j jx) ¼ K j N j NV N j V N K k-nearest Neighbor classification = K j K 38 K-Nearest Neighbors for Classification K-Nearest Neighbors for Classification Results on an example data set K = 3 K = 39 K acts as a smoothing parameter. Theoretical guarantee For N!, the error rate of the -NN classifier is never more than twice the optimal error (obtained from the true conditional class distributions). 40 Bias-Variance Tradeoff Probability density estimation Histograms: bin size? M too large: too smooth M too small: not smooth enough Kernel methods: kernel size? h too large: too smooth h too small: not smooth enough K-Nearest Neighbor: K? K too large: too smooth K too small: not smooth enough This is a general problem of many probability density estimation methods Including parametric methods and mixture models Too much bias Too much variance 4 Discussion The methods discussed so far are all simple and easy to apply. They are used in many practical applications. However Histograms scale poorly with increasing dimensionality. Only suitable for relatively low-dimensional data. Both k-nn and kernel density estimation require the entire data set to be stored. Too expensive if the data set is large. Simple parametric models are very restricted in what forms of distributions they can represent. Only suitable if the data has the same general form. We need density models that are efficient and flexible! Next lecture 42 7

References and Further Reading More information in Bishop s book Gaussian distribution and ML: Ch..2.4 and 2.3.-2.3.4. Bayesian Learning: Ch..2.3 and 2.3.6. Nonparametric methods: Ch. 2.5. Additional information can be found in Duda & Hart ML estimation: Ch. 3.2 Bayesian Learning: Ch. 3.3-3.5 Nonparametric methods: Ch. 4.-4.5 Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 R.O. Duda, P.E. Hart, D.G. Stork Pattern Classification 2 nd Ed., Wiley-Interscience, 2000 43 8