Computer Vision. Pa0ern Recogni4on Concepts Part II. Luis F. Teixeira MAP- i 2012/13

Similar documents
Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Support Vector Machines

Discriminative classifiers for object classification. Last time

Machine Learning 9. week

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Announcements. Supervised Learning

Classification / Regression Support Vector Machines

Support Vector Machines

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

CS 534: Computer Vision Model Fitting

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

LECTURE : MANIFOLD LEARNING

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Classifier Selection Based on Data Complexity Measures *

Edge Detection in Noisy Images Using the Support Vector Machines

Feature Reduction and Selection

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

Machine Learning: Algorithms and Applications

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Lecture 5: Multilayer Perceptrons

Unsupervised Learning and Clustering

Classifying Acoustic Transient Signals Using Artificial Intelligence

Face Recognition Based on SVM and 2DPCA

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Categorizing objects: of appearance

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

Support Vector Machines. CS534 - Machine Learning

Support Vector Machines

Recognizing Faces. Outline

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Smoothing Spline ANOVA for variable screening

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros.

Classification of Face Images Based on Gender using Dimensionality Reduction Techniques and SVM

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

S1 Note. Basis functions.

Radial Basis Functions

GSLM Operations Research II Fall 13/14

Unsupervised Learning

INF 4300 Support Vector Machine Classifiers (SVM) Anne Solberg

Lecture 4: Principal components

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Using Neural Networks and Support Vector Machines in Data Mining

An Anti-Noise Text Categorization Method based on Support Vector Machines *

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Human Face Recognition Using Generalized. Kernel Fisher Discriminant

The Research of Support Vector Machine in Agricultural Data Classification

Data Mining: Model Evaluation

Recognition continued: discriminative classifiers

Improved Mutual Information Based on Relative Frequency. Factor and Degree of Difference among Classes

INF Repetition Anne Solberg INF

Detection of an Object by using Principal Component Analysis

Image Alignment CSC 767

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

y and the total sum of

Parallelism for Nested Loops with Non-uniform and Flow Dependences

K-means and Hierarchical Clustering

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning

Feature Extractions for Iris Recognition

SVM-based Learning for Multiple Model Estimation

Face Recognition Method Based on Within-class Clustering SVM

LECTURE NOTES Duality Theory, Sensitivity Analysis, and Parametric Programming

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

CMPSCI 670: Computer Vision! Object detection continued. University of Massachusetts, Amherst November 10, 2014 Instructor: Subhransu Maji

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Unsupervised Learning and Clustering

Online Detection and Classification of Moving Objects Using Progressively Improving Detectors

Taxonomy of Large Margin Principle Algorithms for Ordinal Regression Problems

Image Representation & Visualization Basic Imaging Algorithms Shape Representation and Analysis. outline

SUMMARY... I TABLE OF CONTENTS...II INTRODUCTION...

Hierarchical clustering for gene expression data analysis

CAN COMPUTERS LEARN FASTER? Seyda Ertekin Computer Science & Engineering The Pennsylvania State University

Two-Dimensional Supervised Discriminant Projection Method For Feature Extraction

Training of Kernel Fuzzy Classifiers by Dynamic Cluster Generation

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

SRBIR: Semantic Region Based Image Retrieval by Extracting the Dominant Region and Semantic Learning

RECOGNIZING GENDER THROUGH FACIAL IMAGE USING SUPPORT VECTOR MACHINE

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Announcements. Recognizing object categories. Today 2/10/2016. Recognition via feature matching+spatial verification. Kristen Grauman UT-Austin

Feature Selection By KDDA For SVM-Based MultiView Face Recognition

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

A Multivariate Analysis of Static Code Attributes for Defect Prediction

Statistical Methods in functional MRI. Classification and Prediction. Data Processing Pipeline. Machine Learning. Machine Learning

Laplacian Eigenmap for Image Retrieval

Multi-stable Perception. Necker Cube

SIGGRAPH Interactive Image Cutout. Interactive Graph Cut. Interactive Graph Cut. Interactive Graph Cut. Hard Constraints. Lazy Snapping.

Automated method for scoring breast tissue microarray spots using Quadrature mirror filters and Support vector machines

Face Detection with Deep Learning

A Binarization Algorithm specialized on Document Images and Photos

Facial Expression Recognition Based on Local Binary Patterns and Local Fisher Discriminant Analysis

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Journal of Process Control

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Wishing you all a Total Quality New Year!

Clustering. A. Bellaachia Page: 1

Adaptive Virtual Support Vector Machine for the Reliability Analysis of High-Dimensional Problems

Pattern Recognition 46 (2013) Contents lists available at SciVerse ScienceDirect. Pattern Recognition

Support Vector classifiers for Land Cover Classification

Transcription:

Computer Vson Pa0ern Recogn4on Concepts Part II Lus F. Texera MAP- 2012/13

Last lecture The Bayes classfer yelds the op#mal decson rule f the pror and class- cond4onal dstrbu4ons are known. Ths s unlkely for most applca4ons, so we can: a0empt to es4mate p(x ω ) from data, by means of densty es4ma4on technques Naïve Bayes and nearest- neghbors classfers assume p(x ω ) follows a par4cular dstrbu4on (.e. Normal) and es4mate ts parameters quadra4c classfers gnore the underlyng dstrbu4on, and a0empt to separate the data geometrcally dscrmna4ve classfers

k- Nearest neghbour classfer Gven the tranng data D = {x 1,...,x n } as a set of n labeled examples, the nearest neghbour classfer assgns a test pont x the label assocated wth ts closest neghbour (or k neghbours) n D. Closeness s defned usng a dstance func#on.

Dstance func4ons A general class of metrcs for d- dmensonal pa0erns s the Mnkowsk metrc, also known as the L p norm # L p (x, y) = % $ p & x y ( ' The Eucldean dstance s the L 2 norm The Manha:an or cty block dstance s the L1 norm d =1 # d 2 & L 2 (x, y) = % x y ( $ ' L 1 (x, y) = =1 d =1 x y 1/p 1/2

Dstance func4ons The Mahalanobs dstance s based on the covarance of each feature wth the class examples. D M (x) = ( x µ ) T Σ 1 ( x µ ) Based on the assump4on that dstances n the drec4on of hgh varance are less mportant Hghly dependent on a good es4mate of covarance

1- Nearest neghbour classfer Assgn label of nearest tranng data pont to each test data pont Black = nega4ve Red = pos4ve from Duda et al. Novel test example Closest to a pos4ve example from the tranng set, so classfy t as pos4ve. Vorono par44onng of feature space for 2- category 2D data

k- Nearest neghbour classfer For a new pont, fnd the k closest ponts from tranng data Labels of the k ponts vote to classfy k = 5 If the query lands here, the 5 NN consst of 3 nega4ves and 2 pos4ves, so we classfy t as nega4ve. Black = negatve Red = postve

k- Nearest neghbour classfer The man advantage of knn s that t leads to a very smple approxma4on of the (op4mal) Bayes classfer

knn as a classfer Advantages: Smple to mplement Flexble to feature / dstance choces Naturally handles mul4- class cases Can do well n prac4ce wth enough representa4ve data Dsadvantages: Large search problem to fnd nearest neghbors à Hghly suscep4ble to the curse of dmensonalty Storage of data Must have a meanngful dstance func4on

Dmensonalty reduc4on The curse of dmensonalty The number of examples needed to accurately tran a classfer grows exponen#ally wth the dmensonalty of the model In theory, nforma4on provded by add4onal features should help mprove the model s accuracy In realty, however, add4onal features ncrease the rsk of overfbng,.e., memorzng nose n the data rather than ts underlyng structure For a gven sample sze, there s a maxmum number of features above whch the classfer s performance degrades rather than mproves

Dmensonalty reduc4on The curse of dmensonalty can be lmted by: ncorpora4ng pror knowledge (e.g., parametrc models) enforcng smoothness n the target func4on (e.g., regularza4on) reducng the dmensonalty crea4ng a subset of new features by combna4ons of the exs4ng features feature extrac#on choosng a subset of all the features feature selec#on

Dmensonalty reduc4on In feature extrac4on methods, two types of crtera are commonly used: Sgnal representa4on: The goal of feature selec4on s to accurately represent the samples n a lower- dmensonal space (e.g. Prncpal Components Analyss, or PCA) Classfca4on: The goal of feature selec4on s to enhance the class- dscrmnatory nforma4on n the lower- dmensonal space (e.g. Fsher s Lnear Dscrmnants Analyss, or LDA)

Dscrmna4ve classfers Decson boundary- based classfers: Decson trees Neural networks Support vector machnes

Dscrmna4ve vs Genera4ve Genera4ve models es4mate the dstrbu4ons Dscrmna4ve models only defne a decson boundary

Dscrmna4ve vs Genera4ve Dscrmna4ve models dffer from genera4ve models n that they do not allow one to generate samples from the jont dstrbu4on of x and y. However, for tasks such as classfca#on and regresson that do not requre the jont dstrbu4on, dscrmna4ve models generally yeld superor performance. On the other hand, genera4ve models are typcally more flexble than dscrmna4ve models n expressng dependences n complex learnng tasks.

Decson trees Decson trees are herarchcal decson systems n whch cond4ons are sequen4ally tested un4l a class s accepted The feature space s splt nto unque regons correspondng to the classes, n a sequen#al manner The searchng of the regon to whch the feature vector wll be assgned to s acheved va a sequence of decsons along a path of nodes

Decson trees Decson trees classfy a pa0ern through a sequence of ques4ons, n whch the next ques4on depends on the answer to the current ques4on

Decson trees The most popular schemes among decson trees are those that splt the space nto hyper- rectangles wth sdes parallel to the axes The sequence of decsons s appled to ndvdual features, n the form of s the feature x k < α?

Ar4fcal neural networks A neural network s a set of connected nput/ output unts where each connec4on has a weght assocated wth t Durng the learnng phase, the network learns by adjus#ng the weghts so as to be able to predct the correct class output of the nput sgnals

Ar4fcal neural networks Examples of ANN: Perceptron Mul4layer Perceptron (MLP) Radal Bass Func4on (RBF) Self- Organzng Map (SOM, or Kohonen map) Topologes: Feed forward Recurrent

Perceptron Defnes a (hyper)plane that lnearly separates the feature space The nputs are real values and the output +1,- 1 Ac4va4on func4ons: step, lnear, logs4c sgmod, Gaussan

Mul4layer perceptron To handle more complex problems (than lnearly separable ones) we need mul4ple layers. Each layer receves ts nputs from the prevous layer and forwards ts outputs to the next layer The result s the combna#on of lnear boundares whch allow the separa4on of complex data Weghts are obtaned through the back propaga#on algorthm

Non- lnearly separable problems

RBF networks RBF networks approxmate func4ons usng (radal) bass func4ons as the buldng blocks. Generally, the hdden unt func4on s Gaussan and the output Layer s lnear

MLP vs RBF Classfca#on MLPs separate classes va hyperplanes RBFs separate classes va hyperspheres Learnng MLPs use dstrbuted learnng RBFs use localzed learnng RBFs tran faster Structure MLPs have one or more hdden layers RBFs have only one layer RBFs requre more hdden neurons => curse of dmensonalty

ANN as a classfer Advantages Hgh tolerance to nosy data Ablty to classfy untraned pa0erns Well- suted for con4nuous- valued nputs and outputs Successful on a wde array of real- world data Algorthms are nherently parallel Dsadvantages Long tranng 4me Requres a number of parameters typcally best determned emprcally, e.g., the network topology or ``structure. Poor nterpretablty: Dffcult to nterpret the symbolc meanng behnd the learned weghts and of ``hdden unts" n the network

Support Vector Machne Dscrmnant func4on s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) In a nutshell: Map the data to a predetermned very hgh- dmensonal space va a kernel func4on Fnd the hyperplane that maxmzes the margn between the two classes If data are not separable fnd the hyperplane that maxmzes the margn and mnmzes the (a weghted average of the) msclassfca4ons

Lnear classfers

Lnear func4ons n R 2 Let w a = c x x = y ax + cy + b = 0

Lnear func4ons n R 2 Let w a = c x x = y w w x + b = 0 ax + cy + b = 0 w T x + b = 0

Lnear func4ons n R 2 ( x, y 0 0) D Let w a = c x x = y w w x + b = 0 ax + cy + b = 0 w T x + b = 0

Lnear func4ons n R 2 ( x, y 0 0) D Let w a = c x x = y w w x + b = 0 ax + cy + b = 0 w T x + b = 0 D = ax + a 2 cy + c + 2 b = w x + b w Τ 0 0 dstance from pont to lne

Lnear func4ons n R 2 ( x, y 0 0) D Let w a = c x x = y w w x + b = 0 ax + cy + b = 0 w T x + b = 0 D = ax + a 2 cy + c + 2 b = w x + b w Τ 0 0 dstance from pont to lne

Lnear classfers Fnd lnear func4on to separate pos4ve and nega4ve examples x x postve : negatve : x x w + b 0 w + b < 0 Whch lne s best?

Support Vector Machnes Dscrmna4ve classfer based on op&mal separa&ng lne (for 2D case) Maxmze the margn between the pos4ve and nega4ve tranng examples

Support Vector Machnes We want the lne that maxmzes the margn. x x postve ( y negatve( y = 1) : = 1) : x x w + b 1 w + b 1 For support, vectors, x w + b = ±1 Support vectors Margn

Support Vector Machnes We want the lne that maxmzes the margn. x x postve ( y negatve( y = 1) : = 1) : x x w + b 1 w + b 1 For support, vectors, x w + b = ±1 x w + Dstance between pont and lne: w For support vectors: b Support vectors Margn

Support Vector Machnes We want the lne that maxmzes the margn. x x postve ( y negatve( y = 1) : = 1) : x x w + b 1 w + b 1 For support, vectors, x w + b = ±1 Support vectors Margn x w + Dstance between pont and lne: w Therefore, the margn s 2 / w b

Fndng the maxmum margn lne 1. Maxmze margn 2/ w 2. Correctly classfy all tranng data ponts: x x postve ( y negatve( y Quadra&c op&mza&on problem: Mnmze 1 2 = 1) : = 1) : w T w Subject to y (w x +b) 1 x x w w + b + b 1 1

Fndng the maxmum margn lne Solu4on: w = α y x learned weght support vector

Fndng the maxmum margn lne Solu4on: w = α y x b = y w x (for any support vector) w x + b = α y x x b + Classfca4on func4on: f ( x) = = sgn ( w x sgn + ( α x x b) b) + If f(x) < 0, classfy as nega&ve, f f(x) > 0, classfy as pos&ve

Ques4ons What f the features are not 2D? What f the data s not lnearly separable? What f we have more than just two categores?

Ques4ons What f the features are not 2D? Generalzes to d- dmensons replace lne wth hyperplane What f the data s not lnearly separable? What f we have more than just two categores?

Ques4ons What f the features are not 2d? What f the data s not lnearly separable? What f we have more than just two categores?

Sos- margn SVMs Introduce slack varable and allow some nstances to fall wthn the margn, but penalze them Constrant becomes: Objec4ve func4on penalzes for msclassfed nstances wthn the margn C trades- off margn wdth and classfca4ons As C, we get closer to the hard- margn solu4on

Sos- margn vs Hard- margn SVMs Sos- Margn always has a solu4on Sos- Margn s more robust to outlers Smoother surfaces (n the non- lnear case) Hard- Margn does not requre to guess the cost parameter (requres no parameters at all)

Non- lnear SVMs Datasets that are lnearly separable wth some nose work out great: 0 x But what are we gong to do f the dataset s just too hard? 0 x How about mappng data to a hgher- dmensonal space: x 2 0 x

Non- lnear SVMs General dea: the orgnal nput space can be mapped to some hgher- dmensonal feature space where the tranng set s separable: Φ: x φ(x)

The Kernel Trck The lnear classfer reles on dot product between vectors K(x,x j )=x T x j If every data pont s mapped nto hgh- dmensonal space va some transforma4on Φ: x φ(x), the dot product becomes: K(x,x j )= φ(x ) T φ(x j ) A kernel func&on s a smlarty func4on that corresponds to an nner product n some expanded feature space.

Non- lnear SVMs The kernel trck: nstead of explctly compu4ng the lsng transforma4on φ(x), defne a kernel func4on K such that K(x, x jj) = φ(x ) φ(x j ) Ths gves a nonlnear decson boundary n the orgnal feature space: α yk ( x, x ) + b

Examples of kernel func4ons Lnear: K ( x, x ) = j x T x j Gaussan RBF: x x K( x,x j ) = exp( 2 2σ Hstogram ntersecton: j 2 ) K ( x =, x j ) mn( x ( k), x j ( k)) k

Ques4ons What f the features are not 2D? What f the data s not lnearly separable? What f we have more than just two categores?

Mul4- class SVMs Acheve mult-class classfer by combnng a number of bnary classfers One vs. all Tranng: learn an SVM for each class vs. the rest Testng: apply each SVM to test example and assgn to t the class of the SVM that returns the hghest decson value One vs. one Tranng: learn an SVM for each par of classes Testng: each learned SVM votes for a class to assgn to the test example

SVM ssues Choce of kernel Gaussan or polynomal kernel s default f neffec4ve, more elaborate kernels are needed doman experts can gve assstance n formula4ng approprate smlarty measures Choce of kernel parameters e.g. σ n Gaussan kernel, s the dstance between closest ponts wth dfferent classfca4ons In the absence of relable crtera, rely on the use of a valda4on set or cross- valda4on to set such parameters Op4mza4on crteron Hard margn v.s. Sos margn seres of experments n whch parameters are tested

SVM as a classfer Advantages Many SVM packages avalable Kernel- based framework s very powerful, flexble Osen a sparse set of support vectors compact at test 4me Works very well n prac4ce, even wth very small tranng sample szes Dsadvantages No drect mul4- class SVM, must combne two- class SVMs Can be trcky to select best kernel func4on for a problem Computa4on, memory Durng tranng 4me, must compute matrx of kernel values for every par of examples Learnng can take a very long 4me for large- scale problems

Tranng - general strategy We try to smulate the real world scenaro. Test data s our future data. Valda4on set can be our test set - we use t to select our model. The whole am s to es4mate the models true error on the sample data we have.

Valda4on set method Randomly splt some por4on of your data. Leave t asde as the valda#on set The remanng data s the tranng data

Valda4on set method Randomly splt some por4on of your data. Leave t asde as the valda#on set The remanng data s the tranng data Learn a model from the tranng set

Valda4on set method Randomly splt some por4on of your data. Leave t asde as the valda#on set The remanng data s the tranng data Learn a model from the tranng set Es4mate your future performance wth the test data

Test set method It s smple, however We waste some por4on of the data If we do not have much data, we may be lucky or unlucky wth our test data Wth cross- valda#on we reuse the data

LOOCV (Leave- one- out Cross Valda4on) The sngle test data Let us say we have N data ponts and k as the ndex for data ponts, k=1..n Let (x k,y k ) be the k th record Temporarly remove (x k,y k ) from the dataset Tran on the remanng N- 1 dataponts Test the error on (x k,y k ) Do ths for each k=1..n and report the mean error.

LOOCV (Leave- one- out Cross Valda4on) Repeat the valda4on N 4mes, for each of the N data ponts. The valda4on data s changng each 4me.

K- fold cross valda4on Test Tran on (k - 1) splts k- fold In 3 fold cross valda4on, there are 3 runs. In 5 fold cross valda4on, there are 5 runs. In 10 fold cross valda4on, there are 10 runs. the error s averaged over all runs

References Krsten Grauman, Dscrmna4ve classfers for mage recogn4on, h0p://www.cs.utexas.edu/~grauman/ courses/sprng2011/sldes/lecture22_classfers.pdf Jame S. Cardoso, Support Vector Machnes, h0p:// www.dcc.fc.up.pt/~mcombra//lectures/mapi_1112/ CV_1112_6_SupportVectorMachnes.pdf Andrew Moore, Support Vector Machnes Tutoral, h0p://www.autonlab.org/tutorals/svm.html Chrstopher M. Bshop, Pa0ern recogn4on and Machne learnng, Sprnger, 2006. Rchard O. Duda, Peter E. Hart, Davd G. Stork, Pa0ern Classfca4on, John Wley & Sons, 2001