Computer Vson Pa0ern Recogn4on Concepts Part II Lus F. Texera MAP- 2012/13
Last lecture The Bayes classfer yelds the op#mal decson rule f the pror and class- cond4onal dstrbu4ons are known. Ths s unlkely for most applca4ons, so we can: a0empt to es4mate p(x ω ) from data, by means of densty es4ma4on technques Naïve Bayes and nearest- neghbors classfers assume p(x ω ) follows a par4cular dstrbu4on (.e. Normal) and es4mate ts parameters quadra4c classfers gnore the underlyng dstrbu4on, and a0empt to separate the data geometrcally dscrmna4ve classfers
k- Nearest neghbour classfer Gven the tranng data D = {x 1,...,x n } as a set of n labeled examples, the nearest neghbour classfer assgns a test pont x the label assocated wth ts closest neghbour (or k neghbours) n D. Closeness s defned usng a dstance func#on.
Dstance func4ons A general class of metrcs for d- dmensonal pa0erns s the Mnkowsk metrc, also known as the L p norm # L p (x, y) = % $ p & x y ( ' The Eucldean dstance s the L 2 norm The Manha:an or cty block dstance s the L1 norm d =1 # d 2 & L 2 (x, y) = % x y ( $ ' L 1 (x, y) = =1 d =1 x y 1/p 1/2
Dstance func4ons The Mahalanobs dstance s based on the covarance of each feature wth the class examples. D M (x) = ( x µ ) T Σ 1 ( x µ ) Based on the assump4on that dstances n the drec4on of hgh varance are less mportant Hghly dependent on a good es4mate of covarance
1- Nearest neghbour classfer Assgn label of nearest tranng data pont to each test data pont Black = nega4ve Red = pos4ve from Duda et al. Novel test example Closest to a pos4ve example from the tranng set, so classfy t as pos4ve. Vorono par44onng of feature space for 2- category 2D data
k- Nearest neghbour classfer For a new pont, fnd the k closest ponts from tranng data Labels of the k ponts vote to classfy k = 5 If the query lands here, the 5 NN consst of 3 nega4ves and 2 pos4ves, so we classfy t as nega4ve. Black = negatve Red = postve
k- Nearest neghbour classfer The man advantage of knn s that t leads to a very smple approxma4on of the (op4mal) Bayes classfer
knn as a classfer Advantages: Smple to mplement Flexble to feature / dstance choces Naturally handles mul4- class cases Can do well n prac4ce wth enough representa4ve data Dsadvantages: Large search problem to fnd nearest neghbors à Hghly suscep4ble to the curse of dmensonalty Storage of data Must have a meanngful dstance func4on
Dmensonalty reduc4on The curse of dmensonalty The number of examples needed to accurately tran a classfer grows exponen#ally wth the dmensonalty of the model In theory, nforma4on provded by add4onal features should help mprove the model s accuracy In realty, however, add4onal features ncrease the rsk of overfbng,.e., memorzng nose n the data rather than ts underlyng structure For a gven sample sze, there s a maxmum number of features above whch the classfer s performance degrades rather than mproves
Dmensonalty reduc4on The curse of dmensonalty can be lmted by: ncorpora4ng pror knowledge (e.g., parametrc models) enforcng smoothness n the target func4on (e.g., regularza4on) reducng the dmensonalty crea4ng a subset of new features by combna4ons of the exs4ng features feature extrac#on choosng a subset of all the features feature selec#on
Dmensonalty reduc4on In feature extrac4on methods, two types of crtera are commonly used: Sgnal representa4on: The goal of feature selec4on s to accurately represent the samples n a lower- dmensonal space (e.g. Prncpal Components Analyss, or PCA) Classfca4on: The goal of feature selec4on s to enhance the class- dscrmnatory nforma4on n the lower- dmensonal space (e.g. Fsher s Lnear Dscrmnants Analyss, or LDA)
Dscrmna4ve classfers Decson boundary- based classfers: Decson trees Neural networks Support vector machnes
Dscrmna4ve vs Genera4ve Genera4ve models es4mate the dstrbu4ons Dscrmna4ve models only defne a decson boundary
Dscrmna4ve vs Genera4ve Dscrmna4ve models dffer from genera4ve models n that they do not allow one to generate samples from the jont dstrbu4on of x and y. However, for tasks such as classfca#on and regresson that do not requre the jont dstrbu4on, dscrmna4ve models generally yeld superor performance. On the other hand, genera4ve models are typcally more flexble than dscrmna4ve models n expressng dependences n complex learnng tasks.
Decson trees Decson trees are herarchcal decson systems n whch cond4ons are sequen4ally tested un4l a class s accepted The feature space s splt nto unque regons correspondng to the classes, n a sequen#al manner The searchng of the regon to whch the feature vector wll be assgned to s acheved va a sequence of decsons along a path of nodes
Decson trees Decson trees classfy a pa0ern through a sequence of ques4ons, n whch the next ques4on depends on the answer to the current ques4on
Decson trees The most popular schemes among decson trees are those that splt the space nto hyper- rectangles wth sdes parallel to the axes The sequence of decsons s appled to ndvdual features, n the form of s the feature x k < α?
Ar4fcal neural networks A neural network s a set of connected nput/ output unts where each connec4on has a weght assocated wth t Durng the learnng phase, the network learns by adjus#ng the weghts so as to be able to predct the correct class output of the nput sgnals
Ar4fcal neural networks Examples of ANN: Perceptron Mul4layer Perceptron (MLP) Radal Bass Func4on (RBF) Self- Organzng Map (SOM, or Kohonen map) Topologes: Feed forward Recurrent
Perceptron Defnes a (hyper)plane that lnearly separates the feature space The nputs are real values and the output +1,- 1 Ac4va4on func4ons: step, lnear, logs4c sgmod, Gaussan
Mul4layer perceptron To handle more complex problems (than lnearly separable ones) we need mul4ple layers. Each layer receves ts nputs from the prevous layer and forwards ts outputs to the next layer The result s the combna#on of lnear boundares whch allow the separa4on of complex data Weghts are obtaned through the back propaga#on algorthm
Non- lnearly separable problems
RBF networks RBF networks approxmate func4ons usng (radal) bass func4ons as the buldng blocks. Generally, the hdden unt func4on s Gaussan and the output Layer s lnear
MLP vs RBF Classfca#on MLPs separate classes va hyperplanes RBFs separate classes va hyperspheres Learnng MLPs use dstrbuted learnng RBFs use localzed learnng RBFs tran faster Structure MLPs have one or more hdden layers RBFs have only one layer RBFs requre more hdden neurons => curse of dmensonalty
ANN as a classfer Advantages Hgh tolerance to nosy data Ablty to classfy untraned pa0erns Well- suted for con4nuous- valued nputs and outputs Successful on a wde array of real- world data Algorthms are nherently parallel Dsadvantages Long tranng 4me Requres a number of parameters typcally best determned emprcally, e.g., the network topology or ``structure. Poor nterpretablty: Dffcult to nterpret the symbolc meanng behnd the learned weghts and of ``hdden unts" n the network
Support Vector Machne Dscrmnant func4on s a hyperplane (lne n 2D) n feature space (smlar to the Perceptron) In a nutshell: Map the data to a predetermned very hgh- dmensonal space va a kernel func4on Fnd the hyperplane that maxmzes the margn between the two classes If data are not separable fnd the hyperplane that maxmzes the margn and mnmzes the (a weghted average of the) msclassfca4ons
Lnear classfers
Lnear func4ons n R 2 Let w a = c x x = y ax + cy + b = 0
Lnear func4ons n R 2 Let w a = c x x = y w w x + b = 0 ax + cy + b = 0 w T x + b = 0
Lnear func4ons n R 2 ( x, y 0 0) D Let w a = c x x = y w w x + b = 0 ax + cy + b = 0 w T x + b = 0
Lnear func4ons n R 2 ( x, y 0 0) D Let w a = c x x = y w w x + b = 0 ax + cy + b = 0 w T x + b = 0 D = ax + a 2 cy + c + 2 b = w x + b w Τ 0 0 dstance from pont to lne
Lnear func4ons n R 2 ( x, y 0 0) D Let w a = c x x = y w w x + b = 0 ax + cy + b = 0 w T x + b = 0 D = ax + a 2 cy + c + 2 b = w x + b w Τ 0 0 dstance from pont to lne
Lnear classfers Fnd lnear func4on to separate pos4ve and nega4ve examples x x postve : negatve : x x w + b 0 w + b < 0 Whch lne s best?
Support Vector Machnes Dscrmna4ve classfer based on op&mal separa&ng lne (for 2D case) Maxmze the margn between the pos4ve and nega4ve tranng examples
Support Vector Machnes We want the lne that maxmzes the margn. x x postve ( y negatve( y = 1) : = 1) : x x w + b 1 w + b 1 For support, vectors, x w + b = ±1 Support vectors Margn
Support Vector Machnes We want the lne that maxmzes the margn. x x postve ( y negatve( y = 1) : = 1) : x x w + b 1 w + b 1 For support, vectors, x w + b = ±1 x w + Dstance between pont and lne: w For support vectors: b Support vectors Margn
Support Vector Machnes We want the lne that maxmzes the margn. x x postve ( y negatve( y = 1) : = 1) : x x w + b 1 w + b 1 For support, vectors, x w + b = ±1 Support vectors Margn x w + Dstance between pont and lne: w Therefore, the margn s 2 / w b
Fndng the maxmum margn lne 1. Maxmze margn 2/ w 2. Correctly classfy all tranng data ponts: x x postve ( y negatve( y Quadra&c op&mza&on problem: Mnmze 1 2 = 1) : = 1) : w T w Subject to y (w x +b) 1 x x w w + b + b 1 1
Fndng the maxmum margn lne Solu4on: w = α y x learned weght support vector
Fndng the maxmum margn lne Solu4on: w = α y x b = y w x (for any support vector) w x + b = α y x x b + Classfca4on func4on: f ( x) = = sgn ( w x sgn + ( α x x b) b) + If f(x) < 0, classfy as nega&ve, f f(x) > 0, classfy as pos&ve
Ques4ons What f the features are not 2D? What f the data s not lnearly separable? What f we have more than just two categores?
Ques4ons What f the features are not 2D? Generalzes to d- dmensons replace lne wth hyperplane What f the data s not lnearly separable? What f we have more than just two categores?
Ques4ons What f the features are not 2d? What f the data s not lnearly separable? What f we have more than just two categores?
Sos- margn SVMs Introduce slack varable and allow some nstances to fall wthn the margn, but penalze them Constrant becomes: Objec4ve func4on penalzes for msclassfed nstances wthn the margn C trades- off margn wdth and classfca4ons As C, we get closer to the hard- margn solu4on
Sos- margn vs Hard- margn SVMs Sos- Margn always has a solu4on Sos- Margn s more robust to outlers Smoother surfaces (n the non- lnear case) Hard- Margn does not requre to guess the cost parameter (requres no parameters at all)
Non- lnear SVMs Datasets that are lnearly separable wth some nose work out great: 0 x But what are we gong to do f the dataset s just too hard? 0 x How about mappng data to a hgher- dmensonal space: x 2 0 x
Non- lnear SVMs General dea: the orgnal nput space can be mapped to some hgher- dmensonal feature space where the tranng set s separable: Φ: x φ(x)
The Kernel Trck The lnear classfer reles on dot product between vectors K(x,x j )=x T x j If every data pont s mapped nto hgh- dmensonal space va some transforma4on Φ: x φ(x), the dot product becomes: K(x,x j )= φ(x ) T φ(x j ) A kernel func&on s a smlarty func4on that corresponds to an nner product n some expanded feature space.
Non- lnear SVMs The kernel trck: nstead of explctly compu4ng the lsng transforma4on φ(x), defne a kernel func4on K such that K(x, x jj) = φ(x ) φ(x j ) Ths gves a nonlnear decson boundary n the orgnal feature space: α yk ( x, x ) + b
Examples of kernel func4ons Lnear: K ( x, x ) = j x T x j Gaussan RBF: x x K( x,x j ) = exp( 2 2σ Hstogram ntersecton: j 2 ) K ( x =, x j ) mn( x ( k), x j ( k)) k
Ques4ons What f the features are not 2D? What f the data s not lnearly separable? What f we have more than just two categores?
Mul4- class SVMs Acheve mult-class classfer by combnng a number of bnary classfers One vs. all Tranng: learn an SVM for each class vs. the rest Testng: apply each SVM to test example and assgn to t the class of the SVM that returns the hghest decson value One vs. one Tranng: learn an SVM for each par of classes Testng: each learned SVM votes for a class to assgn to the test example
SVM ssues Choce of kernel Gaussan or polynomal kernel s default f neffec4ve, more elaborate kernels are needed doman experts can gve assstance n formula4ng approprate smlarty measures Choce of kernel parameters e.g. σ n Gaussan kernel, s the dstance between closest ponts wth dfferent classfca4ons In the absence of relable crtera, rely on the use of a valda4on set or cross- valda4on to set such parameters Op4mza4on crteron Hard margn v.s. Sos margn seres of experments n whch parameters are tested
SVM as a classfer Advantages Many SVM packages avalable Kernel- based framework s very powerful, flexble Osen a sparse set of support vectors compact at test 4me Works very well n prac4ce, even wth very small tranng sample szes Dsadvantages No drect mul4- class SVM, must combne two- class SVMs Can be trcky to select best kernel func4on for a problem Computa4on, memory Durng tranng 4me, must compute matrx of kernel values for every par of examples Learnng can take a very long 4me for large- scale problems
Tranng - general strategy We try to smulate the real world scenaro. Test data s our future data. Valda4on set can be our test set - we use t to select our model. The whole am s to es4mate the models true error on the sample data we have.
Valda4on set method Randomly splt some por4on of your data. Leave t asde as the valda#on set The remanng data s the tranng data
Valda4on set method Randomly splt some por4on of your data. Leave t asde as the valda#on set The remanng data s the tranng data Learn a model from the tranng set
Valda4on set method Randomly splt some por4on of your data. Leave t asde as the valda#on set The remanng data s the tranng data Learn a model from the tranng set Es4mate your future performance wth the test data
Test set method It s smple, however We waste some por4on of the data If we do not have much data, we may be lucky or unlucky wth our test data Wth cross- valda#on we reuse the data
LOOCV (Leave- one- out Cross Valda4on) The sngle test data Let us say we have N data ponts and k as the ndex for data ponts, k=1..n Let (x k,y k ) be the k th record Temporarly remove (x k,y k ) from the dataset Tran on the remanng N- 1 dataponts Test the error on (x k,y k ) Do ths for each k=1..n and report the mean error.
LOOCV (Leave- one- out Cross Valda4on) Repeat the valda4on N 4mes, for each of the N data ponts. The valda4on data s changng each 4me.
K- fold cross valda4on Test Tran on (k - 1) splts k- fold In 3 fold cross valda4on, there are 3 runs. In 5 fold cross valda4on, there are 5 runs. In 10 fold cross valda4on, there are 10 runs. the error s averaged over all runs
References Krsten Grauman, Dscrmna4ve classfers for mage recogn4on, h0p://www.cs.utexas.edu/~grauman/ courses/sprng2011/sldes/lecture22_classfers.pdf Jame S. Cardoso, Support Vector Machnes, h0p:// www.dcc.fc.up.pt/~mcombra//lectures/mapi_1112/ CV_1112_6_SupportVectorMachnes.pdf Andrew Moore, Support Vector Machnes Tutoral, h0p://www.autonlab.org/tutorals/svm.html Chrstopher M. Bshop, Pa0ern recogn4on and Machne learnng, Sprnger, 2006. Rchard O. Duda, Peter E. Hart, Davd G. Stork, Pa0ern Classfca4on, John Wley & Sons, 2001