INF Repetition Anne Solberg INF

INF 43 7..7 Repetton Anne Solberg anne@f.uo.no INF 43

Classfers covered Gaussan classfer k =I k = k arbtrary Knn-classfer Support Vector Machnes Recommendaton: lnear or Radal Bass Functon kernels INF 43

Approachng a classfcaton problem Choose features Consder preprocessng/normalzaton Choose classfer Estmate classfer parameters on tranng data Estmate hyperparameters on valdaton data Alternatve: cross-valdaton on the tranng data set Compute the accuracy on test data INF 43 3

Measures of classfcaton accuary Average error rate Confuson matrces True/false postve/negatves Precson/recall and senstvty/specfcty INF 43 4

The curse of dmensonalty In practce, the curse means that, for a gven sample sze, there s a mamum number of features one can add before the classfer starts to degrade. For a fnte tranng sample sze, the correct classfcaton rate ntally ncreases hen addng ne features, attans a mamum and then begns to decrease. For a hgh dmensonalty, e ll need lots of tranng data to get the best performance. => samples / feature / class. Correct classfcaton rate as functon of feature dmensonalty, for dfferent amounts of tranng data. Equal pror probabltes of the to classes s assumed. INF 43 5

Use fe, but good features To avod the curse of dmensonalty e must take care n fndng a set of relatvely fe features. A good feature has hgh thn-class homogenety, and should deally have large beteen-class separaton. In practse, one feature s not enough to separate all classes, but a good feature should: separate some of the classes ell Isolate one class from the others. If to features look very smlar or have hgh correlaton, they are often redundant and e should use only one of them. Class separaton can be studed by: Vsual nspecton of the feature mage overlad the tranng mask Scatter plots Evaluatng features as done by tranng can be dffcult to do automatcally, so manual nteracton s normally requred. INF 43 6

Ho do e beat the curse of dmensonalty? Generate fe, but nformatve features Careful feature desgn gven the applcaton Try a smple classfer frst Do the features ork? Do e need addtonal features? Iterate beteen feature etracton and classfcaton Reducng the dmensonalty Feature selecton select a subset of the orgnal features Feature transforms compute a ne subset of features based on a lnear combnaton of all features net eek Eample: Prncpal component transform Unsupervsed, fnds the combnaton that mamzes the varance n the data. When you are confdent that the features are good, consder a more advanced classfer. INF 43 7

Suboptmal feature selecton Select the best sngle features based on some qualty crtera, e.g., estmated correct classfcaton rate. A combnaton of the best sngle features ll often mply correlated features and ll therefore be suboptmal. Sequental forard selecton mples that hen a feature s selected or removed, ths decson s fnal. Stepse forard-backard selecton overcomes ths. A specal case of the add - a, remove - r algorthm. Improved nto floatng search by makng the number of forard and backard search steps data dependent. Adaptve floatng search Oscllatng search. INF 43 8

Dstance measures used n feature selecton In feature selecton, each feature combnaton must be ranked based on a crteron functon. Crtera functons can ether be dstances beteen classes, or the classfcaton accuracy on a valdaton test set. If the crteron s based on e.g. the mean values/covarance matrces for the tranng data, dstance computaton s fast. Better performance at the cost of hgher computaton tme s found hen the classfcaton accuracy on a valdaton data set dfferent from tranng and testng s used as crteron for rankng features. Ths ll be sloer as classfcaton of the valdatton data needs to be done for every combnaton of features. INF 43 9

INF 43 Class separablty measures Ho do e get an ndcaton of the separablty beteen to classes? Eucldean dstance beteen class means r - s Bhattacharyya dstance Can be defned for dfferent dstrbutons For Gaussan data, t s Mahalanobs dstance beteen to classes: s r s r s r s r T s r B ln 8 N N T

Method - Sequental backard selecton Select l features out of d Eample: 4 features,, 3, 4 Choose a crteron C and compute t for the vector [,, 3, 4 ] T Elmnate one feature at a tme by computng [,, 3 ] T, [,, 4 ] T, [, 3, 4 ] T and [, 3, 4 ] T Select the best combnaton, say [,, 3 ] T. From the selected 3-dmensonal feature vector elmnate one more feature, and evaluate the crteron for [, ] T, [, 3 ] T, [, 3 ] T and select the one th the best value. Number of combnatons searched: +/d+d-ll+ INF 43

Method 3: Sequental forard selecton Compute the crteron value for each feature. Select the feature th the best value, say. Form all possble combnatons of features the nner at the prevous step and a ne feature, e.g. [, ] T, [, 3 ] T, [, 4 ] T, etc. Compute the crteron and select the best one, say [, 3 ] T. Contnue th addng a ne feature. Number of combnatons searched: ld-ll-/. Backards selecton s faster f l s closer to d than to. INF 43

Lnear feature transforms INF 43 3

Prncpal component or Karhunen-Loeve transform Let be a feature vector. Features are often correlated, hch mght lead to redundances. We no derve a transform hch yelds uncorrelated features. We seek a lnear transform y=a T, and the y s should be uncorrelated. The y s are uncorrelated f E[yy T ]=,. If e can epress the nformaton n usng uncorrelated features, e mght need feer coeffcents. INF 43 4

The eghts Vsualzaton and ntuton y / INF 43 5

Varance of y cont. Assume mean of s subtracted The sample covarance matr / scatter matr; R Called σ on some sldes INF 43 6

Varance and proecton resduals Sngle sample Proecton onto, assumng = «y» «y» = Sum all n samples not dmensons Note: Ma varance mn proecton resduals! σ INF 43 7

Crteron functon Goal: Fnd transform mnmzng representaton error We start th a sngle eght-vector,, gvng us a sngle feature, y Let J = T R = σ No, let s fnd ma.. As e learned on the prevous slde, mamzng ths s equvalent to mnmzng representaton error INF 43 8

Mamzng varance of y Lagrangan functon for mamzng σ th the constrant T = - R Equatng zero Unfamlar th Lagrangan multplers? See http://bostat.mc.vanderblt.edu/ k/pub/man/coursebos36/lag rangemultplers-bshop- PatternRecogntonMachneLear nng.pdf R The mamzng s an egenvector of R! And σ =λ! [Why?] INF 43 9

Egendecomposton of covarance matrces Real-valued, symmetrc, «n-dmensonal» covarance matr Egenvalue let s say largest Egenvector correspondng to λ Smallest egenvalue a T a = for Remember: λ =varance of T a INF 43

, 3,.. II/III What does uncorrelated mean? Zero covarance. Covarance of y and y : We already have that =a From last slde, requrng R = a R = means requrng a = INF 43

, 3,.. III/III We ant ma R, s.t. = and a = We can smply remove λ a a from R, creatng R net = R- λ a a, and agan fnd ma R net s.t. = Studyng the decomposton of R a fe sldes back, e see that the soluton s the egenvector correspondng to the second largest egenvalue Smlarly, the 3, 4 etc. are gven by the follong egenvectors sorted accordng to ther egenvalues INF 43

, 3,.. III+/III ma R, s.t. = =a =a =a 3 etc. Egenvectors sorted by ther correspondng egenvalues INF 43 3

Prncpal component transform PCA Place the m «prncple» egenvectors the ones th the largest egenvalues along the columns of A Then the transform y = A T gves you the m frst prncple components The m-dmensonal y have uncorrelated elements retans as much varance as possble gves the best n the mean-square sense descrpton of the orgnal data through the «mage»/proecton/reconstructon Ay Note: The egenvectors themselves can often gve nterestng nformaton PCA s also knon as Karhunen-Loeve transform INF 43 4

Introducton to lnear SVM Dscrmnant functon: g = T + Weghts/orentaton To-class problem, y ϵ{-,} Class ndcator for pattern Threshold/bas - g = y = -, g <, g > g Class predcton Input pattern INF 43 5

Separable case: Many canddates Obvously e ant the decson boundary to separate the classes.... hoever, there can be many such hyperplanes. Whch of these to canddates ould you prefer? Why? INF 43 6

Snce / s a unt vector n the drecton, B=-z*/ Because B les on the decson boundary, T B+ = Ths s called the margn of the classfer Dstance to the decson boundary INF 43 7 g = Dstance from to the decson boundary B z T T T z Solve ths for z :

Hyperplanes and margns If both classes are equally probable, the dstance from the hyperplane to the closest ponts n both classes should be equal. Ths s called the margn. The margn for «drecton» s z, and for «drecton» t s z. From prevous slde; the dstance from a pont to the separatng hyperplane s z g Goal: Fnd and mamzng the margn! Ho ould you rte a program fndng ths? Not easy unless e state the obectve functon cleverly! INF 43 8

Toards a clever obectve functon We can scale g such that g ll be equal to or - at the closest ponts n the to classes. Ths s equvalent to: Does not change the margn. Have a margn of. Requre that g g T T,, Remember our goal: Fnd and yeldng the mamum margn INF 43 9

Mamum-margn obectve functon The hyperplane th mamum margn can be found by solvng the optmzaton problem.r.t. and : mnmze subect to J T y,,,... N The ½ factor s for later convenence Note: We assume here fully classseparable data! Checkpont: Do you understand the formulaton? Ho s ths crteron related to mamzng the margn? Note! We are somehat done -- Matlab or smlar softare can solve ths no. But e seek more nsght! INF 43 3

Support vectors The feature vectors th a correspondng > are called the support vectors for the problem. The classfer defned by ths hyperplane s called a Support Vector Machne. Dependng on y + or -, the support vectors ll thus le on ether of the to hyperplanes T + = The support vectors are the ponts n the tranng set that are closest to the decson hyperplane. The optmzaton has a unque soluton, only one hyperplane satsfes the condtons. The support vectors for hyperplane are the blue crcles. The support vectors for hyperplane are the red crcles. INF 43 3

The nonseparable case If the to classes are nonseparable, a hyperplane satsfyng the condtons T - = cannot be found. The feature vectors n the tranng set are no ether:. Vectors that fall outsde the band and are correctly classfed.. Vectors that are nsde the band and are correctly classfed. They satsfy y T + < 3. Vectors that are msclassfed epressed as y T + < Correctly classfed Erroneously classfed INF 43 3

INF 43 33 Cost functon nonseparable case The cost functon to mnmze s no C s a parameter that controls ho much msclassfed tranng samples s eghted. We skp the mathematcs and present the alternatve dual formulaton: All ponts beteen the to hyperplanes > can be shon to have =C.. parameters the vector of s and I here,, N I C J and subect to ma N, C y y y N T

SVMs: The nonlnear case ntro. The tranng samples are l-dmensonal vectors; e have untl no tred to fnd a lnear separaton n ths l-dmensonal feature space Ths seems qute lmtng What f e ncrease the dmensonalty map our samples to a hgher dmensonal space before applyng our SVM? Perhaps e can fnd a better lnear decson boundary n that space? Even f the feature vectors are not lnearly separable n the nput space, they mght be close to separable n a hgher dmensonal space INF 43 34

Note that n both the optmzaton problem and the evaluaton functon, g, the samples come nto play as nner products only If e have a functon evaluatng nner products, K,, e can gnore the samples themselves Let s say e have K, evaluatng nner products n a hgher dmensonal space: -> no need to do the mappng of our samples eplctly! INF 43 35 SVMs and kernels N s T T y g s.t. ma N, N T y C y y Called «kernel»

Useful kernels for classfcaton Polynomal kernels T q z, q K, z Radal bass functon kernels very commonly used! K, z ep z Hyperbolc tangent kernels often th = and = Note the e need to set the parameter The «support» of each pont s controlled by. The nner product s related to the smlarty of the to samples. K, z T tanh z The kernel nputs need not be numerc, e.g. kernels for tet strngs are possble. The kernels gve nnerproduct evaluatons n the, possbly nfntedmensonal, transformed space. INF 43 36

INF 43 37 The kernel formulaton of the obectve functon Gven the approprate kernel e.g. «radal» th dth and the cost of msclassfcaton C, the optmzaton task s: The resultng classfer s: y N C K y y,..., subect to, ma, otherse and to class, f o class assgn t K y g N

Eample of nonlnear decson boundary Ths llustrates ho the nonlnear SVM mght look n the orgnal feature space RBF kernel used Fgure 4.3 n PR by Teodords et.al. INF 43 38

From to M classes All e have dscussed up untl no nvolves only separatng classes. Ho do e etend the methods to M classes? To common approaches: One-aganst-all For each class m, fnd the hyperplane that best dscmnates ths class from all other classes. Then classfy a sample to the class havng the hghest output. To use ths, e need the VALUE of the nner product and not ust the sgn. Compare all sets of parse classfers Fnd a hyperplane for each par of classes. Ths gves MM-/ parse classfers. For a gven sample, use a votng scheme for selectng the most-nnng class. INF 43 39

Ho to use a SVM classfer Fnd a lbrary th all the necessary SVM-functons For eample LbSVM http://.cse.ntu.edu.t/~cln/lbsvm/ Or use the PRTools toolbo http://.37steps.com/prtools/ Read the ntroductory gudes. Often a radal bass functon kernel s a good startng pont. Scale the data to the range [-,] features th large values ll not domnate. Fnd the optmal values of C and by performng a grd search on selected values and usng a valdaton data set. Tran the classfer usng the best value from the grd search. Test usng a separate test set. INF 43 4

Ho to do a grd search Use n-fold cross valaton e.g. -fold crossvaldaton. -fold: dvde the tranng data nto subsets of equal sze. Tran on 9 subsets and test on the last subset. Repeat ths procedure tmes. Grd search: try pars of C,. Select the par that gets the best classfcaton performance on average over all the n valdaton test subsets. Use the follong values of C and : C = -5, -3,..., 5 = -5, -3,..., 3 INF 43 4

Dscrmnant functons The decson rule Decde f P P, for all can be rtten as assgn to f g g The classfer computes J dscrmnant functons g and selects the class correspondng to the largest value of the dscrmnant functon. Snce classfcaton conssts of choosng the class that has the largest value, a scalng of the dscrmnant functon g by fg ll not effect the decson f f s a monotoncally ncreasng functon. Ths can lead to smplfcatons as e ll soon see. INF 43 4

Equvalent dscrmnant functons The follong choces of dscrmnant functons gve equvalent decsons: The effect of the decson rules s to dvde the feature space nto c decson regons R,...R c. If g >g for all, then s n regon R. The regons are separated by decson boundares, surfaces n features space here the dscrmnant functons for to classes are equal INF 43 43 ln ln P p g P p g p P p P g

INF 43 44 The condtonal densty p s Any probablty densty functon can be used to model p s A common model s the multvarate Gaussan densty. The multvarate Gaussan densty: If e have d features, s s a vector of length d and and s a dd matr depends on class s s s the determnant of the matr s, and s - s the nverse s s t s s n s p μ Σ μ Σ / / ep nn nn n n n S ns s s S 3.............. Σ μ Symmetrc dd matr s the varance of feature s the covarance beteen feature and feature Symmetrc because =

The covarance matr and ellpses In D, the Gaussan model can be thought of as appromatng the classes n D feature space th ellpses. The mean vector =[, ] defnes the the center pont of the ellpses., the covarance beteen the features defnes the orentaton of the ellpse. and defnes the dth of the ellpse. S The ellpse defnes ponts here the probablty densty s equal Equal n the sense that the dstance to the mean as computed by the Mahalanobs dstance s equal. The Mahalanobs dstance beteen a pont and the class center s: r T The man aes of the ellpse s determned by the egenvectors of. The egenvalues of gves ther length. INF 43 45

Eucldean dstance vs. Mahalanobs dstance Eucldean dstance beteen pont and class center : T Ponts th equal dstance to le on a crcle. Mahalanobs dstance beteen and : r T Ponts th equal dstance to le on an ellpse. INF 43 46

Dscrmnant functons for the normal densty We sa last lecture that the mnmum-error-rate classfcaton can be computed usng the dscrmnant functons Wth a multvarate Gaussan e get: Let ut look at ths epresson for some specal cases: INF 43 47 ln ln P p g ln ln ln t P d g μ μ

INF 43 48 Case : Σ =σ I The dscrmnant functons smplfes to lnear functons usng such a shape on the probablty dstrbutons ln ln ln ln ln ln T T T T P I d I P I d I g μ μ μ μ μ Common for all classes, no need to compute these terms Snce T s common for all classes, an equvalent g s a lnear functon of :. ln T T P μ μ μ

The dscrmnant functon hen Σ =σ I that defnes the border beteen class and n the feature space s a straght lne. The dscrmnant functon ntersects the lne connectng the to class means at the pont = - / f e do not consder pror probabltes. The dscrmnant functon ll also be normal to the lne connectng the means. Decson boundary 49

INF 43 5 Case : Common covarance, Σ = Σ An equvalent formulaton of the dscrmnant functons s The decson boundares are agan hyperplanes. The decson boundary has the equaton: Because = Σ - - s not n the drecton of -, the hyperplane ll not be orthogonal to the lne beteen the means. ln and here t t P g μ Σ μ μ Σ / ln T T P P

Case 3:, Σ =arbtrary The dscrmnant functons ll be quadratc: t g W here W and μ t Σ Σ t, μ Σ ln Σ ln P The decson surfaces are hyperquadrcs and can assume any of the general forms: hyperplanes hypershperes pars of hyperplanes hyperellsods, Hyperparabolods,.. The net sldes sho eamples of ths. In ths general case e cannot ntutvely dra the decson boundares ust by lookng at the mean and covarance. μ INF 43 5