INF Repetition Anne Solberg INF

Similar documents
INF 4300 Support Vector Machine Classifiers (SVM) Anne Solberg

Support Vector Machines. CS534 - Machine Learning

Feature Reduction and Selection

Classification / Regression Support Vector Machines

Announcements. Supervised Learning

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Graph-based Clustering

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Discriminative classifiers for object classification. Last time

Support Vector Machines

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Support Vector Machines

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Machine Learning 9. week

Smoothing Spline ANOVA for variable screening

Lecture 5: Multilayer Perceptrons

Unsupervised Learning and Clustering

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

Lecture 4: Principal components

Multi-stable Perception. Necker Cube

LECTURE : MANIFOLD LEARNING

Machine Learning. K-means Algorithm

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Classifier Selection Based on Data Complexity Measures *

Radial Basis Functions

Recognizing Faces. Outline

Machine Learning: Algorithms and Applications

Unsupervised Learning

Edge Detection in Noisy Images Using the Support Vector Machines

Programming in Fortran 90 : 2017/2018

SUMMARY... I TABLE OF CONTENTS...II INTRODUCTION...

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Biostatistics 615/815

Classification and clustering using SVM

Human Face Recognition Using Generalized. Kernel Fisher Discriminant

Structure from Motion

Hermite Splines in Lie Groups as Products of Geodesics

Classification of Face Images Based on Gender using Dimensionality Reduction Techniques and SVM

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Greedy Technique - Definition

Problem Set 3 Solutions

Support Vector Machines for Business Applications

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Classifying Acoustic Transient Signals Using Artificial Intelligence

CS 534: Computer Vision Model Fitting

Face Recognition Method Based on Within-class Clustering SVM

A Binarization Algorithm specialized on Document Images and Photos

Cluster Analysis of Electrical Behavior

Unsupervised Learning and Clustering

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros.

What are the camera parameters? Where are the light sources? What is the mapping from radiance to pixel color? Want to solve for 3D geometry

Feature Extractions for Iris Recognition

Lecture #15 Lecture Notes

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

12. Segmentation. Computer Engineering, i Sejong University. Dongil Han

LEAST SQUARES. RANSAC. HOUGH TRANSFORM.

High Dimensional Data Clustering

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Parallelism for Nested Loops with Non-uniform and Flow Dependences

EXTENDED BIC CRITERION FOR MODEL SELECTION

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

2D Raster Graphics. Integer grid Sequential (left-right, top-down) scan. Computer Graphics

cos(a, b) = at b a b. To get a distance measure, subtract the cosine similarity from one. dist(a, b) =1 cos(a, b)

Parallel matrix-vector multiplication

Calibrating a single camera. Odilon Redon, Cyclops, 1914

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Machine Learning. Topic 6: Clustering

Feature Selection for Target Detection in SAR Images

Angle-Independent 3D Reconstruction. Ji Zhang Mireille Boutin Daniel Aliaga

UNIT 2 : INEQUALITIES AND CONVEX SETS

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning

Solving Route Planning Using Euler Path Transform

Computer Vision. Pa0ern Recogni4on Concepts Part II. Luis F. Teixeira MAP- i 2012/13

An Entropy-Based Approach to Integrated Information Needs Assessment

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Adaptive Virtual Support Vector Machine for the Reliability Analysis of High-Dimensional Problems

S1 Note. Basis functions.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Mathematics 256 a course in differential equations for engineering students

Image Alignment CSC 767

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

A Robust LS-SVM Regression

Using Neural Networks and Support Vector Machines in Data Mining

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Incremental Learning with Support Vector Machines and Fuzzy Set Theory

The Research of Support Vector Machine in Agricultural Data Classification

A Multivariate Analysis of Static Code Attributes for Defect Prediction

Transcription:

INF 43 7..7 Repetton Anne Solberg anne@f.uo.no INF 43

Classfers covered Gaussan classfer k =I k = k arbtrary Knn-classfer Support Vector Machnes Recommendaton: lnear or Radal Bass Functon kernels INF 43

Approachng a classfcaton problem Choose features Consder preprocessng/normalzaton Choose classfer Estmate classfer parameters on tranng data Estmate hyperparameters on valdaton data Alternatve: cross-valdaton on the tranng data set Compute the accuracy on test data INF 43 3

Measures of classfcaton accuary Average error rate Confuson matrces True/false postve/negatves Precson/recall and senstvty/specfcty INF 43 4

The curse of dmensonalty In practce, the curse means that, for a gven sample sze, there s a mamum number of features one can add before the classfer starts to degrade. For a fnte tranng sample sze, the correct classfcaton rate ntally ncreases hen addng ne features, attans a mamum and then begns to decrease. For a hgh dmensonalty, e ll need lots of tranng data to get the best performance. => samples / feature / class. Correct classfcaton rate as functon of feature dmensonalty, for dfferent amounts of tranng data. Equal pror probabltes of the to classes s assumed. INF 43 5

Use fe, but good features To avod the curse of dmensonalty e must take care n fndng a set of relatvely fe features. A good feature has hgh thn-class homogenety, and should deally have large beteen-class separaton. In practse, one feature s not enough to separate all classes, but a good feature should: separate some of the classes ell Isolate one class from the others. If to features look very smlar or have hgh correlaton, they are often redundant and e should use only one of them. Class separaton can be studed by: Vsual nspecton of the feature mage overlad the tranng mask Scatter plots Evaluatng features as done by tranng can be dffcult to do automatcally, so manual nteracton s normally requred. INF 43 6

Ho do e beat the curse of dmensonalty? Generate fe, but nformatve features Careful feature desgn gven the applcaton Try a smple classfer frst Do the features ork? Do e need addtonal features? Iterate beteen feature etracton and classfcaton Reducng the dmensonalty Feature selecton select a subset of the orgnal features Feature transforms compute a ne subset of features based on a lnear combnaton of all features net eek Eample: Prncpal component transform Unsupervsed, fnds the combnaton that mamzes the varance n the data. When you are confdent that the features are good, consder a more advanced classfer. INF 43 7

Suboptmal feature selecton Select the best sngle features based on some qualty crtera, e.g., estmated correct classfcaton rate. A combnaton of the best sngle features ll often mply correlated features and ll therefore be suboptmal. Sequental forard selecton mples that hen a feature s selected or removed, ths decson s fnal. Stepse forard-backard selecton overcomes ths. A specal case of the add - a, remove - r algorthm. Improved nto floatng search by makng the number of forard and backard search steps data dependent. Adaptve floatng search Oscllatng search. INF 43 8

Dstance measures used n feature selecton In feature selecton, each feature combnaton must be ranked based on a crteron functon. Crtera functons can ether be dstances beteen classes, or the classfcaton accuracy on a valdaton test set. If the crteron s based on e.g. the mean values/covarance matrces for the tranng data, dstance computaton s fast. Better performance at the cost of hgher computaton tme s found hen the classfcaton accuracy on a valdaton data set dfferent from tranng and testng s used as crteron for rankng features. Ths ll be sloer as classfcaton of the valdatton data needs to be done for every combnaton of features. INF 43 9

INF 43 Class separablty measures Ho do e get an ndcaton of the separablty beteen to classes? Eucldean dstance beteen class means r - s Bhattacharyya dstance Can be defned for dfferent dstrbutons For Gaussan data, t s Mahalanobs dstance beteen to classes: s r s r s r s r T s r B ln 8 N N T

Method - Sequental backard selecton Select l features out of d Eample: 4 features,, 3, 4 Choose a crteron C and compute t for the vector [,, 3, 4 ] T Elmnate one feature at a tme by computng [,, 3 ] T, [,, 4 ] T, [, 3, 4 ] T and [, 3, 4 ] T Select the best combnaton, say [,, 3 ] T. From the selected 3-dmensonal feature vector elmnate one more feature, and evaluate the crteron for [, ] T, [, 3 ] T, [, 3 ] T and select the one th the best value. Number of combnatons searched: +/d+d-ll+ INF 43

Method 3: Sequental forard selecton Compute the crteron value for each feature. Select the feature th the best value, say. Form all possble combnatons of features the nner at the prevous step and a ne feature, e.g. [, ] T, [, 3 ] T, [, 4 ] T, etc. Compute the crteron and select the best one, say [, 3 ] T. Contnue th addng a ne feature. Number of combnatons searched: ld-ll-/. Backards selecton s faster f l s closer to d than to. INF 43

Lnear feature transforms INF 43 3

Prncpal component or Karhunen-Loeve transform Let be a feature vector. Features are often correlated, hch mght lead to redundances. We no derve a transform hch yelds uncorrelated features. We seek a lnear transform y=a T, and the y s should be uncorrelated. The y s are uncorrelated f E[yy T ]=,. If e can epress the nformaton n usng uncorrelated features, e mght need feer coeffcents. INF 43 4

The eghts Vsualzaton and ntuton y / INF 43 5

Varance of y cont. Assume mean of s subtracted The sample covarance matr / scatter matr; R Called σ on some sldes INF 43 6

Varance and proecton resduals Sngle sample Proecton onto, assumng = «y» «y» = Sum all n samples not dmensons Note: Ma varance mn proecton resduals! σ INF 43 7

Crteron functon Goal: Fnd transform mnmzng representaton error We start th a sngle eght-vector,, gvng us a sngle feature, y Let J = T R = σ No, let s fnd ma.. As e learned on the prevous slde, mamzng ths s equvalent to mnmzng representaton error INF 43 8

Mamzng varance of y Lagrangan functon for mamzng σ th the constrant T = - R Equatng zero Unfamlar th Lagrangan multplers? See http://bostat.mc.vanderblt.edu/ k/pub/man/coursebos36/lag rangemultplers-bshop- PatternRecogntonMachneLear nng.pdf R The mamzng s an egenvector of R! And σ =λ! [Why?] INF 43 9

Egendecomposton of covarance matrces Real-valued, symmetrc, «n-dmensonal» covarance matr Egenvalue let s say largest Egenvector correspondng to λ Smallest egenvalue a T a = for Remember: λ =varance of T a INF 43

, 3,.. II/III What does uncorrelated mean? Zero covarance. Covarance of y and y : We already have that =a From last slde, requrng R = a R = means requrng a = INF 43

, 3,.. III/III We ant ma R, s.t. = and a = We can smply remove λ a a from R, creatng R net = R- λ a a, and agan fnd ma R net s.t. = Studyng the decomposton of R a fe sldes back, e see that the soluton s the egenvector correspondng to the second largest egenvalue Smlarly, the 3, 4 etc. are gven by the follong egenvectors sorted accordng to ther egenvalues INF 43

, 3,.. III+/III ma R, s.t. = =a =a =a 3 etc. Egenvectors sorted by ther correspondng egenvalues INF 43 3

Prncpal component transform PCA Place the m «prncple» egenvectors the ones th the largest egenvalues along the columns of A Then the transform y = A T gves you the m frst prncple components The m-dmensonal y have uncorrelated elements retans as much varance as possble gves the best n the mean-square sense descrpton of the orgnal data through the «mage»/proecton/reconstructon Ay Note: The egenvectors themselves can often gve nterestng nformaton PCA s also knon as Karhunen-Loeve transform INF 43 4

Introducton to lnear SVM Dscrmnant functon: g = T + Weghts/orentaton To-class problem, y ϵ{-,} Class ndcator for pattern Threshold/bas - g = y = -, g <, g > g Class predcton Input pattern INF 43 5

Separable case: Many canddates Obvously e ant the decson boundary to separate the classes.... hoever, there can be many such hyperplanes. Whch of these to canddates ould you prefer? Why? INF 43 6

Snce / s a unt vector n the drecton, B=-z*/ Because B les on the decson boundary, T B+ = Ths s called the margn of the classfer Dstance to the decson boundary INF 43 7 g = Dstance from to the decson boundary B z T T T z Solve ths for z :

Hyperplanes and margns If both classes are equally probable, the dstance from the hyperplane to the closest ponts n both classes should be equal. Ths s called the margn. The margn for «drecton» s z, and for «drecton» t s z. From prevous slde; the dstance from a pont to the separatng hyperplane s z g Goal: Fnd and mamzng the margn! Ho ould you rte a program fndng ths? Not easy unless e state the obectve functon cleverly! INF 43 8

Toards a clever obectve functon We can scale g such that g ll be equal to or - at the closest ponts n the to classes. Ths s equvalent to: Does not change the margn. Have a margn of. Requre that g g T T,, Remember our goal: Fnd and yeldng the mamum margn INF 43 9

Mamum-margn obectve functon The hyperplane th mamum margn can be found by solvng the optmzaton problem.r.t. and : mnmze subect to J T y,,,... N The ½ factor s for later convenence Note: We assume here fully classseparable data! Checkpont: Do you understand the formulaton? Ho s ths crteron related to mamzng the margn? Note! We are somehat done -- Matlab or smlar softare can solve ths no. But e seek more nsght! INF 43 3

Support vectors The feature vectors th a correspondng > are called the support vectors for the problem. The classfer defned by ths hyperplane s called a Support Vector Machne. Dependng on y + or -, the support vectors ll thus le on ether of the to hyperplanes T + = The support vectors are the ponts n the tranng set that are closest to the decson hyperplane. The optmzaton has a unque soluton, only one hyperplane satsfes the condtons. The support vectors for hyperplane are the blue crcles. The support vectors for hyperplane are the red crcles. INF 43 3

The nonseparable case If the to classes are nonseparable, a hyperplane satsfyng the condtons T - = cannot be found. The feature vectors n the tranng set are no ether:. Vectors that fall outsde the band and are correctly classfed.. Vectors that are nsde the band and are correctly classfed. They satsfy y T + < 3. Vectors that are msclassfed epressed as y T + < Correctly classfed Erroneously classfed INF 43 3

INF 43 33 Cost functon nonseparable case The cost functon to mnmze s no C s a parameter that controls ho much msclassfed tranng samples s eghted. We skp the mathematcs and present the alternatve dual formulaton: All ponts beteen the to hyperplanes > can be shon to have =C.. parameters the vector of s and I here,, N I C J and subect to ma N, C y y y N T

SVMs: The nonlnear case ntro. The tranng samples are l-dmensonal vectors; e have untl no tred to fnd a lnear separaton n ths l-dmensonal feature space Ths seems qute lmtng What f e ncrease the dmensonalty map our samples to a hgher dmensonal space before applyng our SVM? Perhaps e can fnd a better lnear decson boundary n that space? Even f the feature vectors are not lnearly separable n the nput space, they mght be close to separable n a hgher dmensonal space INF 43 34

Note that n both the optmzaton problem and the evaluaton functon, g, the samples come nto play as nner products only If e have a functon evaluatng nner products, K,, e can gnore the samples themselves Let s say e have K, evaluatng nner products n a hgher dmensonal space: -> no need to do the mappng of our samples eplctly! INF 43 35 SVMs and kernels N s T T y g s.t. ma N, N T y C y y Called «kernel»

Useful kernels for classfcaton Polynomal kernels T q z, q K, z Radal bass functon kernels very commonly used! K, z ep z Hyperbolc tangent kernels often th = and = Note the e need to set the parameter The «support» of each pont s controlled by. The nner product s related to the smlarty of the to samples. K, z T tanh z The kernel nputs need not be numerc, e.g. kernels for tet strngs are possble. The kernels gve nnerproduct evaluatons n the, possbly nfntedmensonal, transformed space. INF 43 36

INF 43 37 The kernel formulaton of the obectve functon Gven the approprate kernel e.g. «radal» th dth and the cost of msclassfcaton C, the optmzaton task s: The resultng classfer s: y N C K y y,..., subect to, ma, otherse and to class, f o class assgn t K y g N

Eample of nonlnear decson boundary Ths llustrates ho the nonlnear SVM mght look n the orgnal feature space RBF kernel used Fgure 4.3 n PR by Teodords et.al. INF 43 38

From to M classes All e have dscussed up untl no nvolves only separatng classes. Ho do e etend the methods to M classes? To common approaches: One-aganst-all For each class m, fnd the hyperplane that best dscmnates ths class from all other classes. Then classfy a sample to the class havng the hghest output. To use ths, e need the VALUE of the nner product and not ust the sgn. Compare all sets of parse classfers Fnd a hyperplane for each par of classes. Ths gves MM-/ parse classfers. For a gven sample, use a votng scheme for selectng the most-nnng class. INF 43 39

Ho to use a SVM classfer Fnd a lbrary th all the necessary SVM-functons For eample LbSVM http://.cse.ntu.edu.t/~cln/lbsvm/ Or use the PRTools toolbo http://.37steps.com/prtools/ Read the ntroductory gudes. Often a radal bass functon kernel s a good startng pont. Scale the data to the range [-,] features th large values ll not domnate. Fnd the optmal values of C and by performng a grd search on selected values and usng a valdaton data set. Tran the classfer usng the best value from the grd search. Test usng a separate test set. INF 43 4

Ho to do a grd search Use n-fold cross valaton e.g. -fold crossvaldaton. -fold: dvde the tranng data nto subsets of equal sze. Tran on 9 subsets and test on the last subset. Repeat ths procedure tmes. Grd search: try pars of C,. Select the par that gets the best classfcaton performance on average over all the n valdaton test subsets. Use the follong values of C and : C = -5, -3,..., 5 = -5, -3,..., 3 INF 43 4

Dscrmnant functons The decson rule Decde f P P, for all can be rtten as assgn to f g g The classfer computes J dscrmnant functons g and selects the class correspondng to the largest value of the dscrmnant functon. Snce classfcaton conssts of choosng the class that has the largest value, a scalng of the dscrmnant functon g by fg ll not effect the decson f f s a monotoncally ncreasng functon. Ths can lead to smplfcatons as e ll soon see. INF 43 4

Equvalent dscrmnant functons The follong choces of dscrmnant functons gve equvalent decsons: The effect of the decson rules s to dvde the feature space nto c decson regons R,...R c. If g >g for all, then s n regon R. The regons are separated by decson boundares, surfaces n features space here the dscrmnant functons for to classes are equal INF 43 43 ln ln P p g P p g p P p P g

INF 43 44 The condtonal densty p s Any probablty densty functon can be used to model p s A common model s the multvarate Gaussan densty. The multvarate Gaussan densty: If e have d features, s s a vector of length d and and s a dd matr depends on class s s s the determnant of the matr s, and s - s the nverse s s t s s n s p μ Σ μ Σ / / ep nn nn n n n S ns s s S 3.............. Σ μ Symmetrc dd matr s the varance of feature s the covarance beteen feature and feature Symmetrc because =

The covarance matr and ellpses In D, the Gaussan model can be thought of as appromatng the classes n D feature space th ellpses. The mean vector =[, ] defnes the the center pont of the ellpses., the covarance beteen the features defnes the orentaton of the ellpse. and defnes the dth of the ellpse. S The ellpse defnes ponts here the probablty densty s equal Equal n the sense that the dstance to the mean as computed by the Mahalanobs dstance s equal. The Mahalanobs dstance beteen a pont and the class center s: r T The man aes of the ellpse s determned by the egenvectors of. The egenvalues of gves ther length. INF 43 45

Eucldean dstance vs. Mahalanobs dstance Eucldean dstance beteen pont and class center : T Ponts th equal dstance to le on a crcle. Mahalanobs dstance beteen and : r T Ponts th equal dstance to le on an ellpse. INF 43 46

Dscrmnant functons for the normal densty We sa last lecture that the mnmum-error-rate classfcaton can be computed usng the dscrmnant functons Wth a multvarate Gaussan e get: Let ut look at ths epresson for some specal cases: INF 43 47 ln ln P p g ln ln ln t P d g μ μ

INF 43 48 Case : Σ =σ I The dscrmnant functons smplfes to lnear functons usng such a shape on the probablty dstrbutons ln ln ln ln ln ln T T T T P I d I P I d I g μ μ μ μ μ Common for all classes, no need to compute these terms Snce T s common for all classes, an equvalent g s a lnear functon of :. ln T T P μ μ μ

The dscrmnant functon hen Σ =σ I that defnes the border beteen class and n the feature space s a straght lne. The dscrmnant functon ntersects the lne connectng the to class means at the pont = - / f e do not consder pror probabltes. The dscrmnant functon ll also be normal to the lne connectng the means. Decson boundary 49

INF 43 5 Case : Common covarance, Σ = Σ An equvalent formulaton of the dscrmnant functons s The decson boundares are agan hyperplanes. The decson boundary has the equaton: Because = Σ - - s not n the drecton of -, the hyperplane ll not be orthogonal to the lne beteen the means. ln and here t t P g μ Σ μ μ Σ / ln T T P P

Case 3:, Σ =arbtrary The dscrmnant functons ll be quadratc: t g W here W and μ t Σ Σ t, μ Σ ln Σ ln P The decson surfaces are hyperquadrcs and can assume any of the general forms: hyperplanes hypershperes pars of hyperplanes hyperellsods, Hyperparabolods,.. The net sldes sho eamples of ths. In ths general case e cannot ntutvely dra the decson boundares ust by lookng at the mean and covarance. μ INF 43 5