GenSVM: A Generalized Multiclass Support Vector Machine

Similar documents
Support Vector Machines

Classification / Regression Support Vector Machines

Support Vector Machines

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Feature Reduction and Selection

Support Vector Machines. CS534 - Machine Learning

Classifier Selection Based on Data Complexity Measures *

Hermite Splines in Lie Groups as Products of Geodesics

GSLM Operations Research II Fall 13/14

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Announcements. Supervised Learning

X- Chart Using ANOM Approach

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Active Contours/Snakes

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Lecture 5: Multilayer Perceptrons

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

5 The Primal-Dual Method

S1 Note. Basis functions.

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

An Entropy-Based Approach to Integrated Information Needs Assessment

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Edge Detection in Noisy Images Using the Support Vector Machines

LECTURE : MANIFOLD LEARNING

The Codesign Challenge

CS 534: Computer Vision Model Fitting

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Discriminative Dictionary Learning with Pairwise Constraints

Smoothing Spline ANOVA for variable screening

Unsupervised Learning and Clustering

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Fitting: Deformable contours April 26 th, 2018

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Taxonomy of Large Margin Principle Algorithms for Ordinal Regression Problems

An Optimal Algorithm for Prufer Codes *

Three supervised learning methods on pen digits character recognition dataset

Wishing you all a Total Quality New Year!

SVM-based Learning for Multiple Model Estimation

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Parallel matrix-vector multiplication

Problem Set 3 Solutions

Classifying Acoustic Transient Signals Using Artificial Intelligence

Collaboratively Regularized Nearest Points for Set Based Recognition

User Authentication Based On Behavioral Mouse Dynamics Biometrics

y and the total sum of

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Optimizing Document Scoring for Query Retrieval

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Data Mining: Model Evaluation

LECTURE NOTES Duality Theory, Sensitivity Analysis, and Parametric Programming

The Research of Support Vector Machine in Agricultural Data Classification

A Statistical Model Selection Strategy Applied to Neural Networks

Programming in Fortran 90 : 2017/2018

Simplification of 3D Meshes

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Machine Learning 9. week


2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Efficient Distributed Linear Classification Algorithms via the Alternating Direction Method of Multipliers

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

Module Management Tool in Software Development Organizations

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Learning to Project in Multi-Objective Binary Linear Programming

A Robust LS-SVM Regression

ISSN: International Journal of Engineering and Innovative Technology (IJEIT) Volume 1, Issue 4, April 2012

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Biostatistics 615/815

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Mathematics 256 a course in differential equations for engineering students

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

Very simple computational domains can be discretized using boundary-fitted structured meshes (also called grids)

Discriminative classifiers for object classification. Last time

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Online Detection and Classification of Moving Objects Using Progressively Improving Detectors

Optimization Methods: Integer Programming Integer Linear Programming 1. Module 7 Lecture Notes 1. Integer Linear Programming

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Random Kernel Perceptron on ATTiny2313 Microcontroller

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Cluster Analysis of Electrical Behavior

Machine Learning: Algorithms and Applications

A Binarization Algorithm specialized on Document Images and Photos

CMPS 10 Introduction to Computer Science Lecture Notes

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

Performance Evaluation of Information Retrieval Systems

Training of Kernel Fuzzy Classifiers by Dynamic Cluster Generation

Improving Low Density Parity Check Codes Over the Erasure Channel. The Nelder Mead Downhill Simplex Method. Scott Stransky

Transcription:

Journal of Machne Learnng Research 17 (2016) 1-42 Submtted 12/14; Revsed 11/16; Publshed 12/16 GenSVM: A Generalzed Multclass Support Vector Machne Gerrt J.J. van den Burg Patrck J.F. Groenen Econometrc Insttute Erasmus Unversty Rotterdam P.O. Box 1738 3000 DR Rotterdam The Netherlands burg@ese.eur.nl groenen@ese.eur.nl Edtor: Sathya Keerth Abstract Tradtonal extensons of the bnary support vector machne (SVM) to multclass problems are ether heurstcs or requre solvng a large dual optmzaton problem. Here, a generalzed multclass SVM s proposed called GenSVM. In ths method classfcaton boundares for a K-class problem are constructed n a (K 1)-dmensonal space usng a smplex encodng. Addtonally, several dfferent weghtngs of the msclassfcaton errors are ncorporated n the loss functon, such that t generalzes three exstng multclass SVMs through a sngle optmzaton problem. An teratve majorzaton algorthm s derved that solves the optmzaton problem wthout the need of a dual formulaton. Ths algorthm has the advantage that t can use warm starts durng cross valdaton and durng a grd search, whch sgnfcantly speeds up the tranng phase. Rgorous numercal experments compare lnear GenSVM wth seven exstng multclass SVMs on both small and large data sets. These comparsons show that the proposed method s compettve wth exstng methods n both predctve accuracy and tranng tme, and that t sgnfcantly outperforms several exstng methods on these crtera. Keywords: support vector machnes, SVM, multclass classfcaton, teratve majorzaton, MM algorthm, classfer comparson 1. Introducton For bnary classfcaton, the support vector machne has shown to be very successful (Cortes and Vapnk, 1995). The SVM effcently constructs lnear or nonlnear classfcaton boundares and s able to yeld a sparse soluton through the so-called support vectors, that s, through those observatons that are ether not perfectly classfed or are on the classfcaton boundary. In addton, by regularzng the loss functon the overfttng of the tranng data set s curbed. Due to ts desrable characterstcs several attempts have been made to extend the SVM to classfcaton problems where the number of classes K s larger than two. Overall, these extensons dffer consderably n the approach taken to nclude multple classes. Three types of approaches for multclass SVMs (MSVMs) can be dstngushed. Frst, there are heurstc approaches that use the bnary SVM as an underlyng classfer and decompose the K-class problem nto multple bnary problems. The most commonly used heurstc s the one-vs-one (OvO) method where decson boundares are constructed c 2016 Gerrt J.J. van den Burg and Patrck J.F. Groenen.

Van den Burg and Groenen x 2 x 2 x 2 (a) One vs. One x 1 (b) One vs. All x 1 (c) Non-heurstc x 1 Fgure 1: Illustraton of ambguty regons for common heurstc multclass SVMs. In the shaded regons tes occur for whch no classfcaton rule has been explctly traned. Fgure (c) corresponds to an SVM where all classes are consdered smultaneously, whch elmnates any possble tes. Fgures nspred by Statnkov et al. (2011). between each par of classes (Kreßel, 1999). OvO requres solvng K(K 1) bnary SVM problems, whch can be substantal f the number of classes s large. An advantage of OvO s that the problems to be solved are smaller n sze. On the other hand, the one-vs-all (OvA) heurstc constructs K classfcaton boundares, one separatng each class from all the other classes (Vapnk, 1998). Although OvA requres fewer bnary SVMs to be estmated, the complete data set s used for each classfer, whch can create a hgh computatonal burden. Another heurstc approach s the drected acyclc graph (DAG) SVM proposed by Platt et al. (2000). DAGSVM s smlar to the OvO approach except that the class predcton s done by successvely votng away unlkely classes untl only one remans. One problem wth the OvO and OvA methods s that there are regons of the space for whch class predctons are ambguous, as llustrated n Fgures 1a and 1b. In practce, heurstc methods such as the OvO and OvA approaches are used more often than other multclass SVM mplementatons. One of the reasons for ths s that there are several software packages that effcently solve the bnary SVM, such as LbSVM (Chang and Ln, 2011). Ths package mplements a varaton of the sequental mnmal optmzaton algorthm of Platt (1999). Implementatons of other multclass SVMs n hghlevel (statstcal) programmng languages are lackng, whch reduces ther use n practce. 1 The second type of extenson of the bnary SVM use error correctng codes. In these methods the problem s decomposed nto multple bnary classfcaton problems based on a constructed codng matrx that determnes the groupng of the classes n a specfc bnary subproblem (Detterch and Bakr, 1995; Allwen et al., 2001; Crammer and Snger, 2002b). Error correctng code SVMs can thus be seen as a generalzaton of OvO and OvA. In Detterch and Bakr (1995) and Allwen et al. (2001), a codng matrx s constructed that determnes whch class nstances are pared aganst each other for each bnary SVM. Both approaches requre that the codng matrx s determned beforehand. However, t s a pror 1. An excepton to ths s the method of Lee et al. (2004), for whch an R mplementaton exsts. See http://www.stat.osu.edu/~yklee/software.html. 2

Generalzed Multclass Support Vector Machne unclear how such a codng matrx should be chosen. In fact, as Crammer and Snger (2002b) show, fndng the optmal codng matrx s an NP-complete problem. The thrd type of approaches are those that optmze one loss functon to estmate all class boundares smultaneously, the so-called sngle machne approaches (Rfkn and Klautau, 2004). In the lterature, such methods have been proposed by, among others, Weston and Watkns (1998), Bredenstener and Bennett (1999), Crammer and Snger (2002a), Lee et al. (2004), and Guermeur and Monfrn (2011). The method of Weston and Watkns (1998) yelds a farly large quadratc problem wth a large number of slack varables, that s, K 1 slack varables for each observaton. The method of Crammer and Snger (2002a) reduces ths number of slack varables by only penalzng the largest msclassfcaton error. In addton, ther method does not nclude a bas term n the decson boundares, whch s advantageous for solvng the dual problem. Interestngly, ths approach does not reduce parsmonously to the bnary SVM for K = 2. The method of Lee et al. (2004) uses a sum-to-zero constrant on the decson functons to reduce the dmensonalty of the problem. Ths constrant effectvely means that the soluton of the multclass SVM les n a (K 1)-dmensonal subspace of the full K dmensons consdered. The sze of the margns s reduced accordng to the number of classes, such that asymptotc convergence s obtaned to the Bayes optmal decson boundary when the regularzaton term s gnored (Rfkn and Klautau, 2004). Fnally, the method of Guermeur and Monfrn (2011) s a quadratc extenson of the method developed by Lee et al. (2004). Ths extenson keeps the sum-to-zero constrant on the decson functons, drops the nonnegatvty constrant on the slack varables, and adds a quadratc functon of the slack varables to the loss functon. Ths means that at the optmum the slack varables are only postve on average, whch dffers from common SVM formulatons. The exstng approaches to multclass SVMs suffer from several problems. All current sngle machne multclass extensons of the bnary SVM rely on solvng a potentally large dual optmzaton problem. Ths can be dsadvantageous when a soluton has to be found n a small amount of tme, snce teratvely mprovng the dual soluton does not guarantee that the prmal soluton s mproved as well. Thus, stoppng early can lead to poor predctve performance. In addton, the dual of such sngle machne approaches should be solvable quckly n order to compete wth exstng heurstc approaches. Almost all sngle machne approaches rely on msclassfcatons of the observed class wth each of the other classes. By smply summng these msclassfcaton errors (as n Lee et al., 2004) observatons wth multple errors contrbute more than those wth a sngle msclassfcaton do. Consequently, observatons wth multple msclassfcatons have a stronger nfluence on the soluton than those wth a sngle msclassfcaton, whch s not a desrable property for a multclass SVM, as t overemphaszes objects that are msclassfed wth respect to multple classes. Here, t s argued that there s no reason to penalze certan msclassfcaton regons more than others. Sngle machne approaches are preferred for ther ablty to capture the multclass classfcaton problem n a sngle model. A parallel can be drawn here wth multnomal regresson and logstc regresson. In ths case, multnomal regresson reduces exactly to the bnary logstc regresson method when K = 2, both technques are sngle machne approaches, and many of the propertes of logstc regresson extend to multnomal regresson. Therefore, 3

Van den Burg and Groenen t can be consdered natural to use a sngle machne approach for the multclass SVM that reduces parsmonously to the bnary SVM when K = 2. The dea of castng the multclass SVM problem to K 1 dmensons s appealng, snce t reduces the dmensonalty of the problem and s also present n other multclass classfcaton methods such as multnomal regresson and lnear dscrmnant analyss. However, the sum-to-zero constrant employed by Lee et al. (2004) creates an addtonal burden on the dual optmzaton problem (Dogan et al., 2011). Therefore, t would be desrable to cast the problem to K 1 dmensons n another manner. Below a smplex encodng wll be ntroduced to acheve ths goal. The smplex encodng for multclass SVMs has been proposed earler by Hll and Doucet (2007) and Mroueh et al. (2012), although the method outlned below dffers from these two approaches. Note that the smplex codng approach by Mroueh et al. (2012) was shown to be equvalent to that of Lee et al. (2004) by Ávla Pres et al. (2013). An advantage of the smplex encodng s that n contrast to methods such as OvO and OvA, there are no regons of ambguty n the predcton space (see Fgure 1c). In addton, the low dmensonal projecton also has advantages for understandng the method, snce t allows for a geometrc nterpretaton. The geometrc nterpretaton of exstng sngle machne multclass SVMs s often dffcult snce most are based on a dual optmzaton approach wth lttle attenton for a prmal problem based on hnge errors. A new flexble and general multclass SVM s proposed, called GenSVM. Ths method uses the smplex encodng to formulate the multclass SVM problem as a sngle optmzaton problem that reduces to the bnary SVM when K = 2. By usng a flexble hnge functon and an l p norm of the errors the GenSVM loss functon ncorporates three exstng multclass SVMs that use the sum of the hnge errors, and extends these methods. In the lnear verson of GenSVM, K 1 lnear combnatons of the features are estmated next to the bas terms. In the nonlnear verson, kernels can be used n a smlar manner as can be done for bnary SVMs. The resultng GenSVM loss functon s convex n the parameters to be estmated. For ths loss functon an teratve majorzaton (IM) algorthm wll be derved wth guaranteed descent to the global mnmum. By solvng the optmzaton problem n the prmal t s possble to use warm starts durng a hyperparameter grd search or durng cross valdaton, whch makes the resultng algorthm very compettve n total tranng tme, even for large data sets. To evaluate ts performance, GenSVM s compared to seven of the multclass SVMs descrbed above on several small data sets and one large data set. The smaller data sets are used to assess the classfcaton accuracy of GenSVM, whereas the large data set s used to verfy feasblty of GenSVM for large data sets. Due to the computatonal cost of these rgorous experments only comparsons of lnear multclass SVMs are performed, and experments on nonlnear MSVMs are consdered outsde the scope of ths paper. Exstng comparsons of multclass SVMs n the lterature do not determne any statstcally sgnfcant dfferences n performance between classfers, and resort to tables of accuracy rates for the comparsons (for nstance Hsu and Ln, 2002). Usng suggestons from the benchmarkng lterature predctve performance and tranng tme of all classfers s compared usng performance profles and rank tests. The rank tests are used to uncover statstcally sgnfcant dfferences between classfers. Ths paper s organzed as follows. Secton 2 ntroduces the novel generalzed multclass SVM. In Secton 3, features of the teratve majorzaton theory are revewed and a number 4

Generalzed Multclass Support Vector Machne of useful propertes are hghlghted. Secton 4 derves the IM algorthm for GenSVM, and presents pseudocode for the algorthm. Extensons of GenSVM to nonlnear classfcaton boundares are dscussed n Secton 5. A numercal comparson of GenSVM wth exstng multclass SVMs on emprcal data sets s done n Secton 6. Secton 7 concludes the paper. 2. GenSVM Before ntroducng GenSVM formally, consder a small llustratve example of a hypothetcal data set of n = 90 objects wth K = 3 classes and m = 2 attrbutes. Fgure 2a shows the data set n the space of these two attrbutes x 1 and x 2, wth dfferent classes denoted by dfferent symbols. Fgure 2b shows the (K 1)-dmensonal smplex encodng of the data after an addtonal RBF kernel transformaton has been appled and the mappng has been optmzed to mnmze msclassfcaton errors. In ths fgure, the trangle shown n the center corresponds to a regular K-smplex n K 1 dmensons, and the sold lnes perpendcular to the faces of ths smplex are the decson boundares. Ths (K 1)-dmensonal space wll be referred to as the smplex space throughout ths paper. The mappng from the nput space to ths smplex space s optmzed by mnmzng the msclassfcaton errors, whch are calculated by measurng the dstance of an object to the decson boundares n the smplex space. Predcton of a class label s also done n ths smplex space, by fndng the nearest smplex vertex for the object. Fgure 2c llustrates the decson boundares n the orgnal space of the nput attrbutes x 1 and x 2. In Fgures 2b and 2c, the support vectors can be dentfed as the objects that le on or beyond the dashed margn lnes of ther assocated class. Note that the use of the smplex encodng ensures that for every pont n the predctor space a class s predcted, hence no ambguty regons can exst n the GenSVM soluton. The msclassfcaton errors are formally defned as follows. Let x R m be an object vector correspondng to m attrbutes, and let y denote the class label of object wth y {1,..., K}, for {1,..., n}. Furthermore, let W R m (K 1) be a weght matrx, and defne a translaton vector t R K 1 for the bas terms. Then, object s represented n the (K 1)-dmensonal smplex space by s = x W + t. Note that here the lnear verson of GenSVM s descrbed, the nonlnear verson s descrbed n Secton 5. To obtan the msclassfcaton error of an object, the correspondng smplex space vector s s projected on each of the decson boundares that separate the true class of an object from another class. For the errors to be proportonal wth the dstance to the decson boundares, a regular K-smplex n R K 1 s used wth dstance 1 between each par of vertces. Let U K be the K (K 1) coordnate matrx of ths smplex, where a row u k of U K gves the coordnates of a sngle vertex k. Then, t follows that wth k {1,..., K} and l {1,..., K 1} the elements of U K are gven by 1 f k l 2(l 2 +l) u kl = l f k = l + 1 (1) 2(l 2 +l) 0 f k > l + 1. See Appendx A for a dervaton of ths expresson. Fgure 3 shows an llustraton of how the msclassfcaton errors are computed for a sngle object. Consder object A wth true class 5

Van den Burg and Groenen x 2 s 2 x 2 (a) Input space x 1 (b) Smplex space s 1 x 1 (c) Input space wth boundares Fgure 2: Illustraton of GenSVM for a 2D data set wth K = 3 classes. In (a) the orgnal data s shown, wth dfferent symbols denotng dfferent classes. Fgure (b) shows the mappng of the data to the (K 1)-dmensonal smplex space, after an addtonal RBF kernel mappng has been appled and the optmal soluton has been determned. The decson boundares n ths space are fxed as the perpendcular bsectors of the faces of the smplex, whch s shown as the gray trangle. Fgure (c) shows the resultng boundares mapped back to the orgnal nput space, as can be seen by comparng wth (a). In Fgures (b) and (c) the dashed lnes show the margns of the SVM soluton. y A = 2. It s clear that object A s msclassfed as t s not located n the shaded area that has Vertex u 2 as the nearest vertex. The boundares of the shaded area are gven by the perpendcular bsectors of the edges of the smplex between Vertces u 2 and u 1 and between Vertces u 2 and u 3, and form the decson boundares for class 2. The error for object A s computed by determnng the dstance from the object to each of these decson boundares. Let q (21) A and q (23) A denote these dstances to the class boundares, whch are obtaned by projectng s A = x A W + t on u 2 u 1 and u 2 u 3 respectvely, as llustrated n the fgure. Generalzng ths reasonng, scalars q (kj) can be defned to measure the projecton dstance of object onto the boundary between class k and j n the smplex space, as q (kj) = (x W + t )(u k u j ). (2) It s requred that the GenSVM loss functon s both general and flexble, such that t can easly be tuned for the specfc data set at hand. To acheve ths, a loss functon s constructed wth a number of dfferent weghtngs, each wth a specfc effect on the object dstances q (kj). In the proposed loss functon, flexblty s added through the use of the Huber hnge functon nstead of the absolute hnge functon, and by usng the l p norm of the hnge errors nstead of the sum. The motvaton for these choces follows. As s customary for SVMs a hnge loss s used to ensure that nstances that do not cross ther class margn wll yeld zero error. Here, the flexble and contnuous Huber hnge loss 6

Generalzed Multclass Support Vector Machne A s 2 q (23) A u 3 u 2 u 1 q (21) A s 1 u 1 u 2 u 2 u 3 Fgure 3: Graphcal llustraton of the calculaton of dstances q (y Aj) for an object A wth y A = 2 and K = 3. The fgure shows the stuaton n the (K 1)-dmensonal space. The dstance q (21) A s calculated by projectng s A = x A W + t on u 2 u 1, and the dstance q (23) A s found by projectng s A on u 2 u 3. The boundary between the class 1 and class 3 regons has been omtted for clarty, but les along u 2. s used (after the Huber error n robust statstcs, see Huber, 1964), whch s defned as 1 q κ+1 2 f q κ 1 h(q) = 2(κ+1) (1 q)2 f q ( κ, 1] (3) 0 f q > 1, wth κ > 1. The Huber hnge loss has been ndependently ntroduced n Chapelle (2007), Rosset and Zhu (2007), and Groenen et al. (2008). Ths hnge error s zero when an nstance s classfed correctly wth respect to ts class margn. However, n contrast to the absolute hnge error, t s contnuous due to a quadratc regon n the nterval ( κ, 1]. Ths quadratc regon allows for a softer weghtng of objects close to the decson boundary. Addtonally, the smoothness of the Huber hnge error s a desrable property for the teratve majorzaton algorthm derved n Secton 4.1. Note that the Huber hnge error approaches the absolute hnge for κ 1, and the quadratc hnge for κ. The Huber hnge error s appled to each of the dstances q (y j), for j y. Thus, no error s counted when the object s correctly classfed. For each of the objects, errors wth respect to the other classes are summed usng an l p norm to obtan the total object error K h j=1 j y p ( q (y j) 7 ) 1/p.

Van den Burg and Groenen The l p norm s added to provde a form of regularzaton on Huber weghted errors for nstances that are msclassfed wth respect to multple classes. As argued n the Introducton, smply summng msclassfcaton errors can lead to overemphaszng of nstances wth multple msclassfcaton errors. By addng an l p norm of the hnge errors the nfluence of such nstances on the loss functon can be tuned. Wth the addton of the l p norm on the hnge errors t s possble to llustrate how GenSVM generalzes exstng methods. For nstance, wth p = 1 and κ 1, the loss functon solves the same problem as the method of Lee et al. (2004). Next, for p = 2 and κ 1 t resembles that of Guermeur and Monfrn (2011). Fnally, for p = and κ 1 the l p norm reduces to the max norm of the hnge errors, whch corresponds to the method of Crammer and Snger (2002a). Note that n each case the value of κ can addtonally be vared to nclude an even broader famly of loss functons. To llustrate the effects of p and κ on the total object error, refer to Fgure 4. In Fgures 4a and 4b, the value of p s set to p = 1 and p = 2 respectvely, whle mantanng the absolute hnge error usng κ = 0.95. A reference pont s plotted at a fxed poston n the area of the smplex space where there s a nonzero error wth respect to two classes. It can be seen from ths reference pont that the value of the combned error s hgher when p = 1. Wth p = 2 the combned error at the reference pont approxmates the Eucldean dstance to the margn, when κ 1. Fgures 4a, 4c, and 4d show the effect of varyng κ. It can be seen that the error near the margn becomes more quadratc wth ncreasng κ. In fact, as κ ncreases the error approaches the squared Eucldean dstance to the margn, whch can be used to obtan a quadratc hnge multclass SVM. Both of these effects wll become stronger when the number of classes ncreases, as ncreasngly more objects wll have errors wth respect to more than one class. Next, let ρ 0 denote optonal object weghts, whch are ntroduced to allow flexblty n the way ndvdual objects contrbute to the total loss functon. Wth these ndvdual weghts t s possble to correct for dfferent group szes, or to gve addtonal weghts to msclassfcatons of certan classes. When correctng for group szes, the weghts can be chosen as ρ = n n k K, G k, (4) where G k = { : y = k} s the set of objects belongng to class k, and n k = G k. The complete GenSVM loss functon combnng all n objects can now be formulated as L MSVM (W, t) = 1 n K ρ k=1 G k j k h p ( q (kj) ) 1/p + λ tr W W, (5) where λ tr W W s the penalty term to avod overfttng, and λ > 0 s the regularzaton parameter. Note that for the case where K = 2, the above loss functon reduces to the loss functon for bnary SVM gven n Groenen et al. (2008), wth Huber hnge errors. The outlne of a proof for the convexty of the loss functon n (5) s gven. note that the dstances q (kj) functon s convex n q (kj) s trvally convex n q (kj) Frst, n the loss functon are affne n W and t. Hence, f the loss t s convex n W and t as well. Second, the Huber hnge functon, snce each separate pece of the functon s convex, and the Huber 8

Generalzed Multclass Support Vector Machne 8 8 0 0 s 1 s 2 s 1 s 2 (a) p = 1 and κ = 0.95 (b) p = 2 and κ = 0.95 8 8 0 0 s 1 s 2 s 1 s 2 (c) p = 1 and κ = 1.0 (d) p = 1 and κ = 5.0 Fgure 4: Illustraton of the l p norm of the Huber weghted errors. Comparng fgures (a) and (b) shows the effect of the l p norm. Wth p = 1 objects that have errors w.r.t. both classes are penalzed more strongly than those wth only one error, whereas wth p = 2 ths s not the case. Fgures (a), (c), and (d) compare the effect of the κ parameter, wth p = 1. Ths shows that wth a large value of κ, the errors close to the boundary are weghted quadratcally. Note that s 1 and s 2 ndcate the dmensons of the smplex space. hnge s contnuous. Thrd, the l p norm s a convex functon by the Mnkowsk nequalty, and t s monotoncally ncreasng by defnton. Thus, t follows that the l p norm of the Huber weghted nstance errors s convex (see for nstance Rockafellar, 1997). Next, snce t s requred that the weghts ρ are non-negatve, the sum n the frst term of (5) s a convex combnaton. Fnally, the penalty term can also be shown to be convex, snce tr W W s the square of the Frobenus norm of W, and t s requred that λ > 0. Thus, t holds that the loss functon n (5) s convex n W and t. Predctng class labels n GenSVM can be done as follows. Let (W, t ) denote the parameters that mnmze the loss functon. Predctng the class label of an unseen sample x n+1 can then be done by frst mappng t to the smplex space, usng the optmal projecton: s n+1 = x n+1 W + t. The predcted class label s then smply the label correspondng to 9

Van den Burg and Groenen the nearest smplex vertex as measured by the squared Eucldean norm, or 3. Iteratve Majorzaton ŷ n+1 = arg mn s n+1 u k 2, for k = 1,..., K. (6) k To mnmze the loss functon gven n (5), an teratve majorzaton (IM) algorthm wll be derved. Iteratve majorzaton was frst descrbed by Weszfeld (1937), however the frst applcaton of the algorthm n the context of a lne search comes from Ortega and Rhenboldt (1970, p. 253 255). Durng the late 1970s, the method was ndependently developed by De Leeuw (1977) as part of the SMACOF algorthm for multdmensonal scalng, and by Voss and Eckhardt (1980) as a general mnmzaton method. For the reader unfamlar wth the teratve majorzaton algorthm a more detaled descrpton has been ncluded n Appendx B and further examples can be found n for nstance Hunter and Lange (2004). The asymptotc convergence rate of the IM algorthm s lnear, whch s less than that of the Newton-Raphson algorthm (De Leeuw, 1994). However, the largest mprovements n the loss functon wll occur n the frst few steps of the teratve majorzaton algorthm, where the asymptotc lnear rate does not apply (Havel, 1991). Ths property wll become very useful for GenSVM as t allows for a quck approxmaton to the exact SVM soluton n few teratons. There s no straghtforward technque for dervng the majorzaton functon for any gven functon. However, n the next secton the dervaton of the majorzaton functon for the GenSVM loss functon s presented usng an outsde-n approach. In ths approach, each functon that consttutes the loss functon s majorzed separately and the majorzaton functons are combned. Two propertes of majorzaton functons that are useful for ths dervaton are now formally defned. In these expressons, x s a supportng pont, as defned n Appendx B. P1. Let f 1 : Y Z, f 2 : X Y, and defne f = f 1 f 2 : X Z, such that for x X, f(x) = f 1 (f 2 (x)). If g 1 : Y Y Z s a majorzaton functon of f 1, then g : X X Z defned as g = g 1 f 2 s a majorzaton functon of f. Thus for x, x X t holds that g(x, x) = g 1 (f 2 (x), f 2 (x)) s a majorzaton functon of f(x) at x. P2. Let f : X Z and defne f : X Z such that f(x) = a f (x) for x X, wth a 0 for all. If g : X X Z s a majorzaton functon for f at a pont x X, then g : X X Z gven by g(x, x) = a g (x, x) s a majorzaton functon of f. Proofs of these propertes are omtted, as they follow drectly from the requrements for a majorzaton functon gven n Appendx B. The frst property allows for the use of the outsde-n approach to majorzaton, as wll be llustrated n the next secton. 4. GenSVM Optmzaton and Implementaton In ths secton, a quadratc majorzaton functon for GenSVM wll be derved. Although t s possble to derve a majorzaton algorthm for general values of the l p norm parameter, 2 2. For a majorzaton algorthm of the l p norm wth p 2, see Groenen et al. (1999). 10

Generalzed Multclass Support Vector Machne the followng dervaton wll restrct ths value to the nterval p [1, 2] snce ths smplfes the dervaton and avods the ssue that quadratc majorzaton can become slow for p > 2. Pseudocode for the derved algorthm wll be presented, as well as an analyss of the computatonal complexty of the algorthm. Fnally, an mportant remark on the use of warm starts n the algorthm s gven. 4.1 Majorzaton Dervaton To shorten the notaton, defne V = [t W ], z = [1 x ], δ kj = u k u j, such that q (kj) = z Vδ kj. Wth ths notaton t becomes suffcent to optmze the loss functon wth respect to V. Formulated n ths manner (5) becomes L MSVM (V) = 1 n K ρ k=1 G k j k h p ( q (kj) ) 1/p + λ tr V JV, (7) where J s an m + 1 dagonal matrx wth J, = 1 for > 1 and zero elsewhere. To derve a majorzaton functon for ths expresson the outsde-n approach wll be used, together wth the propertes of majorzaton functons. In what follows, varables wth a bar denote supportng ponts for the IM algorthm. The goal of the dervaton s to fnd a quadratc majorzaton functon n V such that L MSVM (V) tr V Z AZ V 2 tr V Z B + C, (8) where A, B, and C are coeffcents of the majorzaton dependng on V. The matrx Z s smply the n (m + 1) matrx wth rows z. Property P2 above means that the summaton over nstances n the loss functon can be gnored for now. Moreover, the regularzaton term s quadratc n V, and thus requres no majorzaton. The outermost functon for whch a majorzaton functon has to be found s thus the l p norm of the Huber hnge errors. Hence t s possble to consder the functon f(x) = x p for majorzaton. A majorzaton functon for f(x) can be constructed, but a dscontnuty n the dervatve at x = 0 wll reman (Tsutsu and Morkawa, 2012). To avod the dscontnuty n the dervatve of the l p norm, the followng nequalty s needed (Hardy et al., 1934, eq. 2.10.3) j k h p ( q (kj) ) 1/p ( h j k Ths nequalty can be used as a majorzaton functon only f equalty holds at the supportng pont ( ) 1/p h p q (kj) = ( ) h q (kj). j k j k q (kj) ). 11

Van den Burg and Groenen ( ) It s not dffcult to see that ths only holds f at most one of the h q (kj) errors s nonzero for j k. Thus an ndcator varable ε s ntroduced whch s 1 f at most one of these errors s nonzero, and 0 otherwse. Then t follows that L MSVM (V) 1 K ( ) ε h q (kj) + (1 ε ) ( ) 1/p h p q (kj) (9) n j k ρ k=1 G k + λ tr V JV. j k Now, the next functon for whch a majorzaton needs to be found s f 1 (x) = x 1/p. From the nequalty a α b β < αa + βb, wth α + β = 1 (Hardy et al., 1934, Theorem 37), a lnear majorzaton nequalty can be constructed for ths functon by substtutng a = x, b = x, α = 1/p and β = 1 1/p (Groenen and Heser, 1996). Ths yelds f 1 (x) = x 1/p 1 ( p x1/p 1 x + 1 1 ) x 1/p = g 1 (x, x). p Applyng ths majorzaton and usng property P1 gves j k h p ( q (kj) ) 1/p 1 p j k h p ( q (kj) Pluggng ths nto (9) and collectng terms yelds L MSVM (V) 1 K ( ε h n ρ k=1 G k ) 1/p 1 ( + 1 1 ) ( ) h p q (kj) p j k j k + Γ (1) + λ tr V JV, q (kj) j k h p ( q (kj) 1/p ) + (1 ε )ω. j k ) ( ) h p q (kj) wth ω = 1 p j k h p ( q (kj) ) 1/p 1. (10) The constant Γ (1) contans all terms that only depend on prevous errors q (kj). The next majorzaton step by the outsde-n approach s to fnd a quadratc majorzaton functon for f 2 (x) = h p (x), of the form f 2 (x) = h p (x) a(x, p)x 2 2b(x, p)x + c(x, p) = g 2 (x, x). Snce ths dervaton s mostly an algebrac exercse t has been moved to Appendx C. In the remander of ths dervaton, a (p) jk wll be used to abbrevate a(q(kj), p), wth smlar 12

Generalzed Multclass Support Vector Machne abbrevatons for b and c. Usng these majorzatons and makng the dependence on V explct by substtutng q (kj) = z Vδ kj gves L MSVM (V) 1 K [ ] ρ ε a (1) n jk z Vδ kj δ kj V z 2b (1) jk z Vδ kj k=1 G k j k + 1 K [ ] ρ (1 ε )ω a (p) n jk z Vδ kj δ kj V z 2b (p) jk z Vδ kj G k k=1 + Γ (2) + λ tr V JV, where Γ (2) agan contans all constant terms. Due to dependence on the matrx δ kj δ kj, the above majorzaton functon s not yet n the desred quadratc form of (8). However, snce the maxmum egenvalue of δ kj δ kj s 1 by defnton of the smplex coordnates, t follows that the matrx δ kj δ kj I s negatve semdefnte. Hence, t can be shown that the nequalty z (V V)(δ kjδ kj I)(V V) z 0 holds (Bjleveld and De Leeuw, 1991, Theorem 4). Rewrtng ths gves the majorzaton nequalty z Vδ kj δ kj V z z VV z 2z V(I δ kj δ kj )Vz + z V(I δ kj δ kj )V z. j k Wth ths nequalty the majorzaton nequalty becomes L MSVM (V) 1 K ρ z n V(V 2V )z k=1 G k j k 2 K ρ z n V G k k=1 + Γ (3) + λ tr V JV, j k [ ] ε a (1) jk + (1 ε )ω a (p) jk ( ) [ε b (1) jk a(1) jk q(kj) ( )] +(1 ε )ω b (p) jk a(p) jk q(kj) δ kj (11) where q (kj) = z Vδ kj. Ths majorzaton functon s quadratc n V and can thus be used n the IM algorthm. To derve the frst-order condton used n the update step of the IM algorthm (step 2 n Appendx B), matrx notaton for the above expresson s ntroduced. Let A be an n n dagonal matrx wth elements α, and let B be an n (K 1) matrx wth rows β, where α = 1 n ρ [ ] ε a (1) jk + (1 ε )ω a (p) jk, (12) j k β = 1 n ρ ( ) ( )] [ε b (1) jk a(1) jk q(kj) + (1 ε )ω b (p) jk a(p) jk q(kj) δ kj. (13) j k Then the majorzaton functon of L MSVM (V) gven n (11) can be wrtten as L MSVM (V) tr (V 2V) Z AZV 2 tr B ZV + Γ (3) + λ tr V JV = tr V (Z AZ + λj)v 2 tr (V Z A + B )ZV + Γ (3). 13

Van den Burg and Groenen Ths majorzaton functon has the desred functonal form descrbed n (8). Dfferentaton wth respect to V and equatng to zero yelds the lnear system (Z AZ + λj)v = Z AZV + Z B. (14) The update V + that solves ths system can then be calculated effcently by Gaussan elmnaton. 4.2 Algorthm Implementaton and Complexty Pseudocode for GenSVM s gven n Algorthm 1. As can be seen, the algorthm smply updates all nstance weghts at each teraton, startng by determnng the ndcator varable ε. In practce, some calculatons can be done effcently for all nstances by usng matrx algebra. When step doublng (see Appendx B) s appled n the majorzaton algorthm, lne 25 s replaced by V 2V + V. In the mplementaton step doublng s appled after a burn-n of 50 teratons. The mplementaton used n the experments descrbed n Secton 6 s wrtten n C, usng the ATLAS (Whaley and Dongarra, 1998) and LAPACK (Anderson et al., 1999) lbrares. The source code for ths C lbrary s avalable under the open source GNU GPL lcense, through an onlne repostory. A thorough descrpton of the mplementaton s avalable n the package documentaton. The complexty of a sngle teraton of the IM algorthm s O(n(m + 1) 2 ) assumng that n > m > K. As noted earler, the convergence rate of the general IM algorthm s lnear. Computatonal complexty of standard SVM solvers that solve the dual problem through decomposton methods les between O(n 2 ) and O(n 3 ) dependng on the value of λ (Bottou and Ln, 2007). An effcent algorthm for the method of Crammer and Snger (2002a) developed by Keerth et al. (2008) has a complexty of O(nmK) per teraton, where m m s the average number of nonzero features per tranng nstance. In the methods of Lee et al. (2004) and Weston and Watkns (1998), a quadratc programmng problem wth n(k 1) dual varables needs to be solved, whch s typcally done usng a standard solver. An analyss of the exact convergence of GenSVM, ncludng the expected number of teratons needed to acheve convergence at a factor ɛ, s outsde the scope of the current work and a subject for further research. 4.3 Smart Intalzaton When tranng machne learnng algorthms to determne the optmal hyperparameters, t s common to use cross valdaton (CV). Wth GenSVM t s possble to ntalze the matrx V such that the fnal result of a fold s used as the ntal value for V 0 for the next fold. Ths same technque can be used when searchng for the optmal hyperparameter confguraton n a grd search, by ntalzng the weght matrx wth the outcome of the prevous confguraton. Such warm-start ntalzaton greatly reduces the tme needed to perform cross valdaton wth GenSVM. It s mportant to note here that usng warm starts s not easly possble wth dual optmzaton approaches. Therefore, the ablty to use warm starts can be seen as an advantage of solvng the GenSVM optmzaton problem n the prmal. 14

Generalzed Multclass Support Vector Machne Algorthm 1: GenSVM Algorthm Input: X, y, ρ, p, κ, λ, ɛ Output: V 1 K max(y) 2 t 1 3 Z [1 X] 4 Let V V 0 5 Generate J and U K 6 L t = L MSVM (V) 7 L t 1 = (1 + 2ɛ)L t 8 whle (L t 1 L t )/L t > ɛ do 9 for 1 to n do 10 Compute q (yj) = z Vδ y j for all j y ( 11 Compute h q (yj) ) for all j y by (3) 12 f ε = 1 then 13 Compute a (1) jy and b (1) jy for all j y accordng to Table 4 n Appendx C 14 else 15 Compute ω followng (10) 16 Compute a (p) jy 17 end 18 Compute α by (12) 19 Compute β by (13) 20 end 21 Construct A from α 22 Construct B from β 23 Fnd V + that solves (14) 24 V V 25 V V + 26 L t 1 L t 27 L t L MSVM (V) 28 t t + 1 29 end and b (p) jy for all j y accordng to Table 4 n Appendx C 5. Nonlnearty One possble method to nclude nonlnearty n a classfer s through the use of splne transformatons (see for nstance Haste et al., 2009). Wth splne transformatons each attrbute vector x j s transformed to a splne bass N j, for j = 1,..., m. The transformed nput matrx N = [N 1,..., N m ] s then of sze n l, where l depends on the degree of the splne transformaton and the number of nteror knots chosen. An applcaton of splne transformatons to the bnary SVM can be found n Groenen et al. (2007). A more common way to nclude nonlnearty n machne learnng methods s through the use of the kernel trck, attrbuted to Azerman et al. (1964). Wth the kernel trck, the dot product of two nstance vectors n the dual optmzaton problem s replaced by the dot product of the same vectors n a hgh dmensonal feature space. Snce no dot products appear n the prmal formulaton of GenSVM, a dfferent method s used here. 15

Van den Burg and Groenen By applyng a preprocessng step on the kernel matrx, nonlnearty can be ncluded usng the same algorthm as the one presented for the lnear case. Furthermore, predctng class labels requres a postprocessng step on the obtaned matrx V. A full dervaton s gven n Appendx D. 6. Experments To assess the performance of the proposed GenSVM classfer, a smulaton study was done comparng GenSVM wth seven exstng multclass SVMs on 13 small data sets. These experments are used to precsely measure predctve accuracy and total tranng tme usng performance profles and rank plots. To verfy the feasblty of GenSVM for large data sets an addtonal smulaton study s done. The results of ths study are presented separately n Secton 6.4. Due to the large number of data sets and methods nvolved, experments were only done for the lnear kernel. Experments on nonlnear multclass SVMs would requre even more tranng tme than for lnear MSVMs and s consdered outsde the scope of ths paper. 6.1 Setup Implementatons of the heurstc multclass SVMs (OvO, OvA, and DAG) were ncluded through LbSVM (v. 3.16, Chang and Ln, 2011). LbSVM s a popular lbrary for bnary SVMs wth packages for many programmng languages, t s wrtten n C++ and mplements a varaton of the SMO algorthm of Platt (1999). The OvO and DAG methods are mplemented n ths package, and a C mplementaton of OvA usng LbSVM was created for these experments. 3 For the sngle-machne approaches the MSVMpack package was used (v. 1.3, Lauer and Guermeur, 2011), whch s wrtten n C. Ths package mplements the methods of Weston and Watkns (W&W, 1998), Crammer and Snger (C&S, 2002a), Lee et al. (LLW, 2004), and Guermeur and Monfrn (MSVM 2, 2011). Fnally, to verfy f mplementaton dfferences are relevant for algorthm performance the LbLnear (Fan et al., 2008) mplementaton of the method by Crammer and Snger (2002a) s also ncluded (denoted LL C&S). Ths mplementaton uses the optmzaton algorthm by Keerth et al. (2008). To compare the classfcaton methods properly, t s desrable to remove any bas that could occur when usng cross valdaton (Cawley and Talbot, 2010). Therefore, nested cross valdaton s used (Stone, 1974), as llustrated n Fgure 5. In nested CV, a data set s randomly splt n a number of chunks. Each of these chunks s kept apart from the remanng chunks once, whle the remanng chunks are combned to form a sngle data set. A grd search s then appled to ths combned data set to fnd the optmal hyperparameters wth whch to predct the test chunk. Ths process s then repeated for each of the chunks. The predctons of the test chunk wll be unbased snce t was not ncluded n the grd search. For ths reason, t s argued that ths approach s preferred over approaches that smply report maxmum accuracy rates obtaned durng the grd search. 3. The LbSVM code used for DAGSVM s the same code as was used n Hsu and Ln (2002) and s avalable at http://www.cse.ntu.edu.tw/~cjln/lbsvmtools. 16

Generalzed Multclass Support Vector Machne Combne chunks Keep apart Grd search usng 10-fold CV Tranng Phase Testng Phase Tran at optmal confguraton Test Fgure 5: An llustraton of nested cross valdaton. A data set s ntally splt n fve chunks. Each chunk s kept apart once, whle a grd search usng 10-fold CV s appled to the combned data from the remanng 4 chunks. The optmal parameters obtaned there are then used to tran the model one last tme, and predct the chunk that was kept apart. For the experments 13 data sets were selected from the UCI repostory (Bache and Lchman, 2013). The selected data sets and ther relevant statstcs are shown n Table 1. All attrbutes were rescaled to the nterval [ 1, 1]. The mage segmentaton and vowel data sets have a predetermned tran and test set, and were therefore not used n the nested CV procedure. Instead, a grd search was done on the provded tranng set for each classfer, and the provded test set was predcted at the optmal hyperparameters obtaned. For the data sets wthout a predetermned tran/test splt, nested CV was used wth 5 ntal chunks. Hence, 5 11 + 2 = 57 pars of ndependent tran and test data sets are obtaned. Whle runnng the grd search, t s desrable to remove any fluctuatons that may result n an unfar comparson. Therefore, t was ensured that all methods had the same CV splt of the tranng data for the same hyperparameter confguraton (specfcally, the value of the regularzaton parameter). In practce, t can occur that a specfc CV splt s advantageous for one classfer but not for others (ether n tme or performance). Thus, deally the grd search would be repeated a number of tmes wth dfferent CV splts, to remove ths varaton. However, due to the sze of the grd search ths s consdered to be nfeasble. Fnally, t should be noted here that durng the grd search 10-fold cross valdaton was appled n a non-stratfed manner, that s, wthout resamplng of small classes. The followng settngs were used n the numercal experments. The regularzaton parameter was vared on a grd wth λ {2 18, 2 16,..., 2 18 }. For GenSVM the grd search was extended wth the parameters κ { 0.9, 0.5, 5.0} and p {1.0, 1.5, 2.0}. The stoppng parameter for the GenSVM majorzaton algorthm was set at ɛ = 10 6 durng the grd search n the tranng phase and at ɛ = 10 8 for the fnal model n the testng phase. In addton, two dfferent weght specfcatons were used for GenSVM: the unt weghts wth ρ = 1,, as well as the group-sze correcton weghts ntroduced n (4). Thus, the grd search conssts of 342 confguratons for GenSVM, and 19 confguratons 17

Van den Burg and Groenen Data set Instances (n) Features (m) Classes (K) mn n k max n k breast tssue 106 9 6 14 22 rs 150 4 3 50 50 wne 178 13 3 48 71 mage segmentaton 210/2100 18 7 30 30 glass 214 9 6 9 76 vertebral 310 6 3 60 150 ecol 336 8 8 2 143 vowel 528/462 10 11 48 48 balancescale 625 4 3 49 288 vehcle 846 18 4 199 218 contracepton 1473 9 3 333 629 yeast 1484 8 10 5 463 car 1728 6 4 65 1210 Table 1: Data set summary statstcs. Data sets wth an astersk have a predetermned test data set. For these data sets, the number of tranng nstances s denoted for the tran and test data sets respectvely. The fnal two columns denote the sze of the smallest and the largest class, respectvely. for the other methods. Snce nested CV s used for most data sets, t s requred to run 10-fold cross valdaton on a total of 28158 hyperparameter confguratons. To enhance the reproducblty of these experments, the exact predctons made by each classfer for each confguraton were stored n a text fle. To run all computatons n a reasonable amount of tme, the computatons were performed on the Dutch Natonal LISA Compute Cluster. A master-worker program was developed usng the message passng nterface n Python (Dalcín et al., 2005). Ths allows for effcent use of multple nodes by successvely sendng out tasks to worker threads from a sngle master thread. Snce the total tranng tme of a classfer s also of nterest, t was ensured that all computatons were done on the exact same core type. 4 Furthermore, tranng tme was measured from wthn the C programs, to ensure that only the tme needed for the cross valdaton routne was measured. The total computaton tme needed to obtan the presented results was about 152 days, usng the LISA Cluster ths was done n fve and a half days wall-clock tme. Durng the tranng phase t showed that several of the sngle machne methods mplemented through MSVMpack dd not converge to an optmal soluton wthn reasonable amount of tme. 5 Instead of lmtng the maxmum number of teratons of the method, MSVMpack was modfed to stop after a maxmum of 2 hours of tranng tme per confguraton. Ths results n 12 mnutes of tranng tme per cross valdaton fold. The soluton found after ths amount of tranng tme was used for predcton durng cross valdaton. 4. The specfc type of core used s the Intel Xeon E5-2650 v2, wth 16 threads at a clock speed of 2.6 GHz. At most 14 threads were used smultaneously, reservng one for the master thread and one for system processes. 5. The default MSVMpack settngs were used wth a chunk sze of 4 for all methods. 18

Generalzed Multclass Support Vector Machne Whenever tranng was stopped prematurely, ths was recorded. 6 Of the 57 tranng sets, 24 confguratons had prematurely stopped tranng n one or more CV splts for the LLW method, versus 19 for W&W, 9 for MSVM 2, and 2 for C&S (MSVMpack). For the LbSVM methods, 13 optmal confguratons for OvA reached the default maxmum number of teratons n one or more CV folds, versus 9 for DAGSVM, and 3 for OvO. No early stoppng was needed for GenSVM or for LL C&S. Determnng the optmal hyperparameters requres a performance measure on the obtaned predctons. For bnary classfers t s common to use ether the htrate or the area under the ROC curve as a measure of classfer performance. The htrate only measures the percentage of correct predctons of a classfer and has the well known problem that no correcton s made for group szes. For nstance, f 90% of the observatons of a test set belong to one class, a classfer that always predcts ths class has a hgh htrate, regardless of ts dscrmnatory power. Therefore, the adjusted Rand ndex (ARI) s used here as a performance measure (Hubert and Arabe, 1985). The ARI corrects for chance and can therefore more accurately measure dscrmnatory power of a classfer than the htrate can. Usng the ARI for evaluatng supervsed learnng algorthms has prevously been proposed by Santos and Embrechts (2009). The optmal parameter confguratons for each method on each data set were chosen such that the maxmum predctve performance was obtaned as measured wth the ARI. If multple confguratons obtaned the hghest performance durng the grd search, the confguraton wth the smallest tranng tme was chosen. The results on the tranng data show that durng cross valdaton GenSVM acheved the hghest classfcaton accuracy on 41 out of 57 data sets, compared to 15 and 12 for DAG and OvO, respectvely. However, these are results on the tranng data sets and therefore can contan consderable bas. To accurately assess the out-of-sample predcton accuracy the optmal hyperparameter confguratons were determned for each of the 57 tranng sets, and the test sets were predcted wth these parameters. To remove any varatons due to random starts, buldng the classfer and predctng the test set was repeated 5 tmes for each classfer. Below the smulaton results on the small data sets wll be evaluated usng performance profles and rank tests. Performance profles offer a vsual representaton of classfer performance, whle rank tests allow for dentfcaton of statstcally sgnfcant dfferences between classfers. For the sake of completeness tables of performance scores and computaton tmes for each method on each data set are provded n Appendx E. To promote reproducblty of the emprcal results, all the code used for the classfer comparsons and all the obtaned results wll be released through an onlne repostory. 6.2 Performance Profles One way to get nsght n the performance of dfferent classfcaton methods s through performance profles (Dolan and Moré, 2002). A performance profle shows the emprcal cumulatve dstrbuton functon of a classfer on a performance metrc. 6. For the classfers mplemented through LbSVM very long tranng tmes were only observed for the OvA method, however due to the nature of ths method t s not trval to stop the calculatons after a certan amount of tme. Ths behavor was observed n about 1% of all confguratons tested on all data sets, and s therefore consdered neglgble. Also, for the LbSVM methods t was recorded whenever the maxmum number of teratons was reached. 19

Van den Burg and Groenen P c (η) 1 0.8 0.6 0.4 GenSVM LL C&S DAG OvA OvO C&S LLW MSVM 2 W&W 0.2 0 1 1.2 1.4 1.6 1.8 2 η Fgure 6: Performance profles for classfcaton accuracy created from all repettons of the test set predctons. The methods OvA, C&S, LL C&S, MSVM 2, W&W, and LLW wll always have a smaller probablty of beng wthn a factor η of the maxmum performance than the GenSVM, OvO, or DAG methods. Let D denote the set of data sets, and C denote the set of classfers. Further, let p d,c denote the performance of classfer c C on data set d D as measured by the ARI. Now defne the performance rato v d,c as the rato between the best performance on data set d and the performance of classfer c on data set d, that s v d,c = max{p d,c : c C} p d,c. Thus the performance rato s 1 for the best performng classfer on a data set and ncreases for classfers wth a lower performance. Then, the performance profle for classfer c s gven by the functon P c (η) = 1 N D {d D : v d,c η}, where N D = D denotes the number of data sets. Thus, the performance profle estmates the probablty that classfer c has a performance rato below η. Note that P c (1) denotes the emprcal probablty that a classfer acheves the hghest performance on a gven data set. Fgure 6 shows the performance profle for classfcaton accuracy. Estmates of P c (1) from Fgure 6 show that there s a 28.42% probablty that OvO acheves the optmal performance, versus 26.32% for both GenSVM and DAGSVM. Note that ths ncludes cases where each of these methods acheves the best performance. Fgure 6 also shows that although there s a small dfference n the probabltes of GenSVM, OvO, and DAG wthn 20

Generalzed Multclass Support Vector Machne T c (τ) 1 0.8 0.6 0.4 GenSVM LL C&S DAG OvA OvO C&S LLW MSVM 2 W&W 0.2 0 10 0 10 1 10 2 10 3 τ Fgure 7: Performance profles for tranng tme. GenSVM has a pror about 40% chance of requrng the smallest tme to perform the grd search on a gven method. The methods mplemented through MSVMpack always have a lower chance of beng wthn a factor τ of the smallest tranng tme than any of the other methods. a factor of 1.08 of the best predctve performance, for η 1.08 GenSVM almost always has the hghest probablty. It can also be concluded that snce the performance profles of the MSVMpack mplementaton and the LbLnear mplementaton of the method of Crammer and Snger (2002a) nearly always overlap, mplementaton dfferences have a neglgble effect on the classfcaton performance of ths method. Fnally, the fgure shows that OvA and the methods of Lee et al. (2004), Crammer and Snger (2002a), Weston and Watkns (1998), and Guermeur and Monfrn (2011) always have a smaller probablty of beng wthn a gven factor of the optmal performance than GenSVM, OvO, or DAG do. Smlarly, a performance profle can be constructed for the tranng tme necessary to do the grd search. Let t d,c denote the total tranng tme for classfer c on data set d. Next, defne the performance rato for tme as w d,c = t d,c mn{t d,c : c C}. Note that here the classfer wth the smallest tranng tme has preference. Therefore, comparson of classfer computaton tme s done wth the lowest computaton tme acheved on a gven data set d. Agan, the rato s 1 when the lowest tranng tme s reached, and t ncreases for hgher computaton tme. Hence, the performance profle for tme s defned as T c (τ) = 1 {d D : w d,c τ}. N D 21