A Multi-class SVM Classifier Utilizing Binary Decision Tree

Informatica 33 (009) 33-41 33 A Multi-class Classifier Utilizing Binary Decision Tree Gjorgji Mazarov, Dejan Gjorgjevikj an Ivan Chorbev Department of Computer Science an Engineering Faculty of Electrical Engineering an Information Technology Karpos b.b., 1000 Skopje, Maceonia E-mail: mazarovg@feit.ukim.eu.mk Keywors: Support Vector Machine, multi-class classification, clustering, binary ecision tree architecture Receive: July 7, 008 In this paper a novel architecture of Support Vector Machine classifiers utilizing binary ecision tree (-BDT) for solving multiclass problems is presente. The hierarchy of binary ecision subtasks using s is esigne with a clustering algorithm. For consistency between the clustering moel an, the clustering moel utilizes istance measures at the kernel space, rather than at the input space. The propose base Binary Decision Tree architecture takes avantage of both the efficient computation of the ecision tree architecture an the high classification accuracy of s. The - BDT architecture was esigne to provie superior multi-class classification performance. Its performance was measure on samples from MNIST, Penigit, Optigit an Statlog atabases of hanwritten igits an letters. The results of the experiments inicate that while maintaining comparable or offering better accuracy with other base approaches, ensembles of trees (Bagging an Ranom Forest) an neural network, the training phase of -BDT is faster. During recognition phase, ue to its logarithmic complexity, -BDT is much faster than the wiely use multi-class methos like one-against-one an one-against-all, for multiclass problems. Furthermore, the experiments showe that the propose metho becomes more favourable as the number of classes in the recognition problem increases. Povzetek: Prestavljena je metoa granje binarnih reves z uporabo za večrazrene probleme. 1 Introuction The recent results in pattern recognition have shown that support vector machine () classifiers often have superior recognition rates in comparison to other classification methos. However, the was originally evelope for binary ecision problems, an its extension to multi-class problems is not straightforwar. How to effectively exten it for solving multiclass classification problem is still an on-going research issue. The popular methos for applying s to multiclass classification problems usually ecompose the multi-class problems into several two-class problems that can be aresse irectly using several s. For the reaers convenience, we introuce the briefly in section. A brief introuction to several wiely use multi-class classification methos that utilize binary s is given in section 3. The Kernelbase clustering introuce to convert the multi-class problem into -base binary ecision-tree architecture is explaine in section 4. In section 5, we iscuss relate works an compare -BDT with other multi-class methos via theoretical analysis an empirical estimation. The experimental results in section are presente to compare the performance of the propose -BDT with traitional multi-class approaches base on, ensemble of ecision trees an neural network. Section 7 gives a conclusion of the paper. Support vector machines for pattern recognition The support vector machine is originally a binary classification metho evelope by Vapnik an colleagues at Bell laboratories [1][], with further algorithm improvements by others [3]. For a binary problem, we have training ata points: {x i, y i }, i=1,...,l, y i {-1, 1}, x i R. Suppose we have some hyperplane which separates the positive from the negative examples (a separating hyperplane ). The points x which lie on the hyperplane satisfy w x + b = 0, where w is normal to the hyperplane, b / w is the perpenicular istance from the hyperplane to the origin, an w is the Eucliean norm of w. Let + ( - ) be the shortest istance from the separating hyperplane to the closest positive (negative) example. Define the margin of a separating hyperplane to be + + -. For the linearly separable case, the support vector algorithm simply looks for the separating hyperplane with largest margin. This can be formulate as follows: suppose that all the training ata satisfy the following constraints:

34 Informatica 33 (009) 33 41 G. Mazarov et al. xi w b 1 for y i 1, ( 1 ) xi w b 1 for y i 1, ( ) These can be combine into one set of inequalities: yi x w b 1 0 i i, ( 3 ) Now consier the points for which the equality in Eq. (1) hols (requiring that there exists such a point) is equivalent to choosing a scale for w an b. These points lie on the hyperplane H 1 : x i w + b = 1 with normal w an perpenicular istance from the origin 1-b / w. Similarly, the points for which the equality in Eq. () hols lie on the hyperplane H : x i w + b = -1, with normal again w an perpenicular istance from the origin -1-b / w. Hence + = - = 1/ w an the margin is simply / w. constraints also form a convex set (any linear constraint efines a convex set, an a set of N simultaneous linear constraints efines the intersection of N convex sets, which is also a convex set). This means that we can equivalently solve the following ual problem: maximize L P, subject to the constraints that the graient of L P with respect to w an b vanish, an subject also to the constraints that the α i 0 (let s call that particular set of constraints C ). This particular ual formulation of the problem is calle the Wolfe ual [4]. It has the property that the maximum of L P, subject to constraints C, occurs at the same values of the w, b an α, as the minimum of L P, subject to constraints C 1. Requiring that the graient of L P with respect to w an b vanish gives the conitions: w i y i xi, ( 5 ) i i y i 0. ( ) i Since these are equality constraints in the ual formulation, we can substitute them into Eq. (4) to give L D i l 1 i i j yi y jxi x j, ( 7 ) i, j origin margin Figure 1 Linear separating hyperplanes for the separable case. The support vectors are circle. Note that H 1 an H are parallel (they have the same normal) an that no training points fall between them. Thus we can fin the pair of hyperplanes which gives the maximum margin by minimizing w, subject to constraints (3). Thus we expect the solution for a typical two imensional case to have the form shown on Fig. 1. We introuce nonnegative Lagrange multipliers α i, i = 1,..., l, one for each of the inequality constraints (3). Recall that the rule is that for constraints of the form c i 0, the constraint equations are multiplie by nonnegative Lagrange multipliers an subtracte from the objective function, to form the Lagrangian. For equality constraints, the Lagrange multipliers are unconstraine. This gives Lagrangian: L p 1 l l w i yi xi w b i, ( 4 ) i1 i1 We must now minimize L p with respect to w, b, an maximize with respect to all α i at the same time, all subject to the constraints α i 0 (let s call this particular set of constraints C 1 ). Now this is a convex quaratic programming problem, since the objective function is itself convex, an those points which satisfy the Note that we have now given the Lagrangian ifferent labels (P for primal, D for ual) to emphasize that the two formulations are ifferent: L P an L D arise from the same objective function but with ifferent constraints; an the solution is foun by minimizing L P or by maximizing L D. Note also that if we formulate the problem with b = 0, which amounts to requiring that all hyperplanes contain the origin, the constraint () oes not appear. This is a mil restriction for high imensional spaces, since it amounts to reucing the number of egrees of freeom by one. Support vector training (for the separable, linear case) therefore amounts to maximizing L D with respect to the α i, subject to constraints () an positivity of the α i, with solution given by (5). Notice that there is a Lagrange multiplier α i for every training point. In the solution, those points for which α i > 0 are calle support vectors, an lie on one of the hyperplanes H 1, H. All other training points have α i = 0 an lie either on H 1 or H (such that the equality in Eq. (3) hols), or on that sie of H 1 or H such that the strict inequality in Eq. (3) hols. For these machines, the support vectors are the critical elements of the training set. They lie closest to the ecision bounary; if all other training points were remove (or move aroun, but so as not to cross H 1 or H ), an training was repeate, the same separating hyperplane woul be foun. The above algorithm for separable ata, when applie to non-separable ata, will fin no feasible solution: this will be evience by the objective function (i.e. the ual Lagrangian) growing arbitrarily large. So how can we exten these ieas to hanle non-separable ata? We woul like to relax the constraints (1) an (), but only

A MULTI-CLASS CLASSIFIER... Informatica 33 (009) 33 41 35 when necessary, that is, we woul like to introuce a further cost (i.e. an increase in the primal objective function) for oing so. This can be one by introucing positive slack variables e i ; i = 1,..., l, in the constraints, which then become: xi w b 1 e i for y i 1, ( 8 ) xi w b 1 e i for y i 1, ( 9 ) e i 0i. ( 10 ) Thus, for an error to occur, the corresponing e i must excee unity, so Σ i e i is an upper boun on the number of training errors. Hence a natural way to assign an extra cost for errors is to change the objective function to be minimize from w / to w / + C(Σ i e i ), where C is a parameter to be chosen by the user, a larger C corresponing to assigning a higher penalty to errors. How can the above methos be generalize to the case where the ecision function (f(x) whose sign represents the class assigne to ata point x) is not a linear function of the ata? First notice that the only way in which the ata appears in the training problem, is in the form of ot proucts, x i x j. Now suppose we first mappe the ata (Figure ) to some other (possibly even infinite imensional) Eucliean space H, using a mapping which we will call Ф: : R H, ( 11 ) Then of course the training algorithm woul only epen on the ata through ot proucts in H, i.e. on functions of the form Ф(x i ) Ф(x j ). Now if there were a kernel function K such that K(x i, x j ) = Ф(x i ) Ф(x j ), we woul only nee to use K in the training algorithm, an woul never nee to explicitly even know what Ф is. The kernel function has to satisfy Mercer s conition [1].One example for this function is Gaussian: xi x j Kx i, x j exp, ( 1 ) In this particular example, H is infinite imensional, so it woul not be very easy to work with Ф explicitly. However, if one replaces x i x j by K(x i, x j ) everywhere in the training algorithm, the algorithm will happily prouce a support vector machine which lives in an infinite imensional space, an furthermore o so in roughly the same amount of time it woul take to train on the un-mappe ata. All the consierations of the previous sections hol, since we are still oing a linear separation, but in a ifferent space. But how can we use this machine? After all, we nee w, an that will live in H. But in test phase an is use by computing ot proucts of a given test point x with w, or more specifically by computing the sign of Ns N s f ( x) i yis i x b i yi Ks i, x b ( 13 ) i1 i1 where the s i are the support vectors. So again we can avoi computing Ф(x) explicitly an use the K(s i, x) = Ф(s i ) Ф(x) instea. Figure General principle of : projection of ata in an optimal imensional space. 3 An overview of wiely use multiclass classification methos Although s were originally esigne as binary classifiers, approaches that aress a multi-class problem as a single all-together optimization problem exist [5], but are computationally much more expensive than solving several binary problems. A variety of techniques for ecomposition of the multi-class problem into several binary problems using Support Vector Machines as binary classifiers have been propose, an several wiely use are given in this section. 3.1 One-against-all (OvA) For the N-class problems (N>), N two-class classifiers are constructe []. The i th is traine while labeling the samples in the i th class as positive examples an all the rest as negative examples. In the recognition phase, a test example is presente to all N s an is labelle accoring to the maximum output among the N classifiers. The isavantage of this metho is its training complexity, as the number of training samples is large. Each of the N classifiers is traine using all available samples. 3. One-against-one (OvO) This algorithm constructs N(N-1)/ two-class classifiers, using all the binary pair-wise combinations of the N classes. Each classifier is traine using the samples of the

3 Informatica 33 (009) 33 41 G. Mazarov et al. first class as positive examples an the samples of the secon class as negative examples. To combine these classifiers, the Max Wins algorithm is aopte. It fins the resultant class by choosing the class vote by the majority of the classifiers [7]. The number of samples use for training of each one of the OvO classifiers is smaller, since only samples from two of all N classes are taken in consieration. The lower number of samples causes smaller nonlinearity, resulting in shorter training times. The isavantage of this metho is that every test sample has to be presente to large number of classifiers N(N-1)/. This results in slower testing, especially when the number of the classes in the problem is big [8]. 3.3 Directe acyclic graph (DAG) Introuce by Platt [1] the DAG algorithm for training an N(N-1)/ classifiers is the same as in oneagainst-one. In the recognition phase, the algorithm epens on a roote binary irecte acyclic graph to make a ecision [9]. DAG creates a moel for each pair of classes. When one such moel, which is able to separate class c 1 from class c, classifies a certain test example into class c 1, it oes not really vote for class c 1, rather it votes against class c, because the example must lie on the other sie of the separating hyperplane than most of the class c samples. Therefore, from that point onwars the algorithm ignores all the moels involving the class c. This means that after each classification with one of the binary moels, one more class can be thrown out as a possible caniate, an after only N-1 steps just one caniate class remains, which therefore becomes the preiction for the current test example. This results in significantly faster testing, while achieving similar recognition rate as One-against-one. 3.4 Binary tree of (BTS) This metho uses multiple s arrange in a binary tree structure [10]. A in each noe of the tree is traine using two of the classes. The algorithm then employs probabilistic outputs to measure the similarity between the remaining samples an the two classes use for training. All samples in the noe are assigne to the two subnoes erive from the previously selecte classes by similarity. This step repeats at every noe until each noe contains only samples from one class. The main problem that shoul be consiere seriously here is training time, because asie training, one has to test all samples in every noe to fin out which classes shoul be assigne to which subnoe while builing the tree. This may ecrease the training performance consierably for huge training atasets. 4 Support vector machines utilizing a binary ecision tree In this paper we propose a binary ecision tree architecture that uses s for making the binary ecisions in the noes. The propose classifier architecture -BDT (Support Vector Machines utilizing Binary Decision Tree), takes avantage of both the efficient computation of the tree architecture an the high classification accuracy of s. Utilizing this architecture, N-1 s neee to be traine for an N class problem, but only at most log N s are require to be consulte to classify a sample. This can lea to a ramatic improvement in recognition spee when aressing problems with big number of classes. An example of -BDT that solves a 7 - class pattern recognition problem utilizing a binary tree, in which each noe makes binary ecision using a is shown on Figure 3. The hierarchy of binary ecision subtasks shoul be carefully esigne before the training of each classifier. The recognition of each sample starts at the root of the tree. At each noe of the binary tree a ecision is being mae about the assignment of the input pattern into one of the two possible groups represente by transferring the pattern to the left or to the right sub-tree. Each of these groups may contain multiple classes. This is repeate recursivly ownwar the tree until the sample reaches a leaf noe that represents the class it has been assigne to. There exist many ways to ivie N classes into two groups, an it is critical to have proper grouping for the goo performance of -BDT. For consistency between the clustering moel an the way calculates the ecision hyperplane, the clustering moel utilizes istance measures at the kernel space, rather than at the input space. Because of this, all training samples are mappe into the kernel space with the same kernel function that is to be use in the training phase.,3 3 4 7 1,,3,4,5,,7,3,4,7 1,5, 4,7 Figure 3: Illustration of -BDT. 1 The -BDT metho that we propose is base on recursively iviing the classes in two isjoint groups in every noe of the ecision tree an training a that will ecie in which of the groups the incoming unknown sample shoul be assigne. The groups are etermine by a clustering algorithm accoring to their class membership. 1,5 5

A MULTI-CLASS CLASSIFIER... Informatica 33 (009) 33 41 37 Let s take a set of samples x 1, x,..., x M each one labele by y i {c 1, c,..., c N } where N is the number of classes. -BDT metho starts with iviing the classes in two isjoint groups g 1 an g. This is performe by calculating N gravity centres for the N ifferent classes. Then, the two classes that have the biggest Eucliean istance from each other are assigne to each of the two clustering groups. After this, the class with the smallest Eucliean istance from one of the clustering groups is foun an assigne to the corresponing group. The gravity center of this group is then recalculate to represent the aition of the samples of the new class to the group. The process continues by fining the next unassigne class that is closest to either of the clustering groups, assigning it to the corresponing group an upating the group s gravity center, until all classes are assigne to one of the two possible groups. This efines a grouping of all the classes in two isjoint groups of classes. This grouping is then use to train a classifier in the root noe of the ecision tree, using the samples of the first group as positive examples an the samples of the secon group as negative examples. The classes from the first clustering group are being assigne to the first (left) subtree, while the classes of the secon clustering group are being assigne to the (right) secon subtree. The process continues recursively (iviing each of the groups into two subgroups applying the proceure explaine above), until there is only one class per group which efines a leaf in the ecision tree. 4 4 4 4 4 4 4 7 7 7 7 7 7 7 1 1 1 1 1 1 1 5 5 5 5 5 5 5 3 3 3 3 3 3 3 Figure 4: -BDT ivisions of the seven classes. For example, Figure 4 illustrates grouping of 7 classes, while Figure 3 shows the corresponing ecision tree of s. After calculating the gravity centers for all classes, the classes c an c 5 are foun to be the furthest apart from each other, consiering their Eucliean istance an are assigne to group g 1 an g accoringly. The closest to group g 1 is class c 3, so it is assigne to the group g 1, followe by recalculation of the g 1 s gravity center. In the next step, class c 1 is the closest to group g, so it is assigne to that group an the group s gravity center is recalculate. In the following iteration, class c 7 is assigne to g 1 an class c is assigne to g, folowe by recalculating of group s gravity centers. Finally class c 4 is assigne to g 1. This completes the first roun of grouping that efines the classes that will be transferre to the left an the right subtree of the root noe. The classifier in the root is traine by consiering samples from the classes {c, c 3, c 4, c 7 } as positive examples an samples from the classes {c 1, c 5, c } as negative examples. The grouping proceure is repeate inepenently for the classes of the left an the right subtree of the root, which results in grouping c 7 an c 4 in g 1,1 an c an c 3 in g 1, in the left noe of the tree an c 1 an c 5 in g,1 an c in g, in the right noe of the tree. The concept is repeate for each associate to a noe in the taxonomy. This will result in training only N-1 s for solving an N-class problem. 5 Relate work an iscussion Various multi-class classification algorithms can be compare by their preictive accuracy an their training an testing times. The training time T for a binary is estimate empirically by a power law [13] stating that T αm, where M is the number of training samples an is a proportionality constant. The parameter is a constant, which epens of the atasets an it is typically in the range [1, ]. Accoring to this law, the estimate training time for OvA is T OvA NM, ( 11 ) where N is the number of classes in the problem. Without loss of generality, let's assume that each of the N classes has the same number of training samples. Thus, each binary of OvO approach only requires M/N samples. Therefore, the training time for OvO is: N TOvO N 1 M N N M, ( 1 ) The training time for DAG is same as OvO. As for BTS an -BDT, the training time is summe over all the noes in the log N levels. In the i th level, there are i-1 noes an each noe uses M/N for BTS an M/ i-1 for -BDT training samples. Hence, the total training time for BTS is: logn i1 M TBTS i1 N, ( 13 ) log N M i1 1 N M N i1 an for -BDT is:

38 Informatica 33 (009) 33 41 G. Mazarov et al. log N i1 M T BDT M, ( 14 ) i1 i1 It must be note that T -BDT in our algorithm oes not inclue the time to buil the hierarchy structure of the N classes, since it consumes insignificant time compare to the quaratic optimization time that ominates the total training time. On the other han, in the process of builing the tree, BTS requires testing of each traine with all the training samples in orer to etermine the next step, therefore significantly increasing the total training time. Accoring to the empirical estimation above, it is evient that the training spee of -BDT is comparable with OvA, OvO, DAG an BTS. In the testing phase, DAG performs faster than OvO an OvA, since it requires only N-1 binary evaluations. -BDT is even faster than DAG because the epth of the -BDT ecision tree is log N in the worst case, which is superior to N-1, especially when N>>. While testing, the inner prouct of the sample s feature vector an all the support vectors of the moel are calculate for each sample. The total number of support vectors in the traine moel irectly contributes to the major part of the evaluation time, which was also confirme by the experiments. A multistage (M) for multi-class problem has been propose by Liu et al. [11]. They use Support Vector Clustering (SVC) [1] to ivie the training ata into two parts that are use to train a binary. For each partition, the same proceure is recursively repeate until the binary gives an exact label of class. An unsolve problem in M is how to control the SVC to ivie the training ataset into exact two parts. However, this proceure is painful an unfeasible, especially for large atasets. The training set from one class coul belong to both clusters, resulting in ecrease preictive accuracy. There are ifferent approaches for solving multi-class problems which are not base on. Some of them are presente in the following iscussion. However, the experimental results clearly show that their classification accuracy is significantly smaller than the base methos. Ensemble techniques have receive consierable attention within the recent machine learning research [1][17][18][19]. The basic goal is to train a iverse set of classifiers for a single learning problem an to vote or average their preictions. The approach is simple as well as powerful, an the obtaine accuracy gains often have soli theoretical founations [0][0][1]. Averaging the preictions of these classifiers helps to reuce the variance an often increases the reliability of the preictions. There are several techniques for obtaining a iverse set of classifiers. The most common technique is to use subsampling to iversify the training sets as in Bagging [1] an Boosting [0]. Other techniques inclue the use of ifferent feature subsets for every classifier in the ensemble [3], to exploit the ranomness of the base algorithms [4], possibly by artificially ranomizing their behavior [5], or to use multiple representations of the omain objects. Finally, classifier iversity can be ensure by moifying the output labels, i.e., by transforming the learning tasks into a collection of relate learning tasks that use the same input examples, but ifferent assignments of the class labels. Error-correcting output coes are the most prominent example for this type of ensemble methos []. Error-correcting output coes are a popular an powerful class binarization technique. The basic iea is to transform an N-class problem into n binary problems (n > N), where each binary problem uses a subset of the classes as the positive class an the remaining classes as a negative class. As a consequence, each original class is encoe as an n-imensional binary vector, one imension for each preiction of a binary problem (+1 for positive an 1 for negative). The resulting matrix of the form { 1, +1} N n is calle the coing matrix. New examples are classifie by etermining the row in the matrix that is closest to the binary vector obtaine by submitting the example to the n classifiers. If the binary problems are chosen in a way that maximizes the istance between the class vectors, the reliability of the classification can be significantly increase. Errorcorrecting output coes can also be easily parallelize, but each subtask requires the total training set. Similar to binarization, some approaches suggest mapping the original multiple classes into three clsses. A relate technique where multi-class problems are mappe to 3-class problems is propose by Angulo an Catal a []. Like with pairwise classification, they propose generating one training set for each pair of classes. They label the two class values with target values +1 an 1, an aitionally, samples of all other classes are labele to a thir class, with a target value of 0. This iea leas to increase size of the training set compare to the binary classification. The mapping into three classes was also use by Kalousis an Theoharis [7] for preicting the most suitable learning algorithm(s) for a given ataset. They traine a nearest-neighbor learner to preict the better algorithm of each pair of learning algorithms. Each of these pairwise problems ha three classes: one for each algorithm an a thir class name tie, where both algorithms ha similar performances. Johannes Fürnkranz has investigate the use of roun robin binarization (or pair-wise classification) [8] as a technique for hanling multi-class problems with separate-an-conquer rule learning algorithms (aka covering algorithms). In particular, roun robin binarization helps Ripper [9] outperform C5.0 on multiclass problems, whereas C5.0 outperforms the original version of Ripper on the same problems. Experimental results In this section, we present the results of our experiments with several multi-class problems. The performance was measure on the problem of recognition of hanwritten igits an letters.

A MULTI-CLASS CLASSIFIER... Informatica 33 (009) 33 41 39 Here, we compare the results of the propose - BDT metho with the following methos: 1) one-against-all (OvA); ) one-against-one (OvO); 3) DAG; 4) BTS; 5) Bagging ) Ranom Forests 7) Multilayer Perceptron (MLP, neural network) The training an testing of the s base methos (OvO, OvA, DAG, BTS an -BDT) was performe using a custom evelope application that uses the Torch library [14]. For solving the partial binary classification problems, we use s with Gaussian kernel. In these methos, we ha to optimize the values of the kernel parameter σ an penalty C. For parameter optimization we use experimental results. The achieve parameter values for the given atasets are given in Table 1. Table 1. The optimize values for σ an C for the use atasets. MNIST Penigit Optigit Statlog σ 0 5 1.1 C 100 100 100 100 We also evelope an application that uses the same (Torch) library for the neural network classification. One hien layer with 5 units was use by the neural network. The number of hien units was etermine experimentally. The classifications base on ensembles of ecision trees [30] (Bagging an Ranom Forest) was performe by Clus, a popular ecision tree learner base on the principles state by Blockeel et al. [31]. There were 100 moels in the ensembles. The pruning metho that we use was C4.5. The number of selecte features in the Ranom Forest metho was log M, where M is the number of features in the ataset. The most important criterion in evaluating the performance of a classifier is usually its recognition rate, but very often the training an testing time of the classifier are equally important. In our experiments, four ifferent multi-class classification problems were aresse by each of the eight previously mentione methos. The training an testing time an the recognition performance were recore for every metho. The first problem was recognition of isolate hanwritten igits (10 classes) from the MNIST atabase. The MNIST atabase [15] contains grayscale images of isolate hanwritten igits. From each igit image, after performing a slant correction, 40 features were extracte. The features are consiste of 10 horizontal, 8 vertical an iagonal projections [5]. The MNIST atabase contains 0.000 training samples, an 10.000 testing samples. The secon an the thir problem are 10 class problems from the UCI Repository [33] of machine learning atabases: Optigit an Penigit. Penigit has 1 features, 7494 training samples, an 3498 testing samples. Optigit has 4 features, 383 training samples, an 1797 testing samples. The fourth problem was recognition of isolate hanwritten letters a -class problem from the Statlog collection [34]. Statlog-letter contains 15.000 training samples, an 5.000 testing samples, where each sample is represente by 1 features. The classifiers were traine using all available training samples of the set an were evaluate by recognizing all the test samples from the corresponing set. All tests were performe on a personal computer with an Intel CoreDuo processor at 1.8GHz with the Winows XP operating system. Tables through 4 show the results of the experiments using 8 ifferent approaches (5 approaches base on, two base on ensembles of ecision trees an one neural network) on each of the 4 ata sets. The first column of each table escribes the classification metho. Table gives the preiction error rate of each metho applie on each of the atasets. Table 3 an table 4 shows the testing an training time of each algorithm, for the atasets, measure in secons, respectively. The results in the tables show that base methos outperform the other approaches, in terms of classification accuracy. In terms of spee, base methos are faster, with ifferent ratios for ifferent atasets. In overall, the base algorithms were significantly better compare to the non base methos. The results in table show that for all atasets, the one-against-all (OvA) metho achieve the lowest error rate. For the MNIST, Penigit an Optigit atasets, the other base methos (OvO, DAG, BTS an our metho - -BDT) achieve higher, but similar error rates. For the recognition of hanwritten letters from the Statlog atabase, the OvO an DAG methos achieve very similar error rates that were about 1.5% higher than the OvA metho. The BTS metho showe the lowest error rate of all methos using oneagainst-one s. Our -BDT metho achieve better recognition rate than all the methos using oneagainst-one s, incluing BTS. Of the non base methos, the Ranom Forest metho achieve the best recognition accuracy for all atasets. The preiction performance of the MLP metho was comparable to the Ranom Forest metho for the 10-class problems, but noticeably worse for the -class problem. The MLP metho is the fastest one in terms of training an testing time, which is evient in Table 3 an Table 4. The classification methos base on ensembles of trees were the slowest in the training an the testing phase, especially the Bagging metho. Overall, the Ranom Forest metho was more accurate than the other non base methos, while the MLP metho was the fastest. The results in Table 3 show that the DAG metho achieve the fastest testing time of all the base methos for the MNIST ataset. For the other atasets, the testing time of DAG is comparable

40 Informatica 33 (009) 33 41 G. Mazarov et al. with BTS an -BDT methos an their testing time is noticeably better than the one-against-all (OvA) an one-against-one (OvO) methos. The -BDT metho was faster in the recognition phase for the Penigit ataset an slightly slower than DAG metho for the Statlog ataset. Table. The preiction error rate (%) of each metho for every ataset Classifier MNIST Penigit Optigit Statlog OvA 1.93 1.70 1.17 3.0 OvO.43 1.94 1.55 4.7 DAG.50 1.97 1.7 4.74 ВТЅ.4 1.94 1.51 4.70 -BDT.45 1.94 1.1 4.54 R. Forest 3.9 3.7 3.18 4.98 Bagging 4.9 5.38 7.17 8.04 MLP 4.5 3.83 3.84 14.14 Table 3. Testing time of each metho for every ataset measure in secons Classifier MNIST Penigit Optigit Statlog OvA 3.5 1.75 1.3 119.50 OvO.89 3.3 1.9 10.50 DAG 9.4 0.55 0.8 1.50 ВТЅ.89 0.57 0.73 17.0 -BDT 5.33 0.54 0.70 13.10 R. Forest 39.51 3.1.7 11.07 Bagging 34.5.13 1.70 9.7 MLP.1 0.49 0.41 1.10 Table 4. Training time of each metho for every ataset measure in secons Classifier MNIST Penigit Optigit Statlog OvA 48.94 4.99 3.94 554.0 OvO 11.9 3.11.0 80.90 DAG 11.9 3.11.0 80.90 ВТЅ 40.73 5.1 5.5 387.10 -BDT 304.5 1.0 1.59 3.30 R. Forest 54.78 17.08.1 50.70 Bagging 355.31 30.87 49.4 11.75 MLP 45.34.0 1.0 10.80 In terms of training spees, it is evient in Table 4 that among the base methos, -BDT is the fastest one in the training phase. For the three 10-class problems the time neee to train the 10 classifiers for the OvA approach took about 4 times longer than training the 45 classifiers for the OvO an DAG methos. Due to the huge number of training samples in the MNIST ataset (0000), -BDT s training time was longer compare to other one-against-one methos. The huge number of training samples increases the nonlinearity of the hyperplane in the, resulting in an increse number of support vectors an increase training time. Also, the elay exists only in the first level of the tree, where the entire training ataset is use for training. In the lower levels, the training time of ivie subsets is not as significant as the first level s elay. In the other 10 class problems, our metho achieve the shortest training time. For the Statlog ataset, the time neee for training of the one-against-all s was almost 7 times longer than the time for training the 35 one-against-one s. The BTS metho is the slowest one in the training phase of the methos using one-against-one s. It must be note that as the number of classes in the ataset increases, the avantage of -BDT becomes more evient. The -BDT metho was the fastest while training, achieving better recognition rate than the methos using one-against-one s. It was only slightly slower in recognition than DAG. 7 Conclusion A novel architecture of Support Vector Machine classifiers utilizing binary ecision tree (-BDT) for solving multiclass problems was presente. The - BDT architecture was esigne to provie superior multi-class classification performance, utilizing a ecision tree architecture that requires much less computation for eciing a class for an unknown sample. A clustering algorithm that utilizes istance measures at the kernel space is use to convert the multi-class problem into binary ecision tree, in which the binary ecisions are mae by the s. The results of the experiments show that the spee of training an testing are improve, while keeping comparable or offering better recognition rates than the other multi-class methos. The experiments showe that this metho becomes more favourable as the number of classes in the recognition problem increases. References [1] V. Vapnik. The Nature of Statistical Learning Theory, n E. Springer, New York, 1999. [] C. J. C. Burges. A tutorial on support vector machine for pattern recognition. Data Min. Knowl. Disc. (1998) 11. [3] T. Joachims. Making large scale learning practical. in B. Scholkopf, C. Bruges an A. Smola (es). Avances in kernel methos-support vector learning, MIT Press, Cambrige, MA, 1998. [4] R. Fletcher. Practical Methos of Optimization. n E. John Wiley & Sons. Chichester (1987). [5] J. Weston, C. Watkins. Multi-class support vector machines. Proceeings of ESANN99, M. Verleysen, E., Brussels, Belgium, 1999.

A MULTI-CLASS CLASSIFIER... Informatica 33 (009) 33 41 41 [] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. [7] J. H. Frieman. Another approach to polychotomous classification. Technical report. Department of Statistics, Stanfor University, 1997. [8] P. Xu, A. K. Chan. Support vector machine for multi-class signal classification with unbalance samples. Proceeings of the International Joint Conference on Neural Networks 003. Portlan, pp.111-1119, 003. [9] Platt, N. Cristianini, J. Shawe-Taylor. Large margin DAG s for multiclass classification. Avances in Neural Information Processing System. Vol. 1, pp. 547 553, 000. [10] B. Fei, J. Liu. Binary Tree of : A New Fast Multiclass Training an Classification Algorithm. IEEE Transaction on neural networks, Vol. 17, No. 3, May 00. [11] X. Liu, H. Xing, X. Wang. A multistage support vector machine. n International Conference on Machine Learning an Cybernetics, pages 1305-1308, 003. [1] A. Ben-Hur, D. Horn, H. Siegelmann, V. Vapnik. Support vector clustering. Journal of Machine Learning Research, vol. :15-137, 001. [13] J. Platt. Fast training of support vector machines using sequential minimal optimization. In Avances in Kernel Methos - Support Vector Learning. Pages 185-08, Cambrige, MA, 1999. MIT Press. [14] R. Collobert, S. Bengio, J. Mariéthoz. Torch: a moular machine learning software library. Technical Report IDIAP-RR 0-4, IDIAP, 00. [15], MNIST, MiniNIST, USA http://yann.lecun.com/exb/mnist [1] T. G. Dietterich. Machine learning research: Four current irections. AI Magazine, 18(4): 97 13, Winter 1997. [17] G. Dietterich. Ensemble methos in machine learning. In J. Kittler an F. Roli (es.) First International Workshop on Multiple Classifier Systems, pp. 1 15. Springer-Verlag, 000a. [18] D. Opitz an R. Maclin. Popular ensemble methos: An empirical stuy. Journal of Artificial Intelligence Research, 11:19 198, 1999. [19] E. Bauer an R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, an variants. Machine Learning, 3:105 19, 1999. [0] Y. Freun an R. E. Schapire. A ecision-theoretic generalization of on-line learning an an application to boosting. Journal of Computer an System Sciences, 55(1):119 139, 1997. [1] L. Breiman. Bagging preictors. Machine Learning, 4():13 140, 199. [] T. G. Dietterich an G. Bakiri. Solving multiclass learning problems via error-correcting output coes. Journal of Artificial Intelligence Research, :3 8, 1995. [3] S. D. Bay. Nearest neighbor classification from multiple feature subsets. Intelligent Data Analysis, 3(3):191 09, 1999. [4] J. F. Kolen an J. B. Pollack. Back propagation is sensitive to initial conitions. In Avances in Neural Information Processing Systems 3 (NIPS- 90), pp. 80 87. Morgan Kaufmann, 1991. [5] T. G. Dietterich. An experimental comparison of three methos for constructing ensembles of ecision trees: Bagging, boosting, an ranomization. Machine Learning, 40():139 158, 000b. [] C. Angulo an A. Catal`a. K-SVCR. A multi-class support vector machine. In R. L opez e M antaras an E. Plaza (es.) Proceeings of the 11th European Conference on Machine Learning (ECML-000), pp. 31 38. Springer-Verlag, 000. [7] A. Kalousis an T. Theoharis. Noemon: Design, implementation an performance results of an intelligent assistant for classifier selection. Intelligent Data Analysis, 3(5):319 337, 1999. [8] Johannes Fürnkranz, Roun robin classification, The Journal of Machine Learning Research,, p.71-747, 3/1/00 [9] W. W. Cohen. Fast effective rule inuction. In A. Prieitis an S. Russell (es.) Proceeings of the 1th International Conference on Machine Learning (ML-95), pp. 115 13, Lake Tahoe, CA, 1995. Morgan Kaufmann. [30] D. Kocev, C. Vens, J. Struyf an S. Dˇzeroski. Ensembles of multi-objective ecision trees. Proceeings of the 18th European Conference on Machine Learning (pp. 4 31) (007). Springer. [31] H. Blockeel, J. Struyf. Efficient Algorithms for Decision Tree Cross-valiation. Journal of Machine Learning Research 3:1-50, 00. [3] D. Gorgevik, D. Cakmakov. An Efficient Three- Stage Classifier for Hanwritten Digit Recognition. Proceeings of 17th Int. Conference on Pattern Recognition, ICPR004. Vol. 4, pp. 507-510, IEEE Computer Society, Cambrige, UK, 3- August 004. [33] C. Blake, E. Keogh an C. Merz. UCI Repository of Machine Learning Databases, (1998). Statlog Data Set, http://archive.ics.uci.eu/ml/atasets.html [Online] [34] Statlog Data Set, http://archive.ics.uci.eu/ml/- atasets/letter+recognition [Online]

4 Informatica 33 (009) 33 41 G. Mazarov et al.