Binarized support vector machines

Size: px

Start display at page:

Download "Binarized support vector machines"

Brett Parks
6 years ago
Views:

1 Universidad Caros III de Madrid Repositorio instituciona e-archivo Departamento de Estadística DES - Working Papers. Statistics and Econometrics. WS Binarized support vector machines Carrizosa, Emiio Descargado de e-archivo, repositorio instituciona de a Universidad Caros III de Madrid

2 Working Paper #08-65 Departamento de Estadística Statistic and Econometric Series 23 Universidad Caros III de Madrid November 2007 Cae Madrid, Getafe (Spain) Fax (34-91) Binarized Support Vector Machines Emiio Carrizosa, 1 Been Martin-Barragan 2 and Doores Romero Moraes 3 Abstract The widey used Support Vector Machine (SVM) method has shown to yied very good resuts in Supervised Cassification probems. Other methods such as Cassification Trees have become more popuar among practitioners than SVM thanks to their interpretabiity, which is an important issue in Data Mining. In this work, we propose an SVM-based method that automaticay detects the most important predictor variabes, and the roe they pay in the cassifier. In particuar, the proposed method is abe to detect those vaues and intervas which are critica for the cassification. The method invoves the optimization of a Linear Programming probem, with a arge number of decision variabes. The numerica experience reported shows that a rather direct use of the standard Coumn-Generation strategy eads to a cassification method which, in terms of cassification abiity, is competitive against the standard inear SVM and Cassification Trees. Moreover, the proposed method is robust, i.e., it is stabe in the presence of outiers and invariant to change of scae or measurement units of the predictor variabes. When the compexity of the cassifier is an important issue, a wrapper feature seection method is appied, yieding simper, sti competitive, cassifiers. Keywords: Supervised Cassification, Binarization, Coumn Generation, Support Vector Machines. This work has been partiay supported by projects MTM C03-01 of MEC, Spain, and FQM-329 of Junta de Andaucía, Spain and. 1 Universidad de Sevia, Departamento de Estadística e Investigación Opertiva, Spain. ecarrizosa@us.es 1 Universidad Caros III de Madrid, Department of Statistics, Spain. been.martin@uc3m.es 1 University of Oxford (United Kingdom). doores.romero-moraes@sbs.ox.ac.uk

3 Binarized Support Vector Machines Emiio Carrizosa Universidad de Sevia (Spain). Beén Martín-Barragán Universidad Caros III de Madrid (Spain). Doores Romero Moraes University of Oxford (United Kingdom). Juy 28, 2008 Abstract The widey used Support Vector Machine (SVM) method has shown to yied very good resuts in Supervised Cassification probems. Other methods such as Cassification Trees have become more popuar among practitioners than SVM thanks to their interpretabiity, which is an important issue in Data Mining. In this work, we propose an SVM-based method that automaticay detects the most important predictor variabes, and the roe they pay in the cassifier. In particuar, the proposed method is abe to detect those vaues and intervas which are critica for the cassification. The method invoves the optimization of a Linear Programming probem with a arge number of decision variabes. The numerica experience reported shows that a rather direct use of the standard Coumn-Generation strategy eads to a cassification method which, in terms of cassification abiity, is competitive against the standard inear SVM and Cassification Trees. Moreover, the proposed method is robust, i.e., it is stabe in the presence of outiers and invariant to change of scae or This work has been partiay supported by projects MTM C03-01 of MEC, Spain, and FQM-329 of Junta de Andaucía, Spain. 1

4 measurement units of the predictor variabes. When the compexity of the cassifier is an important issue, a wrapper feature seection method is appied, yieding simper, sti competitive, cassifiers. Keywords: Supervised Cassification, Binarization, Coumn Generation, Support Vector Machines. 1 Introduction and iterature review Cassifying objects or individuas into different casses or groups is one of the aims of Data Mining. This topic has been addressed in different areas such as Statistics, Operations Research and Artificia Inteigence. A genera introduction of Data Mining can be found in [19]. We focus on the we-known so-caed Supervised Cassification probem, usuay referred as Discriminant Anaysis by statisticians, where we have a set of objects Ω and the aim is to buid a cassification rue which predicts the cass membership c u of an object u into one of a predefined set of casses C, by means of its predictor vector x u. The predictor vector x takes vaues in a set X which is usuay assumed to be a subset of IR p, such as {0, 1} p, and the components x, = 1, 2,..., p, of the predictor vector x are caed predictor variabes. The other piece of information defining u is the cass c u object u beongs. In this paper we restrict ourseves to the case in which two casses exist, C = { 1, 1}, since the muticass case can be reduced to a series of two-cass probems, as has been suggested e.g. in [20, 22, 35]. Not a the information about the objects in Ω is avaiabe, but ony in a subset I, caed the training sampe, where both predictor vector and cass-membership of the objects are known. With this information, the cassification rue must be buit. The Support Vector Machines (SVM) [12] approach is based on margin maximization, which consists in finding the separating hyperpane which is furthest from the cosest object. SVM has shown to be a very powerfu too for Supervised Cassification. The most popuar versions of SVM embed, via a kerne function, the origina predictor variabes into a higher (possiby infinite) dimensiona space, [22]. In this way one obtains cassifiers with good generaization properties but, in genera, hard to interpret. 2

5 In some appication fieds, practitioners, such as doctors or businessmen, may be very unwiing to use a cassifier they cannot interpret. For them, Data Mining methods sometimes proceed ike a back-box, so they woud not fee confident enough to use the cassifier uness they can interpret it somehow. For instance, it is easy to interpret and manage queries of type Is predictor variabe 1 big? Is predictor variabe 2 sma? Does predictor variabe 3 attain a very extreme vaue? where the concept of big, sma and extreme vaue must be quantified, e.g. in the form is predictor variabe greater than or equa to b? (1) This type of queries are used e.g. in Cassification Trees. Because of its very intuitive graphica dispay, practitioners can interpret the cassifier, describing how it works. Moreover, they can detect the vaues of a predictor variabe critica for the cassification. We work in this paper in an SVM-based framework, where the feature space is defined by binarizing each predictor variabe, i.e., by transforming each numerica predictor variabe into a arge series of binary variabes, obtained by making queries of type (1) for many different cutoffs b. Since we are using SVM after binarizing the predictor variabes, we ca our method Binarized Support Vector Machines (BSVM). Our aim is to show that, if SVM is run after such binarization, the resuting cassifier has vauabe properties concerning cassification abiity, interpretabiity and robustness. First, the numerica experience reported in Section 4 shows that BSVM yieds ower miscassification rates than Cassification Trees and is competitive (better in most cases we tested) against inear SVM. Second, BSVM takes us a step forward towards interpretabiity of SVMs. Since the obtained cassification rue is based on queries of type (1), the critica vaues and intervas of the predictor variabes are identified, as done by Cassification Trees. 3

6 Third, BSVM is inear, in the sense it yieds a cassification rue which is inear in the features seected, and can be seen as a inear SVM after transforming the variabes via a noninear mapping. Such mapping, which can be visuaized by means of a graphica procedure, is vauabe in terms of interpretabiity, since it shows the way each predictor variabe infuences the cassifier. The capabiity of BSVM to capture noninearities presents an advantage with respect to the standard inear SVM, with important consequences in the cassification abiity, as shown in our numerica resuts. Finay, the use of queries of type (1) in our method makes it appeaing in terms of robustness against outiers and changes of scae in the data. The numerica experience reported shows that our procedure behaves as Cassification Trees, and ceary better than inear SVMs in presence of outiers. Moreover, whereas the inear SVM is infuenced by the scae in the data even a inear change of scae in the predictor variabes may change the cassifier and thus the cassification abiity of the SVM methodoogy our proposa yieds a cassifier which is invariant to change of scae, in the sense that, if the data were modified by a monotonic (non-)inear transformation, the cassifier obtained woud be the same. The idea of binarizing continuous predictor variabes is not new at a in the fied of Cassification. Indeed, Cassification Trees are precisey based on the strategy of defining, for different predictor variabes, appropriate cutoffs; moreover, binarizing is aso natura in other cassification procedures, such as Neura Networks, where the we-known step activation function, aready at the heart of the perceptron method, aows one to discretize any predictor variabe or combination of them. Binarizing continuous predictor variabes has aso been proposed in the so-caed rue extraction procedures, within SVM [3, 4, 16, 27, 29] and Neura Networks as we, e.g. [1, 2, 13]. When a rue extraction method is appied to a cassifier, one obtains an aternative cassifier which hopefuy have a simiar behavior on data, but is more interpretabe, since it is based on simpe rues, such as those derived from queries of type (1). Whereas our method shares with rue extraction procedures the aim of enhancing interpretabiity of the output of an SVM, we are not proposing to repace the SVM cassifier by a more interpretabe one which, based on queries of type (1), mimics the behavior of the SVM cassifier; instead, we are proposing a binarization in the data, via queries of type (1) and obtaining the SVM cassifier for such transformed data. This way, we increase 4

7 interpretabiity with respect to standard SVM, and, as shown in our numerica experience, provide a competitive method in terms of miscassification rates. As far as we are aware, there is no paper using this way binarization for Support Vector Machines, and thus our strategy is new in the context of SVM. The cassifier proposed in this paper, BSVM, is described in Section 2, where we anayze the interpretabiity of the proposed method, and propose a visuaization too for potting the roe a predictor variabe pays in the cassifier. Since the number of features to be considered may be huge, the BSVM method yieds an optimization probem with a arge number of decision variabes. In Section 3, a Coumn-Generation-based agorithm is proposed in order to sove such an optimization probem. Numerica resuts are presented in Section 4, to iustrate the cassification abiity as we as the desirabe properties of BSVM, showing that the proposed approach gives a cassifier which is competitive against the standard inear SVM or Cassification Trees. Concusions and some ines for future research are discussed in Section 5. 2 Binarized Support Vector Machines Queries of type (1) are simpe and hence are, by themseves, easy to interpret. In Section 2.1, we introduce a cassifier which is made up by queries of this type. We propose in Section 2.2 a visuaization too which graphicay represents the roe any origina predictor variabe pays in the cassifier. The search of a good cassifier is based on SVM ideas, as described in Section Using simpe queries In practica appications, simpe cassification rues based on queries of type (1) are very desirabe because of their interpretabiity. For exampe, a doctor woud say that having high bood pressure is a symptom of disease. Choosing the threshod b from which a specific bood pressure woud be considered high is not usuay an easy task. 5

8 We theoreticay consider a the possibe queries of type (1), mathematicay formaized by the function 1 if x b φ b (x) = 0 otherwise (2) for b IR and = 1, 2,..., p. In what foows, each function of type φ b is caed feature. Our method constructs a cassifier after binarizing a predictor variabes. This binarization procedure coud be done in a tedious preprocessing step, where a the possibe features are created. Instead, as wi be seen ater, we propose a method to generate, by means of a dynamic process, ony those features which are more promising in terms of cassification abiity. The set of possibe cutoff vaues b (and, thus the number of features) is, in principe, infinite. However, given a training sampe I, many of those possibe cutoffs wi yied exacty the same cassification in the objects in I, which are the objects whose information is avaiabe. In this sense, given the training sampe I, for any given predictor variabe, a vaues of b between two consecutive vaues of ead to functions φ b which behave identicay on the training sampe I. Hence, if we construct for each predictor variabe, the finite set B of midpoints between consecutive vaues of in the training sampe I, it turns out that the set of functions {φ b : b B } is, on I, as rich as the fu set of step functions {φ b : b IR}. Instead of working with the fu set of functions defined by a possibe cutoffs b IR, the famiy of features under consideration in this paper is given by F = {φ b : b B, = 1, 2,..., p}. These features are used to cassify in the foowing way: each feature φ b has associated a weight ω b, measuring its contribution for the cassification into the cass 1 or 1. The weighted sum of those features pus a threshod β constitute the score function, f(x) = ω Φ(x) + β = p ω b + β, (3) =1 {b B x b} where Φ(x) = (φ b (x)) b B, =1,2,...,p and ω Φ(x) denotes the scaar product of vectors ω and Φ(x), ω Φ(x) = p =1 b B ω b φ b (x) = p =1 {b B x b} ω b. Objects wi be aocated to cass 1 if f(x) < 0, and to cass 1 if f(x) > 0. In case of ties, i.e. f(x) = 0, objects 6

9 can be aocated randomy or by some predefined order. In this paper, foowing a worst case approach, they wi be considered as miscassified. For a certain predictor variabe, the coefficient ω b associated to feature φ b represents the amount with which the query is x b? contributes to the score function (3). Those predictor variabes for which ω b are zero for a b B are not needed for the cassification, and can be discarded. In other words, ony those vaues b for which the corresponding ω b are non-zero can be considered as critica vaues, in terms of cassification, in the predictor variabe. Moreover, the magnitude of ω b enabes us to quantify the importance of the cutoff point b to separate individuas of cass 1 and 1. To fix ideas, et us consider, for instance, the Wisconsin Breast Cancer Database from the UCI Machine Learning Repository, [28], with data from cancer diagnosis. Each individua has 30 predictor variabes, which, in principe, are to be taken into consideration. However, if we use BSVM, it turns out that ony some of these predictor variabes are shown to be reevant for the cassification. In particuar, with a specific choice of the parameter in our mode, ony 12 have at east one non-zero ω b, and are thus those which need to be taken into account. Tabes 1 and 2 focus on two of these reevant predictor variabes, namey Mean Concave Points and Worst Radius. For instance, for predictor variabe Mean Concave Points and cutoff b = , the weight is For predictor variabe Worst Radius, the ony cutoff is b = 17.72, and has a weight of Since the output of the features proposed in this paper is aways binary, the importance represented by the coefficient is aways measured in the same scae. This means that having Mean Concave Points greater than or equa to is more important for the cassification than having Worst Radius greater than or equa to Moreover, for predictor variabe Worst Radius, the ony important issue for cassification purposes is whether this variabe takes a vaue greater than or equa to its unique critica vaue, namey, In contrast, for variabe Mean Concave Points other cutoffs are aso used in the cassifier. 2.2 Visuaization too The weight ω b gives insightfu knowedge about how predictor variabe together with the cutoff b infuences the cassification. In this section, we focus on the infuence of predictor variabe as a whoe, instead of a particuar 7

10 b ω b Tabe 1: Cutoffs and weights for predictor variabe Mean Concave Points. b ω b Tabe 2: Cutoffs and weights for predictor variabe Worst Radius. cutoff. The roe of predictor variabe in the score function is modeed by the stepwise function s ω b. (4) {b B s b} This stepwise function is usefu since it approximates the most adequate mapping to be appied to predictor variabe to optimize the cassification task. As an iustration, Tabe 3 shows, for predictor variabe Worst Texture, its cutoffs b, the corresponding weights ω b and the cumuative weights b b ω b. b ω b b b ω b Tabe 3: Weights and cumuative weights for predictor variabe Worst Texture. We propose a pot of function (4) in order to gain insight about the contribution of predictor variabe to the cassifier. This pot is a vauabe too to practitioners, who can use it to interpret the cassifier, describing the roe every predictor variabe pays in the cassification. In particuar, it aows a direct choice of the reevant predictor variabes, and detection of critica vaues and intervas. In Figure 1 we show, for each of the 12 reevant predictor variabes, i.e., those with at east one non-zero ω b, its 8

11 contribution to the score function. Predictor variabes Mean Concave Points, Worst Area and Worst Concave Points are the most important, whereas Worst Texture and Worst Perimeter have a medium importance. Using this graphica representation, in predictor variabe Mean Concave Points, we can detect a critica vaue which, as said in the previous section, is at point In the case of Worst Texture, in the interva between 21 and 27 the importance of this predictor variabe grows up itte by itte, whereas outside this interva it remains constant. We can say that in this case we have a critica interva instead of a critica cutoff. We have aso potted in Figure 1 the median (represented by a star) and the mean (represented by a cross). It can be seen how, athough the mean or the median are sometimes a good choice for the cutoff, this does not happen in genera, and BSVM prefers other choices. Figure 1: Contribution of predictor variabes 9

12 Graphica representations ike Figure 1 can provide insight into the roe a predictor variabe pays in the cassifier. In Figure 2, we present three different scenarios, found by appying the proposed method to other pubicy avaiabe databases. More detais can be found in Section 4. In the first case, there is just one vaue which is critica for cassification. The second instance suggests an S-shaped transformation, simiar to Worst Texture in Figure 1; it identifies a critica interva, within which the behavior is inear. Other types of mappings, harder to interpret, are, of course, possibe. This is the case of the third function, which suggests a ogarithmic transformation Figure 2: Transformations suggested by the method Summing up, we have proposed a cassifier that, using simpe queries of type (1), aows us to visuaize the roe every predictor variabe pays in the cassification. Now, it is time to describe the procedure for finding the weights ω b associated to each predictor variabe and cutoff. We foow an SVM-based framework, deveoped in the next section. 2.3 Support vector machines In order to choose ω and β we foow an SVM-based approach which consists in finding the separating hyperpane which maximizes the margin in the feature space, i.e. the space of the images Φ(x u ) of the objects u in the training sampe I. The use of margin maximization is theoreticay motivated by the bounds on the generaization abiity [31, 34, 35], where the probabiity of miscassifying a forthcoming individua is bounded by a function which is decreasing in the margin. The so-caed hard-margin SVM approach proposes the choice of the separating hyperpane with maxima margin, i.e. the hyperpane which correcty cassifies a objects in I and is furthest from the cosest object. In contrast, the so-caed soft-margin approach, aows some objects to be miscassified. We use 10

13 this atter version as it has been empiricay shown to avoid overfitting, a phenomenon which happens when a ow miscassification rate in I does not generaize to forthcoming objects. The soft-margin maximization probem is formuated in this paper by min ω + C u I ξu s.t. : c u ( ω Φ(x u ) + β ) + ξ u 1 u I ξ u 0 u I (5) ω IR N, β IR, where the decision variabes are the weight vector ω, the threshod vaue β and perturbations ξ u associated with the miscassification of object u I. denotes the L 1 norm, C is a constant that trades off the margin in the correcty cassified objects and the perturbations ξ u, and N = p =1 (B ), where (S) denotes the cardinaity of a set S. An appropriate vaue of C is chosen here, as detaied in Section 4.1, by inspecting a wide range of vaues and then measuring the performance with respect to miscassification rates, assessed by cross-vaidation, [23]. SVM seeks the separating hyperpane maximizing a function of the distances. There are many different possibe choices for the distance function, which ead to different variants of SVM, [9]. Whereas the distance between points has usuay been considered in the iterature as the Eucidean norm, yieding the margin to be measured by the Eucidean norm as we, other norms such as the L 1 norm and the L norm have been considered and successfuy appied. See for instance [5, 7, 10, 24, 25, 32, 36] and the references therein. Moreover, the choice of the L norm to measure distances is equivaent to the use of the L 1 norm reguarization, aso caed asso penaty, [21, 33]. Contrary to the Eucidean case, in which a maxima margin hyperpane can be found by soving a quadratic program with inear constraints, if, as in this paper, the L 1 norm reguarization is used, then an optima soution of the corresponding optimization probem can be found by soving a Linear Programming (LP) probem. In [30], empirica resuts show that in terms of separation performance, L 1, L and Eucidean norm-based SVM tend to be quite simiar. Moreover, the L 1 norm reguarization contributes to sparsity in the cassifier, yieding ω with many components equa to zero, see for instance [7, 15] or [26], which states that one of the principa advantages of 1-norm support vector machines (SVMs) is that, unike conventiona 2-norm SVMs, they are very effective in 11

14 reducing input space features for inear kernes. Since is the L 1 norm, Probem (5) can be formuated as the foowing LP probem, min s.t. : p =1 b B (ω + b + ω b ) + C u I ξu p =1 b B (ω + b ω b )cu φ b (x u ) + βc u + ξ u 1 u I ω + b 0 ω b 0 ξ u 0 β IR. b B, = 1, 2,..., p b B, = 1, 2,..., p u I (6) After finding the maxima margin hyperpane in the feature space defined by F, the score function has the form described in (3). As said before, for each feature, the absoute vaue of its coefficient indicates the importance of that feature for the cassification. In particuar, features whose corresponding coefficient is zero can be seen as irreevant for cassification purposes. Using basic Linear Programming theory, it is easy to see that the number of features with non-zero coefficient is not arger than the number of objects in the training sampe. 3 Coumn Generation Probem (6) is a arge scae inear program, which may be soved by any method which can hande SVM with the L 1 norm reguarization, incuding genera-purpose LP procedures or more ad-hoc methods, see e.g. [15]. In this paper, we propose Probem (6) to be soved by the we-known Mathematica Programming too caed Coumn Generation, initiay introduced for the cutting-stock probem [17], and successfuy used, under different variants, in different works on Support Vector Machines, such as [6, 8, 14, 26]. A brief discussion on how Coumn Generation is taiored to sove Probem (6) foows. Instead of soving directy Probem (6), which has a high number of decision variabes, the Coumn Generation technique soves a series of reduced probems where decision variabes, corresponding to features in the set F, are iterativey added as needed. 12

15 For F F, et Master Probem (6-F ) be Probem (6) with the famiy of features F. In each iteration, we first sove Probem (6-F ). The next step is to check whether the current soution is optima for Probem (6) or not, and, in the atter case, generate a new feature φ improving the objective vaue of the current soution. The generated feature is added to the famiy of features F. This process is repeated unti optimaity is reached. To generate new features, we use the dua formuation of Probem (6), max u I λ u s.t. : 1 u I λ uc u φ(x u ) 1 φ F u I λ uc u = 0 (7) 0 λ u C u I. The dua formuation of the Master Probem (6-F ) ony differs from this one in the first set of constraints, which shoud be attained φ F instead of φ F. Let (ω, β ) be an optima soution of Master Probem (6-F ), and et (λ u) u I be the vaues of the corresponding optima dua soution. If the optima soution of the Master Probem (6-F ) is aso optima for Probem (6), then, for every feature φ F the constraints of the Dua Probem (7) wi hod, i.e. 1 u I λ uc u φ(x u ) 1. (8) Denote Γ(φ) = u I λ uc u φ(x u ). If (ω, β ) is not optima for Probem (6), then the most vioated constraint gives us information about which feature is promising and coud be added to F, in the sense that adding such a feature to the set F woud yied, at that iteration, the highest improvement of the objective function. Thus, we wish to generate a new feature φ F maximizing Γ(φ). In the remainder of this Section we give a more detaied description of our impementation of the Coumn Generation agorithm. 3.1 Initia set of features In the coumn generation procedure new features are sequentiay added to the probem, based on the dua vaues of the current soution. The Coumn Generation technique starts with an initia Master Probem, i.e. a set of features 13

16 F 0 must be chosen to initiaize the agorithm. We have chosen to start with one feature per predictor variabe, with the cutoff set equa to its median in the objects of I. 3.2 Generation of features Finding the best φ F reduces to finding a predictor variabe and a cutoff b B, such that Γ(φ b ), with φ b defined by (2), is maxima. This can be done by fu inspection of the set {φ b +, φ b : = 1, 2,..., p}, where, for each predictor variabe, the cutoff b + (respectivey b ) is the vaue in B for which Γ(φ b ) is highest (respectivey owest). In this section we describe an agorithm for, fixed the predictor variabe, finding the cutoff b + maximizing Γ(φ b ). Finding b can be done in a simiar way. First, we sort a the objects in decreasing order by the vaues of the predictor variabe. Denote by u(i) the object in i-th position. For simpicity, suppose there are not repeated vaues, i.e. x u(1) The case with repeated vaues wi be anayzed ater. > x u(2) >... > x u( (I)). Once is fixed and a the objects are sorted by the vaues of the predictor variabe, the vaue Γ(φ b ) can be efficienty cacuated with a recursive procedure. Indeed, for certain i {1, 2,..., (I)}, we have Γ(φ bi+1 ) = Γ(φ bi ) + λ u(i+1) cu(i+1), where b i denotes the cutoff chosen as xu(i) +x u(i+1) 2. Moreover, since λ u is nonnegative for a u I, we have that Γ(φ bi+1 ) Γ(φ bi ) whenever c u(i+1) = 1. Hence, checking whether φ bi is a maximum is not needed for every i, but ony for those i such that c u(i) = 1 and c u(i+1) = 1. In the case in which there are repeated vaues in {x u : u I}, the rue above does not appy. Let i and t be such that x u(i 1) > x u(i) = x u(i+1) =... = x u(i+t) > x u(i+t+1). Note that, in the set of objects where predictor variabe has the same vaue, there coud be objects beonging to different casses. In this case, the vaue b = b i must be checked, whatever the vaue of c u(i+t+1). However, if c u(i) = c u(i+1) =... = c u(i+t) and c u(i+t+1) = 1 we know that setting b = b j, for any j = i, i + 1,..., i + t, wi be improved by setting b = b i+t+1. This means that b = b i does not give a maximum of Γ(φ b ). Ony if c u(i) = 1 and c u(i+t+1) = 1, it is worth to consider b = b i as a candidate to be the maximum. The minimization of Γ is done anaogousy. For exampe, in case of no repeated vaues, candidates to be a minimum correspond to objects u(i) beonging to cass 1 where the next object u(i + 1) beongs to cass 1. 14

17 Taking into account a these considerations we obtain, for a fixed predictor variabe, given the dua vaues λ u, the agorithm described beow, which finds the cutoff b + (and respectivey, b ) for which Γ(φ b + ) (respectivey, Γ(φ b )) is maxima (respectivey, minima). Agorithm 1: Choosing two cutoffs for predictor variabe. Step 0. Sort the objects decreasingy by x : x u(i). Step 1. Set i 1, sum 0, max 0, min 0, i + i and i i. Step 2. Set sum sum+λ u(i) cu(i). Step 3. Step 3.1. If x u(i) = x u(i+1), then, go to Step 4. Step 3.2. Otherwise, if for some t > 0, x u(i t 1) Step 3.3. Otherwise, with j = 1,..., t and c u(i) c u(i j), then: < x u(i t) If sum>max, then set max sum and i + i. If sum<min, then set min sum and i i. =... = x u(i) < x u(i) 1 if c u(i) = 1, c u(i+1) = 1 and sum>max, then set max sum and i + i. if c u(i) = 1, c u(i+1) = 1 and sum<min, then set min sum and i i. and there exists j Step 4. Set i i + 1. If i (I), then go to Step 2. Otherwise STOP: b + = b i + and b = b i. We may notice that Step 0 can be performed in a preprocessing stage, of running time O( I og( I )), for a cas to Agorithm 1 for predictor variabe. Hence, considering such sorting as preprocessing, it foows that each ca to Agorithm 1 runs in O( I ) time, since Step 3 is performed at most once for each object in the training sampe, whie the cacuations invoved in this step can be performed in constant time. 3.3 Impementation detais The coumn generation agorithm has been impemented as foows. The initia set of features F 0 is buit, as described in Section 3.1, using features whose cutoffs are the medians of the predictor variabes. Then, Probem 15

18 (6-F 0 ) is soved for such initia set of features. The dua vaues of the optima soution found, are used to generate new features. In every step of the coumn generation agorithm, instead of generating just one feature (the one maximizing Γ(φ) ), we generate two features for every predictor variabe, given by the cutoffs b +, b for which Γ(φ b ) is respectivey maxima and minima. This is done using Agorithm 1, as described in Section 3.2. We do it for a the predictor variabes, thus obtaining 2p features. Those generated features having Γ(φ) > 1 are added to F and the LP probem (6-F ) is soved. These steps are repeated unti a the generated features have Γ(φ) 1, in which case, we have found an optima soution of Probem (6). A summary of this Coumn Generation Agorithm is described as foows. Agorithm for Coumn Generation. Step 0. Set F 0 {φ 1b 1, φ 2b 2,..., φ pb p }, where b is the median of the predictor variabe, for = 1, 2,..., p. Set F F 0. Step 1. Sove Probem (6-F ). Let (ω, β ) be its optima soution, with dua vaues λ u, u I. Step 2. For each = 1, 2,..., p do: Step 2.1. Run Agorithm 1 to choose b + and b. Step 2.2. If Γ(φ b + Step 2.3. If Γ(φ b ) > 1, then set F F {φ b + }. ) < 1, then set F F {φ b }. Step 3. If F has been modified, then go to Step 1, otherwise STOP: we have found an optima soution of Probem (6). In Step 1, we need to sove the LP probem (6-F ). In our numerica resuts we have used CPLEX as the LP sover. Our numerica experience shows that the number of features used by our cassifier, i.e., the ones that have been generated by the BSVM and have non-zero coefficient in the cassifier, is usuay rather arge. In order to obtain 16

19 a more simpe cassifier, we proceed with a wrapper feature seection procedure in which features are recursivey deeted. In this procedure, which has been successfuy appied in standard SVM, see [18], a the generated features with zero coefficient in the cassifier as we as the feature with non-zero coefficient having the smaest absoute vaue are eiminated. Then, the coefficients are recomputed by the optimization of the Probem (6). This eimination procedure is repeated unti the number of features with non-zero coefficient is beow a number given in advance. As reported in our numerica experience in Section 4.5, this procedure typicay aows one to reduce the number of features used in the cassifier at the expense of a mid oss in the cassification abiity. 4 Numerica resuts 4.1 Databases, benchmarking methods and accuracy measure In this section we iustrate the cassification abiity as we as the most desirabe properties of BSVM. With this aim, a series of numerica experiments has been performed using databases pubicy avaiabe from the UCI Machine Learning Repository [28]. Nine different databases were used, namey, the Sonar Database, caed here sonar; the Cyinder Bands Database, caed here bands; the Credit Screening Database, caed here credit; the Ionosphere Database, caed here ionosphere; the New Diagnostic Database, contained in the Wisconsin Breast Cancer Databases, caed here wdbc; the Ceveand Cinic Foundation Database, caed here ceveand; the Boston Housing Database, caed here housing; the Pima Indians Diabetes Database, caed here pima; and the BUPA Liver-disorders Database, caed here bupa. In case of existence of missing vaues, as occurs for instance in bands and credit, objects with missing vaues were removed from the database. In databases such as bands and credit, some of the predictor variabes were nomina. Each of these predictor variabes has been repaced by a set of binary variabes in the foowing way: for every possibe vaue ˆx of the origina nomina predictor variabe, a new binary variabe is buit taking vaue one when x is equa to ˆx and zero otherwise. The housing database is a typica regression dataset, but it is often used as a cassification dataset, where the cass indicates whether the median vaue of houses exceeds $21,000. Finay, 17

20 for each database, the fina number of objects and the number of predictor variabes (a quantitative) can be found in the second coumn of each tabe. A databases used are of sma/moderate size. Very arge datasets do not seem to be so suitabe for a crude impementation of BSVM, since coumn generation methods are known to be time-consuming. In practice, for databases of much arger size than those used in these experiments, it might be convenient to either seect a subsampe of data with manageabe size, or perform some dimensionaity reduction technique such as Principa Component Anaysis to obtain an appropriate number of variabes. In order to compare the BSVM cassifier with other cassifiers, we have tested the performance of three different benchmarking methods: Cassification Trees, both with pruning (TreePr) and without pruning (TreeCr), inear SVM and a benchmarking method for the SVM with the L 1 norm reguarization, namey, the NLPSVM proposed by Fung and Mangasarian in [15]. The cassification accuracy of each method is measured by the eave-one-out correctness, as done in [15]. Beow we give a brief description of this measure. In the eave-one-out correctness, in what foows ooc, for each object, the training sampe is equa to the whoe database except for this object, whie the testing sampe is equa to the excuded object. For each training sampe, we construct a cassifier which wi be appied in the corresponding testing sampe, returning a 1 if the ony object in the testing sampe is correcty cassified and 0 otherwise. The ooc is equa to the percentage of correcty cassified objects. Since ooc uses a but one object to buid the cassifier, the provided cassification can be expected to be cose to the one of the cassifier trained with the compete dataset. Given a training sampe, it remains to specify the way the SVM as we as the BSVM cassifiers are constructed. Probem (6), as we as the corresponding optimization probem for the inear SVM, contains a parameter that needs to be tuned, namey, the reguarization parameter C which trades off miscassification errors in the training sampe with the generaization error. As in [15], we imit the vaues of this parameter to the vaues 2 i with i = 12, 11,..., 12. Parameter C is then chosen so that it minimizes the miscassification rate after performing 10-fod cross-vaidation in the training data (which, as said before contains a but one object.) It has been empiricay observed, e.g. [7, 11, 15], that the choice of the parameter C may strongy infuence the number of features seected. Hence, one might have aso chosen C by baancing miscassification rates and number of features 18

21 seected. 4.2 Cassification abiity In Tabe 4 we report the ooc of both Cassification Trees, with and without pruning, inear SVM with normaized data, NLPSVM and BSVM. Note that for NLPSVM we ony have resuts on five databases, the ones reported by Fung and Mangasarian [15] in their paper. From Tabe 4, we can see that BSVM outperforms the three benchmarking methods in six out of the nine databases we consider in this paper. This happens in the ionosphere, the housing and the bupa databases. For the sonar, credit and wdbc databases, Fung and Mangasarian [15] do not report any resuts, but BSVM outperforms Cassification Trees and inear SVM. In these six databases, the increase in accuracy of BSVM with respect to the best reported accuracy, ranges from 0.35% for the wdbc database up to 12.01% for the sonar database. For the bands database, we are abe to outperform Cassification Trees, but not the inear SVM, which has an accuracy of 71.12% whie we have 70.40%. Simiary, for the ceveand database, we are abe to outperform Cassification Trees, but inear SVM and NLPSVM have a better cassification accuracy, 84.51% and 85.90% respectivey, whie the one of BSVM is 81.44%. For the pima database, BSVM underperforms the rest of the methods, where the decrease in accuracy of BSVM with respect to the best reported accuracy is equa to 4.44%. As a summary, we concude that BSVM can be seen as a CART-ike method (which may very good for practitioners, since reevant predictor variabes and critica vaues of them are identified), with a cassification power which is at east comparabe, and in some cases much better than competing benchmarking techniques such as CART or other versions of SVM. 4.3 Interpretabiity Cassification Trees are widey used in appied fieds as diverse as Medicine (diagnosis), Computer Science (data structures), Botany (cassification), and Psychoogy (decision theory), mainy because they are easy to interpret. Therefore, high accuracy is not the ony desirabe property of a cassifier, but aso its interpretabiity. BSVM improves argey the interpretabiity of standard SVMs by the use of queries of type (1). In this section we iustrate how our method takes us one step further towards interpretabiity via the use of the visuaization too 19

22 size TreePr TreeCr SVM NLPSVM BSVM sonar bands credit ionosphere wdbc ceveand housing pima bupa Tabe 4: Looc for BSVM and benchmarking methods proposed in Section 2.2. To do this, we have considered two databases with different characteristics, namey bupa and sonar, which have respectivey 6 and 60 predictor variabes. Since the focus here is on interpretabiity, and not generaization abiity, the whoe database have been used to buid the cassifier. Parameter C has been chosen by 10-fod cross-vaidation in the whoe database. If SVM is performed, bupa uses a 6 predictor variabes, i.e., a predictor variabes have associated a non-zero weight. BSVM uses a 6 predictor variabes as we, where we say that a predictor variabe has been used if there exists at east one feature associated to this predictor variabe which has a non-zero coefficient in the cassifier. The tota number of features used by BSVM is 62. A quick ook at Figure 3, with information from bupa, aows us to say that predictor variabes 3, 4 and 5 have the strongest infuence in the cassifier. It is particuary simpe to anayze from the picture the infuence of predictor variabe 3. Indeed, we can see that predictor variabe 3 presents an S-shaped form, with a inear behavior within the interva with endpoints 0 and 25 and constant outside. Moreover, the sopes are negative, meaning that, the higher the vaue of the predictor variabe (up to 25), the stronger the tendency to be cassified in the negative cass. On top of that, we see that whether predictor variabe 3 takes a vaue of, say, 25, or, instead, a much higher vaue, turns out to be irreevant for cassification purposes. 20

23 A simiar behavior is detected for predictor variabe 4. From 0 to 25, the infuence is inear, but after that is stabe and finay jumps around vaue 45. Now the sopes are positive, impying that, the higher the vaue of predictor variabe 4, the stronger the tendency to be cassified in the positive cass. Predictor variabe 5 shows a ogarithmic behavior, again with positive sopes and one critica vaue after which the function remains amost constant. Figure 3: Graphica representation for bupa In genera, this exampe shows that the presence of many cutoffs, as in predictor variabes 3, 4 and 5, is not an inconvenience for interpreting the cassifier, whose behavior is easiy detected via the visuaization too proposed. Now, take the exampe of database sonar, which has 60 predictor variabes. If SVM is performed, a 60 are used again, whereas BSVM uses 98 features, invoving 45 different predictor variabes. Hence, BSVM is abe to make feature seection (it is abe to detect as irreevant for cassification 25% of the predictor variabes) in a database where inear SVM woud consider a variabes as reevant. 21

24 Figure 4: Graphica representation for sonar 22

25 In Figure 4 the 45 used predictor variabes are shown, renumbered for convenience. A simpe ook to this figure indicates that there are ony four out of the 45 predictor variabes with strong infuence in the cassifier, namey predictor variabes 11, 20, 26 and 36. Predictor variabes in positions 11 and 20 have just one main critica vaue, whereas predictor variabes 26 and 36 present an amost inear behavior in a ceary identified interva. We see again that the use of BSVM aows us to choose the reevant predictor variabes, interpret the way such variabes infuence the cassifier, and detect the critica vaues and intervas for each predictor variabe. 4.4 Size of the BSVM cassifier Simpicity in the cassifier is another desirabe property for practitioners. Because both BSVM and Cassification Trees are based on simpe queries of type (1), the size of the resuting cassifier is a good proxy for their compexity. In Tabe 5, we compare the size of the cassifiers constructed by BSVM and Cassification Trees. We report average resuts over a the objects, i.e., over a the testing sampes. For both Cassification Trees, with and without pruning, we report the tota number of nodes in the fina tree as we as the number of eaf nodes. For the BSVM cassifier, we report the number of used predictor variabes as we as the number of used features. In our opinion, the fairest comparison for the compexity is to measure the number of nodes generated by Cassification Trees against the number of features used by BSVM. As the resuts show, the size of the Cassification Tree with pruning is aways smaer. For the Cassification Tree without pruning, in four out of the nine databases, namey credit, ceveand, pima and bupa, BSVM is smaer in size, whie in a fifth one, housing, both cassifiers are of simiar size. This resut in compexity for BSVM shoud be baanced with its better cassification abiity. From Tabe 4, BSVM outperforms Cassification Trees with pruning except for the pima database, with an increase in accuracy which ranges from 1.53% for the credit database up to 18.75% for the sonar database. Simiary, BSVM outperforms Cassification Trees without pruning except for the pima database in which both methods have basicay the same accuracy. The increase in accuracy of BSVM ranges from 2.19% for the housing database up to 25.00% for the sonar database. From the numerica experience reported, we can assert that BSVM outperforms Cassification Trees in terms 23

26 size TreePr TreePr TreeCr TreeCr BSVM BSVM eaf nodes nodes eaf nodes nodes predictor variabes features sonar bands credit ionosphere wdbc ceveand housing pima bupa Tabe 5: Size of benchmarking cassifiers of its cassification abiity, at the expense of a higher compexity of the cassifier, which is comparabe to the compexity of Cassification Trees if no pruning is performed. 4.5 Reducing the number of used features The compexity of the cassifiers obtained by BSVM is usuay high because a high number of features may be present in the rue, as we have discussed in the previous section. With the aim of reducing the compexity of the cassifier obtained by our procedure, we have performed severa experiments impementing the wrapping procedure described in Section 3.3. Due to computationa burden, we present resuts on five databases. The seection of the databases has been done based on the running times, and it does not affect our concusions. In Tabe 6 resuts on databases sonar, ionosphere, ceveand, housing and bupa are shown. For convenience, we repeat some of the resuts aready given in previous tabes. The third and fourth coumns report the best accuracy among a three benchmarking methods as we as the one of BSVM (when no wrapper procedure is appied), both processed from the data in Tabe 4. The fifth coumn contains the size (measured as number of used features) of the BSVM cassifier, obtained from Tabe 5. The sixth coumn reports the accuracy of BSVM when the size of the cassifier is reduced up to a maximum of 30 features. From those resuts we can 24

27 concude that the cassification abiity of BSVM sighty deteriorates. More specificay, the ceveand database, which was one in which the inear SVM outperformed BSVM, has essentiay the same accuracy as before. This is not surprising, since the average number of features used by BSVM is equa to 32, as reported in Tabe 5. For the ionosphere and the housing databases, the ooc remains amost the same, and therefore, BSVM outperforms the best of the three benchmarking methods whie using at most 30 features. In the bupa database, the decrease in accuracy is equa to 1.74%, but even after this deterioration, BSVM with wrapping procedure sti gives us the best accuracy. The concusions for the sonar database are simiar. Thus, even with this imit on the number of features, BSVM is sti abe to outperform the three benchmarking methods, except for the case of the ceveand database in which, as happens without wrapping, BSVM is outperformed by the inear SVM and the NLPSVM. We decided to further investigate those databases in which the wrapping of the BSVM cassifiers did not affect the accuracy, namey the ceveand, ionosphere and housing databases. In the housing database, the wrapped BSVM cassifier, as said above, has the best accuracy, whie its size is haf of that from the tree without pruning (which is the best of the two trees in terms of accuracy). For ceveand and ionosphere, the size was not competitive enough. Therefore, we further reduced the number of used features to a maximum of 20. The ooc of ceveand and ionosphere was exacty the same as the one obtained when the number of used features was imited to 30. This is speciay remarkabe for the ionosphere database, in which we can reduce the number of features to ess than a third, from to at most 20, whie the accuracy decreases ony by 0.29%. size best reported BSVM BSVM BSVM + Wrapper ooc features ooc sonar ionosphere ceveand housing bupa Tabe 6: Looc of BSVM after reducing the number of features to a maximum of 30 25

28 4.6 Presence of outiers The cassifier proposed in this paper is based on threshod functions, thus it seems that extreme observations, with very high or very ow vaues, wi not have a strong infuence in the cassifier. To empiricay test this conjecture, a series of experiments has been performed where some outiers were artificiay introduced. Every ce in the database was chosen to be an outier with probabiity Those ces chosen were modified by adding ρ times the range of its predictor variabe in the training sampe. We present here resuts for ρ = 10. For other vaues of ρ we obtained simiar resuts, and therefore are not reported. As in Section 4.5, and again due to computationa burden, we present resuts on five databases. In Tabe 7 resuts on databases sonar, ionosphere, ceveand, housing and bupa are shown. We compare BSVM against Cassification Trees and inear SVM. We do not report resuts on the NLPSVM since outiers are not discussed in the paper of Fung and Mangasarian [15]. The cassification abiity of the inear SVM cassifier dramaticay worsens when introducing outiers, and ceary underperforms the other two methods. Cassification Trees and BSVM are ony sighty affected. We outperform Cassification Trees in a databases excepting sonar. Therefore, our conjecture is supported. size TreePr TreeCr SVM BSVM sonar ionosphere ceveand housing bupa Tabe 7: Looc for BSVM and benchmarking methods. Databases with outiers 5 Concusions and further research In this paper a new SVM-based too for supervised cassification has been proposed. The cassifier gives insightfu knowedge about the way the predictor variabes infuence the cassification. Indeed, the noninear behavior of the 26

A Column Generation Approach for Support Vector Machines

A Column Generation Approach for Support Vector Machines A Coumn Generation Approach for Support Vector Machines Emiio Carrizosa Universidad de Sevia (Spain). ecarrizosa@us.es Beén Martín-Barragán Universidad de Sevia (Spain). bemart@us.es Doores Romero Moraes