Support Vector Machines for Business Applications

Size: px

Start display at page:

Download "Support Vector Machines for Business Applications"

Austen Lucas Barker
6 years ago
Views:

1 Support Vector Machnes for Busness Applcatons Bran C. Lovell and Chrstan J Walder The Unversty of Queensland and Max Planck Insttute, Tübngen {lovell, walder}@tee.uq.edu.au Introducton Recent years have seen an explosve growth n computng power and data storage wthn busness organsatons. From a busness perspectve, ths means that most companes now have massve archves of customer and product data and more often than not these archves are far too large for human analyss. An obvous queston has therefore arsen, How can one turn these mmense corporate data archves to commercal advantage? To ths end, a number of common applcatons have arsen, from predctng whch products a customer s most lkely to purchase, to desgnng the perfect product based on responses to questonnares. The theory and development of these processes has grown nto a dscplne of ts own, known as Data Mnng, whch draws heavly on the related felds of Machne Learnng, Pattern Recognton, and Mathematcal Statstcs. The Data Mnng dscplne s stll developng however, and a great deal of suboptmal and ad hoc analyss s beng done. Ths s partly due to the complexty of the problems, but s also due to the vast number of avalable technques. Even the most fundamental task n Data Mnng that of nductve nference, or makng predctons based on examples, can be tackled by a great many dfferent technques. Some of these technques are very dffcult to talor to a specfc problem and requre hghly sklled human desgn; others are more generc n applcaton and can be treated more lke the proverbal black box. One partcularly generc and powerful method, known as the Support Vector Machne (SVM) has proven to be both easy to apply and capable of producng results that range from good to excellent n comparson to other methods. Whle applcaton of the method s relatvely straghtforward, the practtoner can stll beneft greatly from a basc understandng of the underlyng machnery. Unfortunately most avalable tutorals on SVMs do requre a very sold mathematcal background, so we have wrtten ths chapter to make SVMs accessble to a wder communty. Ths chapter comprses a basc background on the problem of nducton, followed by the man sectons. In the frst secton we ntroduce the concepts and equatons on whch the SVM s based n an ntutve manner and to dentfy the relatonshp between the SVM and some of the other popular analyss methods. In the second secton we survey some nterestng applcatons of SVMs on practcal real world problems. Fnally, the thrd secton provdes a set of gudelnes and rules of thumb for applyng the tool, wth a pedagogcal example that s desgned to demonstrate everythng that the SVM newcomer requres n order to mmedately apply the tool to a specfc problem doman. The chapter s ntended as a bref ntroducton to the feld that ntroduces the deas, methodologes, as well as a hand-on ntroducton to freely avalable software, allowng the reader to rapdly determne the effectveness of SVMs for ther specfc doman. Background SVMs are most commonly appled to the problem of nductve nference, or makng predctons based on prevously seen examples. To llustrate what s meant by ths, let us consder the data presented n Tables and 2. We see here an example of the problem of nductve nference, more specfcally that of supervsed learnng. In supervsed learnng we are gven a set of nput data along wth ther correspondng labels. The nput data comprses a number of examples about whch several attrbutes are known (n ths case, age, ncome etc). The label ndcates whch class a partcular example belongs to. In the example above, the label tells us whether or not a gven person has broadband nternet connecton to ther home. Ths s called a bnary classfcaton problem because there are only two possble classes. In the second table, we are gven the attrbutes for a dfferent set of consumers, for whom the true class labels are unknown. Our goal

2 s to nfer from the frst table the most lkely labels for the people n the second table, that s, whether or not they have a broadband nternet connecton to ther home. Age Income Years of Educaton Gender Broadband Home Internet Connecton? 30 $56,000 / yr 6 male Yes 50 $60,000 / yr 2 female Yes 6 $2,000 / yr male No 35 $30,000 / yr 2 male No Table : tranng or labelled set The dataset n Table contans demographc nformaton for four randomly selected people. These people were surveyed to determne whether or not they had a broadband home nternet connecton. Age Income Years of Educaton Gender Broadband Home Internet Connecton? 40 $48,000 / yr 7 male unknown 29 $60,000 / yr 8 female unknown Table 2: unlabelled set The dataset n Table 2 contans demographc nformaton for people who may or may not be good canddates for broadband nternet connecton advertsng. The queston arsng s, Whch of these people s lkely to have broadband nternet connecton at home? In the feld of data mnng, we often refer to these sets by the terms test set, tranng set, valdaton set, and so on, but there s some confuson n the lterature about the exact defntons of these terms. For ths reason we avod ths nomenclature, wth the excepton of the term tranng set. For our purposes, the tranng set shall be all that s gven to us n order to nfer some general correspondence between the nput data and labels. We wll refer to the set of data for whch we would lke to predct the labels as the unlabelled set. A schematc dagram for the above process s provded n Fgure. In the case of the SVM classfer (and most other learnng algorthms for that matter), there are a number of parameters whch must be chosen by the user. These parameters control varous aspects of the algorthm, and n order to yeld the best possble performance, t s necessary to make the rght choces. The process of choosng parameters that yeld good performance s often referred to as model selecton. In order to understand ths process, we have to consder what t s that we are amng for n terms of classfer performance. From the pont of vew of the practtoner, the hope s that the algorthm wll be able to make true predctons about unseen cases. Here the true values we are tryng to predct are the class labels of the unlabelled data. From ths perspectve t s natural to measure the performance of a classfer by the probablty of ts msclassfyng an unseen example. It s here that thngs become somewhat less straghtforward, however, due to the followng dlemma. In order to estmate the probablty of a msclassfcaton, we need to know the true underlyng probablty dstrbutons of the data that we are dealng wth. If we actually knew ths, however, we wouldn t have needed to perform nductve nference n the frst place! Indeed knowledge of the true probablty dstrbutons allows us to calculate the theoretcally best possble decson rule correspondng to the socalled Bayesan classfer (Duda et al 200). In recent years, a great deal of research effort has gone nto developng sophstcated theores that make statements about the probablty of a partcular classfer makng errors on new unlabelled cases these statements are typcally referred to as generalzaton bounds. It turns out however, that the research has a long way to go, and n practce one s usually forced to determne the parameters of the learnng algorthm by much more pragmatc means. Perhaps the most straghtforward of these methods nvolves estmatng the probablty of msclassfcaton usng a set of real data for whch the class labels are known to do ths one smply compares the labels predcted by the learnng algorthm to the true known labels. The estmate of

3 msclassfcaton probablty s then gven by the number of examples for whch the algorthm made an error (that s, predcted a label other than the true known label) dvded by the number of examples whch were tested n ths manner. Labelled Tranng Data Learnng Algorthm (SVM) Unlabelled Data Decson Rule Predcted Labels Fgure : The nductve nference process n schematc form. Based on a partcular tranng set of examples wth labels, the learnng algorthm constructs a decson rule whch can then be used to predct the labels of new unlabelled examples. Some care needs to be taken however, n how ths procedure s conducted. A common ptfall for the nexperenced analyst nvolves makng ths estmate of msclassfcaton probablty usng the tranng set from whch the decson rule tself was nferred. The problem wth ths approach s easly seen from the followng smple decson rule example. Imagne a decson rule that makes label predctons by way of the followng procedure (sometmes referred to as the notebook classfer): The notebook classfer decson rule: We wsh to predct the label of the example X. If X s present n the tranng set, make the predcton that ts label s the same as the correspondng label n the tranng set. Otherwse, toss a con to determne the label. For ths method, whle the estmated probablty of msclassfcaton on the tranng set wll be zero, t s clear that for most real world problems the algorthm wll perform no better than tossng a con! The notebook classfer s a commonly used example to llustrate the phenomenon of overfttng whch refers to stuatons where the decson rule fts the tranng set well, but does not generalze well to prevously unseen cases. What we are really amng for s a decson rule that generalzes as well as possble, even f ths means that t cannot perform as well on the tranng set. Cross-valdaton: So t seems that we need a more sophstcated means of estmatng the generalzaton performance of our nferred decson rules, f we are to successfully gude the model selecton process. Fortunately there s a more effectve means of estmatng the generalzaton performance based on the tranng set. Ths procedure, whch s referred to as cross-valdaton or more specfcally n-fold cross valdaton, proceeds n the followng manner (Duda et al 200):. Splt the tranng set nto n equally szed and dsjont subsets (parttons), numbered to n. 2. Construct a decson functon usng a conglomerate of all the data from subsets 2 to n. 3. Use ths decson functon to predct the labels of the examples n subset number. 4. Compare the predcted labels to the known labels n subset number. 5. Repeat steps through 4 a further (n-) tmes, each tme testng on a dfferent subset, and always excludng that subset from tranng.

Havng done ths, we can once agan dvde the number of msclassfcatons by the total number of tranng examples to get an estmate of the true generalzaton performance.

4 Havng done ths, we can once agan dvde the number of msclassfcatons by the total number of tranng examples to get an estmate of the true generalzaton performance. The pont s that snce we have avoded checkng the performance of the classfer on examples that the algorthm had already seen, we have calculated a far more meanngful measure of classfer qualty. Commonly used values for n are 3 and 0 leadng to so called 3-fold and 0-fold cross-valdaton. Now, whle t s nce to have some dea of how well our decson functon wll generalze, we really want to use ths measure to gude the model selecton process. If there are only, say, two parameters to choose for the classfcaton algorthm, t s common to smply evaluate the generalzaton performance (usng cross valdaton) for all combnatons of the two parameters, over some reasonable range. As the number of parameters ncreases, however, ths soon becomes nfeasble due to the excessve number of parameter combnatons. Fortunately one can often get away wth just two parameters for the SVM algorthm, makng ths relatvely straght-forward model selecton methodology wdely applcable and qute effectve on real world problems. Now that we have a basc understandng of what supervsed learnng algorthms can do, as well as roughly how they should be used and evaluated, t s tme to take a peek under the hood of one n partcular, the SVM. Whle the man underlyng dea of the SVM s qute ntutve, t wll be necessary to delve nto some mathematcal detals n order to better apprecate why the method has been so successful. Man Thrust of the Chapter The SVM s a supervsed learnng algorthm that nfers from a set of labeled examples a functon that takes new examples as nput, and produces predcted labels as output. As such the output of the algorthm s a mathematcal functon that s defned on the space from whch our examples are taken, and takes on one of two values at all ponts n the space, correspondng to the two class labels that are consdered n bnary classfcaton. One of the theoretcally appealng thngs about the SVM s that the key underlyng dea s n fact extremely smple. Indeed, the standard dervaton of the SVM algorthm begns wth possbly the smplest class of decson functons: lnear ones. To llustrate what s meant by ths, Fgure 2 conssts of three lnear decson functons that happen to be correctly classfyng some smple 2D tranng sets. Fgure 2: A smple 2D classfcaton task, to separate the black dots from the crcles. Three feasble but dfferent lnear decson functons are depcted, whereby the classfer predcts that any new samples n the gray regon are black dots, and those n the whte regon are crcles. Whch s the best decson functon and why? Lnear decson functons consst of a decson boundary that s a hyperplane (a lne n 2D, plane n 3D, etc) separatng the two dfferent regons of the space. Such a decson functon can be expressed by a mathematcal functon of an nput vector x, the value of whch s the predcted label for x (ether + or -). The lnear classfer can therefore be wrtten as g( x) = sgn( f ( x)) where f ( x) =< w, x > + b.

5 In ths way we have parameterzed the functon by the weght vector w and the scalar b. The notaton <w,x> denotes the nner or scalar product of w and x, defned by < w,x d >= w x where d s the dmensonalty, and w s the -th component of w, where w s of the form (w, w 2, w d ). Havng formalzed our decson functon, we can now formalze the problem whch the lnear SVM addresses: Gven a tranng set of vectors x, x 2, x n wth correspondng class membershp labels y, y 2, y n that take on the values + or -, choose parameters w and b of the lnear decson functon that generalzes well to unseen examples. Perceptron Algorthm: Probably the frst algorthm to tackle ths problem was the Perceptron algorthm (Rosenblatt 958). The Perceptron algorthm smply used an teratve procedure to ncrementally adjust w and b untl the decson boundary was able to separate the two classes of the tranng data. As such, the Perceptron algorthm would gve no preference between the three feasble solutons n Fgure 2 any one of the three could result. Ths seems rather unsatsfactory as most people would agree that the rghtmost decson functon s the superor one. Moreover, ths ntutve preference can be justfed n varous ways, for example by consderng the effect of measurement nose on the data small perturbatons of the data could easly change the predcted labels of the tranng set n the frst two examples, whereas the thrd s far more robust n ths respect. In order to make use of ths ntuton, t s necessary to state more precsely why we prefer the thrd classfer: We prefer decson boundares that not only correctly separate two classes n the tranng set, but le as far from the tranng examples as possble. Ths smple ntuton s all that s requred to lead to the lnear SVM classfer, whch chooses the hyperplane that separates the two classes wth the maxmum margn. The margn s just the dstance from the hyperplane to the nearest tranng example. Before we contnue, t s mportant to note that whle the above example shows a 2D dataset, whch can be convenently represented by ponts n a plane, n fact we wll typcally be dealng wth hgher dmensonal data. For example, the example data n Table could easly be represented as ponts n four dmensons as follows. = x = [ ] ; y = + x 2 = [ ] ; y 2 = + x 3 = [ ] ; y 3 = - x 4 = [ ] ; y 4 = - Actually, there are some desgn decsons to be made by the practtoner when translatng attrbutes nto the above type of numercal format, whch we shall touch on n the next secton. For example here we have mapped the male/female column nto two new numercal ndcators. For now, just note that we have also lsted the labels y to y 4 whch take on the value + or -, n order to ndcate the class membershp of the examples (that s, y = means that x has broadband home nternet connecton). In order to easly fnd the maxmum margn hyperplane for a gven data set usng a computer, we would lke to wrte the task as an optmzaton problem. Optmzaton problems consst of an objectve functon, whch we typcally want to fnd the maxmum or mnmum value of, along wth a set of constrants, whch are condtons that we must satsfy whle fndng the best value of the objectve functon. A smple example s to mnmze x 2 subject to the constrant that x 2. The soluton to ths example optmzaton problem happens to be x =. To see how to compactly formulate the maxmum margn hyperplane problem as an optmzaton problem, take a look at Fgure 3.

6 <w,x>+b= <w,x>+b=+ <w,x>+b=0 Fgure 3: Lnearly separable classfcaton problem The Fgure shows some 2D data drawn as crcles and black dots, havng labels + and respectvely. As before, we have parameterzed our decson functon by the vector w and the scalar b, whch means that, n order for our hyperplane to correctly separate the two classes, we need to satsfy the followng constrants: < w, x < w, x > + b > 0,for all y > + b < 0,for all y = = To ad understandng, the frst constrant above may be expressed as: <w, x > + b must be greater than zero, whenever y s equal to one. It s easy to check that the two sets of constrants above can be combned nto the followng sngle set of constrants: ( < w, x > + b) y > 0, =... n However meetng ths constrant s not enough to separate the two classes optmally we need to do so wth the maxmum margn. An easy way to see how to do ths s the followng. Frst note that we have plotted the decson surface as a sold lne n Fgure 3, whch s the set satsfyng: < w, x > + b = 0. () The set of constrants that we have so far s equvalent to sayng that these data must le on the correct sde (accordng to class label) of ths decson surface. Next notce that we have also plotted as dotted lnes two other hyperplanes, whch are the hyperplanes where the functon <w,x> + b s equal to - (on the lower left) and + (on the upper rght). Now, n order to fnd the maxmum margn hyperplane, we can see ntutvely that we should keep the dotted lnes parallel and equdstant to the decson surface, and maxmze ther dstance from one another, whle satsfyng the constrant that the data le on the correct sde of the dotted lnes assocated wth that class. In mathematcal form, the fnal clause of ths sentence (the constrants) can be wrtten as y ( < w, x > + b) >,... n. = All we need to do then s to maxmze the dstance between the dotted lnes subject to the constrant set above. To ad n understandng, one commonly used analogy s to thnk of these data ponts as nals partally drven nto a board. Now we successvely place thcker and thcker peces of tmber between the nals representng the two classes untl the tmber just fts the centrelne of the tmber now represents the

7 optmal decson boundary. It turns out that ths dstance s equal to 2 / < w, w >, and snce maxmzng 2 / < w, w > s the same as mnmzng < w, w >, we end up wth the followng optmzaton problem, the soluton of whch yelds the parameters of the maxmum margn hyperplane. The term ½ n the objectve functon below can be gnored as t smply makes thngs neater from a certan mathematcal pont of vew: mn w, b 2 w w such that y ( w x + b) for all =, 2,..., m The above problem s qute smple, but t encompasses the key phlosophy behnd the SVM maxmum margn data separaton. If the above problem had been scrbbled onto a cocktal napkn and handed to the poneers of the Perceptron back n the 960 s, then the Machne Learnng dscplne would probably have progressed a great deal further than t has to date! We cannot relax just yet however, as there s a major problem wth the above method: What f these data are not lnearly separable? That s f t s not possble to fnd a hyperplane that separates all of the examples n each class from all of the examples n the other class? In ths case there would be no combnaton of w and b that could ever satsfy the set of constrants above, let alone do so wth maxmum margn. Ths stuaton s depcted n Fgure 4, where t becomes apparent that we need to soften the constrant that these data le on the correct sde of the + and - hyperplanes, that s we need to allow some, but not too many data ponts to volate these constrants by a preferably small amount. Ths alternatve approach turns out to be very useful not only for datasets that are not lnearly separable, but also, and perhaps more mportantly, n allowng mprovements n generalzaton. <w,x>+b= <w,x>+b=+ <w,x>+b=0 Fgure 4: Lnearly nseparable classfcaton problem Usually when we start talkng about vague concepts such as not too many and a small amount, we need to ntroduce a parameter nto our problem, whch we can vary n order to balance between varous goals and objectves. The followng optmzaton problem, known as the -norm soft margn SVM, s probably the one most commonly used to balance the goals of maxmum margn separaton, and correctness of the tranng set classfcaton. It acheves varous trade-offs between these goals for varous values of the parameter C, whch s usually chosen by cross-valdaton on a tranng set as dscussed earler.

8 mn w, b, ξ 2 w w+ C m = such that y ( w x + b) + ξ (2) ξ for all =, 2,..., m. The easest way to understand ths problem s by comparson wth the prevous formulaton that we gave, whch s known as the hard margn SVM, n reference to the fact that the margn constrants are hard, and are not allowed to be volated at all. Frst note that we have an extra term n our objectve functon that s equal to the sum of the ξ s. Snce we are mnmzng the objectve functon, t s safe to say that we are lookng for a soluton that keeps the ξ values small. Moreover, snce the ξ term s added to the orgnal objectve functon after multplcaton by C, we can say that as C ncreases we care less about the sze of the margn, and more about keepng the ξ s small. The true meanng of the ξ s can only be seen from the constrant set, however. Here, nstead of constranng the functon y (<w,x > + b) to be greater than, we constran t to be greater than - ξ. That s, we allow the pont x to volate the margn by an amount ξ. Thus, the value of C trades between how large of a margn we would prefer, as opposed to how many of the tranng set examples volate ths margn (and by how much). So far, we have seen that the maxmally separatng hyperplane s a good startng pont for lnear classfers. We have also seen how to wrte down the problem of fndng ths hyperplane as an optmzaton problem consstng of an objectve functon and constrants. After ths we saw a way of dealng wth data that s not lnearly separable, by allowng some tranng ponts to volate the margn somewhat. The next lmtaton we wll address s n the form of solutons avalable. So far we have only consdered very smple lnear classfers, and as such we can only expect to succeed n very smple cases. Fortunately t s possble to extend the prevous analyss n an ntutve manner, to more complex classes of decson functons. The basc dea s llustrated n Fgure 5. Fgure 5: An example of a mappng Ф to a feature space n whch the data become lnearly separable. The example n Fgure 5 shows on the left a data set that s not lnearly separable. In fact, the data s not even close to lnearly separable, and one could never do very well wth a lnear classfer for the tranng set gven. In spte of ths, t s easy for a person to look at the data and suggest a smple ellptcal decson surface that ought to generalze well. Imagne however that there s a mappng Ф whch transforms these data to some new, possbly hgher dmensonal space, n whch the data s lnearly separable. If we knew Ф then we could map all of the data to the feature space, and perform normal SVM classfcaton n ths space. If we can acheve a reasonable margn n the feature space, then we can expect a reasonably good generalzaton performance, n spte of a possble ncrease n dmensonalty.

9 The last sentence of the prevous paragraph s far deeper than t may frst appear. For some tme, Machne Learnng researchers have feared the curse of dmensonalty, a name gven to the wdely-held belef that f the dmenson of the feature space s large n comparson to the number of tranng examples, then t s dffcult to fnd a classfer that generalzes well. It took the theory of Vapnk and Chervonenks (Vapnk 998) to put a serous dent n ths belef. In a nutshell, they formalzed and proved the last sentence of the prevous paragraph, and thereby paved the way for methods that map data to very hgh dmensonal feature spaces where they then perform maxmum margn lnear separaton. Actually, a trcky practcal ssue also had to be overcome before the approach could floursh: f we map to a feature space that s too hgh n dmenson, then t wll become mpossble to perform the requred calculatons (that s, to fnd w and b) that s, t would take too long on a computer. It s not obvous how to overcome ths dffculty, and t took untl 995 for researchers to notce the followng elegant and qute remarkable possblty. The usual way of proceedng s to take the orgnal soft margn SVM, and convert t to an equvalent Lagrangan dual problem. The dervaton s not especally enlghtenng however, so we wll skp to the result, whch s that the soluton to the followng dual or equvalent problem gves us the soluton to the orgnal SVM problem. The dual problem, whch s to be solved by varyng the α ι s, s as follows (Vapnk 998) m mn yyαα ( x x ) α 2 j j j, j= = such that y α = 0 = 0 α C, =,2,..., m. m m α (3) The α ι s are known as the dual varables, and they defne the correspondng prmal varables w and b by the followng relatonshps: m w = α = y α ( y ( < w, x > + b) ) = 0 Note that by the lnearty of the nner product (that s, the fact that <a+b,c> = <a,c> + <b,c>), we can wrte the decson functon n the followng form: f ( x) =< w, x > + b = α y < x, x > + b Recall that t s the sgn of f(x) that gves us the predcted label of x. A qute remarkable thng s that n order to determne the optmal values of the α ι s and b, and also to calculate f(x), we do not actually need to know any of the tranng or testng vectors, we only need to know the scalar value of ther nner product wth one another. Ths can be seen by notng that the vectors only ever appear by way of ther nner product wth one another. The elegant thng s that rather than explctly mappng all of the data to the new space and performng lnear SVM classfcaton, we can operate n the orgnal space, provded we can fnd a socalled kernel functon k(.,.) whch s equal to the nner product of the mapped data. That s, we need a kernel functon k(.,.) satsfyng: m = x k ( x, y) =< Φ( x), Φ( y) > In practce, the practtoner need not concern hm or herself wth the exact nature of the mappng Ф. In fact, t s usually more ntutve to concentrate on propertes of the kernel functons anyway, and the prevalng wsdom states that the functon k(x,y) should be a good measure of the smlarty of the vectors x and y. Moreover, not just any functon k can be used t must also satsfy certan techncal condtons, known as

10 Mercer s condtons. Ths procedure of mplctly mappng the data va the functon k s typcally often called the kernel trck and has found wde applcaton after beng popularzed by the success of the SVM (Schölkopf, & Smola 998). The two most wdely used kernel functons are the followng. Polynomal Kernel k(x, y) = ( < x, y > + ) d The polynomal kernel s vald for all postve ntegers d. The kernel corresponds to a mappng Ф that computes all degree d monomal terms of the ndvdual vector components of the orgnal space. The polynomal kernel has been used to great effect on dgt recognton problems. 2 x y Gaussan Kernel k(x, y) = exp( - ) 2 σ The Gaussan kernel, whch s smlar to the Gaussan probablty dstrbuton from whch t gets ts name, s one of a group of kernel functons known as radal bass functons (RBFs). RBFs are kernel functons that depend only on the geometrc dstance between x and y. The kernel s vald for all non-zero values of the kernel wdth σ, and corresponds to a mappng Ф nto an nfnte dmensonal, and therefore somewhat less nterpretable, feature space. Nonetheless, the Gaussan s probably the most useful and commonly used kernel functon. Now that we know the form of the SVM dual problem, as well as how to generalze t usng kernel functons, the only thng left s to see s how to actually solve the optmzaton problem, n order to fnd the α ι s. The optmzaton problem s one example of a class of problems known as Quadratc Programs (QPs). The term program, as t s used here, s somewhat antquated and n fact means a mathematcal optmzaton problem, not a computer program. Fortunately there are many computer programs that can solve QP s such as ths, these computer programs beng known as Quadratc Program (QP) solvers. An mportant factor to note here s that there s consderable structure n the QP that arses n SVM tranng, and whle t would be possble to use almost any QP solver on the problem, there are a number of sophstcated software packages talored to take advantage of ths structure, n order to decrease the requrements of computer tme and memory. One property of the SVM QP that can be taken advantage of s ts sparsty the fact that n many cases, at the optmal soluton most of the α ι s wll equal zero. It s nterestng to see what ths means n terms of the decson functon f(x): those vectors wth α ι = 0 do not actually enter nto the fnal form of the soluton. In fact, t can be shown that one can remove all of the correspondng tranng vectors before tranng even commences, and get the same fnal result. The vectors wth non-zero values of α ι are known as the Support Vectors, a term that has ts root n the theory of convex sets. As t turns out, the Support Vectors are the hard cases the tranng examples that are most dffcult to classfy correctly (and that le closest to the decson boundary). In our prevous practcal analogy, the support vectors are lterally the nals that support the block of wood! Now that we have an understandng of the machnery underlyng t, we wll soon proceed to solve a practcal problem usng the freely avalable SVM software package lbsvm wrtten by Hsu and Ln. Relatonshp to Other Methods We noted n the ntroducton that the SVM s an especally easy to use method that typcally produces good results even when treated as a processng black box. Ths s ndeed the case, and to better understand ths t s necessary to consder what s nvolved n usng some other methods. We wll focus n detal on the extremely prevalent class of algorthms known as artfcal neural networks, but frst we provde a bref overvew of some other related methods. Lnear Dscrmnant Analyss (Hand 98, Wess & Kulkowsk 99) s wdely used n busness and marketng applcatons, can work n multple dmensons, and s well-grounded n the mathematcal lterature. It nonetheless has two major drawbacks. The frst s that lnear dscrmnant functons, as the

11 name mples, can only successfully classfy lnearly separable data thus lmtng ther applcaton to relatvely smple problems. If we extend the method to hgher order functons such as quadratc dscrmnators, generalzaton suffers. Indeed such degradaton n performance wth ncreased numbers of parameters corroborated the belef n the curse of dmensonalty fnally dsproved by Vapnk (Vapnk, 998). The second problem s smply that generalzaton performance on real problems s usually sgnfcantly worse than ether decson trees or artfcal neural networks (e.g., see the comparsons n Wess & Kulkowsk 99). Decson Trees are commonly used n classfcaton problems wth categorcal data (Qunlan 993), although t s possble to derve categorcal data from ordnal data by ntroducng bnary valued features such as age s less than 20. Decson trees construct a tree of questons to be asked of a gven example n order to determne the class membershp by way of class labels assocated wth leaf nodes of the decson tree. Ths approach s smple and has the advantage that t produces decson rules that can be nterpreted by a human as well as a machne; however the SVM s more approprate for complex problems wth a many ordnal features. Nearest Neghbour methods are very smple and therefore sutable for extremely large data sets. These methods smply search the tranng data set for the k examples that are closest (by the crtera of Eucldean dstance for example) to the gven nput. The most common class label that assocated wth these k s then assgned to the gven query example. When the tranng and testng computaton tmes are not so mportant however, the dscrmnatve nature of the SVM wll usually yeld sgnfcantly mproved results. Artfcal Neural Network (ANN) algorthms have become extremely wdespread n the area of data mnng and pattern recognton (Bshop, 995). These methods were orgnally nspred by the neural connectons that comprse the human bran the basc dea beng that n the human bran many smple unts (neurons) are connected together n a manner that produces complex, powerful behavour. To smulate ths phenomenon, neurons are modeled by unts whose output y s related to the nput x by some actvaton functon g by the relatonshp y = g(x). These unts are then connected together n varous archtectures, whereby the output of a gven unt s multpled by some constant weght and then fed forward as nput to the next unt, possbly n summaton wth a smlarly scaled output from some other unt(s). Ultmately all of the nputs are fed to one sngle fnal unt, the output of whch s typcally compared to some threshold n order to produce a class membershp predcton. Ths s a very general framework that provdes many avenues for customsaton: Choce of actvaton functon Choce of network archtecture (number of unts and the manner n whch they are connected) Choce of the weghts by whch the output of a gven unt s multpled to produce the nput of another unt. Algorthm for determnng the weghts gven the tranng data. In comparson to the SVM, both the strength and weakness of the ANN les n t s flexblty typcally a consderable amount of expermentaton s requred n order to acheve good results, and moreover snce the optmzaton problems that are typcally used to fnd the weghts of the chosen network are non-convex, many numercal trcks are requred n order to fnd a good soluton to the problem. Nonetheless, gven suffcent skll and effort n engneerng a soluton wth an ANN, one can often talor the algorthm very specfcally to a gven problem n a process that s lkely to eventually yeld superor results to the SVM. Havng sad ths, there are cases, for example n handwrtten dgt recognton, n whch SVM performance s on par wth hghly engneered ANN solutons (DeCoste 2002). By way of comparson, the SVM approach s lkely to yeld a very good soluton wth far less effort than s requred for a good ANN soluton. Practcal Applcaton of the SVM As we have seen, the theoretcal underpnnngs of the SVM are very compellng, especally snce the algorthm nvolves very lttle tral and error, and s easy to apply. Nonetheless, the usefulness of the

12 algorthm can only be borne out by practcal experence, and so n ths sub-secton we survey a number of studes that use the SVM algorthm n practcal problems. Before we menton such specfc cases, we frst dentfy the general characterstcs of those problems to whch the SVM s partcularly well suted. One key consderaton s that n ts basc form the SVM has lmted capacty to deal wth large tranng data sets. Typcally the SVM can only handle problems of up to approxmately 00,000 tranng examples before approxmatons must be made n order to yeld reasonable tranng tmes. Havng sad ths, the tranng tmes depend only margnally on the dmensonalty of the features t s often sad that SVMs can often defy the so-called curse of dmensonalty the dffculty that often occurs when the dmensonalty s hgh n comparson wth the number of tranng samples. It should also be noted that, wth the excepton of the strng kernel case, the SVM s most naturally suted to ordnal features rather than categorcal ones, although as we shall see n the next Secton, t s possble to handle both cases. Before turnng to some specfc busness and marketng cases, t s mportant to note that some of the most successful applcatons of the SVM have been n mage processng n partcular handwrtten dgt recognton (DeCoste 2002) and face recognton (Osuna 997). In these areas, a common theme of the applcaton of SVMs s not so much ncreased accuracy, but rather a greatly smplfed desgn and mplementaton process. As such, when consderng popular areas such as face recognton, t s mportant to understand that very smple SVM mplementatons are often compettve wth the complex and hghly tuned systems that were developed over a longer perod pror to the advent of the SVM. Another nterestng applcaton area for SVMs s on strng data, for example n text mnng or the analyss of genome sequences (Joachms 2002). The key reason for the great success of SVMs n ths area s the exstence of strng kernels these are kernel functons defned on strngs that elegantly avod many of the combnatorc problems assocated wth other methods, whlst havng the advantage over generatve probablty models such as the Hdden Markov Model that the SVM learns to dscrmnate between the two classes va the maxmsaton of the margn. The practcal use of text categorsaton systems s extremely wdespread, wth most large enterprses relyng on such analyss of ther customer nteractons n order to provde automated response systems that are nonetheless talored to the ndvdual. Furthermore, the SVM has been successfully used n a study of text and data mnng for drect marketng applcatons (Cheung 2003) n whch relatvely lmted customer nformaton was automatcally supplanted wth the preferences of a larger populaton, n order to determne effectve marketng strateges. To conclude ths survey note that whle the majorty of the marketng teams do not publsh ther methodologes, snce many of the mportant data mnng software packages (for example Oracle Data Mnng and SAS Enterprse Mner) have ncorporated the SVM, t s lkely that there s a sgnfcant and ncreasng use of the SVM n ndustral settngs. A Worked Example In A Practcal Gude to Support Vector Classfcaton (Chang et al 2003) a smple procedure for applyng the SVM classfer s provded for nexperenced practtoners of the SVM classfer. The procedure s ntended to be easy to follow, quck, and capable of producng reasonable generalzaton performance. The steps they advocate can be paraphrased as follows:. Convert the data to the nput format of the SVM software you ntend to use 2. Scale the ndvdual components of the data nto a common range 3. Use the Gaussan kernel functon 4. Use cross-valdaton to fnd the best parameters C (margn softness) and σ (Gaussan wdth) 5. Wth the values of C and σ determned by cross-valdaton, retran on the entre tranng set The above tasks are easly accomplshed usng, for example, the free lbsvm software package, as we wll demonstrate n detal n ths secton. We have chosen ths tool because t s free, easy to use and of a hgh qualty, although the majorty of our dscusson apples equally well to other SVM software packages wheren the same steps wll necessarly be requred. The pont of ths chapter, then, s to llustrate n a concrete fashon the process of applyng an SVM. The lbsvm software package wth whch we do ths conssts of three man command-lne tools, as well as a helper scrpt n the python language. The basc functons of these tools are summarzed here:

13 svm-scale grd.py svm-tran svm-predct Ths smple program smply rescales the data as n step 2 above. The nput s a data set, and the output s a new data set that has been rescaled. Ths functon can be used to assst n the cross valdaton parameter selecton process. It smply calculates a cross valdaton estmate of generalzaton performance for a range of values of C and the Gaussan kernel wdth σ. The results are then llustrated as a two dmensonal contour plot of generalzaton performance versus C and σ. Ths s the most sophstcated part of lbsvm, whch takes as nput a fle contanng the tranng examples, and outputs a model fle a lst of Support Vectors and correspondng α s, as well as the bas term and kernel parameters. The program also takes a number of nput arguments that are used to specfy the type of kernel functon and margn softness parameter. As well as some more techncal optons, the program also has the opton (used by grd.py) of computng an n-fold cross valdaton estmate of the generalzaton performance. Havng run svm-tran, svm-predct can be used to predct the class labels of a new set of unseen data. The nput to the program s a model fle and a dataset, and the output s a fle contanng the predcted labels, sgn(f(x)), for the gven dataset. Detaled nstructons for nstallng the software can be found on the lbsvm webste, We wll now demonstrate these three steps usng the example dataset at the begnnng of the chapter, n order to predct whch customers are lkely to be home broadband nternet users. To make the procedure clear, we wll gve detals of all the requred nput fles (contanng the labelled and unlabelled data), the output fle (contanng the learnt decson functon), and the command lne statements requred to produce and process these fles. Preprocessng (svm-scale) All of our dscussons so far have consdered the nput tranng examples as numercal vectors. In fact ths s not necessary as t s possble to defne kernels on dscrete quanttes, but we wll not worry about that here. Instead, notce that n our example tranng data n Table, each tranng example has several ndvdual features, both numercal and categorcal. There are three numercal features (age, ncome and years of educaton), and one categorcal feature (gender). In constructng tranng vectors for the SVM from these tranng examples, the numercal features are drectly assgned to ndvdual components of the tranng vectors Categorcal features, however, must be dealt wth slghtly dfferently. Typcally, f the categorcal feature belongs to one of m dfferent categores (here the categores are male and female so that our m s 2), then we map ths sngle categorcal feature nto m ndvdual bnary valued numercal features. A tranng vector whose categorcal feature corresponds to feature n (the orderng s rrelevant), wll have all zero values for these nto bnary valued numercal features, except for the n-th one, whch we set to. Ths s a smple way of ndcatng that the features are not related to one another by relatve magntudes. Once agan, the data n table one would thusly be represented by these four vectors, wth correspondng class labels y : x = [ ] ; y = + x 2 = [ ] ; y 2 = + x 3 = [ ] ; y 3 = - x 4 = [ ] ; y 4 = - In order to use the lbsvm software, we must represent the above data n a fle that s formatted accordng to the lbsvm standard. The format s very smple, and best descrbed wth an example. The above data would be represented by a sngle fle that looks lke ths:

14 + :30 2: :6 4: + : :2 4: - :6 2:2000 3: 4: - :35 2: :2 4: Each lne of the tranng fle represents one tranng example, and begns wth the class label (+ or -), followed by a space and then an arbtrary number of ndex:value pars. There should be no spaces between the colons and the ndexes or values, only between the ndvdual ndex:value pars. Note that f a feature takes on the value zero, t need not be ncluded as an ndex:value par, allowng data wth many zeros to be represented by a smaller fle. Now that we have our tranng data fle, we are ready to run svm_scale. As we dscovered n the frst secton, ultmately all our data wll be represented by the kernel functon evaluaton between ndvdual vectors. The purpose of ths program s to make some very smple adjustments to the data n order for t to be better represented by these kernel evaluatons. In accordance wth step 3 above we wll be usng the Gaussan kernel, whch can be expressed by 2 x (x ) k(x,y) = exp( - ) = exp(- ). σ 2 D y d y d 2 2 d= σ Here we have wrtten out the D ndvdual components of the vectors x and y, whch correspond to the (D = 5) ndvdual numercal features of our tranng examples. It s clear from the summaton on the rght, that f a gven feature has a much larger range of varaton than another feature t wll domnate the sum, and the feature wth the smaller range of varaton wll essentally be gnored. For our example, ths means that the ncome feature, whch has the largest range of values, wll receve an undue amount of attenton from the SVM algorthm. Clearly ths s a problem, and whle the Machne Learnng communty has yet to gve the fnal word on how to deal wth t n an optmal manner, many practtoners smply rescale the data so that each feature falls n the same range, for example between zero and one. Ths can be easly acheved usng svm_scale, whch takes as nput a data fle n lbsvm format, and outputs both a rescaled data fle and a set of scalng parameters. The rescaled data should then be used to tran the model, and the same scalng (as stored n the scalng parameters fle) should be appled to any unlabelled data before applyng the learnt decson functon. The format of the command s as follows: svm-scale s scalng_parameters_fle tranng_data_fle > rescaled_tranng_data_fle In order to apply the same scalng transformaton to the unlabelled set, svm_scale must be executed agan wth the followng arguments: svm-scale r scalng_parameters_fle unlabelled_data_fle > rescaled_unlabelled_data_fle Here the fle unlabelled_data_fle contans the unlabelled data, and has an dentcal format to the tranng fle, asde from the fact that the labels + and - are optonal, and wll be gnored f they exst. Parameter selecton (grd.py) The parameter selecton process s wthout doubt the most dffcult step n applyng an SVM. Fortunately the smplstc method we prescrbe here s not only relatvely straghtforward, but also usually qute effectve. Our goal s to choose the C and σ values for our SVM. Followng the prevous dscusson about parameter or model selecton, our basc method of tacklng ths problem s to make a cross valdaton estmate of the generalzaton performance for a range of values of C and σ, and examne the results vsually. Gven the outcome of ths step, we may ether choose values for C and σ, or conduct a further search based on the results we have already seen. The followng command wll construct a plot of the cross valdaton performance for our scaled dataset:

15 grd.py log2c -5,5, log2g -20,0, v 0 rescaled_tranng_data_fle The search range of the C and σ values are specfed by the log2c and log2g commands respectvely. In both cases the numbers that follow take the form begn,end,stepsze to ndcate that we wsh to search logarthmcally usng the values 2 begn,2 begn+ stepsze Specfyng -v n ndcates that we wsh to do n-fold cross valdaton (n the above command n = 0), and the last argument to the command ndcates whch data fle to use. The output of the program s a contour plot, saved n an mage fle of the name rescaled_tranng_data_fle.png. The output mage for the above command s depcted n Fgure end. Fgure 6: A contour plot of cross-valdaton accuracy for a gven tranng set as produced by grd.py The contour plot ndcates wth varous lne colours the cross-valdaton accuracy of the classfer, as a functon of C and σ ths s measured as a percentage of correct classfcatons, so that we prefer large values. Note that σ s n fact referred to as gamma by the lbsvm software the varable name s of course arbtrary, but we choose to refer to t as σ for compatblty wth the majorty of SVM lterature. Gven such a contour plot of performance, as stated prevously there are generally two conclusons to be reached:. The optmal (or at least satsfactory) values of C and σ are contaned wthn the plottng regon. 2. It s necessary to contnue the search for C and σ, over a dfferent range than that of the plot, n order to acheve better performance.

16 In the frst case, we can read the optmal values of C and σ from the output of the program on the command wndow. Each lne of output ndcates the best parameters that have been encountered up to that pont, and so we can take the last lne as our operatng parameters. In the second case, we must choose whch drecton to contnue the search. From fgure 6 t seems feasble to keep searchng over a range of smaller σ and larger C. Ths whole procedure s usually qute effectve, however there can be no denyng that the search for the correct parameters s stll somethng of a black art. Gven ths, we nvte nterested readers to experment for themselves, n order to get a basc feel for how thngs behave. For our purposes, we shall assume that a good choce s C = 2-2 = 0.25 and σ = 2-2 = 0.25, and proceed to the next step. Tranng (svm-tran) As we have seen, the cross valdaton process does not use all of the data for tranng at each teraton some of the tranng data must be excluded for evaluaton purposes. For ths reason t s stll necessary to do a fnal tranng run on the entre tranng set, usng the parameters that we have determned n the prevous parameter selecton process. The command to tran s: svm-tran g 0.25 c 0.25 rescaled_tranng_data_fle model_fle Ths command sets C and σ usng the c and g swtches, respectvely. The other two arguments are the name of the tranng data, and fnally the fle name for the learnt decson functon or model. Predcton (svm-predct) The fnal step s very smple. Now that we have a decson functon, stored n the fle model_fle as well as a properly scaled set of unlabelled data, we can compute the predcted label of each of the examples n the set of unlabelled data by executng the command: svm-predct rescaled_unlabelled_data_fle model_fle predctons_fle After executng ths command, we wll have a new fle of the name predctons_fle, Each lne of ths fle wll contan ether + or - dependng on the predcted label of the correspondng entry n the fle rescaled_unlabelled_data. Summary The general problem of nducton s an mportant one, and can add a great deal of value to large corporate databases. Analysng ths data s not always smple however, and t s fortunate that methods that are both easy to apply and effectve have fnally arsen, such as the Support Vector Machne. The basc concept underlyng the Support Vector Machne s qute smple and ntutve, and nvolves separatng our two classes of data from one another usng a lnear functon that s the maxmum possble dstance from the data. Ths basc dea becomes a powerful learnng algorthm, when one overcomes the ssue of lnear separablty (by allowng margn errors), and mplct mappng to more descrptve feature spaces (through the use of kernel functons). Moreover, there exst free and easy to use software packages, such as lbsvm, that allow one to obtan good results wth a mnmum of effort. The contnued uptake of these tools s nevtable, but s often mpeded by the poor results obtaned by novces. We hope that ths chapter s a useful ad n avodng ths problem, as t quckly affords a basc understandng of both the theory and practce of the SVM.

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.