48 CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION 3.1 INTRODUCTION The raw mcroarray data s bascally an mage wth dfferent colors ndcatng hybrdzaton (Xue et al 004) of DNAs expressed at dfferent condtons. The mage s further converted nto numercal data as pxel ntensty reflectng deally the count of photons correspondng to the amount of transcrpts genetcally. Ths data s analyzed to study the cause of dsease, effectveness of treatments and so on. Data mnng s becomng an ncreasngly mportant tool to transform ths data nto nformaton. The challenge n studyng the mcroarray dataset s that t ncludes a large number of features, typcally 000 3000. But, not all of these genes are requred for classfcaton (Wang and Palade 007). As these genes do not nfluence the performance of the classfcaton task, takng them nto account durng classfcaton ncreases the dmenson of the classfcaton problem, poses computatonal dffcultes and ntroduces unnecessary nose n the process. In dagnostc research, procedures are based on mcroarray wth enough probes to detect a certan dsease. So the process of selectng nformatve genes, that s, genes related to the partcular study or dsease s called Gene Selecton (GS) (L et al 001). Ths process s smlar to feature selecton for machne learnng n general.
49 The classfcaton accuracy acheved for classfyng gene expresson s hgher for supervsed learnng methods lke SVM and NN (Lee et al 005). Many studes based on SVM for classfyng gene expresson s avalable n the lterature (Furey et al 000; Fujarewcz and Wench 003; We et al 010). The SVM algorthm s a powerful supervsed learnng algorthm. In ths chapter t s proposed to mplement SMO traned support vector classfer wth poly kernel and compare the classfcaton effcency wth Naïve Bayes and CART classfers on colon cancer data avalable from Kent Rdge Bomedcal Data Repostory. 3. METHODOLOGY The colon cancer data s avalable n Kent Rdge Bomedcal Data Repostory. The gene expresson samples were analyzed wth an Affymetrx Olgonucleotde array complementary to more than 6500 human genes. Colon epthelal cell samples taken from 6 colon-cancer patents form the dataset. The orgnal data on each sample conssts of 6000 gene expresson levels of whch 4000 were removed based on the confdence n the measured expresson levels. Thus each sample contans 000 gene expresson levels. Of the 6 samples n the dataset, 40 samples are normal samples and the remanng are samples wth colon cancer. Each sample was taken from tumors and normal healthy parts of the colons of the same patents and measured usng hgh densty olgonucleotde arrays (Ben-Dor et al 000). 3..1 Support Vector Machne for Cancer Predcton Support vector machne s a machne learnng technque based on the structural rsk mnmzaton prncple (Vapnk 1995). SVM uses a hyperplane to separate the postve examples from negatve examples.
50 SVM s wdely used for classfcaton as the classfer has to calculate only the nner product between the two vectors of the tranng data. It s wdely appled n bomedcal research for classfcaton. SVMs perform better than the neural networks (Zen et al 000). SVM wth lnear kernel, polynomal kernel or Radal Bass Functon (RBF) kernel s used to classfy genes usng gene expresson data. 3..1.1 SVM algorthm The modelng of SVM s shown n Equaton (3.1) through Equaton (3.7). For lnear SVMs, K s lnear and the output of SVM can be expressed as and u w* x t (3.1) w yx (3.) where u s the SVM output; wxx,, are vectors and t s the threshold. dual quadratc form: Tranng of SVMs s done by fndng, expressed as mnmzng a 1 y y Kx x mn mn, (3.3) j j j j subject to box constrant, 0 C and lnear equalty constrant, y 0 (3.4)
51 The are the Lagrange multplers. Data sets are not always lnearly separable. In such a case, a hyperplane cannot splt the tranng set nto postve and negatve examples. In such cases, modfcaton to the orgnal optmzaton s gven by (Cortes and Vapnk 1995): 1 mn wb,, w C N 1 (3.5) subject to, y wx. b 1, (3.6) where are slack varables and C s a parameter whch trades off wde margn wth a small number of margn falures. The output of a non-lnear SVM s computed from the Lagrange multplers: N u a K x, x t (3.7) j1 j j j where K s a kernel functon that measures the smlarty between nput vector x and stored tranng vector x j. The Lagrange multplers Equaton (3.8) are computed usng quadratc programs. The non-lnearty alters the quadratc form: N N N 1 mn mn a a K( x, x ), (3.8) j j j 1 j1 1 0 C,, N y 0 1
5 The Quadratc Programmng (QP) problem s solved usng SMO algorthm. 3..1. Tranng of SVM The QP problems n SVM cannot be solved usng standard QP technques due to ther huge sze, as the matrx has I number of elements, where I s the number of tranng examples. Chunkng algorthm (Vapnk 1995) s used to solve the SVM QP, whch removes the rows and columns of the matrx that corresponds to zero Lagrange multplers, thus breakng down the QP nto smaller QP. At every step, a QP problem s solved by takng examples of every non-zero Lagrange multpler from the last step and the worst examples that volate the Karush, Kuhn, Tucker (KKT) condtons (Chrstopher Burges 1998). The process s repeated tll the entre set of nonzero Lagrange multplers are dentfed, thus the last step solves the QP. Yet the major dsadvantage of chunkng s that large scale tranng problems cannot be handled as the reduced sze tself wll not ft nto memory. 3.. Sequental Mnmal Optmzaton Sequental Mnmal Optmzaton s a smple algorthm that solves the SVM QP problem durng tranng of SVM. The advantage of SMO s that the QP s solved wthout usng numercal optmzaton steps and extra matrx storage s not requred. Usng Osuna s theorem, SMO decomposes the overall QP problem nto QP sub-problems. For solvng SVM QP problem, two Lagrange multplers whch comply wth lnear equalty constrant are used for a small optmzaton. At each step, SMO chooses two Lagrange multplers to jontly optmze, fnds the optmal values for these multplers and updates the SVM to reflect the new optmal values.
53 The two Lagrange multplers can be solved analytcally and thus numercal QP optmzaton can be entrely avoded. The solvng of multplers can be expressed n the algorthm n the form of loop usng Vsual C++ code, thus each sub-problem s solved quckly and QP problem s solved fast. Thus, very large SVM tranng problems can be easly processed and stored n the memory of an ordnary personal computer or workstaton. Because no matrx algorthms are used n SMO, t s less susceptble to numercal precson problems. There are two components to SMO: an analytc method for solvng the two Lagrange multplers and a heurstc for choosng whch multplers to optmze. 3...1 Analytcal method of solvng multplers SMO solves the QP, expressed n solvng mn and the lnear constrant by decomposng the QP problem nto fxed sze QP sub-problems. SMO computes the constrants of the multplers and solves the constraned mnmum. As there are only two multplers, the constrants can be shown n two dmensons. Due to box constrant, the multplers le wthn a box and the lnear constrant makes the multpler le on a dagonal n the box. As the multplers le on the dagonal, the algorthm computes and the ends of the segment are expressed n terms of. gven by: If target a 1 does not equal target a, then the bounds for are M max 0,, Nmn CC, 1 1 When target a 1 equals target a, then the bounds for are gven by:
54 M C, N mn C, max 0, 1 1 The mnmum along the drecton of constrant s computed by SMO as shown n Equaton (3.9): a S S new 1 (3.9) where S s the error on the th tranng example and s gven by S u a and s the second dervatve of the objectve functon along the dagonal gven n Equaton (3.10): K( x, x ) K( x, x ) K( x, x ) (3.10) 1 1 1 Now the constraned mnmum s found by clppng Equaton (3.11) and (3.1): N f N new new, clpped new new new f M N M f M (3.11), then the value of a 1 s computed from the new, clpped a as: If z aa 1 new new, clpped 1 1 z (3.1) where u s the SVM output for the th tranng example. SMO termnates when all of the KKT optmalty condtons are fulflled: 0 au 1 0 C au 1 C au 1
55 3... Heurstcs for choosng multplers for optmzaton Convergence s assured usng SMO as t optmzes and alters multplers at every step and each step decreases the objectve functons. Heurstcs chooses whch multpler to optmze the speed of convergence. There are two separate choce of heurstc, one for each of the Lagrange multpler. The loop once agan terates to check for examples whose Lagrange multplers are nethe 0 or C, and the examples whch volate are optmzed. The outer loop makes repeated passes over the non-bound examples untl all examples obey the KKT condtons wthn. The outer loop termnates when all the examples obey the KKT condtons 3 wthn. Typcally the value of s set to be10. In the frst choce heurstc, the examples most lkely to volate KKT condtons are concentrated on. So only the set of non-bound examples are terated untl the set s self-consstent. Then the SMO scans the entre set of examples for KKT volaton. On selecton of the frst Lagrange multpler, SMO selects the second Lagrange multpler to maxmze the sze of the step taken durng jont optmzaton. As the evaluaton of kernel functon K s tme new consumng, SMO approxmates the value S1 S when computng. SMO records cached error value S for non-bound example n the tranng set and then chooses an error to approxmately maxmze the step sze. Thus f S 1 s postve, the example wth mnmum error S s chosen and f S 1 s negatve then the example wth maxmum error S s chosen.
56 3.3 RESULT AND DISCUSSION The colon cancer dataset was traned and tested usng 10 fold cross valdaton. The test bench wth the colon cancer dataset s furnshed n Fgure 3.1. Fgure 3.1 Snapshot of the colon cancer dataset used n the experment
57 The classfcaton accuracy of the three classfers under test s represented n Fgure 3. wth the senstvty and specfcty plots shown n Fgure 3.3. Fgure 3. Classfcaton accuracy of varous classfers It s seen that the performance of sequental mnmal optmzaton gves the best n the overall classfcaton accuracy. Though the classfcaton accuracy of CART s lower than SMO, the senstvty of the CART predctor outperforms SMO. Senstvty measures the actual true postves that are actually measured and t s gven n Equaton (3.13) number of true postves senstvty = number of true postves + number of false negatves (3.13)
58 The specfcty for a bnary class problem s gven n Equaton (3.14). Specfcty s used to measure the classfers ablty to predct the negatve results number of true negatves specfcty = number of true negatves + number of false postves (3.14) Fgure 3.3 Senstvty and specfcty plots A confuson matrx represents the obtaned predcted values n supervsed learnng and s used to show the correct labels and mslabels. The confuson matrx obtaned for all the three classfers s gven n Table 3.1. Table 3.1 Confuson matrx Confuson matrx Naïve Bayes CART SMO Postve Negatve Postve Negatve Postve Negatve Postve 14 8 9 13 17 5 Negatve 1 19 38 4 36
59 The algorthm computaton tme for all the three classfers s shown n Fgure 3.4. The computaton tme was measured n an ntel core 3 M350 processor runnng at.7 Ghz wth 3 Gb RAM and Wndows 7 operatng system. Fgure 3.4 The algorthm computaton tme 3.4 SUMMARY In ths chapter, a support vector machne classfer traned usng sequental mnmal optmzaton s nvestgated. The classfcaton accuracy s good compared to Nave Bayes or CART classfers. However, one drawback of the proposed support vector machne based classfers for cancer predcton s the convergence of SMO for hgher values of complexty parameter 'C' (trades off wde margn wth a small number of margn falures). Thus the performance tmng degrades as the value of 'C' ncreases. The next chapter proposes to focus on LVQ, a hghly ntutve learnng model whch s based on a dfferent tranng paradgm.