Parallel Sequential Minimal Optimization for the Training. of Support Vector Machines

Parallel Sequental Mnmal Optmzaton for the Tranng of Sport Vector Machnes 1 L.J. Cao a, S.S. Keerth b, C.J. Ong b, P. Uvaraj c, X.J. Fu c and H.P. Lee c, J.Q. Zhang a a Fnancal Studes of Fudan Unversty, HanDan Road, ShangHa, P.R. Chna, 00433 b Dept. Of Mechancal Engneerng, Natonal Unversty of Sngapore, 10 Kent Rdge Crescent, Sngapore 11960 c Insttute of Hgh Performance Computng, 1 Scence Par Road, #01-01 the Caprcorn, Scence Par II, 11758 Sngapore Abstract Sequental mnmal optmzaton (SMO) s one popular algorthm for tranng sport vector machne (SVM), but t stll requres a large amount of computaton tme for solvng large sze problems. Ths paper proposes one parallel mplementaton of SMO for tranng SVM. The parallel SMO s developed usng message passng nterface (MPI). Specfcally, the parallel SMO frst parttons the entre tranng data set nto smaller subsets and then smultaneously runs multple CPU processors to deal wth each of the parttoned data sets. Experments show that there s great speed on the adult data set and the MNIST data set when many processors are used. There are also satsfactory results on the Web data set. Index Terms Sport vector machne (SVM), sequental mnmal optmzaton (SMO), message passng nterface (MPI), parallel algorthm 1 Correspondng author. Emal: ljcao@fudan.edu.cn. The research wor s funded by Natonal Natural Scence Research Fund No. 70501008 and sponsored by Shangha Pujang program. 1

I. INTRODUCTION Recently, a lot of research wor has been done on sport vector machnes (SVMs), manly due to ther mpressve generalzaton performance n solvng varous l machne learnng problems [1,,3,4,5]. Gven a set of data ponts { } ( ( X, ) d X R s the nput vector of th tranng data pattern; y { 1, 1} s ts class label; y l s the total number of tranng data patterns), tranng an SVM n classfcaton s equvalent to solvng the folng lnearly constraned convex quadratc programmng (QP) problem. maxmze: l l l 1 R( α ) = α αα j y y j( X, X j ) (1) = 1 = 1 j= 1 l subject to: α y = 0 () = 1 0 α c, = 1, L, l where ( X, X j ) s the ernel functon. The mostly wdely used ernel functon s the Gaussan functon e X X j σ, where σ s the wdth of the Gaussan ernel. α s the Lagrange multpler to be optmzed. For each of tranng data patterns, one α s assocated. c s the regularzaton constant pre-determned by users. After solvng the QP problem (1), the folng decson functon s used to determne the class label for a new data pattern. l functon( X ) = α y( X, X ) + b (3) =1 where b s obtaned from the soluton of (1). So the man problem n SVM s reduced to solvng the QP problem (1), where the number of varables α to be optmzed s equal to the number of tranng data

patterns l. For small sze problems, standard QP technques such as the projected conjugate gradent can be drectly appled. But for large sze problems, standard QP technques are not useful as they requre a large amount of computer memory to store the ernel matrx K as the number of elements of K s equal to the square of the number of tranng data patterns. For mang SVM more practcal, specal algorthms are developed, such as Vapn s chunng [6], Osuna s decomposton [7] and Joachms s SVM lght [8]. They mae the tranng of SVM possble by breang the large QP problem (1) nto a seres of smaller QP problems and optmzng only a subset of tranng data patterns at each step. The subset of tranng data patterns optmzed at each step s called the worng set. Thus, these approaches are categorzed as the worng set methods. Based on the dea of the worng set methods, Platt [9] proposed the sequental mnmal optmzaton (SMO) algorthm whch selects the sze of the worng set as two and uses a smple analytcal approach to solve the reduced smaller QP problems. There are some heurstcs used for choosng two α to optmze at each step. As ponted out by Platt, SMO scales only quadratcally n the number of tranng data patterns, whle other algorthms scales cubcally or more n the number of tranng data patterns. Later, Keerth et. al. [10,11] ascertaned neffcency assocated wth Platt s SMO and suggested two modfed versons of SMO that are much more effcent than Platt s orgnal SMO. The second modfcaton s partcular good and used n popular SVM pacages such as LIBSVM [1]. We wll refer to ths modfcaton as the modfed SMO algorthm. Recently, there are few wors on developng parallel mplementaton of tranng SVMs [13,14,15,16]. In [13], a mxture of SVMs are traned n parallel usng the subsets of a tranng data set. The results of each SVM are then combned by tranng 3

another mult-layer perceptron. The experment shows that the proposed parallel algorthm can provde much effcency than usng a sngle SVM. In the algorthm proposed by Dong et. al. [14], multple SVMs are also developed usng subsets of a tranng data set. The sport vectors n each SVM are then collected to tran another new SVM. The experment demonstrates much effcency of the algorthm. Zanghrat and Zann [15] also proposed a parallel mplementaton of SVM lght where the whole quadratc programmng problem s splt nto smaller subproblems. The subproblems are then solved by a varable projecton method. The results show that the approach s comparable on scalar machnes wth a wdely used technque and can acheve good effcency and scalablty on a multprocessor system. Huang et. Al. [16] proposed a modular networ mplementaton for SVM. The result found out that the modular networ could sgnfcantly reduce the learnng tme of SVM algorthms wthout sacrfcng much generalzaton performance. Ths paper proposes a parallel mplementaton of the modfed SMO based on the multprocessor system for speedng the tranng of SVM, especally wth the am of solvng large sze problems. In ths paper, the parallel SMO s developed usng message passng nterface (MPI) [17]. Unle the sequental SMO whch handles the entre tranng data set usng a sngle CPU processor, the parallel SMO frst parttons the entre tranng data set nto smaller subsets and then smultaneously runs multple CPU processors to deal wth each of the parttoned data sets. On the adult data set the parallell SMO usng 3 CPU processors s more than 1 tmes faster than the sequental SMO. On the web data set,the parallel SMO usng 30 CPU processors s more than 10 tmes faster than the sequental SMO. On the MNIST data set the parallel SMO usng 30 CPU processors on the averaged tme of one-aganst-all SVM classfers s more than 1 tmes faster than the sequental SMO. 4

Ths paper s organzed as fols. Secton II gves an overvew of the modfed SMO. Secton III descrbes the parallel SMO developed usng MPI. Secton IV presents the experment ndcatng the effcency of the parallel SMO. A short concluson then fols. I I II. A BRIEF OVERVIEW OF THE MODIFIED SMO We begn the descrpton of the modfed SMO by gvng the notaton used. Let = { : y = 1, 0 < α < c} { : y = 1, 0 < α }, I { : = 1, α 0}, 0 < c 1 = y = = { : y = 1, α }, I = { : y = 1, α }, and I { : = 1, α 0}. = c 3 = c 4 = y = I = U I, = 0, L, 4 denotes the ndex of tranng data patterns. f l = α j y j( X j, X ) y. b = mn{ f : I 0 I1 I }, I = arg mn f. j=1 b =. τ =10 6. = max{ f : I 0 I 3 I 4}, I arg max f The dea of the modfed SMO s to optmze the two α assocated wth b and b accordng to (4) and (5) at each step. Ther assocated ndex are I and I. α α new new 1 old old y ( f f ) old 1 = α (4) η old old new = α + s α α ) (5) 1 ( where the varables assocated wth the two α are represented usng the subscrpts 1 and. s = y y. (, ), ) (, ) new 1 η = X 1 X (X 1 X 1 X X. α 1 and new need to be clpped to [ 0, C]. That s, 0 α c and 0 α new c. 1 new α After optmzng α 1 and α, f, denotng the error on the th tranng data pattern, s dated accordng to the folng: f new old new old new old = f + α α ) y ( X, X ) + ( α α ) y ( X, X ) (6) ( 1 1 1 1 5

Based on the dated values of f, b and b and the assocated ndex I and I are dated agan accordng to ther defntons. The dated values are then used to choose another two new α to optmze at the next step. In addton, the value of Eq. (1), represented by Dual, s dated at each step Dual new α α new old new old old 1 1 old old 1 1 = Dual ( f1 f ) + η( ) (7) y1 y1 1 α α And DualtyGap, representng the dfference between the prmal and the dual objectve functon n SVM, s calculated by (8). DualtyGap = l = 0 α y f + l = o ε (8) ε = Cmax(0, where ε = Cmax(0, b - f - b ) + f ) f f y y = 1 = 1 A more detaled descrpton of Dual and DualtyGap can be referred to the paper [8]. Dual and DualtyGap are used for checng the convergence of the program. A smple descrpton of the modfed SMO n the sequental form can be summarzed as: Sequental SMO Algorthm: Intalze α = 0, f =, Dual = 0, = 1, L, l y Calculate b, I, b, I, DualtyGap Untl DualtyGap τ Dual (1) Optmze α I, α I () Update f, = 1, L, l (3) Calculate b, I, b, I, DualtyGap and date Dual Repeat 6

III. THE PARALLEL SMO MPI s not a new programmng language, but a lbrary of functons that can be used n C, C++ and FORTRAN [17]. MPI als one to easly mplement an algorthm n parallel by runnng multple CPU processors for mprovng effcency. The Sngle Program Multple Data (SPMD) mode where dfferent processors execute the same program but dfferent data s generally used n MPI for developng parallel programs. In the sequental SMO algorthm, most of computaton tme s domnated by datng f array at the teraton (), as t ncludes the ernel evaluatons and s also requred for every tranng data pattern. As shown n our experment, over 90% of the total computaton tme of the sequental SMO s used for datng f array. So the frst dea for us to mprove the effcency of SMO s to develop the parallel program f for datng array. Accordng to (6), datng array s ndependently evaluated one tranng data pattern at a tme, so the SPMD mode can be used to execute ths program n parallel. That s, the entre tranng data set s frstly equally parttoned nto smaller subsets accordng to the number of processors used. Then each of the parttoned subsets s dstrbuted nto one CPU processor. By executng the program f of datng f array usng all the processors, each processor wll date a dfferent subset of f array based on ts assgned tranng data patterns. In such a way, much computaton tme could be saved. Let p denotes the total number of processors used, t f s the amount of computaton tme used for datng f array n the sequental SMO. By usng the parallel program of datng f array, the amount of computaton tme used to date f array s almost reduced to 1 t f p. 7

f Besdes datng array, calculatng,, and can also be performed n parallel as the calculaton nvolves examnng all the tranng data ponts. By executng the program of calculatng,, and usng all the processors, each processor could obtan one b and one b as well as the assocated b b b b I I I I I and I based on ts assgned tranng data patterns. The b,, and I b I of each processor are not global n the sense they are obtaned only based on a subset of all the tranng data patterns. The global and global b are respectvely the mnmum value of b of each processor and the maxmum value of of each b b processor, as descrbed n Secton. By determnng the global b and the global b, the assocated I and I can thus be found out. The correspondng two α are then optmzed by usng any one CPU processor. Accordng to (8), calculatng DualtyGap s also ndependently evaluated one tranng data pattern at a tme. So ths program can also be executed n parallel usng the SPMD mode. By runnng the program of Eq. (8) usng multple CPU processors, each processor wll calculate a dfferent subset of DualtyGap based on ts assgned tranng data patterns. The value of DualtyGap on the entre tranng data patterns s the sum of the DualtyGap of all the processors. In summary, based on the SPMD parallel mode, the parallel SMO date F array and calculate b, b, I, I, and DualtyGap at each step n parallel usng multple CPU processor. The calculaton of other parts of SMO whch tae lttle tme s done usng one CPU processor, whch s the same as used n the sequental SMO. Due to the use of multple processors, communcaton among processors s also requred n the parallel SMO, such as gettng global b, I, b and I from, b 8

I, and of each processor. For mang the parallel SMO effcent, the b I communcaton tme should be ept small. A bref descrpton of executng the parallel SMO can be summarzed as fols. Parallel SMO Algorthm: p Notaton: p s the total number of processors used. }, U l = l s a subset of { l = 1 =1 p all the tranng data patterns and assgned to processor., b, I,,, f b I DualtyGap,, l denote the varables assocated wth processor. α f l = α j=1 j y j (X j, X ) y. I1 I = mn{ f : I 0 l }, I = arg mn f. b b = max{ f : I 0 I 3 I l }, I = arg max f. b, I, b,, and 4 I DualtyGap stll denote the varables on the entre tranng data patterns. b = max{ b } b, I = arg b =, b = max{ }, = arg b = b, I b I I DualtyGap = p = 1 DualtyGap. Intalze α = 0, f = y, Dual = 0, l, Calculate,,,, b I b I DualtyGap = 1, L, p Obtan b, I, b, I, and DualtyGap Untl DualtyGap τ Dual (1) Optmze α α I, I () Update f, l (3) Calculate b,,,, I b I DualtyGap 9

Repeat (4) Obtan b, I, b, I, DualtyGap and date Dual A more detaled descrpton of the parallel SMO can be referred to the pseudocode n appendx A. IV. EXPERIMENT The parallel SMO s tested aganst the sequental SMO usng three benchmars: the adult data set, the web data set and the MNIST data set. Both algorthms are wrtten n C. Both algorthms are run on IBM p690 Regata SerComputer whch has a total of 7 nodes, wth each node havng 3 power PC_POWER4 1.3GHz processors. For ensurng the same accuracy n the sequental SMO and the parallel SMO, the stop crtera used n both algorthms such as the value of τ are all the same. A. Adult Data Set The frst data set used to test the parallel SMO s speed s the UCI adult data set [10]. The tas s to predct whether the household has an ncome larger than $50,000 based on a total of 13 bnary attrbutes. For each nput vector, only an average of 14 bnary attrbutes are true, represented by the value of 1. Other attrbutes are all false, represented by the value of 0. There are a total of 8,956 data patterns n the tranng data set. The Gaussan ernel s used for both the sequental SMO and the parallel SMO. The values of Gaussan varance σ and c are arbtrarly used as 100 and 1. These values are not necessarly ones that gve the best generalzaton performance of SVM, as the purpose of ths experment s only for evaluatng the computaton tme of two 10

algorthms. Moreover, the LIBSVM verson.8 proposed by Chang and Ln [1] s also nvestgated usng a sngle processor on the experment. The am s to see whether the ernel cache used n LIBSVM can provde effcency n comparson wth the sequental SMO wthout ernel cache. The elapsed tme (measured n seconds) wth dfferent number of processors n the sequental SMO, the parallel SMO and LIBSVM s gven n Table 1, as well as the number of converged sport vectors (denoted as SVs) and bounded sport vectors wth α = c ( denoted as BSVs). From the table, t can be observed that the elapsed tme of the parallel SMO gradually reduces wth an ncrease n the number of processors. It can be reduced by almost half wth the use of two processors and almost three-quarters wth the use of four processors, etc.. Ths result demonstrates that the parallel SMO s effcent n reducng the tranng tme of SVM. Moreover, the parallel SMO usng one CPU processor taes slghtly more tme than the sequental SMO, due to the use of MPI programs. The table also shows that LIBSVM runnng on the sngle processor requres less tme than that of the sequental SMO. Ths demonstrates that the ernel cachng s effectve n reducng the computaton tme of the ernel evaluaton. For evaluatng the performance of the parallel SMO, the folng two crtera are used: speed and effcency. They are respectvely defned by the elapsed tme of the sequental SMO speed = (9) the elapsed tme of the parallelsmo effceny = speed number of processors (10) The speed of the parallel SMO wth respect to dfferent number of processors s llustrated n Fg. 1. The fgure shows that to 16 processors the parallel SMO scales almost lnearly wth the number of processors. After that, the scalablty of the parallel 11

SMO s slghtly reduced. The maxmum value of the speed s more than 1, correspondng to the use of 3 processors. The result means that the tranng tme of 1 the parallel SMO by runnng 3 processors s only about of that of the sequental 1 SMO, whch s very good. The effcency of the parallel SMO wth dfferent number of processors s llustrated n Fg.. As shown n the fgure, the value of the effcency of the parallel SMO s 0.9788 when two processors are used. It gradually reduces as the number of processor ncreases. The reason may le n that the use of more processors wll lead to more communcaton tme, thus reducng the effcency of the parallel SMO. For a better understandng of the cost of varous subparts n the parallel SMO, the computaton tme n dfferent components (I/O; ntalzaton; optmzng α I and α ; datng and calculatng,,,, DualtyGap ; and obtanng I f b I b I b, I, b, I, DualtyGap ) s reported n Table. The tme for datng f and calculatng,,,, DualtyGap s called as the parallel tme as the b nvolved calculatons are done n parallel. And the tme for obtanng b, I, b, I I b I, DualtyGap s called as the communcaton tme as there are many processors ncluded n the calculaton. The table shows that the tme for I/O, ntalzaton, and optmzng α I and α I s lttle and rrelevant to the number of processor, whle a large amount of tme s used n the parallel tme, whch means that the datng of f and the calculatng of b, I, b, l I ow, DualtyGap had better be performed n parallel usng multple processors. As expected, the parallel tme decreases wth the ncrease of the number of processors. In contrast, the communcaton tme slghtly 1

ncreases wth the ncrease of the number of processors. Ths exactly explans why the effcency of the parallel SMO decreases as the number of processors ncreases. B. Web Data Set The web data set s examned n the second experment [10]. Ths problem s to classfy whether a web page belongs to a certan category or not. There are a total of 4,69 data patterns n the tranng data set, wth each data pattern composed of 300 spare bnary eyword attrbutes extracted from each web page. For ths data set, the Gaussan functon s stll used as the ernel functon of the sequental SMO and the parallel SMO. The values of Gaussan varance are respectvely used as 0.064 and 64. σ and c The elapsed tme wth dfferent number of processors used n the sequental SMO, the parallel SMO and LIBSVM s gven n Table 3, as well as the total number of sport vectors and bounded sport vectors. Same as n the adult data set, the elapsed tme of the parallel SMO gradually reduces wth the ncrease of the number of processors, by almost half usng two processors and almost three-quarters usng four processors, so on and so for. The parallel SMO usng one CPU processor also taes slghtly more tme than the sequental SMO, due to the use of MPI program. The LIBSVM requres less tme than that of the sequental SMO, due to the use of the ernel cache. Based on the obtaned results, the speed and the effcency of the parallel SMO are calculated and respectvely llustrated n Fg. 3 and Fg. 4. Fg. 3 shows that the speed of the parallel SMO ncreases wth the ncrease of the number of processors ( to 30 processors), demonstratng the effcency of the parallel SMO. For ths data set, the maxmum value of the speed s more than 10, correspondng to the use of 13

30 processors. As llustrated n Fg. 4, the effcency of the parallel SMO decreases wth the ncrease of the number of processors, due to the ncrease of the communcaton tme. The computaton tme n dfferent components of the parallel SMO s reported n Table 4. The same conclusons are reached as n the adult data set. The tme for I/O, ntalzaton, and optmzng α I and I α s lttle and almost rrelevant to the number of processors. Wth the ncrease of the number of processors, the parallel tme decreases, whle the communcaton tme slghtly ncreases. In terms of speed and effcency the result on the web data set s not as good as that n the adult data set. Ths can be analyzed as the rato of the parallel tme to the communcaton tme n the web data set s much smaller than that of the adult data set, as llustrated n Table and Table 4. Ths also means that the advantage of usng the parallel SMO s more obvous for large sze problems. C. MNIST Data Set The MNIST handwrtten dgt data set s also examned n the experment. Ths data set conssts of 60,000 tranng samples and 10,000 testng samples. Each sample s composed of 576 features. Ths data set s avalable at http://www.cenparm.concorda.ca/~people/jdong/herosvm/ and has also been used n Dong et al. s wor on speedng the sequental SMO [18]. The MNIST data set s actually a ten-class classfcaton problem. Accordng to the one aganst the rest method, ten SVM classfers are constructed by separatng one class from the rest. In our experment, the Gaussan ernel s used n the sequental SMO and the parallel SMO for each of ten SVM classfers. The values of σ and c are respectvely used as 0.6 and 10, same as those used n [14]. 14

The elapsed tme wth dfferent number of processors n the sequental SMO and the parallel SMO and LIBSVM for each of ten SVM classfers s gven n Table 5. The number of converged sport vectors and bounded sport vectors s descrbed n Table 6. The averaged value of the elapsed tme n the ten SVM classfers s also lsted n ths table. The table shows that there s stll beneft n the usng of the ernel cache n LIBSVM n comparson wth the sequental SMO. Fg. 5 and Fg. 6 respectvely llustrate the speed and the effcency of the parallel SMO. Fg. 5 shows that the speed of the parallel SMO ncreases wth the ncrease of the number of processors. The maxmum values of the speed n the ten SVM classfers range from 17.1 to.8. The averaged maxmum value of speed s equal to 1.7, correspondng to the use of 30 processors. Fg. 6 shows that the effcency of the parallel SMO decreases wth the ncrease of the number of processors, due to the use of more communcaton tme. V. CONCLUSIONS Ths paper proposes the parallel mplementaton of SMO usng MPI. The parallel SMO uses multple CPU processors to deal wth the computaton of SMO. By parttonng the entre tranng data set nto smaller subsets and dstrbutng each of the parttoned subsets nto one CPU processor, the parallel SMO dates F array and calculates b, b, and DualtyGap at each step n parallel usng multple CPU processors. Ths parallel mode s called the SPMD model n MPI. Experment on three large data sets demonstrates the effcency of the parallel SMO. The experment also shows that the effcency of the parallel SMO decreases wth the ncrease of the number of processors, as there s more communcaton tme wth 15

the use of more processors. For ths reason, the parallel SMO s more useful for large sze problems. The experment also shows that LIBSVM wth the usng of the worng set sze as s more effcent than the sequental SMO. Ths can be explaned that the LIBSVM use the ernel cache, whle the sequental and parallel SMO do not tae t nto account. Future wor wll explot the ernel cache for further mprovng the current verson of the parallel SMO. In the current verson of the parallel SMO, the mult-class classfcaton problem s performed by consderng one class by one class. In the future wor, t s worthy to perform the mult-class classfcaton problem n parallel by consderng all the classes smultaneously for further mprovng the effcency of the parallel SMO. In such an approach, t needs to develop a structural approach to consder the communcaton between processors Ths wor s very useful for the research where multple CPU processors machne s avalable. Future wor also needs to extend the parallel SMO from classfcaton for regresson estmaton by mplementng the same methodology for SVM regressor. 16

References: [1] V.N. Vapn, The Nature of Statstcal Learnng Theory, New Yor, Sprnger- Verlag, 1995. [] C.J.C. Burges, A tutoral on sport vector machnes for pattern recognton, Knowledge Dscovery and Data Mnng, Vol., No., pp. 955-974, 1998. [3] L.J. Cao and F. E.H. Tay, Sport vector machnes wth adaptve parameters n fnancal tme seres forecastng, IEEE Transactons on Neural Networs, 14(6), 1506-1518,003. [4] S. Gutta, R.J. Jeffrey, P. Jonathon and H. Wechsler, Mxture of Experts for Classfcaton of Gender, Ethnc Orgn, and Pose of Human Faces, IEEE Transactons on Neural Networs, 11 (4), July 000, 948-960. [5] K. Ieda, Effects of Kernel Functon on Nu Sport Vector Machnes n Extreme Cases, IEEE Transactons on Neural Networs, 17 (1), Jan.006, 1-9. [6] V.N. Vapn, Estmaton of Dependence Based on Emprcal Data, New Yor: Sprnger Verlag, 198. [7] E. Osuna, R. Freund and F. Gros, An mproved algorthm for sport vector machnes, NNSP 97: Proc. of the IEEE Sgnal Processng Socety Worshop, Amela Island, USA, pp. 76-85, 1997. [8] T. Joachms, Mang large-scale sport vector machne learnng practcal, n Advances n Kernel Methods: Sport Vector Machnes, ed. by B. Scholopf, C. Burges, A. Smola. MIT Press, Cambrdge, MA, December 1998. [9] J.C. Platt, Fast tranng of sport vector machnes usng sequental mnmal optmsaton, In Advances n Kernel Methods Sport Vector Learnng, ed. by B. Scholopf, C.J.C. Burges and A.J. Smola, pp. 185-08, MIT Press, 1999. 17

[10] S.S. Keerth, S.K. Shevade, C. Bhattaacharyya and K.R.K. Murthy, Improvements to Platt s SMO algorthm for SVM classfer desgn, Neural Computaton, Vol. 13, pp. 637-649, 001. [11] S.K. Shevade, S.S. Keerth, C. Bhattacharyya and K.R.K. Murthy, Improvements to the SMO algorthm for SVM regresson, IEEE Transactons on Neural Networs, 11 (5), Sept. 000 Page(s):1188-1193. [1] C.C. Chang and C.J. Ln. LIBSVM: a Lbrary for Sport Vector Machnes, avalable at http://www.cse.ntu.edu.tw/~cjln/lbsvm/. [13] R. Collobert, S. Bengo and Y. Bengo, A parallel mxture of SVMs for very large scale problems, Neural Computaton, Vol. 14, No. 5, pp. 1105 1114, 00. [14] J. X. Dong, A. Krzyza, C. Y. Suen, A fast Parallel Optmzaton for Tranng Sport Vector Machne, Proc. of 3rd Int. Conf. Machne Learnng and Data Mnng, P. Perner and A. Rosenfeld (Eds.) Sprnger Lecture Notes n Artfcal Intellgence (LNAI 734), pp. 96--105, Lepzg, Germany, July 5-7, 003 [15] G. Zanghrat, L. Zann, A parallel solver for large quadratc programs n tranng sport vector machnes, Parallel Computng, Vol. 9, No. 4, pp. 535-551, 003. [16] B.H. Guang, K. Z. Mao, C.K. Sew and D.S. Huang, Fast Modular Networ Implementaton for Sport Vector Machnes, IEEE Transactons on Neural Networs, Vol. 16, No. 6, Nov. 005, 1651-1663 [17] P.S. Pacheco, Parallel Programmng wth MPI, San Francsco, Calf.: Morgan Kaufmann Publshers, 1997. [18] J.X. Dong, A. Krzyza and C.Y. Suen, A fast SVM tranng algorthm, accepted n Pattern Recognton and Artfcal Intellgence, 00. 18

Appendx A: Pseudo-code for the parallel SMO ( Note: If there s some process ran before the code, ths means that only the processor assocated wth the ran executes the code. Otherwse, all the processors execute the code. ) n_sample = total number of tranng samples p = total number of processors local_nsample = n_sample/ p Procedure taestep ( ) f ( _==_&& Z1==Z ) return 0; s=y1*y; f ( y1==y ) gamma=alph1+alph; else gamma=alph1-alph; f ( s==1 ) { f (y==1) { L=MAX( 0,gamma-C); H=MIN(C, gamma); } else { L=MAX(0,gamma-C); H=MIN(C, gamma); } } else { L=MAX(0,-gamma); f (y==1) H=MIN(C, C-gamma); else 19

H=MIN(C, C-gamma); } f (H<=L) return 0; K11 = ernel ( X1, X1 ); K = ernel ( X, X ); K1 = ernel ( X1, X ); eta=*k1-k11-k; f ( eta<eps*(k11+k) ) { a= alph-(y*(f1-f)/eta); f (a<l) a=l; else f (a>h) a=h; } else { slope=y*(f1-f); change=slope *(H-L); f( fabs(change)>0 ) { f (slope>0 ) a=h; else a=l; } else a=alph; } f (y==1) { f (a> C-EPS*C) a=c; else f (a<eps*c) a=0; else ; } else { f (a>c-eps*c) 0

a=c; else f (a<eps*c) a=0; else ; } f( fabs(a-alph)<eps* (a+alph+eps) return 0; f ( s==1 ) a1=gamma-a else a1=gamma+a; f (y1==1) { f (a1> C-EPS*C) a1=c; else f (a1<eps*c) a1=0; else ; } else { f (a1>c-eps*c) a1=c; else f (a1<eps*c) a1=0; else ; } date the value of Dual return 1 Endprocedure Procedure ComputeDualtyGap( ) DualtyGap=0; loop over local_nsample tranng samples f ( y[]==1 ) DualtyGap += C*MAX(0, (b-fcache[]) ); else DualtyGap +=C*MAX(0, (-b+fcache[]) ); 1

loop over tranng samples n I_0 and I_ and I_3 DualtyGap+=alpha[]*y[]*fcache[]; return DualtyGap; Endprocedure Procedure Man( ) processor 0: read the frst bloc of local_nsample tranng data patterns from the data fle and save them nto the matrx X for =1 to p read the th bloc of local_nsample tranng data patterns from the data fle and send them to processor end processors 1 to p: receve local_nsample tranng data patterns from processor 0 and save them nto the matrx X (all the processors) ntalze alpha array to all zero (for local_nsample tranng data patterns ) ntalze fcache array to the negatve of y array (for local_nsample tranng data patterns ) store the ndces of postve class n I_1 and negatve class n I_4 (for local_nsample tranng data patterns ) set b to zero ntalze the value of Dual to zero DualtyGap=ComputeDualtyGap( ) (for local_nsample tranng data patterns ) sum DualtyGap of each processor and broadcast t to every processor compute ( b_, _ ) and ( b_, _) usng n I and fcache array (for local_nsample tranng data patterns ) compute global b_ and global b_ usng local b_ and local b_ of each processor fnd out processor Z1 contanng global b_ fnd out processor Z contanng global b_ processor Z1: alph1=alpha[ _ ]; y1=y[ _ ];

F1=fcache[ _ ]; X1=X[ _ ]; broadcast alph1, y1, F1, and X1 to every processor processor Z: alph=alpha[ _ ]; y=y[ _]; F=fcache[ _]; X=X[ _]; broadcast alph, y, F, and X to every processor numchanged=1; whle ( DualtyGap>tol*abs(Dual) && numchanged!=0 ) { processor 0: numchanged=taestep( ); broadcast numchanged to every processor f ( numchanged==1 ) { processor 0: broadcast a1, a, and Dual to every processor processor Z1: alph[_ ]=a1; f (y1==1) { f ( a1==c ) move 1 to I_3; else f (a1 ==0 ) move 1 to I_1; else move 1 to I_0; } else { f ( a1==c ) move 1 to I_; else f ( a1==0 ) 3

} else move 1 to I_4; move 1 to I_0; processor Z: alph[_]=a; f (y==1) { f ( a==c ) move to I_3; else f ( a==0 ) move to I_1; else move to I_0; } else { f ( a==c ) move to I_; else f (a==0 ) move to I_4; else move to I_0; } (all the processors) date fcache[] for n I usng new Lagrange multplers (for local_nsample tranng data patterns ) compute (b_, _) and (b_, _) usng n I and fcache array (for local_nsample tranng data patterns ) compute global b_ and global b_ usng local b_ and local b_ of each processor fnd out processor Z1 contanng global b_ fnd out processor Z contanng global b_ 4

processor Z1: alph1=alpha[ _ ]; y1=y[ _ ]; F1=fcache[ _ ]; X1=X[ _ ]; broadcast alph1, y1, F1, and X1 to every processor processor Z: alph=alpha[ _ ]; y=y[ _]; F=fcache[ _]; X=X[ _]; broadcast alph, y, F, and X to every processor b=(b+b)/ DualtyGap=ComputeDualtyGap( ) sum DualtyGap of each processor and broadcast t to every processor } ( end of whle loop) Endprocedure b=(b+b)/ DualtyGap=ComputeDualtyGap( ) sum DualtyGap of each processor and broadcast t to every processor Prmal=Dual+DualtyGap 5

Fg. 1. The speed of the parallel SMO on the adult data set. Fg. 3. The speed of the paralleled SMO on the web data set. Fg.. The effcency of the parallel SMO on the adult data set. 6

Fg. 3. The speed of the parallel SMO on the web data set. Fg. 4. The effcency of the parallel SMO on the web data set. 7

Fg. 5. The speed of the parallel SMO on the MNIST data set. Fg. 6. The effcency of the parallel SMO on the MNIST data set. 8

TABLE I THE ELAPSED TIME (SECONDS) USED IN THE SEQUENTIAL SMO AND THE PARALLEL SMO AND LIBSVM ON THE ADULT DATA SET. LIBSVM Sequental Parallel SMO SMO 1P P 4P 8P 16P 3P Tme(s) 113.06 010.1 048.06 106.81 51.9 75.80 145.05 93.79 SVs 8563 10591 10763 10683 1085 10853 10948 110 BSVs 7649 8631 903 897 8953 9013 9038 915 TABLE II THE COMPUATION TIME IN DIFFERENT COMPONENTS OF THE PARALLEL SMO ON THE ADULT DATA SET. Components Number of processors 1P P 4P 8P 16P 3P I/O 1 1 1 1 1 1 ntalzaton 0 0 0 0 0 0 a I_, a I_ 0 0 0 0 0 0 b, I, b, I, DualtyGap 0 6 8 8 18 F, b, I, b, I, DualtyGap 041 1017 507 61 19 66 9

TABLE III THE ELAPSED TIME USED IN THE SEQUENTIAL SMO AND THE PARALLEL SMO AND LIBSVM ON THE WEB DATA SET. LIBSV Sequental Parallel SMO M SMO 1P P 4P 8P 16P 30P Tme(s) 104.7 17.75 191.33 95.70 5.4 31.59 3.11 16.0 SVs 58 67 703 71 76 75 805 817 BSVs 493 658 685 687 694 703 718 74 TABLE IV THE COMPUATION TIME IN DIFFERENT COMPONENTS OF THE PARALLEL SMO ON THE WEB DATA SET. Components Number of processors 1P P 4P 8P 16P 30P I/O ntalzaton 0 0 0 0 0 0 a I_, a I_ 0 0 0 0 0 0 b, I, b, I, DualtyGap 0 1 1 3 3 F, b, I, b, I, DualtyGap 183 87 43 0 9 5 30

TABLE V THE ELAPSED TIME USED IN THE SEQUENTIAL SMO AND THE PARALLEL SMO AND LIBSVM ON THE MNIST DATA SET. Class LIBSVM Sequental Parallel SMO SMO 1P P 4P 8P 16P 30P 0 931.668 3597.97 3948.83 186.49 1006.46 483.51 83.19 10.10 1 753.418 3717.91 336.05 1845.33 895.45 46.50 66.70 196.09 5160.93 5644.19 5595.01 781.18 130.7 656.56 37.7 48.3 3 5737.956 601.50 5404.18 749.00 1330.94 703.06 399. 71.97 4 5145.859 6044.60 6143.85 771.65 1544.05 719.86 400.7 74.08 5 485.64 5568.70 559.6 551.38 1408.74 655.09 378.6 67.57 6 3448.498 43.65 46.76 099.81 973.81 491.43 94.33 194.78 7 541.564 5788.88 5796.86 314.36 1467.97 731.57 41.99 9.19 8 6565.783 7183.05 743.13 331.7 1800.8 8.35 468.53 314.70 9 764.706 8033.80 7960.56 3645.48 1844.40 93.33 554.03 353.78 Averaged 4963.403 5583.35 5517.485 675.4 1357.437 665.86 383.105 6.358 31

TABLE V THE NUMBER OF CONVERGED SUPPORT VECTORS AND BOUNDED SUPPORT VECTORS IN THE SEQUENTIAL SMO AND THE PARALLEL SMO AND LIBSVM ON THE MNIST DATA SET. Class LIBSVM #SVs #BSVs Sequental SMO Parallel SMO 1P P 4P 8P 16P 30P #SVs #BSVs #SVs #BSVs #SVs #BSVs #SVs #BSVs #SVs #BSVs #SVs #BSVs #SVs #BSVs 0 1871 1807 1 198 186 108 010 3 976 1989 4 16 1934 5 106 13 6 483 199 7 65 008 8 146 008 9 317 011 Averaged 48 004 01 1865 104 198 811 084 309 607 374 463 356 093 551 36 807 313 813 10 887 11 57 180 130 011 167 07 338 97 403 046 317 1 30 143 891 179 985 187 3035 163 3003 18 630 154 048 199 140 1968 547 09 3316 460 384 7 799 191 650 165 75 178 741 159 879 173 67 170 060 1946 170 1981 571 13 3073 594 674 19 68 615 718 018 805 04 85 05 896 13 686 04 073 1947 179 1934 479 163 3119 108 565 104 318 055 716 15 803 14 879 0 98 5 744 119 074 1958 131 1988 597 4 303 331 614 01 997 544 3036 117 307 16 3065 03 301 16 719 175 095 1996 190 039 314 409 3109 3 863 100 3190 651 3179 110 387 11 34 59 3384 55 788 6 3