Evolutionary Support Vector Regression based on Multi-Scale Radial Basis Function Kernel

Eolutonary Support Vector Regresson based on Mult-Scale Radal Bass Functon Kernel Tanasanee Phenthrakul and Boonserm Kjsrkul Abstract Kernel functons are used n support ector regresson (SVR) to compute the nner product n a hgher dmensonal feature space. The performance of approxmaton depends on the chosen kernels. The radal bass functon (RBF) kernel s a Mercer s kernel that has been wdely used n many problems. Howeer, t stll has the restrcton n some complex problems. In order to obtan a more flexble kernel functon, the mult-scale RBF kernels are combned by nonnegate weghtng lnear combnaton. Ths proposed kernel s proed to be a Mercer s kernel. Then, the eolutonary strategy (ES) s appled for adjustng the parameters of SVR and kernel functon. Moreoer, subsets cross-aldaton s used for ealuatng these parameters. The optmum alues of these parameters are searched by (5+0)-ES. The expermental results show the ablty of the proposed method that outperforms the statstcal technques. S I. INTRODUCTION UPPORT ector machnes (SVMs) are learnng algorthms proposed by Vapnk et al. [], based on the dea of the emprcal rsk mnmzaton prncple. They hae been wdely used n many applcatons such as pattern recogntons and functon approxmatons. Bascall SVM operates a lnear hyperplane n an augmented space by means of some defned kernels satsfyng Mercer s condton [], [], [3]. These kernels map the nput ectors nto a ery hgh dmensonal space, possbly of nfnte dmenson, where a lnear hyperplane s more lkely [3]. There are many types of kernel functons such as lnear kernels, polynomal kernels, sgmod kernels, and RBF kernels. Each kernel functon s sutable for some tasks, and t must be chosen for the tasks under consderaton by hand or usng pror knowledge [4]. The RBF kernel s a most successful kernel n many problems, but stll has the restrctons n some complex problems. Therefore, we propose to mproe the effcency of approxmaton by usng the combnaton of RBF kernels at dfferent scales. These kernels are combned by ncludng weghts. The weghts, the wdths of the RBF kernels, the deaton Ths work was supported by the Thaland Research Fund and the Royal Golden Jublee Ph.D. Program. T. Phenthrakul s wth the Department of Computer Engneerng, Chulalongkorn Unerst Bangkok 0330 Thaland (phone: 66--8-6956; fax: 66--8-6955; e-mal: tanasanee@yahoo.com). B. Kjsrkul, Dr., s wth the Department of Computer Engneerng, Faculty of Engneerng, Chulalongkorn Unerst Bangkok 0330 Thaland (e-mal: boonserm.k@chula.ac.th). of approxmaton, and the regularzaton parameter of SVM are the adjustable parameters; these parameters are called hyperparameters. In general, these hyperparameters are usually determned by a grd search. The hyperparameters are ared wth a fxed step-sze n a range of alues, but ths knd of search consumes a lot of tme. Hence, we propose to use the eolutonary strateges (ESs) for choosng these hyperparameters. Howeer, the objecte functon s an mportant part n eolutonary algorthms. There are many ways to measure the ftness of the hyperparameters. In ths work, we propose to use subsets cross-aldaton for ealuatng the hyperparameters n the eolutonary process. We ge a short descrpton of support ector regresson n Secton II. In Secton III and Secton IV we propose the mult-scale RBF kernel and apply eolutonary strateges to determne the approprate hyperparameters, respectely. The proposed kernels wth the help of ES are tested n Secton V. Fnall the conclusons are gen n Secton VI. II. SUPPORT VECTOR REGRESSION The support ector machne s a learnng algorthm that can be dded nto support ector classfcaton and support ector regresson. Support Vector Regresson (SVR) s a powerful method that s able to approxmate a real alued functon n terms of a small subset (called support ectors) of the tranng examples. Suppose we are gen tranng data {( y), K, ( xl, y l )} X R, where X denotes the space of nput patterns. In ε SV regresson, our goal s to fnd a functon f (x) that has at most ε deaton from the actually obtaned target y for all the tranng data, and at the same tme s as flat as possble. In other words, we do not care about errors as long as they are less than ε, but we do not accept any deaton larger than ths [5]. We begn by descrbng the case of lnear functons f, takng n form: f ( x) = w, x + b wth w X, b R. () Flatness n ths case means that one seeks a small w. One way to ensure ths s to mnmze the norm,.e. w = w, w. We can wrte ths problem as a conex optmzaton problem:

mnmze subject to w y w, x b ε () w, x b y ε + Sometmes, we may want to allow for some errors. Soft margn loss functon was adapted to SV machnes; one can ntroduce slack arables ξ, ξ to cope wth otherwse nfeasble constrants of the optmzaton problem [5]. Hence we wll get mnmze subject to w + C y w, x l = ( ξ + ξ ) b, x + b y ε + ξ w ε + ξ ξ, ξ 0 The constant C > 0 determnes the trade-off between the flatness of f and the amount up to whch deatons larger than ε are tolerated. In most cases ths optmzaton problem can be soled more easly n ts dual formulaton [5]. An example of a lnear SVR s shown n Fg.. (3) Kernel Lnear Polynomal TABLE I COMMON KERNEL FUNCTIONS Formula K( = x y K ( = ( + x Sgmod K ( = tanh( α x y + β ) Exponental RBF K( = exp( γ x y ) Gaussan RBF K( = exp( γ x y ) Mult-quadratc K ( = x y + c III. MULTI-SCALE RBF KERNEL The Gaussan RBF kernel s wdely used n many problems. It uses the Eucldean dstance between two ponts n the orgnal space to fnd the correlaton n the augmented space [3]. Ths correlaton s rather smooth, and there s only one parameter for adjustng the wdth of RBF, whch s not powerful enough for some complex problems. In order to get a better kernel, the combnaton of RBF kernels at dfferent scales s proposed. The analytc expresson of ths kernel s the followng: d n Κ( = a K( γ ), (4) = Fg.. An example of the soft margn for a lnear SVR. Moreoer, seekng a proper lnear hyperplane n an nput space has the restrctons. There s an mportant technque that enables these machnes to produce complex nonlnear approxmaton nsde the orgnal space. Ths performs by mappng the nput space nto a hgher dmensonal feature space through a mappng functon Φ [6]. Ths can be acheed by substtutng Φ ( x ) nto each tranng example x. Howeer, a good property of SVM s that t s not necessary to know the explct form of Φ. Only the nner product n the feature space, called kernel functon K( = Φ( x) Φ(, must be defned. The functon whch can be a kernel must satsfy Mercer s condton [7]. Some of the common kernels are shown n Table I. Each kernel corresponds to some feature space and because no explct mappng to ths feature space occurs, optmal lnear separators can be found effcently n the feature space wth mllons of dmensons [8]. where n s a poste nteger, a for =,..., n are the arbtrary non-negate weghtng constants, and K γ ) = ( exp( γ x y ) s the RBF kernel at the wdth γ for =,...,n. In general, the functon whch maps the nput space nto the augmented feature space s unknown. Howeer, the exstence of such functon s assured by Mercer s theorem [4]. The Mercer s theorem [], [4] s shown n Fg.. Fg.. The Mercer s theorem.

The proposed kernel functons can be proed to be an admssble kernel by the Mercer s theorem. The prong process s shown n Fg. 3. The RBF s a well-known Mercer s kernel. Therefore, the non-negate lnear combnaton of RBFs n (4) can be proed to be the Mercer s kernel. ( µ + λ )-ES where µ parents produce λ offsprng. Both parents and offsprng compete equally for sural []. The (5+0)-ES s appled to adjust these hyperparameters, and ths algorthm s shown n Fg. 4. t = 0; ntalzaton(,..., 5,σ ); ealuate f ( ),..., f ( 5 ) ; whle (t < 500) do for = to 0 do = recombnaton(,..., 5 ); = mutate ( ) ; ealuate f ( ) ; end (,..., 5 ) = select(,..., 5,,..., 0 ) σ = mutate σ (σ ) ; t = t+; end Fg. 4. (5+0)-ES algorthm. Fg. 3. The proof of the proposed kernel. As shown n (4), there are n parameters when n terms of RBF kernels are used ( n parameters for adjustng weghts and n alues of the wdths of RBFs ). Howeer, we notce that the number of parameters can be reduced to n by fxng a alue of the frst parameter to. The mult-scale RBF kernel becomes as follows n + Κ( = K( γ 0 ) ak( γ ). (5) = Ths form of the mult-scale RBF kernel wll be used n the rest of ths paper. IV. EVOLUTIONARY STRATEGIES FOR SVR BASED ON MULTI-SCALE RBF KERNEL Eolutonary strateges (ES, [9]) are based on the prncples of adapte selecton found n the natural world. ES has been successfully used to sole arous types of optmzaton problems [0]. Each generaton (teraton) of the ES algorthm takes a populaton of ndduals (potental solutons) and modfes the problem parameters to produce offsprng (new solutons) []. Only the hghest ft ndduals (better solutons) sure to produce new generatons []. In order to obtan approprate alues of the hyperparameters, ES s consdered. There are seeral dfferent ersons of ES. Neertheless, we prefer to use the Ths algorthm uses 5 solutons to produce 0 new solutons by a recombnaton method. These new solutons are mutated and ealuated, but only the 5 fttest solutons are selected from 5+0 solutons to be the parents n the next generaton. These processes wll be repeated untl a fxed number of generatons hae been produced or the acceptance crteron s reached. A. Intalzaton Let be the non-negate real-alue ector of the hyperparameters that has n + dmensons. The ector s represented n the form: = ( C, ε, n, γ 0, a, γ, a, γ,, a n, γ n ), (6) where C s the regularzaton parameter, ε s the deaton of an approxmaton, n s the number of terms of RBFs, γ for = 0,..., n are the wdths of RBFs, and a for =,..., n are the weghts of RBFs. The (5+0)-ES algorthm starts wth the 0 th generaton (t=0) that selects 5 solutons,..., 5 and standard deaton + σ R n + usng randomzaton. These 5 ntal solutons are ealuated to calculate ther ftness. Our goal s to fnd that optmzes the objecte functon f (). B. Recombnaton The recombnaton functon wll create 0 new solutons. We use the global ntermedary recombnaton method for creatng these 0 new solutons. Ten pars of solutons are selected from conentonal 5 solutons. The aerage of each par of solutons, element by element, s a new soluton.

0 + + 3 M + 4. 5 = ( ) = ( ) = ( ) (7) C. Mutaton The for =,,0 are mutated by addng each of them wth ( z, z,, z n+ ), and z s a random alue from a normal dstrbuton wth zero mean and σ araton. mutate ( ) = ( C + z, ε + z, n + z3, γ 0 + z4,, a n z, γ n z ) + n + z ~ (0, σ ). N + n + In each generaton, the standard deaton wll be adjusted by z z z n + mutate σ (σ ) = ( σ e, σ e,, σ n+ e ) (9) z ~ (0, τ ), N when τ s an arbtrary constant. D. Ealuaton In general, percentage error s used to measure the effcency of regresson model. Howeer, ths functon may oerft tranng data. Sometmes, data contan a lot of nose, and thus f the model fts these nosy data, the learned concept may be wrong. Hence, we would lke to aldate a set of hyperparameters on many tranng sets. A good hyperparameters should perform well on these tranng data. Howeer, as we hae only a fxed amount of tranng data. Therefore, 5-subsets cross-aldaton s proposed to aod the oerft problems. At the begnnng, the tranng data are dded nto fe subsets, each of whch has almost the same number of data. For each generaton of ES, the classfers wth the same hyperparameters are traned and aldated fe tmes. In the j th teraton ( j =,, 3, 4, 5), the classfer s traned on all subsets except for the j th one. Then, the error of predcton s calculated for the j th subset. The aerage of these fe errors s used to be the objecte functon f (). These parttons are dsplayed n Fg. 5. (8) Fg. 5. Partton tranng data nto 5 subsets. V. EXPERIMENTAL RESULTS In order to erfy the performance of the proposed method, SVRs wth the mult-scale RBF kernels are traned and tested on tme seres datasets from Neural Forecastng Competton []. These datasets are monthly tme seres drawn from homogeneous populaton of emprcal busness tme seres []. Snce these datasets are used for competton, we can ensure that these data are not so easy. In each dataset, the data of pror perods are used as the tranng data for predctng the next perod. The eolutonary strateges are used to fnd the optmal hyperparameters. The alue of τ n ealuaton process of these experments s.0. The number of RBF terms s a poste nteger that s less than or equal to 0. The wdths of RBFs ( γ ), the weghts of RBFs ( a ), the deaton of an approxmaton ( ε ), and the regularzaton parameter ( C ) are real numbers between 0.0 and 0.0. These hyperparameters are nspected wthn 500 generatons of ES. The performance of the proposed method s ealuated by the symmetrc mean absolute percentage error (SMAPE) [3], whch s defned as where y for data, ŷ for SMAPE = l l = y yˆ 00 ( y + yˆ ) (0) =,..., l are the actually targets of the tranng =,..., l are the forecast alues, and l s the number of tranng data. The expermental results are shown n Table II.

TABLE II THE SMAPE VALUES ON TRAINING DATA OF EACH DATASET (%) Datasets Cumulate Mean Mong Aerage N=5 Exponental Smoothng α =0.8 ES-SVR m-rbf NN3_0 4.906 3.379 4.5050.838 NN3_0 35.89 38.344 0.967 5.683 NN3_03 96.579 93.749 56.64 7.959 NN3_04 39.0983 40.6747 5.9709 7.7985 NN3_05 6.53 3.3809.847.307 NN3_06 8.873 8.433 8.87 3.000 NN3_07 4.4867 3.7889 3.979.338 NN3_08 3.348.764 4.5493 7.894 NN3_09 9.330 7.8606 5.4935.5 NN3_0 39.046 4.59 30.59 3.8448 NN3_ 9.743 0.484 9.4045 7.0860 Aerage 6.9434 5.85 8.458 5.3 The expermental results show the ablty of the proposed method by SMAPE that outperforms the statstcal technques,.e. the cumulate mean, the mong aerage, and the exponental smoothng. Moreoer, the proposed method that combnes two technques of ES and SVR based on the mult-scale RBF kernel yelds the SMAPE alues less than SVR wth the sngle RBF kernel n all datasets. The aerage SMAPE on datasets of the proposed method s the best when compared wth the other technques. The examples of the approxmatons are llustrated as graphs n Fg. 6. From these graphs, the proposed method s compared wth the cumulate mean and the mong aerage. The proposed method and the mong aerage yeld the results that are smlar to the emprcal data. Howeer, wth the help of ES, the proposed method can determne the optmal hyperparameters n a more conenent way. For the cumulate mean, the results of the approxmaton are not good enough; t does not perform well on these data, where there are both seasonal and trend characterstcs. Fg. 6. Graph of the predcton by the proposed method, the cumulate mean, and the mong aerage on NN3_ dataset.

VI. CONCLUSION The non-negate lnear combnaton of multple RBF kernels wth ncludng weghts s proposed for support ector regresson. The proposed kernel s proed to be an admssble kernel by Mercer s theorem. Then, the eolutonary strategy s appled to the adjustment of the hyperparameters of SVR. The optmum alues of these hyperparameters are searched. Moreoer, subsets crossaldaton on the error of predcton s consdered to be the objecte functon n eolutonary process to aod the oerft problems. The expermental results show the ablty of the proposed method through the symmetrc mean absolute percentage error (SMAPE). When SVR uses the proposed kernel, t s able to learn from data ery well. Furthermore, the eolutonary strategy s effecte n optmzng the hyperparameters. Hence, the proposed method s hghly sutable for the complex problems where we hae no pror knowledge about ther hyperparameters. Besdes, ths non-negate lnear combnaton can be appled to other Mercer s kernels such as sgmod, polynomal, or Fourer seres kernels, as the general form of lnear combnaton of the Mercer s kernels has been proed to be a Mercer s kernel already. REFERENCES [] V.N. Vapnk, Statstcal Learnng Theor John Wley and Sons, Inc., New York, USA, 998. [] B. Schölkopf, C. Burges, and A.J. Smola, Adances n Kernel Methods: Support Vector Machnes. MIT Press, Cambrdge, MA, 998. [3] N.E. Ayat, M. Cheret, L. Remak, and C.Y. Suen, KMOD-A New Support Vector Machne Kernel wth Moderate Decreasng for Pattern Recognton, Proceedngs on Document Analyss and Recognton, pp. 5-9, Seattle, USA, 0-3 Sept. 00. [4] V. Kecman, Learnng and Soft Computng: Support Vector Machnes, Neural Networks, and Fuzzy Logc Models, MIT Press, London, 00. [5] A.J. Smola and B. Schölkopf, A Tutoral on Support Vector Regresson, NEUROCOLT Techncal Report NC-TR-98-030, Royal Holloway College, London, 998. [6] B. Schölkopf, and A.J. Smola, Learnng wth Kernels: Support Vector Machnes, Regularzaton, Optmzaton, and Beyond, MIT Press, London, 00. [7] J.S. Taylor and N. Crstann, Kernel Methods for Pattern Analyss, Cambrdge Unersty Press, UK, 004. [8] S. Russell and P. Norg, Artfcal Intellgence: A Modern Approach, Prentce-Hall, 003. [9] H.-G. Beyer and H.-P. Schwefel, Eoluton strateges: A comprehense ntroducton, Natural Computng, Vol., No., pp. 3-5, 00. [0] D.B. Fogel, Eolutonary Computaton: Toward a New Phlosophy of Machne Intellgence. IEEE Press, Pscatawa NJ, 995. [] E. dedoncker, A. Gupta, and G. Greenwood, Adapte Integraton Usng Eolutonary Strateges, Proceedngs of 3rd Internatonal Conference on Hgh Performance Computng, pp. 94-99, 9- December 996. [] BI 3 S-Lab, Artfcal Neural Network & Computatonal Intellgence Forecastng Competton, Hamburg, German http://www.neuralforecastng-competton.com/ datasets.htm [Accessed: March 007]. [3] R.J. Hyndman and A.B. Koehler, Another Look at Measures of Forecast Accurac Internatonal Journal of Forecastng, Vol., Issue 4, Netherlands, 006.