Using Neural Networks and Support Vector Machines in Data Mining

Usng eural etworks and Support Vector Machnes n Data Mnng RICHARD A. WASIOWSKI Computer Scence Department Calforna State Unversty Domnguez Hlls Carson, CA 90747 USA Abstract: - Multvarate data analyss technques have the potental to mprove data analyss. Support Vector Machnes (SVS) are a recent addton to the famly of multvarate data analyss. A bref ntroducton to the SVM Vector Machnes technque s followed by an outlne of the practcal applcaton Key-Words: - SVM vector machnes, data analyss 1 Introducton A common problem n varous areas of scence s that of classfcaton. One approach s to develop an algorthm whch fnds complex patterns n nput example data (labeled tranng data) to learn the soluton to the problem. Ths s called supervsed learnng. Such an algorthm can map each tranng example onto two categores, more than two categores, or contnuous, real-valued output. A potental problem wth ths approach s nose n tranng data. There may be no correct underlyng classfcaton functon. Two smlar tranng examples may be n dfferent categores. Another problem may be that the resultng algorthm msclassfes unseen data because t has overfts the tranng data. A better goal s to optmze generalzaton the ablty to correctly classfy unseen data. ext secton shows how the Support Vector Machne learnng methodology addresses these problems. The descrpton follows that of several references n the lterature [4,5,21]. 2 Problem Formulaton The SVM vector method was developed to construct separatng hyperplanes for pattern recognton problems [4,5]. In the 1990s t was generalzed for constructng nonlnear separatng functons and for estmatng real-valued functons. Applcatons of SVMs nclude text categorzaton, character recognton, and bonformatcs and face detecton. Support Vector Machnes (SVM) are learnng machnes that can perform classfcaton and real valued functon approxmaton. SVM creates functons from a set of labeled tranng data and operate by fndng a hypersurface n the space of possble nputs. Ths hypersurface wll attempt to splt the postve examples from the negatve examples. The splt wll be chosen to have the largest dstance from the hypersurface to the nearest of the postve and negatve examples. Intutvely, ths makes the classfcaton correct for testng data that s near, but not dentcal to the tranng data. In detal, durng the tranng phase SVM takes a data matrx as nput, and labels each sample as ether belongng to a gven class (postve) or not (negatve). SVM treats each sample n the matrx as a pont n a hgh-dmensonal feature space, where the number of attrbutes determnes the dmensonalty of the space. SVM learnng algorthm then dentfes a hyperplane n ths space that best separates the postve and negatve tranng samples. The traned SVM can then be used to make predctons about a test sample s membershp n the class. In bref, SVM non-lnearly maps ther n-dmensonal nput space nto a hgh dmensonal feature space. In ths hgh dmensonal feature space a lnear classfer s constructed. 3 SVM based algorthm The man dea of the SVM approach s to map the tranng data nto a hgh dmensonal feature space n whch a decson boundary s determned by constructng the optmal separatng hyperplane. Computatons n the feature space are avoded by usng a kernel functon. The formal goal s to estmate the functon f :R { ± 1} usng nput output tranng data ( x1, y1),...,( x, y ) R { ± 1} such that f wll correctly classfy examples ( x, y),.e. f ( x) = y. s the number of tranng examples. For generalzaton we restrct the class of functons from whch f s chosen. Smply mnmzng the tranng error does not necessarly result n good generalzaton. SVM Vector classfers are based on the class of hyperplanes ( w x) + b= 0 wth w R, b R and correspondng to the decson functon f ( x ) = sgn [( w x ) + b]. w s called the weght vector and b the threshold. w and b s the parameters controllng the functon and must be learned from the data. The unque hyperplane wth maxmal margn of separaton between the two classes s called the optmal hyperplane. The optmzaton problem thus s to fnd the optmal hyperplane. Both the optmzaton problem and the fnal decson functon depend only on dot products between nput vectors. Ths s crucal for the successful generalzaton to the nonlnear case. If f ( x ) s

a nonlnear functon of x one possble approach s to use a neural network, whch conssts of a network of smple lnear classfers. Problems wth ths approach nclude many parameters and the exstence of local mnma. The SVM approach s to map the nput data nto a hgh, possbly nfnte dmensonal feature space, F va a nonlnear map Φ: R F. Then the optmal hyperplane algorthm can be used n F. Ths hgh dmensonalty may lead to a practcal computatonal problem n feature space. Snce the nput vectors appear n the problem only nsde dot products, however, we only need to use dot products n feature space. If we can fnd a kernel functon, K, such that K( x1, x2) =Φ( x1) Φ( x2) then we don t need to know Φ explctly. Mercer s Theorem tells us that a functon K( xy, ) s a kernel,.e. there exsts a mappng Φ such that K( x1, x2) =Φ( x1) Φ( x2) We can choose from known kernel functons: polynomal of degree, Gaussan Radal Bass Functon or sgmod. We propose a new SVMbased method (BSVM), whch uses the dynamc programmng algorthm as a kernel functon. A detaled descrpton of experments can be located n [19]. The result of computatonal experment show that the BSVM method outperforms exstng algorthms we tested. 4 eural etworks based algorthm Over the past a few years, eural etworks, one of the branches n Artfcal Intellgence technology, have ganed popularty among the hydrologcal and hydraulc engneerng communty and some encouragng results have been acheved. Recently, a new tool from the Artfcal Intellgence feld called a Support Vector Machne (SVM) has ganed popularty n the Machne Learnng communty. It has been appled successfully to classfcaton tasks such as pattern recognton, OCR and more recently also to regresson and tme seres. Mathematcally, SVMs are a range of classfcaton and regresson algorthms that have been formulated from the prncples of statstcal learnng theory. So far, these SVMs have been benchmarked aganst artfcal neural networks (As) and outperformed A n many applcaton areas. It has been hypothessed that ths s because there are fewer model parameters to optmse n the SVM approach, reducng the possblty of over fttng the tranng data and thus ncreasng the actual performance. Compared wth tradtonal artfcal neural networks, tranng n SVMs s very robust due to ther quadratc objectve functons. It s useful to explore ths new technology n rver flow modelng area, wth the hope that t could overcome some of the problems n A and may perform much better than the tradtonal lnear models. Both SVMs and As can be represented as two-layer networks (where the weghts are non-lnear n the frst layer and lnear n the second layer). However, whle As generally adapt all the parameters (usng gradent or clusterng-based approaches) SVMs choose the parameters for the frst layer to be the tranng nput vectors because ths mnmses the VC-dmenson as ndcated n Fgure 1. α 1 y K(x 1,x) K(x 1,x) K(x,x) x 1 x 2 x 3 x n Mathematcally, a basc functon for statstcal learnng process s M f ( x) = α φ ( x) = wφ( x ) = 1 α 2 Fgure 1 SVM structure (1) Where the output s a lnearly-weghted sum of M. The nonlnear transformaton s carred by ().The range of models represented by Equaton 1 s extremely broad. SVM s a specal form of them and ts decson functon s represented as f ( x) = αk( x, x ) b = 1 (2) where K s the kernel functon, and b are parameters, s the number of tranng data, x are vectors used n tranng process and x s the ndependent vector. The parameters and b are derved by maxmze ther objectve functon. α Decson rule y = α K( x, x) + b = 1 Weghts α,..., 1 α In SVM, all nput data are organsed as vectors (.e., one dmensonal array) and some of these vectors are used n the modellng process (as demonstrated n Eq 2). Ths s qute dfferent compared wth other models lke A and Lnear TF models, whch are global models. In these models, model parameters are derved from the tranng data set and then only the derved parameters are used n future smulatons. The data for tranng would play no part n the predcton process. SVM s qute dfferent. It uses the tranng data for model calbraton so as to estmate the model parameters, but also keeps the most mportant part of the nput vectors n ts model. These vectors are called support vectors (only a small number of tranng vectors are chosen). The unque structures of the kernel functons used for nonlnear transformaton of onlnear transformaton based on support vectors x,..., 1 x Input vector x = x x 1 n (,..., )

Predcton error nput vectors enable SVM to get rd of most tranng vectors, so that the resulted model s much smaller. The reduced support vectors also mprove the model s generalsaton ablty and decrease the computaton load. Snce SVM theory was orgnally created from the machne learnng communty, ths type of models s coned as Support Vector Machnes. SVM has a strong nonlnear ablty and ths s analogous to the nonlnear treatment for the tradtonal lnear models. As we know, t s possble to transform the nput varables wth a certan nonlnear functons so that lnear models can be used to model nonlnear processes (Generalzed lnear system framework). For example, an nput vector x=(x 1,x 2 ) can be transformed nto a hgher 2 2 dmenson nput vector z = ( x1, x2, x1, x2, x1x2), whch can then be treated as a lnear system. In a smlar fashon, SVM uses some specfc kernel functons whch transform the nput vector as an nner product of nonlnear functons n the model. The selecton of sutable kernel functon for a specfc problem s a very complcated process at the moment and much more research work s stll needed. A major problem n any model tranng s the decson about the complexty of the model s structure. More complcated models tend to do well n tranng but do badly n predcton. For example, a common problem n A s applcatons s overfttng, sometmes the model s weghts are even less than the tranng data ponts. Hgh bas Low varance Test set Low bas Hgh varance Stop for ranfall and flow data are selected for nput vectors. An nput vector can have a mxture of varous varables (e.g., ran, flow, temperature, date, etc). At each computaton step, we sequentally add the newly acqured data and remove the earler ones, to predct the flow n the future. Before the tranng, several key parameters have to be selected by manual operatons. They are a) Three parameters to control the SVM tranng: Cost of error C, Slackness tube and kernel functon b) Wndow szes for ranfall and flow data; c) Scale factors for ranfall and flow data In the process above, C s useful for controllng the smoothness of the functon. Large C values penalse the errors, hence the resulted SVMs have small number of SV. Slackness tube wth s a new concept (n tradtonal least square method, s always zero ) and the nput data whch fall n the tube are not penalsed. Three popular kernel functons are tested: Dth degree polynomal (only 2 s used n ths project), radal bass and Sgmod functons. Varous wndow szes were tested (3 ran, 3 flow; 1 ran, 5 flow; 0 ran,10 flow; 10 ran, 0 flow; 20 ran, 0 flow; ). Scale factors are used to transform ranfall and flow data nto a smlar range otherwse the data of hgh values (.e., small unt) would domnate the tranng process. Most of work n ths part s manual, hence a tedous process due to the huge number of combnatons. In the model calbraton stage, we fnd that SVM can perform very well n many cases. Wth the data used n the tranng (Brdcreek), Polynomal kernel functon performed much better than radal bass and Sgmod kernel functons (see Fgure 3). Comparson wth lnear TF model clearly demonstrated the nonlnear effect of SVMs. The TF model s overestmaton of small peaks were removed by polynomal kernel functons. Low Tranng set Model complexty Hgh Fgure 2 The nfluence of model complexty As ndcated by Fgure 2, to chose a sutable model structure whch acheve the best test result s very mportant. In ths aspect, SVM has an advantage over A that t can automatcally mnmze the number Support Vectors, thus to mprove ts generalsaton ablty. In the modelng process, ranfall data seres (x t, x t-1, )and flow data seres (y t, y t-1, ) are used to construct vectors for the tranng and testng. At each tme step t, y t+1 s the target value and some fxed movng wndows Despte the success of SVM tranng, n testng stage, we found that SVMs were usually less stable than lnear TF models and tended to perform poorly n comparson. However, there were some nterestng features from SVMs that could make them useful for modelng hgh flows. For example, n Fgure 4, although SVM smulated flow s not as close to the measured flow as TF, ts predcted peak flow s much closer to the real peak than the TF model s one. 5 Practcal Experments In ths secton a data set s used to test BSVM. We consder s the Swss roll data. It s a three dmensonal data that looks lke Fgure 1.

Fg 1. Swss Roll Data The dstance between the samples s the geodesc dstance of the surface of two samples. The BSVM method s appled and the average testng msclassfcaton error equal to 3.9%. It shows that the BSVM method s good at ths case. 6 Concluson BSVM provde nonlnear functon approxmatons by mappng nput vectors nto a hgh dmensonal feature space where a hyperplane s constructed to separate classes n the data. Computatonally ntensve calculatons n the feature space are avoded through the use of kernel functons. BSVM correspond to a lnear method n feature space whch makes them theoretcally easer to analyze. Over the past a few years, eural etworks, one of the branches n Artfcal Intellgence technology, have ganed popularty among the hydrologcal and hydraulc engneerng communty and some encouragng results have been acheved. Recently, a new tool from the Artfcal Intellgence feld called a Support Vector Machne (SVM) has ganed popularty n the Machne Learnng communty. It has been appled successfully to classfcaton tasks such as pattern recognton, OCR and more recently also to regresson and tme seres. Mathematcally, SVMs are a range of classfcaton and regresson algorthms that have been formulated from the prncples of statstcal learnng theory. So far, these SVMs have been benchmarked aganst artfcal neural networks (As) and outperformed A n many applcaton areas. It has been hypothessed that ths s because there are fewer model parameters to optmse n the SVM approach, reducng the possblty of over fttng the tranng data and thus ncreasng the actual performance. Compared wth tradtonal artfcal neural networks, tranng n SVMs s very robust due to ther quadratc objectve functons. It s useful to explore ths new technology n rver flow modelng area, wth the hope that t could overcome some of the problems n A and may perform much better than the tradtonal lnear models. References: [1]Altschul, S. F. et al. 1997. Gapped blast and ps-blast: A new generaton of proten database search programs. uclec Acds Research 25:3389-3402. [2] Ben-Hur, A. D., Horn, H. T. Segelmann, and V. Vapnk, SVM Vector Cluster, Journal of Machne Learnng Research, vol. 2, pp. 125-137, 2001. [3] Boser, B. E. I. M. Guyon, and V. Vapnk, A Tranng Algorthm for Optmal Margn Classfers, Proceedngs of the Ffth Annual ACM Workshop on Computatonal Learnng Theory, pp. 144-152, ACM Press, 1992. [4] Joachms, T. Makng Large-scale SVM VECTOR MACHIELearnng Practcal, Advances n Kernel Methods: SVM Vector Learnng, MIT Press, Cambrdge, MA, 1998. [5]Jaakkola, T., Dekhans, M. and Haussler, D. 2000. A dscrmnatve framework for detectng remote proten homologes. Journal of Computatonal Bology 7:95-114. [6] Burges, C. Smplfed SVM Vector Decson Rules, Internatonal Conference on Machne Learnng, 1996.[7] Romdhan, S. P. Torr, B. Schölkopf, and A. Blake, Computatonally Effcent Face Detecton, Internatonal Conference on Computer Vson, 2001.[8]Lao, L. and oble, W. S. 2002. Combnng parwse sequence smlarty and SVM vector machnes for remote proten homology detecton. In: Proc. 6th Annual Internatonal Conference on Computatonal Molecular Bology(RECOMB 2002), ew York: ACM. pp. 225-232. [9]Eddy, S. R. 1995. Multple algnment usng Hdden Markov models. In Proc. 3rd Internatonal Conference on Intellgent Systems for Molecular Bology (ISMB 95), AAAI Press. pp. 114-120. [10]Karplus, K., Barrett, C. and Hughey, R. 1998. Hdden markov models for detectng remote proten homologes. Bonformatcs 14:846-856. [11]Murzn, A. G. et al. 1995. SCOP: A structural classfcaton of protens database for the nvestgaton of sequences and structures. Journal of Molecular Bology 247:536-540. [12]Sago, H., Vert, J-P., Akutsu, T. and Ueda,. 2002. Proten homology detecton usng strng algnment kernels. Manuscrpt. [13] DeCoste D. and D. Mazzon, Fast Query-Optmzed Kernel Machne Classfcaton va Incremental Approxmate earest SVM Vectors, Internatonal Conference on Machne Learnng, 2003.[14] Marchand M. and J. Shawe-Taylor, The Set Coverng Machne, Journal of Machne Learnng Research, vol. 3, pp. 723-746, 2002 [15] Muller, K. R. S. Mka, G. Ratsch, K. Tsuda, and B. Schölkopf, An Introducton to Kernel-based Learnng Algorthms, IEEE Transactons on eural etworks, vol. 12(2), pp. 181-201, 2001. [16] LeCun, Y., B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W.

Hubbard, and L. J. Jackel, Handwrtten Dgt Recognton wth Back-propagaton etwork, Advances n eural Informaton Processng Systems, 1990.[17] Platt, J., Fast Tranng of SVM Vector Machnes Usng Sequental Mnmal Optmzaton, Advances n Kernel Methods: SVM Vector Learnng, MIT Press, Cambrdge, MA, 1999.[18]Smth, T. and Waterman, M. A. 1981. Identfcaton of common molecular subsequences. Journal of Molecular Bology 147:195-197.[19] Wasnowsk R., Improvng the Performance of SVM Vector Machnes, RAW-98-104, June, 1999. [20] Wasnowsk R.,, The Use of SVM Vector Machnes n Data Mnng, RAW-99-99A, June, 1998[21] Vapnk,V, The ature of Statstcal Learnng Theory, 2nd ed., Sprnger-Verlag, ew York, 1999.[22] Zhang L. and B. Zhang, A Geometrcal Representaton of McCulloch-Ptts eural Model and Its Applcatons, IEEE Transactons on eural etworks, vol. 10(4), pp. 291-295, 1999.