SVM-based Learning for Multiple Model Estimation

SVM-based Learnng for Multple Model Estmaton Vladmr Cherkassky and Yunqan Ma Department of Electrcal and Computer Engneerng Unversty of Mnnesota Mnneapols, MN 55455 {cherkass,myq}@ece.umn.edu Abstract: Ths paper presents new constructve learnng methodology for multple model estmaton. Under multple model formulaton, tranng data are generated by several (unknown) statstcal models, so exstng learnng methods (for classfcaton or regresson) based on a sngle model formulaton are no longer applcable. We descrbe general framework for multple model estmaton usng SVM methodology. The proposed constructve methodology s analyzed n detal for regresson formulaton. We also present several emprcal examples for multple-model regresson formulaton. These emprcal results llustrate advantages of the proposed multple model estmaton approach.. Introducton Ths paper descrbes constructve learnng methods for multple model regresson formulaton proposed n [Cherkassky and Ma, 00]. Under ths formulaton, avalable (tranng) data ( x, y ), =,,..., n are generated by several (unknown) regresson models, so the goal of

learnng s two-fold,.e., parttonng of avalable data nto several subsets and estmatng regresson models for each subset of avalable data. Hence, the problem of multple model estmaton s nherently more complex than tradtonal supervsed learnng where all tranng data are used to estmate a sngle model. Conceptually, there are two prncpal approaches for desgnng constructve learnng methods for multple model estmaton: - () Frst partton avalable data nto several subsets, then estmate model parameters for each subset of data; - () Frst estmate a domnant model usng all avalable data, and then partton the data nto several subsets. Under the frst approach, the learnng starts wth a clusterng step (unsupervsed learnng) followed by supervsed learnng on a subset of avalable data. Practcal mplementaton of ths approach s descrbed n [Tanaka, 00] usng the framework of mxture densty estmaton, where each (hdden) model s modeled as a component n a mxture. The man practcal lmtatons of ths approach are as follows: - Inherent complexty of densty estmaton (wth fnte samples). There s theoretcal and emprcal evdence that densty estmaton s much harder than supervsed learnng (regresson) wth fnte samples [Vapnk, 999, Cherkassky and Muler, 998]; - Moreover, the settng of multple model estmaton leads to clusterng/densty estmaton n local regons of the nput space. That s, for a gven nput value, there may be several dstnctly dfferent output (response) values, correspondng to dfferent models. Hence, data parttonng (clusterng) should be based on dfferent response values, and ths leads clusterng/densty estmaton n local regons of the nput space. Practcal mplementaton of such clusterng usng a (small) porton of avalable data becomes very problematc wth fnte samples due to the curse of dmensonalty.

3 Under the second approach, we apply robust regresson estmator to all avalable data n order to estmate a domnant model (where a domnant model s a model that descrbes the majorty of data samples). Clearly, ths approach s better than clusterng/densty estmaton strategy because: - It s based on regresson rather than densty estmaton formulaton; - It uses all avalable data (rather than a porton of the data) for model estmaton. Hence, n ths paper we focus on mplementaton of the second approach. The man practcal requrement for ths approach s avalablty of robust regresson algorthm, where robustness refers to capablty of estmatng a sngle (domnant) model when avalable data are generated by several (hdden) models possbly corrupted by addtve nose. Ths noton of robustness s somewhat dfferent from tradtonal robust estmaton technques. Ths s because standard robust methods are stll based on a sngle-model formulaton, where the goal of robust estmaton s resstance (of estmates) wth respect to unknown nose models. Recently, robust statstcal methods have been appled to computer vson problems that can be descrbed usng multple model formulaton. In these studes, exstence of multple models (n the data) s referred to as the presence of structured outlers [Chen et al, 000]. Emprcal evdence suggests that tradtonal robust statstcal methods usually fal n the presence of structured outlers, especally when the model nstances are corrupted by sgnfcant nose [Chen et al, 000]. Ths can be explaned as follows. When the data are generated by several (hdden) models, each of the data subsets (structures) has the same mportance, and relatve to any one of them the rest of the data are outlers. As a result, the noton of the breakdown pont (n robust statstcs) whch descrbes processng the majorty of data ponts loses ts meanng under multple-model formulaton. Moreover, tradtonal robust estmaton methods cannot handle more than 30% of outlers n the

4 data [Rousseeuw and Leroy, 987]. Hence, we need to develop new constructve learnng methodologes for multple model estmaton. Ths paper descrbes new learnng algorthms for multple model estmaton based on SVM methodology. The followng smple example llustrates desrable propertes of robust algorthms for multple model estmaton. Consder a data set comprsng two (lnear) regresson models: domnant model M (70% of the data) and secondary model M (30% of the data) shown n Fg.a. The data are corrupted by addtve gaussan nose (wth standard devaton 0.). Results n Fg. b show the model estmated by (lnear) SVM regresson wth nsenstve zone ε =0.084, and the model estmated by ordnary least squares (OLS). Both estmaton algorthms use all avalable data (generated by both models). OLS method produces rather naccurate model, whereas SVM produces very accurate estmate of the domnant model M. Further, data set n Fg. c shows another data set generated usng the same domnant model M but completely dfferent secondary model M. Applcaton of SVM (wth the same ε -value) to ths data set yelds (almost) dentcal estmate of the domnant model M, as shown n Fg. d. However, applcaton of OLS to ths data set yelds an estmate of M (shown n Fg. d) that s completely dfferent from the estmate n Fg. b. Note that the number of (hdden) models s unknown (to an estmaton method), and we use two models only to smplfy presentaton. Ths example shows an exstence of robust regresson algorthm that can be used to accurately estmate a sngle (domnant) model from a data set generated by several models. Here robustness (n the context of multple model estmaton) refers to accurate estmaton of the domnant model and the stablty of such estmates n spte of sgnfcant potental varablty of data generated by secondary model(s). Gong back to example n Fg. : after the domnant mode M has been dentfed by a robust regresson method, t may be possble to dentfy and remove data samples generated by M, and then apply robust regresson to the remanng data n order to estmate the next model.

5 Hence, we propose general methodology for multple model estmaton based on successve applcaton of robust regresson algorthm to avalable data, so that durng each (successve) teraton, we estmate a sngle (domnant) model and then partton the data nto two subsets. Ths teratve procedure s outlned next: Table : PROCEDURE for MULTIPLE MODEL ESTIMATION Intalzaton: Avalable data = all tranng samples. Step : Estmate domnant model,.e. apply robust regresson to avalable data, resultng n a domnant model M (descrbng the majorty of avalable data). Step : Partton avalable data nto two subsets,.e. samples generated by M and samples generated by other models (the remanng data). Ths parttonng s performed by analyzng avalable data samples ordered accordng to ther dstance (resduals) to domnant model M. Step 3: Remove subset of data generated by domnant model from avalable data. Iterate: Apply Steps -3 to avalable data untl some stoppng crteron s met. It s mportant to note here that the above procedure reles heavly on the exstence of robust (regresson) estmaton algorthm that can relably dentfy and estmate a domnant model (descrbng majorty of avalable data) n the presence of (structured) outlers and nose. The exstence of such robust regresson method based on Support Vector Machne (SVM) regresson has been demonstrated n the example shown n Fg.. However, results n Fg. are purely emprcal and requre further explanaton, snce the orgnal SVM methodology has been developed for sngle model formulaton. Even though SVM s known for ts robustness, ts applcaton for multple model estmaton s far from beng obvous.

6 In the next secton we provde conceptual and theoretcal justfcaton for usng SVM method n the context of multple model estmaton,.e. we explan why SVM regresson can be used n Step of an teratve procedure outlned above. Secton 3 descrbes detals and mplementaton of the parttonng Step. In addton, Secton 3 provdes gudelnes on selecton of meta-parameters for SVM regresson, mportant for practcal applcaton of SVM. Secton 4 presents emprcal results for multple model estmaton. These results show successful applcaton of the proposed SVM-based multple model estmaton for both lnear and nonlnear regresson models. Secton 5 presents a clusterng algorthm based on multple model estmaton, where the goal (of learnng) s to partton avalable (tranng) data nto several subsets, such that each subset s generated by a dfferent model. Fnally, conclusons are gven n Secton 6.. SVM Regresson for robust model estmaton It s well known that Support Vector Machne (SVM) methodology s robust under standard sngle-model estmaton settng [Vapnk, 999]. That s, SVM approach works well for estmatng ndcator functon (pattern recognton problem) and for estmatng real-valued functon (regresson problem) from nosy sparse tranng data. In ths secton, we demonstrate SVM robustness under multple model estmaton settng,.e. we explan why SVM regresson provdes stable and accurate estmates of the domnant model when avalable (tranng) data are generated by several (hdden) models. Frst, we revew standard (lnear) SVM regresson formulaton [Vapnk, 995]. The goal of regresson s to select the best model from a set of admssble models (aka approxmatng functons) f ( x, ω), where ω denotes (generalzed) set of parameters. The best model provdes

7 good predcton accuracy (generalzaton) for future (test) samples, and ts selecton s performed va mnmzaton of some loss functon (aka emprcal rsk) for avalable tranng data ( x, y ), =,,..., n. The man feature of SVM regresson responsble for ts attractve propertes s the noton of ε -nsenstve loss functon: L( y, 0 f y f ( x, ω) ε f ( x, ω)) = () y f ( x, ω) ε, otherwse Here the lnear nature of the loss functon accounts for SVM robustness whereas the ε - nsenstve zone leads to sparseness of SVM regresson models [Vapnk, 995]. Let us consder (for smplcty) lnear SVM regresson: f ( x, ω) =< ω, x > +b () SVM approach to lnear regresson amounts to (smultaneous) mnmzaton of ε -nsenstve loss functon () and mnmzaton of the norm of lnear parameters ω [Vapnk, 995]. Ths can * be formally descrbed by ntroducng (non-negatve) slack varables ξ ξ =,... n, to measure the devaton of tranng samples outsde ε -nsenstve zone. Thus SVM regresson can be formulated as mnmzaton of the followng functonal: Subject to constrants n =, * ω + C ( ξ + ξ ) (3) y < < ω, x ω, x ξ, ξ > + b y * > b ε + ξ * ε + ξ 0, =,..., n The constant C determnes the trade off between the model complexty (flatness) and the degree to whch devatons larger than ε are tolerated n optmzaton formulaton. Ths optmzaton problem can be transformed nto the dual problem [Vapnk, 995], and ts soluton s gven by

8 n * f (x) = ( α α ) < x,x > + b (4) = wth coeffcent values n the range 0 α * C, 0 α C. In representaton (4), typcally only a fracton of tranng samples appear wth non-zero coeffcents, and such tranng samples are called support vectors. For most applcatons, the number of support vectors (SVs) n SV s usually smaller than the number of tranng samples. Thus, wth ε -nsenstve loss, SVM solutons are typcally sparse. For nonlnear regresson problem, SVM approach performs frst a mappng from the nput space onto a hgh-dmensonal feature space and then performs lnear regresson n the hghdmensonal feature space. The SVM soluton s n * f ( x) = ( α α ) K( x, x) + b (5) = where the K(, x) s a kernel functon. The choce of the kernel functons and kernel parameters x s determned by a user and s (usually) applcaton-dependent. In ths paper, we use RBF kernels K x x x, x) = exp( ) (6) p ( where p s RBF wdth parameter. Next, we explan why SVM regresson s sutable for estmatng the domnant model under multple model formulaton. We assume, for smplcty, lnear SVM formulaton (4); however smlar arguments hold for nonlnear SVM as well. The objectve functon n (3) can be vewed as a prmal problem, and ts dual form can be obtaned by constructng Lagrange functon and ntroducng a set of (dual) varables [Vapnk, 995]. For the dual form, the so called Karush- Kuhn-Tucker (KKT) condtons hold at the optmal soluton, whch state that the product between dual varables and constrants has to vansh:

9 α ( ε + ξ y + < ω, x > + b) = 0 * * α ( ε + ξ y < ω, x > b) = 0 (7) ( C α ) ξ = 0 * ( C α ) ξ * = 0 We may further analyze propertes of coeffcents α (dual varables) n the SVM soluton evdent from KKT condtons [Smola and Schölkopf, 998]. Frst, only samples wth correspondng α = C le outsde the ε nsenstve zone. Second, condtonα α * = 0, mples that dual varables α * and α cannot be smultaneously be nonzero, snce nonzero slack cannot happen n both drectons. Let us analyze contrbuton of tranng samples n SVM soluton (4). As shown n Fg., all data samples can be dvded nto 3 subsets: data ponts nsde the ε -tube (labeled as n Fg..), data ponts on the ε -tube (border) (labeled as n Fg..) and data ponts outsde the ε -tube (label as n Fg..). Note that data samples nsde the ε -tube cannot be support vectors, whereas data samples on the ε -tube border and outsde the ε -tube are the support vectors but they have dfferent values of the slack varables ξ and dual varables α, as summarzed n Table. Table : Values of Slack Varables and dual varables for dfferent subsets Sample Locaton SV ξ α Subset Insde the ε -tube Not SV ξ =0 α = 0 Subset On the ε -tube Is SV ξ =0 α (0, C) Subset 3 Outsde the ε -tube Is SV ξ >0 α = C Recall that the coeffcents ω n the (lnear) SVM soluton (4) are calculated as

0 n ω = ( α α )x = * (8) where non-zero contrbuton s provded only by support vectors, whch are the data ponts n subset (on the ε -tube) and subset 3 (outsde the ε -tube). Further, the value of coeffcent ω s determned by α and x -value of tranng samples, however for samples n subset 3, the value α = C (constant) does not depend on the y -values of tranng samples. Hence, data ponts from subset 3 gve the same contrbuton to SVM soluton regardless of ther y -values,.e. ndependent of how far away ther y -values are from the ε -tube. Ths property enables robust SVM estmaton of the domnant model durng multple model estmaton. For example n Fg., consder two ponts, labeled as Pont and Pont. Although ther y -values are qute dfferent, ther x -values are very close (or equal) and the correspondng α = C, so ther contrbutons to SVM soluton (8) are (approxmately) the same. Smlarly, one can analyze contrbuton of data samples to the bas term n SVM soluton. Followng [Smola and Schölkopf, 998] the bas term (b) s gven by: b b < ω, x > ε for α ( 0, C ) (9) = y * < ω, x > +ε for α (0, C ) = y where the constrant α (0, C ) corresponds to data ponts n Subset (on the border of ε - tube). Hence, the ponts outsde the ε -tube (n subset 3) do not contrbute to the bas,.e. outlers (samples outsde the ε -tube) have no affect on the value of bas. In summary, our analyss of (lnear) SVM regresson presented above ndcates that: - SVM regresson model depends manly on SVs on the border of ε -nsenstve zone;

- SVM regresson soluton s very robust to outlers (.e. data samples outsde ε - nsenstve zone). In partcular, SVM soluton does not depend on the y-values of such outlers. These propertes make SVM very attractve for ts use n an teratve procedure for multple model estmaton descrbed n Secton, where a robust estmator appled to all tranng data needs to provde a relable estmate of the frst domnant model. The man practcal ssue s specfyng condtons under whch SVM regresson would yeld an accurate estmate of the domnant model, under multple model settng. To answer ths queston, recall that an SVM model depends manly on SVs on the border of ε -nsenstve zone. Hence, SVM regresson would provde an accurate estmate of the domnant model only f these SVs are generated by the domnant model. Ths wll happen only f all (or most) samples n subset (nsde the ε -tube) and samples n subset (on the ε -tube) are generated by the domnant model. Snce the SVM model s estmated usng all tranng data (from several models), the last condton mples that the majorty of the data (say, over 55%) should be generated by domnant model. The requrement that majorty of avalable data should be generated by a domnant model s standard n robust statstcs [Lawrence and Arthur, 987]. Here we smply derved ths condton for SVM algorthm n the context of multple model estmaton. 3. SVM methodology for multple model estmaton Ths secton descrbes practcal algorthms (based on SVM regresson) for multple model estmaton. These algorthms are based on the teratve procedure descrbed n Secton. However, practcal mplementaton of ths procedure requres addressng the followng ssues: - How to set meta-parameters of SVM regresson;

- How to partton the data nto two subsets (after the domnant model had been estmated). In order to smplfy presentaton, all descrptons n ths paper assume that tranng data are generated by two models,.e. model M (domnant model) and model M (mnor model). The goal s to accurately estmate the domnant model n the frst teraton of an teratve procedure gven n Secton. Then the second model M s estmated n the second teraton of an algorthm. Generalzaton to data sets wth multple models s straghtforward. Selecton of SVM meta-parameters. Next we dscuss proper settng of ε (nsenstve zone) and C (regularzaton parameter) n SVM regresson for estmatng the domnant model n Step of an teratve procedure gven n Secton. There are many proposals for settng SVM metaparameters for standard sngle-model estmaton [Vapnk, 995; Cherkassky and Muler, 998; Schölkopf et al, 999; Haste et al, 00]. However, most theoretcal prescrptons for settng meta-parameters are based on restrctve assumptons and n practce SVM meta-parameters are often selected va resamplng [Schölkopf et al, 999]. In ths paper, however, we are nterested n selectng meta-parameters for multple model estmaton settng. Recently, [Cherkassky and Ma, 00] proposed analytcal selecton of SVM meta-parameters (for standard sngle-model regresson formulaton), as detaled next. For SVM regresson, the values of meta-parameters are: C = max( y + 3σ, y 3σ ) (0) y y where y s the mean of the tranng response values, and σ y s the standard devaton of the tranng response values; ln n ε ( σ, n) = 3σ () n where σ s the standard devaton of addtve nose and n s the number of tranng samples.

3 Further, t can be shown that the value of ε -parameter plays the most mportant role for SVM regresson, whereas SVM solutons are rather nsenstve to regularzaton parameter as long as ths parameter s larger than the value gven by (0) [Cherkassky and Ma, 00]. Ths nsenstvty to regularzaton parameter values s partcularly true for lnear SVM regresson formulaton (3). In other words, one can use very large value of regularzaton parameter n (3), so that SVM soluton depends only on proper settng of ε. So n the remander of the paper we shall be only concerned wth selecton of ε. In order to apply () for multple model estmaton, consder (for smplcty) only lnear SVM. Then n order to estmate domnant model M we should know the standard devatonσ of addtve nose n the domnant model and the number of samples generated by the domnant model M. Hence, we may consder two possbltes: - Frst, we assume that the nose level for each (hdden) model s avalable or can be somehow estmated (usng a pror knowledge). In ths case, we smply use () for selectng the value of ε. - Second, the nose level (standard devaton) and the number of samples for each model s not known. Let us consder the second (more dffcult) possblty. In ths case, selecton of ε reles on the requrement that majorty of avalable s generated by the domnant model (beng estmated by SVM). Hence, we need to select ε -value such that most of the data (say 55%) les nsde the ε -tube. Ths can be done by tral-and-error approach (.e., tryng dfferent ε -values and examnng support vectors n SVM estmates) or usng a more systematc approach called ν - SVM [Schölkopf et al, 998]. Ths approach effectvely mplements SVM regresson havng prespecfed number of support vectors specfed by parameter ν (.e., a gven fracton of the total number of samples). In the context of multple model estmaton, the requrement that 55% of the

4 data s nsde the nsenstve zone s equvalent to specfyng ν =0.45,.e. that 45% of the data s outsde the ε -tube. Remarkably, the ablty of SVM to accurately estmate the domnant model s not very senstve to the chosen wdth of ε -nsenstve zone. For example, let us apply (lnear) SVM to the data set shown n Fg., n order to estmate the domnant model M. Assumng the nose standard devatonσ =0. s known, the value of ε -nsenstve zone accordng to () should be ε =0.084. Ths value has been used to generate regresson estmates shown n Fg.. In practce, we can only know/use crude estmates of the nose level (and hence crude ε -values). So we try to estmate the domnant model for data set n Fg. a usng SVM regresson wth three dfferent ε - values (ε =0.084, 0.04 and 0.6). These values are chosen as a half and one-and-a-half of the value ε =0.084 specfed by (). Fg. 3 shows SVM estmates of the domnant model for dfferent ε -values; clearly these estmates are almost dentcal, n spte of sgnfcant varatons n ε -values. Hence, usng naccurate values of σ and n (the number of samples) for estmatng the value ε of va () should not affect accurate estmaton of the domnant model. For example, f the total number of samples s 00 (known number), then the (unknown) number of samples n the domnant model should be at least 50. Accordng to (), the dfference between ln 50 50 and ln 00 00 s about 5%, so usng naccurate values of the number of samples should result n 5% varaton n ε -values (n the worst case). Ths varaton would not affect the accuracy of SVM regresson estmates, as ndcated by emprcal results n Fg.3. Data parttonng step. Followng estmaton of the domnant model, we need to partton avalable data nto two subsets,.e. data generated by domnant model and the remanng data (generated by other models). Ths s done by analyzng the (absolute value of) resduals between the tranng response values y and SVM estmates provded by domnant model yˆ ( x ) :

5 res = y yˆ( x ) for =,..., n () Namely, tranng samples wth resdual values smaller than certan threshold are generated by domnant model, and samples wth large absolute values of resduals are generated by other model(s). Emprcally we found that a good threshold equals twce the standard devaton of addtve nose n the domnant model M: If res < σ then ( x, ) M (3) y Here we assume that the nose level σ s known a pror or can be (accurately) estmated from data. In fact, the nose level (ts standard devaton) can be readly estmated from data as outlned next. Let us plot the hstogram of resduals res = y yˆ ( x ) for tranng data. Snce SVM provdes very accurate estmates of the domnant model (n Step ) and the majorty of the data are produced by the domnant model, these samples wll form a large cluster (of resduals) symmetrc around zero, whereas samples generated by other models wll produce a few more smaller clusters. Then the standard devaton of nose (n the domnant model) can be estmated va standard devaton of resduals n a large cluster. Further, emprcal results (n Fg.3) ndcate that overall qualty of multple model estmaton procedure s not senstve to accurate knowledge/estmaton of the nose level. 4. Emprcal results Ths secton descrbes emprcal results for multple model estmaton procedure usng synthetc data sets. We only show examples where tranng samples generated by two models,.e. - Model M generates n samples accordng to y = r ( x) + δ Z = ( x, y ), =,,..., n are

6 - Model M generates n samples accordng to y = r ( x) + δ, so that n + n = n. Note that the same algorthm can been successfully appled when the data are generated by larger number of models (these results are not shown here due to space constrants). In all examples the nput values of the tranng data are generated as random samples from a unform dstrbuton. Both hdden models are defned n the same doman n the nput space. We use addtve gaussan nose to generate tranng data for the examples presented n ths secton. However, the proposed method works well wth other types of nose; also the standard devaton (of nose) may be dfferent for dfferent models. To smplfy the presentaton, the standard devaton of nose s assumed to be known (to the algorthm); however n practcal settngs the nose level can be estmated from data, as descrbed n secton 3. The frst example assumes that both models are lnear, hence we apply lnear SVM regresson method (wthout kernels). In ths example we have 00 tranng samples generated as follows: - Model M: y = r ( x) + δ, where r ( x) = 0.8x +, x [0,], n = 60 (major model). - Model M: y = r ( x) + δ, where r ( x) = 0.x +, x [0,], n = 40 (mnor model). We consder two nose levels: σ = σ =0. (small nose) and σ = σ =0.3 (large nose). The tranng data sets (wth small nose and wth large nose) are shown n Fg. 4, wth samples generated by major model labeled as +, and samples generated by mnor model labeled as. These data are hard to separate vsually (by a human eye) n the case of large nose. However, the proposed method can accurately estmate both models, and separate the tranng data, as shown n Fg.4. As expected, the model estmaton accuracy s better for low nose; however even wth large nose the model estmates are qute good.

7 In second example, both models are nonlnear. Hence we use nonlnear SVM regresson. An RBF kernel functon (6) wth wdth parameter p=0. s used n ths example. Avalable tranng data (total of 00 samples) are generated as follows: - Model M: y = r ( x) + δ, where r ( x) = sn(πx), x [0,], n =70 - Model M: y = r ( x) + δ, where r ( x) = cos(πx), x [0,], n = 30. Agan, we consder two nose levels: σ = σ =0. (small nose) and σ = σ =0.3 (large nose). The tranng data sets (wth small nose and wth large nose) are shown n Fg. 5. Results n Fg. 5 ndcate that the proposed method provdes very accurate model estmates, even n the case of large nose. Fnally, we show an example of multple model estmaton for hgher-dmensonal data. We consder lnear models n a 4-dmensonal nput space, so that avalable tranng data are generated as follows: - Model M: y = r ( x) + δ, where r ( x ) = x + x + x3 + x4, 4 x [0,], n 60 (major = model). - Model M: y = r ( x) + δ,where r ( x ) = 6 x x3 x4, 4 x [0,], n 40 (mnor = model). Tranng data are corrupted by addtve nose wth standard devaton σ = σ =0.. For ths data set, we llustrate the data parttonng step n Fg.6. Results n Fg. 6 show the dstrbuton of resdual values,.e. the dfference between response values and M model estmates (normalzed by standard devaton of nose) calculated accordng to (). Resdual values for the frst 60 samples (generated by model M) are on the left-hand sde of Fg. 6, and the next 40 samples (generated by model M) are on the rght-hand sde of Fg.6. Parttonng of data samples s performed usng resdual values accordng to (3) usng threshold value, and

8 ths threshold s ndcated as a horzontal lne n the mddle of Fg.6. That s, samples below ths lne are assumed to orgnate from M, and samples above ths lne are assumed to orgnate from M. As expected, data parttonng s not very accurate, snce some samples from M are classfed as samples from M, and vce versa. Ths s because the two models actually provde the same (or very close) response values n a small regon of the nput space; so perfectly accurate classfcaton s not possble. However, the proposed multple model estmaton procedure provdes very accurate model estmates for ths data set. Namely, the estmates obtaned for models M and M are: For model M: y ˆ( x ) = 0.0+.4x + 0.95x +.08x3 +. 08x4 MSE (for M) = 0.077; For model M: yˆ ( x ) = 5.93 0.07x 0.93x x3 0. 99x4 MSE (for M) = 0.0044. Clearly, the above estmates are very close to the target functons used to generate nosy data. The MSE measure ndcates the mean-squared-error between regresson estmates and the true target functon (for each model) obtaned usng 500 ndependently generated test samples. 5. Clusterng usng multple model estmaton In many applcatons, the goal of data modelng (assumng that data are generated by several models) s to cluster/partton avalable data nto several subsets, correspondng to dfferent generatng models. Ths goal s concerned manly wth accurate parttonng of the data, rather than wth accurate estmaton of the (hdden) regresson models, even though these two objectves are hghly correlated. Example shown n Fg. 6 llustrates the problem: even though data parttonng (mplemented by proposed algorthm) s not very accurate, the algorthm produces very accurate and robust estmates of regresson models. In ths secton we show how to mprove the accuracy of data parttonng under multple model estmaton formulaton.

9 For example, consder nonlnear data set descrbed n Secton 4 and shown n Fg. 5c. For ths data set, some samples are very dffcult to assgn to an approprate model, especally n regons of the nput space where the models have smlar response values. For ths data set, the proposed multple model estmaton algorthm correctly classfes 90.4% of samples generated by major model M, and 53.5% of samples generated by mnor model M. However, the accuracy of data parttonng can be further mproved usng smple post processng procedure descrbed next. Ths procedure uses regresson estmates provded by proposed multple model estmaton algorthm. Let us denote regresson estmate for major model M as y ( ( x), and regresson estmate for ˆ ) ( mnor model M as y ( x). Then each tranng sample ( x, y ), =,,..., n can be assgned to one of the two models based on the (absolute) value of resduals: ˆ ) res () () () ( ) = y yˆ ( x ) and res = y yˆ ( x ) That s: If () () res < res then (, y ) M x else ( x, ) M (4) y Effectvely, such post processng method mplements nearest neghbor classfcaton usng (absolute value of) resduals. Applyng prescrpton (4) for parttonng the data set shown n Fg. 5c yelds classfcaton accuracy 9.6% for samples generated by M, and classfcaton accuracy 80% for samples generated by M. Hence, data re-parttonng technque (4) gves better accuracy than data parttonng produced by the orgnal multple model estmaton procedure. In concluson, we comment on applcablty and mplcatons of clusterng/data parttonng approach descrbed n ths secton. Ths approach to clusterng assumes that tranng data are generated by several models, and the clusterng reles heavly of accurate estmates of (regresson) models obtaned by robust SVM-based algorthm. Hence, the problem settng tself

0 combnes supervsed learnng (.e. estmaton of regresson models) and unsupervsed learnng (.e. data parttonng or clusterng). We expect ths approach to clusterng to outperform tradtonal clusterng technques for applcatons that can be descrbed usng multple model formulaton. Fnally, proposed nearest neghbor rule (4) for data parttonng assumes that both (hdden) models have the same nose level (or standard devaton), and the same msclassfcaton cost for both models. These assumptons hold true for the data set n Fg. 5c, and ths explans mproved classfcaton accuracy for ths example. In many applcatons, however, the nose levels and msclassfcaton costs for dfferent (hdden) models are not the same, and one should adjust the rule (4) to account for these dfferences. 6. Summary and dscusson Ths paper presents a new algorthm for multple model estmaton. The proposed method s based on SVM learnng adapted to multple model formulaton. Emprcal results presented n ths paper demonstrate that SVM-based learnng can be successfully appled to multple model regresson problems. In addton, we ntroduced a new clusterng/data parttonng method sutable for multple model formulaton. Future related work may focus on applcatons of the proposed methodology to real-world problems, rangng from computer vson (moton analyss) to fnancal engneerng. As dscussed n [Cherkassky and Ma, 00], such applcatons should be based on a thorough understandng of each applcaton doman, necessary for a meanngful specfcaton/parameterzaton of (hdden) models. Addtonal research may be concerned wth better understandng of robustness of SVM methodology, and comparng t wth tradtonal robust methods n the context of multple model estmaton.

Fnally, we pont out that the proposed learnng method assumes that all (hdden) models are defned on the same doman (.e., the same regon n the nput space). In stuatons where dfferent models are defned n dfferent (dsjont) regons of the nput space, the proposed algorthm can not be successfully appled. Instead, one should use well-known (tree) parttonng algorthms such as CART, mxture of experts and ther varants [Haste et al, 00]. These algorthms effectvely partton the nput space nto several (dsjont) regons and estmate output (response) values n each regon of the nput space. Here t may be nterestng to note that tree parttonng algorthms are based on a sngle model formulaton, so they tend to enforce smoothness at the regon boundares. It may be possble to develop learnng algorthms for regresson n dsjont regons usng multple model formulaton, and then compare ts accuracy wth tradtonal tree parttonng methods.

REFRENCES [] V. Cherkassky & Y. Ma, Multple Model Estmaton: A New Formulaton for Predctve Learnng, IEEE Trans. Neural Networks (under revew), 00 [] M. Tanaka, Mxture of Probablstc Factor Analyss Model and Its Applcaton, n Proc. ICANN 00, LNCS 30, pp. 83-88, 00 [3] V. Vapnk. The Nature of Statstcal Learnng Theory ( nd ed.). Sprnger, 999 [4] V. Cherkassky & F. Muler, Learnng from Data: Concepts, Theory and Methods, Wley998 [5] H. Chen, P. Meer and D. Tyler, Robust Regresson for Data wth Multple Structures, n CVPR 00, Proc. IEEE Computer Socety Conf., pp 069-075, 00 [6] P. Rousseeuw and A. Leroy, Robust Regresson and Outler Detecton, Wley, NY, 987 [7] V. Vapnk, The Nature of Statstcal Learnng Theory, Sprnger, 995 [8] K. Lawrence and J. Arthur, Robust Regresson- analyss and applcatons, New York : M. Dekker, 990 [9] A. Smola and B. Schölkopf, A Tutoral on Support Vector Regresson, NeuroCOLT Techncal Report NC-TR-98-030, Royal Holloway College, Unversty of London, UK, 998 [0] T. Haste, R. Tbshran and J. Fredman, The Elements of Statstcal Learnng: Data Mnng, Inference and Predcton, Sprnger, 00 [] B. Schölkopf, J. Burges and A. Smola, ed., Advances n Kernel Methods: Support Vector Machnes, MIT Press, 999

3 [] V. Cherkassky and Y. Ma, Selecton of Meta-Parameters for Support Vector Regresson, Proc. ICANN-00 (to appear), 00 [3] B. Schölkopf, P. Bartlett, A. Smola, and R. Wllamson, Support Vector regresson wth automatc accuracy control, n L. Nklasson, M. Bodén, and T. Zemke (Eds), Proc. ICANN'98, Sprnger, pp -6, 998 FIGURE CAPTIONS Fg. : Comparng robust method vs least squares estmaton of the domnant model. (a) Frst data set (domnant model 70% of data samples, secondary model 30% of samples), (b) Estmates of domnant model M by robust method vs. least squares, for data set (a), (c) Second data set (domnant model 70% of data samples, secondary model 30% of samples), (d) Estmates of domnant model M by robust method vs. least squares, for data set (b). Fg. : Locaton of tranng data wth respect to ε -nsenstve tube showng three possble subsets of data. Fg. 3. SVM estmates of domnant model do not depend on accurate selecton of ε -values. Results show three SVM estmates for data set n Fg. (a), usng optmal ε (for ths data set), 0.5ε, and.5ε : Sold lne --- optmal ε =0.084 Dashed lne --- ε =0.04 Dotted lne --- ε =0.6 Fg. 4. Example of multple model estmaton procedure for lnear models: (a) Tranng data (wth small nose) (b) Model estmates for data set (a) obtaned usng proposed algorthm (c) Tranng data (wth large nose) (d) Model estmates for data set (b) obtaned usng proposed algorthm. Fg. 5. Example of multple model estmaton procedure for nonlnear models:

4 (a) Tranng data (wth small nose) (b) Model estmates for data set (a) obtaned usng proposed algorthm (c) Tranng data (wth large nose) (d) Model estmates for data set (b) obtaned usng proposed algorthm. Fg.6. Illustraton of data parttonng Step n the proposed algorthm for hgh-dmensonal data set. Horzontal threshold lne s used to partton the data nto two subsets. 3 3 0 Domnant M Secondary M - 0 0.5 (a) 3 0 By Robust Method By Least Squares - 0 0.5 (b) 3 0 Domnant M Secondary M - 0 0.5 (c) 0 By Robust Method By Least Squares - 0 0.5 (d) Fg.. Comparng robust method vs least squares estmaton of the domnant model. (a) Frst data set (domnant model 70% of data samples, secondary model 30% of samples), (b) Estmates of domnant model M by robust method vs. least squares, for data set (a), (c) Second data set (domnant model 70% of data samples, secondary model 30% of samples), (d) Estmates of domnant model M by robust method vs. least squares, for data set (b).

5 [Cherkassy and Ma] Pont subset 3 subset Pont subset ε Fg. : Locaton of tranng data wth respect to ε -nsenstve tube showng three possble subsets of data. [Cherkassy and Ma]

6 3.5.5 0.5 0-0.5 eps=0.084 eps=0.04 eps=0.6-0 0. 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Fg. 3. SVM estmates of domnant model do not depend on accurate selecton of ε -values. Results show three SVM estmates for data set n Fg. (a), usng optmal ε (for ths data set), 0.5ε, and.5ε : Sold lne --- optmal ε =0.084 Dashed lne --- ε =0.04 Dotted lne --- ε =0.6 [Cherkassy and Ma]

7 3 3 0 0 0.5 (a) 3 0 0 0.5 (c) M estmate M estmate 0 0 0.5 (b) 3.5.5 M estmate M estmate 0.5 0 0.5 (d) Fg. 4. Example of multple model estmaton procedure for lnear models: (a) Tranng data (wth small nose) (b) Model estmates for data set (a) obtaned usng proposed algorthm (c) Tranng data (wth large nose) (d) Model estmates for data set (b) obtaned usng proposed algorthm. [Cherkassy and Ma]

8 0 - - 0 0.5 (a) 0 - M estmate M estmate - 0 0.5 (b) 0 - - 0 0.5 (c) 0 - M estmate M estmate - 0 0.5 (d) Fg. 5. Example of multple model estmaton procedure for nonlnear models: (a) Tranng data (wth small nose) (b) Model estmates for data set (a) obtaned usng proposed algorthm (c) Tranng data (wth large nose) (d) Model estmates for data set (b) obtaned usng proposed algorthm. [Cherkassy and Ma]

9 0 0 a gm S u al/ R esd 0 0 0-0 0 0 30 40 50 60 70 80 90 00 Index of Tranng Samples Fg.6. Illustraton of data parttonng Step n the proposed algorthm for hgh-dmensonal data set. Horzontal threshold lne s used to partton the data nto two subsets. [Cherkassy and Ma]