General Vector Machine. Hong Zhao Department of Physics, Xiamen University

Size: px

Start display at page:

Download "General Vector Machine. Hong Zhao Department of Physics, Xiamen University"

Todd Howard
6 years ago
Views:

1 General Vector Machne Hong Zhao Department of Physcs, Xamen Unversty The support vector machne (SVM) s an mportant class of learnng machnes for functon approach, pattern recognton, and tme-serous predcton, etc. It maps samples nto the feature space by so-called support vectors of selected samples, and then feature vectors are separated by maxmum margn hyperplane. The present paper presents the general vector machne (GVM) to replace the SVM. The support vectors are replaced by general project vectors selected from the usual vector space, and a Monte Carlo (MC) algorthm s developed to fnd the general vectors. The general project vectors mproves the feature-extracton ablty, and the MC algorthm can control the wdth of the separaton margn of the hyperplane. By controllng the separaton margn, we show that the maxmum margn hyperplane can usually nduce the overlearnng, and the best learnng machne s acheved wth a proper separaton margn. Applcatons n functon approach, pattern recognton, and classfcaton ndcate that the developed method s very successful, partcularly for small-set tranng problems. Addtonally, our algorthm may nduce some partcular applcatons, such as for the transductve nference. 1.Introducton In essence, a learnng machne s a hgh-dmensonal map from nput M L x R to output y R, y (, x), where represents the parameter

2 set. The goal of desgnng such a map s to fnd a parameter set n the condton of a gven sample set ( x, y, 1,..., P) of P samples, whch guarantees the map not only correctly response to the sample set, but also correctly response to the real set represented by the samples. Generalzng the knowledge learned from the lmted samples to the real set s the key performance of a leanng machne. Ths ablty s determned by the pror knowledge beng ncorporated nto the learnng machne. The pror knowledge s the nformaton about the learnng task whch s avalable n addton to the tranng samples. A method for desgnng a learnng machne s therefore usually composed of two parts,.e., an algorthm for tranng the learnng machne response to samples and strateges for ganng the generalzaton ablty. There are two fundamental methods for desgnng multlayer learnng machnes,.e., the back-propagaton (BP) method [1-3] and the support vector machne (SVM) method [4-5]. Both methods have acheved great success n varous applcatons rangng from the tradtonal doman of functon approach, pattern recognton and tme seres predcton to an ncreasngly wde varety of bologcal applcatons [1-7]. The BP method employs a set of determnstc equatons to calculate the parameter set n an teratve manner to tran the machne response to samples. To gan the generalzaton ablty, t apples the Occam razor prncple : the enttes should not be multpled beyond the necessty. Based on ths prncple, the best machne should have as smaller as possble sze. However, t s already clear that smaller machne may not always gve the best performance [4]. In addton, the BP algorthm s experence dependent, partcularly for choosng a proper learnng rate. The SVM method s a great progress n desgnng learnng machnes. It gves up the Occam razor, and follows the structural rsk mnmzaton

3 (SRM) prncple of statstcal learnng theory [4] to gan the generalzaton ablty. Based on the SRM prncple, the machne wth the smallest Vapnk-Chervonenks (VC) dmenson nstead of the smallest sze s supposed to be the best. In more detal, the SVM method maps the nput vectors of samples nto the hgh-dmensonal feature space by so-called support vectors, and then the feature vectors are separated wth the maxmum margn hyperplane calculated followng the lnear optmzaton algorthm. However, fundamentally, whether the structural rsk can generally led to the best machne, partcularly, whether the maxmum margn hyperplane can be appled as a quanttatve crteron of the best machne, are unknown[4,8]. Techncally, how to calculate the VC dmenson and how to choose the kernel functon are also open problems [8]. In addton, support vectors are nput vectors of specal samples chosen from the tranng samples, whch may result n serous restrcton when only a small tranng set s avalable. Another problem s that the SVM s orgnally developed for bnary decson problems. Though there are some efforts to extend the method to mult-classfcaton problems [9], t seems that the performance of such a machne s stll poorer than the b-classfcaton SVMs [10]. The SVM method thus needs to be further developed. In ths paper, we present a new method to desgn learnng machnes wth nput, hdden, and output layers. To tran the learnng machne responses to samples, we develop a Monte Carlo (MC) algorthm. In fact, Hagan, Demuth and Beale have ponted out that randomly searches for sutable weghts may be a possble way. It was however abandoned because they dd not beleve t s practcable [] because of the computatonal dffculty. They thus turn to develop the BP algorthm. We recur ths dea snce frstly t s avalable for fndng solutons of optmzaton problems wth complex restrctons as demanded by our

4 generalzaton strateges. Secondly and essentally, the speed of nowadays computers have greatly mproved. Another equally mportant factor s that our algorthm evolves only a small part of the learnng machne. At each round of adaptaton, we adapt only one parameter f t does not make the performance bad, nstead of developng the entre system. Due to ths reason, the tranng tme s acceptable for usual applcatons. As a MC algorthm, t has hgh flexblty, and s applcable for varous neuron transfer functons and cost functons, and for ether contnuous or dscontnuous, even dscrete system parameters. Usng ths algorthm, we can drectly search for the proper project vectors n the usual vector space. We thus call the learnng machne desgned by our method the general vector machne (GVM). For pursng the generalzaton ablty, we classfy the pror knowledge nto the common and problem-dependent classes, and suggest correspondng strateges to maxmally ntegrate them nto the learnng machne. Objects wth small dfference n features should belong to the same class may be the most common pror knowledge for human bengs to generalze experences. To ncorporate ths knd of pror knowledge, a learnng machne should be nsenstve to small nput changes,.e., should have small nput-output senstvty. Mathematcally, the ampltudes of dervatves, usually the second-order dervatves, defne the structural rsk and measure the nput-output senstvty of functons. Note that the concept of structural rsk employed n SVM s not dentcal to such a general mathematcal concept. Mnmzng such a structural rsk s ndeed appled as a basc prncple n the learnng problem. However, n the case of multlayer learnng machnes, as a complex functon of the system parameters and nput vectors, the structural rsk s qute dffcult to be calculated, let alone fndng ts soluton of mnmum. In the SVM method, the separatng margn n the feature space between dfferent classes s

5 maxmzed to decrease the nput-output senstvty. Ths strategy avods the drect calculaton of the structural rsk. The problem s that extremely decreasng the structural rsk could not always lead to the best machne, as our applcaton examples wll ndcate. Another knd of the common pror knowledge s: learnng machnes supervsed under the same tranng method wth the same tranng set should have as small as possble output uncertanty. Ths s a basc requrement for a recognton system. It should has a suffcent degree of stablty on small parameter changes. Otherwse, the system s lack of relablty. For example, the bran s remarkably robust; t does not stop workng just because a few cells de. We apply the desgn rsk mnmzaton (DRM) strategy -- Learnng machnes wth smaller desgn rsk are better ones, as a basc prncple to maxmally ncorporate the common pror knowledge. The desgn rsk s defned to be the dsperson degree of the outputs of learnng machnes desgned usng the same tranng set. It s just the error bar usually used to ndcate the precson of the expermental data. Mnmzng t mnmzes the uncertanty of the outputs of dfferent learnng machne desgned by the same algorthm. Furthermore, on one hand, small desgn rsk means small nput-output senstvty. The desgn rsk can thus mpose restrctons on the structural rsk. On the other hand, the DRM prncple s not equvalent to the nput-output senstvty. The DRM can approach the mnmum of the real rsk better than the nput-output senstvty mnmzaton strategy. In the case of functon approach and smoothng, we wll show that the mnmum of the desgn rsk s usually consstent to that of the real rsk, whle that of the structural rsk has a degree of dvergence from the real-rsk mnmum. For pattern recognton and objectve classfcaton, ths s also true when the real patterns can be consdered as the random varatons of the tranng samples. We wll explan why the DRM can acheve a better result and why the SRM may

6 nduce the dvergence. However, there s no further rgd proof to ths problem. Indeed, ths s also the case for the SRM prncple as well as the Occam razor prncple, for whch there s no rgd proof to guarantee the convergence to the real rsk mnmum when a fnte sample set s avalable. A partcular set of samples has ts specal background knowledge. For the pattern recognton, for example, the patterns may have specal geometrc symmetry. In ths case, extremely maxmzng ether the structural or the desgn rsks may result the devaton from the nature geometry, and thus decreases the correct rate of recognton. The knowledge about the geometrc symmetry of patterns s a knd of typcal problem-dependent pror knowledge. Besdes, the physcal nterpretaton of nput vectors of samples may also nvolve the problem-dependent pror knowledge, such as the knowledge behnd the physology and bochemstry ndexes of the medcal nspecton. The problem-dependent pror knowledge presents n functon approach and smoothng, too, such as the knowledge about the goal functon. To maxmally ncorporate the problem-dependent pror knowledge, one needs to apply ndvdualzed strateges, ncludng proper pretreatment of sample nput vectors, usng proper neuron transfer functons and cost functons, etc.. To search for the mnmum of the desgn rsk effectvely, we ntroduce a set of control parameters. Ths set ncludes parameters specfyng the ranges of the weghts connectng dfferent neurons, coeffcents of neural transfer functons, and bases of neurons. The number of hdden-layer neurons and the wdth of the separaton margn are also ncluded. The control of the structural rsk of the learnng machne s approached by control the structural rsk of ndvdual neurons.

7 Usng the desgn-rsk-control strategy, the goal s to fnd the best control parameter set nstead of fndng the best learnng machne. We ntroduce two performance ndexes, the desgn rsk and the average correct rate on the test set (or on a spurous test set), to dentfy the best control parameter set. In more detal, for functon approach and smoothng, the desgn rsk s appled to unquely defne the best control parameter set snce we wll demonstrate that the desgn rsk gves a good estmate to the real rsk. In ths case, the desgn program s stopped when the desgn rsk mnmum s approached. For other problems, we apply the average correct rate on the test set as the domnant performance ndex. We show that for pattern recognton and objectve classfcaton, mnmzng the desgn rsk usually gves the hghest average correct rate when real patterns can be consdered as random varatons of sample patterns. In ths case, the two ndexes are consstent wth each other. When the problem-dependent pror knowledge s nvolved, however, the two ndexes may be nconsstent wth each other. In ths case, the best control parameter set should be decded as a balance between the two performance ndexes. In more detals, f the desgn rsk has become acceptable we dentfy the control parameter set wth the maxmum average correct rate to be the best control parameter set. If t s stll too bg, we have to searchng alone the drecton that the desgn rsk decreases contnuously tll the rsk s acceptable, though n that case the average correct rate become smaller. The MC algorthm provdes a nature way to calculate the desgn rsk as well as the average correct rate, snce for a gven tranng set one can desgn a number of GVMs at a fxed control parameter set. Each round of tranng starts by settng all the system parameters to random numbers n ther avalable ranges. The ntal system parameters are ndependent and dentcally dstrbuted, and the tranng s proceeded by MC adoptons.

8 These GVMs are thus statstcally dentcal, and every GVM at the best control parameter set should have the same antcpaton performance for the real set and can be equally appled as the learnng machne to performng the task. Ths s dfferent from prevous methods whch selectng the learnng machne wth the best performance on the test set to be the performng system, n whch case one cannot guarantee that t also has the best performance for real set. We can further construct the performng system by combnng a suffcent number of GVMs desgned at the same control parameter set and wth the same tranng samples. We call t the jont GVM (J-GVM). The J-GVM can dramatcally decrease the desgn rsk as well as the structural rsk. Moreover, the J-GVM may acheves a good balance between the goal of maxmally extractng the feature of nput vectors and the goal of mnmzng the rsks. It can therefore remarkably mprove the generalzaton ablty for small tranng-set problems, as wll be shown by our examples. The dea of J-GVM s smlar to the ensemble method [11]. The dfference s that, besdes beng desgned under the supervson of the DRM strategy, the GVMs whch are used to construct the J-GVM are traned by the same tranng sample set. The rest of the paper s managed as follows. In the next secton we ntroduce the archtecture of the GVM. We shall emphasze the dfference from that of the SVM. In secton 3 we present the MC algorthm for tranng a GVM to response to samples. Secton 4 s contrbuted to ntroduce the dea of controllng the structural rsk of the learnng machne by controllng that of sngle neurons. The man control parameters are ntroduced n ths secton. Secton 5 presents the DRM prncple. Why mnmzng the desgn rsk can approach the best fttng n functon approach and smoothng s explaned. Secton 6 ntroduces several strateges for maxmally ncorporatng the problem-dependent

9 W pror knowledge. Secton 7 constructs the J-GVM. The next three sectons are applcaton examples, ncludng the functon approach and smoothng, pattern recognton and classfcaton respectvely. Functon approach examples show the consstence between the desgn rsk mnmum and the best fttng. The fttng precson outperforms the SVM method as well as the usual splne algorthm. Pattern recognton s performed on the famous MNIST set of handwrtten dgts [1]. Our purpose s to show how to perform ths knd of task usng a GVM or a J-GVM. We focus manly on the case of usng small tranng set. The recognton rate acheved usng all the tranng samples s also shown for the comparson purpose. By drectly usng the normalzed gray-scale mages wthout specal preprocessng so as to farly compare the algorthms, we obtan recognton correct rate beyond the correspondng records by usng the BP neural network, the SVM, and even the complex learnng system supervsed by the deep-learnng method. The classfcaton s performed on the Wsconsn breast cancer database [13-15]. Ths example fully reveal the advantage of our method on applcaton to small-tranng set. The last secton s to summarze the man deas and results. A partcular applcaton, washng out the bad samples, of our method s demonstrated n the end of ths secton..the model ( a) : GVM ( b) : SVM x 1 x x 3 x 4 W 1 f ( x, w ) f ( x, w f ( x, w f ( x, w 3 4 f ( x, w 5 ) ) ) ) W y 1 y x 1 x x 3 x 4 x 1 K( x, x ) K( x, x ) 3 K( x, x ) w y Fgure 1. The archtecture of a GVM (a) and a typcal SVM (b).

10 We study the three-layer learnng machne composed by nput, hdden and output layers. The numbers of neurons n the nput, hdden, and output layers are N, M, and L correspondngly. The dynamcs s gven by the followng formula, Hdden layer: Output layer: M y f( h ), h wjx j b, 1,..., N (1) j1 where y f, h,, b N yl hl, hl wl y, l 1,..., L 1, respectvely represent the output, neuron transfer () functon, local feld, transfer functon coeffcent, and bas of the th neuron n the hdden layer. Here x j s the jth component of an nput vector, w j s the weght connectng the nput x j and the th neuron n the hdden layer. Smlarly, y, h are the output and the local feld of the l l lth neuron n the output layer, w l s the weght connectng y and the lth neuron n the output layer, M s the dmenson of the nput vector, N s the number of neurons n the hdden layer, L s the dmenson of output vectors. Here we apply the lnear transfer functon to the output layer for smplfyng the analyss. For functon approach, t s applcable drectly. For pattern recognton and classfcaton, one can further apply nonlnear transfer functons to assgn labels to output vectors after fnshng the tranng. Fgure 1(a) shows the archtecture of a GVM, whle Fgure 1(b) shows that of a typcal SVM. We plot two output neurons n the former and a sngle one n the latter, to emphasze that our method s drectly applcable for mult-classfcaton problems whle the SVM method s usually for bnary decson problems. For practcal applcatons of the SVM method, one usually desgns several b-classfcaton machnes to

11 perform the mult classfcatons. The essental dfference from the SVM s that we use general vectors M w R x R. to replace support vectors M To fnd the soluton of the general vectors, we apply the MC algorthm descrbed n next secton. 3.Monte Carlo algorthm The MC algorthm s for establshng the correct response on the tranng set. A three-layer network maps nput vectors to output vectors by two steps of transformaton. That s: the Hdden layer maps the M-dmenson nput vectors of samples nto N-dmenson vectors n the feature space, and then the output layer maps them nto L-dmensonal output vectors. The two layers resemble two coupled mrrors. Smultaneously changng two mrrors, or fxed on one and changng another, could both establsh the desred nput-output correspondence. The SVM algorthm apples the former strategy. It adjusts the hdden-layer by selected support vectors, M x R, and calculates the output-layer by the lnear optmzaton theory. Introducng the support vectors s the soul of the SVM method. It decreases dramatcally the freedom of choosng the hdden-layer parameters, and reduces the soluton to be a lnear optmzaton problem n the feature space. Ths treatment, on the other hand, mposes restrcton on the projectve vectors snce the support vectors can only selected from the nput vectors of samples. w by Our dea s dfferent. Let us denote the th row of the weght matrx M w R, and call t as the th weght vector of the matrx. We gve up the support vectors and apply drectly the weght vectors to collect the feature of nput vectors. To perform the tranng, we randomly ntalze the parameters n the output layer and fxed them afterwards, and then adjust the hdden-layer mrror,.e., the weght vectors w as

12 well as transfer functon coeffcents and neuron bases, to fnd the soluton. Because there are huge amount of parameters n the hdden layer, the possblty to fnd solutons wth the fxed output layer s stll qute hgh. One can n prncple adjusts smultaneously parameters n both layers to fnd the soluton. However, once the output-layer mrror changes, the hdden-layer mrror should adjust accordngly to march the change, whch may nduce abundant computer tme of smulaton. For sake of the smplcty, we set w 1 l randomly to the output layer. The parameters n the hdden layer are also randomly ntalzed wthn ther avalable ranges (wll be defned n the next secton). To supervse the tranng, a cost functon, ( y, t ), 1,..., P, s constructed by tranng samples, where t represents the actual output of the learnng machne under the nput x. In sectons 6 we shall show how to construct n detal. To start the tranng, we set all of the parameters to random numbers n ther avalable ranges, and calculate the local felds of neurons, h, hl, 1,..., P; 1,..., N; l 1,..., L, as well as the functon. We then repeatedly apply the followng procedure to fnd hdden-layer parameters: Randomly adapt one hdden-layer parameter to a new value n ts avalable range, and calculate the changes n ; If t does not become worse, accept the adaptaton and renew the local felds as well as the outputs of neurons and the cost functon, otherwse gve up the adaptaton. In more detal, the hdden layer s renewed by the follow rules: (a)if w w, then h h x, y f ( h ) (3) j j j (b)if, then h ( ) h, y f ( h ) (4) (c)if b b, then h h, y f ( h ) (5)

13 The output layer s renewed by h l h w ( y y ( old )) (6) l l where y (old ) represents the value before adaptaton. The renew operatons are performed over 1,..., P ; l 1,..., L. For a partcular applcaton, one can specfy only certan classes of parameters as changeable ones and keeps the others fxed. For desgnng learnng machnes wth parameters takng contnuous values, one can set to be small random numbers. It s not necessary to lmt to be suffcently small. For parameters wth dscrete states [16-17], the parameter s set to push the parameter jumpng from the present state to a neghborng state randomly. The tranng s stopped untl 0 or after a suffcent long tranng tme, t t0. Our algorthm does not need to develop the entre network whch needs about O( NMP NLP) multply-add operatons. Each adaptaton n nduces only O( P LP) multply-add operatons and the adaptaton accepted s optmum for the whole tranng set n the statstcal sense. The examples shown n the applcaton sectons wll ndcate that the tranng tme of our algorthm s practcal for varous applcatons. Applyng the MC algorthm, the goal of extractng the feature of samples s approached by smply projectng the nput vectors nto the weght vectors n the hdden layer. The projectons are consdered as the features. Though a sngle weght vector may extract less nformaton than a support vector does, the unlmted amount of weght vectors can offset ths drawback. Indeed, none has proved that support vectors are the best ones. In the case when the tranng set s qute small, the support vector method has obvous lmtaton. We suppose that the most optmal vector set that could maxmally extract the sample feature should be searched over the entre vector space, and the MC algorthm have such a

14 potental. In prncple, more weght vectors mean more feature nformaton. As a consequence, we prefer large learnng machne to small ones. The over-tranng problem nduced by large network sze can be suppressed by controllng the desgn rsk usng strateges ntroduced n next two sectons. 4. Controllng the structural rsk In ths secton we show how to control the structural rsk of a GVM by controllng that of sngle neurons. The response of a functon to a small nput change can be expressed as 1 y ~ x ( x). x x Therefore, the moments of dervatve of ( x, ) determne the nput-output senstvty of the functon. In applcaton, the second moment s usually appled to defne the rsk. The second moment of the dervatve of the th neuron n the hdden layer to the jth and kth (7) components of an nput vector s jk y x x j k, whch can be derved explctly: '' w w f ( h ) (8) jk j k All of the components over ndexes j and k determnes the structural rsk of the th neuron. The structural rsk of a GVM s the lnear combnaton of the sngle-neuron rsks snce we employ lnear neurons n the output layer,.e., for the lth output component of a GVM t has yl x x j k N 1 w. l jk The structural rsk of a GVM s thus determned by totally LM N terms of second dervatves. One may employ the average ampltude R N S wl jk 1 (9)

15 to measure the structural rsk of a GVM, where < > represents the average over ndex par (l,j,k). Ths s a hard task for even calculaton, let alone to fnd the soluton of the mnmum. of Followng Eq. (8), the rsk of a neuron s determned by the product,w, w and j k '' f, and therefore t s controlled by the ampltudes of these parameters. By lmtng the parameters n the ranges [ c, c ], w jk [ cw, cw], (10) and choosng neuron transfer functons whch has bounded dervatve we can control the ampltude of sutable for ths requrement nclude jk '' f,. Typcal neuron transfer functons f (z) z e (11) '' wth f [,0.9],and f (z) tanh(z) (1) '' wth f [ 0.8,0.8 ]. The former s called as the Gaussan neuron transfer functon and the latter the sgmod neuron transfer functon. As a lnear combnaton of rsks of neurons, the structural rsk of a GVM can be controlled by the parameters c and c w, and are called as the control parameters. Nevertheless, for the zero-nput vector and ts '' '' nearby vectors, t has f (0) for all of the hdden-layer neurons f f b 0 followng eq. (1), whch results n nput-vector-value dependent rsks. To avod ths stuaton, the parameter b s gven a role n controllng the rsk. Wthout loss of the generalty, suppose the nput vectors are symmetrc to the orgn. We assgn random values to the range b c, c ] (13) [ b b b n

16 wth c w x b ~ max jk. As b k and w jk take random values, '' f takes dfferent values even wth the zero nput vector, and thus the systematcal dependence to the nput-vector value s greatly reduced. The parameter c b s thus also appled as a control parameter. In prncple, as a mult-parameter optmzaton problem, the global optmal soluton should be found by searchng the control parameter space. Fortunately, as can be seen from eq. (8) that t s the multple w w f determnes the rsk. One thus can fx the control parameter c w '' j k and searches the parameter c alone to fnd the soluton. Meanwhle, the control parameter c can also control the range of the varable f, and control the dstrbuton of f. Sometme one may also searchng for a '' more proper c b, but our nvestgate ndcate that t s not senstve except n the partcular stuaton near the zero nput vector. In SVM method, the polynomal transfer functon f (z) n z (14) '' s also constantly employed. Snce f (z) n( n 1)z n, the range of the second dervatve depends on the nput vectors explctly. It has relatvely bg ampltude when n s bg, and may ncrease wth the ncrease of the ampltude of nput vectors. Ths transfer functon therefore may nduce hgher structural rsk. Therefore, when sometme we have to apply ths functon to maxmum problem-dependent pror knowledge, addtonal strategy should be employed to suppress the rsk. 5.The desgn rsk mnmzaton strategy Extremely mnmzng the structural rsk may result n the overtranng for functon approach. The goal of functon approach s to fnd a learnng machne (, x) satsfyng (, x) g( x), where g (x) s

17 the goal functon. Mnmzng the emprcal rsk can approach ths relaton on the sample set. To generalze to the whole nterval, the usual way s to decrease the structural rsk of (, x) to pursue the smoothness. Let d (, x)/ dx dx and d g x)/ dx dx ( represent the structural rsks of the learnng machne and the goal functon respectvely. Usually, the former s much bgger than the latter snce the learnng machne s a bg system wth a large set of parameters. Therefore, the mnmzaton can ntally decrease the rsk of the former to approach that of the latter. However, extreme mnmzaton may make the former smaller than the latter, and thus the overtranng occurs. Ths s possble snce (, x) and g (x) are not dentcal functons. We argue that the DRM strategy can avod the over mnmzaton and lead to the best fttng. To calculate the desgn rsk, we construct n GVMs satsfyng 0 wth randomly ntalzed system parameters at a gven set of control parameters. These GVMs are traned usng the same set of tranng samples. We smply apply the ( x, ) of a GVM to be the response functon (x). The squared error of the response functons 1 E( ) ( ( x) ( x) ) n (15) n 1 of n GVMs defnes the desgn ( x), 1,..., n rsk, where ( x) s the average response functon. Each (x) may be consdered as a random fluctuaton around the goal functon g (x), and thus one may expect ( x) g( x) for suffcent large n snce the random fluctuatons may offset wth each other. In ths case, E () equates to the average fttng error defned as n 1 ( ) ( ( x) g( x) ) (16) n 1

18 Therefore, the mnmum of the desgn rsk gves that of the average fttng error, and defnes the best control parameter set. For pattern recognton, the moments of dervatves n eq. (7) alone could not determne the nput-output senstvty. In ths problem, the outputs of the learnng machne should be separated by a separatng margn for classfyng patterns nto dfferent classes. If the μth sample s belong to the th class, the learnng machne should be response as y y d 0 for l l. A varaton of the μth sample, x x x, s expected to be classfed nto the same class that the μth sample belongs. Inputtng the varaton nto the learnng machne, one have y y y. The varaton can be correctly classfed nto the th class f only y y l 0 for d y y 0 l l. Ths condton can be represented as alternatvely. Thus, the correct classfcaton s determned not only by the ampltudes of dervatves of (, x) whch affects the ampltudes of y and yl, but also by the separatng margn. The nput-output senstvty n ths case s the flexblty of the condton d y y 0 beng volated wth nput varatons of x x x l. Therefore, the nput-output senstvty here s not equvalent to the mathematcal concept of the structural rsk. To decrease the senstvty, one may fx the separaton margn and decrease the ampltudes of dervatves of (, x), or fx the latters and ncrease the former. For ths reason, the wdth of the separatng margn d s also appled as a control parameter n our algorthm. We argue that the desgn rsk s also the favorable ndcator for supervsng the learnng machne for pattern recognton. We denote the correct rate of a GVM on the test set to be and apply t as the

19 response functon to measure the desgn rsk. Desgnng a set of GVMs to obtan a response functon seres, 1,..., n at a gven control parameter set, we obtan two performance ndexes, the average correct rate and the dsperson degree of the correct rates. The latter defnes just the desgn rsk. We mnmze the desgn rsk to pursue a better performance based on followng reasons. Frstly, mnmzng the desgn rsk s a necessary demand for a desgn method. If learnng machnes desgned by dfferent users followng the same algorthm have bg dsperson n recognzng the same test set, the method wll be lack of relablty. Secondly, mnmzng the desgn rsk can suppress the nput-output senstvty of the learnng machne, and can maxmally ncorporate the common pror knowledge of varatons of a knowng pattern should be assgned to the same class. Bgger degree of the nput-output senstvty may nduce bgger output fluctuatons of y, and n turn nduce bgger dsperson n the desgn rsk. As a result, mnmzng the desgn rsk can usually maxmze the average correct rate. Ths occurs when real patterns can be consdered as the random varatons of tranng samples. In more detal, f the varatons of the μth pattern can be represented by x x, E 0, E, then mnmzng the nput-output senstvty mples varatons wth as bg as possble mean square could be classfed nto the class that the μth pattern belongs. It s just ths case marches exactly the common pror knowledge, snce varatons of a knowng pattern should be assgned to the same class put no partcular restrcton to the varaton. Graspng ths pont may help us to understand the applcable scope of so-called desgn prncples, such as the Occam razor prncple, the SRM prncple and the DRM strategy n ths paper. A prncple s for maxmzng the common

20 pror knowledge and s thus applcable generally. However, t cannot be expected to always gve the optmal soluton for partcular problems snce t may loss the problem-dependent pror knowledge. 6.Incorporatng the problem-dependent pror knowledge For functon approach, the problem-dependent pror knowledge s about the goal functon nducng the data set. In ths case, choosng a proper neuron transfer functon may better march the feature of the data. For example, f the data come from a polynomal goal functon, applyng a polynomal transfer functon shall acheve a better fttng than usng a sgmod one. Our applcaton examples wll show that the more proper neuron transfer functon can be selected also by checkng the desgn rsk,.e., a smaller desgn rsk wll become smaller when a better neuron transfer functon s appled. For pattern recognton, there are many types of problem-dependent pror knowledge, such as the nterpretaton of nput vectors, the rotatonal and translatonal nvarance of patterns, the specal geometrc symmetry of patterns etc. Applyng ndvdualzed strateges to maxmally utlze the problem-dependent pror knowledge s essental for further mprove the generalzaton ablty. The physcal nterpretaton of nput vectors may nvolve the problem-dependent pror knowledge. When a varaton x x x of the th sample s nput to a GVM, devatons h h h and hl h l hl n local felds n hdden layer and n the output layer wll be nduced n turn. In medcal dagnoss, for example, each component of an nput vector descrbes a bochemcal ndcator. As n the Wsconsn breast cancer database [14], the components are endowed the meanngs that the more low the value the more normal, and the more hgh the more lkely

21 malgnant. As a result, negatve h represents the normal, and postve h ndcates the dvergence from the normal. To match ths feature, the sgmod transfer functon s preferable than the Gauss transfer functon. In other examples as for pattern recognton, an nput vector encodes a two-dmensonal pattern. A component of such a vector has a standard reference value for a specfc pattern. The correspondng component of a new pattern wth ether great or less value both represent the devaton from the reference value. In ths case, the Gaussan neuron transfer functon may march the feature better. Smlarly, the property of h l should also march the symmetry feature of samples. We ntroduce several cost functons to control the dstrbuton of local felds of neurons n the output layer. The frst one s defned by 1 F h s d 1 P L ( l l ), (17) PL 1 l1, hl sl d where s 1 and s 1 for l f the sample belong to the th class. l When t s mnmzed to F 1 0, the local felds of neurons n the output layer satsfy h s d for all tranng samples. In ths case, t has h d l l and h l d, l otherwse. The dstrbuton of the local felds s llustrated schematcally n Fg. (a). We call ths functon the steep-margn cost functon. The second one s F 1 P L ( hl sl d) PL 1, l1 (18) whch compresses h around d. The resulted dstrbuton of h s l s l l llustrated n Fg. (b). We call t the Gauss-margn cost functon. In certan case, F may not be approachable due to the huge amount of tranng samples. In ths stuaton, ntroducng the followng cost functon,

22 L ( hl sl d1) P l1, hl sl d1 (19) L 1 ( hl sl d1) l1, hl sl d 1 F3 PL may be helpful. The mnmzaton of t drves [ d 1, d ]. When d1 approaches F. h nto the nterval l s l d t approaches F 1, whle when d ~ d t 1 These cost functons can also be presented n another way. For example, the second cost functon can be modfed as 1 P L F ( h hl d) (0) PL 1 l1, l By mnmzng ths functon, h and h l for l are separated by a dstance d for each sample patterns, but the local felds need not to dstrbute around the orgn. (a) Steep margn (b) Gauss margn d d Fgure. Schematc dagram of separatng margn. (a) Steep margn and (b) Gauss margn defned by the cost functons F 1 and F respectvely. Properly pretreatment of nput vectors s also an effectve manner to ncorporatng the problem-dependent knowledge of transform symmetry. For handwrtten dgt recognton, the patterns have symmetres under small spatal shfts, small rotatons, as well as small dstortons. For man-made objectves, spatal shfts and angle varatons are mportant but dstortons may have no correspondng. For these partcular problems, generalzaton based on random varatons s extravagant. To ncorporate the partcular symmetrc restrcton, one can construct spurous tranng samples by the shft, rotaton, dstorton, tangent dstance technque, etc.

23 [18], and apples them also n the tranng. The drawback s that bg amount of addtonal calculatons s nduced for tranng the machne. Another way s to encode the symmetry property nto the nput vectors of samples. The so-called Gradent-base feature-extracton algorthm e-grg [19], wth feature vector encodng eght drecton-specfc 5 5 gradent mages, s one of the top-performng algorthms for ths purpose. In these cases, especally when partcular geometrc restrctons are nvolved, however, the control parameter set wth maxmum average correct rate and that wth mnmum desgn rsk may be nconsstent wth each other, and the average correct rate on the test set should also be appled as a performance ndex to dentfy the best control parameter set. 7.Usng a jont learnng machne There s a way to further decrease the rsk,.e., combnng a huge number of GVMs desgned at the same control parameter set to construct a jont learnng machne, a J-GVM. The structural rsk of a J-GVM s the algebrac average of the GVMs. Wth the small rsk, one may expect that the J-GVM has better performance than a sngle GVM. The GVMs composng the J-GVM are statstcally dentcal, snce there are obtaned wth the same tranng set at the same control parameter set. For an ndvdual GVM, ts output nvolves not only the nformaton related to tranng samples but also random nose. The output of a J-GVM s an ensemble average of the vast GVMs, the nose part s thus be suppressed. Based on ths consderaton, the J-GVM needs not necessarly to be constructed usng the GVMs at the best control-parameter set of ndvdual GVMs. For pattern recognton, applcaton examples wll show that the J-GVM constructed at a control parameter set wth GVMs havng a relatvely bg degree of rsk may have better performance. In such a case, weght vectors have bg freedom to

24 extract the feature of nput vectors, and therefore much more features of samples may be extracted. Because of ths reason, we ask a suffcent amount of GVMs to construct the J-GVM so as to offset the tranng-sample ndependent fluctuatons. Ths strategy s smlar to the conventonal ensemble methods [6,11]. 8 Applcaton for functon approach and smoothng We apply a M N 1 GVM for functon approach and smoothng. The MC algorthm s appled to tran a GVM n response to samples, whch effect s measured by the emprcal rsk F e 1 P P 1 ( t y ) (1), where t and y are the target and actual output of the μth sample. The tranng s stopped when the emprcal rsk s smaller than a threshold or a maxmum number of MC steps approached. The latter stop condton s for smoothng nose data sets, n whch case the former condton may not be approached. Our examples nclude fttng the sn functon g( x) sn( x), the snc functon g( x) sn( x) / x, the Hermt 5th polynomal g x x x x 5 3 ( ) ( ) / 8, the Hermt 7th polynomal g x x x x x ( ) ( ) /16, and the -D snc functon g( x, y) sn( x y ) / x y. For data smoothng, we add the whte nose to the snc functon as sn( x) g( x), E 0, E. () x 8.1 Fndng the best control-parameter set We prepare three nose-free tranng sets from goal functons of the sn, snc and Hermt 5th polynomal for functon approach. Each set has

25 0 unformly dstrbuted samples x, g( x )] from nterval of x [ c, c ] [. x x The set of the sn functon s from one perod of cx. The snc functon s n the nterval of cx 10. The doman of the Hermt polynomal s [-1,1] 4 wth c x 1. The stop condton s 10 for all the three sets. One F e nose set from the snc functon,.e., 100 samples wth 0.1 as n the reference [4], s appled for data smoothng. The tranng s stopped for ths set after 10 5 N MC steps. The Gauss transfer functon s adopted for all tranng sets. The parameters of the GVM are lmted n the ntervals [ c, c ], w [ 10/c x,10/c x], b [10,10], correspondngly. The factor 1/c results max[ w x] c 10, whch guarantees the optmal value x b of c keep unversal roughly for dfferent samples (see examples below). Wth these settngs, only c remans to be the control parameter. At each pont of c, 500 GVMs are desgned wth random ntalzatons n the specfed ranges of system parameters. They are appled to calculate the average structural rsk, average fttng error and desgn rsk. Because there s only one neuron n the nput and output layers respectvely, the structural rsk defned by eq. (9) can be calculated drectly. The frst row of Fg. 3 shows shows the g RS -R S, and the second row ( ) and E [], as functons of the parameter c for the four sample sets correspondngly. The average structural rsk R s S calculated over the rsk R S of GVMs. The structural rsk R d g x dx g S ( ) / of the goal functon s a constant and s appled as a reference lne for R. It can be seen that S R decreases rapdly wth S the decrease of c. Ths fact ndcates that the rsk R do can be S decreased by decreasng that of ndvdual neurons. Wth the further decrease of c, R shows a mnmum. However, t s usually S

26 nconsstent to that of ( ). For the frst and last tranng sets, the dfferences between the mnma of R and ) S ( are slght, whle for the second and thrd sets, the dfferences are remarkable. Partcularly, for the thrd set, the mnmum of R s around c 1. 0 ( ) , whle the mnmum of ) S at whch ( appears at c wth ( ) Therefore, mnmzng the structural rsk of a learnng machne does not necessarly converge to the best machne. It can be realzed from the fgure that the dfference s nduced by the over mnmzaton,.e., the rsk of the learnng machne s mnmzed to a value that s smaller than that of the goal functon. In the cases of Fg. 3(b) and 3(c), R - shows negatve ntervals and the dfference between g S R S the mnma of 3(a) and 3(d), R and ) S g S R S ( are remarkable. In the cases of Fg. R - keeps non-negatve, and the mnma of R S are close to those of g ( ) correspondngly. Nevertheless, because R S s unknown for practcal applcatons, one cannot judge whether the over mnmzaton occurs, and thus can not apply the structural rsk crteron wth confdence to judge whether the best fttng s acheved. Ths effect cannot be generally avoded snce ( x, ) and gx ( ) are dfferent functons n prncple. Mnmzng the rsk of the former cannot necessarly converge to that of the latter. those of On the contrary, the mnma of E [] are consstent approxmately to ( ) for ether data set. In the cases of the frst two sets, E [] and ( ) concde almost wth each other, ndcatng the satsfyng of ( x) g( x). For the last two sets, E [] and ( ) show the smlar dependence on c, ndcatng that (x) can stll gve the best approach to g (x). Therefore, the mnmum of the desgn rsk

27 defnes the best control-parameter set for functon approach and smoothng. E[], <R S >-R g S E-3 (a) (e) 6 (b) 0.15 (c) (d) (f) (g) (h) E E-3 1E E Fgure 3. The frst row shows c c R - g S R S c c, and the second row shows ( ) (up-trangles) and E [] (crcles) as functons of c. (a) and (e): sn functon; (b) and (f): snc functon; (c) and (g): hermt polynomal; (d) and (h): snc functon wth nose ampltude 0.1. The nconsstence between E [] and ( ) can be understood by nvestgatng the fttng process of the nosed data. At the control parameter set wth the mnmum desgn rsk, Fg. 4(a) shows two fttng curves of dfferent GVMs for a set of nose data wth 0. created by eq. (). These curves converge to almost the same functon, but have systematc devaton from the goal functon. Obvously, ths s not the problem of the DRM strategy, nstead, t s an effect of fnte tranng samples. For a fnte tranng set, the nose may nduce the systematc dvergence from the goal functon. Suppose we have two tranng sets and each has just several sample ponts for example, then we cannot dstnct whch one s the nosed data and whch one from the goal functon n prncple wthout addtonal pror knowledge. The nosed data set tself also defnes a goal functon. Applyng the DRM strategy approach ths

28 goal functon and results the systematc devaton from the real goal functon. To decrease the devaton nduced by ths mechansm, one has to applyng small-nose-ampltude samples or gettng more tranng samples. Ths effect may also exst for nose-free fnte tranng samples. Gvng just several ponts, there s bg uncertanty to judge whch goal functon they come from. Applyng more tranng samples can decrease the uncertanty. Furthermore, dfferent from the nosed data, the devaton may be dramatcally suppressed by choosng more proper neuron transfer functon whch marches the real goal functon better, as well be shown n secton 8.6. (a) (b) snc c = snc c =0.5 y y x Fgure 4. Functon smoothng for samples wth nose ntensty of 0.1 (a) and 0. (b). 100 samples (stars) are dstrbuted unformly n the nterval of [-10,10]. The goal functon s the 1D snc functon (black lne). In the plot, two fttng curves are shown, whch are almost overlapped wth each other. 8. Improvng the fttng by ncreasng the learnng machne sze As can be seen from Table I, ncreasng samples can greatly ncrease the fttng precson. Ths s generally true for ether our algorthm or conventonal methods. Here we emphasze that we can also mprove the x

29 fttng precson by ncreasng hdden-layer neurons. Increasng the hdden-layer neurons from 100 to 1000, the fttng precson may be further mproved by up to threefold. Ths s a remarkable dfference to conventonal methods. The BP method follows the Occam razor prncple to pursue neural networks wth as smaller as possble sze. If a hdden-layer wth tmes of samples s appled then serous over-fttng must be arsen. For the SVM method, the hdden-layer neurons are lmted by the amount of the samples. The hdden-layer neurons could not exceed the number of samples. Table I : Fttng precson vs. sample amount and machne sze Samples 10 0 Data Sn snc hermt sn snc hermt GVM( ) 4.5x x10-3.5x10-1.3x x10-4.9x10-3 GVM( ) 1.3x10-4.6x x10-1. x x x10-3 J-GVM( ) 5.3x10-5.4x10-3.5x10-1.0x x10-5.9x10-3 Splne 3. x x10-8.x10-1.5x x x Fttng by a J-GVM 4 To obtan GVMs, the tranng s stopped when F 10. Therefore, the fttng precson of a GVM could not beyond ths threshold. Table I shows that applyng a J-GVM can usually obtan better result than a GVM. The J-GVM s constructed by the 500 GVMs traned at the best control parameter c where E [] takes the mnmum. For certan data sets the precson can be mproved by even one order, whch s much hgher than the emprcal rsk. e 8.4 Comparson to conventonal algorthms of functon approach Table I also shows the correspondng results usng the wdely appled splne algorthm for functon approach. For the last two sets, GVMs get obvously better performance. For the data of sn functon, the

30 splne algorthm acheves the same precson. However, ths s because 4 we apply the tranng stop condton of F 10, whch lmts the 5 precson of the fttng. If one changes the stop condton to F 10, one 5 can further mprove the precson of GVMs to the order of 10. The J-GVM s constantly better than the tradtonal algorthm for ether data set. As to compare to the SVM, we would lke to menton the example shown n text book of Vapnk [4]. In ths example, 100 samples for the snc functon are appled as tranng set. When choosng 14 samples from ths set to be support vectors, the fttng curve already shows bg dverges over a ampltude of 0.1 from the goal functon. In our case, even for the tranng set wth only 10 samples, one can approach a precson of -3 Θ( ) usng a GVM. The fttng curve s already ndstngushable vsually from the goal functon. e e 8.5 Tranng tme For the sake of comparson, the tranng tme s measured by the CPU tme of commonly used personal computer (specfcally.0 GHz). Fgure 5 shows the average tranng tme of a GVM as a functon of c n the case of the snc goal functon wth 0 samples. It can be seen that the tranng tme ncreases rapdly wth the decrease of c. The tranng tme may ncrease exponentally when the GVM becomes too small. We have checked that for N 0, the tranng fals for the condton of F 10 4 cannot beng approached wthn a reasonable tranng tme. On e the contrary, the tranng tme decreases slghtly wth the ncrease of the machne sze. Together wth the fact that large machnes may mprove the ft precson, our algorthm thus prefers large machnes than small ones.

31 Over-fttng problem of large machnes can be suppressed by further decreasng the desgn rsk. CPU_tme(s) c Fgure 5. Tranng tme as functons of the machne sze. 8.6 Improvng the fttng by proper neural transfer functons The BP method s derved wth the sgmod neuron transfer functon and thus has no choce for the transfer functon. Choosng the transfer functon (kernel functon) remans a vexng ssue n SVM method. One usually needs to search for dfferent functon for dfferent problem. As for fttng the snc functon, a complex form f ( x, x ) 1 x x (1/ ) x x ( x ^ x ) (1/ 3)( x ^ x ) x s employed, where represents a support vector [4]. Our algorthm does not senstvely depend on the transfer functon, as shown by above examples that favorable fttngs are obtaned wth the Gauss transfer functon for ether data set. However, when the transfer functon marches the feature of the goal functon better, one may obtan better result. Fgure 6 shows the fttng results usng the Gauss, sgmod, and polynomal neural transfer functons for the data sets of the goal functons sn, snc and Hermt 5th polynomal. In each set there are 10 sample ponts. The polynomal 6 transfer functon s defned by a sx order polynomal ( h) h f.

32 E-3 (a) E-3 (b) E-3 1E-4 (c) c c c It can be seen that fttng wth the Gauss transfer functon can approach better results for all the three data sets than wth the sgmod transfer functon, but the dfference s not remarkable. By usng the polynomal transfer functon, the dfference s bg. For the data set of the snc functon, the best precson s about ( ) Indeed, even the emprcal rsk can only be mnmzed below to Fe 0.10 for ths set. For the data set of the sn functon, a precson of ( ) 0.0 can be approached, but s stll qut worse than those usng another two transfer functons. For the data set of the Hermt polynomal functon, however, a much hgh precson s acheved. Wth the tranng stop condton of F e Fgure 6. The fttng precson as a functon of c by use of the Gauss (a), Sgmod (b), and polynomal (c) neural transfer functons. The squares, stars, and crcles are for data sets from the sn, snc, and Hermt polynomal goal functons, correspondngly. 10-4, the fttng precson remans below -4 Θ( ) as Fg. 6(c) ndcated. Wth F e 10-6, one can ndeed acheve -6 Θ( ) These facts mply that the goal functon can be recovered wth remarkable precson. We have checked that the hgh-precson fttng can always acheved for ths data set f applyng a transfer functon of f n ( h) h wth n 5. The perfect precson obvously comes from the fact that the polynomal transfer functon marches the feature of ths data set

33 well. Smlar dscusson s applcable for explanng the results n Fg. 6(a) and 6(b), where the Gauss transfer functon marchers the data sets better than the sgmod one. The more favorable neural transfer functon can be chosen followng the desgn rsk crteron or the pror knowledge of the data source. When a more favorable neural transfer functon s appled, the desgn rsk wll become more small. 8.7 Applyng as a unversal fttng machne For practcal applcatons, one usually needs not necessarly to search for the best-control parameter set. It can be realzed from the above examples that the best fttng acheves around 0.5 c for all of the data sets. Applyng a suffcent large machne wth Gaussan neuron transfer functon, we can perform the functon approach for varous data sets by fxng c at 0.5 c. Fgure 7 shows that apples a GVM wth Gaussan transfer functon we fulfll the fttng well for several complex goal functons. The frst data set comes from the snc goal functon n the nterval of [-0,0], and the second from the sn goal functon n the nterval of [ 5,5 ]. The thrd set s from the Hermt 7th polynomal. The last set s from a square wave n [-10,10]. For each set, only 0 samples are appled. For more complex goal functons, the tranng may be not acheved (the emprcal rsk cannot be decrease to the target threshold of F e ) wth c In ths case, one can allow the computer to ncrease the hdden-layer neurons untl the tranng s acheved. Therefore, for usually applcaton of the functon approach, the calculaton of E [] usng a large number of GVMs can be avoded. Ths knd of calculaton s necessary only when the hgher fttng precson s essental.

34 fg10-zhao 1.0 (a) 1.0 (b) y (c) (d) y x x Fgure 7. Fttng dfferent data set usng a GVM wth the Gaussan transfer functon. (a) The snc functon n [-0,0]. (b) The sn functon n [ 5,5 ]. (c) The hermt 7th polynomal n [-1,1]. (d) A pecewse step functon n [-10,10]. There are 0 samples n ether sets. The algorthm can also be appled drectly to hgh dmensonal fttng. The SVM method gves a desrable fttng precson by chosen 153 support vectors from 0 0 samples [4] of the two-dmensonal snc functon. Applyng our strategy to desgn a GVM, Fg. 8 shows that better result s acheved by use of sample ponts only. Z y x snc-d N=1000 c =0.5 Fgure 8. Fttng D snc functon usng 10x10 samples wth a GVM. 8.8 The role of bas Here we provde examples to explan the role of the neuron bas. The dependence of the structural rsk on the control parameter c b s somewhat complex. Ths parameter nvolves mplctly n '' f. '' Partcularly, f set c 0, t has f (0) 0 for each neuron wth nput b x 0 n applyng above transfer functons,.e., t lead to the smallest

35 structural rsk wth the zero-vector nput. As a consequence, t results the response of the GVM rgd to nputs around the zero-vector nput and thus the fttng may become qute dffcult. Fgure 9 shows two examples of such an effect, where tow GVMs wth the Gaussan transfer functon desgned at b 0 are appled to fttng the 1D snc functon and the sn functon. Around the orgn, one can clearly see bg devatons from the goal functons, ndcatng that at the orgn the fttng curve s stff and thus dffcult to be approached. y (a) snc c b =0 c = (b) sn c b =0 c =10 fg4-zhao x x of To dsentangle ths problem, our dea s to decouple the dependence '' f on the specfc value of nputs, and so as to guarantee the senstvty of a GVM been determned by the ntrnsc behavor of the data set nstead of values of nput vectors. For ths purpose, we set c ~ max[ w x] Fgure 9. The fttng wth b 0 for the snc functon (a) and the sn functon (b). b jk, n whch case the random ntalzatons of jk w and b led to the dstrbuton of nput vectors. '' f nsenstvely depend on the specfc value of 9 Pattern Recognton We perform a standard handwrtten dgt recognton task to show how to desgn the GVM for pattern recognton. The dataset MNIST [13]

36 has tranng samples and test samples. Each sample s represented by a 8 8 dmensonal vector. To perform ths task, the BP method trans a 88 N 10 mult-classfer[13], whle the SVM method usually desgns ten bnary-classfers. The GVM has the 88 N 10 archtecture, too. The nput vector x of the th sample represents the th pattern. The orgnal data takes nteger of ( 0,55). We rescale the component x by 0.1*( 100) x x as nput varable. The output target vector y encodes the dgt, (0,...,9), whch s defned by y l 0, l 1, 0, otherwse. Such a GVM responses correctly to the tranng set f only h 0, for 1,..., P ; l 1,..., L, where s sgn(y ) l l s l. In ths case, the emprcal rsk vanshes. For sake of smplcty, we fx c 1, c 100 and w 1 below. w b l 9.1 Fndng the best control-parameter set parameter To decrease the nput-output senstvty, we can decrease the control c at fxed d, or ncrease d at fxed c. In ths way, we can search the best control-parameter set alone only one parameter axs. At a control parameter set, n GVMs are desgned to calculate the average recognton rate, the average structural rsk R S, and the desgn rsk E []. The desgn rsk s calculated by usng the recognton rate of a GVM on the test set to be the response functon. To estmate the structural rsk, only the second dervatves jj of the hdden-layer neurons are nvolved n eq. (9), otherwse the calculaton should be a hard

37 task. We have checked by full calculaton of and found no qualtatve dfference. We frst show the dependence of, on a small tranng set jk R and [] S c E on wth d fxed. We tran 500 GVMs wth the frst 1% MNIST tranng samples by applyng F wth 1 F to be the cost functon and the Gaussan transfer functon to be the neuron transfer functon. The results are shown n Fg. 10. The crcles and trangles n Fg. 10 show the results wth N=1000 and N=3000 at d=30 and d=100 respectvely. It can be seen that, R and [] S E decrease monotonously wth the decrease of c. The recognton rate ncreases rapdly wth the decrease of c ntally. After the turnng pont of c 0.005, t turns to decrease. Therefore, the best control parameter set s not at the mnmum nether of the structural rsk nor the desgn rsk. In ths case, we have to employ also the average recognton rate as a performance ndcator varable. The best control-parameter set s determned by combnng the desgn rsk and the average recognton rate, whch s a balance between a hgh recognton rate and an acceptable desgn rsk. In ths example, the turnng pont of the average rate can defne the best control parameter set snce at whch the desgn rsk s acceptably low. Fgure 10 reveals more. Frstly, applyng large machnes can not only ncrease the recognton rate but also decrease the desgn rsk. At the turnng pont, the average recognton rate s 88.8% for N=1000 and 89.5% for N=3000, whle the desgn rsk s about 0.% for N=1000 and about 0.1% for N=3000. Secondly, t reveals that the turnng pont on s nsenstve to the machne sze. Ether for N=1000 and N=3000, t appears around c We can thus fx the parameter c to ths value n our followng studes. c

38 90 85 (a) <R S > (b) E() (c) c c Fgure 10. The average recognton rate (a), the structural rsk (b), and the desgn rsk (c) as functons of control parameter c. The stars for d 30 and N 1000, and the crcles for d 100 and N 3000, respectvely. c We then study the dependence of, R and [] S E on d at c Fg. 11 shows the results for N=3000. It can be seen that ncreases rapdly wth the ncrease of d ntally, and becomes decrease after the turnng pont around d=10. The desgn rsk decreases monotonously wth the ncrease of d. Thus, over maxmzng the separatng margn may also result the overtranng for ths data set. The structural rsk R ndeed ncreases slowly. Ths s because, though S the parameter c s fxed, the MC adapton may nduce concentratng slghtly towards the boundares of the specfed nterval and thus ncreases the structural rsk slghtly (a) d <R S > (b) d E[] (c) d Fgure 11. The average recognton rate (a) the structural rsk (b) and the desgn rsk (c) as functons of control parameter d at c and N 3000.

39 In prncple, the large the separatng margn, the bg the probablty that a varaton of a sample beng classfed nto the same class. Ths s, however, true when the test set can be consdered to be random varatons of tranng smples. Here we llustrate ths guess by examples. We construct followng two test sets usng vrtual samples. The frst one s created by addng random nose to nput vectors of the frst 1% samples as x x, 0, E E wth 80. For each sample, 10 vrtual samples are created and thus totally 6000 samples are nvolved n ths set. The second one s obtaned by shftng each of the 1% sample patterns wth pxel unts to adjacent postons, whch gves totally 4800 samples then. Fgure 1 show the average recognton rate measured on these tow test sets for GVMs wth N=3000 as a functon of the control parameter d. For comparson purpose, the result for the orgnal test set s also shown. The other control parameters keep same as n the Fg. 11. One can see that wth the ncreasng of the separatng margn, for the nose set ncreases monotonously, whle for another two sets the over-tranng effect appears after the same turnng pont. These results ndcate that the maxmum-margn strategy appled by the SVM method s correct quanttatvely for random varatons. In other ward, t s correct generally for maxmzng the common pror knowledge of varatons of a knowng pattern should be assgned to the same class wthout addng partcular restrcton on the varatons. For practcal applcatons, as for handwrtten dgts, patterns cannot be consdered to be complete random varatons of the tranng samples snce they are restrcted by the partcular geometry of dgts. Therefore, the separatng margn may not be appled as the quanttatve crteron of best learnng machne generally.

40 d Fgure 1. The average recognton rate as a functon of d at c and N 3000 for the orgnal test set (up-trangles), the test set of random varatons (down-trangles), and the test set of shfted samples (sold trangles). The dependence of on d for the test set of shfted vrtual samples s smlar to that of usng the orgnal test set. Ths fact can be nterpreted as that the shft operaton keeps the geometry of dgt patterns. It also mples that the partcular geometry determnes the turnng pont. Because t gves the same turnng pont as usng the orgnal test set, one can apply ths set to fnd the best control parameter set and apply the orgnal test set also to tran the learnng machne, n whch way the samples may be maxmally utlzed. 9. Improvng the recognton rate by ncreasng the machne sze Increasng the machne sze can extract more features of smples, and thus can ncrease the generalzaton ablty. Fgure 13 shows the dependence of the average recognton rate on the machne sze, whch ndcates that ncreasng the sze can mprove the recognton rate monotonously, though the effect may become saturated when the sze s large enough. The reason s that each weght vector extracts nformaton of samples from a dfferent angle, and thus the more the weght vectors, the more features of samples can be extracted.

41 d The fgure shows that the best control parameter set depends also on the neurons number N. In secton 9.1 we have shown that the best value of Fgure 13. The dependence of the average recognton rate on the machne sze. The hollow crcles, squares, up-trangles and down-trangles are for GVMs wth N= , correspondngly. c s nsenstve to other parameters, we can usually search the space d N for the best control parameter set by fxng c at c The role of neuron transfer functons and cost functons For sake of the smplcty, we study here cost functons of F 1 and F, and transfer functons of the Gaussan and sgmod types. Fgure 14 shows the results wth the combnatons F1 -Gaussan (up-trangles), F1 -sgmod (up-trangles), F -sgmod (down-trangles), and F -Gaussan (crcles), correspondngly. They are obtaned at N=3000. The tranng 4 stop condtons are F or F 1 correspondngly. Obvously, the cost functon F 1 10 wth the Gaussan transfer functon gves the best result, ndcatng that Gaussan type functons march better the nature of the spatal pattern.

42 F -Gauss F -sgmod F 1 -Gauss F 1 -sgmod d Fgure 14. The dependence of the average recognton rate on transfer functons and cost functons. 9.4 Usng a J-GVM Fgure 15 shows, and J as functons of d. At each pont of d, GVMs are desgned usng the frst 1% MNIST samples. Fg. 15(a) shows the results applyng the Gaussan transfer functon and Fg. 15(b) applyng the polynomal transfer functon wth n=7. The cost functons are both Gaussan type. <>, J (a) d (b) d Fgure 15. ( crcles), ( sold stars) and J (sold crcles) as functons of d, (a) for the Gaussan transfer functon and (b) for the polynomal transfer functon respectvely. One can see that, besdes havng a hgh value, the recognton rate s relatvely nsenstve to the control parameter. In Fg. 15(a) wth the

43 Gaussan transfer functon, J-GVMs desgned n d (50,150) all have approached recognton rate, and n Fg. 15(b) wth the polynomal transfer functon, the recognton rate of the J-GVM seems just slghtly dependent on d. Ths property s also an essental advantage of a J-GVM snce the carful-searchng n the parameter space for the best control parameter set s avoded. By usng suffcent GVMs to construct the J-GVM, the rsk can be suppressed by the ensemble average even ndvdual GVMs havng relatvely bg rsk. Ths s why the J-GVM has hgh recognton rate and s nsenstve to the control parameter. Fg. 15(a) also shows another effect,.e., the best control parameter d for the optmal J-GVM s smaller than that for the optmal GVMs. As wll also seeng n next secton, ths s a common property for a-gvm. The reason may be as follows. The recognton rate s determned by features beng extracted from the samples. Dfferent GVM extracts nformaton from dfferent angle. GVMs at ther best control parameter set have relatvely small desgn rsk, and thus have relatvely small dsperson n angles. On the contrary, GVMs wth relatvely bg nput-output senstvty extract feature nformaton from more wde angles. It s nterestng n applyng the polynomal transfer functon. Though the recognton rates of GVMs are qute low, the rate of the J-GVM s even hgher than that of usng the Gaussan transfer functon. The reason may be that ths transfer functon marches the nature of dgt patterns better. Because the change of the gray degree of dgt patterns s steep, hgher-order polynomal transfer functons ft ths feature well. '' Nevertheless, because f ( z) n( n 1) z n the structural rsk of neurons s remarkably bg, ndvdual GVMs wth the polynomal transfer functon have bad performance. By applyng a J-GVM, the rsk s suppressed by

44 the ensemble average, and the advantage that the hgh-order polynomal transfer functons emerges. 9.5 Improvng the performance further by proper pretreatment of samples As explaned n secton 9.1, the maxmum-margn strategy s generally applcable when the test samples can be consdered as random varatons of tranng samples. For handwrtten dgts as well as usual spatal patterns, the varatons could not be consdered as random. The partcular geometry of spatal patterns excludes most of the random varatons. To create a vrtual sample set followng the geometrc nature of patterns s a way to avod the excessve generalzaton, and the tangent dstance technque [18], nvolvng shft, dstoraton, rotaton, etc. can be used for ths purpose. We have constructed a spurous sample set n secton 9.1 by shftng each of the frst 1% MNIST sample patterns wth pxel unts to adjacent postons. It s appled as a test set there. Here we apply t to be a tranng set. Fgure 16 shows that the recognton rate on the test set s dramatcally mproved comparng to that usng the orgnal 1% samples. The man drawback s that the tranng wll be tme-consumng when the amount of samples s too large. Certan smple retreatments of nput vectors may also effectve. For example, we smooth the 1% samples by usng a Gaussan convoluton wth unt standard devaton and appled them to tran the learnng machne, the recognton rate s also mproved, see Fg. 16. The more effectve way of encodng the spatal nformaton s the Gradent-based feature extractng technque developed n recent years [19]. A 00-dmensonal numerc feature vector encodng eght drecton-specfc 5 5 gradent mages s calculated for each sample usng ths technque. Ths s one of three top-performng representatons n [19] and s called

45 e-grg n ther paper. Applyng the frst 1% samples pretreated by ths technque we obtan a much hgh recognton rate, as shown n Fg. 16. Note that we do not apply the 1% samples randomly choosng from the whole tranng set to be the tranng samples, as done n ref. [19]. In that way one does not know whch samples actually been chosen and thus may hnder a far comparson among dfferent researchers gradent-based data shfted data J 9 90 smoothng data orgnal data d Fgure 16. The recognton rate as a functon d at c and N 3000 for the J-GVM desgned by dfferent tranng set constructed by the orgnal 1% MNIST data set. 9.6 The hghest record on the data set Smlar to other tranng methods of learnng machne, ncreasng tranng samples can ncrease the recognton rate. The superorty of the GVM may become weaken when the amount of samples s suffcently large, lke other methods. Fg. 17 (a) and Fg. 17(b) show the results of usng the frst 10% and all of the MNIST samples respectvely. In the frst case 50 GVMs, and n the last case 10 ones, are traned respectvely at each parameter pont, and the J-GVM s constructed wth these GVMs. In both cases, the Gaussan transfer functon s appled and the GVM sze s fxed at N = The cost functon F wth tranng termnaton

46 condton F 1 s appled. In the tranng, the normalzed gray-scale mages are drectly used so as to purely compare the algorthms themselves, wth gettng rd of the mprovement resulted by pretreatment technques. It can be seen that usng all of the tranng samples the record s beyond those usng the BP method wth error rate 1.5% [0], the SVM wth error rate 1.4% (By combnng 10 one-vs-rest bnary SVMs and buldng a ten-class dgt classfer) [1] and the recently mproved deep-learnng method wth error rate 1.5%[]. The last record s obtaned wth a complex fve-layer herarchcal model. Therefore, n the case of applyng the orgnal tranng set, our record s compettve. Inspre ths s the case, we stll emphasze that the prorty of our method s partcularly for small sets of samples. J d 98.8% Fgure 17. The recognton rate as a functon d at c and N 6000 for the J-GVM desgned by the orgnal frst 1%, 10%, and the complete set of MNIST data set, correspondngly. 10 Classfcaton Classfcaton s a specal case of pattern recognton. The Wsconsn breast cancer database was establshed [14] at 199 wth 699 samples. As usual, the frst /3 samples are appled as the tranng set, and the remans

47 as the test set. The nputs are 9 dmensonal vectors, wth components represent features from mcroscopc examnaton results, and are normalzed to take value from [0, 10]. The output ndcates the bengn and malgnant patents. Ths task can be acheved by a GVM wth two neurons n the output layer. A patent s classfed nto bengn f the output of the frst neuron s bgger than that of the second one, otherwse malgnant. We frst study and E [] as a functon of the control parameter d wth other parameters keepng fxed at [1,1 ], [1,1 ] 1, 1 l w, b [10,10], w and N=00. At each set of control parameters, 500 GVMs are used to calculate and E []. Fgure 18 shows the results. The representatons of the symbols are: up-trangle for F1 - sgmod combnaton; down-trangle for F1 -Gaussan combnaton; star for F -sgmod combnaton; crcles for F - Gaussan 3 combnaton. The stop condtons are F or F 1 correspondngly When the stop condtons cannot be fulflled wthn a preset maxmum tranng tme, the search alone d s ceased. jk <> (a) F 1 -sgmod F 1 -Gauss F -sgmod F -Gauss d E[] (b) F 1 -sgmod F 1 -Gauss F -sgmod F -Gauss Fgure 18 and E [] as a functon of d. d

48 It can be seen that the best result s gven by the steep cost functon F 1 wth the sgmod transfer functon. Ths fact ndcates that steep functons march the nature of ths sample set well. The reason s that for a component of such a sample vector, small value means the normal, whle large one represents the abnormal. Wth the F1 - sgmod combnaton, the maxmum of s approached around d ~ 16, after whch t become decrease slghtly. Ths phenomenon mght be explaned as that the data from the medcal examnaton could be regarded approxmately as random varatons of standard samples. The mcroscopc examnaton may nduce random errors of measurement, and meanwhle bochemcal ndexes themselves may be nfluenced n a complex way by prompt accdental events of a patent. Fg. 18(b) shows that E [] decreases monotonously wth the ncrease of d. The essental feature explored here s the bg uncertanty. For example, n usng the F1 -sgmod combnaton, GVM s about 0.5% even at d=16 where the average correct rate takes the maxmal value of 99.0%. That s to say, dfferent GVM obtaned by dfferent user followng the same tranng program may show correct rate regon from 98.5% 99.5%. The uncertanty s thus a serous problem. Ths drawback s a consequence of small sample set. Applyng a J-GVM s the effectve way to overcome ths drawback. Fg. 19 shows the dstrbuton of the recognton rate of GVMs as a functon of the control parameter d. At each pont of d, the correct rates of 500 GVMs desgned wth the F1 -sgmod combnaton are shown as stars. The average correct rate of GVMs and the correct rate of the J-GVM constructed usng these GVMs are also shown n the fgure as trangles and crcles respectvely. Fg. 19(a) s for N=00 and Fg. 19(b) for

49 N=500. In the case of N=00, the maxmal average correct rate s 99.01%. In an nterval of d [4,8], the rate of the J-GVM keeps at 100%. Wth more bg GVMs, N=500, the maxmum average correct rate approaches 99.30% and n a wde nterval of d [4,] the correct rate of the J-GVM keeps at 100%. Ths fact agan ndcates that bg machnes are more favorable. As to the records of the correct rate, our results are superor to prevous studes [14-15], where a record of 97.5% s approached by a SVM-method based learnng machne., <>, J (a) d (b) d Fgure 19 also explores the drawback of applyng the ndvdual learnng machne to be the performng machne. Even n the parameter regon wth low average correct rate, as at d= n Fg. 19, certan GVMs may gve the correct rate of 100%. However, at the same parameter, another one may approach only about 9%. Therefore, the correct rate on the test set s not a good ndcator varable of the learnng machne performance. The hgher record may be just a fluctuaton. When applyng t to real patents, one cannot expect t stll reman the correct rate. Indeed, the learnng machne wth 9% record may not be necessarly worse than that wth 100% record for real applcaton. The J-GVM can avod such knd of uncertanty. Fgure 19. ( crcles), ( sold stars) and crcles) as functons of d, (a) for N=00 and (b) for N=500. J (sold

50 11 Summary and Dscusson (1) We develop a MC algorthm to gan the correct response to the tranng set. The basc dea s to drve the local felds of neurons move contnuously towards the target dstrbuton defned by the cost functon. Applyng ths algorthm, one can obtan three-layer neural networks wth ether contnuous or dscrete parameters. The MC algorthm works well manly because each random adaptaton s performed only for one parameter n the hdden layer, and thereby t results only O( P LP) multply operatons for makng the decson, other than O( NMP NLP) operatons requred for evolvng the whole network. Comparng to the SVM method, we gve up support vectors, and replace them by general weght vectors. Support vectors are specal weght vectors. For small tranng-set problems, support vectors are lmted by the number of samples. The weght vectors have no such a lmtaton. Usng enough weght vectors, the features of nput vectors can be maxmally extracted () We classfy the pror knowledge nto common and problem-dependent parts, and suggest correspondng strateges to ncorporate them nto the learnng machne to gan maxmum generalzaton ablty. There are two classes of common pror knowledge. The frst part s that the learnng machne should not be too senstve to the small changes of nputs. Ths s resulted by the pror knowledge that normal functons usually have suffcent smoothness, and varatons of a knowng pattern may belong to the same class. The second part s a basc requrement for a desgn method tself. Followng the same rule, the same specfed ntal condton, and the same tranng set, learnng machnes desgned by dfferent users should have suffcent small dsperson on the same test set. Ths requrement usually does not emphaszed n prevous tranng methods. Here we apply t as a basc prncple to supervse the

51 desgn of learnng machnes. Ths s the DRM strategy. We have argued that the desgn rsk can be saved as the unque quanttatve ndcator varable of the best control parameter set for functon approach and smoothng, whch has been confrmed by examples. On the contrary, we llustrated that the SRM prncple may nduce over-mnmzaton of the rsk and nduce the devaton from the best fttng. We have also demonstrated that mnmzng the desgn rsk can lead to the maxmzaton of the separaton margn for the classfcaton, and thus can gan the better generalzaton ablty. The maxmum margn strategy, however, s exactly avalable when the real patterns can be consdered as random varatons of the sample patterns. In other word, t may nduce over tranng for certan pattern samples, such as the handwrtng dgts. Real patterns usually have partcular geometrc symmetry and extremely maxmzng the separatng margn may result the msmatch to the nature symmetry. Therefore, t s essental for further mprovng the generalzaton ablty by ncorporatng the problem-dependent pror knowledge. For functon approach and smoothng, choosng a more proper neuron transfer functon s such a way. The more proper functon can be also chosen usng the desgn-rsk crteron. For pattern recognton and classfcaton, there are many manners to maxmze the problem-dependent pror knowledge, such as usng a more proper neuron transfer functon or cost functon. One can also construct spurous samples followng the partcular geometrc symmetry of samples to extend the tranng set, or ncorporate the partcular geometrc nformaton nto nput vectors. In ths case, we need to combne the average recognton rate wth the desgn rsk to fnd the best control parameter set. To calculate the average correct rate, the real test set s not

52 must necessary. One may construct a spurous sample set havng the same geometrc symmetry to be the test set. As a result of the DRM strategy, nstead of fndng the best machne accordng to the test result on the test set, we search for the best control parameter set. At the best control parameter set, each GVM s equvalent to applcatons and can be appled as the performng leanng machne. (3) The structural rsk s stll a key parameter n our method. We usually search for the best control parameter set alone the drecton wth decreasng nput-output senstvty. However, we control the structural rsk of a GVM by control that of ndvdual neurons. It s show that the structural rsk of a neuron s determned by the multple of several classes of parameters as well as the second dervatve of the neuron transfer functon. By lmtng the ntervals of these parameters and applyng proper neural transfer functons whch second dervatves have fxed upper bound, the rsk of ndvdual neurons s controllable. As a lnear combnaton of the rsks of ndvdual neurons, the structural rsk of the machne thus can be controlled. (4) We can apply the J-GVM constructed by a suffcent amount of GVMs to be the performng machne. The output of a J-GVM s the ensemble average of these GVMs. It s an applcaton of the ensemble strategy [11]. The J-GVM usually has better performance snce t has more small emprcal rsk, structural rsk and desgn rsk. The flexblty of the Monte Carlo algorthm enables us to obtan a suffcent amount of statstcally dentcal GVMs usng the same tranng set so as to effectvely dmnsh the nose part. (5) We emphasze that the superorty of our method s for small sample-set problems. For functon approach and smoothng, examples show that the fttng precson usng small tranng set s obvously hgher than those usng the SVM method and the conventonal splne algorthm.

53 In the example of classfcaton of breast cancer patents usng totally 699 samples, the result s encouragng. Partcularly wth a J-GVM one can always get the 100% correct rate n a wde regon of control parameters, ndcatng the learnng machne has a hgh confdence level. For handwrtten dgt recognton, hstorcal records of the hghest recognton rate may result confuson snce they have complex background. The tranng samples may be pretreated usng varous technques and the learnng machne may have multlayer structures more than the three. In our paper we apply only the orgnal gray-scale mages wthout any pretreatment so as to farly compare the method tself. We approach a recognton rate of 90% by use of only the frst 1% samples, and of 97% by use of the frst 10% samples. Usng all samples, we obtaned the recognton rate of 98.8%. Ths rate s beyond those usng the BP method wth error rate 1.5% [0], the SVM wth error rate 1.4% [1] and the recently mproved deep-learnng method wth error rate 1.5%[]. For these record, only the BP method s obtaned wth a three-layer neural network. The record of the SVM s for usng 10 b-classfers. If applyng the mult-classfer SVM, only a correct rate of 96% s approached. The last record s obtaned wth a complex fve-layer herarchcal model. Inspre ths s the case, we stll emphasze that the prorty of our method s partcularly for small sets of samples. In fact, when the tranng samples are suffcent, the SVM method also has great freedom to select support vectors, whch reduces the superorty of our strategy. (6) Our method can appled to other problems. The method can be appled drectly on many other tradtonal tasks of leanng problem, such as the tme-seres predcton. What s more, the algorthm may have nduce some partcular applcatons. For example, after proceedng the Monte Carlo adaptaton for a proper perod, the local felds h wll l s l

dstrbute around s l h l d. Then those examples wth much small h l s l may represent bad examples, such as two dentcal objects wth opposte labels. The fgure below shows 0 such examples.

54 dstrbute around s l h l d. Then those examples wth much small h l s l may represent bad examples, such as two dentcal objects wth opposte labels. The fgure below shows 0 such examples. One can see that, the thrd sample and the last one for example, no one could be recognze them as 3 and 4 respectvely. To pck out these bad examples, the test set s not used. It may be an nstance of the so-called transductve nference [4]. Pckng up bad examples mght have practcal mportance for certan problem, such as fndng those msdagnosed patents from the tranng set. Fgure 0, Washng out the bed samples. The dgt patterns are from the MNIST data set. The above numbers of the subscrpt are target dgts, below ones are the sequence number of patterns n the data set, correspondngly As a new attempt of developng the desgn theory of learnng machnes, many vewponts are presented wthout strct proof. Nevertheless, as a research feld to solve practcal problems, suggestng new algorthms based on emprcal study and then nvestgatng ther theoretcal bass, s consstent wth common practces of ths research feld. Bblography 1. Rumelhart, D.E., Hnton, G.E. & Wllams, R.J. Nature 33, (1986).

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.