Backpropagation: In Search of Performance Parameters

Bacpropagaton: In Search of Performance Parameters ANIL KUMAR ENUMULAPALLY, LINGGUO BU, and KHOSROW KAIKHAH, Ph.D. Computer Scence Department Texas State Unversty-San Marcos San Marcos, TX-78666 USA ae049@txstate.edu, lb40@txstate.edu, 0@TxState.edu Abstract: - Ths wor s an extensve study of the bacpropagaton networ based on a new vsual tool, Equal Opportunty for Recognton (EOR) for all nputs to be recalled, whch s used to evaluate the overall networ performance, n partcular, ts generalzaton capabltes. The new procedure, EOR, s used as a means to assess the effect of other system parameters. Keywords: bacpropagaton, Networ evaluaton, generalzaton, EOR, Processng Elements (PEs) and parameters. Introducton A Bacpropagaton Networ s a multplayer, assocatve, and feed forward neural networ that features supervsed learnng usng gradent descent tranng procedure. Bac Propagaton s wdely used n applcatons nvolvng pattern recognton because of ts powerful capablty of generalzaton. Whle ts system structure and learnng algorthm are well documented, there exst no mathematcal crtera to assess the performance, partcularly the generalzaton capabltes, of the networ wth respect to such networ parameters as number of PEs on the hdden layer, the mean squared error, learnng rates, ntalzaton of weghts and thresholds. Searchng for a measure of system performance, we proposed a vsual method, the EOR plot, whch can be used as an ndcator of the overall system performance. Wth the ad of EOR plottng, we further studed the varous parameters of the system as they relate to the overall system behavor, ncludng MSE, hdden layer sze, learnng rates, weght and threshold ntalzaton, and threshold updatng. Descrpton of Experments To study the varous propertes of a bacpropagaton networ, we started wth 6 captal letters of the Englsh alphabet, each of whch s represented on a 4 by 4 grd as follows. Each grd was converted to a bnary vector of 576 elements. Fg. : Input patterns Each bnary vector s assocated wth a 8-bt ASCII code correspondng to the Englsh letter. Snce one hdden layer s generally suffcent for most applcatons [4], we have desgned a bacpropagaton networ of three layers, an nput layer wth 576 PEs, an output layer wth 8 PEs, and a hdden layer wth a varyng number of PEs. In the lght of the EOR (Equal Opportunty for Recognton) plot as presented below, we studed all parameters of a bacpropagaton networ based on our mplementaton of the networ on Mathematca 4. 3 EOR Bacpropagaton s a smple and powerful algorthm, yeldng satsfactory results f properly mplemented. Mathematcal crtera, however, are stll to be found that can be employed to evaluate system performance wth respect to such a networ parameters as the MSE, hdden layer sze, ntal weghts, and learnng rates. Many rules for choosng hdden

layer sze have been proposed, however none of them seem to be superor and all are result of some emprcal conjure. To guarantee the applcablty of a networ, however, some measures have to be taen to assess system performance. To avod overtranng, for example, constant montorng on system performance s necessary, ncludng the ncorporaton of test data n the process of tranng Gven a specfc applcaton, such as the recognton of the 6 captal Englsh letters, nose reducton and generalzaton capabltes n the presence of random nose are among essental requrements of the networ. In other words, we need to prove the probablstc performance of a networ so that, frst, all nput patterns can be recovered successfully wth an equal opportunty, and second, the probablty that an ndvdual nput can be recovered should meet the requrements of the applcaton. Both factors are related to all the parameters of a networ. In the absence of a mathematcal descrpton, we propose the EOR plot (Equal Opportunty for Recognton) as a vsual, probablstc method to evaluate system performance. Gven a set of system parameters, ncludng MSE, ntal weghts, thresholds, learnng rates, and hdden layer sze, we tran the networ and estmate the probablty of each ndvdual nput pattern correctly recognzed at a specfed rate of random nose. The latter could be done by repeatng the recall process on a suffcently large number of randomly corrupted nputs and montorng the behavor of the networ. After all ndvdual nputs have been processed, the performance of the networ can be analyzed usng EOR plots. Usng 9 hdden layer neurons wth a range of 00 to +00 for weght and threshold random ntalzaton, a learnng rate of, an MSE of 005, and random nose rates of 0% and 5%, respectvely, we obtan the followng EOR plots as an estmaton of the networ performance. p r o b a bnumber of teratons n hundreds l t y p r o b a bnumber of teratons n hundreds l t y P p 5 0 5 0 5 Number of teratons n hundreds Fg. (a) 5 0 5 0 5 Number of teratons n hundreds Fg. (b) Fg. : EOR plots for 0% & 5% nose respectvely Accordng to the two EOR plots, wth 0% random nose, each nput pattern can be correctly recognzed wth a probablty of over 90% n spte of the slght varatons; wth 5% random nose, all patterns can be recognzed. Fg 3 depcts sample letters wth 0% random nose. Fg. 3: Specmen wth 0% random nose All corrupted patterns can be correctly recovered wth a probablty of more than 90%. As shown by our experments, EOR plots can be used as an objectve descrpton of system performance. EOR plots can be utlzed n analyzng other networ parameters. 4 Results and Analyss 4. Mean Squared Error (MSE) MSE s generally used as an ndcator of networ convergence. However, MSE s not a suffcent factor and other networ characterstcs need to be consdered. Frst, we wll show that MSE s not always a suffcent descrptor of system performance. Usng 8 hdden layer PEs and an MSE of 005 wth a dfferent range for random ntalzaton of weghts and thresholds, we obtaned the followng 5% random nose EOR plots. In fgure

4(a), the range of weght and threshold ntalzaton s -.0 to +.0; n fgure 4(b), the range of weghts and threshold ntalzaton s - 05 to +05. p p 5 0 5 0 5 Fg. 4(a) 5 0 5 0 5 Fg. 4( b) Fg. 4: EOR plot for dfferent weght and threshold ntalzatons Second, gven a specfc topology of a networ, a small MSE does not always yeld better system performance. As shown by our experment, after a certan pont, the EOR plot remans vrtually the same wthout evdence of over-fttng. The followng results are obtaned usng 8 hdden layer nodes, a learnng rate of, and a range of 05 to +05 for weght and threshold random ntalzaton, at an MSE of 5, 05, and 0 8 6 4 5 0 5 Fgure 5(c): M SE of 00 Fg 5: EOR plots for dffer ent MSE values Therefore, whle MSE s an mportant factor of a bacpropagaton networ, t s not suffcent for drawng conclusons about system performance. Other factors, ncludng weght ntalzaton and sze of the hdden layer also play an mportant role. 4. Weght Intalzaton In a three-layer networ, there are two weght sets. As a general rule, the weghts should be randomly ntalzed to small values to avod system oscllaton and as justfed by the dervatve of the actvaton functon. We started wth a range of.0 to +.0 and gradually reduced the range. We observed that smaller random ntalzaton yelds a better performance. For the followng graphs, a networ wth 8 hdden layer nodes used, together wth an MSE of 05, learnng rate of and dfferent ranges for weght and threshold ntalzaton. 0 5 Error Fg. 6(a) weght and threshold ntalzaton to + Fg. 5(a): MSE of 5 Fg. 5(b): MSE of 05 Fg. 6(b) weght and threshold ntalzaton -5 to +5

Fg. 6(c) weght and threshold ntalzaton to + the same MSE yelds smlar system performance regardless of the range of ntalzaton, and thus can be used to compare the effect of number of PEs on the hdden layer. Wth a small number of PEs on the hdden layer, compared to the nput and output layers the learnng curve exhbts a great deal of fluctuatons and does not converge to the specfed MSE. Ths mples the networ does not have enough learnng capacty.e. memory wth 3 hdden layer PEs, we observed the followng results: 5 0 5 0 5 Fg. 6(d) weght and threshold ntalzato n 00 to + 00 5 0 5 0 5 Fg. 6(e) weght and threshold ntalzaton -0000 to +0000 Fg. 6: EOR plots usng varous ranges for weght and threshold random ntalzaton to +, -5 to +5, - to +, -00 to +00, and -0000 to +0000, respectvely. Although the weghts could all be ntalzed to zero, ths would result n a hghly symmetrcal networ and s thus created therefore; t s not a good choce for networ desgn. Ths emphaszes the statement made by Ramelhart et al.[6] Intal weghts of exactly 0 cannot be used, snce symmetres n the envronment are not suffcent to brea symmetres n ntal weghts. p Fg. 7(a) Learnng curve Fg. 7: Learnng curve and EO R plot for a networ wth 3 PEs on hdden layer Wth more PEs on the hdden layer, more nput patterns can be correctly recovered. Once the number of hdden layer PEs reaches an deal range the system performance stablzes and shows very lttle mprovement wth an addton of new PEs. The followng EOR plots are based on 5, 7, 9,, 4, 36, 48, 00 PEs, respectvely. p 5 0 5 0 5 Fg. 7(b) EOR plot 4.3 Number of PEs on the Hdden Layer To study the effect of the number of PEs n the hdden layer on system performance, we performed a seres of experments where all weghts and thresholds were randomly ntalzed between 05 and +05, wth a fxed MSE of 005 and a learnng rate of. When the weghts are ntalzed to very small values, 5 0 5 0 5 Fg. 8(a) E OR plot for 5 hdden layer PEs

p Fg. 8(b) EOR plot for 7 hdden layer PEs p 5 0 5 0 5 Hdden layer PE s are the feature extractors. As the hdden layer sze ncreases, for a fxed error, the number of teratons to tran the networ converges to a value and wll not oscllate. Ths tells us after certan lmt the hdden layer sze does not have any effect on the number of teratons. Although the ncreasng the hdden layer sze brngs down the number of teratons there may not be much mprovement n the total tranng tme. Fg. 8(c) EOR plot for 4 hdden layer PEs p Fg. 8(d) EOR plot for 48 hdden layer PEs p Fg. 8(e) EOR plot for 00 hdden layer PEs Fg. 8: EOR plot for a networ wth 5, 7, 4, 48, 00 PEs on hdden layer, respectvely. We observed that usng a fxed MSE, the number of teratons s related to the number of PEs on the hdden layer by the followng curve. ter 000 750 500 50 000 750 500 50 5 0 5 0 5 5 0 5 0 5 5 0 5 0 5 0 40 60 80 00 Fg. 9: Relatonshp between hdden layer sze and number of tranng teratons 4.4 Learnng Rates: Whle learnng rates are generally taen to be small numbers between 0 and, there s no crteron governng the selecton of a learnng rate. If t s too small, the error correcton s trval and the networ does not learn well, wth lttle chance of gettng out of a local mnmum; f t s too large, the learnng process s one of oscllaton, wth lttle chance of convergence to the necessary MSE. The tranng of a networ s amed at ts generalzaton performance, whch s acheved by system convergence, the speed of whch s adjusted by the learnng rates. To apprecate the effects of large learnng rates, consder the learnng curve of a networ wth 9 hdden layer PEs, a weght and threshold ntalzaton range of 05 to +05, and a learnng rate of 5, as depcted n fg..75.5.5 75 5 5 e 00 00 300 400 500 Fg. 0: Learnng curve at a hgh learnng rate (5). To assess the effect of learnng rate on system performance, we used a networ wth 6 hdden layer PEs, a range of 05 to +05 for weght and threshold ntalzaton, an MSE of 005, and varous learnng rates. Wth a learnng rate of 00 and 0% random nose, the EOR plots are as follows, correspondng to the learnng rate of 0,,, and. Networ dd not converge wth a Learnng rate of.0

Fg. (a) EOR plot at learnng rate of 0 Fg. (b) EOR plot at learnng rate of Fg. (d) EOR plot at learnng rate of Fg. : EOR plots at the learnng rates of 0,, and, respectvely. 4.5 Thresholds Thresholds, or bas, can be used on both the hdden layer and the output layer PEs, to fne-tune the system convergence. Each PE on the hdden and output layer can a threshold value, whch s updated drectly based on the delta value computed for that PE. The threshold updatng not only speeds up system convergence, but also t s potentally helpful n smoothng out system fluctuatons that mght be hard to deal wth usng weght updatng alone. n o = f ( a w θ ), () = Where O s the output of the th node on the hdden or output layer and θ s the correspondng threshold and f s the sgmod functon. If δ s the delta value for the node, θ should be updated as follows: θ ( t) = θ ( t ) εδ () 5 0 5 0 5 5 0 5 0 5 where ε s the threshold learnng rate, δ s the delta value, and θ s the threshold value. 5 Conclusons As there are no formulas that can be readly used to evaluate the performance of a bacpropagaton networ, the Equal Opportunty for Recognton (EOR) plots represent a practcal tool for system assessment wth respect to the applcaton condtons. As a probablstc method, not only can t be used to descrbe system performance, t can also be ncorporated nto the recall process for demandng pattern recogntons. The EOR, has shown a great promse n fndng the optmal ntal condtons for our Neural Networ. The future wor can be n the drecton of fndng general prncples, to desgn a bacpropagaton networ wth near optmal ntal condtons, usng EOR. References: [] Freeman, James A. Smulatng Neural Networs. Addson-Wesley, 994. [] McAuley, Devn. The bacpropagaton networ: learnng by example, 997. [3] Mehrotra, Krshan, et al. Elements of artfcal neural networs, Cambrdge, MIT Press, 997. [4] Sureerattanan, Songyot, et al, New developments on bacpropagaton networ tranng, IEICE Trans., vol. E83-A, No. 6, pp. 03-039, June, 000 [5] Bac Propagaton s Senstve to Intal Condtons (990) -John F. Kolen, Jordan B. Pollac [6] Learnng Representaton by Bac- Propagatng Errors. Nature 33:533-536. D. E. Rumelhart, G. E. Hnton, and R. J. Wllams. 986. [7] Sarle, Warren S. ftp://ftp.sas.com/pub/neural, 00.