Understanding the difficulty of training deep feedforward neural networks

Size: px

Start display at page:

Download "Understanding the difficulty of training deep feedforward neural networks"

Shanon Whitehead
5 years ago
Views:

1 Understandng the dffculty of tranng deep feedforward neural networks Xaver Glorot Yoshua Bengo DIRO, Unversté de Montréal, Montréal, Québec, Canada Abstract Whereas before 2006 t appears that deep multlayer neural networks were not successfully traned, snce then several algorthms have been shown to successfully tran them, wth expermental results showng the superorty of deeper vs less deep archtectures. All these expermental results were obtaned wth new ntalzaton or tranng mechansms. Our objectve here s to understand better why standard gradent descent from random ntalzaton s dong so poorly wth deep neural networks, to better understand these recent relatve successes and help desgn better algorthms n the future. We frst observe the nfluence of the non-lnear actvatons functons. We fnd that the logstc sgmod actvaton s unsuted for deep networks wth random ntalzaton because of ts mean value, whch can drve especally the top hdden layer nto saturaton. Surprsngly, we fnd that saturated unts can move out of saturaton by themselves, albet slowly, and explanng the plateaus sometmes seen when tranng neural networks. We fnd that a new non-lnearty that saturates less can often be benefcal. Fnally, we study how actvatons and gradents vary across layers and durng tranng, wth the dea that tranng may be more dffcult when the sngular values of the Jacoban assocated wth each layer are far from 1. Based on these consderatons, we propose a new ntalzaton scheme that brngs substantally faster convergence. 1 Deep Neural Networks Deep learnng methods am at learnng feature herarches wth features from hgher levels of the herarchy formed by the composton of lower level features. They nclude Appearng n Proceedngs of the 13 th Internatonal Conference on Artfcal Intellgence and Statstcs (AISTATS) 2010, Cha Laguna Resort, Sardna, Italy. Volume 9 of JMLR: W&CP 9. Copyrght 2010 by the authors. learnng methods for a wde array of deep archtectures, ncludng neural networks wth many hdden layers (Vncent et al., 2008) and graphcal models wth many levels of hdden varables (Hnton et al., 2006), among others (Zhu et al., 2009; Weston et al., 2008). Much attenton has recently been devoted to them (see (Bengo, 2009) for a revew), because of ther theoretcal appeal, nspraton from bology and human cognton, and because of emprcal success n vson (Ranzato et al., 2007; Larochelle et al., 2007; Vncent et al., 2008) and natural language processng (NLP) (Collobert & Weston, 2008; Mnh & Hnton, 2009). Theoretcal results revewed and dscussed by Bengo (2009), suggest that n order to learn the knd of complcated functons that can represent hgh-level abstractons (e.g. n vson, language, and other AI-level tasks), one may need deep archtectures. Most of the recent expermental results wth deep archtecture are obtaned wth models that can be turned nto deep supervsed neural networks, but wth ntalzaton or tranng schemes dfferent from the classcal feedforward neural networks (Rumelhart et al., 1986). Why are these new algorthms workng so much better than the standard random ntalzaton and gradent-based optmzaton of a supervsed tranng crteron? Part of the answer may be found n recent analyses of the effect of unsupervsed pretranng (Erhan et al., 2009), showng that t acts as a regularzer that ntalzes the parameters n a better basn of attracton of the optmzaton procedure, correspondng to an apparent local mnmum assocated wth better generalzaton. But earler work (Bengo et al., 2007) had shown that even a purely supervsed but greedy layer-wse procedure would gve better results. So here nstead of focusng on what unsupervsed pre-tranng or sem-supervsed crtera brng to deep archtectures, we focus on analyzng what may be gong wrong wth good old (but deep) multlayer neural networks. Our analyss s drven by nvestgatve experments to montor actvatons (watchng for saturaton of hdden unts) and gradents, across layers and across tranng teratons. We also evaluate the effects on these of choces of actvaton functon (wth the dea that t mght affect saturaton) and ntalzaton procedure (snce unsupervsed pretranng s a partcular form of ntalzaton and t has a drastc mpact). 249

2 Expermental Settng and Datasets Understandng the dffculty of tranng deep feedforward neural networks Code to produce the new datasets ntroduced n ths secton s avalable from: http://www.ro.umontreal.

1 Onlne Learnng on an Infnte Dataset: Shapeset-3 2 Recent work wth deep archtectures (see Fgure 7 n Bengo (2009)) shows that even wth very large tranng sets or onlne learnng, ntalzaton from

2 2 Expermental Settng and Datasets Understandng the dffculty of tranng deep feedforward neural networks Code to produce the new datasets ntroduced n ths secton s avalable from: ca/ lsa/twk/bn/vew.cg/publc/ DeepGradentsAISTATS Onlne Learnng on an Infnte Dataset: Shapeset-3 2 Recent work wth deep archtectures (see Fgure 7 n Bengo (2009)) shows that even wth very large tranng sets or onlne learnng, ntalzaton from unsupervsed pretranng yelds substantal mprovement, whch does not vansh as the number of tranng examples ncreases. The onlne settng s also nterestng because t focuses on the optmzaton ssues rather than on the small-sample regularzaton effects, so we decded to nclude n our experments a synthetc mages dataset nspred from Larochelle et al. (2007) and Larochelle et al. (2009), from whch as many examples as needed could be sampled, for testng the onlne learnng scenaro. We call ths dataset the Shapeset-3 2 dataset, wth example mages n Fgure 1 (top). Shapeset-3 2 contans mages of 1 or 2 two-dmensonal objects, each taken from 3 shape categores (trangle, parallelogram, ellpse), and placed wth random shape parameters (relatve lengths and/or angles), scalng, rotaton, translaton and grey-scale. We notced that for only one shape present n the mage the task of recognzng t was too easy. We therefore decded to sample also mages wth two objects, wth the constrant that the second object does not overlap wth the frst by more than ffty percent of ts area, to avod hdng t entrely. The task s to predct the objects present (e.g. trangle + ellpse, parallelogram + parallelogram, trangle alone, etc.) wthout havng to dstngush between the foreground shape and the background shape when they overlap. Ths therefore defnes nne confguraton classes. The task s farly dffcult because we need to dscover nvarances over rotaton, translaton, scalng, object color, occluson and relatve poston of the shapes. In parallel we need to extract the factors of varablty that predct whch object shapes are present. The sze of the mages are arbtrary but we fxed t to n order to work wth deep dense networks effcently. 2.2 Fnte Datasets The MNIST dgts (LeCun et al., 1998a), dataset has 50,000 tranng mages, 10,000 valdaton mages (for hyper-parameter selecton), and 10,000 test mages, each showng a grey-scale pxel mage of one of the 10 dgts. CIFAR-10 (Krzhevsky & Hnton, 2009) s a labelled sub- Fgure 1: Top: Shapeset-3 2 mages at resoluton. The examples we used are at resoluton. The learner tres to predct whch objects (parallelogram, trangle, or ellpse) are present, and 1 or 2 objects can be present, yeldng 9 possble classfcatons. Bottom: Small-ImageNet mages at full resoluton. set of the tny-mages dataset that contans 50,000 tranng examples (from whch we extracted 10,000 as valdaton data) and 10,000 test examples. There are 10 classes correspondng to the man object n each mage: arplane, automoble, brd, cat, deer, dog, frog, horse, shp, or truck. The classes are balanced. Each mage s n color, but s just pxels n sze, so the nput s a vector of = 3072 real values. Small-ImageNet whch s a set of tny gray level mages dataset computed from the hgher-resoluton and larger set at wth labels from the WordNet noun herarchy. We have used 90,000 examples for tranng, 10,000 for the valdaton set, and 10,000 for testng. There are 10 balanced classes: reptles, vehcles, brds, mammals, fsh, furnture, nstruments, tools, flowers and fruts Fgure 1 (bottom) shows randomly chosen examples. 2.3 Expermental Settng We optmzed feedforward neural networks wth one to fve hdden layers, wth one thousand hdden unts per layer, and wth a softmax logstc regresson for the output layer. The cost functon s the negatve log-lkelhood log P (y x), where (x, y) s the (nput mage, target class) par. The neural networks were optmzed wth stochastc back-propagaton on mn-batches of sze ten,.e., the average g log P (y x) was computed over

3 Xaver Glorot, Yoshua Bengo tranng pars (x, y) and used to update parameters n that drecton, wth g. The learnng rate s a hyperparameter that s optmzed based on valdaton set error after a large number of updates (5 mllon). We vared the type of non-lnear actvaton functon n the hdden layers: the sgmod 1/(1 + e x ), the hyperbolc tangent tanh(x), and a newly proposed actvaton functon (Bergstra et al., 2009) called the softsgn, x/(1 + x ). The softsgn s smlar to the hyperbolc tangent (ts range s -1 to 1) but ts tals are quadratc polynomals rather than exponentals,.e., t approaches ts asymptotes much slower. In the comparsons, we search for the best hyperparameters (learnng rate and depth) separately for each model. Note that the best depth was always fve for Shapeset-3 2, except for the sgmod, for whch t was four. We ntalzed the bases to be 0 and the weghts W j at each layer wth the followng commonly used heurstc: h 1 1 W j U p, p, (1) n n where U[ a, a] s the unform dstrbuton n the nterval ( a, a) and n s the sze of the prevous layer (the number of columns of W ). 3 Effect of Actvaton Functons and Saturaton Durng Tranng Two thngs we want to avod and that can be revealed from the evoluton of actvatons s excessve saturaton of actvaton functons on one hand (then gradents wll not propagate well), and overly lnear unts (they wll not compute somethng nterestng). 3.1 Experments wth the Sgmod The sgmod non-lnearty has been already shown to slow down learnng because of ts none-zero mean that nduces mportant sngular values n the Hessan (LeCun et al., 1998b). In ths secton we wll see another symptomatc behavor due to ths actvaton functon n deep feedforward networks. We want to study possble saturaton, by lookng at the evoluton of actvatons durng tranng, and the fgures n ths secton show results on the Shapeset-3 2 data, but smlar behavor s observed wth the other datasets. Fgure 2 shows the evoluton of the actvaton values (after the nonlnearty) at each hdden layer durng tranng of a deep archtecture wth sgmod actvaton functons. Layer 1 refers to the output of frst hdden layer, and there are four hdden layers. The graph shows the means and standard devatons of these actvatons. These statstcs along wth hstograms are computed at dfferent tmes durng learnng, by lookng at actvaton values for a fxed set of 300 test examples. Fgure 2: Mean and standard devaton (vertcal bars) of the actvaton values (output of the sgmod) durng supervsed learnng, for the dfferent hdden layers of a deep archtecture. The top hdden layer quckly saturates at 0 (slowng down all learnng), but then slowly desaturates around epoch 100. We see that very quckly at the begnnng, all the sgmod actvaton values of the last hdden layer are pushed to ther lower saturaton value of 0. Inversely, the others layers have a mean actvaton value that s above 0.5, and decreasng as we go from the output layer to the nput layer. We have found that ths knd of saturaton can last very long n deeper networks wth sgmod actvatons, e.g., the depthfve model never escaped ths regme durng tranng. The bg surprse s that for ntermedate number of hdden layers (here four), the saturaton regme may be escaped. At the same tme that the top hdden layer moves out of saturaton, the frst hdden layer begns to saturate and therefore to stablze. We hypothesze that ths behavor s due to the combnaton of random ntalzaton and the fact that an hdden unt output of 0 corresponds to a saturated sgmod. Note that deep networks wth sgmods but ntalzed from unsupervsed pre-tranng (e.g. from RBMs) do not suffer from ths saturaton behavor. Our proposed explanaton rests on the hypothess that the transformaton that the lower layers of the randomly ntalzed network computes ntally s not useful to the classfcaton task, unlke the transformaton obtaned from unsupervsed pre-tranng. The logstc layer output softmax(b + Wh) mght ntally rely more on ts bases b (whch are learned very quckly) than on the top hdden actvatons h derved from the nput mage (because h would vary n ways that are not predctve of y, maybe correlated mostly wth other and possbly more domnant varatons of x). Thus the error gradent would tend to push Wh towards 0, whch can be acheved by pushng h towards 0. In the case of symmetrc actvaton functons lke the hyperbolc tangent and the softsgn, sttng around 0 s good because t allows gradents to flow backwards. However, pushng the sgmod outputs to 0 would brng them nto a saturaton regme whch would prevent gradents to flow backward and prevent the lower layers from learnng useful features. Eventually but slowly, the lower layers move toward more useful features and the top hdden layer then moves out of the saturaton regme. Note however that, even after ths, the network moves nto a soluton that s of poorer qualty (also n terms of generalzaton) 251

Understandng the dffculty of tranng deep feedforward neural networks then those found wth symmetrc actvaton functons, as can be seen n fgure 11. where the gradents would flow well. 3.

Why ths s happenng remans to be understood. Fgure 4: Actvaton values normalzed hstogram at the end of learnng, averaged across unts of the same layer and across 300 test examples.

8) where the unts do not saturate but are non-lnear. 4 Studyng Gradents and ther Propagaton 4.

4 Understandng the dffculty of tranng deep feedforward neural networks then those found wth symmetrc actvaton functons, as can be seen n fgure 11. where the gradents would flow well. 3.2 Experments wth the Hyperbolc tangent As dscussed above, the hyperbolc tangent networks do not suffer from the knd of saturaton behavor of the top hdden layer observed wth sgmod networks, because of ts symmetry around 0. However, wth our standard weght h ntalzaton U p1 1 n, pn, we observe a sequentally occurrng saturaton phenomenon startng wth layer 1 and propagatng up n the network, as llustrated n Fgure 3. Why ths s happenng remans to be understood. Fgure 4: Actvaton values normalzed hstogram at the end of learnng, averaged across unts of the same layer and across 300 test examples. Top: actvaton functon s hyperbolc tangent, we see mportant saturaton of the lower layers. Bottom: actvaton functon s softsgn, we see many actvaton values around (-0.6,-0.8) and (0.6,0.8) where the unts do not saturate but are non-lnear. 4 Studyng Gradents and ther Propagaton 4.1 Effect of the Cost Functon Fgure 3: Top:98 percentles (markers alone) and standard devaton (sold lnes wth markers) of the dstrbuton of the actvaton values for the hyperbolc tangent networks n the course of learnng. We see the frst hdden layer saturatng frst, then the second, etc. Bottom: 98 percentles (markers alone) and standard devaton (sold lnes wth markers) of the dstrbuton of actvaton values for the softsgn durng learnng. Here the dfferent layers saturate less and do so together. 3.3 Experments wth the Softsgn The softsgn x/(1+ x ) s smlar to the hyperbolc tangent but mght behave dfferently n terms of saturaton because of ts smoother asymptotes (polynomal nstead of exponental). We see on Fgure 3 that the saturaton does not occur one layer after the other lke for the hyperbolc tangent. It s faster at the begnnng and then slow, and all layers move together towards larger weghts. We can also see at the end of tranng that the hstogram of actvaton values s very dfferent from that seen wth the hyperbolc tangent (Fgure 4). Whereas the latter yelds modes of the actvatons dstrbuton mostly at the extremes (asymptotes -1 and 1) or around 0, the softsgn network has modes of actvatons around ts knees (between the lnear regme around 0 and the flat regme around -1 and 1). These are the areas where there s substantal non-lnearty but We have found that the logstc regresson or condtonal log-lkelhood cost functon ( log P (y x) coupled wth softmax outputs) worked much better (for classfcaton problems) than the quadratc cost whch was tradtonally used to tran feedforward neural networks (Rumelhart et al., 1986). Ths s not a new observaton (Solla et al., 1988) but we fnd t mportant to stress here. We found that the plateaus n the tranng crteron (as a functon of the parameters) are less present wth the log-lkelhood cost functon. We can see ths on Fgure 5, whch plots the tranng crteron as a functon of two weghts for a two-layer network (one hdden layer) wth hyperbolc tangent unts, and a random nput and target sgnal. There are clearly more severe plateaus wth the quadratc cost. 4.2 Gradents at ntalzaton Theoretcal Consderatons and a New Normalzed Intalzaton We study the back-propagated gradents, or equvalently the gradent of the cost functon on the nputs bases at each layer. Bradley (2009) found that back-propagated gradents were smaller as one moves from the output layer towards the nput layer, just after ntalzaton. He studed networks wth lnear actvaton at each layer, fndng that the varance of the back-propagated gradents decreases as we go backwards n the network. We wll also start by studyng the lnear regme. 252

5 Xaver Glorot, Yoshua Bengo From a forward-propagaton pont of vew, to keep nformaton flowng we would lke that 8(, 0 ), V ar[z ]=Var[z 0 ]. (8) From a back-propagaton pont of vew we would smlarly lke to have 8(, 0 ), V = Var. 0 These two condtons transform to: 8, n Var[W ]=1 (10) Fgure 5: Cross entropy (black, surface on top) and quadratc (red, bottom surface) cost as a functon of two weghts (one at each layer) of a network wth two layers, W 1 respectvely on the frst layer and W 2 on the second, output layer. For a dense artfcal neural network usng symmetrc actvaton functon f wth unt dervatve at 0 (.e. f 0 (0) = 1), f we wrte z for the actvaton vector of layer, and s the argument vector of the actvaton functon at layer, we have s = z W + b and z +1 = f(s ). From these defntons we obtan k = f 0 (s k)w l,k k The varances wll be expressed wth respect to the nput, outpout and weght ntalzaton randomness. Consder the hypothess that we are n a lnear regme at the ntalzaton, that the weghts are ntalzed ndependently and that the nputs features varances are the same (= Var[x]). Then we can say that, wth n the sze of layer and x the network nput, Var[z ]=Var[x] (3) f 0 (s k) 1, (4) Y 1 0 =0 n 0Var[W 0 ], (5) We wrte Var[W 0 ] for the shared scalar varance of all weghts at layer 0. Then for a network wth d layers, dy = d n 0 +1Var[W 0 ], (6) = Y 1 0 =0 0 = dy 1 n 0Var[W 0 ] 0 = d. n 0 +1Var[W 0 ] (7) 8, n +1 Var[W ]=1 (11) As a compromse between these two constrants, we mght want to have 8, V ar[w ]= 2 n + n +1 (12) Note how both constrants are satsfed when all layers have the same wdth. If we also have the same ntalzaton for the weghts we could get the followng nterestng propertes: h d Var[x] 8, V = nv ar[w ] (13) 8, V = h nv ar[w ] d (14) We can see that the varance of the gradent on the weghts s the same for all layers, but the varance of the backpropagated gradent mght stll vansh or explode as we consder deeper networks. Note how ths s remnscent of ssues rased when studyng recurrent neural networks (Bengo et al., 1994), whch can be seen as very deep networks when unfolded through tme. The standard ntalzaton that we have used (eq.1) gves rse to varance wth the followng property: nv ar[w ]= 1 3 (15) where n s the layer sze (assumng all layers of the same sze). Ths wll cause the varance of the back-propagated gradent to be dependent on the layer (and decreasng). The normalzaton factor may therefore be mportant when ntalzng deep networks because of the multplcatve effect through layers, and we suggest the followng ntalzaton procedure to approxmately satsfy our objectves of mantanng actvaton varances and back-propagated gradents varance as one moves up or down the network. We call t the normalzed ntalzaton: W U h p 6 p nj + n j+1, p 6 p nj + n j+1 (16) 253

6 Understandng the dffculty of tranng deep feedforward neural networks Gradent Propagaton Study To emprcally valdate the above theoretcal deas, we have plotted some normalzed hstograms of actvaton values, weght gradents and of the back-propagated gradents at ntalzaton wth the two dfferent ntalzaton methods. The results dsplayed (Fgures 6, 7 and 8) are from experments on Shapeset-3 2, but qualtatvely smlar results were obtaned wth the other datasets. We montor the sngular values of the Jacoban matrx assocated wth layer : (17) When consecutve layers have the same dmenson, the average sngular value corresponds to the average rato of nfntesmal volumes mapped from z to z +1, as well as to the rato of average actvaton varance gong from z to z +1. Wth our normalzed ntalzaton, ths rato s around 0.8 whereas wth the standard ntalzaton, t drops down to 0.5. Fgure 6: Actvaton values normalzed hstograms wth hyperbolc tangent actvaton, wth standard (top) vs normalzed ntalzaton (bottom). Top: 0-peak ncreases for hgher layers. 4.3 Back-propagated Gradents Durng Learnng The dynamc of learnng n such networks s complex and we would lke to develop better tools to analyze and track t. In partcular, we cannot use smple varance calculatons n our theoretcal analyss because the weghts values are not anymore ndependent of the actvaton values and the lnearty hypothess s also volated. As frst noted by Bradley (2009), we observe (Fgure 7) that at the begnnng of tranng, after the standard ntalzaton (eq. 1), the varance of the back-propagated gradents gets smaller as t s propagated downwards. However we fnd that ths trend s reversed very quckly durng learnng. Usng our normalzed ntalzaton we do not see such decreasng back-propagated gradents (bottom of Fgure 7). Fgure 7: Back-propagated gradents normalzed hstograms wth hyperbolc tangent actvaton, wth standard (top) vs normalzed (bottom) ntalzaton. Top: 0-peak decreases for hgher layers. What was ntally really surprsng s that even when the back-propagated gradents become smaller (standard ntalzaton), the varance of the weghts gradents s roughly constant across layers, as shown on Fgure 8. However, ths s explaned by our theoretcal analyss above (eq. 14). Interestngly, as shown n Fgure 9, these observatons on the weght gradent of standard and normalzed ntalzaton change durng tranng (here for a tanh network). Indeed, whereas the gradents have ntally roughly the same magntude, they dverge from each other (wth larger gradents n the lower layers) as tranng progresses, especally wth the standard ntalzaton. Note that ths mght be one of the advantages of the normalzed ntalzaton, snce havng gradents of very dfferent magntudes at dfferent layers may yeld to ll-condtonng and slower tranng. Fnally, we observe that the softsgn networks share smlartes wth the tanh networks wth normalzed ntalzaton, as can be seen by comparng the evoluton of actvatons n both cases (resp. Fgure 3-bottom and Fgure 10). 5 Error Curves and Conclusons The fnal consderaton that we care for s the success of tranng wth dfferent strateges, and ths s best llustrated wth error curves showng the evoluton of test error as tranng progresses and asymptotes. Fgure 11 shows such curves wth onlne tranng on Shapeset-3 2, whle Table 1 gves fnal test error for all the datasets studed (Shapeset-3 2, MNIST, CIFAR-10, and Small- ImageNet). As a baselne, we optmzed RBF SVM models on one hundred thousand Shapeset examples and obtaned 59.47% test error, whle on the same set we obtaned 50.47% wth a depth fve hyperbolc tangent network wth normalzed ntalzaton. These results llustrate the effect of the choce of actvaton and ntalzaton. As a reference we nclude n Fg- 254

Xaver Glorot, Yoshua Bengo Fgure 8: Weght gradent normalzed hstograms wth hyperbolc tangent actvaton just after ntalzaton, wth standard ntalzaton (top) and normalzed ntalzaton (bottom), for dfferent

Table 1: Test error wth dfferent actvaton functons and ntalzaton schemes for deep networks wth 5 hdden layers. N after the actvaton functon name ndcates the use of normalzed ntalzaton.

8 68.13 Tanh 27.15 1.76 55.9 70.58 Tanh N 15.60 1.64 52.92 68.57 Sgmod 82.61 2.21 57.28 70.

7 Xaver Glorot, Yoshua Bengo Fgure 8: Weght gradent normalzed hstograms wth hyperbolc tangent actvaton just after ntalzaton, wth standard ntalzaton (top) and normalzed ntalzaton (bottom), for dfferent layers. Even though wth standard ntalzaton the back-propagated gradents get smaller, the weght gradents do not! Table 1: Test error wth dfferent actvaton functons and ntalzaton schemes for deep networks wth 5 hdden layers. N after the actvaton functon name ndcates the use of normalzed ntalzaton. Results n bold are statstcally dfferent from non-bold ones under the null hypothess test wth p = TYPE Shapeset MNIST CIFAR-10 ImageNet Softsgn Softsgn N Tanh Tanh N Sgmod ure 11 the error curve for the supervsed fne-tunng from the ntalzaton obtaned after unsupervsed pre-tranng wth denosng auto-encoders (Vncent et al., 2008). For each network the learnng rate s separately chosen to mnmze error on the valdaton set. We can remark that on Shapeset-3 2, because of the task dffculty, we observe mportant saturatons durng learnng, ths mght explan that the normalzed ntalzaton or the softsgn effects are more vsble. Several conclusons can be drawn from these error curves: The more classcal neural networks wth sgmod or hyperbolc tangent unts and standard ntalzaton fare rather poorly, convergng more slowly and apparently towards ultmately poorer local mnma. The softsgn networks seem to be more robust to the ntalzaton procedure than the tanh networks, presumably because of ther gentler non-lnearty. For tanh networks, the proposed normalzed ntalzaton can be qute helpful, presumably because the layer-to-layer transformatons mantan magntudes of Fgure 9: Standard devaton ntervals of the weghts gradents wth hyperbolc tangents wth standard ntalzaton (top) and normalzed (bottom) durng tranng. We see that the normalzaton allows to keep the same varance of the weghts gradent across layers, durng tranng (top: smaller varance for hgher layers). Fgure 10: 98 percentle (markers alone) and standard devaton (sold lnes wth markers) of the dstrbuton of actvaton values for hyperbolc tangent wth normalzed ntalzaton durng learnng. actvatons (flowng upward) and gradents (flowng backward). Others methods can allevate dscrepances between layers durng learnng, e.g., explotng second order nformaton to set the learnng rate separately for each parameter. For example, we can explot the dagonal of the Hessan (LeCun et al., 1998b) or a gradent varance estmate. Both those methods have been appled for Shapeset-3 2 wth hyperbolc tangent and standard ntalzaton. We observed a gan n performance but not reachng the result obtaned from normalzed ntalzaton. In addton, we observed further gans by combnng normalzed ntalzaton wth second order methods: the estmated Hessan mght then focus on dscrepances between unts, not havng to correct mportant ntal dscrepances between layers. In all reported experments we have used the same number of unts per layer. However, we verfed that we obtan the same gans when the layer sze ncreases (or decreases) wth layer number. The other conclusons from ths study are the followng: Montorng actvatons and gradents across layers and 255

8 Understandng the dffculty of tranng deep feedforward neural networks Fgure 11: Test error durng onlne tranng on the Shapeset-3 2 dataset, for varous actvaton functons and ntalzaton schemes (ordered from top to bottom n decreasng fnal error). N after the actvaton functon name ndcates the use of normalzed ntalzaton. Fgure 12: Test error curves durng tranng on MNIST and CIFAR10, for varous actvaton functons and ntalzaton schemes (ordered from top to bottom n decreasng fnal error). N after the actvaton functon name ndcates the use of normalzed ntalzaton. tranng teratons s a powerful nvestgatve tool for understandng tranng dffcultes n deep nets. Sgmod actvatons (not symmetrc around 0) should be avoded when ntalzng from small random weghts, because they yeld poor learnng dynamcs, wth ntal saturaton of the top hdden layer. Keepng the layer-to-layer transformatons such that both actvatons and gradents flow well (.e. wth a Jacoban around 1) appears helpful, and allows to elmnate a good part of the dscrepancy between purely supervsed deep networks and ones pre-traned wth unsupervsed learnng. Many of our observatons reman unexplaned, suggestng further nvestgatons to better understand gradents and tranng dynamcs n deep archtectures. References Bengo, Y. (2009). Learnng deep archtectures for AI. Foundatons and Trends n Machne Learnng, 2, Also publshed as a book. Now Publshers, Bengo, Y., Lambln, P., Popovc, D., & Larochelle, H. (2007). Greedy layer-wse tranng of deep networks. NIPS 19 (pp ). MIT Press. Bengo, Y., Smard, P., & Frascon, P. (1994). Learnng long-term dependences wth gradent descent s dffcult. IEEE Transactons on Neural Networks, 5, Bergstra, J., Desjardns, G., Lambln, P., & Bengo, Y. (2009). Quadratc polynomals learn better mage features (Techncal Report 1337). Département d Informatque et de Recherche Opératonnelle, Unversté de Montréal. Bradley, D. (2009). Learnng n modular systems. Doctoral dssertaton, The Robotcs Insttute, Carnege Mellon Unversty. Collobert, R., & Weston, J. (2008). A unfed archtecture for natural language processng: Deep neural networks wth multtask learnng. ICML Erhan, D., Manzagol, P.-A., Bengo, Y., Bengo, S., & Vncent, P. (2009). The dffculty of tranng deep archtectures and the effect of unsupervsed pre-tranng. AISTATS 2009 (pp ). Hnton, G. E., Osndero, S., & Teh, Y. (2006). A fast learnng algorthm for deep belef nets. Neural Computaton, 18, Krzhevsky, A., & Hnton, G. (2009). Learnng multple layers of features from tny mages (Techncal Report). Unversty of Toronto. Larochelle, H., Bengo, Y., Louradour, J., & Lambln, P. (2009). Explorng strateges for tranng deep neural networks. The Journal of Machne Learnng Research, 10, Larochelle, H., Erhan, D., Courvlle, A., Bergstra, J., & Bengo, Y. (2007). An emprcal evaluaton of deep archtectures on problems wth many factors of varaton. ICML LeCun, Y., Bottou, L., Bengo, Y., & Haffner, P. (1998a). Gradent-based learnng appled to document recognton. Proceedngs of the IEEE, 86, LeCun, Y., Bottou, L., Orr, G. B., & Müller, K.-R. (1998b). Effcent backprop. In Neural networks, trcks of the trade, Lecture Notes n Computer Scence LNCS Sprnger Verlag. Mnh, A., & Hnton, G. E. (2009). A scalable herarchcal dstrbuted language model. NIPS 21 (pp ). Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Effcent learnng of sparse representatons wth an energy-based model. NIPS 19. Rumelhart, D. E., Hnton, G. E., & Wllams, R. J. (1986). Learnng representatons by back-propagatng errors. Nature, 323, Solla, S. A., Levn, E., & Flesher, M. (1988). Accelerated learnng n layered neural networks. Complex Systems, 2, Vncent, P., Larochelle, H., Bengo, Y., & Manzagol, P.-A. (2008). Extractng and composng robust features wth denosng autoencoders. ICML Weston, J., Ratle, F., & Collobert, R. (2008). Deep learnng va sem-supervsed embeddng. ICML 2008 (pp ). New York, NY, USA: ACM. Zhu, L., Chen, Y., & Yulle, A. (2009). Unsupervsed learnng of probablstc grammar-markov models for object categores. IEEE Transactons on Pattern Analyss and Machne Intellgence, 31,

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented