MULTIVARIATE ANALYSIS AND ITS USE IN HIGH ENERGY PHYSICS

Size: px

Start display at page:

Download "MULTIVARIATE ANALYSIS AND ITS USE IN HIGH ENERGY PHYSICS"

Cleopatra Hill
5 years ago
Views:

1 1 DR ADRIAN BEVAN MULTIVARIATE ANALYSIS AND ITS USE IN HIGH ENERGY PHYSICS 4) OPTIMISATION Lectures given at the department of Physics at CINVESTAV, Instituto Politécnico Nacional, Mexico City 28th August - 3rd Sept 2018

2 2 LECTURE PLAN Introduction Hyper parameter optimisation Supervised learning Loss functions Gradient descent Stochastic learning Batch learning Overtraining Training validation Dropout for deep networks Weight regularisation for neural networks Cross Validation methods Summary Suggested reading

3 3 INTRODUCTION We initialise the weights of a neural network or other MVA algorithm randomly. The determination a set of optimal weights requires some heuristic algorithm and some figure of merit. The algorithm is the optimisation method (typically derived from gradient descent). The figure of merit is called the loss or cost function and can take many forms. The optimisation process for machine learning algorithms is similar to optimisation of a X 2 or -lnl in a fit where one minimises the X 2 or -lnl as the figure of merit, with respect to the model parameters.

4 4 HYPERPARAMETER OPTIMISATION Models have hyperparameters (HPs) that are required to fix the response function (*). The set of HPs forms a hyperspace. The purpose of optimisation is to select a point in hyperspace that optimises the performance of the model using some figure of merit (FOM). The figure of merit is called the cost or loss function. c.f. least squares regression or a χ 2 or likelihood fit. (*) Minimisation problems related to likelihood fitting often split the hyper-parameters into parameters of interest (e.g. physical quantities) and other nuisance parameters that are not deemed to be interesting. Machine learning model parameters are the equivalent of nuisance parameters in the language of likelihood fits. e.g. see the book by Edwards, Likelihood (1992), John Hopkins Uni Press.

5 <latexit sha1_base64="9yhrxiv7gtcdly+b3kjonvbeuco=">aaacohicbvbdsxtbfj2nttw1h2n76mvfuekohn0i2bdb2hefqolrqdzzzid3k8hzd2buvsosn+vlf4zvps99qiiv/gincze26yelh3puyeaekffskof9dgorq8+ev1hbdzdevnr9pv723anjci2wizkv6w7edsqzyockkezmgnksktylzr/o/lpvqi3m0hoa5nhp+civsrscrbtwdyewvqdxodcmzmckjczlnj8dln+mcbfkultzeqiai/fay9gywhaelsxtzyvbcvw++s1w3bde8nrehlbm/io0wiwjsh4dddnrjjisunyynu/l1c+5jikutt2gmjhzcc5h2lm05qmafjk/faofrdkeonn2uok5+nei5ikxkysymwmnsvn0zul/vf5b8ed+kdo8iezf40nxoyaymluiq6lrkjpywoww9q8gxlxzqbbrwqn+4snl5prt2/fa/vfoy/9lvcca22rbrml8tsv22qe7yh0m2bx7xf6wg+eh89u5de4ev2tolxnp/ofz/wbadahi</latexit> <latexit sha1_base64="9yhrxiv7gtcdly+b3kjonvbeuco=">aaacohicbvbdsxtbfj2nttw1h2n76mvfuekohn0i2bdb2hefqolrqdzzzid3k8hzd2buvsosn+vlf4zvps99qiiv/gincze26yelh3puyeaekffskof9dgorq8+ev1hbdzdevnr9pv723anjci2wizkv6w7edsqzyockkezmgnksktylzr/o/lpvqi3m0hoa5nhp+civsrscrbtwdyewvqdxodcmzmckjczlnj8dln+mcbfkultzeqiai/fay9gywhaelsxtzyvbcvw++s1w3bde8nrehlbm/io0wiwjsh4dddnrjjisunyynu/l1c+5jikutt2gmjhzcc5h2lm05qmafjk/faofrdkeonn2uok5+nei5ikxkysymwmnsvn0zul/vf5b8ed+kdo8iezf40nxoyaymluiq6lrkjpywoww9q8gxlxzqbbrwqn+4snl5prt2/fa/vfoy/9lvcca22rbrml8tsv22qe7yh0m2bx7xf6wg+eh89u5de4ev2tolxnp/ofz/wbadahi</latexit> <latexit sha1_base64="9yhrxiv7gtcdly+b3kjonvbeuco=">aaacohicbvbdsxtbfj2nttw1h2n76mvfuekohn0i2bdb2hefqolrqdzzzid3k8hzd2buvsosn+vlf4zvps99qiiv/gincze26yelh3puyeaekffskof9dgorq8+ev1hbdzdevnr9pv723anjci2wizkv6w7edsqzyockkezmgnksktylzr/o/lpvqi3m0hoa5nhp+civsrscrbtwdyewvqdxodcmzmckjczlnj8dln+mcbfkultzeqiai/fay9gywhaelsxtzyvbcvw++s1w3bde8nrehlbm/io0wiwjsh4dddnrjjisunyynu/l1c+5jikutt2gmjhzcc5h2lm05qmafjk/faofrdkeonn2uok5+nei5ikxkysymwmnsvn0zul/vf5b8ed+kdo8iezf40nxoyaymluiq6lrkjpywoww9q8gxlxzqbbrwqn+4snl5prt2/fa/vfoy/9lvcca22rbrml8tsv22qe7yh0m2bx7xf6wg+eh89u5de4ev2tolxnp/ofz/wbadahi</latexit> 5 HYPERPARAMETER OPTIMISATION Consider a perceptron with N inputs. This has N+1 HPs: N weights and a bias:! NX y = f w i x i + i=1 = f(w T x + ) For brevity the bias parameter is called a weight from the perspective of HP optimisation.

6 6 HYPERPARAMETER OPTIMISATION Consider a neural network with an N dimensional input feature space, M perceptrons on the input layer and 1 output perceptron. This has M(N+1) + (M+1) HPs. For an MLP with one hidden layer of K perceptrons: This has M(N+1) + K(M+1) + (K+1) HPs. For an MLP with two hidden layers of K and L perceptrons, respectively: This has M(N+1) + K(M+1) + L(K+1) +(L+1) HPs. and so on.

7 7 HYPERPARAMETER OPTIMISATION Neural networks have a lot of HPs. Deep networks, especially CNNs can have millions of HPs to optimise. Requires appropriate computing resource. Requires appropriately efficient methods for HP optimisation. What is acceptable for an optimisation of 10 or 100 HPs will not generally scale well to a problem with HPs. The more HPs to determine the more data is required to obtain a generalisable solution for the HPs. By generalisable we mean that the model defined using a set of HPs will have reproducible behaviour when presented with unseen data. When this is not the case we have an overtrained model - one that has learned statistical fluctuations in our training set.

8 8 HYPERPARAMETER OPTIMISATION: SUPERVISED LEARNING The type of machine learning we are using is referred to as supervised learning. We present the algorithm with known (labeled) samples of data, and optimise the HPs in order to minimise the loss function. The loss function is a function of: labels (e.g. signal = +1, background = 0 or -1 depending on convention/choice of loss function) hyper parameters (i.e. weights, biases, etc.) activation functions architecture of the network (arrangement of perceptrons)

9 9 HYPERPARAMETER OPTIMISATION: LOSS FUNCTIONS There are a number of different types of loss (cost) function that are commonly used. These different figures of merit are measures of how well a model performs at ensuring the weights provide a reasonable output response. Loss functions can have additional terms added to them (e.g. weight regularisation for overtraining and for constructing adversarial networks). Three common loss functions are described in the following.

10 <latexit sha1_base64="wob3qjitr/lzogjhkx/5ujtt5ei=">aaaci3icbvdlsgnbejynrxhfuy9ebomqd4bdichciojfk0qwkmstzxyyawznzpezxies+y9e/buvhptgxyp/4urx8fxqufr1090vjoibcn0ppza3v7c4vfwurayurw+un7eutzxqyto0frg+dylhgivwbg6c3saaerkkdhpen439mwemdy/vfqwt1pxktvgiuwjwcson/gprldfcxao3sg9s6qsuozgg4w0v711gp9kezl6e1fpqmod4aepa93v1ofxxa+4e+c/xzqsczmgf5zhfj2kqmqiqidedz02gmxennaqwl/zusitqe3lhopyqipnpzpmfc7xnlt6oym1laz6o3ycyio0zytb2sgid89sbi/95nrsi427gvzicu3s6keofhhipa8n9rhkfmbseum3trzgoii0ebkwlg4l3++w/5lpe89yad3lyaz7o4iiihbslqshdr6ijzleltrffj+gzvai358l5cubo+7s14mxmttepoj9f5oajug==</latexit> 10 HYPERPARAMETER OPTIMISATION: LOSS FUNCTIONS L2-norm loss: This is like a X 2 term, but without the error normalisation and a factor of 1/2. " = NX i=1 N = number of examples 1 2 (y i t i ) 2 y i = Model output for the i th example t i = True target type for the i th example (label values) The L1 norm loss function is as above, without the factor of 1/2 or square.

11 <latexit sha1_base64="ytuioll7p1atpmvieql0vr65zzc=">aaacjhicbvdlsgnbejyn7/ikevqygir4mowgquee0ysniwa0ke2w2cmsgtizu8z0cmhzj/hir3jx4ampxvwwj4+drgsaiqpuurvcrhadrvvpfgzm5+yxfpekyyura+uljc0be6easganraybitfmcmuawegwzqizkafgt2h/fojf3jntekyuyzcwtir3ikecerbsudr274lmieeivvge+5emnppy7dl3tsp9wsuhe2t8xms7vqwmao73mqr8r1mlsmw36o6a/xjvqspognpqevo7mu0lu0afmabluqm0m6kbu8hyop8alhdaj3eszakikpl2nnoyx7tw6eio1ryu4jh6cyij0pibdg2njnaz095q/m9rpradttoukhsyounfusowxhiygo5yzsiigsweam5vxbrhbe5gcy3aelzpl/+sm1rvc6ve1uh59gwsxylarjuogjx0ie7rbaqjbqloat2hf/tqpdrpzrvzmw4tojozlfqlztc3nuakua==</latexit> 11 HYPERPARAMETER OPTIMISATION: LOSS FUNCTIONS Mean Square Error (MSE) loss: Very similar to the L2 norm loss function; just normalise the L2 norm loss by the number of training examples to compute an average. " = 1 N NX (y i t i ) 2 i=1 N = number of examples y i = Model output for the i th example t i = True target type for the i th example (label values)

12 <latexit sha1_base64="9qv2vyp/63hmdf0j3pcck8iaujo=">aaaconicbzdlsgmxfiyz3q23qks3wsioypkrqtec6malbvufzlgy6zk2mgsg5eyhdh0unz6foxduxcji1gcwrv14oxdy8f/nkjw/zqsw6pup3stk1ptm7nx8awfxaxmlvlrwsgluonr5klnzhtmlumioo0aj15kbpmijv/ht2dc/6ogxitwx2m8guqyjrsi4qye1yrwwxwxkvshu02o6f9pchviogbzviongcknpkchbjsuratp39y7ddvacsdnuhdptjzsi08wova74vx9u9c8ey6iqcv20yg9ho+w5ao1cmmubgz9hvdcdgksylmlcqsb4leta06fmcmxujfyf0c2ntgmsgnc00ph6fajgytq+il2nyti1v72h+j/xzde5igqhsxxb86+hklxstokwr9owbjjkvgpgjxb/pbzldopo0i65eilfk/+fxn418ktb7abycjqoy45ske2ytqjyse7iobkgdcljhxkil+tvu/eevtfv/at1whvprjmf5x18akipq88=</latexit> <latexit sha1_base64="mujrhcfaagfcflurjf9cc6uufz8=">aaacanicbvc7sgnbfj31genr1upsbooqfizdebqrgjawecwdkk2yncwmq2yfznwvlzxy+cs2fory+hv2/o2tzatnphc5h3puzeyenxjcgwv9gwuls8srq7m1/prg5ta2ubnbv2eskavruisy6rlfba9ydtgi1owki74rwmmdxo39xh2tiofblsqrc3zsd7jhkqetdc39ahee7kv4aicdknrhsamt6galudcswgvrajxp7iwuuizq1/xq90ia+ywakohsldukwemjbe4fg+xbswiroupszy1na+iz5aste0b4scs97ivsvwb4ov7esimvvok7etinmfcz3lj8z2vf4j07kq+igfhapw95scaq4neeumcloyastqivxp8v0wgrhijola9dsgdpnif1k7jtle2b00llmosjhw7qisoig52hcrpgvvrdfd2iz/sk3own48v4nz6mowtgtroh/sd4/ahezpux</latexit> 12 HYPERPARAMETER OPTIMISATION: LOSS FUNCTIONS Cross Entropy: This loss function is inspired by the likelihood for observing either target value From the likelihood L of observing the training data set we can compute the -lnl as " = nx i=1 P (t x) =y t (1 y) [t n ln y n +(1 t n )ln(1 y n )] (1 t) n = number of examples t = target type (0 or 1) depending on example (label values) y = output prediction of model

13 13 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Guess an initial value for the weight parameter: w y =ε w 0 w

14 14 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Estimate the gradient at that point (tangent to the curve) 25 y =ε w 0 w

15 <latexit sha1_base64="icnt3h5wowtccxo38pivblylbng=">aaaccxicbvdlssnafj3uv62vqes3g0vwy0le0i1q1ixlcvybtsg3k0k7dpjgzmipivs3/oobf4q49q/c+tdo2yy09ccfwzn3cu89xskzvjb1bzswlldw18rrly3nre0dc3evjenuenokmy9fxwnjoytouzhfascrfekp07y3vj747qcqjiujezvoqbtcp2ibi6c01doxc0o5ajzcl/jeaz4madubajl54zzzr3nprfo1awq8soycvfgbrs/8cvyypcgnfoegzde2euvmibqjnoyvj5u0atkepu1qgkfipztnp8nxkvz8hmrcv6twvp09kueo5tj0dgciaidnvyn4n9dnvxdhzixkukujmlsupbyrge9iwt4tlcg+1gsiyppwtaagc1a6vioowz5/ezg0tmu2vbpvzqr1qykomjpah+gy2egc1detaqamiugrpanx9gy8gs/gu/exay0zxcw++gpj8wd5cpmu</latexit> <latexit sha1_base64="k5uronfxt/37sb+ijvzc3rwsaic=">aaacqnicbzblaxrbfiwr4ynxjdomszcxh0hcohqhiw6ekgthmgeniuxnhtvvt2ekvd+oum1omvltbvil3pkd3lhqjfsx1kxami8dbydzz6wqvrg02neyfguw7t1/8hb55vhn8eqtp8+6z9eoxffzrqnvmmkexoji6jwgrnnqswkjs9jqcxy2n58ffybrdjf/5lqkuyatxkdaifto3p0k98kwqg2v3kprz0gmflwtqd1rkvmzsnmbf/k1nxjnouwqhllevn6vvk+m/pp0c8bdxtgpf4lbjmpnt7q6ghe/yqrqvuy5k4podaow5fgdlruynoviylgj6gwnnpq2x4zcqfkgmmggtxjic+tpzrbi/99omhouzmlfzjcn7uzsht41g1acvhs1oi8rplxdxzrwbriaou9itcxfpvygldx+racm6jgwp97xekkbx75tjrb6udipdt/2dnzbhcvihxgpnkuktswo+caoxeao8uv8fz/fr+ai+bh8di6vqktbu7murin48xfrikza</latexit> 15 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Compute Δw such that Δy is negative (to move toward y =ε the minimum) y = = w dy dw dy dw α is the learning rate: a small positive number Choose w = dy dw w 0 to ensure Δy is always negative. w

16 <latexit sha1_base64="dhvvjyhsum6hm8wyk/rmezzpije=">aaack3icbvdjsgnbeo1xn25rj14kgyiew4wiehfepxhumcaqcunnt0/s2lpq3wmiw/ypf3/fgx5c8op/2ilzchtq8hiviqp6fiq40rb9ak1mtk3pzm7nvxywl5zxqqtr1yrjjgvnmohetn1utpcyntxxgrvtytdybwv5n6cjv3xlpojjfkwhketg2it5yclqi3nvk4gx87ptwpyrddwodxdpmnaia3ddcpqozv0xrdphceojna9gwotbopcqnbthjwf/ivosgilx4vuf3schwcritquq1xhsvhdzljptwyqkmymwir3bhusygmpevdcf/1ralleccbnpktywvr9p5bgpnyx80xmh7qvf3kj8z+tkojzs5jxom81i+ruozatobebbqcalo1omduequbkvab9nenrewzehol9f/kuu9xqo3xau92vhj2ucc2sdbjid4padckzoyqvpekruyan5ji/wvfvkvvnvx60tvjmztn7a+vgetuijlw==</latexit> <latexit sha1_base64="icnt3h5wowtccxo38pivblylbng=">aaaccxicbvdlssnafj3uv62vqes3g0vwy0le0i1q1ixlcvybtsg3k0k7dpjgzmipivs3/oobf4q49q/c+tdo2yy09ccfwzn3cu89xskzvjb1bzswlldw18rrly3nre0dc3evjenuenokmy9fxwnjoytouzhfascrfekp07y3vj747qcqjiujezvoqbtcp2ibi6c01doxc0o5ajzcl/jeaz4madubajl54zzzr3nprfo1awq8soycvfgbrs/8cvyypcgnfoegzde2euvmibqjnoyvj5u0atkepu1qgkfipztnp8nxkvz8hmrcv6twvp09kueo5tj0dgciaidnvyn4n9dnvxdhzixkukujmlsupbyrge9iwt4tlcg+1gsiyppwtaagc1a6vioowz5/ezg0tmu2vbpvzqr1qykomjpah+gy2egc1detaqamiugrpanx9gy8gs/gu/exay0zxcw++gpj8wd5cpmu</latexit> <latexit sha1_base64="k5uronfxt/37sb+ijvzc3rwsaic=">aaacqnicbzblaxrbfiwr4ynxjdomszcxh0hcohqhiw6ekgthmgeniuxnhtvvt2ekvd+oum1omvltbvil3pkd3lhqjfsx1kxami8dbydzz6wqvrg02neyfguw7t1/8hb55vhn8eqtp8+6z9eoxffzrqnvmmkexoji6jwgrnnqswkjs9jqcxy2n58ffybrdjf/5lqkuyatxkdaifto3p0k98kwqg2v3kprz0gmflwtqd1rkvmzsnmbf/k1nxjnouwqhllevn6vvk+m/pp0c8bdxtgpf4lbjmpnt7q6ghe/yqrqvuy5k4podaow5fgdlruynoviylgj6gwnnpq2x4zcqfkgmmggtxjic+tpzrbi/99omhouzmlfzjcn7uzsht41g1acvhs1oi8rplxdxzrwbriaou9itcxfpvygldx+racm6jgwp97xekkbx75tjrb6udipdt/2dnzbhcvihxgpnkuktswo+caoxeao8uv8fz/fr+ai+bh8di6vqktbu7murin48xfrikza</latexit> 16 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Compute a new weight value: w 1 = w 0 +Δw w i+1 = w i + w = w i dy dw y =ε y = w dy dw dy = dw α is the learning rate: a small positive number Choose w = dy dw w 1 w 0 to ensure Δy is always negative. w

17 <latexit sha1_base64="dhvvjyhsum6hm8wyk/rmezzpije=">aaack3icbvdjsgnbeo1xn25rj14kgyiew4wiehfepxhumcaqcunnt0/s2lpq3wmiw/ypf3/fgx5c8op/2ilzchtq8hiviqp6fiq40rb9ak1mtk3pzm7nvxywl5zxqqtr1yrjjgvnmohetn1utpcyntxxgrvtytdybwv5n6cjv3xlpojjfkwhketg2it5yclqi3nvk4gx87ptwpyrddwodxdpmnaia3ddcpqozv0xrdphceojna9gwotbopcqnbthjwf/ivosgilx4vuf3schwcritquq1xhsvhdzljptwyqkmymwir3bhusygmpevdcf/1ralleccbnpktywvr9p5bgpnyx80xmh7qvf3kj8z+tkojzs5jxom81i+ruozatobebbqcalo1omduequbkvab9nenrewzehol9f/kuu9xqo3xau92vhj2ucc2sdbjid4padckzoyqvpekruyan5ji/wvfvkvvnvx60tvjmztn7a+vgetuijlw==</latexit> <latexit sha1_base64="icnt3h5wowtccxo38pivblylbng=">aaaccxicbvdlssnafj3uv62vqes3g0vwy0le0i1q1ixlcvybtsg3k0k7dpjgzmipivs3/oobf4q49q/c+tdo2yy09ccfwzn3cu89xskzvjb1bzswlldw18rrly3nre0dc3evjenuenokmy9fxwnjoytouzhfascrfekp07y3vj747qcqjiujezvoqbtcp2ibi6c01doxc0o5ajzcl/jeaz4madubajl54zzzr3nprfo1awq8soycvfgbrs/8cvyypcgnfoegzde2euvmibqjnoyvj5u0atkepu1qgkfipztnp8nxkvz8hmrcv6twvp09kueo5tj0dgciaidnvyn4n9dnvxdhzixkukujmlsupbyrge9iwt4tlcg+1gsiyppwtaagc1a6vioowz5/ezg0tmu2vbpvzqr1qykomjpah+gy2egc1detaqamiugrpanx9gy8gs/gu/exay0zxcw++gpj8wd5cpmu</latexit> <latexit sha1_base64="k5uronfxt/37sb+ijvzc3rwsaic=">aaacqnicbzblaxrbfiwr4ynxjdomszcxh0hcohqhiw6ekgthmgeniuxnhtvvt2ekvd+oum1omvltbvil3pkd3lhqjfsx1kxami8dbydzz6wqvrg02neyfguw7t1/8hb55vhn8eqtp8+6z9eoxffzrqnvmmkexoji6jwgrnnqswkjs9jqcxy2n58ffybrdjf/5lqkuyatxkdaifto3p0k98kwqg2v3kprz0gmflwtqd1rkvmzsnmbf/k1nxjnouwqhllevn6vvk+m/pp0c8bdxtgpf4lbjmpnt7q6ghe/yqrqvuy5k4podaow5fgdlruynoviylgj6gwnnpq2x4zcqfkgmmggtxjic+tpzrbi/99omhouzmlfzjcn7uzsht41g1acvhs1oi8rplxdxzrwbriaou9itcxfpvygldx+racm6jgwp97xekkbx75tjrb6udipdt/2dnzbhcvihxgpnkuktswo+caoxeao8uv8fz/fr+ai+bh8di6vqktbu7murin48xfrikza</latexit> 17 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Repeat until some convergence criteria is satisfied. w i+1 = w i + w = w i dy dw y =ε y = w dy dw dy = dw w 2 w 4 n 2 w 1 α is the learning rate: a small positive number Choose w = dy dw w 0 to ensure Δy is always negative. w

18 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT We can extend this from a one parameter optimisation to a 2 parameter one, and follow the same

18 18 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT We can extend this from a one parameter optimisation to a 2 parameter one, and follow the same principles, now in 2D. The successive points w i+1 can be visualised a bit like a ball rolling down a concave hill into the region of the minimum.

19 <latexit sha1_base64="hzkgbnw+5hotcuy3+qgig74ku40=">aaacthicbvfnj9mwfhtc12752ajhlhyvanfcfvciucctgaphralbleoqxhyntdzxivufkoryc7lx49/gtjmwbz9kazqzz88ep5vwdqpobxdeun3n7r2j48h9bw8fnqwfp7l0zw2fnipsl/yqbse1mnkkcrw8qqyeitvyll5/xuuzx9i6vzrv2fqylmbhvk4eokes4e9v0qoz1tgxh+kqufsm8i9si9av5xxw3lnvoohqczqbsdxqxku7ipy5zjf7kc8tidajtddm/nj2wnudt2qxxfc/jusjh1ytfvdwojtsnjfnwxqnw1e0jjzf9whrwyj0dzem//csfhuhdqonzs1zvghcgkultowgvhayanenczn30eahxdxuqu/oc89knc+txwbphv2/o4xcuazivbmaxlpdbu0e0uy15h/ivpmqrmnedlbea4olxf8gzzsvanxjaqir/f2pwilpb/0/d3wibpfj++bymmbrmh17ozr/1mdxrj6r5+sumpkenjov5ijmiqhymat+bhc+c3koqrm1hkhf85tcqnd8a0rl0e0=</latexit> <latexit sha1_base64="lgwqyufvl0lqskc5cvo2nqpvszm=">aaaconicbvfdb9mwfhxcykn8rixhhraoocggsiqk8tjpykoa9ja2qbdjdahuhke15jirfbmpivld+bu87d/mbype1h7j1tg55+ra58afkhad4mbzh6w9fls+8bjz5omz55vdf1tnni8nf0oeq9xcxgcfklomuaisf4urkmvknmex+7p6+zuwvub6f1afidkyajlkduikcfcpoxakgvb0/s5t+tvlgmi1exnr0h9whs8mvdefypricts/t1lqgncjrzo6ur7x4sfznmziyrq//b7qj3sla7dssnk0q836rnnbong3f/sdoegycvvsiy1oxt2/lml5mqmnxig1ozaomkrboorknb1wwleav4sjgdmqirm2qucrn/sduxka5sydjxsu/t9rq2ztlcxomqfo7f3atfxvg5wyfo1qqysshealqwmpkoz0ti+ascm4qsor4ea6t1i+bzcpuq12xajh/s8vk7nbpwz64emx3t63no4n8oq8idskjdtkjxysezik3hvtffeovr/+w//ip/v/lqy+1/a8jhfgs1tafmpg</latexit> 19 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT In general for an n-dimensional hyperspace of hyper parameters we can follow the same brute force approach using: y = = wry " dy dw 1,i 2 + dy dw 2,i dy dw n,i 2 # where w i+1 = w i + w = w i ry " dy = w i dw 1,i 2 + dy dw 2,i dy dw n,i 2 #

20 <latexit sha1_base64="wob3qjitr/lzogjhkx/5ujtt5ei=">aaaci3icbvdlsgnbejynrxhfuy9ebomqd4bdichciojfk0qwkmstzxyyawznzpezxies+y9e/buvhptgxyp/4urx8fxqufr1090vjoibcn0ppza3v7c4vfwurayurw+un7eutzxqyto0frg+dylhgivwbg6c3saaerkkdhpen439mwemdy/vfqwt1pxktvgiuwjwcson/gprldfcxao3sg9s6qsuozgg4w0v711gp9kezl6e1fpqmod4aepa93v1ofxxa+4e+c/xzqsczmgf5zhfj2kqmqiqidedz02gmxennaqwl/zusitqe3lhopyqipnpzpmfc7xnlt6oym1laz6o3ycyio0zytb2sgid89sbi/95nrsi427gvzicu3s6keofhhipa8n9rhkfmbseum3trzgoii0ebkwlg4l3++w/5lpe89yad3lyaz7o4iiihbslqshdr6ijzleltrffj+gzvai358l5cubo+7s14mxmttepoj9f5oajug==</latexit> <latexit sha1_base64="omq77c84nczof5krivlopprtdgq=">aaacuxicjvfnt9tafhwxx2n4sttjlysije6rjzdgiodsi5wagbrh0fpmmaxyr63d50bk+s9yokf+j156ooomrijcd4y00mjmzx7mjovwjspwzynywv1b32h+ag1ube/stj9+6ru8tjj6mte5vurqkvageqxy01vhcbne02vycz73l6dkncrnd54vnmzw2qhuswqvjdqtoluoq7haywq1ikdoqxbke1oj+tm4rywime64emjgohbvdcy8icz61o6e3xab8zzes9kbjs5g7yd4nmsyi8nso3odkcx4wm13lzrqvlw6klde4dunpdwykrtwi0zqse+vsuhz65dhsvbfjirmnjtliz/mkcfuttcx/+cnsk5phpuyrclk5nnbaakf52jerxgrs5l1zbouvvm7cjlb3xt7t2j5eqlxt35l+ofdkoxg3446p2floprwbfbgaci4hlp4chfqawn38ase4u/jr+n3aehwnbo0lpnp8a+czb+zg7yv</latexit> 20 HYPERPARAMETER OPTIMISATION: BACK PROPAGATION The delta-rule or back propagation method is used for NNs to determine the weights; based on gradient descent. For a regularisation problem* we use the L2 norm loss function (with or without the factor of 1/2): NX 1 " = 2 (y i t i ) 2 i=1 y i is the model output for example x i t i is the corresponding label for the i th example We can compute the derivative of the ε with respect to the weights Derivatives depend on the i activation function(s) in the model y(x) *Either this or cross entropy are used for the classification problems.

21 <latexit sha1_base64="lipjf/uwunpkwmrndahhwtqahxk=">aaacp3icjvfnbxmxepuubur4cvtixwpaaigixyqel0ovxbahaewtrortanazbazau5y92ypa7v/jr/tgv8gbrp1eijespc+bewo/yaxwnplkdxtf2bh77/7mg87dr4+fpo0+ez70zeukdmspszfkwknwbq5ikcardqgm03iynxxp+cntdf6vxqetle4mhbcqvxiopkbdx2dhtxubnnyht6h5j0dboxdhkyo0mor8tfy7axp0nxoro5c1sobigebifbxar3qqury5jm4alkrh0bwjluqvrv+c8b8jvmrntntl+sky+g2qrkgprwnv2j0xs1jwbgusgrwfp4mlsd2qso1nr1qelcgtomzxgauy9jn66xpdx4xmjoelc6cgvsxe7ajbel8wwag0qhn/k2utf+pgfewfjruqbevyynwgvnkcst4ujc+uq0l6eqbip8jbuzxdcircajvbhptml2+d4ft+mvtt/q+93c9rozbzc7bnxroufws77cvbywmmo5frt+hndbc/ix/ew3i0ko2jdc8wuxyx/ahw/tlb</latexit> <latexit sha1_base64="lipjf/uwunpkwmrndahhwtqahxk=">aaacp3icjvfnbxmxepuubur4cvtixwpaaigixyqel0ovxbahaewtrortanazbazau5y92ypa7v/jr/tgv8gbrp1eijespc+bewo/yaxwnplkdxtf2bh77/7mg87dr4+fpo0+ez70zeukdmspszfkwknwbq5ikcardqgm03iynxxp+cntdf6vxqetle4mhbcqvxiopkbdx2dhtxubnnyht6h5j0dboxdhkyo0mor8tfy7axp0nxoro5c1sobigebifbxar3qqury5jm4alkrh0bwjluqvrv+c8b8jvmrntntl+sky+g2qrkgprwnv2j0xs1jwbgusgrwfp4mlsd2qso1nr1qelcgtomzxgauy9jn66xpdx4xmjoelc6cgvsxe7ajbel8wwag0qhn/k2utf+pgfewfjruqbevyynwgvnkcst4ujc+uq0l6eqbip8jbuzxdcircajvbhptml2+d4ft+mvtt/q+93c9rozbzc7bnxroufws77cvbywmmo5frt+hndbc/ix/ew3i0ko2jdc8wuxyx/ahw/tlb</latexit> <latexit sha1_base64="lipjf/uwunpkwmrndahhwtqahxk=">aaacp3icjvfnbxmxepuubur4cvtixwpaaigixyqel0ovxbahaewtrortanazbazau5y92ypa7v/jr/tgv8gbrp1eijespc+bewo/yaxwnplkdxtf2bh77/7mg87dr4+fpo0+ez70zeukdmspszfkwknwbq5ikcardqgm03iynxxp+cntdf6vxqetle4mhbcqvxiopkbdx2dhtxubnnyht6h5j0dboxdhkyo0mor8tfy7axp0nxoro5c1sobigebifbxar3qqury5jm4alkrh0bwjluqvrv+c8b8jvmrntntl+sky+g2qrkgprwnv2j0xs1jwbgusgrwfp4mlsd2qso1nr1qelcgtomzxgauy9jn66xpdx4xmjoelc6cgvsxe7ajbel8wwag0qhn/k2utf+pgfewfjruqbevyynwgvnkcst4ujc+uq0l6eqbip8jbuzxdcircajvbhptml2+d4ft+mvtt/q+93c9rozbzc7bnxroufws77cvbywmmo5frt+hndbc/ix/ew3i0ko2jdc8wuxyx/ahw/tlb</latexit> 21 HYPERPARAMETER OPTIMISATION: BACK PROPAGATION The parameters w and θ are updated using: The derivatives can be re-written in terms of the errors on the weights, and the errors on w and θ can be related to each other. Back propagation involves: w r+1 = w r r+1 = r where α is the small positive learning rate A forward pass where weights are fixed and the model predictions are made*. This is followed by the backward pass where the errors on the bias parameters are computed and used to determine the errors on the weights. These in turn are then used to update the HPs from epoch r to epoch r+1. NX i=1 NX i *Weights are randomly initialised for the whole network.

22 22 HYPERPARAMETER OPTIMISATION Consider the examples of an MLP to approximate the function f(x) = x 2 and how the convergence depends on the learning rate. α=0.1 α= epochs Too large a learning rate and the optimisation does not always lead to an improved cost; the small change approximation breaks down. A small learning rate means the convergence takes much longer (a x10 reduction in the learning rate will require a x10 increase in optimisation steps) This example uses an L2 loss function and is a neural network with 1 input, a single layer of 50 nodes and 1 output; implemented in TensorFlow.

23 23 HYPERPARAMETER OPTIMISATION Consider the examples of an MLP to approximate the function f(x) = x 2 and how the convergence depends on the learning rate. α=0.1 α= epochs Too large a learning rate and the optimisation does not always lead to an improved cost; the small change approximation breaks down. A small learning rate means the convergence takes much longer (a x10 reduction in the learning rate will require a x10 increase in optimisation steps) This example uses an L2 loss function and is a neural network with 1 input, a single layer of 50 nodes and 1 output; implemented in TensorFlow.

24 HYPERPARAMETER OPTIMISATION: STOCHASTIC LEARNING [1] Data are inherently noisy. Individual training examples can be used to estimate the gradient.

24 24 HYPERPARAMETER OPTIMISATION: STOCHASTIC LEARNING [1] Data are inherently noisy. Individual training examples can be used to estimate the gradient. Training examples tend to cluster, so processing a batch of training data, one example at a time results in sampling the ensemble in such a way to have faster optimisation performance. Noise in the data can help the optimisation algorithm avoid getting locked into local minima. Often results in better optimisation performance than batch learning. [1] LeCun et al., Efficient BackProp, Neural Networks Tricks of the Trade, Springer 1998

25 25 HYPERPARAMETER OPTIMISATION: BATCH LEARNING [1] Data are inherently noisy. Can use a sample of training data to estimate the gradient for minimisation (see later) to minimise the effect of this noise. The sample of data is used to obtain a better estimate the gradient This is referred to as batch learning. Can use mini-batches of data to speed up optimisation, which is motivated by the observation that for many problems there are clusters of similar training examples. [1] LeCun et al., Efficient BackProp, Neural Networks Tricks of the Trade, Springer 1998

26 26 HYPERPARAMETER OPTIMISATION: BATCH LEARNING Returning to the example f(x) = x 2 ; optimising on 1/4 of the data at a time (4 batches) leads to accelerated optimisation relative to optimising on all the data each epoch. Training with all the data and a learning rate of Batch training (4 batches) with all the data and a learning rate of This example uses an L2 loss function and is a neural network with 1 input, a single layer of 50 nodes and 1 output; implemented in TensorFlow.

27 27 GRADIENT DESCENT: REFLECTION For a problem with a parabolic minimum and an appropriate learning rate, α, to fix the step size, we can guarantee convergence to a sensible minimum in some number of steps. If we translate the distribution to a fixed scale, then all of a sudden we can predict how many steps it will take to converge to the minimum from some distance away from it for a given α. If the problem hyperspace is not parabolic, this becomes more complicated. [1] LeCun et al., Efficient BackProp, Neural Networks Tricks of the Trade, Springer 1998

28 28 GRADIENT DESCENT: REFLECTION Based on the underlying nature of the gradient descent optimisation algorithm family, being derived to optimise a parabolic distribution, ideally we want to try and standardise the input distributions to a neural network. Use a unit Gaussian as a standard e.g.: maps x to x = (x-μ)/σ; Scale x to the range [-1, 1]. The transformed data inputs will be scale invariant in the sense that HPs such as the learning rate will be come general, rather than problem (and therefore scale) dependent. If we don t do this the optimisation algorithm will work, but it may take longer to converge to the minimum, and could be more susceptible to divergent behaviour. [1] LeCun et al., Efficient BackProp, Neural Networks Tricks of the Trade, Springer 1998

29 29 OVERTRAINING Data are noisy. Optimisation can result in learning the noise in the training data. Overtraining is the term given to learning the noise, and this can be mitigated in a number of different ways: Using more training data (not always possible). Checking against different data sets to identify the onset of learning noise. Changing the network configuration when training (dropout). Weight regularisation (large weights are penalised in the cost). None of these methods guarantees that you avoid over training.

30 30 OVERTRAINING A model is over fitted if the HPs that have been determined are tuned to the statistical fluctuations in the data set. Simple illustration of the problem: 30 training examples The decision boundary selected here does a good job of separating the red and blue dots. Boundaries like this can be obtained by training models on limited data samples. The accuracies can be impressive. But would the performance be as good with a new, or a larger data sample?

31 31 OVERTRAINING A model is over fitted if the HPs that have been determined are tuned to the statistical fluctuations in the data set. Simple illustration of the problem: 30 training examples 1000 training examples Increasing to 1000 training examples we can see the boundary doesn t do as well. This illustrates the kind of problem encountered when we overfit HPs of a model.

32 32 OVERTRAINING: TRAINING VALIDATION One way to avoid tuning to statistical fluctuations in the data is to impose a training convergence criteria based on a data sample independent from the training set: a validation sample. Use the cost evaluated for the training and validation samples to check to see if the HPs are over trained. If both samples have similar cost then the model response function is similar on two statistically independent samples. If the samples are large enough then one could reasonably assume that the response function would then be general when applied to an unseen data sample. large enough is a model and problem dependent constraint.

33 33 OVERTRAINING: TRAINING VALIDATION Training convergence criteria that could be used: Terminate training after N epochs Cost comparison: Evaluate the performance on the training and validation sets. Compare the two and place some threshold on the difference Δcost < δ cost Terminate the training when the gradient of the cost function with respect to the weights is below some threshold. Terminate the training when the Δcost starts to increase for the validation sample.

34 34 OVERTRAINING: TRAINING VALIDATION: EXAMPLE HIGGS KAGGLE DATA This example shows the Higgs Kaggle (H ττ) with an overtrained neural network. Hyper parameters are not optimal, but test and train samples give similar performance. The test sample (green) has a consistently high cost value relative to the train sample (blue). 65% accuracy attained before overtraining starts Overtrained network Overtrained network This example uses all features of the data set, but only 2000 test/train events with a learning rate of and dropout is not being used. The network has a single layer with 256 nodes and a single output to classify if a given event is signal or background. There is no batch training used for this example. See for example code for this problem

35 35 OVERTRAINING: TRAINING VALIDATION: EXAMPLE HIGGS KAGGLE DATA Continuing to train an over-trained network does not resolve the issue - the network remains over-trained. 65% accuracy attained before overtraining starts Overtrained network Overtrained network In this case the level of overtraining continues to increase and the difference in the performance of the train and test samples in terms of accuracy increases. After training cycles we have an accuracy of over 93% for the train sample, but less than 70% for the test sample. This is not a good configuration to use on unseen data as the outcome is unpredictable. For this example we should terminate after about 200 epochs, while the test/train performance remain similar and have an accuracy of 70%. Other network configurations may be better. See for example code for this problem

36 36 OVERTRAINING: DROPOUT FOR DEEP NETWORKS A pragmatic way to mitigate overfitting is to compromise the model randomly in different epochs of the training by removing units from the network. Dropout is used during training; when evaluating predictions with the validation or unseen data the full network of Fig. (a) is used. That way the whole model will be effectively trained on a sub-sample of the data in the hope that the effect of statistical fluctuations will be limited. This does not remove the possibility that a model is overtrained, as with the previous discussion HP generalisation is promoted by using this method. Srivastava et al., J. Machine Learning Research 15 (2014)

37 OVERTRAINING: DROPOUT FOR DEEP NETWORKS A variety of architectures has been explored with different training samples (see Ref [1] for details).

37 37 OVERTRAINING: DROPOUT FOR DEEP NETWORKS A variety of architectures has been explored with different training samples (see Ref [1] for details). Dropout can be detrimental for small training samples, however in general the results show that dropout is beneficial. For deep networks or typical training samples O(500) examples or more this technique is expected to be beneficial. [1] Srivastava et al., J. Machine Learning Research 15 (2014)

38 38 OVERTRAINING: DROPOUT: EXAMPLE HIGGS KAGGLE DATA Changing the drop out keep probability from 0.9 to 0.7 stops the network becoming overtrained in the first 1000 epochs. ~73% accuracy attained without overtraining for 1000 epochs, keep probability of 0.7 and a learning rate of A better accuracy is attained for this network using dropout; above 70%. See for example code for this problem

39 <latexit sha1_base64="rlq2jfdtjiwjswvxygzme1ksbsa=">aaacfhicbvdlssnafj34rpuvdelmsaicubirdcmu3bisyb/qhdcztnuhm0myubgu0i9w46+4cagiwxfu/bunbrbaemdgcm65zl0ntaxx4djf1tlyyuraemmjvlm1vbnr7+03dzipyho0eylqh0qzwwpwaa6ctvpfiawfa4wdm4nfembk8ys+h1hkfel6me9yssbigx3qcroocpz0jj3bjqcd5pzk6yakcighjpf6omd4gpdarjhvzwq8snycvfcbemb/evfcm8lioijo3xgdfpyckobushhzyzrlcr2qhusyghpjtj9pjxrjy6ne2kxhxgx4qv6eyinueirdk5qe+nrem4j/ez0mupd+zum0axbt2ufdtgbi8kqhhhhfkiiriyqqbnbfte8uowb6ljss3pmtf0nzroo6vffuvfk7luoooun0he6qiy5qdd2iomogih7rm3pfb9at9wk9wx+z6jjvzbygp7a+fwcba58o</latexit> 39 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS Weight regularisation involves adding a penalty term to the loss function used to optimise the HPs of a network. This term is based on the sum of the weights w i (including bias parameters) in the network and takes X the form: i=8weights The rationale is to add an additional cost term to the optimisation coming from the complexity of the network. The performance of the network will vary as a function of λ. To optimise a network using weight regularisation it will have to be trained a number of times in order to identify the value corresponding to the min(cost) from the set of trained solutions. w i This is the L1 norm regularisation term. See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:

40 <latexit sha1_base64="cugaj1e3/yhdcoh4+zf4b9uhct0=">aaacf3icbvdlssnafj34rpuvdelmsagupcrf0i1qdooygn1ae8nkom2hzirh5szsqv/cjb/ixouibnxn3zhts9dwawohc85l7j1hirggx/m2lpzxvtfwcxvfza3tnv17b7+h41rrvqexifurjjojhre6cbcslshgzchymxxct/zma1oax9edjblms9klejdtakyk7lintlhdskdt6qkuoegg45den1zeifm8zlzxbz3gw4dfvwk75jsdkfaicxnsqjlqgf3ldwkashybfutrtusk4gdeaaecjyteqllc6id0wnvqieim/wx61xgfg6wdzsbmrycn6u+jjeitrzi0sumgr+e9ifif106he+fnpepsybgdfdrnbyyyt0rcha4ybteyhfdfza6y9okifeyvrvoco3/yimluyq5tdm/pstwrvi4cokrh6as56bxv0q2qotqi6be9o1f0zj1zl9a79tgllln5zah6a+vzb1ieoai=</latexit> 40 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS Weight regularisation involves adding a penalty term to the loss function used to optimise the HPs of a network. This term is based on the sum of the weights w i (including bias parameters) in the network and takes the form: X i=8,weights This variant is the L2 norm regularisation term; also know as weight decay regularisation. The rationale is to add an additional cost term to the optimisation coming from the complexity of the network. The performance of the network will vary as a function of λ. To optimise a network using weight regularisation it will have to be trained a number of times in order to identify the value corresponding to the min(cost) from the set of trained solutions. w 2 i See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:

41 <latexit sha1_base64="k5egs/alxmo+runpdhqcmvqgmjq=">aaacvxicbvfbsxtbfj5drdxy1qhhl0odygkbdonql4lus0/fglehmyxvj7pjw5nzzeatepb9k16k/8rlozoyq419mpdxfd/jvfdnvip0fewpqbi2/mbj7ezwa/vd+w877d29k1duvsi+kfrhbzjwuqgrfujs8qa0ensm5hv2ez7xr++kdviys5qvcqhhyjbhaesptk2so7cydkgkw095klsqddzup5vevtprqjfcwunp3iw8etrlkx/lloknuy9/5onyo8bav8xjxlhq6gu/lzizkmv4fyqjxtrurn1oufw1ijegw5z1kbyfknehki0ncqxodekopgenllao2bssysksxc1m5mbda1q6yb1ipeghnhlzv4l/hvic/bejbu3ctgfeqygmblwbk//tbhxlj8matvmrnoj5uf4ptgwfr8zhakugnfmahew/kxdt8mgs/4iwdyfepfk1uop146gb/zrunh1fxrhjdthhdsri9o2dsr/sgvwzya/skqicmpgd/anxw41naxgse/bziwp3/gjec7od</latexit> <latexit sha1_base64="ytuioll7p1atpmvieql0vr65zzc=">aaacjhicbvdlsgnbejyn7/ikevqygir4mowgquee0ysniwa0ke2w2cmsgtizu8z0cmhzj/hir3jx4ampxvwwj4+drgsaiqpuurvcrhadrvvpfgzm5+yxfpekyyura+uljc0be6easganraybitfmcmuawegwzqizkafgt2h/fojf3jntekyuyzcwtir3ikecerbsudr274lmieeivvge+5emnppy7dl3tsp9wsuhe2t8xms7vqwmao73mqr8r1mlsmw36o6a/xjvqspognpqevo7mu0lu0afmabluqm0m6kbu8hyop8alhdaj3eszakikpl2nnoyx7tw6eio1ryu4jh6cyij0pibdg2njnaz095q/m9rpradttoukhsyounfusowxhiygo5yzsiigsweam5vxbrhbe5gcy3aelzpl/+sm1rvc6ve1uh59gwsxylarjuogjx0ie7rbaqjbqloat2hf/tqpdrpzrvzmw4tojozlfqlztc3nuakua==</latexit> <latexit sha1_base64="ytuioll7p1atpmvieql0vr65zzc=">aaacjhicbvdlsgnbejyn7/ikevqygir4mowgquee0ysniwa0ke2w2cmsgtizu8z0cmhzj/hir3jx4ampxvwwj4+drgsaiqpuurvcrhadrvvpfgzm5+yxfpekyyura+uljc0be6easganraybitfmcmuawegwzqizkafgt2h/fojf3jntekyuyzcwtir3ikecerbsudr274lmieeivvge+5emnppy7dl3tsp9wsuhe2t8xms7vqwmao73mqr8r1mlsmw36o6a/xjvqspognpqevo7mu0lu0afmabluqm0m6kbu8hyop8alhdaj3eszakikpl2nnoyx7tw6eio1ryu4jh6cyij0pibdg2njnaz095q/m9rpradttoukhsyounfusowxhiygo5yzsiigsweam5vxbrhbe5gcy3aelzpl/+sm1rvc6ve1uh59gwsxylarjuogjx0ie7rbaqjbqloat2hf/tqpdrpzrvzmw4tojozlfqlztc3nuakua==</latexit> <latexit sha1_base64="ytuioll7p1atpmvieql0vr65zzc=">aaacjhicbvdlsgnbejyn7/ikevqygir4mowgquee0ysniwa0ke2w2cmsgtizu8z0cmhzj/hir3jx4ampxvwwj4+drgsaiqpuurvcrhadrvvpfgzm5+yxfpekyyura+uljc0be6easganraybitfmcmuawegwzqizkafgt2h/fojf3jntekyuyzcwtir3ikecerbsudr274lmieeivvge+5emnppy7dl3tsp9wsuhe2t8xms7vqwmao73mqr8r1mlsmw36o6a/xjvqspognpqevo7mu0lu0afmabluqm0m6kbu8hyop8alhdaj3eszakikpl2nnoyx7tw6eio1ryu4jh6cyij0pibdg2njnaz095q/m9rpradttoukhsyounfusowxhiygo5yzsiigsweam5vxbrhbe5gcy3aelzpl/+sm1rvc6ve1uh59gwsxylarjuogjx0ie7rbaqjbqloat2hf/tqpdrpzrvzmw4tojozlfqlztc3nuakua==</latexit> 41 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS For example we can consider extending an MSE cost function to allow for weight regularisation. The MSE cost is given by: " = 1 N NX (y i t i ) 2 i=1 To allow for regularisation we add the sum of weights term: " = 1 N NX X (y i t i ) 2 + i=1 i=8,weights w 2 i This is a simple modification to make to the NN training process that adds a penalty for the inclusion of non-zero weights in the network. See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:

42 42 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS The optimisation process needs to be run for each value of λ in order to choose the best (least cost) solution that is not overtrained. This means we have to train the network many times, with different values of λ when using regularisation. e.g. for the Kaggle example with 1000 epochs we have: No regularisation; best cost Best performance for this sampling of λ; but similar outputs obtained for a range of trainings λ cost: train / test accuracy: train / test / / / / / / / / / / / / / / Increasing cost of weights See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:

43 43 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS The optimisation process needs to be run for each value of λ in order to choose the best (least cost) solution that is not overtrained. This means we have to train the network many times, with different values of λ when using regularisation. e.g. for the Kaggle example with 1000 epochs we have: cost includes X i=8,weights w 2 i Accuracy falls off as lambda increases, but there is little dependence for this example. See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:

44 44 OVER FITTING: CROSS VALIDATION An alternative way of thinking about the problem is to assume that the response function of the model will have some bias and some variance. The bias will be irreducible and mean that the predictions made will have some systematic effect related to the average output value. The variance will depend on the size of the training sample, and we would like to know what this is. We can use cross validation to estimate the prediction error. Any prediction bias can be measured using control samples. Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70: For a review of cross validation see: S. Arlot and A. Celisse, Statistics Surveys Vol (2010).

45 45 OVER FITTING: CROSS VALIDATION Application of this concept to machine learning can be seen via k-fold cross validation and its variants* Divide the data sample for training and validation into k equal sub-samples. validation From these one can prepare k sets of validation samples and residual training samples. validation validation validation Each set uses all examples; but the training and validation sub-sets are distinct. validation *Variants include the extremes of leave 1 out CV and Hold out CV as well as leave p-out CV. These involve reserving 1 example, 50% of examples and p examples for testing, and the remainder of data for training, respectively. Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70: For a review of cross validation see: S. Arlot and A. Celisse, Statistics Surveys Vol (2010).

46 46 OVER FITTING: CROSS VALIDATION Application of this concept to machine learning can be seen via k-fold cross validation and its variants* One can then train the data on each of the k training sets, validating the performance of the network on the corresponding validation set. validation We can measure the model error on the validation set. validation The model prediction error is the average error obtained from the k folds on their corresponding validation sets. validation validation If desired we could combine the k models to compute some average that would have a prediction performance that is more robust than any of the individual folds. validation *Variants include the extremes of leave 1 out CV and Hold out CV as well as leave p-out CV. These involve reserving 1 example, 50% of examples and p examples for testing, and the remainder of data for training, respectively. Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70: For a review of cross validation see: S. Arlot and A. Celisse, Statistics Surveys Vol (2010).

47 47 OVER FITTING: CROSS VALIDATION Application of this concept to machine learning can be seen via k-fold cross validation and its variants* The ensemble of response function outputs will vary in analogy with the spread of a Gaussian distribution. This results in family of ROC curves; with a representative performance that is neither the best or worst ROC. The example shown is for a Support Vector Machine, with the best average and holdout ROC curves to indicate some sense of spread. Backgr rejection (1-eff) (False negative rate) ROC-Curve Signal eff (True positive rate) SVM_Best SVM_Average SVM_Holdout_RBF *Variants include the extremes of leave 1 out CV and Hold out CV as well as leave p-out CV. These involve reserving 1 example, 50% of examples and p examples for testing, and the remainder of data for training, respectively. Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70: For a review of cross validation see: S. Arlot and A. Celisse, Statistics Surveys Vol (2010).

48 48 SUMMARY Hyper-parameter optimisation does not guarantee robust solution. Minimisation of different cost functions will allow us to determine optimial hyper-parameters for our model. When the model starts to learn statistical fluctuations in the training sample we should stop the training to avoid overtraining the data (i.e. to avoid learning the statistical fluctuations in a given sample). Methods exist to mitigate overtraining, however none of them guarantee a generalised solution that is robust from overtraining [dropout, regularisation, cross validation]. The only exception to this is to supply an effectively infinite sample of training data for a given problem.

49 49 SUGGESTED READING (HEP RELATED) There is not much HEP-specific literature regarding HPs optimisation. Minuit is used to optimise parameters in TMVA along with other algorithms. See the Minuit user guide and TMVA for more details. This is based on the DFP method of variable metric minimisation. Logarithmic grid searches are also discussed for specific algorithms with a few HPs (see SVM notes).

50 50 SUGGESTED READING (NON-HEP) There are a few backup slides on the ADAM optimiser that performs well with a number of deep learning problems that you might wish to look at. In addition the following books discuss the issue of parameter optimisation: Bishop, Neural Networks for Pattern Recognition, Oxford University Press (2013) Cristianini and Shawe-Taylor, Support Vector Machines and other kernel-based learning methods, Cambridge University Press (2014) Goodfellow, Deep Learning, MIT Press (2016) MacKay s Information Theory, Inference and Learning algorithms, Cambridge University Press (2011)

51 51 APPENDIX The following slides contain additional information that may be of interest.

52 52 MULTIPLE MINIMA Often more complication hyperspace optimisation problems are encountered, where there are multiple minima. The gradient descent minimisation algorithm is based on the assumption that there is a single minimum to be found. In reality there are often multiple minima. Sometimes the minima are degenerate, or near degenerate. How do we know we have converged on the global minimum? One of several minima

53 53 MULTIPLE MINIMA Often more complication hyperspace optimisation problems are encountered, where there are multiple minima. The gradient descent minimisation algorithm is based on the assumption that there is a single minimum to be found. In reality there are often multiple minima. Sometimes the minima are degenerate, or near degenerate. How do we know we have converged on the global minimum? Global minimum

54 54 GRID SEARCHES Just as we can scan through a parameter in order to minimise a likelihood ratio, we can scan through a HP to observe how the loss function changes. For simple models we can construct a 2D grid of points in m and c. Evaluating the loss function for each point in the 2D sample space we can construct a grid from which to select the minimum value. The assumption here is that our grid spacing is sufficient for the purpose of optimising the problem. c This type of parameter search is often used for support vector machine HPs (kernel function parameters and cost). m This method does not scale to large numbers of parameters.

55 55 GRID SEARCHES e.g. consider a linear regression study optimising the parameters for the model y=mx+c The loss function for this problem results in a valley as m and c are anticorrelated parameters in this 2D hyperspace. loss L m c The contours of the loss function show a minimum, but this is selected from a discrete grid of points (need to ensure grid spacing is sufficient for your needs). loss L m c

56 56 ADAM OPTIMISER This is a stochastic gradient descent algorithm. Consider a model f(θ) that is differentiable with respect to the HPs θ so that: the gradient g t = f t (θ t-1 ) can be computed. t is the training epoch m t and v t are biased values of the first and second moment m t -hat and v t -hat are bias corrected estimator of the moments Some initial guess for the HP is taken: θ 0, and the HPs for a given epoch are denoted by θ t α is the step size β 1 and β 2 are exponential decay rates of moving averages. Kingma and Ba, Proc. 3rd Int Conf. Learning Representations, San Diego, 2015

57 57 ADAM OPTIMISER ADAptive Moment estimation based on gradient descent. Kingma and Ba, Proc. 3rd Int Conf. Learning Representations, San Diego, 2015

58 58 ADAM OPTIMISER Benchmarking performance using MNIST and CFAR10 data indicates that Adam with dropout minimises the loss function compared with other optimisers tested. Faster drop off in cost, and lower overall cost obtained. Kingma and Ba, Proc. 3rd Int Conf. Learning Representations, San Diego, 2015

59 59 DAVIDON-FLETCHER-POWELL This is a variable metric minimisation algorithm (1959, 1963) is described in wikipedia as: The DFP formula is quite effective, but it was soon superseded by the BFGS formula, which is its dual (interchanging the roles of y and s). This is used in high energy physics via the package MINUIT (FORTRAN) and Minuit2 (C++) implementations. The standard tools that are used for data analysis in HEP have these implementations available, and while the algorithm may no longer be optimal, it is still deemed good enough by many for the optimisation tasks. Robust / reliable minimisation. Underlying method derived assuming parabolic minima Understood and trusted by the HEP community.

60 60 DAVIDON-FLETCHER-POWELL This is a variable metric minimisation algorithm (1959, 1963) is described in wikipedia as: The DFP formula is quite effective, but it was soon superseded by the BFGS formula, which is its dual (interchanging the roles of y and s). This is used in high energy physics via the package MINUIT (FORTRAN) and Minuit2 (C++) implementations. This is mentioned only because it is the dominant algorithm used in particle physics at this time. Almost all minimisation problems are solved using this algorithm, where the dominant use is for maximum likelihood and χ 2 fit minimisation problems. The number of HPs required to solve those problems is small (up to a few hundred) in comparison with the numbers required for neural networks (esp. deep learning problems). If you don t (intend to) work in this field, you can now forget you heard about this algorithm.

CS489/698: Intro to ML

CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun