MULTIVARIATE ANALYSIS AND ITS USE IN HIGH ENERGY PHYSICS

Size: px
Start display at page:

Download "MULTIVARIATE ANALYSIS AND ITS USE IN HIGH ENERGY PHYSICS"

Transcription

1 1 DR ADRIAN BEVAN MULTIVARIATE ANALYSIS AND ITS USE IN HIGH ENERGY PHYSICS 4) OPTIMISATION Lectures given at the department of Physics at CINVESTAV, Instituto Politécnico Nacional, Mexico City 28th August - 3rd Sept 2018

2 2 LECTURE PLAN Introduction Hyper parameter optimisation Supervised learning Loss functions Gradient descent Stochastic learning Batch learning Overtraining Training validation Dropout for deep networks Weight regularisation for neural networks Cross Validation methods Summary Suggested reading

3 3 INTRODUCTION We initialise the weights of a neural network or other MVA algorithm randomly. The determination a set of optimal weights requires some heuristic algorithm and some figure of merit. The algorithm is the optimisation method (typically derived from gradient descent). The figure of merit is called the loss or cost function and can take many forms. The optimisation process for machine learning algorithms is similar to optimisation of a X 2 or -lnl in a fit where one minimises the X 2 or -lnl as the figure of merit, with respect to the model parameters.

4 4 HYPERPARAMETER OPTIMISATION Models have hyperparameters (HPs) that are required to fix the response function (*). The set of HPs forms a hyperspace. The purpose of optimisation is to select a point in hyperspace that optimises the performance of the model using some figure of merit (FOM). The figure of merit is called the cost or loss function. c.f. least squares regression or a χ 2 or likelihood fit. (*) Minimisation problems related to likelihood fitting often split the hyper-parameters into parameters of interest (e.g. physical quantities) and other nuisance parameters that are not deemed to be interesting. Machine learning model parameters are the equivalent of nuisance parameters in the language of likelihood fits. e.g. see the book by Edwards, Likelihood (1992), John Hopkins Uni Press.

5 <latexit sha1_base64="9yhrxiv7gtcdly+b3kjonvbeuco=">aaacohicbvbdsxtbfj2nttw1h2n76mvfuekohn0i2bdb2hefqolrqdzzzid3k8hzd2buvsosn+vlf4zvps99qiiv/gincze26yelh3puyeaekffskof9dgorq8+ev1hbdzdevnr9pv723anjci2wizkv6w7edsqzyockkezmgnksktylzr/o/lpvqi3m0hoa5nhp+civsrscrbtwdyewvqdxodcmzmckjczlnj8dln+mcbfkultzeqiai/fay9gywhaelsxtzyvbcvw++s1w3bde8nrehlbm/io0wiwjsh4dddnrjjisunyynu/l1c+5jikutt2gmjhzcc5h2lm05qmafjk/faofrdkeonn2uok5+nei5ikxkysymwmnsvn0zul/vf5b8ed+kdo8iezf40nxoyaymluiq6lrkjpywoww9q8gxlxzqbbrwqn+4snl5prt2/fa/vfoy/9lvcca22rbrml8tsv22qe7yh0m2bx7xf6wg+eh89u5de4ev2tolxnp/ofz/wbadahi</latexit> <latexit sha1_base64="9yhrxiv7gtcdly+b3kjonvbeuco=">aaacohicbvbdsxtbfj2nttw1h2n76mvfuekohn0i2bdb2hefqolrqdzzzid3k8hzd2buvsosn+vlf4zvps99qiiv/gincze26yelh3puyeaekffskof9dgorq8+ev1hbdzdevnr9pv723anjci2wizkv6w7edsqzyockkezmgnksktylzr/o/lpvqi3m0hoa5nhp+civsrscrbtwdyewvqdxodcmzmckjczlnj8dln+mcbfkultzeqiai/fay9gywhaelsxtzyvbcvw++s1w3bde8nrehlbm/io0wiwjsh4dddnrjjisunyynu/l1c+5jikutt2gmjhzcc5h2lm05qmafjk/faofrdkeonn2uok5+nei5ikxkysymwmnsvn0zul/vf5b8ed+kdo8iezf40nxoyaymluiq6lrkjpywoww9q8gxlxzqbbrwqn+4snl5prt2/fa/vfoy/9lvcca22rbrml8tsv22qe7yh0m2bx7xf6wg+eh89u5de4ev2tolxnp/ofz/wbadahi</latexit> <latexit sha1_base64="9yhrxiv7gtcdly+b3kjonvbeuco=">aaacohicbvbdsxtbfj2nttw1h2n76mvfuekohn0i2bdb2hefqolrqdzzzid3k8hzd2buvsosn+vlf4zvps99qiiv/gincze26yelh3puyeaekffskof9dgorq8+ev1hbdzdevnr9pv723anjci2wizkv6w7edsqzyockkezmgnksktylzr/o/lpvqi3m0hoa5nhp+civsrscrbtwdyewvqdxodcmzmckjczlnj8dln+mcbfkultzeqiai/fay9gywhaelsxtzyvbcvw++s1w3bde8nrehlbm/io0wiwjsh4dddnrjjisunyynu/l1c+5jikutt2gmjhzcc5h2lm05qmafjk/faofrdkeonn2uok5+nei5ikxkysymwmnsvn0zul/vf5b8ed+kdo8iezf40nxoyaymluiq6lrkjpywoww9q8gxlxzqbbrwqn+4snl5prt2/fa/vfoy/9lvcca22rbrml8tsv22qe7yh0m2bx7xf6wg+eh89u5de4ev2tolxnp/ofz/wbadahi</latexit> 5 HYPERPARAMETER OPTIMISATION Consider a perceptron with N inputs. This has N+1 HPs: N weights and a bias:! NX y = f w i x i + i=1 = f(w T x + ) For brevity the bias parameter is called a weight from the perspective of HP optimisation.

6 6 HYPERPARAMETER OPTIMISATION Consider a neural network with an N dimensional input feature space, M perceptrons on the input layer and 1 output perceptron. This has M(N+1) + (M+1) HPs. For an MLP with one hidden layer of K perceptrons: This has M(N+1) + K(M+1) + (K+1) HPs. For an MLP with two hidden layers of K and L perceptrons, respectively: This has M(N+1) + K(M+1) + L(K+1) +(L+1) HPs. and so on.

7 7 HYPERPARAMETER OPTIMISATION Neural networks have a lot of HPs. Deep networks, especially CNNs can have millions of HPs to optimise. Requires appropriate computing resource. Requires appropriately efficient methods for HP optimisation. What is acceptable for an optimisation of 10 or 100 HPs will not generally scale well to a problem with HPs. The more HPs to determine the more data is required to obtain a generalisable solution for the HPs. By generalisable we mean that the model defined using a set of HPs will have reproducible behaviour when presented with unseen data. When this is not the case we have an overtrained model - one that has learned statistical fluctuations in our training set.

8 8 HYPERPARAMETER OPTIMISATION: SUPERVISED LEARNING The type of machine learning we are using is referred to as supervised learning. We present the algorithm with known (labeled) samples of data, and optimise the HPs in order to minimise the loss function. The loss function is a function of: labels (e.g. signal = +1, background = 0 or -1 depending on convention/choice of loss function) hyper parameters (i.e. weights, biases, etc.) activation functions architecture of the network (arrangement of perceptrons)

9 9 HYPERPARAMETER OPTIMISATION: LOSS FUNCTIONS There are a number of different types of loss (cost) function that are commonly used. These different figures of merit are measures of how well a model performs at ensuring the weights provide a reasonable output response. Loss functions can have additional terms added to them (e.g. weight regularisation for overtraining and for constructing adversarial networks). Three common loss functions are described in the following.

10 <latexit sha1_base64="wob3qjitr/lzogjhkx/5ujtt5ei=">aaaci3icbvdlsgnbejynrxhfuy9ebomqd4bdichciojfk0qwkmstzxyyawznzpezxies+y9e/buvhptgxyp/4urx8fxqufr1090vjoibcn0ppza3v7c4vfwurayurw+un7eutzxqyto0frg+dylhgivwbg6c3saaerkkdhpen439mwemdy/vfqwt1pxktvgiuwjwcson/gprldfcxao3sg9s6qsuozgg4w0v711gp9kezl6e1fpqmod4aepa93v1ofxxa+4e+c/xzqsczmgf5zhfj2kqmqiqidedz02gmxennaqwl/zusitqe3lhopyqipnpzpmfc7xnlt6oym1laz6o3ycyio0zytb2sgid89sbi/95nrsi427gvzicu3s6keofhhipa8n9rhkfmbseum3trzgoii0ebkwlg4l3++w/5lpe89yad3lyaz7o4iiihbslqshdr6ijzleltrffj+gzvai358l5cubo+7s14mxmttepoj9f5oajug==</latexit> 10 HYPERPARAMETER OPTIMISATION: LOSS FUNCTIONS L2-norm loss: This is like a X 2 term, but without the error normalisation and a factor of 1/2. " = NX i=1 N = number of examples 1 2 (y i t i ) 2 y i = Model output for the i th example t i = True target type for the i th example (label values) The L1 norm loss function is as above, without the factor of 1/2 or square.

11 <latexit sha1_base64="ytuioll7p1atpmvieql0vr65zzc=">aaacjhicbvdlsgnbejyn7/ikevqygir4mowgquee0ysniwa0ke2w2cmsgtizu8z0cmhzj/hir3jx4ampxvwwj4+drgsaiqpuurvcrhadrvvpfgzm5+yxfpekyyura+uljc0be6easganraybitfmcmuawegwzqizkafgt2h/fojf3jntekyuyzcwtir3ikecerbsudr274lmieeivvge+5emnppy7dl3tsp9wsuhe2t8xms7vqwmao73mqr8r1mlsmw36o6a/xjvqspognpqevo7mu0lu0afmabluqm0m6kbu8hyop8alhdaj3eszakikpl2nnoyx7tw6eio1ryu4jh6cyij0pibdg2njnaz095q/m9rpradttoukhsyounfusowxhiygo5yzsiigsweam5vxbrhbe5gcy3aelzpl/+sm1rvc6ve1uh59gwsxylarjuogjx0ie7rbaqjbqloat2hf/tqpdrpzrvzmw4tojozlfqlztc3nuakua==</latexit> 11 HYPERPARAMETER OPTIMISATION: LOSS FUNCTIONS Mean Square Error (MSE) loss: Very similar to the L2 norm loss function; just normalise the L2 norm loss by the number of training examples to compute an average. " = 1 N NX (y i t i ) 2 i=1 N = number of examples y i = Model output for the i th example t i = True target type for the i th example (label values)

12 <latexit sha1_base64="9qv2vyp/63hmdf0j3pcck8iaujo=">aaaconicbzdlsgmxfiyz3q23qks3wsioypkrqtec6malbvufzlgy6zk2mgsg5eyhdh0unz6foxduxcji1gcwrv14oxdy8f/nkjw/zqsw6pup3stk1ptm7nx8awfxaxmlvlrwsgluonr5klnzhtmlumioo0aj15kbpmijv/ht2dc/6ogxitwx2m8guqyjrsi4qye1yrwwxwxkvshu02o6f9pchviogbzviongcknpkchbjsuratp39y7ddvacsdnuhdptjzsi08wova74vx9u9c8ey6iqcv20yg9ho+w5ao1cmmubgz9hvdcdgksylmlcqsb4leta06fmcmxujfyf0c2ntgmsgnc00ph6fajgytq+il2nyti1v72h+j/xzde5igqhsxxb86+hklxstokwr9owbjjkvgpgjxb/pbzldopo0i65eilfk/+fxn418ktb7abycjqoy45ske2ytqjyse7iobkgdcljhxkil+tvu/eevtfv/at1whvprjmf5x18akipq88=</latexit> <latexit sha1_base64="mujrhcfaagfcflurjf9cc6uufz8=">aaacanicbvc7sgnbfj31genr1upsbooqfizdebqrgjawecwdkk2yncwmq2yfznwvlzxy+cs2fory+hv2/o2tzatnphc5h3puzeyenxjcgwv9gwuls8srq7m1/prg5ta2ubnbv2eskavruisy6rlfba9ydtgi1owki74rwmmdxo39xh2tiofblsqrc3zsd7jhkqetdc39ahee7kv4aicdknrhsamt6galudcswgvrajxp7iwuuizq1/xq90ia+ywakohsldukwemjbe4fg+xbswiroupszy1na+iz5aste0b4scs97ivsvwb4ov7esimvvok7etinmfcz3lj8z2vf4j07kq+igfhapw95scaq4neeumcloyastqivxp8v0wgrhijola9dsgdpnif1k7jtle2b00llmosjhw7qisoig52hcrpgvvrdfd2iz/sk3own48v4nz6mowtgtroh/sd4/ahezpux</latexit> 12 HYPERPARAMETER OPTIMISATION: LOSS FUNCTIONS Cross Entropy: This loss function is inspired by the likelihood for observing either target value From the likelihood L of observing the training data set we can compute the -lnl as " = nx i=1 P (t x) =y t (1 y) [t n ln y n +(1 t n )ln(1 y n )] (1 t) n = number of examples t = target type (0 or 1) depending on example (label values) y = output prediction of model

13 13 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Guess an initial value for the weight parameter: w y =ε w 0 w

14 14 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Estimate the gradient at that point (tangent to the curve) 25 y =ε w 0 w

15 <latexit sha1_base64="icnt3h5wowtccxo38pivblylbng=">aaaccxicbvdlssnafj3uv62vqes3g0vwy0le0i1q1ixlcvybtsg3k0k7dpjgzmipivs3/oobf4q49q/c+tdo2yy09ccfwzn3cu89xskzvjb1bzswlldw18rrly3nre0dc3evjenuenokmy9fxwnjoytouzhfascrfekp07y3vj747qcqjiujezvoqbtcp2ibi6c01doxc0o5ajzcl/jeaz4madubajl54zzzr3nprfo1awq8soycvfgbrs/8cvyypcgnfoegzde2euvmibqjnoyvj5u0atkepu1qgkfipztnp8nxkvz8hmrcv6twvp09kueo5tj0dgciaidnvyn4n9dnvxdhzixkukujmlsupbyrge9iwt4tlcg+1gsiyppwtaagc1a6vioowz5/ezg0tmu2vbpvzqr1qykomjpah+gy2egc1detaqamiugrpanx9gy8gs/gu/exay0zxcw++gpj8wd5cpmu</latexit> <latexit sha1_base64="k5uronfxt/37sb+ijvzc3rwsaic=">aaacqnicbzblaxrbfiwr4ynxjdomszcxh0hcohqhiw6ekgthmgeniuxnhtvvt2ekvd+oum1omvltbvil3pkd3lhqjfsx1kxami8dbydzz6wqvrg02neyfguw7t1/8hb55vhn8eqtp8+6z9eoxffzrqnvmmkexoji6jwgrnnqswkjs9jqcxy2n58ffybrdjf/5lqkuyatxkdaifto3p0k98kwqg2v3kprz0gmflwtqd1rkvmzsnmbf/k1nxjnouwqhllevn6vvk+m/pp0c8bdxtgpf4lbjmpnt7q6ghe/yqrqvuy5k4podaow5fgdlruynoviylgj6gwnnpq2x4zcqfkgmmggtxjic+tpzrbi/99omhouzmlfzjcn7uzsht41g1acvhs1oi8rplxdxzrwbriaou9itcxfpvygldx+racm6jgwp97xekkbx75tjrb6udipdt/2dnzbhcvihxgpnkuktswo+caoxeao8uv8fz/fr+ai+bh8di6vqktbu7murin48xfrikza</latexit> 15 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Compute Δw such that Δy is negative (to move toward y =ε the minimum) y = = w dy dw dy dw α is the learning rate: a small positive number Choose w = dy dw w 0 to ensure Δy is always negative. w

16 <latexit sha1_base64="dhvvjyhsum6hm8wyk/rmezzpije=">aaack3icbvdjsgnbeo1xn25rj14kgyiew4wiehfepxhumcaqcunnt0/s2lpq3wmiw/ypf3/fgx5c8op/2ilzchtq8hiviqp6fiq40rb9ak1mtk3pzm7nvxywl5zxqqtr1yrjjgvnmohetn1utpcyntxxgrvtytdybwv5n6cjv3xlpojjfkwhketg2it5yclqi3nvk4gx87ptwpyrddwodxdpmnaia3ddcpqozv0xrdphceojna9gwotbopcqnbthjwf/ivosgilx4vuf3schwcritquq1xhsvhdzljptwyqkmymwir3bhusygmpevdcf/1ralleccbnpktywvr9p5bgpnyx80xmh7qvf3kj8z+tkojzs5jxom81i+ruozatobebbqcalo1omduequbkvab9nenrewzehol9f/kuu9xqo3xau92vhj2ucc2sdbjid4padckzoyqvpekruyan5ji/wvfvkvvnvx60tvjmztn7a+vgetuijlw==</latexit> <latexit sha1_base64="icnt3h5wowtccxo38pivblylbng=">aaaccxicbvdlssnafj3uv62vqes3g0vwy0le0i1q1ixlcvybtsg3k0k7dpjgzmipivs3/oobf4q49q/c+tdo2yy09ccfwzn3cu89xskzvjb1bzswlldw18rrly3nre0dc3evjenuenokmy9fxwnjoytouzhfascrfekp07y3vj747qcqjiujezvoqbtcp2ibi6c01doxc0o5ajzcl/jeaz4madubajl54zzzr3nprfo1awq8soycvfgbrs/8cvyypcgnfoegzde2euvmibqjnoyvj5u0atkepu1qgkfipztnp8nxkvz8hmrcv6twvp09kueo5tj0dgciaidnvyn4n9dnvxdhzixkukujmlsupbyrge9iwt4tlcg+1gsiyppwtaagc1a6vioowz5/ezg0tmu2vbpvzqr1qykomjpah+gy2egc1detaqamiugrpanx9gy8gs/gu/exay0zxcw++gpj8wd5cpmu</latexit> <latexit sha1_base64="k5uronfxt/37sb+ijvzc3rwsaic=">aaacqnicbzblaxrbfiwr4ynxjdomszcxh0hcohqhiw6ekgthmgeniuxnhtvvt2ekvd+oum1omvltbvil3pkd3lhqjfsx1kxami8dbydzz6wqvrg02neyfguw7t1/8hb55vhn8eqtp8+6z9eoxffzrqnvmmkexoji6jwgrnnqswkjs9jqcxy2n58ffybrdjf/5lqkuyatxkdaifto3p0k98kwqg2v3kprz0gmflwtqd1rkvmzsnmbf/k1nxjnouwqhllevn6vvk+m/pp0c8bdxtgpf4lbjmpnt7q6ghe/yqrqvuy5k4podaow5fgdlruynoviylgj6gwnnpq2x4zcqfkgmmggtxjic+tpzrbi/99omhouzmlfzjcn7uzsht41g1acvhs1oi8rplxdxzrwbriaou9itcxfpvygldx+racm6jgwp97xekkbx75tjrb6udipdt/2dnzbhcvihxgpnkuktswo+caoxeao8uv8fz/fr+ai+bh8di6vqktbu7murin48xfrikza</latexit> 16 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Compute a new weight value: w 1 = w 0 +Δw w i+1 = w i + w = w i dy dw y =ε y = w dy dw dy = dw α is the learning rate: a small positive number Choose w = dy dw w 1 w 0 to ensure Δy is always negative. w

17 <latexit sha1_base64="dhvvjyhsum6hm8wyk/rmezzpije=">aaack3icbvdjsgnbeo1xn25rj14kgyiew4wiehfepxhumcaqcunnt0/s2lpq3wmiw/ypf3/fgx5c8op/2ilzchtq8hiviqp6fiq40rb9ak1mtk3pzm7nvxywl5zxqqtr1yrjjgvnmohetn1utpcyntxxgrvtytdybwv5n6cjv3xlpojjfkwhketg2it5yclqi3nvk4gx87ptwpyrddwodxdpmnaia3ddcpqozv0xrdphceojna9gwotbopcqnbthjwf/ivosgilx4vuf3schwcritquq1xhsvhdzljptwyqkmymwir3bhusygmpevdcf/1ralleccbnpktywvr9p5bgpnyx80xmh7qvf3kj8z+tkojzs5jxom81i+ruozatobebbqcalo1omduequbkvab9nenrewzehol9f/kuu9xqo3xau92vhj2ucc2sdbjid4padckzoyqvpekruyan5ji/wvfvkvvnvx60tvjmztn7a+vgetuijlw==</latexit> <latexit sha1_base64="icnt3h5wowtccxo38pivblylbng=">aaaccxicbvdlssnafj3uv62vqes3g0vwy0le0i1q1ixlcvybtsg3k0k7dpjgzmipivs3/oobf4q49q/c+tdo2yy09ccfwzn3cu89xskzvjb1bzswlldw18rrly3nre0dc3evjenuenokmy9fxwnjoytouzhfascrfekp07y3vj747qcqjiujezvoqbtcp2ibi6c01doxc0o5ajzcl/jeaz4madubajl54zzzr3nprfo1awq8soycvfgbrs/8cvyypcgnfoegzde2euvmibqjnoyvj5u0atkepu1qgkfipztnp8nxkvz8hmrcv6twvp09kueo5tj0dgciaidnvyn4n9dnvxdhzixkukujmlsupbyrge9iwt4tlcg+1gsiyppwtaagc1a6vioowz5/ezg0tmu2vbpvzqr1qykomjpah+gy2egc1detaqamiugrpanx9gy8gs/gu/exay0zxcw++gpj8wd5cpmu</latexit> <latexit sha1_base64="k5uronfxt/37sb+ijvzc3rwsaic=">aaacqnicbzblaxrbfiwr4ynxjdomszcxh0hcohqhiw6ekgthmgeniuxnhtvvt2ekvd+oum1omvltbvil3pkd3lhqjfsx1kxami8dbydzz6wqvrg02neyfguw7t1/8hb55vhn8eqtp8+6z9eoxffzrqnvmmkexoji6jwgrnnqswkjs9jqcxy2n58ffybrdjf/5lqkuyatxkdaifto3p0k98kwqg2v3kprz0gmflwtqd1rkvmzsnmbf/k1nxjnouwqhllevn6vvk+m/pp0c8bdxtgpf4lbjmpnt7q6ghe/yqrqvuy5k4podaow5fgdlruynoviylgj6gwnnpq2x4zcqfkgmmggtxjic+tpzrbi/99omhouzmlfzjcn7uzsht41g1acvhs1oi8rplxdxzrwbriaou9itcxfpvygldx+racm6jgwp97xekkbx75tjrb6udipdt/2dnzbhcvihxgpnkuktswo+caoxeao8uv8fz/fr+ai+bh8di6vqktbu7murin48xfrikza</latexit> 17 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Repeat until some convergence criteria is satisfied. w i+1 = w i + w = w i dy dw y =ε y = w dy dw dy = dw w 2 w 4 n 2 w 1 α is the learning rate: a small positive number Choose w = dy dw w 0 to ensure Δy is always negative. w

18 18 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT We can extend this from a one parameter optimisation to a 2 parameter one, and follow the same principles, now in 2D. The successive points w i+1 can be visualised a bit like a ball rolling down a concave hill into the region of the minimum.

19 <latexit sha1_base64="hzkgbnw+5hotcuy3+qgig74ku40=">aaacthicbvfnj9mwfhtc12752ajhlhyvanfcfvciucctgaphralbleoqxhyntdzxivufkoryc7lx49/gtjmwbz9kazqzz88ep5vwdqpobxdeun3n7r2j48h9bw8fnqwfp7l0zw2fnipsl/yqbse1mnkkcrw8qqyeitvyll5/xuuzx9i6vzrv2fqylmbhvk4eokes4e9v0qoz1tgxh+kqufsm8i9si9av5xxw3lnvoohqczqbsdxqxku7ipy5zjf7kc8tidajtddm/nj2wnudt2qxxfc/jusjh1ytfvdwojtsnjfnwxqnw1e0jjzf9whrwyj0dzem//csfhuhdqonzs1zvghcgkultowgvhayanenczn30eahxdxuqu/oc89knc+txwbphv2/o4xcuazivbmaxlpdbu0e0uy15h/ivpmqrmnedlbea4olxf8gzzsvanxjaqir/f2pwilpb/0/d3wibpfj++bymmbrmh17ozr/1mdxrj6r5+sumpkenjov5ijmiqhymat+bhc+c3koqrm1hkhf85tcqnd8a0rl0e0=</latexit> <latexit sha1_base64="lgwqyufvl0lqskc5cvo2nqpvszm=">aaaconicbvfdb9mwfhxcykn8rixhhraoocggsiqk8tjpykoa9ja2qbdjdahuhke15jirfbmpivld+bu87d/mbype1h7j1tg55+ra58afkhad4mbzh6w9fls+8bjz5omz55vdf1tnni8nf0oeq9xcxgcfklomuaisf4urkmvknmex+7p6+zuwvub6f1afidkyajlkduikcfcpoxakgvb0/s5t+tvlgmi1exnr0h9whs8mvdefypricts/t1lqgncjrzo6ur7x4sfznmziyrq//b7qj3sla7dssnk0q836rnnbong3f/sdoegycvvsiy1oxt2/lml5mqmnxig1ozaomkrboorknb1wwleav4sjgdmqirm2qucrn/sduxka5sydjxsu/t9rq2ztlcxomqfo7f3atfxvg5wyfo1qqysshealqwmpkoz0ti+ascm4qsor4ea6t1i+bzcpuq12xajh/s8vk7nbpwz64emx3t63no4n8oq8idskjdtkjxysezik3hvtffeovr/+w//ip/v/lqy+1/a8jhfgs1tafmpg</latexit> 19 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT In general for an n-dimensional hyperspace of hyper parameters we can follow the same brute force approach using: y = = wry " dy dw 1,i 2 + dy dw 2,i dy dw n,i 2 # where w i+1 = w i + w = w i ry " dy = w i dw 1,i 2 + dy dw 2,i dy dw n,i 2 #

20 <latexit sha1_base64="wob3qjitr/lzogjhkx/5ujtt5ei=">aaaci3icbvdlsgnbejynrxhfuy9ebomqd4bdichciojfk0qwkmstzxyyawznzpezxies+y9e/buvhptgxyp/4urx8fxqufr1090vjoibcn0ppza3v7c4vfwurayurw+un7eutzxqyto0frg+dylhgivwbg6c3saaerkkdhpen439mwemdy/vfqwt1pxktvgiuwjwcson/gprldfcxao3sg9s6qsuozgg4w0v711gp9kezl6e1fpqmod4aepa93v1ofxxa+4e+c/xzqsczmgf5zhfj2kqmqiqidedz02gmxennaqwl/zusitqe3lhopyqipnpzpmfc7xnlt6oym1laz6o3ycyio0zytb2sgid89sbi/95nrsi427gvzicu3s6keofhhipa8n9rhkfmbseum3trzgoii0ebkwlg4l3++w/5lpe89yad3lyaz7o4iiihbslqshdr6ijzleltrffj+gzvai358l5cubo+7s14mxmttepoj9f5oajug==</latexit> <latexit sha1_base64="omq77c84nczof5krivlopprtdgq=">aaacuxicjvfnt9tafhwxx2n4sttjlysije6rjzdgiodsi5wagbrh0fpmmaxyr63d50bk+s9yokf+j156ooomrijcd4y00mjmzx7mjovwjspwzynywv1b32h+ag1ube/stj9+6ru8tjj6mte5vurqkvageqxy01vhcbne02vycz73l6dkncrnd54vnmzw2qhuswqvjdqtoluoq7haywq1ikdoqxbke1oj+tm4rywime64emjgohbvdcy8icz61o6e3xab8zzes9kbjs5g7yd4nmsyi8nso3odkcx4wm13lzrqvlw6klde4dunpdwykrtwi0zqse+vsuhz65dhsvbfjirmnjtliz/mkcfuttcx/+cnsk5phpuyrclk5nnbaakf52jerxgrs5l1zbouvvm7cjlb3xt7t2j5eqlxt35l+ofdkoxg3446p2floprwbfbgaci4hlp4chfqawn38ase4u/jr+n3aehwnbo0lpnp8a+czb+zg7yv</latexit> 20 HYPERPARAMETER OPTIMISATION: BACK PROPAGATION The delta-rule or back propagation method is used for NNs to determine the weights; based on gradient descent. For a regularisation problem* we use the L2 norm loss function (with or without the factor of 1/2): NX 1 " = 2 (y i t i ) 2 i=1 y i is the model output for example x i t i is the corresponding label for the i th example We can compute the derivative of the ε with respect to the weights Derivatives depend on the i activation function(s) in the model y(x) *Either this or cross entropy are used for the classification problems.

21 <latexit sha1_base64="lipjf/uwunpkwmrndahhwtqahxk=">aaacp3icjvfnbxmxepuubur4cvtixwpaaigixyqel0ovxbahaewtrortanazbazau5y92ypa7v/jr/tgv8gbrp1eijespc+bewo/yaxwnplkdxtf2bh77/7mg87dr4+fpo0+ez70zeukdmspszfkwknwbq5ikcardqgm03iynxxp+cntdf6vxqetle4mhbcqvxiopkbdx2dhtxubnnyht6h5j0dboxdhkyo0mor8tfy7axp0nxoro5c1sobigebifbxar3qqury5jm4alkrh0bwjluqvrv+c8b8jvmrntntl+sky+g2qrkgprwnv2j0xs1jwbgusgrwfp4mlsd2qso1nr1qelcgtomzxgauy9jn66xpdx4xmjoelc6cgvsxe7ajbel8wwag0qhn/k2utf+pgfewfjruqbevyynwgvnkcst4ujc+uq0l6eqbip8jbuzxdcircajvbhptml2+d4ft+mvtt/q+93c9rozbzc7bnxroufws77cvbywmmo5frt+hndbc/ix/ew3i0ko2jdc8wuxyx/ahw/tlb</latexit> <latexit sha1_base64="lipjf/uwunpkwmrndahhwtqahxk=">aaacp3icjvfnbxmxepuubur4cvtixwpaaigixyqel0ovxbahaewtrortanazbazau5y92ypa7v/jr/tgv8gbrp1eijespc+bewo/yaxwnplkdxtf2bh77/7mg87dr4+fpo0+ez70zeukdmspszfkwknwbq5ikcardqgm03iynxxp+cntdf6vxqetle4mhbcqvxiopkbdx2dhtxubnnyht6h5j0dboxdhkyo0mor8tfy7axp0nxoro5c1sobigebifbxar3qqury5jm4alkrh0bwjluqvrv+c8b8jvmrntntl+sky+g2qrkgprwnv2j0xs1jwbgusgrwfp4mlsd2qso1nr1qelcgtomzxgauy9jn66xpdx4xmjoelc6cgvsxe7ajbel8wwag0qhn/k2utf+pgfewfjruqbevyynwgvnkcst4ujc+uq0l6eqbip8jbuzxdcircajvbhptml2+d4ft+mvtt/q+93c9rozbzc7bnxroufws77cvbywmmo5frt+hndbc/ix/ew3i0ko2jdc8wuxyx/ahw/tlb</latexit> <latexit sha1_base64="lipjf/uwunpkwmrndahhwtqahxk=">aaacp3icjvfnbxmxepuubur4cvtixwpaaigixyqel0ovxbahaewtrortanazbazau5y92ypa7v/jr/tgv8gbrp1eijespc+bewo/yaxwnplkdxtf2bh77/7mg87dr4+fpo0+ez70zeukdmspszfkwknwbq5ikcardqgm03iynxxp+cntdf6vxqetle4mhbcqvxiopkbdx2dhtxubnnyht6h5j0dboxdhkyo0mor8tfy7axp0nxoro5c1sobigebifbxar3qqury5jm4alkrh0bwjluqvrv+c8b8jvmrntntl+sky+g2qrkgprwnv2j0xs1jwbgusgrwfp4mlsd2qso1nr1qelcgtomzxgauy9jn66xpdx4xmjoelc6cgvsxe7ajbel8wwag0qhn/k2utf+pgfewfjruqbevyynwgvnkcst4ujc+uq0l6eqbip8jbuzxdcircajvbhptml2+d4ft+mvtt/q+93c9rozbzc7bnxroufws77cvbywmmo5frt+hndbc/ix/ew3i0ko2jdc8wuxyx/ahw/tlb</latexit> 21 HYPERPARAMETER OPTIMISATION: BACK PROPAGATION The parameters w and θ are updated using: The derivatives can be re-written in terms of the errors on the weights, and the errors on w and θ can be related to each other. Back propagation involves: w r+1 = w r r+1 = r where α is the small positive learning rate A forward pass where weights are fixed and the model predictions are made*. This is followed by the backward pass where the errors on the bias parameters are computed and used to determine the errors on the weights. These in turn are then used to update the HPs from epoch r to epoch r+1. NX i=1 NX i *Weights are randomly initialised for the whole network.

22 22 HYPERPARAMETER OPTIMISATION Consider the examples of an MLP to approximate the function f(x) = x 2 and how the convergence depends on the learning rate. α=0.1 α= epochs Too large a learning rate and the optimisation does not always lead to an improved cost; the small change approximation breaks down. A small learning rate means the convergence takes much longer (a x10 reduction in the learning rate will require a x10 increase in optimisation steps) This example uses an L2 loss function and is a neural network with 1 input, a single layer of 50 nodes and 1 output; implemented in TensorFlow.

23 23 HYPERPARAMETER OPTIMISATION Consider the examples of an MLP to approximate the function f(x) = x 2 and how the convergence depends on the learning rate. α=0.1 α= epochs Too large a learning rate and the optimisation does not always lead to an improved cost; the small change approximation breaks down. A small learning rate means the convergence takes much longer (a x10 reduction in the learning rate will require a x10 increase in optimisation steps) This example uses an L2 loss function and is a neural network with 1 input, a single layer of 50 nodes and 1 output; implemented in TensorFlow.

24 24 HYPERPARAMETER OPTIMISATION: STOCHASTIC LEARNING [1] Data are inherently noisy. Individual training examples can be used to estimate the gradient. Training examples tend to cluster, so processing a batch of training data, one example at a time results in sampling the ensemble in such a way to have faster optimisation performance. Noise in the data can help the optimisation algorithm avoid getting locked into local minima. Often results in better optimisation performance than batch learning. [1] LeCun et al., Efficient BackProp, Neural Networks Tricks of the Trade, Springer 1998

25 25 HYPERPARAMETER OPTIMISATION: BATCH LEARNING [1] Data are inherently noisy. Can use a sample of training data to estimate the gradient for minimisation (see later) to minimise the effect of this noise. The sample of data is used to obtain a better estimate the gradient This is referred to as batch learning. Can use mini-batches of data to speed up optimisation, which is motivated by the observation that for many problems there are clusters of similar training examples. [1] LeCun et al., Efficient BackProp, Neural Networks Tricks of the Trade, Springer 1998

26 26 HYPERPARAMETER OPTIMISATION: BATCH LEARNING Returning to the example f(x) = x 2 ; optimising on 1/4 of the data at a time (4 batches) leads to accelerated optimisation relative to optimising on all the data each epoch. Training with all the data and a learning rate of Batch training (4 batches) with all the data and a learning rate of This example uses an L2 loss function and is a neural network with 1 input, a single layer of 50 nodes and 1 output; implemented in TensorFlow.

27 27 GRADIENT DESCENT: REFLECTION For a problem with a parabolic minimum and an appropriate learning rate, α, to fix the step size, we can guarantee convergence to a sensible minimum in some number of steps. If we translate the distribution to a fixed scale, then all of a sudden we can predict how many steps it will take to converge to the minimum from some distance away from it for a given α. If the problem hyperspace is not parabolic, this becomes more complicated. [1] LeCun et al., Efficient BackProp, Neural Networks Tricks of the Trade, Springer 1998

28 28 GRADIENT DESCENT: REFLECTION Based on the underlying nature of the gradient descent optimisation algorithm family, being derived to optimise a parabolic distribution, ideally we want to try and standardise the input distributions to a neural network. Use a unit Gaussian as a standard e.g.: maps x to x = (x-μ)/σ; Scale x to the range [-1, 1]. The transformed data inputs will be scale invariant in the sense that HPs such as the learning rate will be come general, rather than problem (and therefore scale) dependent. If we don t do this the optimisation algorithm will work, but it may take longer to converge to the minimum, and could be more susceptible to divergent behaviour. [1] LeCun et al., Efficient BackProp, Neural Networks Tricks of the Trade, Springer 1998

29 29 OVERTRAINING Data are noisy. Optimisation can result in learning the noise in the training data. Overtraining is the term given to learning the noise, and this can be mitigated in a number of different ways: Using more training data (not always possible). Checking against different data sets to identify the onset of learning noise. Changing the network configuration when training (dropout). Weight regularisation (large weights are penalised in the cost). None of these methods guarantees that you avoid over training.

30 30 OVERTRAINING A model is over fitted if the HPs that have been determined are tuned to the statistical fluctuations in the data set. Simple illustration of the problem: 30 training examples The decision boundary selected here does a good job of separating the red and blue dots. Boundaries like this can be obtained by training models on limited data samples. The accuracies can be impressive. But would the performance be as good with a new, or a larger data sample?

31 31 OVERTRAINING A model is over fitted if the HPs that have been determined are tuned to the statistical fluctuations in the data set. Simple illustration of the problem: 30 training examples 1000 training examples Increasing to 1000 training examples we can see the boundary doesn t do as well. This illustrates the kind of problem encountered when we overfit HPs of a model.

32 32 OVERTRAINING: TRAINING VALIDATION One way to avoid tuning to statistical fluctuations in the data is to impose a training convergence criteria based on a data sample independent from the training set: a validation sample. Use the cost evaluated for the training and validation samples to check to see if the HPs are over trained. If both samples have similar cost then the model response function is similar on two statistically independent samples. If the samples are large enough then one could reasonably assume that the response function would then be general when applied to an unseen data sample. large enough is a model and problem dependent constraint.

33 33 OVERTRAINING: TRAINING VALIDATION Training convergence criteria that could be used: Terminate training after N epochs Cost comparison: Evaluate the performance on the training and validation sets. Compare the two and place some threshold on the difference Δcost < δ cost Terminate the training when the gradient of the cost function with respect to the weights is below some threshold. Terminate the training when the Δcost starts to increase for the validation sample.

34 34 OVERTRAINING: TRAINING VALIDATION: EXAMPLE HIGGS KAGGLE DATA This example shows the Higgs Kaggle (H ττ) with an overtrained neural network. Hyper parameters are not optimal, but test and train samples give similar performance. The test sample (green) has a consistently high cost value relative to the train sample (blue). 65% accuracy attained before overtraining starts Overtrained network Overtrained network This example uses all features of the data set, but only 2000 test/train events with a learning rate of and dropout is not being used. The network has a single layer with 256 nodes and a single output to classify if a given event is signal or background. There is no batch training used for this example. See for example code for this problem

35 35 OVERTRAINING: TRAINING VALIDATION: EXAMPLE HIGGS KAGGLE DATA Continuing to train an over-trained network does not resolve the issue - the network remains over-trained. 65% accuracy attained before overtraining starts Overtrained network Overtrained network In this case the level of overtraining continues to increase and the difference in the performance of the train and test samples in terms of accuracy increases. After training cycles we have an accuracy of over 93% for the train sample, but less than 70% for the test sample. This is not a good configuration to use on unseen data as the outcome is unpredictable. For this example we should terminate after about 200 epochs, while the test/train performance remain similar and have an accuracy of 70%. Other network configurations may be better. See for example code for this problem

36 36 OVERTRAINING: DROPOUT FOR DEEP NETWORKS A pragmatic way to mitigate overfitting is to compromise the model randomly in different epochs of the training by removing units from the network. Dropout is used during training; when evaluating predictions with the validation or unseen data the full network of Fig. (a) is used. That way the whole model will be effectively trained on a sub-sample of the data in the hope that the effect of statistical fluctuations will be limited. This does not remove the possibility that a model is overtrained, as with the previous discussion HP generalisation is promoted by using this method. Srivastava et al., J. Machine Learning Research 15 (2014)

37 37 OVERTRAINING: DROPOUT FOR DEEP NETWORKS A variety of architectures has been explored with different training samples (see Ref [1] for details). Dropout can be detrimental for small training samples, however in general the results show that dropout is beneficial. For deep networks or typical training samples O(500) examples or more this technique is expected to be beneficial. [1] Srivastava et al., J. Machine Learning Research 15 (2014)

38 38 OVERTRAINING: DROPOUT: EXAMPLE HIGGS KAGGLE DATA Changing the drop out keep probability from 0.9 to 0.7 stops the network becoming overtrained in the first 1000 epochs. ~73% accuracy attained without overtraining for 1000 epochs, keep probability of 0.7 and a learning rate of A better accuracy is attained for this network using dropout; above 70%. See for example code for this problem

39 <latexit sha1_base64="rlq2jfdtjiwjswvxygzme1ksbsa=">aaacfhicbvdlssnafj34rpuvdelmsaicubirdcmu3bisyb/qhdcztnuhm0myubgu0i9w46+4cagiwxfu/bunbrbaemdgcm65zl0ntaxx4djf1tlyyuraemmjvlm1vbnr7+03dzipyho0eylqh0qzwwpwaa6ctvpfiawfa4wdm4nfembk8ys+h1hkfel6me9yssbigx3qcroocpz0jj3bjqcd5pzk6yakcighjpf6omd4gpdarjhvzwq8snycvfcbemb/evfcm8lioijo3xgdfpyckobushhzyzrlcr2qhusyghpjtj9pjxrjy6ne2kxhxgx4qv6eyinueirdk5qe+nrem4j/ez0mupd+zum0axbt2ufdtgbi8kqhhhhfkiiriyqqbnbfte8uowb6ljss3pmtf0nzroo6vffuvfk7luoooun0he6qiy5qdd2iomogih7rm3pfb9at9wk9wx+z6jjvzbygp7a+fwcba58o</latexit> 39 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS Weight regularisation involves adding a penalty term to the loss function used to optimise the HPs of a network. This term is based on the sum of the weights w i (including bias parameters) in the network and takes X the form: i=8weights The rationale is to add an additional cost term to the optimisation coming from the complexity of the network. The performance of the network will vary as a function of λ. To optimise a network using weight regularisation it will have to be trained a number of times in order to identify the value corresponding to the min(cost) from the set of trained solutions. w i This is the L1 norm regularisation term. See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:

40 <latexit sha1_base64="cugaj1e3/yhdcoh4+zf4b9uhct0=">aaacf3icbvdlssnafj34rpuvdelmsagupcrf0i1qdooygn1ae8nkom2hzirh5szsqv/cjb/ixouibnxn3zhts9dwawohc85l7j1hirggx/m2lpzxvtfwcxvfza3tnv17b7+h41rrvqexifurjjojhre6cbcslshgzchymxxct/zma1oax9edjblms9klejdtakyk7lintlhdskdt6qkuoegg45den1zeifm8zlzxbz3gw4dfvwk75jsdkfaicxnsqjlqgf3ldwkashybfutrtusk4gdeaaecjyteqllc6id0wnvqieim/wx61xgfg6wdzsbmrycn6u+jjeitrzi0sumgr+e9ifif106he+fnpepsybgdfdrnbyyyt0rcha4ybteyhfdfza6y9okifeyvrvoco3/yimluyq5tdm/pstwrvi4cokrh6as56bxv0q2qotqi6be9o1f0zj1zl9a79tgllln5zah6a+vzb1ieoai=</latexit> 40 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS Weight regularisation involves adding a penalty term to the loss function used to optimise the HPs of a network. This term is based on the sum of the weights w i (including bias parameters) in the network and takes the form: X i=8,weights This variant is the L2 norm regularisation term; also know as weight decay regularisation. The rationale is to add an additional cost term to the optimisation coming from the complexity of the network. The performance of the network will vary as a function of λ. To optimise a network using weight regularisation it will have to be trained a number of times in order to identify the value corresponding to the min(cost) from the set of trained solutions. w 2 i See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:

41 <latexit sha1_base64="k5egs/alxmo+runpdhqcmvqgmjq=">aaacvxicbvfbsxtbfj5drdxy1qhhl0odygkbdonql4lus0/fglehmyxvj7pjw5nzzeatepb9k16k/8rlozoyq419mpdxfd/jvfdnvip0fewpqbi2/mbj7ezwa/vd+w877d29k1duvsi+kfrhbzjwuqgrfujs8qa0ensm5hv2ez7xr++kdviys5qvcqhhyjbhaesptk2so7cydkgkw095klsqddzup5vevtprqjfcwunp3iw8etrlkx/lloknuy9/5onyo8bav8xjxlhq6gu/lzizkmv4fyqjxtrurn1oufw1ijegw5z1kbyfknehki0ncqxodekopgenllao2bssysksxc1m5mbda1q6yb1ipeghnhlzv4l/hvic/bejbu3ctgfeqygmblwbk//tbhxlj8matvmrnoj5uf4ptgwfr8zhakugnfmahew/kxdt8mgs/4iwdyfepfk1uop146gb/zrunh1fxrhjdthhdsri9o2dsr/sgvwzya/skqicmpgd/anxw41naxgse/bziwp3/gjec7od</latexit> <latexit sha1_base64="ytuioll7p1atpmvieql0vr65zzc=">aaacjhicbvdlsgnbejyn7/ikevqygir4mowgquee0ysniwa0ke2w2cmsgtizu8z0cmhzj/hir3jx4ampxvwwj4+drgsaiqpuurvcrhadrvvpfgzm5+yxfpekyyura+uljc0be6easganraybitfmcmuawegwzqizkafgt2h/fojf3jntekyuyzcwtir3ikecerbsudr274lmieeivvge+5emnppy7dl3tsp9wsuhe2t8xms7vqwmao73mqr8r1mlsmw36o6a/xjvqspognpqevo7mu0lu0afmabluqm0m6kbu8hyop8alhdaj3eszakikpl2nnoyx7tw6eio1ryu4jh6cyij0pibdg2njnaz095q/m9rpradttoukhsyounfusowxhiygo5yzsiigsweam5vxbrhbe5gcy3aelzpl/+sm1rvc6ve1uh59gwsxylarjuogjx0ie7rbaqjbqloat2hf/tqpdrpzrvzmw4tojozlfqlztc3nuakua==</latexit> <latexit sha1_base64="ytuioll7p1atpmvieql0vr65zzc=">aaacjhicbvdlsgnbejyn7/ikevqygir4mowgquee0ysniwa0ke2w2cmsgtizu8z0cmhzj/hir3jx4ampxvwwj4+drgsaiqpuurvcrhadrvvpfgzm5+yxfpekyyura+uljc0be6easganraybitfmcmuawegwzqizkafgt2h/fojf3jntekyuyzcwtir3ikecerbsudr274lmieeivvge+5emnppy7dl3tsp9wsuhe2t8xms7vqwmao73mqr8r1mlsmw36o6a/xjvqspognpqevo7mu0lu0afmabluqm0m6kbu8hyop8alhdaj3eszakikpl2nnoyx7tw6eio1ryu4jh6cyij0pibdg2njnaz095q/m9rpradttoukhsyounfusowxhiygo5yzsiigsweam5vxbrhbe5gcy3aelzpl/+sm1rvc6ve1uh59gwsxylarjuogjx0ie7rbaqjbqloat2hf/tqpdrpzrvzmw4tojozlfqlztc3nuakua==</latexit> <latexit sha1_base64="ytuioll7p1atpmvieql0vr65zzc=">aaacjhicbvdlsgnbejyn7/ikevqygir4mowgquee0ysniwa0ke2w2cmsgtizu8z0cmhzj/hir3jx4ampxvwwj4+drgsaiqpuurvcrhadrvvpfgzm5+yxfpekyyura+uljc0be6easganraybitfmcmuawegwzqizkafgt2h/fojf3jntekyuyzcwtir3ikecerbsudr274lmieeivvge+5emnppy7dl3tsp9wsuhe2t8xms7vqwmao73mqr8r1mlsmw36o6a/xjvqspognpqevo7mu0lu0afmabluqm0m6kbu8hyop8alhdaj3eszakikpl2nnoyx7tw6eio1ryu4jh6cyij0pibdg2njnaz095q/m9rpradttoukhsyounfusowxhiygo5yzsiigsweam5vxbrhbe5gcy3aelzpl/+sm1rvc6ve1uh59gwsxylarjuogjx0ie7rbaqjbqloat2hf/tqpdrpzrvzmw4tojozlfqlztc3nuakua==</latexit> 41 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS For example we can consider extending an MSE cost function to allow for weight regularisation. The MSE cost is given by: " = 1 N NX (y i t i ) 2 i=1 To allow for regularisation we add the sum of weights term: " = 1 N NX X (y i t i ) 2 + i=1 i=8,weights w 2 i This is a simple modification to make to the NN training process that adds a penalty for the inclusion of non-zero weights in the network. See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:

42 42 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS The optimisation process needs to be run for each value of λ in order to choose the best (least cost) solution that is not overtrained. This means we have to train the network many times, with different values of λ when using regularisation. e.g. for the Kaggle example with 1000 epochs we have: No regularisation; best cost Best performance for this sampling of λ; but similar outputs obtained for a range of trainings λ cost: train / test accuracy: train / test / / / / / / / / / / / / / / Increasing cost of weights See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:

43 43 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS The optimisation process needs to be run for each value of λ in order to choose the best (least cost) solution that is not overtrained. This means we have to train the network many times, with different values of λ when using regularisation. e.g. for the Kaggle example with 1000 epochs we have: cost includes X i=8,weights w 2 i Accuracy falls off as lambda increases, but there is little dependence for this example. See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:

44 44 OVER FITTING: CROSS VALIDATION An alternative way of thinking about the problem is to assume that the response function of the model will have some bias and some variance. The bias will be irreducible and mean that the predictions made will have some systematic effect related to the average output value. The variance will depend on the size of the training sample, and we would like to know what this is. We can use cross validation to estimate the prediction error. Any prediction bias can be measured using control samples. Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70: For a review of cross validation see: S. Arlot and A. Celisse, Statistics Surveys Vol (2010).

45 45 OVER FITTING: CROSS VALIDATION Application of this concept to machine learning can be seen via k-fold cross validation and its variants* Divide the data sample for training and validation into k equal sub-samples. validation From these one can prepare k sets of validation samples and residual training samples. validation validation validation Each set uses all examples; but the training and validation sub-sets are distinct. validation *Variants include the extremes of leave 1 out CV and Hold out CV as well as leave p-out CV. These involve reserving 1 example, 50% of examples and p examples for testing, and the remainder of data for training, respectively. Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70: For a review of cross validation see: S. Arlot and A. Celisse, Statistics Surveys Vol (2010).

46 46 OVER FITTING: CROSS VALIDATION Application of this concept to machine learning can be seen via k-fold cross validation and its variants* One can then train the data on each of the k training sets, validating the performance of the network on the corresponding validation set. validation We can measure the model error on the validation set. validation The model prediction error is the average error obtained from the k folds on their corresponding validation sets. validation validation If desired we could combine the k models to compute some average that would have a prediction performance that is more robust than any of the individual folds. validation *Variants include the extremes of leave 1 out CV and Hold out CV as well as leave p-out CV. These involve reserving 1 example, 50% of examples and p examples for testing, and the remainder of data for training, respectively. Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70: For a review of cross validation see: S. Arlot and A. Celisse, Statistics Surveys Vol (2010).

47 47 OVER FITTING: CROSS VALIDATION Application of this concept to machine learning can be seen via k-fold cross validation and its variants* The ensemble of response function outputs will vary in analogy with the spread of a Gaussian distribution. This results in family of ROC curves; with a representative performance that is neither the best or worst ROC. The example shown is for a Support Vector Machine, with the best average and holdout ROC curves to indicate some sense of spread. Backgr rejection (1-eff) (False negative rate) ROC-Curve Signal eff (True positive rate) SVM_Best SVM_Average SVM_Holdout_RBF *Variants include the extremes of leave 1 out CV and Hold out CV as well as leave p-out CV. These involve reserving 1 example, 50% of examples and p examples for testing, and the remainder of data for training, respectively. Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70: For a review of cross validation see: S. Arlot and A. Celisse, Statistics Surveys Vol (2010).

48 48 SUMMARY Hyper-parameter optimisation does not guarantee robust solution. Minimisation of different cost functions will allow us to determine optimial hyper-parameters for our model. When the model starts to learn statistical fluctuations in the training sample we should stop the training to avoid overtraining the data (i.e. to avoid learning the statistical fluctuations in a given sample). Methods exist to mitigate overtraining, however none of them guarantee a generalised solution that is robust from overtraining [dropout, regularisation, cross validation]. The only exception to this is to supply an effectively infinite sample of training data for a given problem.

49 49 SUGGESTED READING (HEP RELATED) There is not much HEP-specific literature regarding HPs optimisation. Minuit is used to optimise parameters in TMVA along with other algorithms. See the Minuit user guide and TMVA for more details. This is based on the DFP method of variable metric minimisation. Logarithmic grid searches are also discussed for specific algorithms with a few HPs (see SVM notes).

50 50 SUGGESTED READING (NON-HEP) There are a few backup slides on the ADAM optimiser that performs well with a number of deep learning problems that you might wish to look at. In addition the following books discuss the issue of parameter optimisation: Bishop, Neural Networks for Pattern Recognition, Oxford University Press (2013) Cristianini and Shawe-Taylor, Support Vector Machines and other kernel-based learning methods, Cambridge University Press (2014) Goodfellow, Deep Learning, MIT Press (2016) MacKay s Information Theory, Inference and Learning algorithms, Cambridge University Press (2011)

51 51 APPENDIX The following slides contain additional information that may be of interest.

52 52 MULTIPLE MINIMA Often more complication hyperspace optimisation problems are encountered, where there are multiple minima. The gradient descent minimisation algorithm is based on the assumption that there is a single minimum to be found. In reality there are often multiple minima. Sometimes the minima are degenerate, or near degenerate. How do we know we have converged on the global minimum? One of several minima

53 53 MULTIPLE MINIMA Often more complication hyperspace optimisation problems are encountered, where there are multiple minima. The gradient descent minimisation algorithm is based on the assumption that there is a single minimum to be found. In reality there are often multiple minima. Sometimes the minima are degenerate, or near degenerate. How do we know we have converged on the global minimum? Global minimum

54 54 GRID SEARCHES Just as we can scan through a parameter in order to minimise a likelihood ratio, we can scan through a HP to observe how the loss function changes. For simple models we can construct a 2D grid of points in m and c. Evaluating the loss function for each point in the 2D sample space we can construct a grid from which to select the minimum value. The assumption here is that our grid spacing is sufficient for the purpose of optimising the problem. c This type of parameter search is often used for support vector machine HPs (kernel function parameters and cost). m This method does not scale to large numbers of parameters.

55 55 GRID SEARCHES e.g. consider a linear regression study optimising the parameters for the model y=mx+c The loss function for this problem results in a valley as m and c are anticorrelated parameters in this 2D hyperspace. loss L m c The contours of the loss function show a minimum, but this is selected from a discrete grid of points (need to ensure grid spacing is sufficient for your needs). loss L m c

56 56 ADAM OPTIMISER This is a stochastic gradient descent algorithm. Consider a model f(θ) that is differentiable with respect to the HPs θ so that: the gradient g t = f t (θ t-1 ) can be computed. t is the training epoch m t and v t are biased values of the first and second moment m t -hat and v t -hat are bias corrected estimator of the moments Some initial guess for the HP is taken: θ 0, and the HPs for a given epoch are denoted by θ t α is the step size β 1 and β 2 are exponential decay rates of moving averages. Kingma and Ba, Proc. 3rd Int Conf. Learning Representations, San Diego, 2015

57 57 ADAM OPTIMISER ADAptive Moment estimation based on gradient descent. Kingma and Ba, Proc. 3rd Int Conf. Learning Representations, San Diego, 2015

58 58 ADAM OPTIMISER Benchmarking performance using MNIST and CFAR10 data indicates that Adam with dropout minimises the loss function compared with other optimisers tested. Faster drop off in cost, and lower overall cost obtained. Kingma and Ba, Proc. 3rd Int Conf. Learning Representations, San Diego, 2015

59 59 DAVIDON-FLETCHER-POWELL This is a variable metric minimisation algorithm (1959, 1963) is described in wikipedia as: The DFP formula is quite effective, but it was soon superseded by the BFGS formula, which is its dual (interchanging the roles of y and s). This is used in high energy physics via the package MINUIT (FORTRAN) and Minuit2 (C++) implementations. The standard tools that are used for data analysis in HEP have these implementations available, and while the algorithm may no longer be optimal, it is still deemed good enough by many for the optimisation tasks. Robust / reliable minimisation. Underlying method derived assuming parabolic minima Understood and trusted by the HEP community.

60 60 DAVIDON-FLETCHER-POWELL This is a variable metric minimisation algorithm (1959, 1963) is described in wikipedia as: The DFP formula is quite effective, but it was soon superseded by the BFGS formula, which is its dual (interchanging the roles of y and s). This is used in high energy physics via the package MINUIT (FORTRAN) and Minuit2 (C++) implementations. This is mentioned only because it is the dominant algorithm used in particle physics at this time. Almost all minimisation problems are solved using this algorithm, where the dominant use is for maximum likelihood and χ 2 fit minimisation problems. The number of HPs required to solve those problems is small (up to a few hundred) in comparison with the numbers required for neural networks (esp. deep learning problems). If you don t (intend to) work in this field, you can now forget you heard about this algorithm.

CS489/698: Intro to ML

CS489/698: Intro to ML CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun

More information

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3 Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.

Knowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey. Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN

More information

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013 Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork

More information

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R. Lecture 24: Learning 3 Victor R. Lesser CMPSCI 683 Fall 2010 Today s Lecture Continuation of Neural Networks Artificial Neural Networks Compose of nodes/units connected by links Each link has a numeric

More information

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES Deep Learning Practical introduction with Keras Chapter 3 27/05/2018 Neuron A neural network is formed by neurons connected to each other; in turn, each connection of one neural network is associated

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

Neural Network Neurons

Neural Network Neurons Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given

More information

Deep Learning With Noise

Deep Learning With Noise Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017 Hyperparameter optimization CS6787 Lecture 6 Fall 2017 Review We ve covered many methods Stochastic gradient descent Step size/learning rate, how long to run Mini-batching Batch size Momentum Momentum

More information

Simple Model Selection Cross Validation Regularization Neural Networks

Simple Model Selection Cross Validation Regularization Neural Networks Neural Nets: Many possible refs e.g., Mitchell Chapter 4 Simple Model Selection Cross Validation Regularization Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

MLPQNA-LEMON Multi Layer Perceptron neural network trained by Quasi Newton or Levenberg-Marquardt optimization algorithms

MLPQNA-LEMON Multi Layer Perceptron neural network trained by Quasi Newton or Levenberg-Marquardt optimization algorithms MLPQNA-LEMON Multi Layer Perceptron neural network trained by Quasi Newton or Levenberg-Marquardt optimization algorithms 1 Introduction In supervised Machine Learning (ML) we have a set of data points

More information

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction Akarsh Pokkunuru EECS Department 03-16-2017 Contractive Auto-Encoders: Explicit Invariance During Feature Extraction 1 AGENDA Introduction to Auto-encoders Types of Auto-encoders Analysis of different

More information

The Mathematics Behind Neural Networks

The Mathematics Behind Neural Networks The Mathematics Behind Neural Networks Pattern Recognition and Machine Learning by Christopher M. Bishop Student: Shivam Agrawal Mentor: Nathaniel Monson Courtesy of xkcd.com The Black Box Training the

More information

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,

Index. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning, A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,

More information

Deep Learning Cook Book

Deep Learning Cook Book Deep Learning Cook Book Robert Haschke (CITEC) Overview Input Representation Output Layer + Cost Function Hidden Layer Units Initialization Regularization Input representation Choose an input representation

More information

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center

Machine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center Machine Learning With Python Bin Chen Nov. 7, 2017 Research Computing Center Outline Introduction to Machine Learning (ML) Introduction to Neural Network (NN) Introduction to Deep Learning NN Introduction

More information

Neural Networks (pp )

Neural Networks (pp ) Notation: Means pencil-and-paper QUIZ Means coding QUIZ Neural Networks (pp. 106-121) The first artificial neural network (ANN) was the (single-layer) perceptron, a simplified model of a biological neuron.

More information

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18 Lecture 6: k-nn Cross-validation Regularization LEARNING METHODS Lazy vs eager learning Eager learning generalizes training data before

More information

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Jincheng Cao, SCPD Jincheng@stanford.edu 1. INTRODUCTION When running a direct mail campaign, it s common practice

More information

A Brief Look at Optimization

A Brief Look at Optimization A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Support Vector Machines (a brief introduction) Adrian Bevan.

Support Vector Machines (a brief introduction) Adrian Bevan. Support Vector Machines (a brief introduction) Adrian Bevan email: a.j.bevan@qmul.ac.uk Outline! Overview:! Introduce the problem and review the various aspects that underpin the SVM concept.! Hard margin

More information

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

COMP 551 Applied Machine Learning Lecture 16: Deep Learning COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all

More information

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4

More information

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015 CPSC 340: Machine Learning and Data Mining Robust Regression Fall 2015 Admin Can you see Assignment 1 grades on UBC connect? Auditors, don t worry about it. You should already be working on Assignment

More information

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa Instructors: Parth Shah, Riju Pahwa Lecture 1 Notes Outline 1. Machine Learning What is it? Classification vs. Regression Error Training Error vs. Test Error 2. Linear Classifiers Goals and Motivations

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

Deep Generative Models Variational Autoencoders

Deep Generative Models Variational Autoencoders Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative

More information

Machine Learning. MGS Lecture 3: Deep Learning

Machine Learning. MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ Machine Learning MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHAT IS DEEP LEARNING? Shallow network: Only one hidden layer

More information

Cost Functions in Machine Learning

Cost Functions in Machine Learning Cost Functions in Machine Learning Kevin Swingler Motivation Given some data that reflects measurements from the environment We want to build a model that reflects certain statistics about that data Something

More information

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13 CSE 634 - Data Mining Concepts and Techniques STATISTICAL METHODS Professor- Anita Wasilewska (REGRESSION) Team 13 Contents Linear Regression Logistic Regression Bias and Variance in Regression Model Fit

More information

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning  Ian Goodfellow Practical Methodology Lecture slides for Chapter 11 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-26 What drives success in ML? Arcane knowledge of dozens of obscure algorithms? Mountains

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Support Vector Machines Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Support Vector Machines: introduction 2 Support Vector Machines (SVMs) SVMs

More information

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017 Neural Networks Consider

More information

Model Generalization and the Bias-Variance Trade-Off

Model Generalization and the Bias-Variance Trade-Off Charu C. Aggarwal IBM T J Watson Research Center Yorktown Heights, NY Model Generalization and the Bias-Variance Trade-Off Neural Networks and Deep Learning, Springer, 2018 Chapter 4, Section 4.1-4.2 What

More information

Theoretical Concepts of Machine Learning

Theoretical Concepts of Machine Learning Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017 3/0/207 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/0/207 Perceptron as a neural

More information

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz Gradient Descent Wed Sept 20th, 2017 James McInenrey Adapted from slides by Francisco J. R. Ruiz Housekeeping A few clarifications of and adjustments to the course schedule: No more breaks at the midpoint

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Chapter 9 Chapter 9 1 / 50 1 91 Maximal margin classifier 2 92 Support vector classifiers 3 93 Support vector machines 4 94 SVMs with more than two classes 5 95 Relationshiop to

More information

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Slide credit: http://sebastianruder.com/optimizing-gradient-descent/index.html#batchgradientdescent

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.

More information

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska Classification Lecture Notes cse352 Neural Networks Professor Anita Wasilewska Neural Networks Classification Introduction INPUT: classification data, i.e. it contains an classification (class) attribute

More information

Lecture #11: The Perceptron

Lecture #11: The Perceptron Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule Assignment 3 Will be

More information

Conflict Graphs for Parallel Stochastic Gradient Descent

Conflict Graphs for Parallel Stochastic Gradient Descent Conflict Graphs for Parallel Stochastic Gradient Descent Darshan Thaker*, Guneet Singh Dhillon* Abstract We present various methods for inducing a conflict graph in order to effectively parallelize Pegasos.

More information

Facial Expression Classification with Random Filters Feature Extraction

Facial Expression Classification with Random Filters Feature Extraction Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle

More information

Deep Learning. Volker Tresp Summer 2014

Deep Learning. Volker Tresp Summer 2014 Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there

More information

IST 597 Foundations of Deep Learning Fall 2018 Homework 1: Regression & Gradient Descent

IST 597 Foundations of Deep Learning Fall 2018 Homework 1: Regression & Gradient Descent IST 597 Foundations of Deep Learning Fall 2018 Homework 1: Regression & Gradient Descent This assignment is worth 15% of your grade for this class. 1 Introduction Before starting your first assignment,

More information

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies

More information

Multivariate Data Analysis and Machine Learning in High Energy Physics (V)

Multivariate Data Analysis and Machine Learning in High Energy Physics (V) Multivariate Data Analysis and Machine Learning in High Energy Physics (V) Helge Voss (MPI K, Heidelberg) Graduierten-Kolleg, Freiburg, 11.5-15.5, 2009 Outline last lecture Rule Fitting Support Vector

More information

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you

More information

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016 CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2016 Assignment 5: Due Friday. Assignment 6: Due next Friday. Final: Admin December 12 (8:30am HEBB 100) Covers Assignments 1-6. Final from

More information

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples. Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing examples. Aims: to present an outline of supervised learning as part of AI; to introduce

More information

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5. More on Neural Networks Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.6 Recall the MLP Training Example From Last Lecture log likelihood

More information

5 Learning hypothesis classes (16 points)

5 Learning hypothesis classes (16 points) 5 Learning hypothesis classes (16 points) Consider a classification problem with two real valued inputs. For each of the following algorithms, specify all of the separators below that it could have generated

More information

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms

Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 1225 Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms S. Sathiya Keerthi Abstract This paper

More information

Early Stopping. Sargur N. Srihari

Early Stopping. Sargur N. Srihari Early Stopping Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Regularization Strategies 1. Parameter Norm Penalties

More information

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CS535 Big Data Fall 2017 Colorado State University   10/10/2017 Sangmi Lee Pallickara Week 8- A. CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron Matt Gormley Lecture 5 Jan. 31, 2018 1 Q&A Q: We pick the best hyperparameters

More information

Machine Learning Lecture 3

Machine Learning Lecture 3 Machine Learning Lecture 3 Probability Density Estimation II 19.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Exam dates We re in the process

More information

Learning from Data: Adaptive Basis Functions

Learning from Data: Adaptive Basis Functions Learning from Data: Adaptive Basis Functions November 21, 2005 http://www.anc.ed.ac.uk/ amos/lfd/ Neural Networks Hidden to output layer - a linear parameter model But adapt the features of the model.

More information

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning Partitioning Data IRDS: Evaluation, Debugging, and Diagnostics Charles Sutton University of Edinburgh Training Validation Test Training : Running learning algorithms Validation : Tuning parameters of learning

More information

Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick

Neural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick Neural Networks Theory And Practice Marco Del Vecchio marco@delvecchiomarco.com Warwick Manufacturing Group University of Warwick 19/07/2017 Outline I 1 Introduction 2 Linear Regression Models 3 Linear

More information

A very different type of Ansatz - Decision Trees

A very different type of Ansatz - Decision Trees A very different type of Ansatz - Decision Trees A Decision Tree encodes sequential rectangular cuts But with a lot of underlying theory on training and optimization Machine-learning technique, widely

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

Modern Methods of Data Analysis - WS 07/08

Modern Methods of Data Analysis - WS 07/08 Modern Methods of Data Analysis Lecture XV (04.02.08) Contents: Function Minimization (see E. Lohrmann & V. Blobel) Optimization Problem Set of n independent variables Sometimes in addition some constraints

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Tutorial on Machine Learning Tools

Tutorial on Machine Learning Tools Tutorial on Machine Learning Tools Yanbing Xue Milos Hauskrecht Why do we need these tools? Widely deployed classical models No need to code from scratch Easy-to-use GUI Outline Matlab Apps Weka 3 UI TensorFlow

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining

More information

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016 CPSC 34: Machine Learning and Data Mining Feature Selection Fall 26 Assignment 3: Admin Solutions will be posted after class Wednesday. Extra office hours Thursday: :3-2 and 4:3-6 in X836. Midterm Friday:

More information

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Neural Network and Deep Learning Early history of deep learning Deep learning dates back to 1940s: known as cybernetics in the 1940s-60s, connectionism in the 1980s-90s, and under the current name starting

More information

Topics in Machine Learning

Topics in Machine Learning Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur

More information

Day 3 Lecture 1. Unsupervised Learning

Day 3 Lecture 1. Unsupervised Learning Day 3 Lecture 1 Unsupervised Learning Semi-supervised and transfer learning Myth: you can t do deep learning unless you have a million labelled examples for your problem. Reality You can learn useful representations

More information

Classification: Feature Vectors

Classification: Feature Vectors Classification: Feature Vectors Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free YOUR_NAME MISSPELLED FROM_FRIEND... : : : : 2 0 2 0 PIXEL 7,12

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule. CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit

More information

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set. Evaluate what? Evaluation Charles Sutton Data Mining and Exploration Spring 2012 Do you want to evaluate a classifier or a learning algorithm? Do you want to predict accuracy or predict which one is better?

More information

Logical Rhythm - Class 3. August 27, 2018

Logical Rhythm - Class 3. August 27, 2018 Logical Rhythm - Class 3 August 27, 2018 In this Class Neural Networks (Intro To Deep Learning) Decision Trees Ensemble Methods(Random Forest) Hyperparameter Optimisation and Bias Variance Tradeoff Biological

More information

Chap.12 Kernel methods [Book, Chap.7]

Chap.12 Kernel methods [Book, Chap.7] Chap.12 Kernel methods [Book, Chap.7] Neural network methods became popular in the mid to late 1980s, but by the mid to late 1990s, kernel methods have also become popular in machine learning. The first

More information

Learning and Generalization in Single Layer Perceptrons

Learning and Generalization in Single Layer Perceptrons Learning and Generalization in Single Layer Perceptrons Neural Computation : Lecture 4 John A. Bullinaria, 2015 1. What Can Perceptrons do? 2. Decision Boundaries The Two Dimensional Case 3. Decision Boundaries

More information

Machine Learning Methods for Ship Detection in Satellite Images

Machine Learning Methods for Ship Detection in Satellite Images Machine Learning Methods for Ship Detection in Satellite Images Yifan Li yil150@ucsd.edu Huadong Zhang huz095@ucsd.edu Qianfeng Guo qig020@ucsd.edu Xiaoshi Li xil758@ucsd.edu Abstract In this project,

More information

Supervised Learning in Neural Networks (Part 2)

Supervised Learning in Neural Networks (Part 2) Supervised Learning in Neural Networks (Part 2) Multilayer neural networks (back-propagation training algorithm) The input signals are propagated in a forward direction on a layer-bylayer basis. Learning

More information

Perceptron as a graph

Perceptron as a graph Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 10 th, 2007 2005-2007 Carlos Guestrin 1 Perceptron as a graph 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2

More information

Supplementary material for: BO-HB: Robust and Efficient Hyperparameter Optimization at Scale

Supplementary material for: BO-HB: Robust and Efficient Hyperparameter Optimization at Scale Supplementary material for: BO-: Robust and Efficient Hyperparameter Optimization at Scale Stefan Falkner 1 Aaron Klein 1 Frank Hutter 1 A. Available Software To promote reproducible science and enable

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

Unsupervised Learning

Unsupervised Learning Deep Learning for Graphics Unsupervised Learning Niloy Mitra Iasonas Kokkinos Paul Guerrero Vladimir Kim Kostas Rematas Tobias Ritschel UCL UCL/Facebook UCL Adobe Research U Washington UCL Timetable Niloy

More information

Fall 09, Homework 5

Fall 09, Homework 5 5-38 Fall 09, Homework 5 Due: Wednesday, November 8th, beginning of the class You can work in a group of up to two people. This group does not need to be the same group as for the other homeworks. You

More information

Logistic Regression and Gradient Ascent

Logistic Regression and Gradient Ascent Logistic Regression and Gradient Ascent CS 349-02 (Machine Learning) April 0, 207 The perceptron algorithm has a couple of issues: () the predictions have no probabilistic interpretation or confidence

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

CSE 573: Artificial Intelligence Autumn 2010

CSE 573: Artificial Intelligence Autumn 2010 CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine

More information