MULTIVARIATE ANALYSIS AND ITS USE IN HIGH ENERGY PHYSICS
|
|
- Cleopatra Hill
- 5 years ago
- Views:
Transcription
1 1 DR ADRIAN BEVAN MULTIVARIATE ANALYSIS AND ITS USE IN HIGH ENERGY PHYSICS 4) OPTIMISATION Lectures given at the department of Physics at CINVESTAV, Instituto Politécnico Nacional, Mexico City 28th August - 3rd Sept 2018
2 2 LECTURE PLAN Introduction Hyper parameter optimisation Supervised learning Loss functions Gradient descent Stochastic learning Batch learning Overtraining Training validation Dropout for deep networks Weight regularisation for neural networks Cross Validation methods Summary Suggested reading
3 3 INTRODUCTION We initialise the weights of a neural network or other MVA algorithm randomly. The determination a set of optimal weights requires some heuristic algorithm and some figure of merit. The algorithm is the optimisation method (typically derived from gradient descent). The figure of merit is called the loss or cost function and can take many forms. The optimisation process for machine learning algorithms is similar to optimisation of a X 2 or -lnl in a fit where one minimises the X 2 or -lnl as the figure of merit, with respect to the model parameters.
4 4 HYPERPARAMETER OPTIMISATION Models have hyperparameters (HPs) that are required to fix the response function (*). The set of HPs forms a hyperspace. The purpose of optimisation is to select a point in hyperspace that optimises the performance of the model using some figure of merit (FOM). The figure of merit is called the cost or loss function. c.f. least squares regression or a χ 2 or likelihood fit. (*) Minimisation problems related to likelihood fitting often split the hyper-parameters into parameters of interest (e.g. physical quantities) and other nuisance parameters that are not deemed to be interesting. Machine learning model parameters are the equivalent of nuisance parameters in the language of likelihood fits. e.g. see the book by Edwards, Likelihood (1992), John Hopkins Uni Press.
5 <latexit sha1_base64="9yhrxiv7gtcdly+b3kjonvbeuco=">aaacohicbvbdsxtbfj2nttw1h2n76mvfuekohn0i2bdb2hefqolrqdzzzid3k8hzd2buvsosn+vlf4zvps99qiiv/gincze26yelh3puyeaekffskof9dgorq8+ev1hbdzdevnr9pv723anjci2wizkv6w7edsqzyockkezmgnksktylzr/o/lpvqi3m0hoa5nhp+civsrscrbtwdyewvqdxodcmzmckjczlnj8dln+mcbfkultzeqiai/fay9gywhaelsxtzyvbcvw++s1w3bde8nrehlbm/io0wiwjsh4dddnrjjisunyynu/l1c+5jikutt2gmjhzcc5h2lm05qmafjk/faofrdkeonn2uok5+nei5ikxkysymwmnsvn0zul/vf5b8ed+kdo8iezf40nxoyaymluiq6lrkjpywoww9q8gxlxzqbbrwqn+4snl5prt2/fa/vfoy/9lvcca22rbrml8tsv22qe7yh0m2bx7xf6wg+eh89u5de4ev2tolxnp/ofz/wbadahi</latexit> <latexit sha1_base64="9yhrxiv7gtcdly+b3kjonvbeuco=">aaacohicbvbdsxtbfj2nttw1h2n76mvfuekohn0i2bdb2hefqolrqdzzzid3k8hzd2buvsosn+vlf4zvps99qiiv/gincze26yelh3puyeaekffskof9dgorq8+ev1hbdzdevnr9pv723anjci2wizkv6w7edsqzyockkezmgnksktylzr/o/lpvqi3m0hoa5nhp+civsrscrbtwdyewvqdxodcmzmckjczlnj8dln+mcbfkultzeqiai/fay9gywhaelsxtzyvbcvw++s1w3bde8nrehlbm/io0wiwjsh4dddnrjjisunyynu/l1c+5jikutt2gmjhzcc5h2lm05qmafjk/faofrdkeonn2uok5+nei5ikxkysymwmnsvn0zul/vf5b8ed+kdo8iezf40nxoyaymluiq6lrkjpywoww9q8gxlxzqbbrwqn+4snl5prt2/fa/vfoy/9lvcca22rbrml8tsv22qe7yh0m2bx7xf6wg+eh89u5de4ev2tolxnp/ofz/wbadahi</latexit> <latexit sha1_base64="9yhrxiv7gtcdly+b3kjonvbeuco=">aaacohicbvbdsxtbfj2nttw1h2n76mvfuekohn0i2bdb2hefqolrqdzzzid3k8hzd2buvsosn+vlf4zvps99qiiv/gincze26yelh3puyeaekffskof9dgorq8+ev1hbdzdevnr9pv723anjci2wizkv6w7edsqzyockkezmgnksktylzr/o/lpvqi3m0hoa5nhp+civsrscrbtwdyewvqdxodcmzmckjczlnj8dln+mcbfkultzeqiai/fay9gywhaelsxtzyvbcvw++s1w3bde8nrehlbm/io0wiwjsh4dddnrjjisunyynu/l1c+5jikutt2gmjhzcc5h2lm05qmafjk/faofrdkeonn2uok5+nei5ikxkysymwmnsvn0zul/vf5b8ed+kdo8iezf40nxoyaymluiq6lrkjpywoww9q8gxlxzqbbrwqn+4snl5prt2/fa/vfoy/9lvcca22rbrml8tsv22qe7yh0m2bx7xf6wg+eh89u5de4ev2tolxnp/ofz/wbadahi</latexit> 5 HYPERPARAMETER OPTIMISATION Consider a perceptron with N inputs. This has N+1 HPs: N weights and a bias:! NX y = f w i x i + i=1 = f(w T x + ) For brevity the bias parameter is called a weight from the perspective of HP optimisation.
6 6 HYPERPARAMETER OPTIMISATION Consider a neural network with an N dimensional input feature space, M perceptrons on the input layer and 1 output perceptron. This has M(N+1) + (M+1) HPs. For an MLP with one hidden layer of K perceptrons: This has M(N+1) + K(M+1) + (K+1) HPs. For an MLP with two hidden layers of K and L perceptrons, respectively: This has M(N+1) + K(M+1) + L(K+1) +(L+1) HPs. and so on.
7 7 HYPERPARAMETER OPTIMISATION Neural networks have a lot of HPs. Deep networks, especially CNNs can have millions of HPs to optimise. Requires appropriate computing resource. Requires appropriately efficient methods for HP optimisation. What is acceptable for an optimisation of 10 or 100 HPs will not generally scale well to a problem with HPs. The more HPs to determine the more data is required to obtain a generalisable solution for the HPs. By generalisable we mean that the model defined using a set of HPs will have reproducible behaviour when presented with unseen data. When this is not the case we have an overtrained model - one that has learned statistical fluctuations in our training set.
8 8 HYPERPARAMETER OPTIMISATION: SUPERVISED LEARNING The type of machine learning we are using is referred to as supervised learning. We present the algorithm with known (labeled) samples of data, and optimise the HPs in order to minimise the loss function. The loss function is a function of: labels (e.g. signal = +1, background = 0 or -1 depending on convention/choice of loss function) hyper parameters (i.e. weights, biases, etc.) activation functions architecture of the network (arrangement of perceptrons)
9 9 HYPERPARAMETER OPTIMISATION: LOSS FUNCTIONS There are a number of different types of loss (cost) function that are commonly used. These different figures of merit are measures of how well a model performs at ensuring the weights provide a reasonable output response. Loss functions can have additional terms added to them (e.g. weight regularisation for overtraining and for constructing adversarial networks). Three common loss functions are described in the following.
10 <latexit sha1_base64="wob3qjitr/lzogjhkx/5ujtt5ei=">aaaci3icbvdlsgnbejynrxhfuy9ebomqd4bdichciojfk0qwkmstzxyyawznzpezxies+y9e/buvhptgxyp/4urx8fxqufr1090vjoibcn0ppza3v7c4vfwurayurw+un7eutzxqyto0frg+dylhgivwbg6c3saaerkkdhpen439mwemdy/vfqwt1pxktvgiuwjwcson/gprldfcxao3sg9s6qsuozgg4w0v711gp9kezl6e1fpqmod4aepa93v1ofxxa+4e+c/xzqsczmgf5zhfj2kqmqiqidedz02gmxennaqwl/zusitqe3lhopyqipnpzpmfc7xnlt6oym1laz6o3ycyio0zytb2sgid89sbi/95nrsi427gvzicu3s6keofhhipa8n9rhkfmbseum3trzgoii0ebkwlg4l3++w/5lpe89yad3lyaz7o4iiihbslqshdr6ijzleltrffj+gzvai358l5cubo+7s14mxmttepoj9f5oajug==</latexit> 10 HYPERPARAMETER OPTIMISATION: LOSS FUNCTIONS L2-norm loss: This is like a X 2 term, but without the error normalisation and a factor of 1/2. " = NX i=1 N = number of examples 1 2 (y i t i ) 2 y i = Model output for the i th example t i = True target type for the i th example (label values) The L1 norm loss function is as above, without the factor of 1/2 or square.
11 <latexit sha1_base64="ytuioll7p1atpmvieql0vr65zzc=">aaacjhicbvdlsgnbejyn7/ikevqygir4mowgquee0ysniwa0ke2w2cmsgtizu8z0cmhzj/hir3jx4ampxvwwj4+drgsaiqpuurvcrhadrvvpfgzm5+yxfpekyyura+uljc0be6easganraybitfmcmuawegwzqizkafgt2h/fojf3jntekyuyzcwtir3ikecerbsudr274lmieeivvge+5emnppy7dl3tsp9wsuhe2t8xms7vqwmao73mqr8r1mlsmw36o6a/xjvqspognpqevo7mu0lu0afmabluqm0m6kbu8hyop8alhdaj3eszakikpl2nnoyx7tw6eio1ryu4jh6cyij0pibdg2njnaz095q/m9rpradttoukhsyounfusowxhiygo5yzsiigsweam5vxbrhbe5gcy3aelzpl/+sm1rvc6ve1uh59gwsxylarjuogjx0ie7rbaqjbqloat2hf/tqpdrpzrvzmw4tojozlfqlztc3nuakua==</latexit> 11 HYPERPARAMETER OPTIMISATION: LOSS FUNCTIONS Mean Square Error (MSE) loss: Very similar to the L2 norm loss function; just normalise the L2 norm loss by the number of training examples to compute an average. " = 1 N NX (y i t i ) 2 i=1 N = number of examples y i = Model output for the i th example t i = True target type for the i th example (label values)
12 <latexit sha1_base64="9qv2vyp/63hmdf0j3pcck8iaujo=">aaaconicbzdlsgmxfiyz3q23qks3wsioypkrqtec6malbvufzlgy6zk2mgsg5eyhdh0unz6foxduxcji1gcwrv14oxdy8f/nkjw/zqsw6pup3stk1ptm7nx8awfxaxmlvlrwsgluonr5klnzhtmlumioo0aj15kbpmijv/ht2dc/6ogxitwx2m8guqyjrsi4qye1yrwwxwxkvshu02o6f9pchviogbzviongcknpkchbjsuratp39y7ddvacsdnuhdptjzsi08wova74vx9u9c8ey6iqcv20yg9ho+w5ao1cmmubgz9hvdcdgksylmlcqsb4leta06fmcmxujfyf0c2ntgmsgnc00ph6fajgytq+il2nyti1v72h+j/xzde5igqhsxxb86+hklxstokwr9owbjjkvgpgjxb/pbzldopo0i65eilfk/+fxn418ktb7abycjqoy45ske2ytqjyse7iobkgdcljhxkil+tvu/eevtfv/at1whvprjmf5x18akipq88=</latexit> <latexit sha1_base64="mujrhcfaagfcflurjf9cc6uufz8=">aaacanicbvc7sgnbfj31genr1upsbooqfizdebqrgjawecwdkk2yncwmq2yfznwvlzxy+cs2fory+hv2/o2tzatnphc5h3puzeyenxjcgwv9gwuls8srq7m1/prg5ta2ubnbv2eskavruisy6rlfba9ydtgi1owki74rwmmdxo39xh2tiofblsqrc3zsd7jhkqetdc39ahee7kv4aicdknrhsamt6galudcswgvrajxp7iwuuizq1/xq90ia+ywakohsldukwemjbe4fg+xbswiroupszy1na+iz5aste0b4scs97ivsvwb4ov7esimvvok7etinmfcz3lj8z2vf4j07kq+igfhapw95scaq4neeumcloyastqivxp8v0wgrhijola9dsgdpnif1k7jtle2b00llmosjhw7qisoig52hcrpgvvrdfd2iz/sk3own48v4nz6mowtgtroh/sd4/ahezpux</latexit> 12 HYPERPARAMETER OPTIMISATION: LOSS FUNCTIONS Cross Entropy: This loss function is inspired by the likelihood for observing either target value From the likelihood L of observing the training data set we can compute the -lnl as " = nx i=1 P (t x) =y t (1 y) [t n ln y n +(1 t n )ln(1 y n )] (1 t) n = number of examples t = target type (0 or 1) depending on example (label values) y = output prediction of model
13 13 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Guess an initial value for the weight parameter: w y =ε w 0 w
14 14 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Estimate the gradient at that point (tangent to the curve) 25 y =ε w 0 w
15 <latexit sha1_base64="icnt3h5wowtccxo38pivblylbng=">aaaccxicbvdlssnafj3uv62vqes3g0vwy0le0i1q1ixlcvybtsg3k0k7dpjgzmipivs3/oobf4q49q/c+tdo2yy09ccfwzn3cu89xskzvjb1bzswlldw18rrly3nre0dc3evjenuenokmy9fxwnjoytouzhfascrfekp07y3vj747qcqjiujezvoqbtcp2ibi6c01doxc0o5ajzcl/jeaz4madubajl54zzzr3nprfo1awq8soycvfgbrs/8cvyypcgnfoegzde2euvmibqjnoyvj5u0atkepu1qgkfipztnp8nxkvz8hmrcv6twvp09kueo5tj0dgciaidnvyn4n9dnvxdhzixkukujmlsupbyrge9iwt4tlcg+1gsiyppwtaagc1a6vioowz5/ezg0tmu2vbpvzqr1qykomjpah+gy2egc1detaqamiugrpanx9gy8gs/gu/exay0zxcw++gpj8wd5cpmu</latexit> <latexit sha1_base64="k5uronfxt/37sb+ijvzc3rwsaic=">aaacqnicbzblaxrbfiwr4ynxjdomszcxh0hcohqhiw6ekgthmgeniuxnhtvvt2ekvd+oum1omvltbvil3pkd3lhqjfsx1kxami8dbydzz6wqvrg02neyfguw7t1/8hb55vhn8eqtp8+6z9eoxffzrqnvmmkexoji6jwgrnnqswkjs9jqcxy2n58ffybrdjf/5lqkuyatxkdaifto3p0k98kwqg2v3kprz0gmflwtqd1rkvmzsnmbf/k1nxjnouwqhllevn6vvk+m/pp0c8bdxtgpf4lbjmpnt7q6ghe/yqrqvuy5k4podaow5fgdlruynoviylgj6gwnnpq2x4zcqfkgmmggtxjic+tpzrbi/99omhouzmlfzjcn7uzsht41g1acvhs1oi8rplxdxzrwbriaou9itcxfpvygldx+racm6jgwp97xekkbx75tjrb6udipdt/2dnzbhcvihxgpnkuktswo+caoxeao8uv8fz/fr+ai+bh8di6vqktbu7murin48xfrikza</latexit> 15 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Compute Δw such that Δy is negative (to move toward y =ε the minimum) y = = w dy dw dy dw α is the learning rate: a small positive number Choose w = dy dw w 0 to ensure Δy is always negative. w
16 <latexit sha1_base64="dhvvjyhsum6hm8wyk/rmezzpije=">aaack3icbvdjsgnbeo1xn25rj14kgyiew4wiehfepxhumcaqcunnt0/s2lpq3wmiw/ypf3/fgx5c8op/2ilzchtq8hiviqp6fiq40rb9ak1mtk3pzm7nvxywl5zxqqtr1yrjjgvnmohetn1utpcyntxxgrvtytdybwv5n6cjv3xlpojjfkwhketg2it5yclqi3nvk4gx87ptwpyrddwodxdpmnaia3ddcpqozv0xrdphceojna9gwotbopcqnbthjwf/ivosgilx4vuf3schwcritquq1xhsvhdzljptwyqkmymwir3bhusygmpevdcf/1ralleccbnpktywvr9p5bgpnyx80xmh7qvf3kj8z+tkojzs5jxom81i+ruozatobebbqcalo1omduequbkvab9nenrewzehol9f/kuu9xqo3xau92vhj2ucc2sdbjid4padckzoyqvpekruyan5ji/wvfvkvvnvx60tvjmztn7a+vgetuijlw==</latexit> <latexit sha1_base64="icnt3h5wowtccxo38pivblylbng=">aaaccxicbvdlssnafj3uv62vqes3g0vwy0le0i1q1ixlcvybtsg3k0k7dpjgzmipivs3/oobf4q49q/c+tdo2yy09ccfwzn3cu89xskzvjb1bzswlldw18rrly3nre0dc3evjenuenokmy9fxwnjoytouzhfascrfekp07y3vj747qcqjiujezvoqbtcp2ibi6c01doxc0o5ajzcl/jeaz4madubajl54zzzr3nprfo1awq8soycvfgbrs/8cvyypcgnfoegzde2euvmibqjnoyvj5u0atkepu1qgkfipztnp8nxkvz8hmrcv6twvp09kueo5tj0dgciaidnvyn4n9dnvxdhzixkukujmlsupbyrge9iwt4tlcg+1gsiyppwtaagc1a6vioowz5/ezg0tmu2vbpvzqr1qykomjpah+gy2egc1detaqamiugrpanx9gy8gs/gu/exay0zxcw++gpj8wd5cpmu</latexit> <latexit sha1_base64="k5uronfxt/37sb+ijvzc3rwsaic=">aaacqnicbzblaxrbfiwr4ynxjdomszcxh0hcohqhiw6ekgthmgeniuxnhtvvt2ekvd+oum1omvltbvil3pkd3lhqjfsx1kxami8dbydzz6wqvrg02neyfguw7t1/8hb55vhn8eqtp8+6z9eoxffzrqnvmmkexoji6jwgrnnqswkjs9jqcxy2n58ffybrdjf/5lqkuyatxkdaifto3p0k98kwqg2v3kprz0gmflwtqd1rkvmzsnmbf/k1nxjnouwqhllevn6vvk+m/pp0c8bdxtgpf4lbjmpnt7q6ghe/yqrqvuy5k4podaow5fgdlruynoviylgj6gwnnpq2x4zcqfkgmmggtxjic+tpzrbi/99omhouzmlfzjcn7uzsht41g1acvhs1oi8rplxdxzrwbriaou9itcxfpvygldx+racm6jgwp97xekkbx75tjrb6udipdt/2dnzbhcvihxgpnkuktswo+caoxeao8uv8fz/fr+ai+bh8di6vqktbu7murin48xfrikza</latexit> 16 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Compute a new weight value: w 1 = w 0 +Δw w i+1 = w i + w = w i dy dw y =ε y = w dy dw dy = dw α is the learning rate: a small positive number Choose w = dy dw w 1 w 0 to ensure Δy is always negative. w
17 <latexit sha1_base64="dhvvjyhsum6hm8wyk/rmezzpije=">aaack3icbvdjsgnbeo1xn25rj14kgyiew4wiehfepxhumcaqcunnt0/s2lpq3wmiw/ypf3/fgx5c8op/2ilzchtq8hiviqp6fiq40rb9ak1mtk3pzm7nvxywl5zxqqtr1yrjjgvnmohetn1utpcyntxxgrvtytdybwv5n6cjv3xlpojjfkwhketg2it5yclqi3nvk4gx87ptwpyrddwodxdpmnaia3ddcpqozv0xrdphceojna9gwotbopcqnbthjwf/ivosgilx4vuf3schwcritquq1xhsvhdzljptwyqkmymwir3bhusygmpevdcf/1ralleccbnpktywvr9p5bgpnyx80xmh7qvf3kj8z+tkojzs5jxom81i+ruozatobebbqcalo1omduequbkvab9nenrewzehol9f/kuu9xqo3xau92vhj2ucc2sdbjid4padckzoyqvpekruyan5ji/wvfvkvvnvx60tvjmztn7a+vgetuijlw==</latexit> <latexit sha1_base64="icnt3h5wowtccxo38pivblylbng=">aaaccxicbvdlssnafj3uv62vqes3g0vwy0le0i1q1ixlcvybtsg3k0k7dpjgzmipivs3/oobf4q49q/c+tdo2yy09ccfwzn3cu89xskzvjb1bzswlldw18rrly3nre0dc3evjenuenokmy9fxwnjoytouzhfascrfekp07y3vj747qcqjiujezvoqbtcp2ibi6c01doxc0o5ajzcl/jeaz4madubajl54zzzr3nprfo1awq8soycvfgbrs/8cvyypcgnfoegzde2euvmibqjnoyvj5u0atkepu1qgkfipztnp8nxkvz8hmrcv6twvp09kueo5tj0dgciaidnvyn4n9dnvxdhzixkukujmlsupbyrge9iwt4tlcg+1gsiyppwtaagc1a6vioowz5/ezg0tmu2vbpvzqr1qykomjpah+gy2egc1detaqamiugrpanx9gy8gs/gu/exay0zxcw++gpj8wd5cpmu</latexit> <latexit sha1_base64="k5uronfxt/37sb+ijvzc3rwsaic=">aaacqnicbzblaxrbfiwr4ynxjdomszcxh0hcohqhiw6ekgthmgeniuxnhtvvt2ekvd+oum1omvltbvil3pkd3lhqjfsx1kxami8dbydzz6wqvrg02neyfguw7t1/8hb55vhn8eqtp8+6z9eoxffzrqnvmmkexoji6jwgrnnqswkjs9jqcxy2n58ffybrdjf/5lqkuyatxkdaifto3p0k98kwqg2v3kprz0gmflwtqd1rkvmzsnmbf/k1nxjnouwqhllevn6vvk+m/pp0c8bdxtgpf4lbjmpnt7q6ghe/yqrqvuy5k4podaow5fgdlruynoviylgj6gwnnpq2x4zcqfkgmmggtxjic+tpzrbi/99omhouzmlfzjcn7uzsht41g1acvhs1oi8rplxdxzrwbriaou9itcxfpvygldx+racm6jgwp97xekkbx75tjrb6udipdt/2dnzbhcvihxgpnkuktswo+caoxeao8uv8fz/fr+ai+bh8di6vqktbu7murin48xfrikza</latexit> 17 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT Newtonian gradient descent is a simple concept. Repeat until some convergence criteria is satisfied. w i+1 = w i + w = w i dy dw y =ε y = w dy dw dy = dw w 2 w 4 n 2 w 1 α is the learning rate: a small positive number Choose w = dy dw w 0 to ensure Δy is always negative. w
18 18 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT We can extend this from a one parameter optimisation to a 2 parameter one, and follow the same principles, now in 2D. The successive points w i+1 can be visualised a bit like a ball rolling down a concave hill into the region of the minimum.
19 <latexit sha1_base64="hzkgbnw+5hotcuy3+qgig74ku40=">aaacthicbvfnj9mwfhtc12752ajhlhyvanfcfvciucctgaphralbleoqxhyntdzxivufkoryc7lx49/gtjmwbz9kazqzz88ep5vwdqpobxdeun3n7r2j48h9bw8fnqwfp7l0zw2fnipsl/yqbse1mnkkcrw8qqyeitvyll5/xuuzx9i6vzrv2fqylmbhvk4eokes4e9v0qoz1tgxh+kqufsm8i9si9av5xxw3lnvoohqczqbsdxqxku7ipy5zjf7kc8tidajtddm/nj2wnudt2qxxfc/jusjh1ytfvdwojtsnjfnwxqnw1e0jjzf9whrwyj0dzem//csfhuhdqonzs1zvghcgkultowgvhayanenczn30eahxdxuqu/oc89knc+txwbphv2/o4xcuazivbmaxlpdbu0e0uy15h/ivpmqrmnedlbea4olxf8gzzsvanxjaqir/f2pwilpb/0/d3wibpfj++bymmbrmh17ozr/1mdxrj6r5+sumpkenjov5ijmiqhymat+bhc+c3koqrm1hkhf85tcqnd8a0rl0e0=</latexit> <latexit sha1_base64="lgwqyufvl0lqskc5cvo2nqpvszm=">aaaconicbvfdb9mwfhxcykn8rixhhraoocggsiqk8tjpykoa9ja2qbdjdahuhke15jirfbmpivld+bu87d/mbype1h7j1tg55+ra58afkhad4mbzh6w9fls+8bjz5omz55vdf1tnni8nf0oeq9xcxgcfklomuaisf4urkmvknmex+7p6+zuwvub6f1afidkyajlkduikcfcpoxakgvb0/s5t+tvlgmi1exnr0h9whs8mvdefypricts/t1lqgncjrzo6ur7x4sfznmziyrq//b7qj3sla7dssnk0q836rnnbong3f/sdoegycvvsiy1oxt2/lml5mqmnxig1ozaomkrboorknb1wwleav4sjgdmqirm2qucrn/sduxka5sydjxsu/t9rq2ztlcxomqfo7f3atfxvg5wyfo1qqysshealqwmpkoz0ti+ascm4qsor4ea6t1i+bzcpuq12xajh/s8vk7nbpwz64emx3t63no4n8oq8idskjdtkjxysezik3hvtffeovr/+w//ip/v/lqy+1/a8jhfgs1tafmpg</latexit> 19 HYPERPARAMETER OPTIMISATION: GRADIENT DESCENT In general for an n-dimensional hyperspace of hyper parameters we can follow the same brute force approach using: y = = wry " dy dw 1,i 2 + dy dw 2,i dy dw n,i 2 # where w i+1 = w i + w = w i ry " dy = w i dw 1,i 2 + dy dw 2,i dy dw n,i 2 #
20 <latexit sha1_base64="wob3qjitr/lzogjhkx/5ujtt5ei=">aaaci3icbvdlsgnbejynrxhfuy9ebomqd4bdichciojfk0qwkmstzxyyawznzpezxies+y9e/buvhptgxyp/4urx8fxqufr1090vjoibcn0ppza3v7c4vfwurayurw+un7eutzxqyto0frg+dylhgivwbg6c3saaerkkdhpen439mwemdy/vfqwt1pxktvgiuwjwcson/gprldfcxao3sg9s6qsuozgg4w0v711gp9kezl6e1fpqmod4aepa93v1ofxxa+4e+c/xzqsczmgf5zhfj2kqmqiqidedz02gmxennaqwl/zusitqe3lhopyqipnpzpmfc7xnlt6oym1laz6o3ycyio0zytb2sgid89sbi/95nrsi427gvzicu3s6keofhhipa8n9rhkfmbseum3trzgoii0ebkwlg4l3++w/5lpe89yad3lyaz7o4iiihbslqshdr6ijzleltrffj+gzvai358l5cubo+7s14mxmttepoj9f5oajug==</latexit> <latexit sha1_base64="omq77c84nczof5krivlopprtdgq=">aaacuxicjvfnt9tafhwxx2n4sttjlysije6rjzdgiodsi5wagbrh0fpmmaxyr63d50bk+s9yokf+j156ooomrijcd4y00mjmzx7mjovwjspwzynywv1b32h+ag1ube/stj9+6ru8tjj6mte5vurqkvageqxy01vhcbne02vycz73l6dkncrnd54vnmzw2qhuswqvjdqtoluoq7haywq1ikdoqxbke1oj+tm4rywime64emjgohbvdcy8icz61o6e3xab8zzes9kbjs5g7yd4nmsyi8nso3odkcx4wm13lzrqvlw6klde4dunpdwykrtwi0zqse+vsuhz65dhsvbfjirmnjtliz/mkcfuttcx/+cnsk5phpuyrclk5nnbaakf52jerxgrs5l1zbouvvm7cjlb3xt7t2j5eqlxt35l+ofdkoxg3446p2floprwbfbgaci4hlp4chfqawn38ase4u/jr+n3aehwnbo0lpnp8a+czb+zg7yv</latexit> 20 HYPERPARAMETER OPTIMISATION: BACK PROPAGATION The delta-rule or back propagation method is used for NNs to determine the weights; based on gradient descent. For a regularisation problem* we use the L2 norm loss function (with or without the factor of 1/2): NX 1 " = 2 (y i t i ) 2 i=1 y i is the model output for example x i t i is the corresponding label for the i th example We can compute the derivative of the ε with respect to the weights Derivatives depend on the i activation function(s) in the model y(x) *Either this or cross entropy are used for the classification problems.
21 <latexit sha1_base64="lipjf/uwunpkwmrndahhwtqahxk=">aaacp3icjvfnbxmxepuubur4cvtixwpaaigixyqel0ovxbahaewtrortanazbazau5y92ypa7v/jr/tgv8gbrp1eijespc+bewo/yaxwnplkdxtf2bh77/7mg87dr4+fpo0+ez70zeukdmspszfkwknwbq5ikcardqgm03iynxxp+cntdf6vxqetle4mhbcqvxiopkbdx2dhtxubnnyht6h5j0dboxdhkyo0mor8tfy7axp0nxoro5c1sobigebifbxar3qqury5jm4alkrh0bwjluqvrv+c8b8jvmrntntl+sky+g2qrkgprwnv2j0xs1jwbgusgrwfp4mlsd2qso1nr1qelcgtomzxgauy9jn66xpdx4xmjoelc6cgvsxe7ajbel8wwag0qhn/k2utf+pgfewfjruqbevyynwgvnkcst4ujc+uq0l6eqbip8jbuzxdcircajvbhptml2+d4ft+mvtt/q+93c9rozbzc7bnxroufws77cvbywmmo5frt+hndbc/ix/ew3i0ko2jdc8wuxyx/ahw/tlb</latexit> <latexit sha1_base64="lipjf/uwunpkwmrndahhwtqahxk=">aaacp3icjvfnbxmxepuubur4cvtixwpaaigixyqel0ovxbahaewtrortanazbazau5y92ypa7v/jr/tgv8gbrp1eijespc+bewo/yaxwnplkdxtf2bh77/7mg87dr4+fpo0+ez70zeukdmspszfkwknwbq5ikcardqgm03iynxxp+cntdf6vxqetle4mhbcqvxiopkbdx2dhtxubnnyht6h5j0dboxdhkyo0mor8tfy7axp0nxoro5c1sobigebifbxar3qqury5jm4alkrh0bwjluqvrv+c8b8jvmrntntl+sky+g2qrkgprwnv2j0xs1jwbgusgrwfp4mlsd2qso1nr1qelcgtomzxgauy9jn66xpdx4xmjoelc6cgvsxe7ajbel8wwag0qhn/k2utf+pgfewfjruqbevyynwgvnkcst4ujc+uq0l6eqbip8jbuzxdcircajvbhptml2+d4ft+mvtt/q+93c9rozbzc7bnxroufws77cvbywmmo5frt+hndbc/ix/ew3i0ko2jdc8wuxyx/ahw/tlb</latexit> <latexit sha1_base64="lipjf/uwunpkwmrndahhwtqahxk=">aaacp3icjvfnbxmxepuubur4cvtixwpaaigixyqel0ovxbahaewtrortanazbazau5y92ypa7v/jr/tgv8gbrp1eijespc+bewo/yaxwnplkdxtf2bh77/7mg87dr4+fpo0+ez70zeukdmspszfkwknwbq5ikcardqgm03iynxxp+cntdf6vxqetle4mhbcqvxiopkbdx2dhtxubnnyht6h5j0dboxdhkyo0mor8tfy7axp0nxoro5c1sobigebifbxar3qqury5jm4alkrh0bwjluqvrv+c8b8jvmrntntl+sky+g2qrkgprwnv2j0xs1jwbgusgrwfp4mlsd2qso1nr1qelcgtomzxgauy9jn66xpdx4xmjoelc6cgvsxe7ajbel8wwag0qhn/k2utf+pgfewfjruqbevyynwgvnkcst4ujc+uq0l6eqbip8jbuzxdcircajvbhptml2+d4ft+mvtt/q+93c9rozbzc7bnxroufws77cvbywmmo5frt+hndbc/ix/ew3i0ko2jdc8wuxyx/ahw/tlb</latexit> 21 HYPERPARAMETER OPTIMISATION: BACK PROPAGATION The parameters w and θ are updated using: The derivatives can be re-written in terms of the errors on the weights, and the errors on w and θ can be related to each other. Back propagation involves: w r+1 = w r r+1 = r where α is the small positive learning rate A forward pass where weights are fixed and the model predictions are made*. This is followed by the backward pass where the errors on the bias parameters are computed and used to determine the errors on the weights. These in turn are then used to update the HPs from epoch r to epoch r+1. NX i=1 NX i *Weights are randomly initialised for the whole network.
22 22 HYPERPARAMETER OPTIMISATION Consider the examples of an MLP to approximate the function f(x) = x 2 and how the convergence depends on the learning rate. α=0.1 α= epochs Too large a learning rate and the optimisation does not always lead to an improved cost; the small change approximation breaks down. A small learning rate means the convergence takes much longer (a x10 reduction in the learning rate will require a x10 increase in optimisation steps) This example uses an L2 loss function and is a neural network with 1 input, a single layer of 50 nodes and 1 output; implemented in TensorFlow.
23 23 HYPERPARAMETER OPTIMISATION Consider the examples of an MLP to approximate the function f(x) = x 2 and how the convergence depends on the learning rate. α=0.1 α= epochs Too large a learning rate and the optimisation does not always lead to an improved cost; the small change approximation breaks down. A small learning rate means the convergence takes much longer (a x10 reduction in the learning rate will require a x10 increase in optimisation steps) This example uses an L2 loss function and is a neural network with 1 input, a single layer of 50 nodes and 1 output; implemented in TensorFlow.
24 24 HYPERPARAMETER OPTIMISATION: STOCHASTIC LEARNING [1] Data are inherently noisy. Individual training examples can be used to estimate the gradient. Training examples tend to cluster, so processing a batch of training data, one example at a time results in sampling the ensemble in such a way to have faster optimisation performance. Noise in the data can help the optimisation algorithm avoid getting locked into local minima. Often results in better optimisation performance than batch learning. [1] LeCun et al., Efficient BackProp, Neural Networks Tricks of the Trade, Springer 1998
25 25 HYPERPARAMETER OPTIMISATION: BATCH LEARNING [1] Data are inherently noisy. Can use a sample of training data to estimate the gradient for minimisation (see later) to minimise the effect of this noise. The sample of data is used to obtain a better estimate the gradient This is referred to as batch learning. Can use mini-batches of data to speed up optimisation, which is motivated by the observation that for many problems there are clusters of similar training examples. [1] LeCun et al., Efficient BackProp, Neural Networks Tricks of the Trade, Springer 1998
26 26 HYPERPARAMETER OPTIMISATION: BATCH LEARNING Returning to the example f(x) = x 2 ; optimising on 1/4 of the data at a time (4 batches) leads to accelerated optimisation relative to optimising on all the data each epoch. Training with all the data and a learning rate of Batch training (4 batches) with all the data and a learning rate of This example uses an L2 loss function and is a neural network with 1 input, a single layer of 50 nodes and 1 output; implemented in TensorFlow.
27 27 GRADIENT DESCENT: REFLECTION For a problem with a parabolic minimum and an appropriate learning rate, α, to fix the step size, we can guarantee convergence to a sensible minimum in some number of steps. If we translate the distribution to a fixed scale, then all of a sudden we can predict how many steps it will take to converge to the minimum from some distance away from it for a given α. If the problem hyperspace is not parabolic, this becomes more complicated. [1] LeCun et al., Efficient BackProp, Neural Networks Tricks of the Trade, Springer 1998
28 28 GRADIENT DESCENT: REFLECTION Based on the underlying nature of the gradient descent optimisation algorithm family, being derived to optimise a parabolic distribution, ideally we want to try and standardise the input distributions to a neural network. Use a unit Gaussian as a standard e.g.: maps x to x = (x-μ)/σ; Scale x to the range [-1, 1]. The transformed data inputs will be scale invariant in the sense that HPs such as the learning rate will be come general, rather than problem (and therefore scale) dependent. If we don t do this the optimisation algorithm will work, but it may take longer to converge to the minimum, and could be more susceptible to divergent behaviour. [1] LeCun et al., Efficient BackProp, Neural Networks Tricks of the Trade, Springer 1998
29 29 OVERTRAINING Data are noisy. Optimisation can result in learning the noise in the training data. Overtraining is the term given to learning the noise, and this can be mitigated in a number of different ways: Using more training data (not always possible). Checking against different data sets to identify the onset of learning noise. Changing the network configuration when training (dropout). Weight regularisation (large weights are penalised in the cost). None of these methods guarantees that you avoid over training.
30 30 OVERTRAINING A model is over fitted if the HPs that have been determined are tuned to the statistical fluctuations in the data set. Simple illustration of the problem: 30 training examples The decision boundary selected here does a good job of separating the red and blue dots. Boundaries like this can be obtained by training models on limited data samples. The accuracies can be impressive. But would the performance be as good with a new, or a larger data sample?
31 31 OVERTRAINING A model is over fitted if the HPs that have been determined are tuned to the statistical fluctuations in the data set. Simple illustration of the problem: 30 training examples 1000 training examples Increasing to 1000 training examples we can see the boundary doesn t do as well. This illustrates the kind of problem encountered when we overfit HPs of a model.
32 32 OVERTRAINING: TRAINING VALIDATION One way to avoid tuning to statistical fluctuations in the data is to impose a training convergence criteria based on a data sample independent from the training set: a validation sample. Use the cost evaluated for the training and validation samples to check to see if the HPs are over trained. If both samples have similar cost then the model response function is similar on two statistically independent samples. If the samples are large enough then one could reasonably assume that the response function would then be general when applied to an unseen data sample. large enough is a model and problem dependent constraint.
33 33 OVERTRAINING: TRAINING VALIDATION Training convergence criteria that could be used: Terminate training after N epochs Cost comparison: Evaluate the performance on the training and validation sets. Compare the two and place some threshold on the difference Δcost < δ cost Terminate the training when the gradient of the cost function with respect to the weights is below some threshold. Terminate the training when the Δcost starts to increase for the validation sample.
34 34 OVERTRAINING: TRAINING VALIDATION: EXAMPLE HIGGS KAGGLE DATA This example shows the Higgs Kaggle (H ττ) with an overtrained neural network. Hyper parameters are not optimal, but test and train samples give similar performance. The test sample (green) has a consistently high cost value relative to the train sample (blue). 65% accuracy attained before overtraining starts Overtrained network Overtrained network This example uses all features of the data set, but only 2000 test/train events with a learning rate of and dropout is not being used. The network has a single layer with 256 nodes and a single output to classify if a given event is signal or background. There is no batch training used for this example. See for example code for this problem
35 35 OVERTRAINING: TRAINING VALIDATION: EXAMPLE HIGGS KAGGLE DATA Continuing to train an over-trained network does not resolve the issue - the network remains over-trained. 65% accuracy attained before overtraining starts Overtrained network Overtrained network In this case the level of overtraining continues to increase and the difference in the performance of the train and test samples in terms of accuracy increases. After training cycles we have an accuracy of over 93% for the train sample, but less than 70% for the test sample. This is not a good configuration to use on unseen data as the outcome is unpredictable. For this example we should terminate after about 200 epochs, while the test/train performance remain similar and have an accuracy of 70%. Other network configurations may be better. See for example code for this problem
36 36 OVERTRAINING: DROPOUT FOR DEEP NETWORKS A pragmatic way to mitigate overfitting is to compromise the model randomly in different epochs of the training by removing units from the network. Dropout is used during training; when evaluating predictions with the validation or unseen data the full network of Fig. (a) is used. That way the whole model will be effectively trained on a sub-sample of the data in the hope that the effect of statistical fluctuations will be limited. This does not remove the possibility that a model is overtrained, as with the previous discussion HP generalisation is promoted by using this method. Srivastava et al., J. Machine Learning Research 15 (2014)
37 37 OVERTRAINING: DROPOUT FOR DEEP NETWORKS A variety of architectures has been explored with different training samples (see Ref [1] for details). Dropout can be detrimental for small training samples, however in general the results show that dropout is beneficial. For deep networks or typical training samples O(500) examples or more this technique is expected to be beneficial. [1] Srivastava et al., J. Machine Learning Research 15 (2014)
38 38 OVERTRAINING: DROPOUT: EXAMPLE HIGGS KAGGLE DATA Changing the drop out keep probability from 0.9 to 0.7 stops the network becoming overtrained in the first 1000 epochs. ~73% accuracy attained without overtraining for 1000 epochs, keep probability of 0.7 and a learning rate of A better accuracy is attained for this network using dropout; above 70%. See for example code for this problem
39 <latexit sha1_base64="rlq2jfdtjiwjswvxygzme1ksbsa=">aaacfhicbvdlssnafj34rpuvdelmsaicubirdcmu3bisyb/qhdcztnuhm0myubgu0i9w46+4cagiwxfu/bunbrbaemdgcm65zl0ntaxx4djf1tlyyuraemmjvlm1vbnr7+03dzipyho0eylqh0qzwwpwaa6ctvpfiawfa4wdm4nfembk8ys+h1hkfel6me9yssbigx3qcroocpz0jj3bjqcd5pzk6yakcighjpf6omd4gpdarjhvzwq8snycvfcbemb/evfcm8lioijo3xgdfpyckobushhzyzrlcr2qhusyghpjtj9pjxrjy6ne2kxhxgx4qv6eyinueirdk5qe+nrem4j/ez0mupd+zum0axbt2ufdtgbi8kqhhhhfkiiriyqqbnbfte8uowb6ljss3pmtf0nzroo6vffuvfk7luoooun0he6qiy5qdd2iomogih7rm3pfb9at9wk9wx+z6jjvzbygp7a+fwcba58o</latexit> 39 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS Weight regularisation involves adding a penalty term to the loss function used to optimise the HPs of a network. This term is based on the sum of the weights w i (including bias parameters) in the network and takes X the form: i=8weights The rationale is to add an additional cost term to the optimisation coming from the complexity of the network. The performance of the network will vary as a function of λ. To optimise a network using weight regularisation it will have to be trained a number of times in order to identify the value corresponding to the min(cost) from the set of trained solutions. w i This is the L1 norm regularisation term. See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:
40 <latexit sha1_base64="cugaj1e3/yhdcoh4+zf4b9uhct0=">aaacf3icbvdlssnafj34rpuvdelmsagupcrf0i1qdooygn1ae8nkom2hzirh5szsqv/cjb/ixouibnxn3zhts9dwawohc85l7j1hirggx/m2lpzxvtfwcxvfza3tnv17b7+h41rrvqexifurjjojhre6cbcslshgzchymxxct/zma1oax9edjblms9klejdtakyk7lintlhdskdt6qkuoegg45den1zeifm8zlzxbz3gw4dfvwk75jsdkfaicxnsqjlqgf3ldwkashybfutrtusk4gdeaaecjyteqllc6id0wnvqieim/wx61xgfg6wdzsbmrycn6u+jjeitrzi0sumgr+e9ifif106he+fnpepsybgdfdrnbyyyt0rcha4ybteyhfdfza6y9okifeyvrvoco3/yimluyq5tdm/pstwrvi4cokrh6as56bxv0q2qotqi6be9o1f0zj1zl9a79tgllln5zah6a+vzb1ieoai=</latexit> 40 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS Weight regularisation involves adding a penalty term to the loss function used to optimise the HPs of a network. This term is based on the sum of the weights w i (including bias parameters) in the network and takes the form: X i=8,weights This variant is the L2 norm regularisation term; also know as weight decay regularisation. The rationale is to add an additional cost term to the optimisation coming from the complexity of the network. The performance of the network will vary as a function of λ. To optimise a network using weight regularisation it will have to be trained a number of times in order to identify the value corresponding to the min(cost) from the set of trained solutions. w 2 i See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:
41 <latexit sha1_base64="k5egs/alxmo+runpdhqcmvqgmjq=">aaacvxicbvfbsxtbfj5drdxy1qhhl0odygkbdonql4lus0/fglehmyxvj7pjw5nzzeatepb9k16k/8rlozoyq419mpdxfd/jvfdnvip0fewpqbi2/mbj7ezwa/vd+w877d29k1duvsi+kfrhbzjwuqgrfujs8qa0ensm5hv2ez7xr++kdviys5qvcqhhyjbhaesptk2so7cydkgkw095klsqddzup5vevtprqjfcwunp3iw8etrlkx/lloknuy9/5onyo8bav8xjxlhq6gu/lzizkmv4fyqjxtrurn1oufw1ijegw5z1kbyfknehki0ncqxodekopgenllao2bssysksxc1m5mbda1q6yb1ipeghnhlzv4l/hvic/bejbu3ctgfeqygmblwbk//tbhxlj8matvmrnoj5uf4ptgwfr8zhakugnfmahew/kxdt8mgs/4iwdyfepfk1uop146gb/zrunh1fxrhjdthhdsri9o2dsr/sgvwzya/skqicmpgd/anxw41naxgse/bziwp3/gjec7od</latexit> <latexit sha1_base64="ytuioll7p1atpmvieql0vr65zzc=">aaacjhicbvdlsgnbejyn7/ikevqygir4mowgquee0ysniwa0ke2w2cmsgtizu8z0cmhzj/hir3jx4ampxvwwj4+drgsaiqpuurvcrhadrvvpfgzm5+yxfpekyyura+uljc0be6easganraybitfmcmuawegwzqizkafgt2h/fojf3jntekyuyzcwtir3ikecerbsudr274lmieeivvge+5emnppy7dl3tsp9wsuhe2t8xms7vqwmao73mqr8r1mlsmw36o6a/xjvqspognpqevo7mu0lu0afmabluqm0m6kbu8hyop8alhdaj3eszakikpl2nnoyx7tw6eio1ryu4jh6cyij0pibdg2njnaz095q/m9rpradttoukhsyounfusowxhiygo5yzsiigsweam5vxbrhbe5gcy3aelzpl/+sm1rvc6ve1uh59gwsxylarjuogjx0ie7rbaqjbqloat2hf/tqpdrpzrvzmw4tojozlfqlztc3nuakua==</latexit> <latexit sha1_base64="ytuioll7p1atpmvieql0vr65zzc=">aaacjhicbvdlsgnbejyn7/ikevqygir4mowgquee0ysniwa0ke2w2cmsgtizu8z0cmhzj/hir3jx4ampxvwwj4+drgsaiqpuurvcrhadrvvpfgzm5+yxfpekyyura+uljc0be6easganraybitfmcmuawegwzqizkafgt2h/fojf3jntekyuyzcwtir3ikecerbsudr274lmieeivvge+5emnppy7dl3tsp9wsuhe2t8xms7vqwmao73mqr8r1mlsmw36o6a/xjvqspognpqevo7mu0lu0afmabluqm0m6kbu8hyop8alhdaj3eszakikpl2nnoyx7tw6eio1ryu4jh6cyij0pibdg2njnaz095q/m9rpradttoukhsyounfusowxhiygo5yzsiigsweam5vxbrhbe5gcy3aelzpl/+sm1rvc6ve1uh59gwsxylarjuogjx0ie7rbaqjbqloat2hf/tqpdrpzrvzmw4tojozlfqlztc3nuakua==</latexit> <latexit sha1_base64="ytuioll7p1atpmvieql0vr65zzc=">aaacjhicbvdlsgnbejyn7/ikevqygir4mowgquee0ysniwa0ke2w2cmsgtizu8z0cmhzj/hir3jx4ampxvwwj4+drgsaiqpuurvcrhadrvvpfgzm5+yxfpekyyura+uljc0be6easganraybitfmcmuawegwzqizkafgt2h/fojf3jntekyuyzcwtir3ikecerbsudr274lmieeivvge+5emnppy7dl3tsp9wsuhe2t8xms7vqwmao73mqr8r1mlsmw36o6a/xjvqspognpqevo7mu0lu0afmabluqm0m6kbu8hyop8alhdaj3eszakikpl2nnoyx7tw6eio1ryu4jh6cyij0pibdg2njnaz095q/m9rpradttoukhsyounfusowxhiygo5yzsiigsweam5vxbrhbe5gcy3aelzpl/+sm1rvc6ve1uh59gwsxylarjuogjx0ie7rbaqjbqloat2hf/tqpdrpzrvzmw4tojozlfqlztc3nuakua==</latexit> 41 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS For example we can consider extending an MSE cost function to allow for weight regularisation. The MSE cost is given by: " = 1 N NX (y i t i ) 2 i=1 To allow for regularisation we add the sum of weights term: " = 1 N NX X (y i t i ) 2 + i=1 i=8,weights w 2 i This is a simple modification to make to the NN training process that adds a penalty for the inclusion of non-zero weights in the network. See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:
42 42 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS The optimisation process needs to be run for each value of λ in order to choose the best (least cost) solution that is not overtrained. This means we have to train the network many times, with different values of λ when using regularisation. e.g. for the Kaggle example with 1000 epochs we have: No regularisation; best cost Best performance for this sampling of λ; but similar outputs obtained for a range of trainings λ cost: train / test accuracy: train / test / / / / / / / / / / / / / / Increasing cost of weights See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:
43 43 OVERTRAINING: WEIGHT REGULARISATION FOR NEURAL NETWORKS The optimisation process needs to be run for each value of λ in order to choose the best (least cost) solution that is not overtrained. This means we have to train the network many times, with different values of λ when using regularisation. e.g. for the Kaggle example with 1000 epochs we have: cost includes X i=8,weights w 2 i Accuracy falls off as lambda increases, but there is little dependence for this example. See Ch. 9 of Bishop s Neural Network for Pattern Recognition Loshchilov, Frank Hutter, arxiv:
44 44 OVER FITTING: CROSS VALIDATION An alternative way of thinking about the problem is to assume that the response function of the model will have some bias and some variance. The bias will be irreducible and mean that the predictions made will have some systematic effect related to the average output value. The variance will depend on the size of the training sample, and we would like to know what this is. We can use cross validation to estimate the prediction error. Any prediction bias can be measured using control samples. Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70: For a review of cross validation see: S. Arlot and A. Celisse, Statistics Surveys Vol (2010).
45 45 OVER FITTING: CROSS VALIDATION Application of this concept to machine learning can be seen via k-fold cross validation and its variants* Divide the data sample for training and validation into k equal sub-samples. validation From these one can prepare k sets of validation samples and residual training samples. validation validation validation Each set uses all examples; but the training and validation sub-sets are distinct. validation *Variants include the extremes of leave 1 out CV and Hold out CV as well as leave p-out CV. These involve reserving 1 example, 50% of examples and p examples for testing, and the remainder of data for training, respectively. Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70: For a review of cross validation see: S. Arlot and A. Celisse, Statistics Surveys Vol (2010).
46 46 OVER FITTING: CROSS VALIDATION Application of this concept to machine learning can be seen via k-fold cross validation and its variants* One can then train the data on each of the k training sets, validating the performance of the network on the corresponding validation set. validation We can measure the model error on the validation set. validation The model prediction error is the average error obtained from the k folds on their corresponding validation sets. validation validation If desired we could combine the k models to compute some average that would have a prediction performance that is more robust than any of the individual folds. validation *Variants include the extremes of leave 1 out CV and Hold out CV as well as leave p-out CV. These involve reserving 1 example, 50% of examples and p examples for testing, and the remainder of data for training, respectively. Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70: For a review of cross validation see: S. Arlot and A. Celisse, Statistics Surveys Vol (2010).
47 47 OVER FITTING: CROSS VALIDATION Application of this concept to machine learning can be seen via k-fold cross validation and its variants* The ensemble of response function outputs will vary in analogy with the spread of a Gaussian distribution. This results in family of ROC curves; with a representative performance that is neither the best or worst ROC. The example shown is for a Support Vector Machine, with the best average and holdout ROC curves to indicate some sense of spread. Backgr rejection (1-eff) (False negative rate) ROC-Curve Signal eff (True positive rate) SVM_Best SVM_Average SVM_Holdout_RBF *Variants include the extremes of leave 1 out CV and Hold out CV as well as leave p-out CV. These involve reserving 1 example, 50% of examples and p examples for testing, and the remainder of data for training, respectively. Geisser, S. (1975). The predictive sample reuse method with applications. J. Amer. Statist. Assoc., 70: For a review of cross validation see: S. Arlot and A. Celisse, Statistics Surveys Vol (2010).
48 48 SUMMARY Hyper-parameter optimisation does not guarantee robust solution. Minimisation of different cost functions will allow us to determine optimial hyper-parameters for our model. When the model starts to learn statistical fluctuations in the training sample we should stop the training to avoid overtraining the data (i.e. to avoid learning the statistical fluctuations in a given sample). Methods exist to mitigate overtraining, however none of them guarantee a generalised solution that is robust from overtraining [dropout, regularisation, cross validation]. The only exception to this is to supply an effectively infinite sample of training data for a given problem.
49 49 SUGGESTED READING (HEP RELATED) There is not much HEP-specific literature regarding HPs optimisation. Minuit is used to optimise parameters in TMVA along with other algorithms. See the Minuit user guide and TMVA for more details. This is based on the DFP method of variable metric minimisation. Logarithmic grid searches are also discussed for specific algorithms with a few HPs (see SVM notes).
50 50 SUGGESTED READING (NON-HEP) There are a few backup slides on the ADAM optimiser that performs well with a number of deep learning problems that you might wish to look at. In addition the following books discuss the issue of parameter optimisation: Bishop, Neural Networks for Pattern Recognition, Oxford University Press (2013) Cristianini and Shawe-Taylor, Support Vector Machines and other kernel-based learning methods, Cambridge University Press (2014) Goodfellow, Deep Learning, MIT Press (2016) MacKay s Information Theory, Inference and Learning algorithms, Cambridge University Press (2011)
51 51 APPENDIX The following slides contain additional information that may be of interest.
52 52 MULTIPLE MINIMA Often more complication hyperspace optimisation problems are encountered, where there are multiple minima. The gradient descent minimisation algorithm is based on the assumption that there is a single minimum to be found. In reality there are often multiple minima. Sometimes the minima are degenerate, or near degenerate. How do we know we have converged on the global minimum? One of several minima
53 53 MULTIPLE MINIMA Often more complication hyperspace optimisation problems are encountered, where there are multiple minima. The gradient descent minimisation algorithm is based on the assumption that there is a single minimum to be found. In reality there are often multiple minima. Sometimes the minima are degenerate, or near degenerate. How do we know we have converged on the global minimum? Global minimum
54 54 GRID SEARCHES Just as we can scan through a parameter in order to minimise a likelihood ratio, we can scan through a HP to observe how the loss function changes. For simple models we can construct a 2D grid of points in m and c. Evaluating the loss function for each point in the 2D sample space we can construct a grid from which to select the minimum value. The assumption here is that our grid spacing is sufficient for the purpose of optimising the problem. c This type of parameter search is often used for support vector machine HPs (kernel function parameters and cost). m This method does not scale to large numbers of parameters.
55 55 GRID SEARCHES e.g. consider a linear regression study optimising the parameters for the model y=mx+c The loss function for this problem results in a valley as m and c are anticorrelated parameters in this 2D hyperspace. loss L m c The contours of the loss function show a minimum, but this is selected from a discrete grid of points (need to ensure grid spacing is sufficient for your needs). loss L m c
56 56 ADAM OPTIMISER This is a stochastic gradient descent algorithm. Consider a model f(θ) that is differentiable with respect to the HPs θ so that: the gradient g t = f t (θ t-1 ) can be computed. t is the training epoch m t and v t are biased values of the first and second moment m t -hat and v t -hat are bias corrected estimator of the moments Some initial guess for the HP is taken: θ 0, and the HPs for a given epoch are denoted by θ t α is the step size β 1 and β 2 are exponential decay rates of moving averages. Kingma and Ba, Proc. 3rd Int Conf. Learning Representations, San Diego, 2015
57 57 ADAM OPTIMISER ADAptive Moment estimation based on gradient descent. Kingma and Ba, Proc. 3rd Int Conf. Learning Representations, San Diego, 2015
58 58 ADAM OPTIMISER Benchmarking performance using MNIST and CFAR10 data indicates that Adam with dropout minimises the loss function compared with other optimisers tested. Faster drop off in cost, and lower overall cost obtained. Kingma and Ba, Proc. 3rd Int Conf. Learning Representations, San Diego, 2015
59 59 DAVIDON-FLETCHER-POWELL This is a variable metric minimisation algorithm (1959, 1963) is described in wikipedia as: The DFP formula is quite effective, but it was soon superseded by the BFGS formula, which is its dual (interchanging the roles of y and s). This is used in high energy physics via the package MINUIT (FORTRAN) and Minuit2 (C++) implementations. The standard tools that are used for data analysis in HEP have these implementations available, and while the algorithm may no longer be optimal, it is still deemed good enough by many for the optimisation tasks. Robust / reliable minimisation. Underlying method derived assuming parabolic minima Understood and trusted by the HEP community.
60 60 DAVIDON-FLETCHER-POWELL This is a variable metric minimisation algorithm (1959, 1963) is described in wikipedia as: The DFP formula is quite effective, but it was soon superseded by the BFGS formula, which is its dual (interchanging the roles of y and s). This is used in high energy physics via the package MINUIT (FORTRAN) and Minuit2 (C++) implementations. This is mentioned only because it is the dominant algorithm used in particle physics at this time. Almost all minimisation problems are solved using this algorithm, where the dominant use is for maximum likelihood and χ 2 fit minimisation problems. The number of HPs required to solve those problems is small (up to a few hundred) in comparison with the numbers required for neural networks (esp. deep learning problems). If you don t (intend to) work in this field, you can now forget you heard about this algorithm.
CS489/698: Intro to ML
CS489/698: Intro to ML Lecture 14: Training of Deep NNs Instructor: Sun Sun 1 Outline Activation functions Regularization Gradient-based optimization 2 Examples of activation functions 3 5/28/18 Sun Sun
More informationNeural Network Optimization and Tuning / Spring 2018 / Recitation 3
Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.
More information4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.
1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when
More informationPerceptron: This is convolution!
Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image
More informationNeural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer
More informationKnowledge Discovery and Data Mining. Neural Nets. A simple NN as a Mathematical Formula. Notes. Lecture 13 - Neural Nets. Tom Kelsey.
Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Lecture 13 - Neural Nets Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-13-NN
More informationMachine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013
Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork
More informationNeural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.
Lecture 24: Learning 3 Victor R. Lesser CMPSCI 683 Fall 2010 Today s Lecture Continuation of Neural Networks Artificial Neural Networks Compose of nodes/units connected by links Each link has a numeric
More informationDeep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES
Deep Learning Practical introduction with Keras Chapter 3 27/05/2018 Neuron A neural network is formed by neurons connected to each other; in turn, each connection of one neural network is associated
More informationImproving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah
Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com
More informationNeural Network Neurons
Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given
More informationDeep Learning With Noise
Deep Learning With Noise Yixin Luo Computer Science Department Carnegie Mellon University yixinluo@cs.cmu.edu Fan Yang Department of Mathematical Sciences Carnegie Mellon University fanyang1@andrew.cmu.edu
More informationLecture 20: Neural Networks for NLP. Zubin Pahuja
Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple
More informationHyperparameter optimization. CS6787 Lecture 6 Fall 2017
Hyperparameter optimization CS6787 Lecture 6 Fall 2017 Review We ve covered many methods Stochastic gradient descent Step size/learning rate, how long to run Mini-batching Batch size Momentum Momentum
More informationSimple Model Selection Cross Validation Regularization Neural Networks
Neural Nets: Many possible refs e.g., Mitchell Chapter 4 Simple Model Selection Cross Validation Regularization Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February
More informationThe exam is closed book, closed notes except your one-page cheat sheet.
CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right
More informationMLPQNA-LEMON Multi Layer Perceptron neural network trained by Quasi Newton or Levenberg-Marquardt optimization algorithms
MLPQNA-LEMON Multi Layer Perceptron neural network trained by Quasi Newton or Levenberg-Marquardt optimization algorithms 1 Introduction In supervised Machine Learning (ML) we have a set of data points
More informationAkarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction
Akarsh Pokkunuru EECS Department 03-16-2017 Contractive Auto-Encoders: Explicit Invariance During Feature Extraction 1 AGENDA Introduction to Auto-encoders Types of Auto-encoders Analysis of different
More informationThe Mathematics Behind Neural Networks
The Mathematics Behind Neural Networks Pattern Recognition and Machine Learning by Christopher M. Bishop Student: Shivam Agrawal Mentor: Nathaniel Monson Courtesy of xkcd.com The Black Box Training the
More informationIndex. Umberto Michelucci 2018 U. Michelucci, Applied Deep Learning,
A Acquisition function, 298, 301 Adam optimizer, 175 178 Anaconda navigator conda command, 3 Create button, 5 download and install, 1 installing packages, 8 Jupyter Notebook, 11 13 left navigation pane,
More informationDeep Learning Cook Book
Deep Learning Cook Book Robert Haschke (CITEC) Overview Input Representation Output Layer + Cost Function Hidden Layer Units Initialization Regularization Input representation Choose an input representation
More informationMachine Learning With Python. Bin Chen Nov. 7, 2017 Research Computing Center
Machine Learning With Python Bin Chen Nov. 7, 2017 Research Computing Center Outline Introduction to Machine Learning (ML) Introduction to Neural Network (NN) Introduction to Deep Learning NN Introduction
More informationNeural Networks (pp )
Notation: Means pencil-and-paper QUIZ Means coding QUIZ Neural Networks (pp. 106-121) The first artificial neural network (ANN) was the (single-layer) perceptron, a simplified model of a biological neuron.
More informationCOMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization
COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18 Lecture 6: k-nn Cross-validation Regularization LEARNING METHODS Lazy vs eager learning Eager learning generalizes training data before
More informationPredict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry
Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry Jincheng Cao, SCPD Jincheng@stanford.edu 1. INTRODUCTION When running a direct mail campaign, it s common practice
More informationA Brief Look at Optimization
A Brief Look at Optimization CSC 412/2506 Tutorial David Madras January 18, 2018 Slides adapted from last year s version Overview Introduction Classes of optimization problems Linear programming Steepest
More informationPattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More informationSupport Vector Machines (a brief introduction) Adrian Bevan.
Support Vector Machines (a brief introduction) Adrian Bevan email: a.j.bevan@qmul.ac.uk Outline! Overview:! Introduce the problem and review the various aspects that underpin the SVM concept.! Hard margin
More informationCOMP 551 Applied Machine Learning Lecture 16: Deep Learning
COMP 551 Applied Machine Learning Lecture 16: Deep Learning Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise noted, all
More informationContents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation
Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation Learning 4 Supervised Learning 4 Unsupervised Learning 4
More informationCPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015
CPSC 340: Machine Learning and Data Mining Robust Regression Fall 2015 Admin Can you see Assignment 1 grades on UBC connect? Auditors, don t worry about it. You should already be working on Assignment
More informationLecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa
Instructors: Parth Shah, Riju Pahwa Lecture 1 Notes Outline 1. Machine Learning What is it? Classification vs. Regression Error Training Error vs. Test Error 2. Linear Classifiers Goals and Motivations
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationEvaluating Classifiers
Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts
More informationDeep Generative Models Variational Autoencoders
Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative
More informationMachine Learning. MGS Lecture 3: Deep Learning
Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ Machine Learning MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHAT IS DEEP LEARNING? Shallow network: Only one hidden layer
More informationCost Functions in Machine Learning
Cost Functions in Machine Learning Kevin Swingler Motivation Given some data that reflects measurements from the environment We want to build a model that reflects certain statistics about that data Something
More informationCSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13
CSE 634 - Data Mining Concepts and Techniques STATISTICAL METHODS Professor- Anita Wasilewska (REGRESSION) Team 13 Contents Linear Regression Logistic Regression Bias and Variance in Regression Model Fit
More informationPractical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow
Practical Methodology Lecture slides for Chapter 11 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-26 What drives success in ML? Arcane knowledge of dozens of obscure algorithms? Mountains
More informationWhat is machine learning?
Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship
More informationMachine Learning for NLP
Machine Learning for NLP Support Vector Machines Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Support Vector Machines: introduction 2 Support Vector Machines (SVMs) SVMs
More informationMachine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart
Machine Learning The Breadth of ML Neural Networks & Deep Learning Marc Toussaint University of Stuttgart Duy Nguyen-Tuong Bosch Center for Artificial Intelligence Summer 2017 Neural Networks Consider
More informationModel Generalization and the Bias-Variance Trade-Off
Charu C. Aggarwal IBM T J Watson Research Center Yorktown Heights, NY Model Generalization and the Bias-Variance Trade-Off Neural Networks and Deep Learning, Springer, 2018 Chapter 4, Section 4.1-4.2 What
More informationTheoretical Concepts of Machine Learning
Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017
3/0/207 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/0/207 Perceptron as a neural
More informationGradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz
Gradient Descent Wed Sept 20th, 2017 James McInenrey Adapted from slides by Francisco J. R. Ruiz Housekeeping A few clarifications of and adjustments to the course schedule: No more breaks at the midpoint
More informationSupport Vector Machines
Support Vector Machines Chapter 9 Chapter 9 1 / 50 1 91 Maximal margin classifier 2 92 Support vector classifiers 3 93 Support vector machines 4 94 SVMs with more than two classes 5 95 Relationshiop to
More informationGradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent
Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Slide credit: http://sebastianruder.com/optimizing-gradient-descent/index.html#batchgradientdescent
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.
More informationClassification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska
Classification Lecture Notes cse352 Neural Networks Professor Anita Wasilewska Neural Networks Classification Introduction INPUT: classification data, i.e. it contains an classification (class) attribute
More informationLecture #11: The Perceptron
Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule Assignment 3 Will be
More informationConflict Graphs for Parallel Stochastic Gradient Descent
Conflict Graphs for Parallel Stochastic Gradient Descent Darshan Thaker*, Guneet Singh Dhillon* Abstract We present various methods for inducing a conflict graph in order to effectively parallelize Pegasos.
More informationFacial Expression Classification with Random Filters Feature Extraction
Facial Expression Classification with Random Filters Feature Extraction Mengye Ren Facial Monkey mren@cs.toronto.edu Zhi Hao Luo It s Me lzh@cs.toronto.edu I. ABSTRACT In our work, we attempted to tackle
More informationDeep Learning. Volker Tresp Summer 2014
Deep Learning Volker Tresp Summer 2014 1 Neural Network Winter and Revival While Machine Learning was flourishing, there was a Neural Network winter (late 1990 s until late 2000 s) Around 2010 there
More informationIST 597 Foundations of Deep Learning Fall 2018 Homework 1: Regression & Gradient Descent
IST 597 Foundations of Deep Learning Fall 2018 Homework 1: Regression & Gradient Descent This assignment is worth 15% of your grade for this class. 1 Introduction Before starting your first assignment,
More informationDeep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group
Deep Learning Vladimir Golkov Technical University of Munich Computer Vision Group 1D Input, 1D Output target input 2 2D Input, 1D Output: Data Distribution Complexity Imagine many dimensions (data occupies
More informationMultivariate Data Analysis and Machine Learning in High Energy Physics (V)
Multivariate Data Analysis and Machine Learning in High Energy Physics (V) Helge Voss (MPI K, Heidelberg) Graduierten-Kolleg, Freiburg, 11.5-15.5, 2009 Outline last lecture Rule Fitting Support Vector
More informationEnsemble methods in machine learning. Example. Neural networks. Neural networks
Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you
More informationCPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016
CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2016 Assignment 5: Due Friday. Assignment 6: Due next Friday. Final: Admin December 12 (8:30am HEBB 100) Covers Assignments 1-6. Final from
More informationSupervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.
Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing examples. Aims: to present an outline of supervised learning as part of AI; to introduce
More informationMore on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.
More on Neural Networks Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.6 Recall the MLP Training Example From Last Lecture log likelihood
More information5 Learning hypothesis classes (16 points)
5 Learning hypothesis classes (16 points) Consider a classification problem with two real valued inputs. For each of the following algorithms, specify all of the separators below that it could have generated
More informationEfficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 1225 Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms S. Sathiya Keerthi Abstract This paper
More informationEarly Stopping. Sargur N. Srihari
Early Stopping Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Regularization Strategies 1. Parameter Norm Penalties
More informationCS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.
CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE
More informationSupport Vector Machines
Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining
More informationPerceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron Matt Gormley Lecture 5 Jan. 31, 2018 1 Q&A Q: We pick the best hyperparameters
More informationMachine Learning Lecture 3
Machine Learning Lecture 3 Probability Density Estimation II 19.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Exam dates We re in the process
More informationLearning from Data: Adaptive Basis Functions
Learning from Data: Adaptive Basis Functions November 21, 2005 http://www.anc.ed.ac.uk/ amos/lfd/ Neural Networks Hidden to output layer - a linear parameter model But adapt the features of the model.
More informationPartitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning
Partitioning Data IRDS: Evaluation, Debugging, and Diagnostics Charles Sutton University of Edinburgh Training Validation Test Training : Running learning algorithms Validation : Tuning parameters of learning
More informationNeural Networks. Theory And Practice. Marco Del Vecchio 19/07/2017. Warwick Manufacturing Group University of Warwick
Neural Networks Theory And Practice Marco Del Vecchio marco@delvecchiomarco.com Warwick Manufacturing Group University of Warwick 19/07/2017 Outline I 1 Introduction 2 Linear Regression Models 3 Linear
More informationA very different type of Ansatz - Decision Trees
A very different type of Ansatz - Decision Trees A Decision Tree encodes sequential rectangular cuts But with a lot of underlying theory on training and optimization Machine-learning technique, widely
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More informationNatural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu
Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward
More informationModern Methods of Data Analysis - WS 07/08
Modern Methods of Data Analysis Lecture XV (04.02.08) Contents: Function Minimization (see E. Lohrmann & V. Blobel) Optimization Problem Set of n independent variables Sometimes in addition some constraints
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationTutorial on Machine Learning Tools
Tutorial on Machine Learning Tools Yanbing Xue Milos Hauskrecht Why do we need these tools? Widely deployed classical models No need to code from scratch Easy-to-use GUI Outline Matlab Apps Weka 3 UI TensorFlow
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining
More informationCPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016
CPSC 34: Machine Learning and Data Mining Feature Selection Fall 26 Assignment 3: Admin Solutions will be posted after class Wednesday. Extra office hours Thursday: :3-2 and 4:3-6 in X836. Midterm Friday:
More informationNeural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Neural Network and Deep Learning Early history of deep learning Deep learning dates back to 1940s: known as cybernetics in the 1940s-60s, connectionism in the 1980s-90s, and under the current name starting
More informationTopics in Machine Learning
Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur
More informationDay 3 Lecture 1. Unsupervised Learning
Day 3 Lecture 1 Unsupervised Learning Semi-supervised and transfer learning Myth: you can t do deep learning unless you have a million labelled examples for your problem. Reality You can learn useful representations
More informationClassification: Feature Vectors
Classification: Feature Vectors Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free YOUR_NAME MISSPELLED FROM_FRIEND... : : : : 2 0 2 0 PIXEL 7,12
More informationFeature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.
CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit
More informationEvaluation. Evaluate what? For really large amounts of data... A: Use a validation set.
Evaluate what? Evaluation Charles Sutton Data Mining and Exploration Spring 2012 Do you want to evaluate a classifier or a learning algorithm? Do you want to predict accuracy or predict which one is better?
More informationLogical Rhythm - Class 3. August 27, 2018
Logical Rhythm - Class 3 August 27, 2018 In this Class Neural Networks (Intro To Deep Learning) Decision Trees Ensemble Methods(Random Forest) Hyperparameter Optimisation and Bias Variance Tradeoff Biological
More informationChap.12 Kernel methods [Book, Chap.7]
Chap.12 Kernel methods [Book, Chap.7] Neural network methods became popular in the mid to late 1980s, but by the mid to late 1990s, kernel methods have also become popular in machine learning. The first
More informationLearning and Generalization in Single Layer Perceptrons
Learning and Generalization in Single Layer Perceptrons Neural Computation : Lecture 4 John A. Bullinaria, 2015 1. What Can Perceptrons do? 2. Decision Boundaries The Two Dimensional Case 3. Decision Boundaries
More informationMachine Learning Methods for Ship Detection in Satellite Images
Machine Learning Methods for Ship Detection in Satellite Images Yifan Li yil150@ucsd.edu Huadong Zhang huz095@ucsd.edu Qianfeng Guo qig020@ucsd.edu Xiaoshi Li xil758@ucsd.edu Abstract In this project,
More informationSupervised Learning in Neural Networks (Part 2)
Supervised Learning in Neural Networks (Part 2) Multilayer neural networks (back-propagation training algorithm) The input signals are propagated in a forward direction on a layer-bylayer basis. Learning
More informationPerceptron as a graph
Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 10 th, 2007 2005-2007 Carlos Guestrin 1 Perceptron as a graph 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2
More informationSupplementary material for: BO-HB: Robust and Efficient Hyperparameter Optimization at Scale
Supplementary material for: BO-: Robust and Efficient Hyperparameter Optimization at Scale Stefan Falkner 1 Aaron Klein 1 Frank Hutter 1 A. Available Software To promote reproducible science and enable
More informationDeep Learning for Computer Vision II
IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L
More informationUnsupervised Learning
Deep Learning for Graphics Unsupervised Learning Niloy Mitra Iasonas Kokkinos Paul Guerrero Vladimir Kim Kostas Rematas Tobias Ritschel UCL UCL/Facebook UCL Adobe Research U Washington UCL Timetable Niloy
More informationFall 09, Homework 5
5-38 Fall 09, Homework 5 Due: Wednesday, November 8th, beginning of the class You can work in a group of up to two people. This group does not need to be the same group as for the other homeworks. You
More informationLogistic Regression and Gradient Ascent
Logistic Regression and Gradient Ascent CS 349-02 (Machine Learning) April 0, 207 The perceptron algorithm has a couple of issues: () the predictions have no probabilistic interpretation or confidence
More information1 Case study of SVM (Rob)
DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how
More informationCSE 573: Artificial Intelligence Autumn 2010
CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine
More information