Deep Reinforcement Learning and the Atari MARC G. BELLEMARE Google Brain
|
|
- Clyde Walters
- 5 years ago
- Views:
Transcription
1 Deep Reinforcement Learning and the Atari 2600 MARC G. BELLEMARE Google Brain
2 DEEP REINFORCEMENT LEARNING Inputs Hidden layers Output Backgammon (Tesauro, 1994) Elevator control (Crites and Barto, 1996) Helicopter control (Ng et al., 2003) Transfer in tic-tac-toe (Rivest and Precup, 2003; Bellemare et al., unpublished) Atari game-playing (Mnih et al., 2015) Go (Silver et al., 2016)
3 Deep RL Inputs Hidden layers Output Challenge domains Algorithms
4 SOME CHALLENGES IN DEEP RL Training data is not i.i.d. Hard to get confidence intervals Potential divergence issues (e.g. van Hasselt et al., 2015) State space often unknown Model-free methods reign supreme (simulation a plus)
5 Diuk et al., 2008 Barbados RL Workshop, 2008 Naddaf, 2008 Bellemare et al., MGB, Naddaf, Veness, Bowling 2013 Mnih et al., 2013, 2015 Hausknecht et al Lipovetsky et al. 2015
6
7 pixels pixels 60 HZ 128 BYTES
8 Observation Action The Arcade Learning Environment (Bellemare et al., 2013)
9 Observation Action Reward The Arcade Learning Environment (Bellemare et al., 2013)
10
11 GENERAL COMPETENCY CC Wikipedia
12 NARROW COMPETENCY CC Wikipedia
13 Diverse Interesting Independent
14 Diverse We argue that reinforcement learning is particularly vulnerable to environment overfitting and propose as a remedy generalized methodologies [ ] Whiteson et al., 2011
15 Diverse
16 Diverse
17 Interesting (to people) CC Wikipedia
18 Independent (by and for people)
19 EARLY ATTEMPTS ( ) BASS/Basic Features RAM LSH on pixels DISCO: Object Detection
20 35 Average Baseline Scores 1.4 Median Baseline Scores Median Inter-Algorithm Scores Basic BASS DISCO LSH RAM Basic BASS DISCO LSH RAM Basic BASS DISCO LSH RAM Fraction of Games SCORE DISTRIBUTION RAM Basic BASS LSH DISCO Inter-Algorithm Score 0.2 0
21 DEEP Q-NETWORKS (DQN) Mnih et al., 2015 Image credit: Volodymyr Mnih
22 GRAYSCALE POLICY POOL + DQN DOWNSAMPLE STACK REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q
23 <latexit sha1_base64="o22ij337m584pl9kikl/mk0r0ni=">aaaichicfzvfbxtfemdppdql0jlciy8rrcojpjevqgivvwqouue4wumtne3wwht7e/yqu3fh7rry2e4x4npwhnjlw/dcz2huj5p7u3qsfxpzm9mzndk/qsq4nv3+p51b791+/0737gdrh3507/7h6w8+odxjxff2qhorqloaacz4ze4mn4kdpyorgqj2mrh8lvgxr5nspimpztjly0mmmy84jqzuk/ufpsnnbgbmkc30bgetkukt7+yuq6bnc5phsigym2cv+hlhkzgsiczjymljbsi8rpjuoxq4sdhwetjujoc2fxvbigxsofqiykhcxibqa+ux3cl6r7/tzx/ufvxs6hnlm5o8ulpayulnkswgcql1hd9pzdgsztgvzk3huwypozdkyi5ajilkemzz9b16cjoqrymcx2xqrq16wck1xsoalcuxm91kmfl/mjnjwnsbwskd6uzojvpmbhmczg2lazfsnbfijcjrdaq5ytsijqiekg6zqnrgob0g+lepkayjgdtdxqs5rqpfhozxjrkfr3mqyyitiirvnaiyfq3h7deaql/j0gl6wlmczy0syv8416bnfxrwoq8q9fwl7lxoxosevehriw4qdnci5xv63ql7fbrfoscvetyiwwodumax9gefhzad9wmfnracg4msb4e9apqmtmksa6rsqovna6laypsn3m7mhjyiv4vpjnngvs1qns1iz2g3+4/jlnhdhliiybziyb2i60s/zxidmnibiu2d6mdiljhefwflu2clwzcymrvyjba4nnip7lvbsthobjccnjbyhlznkisyav7zqqphlkb2kh9nhxut6xo2jtyr3rucamie39jd941hecmqhdzan4zuvxc41xlfpbum2jjkmavx06tmf8qkuzd1qghl2wejsgruzfln2ea2x8vhsxtxagvgwh9blfhkeh6dej7miu9n2uejldyo3m0/yuzmsugwn/e5t/mo6pnnbve4hbmnny6xwo9uuga7qmqa5xv7+wqbjse5rdj44+1mepchnky0bgm1ikbpuzgsknb0 r L( ) L( ) LEARNING ALGORITHM L( ) = 1 2 (r + max Q (x 0,a 0 ) Q (x, a)) 2 a 0 2A T Q = r + P max Q
24 Bellemare, Ostrovski, Guez, Thomas, Munos, 2015
25 GRAYSCALE POLICY POOL + DQN DOWNSAMPLE STACK REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q
26 LSTM layer (Hausknecht GRAYSCALE and Stone, 2015) Recurrent network (Mnih et al., 2016) POOL + DOWNSAMPLE STACK POLICY DQN REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q
27 GRAYSCALE POLICY POOL + DOWNSAMPLE STACK DQN REPLAY MEMORY TARGET NETWORK LEARNING T Q = r + ALGORITHM Dueling networks (Wang et al., 2016) Asynchronous actor-critic (Mnih et al., 2016) P max Q Bootstrap Q-functions (Osband et al., 2016) Noisy networks (Fortunato et al.; Plappert et al., 2017)
28 POOL DOWNSAMPLE GRAYSCALE Prioritized replay (Schaul et al., 2016) Importance sampling (Gruslys et al., 2018) Off-policy + corrections (Gelada et al., in prep.) STACK POLICY DQN REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q
29 Double Q-learning GRAYSCALE (van Hasselt et al., 2015) Advantage learning (Bellemare et al., 2015) Q(λ) (Harutyunyan + et al., 2016) POOL DOWNSAMPLE Retrace (Munos et al., 2016) STACK Distributional methods (Bellemare et al., 2017, Dabney et al., 2018, ) DQN POLICY REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q
30 1. Distributional reinforcement learning 2. Exploration with pseudo-counts
31 x E R(x) : $-2000 R(x) $ (200) 139
32 R(x 0 )+ R(x 1 )+ 2 R(x 2 )... BELLMAN EQUATION V (x) =E R(x)+ E V (x 0 ) x0 P
33 GROUND TRUTH V (x) =E R(x)+ E V (x 0 ) x0 P IMPLIED MODEL SCHOKNECHT 2002; PARR ET AL., 2008; SUTTON ET AL. 2008
34 <latexit sha1_base64="gehkvpxgpdusdoiecww/ybk0nco=">aaaiihicfzvbb9s2fmflbus875a2j30hanritycwggfdowtoxum3lmmcnhhthi5aubrnrjq0km7lmhrb6/b99k32ukobrutaabbi8/sfnqpdmxv5xonb4n/oru8+/ez2590vel9+9fu3327cuttw4ujsdkjdp5snllhm5we70vz77dssjajxz2/cy5cpf/ooscxd4fgvizyrzbbwkadeg8nz+g98gso+gt9c3+0izokiuy2wz6yanaojzdgzpei/idwjqpcsb6/7fzuzzsezuc/ucl4ukqi80c6ktxbxyddfbyiyces+m2s0qrj3phxspvpfbqiljfo6yoamixfq0jznbdrznvqd7uh2ohbdlhp9q3hgzp3bmfzcuhas0nqnsp3bg0hpdjgau58lpbxqlcl0kszyotqdipiamgykevqqlb6ahhj+guaztephifbqkvxqcqlnqsls44eynotadjoqpjqqrk56+tpe8cbaabbqpkxpwkc6ronkqb6xudx/cq1cjyevqnrojkea1k89pj8lqdamuzjzwges+ypfnurkveorkoou51wqobkweg8h7d0nyxydz2d6oje4/tzkfpm6srr0tejpw/rthb5t0ecv+rxfjyr0qewhftps0bmkpwvr/qrdb9hjcj1u0b0k3uuaxvkhfxzydfbjch23nn2daruuowj6rkykaisqzajlnz4bini6xlvnnl0c4ruyyokn7ocu1vkcjcz2jp00sim9xb6brjb7ktff0kshdpamzptg+gmsjtqu35tcmphcc4izk6xjfifji5dcr5kyhypq54kdbrllgst5eaujwmsxwlnyk2qusnfkrjpy1bausc5tv4leyjyt9dbfe3ixpjldnfx9lynrrucslewyn8g07dh4hkrml6n9iakil+uawpyzl0mz+lcrzz2lm9sc553mbaqhlibze2lycxzfo75dhrl4ulfkhkw/rndsiv423fx/wxkf9ioz48fknto3m7pfaw9onexclixfeseul8a61isr2m0rbjae4ldk4i230tbhhdaxhrba5yoyvuj5slky0avvxm0pwdwngh2ulmh6b3cyxstmw5fhnu6u5ep2z52lco4qu3kztrvjnw17sg0f/th/9qk4tbrwfeubtwnz1mprmfwrnbjolnq56pzv+bvzt7fxhxqfd5/k0ludwueevxu6l/4hb0imka==</latexit> IMPLIED MODEL V (x) =E R(x 0 )+ R(x 1 )+ 2 R(x 2 )+ x 0 = x, a t = E [R(x 0 )] + E [R(x 1 )] + 2 E [R(x 2 )] +.
35 VENESS, BELLEMARE, HUTTER, CHUA, DESJARDINS (2015)
36 <latexit sha1_base64="z3v7rkzg+vlq3h00sc3tusddgcs=">aaaid3icfzxrbts2fmfldqu7dje0wz/tczgjq7oggrumanghqlt66iylmzmmbhrtnsiksolql5j0j4cv9gx7mn0b9nwpsafye+xqkh1dugqwrz7f/4ih5/dijyir3e//07ly9ympr3wvf7rx4+nppv1s8+atkyoxkritgotynnpemcejdqk5fuw0kyyenmavvponlr94w6ticxsslwmbhgqw8ybtose03fxt9aonfcu9i75+hdble0y1wh6fote62kqn/bvohsizeoyk77ux/ve7uwu3t1a/1qpwnccmn9pfw55ymhnwdpev2ulmr7/tzx/ubrhlo+euz3b681qk/zguqhzpkohsy7ef6ikhunmqwlabf4olhj6tgrtdmyihuxotzypdd8dioycw8is0yq1vd0ncpzahb8qq6llqmmv8p6bnyw10y1vsbaorkw4etaypkovmes1cchyc6rjzaigfs8iawekdumlhvojoisruqxnry4pzdilmsgszp3ugmvd8opei/w1pvjmktmhszcmdr2/gip1ky6h55btmn2cg27lriszzlgvq0wo9bdgxffqyrz9u6jmwparqoxydvoigrc8q9kxf9yt0v0wpk/s4rfcqdc9rjksdvvbh01mnkntucvyosux55qdpmzazllhrayytb+orcyltt3i7gbnfivwopljilfvoojplo8yku+m+joxgd7dpgvuuxwf6gw10wgbvsryppl8gukjj+y0ppzkptccymblydba4nmip9cpbtxaucx5ysebg+ebyhl/af+a1nvizvctqlvlvrlbqylxdusr6zbsisyrm2aue+pce/rlaznajf97lvnnxcxzlfhrbocx7dkaq1ezs2o9ysor5hvdycuzplgmbgvnwmd3c5rjonksrxbaaxu7eymecogbfop4lc+kzue00usmj4p36qzxozcpnntmfj51wfjyy3gz6brnapplhxmnf5vl4kzuwg40ry03yjl17hccscdmsmnjjbdt6nxakuwqhtvorw2eocfagzvhsxbj5azjnrkodlwvg/w2o7e3inbwz5ugjtxzaxtezrxbxuc2bqd0y7e64/r338nve4+/lw+u686xzlbplum5957hzozn0thzq/nu50fmic7v7e/ep7p/dvwrplu7p87lte7p//wdfj+fc</latexit> V (x) =E R(x 0 )+ R(x 1 )+ 2 R(x 2 )+ = E Z (x) VENESS, BELLEMARE, HUTTER, CHUA, DESJARDINS (2015)
37 BELLMAN EQUATION (x) =E Z (x) = E R(x)+ E Z (X 0 ).
38 DISTRIBUTIONAL BELLMAN EQUATION Z (x) D = R(x)+ Z (X 0 )
39 Z (x) D = R(x)+ Z (X 0 ) VALUE DISTRIBUTION Z (x 0 1) Z (x) x p(x x) = 0.5 r = 0 r = 1 x 0 1 Z (x 0 2) p(x x) = 0.5 x 0 2 Bellman (1957): Bellman equation for mean Sobel (1982): for variance Engel (2003): for Bayesian uncertainty Azar et al. (2011), Lattimore & Hutter (2012): for higher moments Morimura et al. (2010, 2010b): for densities
40 $150 $300 + $200
41 $450 $300 + $200
42 $450 + $300 $200 +
43 $450 + $300 $200 +
44 Q(x, a) GRAYSCALE p(z x, a) Discrete distribution POLICY POOL + DOWNSAMPLE STACK C51 REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q
45 GRAYSCALE POLICY POOL + DOWNSAMPLE STACK ε-greedy w.r.t DQN expected value REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q
46 APPROXIMATION PROJECTION
47 1.From x, a, sample a transition: P Z r, X 0,A 0 R(x, a),p( x, a), ( X 0 ) (a) 2.Compute sample backup P Z ˆT Z(x, a) :=r + Z(X 0,A 0 ) (b) 3.Project onto approximation support: R+ P Z ˆT Z(x, a) (c) 4.Update towards projection, i.e. take a KL-minimizing step T Z (d)
48
49
50 Mean Median > H.B. > DQN DQN 228% 79% 24 0 DDQN 307% 118% DUEL. 373% 151% PRIOR. 434% 124% PR. DUEL. 592% 172% C51 701% 178% 40 50
51
52
53
54 DISTRIBUTIONAL PERSPECTIVE BELLEMARE ET AL., 2016 QUANTILE REGRESSION DABNEY ET AL., 2017 CRAMER DISTANCE BELLEMARE ET AL., 2017 IMPLICIT QUANTILE NETWORKS DABNEY ET AL., 2018 ANALYSIS OF C51 ROWLAND ET AL., 2017 DISTRIBUTIONAL ACTOR-CRITIC BARTH-MARON ET AL., 2018 AS EXPECTED LYLE ET AL., 2018 PGMRL WORKSHOP, ICML GENERATIVE QUANTILE NETWORKS OSTROVSKI ET AL., 2018 THE RAINBOW AGENT HESSEL ET AL., 2018
55 1. Distributional reinforcement learning 2. Exploration with pseudo-counts
56 Pong Seaquest Montezuma s Revenge September 2017
57
58
59 ?
60 <latexit sha1_base64="qbbexo8ugwarbyypna0tfspy9ya=">aaah2xicfzxdbhq3fmchkcyktar62rurkxsgq2inqtskcomirsgiczuqhjd1auxxehat2dnt2wu7mxpru6q3frcephfc0ffp8cxsmh+ukxbnzpn9j318/bwkgmvt73+4dpnkf1evda7fwpvy5ldf31q/fedij3nf2sfnrkkoa6kz4de7nnwidpwqrmqg2kvg9injr94wpxksh5hlysastgmecuomucbrz/bulxrkpnqevgf8j/cusekqzouuutoxiw2enzdomcesywjicth4jdddyhceb/qq2bi/nlnv9jf7+ypahl8axa98hppb1xy4tohcsthqqbqe+f3ujc1rhlpbsju81ywl9jrm2qjmmeimxzyfc4bugideualgfxuue6srlkitlziapsrmppvmof+pmzms9w6dsulin3iy0c9jy+n0blhmi5siuuamqa7ukoqkciiwybcqoiwk0rlrhbqykhqfypqaonntuzvtolbmah7wkfh4hqe6lnkiqfitkidw13dm3tie5jcolayvm4vd2cgr9mwwnehxhr636oskfd2iwxw61al7fbrfoomkhbtosywetohohe606egfhrtodovuz81i6b0k3msg66mkpwofb7sldgk724xnmzil1ltzysuabksbopco95o5hyxcn2case7ylw7v2sloloxk/2hsorulqxatnjw6t/rxl+iawd5ubadclybryhl1wjbszbzaeeyzwgvgcaiburr6na0shcecsw4esczvxm2sm2gbxholdvsyte1z3s4k6ybwnwxkbfa+mziskd250mp3rur4kpadxmdn29mq7mbgeswkfqymzzvbelkaxzn/ppnendybhs1nnyqqevcrzz25zw0pio/mbmqjlicrp7zysmgg/a51frgtn87cr6oupi4+r584ivk2wgixz0wmg3b+pnjcz7t+c7z4hmkjk4vlpfasm5sdjsp9jfokfxqfwjkqhfyzvhhpwz8twssuqrbwk2l4fbxjkmmht88nbryl7hkxahflgbh/ixn3zzedr4z9+ohcpmc/15ldcff5zzupbrz9s <latexit sha1_base64="clsxqiehyqgyrzu7iv0cq+47hlg=">aaaif3icfzxbbhs3eebxarql7s1phtsarixassmy2qjaixqbykrbwtrwzcd2hjucwkw4emfyd0nsqwrmn/mn/zq+fx3tyz+i/9dhxuy9pfla2tk5m5zhdc9birg2g8e/nwsfxp/wrvfmrxsff/lpz59v3vriwmdlrdkrjuwstgkimearozlcchaskezkinil4pyj4y9em6v5hb2adcimkswjhnjkdkimm2/376365d56hhaglszibtfizboupub4tqqkjworhfeztasthdwxkdmepynckqymlocarwjtnrtbsft9rlbuu0fy71aranhadekt1q+usam8dgywmd3sdbyh2ypagl8ipa94xtnbn1z4ftolzjghgmh95g8sm7fegu4fszfwuroe0hmyz2cgrkqypbfzxvj0fzqzfmykfpfbmbbqyynuei0dsjtelhstoex/mboqtejwwskd6kzojvxhynmula2laj5subtixmg1cs24gmklnqiekg6zqnrboiyg2lmpkeyxgdtdxqo5rqpfhoyxjrlnxvnef0va5vwqoquw+gao2g80hhurzsymz6gpbm6ucps8trv0pejpwvrlhb5s0z0k3wnrgwo9anfhhq5b9lrct1t0r0l3wvswqg9bdldcd9nmsfr+be83nfvxhr63ninrgypajpq+cvoywjoqo25504aomoi/wf1mzrmc4xcwzsr37eeh6mwvprb2u/8wcshu4hkluxe4omten7pehwz2pmj7opoveiumvt+u50hqc1swmdnvooyihbu55py6lsucxyjldhpilauufvefjacvqcwgryym9iibp8s6ixunmwkbuncbrxqyp1f28h3lmtsxya6v4c+7avl3s2cxytikbm3zztcftkzxtn/ajfhndubhy9knsyofvgrdz25z28p8o9mnkiyfcozplbysnai/qt0f5stnc/frkcwpwvfbt52jwtfc3ame+7gjz8edzxs25ze7xamzndizcbeufk6nrbczsqxjvrf3rxbyepldkom37jvpfybqmmiqphjfjj+hpfkl7fjzzemwi5tfcansm3t/fsfulmqgjgz78ngl+zc9qjoh4k7ymzdtwzj+dtsfbpv73/ued4tb66b3pfe1d8/zve+9x95p3tg78qj3b+d256vone7v3t+6f3b/yk2vdqqf217t6f79h9rs7nk=</latexit> EXPLORATION Bellman equation: Q(x, a) =r(x, a)+ E max x0 P a 0 2A Q(x0,a 0 ) Exploration bonus (Strehl and Littman, 2008): Q(x, a) =ˆr(x, a) + E x 0 ˆP max a 0 2A Q(x0,a 0 )+p N(x, a)
61 r = -1 safe path transition fn. S The Cliff G optimal path state r = -100 optimal policy reward fn.
62 Most observations experienced once: up to 90% singletons (Blundell et al., 2016)
63 GENERATIVE MODEL Train Generate
64 <latexit sha1_base64="dqatesxjbvsir0w2pjl4l/rzzue=">aaahl3icfzvtb9s2emfvrqu7rnva9dwwn8smat0qbnywyeohys3qpgxluidn0jzimcjqzbmhkzwkozuspkxfbh9s32zhwu700fward DENSITY MODEL Train Query n (x) Assumptions 1. density frequency 2. model quality = generalization
65 n (x) = ˆN n (x) ˆN n (x) = n (x) {ˆn 0 n(x) = ˆN n (x)+1 ˆn n(x) 0 n(x) n (x) Bellemare, Srinivasan, Ostrovski, Schaul, Saxton, Munos, 2016
66 CODING PROBABILITY (DENSITY) n (x) TRAIN 0 n(x) RECODING PROBABILITY
67 THE CTS MODEL n (x) := Q K i=1 i n(x i x <i ) x i,j
68
69 periods without Pseudo-counts salient events start position pseudo-counts salient event pseudo-counts Frames (1000s)
70 <latexit sha1_base64="qbbexo8ugwarbyypna0tfspy9ya=">aaah2xicfzxdbhq3fmchkcyktar62rurkxsgq2inqtskcomirsgiczuqhjd1auxxehat2dnt2wu7mxpru6q3frcephfc0ffp8cxsmh+ukxbnzpn9j318/bwkgmvt73+4dpnkf1evda7fwpvy5ldf31q/fedij3nf2sfnrkkoa6kz4de7nnwidpwqrmqg2kvg9injr94wpxksh5hlysastgmecuomucbrz/bulxrkpnqevgf8j/cusekqzouuutoxiw2enzdomcesywjicth4jdddyhceb/qq2bi/nlnv9jf7+ypahl8axa98hppb1xy4tohcsthqqbqe+f3ujc1rhlpbsju81ywl9jrm2qjmmeimxzyfc4bugideualgfxuue6srlkitlziapsrmppvmof+pmzms9w6dsulin3iy0c9jy+n0blhmi5siuuamqa7ukoqkciiwybcqoiwk0rlrhbqykhqfypqaonntuzvtolbmah7wkfh4hqe6lnkiqfitkidw13dm3tie5jcolayvm4vd2cgr9mwwnehxhr636oskfd2iwxw61al7fbrfoomkhbtosywetohohe606egfhrtodovuz81i6b0k3msg66mkpwofb7sldgk724xnmzil1ltzysuabksbopco95o5hyxcn2case7ylw7v2sloloxk/2hsorulqxatnjw6t/rxl+iawd5ubadclybryhl1wjbszbzaeeyzwgvgcaiburr6na0shcecsw4esczvxm2sm2gbxholdvsyte1z3s4k6ybwnwxkbfa+mziskd250mp3rur4kpadxmdn29mq7mbgeswkfqymzzvbelkaxzn/ppnendybhs1nnyqqevcrzz25zw0pio/mbmqjlicrp7zysmgg/a51frgtn87cr6oupi4+r584ivk2wgixz0wmg3b+pnjcz7t+c7z4hmkjk4vlpfasm5sdjsp9jfokfxqfwjkqhfyzvhhpwz8twssuqrbwk2l4fbxjkmmht88nbryl7hkxahflgbh/ixn3zzedr4z9+ohcpmc/15ldcff5zzupbrz9s <latexit sha1_base64="clsxqiehyqgyrzu7iv0cq+47hlg=">aaaif3icfzxbbhs3eebxarql7s1phtsarixassmy2qjaixqbykrbwtrwzcd2hjucwkw4emfyd0nsqwrmn/mn/zq+fx3tyz+i/9dhxuy9pfla2tk5m5zhdc9birg2g8e/nwsfxp/wrvfmrxsff/lpz59v3vriwmdlrdkrjuwstgkimearozlcchaskezkinil4pyj4y9em6v5hb2adcimkswjhnjkdkimm2/376365d56hhaglszibtfizboupub4tqqkjworhfeztasthdwxkdmepynckqymlocarwjtnrtbsft9rlbuu0fy71aranhadekt1q+usam8dgywmd3sdbyh2ypagl8ipa94xtnbn1z4ftolzjghgmh95g8sm7fegu4fszfwuroe0hmyz2cgrkqypbfzxvj0fzqzfmykfpfbmbbqyynuei0dsjtelhstoex/mboqtejwwskd6kzojvxhynmula2laj5subtixmg1cs24gmklnqiekg6zqnrboiyg2lmpkeyxgdtdxqo5rqpfhoyxjrlnxvnef0va5vwqoquw+gao2g80hhurzsymz6gpbm6ucps8trv0pejpwvrlhb5s0z0k3wnrgwo9anfhhq5b9lrct1t0r0l3wvswqg9bdldcd9nmsfr+be83nfvxhr63ninrgypajpq+cvoywjoqo25504aomoi/wf1mzrmc4xcwzsr37eeh6mwvprb2u/8wcshu4hkluxe4omten7pehwz2pmj7opoveiumvt+u50hqc1swmdnvooyihbu55py6lsucxyjldhpilauufvefjacvqcwgryym9iibp8s6ixunmwkbuncbrxqyp1f28h3lmtsxya6v4c+7avl3s2cxytikbm3zztcftkzxtn/ajfhndubhy9knsyofvgrdz25z28p8o9mnkiyfcozplbysnai/qt0f5stnc/frkcwpwvfbt52jwtfc3ame+7gjz8edzxs25ze7xamzndizcbeufk6nrbczsqxjvrf3rxbyepldkom37jvpfybqmmiqphjfjj+hpfkl7fjzzemwi5tfcansm3t/fsfulmqgjgz78ngl+zc9qjoh4k7ymzdtwzj+dtsfbpv73/ued4tb66b3pfe1d8/zve+9x95p3tg78qj3b+d256vone7v3t+6f3b/yk2vdqqf217t6f79h9rs7nk=</latexit> <latexit sha1_base64="6r4+sicrfo+xjaz8wmujrcalwqq=">aaaifxicfzxdbts2fmflbqu77kpjdrfeedokpjsrwmoafh0cnkilblisowmspgkng6kobckipjj0z4fv5z5ht7o7ybe73jpsjxb04uqfxqxyis/vf8jdcyjss0khzwdwt+fwbx9+dlt75+o1tz797po76xtfnop4rjic8dim1znhniqighmjtahniqimvrbeepdpm/7ydsgt4ujylbmyszanrca4m2iarp92ulxoswdkh9azm0qvvw8jntipgagwsicbiv1seqqflfsjlfdjfhpl0coiqvlusncgzt5hm7m3gyxjg7mk/ntsqdfuqqegyue/vsbmax1slr6kabo2we8ntgf5q9ont2z0npizttzul6gf87meypcqax3hdhiztkwzwuni1+hcq8l4jzvcbtyjjkgpbr5ssu6jxsdbrpaxgzjbqx6wsa2x0kolzgammywz/h8zm1mb3wyqpqpdimkej8zwrmncqmslkij5sexmsjirxyjme7jebunk4koinzhmosfi1ucmpzekmtnemwwvawwhflenfplvrklljc2klnwcpbx9jubwk49xw0s+pfwfvjjbg2ehfye1rnozcj1r0vcv+qpfdyt0t0wpkvsorycvomzr8wo9b9h9ct1v0emkpw7rvqrds5vj0ocvfnh01qcvetpy9g5k7hn2oombgjil1lzzucube0yhop+w9psx+ywi72aapmjydxmqs0wuwvz23cdrntl96kowoifidadpskchgn+mgn00/ykbmhorb2wptw2hrceu1moxim/s5hilvu5xlrrfozaclrhyprmaxvc4ar4mnj96nldx+tgrrjty17apsvnzrvb4i/b8ro/9gw//mir2ean/3ktxetczibvigwkwzq9wcwnnrjl7euimluua4ydnn8yqdjejyzrlpm57xhsa1yhi3aax7tjseajd6fvsc3fnyjrloo1uiih4v36sscyy4eafcyh8bs713gz7brnaivlxxmnf5vb4kz2uk11u5hrngxv3dsetiqxumnztftz6nxaluwqxtdoro+ekcfzjo3p+xbjzaskvnpvlmp1bgmd3jbg8muzjnwv5ef7x2u5xv7nnm6ndop1u2x1su4ff954my1vrjnpp+drzclznofpe+dezoscod/7tbhs+6tzr/t79o/tn969ceqtt+nzp1j7u3/8brvbt6a==</latexit> EXPLORATION Bellman equation: Q(x, a) =r(x, a)+ E max x0 P a 0 2A Q(x0,a 0 ) Exploration bonus (Strehl and Littman, 2008) Q(x, a) =ˆr(x, a) + E x 0 ˆP max a 0 2A Q(x0,a 0 )+p N(x, a) Pseudo-count bonus Q(x, a) =ˆr(x, a)+ E x 0 ˆP max a 0 2A Q(x0,a 0 )+q ˆN(x)
71
72 START END DQN + bonus DQN
73 Average Score Training Frames (millions)
74 <latexit sha1_base64="jxoybhwqgddcnoviwxohiaiweek=">aaaiunicfzvtb9s2eidldmu8rnvs7uo+edpajosxwmoadh0cnkihbljiowle3isuqfg0tysuhjlu7dd6c/sn+7k/sk87vdjws1cbtk733b2pxypptwxxptp5p3hv/iefpthofrb5+cmvvvxq69hjcx3nfgvnnbkrgvhem8fddma4ewwwvyxix7al//pvwi/em6v5fj6axzqnjrmhfmqpmadytv7chouaoeptuwfaxdm76ok+ukgglkwjlnjiocukplsbr012er4tkqncksw9s54hzeoe6ugcbrjm143bidzbqvjx8csgiciypd1f2y76ht3svmnjpzoe5fud+b0egplfhv8dx40mmxksmxmkgte9rvznr5m+qc64udby8qfvpxowx0fez5kfhgqi9zxbmzqhjcpwkli8iweatqm9jmn2bwjiinehtysdoyegcdaouvalduq1rq9lpnyl6yoljgaiqyxr/h8ze1ka3szwso90jscz+mloetidgrbslkxrtcatowsnucavo0ysqcbuczgvohoicdxqceuxxtgcg+qwkzwnzacy0py2uqlgpz/qvejzreoljx+ib+kq/ukjwnmwsji+is1o5kajsg/iueihbtqo0bcf+rzgdwr0oezpcvskrrsf2q3rywk9rngjaj2q0dmcpa3rwwi9jkvf0scfffx11ucfel5z9ns59n3bq/pomzi51ltzfs2b+ksbqfsot6s5bznch2casz6wnxnuzvmwtrcz3rdhmtgthlarwmw+hx5fq0r/srltmtibih2b6g9iljhifwdz09hmtmawzmrpoelgwmkls9fxusjhjljkoihe0shvjlqfcpdyldzsbjs1t2mcjdzvrevy5ngsdbdwkcj7ubah77vhcc2q7a7h74fxsu5mwilfpbumluwqwrtikl4l+hmmibouawpbzr6kvcsgiosysza3pc0+qqsrrtaav+7qzgcxvkmtd3my4rtkkefg/ri9l53dftdfzfjmk0w4pr5sqrmtt7papazgxemn81b4lfbs8pzyjzvmk/bhdogwkby6dn64nugfm4sfyq1bwnze/zxkklxs9l+vfm7sa/wegnrbarrjv8vrcseya0egfbg/mu+ymzjlenxvbvvmqgvnp+y5nt33+mfwy25+azwdb5xvnw3hdz47l51fnb5z5tdgdqpxuggmnv7e+lfzan7pto81cp+vndltfpgfpkv+ig==</latexit> CREDIT ASSIGNMENT ISSUES IN EXPLORATION Another important key factor: fast credit assignment apple Q(x t,a t )=q r + (x t,a t )+ " 1 X max Q(x t+1,a 0 ) a 0 2A # +(1 q) i r + (x t+i,a t+i ) i=0 Mixed Monte-Carlo (MMC) update
75 EFFECT OF MIXED MONTE CARLO UPDATE
76 REMOVING EXTRINSIC REWARDS
77
78 Breakout Super-human agents Scoring exploit Sub-human agents Space Invaders Asteroids Bowling 81-83: 27% sub-human, 8% exploit 36 games Freeway Ice Hockey Tennis Ms. Pac-Man Assault Asterix Beam Rider Elevator Action Enduro James Bond Seaquest Up N Down Bank Heist Krull Frostbite Private Eye Road Runner 77 Q*Bert Venture Pong Video Pinball 15 games Battlezone Berzerk Boxing Carnival Centipede Crazy Climber Fishing Derby Phoenix Skiing 80 & before: 93% super-human, 7% sub-human Amidar Atlantis Chopper Cmd. Demon Attack Gopher Name this Game Pooyan River Raid Star Gunner Time Pilot Yar s Revenge Zaxxon Tutankham Alien Gravitar Journey Escape Kangaroo Pitfall! Kung-Fu Master H.E.R.O. Montezuma s Revenge 6 games Solaris 84 and after: only 1/6 super-human Double Dunk
79 Deep Reinforcement Learning and the Atari 2600 MARC G. BELLEMARE Google Brain
Slides credited from Dr. David Silver & Hung-Yi Lee
Slides credited from Dr. David Silver & Hung-Yi Lee Review Reinforcement Learning 2 Reinforcement Learning RL is a general purpose framework for decision making RL is for an agent with the capacity to
More informationMocha.jl. Deep Learning for Julia. Chiyuan Zhang CSAIL, MIT
Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT JULIA BASICS 10-minute Introduction to Julia A glance of basic syntax Numpy Matlab Julia Description x[0] x(1) x[1] Index 1 st elem
More informationIntroduction to Deep Q-network
Introduction to Deep Q-network Presenter: Yunshu Du CptS 580 Deep Learning 10/10/2016 Deep Q-network (DQN) Deep Q-network (DQN) An artificial agent for general Atari game playing Learn to master 49 different
More informationKnowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay
Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay Haiyan (Helena) Yin, Sinno Jialin Pan School of Computer Science and Engineering Nanyang Technological University
More informationarxiv: v1 [cs.ai] 18 Apr 2017
Investigating Recurrence and Eligibility Traces in Deep Q-Networks arxiv:174.5495v1 [cs.ai] 18 Apr 217 Jean Harb School of Computer Science McGill University Montreal, Canada jharb@cs.mcgill.ca Abstract
More informationA Brief Introduction to Reinforcement Learning
A Brief Introduction to Reinforcement Learning Minlie Huang ( ) Dept. of Computer Science, Tsinghua University aihuang@tsinghua.edu.cn 1 http://coai.cs.tsinghua.edu.cn/hml Reinforcement Learning Agent
More informationarxiv: v1 [cs.lg] 15 Jun 2016
Deep Reinforcement Learning with Macro-Actions arxiv:1606.04615v1 [cs.lg] 15 Jun 2016 Ishan P. Durugkar University of Massachusetts, Amherst, MA 01002 idurugkar@cs.umass.edu Stefan Dernbach University
More informationarxiv: v2 [cs.lg] 16 May 2017
EFFICIENT PARALLEL METHODS FOR DEEP REIN- FORCEMENT LEARNING Alfredo V. Clemente Department of Computer and Information Science Norwegian University of Science and Technology Trondheim, Norway alfredvc@stud.ntnu.no
More informationarxiv: v2 [cs.ai] 19 Jun 2018
THE REACTOR: A FAST AND SAMPLE-EFFICIENT ACTOR-CRITIC AGENT FOR REINFORCEMENT LEARNING Audrūnas Gruslys, DeepMind audrunas@google.com Will Dabney, DeepMind wdabney@google.com Mohammad Gheshlaghi Azar,
More informationTopics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning
Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 12: Deep Reinforcement Learning Types of Learning Supervised training Learning from the teacher Training data includes
More informationTD LEARNING WITH CONSTRAINED GRADIENTS
TD LEARNING WITH CONSTRAINED GRADIENTS Anonymous authors Paper under double-blind review ABSTRACT Temporal Difference Learning with function approximation is known to be unstable. Previous work like Sutton
More informationReinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)
Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended) Pavlos Andreadis, February 2 nd 2018 1 Markov Decision Processes A finite Markov Decision Process
More informationarxiv: v1 [stat.ml] 1 Sep 2017
Mean Actor Critic Kavosh Asadi 1 Cameron Allen 1 Melrose Roderick 1 Abdel-rahman Mohamed 2 George Konidaris 1 Michael Littman 1 arxiv:1709.00503v1 [stat.ml] 1 Sep 2017 Brown University 1 Amazon 2 Providence,
More informationarxiv: v1 [cs.lg] 6 Nov 2018
Quasi-Newton Optimization in Deep Q-Learning for Playing ATARI Games Jacob Rafati 1 and Roummel F. Marcia 2 {jrafatiheravi,rmarcia}@ucmerced.edu 1 Electrical Engineering and Computer Science, 2 Department
More informationS3TA: A SOFT, SPATIAL, SEQUENTIAL, TOP-DOWN ATTENTION MODEL
STA: A SOFT, SPATIAL, SEQUENTIAL, TOP-DOWN ATTENTION MODEL Anonymous authors Paper under double-blind review ABSTRACT We present a soft, spatial, sequential, top-down attention model (STA). This model
More informationNeural Episodic Control. Alexander pritzel et al (presented by Zura Isakadze)
Neural Episodic Control Alexander pritzel et al. 2017 (presented by Zura Isakadze) Reinforcement Learning Image from reinforce.io RL Example - Atari Games Observed States Images. Internal state - RAM.
More informationarxiv: v2 [cs.lg] 25 Oct 2018
Peter Jin 1 Kurt Keutzer 1 Sergey Levine 1 arxiv:171.11424v2 [cs.lg] 25 Oct 218 Abstract Deep reinforcement learning algorithms that estimate state and state-action value functions have been shown to be
More informationApplications of Reinforcement Learning. Ist künstliche Intelligenz gefährlich?
Applications of Reinforcement Learning Ist künstliche Intelligenz gefährlich? Table of contents Playing Atari with Deep Reinforcement Learning Playing Super Mario World Stanford University Autonomous Helicopter
More informationCS 234 Winter 2018: Assignment #2
Due date: 2/10 (Sat) 11:00 PM (23:00) PST These questions require thought, but do not require long answers. Please be as concise as possible. We encourage students to discuss in groups for assignments.
More informationGUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning
GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning Koichi Shirahata, Youri Coppens, Takuya Fukagai, Yasumoto Tomita, and Atsushi Ike FUJITSU LABORATORIES LTD. March 27, 2018 0 Deep
More informationRepresentations and Control in Atari Games Using Reinforcement Learning
Representations and Control in Atari Games Using Reinforcement Learning by Yitao Liang Class of 2016 A thesis submitted in partial fulfillment of the requirements for the honor in Computer Science May
More informationHuman-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015
Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015 Content Demo Framework Remarks Experiment Discussion Content Demo Framework Remarks Experiment Discussion
More informationCombining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving
Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving Xi Xiong Jianqiang Wang Fang Zhang Keqiang Li State Key Laboratory of Automotive Safety and Energy, Tsinghua University
More informationIntroduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University
Introduction to Reinforcement Learning J. Zico Kolter Carnegie Mellon University 1 Agent interaction with environment Agent State s Reward r Action a Environment 2 Of course, an oversimplification 3 Review:
More informationDeep Reinforcement Learning for Pellet Eating in Agar.io
Deep Reinforcement Learning for Pellet Eating in Agar.io Nil Stolt Ansó 1, Anton O. Wiehe 1, Madalina M. Drugan 2 and Marco A. Wiering 1 1 Bernoulli Institute, Department of Artificial Intelligence, University
More informationCS221 Final Project: Learning Atari
CS221 Final Project: Learning Atari David Hershey, Rush Moody, Blake Wulfe {dshersh, rmoody, wulfebw}@stanford December 11, 2015 1 Introduction 1.1 Task Definition and Atari Learning Environment Our goal
More information10703 Deep Reinforcement Learning and Control
10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient II Used Materials Disclaimer: Much of the material and slides for this lecture
More informationDistributed Deep Q-Learning
Distributed Deep Q-Learning Kevin Chavez 1, Hao Yi Ong 1, and Augustus Hong 1 Abstract We propose a distributed deep learning model to successfully learn control policies directly from highdimensional
More information10703 Deep Reinforcement Learning and Control
10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient I Used Materials Disclaimer: Much of the material and slides for this lecture
More informationWhen Network Embedding meets Reinforcement Learning?
When Network Embedding meets Reinforcement Learning? ---Learning Combinatorial Optimization Problems over Graphs Changjun Fan 1 1. An Introduction to (Deep) Reinforcement Learning 2. How to combine NE
More informationOptimazation of Traffic Light Controlling with Reinforcement Learning COMP 3211 Final Project, Group 9, Fall
Optimazation of Traffic Light Controlling with Reinforcement Learning COMP 3211 Final Project, Group 9, Fall 2017-2018 CHEN Ziyi 20319433, LI Xinran 20270572, LIU Cheng 20328927, WU Dake 20328628, and
More informationarxiv: v3 [cs.lg] 2 Mar 2017
Published as a conference paper at ICLR 7 REINFORCEMENT LEARNING THROUGH ASYN- CHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU Mohammad Babaeizadeh Department of Computer Science University of Illinois at Urbana-Champaign,
More informationarxiv: v1 [cs.cv] 2 Sep 2018
Natural Language Person Search Using Deep Reinforcement Learning Ankit Shah Language Technologies Institute Carnegie Mellon University aps1@andrew.cmu.edu Tyler Vuong Electrical and Computer Engineering
More informationarxiv: v1 [cs.ai] 18 May 2018
Hierarchical Reinforcement Learning with Deep Nested Agents arxiv:1805.07008v1 [cs.ai] 18 May 2018 Marc W. Brittain Department of Aerospace Engineering Iowa State University Ames, IA 50011 mwb@iastate.edu
More informationMarkov Decision Processes (MDPs) (cont.)
Markov Decision Processes (MDPs) (cont.) Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University November 29 th, 2007 Markov Decision Process (MDP) Representation State space: Joint state x
More informationApprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang
Apprenticeship Learning for Reinforcement Learning with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Table of Contents Introduction Theory Autonomous helicopter control
More informationDOM-Q-NET: GROUNDED RL ON STRUCTURED LANGUAGE
DOM-Q-NET: GROUNDED RL ON STRUCTURED LANGUAGE Anonymous authors Paper under double-blind review ABSTRACT The ability for agents to interact with the web would allow for significant improvements in knowledge
More informationMarkov Decision Processes. (Slides from Mausam)
Markov Decision Processes (Slides from Mausam) Machine Learning Operations Research Graph Theory Control Theory Markov Decision Process Economics Robotics Artificial Intelligence Neuroscience /Psychology
More informationarxiv: v2 [cs.lg] 10 Mar 2018
Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research arxiv:1802.09464v2 [cs.lg] 10 Mar 2018 Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen
More informationarxiv: v1 [cs.lg] 31 Jul 2015
-Conditional Video using Deep Networks in Atari Games arxiv:1507.08750v1 [cs.lg] 31 Jul 2015 Junhyuk Oh Computer Science & Engineering University of Michigan junhyuk@umich.edu Honglak Lee Computer Science
More informationLearning to Play General Video-Games via an Object Embedding Network
1 Learning to Play General Video-Games via an Object Embedding Network William Woof & Ke Chen School of Computer Science, The University of Manchester Email: {william.woof, ke.chen}@manchester.ac.uk arxiv:1803.05262v2
More informationAccelerating Reinforcement Learning in Engineering Systems
Accelerating Reinforcement Learning in Engineering Systems Tham Chen Khong with contributions from Zhou Chongyu and Le Van Duc Department of Electrical & Computer Engineering National University of Singapore
More informationGPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING
GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh,, I.Frosio, S.Tyree, J. Clemons, J.Kautz University of Illinois at Urbana-Champaign, USA NVIDIA, USA An ICLR 2017 paper A github project GPU-BASED
More informationConvolutional Restricted Boltzmann Machine Features for TD Learning in Go
ConvolutionalRestrictedBoltzmannMachineFeatures fortdlearningingo ByYanLargmanandPeterPham AdvisedbyHonglakLee 1.Background&Motivation AlthoughrecentadvancesinAIhaveallowed Go playing programs to become
More informationPer-decision Multi-step Temporal Difference Learning with Control Variates
Per-decision Multi-step Temporal Difference Learning with Control Variates Kristopher De Asis Department of Computing Science University of Alberta Edmonton, AB T6G 2E8 kldeasis@ualberta.ca Richard S.
More informationScore function estimator and variance reduction techniques
and variance reduction techniques Wilker Aziz University of Amsterdam May 24, 2018 Wilker Aziz Discrete variables 1 Outline 1 2 3 Wilker Aziz Discrete variables 1 Variational inference for belief networks
More informationMachine Learning Techniques at the core of AlphaGo success
Machine Learning Techniques at the core of AlphaGo success Stéphane Sénécal Orange Labs stephane.senecal@orange.com Paris Machine Learning Applications Group Meetup, 14/09/2016 1 / 42 Some facts... (1/3)
More informationDeep Reinforcement Learning in Large Discrete Action Spaces
Gabriel Dulac-Arnold*, Richard Evans*, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, Ben Coppin DULACARNOLD@GOOGLE.COM Google DeepMind
More informationADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN
More informationMarco Wiering Intelligent Systems Group Utrecht University
Reinforcement Learning for Robot Control Marco Wiering Intelligent Systems Group Utrecht University marco@cs.uu.nl 22-11-2004 Introduction Robots move in the physical environment to perform tasks The environment
More informationMarkov Decision Processes and Reinforcement Learning
Lecture 14 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Artificial Intelligence
More informationCME 213 SPRING Eric Darve
CME 213 SPRING 2017 Eric Darve MPI SUMMARY Point-to-point and collective communications Process mapping: across nodes and within a node (socket, NUMA domain, core, hardware thread) MPI buffers and deadlocks
More informationCS885 Reinforcement Learning Lecture 9: May 30, Model-based RL [SutBar] Chap 8
CS885 Reinforcement Learning Lecture 9: May 30, 2018 Model-based RL [SutBar] Chap 8 CS885 Spring 2018 Pascal Poupart 1 Outline Model-based RL Dyna Monte-Carlo Tree Search CS885 Spring 2018 Pascal Poupart
More informationReinforcement Learning (2)
Reinforcement Learning (2) Bruno Bouzy 1 october 2013 This document is the second part of the «Reinforcement Learning» chapter of the «Agent oriented learning» teaching unit of the Master MI computer course.
More informationArtificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu
Artificial Neural Networks Introduction to Computational Neuroscience Ardi Tampuu 7.0.206 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition
More informationReinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019
Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 1 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Introduction, History, General Concepts
More informationGeneralized Inverse Reinforcement Learning
Generalized Inverse Reinforcement Learning James MacGlashan Cogitai, Inc. james@cogitai.com Michael L. Littman mlittman@cs.brown.edu Nakul Gopalan ngopalan@cs.brown.edu Amy Greenwald amy@cs.brown.edu Abstract
More informationarxiv: v1 [cs.lg] 22 Jul 2018
NAVREN-RL: Learning to fly in real environment via end-to-end deep reinforcement learning using monocular images Malik Aqeel Anwar 1, Arijit Raychowdhury 2 Department of Electrical and Computer Engineering
More informationGradient Reinforcement Learning of POMDP Policy Graphs
1 Gradient Reinforcement Learning of POMDP Policy Graphs Douglas Aberdeen Research School of Information Science and Engineering Australian National University Jonathan Baxter WhizBang! Labs July 23, 2001
More informationPartially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)
Partially Observable Markov Decision Processes Mausam (slides by Dieter Fox) Stochastic Planning: MDPs Static Environment Fully Observable Perfect What action next? Stochastic Instantaneous Percepts Actions
More informationPlanning and Control: Markov Decision Processes
CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Perfect vs. Noisy Environment What
More informationDeep Reinforcement Learning
Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3. Policy Gradient and Gradient Estimators 4. Q-prop: Sample Efficient Policy Gradient and an Off-policy Critic
More informationDeep Q-Learning to play Snake
Deep Q-Learning to play Snake Daniele Grattarola August 1, 2016 Abstract This article describes the application of deep learning and Q-learning to play the famous 90s videogame Snake. I applied deep convolutional
More informationUsing Linear Programming for Bayesian Exploration in Markov Decision Processes
Using Linear Programming for Bayesian Exploration in Markov Decision Processes Pablo Samuel Castro and Doina Precup McGill University School of Computer Science {pcastr,dprecup}@cs.mcgill.ca Abstract A
More informationCellular Network Traffic Scheduling using Deep Reinforcement Learning
Cellular Network Traffic Scheduling using Deep Reinforcement Learning Sandeep Chinchali, et. al. Marco Pavone, Sachin Katti Stanford University AAAI 2018 Can we learn to optimally manage cellular networks?
More informationGated Path Planning Networks
Lisa Lee *1 Emilio Parisotto *1 Devendra Singh Chaplot 1 Eric Xing 1 Ruslan Salakhutdinov 1 Abstract Value Iteration Networks (VINs) are effective differentiable path planning modules that can be used
More informationObject-sensitive Deep Reinforcement Learning
EPiC Series in Computing Volume 50, 2017, Pages 20 35 GCAI 2017. 3rd Global Conference on Artificial Intelligence Object-sensitive Deep Reinforcement Learning Yuezhang Li, Katia Sycara, and Rahul Iyer
More informationHierarchical Average Reward Reinforcement Learning
Journal of Machine Learning Research 8 (2007) 2629-2669 Submitted 6/03; Revised 8/07; Published 11/07 Hierarchical Average Reward Reinforcement Learning Mohammad Ghavamzadeh Department of Computing Science
More informationarxiv: v2 [cs.ro] 20 Apr 2017
Learning a Unified Control Policy for Safe Falling Visak C.V. Kumar 1, Sehoon Ha 2, C. Karen Liu 1 arxiv:1703.02905v2 [cs.ro] 20 Apr 2017 Abstract Being able to fall safely is a necessary motor skill for
More informationResource Allocation for a Wireless Coexistence Management System Based on Reinforcement Learning
Resource Allocation for a Wireless Coexistence Management System Based on Reinforcement Learning WCS which cause independent interferences. Such coexistence managements require approaches, which mitigate
More informationClick to edit Master title style Approximate Models for Batch RL Click to edit Master subtitle style Emma Brunskill 2/18/15 2/18/15 1 1
Approximate Click to edit Master titlemodels style for Batch RL Click to edit Emma Master Brunskill subtitle style 11 FVI / FQI PI Approximate model planners Policy Iteration maintains both an explicit
More informationHierarchical Average Reward Reinforcement Learning
Hierarchical Average Reward Reinforcement Learning Mohammad Ghavamzadeh mgh@cs.ualberta.ca Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E8, CANADA Sridhar Mahadevan mahadeva@cs.umass.edu
More informationPlanning and Reinforcement Learning through Approximate Inference and Aggregate Simulation
Planning and Reinforcement Learning through Approximate Inference and Aggregate Simulation Hao Cui Department of Computer Science Tufts University Medford, MA 02155, USA hao.cui@tufts.edu Roni Khardon
More informationAsynchronous Advantage Actor- Critic with Adam Optimization and a Layer Normalized Recurrent Network
DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017 Asynchronous Advantage Actor- Critic with Adam Optimization and a Layer Normalized Recurrent Network JOAKIM BERGDAHL KTH ROYAL
More informationMulti-View 3D Object Detection Network for Autonomous Driving
Multi-View 3D Object Detection Network for Autonomous Driving Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, Tian Xia CVPR 2017 (Spotlight) Presented By: Jason Ku Overview Motivation Dataset Network Architecture
More informationarxiv: v1 [cs.lg] 17 Jun 2018
Lisa Lee * 1 Emilio Parisotto * 1 Devendra Singh Chaplot 1 Eric Xing 1 Ruslan Salakhutdinov 1 arxiv:1806.06408v1 [cs.lg] 17 Jun 2018 Abstract Value Iteration Networks (VINs) are effective differentiable
More informationThe Cross-Entropy Method for Mathematical Programming
The Cross-Entropy Method for Mathematical Programming Dirk P. Kroese Reuven Y. Rubinstein Department of Mathematics, The University of Queensland, Australia Faculty of Industrial Engineering and Management,
More informationarxiv: v2 [cs.ai] 17 Aug 2017
Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics Ken Kansky Tom Silver David A. Mély Mohamed Eldawy Miguel Lázaro-Gredilla Xinghua Lou Nimrod Dorfman Szymon Sidor
More informationarxiv: v1 [cs.cv] 20 May 2018
Unsupervised Video Object Segmentation for Deep Reinforcement Learning arxiv:1805.07780v1 [cs.cv] 20 May 2018 Vik Goel School of Computer Science University of Waterloo v5goel@uwaterloo.ca Jameson Weng
More informationNeural Networks and Tree Search
Mastering the Game of Go With Deep Neural Networks and Tree Search Nabiha Asghar 27 th May 2016 AlphaGo by Google DeepMind Go: ancient Chinese board game. Simple rules, but far more complicated than Chess
More informationarxiv: v2 [cs.ai] 21 Sep 2017
Learning to Drive using Inverse Reinforcement Learning and Deep Q-Networks arxiv:161203653v2 [csai] 21 Sep 2017 Sahand Sharifzadeh 1 Ioannis Chiotellis 1 Rudolph Triebel 1,2 Daniel Cremers 1 1 Department
More informationChapter 10. Conclusion Discussion
Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with
More informationReinforcement Learning: A brief introduction. Mihaela van der Schaar
Reinforcement Learning: A brief introduction Mihaela van der Schaar Outline Optimal Decisions & Optimal Forecasts Markov Decision Processes (MDPs) States, actions, rewards and value functions Dynamic Programming
More informationDetecting Salient Contours Using Orientation Energy Distribution. Part I: Thresholding Based on. Response Distribution
Detecting Salient Contours Using Orientation Energy Distribution The Problem: How Does the Visual System Detect Salient Contours? CPSC 636 Slide12, Spring 212 Yoonsuck Choe Co-work with S. Sarma and H.-C.
More informationMobile Robot Obstacle Avoidance based on Deep Reinforcement Learning
Mobile Robot Obstacle Avoidance based on Deep Reinforcement Learning by Shumin Feng Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of
More informationA DECENTRALIZED REINFORCEMENT LEARNING CONTROLLER FOR COLLABORATIVE DRIVING. Luke Ng, Chris Clark, Jan P. Huissoon
A DECENTRALIZED REINFORCEMENT LEARNING CONTROLLER FOR COLLABORATIVE DRIVING Luke Ng, Chris Clark, Jan P. Huissoon Department of Mechanical Engineering, Automation & Controls Group University of Waterloo
More informationAlgorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10
Algorithms for Solving RL: Temporal Difference Learning (TD) 1 Reinforcement Learning Lecture 10 Gillian Hayes 8th February 2007 Incremental Monte Carlo Algorithm TD Prediction TD vs MC vs DP TD for control:
More informationKnowledge-Defined Networking: Towards Self-Driving Networks
Knowledge-Defined Networking: Towards Self-Driving Networks Albert Cabellos (UPC/BarcelonaTech, Spain) albert.cabellos@gmail.com 2nd IFIP/IEEE International Workshop on Analytics for Network and Service
More informationFoundations of Artificial Intelligence
Foundations of Artificial Intelligence 45. AlphaGo and Outlook Malte Helmert and Gabriele Röger University of Basel May 22, 2017 Board Games: Overview chapter overview: 40. Introduction and State of the
More informationJustin A. Boyan and Andrew W. Moore. Carnegie Mellon University. Pittsburgh, PA USA.
Learning Evaluation Functions for Large Acyclic Domains Justin A. Boyan and Andrew W. Moore Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 USA jab@cs.cmu.edu, awm@cs.cmu.edu
More informationDeictic Image Mapping: An Abstraction For Learning Pose Invariant Manipulation Policies
Deictic Image Mapping: An Abstraction For Learning Pose Invariant Manipulation Policies Robert Platt, Colin Kohler, Marcus Gualtieri College of Computer and Information Science, Northeastern University
More informationExploring Boosted Neural Nets for Rubik s Cube Solving
Exploring Boosted Neural Nets for Rubik s Cube Solving Alexander Irpan Department of Computer Science University of California, Berkeley alexirpan@berkeley.edu Abstract We explore whether neural nets can
More informationExperiments with Tensor Flow
Experiments with Tensor Flow 06.07.2017 Roman Weber (Geschäftsführer) Richard Schmid (Senior Consultant) A Smart Home? 2 WEBGATE WELTWEIT WebGate USA Boston WebGate Support Center Brno, Tschechische Republik
More informationarxiv: v1 [cs.ai] 1 Dec 2016
Playing Doom with SLAM-Augmented Deep Reinforcement Learning Shehroze Bhatti Alban Desmaison Ondrej Miksik Nantas Nardelli N. Siddharth Philip H. S. Torr University of Oxford arxiv:1612.00380v1 [cs.ai]
More informationUtile Distinctions for Relational Reinforcement Learning
Utile Distinctions for Relational Reinforcement Learning William Dabney and Amy McGovern University of Oklahoma School of Computer Science amarack@ou.edu and amcgovern@ou.edu Abstract We introduce an approach
More informationEnsemble of Specialized Neural Networks for Time Series Forecasting. Slawek Smyl ISF 2017
Ensemble of Specialized Neural Networks for Time Series Forecasting Slawek Smyl slawek@uber.com ISF 2017 Ensemble of Predictors Ensembling a group predictors (preferably diverse) or choosing one of them
More informationValue Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Reinforcement Learning: Value Iteration Matt Gormley Lecture 23 Apr. 10, 2019 1
More informationover Multi Label Images
IBM Research Compact Hashing for Mixed Image Keyword Query over Multi Label Images Xianglong Liu 1, Yadong Mu 2, Bo Lang 1 and Shih Fu Chang 2 1 Beihang University, Beijing, China 2 Columbia University,
More informationDeep-Reinforcement Learning Multiple Access for Heterogeneous Wireless Networks
Deep-Reinforcement Learning Multiple Access for Heterogeneous Wireless Networks Yiding Yu, Taotao Wang, Soung Chang Liew arxiv:1712.00162v1 [cs.ni] 1 Dec 2017 Abstract This paper investigates the use of
More informationApplication of Deep Learning Techniques in Satellite Telemetry Analysis.
Application of Deep Learning Techniques in Satellite Telemetry Analysis. Greg Adamski, Member of Technical Staff L3 Technologies Telemetry and RF Products Julian Spencer Jones, Spacecraft Engineer Telenor
More information