Deep Reinforcement Learning and the Atari MARC G. BELLEMARE Google Brain

Size: px
Start display at page:

Download "Deep Reinforcement Learning and the Atari MARC G. BELLEMARE Google Brain"

Transcription

1 Deep Reinforcement Learning and the Atari 2600 MARC G. BELLEMARE Google Brain

2 DEEP REINFORCEMENT LEARNING Inputs Hidden layers Output Backgammon (Tesauro, 1994) Elevator control (Crites and Barto, 1996) Helicopter control (Ng et al., 2003) Transfer in tic-tac-toe (Rivest and Precup, 2003; Bellemare et al., unpublished) Atari game-playing (Mnih et al., 2015) Go (Silver et al., 2016)

3 Deep RL Inputs Hidden layers Output Challenge domains Algorithms

4 SOME CHALLENGES IN DEEP RL Training data is not i.i.d. Hard to get confidence intervals Potential divergence issues (e.g. van Hasselt et al., 2015) State space often unknown Model-free methods reign supreme (simulation a plus)

5 Diuk et al., 2008 Barbados RL Workshop, 2008 Naddaf, 2008 Bellemare et al., MGB, Naddaf, Veness, Bowling 2013 Mnih et al., 2013, 2015 Hausknecht et al Lipovetsky et al. 2015

6

7 pixels pixels 60 HZ 128 BYTES

8 Observation Action The Arcade Learning Environment (Bellemare et al., 2013)

9 Observation Action Reward The Arcade Learning Environment (Bellemare et al., 2013)

10

11 GENERAL COMPETENCY CC Wikipedia

12 NARROW COMPETENCY CC Wikipedia

13 Diverse Interesting Independent

14 Diverse We argue that reinforcement learning is particularly vulnerable to environment overfitting and propose as a remedy generalized methodologies [ ] Whiteson et al., 2011

15 Diverse

16 Diverse

17 Interesting (to people) CC Wikipedia

18 Independent (by and for people)

19 EARLY ATTEMPTS ( ) BASS/Basic Features RAM LSH on pixels DISCO: Object Detection

20 35 Average Baseline Scores 1.4 Median Baseline Scores Median Inter-Algorithm Scores Basic BASS DISCO LSH RAM Basic BASS DISCO LSH RAM Basic BASS DISCO LSH RAM Fraction of Games SCORE DISTRIBUTION RAM Basic BASS LSH DISCO Inter-Algorithm Score 0.2 0

21 DEEP Q-NETWORKS (DQN) Mnih et al., 2015 Image credit: Volodymyr Mnih

22 GRAYSCALE POLICY POOL + DQN DOWNSAMPLE STACK REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q

23 <latexit sha1_base64="o22ij337m584pl9kikl/mk0r0ni=">aaaichicfzvfbxtfemdppdql0jlciy8rrcojpjevqgivvwqouue4wumtne3wwht7e/yqu3fh7rry2e4x4npwhnjlw/dcz2huj5p7u3qsfxpzm9mzndk/qsq4nv3+p51b791+/0737gdrh3507/7h6w8+odxjxff2qhorqloaacz4ze4mn4kdpyorgqj2mrh8lvgxr5nspimpztjly0mmmy84jqzuk/ufpsnnbgbmkc30bgetkukt7+yuq6bnc5phsigym2cv+hlhkzgsiczjymljbsi8rpjuoxq4sdhwetjujoc2fxvbigxsofqiykhcxibqa+ux3cl6r7/tzx/ufvxs6hnlm5o8ulpayulnkswgcql1hd9pzdgsztgvzk3huwypozdkyi5ajilkemzz9b16cjoqrymcx2xqrq16wck1xsoalcuxm91kmfl/mjnjwnsbwskd6uzojvpmbhmczg2lazfsnbfijcjrdaq5ytsijqiekg6zqnrgob0g+lepkayjgdtdxqs5rqpfhozxjrkfr3mqyyitiirvnaiyfq3h7deaql/j0gl6wlmczy0syv8416bnfxrwoq8q9fwl7lxoxosevehriw4qdnci5xv63ql7fbrfoscvetyiwwodumax9gefhzad9wmfnracg4msb4e9apqmtmksa6rsqovna6laypsn3m7mhjyiv4vpjnngvs1qns1iz2g3+4/jlnhdhliiybziyb2i60s/zxidmnibiu2d6mdiljhefwflu2clwzcymrvyjba4nnip7lvbsthobjccnjbyhlznkisyav7zqqphlkb2kh9nhxut6xo2jtyr3rucamie39jd941hecmqhdzan4zuvxc41xlfpbum2jjkmavx06tmf8qkuzd1qghl2wejsgruzfln2ea2x8vhsxtxagvgwh9blfhkeh6dej7miu9n2uejldyo3m0/yuzmsugwn/e5t/mo6pnnbve4hbmnny6xwo9uuga7qmqa5xv7+wqbjse5rdj44+1mepchnky0bgm1ikbpuzgsknb0 r L( ) L( ) LEARNING ALGORITHM L( ) = 1 2 (r + max Q (x 0,a 0 ) Q (x, a)) 2 a 0 2A T Q = r + P max Q

24 Bellemare, Ostrovski, Guez, Thomas, Munos, 2015

25 GRAYSCALE POLICY POOL + DQN DOWNSAMPLE STACK REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q

26 LSTM layer (Hausknecht GRAYSCALE and Stone, 2015) Recurrent network (Mnih et al., 2016) POOL + DOWNSAMPLE STACK POLICY DQN REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q

27 GRAYSCALE POLICY POOL + DOWNSAMPLE STACK DQN REPLAY MEMORY TARGET NETWORK LEARNING T Q = r + ALGORITHM Dueling networks (Wang et al., 2016) Asynchronous actor-critic (Mnih et al., 2016) P max Q Bootstrap Q-functions (Osband et al., 2016) Noisy networks (Fortunato et al.; Plappert et al., 2017)

28 POOL DOWNSAMPLE GRAYSCALE Prioritized replay (Schaul et al., 2016) Importance sampling (Gruslys et al., 2018) Off-policy + corrections (Gelada et al., in prep.) STACK POLICY DQN REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q

29 Double Q-learning GRAYSCALE (van Hasselt et al., 2015) Advantage learning (Bellemare et al., 2015) Q(λ) (Harutyunyan + et al., 2016) POOL DOWNSAMPLE Retrace (Munos et al., 2016) STACK Distributional methods (Bellemare et al., 2017, Dabney et al., 2018, ) DQN POLICY REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q

30 1. Distributional reinforcement learning 2. Exploration with pseudo-counts

31 x E R(x) : $-2000 R(x) $ (200) 139

32 R(x 0 )+ R(x 1 )+ 2 R(x 2 )... BELLMAN EQUATION V (x) =E R(x)+ E V (x 0 ) x0 P

33 GROUND TRUTH V (x) =E R(x)+ E V (x 0 ) x0 P IMPLIED MODEL SCHOKNECHT 2002; PARR ET AL., 2008; SUTTON ET AL. 2008

34 <latexit sha1_base64="gehkvpxgpdusdoiecww/ybk0nco=">aaaiihicfzvbb9s2fmflbus875a2j30hanritycwggfdowtoxum3lmmcnhhthi5aubrnrjq0km7lmhrb6/b99k32ukobrutaabbi8/sfnqpdmxv5xonb4n/oru8+/ez2590vel9+9fu3327cuttw4ujsdkjdp5snllhm5we70vz77dssjajxz2/cy5cpf/ooscxd4fgvizyrzbbwkadeg8nz+g98gso+gt9c3+0izokiuy2wz6yanaojzdgzpei/idwjqpcsb6/7fzuzzsezuc/ucl4ukqi80c6ktxbxyddfbyiyces+m2s0qrj3phxspvpfbqiljfo6yoamixfq0jznbdrznvqd7uh2ohbdlhp9q3hgzp3bmfzcuhas0nqnsp3bg0hpdjgau58lpbxqlcl0kszyotqdipiamgykevqqlb6ahhj+guaztephifbqkvxqcqlnqsls44eynotadjoqpjqqrk56+tpe8cbaabbqpkxpwkc6ronkqb6xudx/cq1cjyevqnrojkea1k89pj8lqdamuzjzwges+ypfnurkveorkoou51wqobkweg8h7d0nyxydz2d6oje4/tzkfpm6srr0tejpw/rthb5t0ecv+rxfjyr0qewhftps0bmkpwvr/qrdb9hjcj1u0b0k3uuaxvkhfxzydfbjch23nn2daruuowj6rkykaisqzajlnz4bini6xlvnnl0c4ruyyokn7ocu1vkcjcz2jp00sim9xb6brjb7ktff0kshdpamzptg+gmsjtqu35tcmphcc4izk6xjfifji5dcr5kyhypq54kdbrllgst5eaujwmsxwlnyk2qusnfkrjpy1bausc5tv4leyjyt9dbfe3ixpjldnfx9lynrrucslewyn8g07dh4hkrml6n9iakil+uawpyzl0mz+lcrzz2lm9sc553mbaqhlibze2lycxzfo75dhrl4ulfkhkw/rndsiv423fx/wxkf9ioz48fknto3m7pfaw9onexclixfeseul8a61isr2m0rbjae4ldk4i230tbhhdaxhrba5yoyvuj5slky0avvxm0pwdwngh2ulmh6b3cyxstmw5fhnu6u5ep2z52lco4qu3kztrvjnw17sg0f/th/9qk4tbrwfeubtwnz1mprmfwrnbjolnq56pzv+bvzt7fxhxqfd5/k0ludwueevxu6l/4hb0imka==</latexit> IMPLIED MODEL V (x) =E R(x 0 )+ R(x 1 )+ 2 R(x 2 )+ x 0 = x, a t = E [R(x 0 )] + E [R(x 1 )] + 2 E [R(x 2 )] +.

35 VENESS, BELLEMARE, HUTTER, CHUA, DESJARDINS (2015)

36 <latexit sha1_base64="z3v7rkzg+vlq3h00sc3tusddgcs=">aaaid3icfzxrbts2fmfldqu7dje0wz/tczgjq7oggrumanghqlt66iylmzmmbhrtnsiksolql5j0j4cv9gx7mn0b9nwpsafye+xqkh1dugqwrz7f/4ih5/dijyir3e//07ly9ympr3wvf7rx4+nppv1s8+atkyoxkritgotynnpemcejdqk5fuw0kyyenmavvponlr94w6ticxsslwmbhgqw8ybtose03fxt9aonfcu9i75+hdble0y1wh6fote62kqn/bvohsizeoyk77ux/ve7uwu3t1a/1qpwnccmn9pfw55ymhnwdpev2ulmr7/tzx/ubrhlo+euz3b681qk/zguqhzpkohsy7ef6ikhunmqwlabf4olhj6tgrtdmyihuxotzypdd8dioycw8is0yq1vd0ncpzahb8qq6llqmmv8p6bnyw10y1vsbaorkw4etaypkovmes1cchyc6rjzaigfs8iawekdumlhvojoisruqxnry4pzdilmsgszp3ugmvd8opei/w1pvjmktmhszcmdr2/gip1ky6h55btmn2cg27lriszzlgvq0wo9bdgxffqyrz9u6jmwparqoxydvoigrc8q9kxf9yt0v0wpk/s4rfcqdc9rjksdvvbh01mnkntucvyosux55qdpmzazllhrayytb+orcyltt3i7gbnfivwopljilfvoojplo8yku+m+joxgd7dpgvuuxwf6gw10wgbvsryppl8gukjj+y0ppzkptccymblydba4nmip9cpbtxaucx5ysebg+ebyhl/af+a1nvizvctqlvlvrlbqylxdusr6zbsisyrm2aue+pce/rlaznajf97lvnnxcxzlfhrbocx7dkaq1ezs2o9ysor5hvdycuzplgmbgvnwmd3c5rjonksrxbaaxu7eymecogbfop4lc+kzue00usmj4p36qzxozcpnntmfj51wfjyy3gz6brnapplhxmnf5vl4kzuwg40ry03yjl17hccscdmsmnjjbdt6nxakuwqhtvorw2eocfagzvhsxbj5azjnrkodlwvg/w2o7e3inbwz5ugjtxzaxtezrxbxuc2bqd0y7e64/r338nve4+/lw+u686xzlbplum5957hzozn0thzq/nu50fmic7v7e/ep7p/dvwrplu7p87lte7p//wdfj+fc</latexit> V (x) =E R(x 0 )+ R(x 1 )+ 2 R(x 2 )+ = E Z (x) VENESS, BELLEMARE, HUTTER, CHUA, DESJARDINS (2015)

37 BELLMAN EQUATION (x) =E Z (x) = E R(x)+ E Z (X 0 ).

38 DISTRIBUTIONAL BELLMAN EQUATION Z (x) D = R(x)+ Z (X 0 )

39 Z (x) D = R(x)+ Z (X 0 ) VALUE DISTRIBUTION Z (x 0 1) Z (x) x p(x x) = 0.5 r = 0 r = 1 x 0 1 Z (x 0 2) p(x x) = 0.5 x 0 2 Bellman (1957): Bellman equation for mean Sobel (1982): for variance Engel (2003): for Bayesian uncertainty Azar et al. (2011), Lattimore & Hutter (2012): for higher moments Morimura et al. (2010, 2010b): for densities

40 $150 $300 + $200

41 $450 $300 + $200

42 $450 + $300 $200 +

43 $450 + $300 $200 +

44 Q(x, a) GRAYSCALE p(z x, a) Discrete distribution POLICY POOL + DOWNSAMPLE STACK C51 REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q

45 GRAYSCALE POLICY POOL + DOWNSAMPLE STACK ε-greedy w.r.t DQN expected value REPLAY MEMORY TARGET NETWORK LEARNING ALGORITHM T Q = r + P max Q

46 APPROXIMATION PROJECTION

47 1.From x, a, sample a transition: P Z r, X 0,A 0 R(x, a),p( x, a), ( X 0 ) (a) 2.Compute sample backup P Z ˆT Z(x, a) :=r + Z(X 0,A 0 ) (b) 3.Project onto approximation support: R+ P Z ˆT Z(x, a) (c) 4.Update towards projection, i.e. take a KL-minimizing step T Z (d)

48

49

50 Mean Median > H.B. > DQN DQN 228% 79% 24 0 DDQN 307% 118% DUEL. 373% 151% PRIOR. 434% 124% PR. DUEL. 592% 172% C51 701% 178% 40 50

51

52

53

54 DISTRIBUTIONAL PERSPECTIVE BELLEMARE ET AL., 2016 QUANTILE REGRESSION DABNEY ET AL., 2017 CRAMER DISTANCE BELLEMARE ET AL., 2017 IMPLICIT QUANTILE NETWORKS DABNEY ET AL., 2018 ANALYSIS OF C51 ROWLAND ET AL., 2017 DISTRIBUTIONAL ACTOR-CRITIC BARTH-MARON ET AL., 2018 AS EXPECTED LYLE ET AL., 2018 PGMRL WORKSHOP, ICML GENERATIVE QUANTILE NETWORKS OSTROVSKI ET AL., 2018 THE RAINBOW AGENT HESSEL ET AL., 2018

55 1. Distributional reinforcement learning 2. Exploration with pseudo-counts

56 Pong Seaquest Montezuma s Revenge September 2017

57

58

59 ?

60 <latexit sha1_base64="qbbexo8ugwarbyypna0tfspy9ya=">aaah2xicfzxdbhq3fmchkcyktar62rurkxsgq2inqtskcomirsgiczuqhjd1auxxehat2dnt2wu7mxpru6q3frcephfc0ffp8cxsmh+ukxbnzpn9j318/bwkgmvt73+4dpnkf1evda7fwpvy5ldf31q/fedij3nf2sfnrkkoa6kz4de7nnwidpwqrmqg2kvg9injr94wpxksh5hlysastgmecuomucbrz/bulxrkpnqevgf8j/cusekqzouuutoxiw2enzdomcesywjicth4jdddyhceb/qq2bi/nlnv9jf7+ypahl8axa98hppb1xy4tohcsthqqbqe+f3ujc1rhlpbsju81ywl9jrm2qjmmeimxzyfc4bugideualgfxuue6srlkitlziapsrmppvmof+pmzms9w6dsulin3iy0c9jy+n0blhmi5siuuamqa7ukoqkciiwybcqoiwk0rlrhbqykhqfypqaonntuzvtolbmah7wkfh4hqe6lnkiqfitkidw13dm3tie5jcolayvm4vd2cgr9mwwnehxhr636oskfd2iwxw61al7fbrfoomkhbtosywetohohe606egfhrtodovuz81i6b0k3msg66mkpwofb7sldgk724xnmzil1ltzysuabksbopco95o5hyxcn2case7ylw7v2sloloxk/2hsorulqxatnjw6t/rxl+iawd5ubadclybryhl1wjbszbzaeeyzwgvgcaiburr6na0shcecsw4esczvxm2sm2gbxholdvsyte1z3s4k6ybwnwxkbfa+mziskd250mp3rur4kpadxmdn29mq7mbgeswkfqymzzvbelkaxzn/ppnendybhs1nnyqqevcrzz25zw0pio/mbmqjlicrp7zysmgg/a51frgtn87cr6oupi4+r584ivk2wgixz0wmg3b+pnjcz7t+c7z4hmkjk4vlpfasm5sdjsp9jfokfxqfwjkqhfyzvhhpwz8twssuqrbwk2l4fbxjkmmht88nbryl7hkxahflgbh/ixn3zzedr4z9+ohcpmc/15ldcff5zzupbrz9s <latexit sha1_base64="clsxqiehyqgyrzu7iv0cq+47hlg=">aaaif3icfzxbbhs3eebxarql7s1phtsarixassmy2qjaixqbykrbwtrwzcd2hjucwkw4emfyd0nsqwrmn/mn/zq+fx3tyz+i/9dhxuy9pfla2tk5m5zhdc9birg2g8e/nwsfxp/wrvfmrxsff/lpz59v3vriwmdlrdkrjuwstgkimearozlcchaskezkinil4pyj4y9em6v5hb2adcimkswjhnjkdkimm2/376365d56hhaglszibtfizboupub4tqqkjworhfeztasthdwxkdmepynckqymlocarwjtnrtbsft9rlbuu0fy71aranhadekt1q+usam8dgywmd3sdbyh2ypagl8ipa94xtnbn1z4ftolzjghgmh95g8sm7fegu4fszfwuroe0hmyz2cgrkqypbfzxvj0fzqzfmykfpfbmbbqyynuei0dsjtelhstoex/mboqtejwwskd6kzojvxhynmula2laj5subtixmg1cs24gmklnqiekg6zqnrboiyg2lmpkeyxgdtdxqo5rqpfhoyxjrlnxvnef0va5vwqoquw+gao2g80hhurzsymz6gpbm6ucps8trv0pejpwvrlhb5s0z0k3wnrgwo9anfhhq5b9lrct1t0r0l3wvswqg9bdldcd9nmsfr+be83nfvxhr63ninrgypajpq+cvoywjoqo25504aomoi/wf1mzrmc4xcwzsr37eeh6mwvprb2u/8wcshu4hkluxe4omten7pehwz2pmj7opoveiumvt+u50hqc1swmdnvooyihbu55py6lsucxyjldhpilauufvefjacvqcwgryym9iibp8s6ixunmwkbuncbrxqyp1f28h3lmtsxya6v4c+7avl3s2cxytikbm3zztcftkzxtn/ajfhndubhy9knsyofvgrdz25z28p8o9mnkiyfcozplbysnai/qt0f5stnc/frkcwpwvfbt52jwtfc3ame+7gjz8edzxs25ze7xamzndizcbeufk6nrbczsqxjvrf3rxbyepldkom37jvpfybqmmiqphjfjj+hpfkl7fjzzemwi5tfcansm3t/fsfulmqgjgz78ngl+zc9qjoh4k7ymzdtwzj+dtsfbpv73/ued4tb66b3pfe1d8/zve+9x95p3tg78qj3b+d256vone7v3t+6f3b/yk2vdqqf217t6f79h9rs7nk=</latexit> EXPLORATION Bellman equation: Q(x, a) =r(x, a)+ E max x0 P a 0 2A Q(x0,a 0 ) Exploration bonus (Strehl and Littman, 2008): Q(x, a) =ˆr(x, a) + E x 0 ˆP max a 0 2A Q(x0,a 0 )+p N(x, a)

61 r = -1 safe path transition fn. S The Cliff G optimal path state r = -100 optimal policy reward fn.

62 Most observations experienced once: up to 90% singletons (Blundell et al., 2016)

63 GENERATIVE MODEL Train Generate

64 <latexit sha1_base64="dqatesxjbvsir0w2pjl4l/rzzue=">aaahl3icfzvtb9s2emfvrqu7rnva9dwwn8smat0qbnywyeohys3qpgxluidn0jzimcjqzbmhkzwkozuspkxfbh9s32zhwu700fward DENSITY MODEL Train Query n (x) Assumptions 1. density frequency 2. model quality = generalization

65 n (x) = ˆN n (x) ˆN n (x) = n (x) {ˆn 0 n(x) = ˆN n (x)+1 ˆn n(x) 0 n(x) n (x) Bellemare, Srinivasan, Ostrovski, Schaul, Saxton, Munos, 2016

66 CODING PROBABILITY (DENSITY) n (x) TRAIN 0 n(x) RECODING PROBABILITY

67 THE CTS MODEL n (x) := Q K i=1 i n(x i x <i ) x i,j

68

69 periods without Pseudo-counts salient events start position pseudo-counts salient event pseudo-counts Frames (1000s)

70 <latexit sha1_base64="qbbexo8ugwarbyypna0tfspy9ya=">aaah2xicfzxdbhq3fmchkcyktar62rurkxsgq2inqtskcomirsgiczuqhjd1auxxehat2dnt2wu7mxpru6q3frcephfc0ffp8cxsmh+ukxbnzpn9j318/bwkgmvt73+4dpnkf1evda7fwpvy5ldf31q/fedij3nf2sfnrkkoa6kz4de7nnwidpwqrmqg2kvg9injr94wpxksh5hlysastgmecuomucbrz/bulxrkpnqevgf8j/cusekqzouuutoxiw2enzdomcesywjicth4jdddyhceb/qq2bi/nlnv9jf7+ypahl8axa98hppb1xy4tohcsthqqbqe+f3ujc1rhlpbsju81ywl9jrm2qjmmeimxzyfc4bugideualgfxuue6srlkitlziapsrmppvmof+pmzms9w6dsulin3iy0c9jy+n0blhmi5siuuamqa7ukoqkciiwybcqoiwk0rlrhbqykhqfypqaonntuzvtolbmah7wkfh4hqe6lnkiqfitkidw13dm3tie5jcolayvm4vd2cgr9mwwnehxhr636oskfd2iwxw61al7fbrfoomkhbtosywetohohe606egfhrtodovuz81i6b0k3msg66mkpwofb7sldgk724xnmzil1ltzysuabksbopco95o5hyxcn2case7ylw7v2sloloxk/2hsorulqxatnjw6t/rxl+iawd5ubadclybryhl1wjbszbzaeeyzwgvgcaiburr6na0shcecsw4esczvxm2sm2gbxholdvsyte1z3s4k6ybwnwxkbfa+mziskd250mp3rur4kpadxmdn29mq7mbgeswkfqymzzvbelkaxzn/ppnendybhs1nnyqqevcrzz25zw0pio/mbmqjlicrp7zysmgg/a51frgtn87cr6oupi4+r584ivk2wgixz0wmg3b+pnjcz7t+c7z4hmkjk4vlpfasm5sdjsp9jfokfxqfwjkqhfyzvhhpwz8twssuqrbwk2l4fbxjkmmht88nbryl7hkxahflgbh/ixn3zzedr4z9+ohcpmc/15ldcff5zzupbrz9s <latexit sha1_base64="clsxqiehyqgyrzu7iv0cq+47hlg=">aaaif3icfzxbbhs3eebxarql7s1phtsarixassmy2qjaixqbykrbwtrwzcd2hjucwkw4emfyd0nsqwrmn/mn/zq+fx3tyz+i/9dhxuy9pfla2tk5m5zhdc9birg2g8e/nwsfxp/wrvfmrxsff/lpz59v3vriwmdlrdkrjuwstgkimearozlcchaskezkinil4pyj4y9em6v5hb2adcimkswjhnjkdkimm2/376365d56hhaglszibtfizboupub4tqqkjworhfeztasthdwxkdmepynckqymlocarwjtnrtbsft9rlbuu0fy71aranhadekt1q+usam8dgywmd3sdbyh2ypagl8ipa94xtnbn1z4ftolzjghgmh95g8sm7fegu4fszfwuroe0hmyz2cgrkqypbfzxvj0fzqzfmykfpfbmbbqyynuei0dsjtelhstoex/mboqtejwwskd6kzojvxhynmula2laj5subtixmg1cs24gmklnqiekg6zqnrboiyg2lmpkeyxgdtdxqo5rqpfhoyxjrlnxvnef0va5vwqoquw+gao2g80hhurzsymz6gpbm6ucps8trv0pejpwvrlhb5s0z0k3wnrgwo9anfhhq5b9lrct1t0r0l3wvswqg9bdldcd9nmsfr+be83nfvxhr63ninrgypajpq+cvoywjoqo25504aomoi/wf1mzrmc4xcwzsr37eeh6mwvprb2u/8wcshu4hkluxe4omten7pehwz2pmj7opoveiumvt+u50hqc1swmdnvooyihbu55py6lsucxyjldhpilauufvefjacvqcwgryym9iibp8s6ixunmwkbuncbrxqyp1f28h3lmtsxya6v4c+7avl3s2cxytikbm3zztcftkzxtn/ajfhndubhy9knsyofvgrdz25z28p8o9mnkiyfcozplbysnai/qt0f5stnc/frkcwpwvfbt52jwtfc3ame+7gjz8edzxs25ze7xamzndizcbeufk6nrbczsqxjvrf3rxbyepldkom37jvpfybqmmiqphjfjj+hpfkl7fjzzemwi5tfcansm3t/fsfulmqgjgz78ngl+zc9qjoh4k7ymzdtwzj+dtsfbpv73/ued4tb66b3pfe1d8/zve+9x95p3tg78qj3b+d256vone7v3t+6f3b/yk2vdqqf217t6f79h9rs7nk=</latexit> <latexit sha1_base64="6r4+sicrfo+xjaz8wmujrcalwqq=">aaaifxicfzxdbts2fmflbqu77kpjdrfeedokpjsrwmoafh0cnkilblisowmspgkng6kobckipjj0z4fv5z5ht7o7ybe73jpsjxb04uqfxqxyis/vf8jdcyjss0khzwdwt+fwbx9+dlt75+o1tz797po76xtfnop4rjic8dim1znhniqighmjtahniqimvrbeepdpm/7ydsgt4ujylbmyszanrca4m2iarp92ulxoswdkh9azm0qvvw8jntipgagwsicbiv1seqqflfsjlfdjfhpl0coiqvlusncgzt5hm7m3gyxjg7mk/ntsqdfuqqegyue/vsbmax1slr6kabo2we8ntgf5q9ont2z0npizttzul6gf87meypcqax3hdhiztkwzwuni1+hcq8l4jzvcbtyjjkgpbr5ssu6jxsdbrpaxgzjbqx6wsa2x0kolzgammywz/h8zm1mb3wyqpqpdimkej8zwrmncqmslkij5sexmsjirxyjme7jebunk4koinzhmosfi1ucmpzekmtnemwwvawwhflenfplvrklljc2klnwcpbx9jubwk49xw0s+pfwfvjjbg2ehfye1rnozcj1r0vcv+qpfdyt0t0wpkvsorycvomzr8wo9b9h9ct1v0emkpw7rvqrds5vj0ocvfnh01qcvetpy9g5k7hn2oombgjil1lzzucube0yhop+w9psx+ywi72aapmjydxmqs0wuwvz23cdrntl96kowoifidadpskchgn+mgn00/ykbmhorb2wptw2hrceu1moxim/s5hilvu5xlrrfozaclrhyprmaxvc4ar4mnj96nldx+tgrrjty17apsvnzrvb4i/b8ro/9gw//mir2ean/3ktxetczibvigwkwzq9wcwnnrjl7euimluua4ydnn8yqdjejyzrlpm57xhsa1yhi3aax7tjseajd6fvsc3fnyjrloo1uiih4v36sscyy4eafcyh8bs713gz7brnaivlxxmnf5vb4kz2uk11u5hrngxv3dsetiqxumnztftz6nxaluwqxtdoro+ekcfzjo3p+xbjzaskvnpvlmp1bgmd3jbg8muzjnwv5ef7x2u5xv7nnm6ndop1u2x1su4ff954my1vrjnpp+drzclznofpe+dezoscod/7tbhs+6tzr/t79o/tn969ceqtt+nzp1j7u3/8brvbt6a==</latexit> EXPLORATION Bellman equation: Q(x, a) =r(x, a)+ E max x0 P a 0 2A Q(x0,a 0 ) Exploration bonus (Strehl and Littman, 2008) Q(x, a) =ˆr(x, a) + E x 0 ˆP max a 0 2A Q(x0,a 0 )+p N(x, a) Pseudo-count bonus Q(x, a) =ˆr(x, a)+ E x 0 ˆP max a 0 2A Q(x0,a 0 )+q ˆN(x)

71

72 START END DQN + bonus DQN

73 Average Score Training Frames (millions)

74 <latexit sha1_base64="jxoybhwqgddcnoviwxohiaiweek=">aaaiunicfzvtb9s2eidldmu8rnvs7uo+edpajosxwmoadh0cnkihbljiowle3isuqfg0tysuhjlu7dd6c/sn+7k/sk87vdjws1cbtk733b2pxypptwxxptp5p3hv/iefpthofrb5+cmvvvxq69hjcx3nfgvnnbkrgvhem8fddma4ewwwvyxix7al//pvwi/em6v5fj6axzqnjrmhfmqpmadytv7chouaoeptuwfaxdm76ok+ukgglkwjlnjiocukplsbr012er4tkqncksw9s54hzeoe6ugcbrjm143bidzbqvjx8csgiciypd1f2y76ht3svmnjpzoe5fud+b0egplfhv8dx40mmxksmxmkgte9rvznr5m+qc64udby8qfvpxowx0fez5kfhgqi9zxbmzqhjcpwkli8iweatqm9jmn2bwjiinehtysdoyegcdaouvalduq1rq9lpnyl6yoljgaiqyxr/h8ze1ka3szwso90jscz+mloetidgrbslkxrtcatowsnucavo0ysqcbuczgvohoicdxqceuxxtgcg+qwkzwnzacy0py2uqlgpz/qvejzreoljx+ib+kq/ukjwnmwsji+is1o5kajsg/iueihbtqo0bcf+rzgdwr0oezpcvskrrsf2q3rywk9rngjaj2q0dmcpa3rwwi9jkvf0scfffx11ucfel5z9ns59n3bq/pomzi51ltzfs2b+ksbqfsot6s5bznch2casz6wnxnuzvmwtrcz3rdhmtgthlarwmw+hx5fq0r/srltmtibih2b6g9iljhifwdz09hmtmawzmrpoelgwmkls9fxusjhjljkoihe0shvjlqfcpdyldzsbjs1t2mcjdzvrevy5ngsdbdwkcj7ubah77vhcc2q7a7h74fxsu5mwilfpbumluwqwrtikl4l+hmmibouawpbzr6kvcsgiosysza3pc0+qqsrrtaav+7qzgcxvkmtd3my4rtkkefg/ri9l53dftdfzfjmk0w4pr5sqrmtt7papazgxemn81b4lfbs8pzyjzvmk/bhdogwkby6dn64nugfm4sfyq1bwnze/zxkklxs9l+vfm7sa/wegnrbarrjv8vrcseya0egfbg/mu+ymzjlenxvbvvmqgvnp+y5nt33+mfwy25+azwdb5xvnw3hdz47l51fnb5z5tdgdqpxuggmnv7e+lfzan7pto81cp+vndltfpgfpkv+ig==</latexit> CREDIT ASSIGNMENT ISSUES IN EXPLORATION Another important key factor: fast credit assignment apple Q(x t,a t )=q r + (x t,a t )+ " 1 X max Q(x t+1,a 0 ) a 0 2A # +(1 q) i r + (x t+i,a t+i ) i=0 Mixed Monte-Carlo (MMC) update

75 EFFECT OF MIXED MONTE CARLO UPDATE

76 REMOVING EXTRINSIC REWARDS

77

78 Breakout Super-human agents Scoring exploit Sub-human agents Space Invaders Asteroids Bowling 81-83: 27% sub-human, 8% exploit 36 games Freeway Ice Hockey Tennis Ms. Pac-Man Assault Asterix Beam Rider Elevator Action Enduro James Bond Seaquest Up N Down Bank Heist Krull Frostbite Private Eye Road Runner 77 Q*Bert Venture Pong Video Pinball 15 games Battlezone Berzerk Boxing Carnival Centipede Crazy Climber Fishing Derby Phoenix Skiing 80 & before: 93% super-human, 7% sub-human Amidar Atlantis Chopper Cmd. Demon Attack Gopher Name this Game Pooyan River Raid Star Gunner Time Pilot Yar s Revenge Zaxxon Tutankham Alien Gravitar Journey Escape Kangaroo Pitfall! Kung-Fu Master H.E.R.O. Montezuma s Revenge 6 games Solaris 84 and after: only 1/6 super-human Double Dunk

79 Deep Reinforcement Learning and the Atari 2600 MARC G. BELLEMARE Google Brain

Slides credited from Dr. David Silver & Hung-Yi Lee

Slides credited from Dr. David Silver & Hung-Yi Lee Slides credited from Dr. David Silver & Hung-Yi Lee Review Reinforcement Learning 2 Reinforcement Learning RL is a general purpose framework for decision making RL is for an agent with the capacity to

More information

Mocha.jl. Deep Learning for Julia. Chiyuan Zhang CSAIL, MIT

Mocha.jl. Deep Learning for Julia. Chiyuan Zhang CSAIL, MIT Mocha.jl Deep Learning for Julia Chiyuan Zhang (@pluskid) CSAIL, MIT JULIA BASICS 10-minute Introduction to Julia A glance of basic syntax Numpy Matlab Julia Description x[0] x(1) x[1] Index 1 st elem

More information

Introduction to Deep Q-network

Introduction to Deep Q-network Introduction to Deep Q-network Presenter: Yunshu Du CptS 580 Deep Learning 10/10/2016 Deep Q-network (DQN) Deep Q-network (DQN) An artificial agent for general Atari game playing Learn to master 49 different

More information

Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay

Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay Haiyan (Helena) Yin, Sinno Jialin Pan School of Computer Science and Engineering Nanyang Technological University

More information

arxiv: v1 [cs.ai] 18 Apr 2017

arxiv: v1 [cs.ai] 18 Apr 2017 Investigating Recurrence and Eligibility Traces in Deep Q-Networks arxiv:174.5495v1 [cs.ai] 18 Apr 217 Jean Harb School of Computer Science McGill University Montreal, Canada jharb@cs.mcgill.ca Abstract

More information

A Brief Introduction to Reinforcement Learning

A Brief Introduction to Reinforcement Learning A Brief Introduction to Reinforcement Learning Minlie Huang ( ) Dept. of Computer Science, Tsinghua University aihuang@tsinghua.edu.cn 1 http://coai.cs.tsinghua.edu.cn/hml Reinforcement Learning Agent

More information

arxiv: v1 [cs.lg] 15 Jun 2016

arxiv: v1 [cs.lg] 15 Jun 2016 Deep Reinforcement Learning with Macro-Actions arxiv:1606.04615v1 [cs.lg] 15 Jun 2016 Ishan P. Durugkar University of Massachusetts, Amherst, MA 01002 idurugkar@cs.umass.edu Stefan Dernbach University

More information

arxiv: v2 [cs.lg] 16 May 2017

arxiv: v2 [cs.lg] 16 May 2017 EFFICIENT PARALLEL METHODS FOR DEEP REIN- FORCEMENT LEARNING Alfredo V. Clemente Department of Computer and Information Science Norwegian University of Science and Technology Trondheim, Norway alfredvc@stud.ntnu.no

More information

arxiv: v2 [cs.ai] 19 Jun 2018

arxiv: v2 [cs.ai] 19 Jun 2018 THE REACTOR: A FAST AND SAMPLE-EFFICIENT ACTOR-CRITIC AGENT FOR REINFORCEMENT LEARNING Audrūnas Gruslys, DeepMind audrunas@google.com Will Dabney, DeepMind wdabney@google.com Mohammad Gheshlaghi Azar,

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 12: Deep Reinforcement Learning Types of Learning Supervised training Learning from the teacher Training data includes

More information

TD LEARNING WITH CONSTRAINED GRADIENTS

TD LEARNING WITH CONSTRAINED GRADIENTS TD LEARNING WITH CONSTRAINED GRADIENTS Anonymous authors Paper under double-blind review ABSTRACT Temporal Difference Learning with function approximation is known to be unstable. Previous work like Sutton

More information

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended) Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended) Pavlos Andreadis, February 2 nd 2018 1 Markov Decision Processes A finite Markov Decision Process

More information

arxiv: v1 [stat.ml] 1 Sep 2017

arxiv: v1 [stat.ml] 1 Sep 2017 Mean Actor Critic Kavosh Asadi 1 Cameron Allen 1 Melrose Roderick 1 Abdel-rahman Mohamed 2 George Konidaris 1 Michael Littman 1 arxiv:1709.00503v1 [stat.ml] 1 Sep 2017 Brown University 1 Amazon 2 Providence,

More information

arxiv: v1 [cs.lg] 6 Nov 2018

arxiv: v1 [cs.lg] 6 Nov 2018 Quasi-Newton Optimization in Deep Q-Learning for Playing ATARI Games Jacob Rafati 1 and Roummel F. Marcia 2 {jrafatiheravi,rmarcia}@ucmerced.edu 1 Electrical Engineering and Computer Science, 2 Department

More information

S3TA: A SOFT, SPATIAL, SEQUENTIAL, TOP-DOWN ATTENTION MODEL

S3TA: A SOFT, SPATIAL, SEQUENTIAL, TOP-DOWN ATTENTION MODEL STA: A SOFT, SPATIAL, SEQUENTIAL, TOP-DOWN ATTENTION MODEL Anonymous authors Paper under double-blind review ABSTRACT We present a soft, spatial, sequential, top-down attention model (STA). This model

More information

Neural Episodic Control. Alexander pritzel et al (presented by Zura Isakadze)

Neural Episodic Control. Alexander pritzel et al (presented by Zura Isakadze) Neural Episodic Control Alexander pritzel et al. 2017 (presented by Zura Isakadze) Reinforcement Learning Image from reinforce.io RL Example - Atari Games Observed States Images. Internal state - RAM.

More information

arxiv: v2 [cs.lg] 25 Oct 2018

arxiv: v2 [cs.lg] 25 Oct 2018 Peter Jin 1 Kurt Keutzer 1 Sergey Levine 1 arxiv:171.11424v2 [cs.lg] 25 Oct 218 Abstract Deep reinforcement learning algorithms that estimate state and state-action value functions have been shown to be

More information

Applications of Reinforcement Learning. Ist künstliche Intelligenz gefährlich?

Applications of Reinforcement Learning. Ist künstliche Intelligenz gefährlich? Applications of Reinforcement Learning Ist künstliche Intelligenz gefährlich? Table of contents Playing Atari with Deep Reinforcement Learning Playing Super Mario World Stanford University Autonomous Helicopter

More information

CS 234 Winter 2018: Assignment #2

CS 234 Winter 2018: Assignment #2 Due date: 2/10 (Sat) 11:00 PM (23:00) PST These questions require thought, but do not require long answers. Please be as concise as possible. We encourage students to discuss in groups for assignments.

More information

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning Koichi Shirahata, Youri Coppens, Takuya Fukagai, Yasumoto Tomita, and Atsushi Ike FUJITSU LABORATORIES LTD. March 27, 2018 0 Deep

More information

Representations and Control in Atari Games Using Reinforcement Learning

Representations and Control in Atari Games Using Reinforcement Learning Representations and Control in Atari Games Using Reinforcement Learning by Yitao Liang Class of 2016 A thesis submitted in partial fulfillment of the requirements for the honor in Computer Science May

More information

Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015

Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015 Human-level Control Through Deep Reinforcement Learning (Deep Q Network) Peidong Wang 11/13/2015 Content Demo Framework Remarks Experiment Discussion Content Demo Framework Remarks Experiment Discussion

More information

Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving

Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving Xi Xiong Jianqiang Wang Fang Zhang Keqiang Li State Key Laboratory of Automotive Safety and Energy, Tsinghua University

More information

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University Introduction to Reinforcement Learning J. Zico Kolter Carnegie Mellon University 1 Agent interaction with environment Agent State s Reward r Action a Environment 2 Of course, an oversimplification 3 Review:

More information

Deep Reinforcement Learning for Pellet Eating in Agar.io

Deep Reinforcement Learning for Pellet Eating in Agar.io Deep Reinforcement Learning for Pellet Eating in Agar.io Nil Stolt Ansó 1, Anton O. Wiehe 1, Madalina M. Drugan 2 and Marco A. Wiering 1 1 Bernoulli Institute, Department of Artificial Intelligence, University

More information

CS221 Final Project: Learning Atari

CS221 Final Project: Learning Atari CS221 Final Project: Learning Atari David Hershey, Rush Moody, Blake Wulfe {dshersh, rmoody, wulfebw}@stanford December 11, 2015 1 Introduction 1.1 Task Definition and Atari Learning Environment Our goal

More information

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient II Used Materials Disclaimer: Much of the material and slides for this lecture

More information

Distributed Deep Q-Learning

Distributed Deep Q-Learning Distributed Deep Q-Learning Kevin Chavez 1, Hao Yi Ong 1, and Augustus Hong 1 Abstract We propose a distributed deep learning model to successfully learn control policies directly from highdimensional

More information

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient I Used Materials Disclaimer: Much of the material and slides for this lecture

More information

When Network Embedding meets Reinforcement Learning?

When Network Embedding meets Reinforcement Learning? When Network Embedding meets Reinforcement Learning? ---Learning Combinatorial Optimization Problems over Graphs Changjun Fan 1 1. An Introduction to (Deep) Reinforcement Learning 2. How to combine NE

More information

Optimazation of Traffic Light Controlling with Reinforcement Learning COMP 3211 Final Project, Group 9, Fall

Optimazation of Traffic Light Controlling with Reinforcement Learning COMP 3211 Final Project, Group 9, Fall Optimazation of Traffic Light Controlling with Reinforcement Learning COMP 3211 Final Project, Group 9, Fall 2017-2018 CHEN Ziyi 20319433, LI Xinran 20270572, LIU Cheng 20328927, WU Dake 20328628, and

More information

arxiv: v3 [cs.lg] 2 Mar 2017

arxiv: v3 [cs.lg] 2 Mar 2017 Published as a conference paper at ICLR 7 REINFORCEMENT LEARNING THROUGH ASYN- CHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU Mohammad Babaeizadeh Department of Computer Science University of Illinois at Urbana-Champaign,

More information

arxiv: v1 [cs.cv] 2 Sep 2018

arxiv: v1 [cs.cv] 2 Sep 2018 Natural Language Person Search Using Deep Reinforcement Learning Ankit Shah Language Technologies Institute Carnegie Mellon University aps1@andrew.cmu.edu Tyler Vuong Electrical and Computer Engineering

More information

arxiv: v1 [cs.ai] 18 May 2018

arxiv: v1 [cs.ai] 18 May 2018 Hierarchical Reinforcement Learning with Deep Nested Agents arxiv:1805.07008v1 [cs.ai] 18 May 2018 Marc W. Brittain Department of Aerospace Engineering Iowa State University Ames, IA 50011 mwb@iastate.edu

More information

Markov Decision Processes (MDPs) (cont.)

Markov Decision Processes (MDPs) (cont.) Markov Decision Processes (MDPs) (cont.) Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University November 29 th, 2007 Markov Decision Process (MDP) Representation State space: Joint state x

More information

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Apprenticeship Learning for Reinforcement Learning with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Table of Contents Introduction Theory Autonomous helicopter control

More information

DOM-Q-NET: GROUNDED RL ON STRUCTURED LANGUAGE

DOM-Q-NET: GROUNDED RL ON STRUCTURED LANGUAGE DOM-Q-NET: GROUNDED RL ON STRUCTURED LANGUAGE Anonymous authors Paper under double-blind review ABSTRACT The ability for agents to interact with the web would allow for significant improvements in knowledge

More information

Markov Decision Processes. (Slides from Mausam)

Markov Decision Processes. (Slides from Mausam) Markov Decision Processes (Slides from Mausam) Machine Learning Operations Research Graph Theory Control Theory Markov Decision Process Economics Robotics Artificial Intelligence Neuroscience /Psychology

More information

arxiv: v2 [cs.lg] 10 Mar 2018

arxiv: v2 [cs.lg] 10 Mar 2018 Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research arxiv:1802.09464v2 [cs.lg] 10 Mar 2018 Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen

More information

arxiv: v1 [cs.lg] 31 Jul 2015

arxiv: v1 [cs.lg] 31 Jul 2015 -Conditional Video using Deep Networks in Atari Games arxiv:1507.08750v1 [cs.lg] 31 Jul 2015 Junhyuk Oh Computer Science & Engineering University of Michigan junhyuk@umich.edu Honglak Lee Computer Science

More information

Learning to Play General Video-Games via an Object Embedding Network

Learning to Play General Video-Games via an Object Embedding Network 1 Learning to Play General Video-Games via an Object Embedding Network William Woof & Ke Chen School of Computer Science, The University of Manchester Email: {william.woof, ke.chen}@manchester.ac.uk arxiv:1803.05262v2

More information

Accelerating Reinforcement Learning in Engineering Systems

Accelerating Reinforcement Learning in Engineering Systems Accelerating Reinforcement Learning in Engineering Systems Tham Chen Khong with contributions from Zhou Chongyu and Le Van Duc Department of Electrical & Computer Engineering National University of Singapore

More information

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING

GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING GPU-BASED A3C FOR DEEP REINFORCEMENT LEARNING M. Babaeizadeh,, I.Frosio, S.Tyree, J. Clemons, J.Kautz University of Illinois at Urbana-Champaign, USA NVIDIA, USA An ICLR 2017 paper A github project GPU-BASED

More information

Convolutional Restricted Boltzmann Machine Features for TD Learning in Go

Convolutional Restricted Boltzmann Machine Features for TD Learning in Go ConvolutionalRestrictedBoltzmannMachineFeatures fortdlearningingo ByYanLargmanandPeterPham AdvisedbyHonglakLee 1.Background&Motivation AlthoughrecentadvancesinAIhaveallowed Go playing programs to become

More information

Per-decision Multi-step Temporal Difference Learning with Control Variates

Per-decision Multi-step Temporal Difference Learning with Control Variates Per-decision Multi-step Temporal Difference Learning with Control Variates Kristopher De Asis Department of Computing Science University of Alberta Edmonton, AB T6G 2E8 kldeasis@ualberta.ca Richard S.

More information

Score function estimator and variance reduction techniques

Score function estimator and variance reduction techniques and variance reduction techniques Wilker Aziz University of Amsterdam May 24, 2018 Wilker Aziz Discrete variables 1 Outline 1 2 3 Wilker Aziz Discrete variables 1 Variational inference for belief networks

More information

Machine Learning Techniques at the core of AlphaGo success

Machine Learning Techniques at the core of AlphaGo success Machine Learning Techniques at the core of AlphaGo success Stéphane Sénécal Orange Labs stephane.senecal@orange.com Paris Machine Learning Applications Group Meetup, 14/09/2016 1 / 42 Some facts... (1/3)

More information

Deep Reinforcement Learning in Large Discrete Action Spaces

Deep Reinforcement Learning in Large Discrete Action Spaces Gabriel Dulac-Arnold*, Richard Evans*, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, Ben Coppin DULACARNOLD@GOOGLE.COM Google DeepMind

More information

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

More information

Marco Wiering Intelligent Systems Group Utrecht University

Marco Wiering Intelligent Systems Group Utrecht University Reinforcement Learning for Robot Control Marco Wiering Intelligent Systems Group Utrecht University marco@cs.uu.nl 22-11-2004 Introduction Robots move in the physical environment to perform tasks The environment

More information

Markov Decision Processes and Reinforcement Learning

Markov Decision Processes and Reinforcement Learning Lecture 14 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Artificial Intelligence

More information

CME 213 SPRING Eric Darve

CME 213 SPRING Eric Darve CME 213 SPRING 2017 Eric Darve MPI SUMMARY Point-to-point and collective communications Process mapping: across nodes and within a node (socket, NUMA domain, core, hardware thread) MPI buffers and deadlocks

More information

CS885 Reinforcement Learning Lecture 9: May 30, Model-based RL [SutBar] Chap 8

CS885 Reinforcement Learning Lecture 9: May 30, Model-based RL [SutBar] Chap 8 CS885 Reinforcement Learning Lecture 9: May 30, 2018 Model-based RL [SutBar] Chap 8 CS885 Spring 2018 Pascal Poupart 1 Outline Model-based RL Dyna Monte-Carlo Tree Search CS885 Spring 2018 Pascal Poupart

More information

Reinforcement Learning (2)

Reinforcement Learning (2) Reinforcement Learning (2) Bruno Bouzy 1 october 2013 This document is the second part of the «Reinforcement Learning» chapter of the «Agent oriented learning» teaching unit of the Master MI computer course.

More information

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu

Artificial Neural Networks. Introduction to Computational Neuroscience Ardi Tampuu Artificial Neural Networks Introduction to Computational Neuroscience Ardi Tampuu 7.0.206 Artificial neural network NB! Inspired by biology, not based on biology! Applications Automatic speech recognition

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 1 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Introduction, History, General Concepts

More information

Generalized Inverse Reinforcement Learning

Generalized Inverse Reinforcement Learning Generalized Inverse Reinforcement Learning James MacGlashan Cogitai, Inc. james@cogitai.com Michael L. Littman mlittman@cs.brown.edu Nakul Gopalan ngopalan@cs.brown.edu Amy Greenwald amy@cs.brown.edu Abstract

More information

arxiv: v1 [cs.lg] 22 Jul 2018

arxiv: v1 [cs.lg] 22 Jul 2018 NAVREN-RL: Learning to fly in real environment via end-to-end deep reinforcement learning using monocular images Malik Aqeel Anwar 1, Arijit Raychowdhury 2 Department of Electrical and Computer Engineering

More information

Gradient Reinforcement Learning of POMDP Policy Graphs

Gradient Reinforcement Learning of POMDP Policy Graphs 1 Gradient Reinforcement Learning of POMDP Policy Graphs Douglas Aberdeen Research School of Information Science and Engineering Australian National University Jonathan Baxter WhizBang! Labs July 23, 2001

More information

Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)

Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox) Partially Observable Markov Decision Processes Mausam (slides by Dieter Fox) Stochastic Planning: MDPs Static Environment Fully Observable Perfect What action next? Stochastic Instantaneous Percepts Actions

More information

Planning and Control: Markov Decision Processes

Planning and Control: Markov Decision Processes CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Perfect vs. Noisy Environment What

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3. Policy Gradient and Gradient Estimators 4. Q-prop: Sample Efficient Policy Gradient and an Off-policy Critic

More information

Deep Q-Learning to play Snake

Deep Q-Learning to play Snake Deep Q-Learning to play Snake Daniele Grattarola August 1, 2016 Abstract This article describes the application of deep learning and Q-learning to play the famous 90s videogame Snake. I applied deep convolutional

More information

Using Linear Programming for Bayesian Exploration in Markov Decision Processes

Using Linear Programming for Bayesian Exploration in Markov Decision Processes Using Linear Programming for Bayesian Exploration in Markov Decision Processes Pablo Samuel Castro and Doina Precup McGill University School of Computer Science {pcastr,dprecup}@cs.mcgill.ca Abstract A

More information

Cellular Network Traffic Scheduling using Deep Reinforcement Learning

Cellular Network Traffic Scheduling using Deep Reinforcement Learning Cellular Network Traffic Scheduling using Deep Reinforcement Learning Sandeep Chinchali, et. al. Marco Pavone, Sachin Katti Stanford University AAAI 2018 Can we learn to optimally manage cellular networks?

More information

Gated Path Planning Networks

Gated Path Planning Networks Lisa Lee *1 Emilio Parisotto *1 Devendra Singh Chaplot 1 Eric Xing 1 Ruslan Salakhutdinov 1 Abstract Value Iteration Networks (VINs) are effective differentiable path planning modules that can be used

More information

Object-sensitive Deep Reinforcement Learning

Object-sensitive Deep Reinforcement Learning EPiC Series in Computing Volume 50, 2017, Pages 20 35 GCAI 2017. 3rd Global Conference on Artificial Intelligence Object-sensitive Deep Reinforcement Learning Yuezhang Li, Katia Sycara, and Rahul Iyer

More information

Hierarchical Average Reward Reinforcement Learning

Hierarchical Average Reward Reinforcement Learning Journal of Machine Learning Research 8 (2007) 2629-2669 Submitted 6/03; Revised 8/07; Published 11/07 Hierarchical Average Reward Reinforcement Learning Mohammad Ghavamzadeh Department of Computing Science

More information

arxiv: v2 [cs.ro] 20 Apr 2017

arxiv: v2 [cs.ro] 20 Apr 2017 Learning a Unified Control Policy for Safe Falling Visak C.V. Kumar 1, Sehoon Ha 2, C. Karen Liu 1 arxiv:1703.02905v2 [cs.ro] 20 Apr 2017 Abstract Being able to fall safely is a necessary motor skill for

More information

Resource Allocation for a Wireless Coexistence Management System Based on Reinforcement Learning

Resource Allocation for a Wireless Coexistence Management System Based on Reinforcement Learning Resource Allocation for a Wireless Coexistence Management System Based on Reinforcement Learning WCS which cause independent interferences. Such coexistence managements require approaches, which mitigate

More information

Click to edit Master title style Approximate Models for Batch RL Click to edit Master subtitle style Emma Brunskill 2/18/15 2/18/15 1 1

Click to edit Master title style Approximate Models for Batch RL Click to edit Master subtitle style Emma Brunskill 2/18/15 2/18/15 1 1 Approximate Click to edit Master titlemodels style for Batch RL Click to edit Emma Master Brunskill subtitle style 11 FVI / FQI PI Approximate model planners Policy Iteration maintains both an explicit

More information

Hierarchical Average Reward Reinforcement Learning

Hierarchical Average Reward Reinforcement Learning Hierarchical Average Reward Reinforcement Learning Mohammad Ghavamzadeh mgh@cs.ualberta.ca Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E8, CANADA Sridhar Mahadevan mahadeva@cs.umass.edu

More information

Planning and Reinforcement Learning through Approximate Inference and Aggregate Simulation

Planning and Reinforcement Learning through Approximate Inference and Aggregate Simulation Planning and Reinforcement Learning through Approximate Inference and Aggregate Simulation Hao Cui Department of Computer Science Tufts University Medford, MA 02155, USA hao.cui@tufts.edu Roni Khardon

More information

Asynchronous Advantage Actor- Critic with Adam Optimization and a Layer Normalized Recurrent Network

Asynchronous Advantage Actor- Critic with Adam Optimization and a Layer Normalized Recurrent Network DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2017 Asynchronous Advantage Actor- Critic with Adam Optimization and a Layer Normalized Recurrent Network JOAKIM BERGDAHL KTH ROYAL

More information

Multi-View 3D Object Detection Network for Autonomous Driving

Multi-View 3D Object Detection Network for Autonomous Driving Multi-View 3D Object Detection Network for Autonomous Driving Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, Tian Xia CVPR 2017 (Spotlight) Presented By: Jason Ku Overview Motivation Dataset Network Architecture

More information

arxiv: v1 [cs.lg] 17 Jun 2018

arxiv: v1 [cs.lg] 17 Jun 2018 Lisa Lee * 1 Emilio Parisotto * 1 Devendra Singh Chaplot 1 Eric Xing 1 Ruslan Salakhutdinov 1 arxiv:1806.06408v1 [cs.lg] 17 Jun 2018 Abstract Value Iteration Networks (VINs) are effective differentiable

More information

The Cross-Entropy Method for Mathematical Programming

The Cross-Entropy Method for Mathematical Programming The Cross-Entropy Method for Mathematical Programming Dirk P. Kroese Reuven Y. Rubinstein Department of Mathematics, The University of Queensland, Australia Faculty of Industrial Engineering and Management,

More information

arxiv: v2 [cs.ai] 17 Aug 2017

arxiv: v2 [cs.ai] 17 Aug 2017 Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics Ken Kansky Tom Silver David A. Mély Mohamed Eldawy Miguel Lázaro-Gredilla Xinghua Lou Nimrod Dorfman Szymon Sidor

More information

arxiv: v1 [cs.cv] 20 May 2018

arxiv: v1 [cs.cv] 20 May 2018 Unsupervised Video Object Segmentation for Deep Reinforcement Learning arxiv:1805.07780v1 [cs.cv] 20 May 2018 Vik Goel School of Computer Science University of Waterloo v5goel@uwaterloo.ca Jameson Weng

More information

Neural Networks and Tree Search

Neural Networks and Tree Search Mastering the Game of Go With Deep Neural Networks and Tree Search Nabiha Asghar 27 th May 2016 AlphaGo by Google DeepMind Go: ancient Chinese board game. Simple rules, but far more complicated than Chess

More information

arxiv: v2 [cs.ai] 21 Sep 2017

arxiv: v2 [cs.ai] 21 Sep 2017 Learning to Drive using Inverse Reinforcement Learning and Deep Q-Networks arxiv:161203653v2 [csai] 21 Sep 2017 Sahand Sharifzadeh 1 Ioannis Chiotellis 1 Rudolph Triebel 1,2 Daniel Cremers 1 1 Department

More information

Chapter 10. Conclusion Discussion

Chapter 10. Conclusion Discussion Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with

More information

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Reinforcement Learning: A brief introduction. Mihaela van der Schaar Reinforcement Learning: A brief introduction Mihaela van der Schaar Outline Optimal Decisions & Optimal Forecasts Markov Decision Processes (MDPs) States, actions, rewards and value functions Dynamic Programming

More information

Detecting Salient Contours Using Orientation Energy Distribution. Part I: Thresholding Based on. Response Distribution

Detecting Salient Contours Using Orientation Energy Distribution. Part I: Thresholding Based on. Response Distribution Detecting Salient Contours Using Orientation Energy Distribution The Problem: How Does the Visual System Detect Salient Contours? CPSC 636 Slide12, Spring 212 Yoonsuck Choe Co-work with S. Sarma and H.-C.

More information

Mobile Robot Obstacle Avoidance based on Deep Reinforcement Learning

Mobile Robot Obstacle Avoidance based on Deep Reinforcement Learning Mobile Robot Obstacle Avoidance based on Deep Reinforcement Learning by Shumin Feng Thesis submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of

More information

A DECENTRALIZED REINFORCEMENT LEARNING CONTROLLER FOR COLLABORATIVE DRIVING. Luke Ng, Chris Clark, Jan P. Huissoon

A DECENTRALIZED REINFORCEMENT LEARNING CONTROLLER FOR COLLABORATIVE DRIVING. Luke Ng, Chris Clark, Jan P. Huissoon A DECENTRALIZED REINFORCEMENT LEARNING CONTROLLER FOR COLLABORATIVE DRIVING Luke Ng, Chris Clark, Jan P. Huissoon Department of Mechanical Engineering, Automation & Controls Group University of Waterloo

More information

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10 Algorithms for Solving RL: Temporal Difference Learning (TD) 1 Reinforcement Learning Lecture 10 Gillian Hayes 8th February 2007 Incremental Monte Carlo Algorithm TD Prediction TD vs MC vs DP TD for control:

More information

Knowledge-Defined Networking: Towards Self-Driving Networks

Knowledge-Defined Networking: Towards Self-Driving Networks Knowledge-Defined Networking: Towards Self-Driving Networks Albert Cabellos (UPC/BarcelonaTech, Spain) albert.cabellos@gmail.com 2nd IFIP/IEEE International Workshop on Analytics for Network and Service

More information

Foundations of Artificial Intelligence

Foundations of Artificial Intelligence Foundations of Artificial Intelligence 45. AlphaGo and Outlook Malte Helmert and Gabriele Röger University of Basel May 22, 2017 Board Games: Overview chapter overview: 40. Introduction and State of the

More information

Justin A. Boyan and Andrew W. Moore. Carnegie Mellon University. Pittsburgh, PA USA.

Justin A. Boyan and Andrew W. Moore. Carnegie Mellon University. Pittsburgh, PA USA. Learning Evaluation Functions for Large Acyclic Domains Justin A. Boyan and Andrew W. Moore Computer Science Department Carnegie Mellon University Pittsburgh, PA 15213 USA jab@cs.cmu.edu, awm@cs.cmu.edu

More information

Deictic Image Mapping: An Abstraction For Learning Pose Invariant Manipulation Policies

Deictic Image Mapping: An Abstraction For Learning Pose Invariant Manipulation Policies Deictic Image Mapping: An Abstraction For Learning Pose Invariant Manipulation Policies Robert Platt, Colin Kohler, Marcus Gualtieri College of Computer and Information Science, Northeastern University

More information

Exploring Boosted Neural Nets for Rubik s Cube Solving

Exploring Boosted Neural Nets for Rubik s Cube Solving Exploring Boosted Neural Nets for Rubik s Cube Solving Alexander Irpan Department of Computer Science University of California, Berkeley alexirpan@berkeley.edu Abstract We explore whether neural nets can

More information

Experiments with Tensor Flow

Experiments with Tensor Flow Experiments with Tensor Flow 06.07.2017 Roman Weber (Geschäftsführer) Richard Schmid (Senior Consultant) A Smart Home? 2 WEBGATE WELTWEIT WebGate USA Boston WebGate Support Center Brno, Tschechische Republik

More information

arxiv: v1 [cs.ai] 1 Dec 2016

arxiv: v1 [cs.ai] 1 Dec 2016 Playing Doom with SLAM-Augmented Deep Reinforcement Learning Shehroze Bhatti Alban Desmaison Ondrej Miksik Nantas Nardelli N. Siddharth Philip H. S. Torr University of Oxford arxiv:1612.00380v1 [cs.ai]

More information

Utile Distinctions for Relational Reinforcement Learning

Utile Distinctions for Relational Reinforcement Learning Utile Distinctions for Relational Reinforcement Learning William Dabney and Amy McGovern University of Oklahoma School of Computer Science amarack@ou.edu and amcgovern@ou.edu Abstract We introduce an approach

More information

Ensemble of Specialized Neural Networks for Time Series Forecasting. Slawek Smyl ISF 2017

Ensemble of Specialized Neural Networks for Time Series Forecasting. Slawek Smyl ISF 2017 Ensemble of Specialized Neural Networks for Time Series Forecasting Slawek Smyl slawek@uber.com ISF 2017 Ensemble of Predictors Ensembling a group predictors (preferably diverse) or choosing one of them

More information

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Reinforcement Learning: Value Iteration Matt Gormley Lecture 23 Apr. 10, 2019 1

More information

over Multi Label Images

over Multi Label Images IBM Research Compact Hashing for Mixed Image Keyword Query over Multi Label Images Xianglong Liu 1, Yadong Mu 2, Bo Lang 1 and Shih Fu Chang 2 1 Beihang University, Beijing, China 2 Columbia University,

More information

Deep-Reinforcement Learning Multiple Access for Heterogeneous Wireless Networks

Deep-Reinforcement Learning Multiple Access for Heterogeneous Wireless Networks Deep-Reinforcement Learning Multiple Access for Heterogeneous Wireless Networks Yiding Yu, Taotao Wang, Soung Chang Liew arxiv:1712.00162v1 [cs.ni] 1 Dec 2017 Abstract This paper investigates the use of

More information

Application of Deep Learning Techniques in Satellite Telemetry Analysis.

Application of Deep Learning Techniques in Satellite Telemetry Analysis. Application of Deep Learning Techniques in Satellite Telemetry Analysis. Greg Adamski, Member of Technical Staff L3 Technologies Telemetry and RF Products Julian Spencer Jones, Spacecraft Engineer Telenor

More information