CS885 Reinforcement Learning Lecture 9: May 30, Model-based RL [SutBar] Chap 8

Size: px

Start display at page:

Download "CS885 Reinforcement Learning Lecture 9: May 30, Model-based RL [SutBar] Chap 8"

Samson Wilson
5 years ago
Views:

1 CS885 Reinforcement Learning Lecture 9: May 30, 2018 Model-based RL [SutBar] Chap 8 CS885 Spring 2018 Pascal Poupart 1

2 Outline Model-based RL Dyna Monte-Carlo Tree Search CS885 Spring 2018 Pascal Poupart 2

3 Model-free RL No explicit transition or reward models Q-learning: value-based method Policy gradient: policy-based method Actor critic: policy and value based method state reward Agent: update policy/value function Environment action CS885 Spring 2018 Pascal Poupart 3

4 Model-based RL Learn explicit transition and/or reward model Plan based on the model Benefit: Increased sample efficiency Drawback: Increased complexity plan Agent: update model Agent: update policy/value function state reward Environment action CS885 Spring 2018 Pascal Poupart 4

5 Maze Example r r r +1 u u -1 u l l l g = 1 Reward is for non-terminal states We need to learn all the transition probabilities! (1,1)à (1,2)à (1,3)à (1,2)à (1,3)à (2,3)à (3,3)à (4,3) +1 (1,1)à (1,2)à (1,3)à (2,3)à (3,3)à (3,2)à (3,3)à (4,3) +1 (1,1)à (2,1)à (3,1)à (3,2)à (4,2) -1 P((2,3) (1,3),r) =2/3 P((1,2) (1,3),r) =1/3 Use this information in # % = max ) * %,, Pr % 4 %,, # (% 4 ) CS885 Spring 2018 Pascal Poupart 5

6 Model-based RL Idea: at each step Execute action Observe resulting state and reward Update transition and/or reward model Update policy and/or value function CS885 Spring 2018 Pascal Poupart 6

7 Model-based RL (with Value Iteration) ModelBasedRL(!) Repeat Select and execute " Observe! and $ Update counts: %!, " %!, " + 1, %!, ",! * %!, ",! * + 1 Update transition: Pr! *!, " -(/,0,/1 )! Update reward: 5!, " -(/,0) /,0 89 :(/,0) -(/,0) Solve: ; (!) = max 0 5!, " + A / 1 Pr! *!, " ;! *!!! Until convergence of ; Return ; CS885 Spring 2018 Pascal Poupart 7

8 Complex models Use function approx. for transition and reward models Linear model:!"# $ % $, ' = )($ -. $, / 0 1) ' Non-linear models: Stochastic (e.g. Gaussian process): $, / 0 1) ' Deterministic (e.g., neural network): $ % = 5 $, ' = ))($, ')!"# $ % $, ' = 34($ -. CS885 Spring 2018 Pascal Poupart 8

9 Partial Planning In complex models, fully optimizing the policy or value function at each time step is intractable Consider partial planning A few steps of Q-learning A few steps of policy gradient CS885 Spring 2018 Pascal Poupart 9

10 Model-based RL (with Q-learning) ModelBasedRL(!) Repeat Select and execute ", observe! and $ Update transition: % & % & ) & *!, "!,./ *(!, ") Update reward:% 2 % 2 ) 2 3!, " $.4 3(!, ") Repeat a few times: sample!, 6" arbitrarily 7 3!, 6" + 9 max? *(!, 6"), 6",?!, 6" 6= > Update?: 7.A?(!, 6")!! Until convergence of? Return? CS885 Spring 2018 Pascal Poupart 10

11 Partial Planning vs Replay Buffer Previous algorithm is very similar to Model-free Q- learning with a replay buffer Instead of updating Q-function based on samples from replay buffer, generate samples from model Replay buffer: Simple, real samples, no generalization to other sate-action pairs Partial planning with a model Complex, simulated samples, generalization to other state-action pairs (can help or hurt) CS885 Spring 2018 Pascal Poupart 11

12 Dyna Learn explicit transition and/or reward model Plan based on the model Learn directly from real experience plan Agent: update model Agent: update policy/value function state reward state reward Environment action CS885 Spring 2018 Pascal Poupart 12

13 Dyna-Q Dyna-Q(!) Repeat Select and execute ", observe! and $ Update transition: % & % & ) & *!, "!,./ *(!, ") Update reward: % 2 % 2 ) 2 3!, " $.4 3(!, ") 5 $ + 7 max ; < =!, " =!, " Update =: %? %? )? 5.@ =(!, ") Repeat a few times: sample!, B" arbitrarily 5 3!, B" + 7 max = *(!, B"), B", =!, B" B; < Update =: %? %? )? 5.@ =(!, B")!! Return = CS885 Spring 2018 Pascal Poupart 13

14 Task: reach G from S Dyna-Q CS885 Spring 2018 Pascal Poupart 14

15 Planning from current state Instead of planning at arbitrary states, plan from the current state This helps improve next action Monte Carlo Tree Search CS885 Spring 2018 Pascal Poupart 15

16 Tree Search a s1 b s2 s3 s12 s13 a b a b a b a b s4 s5 s6 s7 s8 s9 s10 s11 s14 s15 s16 s17 s18 s19 s20 s21 CS885 Spring 2018 Pascal Poupart 16

17 Tractable Tree Search Combine 3 ideas: Leaf nodes: approximate leaf values with value of default policy! " $, & " ( $, & 1 *($, &) -./0 Chance nodes: approximate expectation by sampling from transition model " 1 $, & 3 $, & =($ > ) *($, &) ~ 9:(6 7 6,<) Decision nodes: expand only most promising actions & = &@AB&C < " $, & + D E FG ,< and = $ = "($, & ) Resulting algorithm: Monte Carlo Tree Search CS885 Spring 2018 Pascal Poupart 17

18 Computer Go Deep RL Monte Carlo Tree Search Oct 2015: CS885 Spring 2018 Pascal Poupart 18

19 Monte Carlo Tree Search (with upper confidence bound) UCT(! " ) create root #$%& " with state!/0/& #$%& "! " while within computational budget do #$%& < =>&&?$@ABC(#$%& " ) F0@G& H&I0G@/?$@ABC!/0/& #$%& < J0BKGL(#$%& <, F0@G&) return 0B/A$#(J&!/OhA@% #$%& ", 0 ) TreePolicy(#$%&) while #$%& is nonterminal do if #$%& is not fully expanded do return UVL0#%(#$%&) else #$%& J&!/OhA@%(#$%&, O) return #$%& CS885 Spring 2018 Pascal Poupart 19

20 Monte Carlo Tree Search (continued) Expand(!"#$) choose % untried actions of '()*%*$!"#$ ) add a new child!"#$ to!"#$ with )*%*$!"#$ 9 ;()*%*$!"#$, %) return!"#$ deterministic transition BestChild(!"#$,?) return arg max M CDEF G HIJKELFC CDEF!"#$9 +? O PQ C CDEF C(CDEF G ) DefaultPolicy(!"#$) while!"#$ is not terminal do sample % ~ U % )*%*$!"#$ ) ;()*%*$!"#$, %) return V(), %) CS885 Spring 2018 Pascal Poupart 20

21 Monte Carlo Tree Search (continued) Single Player Backup(!"#$,%&'($) while!"#$ is not null do 7 789: ; 789: <=>?@: 5!"#$ 7 789: <A!!"#$!!"#$ + 1!"#$ D&E$!F(!"#$) Two Players (adversarial) BackupMinMax(!"#$,%&'($) while!"#$ is not null do 7 789: ; 789: <=>?@: 5!"#$ 7 789: <A!!"#$!!"#$ + 1 %&'($ %&'($!"#$ D&E$!F(!"#$) CS885 Spring 2018 Pascal Poupart 21

22 AlphaGo Four steps: 1. Supervised Learning of Policy Networks 2. Policy gradient with Policy Networks 3. Value gradient with Value Networks 4. Searching with Policy and Value Networks Monte Carlo Tree Search variant CS885 Spring 2018 Pascal Poupart 22

23 Search Tree At each edge store!(#, %), '(% #), )(#, %) Where ) #, % is the visit count of (#, %) Sample trajectory CS885 Spring 2018 Pascal Poupart 23

24 Simulation At each node, select edge! that maximizes! =!$%&!' ( ) *,! + -(*,!) where - *,! 1(( 3) is an exploration bonus 456 3,( ) *,! = 4 6 3,( *,! [;< = * + 1 ;? 8 ] 1 BC *,! D!* EB*BFGH!F BFG$!FBIJ B 1 8 *,! = A 0 IFhG$DB*G CS885 Spring 2018 Pascal Poupart 24

Neural Networks and Tree Search

Mastering the Game of Go With Deep Neural Networks and Tree Search Nabiha Asghar 27 th May 2016 AlphaGo by Google DeepMind Go: ancient Chinese board game. Simple rules, but far more complicated than Chess