Click to edit Master title style Approximate Models for Batch RL Click to edit Master subtitle style Emma Brunskill 2/18/15 2/18/15 1 1

Size: px

Start display at page:

Download "Click to edit Master title style Approximate Models for Batch RL Click to edit Master subtitle style Emma Brunskill 2/18/15 2/18/15 1 1"

Ashlyn Morrison
6 years ago
Views:

1 Approximate Click to edit Master titlemodels style for Batch RL Click to edit Emma Master Brunskill subtitle style 11

2 FVI / FQI PI Approximate model planners Policy Iteration maintains both an explicit representation of a policy and the value of that policy Image from David Silver 2

3 Forward Search w/generative Model a1 a2 s2 s2 Slide modified from David Silver 3

4 Exact/Exhaustive Forward Search a1 max exp a2 s2 exp s2 Slide modified from David Silver 4

5 How many nodes in a H-depth tree (as a function of state space S and action space A )? a1 max exp a2 s2 exp s2 Slide modified from David Silver 5

6 How many nodes in a H-depth tree (as a function of state space S and action space A )? ( S A )H a1 max exp a2 s2 exp s2 Slide modified from David Silver 6

7 Sparse Sampling: Don t Enumerate All Next States, Instead Sample Next States s ~ P(s s,a) a1 max exp a2 s2 exp s37 s2 Sample n next states, si ~ P(s s,a) Compute (1/n) Sumi V(s_i) Converges to expected future reward: Sums p(s s,a)v(s ) Slide modified from David Silver 7

8 Sparse Sampling: # nodes if sample n states at each action node? Independent of S! O(n A )H a1 max exp a2 s2 exp s37 s2 Sample n next states, si ~ P(s s,a) Compute (1/n) Sumi V(s_i) Converges to expected future reward: Sums p(s s,a)v(s ) Slide modified from David Silver 8

9 Sparse Sampling: # nodes if sample n states at each action node? Independent of S! O(n A )H a1 max exp a2 s2 exp s37 s2 Upside: Can choose n to achieve bounds on the accuracy of the value function at the root state, independent of state space size Downside: Still exponential in horizon, n still large for good bounds Slide modified from David Silver 9

10 Limitation of Sparse Sampling Slide modified from Alan Fern 10

11 Monte Carlo Tree Search Combine ideas of sparse sampling with an adaptive method for focusing on more promising parts of the ree Here more promising means the actions that are seem likely to yield higher long term reward Uses the idea of simulation search 11

12 Simulation Based Search Slide modified from David Silver 12

13 Simulation based Search Slide modified from David Silver 13

14 Simple Monte Carlo Search rollout policy dafa greedy improvement with respect to fixed rollout policy Slide modified from David Silver 14

15 Upper Confidence Tree (UCT) [Kocsis & Szepesvari, 2006] Combine forward search and simulation search Instance of Monte-Carlo Tree Search Repeated Monte Carlo simulation of rollout policy Rollouts add one or more nodes to search tree UCT Uses optimism under uncertainty idea Some nice theoretical properties Much better realtime performance than sparse sampling Slide modified from Alan Fern 15

16 a1 Slide modified from Alan Fern 16

17 a1 a2 Slide modified from Alan Fern 17

18 a1 a2 Slide modified from Alan Fern 18

19 a1 a2 Slide modified from Alan Fern 19

20 a1 s2 a2 1 Slide modified from Alan Fern 20

21 (Upper Confidence Bound) Slide modified from Alan Fern 21

22 Slide modified from Alan Fern 22

23 Requires us to have a simulator/ generative model Each pass down the tree, follow tree policy until reach a state leaf where not all actions have been tried. Then need to simulate starting from that state leaf the result of taking another action Slide modified from Alan Fern 23

24 Guarantees on UCT Slide modified from Remi Munos 24

25 Computer Go Previous game tree approaches faired poorly Slide modified from slides from Alan Fern & David Silver 25

26 Rules of Go Slide modified from David Silver 26

27 Position Evaluation in Go Slide modified from David Silver 27

28 Monte Carlo Evaluation in Go: Planning problem, just a very very hard one Slide modified from David Silver 28

29 Enormous Progress. MCTS Huge Impact Slide modified from David Silver 29

30 Going Back to Batch RL... Use supervised learning method to compute model Use learned model with MCTS planning Note: error in model will impact error in estimated values! Computes an action for current state, take action, then redo planning for next state 30

31 Autonomous Driving using Texplore (Hester and Stone 2013) 31

32 FVI / FQI API Approximate model planners Image from David Silver 32

33 33

Monte Carlo Tree Search PAH 2015

Monte Carlo Tree Search PAH 2015 MCTS animation and RAVE slides by Michèle Sebag and Romaric Gaudel Markov Decision Processes (MDPs) main formal model Π = S, A, D, T, R states finite set of states of the