Function Approximation. Pieter Abbeel UC Berkeley EECS

Size: px

Start display at page:

Download "Function Approximation. Pieter Abbeel UC Berkeley EECS"

Melvyn Horton
6 years ago
Views:

1 Function Approximation Pieter Abbeel UC Berkeley EECS

2 Outline Value iteration with function approximation Linear programming with function approximation

3 Value Iteration Algorithm: Start with for all s. For i=1,, H For all states s 2 S: Impractical for large state spaces = the expected sum of rewards accumulated when starting from state s and acting optimally for a horizon of i steps = the optimal action when in state s and getting to act for a horizon of i steps

Nine basis functions, 10,..., 18, each mapping the state to the absolute difference between heights of successive columns: h[k+1] h[k], k = 1,..., 9.

4 Example: tetris state: board configuration + shape of the falling piece ~2 200 states! action: rotation and translation applied to the falling piece 22 features aka basis functions Á i Ten basis functions, 0,..., 9, mapping the state to the height h[k] of each of the ten columns. Nine basis functions, 10,..., 18, each mapping the state to the absolute difference between heights of successive columns: h[k+1] h[k], k = 1,..., 9. One basis function, 19, that maps state to the maximum column height: max k h [k] One basis function, 20, that maps state to the number of holes in the board. One basis function, 21, that is equal to 1 in every state. ˆV θ (s) = 21 i=0 θ i φ i (s) =(s) [Bertsekas & Ioffe, 1996 (TD); Bertsekas & Tsitsiklis 1996 (TD);Kakade 2002 (policy gradient); Farias & Van Roy, 2006 (approximate LP)]

5 Function Approximation V(s) = + distance to closest ghost + θ 2 distance to closest power pellet + θ 3 in dead-end + θ 4 closer to power pellet than ghost is + = θ 0 θ 1 n θ i φ i (s) =(s) i=0

6 Function Approximation 0 th order approximation (1-nearest neighbor):. s x1 x2 x3 x x5 x6 x7 x8.... x9 x10 x11 x12 φ(s) = ˆV (s) = ˆV (x4) = θ 4 ˆV (s) =(s) Only store values for x1, x2,, x12 call these values θ 1, θ 2,...,θ 12 Assign other states value of nearest x state

7 Function Approximation 1 th order approximation (k-nearest neighbor interpolation): ˆV (s) =φ 1 (s)θ 1 + φ 2 (s)θ 2 + φ 5 (s)θ 5 + φ 6 (s)θ x1. x2 x3 x4 s.... x5 x6 x7 x8.... x9 x10 x11 x12 φ(s) = ˆV (s) =(s) Only store values for x1, x2,, x12 call these values θ 1, θ 2,...,θ 12 Assign other states interpolated value of nearest 4 x states

8 Function Approximation Examples: S = R, ˆV (s) =θ1 + θ 2 s S = R, ˆV (s) =θ1 + θ 2 s + θ 3 s 2 S = R, ˆV (s) = n i=0 θ i s i S, ˆV (s) = log( exp((s)) )

9 Function Approximation Main idea: Use approximation of the true value function, θ ˆV θ is a free parameter to be chosen from its domain V Θ S Representation size: à downto: Θ + : less parameters to estimate - : less expressiveness, typically there exist many V for which there is no θ such that ˆVθ = V

10 Supervised Learning Given: set of examples (s (1),V(s (1) )),,(s (2),V(s (2) )),...,(s (m),v(s (m) )) Asked for: best ˆV θ Representative approach: find through least squares: min θ Θ m ( ˆV θ (s (i) ) V (s (i) )) 2 i=1 θ

11 Supervised Learning Example Linear regression Observation Error or residual Prediction min θ 0,θ 1 n (θ 0 + θ 1 x (i) y (i) ) 2 i=

12 Overfitting To avoid overfitting: reduce number of features used Practical approach: leave-out validation Perform fitting for different choices of feature sets using just 70% of the data Pick feature set that led to highest quality of fit on the remaining 30% of data

13 Status Function approximation through supervised learning BUT: where do the supervised examples come from?

14 Value Iteration with Function Approximation Pick some (typically ) Initialize by choosing some setting for Iterate for i = 0, 1, 2,, H: Step 1: Bellman back-ups s S : S S Vi+1 (s) max a s T (s, a, s ) S << S θ (0) R(s, a, s )+γ ˆV θ (i)(s ) Step 2: Supervised learning find θ (i+1) as the solution of: min ˆVθ (i+1)(s) V 2 i+1 (s) θ s S

15 Value Iteration with Function Approximation --- Example Mini-tetris: two types of blocks, can only choose translation (not rotation) Example state: Reward = 1 for placing a block Sink state / Game over is reached when block is placed such that part of it extends above the red rectangle If you have a complete row, it gets cleared

16 Value Iteration with Function Approximation --- Example S = {,,, }

17 Value Iteration with Function Approximation --- Example S = {,,, } 10 features aka basis functions Á i Four basis functions, 0,..., 3, mapping the state to the height h[k] of each of the four columns. Three basis functions, 4,..., 6, each mapping the state to the absolute difference between heights of successive columns: h[k+1] h[k], k = 1,..., 3. One basis function, 7, that maps state to the maximum column height: max k h[k] One basis function, 8, that maps state to the number of holes in the board. One basis function, 9, that is equal to 1 in every state. Init \theta^{(0)} = ( -1, -1, -1, -1, -2, -2, -2, -3, -2, 10)

18 Value Iteration with Function Approximation --- Example Bellman back-ups for the states in S : V( ) = max {0.5 *(1+ V( ))+0.5*(1 + V( ) ), 0.5 *(1+ V( ))+0.5*(1 + V( ) ), 0.5 *(1+ V( ))+0.5*(1 + V( ) ), 0.5 *(1+ V( ))+0.5*(1 + V( ) ),

19 Value Iteration with Function Approximation --- Example Bellman back-ups for the states in S : V( ) = max {0.5 *(1+ V( ))+0.5*(1 + V( ) ), 0.5 *(1+ V( ))+0.5*(1 + V( ) ), 0.5 *(1+ V( ))+0.5*(1 + V( ) ), 0.5 *(1+ V( ))+0.5*(1 + V( ) ),

20 Value Iteration with Function Approximation --- Example S = {,,, } 10 features aka basis functions Á i Four basis functions, 0,..., 3, mapping the state to the height h[k] of each of the four columns. Three basis functions, 4,..., 6, each mapping the state to the absolute difference between heights of successive columns: h[k+1] h[k], k = 1,..., 3. One basis function, 7, that maps state to the maximum column height: max k h[k] One basis function, 8, that maps state to the number of holes in the board. One basis function, 9, that is equal to 1 in every state. Init θ (0) =( 1, 1, 1, 1, 2, 2, 2, 3, 2, 20)

21 Value Iteration with Function Approximation --- Example Bellman back-ups for the states in S : V( )=max {0.5 *(1+ ( ))+0.5*(1 + ( ) ), (6,2,4,0, 4, 2, 4, 6, 0, 1) (6,2,4,0, 4, 2, 4, 6, 0, 1) 0.5 *(1+ ( ))+0.5*(1 + ( ) ), (2,6,4,0, 4, 2, 4, 6, 0, 1) (2,6,4,0, 4, 2, 4, 6, 0, 1) 0.5 *(1+ ( ))+0.5*(1 + ( ) ), (sink-state, V=0) (sink-state, V=0) 0.5 *(1+ ( ))+0.5*(1 + ( ) )} (0,0,2,2, 0,2,0, 2, 0, 1) (0,0,2,2, 0,2,0, 2, 0, 1)

22 Value Iteration with Function Approximation --- Example Bellman back-ups for the states in S : V( )=max {0.5 *( )+0.5*( ), 0.5 *( )+0.5*( ), 0.5 *(1+ 0 )+0.5*(1 + 0 ), 0.5 *(1+ 6 )+0.5*(1 + 6 ), = 6.4 (for = 0.9)

23 Value Iteration with Function Approximation --- Example Bellman back-ups for the second state in S : θ (0) =( 1, 1, 1, 1, 2, 2, 2, 3, 2, 20) V( )=max {0.5 *(1+ ( ))+0.5*(1 + ( ) ), (sink-state, V=0) (sink-state, V=0) 0.5 *(1+ ( ))+0.5*(1 + ( ) ), (sink-state, V=0) (sink-state, V=0) 0.5 *(1+ ( ))+0.5*(1 + ( ) ), (sink-state, V=0) (sink-state, V=0) 0.5 *(1+ ( ))+0.5*(1 + ( ) )} = 19 (0,0,0,0, 0,0,0, 0, 0, 1) (0,0,0,0, 0,0,0, 0, 0, 1) -> V = 20 -> V = 20

24 Value Iteration with Function Approximation --- Example Bellman back-ups for the third state in S : θ (0) =( 1, 1, 1, 1, 2, 2, 2, 3, 2, 20) V( )=max {0.5 *(1+ ( ))+0.5*(1 + ( ) ), (4,4,0,0, 0,4,0, 4, 0, 1) (4,4,0,0, 0,4,0, 4, 0, 1) -> V = -8 -> V = *(1+ ( ))+0.5*(1 + ( ) ), (2,4,4,0, 2,0,4, 4, 0, 1) (2,4,4,0, 2,0,4, 4, 0, 1) -> V = -14 -> V = *(1+ ( ))+0.5*(1 + ( ) ), (0,0,0,0, 0,0,0, 0, 0, 1) (0,0,0,0, 0,0,0, 0, 0, 1) -> V = 20 -> V = 20 = 19

25 Value Iteration with Function Approximation --- Example Bellman back-ups for the fourth state in S : θ (0) =( 1, 1, 1, 1, 2, 2, 2, 3, 2, 20) V( )=max {0.5 *(1+ ( ))+0.5*(1 + ( ) ), (6,6,4,0, 0,2,4, 6, 4, 1) (6,6,4,0, 0,2,4, 6, 4, 1) -> V = -34 -> V = *(1+ ( ))+0.5*(1 + ( ) ), (4,6,6,0, 2,0,6, 6, 4, 1) (4,6.6,0, 2,0,6, 6, 4, 1) -> V = -38 -> V = *(1+ ( ))+0.5*(1 + ( ) ), (4,0,6,6, 4,6,0, 6, 4, 1) (4,0,6,6, 4,6,0, 6, 4, 1) -> V = -42 -> V = -42 = -29.6

26 Value Iteration with Function Approximation --- Example After running the Bellman back-ups for all 4 states in S we have: We now run supervised learning on these 4 examples to find a new µ: V( )= 6.4 (2,2,4,0, 0,2,4, 4, 0, 1) V( )= 19 (4,4,4,0, 0,0,4, 4, 0, 1) V( )= 19 (2,2,0,0, 0,2,0, 2, 0, 1) V( )= (4,0,4,0, 4,4,4, 4, 0, 1) min θ (6.4 ( )) 2 +(19 ( )) 2 +(19 ( )) 2 +(( 29.6) ( )) 2 à Running least squares gives new µ θ (1) =(0.195, 6.24, 2.11, 0, 6.05, 0.13, 2.11, 2.13, 0, 1.59)

27 Potential guarantees?

28 Simple example** µ 2µ Function approximator: [1 2] * µ

29 Simple example**

30 Composing operators** Definition. An operator G is a non-expansion with respect to a norm. if Fact. If the operator F is a contraction with respect to a norm. and the operator G is a non-expansion with respect to the same norm, then the sequential application of the operators G and F is a -contraction, i.e., Corollary. If the supervised learning step is a nonexpansion, then iteration in value iteration with function approximation is a -contraction, and in this case we have a convergence guarantee.

31 Averager function approximators are non-expansions** Examples: nearest neighbor (aka state aggregation) linear interpolation over triangles (tetrahedrons, )

32 Averager function approximators are non-expansions**

33 Linear regression L ** [Example taken from Gordon, 1995.]

34 Guarantees for fixed point** I.e., if we pick a non-expansion function approximator which can approximate J* well, then we obtain a good value function estimate. To apply to discretization: use continuity assumptions to show that J* can be approximated well by chosen discretization scheme

35 Outline Value iteration with function approximation Linear programming with function approximation

36 Infinite Horizon Linear Program min V µ 0 (s)v (s) s S s.t. V(s) s T (s, a, s )[R(s, a, s )+γv (s )], s S, a A µ 0 is a probability distribution over S, with µ 0 (s)> 0 for all s 2 S. Theorem. V * is the solution to the above LP.

37 Infinite Horizon Linear Program min V µ 0 (s)v (s) s S s.t. V(s) s T (s, a, s )[R(s, a, s )+γv (s )], s S, a A V (s) =(s) Let:, and consider S rather than S: min θ s S µ 0 (s)(s) s.t. (s) s T (s, a, s ) R(s, a, s )+γ(s ), s S,a A à Linear program that finds ˆV θ (s) =(s)

38 Approximate Linear Program Guarantees** min θ s S µ 0 (s)(s) s.t. (s) s T (s, a, s ) R(s, a, s )+γ(s ), s S,a A LP solver will converge Solution quality: [de Farias and Van Roy, 2002] Assuming one of the features is the feature that is equal to one for all states, and assuming S =S we have that: V Φθ 1,µ0 2 1 γ min V Φθ θ (slightly weaker, probabilistic guarantees hold for S not equal to S, these guarantees require size of S to grow as the number of features grows)

Func%on Approxima%on. Pieter Abbeel UC Berkeley EECS

Func%on Approxima%on. Pieter Abbeel UC Berkeley EECS Func%on Approxima%on Pieter Abbeel UC Berkeley EECS Value Itera4on Algorithm: Start with for all s. For i=1,, H For all states s 2 S: Imprac4cal for large state spaces = the expected sum of rewards accumulated