On Movement Skill Learning and Movement Representations for Robotics

Size: px

Start display at page:

Download "On Movement Skill Learning and Movement Representations for Robotics"

Rosanna Kelley
6 years ago
Views:

1 On Movement Skill Learning and Movement Representations for Robotics Gerhard Neumann 1 1 Graz University of Technology, Institute for Theoretical Computer Science November 2, 2011

2 Modern Robotic Systems: Motivation... Many degrees of freedoms, compliant actuators, highly dynamic movements...

3 In principle the advanced morphology of these robots would allow us to perform a wide range of complex movements such as Different forms of locomotion (walking, running, trott) Jumping Playing tennis... Classical control methods often fail or are very hard to use for such complex movements. More promising approach : Let the robot learn the movement from trial and error Main topic of this thesis!

4 Movement Skill Learning for Robotics Movement Skill Learning can be easily formulated as Reinforcement Learning problem. The agent has to search for a policy which optimizes reward So why is it challenging? High dimensional continuous state spaces High dimensional continuous action spaces Data is expensive : Needs to be data efficient Needs to be safe

5 Movement Skill Learning for Robotics Learning algorithms can be roughly divided into Value-based methods Policy-search methods

6 Value-based methods Estimate the expected discounted future reward for each state s when following policy π [ ] V π (s) = E γ t r t, t=0 Also denoted as value function of policy π Recursive Form V π (s) = E [ r(s,a) + γv π (s ) ]

7 Value-based methods + The value function can be used to assess the quality of each intermediate action of an episode E.g. by the use of the Temporal Difference (TD) error δ t = r t + γv π (s t+1 ) V π (s t ) Evaluates if the current step s t,a t,r t,s t+1 was better or worse than expected We can efficiently solve the temporal credit assignment problem - The value function is very hard to estimate in high-dimensional continuous state and action spaces

8 Policy Search Methods Rely on a parametric representation of the policy π(a s;w) Parameters of the policy w Directly optimize policy parameters by performing rollouts on the real system - We can only assess the quality of a whole trajectory instead of single actions + However, as no value function is estimated this can be done very accurately More successful than value based methods Performance strongly depends on the used movement representation

9 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression

10 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives

11 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives Policy Search Variational Inference for Policy Search in Changing Situations

12 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives Policy Search Variational Inference for Policy Search in Changing Situations

13 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression

14 Fitted Q-iteration : Batch-Mode Reinforcement Learning (BMRL) Batch-Mode RL methods use the whole history H of the agent to update the value or action value function H = {< s i,a i, r i,s i >} 1 i N Advantage : Data-points are used more efficiently than in online methods

15 Fitted Q-iteration : Batch-Mode Reinforcement Learning (BMRL) Fitted Q-Iteration (Ernst et al., 2003) approximates the state-action value function Q(s,a) by iteratively using supervised regression techniques Repeat K times Q k+1 (i) = r i + γv k (s i) = r i + γ max a Q k (s i,a ) D k = { [(si,a i ), Q k+1 (i) ] 1 i N }, Q k+1 = Regress(D k )

16 Fitted Q-iteration : Batch-Mode Reinforcement Learning (BMRL) + FQI has proven to outperform classical online RL methods in many applications (Ernst et al., 2005). + Any type of supervised learning method can be used... E.g. neural networks (Riedmiller, 2005), regression trees (Ernst et al., 2005), Gaussian Processes - High computational demands...

17 FQI for Robotics... Continuous state spaces : Any type of supervised learning method can be used... E.g. neural networks, regression trees, Gaussian Processes Continuous action spaces : We have to solve Q k+1 (i) = r i + γ max a Q k (s i,a ) - Hm... how do perform the max a -operator in continuous action spaces?

18 FQI for Robotics... Hm... how do perform the max a -operator in continuous action spaces? Discretizations become prohibitively expensive in high dimensional spaces We have to solve an optimization problem for each sample! E.g. use Cross-Entropy optimization for each data point s i

19 FQI for Robotics... Hm... how do perform the max a -operator in continuous action spaces? We show that an advantage-weighted regression can be used to approximate max a Q(s,a). The regression uses the states s i as input values and Q(s i,a i ) as target values. The weighting w i = exp(τā(s,a) of each data point is based on the advantage function A(s, a) = Q(s, a) V (s).

20 FQI for Robotics... What is a weighted regression? Minimize the error function w.r.t. θ E = N w i (V (s i ; θ) Q(s i,a i )) 2 i=1 w i... each data point gets an individual weighting

21 FQI for Robotics... We proof this by applying the following 2 steps: Weighted regression for value estimation Soft-greedy policy improvement

22 Weighted regression for value estimation The value function of a stochastic policy π is given by V π (s) = a π(a s)q(s,a)da We show that this can be approximated without evaluating the integral by solving a weighted regression problem D V = { s i, Q(s i,a i ) },U = {π(a i s i )}, ˆV = WeightedReg(D V, U)

23 Proof We want to find an approximation ˆV (s) of V π (s) by minimizing the error function ( 2 Error(ˆV ) = µ(s) π(a s)q(s,a)da ˆV (s)) ds s a = s ( µ(s) π(a s) a ( ) 2 Q(s,a) ˆV (s) da) ds, µ(s) : state distribution when following policy π( s).

24 Proof Squared error function : Error(ˆV ) = s ( ( ) 2 µ(s) π(a s) Q(s,a) ˆV (s) da) ds, a An upper bound of Error(ˆV ) is given by : ( 2 Error B (ˆV ) = µ(s) π(a s) Q(s,a) ˆV (s)) dads Error(ˆV ). s a Use of Jensens inequality

25 Proof It is easy to show that both error functions have the same minimum for ˆV The upper bound Error B can be approximated straightforwardly by samples {(s i,a i ), Q(s i,a i )} 1 i N Error B (ˆV ) N i=1 ( 2 π(a i s i ) Q(s i,a i ) ˆV (s i )) (1) No integral over the action space is needed!

26 FQI for Robotics... We proof this by applying the following 2 steps: Weighted regression for value estimation Soft-greedy policy improvement

27 Soft-greedy policy improvement The optimal value function V (s) = max a Q(s,a) can be approximated without evaluating max a Q(s,a) by solving an advantage-weighted regression problem. D V = { s i, Q(s i,a i ) }, U = { exp(τā(s i,a i )) }, (2) ˆV = WeightedReg(D V, U ) (3) - τ... greediness parameter of the algorithm. - Ā(s, a)... normalized advantage function.

28 Proof We approximate the value function V π 1 of a soft-max policy π 1 by the use of weighted regression. Since a soft-max policy is an approximation of the greedy policy, we can replace V (s) = max a Q(s,a) with V π 1 (s).

29 Proof The used soft-max policy π 1 (a s) is based on the advantage function A(s,a) = Q(s,a) V (s). π 1 (a s) = exp(τā(s,a)) a exp(τā(s,a))da, Ā(s,a) = A(s,a) m A(s) σ A (s). If we assume that the advantages A(s,a) are normally distributed the denominator of π 1 is constant. Thus we can use exp(τā(s,a)) π 1 (a s) directly as weighting for the regression.

30 Concrete algorithm : LAWER The Locally-Advantage WEighted Regression (LAWER) algorithm implements the presented theoretical results. It combines Locally Weighted Regression (LWR, (Atkeson et al., 1997)) and advantage weighted regression. The locality weighting w i (s) and the advantage weighting u i = exp(τā(s i,a i )) can be multiplicatively combined

31 Concrete algorithm : LAWER The value function is then given by a simple weighted linear regression: V k+1 (s) = s(s T US) 1 S T UQ k+1 s = [1,s T ] T, S = [ s 1, s 2,..., s N ] T is the state matrix. U = diag(w i (s)u i ) In order to approximate V (s) = max a Q k (s,a) only the Q-values of neighbored state-action pairs are needed.

32 Approximation of the policy For unseen states we need to approximate the soft-max policy Gaussian policy π(a s) = N(a µ(s), σ 2 ). For estimating this policy we use reward-weighted regression (Peters & Schaal, 2007), only the advantage is used instead of the reward for the weighting. Thus, we optimize the long-term reward instead of the immediate reward

33 Results We use the Cross-Entropy (CE) optimization method (de Boer et al., 2005) as comparison to find the maximum Q-values max a Q(s,a). We compare the LAWER algorithm to 3 different state of the art CE-based fitted Q-iteration algorithms: Tree-based FQI (Ernst et al., 2005) (CE-Tree) Neural FQI (Riedmiller, 2005) (CE-Net) LWR-based FQI (CE-LWR) After each FQI cycle new data was collected. The immediate reward function was quadratic in the distance to the goal position x G and in the applied torque/force

34 Pendulum swing-up task A pendulum needs to be swung up from the position at the bottom to the top position (Riedmiller, 2005). 2 experiments with different torque punishment factors (c 2 ) were carried out Average Reward LAWER CE Tree CE LWR CE Net Number of Data Collections Average Reward LAWER CE Tree CE LWR CE Net Number of Data Collections (a) c 2 = (b) c 2 = 0.025

35 Comparison of torque trajectories u [N] LAWER CE Tree 5 CE LWR Time [s] (c) c 2 = u [N] LAWER CE Tree 5 CE LWR Time [s] (d) c 2 = 0.025

36 Dynamic puddle-world The agent has to navigate from a start position to a goal position, it gets negative reward when going through puddles. Dynamic version of the puddle-world : The agent can set a force accelerating a k-dimensional point mass. This was done for k = 2 and k = 3 dimensions. 1 Start Goal 0 1

37 Comparison of the algorithms 20 Average Reward LAWER CE Tree Number of Data Collections Average Reward LAWER CE Tree Number of Data Collections (e) 2-D (f) 3-D The CE-Tree Method learns faster, but does not manage to learn high quality policies for the 3D setting. LAWER also works for high dimensional action spaces.

38 Comparison of torque trajectories u Time [s] 4 5 (g) LAWER u 1 u Time [s] (h) CE-Tree u 1 u 2 u 3

39 Conclusion We have proven that the greedy operator max a Q(s,a) can be approximated efficiently by an advantage-weighted regression. The resulting algorithm runs an order of magnitude faster than competing algorithms. In spite of the resulting soft-greedy policy improvement our algorithm was able to produce policies of higher quality. The Locally-Advantage Weighted Regression algorithm allows us to use fitted Q-iteration even for high dimensional continuous action spaces.

40 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives Policy Search Variational Inference for Policy Search in Changing Situations

41 Movement Representations for Motor Skill Learning Directly optimize a parametric movement representation No value estimation is needed What is a good representation for learning a movement? Episodic Tasks: Often it is sufficient to formulate the learning task in the episodic RL setup Single initial state, specified fixed duration of the movement Direct Policy Search can be applied easily in this setup

42 Movement Representations for Motor Skill Learning Episodic setup : Use a trajectory-based representation We learn a parametric representation of the desired trajectory [q d (t;w), q d (t;w)] t... duration of the movement, no direct dependence on the high dimensional state t is now a scalar, this significantly simplifies the learning problem Can only be used in the episodic setup (single start states) This trajectory is then followed by using feedback control laws Most common movement representations are trajectory based... Dynamic Movement Primitives (Ijspeert & Schaal, 2003), Splines (Kolter & Ng, 2009),...

43 Trajectory-based vs. Value Based Motor Skill learning Trajectory-Based: Can be seen as single-step decision task The agent chooses the parameters w as action of a single, temporally extended step Only one step per episode... Value-Based: One decision per time step of the agent The agent chooses the tourque u as action of a single, very short time step Up to a few hundred steps per episode...

44 Trajectory-based vs. Value Based Motor Skill learning Can we find a more intuitive solution for which the agent chooses new actions only at certain, characteristic time points of the movement? Temporal Abstraction: Sequencing of temporally extended actions, also called Motion Templates

45 Temporal Abstractions for Motor-Skill Learning Example : Drawing a triangle with a pen Flat Setup Abstracted Setup We have to make many unessential decisions The movement can be easily decomposed into 3 elemental motions

46 Temporal Abstractions for Motor-Skill Learning Example : Drawing a triangle with a pen Flat Setup Abstracted Setup We have to make many unessential decisions The movement can be easily decomposed into 3 elemental motions

47 Temporal Abstractions for Motor-Skill Learning Standard framework for temporally extended actions : Options (Sutton et al., 1999) Options are closed loop policies taking actions over a period of time However: They are mainly used in discrete environments. In many applications options are discrete temporally extended actions E.g. Go to another room, Follow the hallway or Frighten the poor monkey For motor tasks useful options are often difficult to specify.

48 Temporal Abstractions for Motor-Skill Learning : Illustration Pendulum Swing-up Task : Standard RL benchmark task Learn how to swing up and balance an inverted pendulum from the bottom position We additionally want to minimize the energy consumption Flat RL : Choose a new action every 50ms

49 Pendulum Swing-Up: Illustration How can we decompose the trajectory into options? 6 Torque [s] Positive peak 4 Negative peak Balancing Motion Time [s] We have positive and negative peaks in the torque trajectory followed by a final balancing motion.

50 Pendulum Swing-Up: Illustration Torque [s] How can we decompose the trajectory into options? Specify the exact form of the peaks and 6 the balancing motion for the options? Positive peak 4 Negative peak Balancing Motion Time [s] Requires a lot of prior knowledge... The learning task becomes trivial... However : We can specify the functional form of the options Use parameterized options...

51 Motion Templates Motion templates : Parameterized Options Used as our building blocks of motion. A motion template m p is defined by : Its k p dimensional parameter space Θ p Its parameterized policy u p (s,t;θ p ) Its termination condition c p (s,t;θ p ) s... state, t... execution time, θ p Θ p... parameters The functional form of u p and c p is chosen by the designer, the parameters θ p are learned by the agent

52 Motion Templates At each decision time step σ k the agent has to choose : Which motion template m p A(σ k ) to use. A(σ k )... set of available motion templates in decision time step σ k. Which parameterization θ p Θ p of m p to use. Subsequently the policy u p is executed until the termination condition c p is fullfilled. Continuous time : The duration of the templates can be continuous valued The agent has to learn the correct sequence and parameterization of the motion templates

53 Pendulum Swing-up : Decomposition into Motion Templates How can we decompose the trajectory into motion templates? 6 Torque [s] Positive peak 4 Negative peak Balancing Motion Time [s] We have positive and negative peaks in the torque trajectory followed by a final balancing motion.

54 Pendulum Swing-Up : Templates to model the peaks We use 2 templates per peak: One for the ascending part : m and one for the descending part : m 2 Both just depend on the execution time of the template.

55 Pendulum Swing-Up : Decomposition into Motion Templates We use 2 templates per peak: Ascending part (m ) 1 Descending part (m 2 ) 4 4 Torque [N] 2 0 a 1 = 4 a 1 = 3 a 1 = 2 Torque [N] 2 0 a 2 = 4 a 2 = 3 a 2 = Time [s] Parameters : a i... height of the template Time [s]

56 Pendulum Swing-Up : Decomposition into Motion Templates We use 2 templates per peak: Ascending part (m ) 1 Descending part (m 2 ) 4 4 Torque [N] 2 0 o 1 = 0.5 o 1 = 1 o 1 = 2 Torque [N] 2 0 o 2 = 3 o 2 = 6 o 2 = Time [s] Parameters : a i... height of the template o i... curvature of the template Time [s]

57 Pendulum Swing-Up : Decomposition into Motion Templates We use 2 templates per peak: Ascending part (m ) 1 Descending part (m 2 ) 4 4 Torque [N] 2 0 d 1 = 0.3 d 1 = 0.5 d 1 = 0.7 Torque [N] 2 0 d 2 = 0.3 d 2 = 0.5 d 2 = Time [s] Parameters : a i... height of the template o i... curvature of the template d i... duration of the template Time [s]

58 Pendulum Swing-Up : Decomposition into Motion Templates We fix the height of the descending peak template m 2 to be the height of m 1. m 3 and m 4 are the same templates, just for negative peaks. Torque [N] Torque [N] Positive Peak, Descending Part Negative Peak, Ascending Part Negative Peak, Descending Part 4 Positive Peak, Ascending Part Balancing Motion Time [s] m 2 m 3 m 4 m Time [s]

59 Pendulum Swing-Up : Decomposition into Motion Templates The balancing template is implemented as PD-controller MT Functional Form Parameters m 5 k 1 θ k 2 θ k 1, k 2 k 1 and k 2 are the PD controller gains. m 5 always runs for 20s, subsequently the episode is terminated

60 Pendulum Swing-Up : Constructing the motion 6 4 The agent can either choose the peak templates in the predefined order (m 2, m 3, m 4, m 1, m 2,...)... or it can use the balancing template m 5 as final template. Thus the agent has to learn the correct number of swing-ups and the correct parameterization of the swing- ups. Torque [N] Torque [N] Positive Peak, Descending Part Negative Peak, Ascending Part Negative Peak, Descending Part 4 Positive Peak, Ascending Part Balancing Motion Time [s] 6 m 2 4 m 3 2 m 4 m 0 1 m Time [s]

61 Pendulum Swing-Up : Constructing the motion Torque [N] Positive Peak, Descending Part Negative Peak, Ascending Part Negative Peak, Descending Part Positive Peak, Ascending Part Balancing Motion Time [s] Torque [N] m 2 m 3 m 4 m 1 m Time [s] Flat : Approximately 50 decisions/parameters are needed to reach the top position Motion Templates : The whole motion consists only of 5 decisions / 13 parameters

62 Pendulum Swing-Up : Accuracy of the policy Motion templates decrease the number of necessary decisions significantly Overall learning task is simplified Ok... where is the catch?

63 Pendulum Swing-Up : Accuracy of the policy Motion templates decrease the number of necessary decisions significantly Overall learning task is simplified Ok... where is the catch? A single decision has now much more influence on the outcome of the whole motion. Therefore a single decision has to be made much more precisely than in flat RL.

64 Algorithm for Motion Template Learning An RL algorithm is needed which can learn very precise continuous valued policies! For each template m p, we use an advancement of the Locally Advantage WEighted Regression (LAWER, (Neumann & Peters, 2009)) algorithm to learn the policy π p (θ p s) for selecting the parameters of m p.

65 Extensions of LAWER : LAWER for Motion Template Learning Due to the increased precision requirements of motion template learning we had to develop 2 substantal extensions of LAWER Adaptive tree-based Kernels Additional optimization to improve the approximation of V (s) = max a Q(s,a)

66 Extensions of LAWER : Adaptive Tree-Based Kernels The use of an uniform weighting kernel is often problematic in the case of... High dimensional input spaces ( curse of dimensionality ) Spatially varying data densities Spatially varying curvatures of the regression surface This problem can be alleviated by varying the shape of the weighting kernel. We do this by the use of randomized regression trees...

67 Extensions of LAWER : Improved approximation of V (s) = max a Q(s,a) In order to estimate the weightings u i, the original LAWER needed the assumption of normally distributed advantage values. Often this assumption does not hold and the estimate of u i gets imprecise. We improve the estimate of the u i by an additional optimization...

68 Experiments Minimum-Time problems with additional energy-consumption constraints (c 2 ) Pendulum Swing-Up 2-link Pendulum Swing-Up 2-link Pendulum Balancing Iterative learning protocoll: We collect L episodes with the currently estimated exploration policy Subsequently the optimal policy is reestimated and the performance (summed reward) of the optimal policy (without exploration) is evaluated.

69 Experiments : Pendulum Swing-Up Comparison of learning progress for different energy punishment factors (L = 50) Average Reward MT Tree MT Gauss Flat Number of Data Collections Average Reward MT Tree MT Gauss Flat Number of Data Collections Figure: Learning curves for the Gaussian kernel (MT Gauss) and the tree-based kernel (MT Tree) for (left) c 2 = and (right) c 2 = 0.075

70 Experiments : Pendulum Swing-Up Comparison of the flat and the motion template policy 5 m 2 m m 2 m 3 m 4 m m 5 c 2 = m 5 c 2 = m 2 m 3 m 4 m 1 m 2 m 3 m 4 m m c 2 = Time [s] c 2 = c 2 = c = Time [s] 4 5 Figure: (a) Torque trajectories and motion templates learned for different energy punishment factors c 2. (b) Torque trajectories learned with flat RL Performance for c 2 = : flat RL 48.6, motion templates 38.5

71 Experiments : 2-Link Pendulum Swing-Up Same templates as for the 1-dimensional task The peak templates have now 2 additional parameters, the height and the curvature for the second control dimension u 2. The parameters of the balancer template m 5 consists of two 2 2 matrices for the controller gains. Average Reward MT Tree Flat Number of Data Collections Figure: Comparison for motion template learning with tree-based kernels and flat RL

72 Experiments : 2-Link Pendulum Swing-Up Learned motion template policy 6 m 2 m 3 m 4 m 5 4 Tourque [Nm] Time [s] u 1 u 2 Figure: Left: Torque trajectories and decomposition in the motion templates. Right: Illustration of the motion. The bold postures represent the switching time points of the motion templates.

73 Conclusions We have shown that by the use of motion templates, i.e. parametrized options, many motor tasks can be decomposed into elemental movements. Motion templates are the first movement representation which can be sequenced in time While the whole motion consists of less decisions, a single decision has to be made more precisely. We propose a new algorithm for motion template learning which can cope with the precision requirements We have shown that learning with motion templates can produce policies of higher quality than flat RL and could even be applied to tasks where flat RL was not successful.

74 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives Policy Search Variational Inference for Policy Search in Changing Situations

75 Policy Search for trajectory-based representations Back to trajectory-based representations Only 1 decision per episode : Choose parameter vector w Typically w is very high dimensional ( parameters) How can we optimize the parameters w? Policy Gradient Methods (Williams, 1992; Peters & Schaal, 2006) EM-based Methods (Kober & Peters, 2010) Inference-based Methods (Vlassis et al., 2009; Theodorou et al., 2010)

76 Inference-based Methods: Policy Search for changing Situations In different situations s 0 i we have to choose different parameter vectors w i Can we generalize between solutions to avoid relearning? Learn a hierarchic policy π MP (w s 0 ; θ) which chooses the parameter vector w according to the situation s 0. In order to do so we will use approximate inference methods

77 Outline Approximate Inference for Policy Search Decomposition of the log-likelihood Monte-Carlo EM based methods Variational Inference based methods Policy Search for Movement Primitives in changing situations 4-Link Balancing

78 Approximate Inference for Policy Search Using inference or inference-based methods has proven to be very useful for policy search PoWeR (Kober & Peters, 2010), Policy Improvement by Path Integrals (Theodorou et al., 2010) Reward Weighted Regression, Cost Regularized Kernel Regression (Kober et al., 2010) Monte Carlo EM Policy Search(Vlassis et al., 2009) CMA-ES (Heidrich-Meisner & Igel, 2009)

79 All these algorithms use the Moment-projection of a certain target distribution to estimate the policy As we will see this can be problematic in many cases... (multi-modal solution space, complex reward functions...) Here we will introduce the theory to use the Information-projection and show that this projection alleviates many of these problems

80 Approximate Inference for Policy Search Formulating policy search as inference problem... Observed variable : Introduce a Reward Event p(r = 1 τ), e.g p(r = 1 τ) exp( C(τ)) C(τ) trajectory costs Latent Variables : trajectories τ Probabilistic Model : p(r = 1, τ; θ) = p(r = 1 τ)p(τ; θ) We want to find parameters θ which maximize the log-marginal likelihood log p(r; θ) = log p(r τ)p(τ; θ)dτ τ

81 Approximate Inference for Policy Search Policy Search can be seen as finding the maximum likelihood (ML) solution of p(r; θ) p(r; θ) = p(r τ)p(τ; θ)dτ τ Problem: Huge trajectory space, the integral is intractable

82 Decomposition of the log-likelihood We can decompose the log-likelihood by introducing a variational distribution q(τ) over the latent variable τ: Lower Bound L(q, θ): L(q, θ) = = log p(r; θ) = L(q, θ) + KL(q p R ), τ τ q(τ)log p(r, τ; θ) + f 1 (q) = q(τ)log p(τ; θ) + f 2 (q) Expected complete data log-likelihood...

83 Decomposition of the log-likelihood We can decompose the log-likelihood by introducing a variational distribution q(τ) over the latent variable τ: log p(r; θ) = L(q, θ) + KL(q p R ), Kullback-Leibler divergence KL(q p R ) : KL(q p R ) = q(τ)log p R(τ) τ q(τ) dτ Distance between variational distribution q and conditional distribution of the latent variable p(τ R; θ) p R (τ) = p(τ R; θ) p(r τ)p(τ; θ)... reward-weighted model distribution

84 Decomposition of the log-likelihood We can now iteratively increase the lower bound L(q, θ) by: E-Step: Keep model parameters θ fixed Minimize KL-divergence KL(q p R ) w.r.t q M-Step: Keep variational distribution q fixed Maximize Lower Bound L(q,θ) w.r.t θ

85 Approximate Inference for Policy Search Two types of policy search algorithms emerge from this decomposition Monte-Carlo EM based Policy Search (Kober et al., 2010; Kober & Peters, 2010; Vlassis et al., 2009) Variational Inference Policy Search

86 Monte-Carlo (MC) EM based Algorithms MC-EM based algorithms use a sample based approximation of q in the E-step. E-Step min q KL(q p R ) : q(i) = p R (i) p(r τ i )p(τ i ; θ old ) M-Step max θ L(q, θ) : Use q(i) to approximate lower bound L(q, θ) i p R (i)log p(τ i ; θ) + const = KL(p R p(τ; θ)) + const This is the same lower Bound given as given for PoWER and Reward-Weighted Regression.

87 Monte-Carlo (MC) EM based Algorithms Iteratively calculate M(oment)-projection of p R : min θ KL(p R p) = i p R (i)log ( p(τi ; θ) p R (i) ) The model becomes Reward attracted Forces model p to have high probability in regions with high reward Negatively rewarded samples are neglected! Minimization? p can be easily calculated by matching the moments of p with the moments of p R

88 Variational Inference based Algorithms For Variational Inference we use a parametric variational distribution q(τ) = q(τ; θ ) E-Step min q KL(q p R ) : Use sample-based approximation for the integral in the KL-divergence KL(q(τ; θ ) p R ) τ i q(τ i ; θ )log p R(i) q(τ i ; θ ) M-Step max θ L(q, θ) : If we use the same family of distributions for p(τ; θ) and q(τ; θ ) we can simply set θ to θ

89 Variational Inference based Algorithms Iteratively calculate I(nformation)-projection of p R : min θ KL(p p R ) = i p(τ i ; θ)log ( ) pr (i) p(τ i ; θ) The model becomes Cost-averse : Tries to avoid including in regions with low reward in p(τ;θ) Uses information from negatively and positively rewarded examples Minimization? Non-convex optimization problem (computationally much more demanding than using M-projection)... We use numerical gradient ascent

90 Approximate Inference for Policy Search MC-EM : M-projection based min θ KL(p R p) = i p R (i)log ( ) p(τi ; θ) p R (i) Variational Inference : I-projection based min θ KL(p p R ) = i p(τ i ; θ)log ( ) pr (i) p(τ i ; θ) Both algorithms are guaranteed to iteratively increase the lower bound...

91 I vs M-projection : Illustrative Examples Lets look at the differences in more detail...

92 I vs M-projection : Illustrative Examples We consider 1-step decision problems in continuous state and action spaces We typically use a Gaussian distribution as model distribution ([ ] [ ] [ ]) s µs Σss Σ p(s,a; θ) = N, as, a µ a Σ sa Σ aa with θ = {µ s, µ a,σ ss,σ as,σ aa }

93 I vs M-projection : Illustrative Examples 2-dimensional action space, no state variables, multimodal target distribution I Projection M Projection M-projection averages over all modes I-projection concentrates on one mode

94 I vs M-projection : Illustrative Examples We also want to have state variables... The policy π(a s; θ) is obtained by conditioning on the state s. Policy π is a linear Gaussian model... In order to get more complex policies π(a s t ; θ)... For each state s t, we reestimate the model p(s,a; θ) locally (using either the M- or I-projection) We clamp µ s at s t.

95 I vs M-projection : Illustrative Examples 1-dimensional state and action space complex reward function (dark background indicates negative reward) Policy is estimated for 6 different states M projection I projection s1 s2 s3 s4 s5 s6 s1 s2 s3 s4 s5 s6 M-projection includes areas of low reward in the distribution!

96 Policy Search for Motion Primitives Lets apply variational inference for policy search in changing situations Movement representation : parametrized velocity profiles

97 Multi situation setting : How can we learn θ? Existing algorithms are all MC-EM based and therefore use the M-projection Reward-Weighted Regression (Peters & Schaal, 2007), Cost-Regularized Kernel-Regression (Kober et al., 2010) Online learning setup : as samples we always use the history of the agent...

98 Experiments : Cannon-Ball Task Learn to shoot a cannon-ball at a desired location State-space s 0 : Desired Location, Wind Force Parameter-space w : Launching Angle and Velocity of the ball Comparison of I and M-projection : CRKR : Cost-Regularized Kernel Regression Multi-modal solution space, I-projection performs best Performance I Projection M projection CRKR Episodes

99 Experiments : 4-link pendulum balancing 4-link Humanoid robot has to counterbalance different pushes Situations : The robot gets pushed with different forces F i [0;25]Ns at 4 different points of origin 4-dimensional state space Movement Primitives Sequence of sigmoidal velocity profiles (39 parameters)... t = 0.10 s t = 0.60 s t = 1.10 s t = 1.60 s t = 2.10 s

100 Experiments : 4-link pendulum balancing 4-link Humanoid robot has to counterbalance different pushes After episode the robot has learned to balance almost every force The robot learns completely different balancing strategies We could not produce relyable results with the M-projection... t = 0.10 s t = 0.60 s t = 1.10 s t = 1.60 s t = 2.10 s

101 Conclusion We can use the M-projection or the I-projection for Policy Search The I-projection also uses information of bad samples, which are neglected by the M-projection! It therefore can be used with ease for multi-modal distributions or non-concave reward functions Computationally quite demanding... More efficient methods to calculate the I-projection are needed Is there still a big difference for more complex model distributions...?

102 The end Thanks for your attention!

103 Bibliography I Atkeson, Chris G., Moore, Andrew W., & Schaal, Stefan Locally Weighted Learning. Artificial Intelligence Review, 11, de Boer, Pieter-Tjerk, Kroese, Dirk, Mannor, Shie, & Rubinstein, Reuven A Tutorial on the Cross-Entropy Method. Annals of Operations Research, 134(1), Ernst, D., Geurts, P., & Wehenkel, L Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Resource, 6,

104 Bibliography II Ernst, Damien, Geurts, Pierre, & Wehenkel, Louis Iteratively Extending Time Horizon Reinforcement Learning. Pages of: European Conference on Machine Learning (ECML). Heidrich-Meisner, V., & Igel, C Neuroevolution Strategies for Episodic Reinforcement Learning. Journal of Algorithms, 64(4),

105 Bibliography III Ijspeert, Auke Jan, & Schaal, Stefan Learning Attractor Landscapes for Learning Motor Primitives. Pages of: Advances in Neural Information Processing Systems 15. Cambridge, MA: MIT Press. Kober, J., & Peters, J Policy Search for Motor Primitives in Robotics. Machine Learning Journal, online first, Kober, Jens, Oztop, Erhan, & Peters, Jan Reinforcement Learning to adjust Robot Movements to New Situations. In: Proceedings of the 2010 Robotics: Science and Systems Conference (RSS 2010).

106 Bibliography IV Kolter, Z., & Ng, A Task-Space Trajectories via Cubic Spline Optimization. Pages of: Proceedings of the 2009 IEEE international conference on Robotics and Automation. ICRA 09. Piscataway, NJ, USA: IEEE Press. Neumann, G., & Peters, J Fitted Q-Iteration by Advantage Weighted Regression. In: Advances in Neural Information Processing Systems 22 (NIPS 2008). MA: MIT Press.

107 Bibliography V Peters, J., & Schaal, S Policy Gradient methods for Robotics. In: Proceedings of the IEEE International Conference on Intelligent Robotics Systems (IROS). Peters, J., & Schaal, S Reinforcement Learning by Reward-Weighted Regression for Operational Space Control. In: Proceedings of the International Conference on Machine Learning (ICML).

108 Bibliography VI Riedmiller, M Neural fitted Q-Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Proceedings of the European Conference on Machine Learning (ECML). Sutton, Richard, Precup, Doina, & Singh, Satinder Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artificial Intelligence, 112,

109 Bibliography VII Theodorou, E., Buchli, J., & Schaal, S Reinforcement Learning of Motor Skills in High Dimensions: a Path Integral Approach. Pages of: Robotics and Automation (ICRA), 2010 IEEE International Conference on. Vlassis, Nikos, Toussaint, Marc, Kontes, Georgios, & Piperidis, Savas Learning Model-Free Robot Control by a Monte Carlo EM Algorithm. Autonomous Robots, 27(2),

110 Bibliography VIII Williams, Ronald J Simple Statistical Gradient..Following Algorithms for Connectionist Reinforcement Learning. Machine Learning.

Learning Complex Motions by Sequencing Simpler Motion Templates

Learning Complex Motions by Sequencing Simpler Motion Templates Gerhard Neumann gerhard@igi.tugraz.at Wolfgang Maass maass@igi.tugraz.at Institute for Theoretical Computer Science, Graz University of Technology,