On Movement Skill Learning and Movement Representations for Robotics

Size: px
Start display at page:

Download "On Movement Skill Learning and Movement Representations for Robotics"

Transcription

1 On Movement Skill Learning and Movement Representations for Robotics Gerhard Neumann 1 1 Graz University of Technology, Institute for Theoretical Computer Science November 2, 2011

2 Modern Robotic Systems: Motivation... Many degrees of freedoms, compliant actuators, highly dynamic movements...

3 In principle the advanced morphology of these robots would allow us to perform a wide range of complex movements such as Different forms of locomotion (walking, running, trott) Jumping Playing tennis... Classical control methods often fail or are very hard to use for such complex movements. More promising approach : Let the robot learn the movement from trial and error Main topic of this thesis!

4 Movement Skill Learning for Robotics Movement Skill Learning can be easily formulated as Reinforcement Learning problem. The agent has to search for a policy which optimizes reward So why is it challenging? High dimensional continuous state spaces High dimensional continuous action spaces Data is expensive : Needs to be data efficient Needs to be safe

5 Movement Skill Learning for Robotics Learning algorithms can be roughly divided into Value-based methods Policy-search methods

6 Value-based methods Estimate the expected discounted future reward for each state s when following policy π [ ] V π (s) = E γ t r t, t=0 Also denoted as value function of policy π Recursive Form V π (s) = E [ r(s,a) + γv π (s ) ]

7 Value-based methods + The value function can be used to assess the quality of each intermediate action of an episode E.g. by the use of the Temporal Difference (TD) error δ t = r t + γv π (s t+1 ) V π (s t ) Evaluates if the current step s t,a t,r t,s t+1 was better or worse than expected We can efficiently solve the temporal credit assignment problem - The value function is very hard to estimate in high-dimensional continuous state and action spaces

8 Policy Search Methods Rely on a parametric representation of the policy π(a s;w) Parameters of the policy w Directly optimize policy parameters by performing rollouts on the real system - We can only assess the quality of a whole trajectory instead of single actions + However, as no value function is estimated this can be done very accurately More successful than value based methods Performance strongly depends on the used movement representation

9 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression

10 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives

11 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives Policy Search Variational Inference for Policy Search in Changing Situations

12 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives Policy Search Variational Inference for Policy Search in Changing Situations

13 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression

14 Fitted Q-iteration : Batch-Mode Reinforcement Learning (BMRL) Batch-Mode RL methods use the whole history H of the agent to update the value or action value function H = {< s i,a i, r i,s i >} 1 i N Advantage : Data-points are used more efficiently than in online methods

15 Fitted Q-iteration : Batch-Mode Reinforcement Learning (BMRL) Fitted Q-Iteration (Ernst et al., 2003) approximates the state-action value function Q(s,a) by iteratively using supervised regression techniques Repeat K times Q k+1 (i) = r i + γv k (s i) = r i + γ max a Q k (s i,a ) D k = { [(si,a i ), Q k+1 (i) ] 1 i N }, Q k+1 = Regress(D k )

16 Fitted Q-iteration : Batch-Mode Reinforcement Learning (BMRL) + FQI has proven to outperform classical online RL methods in many applications (Ernst et al., 2005). + Any type of supervised learning method can be used... E.g. neural networks (Riedmiller, 2005), regression trees (Ernst et al., 2005), Gaussian Processes - High computational demands...

17 FQI for Robotics... Continuous state spaces : Any type of supervised learning method can be used... E.g. neural networks, regression trees, Gaussian Processes Continuous action spaces : We have to solve Q k+1 (i) = r i + γ max a Q k (s i,a ) - Hm... how do perform the max a -operator in continuous action spaces?

18 FQI for Robotics... Hm... how do perform the max a -operator in continuous action spaces? Discretizations become prohibitively expensive in high dimensional spaces We have to solve an optimization problem for each sample! E.g. use Cross-Entropy optimization for each data point s i

19 FQI for Robotics... Hm... how do perform the max a -operator in continuous action spaces? We show that an advantage-weighted regression can be used to approximate max a Q(s,a). The regression uses the states s i as input values and Q(s i,a i ) as target values. The weighting w i = exp(τā(s,a) of each data point is based on the advantage function A(s, a) = Q(s, a) V (s).

20 FQI for Robotics... What is a weighted regression? Minimize the error function w.r.t. θ E = N w i (V (s i ; θ) Q(s i,a i )) 2 i=1 w i... each data point gets an individual weighting

21 FQI for Robotics... We proof this by applying the following 2 steps: Weighted regression for value estimation Soft-greedy policy improvement

22 Weighted regression for value estimation The value function of a stochastic policy π is given by V π (s) = a π(a s)q(s,a)da We show that this can be approximated without evaluating the integral by solving a weighted regression problem D V = { s i, Q(s i,a i ) },U = {π(a i s i )}, ˆV = WeightedReg(D V, U)

23 Proof We want to find an approximation ˆV (s) of V π (s) by minimizing the error function ( 2 Error(ˆV ) = µ(s) π(a s)q(s,a)da ˆV (s)) ds s a = s ( µ(s) π(a s) a ( ) 2 Q(s,a) ˆV (s) da) ds, µ(s) : state distribution when following policy π( s).

24 Proof Squared error function : Error(ˆV ) = s ( ( ) 2 µ(s) π(a s) Q(s,a) ˆV (s) da) ds, a An upper bound of Error(ˆV ) is given by : ( 2 Error B (ˆV ) = µ(s) π(a s) Q(s,a) ˆV (s)) dads Error(ˆV ). s a Use of Jensens inequality

25 Proof It is easy to show that both error functions have the same minimum for ˆV The upper bound Error B can be approximated straightforwardly by samples {(s i,a i ), Q(s i,a i )} 1 i N Error B (ˆV ) N i=1 ( 2 π(a i s i ) Q(s i,a i ) ˆV (s i )) (1) No integral over the action space is needed!

26 FQI for Robotics... We proof this by applying the following 2 steps: Weighted regression for value estimation Soft-greedy policy improvement

27 Soft-greedy policy improvement The optimal value function V (s) = max a Q(s,a) can be approximated without evaluating max a Q(s,a) by solving an advantage-weighted regression problem. D V = { s i, Q(s i,a i ) }, U = { exp(τā(s i,a i )) }, (2) ˆV = WeightedReg(D V, U ) (3) - τ... greediness parameter of the algorithm. - Ā(s, a)... normalized advantage function.

28 Proof We approximate the value function V π 1 of a soft-max policy π 1 by the use of weighted regression. Since a soft-max policy is an approximation of the greedy policy, we can replace V (s) = max a Q(s,a) with V π 1 (s).

29 Proof The used soft-max policy π 1 (a s) is based on the advantage function A(s,a) = Q(s,a) V (s). π 1 (a s) = exp(τā(s,a)) a exp(τā(s,a))da, Ā(s,a) = A(s,a) m A(s) σ A (s). If we assume that the advantages A(s,a) are normally distributed the denominator of π 1 is constant. Thus we can use exp(τā(s,a)) π 1 (a s) directly as weighting for the regression.

30 Concrete algorithm : LAWER The Locally-Advantage WEighted Regression (LAWER) algorithm implements the presented theoretical results. It combines Locally Weighted Regression (LWR, (Atkeson et al., 1997)) and advantage weighted regression. The locality weighting w i (s) and the advantage weighting u i = exp(τā(s i,a i )) can be multiplicatively combined

31 Concrete algorithm : LAWER The value function is then given by a simple weighted linear regression: V k+1 (s) = s(s T US) 1 S T UQ k+1 s = [1,s T ] T, S = [ s 1, s 2,..., s N ] T is the state matrix. U = diag(w i (s)u i ) In order to approximate V (s) = max a Q k (s,a) only the Q-values of neighbored state-action pairs are needed.

32 Approximation of the policy For unseen states we need to approximate the soft-max policy Gaussian policy π(a s) = N(a µ(s), σ 2 ). For estimating this policy we use reward-weighted regression (Peters & Schaal, 2007), only the advantage is used instead of the reward for the weighting. Thus, we optimize the long-term reward instead of the immediate reward

33 Results We use the Cross-Entropy (CE) optimization method (de Boer et al., 2005) as comparison to find the maximum Q-values max a Q(s,a). We compare the LAWER algorithm to 3 different state of the art CE-based fitted Q-iteration algorithms: Tree-based FQI (Ernst et al., 2005) (CE-Tree) Neural FQI (Riedmiller, 2005) (CE-Net) LWR-based FQI (CE-LWR) After each FQI cycle new data was collected. The immediate reward function was quadratic in the distance to the goal position x G and in the applied torque/force

34 Pendulum swing-up task A pendulum needs to be swung up from the position at the bottom to the top position (Riedmiller, 2005). 2 experiments with different torque punishment factors (c 2 ) were carried out Average Reward LAWER CE Tree CE LWR CE Net Number of Data Collections Average Reward LAWER CE Tree CE LWR CE Net Number of Data Collections (a) c 2 = (b) c 2 = 0.025

35 Comparison of torque trajectories u [N] LAWER CE Tree 5 CE LWR Time [s] (c) c 2 = u [N] LAWER CE Tree 5 CE LWR Time [s] (d) c 2 = 0.025

36 Dynamic puddle-world The agent has to navigate from a start position to a goal position, it gets negative reward when going through puddles. Dynamic version of the puddle-world : The agent can set a force accelerating a k-dimensional point mass. This was done for k = 2 and k = 3 dimensions. 1 Start Goal 0 1

37 Comparison of the algorithms 20 Average Reward LAWER CE Tree Number of Data Collections Average Reward LAWER CE Tree Number of Data Collections (e) 2-D (f) 3-D The CE-Tree Method learns faster, but does not manage to learn high quality policies for the 3D setting. LAWER also works for high dimensional action spaces.

38 Comparison of torque trajectories u Time [s] 4 5 (g) LAWER u 1 u Time [s] (h) CE-Tree u 1 u 2 u 3

39 Conclusion We have proven that the greedy operator max a Q(s,a) can be approximated efficiently by an advantage-weighted regression. The resulting algorithm runs an order of magnitude faster than competing algorithms. In spite of the resulting soft-greedy policy improvement our algorithm was able to produce policies of higher quality. The Locally-Advantage Weighted Regression algorithm allows us to use fitted Q-iteration even for high dimensional continuous action spaces.

40 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives Policy Search Variational Inference for Policy Search in Changing Situations

41 Movement Representations for Motor Skill Learning Directly optimize a parametric movement representation No value estimation is needed What is a good representation for learning a movement? Episodic Tasks: Often it is sufficient to formulate the learning task in the episodic RL setup Single initial state, specified fixed duration of the movement Direct Policy Search can be applied easily in this setup

42 Movement Representations for Motor Skill Learning Episodic setup : Use a trajectory-based representation We learn a parametric representation of the desired trajectory [q d (t;w), q d (t;w)] t... duration of the movement, no direct dependence on the high dimensional state t is now a scalar, this significantly simplifies the learning problem Can only be used in the episodic setup (single start states) This trajectory is then followed by using feedback control laws Most common movement representations are trajectory based... Dynamic Movement Primitives (Ijspeert & Schaal, 2003), Splines (Kolter & Ng, 2009),...

43 Trajectory-based vs. Value Based Motor Skill learning Trajectory-Based: Can be seen as single-step decision task The agent chooses the parameters w as action of a single, temporally extended step Only one step per episode... Value-Based: One decision per time step of the agent The agent chooses the tourque u as action of a single, very short time step Up to a few hundred steps per episode...

44 Trajectory-based vs. Value Based Motor Skill learning Can we find a more intuitive solution for which the agent chooses new actions only at certain, characteristic time points of the movement? Temporal Abstraction: Sequencing of temporally extended actions, also called Motion Templates

45 Temporal Abstractions for Motor-Skill Learning Example : Drawing a triangle with a pen Flat Setup Abstracted Setup We have to make many unessential decisions The movement can be easily decomposed into 3 elemental motions

46 Temporal Abstractions for Motor-Skill Learning Example : Drawing a triangle with a pen Flat Setup Abstracted Setup We have to make many unessential decisions The movement can be easily decomposed into 3 elemental motions

47 Temporal Abstractions for Motor-Skill Learning Standard framework for temporally extended actions : Options (Sutton et al., 1999) Options are closed loop policies taking actions over a period of time However: They are mainly used in discrete environments. In many applications options are discrete temporally extended actions E.g. Go to another room, Follow the hallway or Frighten the poor monkey For motor tasks useful options are often difficult to specify.

48 Temporal Abstractions for Motor-Skill Learning : Illustration Pendulum Swing-up Task : Standard RL benchmark task Learn how to swing up and balance an inverted pendulum from the bottom position We additionally want to minimize the energy consumption Flat RL : Choose a new action every 50ms

49 Pendulum Swing-Up: Illustration How can we decompose the trajectory into options? 6 Torque [s] Positive peak 4 Negative peak Balancing Motion Time [s] We have positive and negative peaks in the torque trajectory followed by a final balancing motion.

50 Pendulum Swing-Up: Illustration Torque [s] How can we decompose the trajectory into options? Specify the exact form of the peaks and 6 the balancing motion for the options? Positive peak 4 Negative peak Balancing Motion Time [s] Requires a lot of prior knowledge... The learning task becomes trivial... However : We can specify the functional form of the options Use parameterized options...

51 Motion Templates Motion templates : Parameterized Options Used as our building blocks of motion. A motion template m p is defined by : Its k p dimensional parameter space Θ p Its parameterized policy u p (s,t;θ p ) Its termination condition c p (s,t;θ p ) s... state, t... execution time, θ p Θ p... parameters The functional form of u p and c p is chosen by the designer, the parameters θ p are learned by the agent

52 Motion Templates At each decision time step σ k the agent has to choose : Which motion template m p A(σ k ) to use. A(σ k )... set of available motion templates in decision time step σ k. Which parameterization θ p Θ p of m p to use. Subsequently the policy u p is executed until the termination condition c p is fullfilled. Continuous time : The duration of the templates can be continuous valued The agent has to learn the correct sequence and parameterization of the motion templates

53 Pendulum Swing-up : Decomposition into Motion Templates How can we decompose the trajectory into motion templates? 6 Torque [s] Positive peak 4 Negative peak Balancing Motion Time [s] We have positive and negative peaks in the torque trajectory followed by a final balancing motion.

54 Pendulum Swing-Up : Templates to model the peaks We use 2 templates per peak: One for the ascending part : m and one for the descending part : m 2 Both just depend on the execution time of the template.

55 Pendulum Swing-Up : Decomposition into Motion Templates We use 2 templates per peak: Ascending part (m ) 1 Descending part (m 2 ) 4 4 Torque [N] 2 0 a 1 = 4 a 1 = 3 a 1 = 2 Torque [N] 2 0 a 2 = 4 a 2 = 3 a 2 = Time [s] Parameters : a i... height of the template Time [s]

56 Pendulum Swing-Up : Decomposition into Motion Templates We use 2 templates per peak: Ascending part (m ) 1 Descending part (m 2 ) 4 4 Torque [N] 2 0 o 1 = 0.5 o 1 = 1 o 1 = 2 Torque [N] 2 0 o 2 = 3 o 2 = 6 o 2 = Time [s] Parameters : a i... height of the template o i... curvature of the template Time [s]

57 Pendulum Swing-Up : Decomposition into Motion Templates We use 2 templates per peak: Ascending part (m ) 1 Descending part (m 2 ) 4 4 Torque [N] 2 0 d 1 = 0.3 d 1 = 0.5 d 1 = 0.7 Torque [N] 2 0 d 2 = 0.3 d 2 = 0.5 d 2 = Time [s] Parameters : a i... height of the template o i... curvature of the template d i... duration of the template Time [s]

58 Pendulum Swing-Up : Decomposition into Motion Templates We fix the height of the descending peak template m 2 to be the height of m 1. m 3 and m 4 are the same templates, just for negative peaks. Torque [N] Torque [N] Positive Peak, Descending Part Negative Peak, Ascending Part Negative Peak, Descending Part 4 Positive Peak, Ascending Part Balancing Motion Time [s] m 2 m 3 m 4 m Time [s]

59 Pendulum Swing-Up : Decomposition into Motion Templates The balancing template is implemented as PD-controller MT Functional Form Parameters m 5 k 1 θ k 2 θ k 1, k 2 k 1 and k 2 are the PD controller gains. m 5 always runs for 20s, subsequently the episode is terminated

60 Pendulum Swing-Up : Constructing the motion 6 4 The agent can either choose the peak templates in the predefined order (m 2, m 3, m 4, m 1, m 2,...)... or it can use the balancing template m 5 as final template. Thus the agent has to learn the correct number of swing-ups and the correct parameterization of the swing- ups. Torque [N] Torque [N] Positive Peak, Descending Part Negative Peak, Ascending Part Negative Peak, Descending Part 4 Positive Peak, Ascending Part Balancing Motion Time [s] 6 m 2 4 m 3 2 m 4 m 0 1 m Time [s]

61 Pendulum Swing-Up : Constructing the motion Torque [N] Positive Peak, Descending Part Negative Peak, Ascending Part Negative Peak, Descending Part Positive Peak, Ascending Part Balancing Motion Time [s] Torque [N] m 2 m 3 m 4 m 1 m Time [s] Flat : Approximately 50 decisions/parameters are needed to reach the top position Motion Templates : The whole motion consists only of 5 decisions / 13 parameters

62 Pendulum Swing-Up : Accuracy of the policy Motion templates decrease the number of necessary decisions significantly Overall learning task is simplified Ok... where is the catch?

63 Pendulum Swing-Up : Accuracy of the policy Motion templates decrease the number of necessary decisions significantly Overall learning task is simplified Ok... where is the catch? A single decision has now much more influence on the outcome of the whole motion. Therefore a single decision has to be made much more precisely than in flat RL.

64 Algorithm for Motion Template Learning An RL algorithm is needed which can learn very precise continuous valued policies! For each template m p, we use an advancement of the Locally Advantage WEighted Regression (LAWER, (Neumann & Peters, 2009)) algorithm to learn the policy π p (θ p s) for selecting the parameters of m p.

65 Extensions of LAWER : LAWER for Motion Template Learning Due to the increased precision requirements of motion template learning we had to develop 2 substantal extensions of LAWER Adaptive tree-based Kernels Additional optimization to improve the approximation of V (s) = max a Q(s,a)

66 Extensions of LAWER : Adaptive Tree-Based Kernels The use of an uniform weighting kernel is often problematic in the case of... High dimensional input spaces ( curse of dimensionality ) Spatially varying data densities Spatially varying curvatures of the regression surface This problem can be alleviated by varying the shape of the weighting kernel. We do this by the use of randomized regression trees...

67 Extensions of LAWER : Improved approximation of V (s) = max a Q(s,a) In order to estimate the weightings u i, the original LAWER needed the assumption of normally distributed advantage values. Often this assumption does not hold and the estimate of u i gets imprecise. We improve the estimate of the u i by an additional optimization...

68 Experiments Minimum-Time problems with additional energy-consumption constraints (c 2 ) Pendulum Swing-Up 2-link Pendulum Swing-Up 2-link Pendulum Balancing Iterative learning protocoll: We collect L episodes with the currently estimated exploration policy Subsequently the optimal policy is reestimated and the performance (summed reward) of the optimal policy (without exploration) is evaluated.

69 Experiments : Pendulum Swing-Up Comparison of learning progress for different energy punishment factors (L = 50) Average Reward MT Tree MT Gauss Flat Number of Data Collections Average Reward MT Tree MT Gauss Flat Number of Data Collections Figure: Learning curves for the Gaussian kernel (MT Gauss) and the tree-based kernel (MT Tree) for (left) c 2 = and (right) c 2 = 0.075

70 Experiments : Pendulum Swing-Up Comparison of the flat and the motion template policy 5 m 2 m m 2 m 3 m 4 m m 5 c 2 = m 5 c 2 = m 2 m 3 m 4 m 1 m 2 m 3 m 4 m m c 2 = Time [s] c 2 = c 2 = c = Time [s] 4 5 Figure: (a) Torque trajectories and motion templates learned for different energy punishment factors c 2. (b) Torque trajectories learned with flat RL Performance for c 2 = : flat RL 48.6, motion templates 38.5

71 Experiments : 2-Link Pendulum Swing-Up Same templates as for the 1-dimensional task The peak templates have now 2 additional parameters, the height and the curvature for the second control dimension u 2. The parameters of the balancer template m 5 consists of two 2 2 matrices for the controller gains. Average Reward MT Tree Flat Number of Data Collections Figure: Comparison for motion template learning with tree-based kernels and flat RL

72 Experiments : 2-Link Pendulum Swing-Up Learned motion template policy 6 m 2 m 3 m 4 m 5 4 Tourque [Nm] Time [s] u 1 u 2 Figure: Left: Torque trajectories and decomposition in the motion templates. Right: Illustration of the motion. The bold postures represent the switching time points of the motion templates.

73 Conclusions We have shown that by the use of motion templates, i.e. parametrized options, many motor tasks can be decomposed into elemental movements. Motion templates are the first movement representation which can be sequenced in time While the whole motion consists of less decisions, a single decision has to be made more precisely. We propose a new algorithm for motion template learning which can cope with the precision requirements We have shown that learning with motion templates can produce policies of higher quality than flat RL and could even be applied to tasks where flat RL was not successful.

74 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives Policy Search Variational Inference for Policy Search in Changing Situations

75 Policy Search for trajectory-based representations Back to trajectory-based representations Only 1 decision per episode : Choose parameter vector w Typically w is very high dimensional ( parameters) How can we optimize the parameters w? Policy Gradient Methods (Williams, 1992; Peters & Schaal, 2006) EM-based Methods (Kober & Peters, 2010) Inference-based Methods (Vlassis et al., 2009; Theodorou et al., 2010)

76 Inference-based Methods: Policy Search for changing Situations In different situations s 0 i we have to choose different parameter vectors w i Can we generalize between solutions to avoid relearning? Learn a hierarchic policy π MP (w s 0 ; θ) which chooses the parameter vector w according to the situation s 0. In order to do so we will use approximate inference methods

77 Outline Approximate Inference for Policy Search Decomposition of the log-likelihood Monte-Carlo EM based methods Variational Inference based methods Policy Search for Movement Primitives in changing situations 4-Link Balancing

78 Approximate Inference for Policy Search Using inference or inference-based methods has proven to be very useful for policy search PoWeR (Kober & Peters, 2010), Policy Improvement by Path Integrals (Theodorou et al., 2010) Reward Weighted Regression, Cost Regularized Kernel Regression (Kober et al., 2010) Monte Carlo EM Policy Search(Vlassis et al., 2009) CMA-ES (Heidrich-Meisner & Igel, 2009)

79 All these algorithms use the Moment-projection of a certain target distribution to estimate the policy As we will see this can be problematic in many cases... (multi-modal solution space, complex reward functions...) Here we will introduce the theory to use the Information-projection and show that this projection alleviates many of these problems

80 Approximate Inference for Policy Search Formulating policy search as inference problem... Observed variable : Introduce a Reward Event p(r = 1 τ), e.g p(r = 1 τ) exp( C(τ)) C(τ) trajectory costs Latent Variables : trajectories τ Probabilistic Model : p(r = 1, τ; θ) = p(r = 1 τ)p(τ; θ) We want to find parameters θ which maximize the log-marginal likelihood log p(r; θ) = log p(r τ)p(τ; θ)dτ τ

81 Approximate Inference for Policy Search Policy Search can be seen as finding the maximum likelihood (ML) solution of p(r; θ) p(r; θ) = p(r τ)p(τ; θ)dτ τ Problem: Huge trajectory space, the integral is intractable

82 Decomposition of the log-likelihood We can decompose the log-likelihood by introducing a variational distribution q(τ) over the latent variable τ: Lower Bound L(q, θ): L(q, θ) = = log p(r; θ) = L(q, θ) + KL(q p R ), τ τ q(τ)log p(r, τ; θ) + f 1 (q) = q(τ)log p(τ; θ) + f 2 (q) Expected complete data log-likelihood...

83 Decomposition of the log-likelihood We can decompose the log-likelihood by introducing a variational distribution q(τ) over the latent variable τ: log p(r; θ) = L(q, θ) + KL(q p R ), Kullback-Leibler divergence KL(q p R ) : KL(q p R ) = q(τ)log p R(τ) τ q(τ) dτ Distance between variational distribution q and conditional distribution of the latent variable p(τ R; θ) p R (τ) = p(τ R; θ) p(r τ)p(τ; θ)... reward-weighted model distribution

84 Decomposition of the log-likelihood We can now iteratively increase the lower bound L(q, θ) by: E-Step: Keep model parameters θ fixed Minimize KL-divergence KL(q p R ) w.r.t q M-Step: Keep variational distribution q fixed Maximize Lower Bound L(q,θ) w.r.t θ

85 Approximate Inference for Policy Search Two types of policy search algorithms emerge from this decomposition Monte-Carlo EM based Policy Search (Kober et al., 2010; Kober & Peters, 2010; Vlassis et al., 2009) Variational Inference Policy Search

86 Monte-Carlo (MC) EM based Algorithms MC-EM based algorithms use a sample based approximation of q in the E-step. E-Step min q KL(q p R ) : q(i) = p R (i) p(r τ i )p(τ i ; θ old ) M-Step max θ L(q, θ) : Use q(i) to approximate lower bound L(q, θ) i p R (i)log p(τ i ; θ) + const = KL(p R p(τ; θ)) + const This is the same lower Bound given as given for PoWER and Reward-Weighted Regression.

87 Monte-Carlo (MC) EM based Algorithms Iteratively calculate M(oment)-projection of p R : min θ KL(p R p) = i p R (i)log ( p(τi ; θ) p R (i) ) The model becomes Reward attracted Forces model p to have high probability in regions with high reward Negatively rewarded samples are neglected! Minimization? p can be easily calculated by matching the moments of p with the moments of p R

88 Variational Inference based Algorithms For Variational Inference we use a parametric variational distribution q(τ) = q(τ; θ ) E-Step min q KL(q p R ) : Use sample-based approximation for the integral in the KL-divergence KL(q(τ; θ ) p R ) τ i q(τ i ; θ )log p R(i) q(τ i ; θ ) M-Step max θ L(q, θ) : If we use the same family of distributions for p(τ; θ) and q(τ; θ ) we can simply set θ to θ

89 Variational Inference based Algorithms Iteratively calculate I(nformation)-projection of p R : min θ KL(p p R ) = i p(τ i ; θ)log ( ) pr (i) p(τ i ; θ) The model becomes Cost-averse : Tries to avoid including in regions with low reward in p(τ;θ) Uses information from negatively and positively rewarded examples Minimization? Non-convex optimization problem (computationally much more demanding than using M-projection)... We use numerical gradient ascent

90 Approximate Inference for Policy Search MC-EM : M-projection based min θ KL(p R p) = i p R (i)log ( ) p(τi ; θ) p R (i) Variational Inference : I-projection based min θ KL(p p R ) = i p(τ i ; θ)log ( ) pr (i) p(τ i ; θ) Both algorithms are guaranteed to iteratively increase the lower bound...

91 I vs M-projection : Illustrative Examples Lets look at the differences in more detail...

92 I vs M-projection : Illustrative Examples We consider 1-step decision problems in continuous state and action spaces We typically use a Gaussian distribution as model distribution ([ ] [ ] [ ]) s µs Σss Σ p(s,a; θ) = N, as, a µ a Σ sa Σ aa with θ = {µ s, µ a,σ ss,σ as,σ aa }

93 I vs M-projection : Illustrative Examples 2-dimensional action space, no state variables, multimodal target distribution I Projection M Projection M-projection averages over all modes I-projection concentrates on one mode

94 I vs M-projection : Illustrative Examples We also want to have state variables... The policy π(a s; θ) is obtained by conditioning on the state s. Policy π is a linear Gaussian model... In order to get more complex policies π(a s t ; θ)... For each state s t, we reestimate the model p(s,a; θ) locally (using either the M- or I-projection) We clamp µ s at s t.

95 I vs M-projection : Illustrative Examples 1-dimensional state and action space complex reward function (dark background indicates negative reward) Policy is estimated for 6 different states M projection I projection s1 s2 s3 s4 s5 s6 s1 s2 s3 s4 s5 s6 M-projection includes areas of low reward in the distribution!

96 Policy Search for Motion Primitives Lets apply variational inference for policy search in changing situations Movement representation : parametrized velocity profiles

97 Multi situation setting : How can we learn θ? Existing algorithms are all MC-EM based and therefore use the M-projection Reward-Weighted Regression (Peters & Schaal, 2007), Cost-Regularized Kernel-Regression (Kober et al., 2010) Online learning setup : as samples we always use the history of the agent...

98 Experiments : Cannon-Ball Task Learn to shoot a cannon-ball at a desired location State-space s 0 : Desired Location, Wind Force Parameter-space w : Launching Angle and Velocity of the ball Comparison of I and M-projection : CRKR : Cost-Regularized Kernel Regression Multi-modal solution space, I-projection performs best Performance I Projection M projection CRKR Episodes

99 Experiments : 4-link pendulum balancing 4-link Humanoid robot has to counterbalance different pushes Situations : The robot gets pushed with different forces F i [0;25]Ns at 4 different points of origin 4-dimensional state space Movement Primitives Sequence of sigmoidal velocity profiles (39 parameters)... t = 0.10 s t = 0.60 s t = 1.10 s t = 1.60 s t = 2.10 s

100 Experiments : 4-link pendulum balancing 4-link Humanoid robot has to counterbalance different pushes After episode the robot has learned to balance almost every force The robot learns completely different balancing strategies We could not produce relyable results with the M-projection... t = 0.10 s t = 0.60 s t = 1.10 s t = 1.60 s t = 2.10 s

101 Conclusion We can use the M-projection or the I-projection for Policy Search The I-projection also uses information of bad samples, which are neglected by the M-projection! It therefore can be used with ease for multi-modal distributions or non-concave reward functions Computationally quite demanding... More efficient methods to calculate the I-projection are needed Is there still a big difference for more complex model distributions...?

102 The end Thanks for your attention!

103 Bibliography I Atkeson, Chris G., Moore, Andrew W., & Schaal, Stefan Locally Weighted Learning. Artificial Intelligence Review, 11, de Boer, Pieter-Tjerk, Kroese, Dirk, Mannor, Shie, & Rubinstein, Reuven A Tutorial on the Cross-Entropy Method. Annals of Operations Research, 134(1), Ernst, D., Geurts, P., & Wehenkel, L Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Resource, 6,

104 Bibliography II Ernst, Damien, Geurts, Pierre, & Wehenkel, Louis Iteratively Extending Time Horizon Reinforcement Learning. Pages of: European Conference on Machine Learning (ECML). Heidrich-Meisner, V., & Igel, C Neuroevolution Strategies for Episodic Reinforcement Learning. Journal of Algorithms, 64(4),

105 Bibliography III Ijspeert, Auke Jan, & Schaal, Stefan Learning Attractor Landscapes for Learning Motor Primitives. Pages of: Advances in Neural Information Processing Systems 15. Cambridge, MA: MIT Press. Kober, J., & Peters, J Policy Search for Motor Primitives in Robotics. Machine Learning Journal, online first, Kober, Jens, Oztop, Erhan, & Peters, Jan Reinforcement Learning to adjust Robot Movements to New Situations. In: Proceedings of the 2010 Robotics: Science and Systems Conference (RSS 2010).

106 Bibliography IV Kolter, Z., & Ng, A Task-Space Trajectories via Cubic Spline Optimization. Pages of: Proceedings of the 2009 IEEE international conference on Robotics and Automation. ICRA 09. Piscataway, NJ, USA: IEEE Press. Neumann, G., & Peters, J Fitted Q-Iteration by Advantage Weighted Regression. In: Advances in Neural Information Processing Systems 22 (NIPS 2008). MA: MIT Press.

107 Bibliography V Peters, J., & Schaal, S Policy Gradient methods for Robotics. In: Proceedings of the IEEE International Conference on Intelligent Robotics Systems (IROS). Peters, J., & Schaal, S Reinforcement Learning by Reward-Weighted Regression for Operational Space Control. In: Proceedings of the International Conference on Machine Learning (ICML).

108 Bibliography VI Riedmiller, M Neural fitted Q-Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Proceedings of the European Conference on Machine Learning (ECML). Sutton, Richard, Precup, Doina, & Singh, Satinder Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artificial Intelligence, 112,

109 Bibliography VII Theodorou, E., Buchli, J., & Schaal, S Reinforcement Learning of Motor Skills in High Dimensions: a Path Integral Approach. Pages of: Robotics and Automation (ICRA), 2010 IEEE International Conference on. Vlassis, Nikos, Toussaint, Marc, Kontes, Georgios, & Piperidis, Savas Learning Model-Free Robot Control by a Monte Carlo EM Algorithm. Autonomous Robots, 27(2),

110 Bibliography VIII Williams, Ronald J Simple Statistical Gradient..Following Algorithms for Connectionist Reinforcement Learning. Machine Learning.

Learning Complex Motions by Sequencing Simpler Motion Templates

Learning Complex Motions by Sequencing Simpler Motion Templates Learning Complex Motions by Sequencing Simpler Motion Templates Gerhard Neumann gerhard@igi.tugraz.at Wolfgang Maass maass@igi.tugraz.at Institute for Theoretical Computer Science, Graz University of Technology,

More information

Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning

Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning Jan Peters 1, Stefan Schaal 1 University of Southern California, Los Angeles CA 90089, USA Abstract. In this paper, we

More information

Learning to bounce a ball with a robotic arm

Learning to bounce a ball with a robotic arm Eric Wolter TU Darmstadt Thorsten Baark TU Darmstadt Abstract Bouncing a ball is a fun and challenging task for humans. It requires fine and complex motor controls and thus is an interesting problem for

More information

Data-Efficient Generalization of Robot Skills with Contextual Policy Search

Data-Efficient Generalization of Robot Skills with Contextual Policy Search Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Data-Efficient Generalization of Robot Skills with Contextual Policy Search Andras Gabor Kupcsik Department of Electrical and

More information

A Brief Introduction to Reinforcement Learning

A Brief Introduction to Reinforcement Learning A Brief Introduction to Reinforcement Learning Minlie Huang ( ) Dept. of Computer Science, Tsinghua University aihuang@tsinghua.edu.cn 1 http://coai.cs.tsinghua.edu.cn/hml Reinforcement Learning Agent

More information

Inverse KKT Motion Optimization: A Newton Method to Efficiently Extract Task Spaces and Cost Parameters from Demonstrations

Inverse KKT Motion Optimization: A Newton Method to Efficiently Extract Task Spaces and Cost Parameters from Demonstrations Inverse KKT Motion Optimization: A Newton Method to Efficiently Extract Task Spaces and Cost Parameters from Demonstrations Peter Englert Machine Learning and Robotics Lab Universität Stuttgart Germany

More information

Teaching a robot to perform a basketball shot using EM-based reinforcement learning methods

Teaching a robot to perform a basketball shot using EM-based reinforcement learning methods Teaching a robot to perform a basketball shot using EM-based reinforcement learning methods Tobias Michels TU Darmstadt Aaron Hochländer TU Darmstadt Abstract In this paper we experiment with reinforcement

More information

Generalized Inverse Reinforcement Learning

Generalized Inverse Reinforcement Learning Generalized Inverse Reinforcement Learning James MacGlashan Cogitai, Inc. james@cogitai.com Michael L. Littman mlittman@cs.brown.edu Nakul Gopalan ngopalan@cs.brown.edu Amy Greenwald amy@cs.brown.edu Abstract

More information

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Apprenticeship Learning for Reinforcement Learning with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Table of Contents Introduction Theory Autonomous helicopter control

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 12: Deep Reinforcement Learning Types of Learning Supervised training Learning from the teacher Training data includes

More information

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

More information

Hierarchical Reinforcement Learning for Robot Navigation

Hierarchical Reinforcement Learning for Robot Navigation Hierarchical Reinforcement Learning for Robot Navigation B. Bischoff 1, D. Nguyen-Tuong 1,I-H.Lee 1, F. Streichert 1 and A. Knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701

More information

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University Introduction to Reinforcement Learning J. Zico Kolter Carnegie Mellon University 1 Agent interaction with environment Agent State s Reward r Action a Environment 2 Of course, an oversimplification 3 Review:

More information

Markov Decision Processes (MDPs) (cont.)

Markov Decision Processes (MDPs) (cont.) Markov Decision Processes (MDPs) (cont.) Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University November 29 th, 2007 Markov Decision Process (MDP) Representation State space: Joint state x

More information

Reinforcement Learning to Adjust Robot Movements to New Situations

Reinforcement Learning to Adjust Robot Movements to New Situations Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Reinforcement Learning to Adjust Robot Movements to New Situations Jens Kober MPI Tübingen, Germany jens.kober@tuebingen.mpg.de

More information

Learning Motor Behaviors: Past & Present Work

Learning Motor Behaviors: Past & Present Work Stefan Schaal Computer Science & Neuroscience University of Southern California, Los Angeles & ATR Computational Neuroscience Laboratory Kyoto, Japan sschaal@usc.edu http://www-clmc.usc.edu Learning Motor

More information

Reinforcement Learning to Adjust Robot Movements to New Situations

Reinforcement Learning to Adjust Robot Movements to New Situations Reinforcement Learning to Adjust Robot Movements to New Situations Jens Kober MPI Tübingen, Germany jens.kober@tuebingen.mpg.de Erhan Oztop ATR, Japan erhan@atr.jp Jan Peters MPI Tübingen, Germany jan.peters@tuebingen.mpg.de

More information

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient I Used Materials Disclaimer: Much of the material and slides for this lecture

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3. Policy Gradient and Gradient Estimators 4. Q-prop: Sample Efficient Policy Gradient and an Off-policy Critic

More information

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10 Algorithms for Solving RL: Temporal Difference Learning (TD) 1 Reinforcement Learning Lecture 10 Gillian Hayes 8th February 2007 Incremental Monte Carlo Algorithm TD Prediction TD vs MC vs DP TD for control:

More information

Slides credited from Dr. David Silver & Hung-Yi Lee

Slides credited from Dr. David Silver & Hung-Yi Lee Slides credited from Dr. David Silver & Hung-Yi Lee Review Reinforcement Learning 2 Reinforcement Learning RL is a general purpose framework for decision making RL is for an agent with the capacity to

More information

Quadruped Robots and Legged Locomotion

Quadruped Robots and Legged Locomotion Quadruped Robots and Legged Locomotion J. Zico Kolter Computer Science Department Stanford University Joint work with Pieter Abbeel, Andrew Ng Why legged robots? 1 Why Legged Robots? There is a need for

More information

Common Subspace Transfer for Reinforcement Learning Tasks

Common Subspace Transfer for Reinforcement Learning Tasks Common Subspace Transfer for Reinforcement Learning Tasks ABSTRACT Haitham Bou Ammar Institute of Applied Research Ravensburg-Weingarten University of Applied Sciences, Germany bouammha@hs-weingarten.de

More information

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient II Used Materials Disclaimer: Much of the material and slides for this lecture

More information

Approximate Linear Successor Representation

Approximate Linear Successor Representation Approximate Linear Successor Representation Clement A. Gehring Computer Science and Artificial Intelligence Laboratory Massachusetts Institutes of Technology Cambridge, MA 2139 gehring@csail.mit.edu Abstract

More information

Robot learning for ball bouncing

Robot learning for ball bouncing Robot learning for ball bouncing Denny Dittmar Denny.Dittmar@stud.tu-darmstadt.de Bernhard Koch Bernhard.Koch@stud.tu-darmstadt.de Abstract For robots automatically learning to solve a given task is still

More information

Deep Learning of Visual Control Policies

Deep Learning of Visual Control Policies ESANN proceedings, European Symposium on Artificial Neural Networks - Computational Intelligence and Machine Learning. Bruges (Belgium), 8-3 April, d-side publi., ISBN -9337--. Deep Learning of Visual

More information

Markov Decision Processes and Reinforcement Learning

Markov Decision Processes and Reinforcement Learning Lecture 14 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Artificial Intelligence

More information

Deep Generative Models Variational Autoencoders

Deep Generative Models Variational Autoencoders Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative

More information

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed

More information

Planning and Control: Markov Decision Processes

Planning and Control: Markov Decision Processes CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Perfect vs. Noisy Environment What

More information

Robotic Search & Rescue via Online Multi-task Reinforcement Learning

Robotic Search & Rescue via Online Multi-task Reinforcement Learning Lisa Lee Department of Mathematics, Princeton University, Princeton, NJ 08544, USA Advisor: Dr. Eric Eaton Mentors: Dr. Haitham Bou Ammar, Christopher Clingerman GRASP Laboratory, University of Pennsylvania,

More information

Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay

Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay Haiyan (Helena) Yin, Sinno Jialin Pan School of Computer Science and Engineering Nanyang Technological University

More information

Direct Policy Search Reinforcement Learning based on Particle Filtering

Direct Policy Search Reinforcement Learning based on Particle Filtering JMLR: Workshop and Conference Proceedings vol:1 13, 2012 European Workshop on Reinforcement Learning Direct Policy Search Reinforcement Learning based on Particle Filtering Petar Kormushev petar.kormushev@iit.it

More information

Problem characteristics. Dynamic Optimization. Examples. Policies vs. Trajectories. Planning using dynamic optimization. Dynamic Optimization Issues

Problem characteristics. Dynamic Optimization. Examples. Policies vs. Trajectories. Planning using dynamic optimization. Dynamic Optimization Issues Problem characteristics Planning using dynamic optimization Chris Atkeson 2004 Want optimal plan, not just feasible plan We will minimize a cost function C(execution). Some examples: C() = c T (x T ) +

More information

Marco Wiering Intelligent Systems Group Utrecht University

Marco Wiering Intelligent Systems Group Utrecht University Reinforcement Learning for Robot Control Marco Wiering Intelligent Systems Group Utrecht University marco@cs.uu.nl 22-11-2004 Introduction Robots move in the physical environment to perform tasks The environment

More information

Learning Inverse Dynamics: a Comparison

Learning Inverse Dynamics: a Comparison Learning Inverse Dynamics: a Comparison Duy Nguyen-Tuong, Jan Peters, Matthias Seeger, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics Spemannstraße 38, 72076 Tübingen - Germany Abstract.

More information

In Homework 1, you determined the inverse dynamics model of the spinbot robot to be

In Homework 1, you determined the inverse dynamics model of the spinbot robot to be Robot Learning Winter Semester 22/3, Homework 2 Prof. Dr. J. Peters, M.Eng. O. Kroemer, M. Sc. H. van Hoof Due date: Wed 6 Jan. 23 Note: Please fill in the solution on this sheet but add sheets for the

More information

Gerhard Neumann Computational Learning for Autonomous Systems Technische Universität Darmstadt, Germany

Gerhard Neumann Computational Learning for Autonomous Systems Technische Universität Darmstadt, Germany Model-Free Preference-based Reinforcement Learning Christian Wirth Knowledge Engineering Technische Universität Darmstadt, Germany Gerhard Neumann Computational Learning for Autonomous Systems Technische

More information

Locally Weighted Learning

Locally Weighted Learning Locally Weighted Learning Peter Englert Department of Computer Science TU Darmstadt englert.peter@gmx.de Abstract Locally Weighted Learning is a class of function approximation techniques, where a prediction

More information

Q-learning with linear function approximation

Q-learning with linear function approximation Q-learning with linear function approximation Francisco S. Melo and M. Isabel Ribeiro Institute for Systems and Robotics [fmelo,mir]@isr.ist.utl.pt Conference on Learning Theory, COLT 2007 June 14th, 2007

More information

Natural Actor-Critic. Authors: Jan Peters and Stefan Schaal Neurocomputing, Cognitive robotics 2008/2009 Wouter Klijn

Natural Actor-Critic. Authors: Jan Peters and Stefan Schaal Neurocomputing, Cognitive robotics 2008/2009 Wouter Klijn Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn Content Content / Introduction Actor-Critic Natural gradient Applications Conclusion

More information

Convexization in Markov Chain Monte Carlo

Convexization in Markov Chain Monte Carlo in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Gaussian Processes Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann SS08, University of Freiburg, Department for Computer Science Announcement

More information

Learning Multiple Models of Non-Linear Dynamics for Control under Varying Contexts

Learning Multiple Models of Non-Linear Dynamics for Control under Varying Contexts Learning Multiple Models of Non-Linear Dynamics for Control under Varying Contexts Georgios Petkos, Marc Toussaint, and Sethu Vijayakumar Institute of Perception, Action and Behaviour, School of Informatics

More information

Model learning for robot control: a survey

Model learning for robot control: a survey Model learning for robot control: a survey Duy Nguyen-Tuong, Jan Peters 2011 Presented by Evan Beachly 1 Motivation Robots that can learn how their motors move their body Complexity Unanticipated Environments

More information

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding Advances in Neural Information Processing Systems 8, pp. 1038-1044, MIT Press, 1996. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding Richard S. Sutton University

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Clustering Lecture 5: Mixture Model

Clustering Lecture 5: Mixture Model Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Learning of a ball-in-a-cup playing robot

Learning of a ball-in-a-cup playing robot Learning of a ball-in-a-cup playing robot Bojan Nemec, Matej Zorko, Leon Žlajpah Robotics Laboratory, Jožef Stefan Institute Jamova 39, 1001 Ljubljana, Slovenia E-mail: bojannemec@ijssi Abstract In the

More information

Optimization of a two-link Robotic Manipulator

Optimization of a two-link Robotic Manipulator Optimization of a two-link Robotic Manipulator Zachary Renwick, Yalım Yıldırım April 22, 2016 Abstract Although robots are used in many processes in research and industry, they are generally not customized

More information

When Network Embedding meets Reinforcement Learning?

When Network Embedding meets Reinforcement Learning? When Network Embedding meets Reinforcement Learning? ---Learning Combinatorial Optimization Problems over Graphs Changjun Fan 1 1. An Introduction to (Deep) Reinforcement Learning 2. How to combine NE

More information

Function Approximation. Pieter Abbeel UC Berkeley EECS

Function Approximation. Pieter Abbeel UC Berkeley EECS Function Approximation Pieter Abbeel UC Berkeley EECS Outline Value iteration with function approximation Linear programming with function approximation Value Iteration Algorithm: Start with for all s.

More information

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering Study of Multiple Multiagent Reinforcement Learning Algorithms in Grid Games by Pascal De Beck-Courcelle A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of

More information

15-780: MarkovDecisionProcesses

15-780: MarkovDecisionProcesses 15-780: MarkovDecisionProcesses J. Zico Kolter Feburary 29, 2016 1 Outline Introduction Formal definition Value iteration Policy iteration Linear programming for MDPs 2 1988 Judea Pearl publishes Probabilistic

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

Dynamic Bayesian network (DBN)

Dynamic Bayesian network (DBN) Readings: K&F: 18.1, 18.2, 18.3, 18.4 ynamic Bayesian Networks Beyond 10708 Graphical Models 10708 Carlos Guestrin Carnegie Mellon University ecember 1 st, 2006 1 ynamic Bayesian network (BN) HMM defined

More information

Theoretical Concepts of Machine Learning

Theoretical Concepts of Machine Learning Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5

More information

Locally Weighted Learning for Control. Alexander Skoglund Machine Learning Course AASS, June 2005

Locally Weighted Learning for Control. Alexander Skoglund Machine Learning Course AASS, June 2005 Locally Weighted Learning for Control Alexander Skoglund Machine Learning Course AASS, June 2005 Outline Locally Weighted Learning, Christopher G. Atkeson et. al. in Artificial Intelligence Review, 11:11-73,1997

More information

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Reinforcement Learning: A brief introduction. Mihaela van der Schaar Reinforcement Learning: A brief introduction Mihaela van der Schaar Outline Optimal Decisions & Optimal Forecasts Markov Decision Processes (MDPs) States, actions, rewards and value functions Dynamic Programming

More information

Adaptive Building of Decision Trees by Reinforcement Learning

Adaptive Building of Decision Trees by Reinforcement Learning Proceedings of the 7th WSEAS International Conference on Applied Informatics and Communications, Athens, Greece, August 24-26, 2007 34 Adaptive Building of Decision Trees by Reinforcement Learning MIRCEA

More information

Using Continuous Action Spaces to Solve Discrete Problems

Using Continuous Action Spaces to Solve Discrete Problems Using Continuous Action Spaces to Solve Discrete Problems Hado van Hasselt Marco A. Wiering Abstract Real-world control problems are often modeled as Markov Decision Processes (MDPs) with discrete action

More information

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1 Unsupervised CS 3793/5233 Artificial Intelligence Unsupervised 1 EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means In clustering, the target feature is not given. Goal: Construct

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 Andreas Krause Announcements Project poster session on Thursday Dec 3, 4-6pm in Annenberg 2 nd floor atrium! Easels, poster boards and cookies

More information

Segmentation: Clustering, Graph Cut and EM

Segmentation: Clustering, Graph Cut and EM Segmentation: Clustering, Graph Cut and EM Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 yingwu@northwestern.edu http://www.eecs.northwestern.edu/~yingwu

More information

Instance-Based Action Models for Fast Action Planning

Instance-Based Action Models for Fast Action Planning Instance-Based Action Models for Fast Action Planning Mazda Ahmadi and Peter Stone Department of Computer Sciences The University of Texas at Austin 1 University Station C0500, Austin, TX 78712-0233 Email:{mazda,pstone}@cs.utexas.edu

More information

Gradient Reinforcement Learning of POMDP Policy Graphs

Gradient Reinforcement Learning of POMDP Policy Graphs 1 Gradient Reinforcement Learning of POMDP Policy Graphs Douglas Aberdeen Research School of Information Science and Engineering Australian National University Jonathan Baxter WhizBang! Labs July 23, 2001

More information

9.1. K-means Clustering

9.1. K-means Clustering 424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific

More information

Getting a kick out of humanoid robotics

Getting a kick out of humanoid robotics Getting a kick out of humanoid robotics Using reinforcement learning to shape a soccer kick Christiaan W. Meijer Getting a kick out of humanoid robotics Using reinforcement learning to shape a soccer

More information

A Fuzzy Reinforcement Learning for a Ball Interception Problem

A Fuzzy Reinforcement Learning for a Ball Interception Problem A Fuzzy Reinforcement Learning for a Ball Interception Problem Tomoharu Nakashima, Masayo Udo, and Hisao Ishibuchi Department of Industrial Engineering, Osaka Prefecture University Gakuen-cho 1-1, Sakai,

More information

Neural Networks: promises of current research

Neural Networks: promises of current research April 2008 www.apstat.com Current research on deep architectures A few labs are currently researching deep neural network training: Geoffrey Hinton s lab at U.Toronto Yann LeCun s lab at NYU Our LISA lab

More information

Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods Alessandro Lazaric Marcello Restelli Andrea Bonarini Department of Electronics and Information Politecnico di Milano

More information

Clustering web search results

Clustering web search results Clustering K-means Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 1 Clustering images Set of Images [Goldberger et al.] 2 1 Clustering web search results 3 Some Data 4 2 K-means

More information

Sample-Efficient Reinforcement Learning for Walking Robots

Sample-Efficient Reinforcement Learning for Walking Robots Sample-Efficient Reinforcement Learning for Walking Robots B. Vennemann Delft Robotics Institute Sample-Efficient Reinforcement Learning for Walking Robots For the degree of Master of Science in Mechanical

More information

Markov Decision Processes. (Slides from Mausam)

Markov Decision Processes. (Slides from Mausam) Markov Decision Processes (Slides from Mausam) Machine Learning Operations Research Graph Theory Control Theory Markov Decision Process Economics Robotics Artificial Intelligence Neuroscience /Psychology

More information

This is an author produced version of Definition and composition of motor primitives using latent force models and hidden Markov models.

This is an author produced version of Definition and composition of motor primitives using latent force models and hidden Markov models. This is an author produced version of Definition and composition of motor primitives using latent force models and hidden Markov models. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/116580/

More information

Autoencoders, denoising autoencoders, and learning deep networks

Autoencoders, denoising autoencoders, and learning deep networks 4 th CiFAR Summer School on Learning and Vision in Biology and Engineering Toronto, August 5-9 2008 Autoencoders, denoising autoencoders, and learning deep networks Part II joint work with Hugo Larochelle,

More information

Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms

Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms Karan M. Gupta Department of Computer Science Texas Tech University Lubbock, TX 7949-314 gupta@cs.ttu.edu Abstract This paper presents a

More information

Unsupervised Learning: Clustering

Unsupervised Learning: Clustering Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer Machine Learning Supervised Learning Unsupervised Learning

More information

Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation

Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation POLISH MARITIME RESEARCH Special Issue S1 (74) 2012 Vol 19; pp. 31-36 10.2478/v10012-012-0020-8 Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation Andrzej Rak,

More information

Reinforcement Learning with Parameterized Actions

Reinforcement Learning with Parameterized Actions Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Reinforcement Learning with Parameterized Actions Warwick Masson and Pravesh Ranchod School of Computer Science and Applied

More information

Machine Learning on Physical Robots

Machine Learning on Physical Robots Machine Learning on Physical Robots Alfred P. Sloan Research Fellow Department or Computer Sciences The University of Texas at Austin Research Question To what degree can autonomous intelligent agents

More information

What can we learn from demonstrations?

What can we learn from demonstrations? What can we learn from demonstrations? Marc Toussaint Machine Learning & Robotics Lab University of Stuttgart IROS workshop on ML in Robot Motion Planning, Okt 2018 1/29 Outline Previous work on learning

More information

Behavioral Data Mining. Lecture 18 Clustering

Behavioral Data Mining. Lecture 18 Clustering Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i

More information

Imitation and Reinforcement Learning for Motor Primitives with Perceptual Coupling

Imitation and Reinforcement Learning for Motor Primitives with Perceptual Coupling Imitation and Reinforcement Learning for Motor Primitives with Perceptual Coupling Jens Kober, Betty Mohler, Jan Peters Abstract Traditional motor primitive approaches deal largely with open-loop policies

More information

Estimation of Bilateral Connections in a Network: Copula vs. Maximum Entropy

Estimation of Bilateral Connections in a Network: Copula vs. Maximum Entropy Estimation of Bilateral Connections in a Network: Copula vs. Maximum Entropy Pallavi Baral and Jose Pedro Fique Department of Economics Indiana University at Bloomington 1st Annual CIRANO Workshop on Networks

More information

Probabilistic Robotics

Probabilistic Robotics Probabilistic Robotics Sebastian Thrun Wolfram Burgard Dieter Fox The MIT Press Cambridge, Massachusetts London, England Preface xvii Acknowledgments xix I Basics 1 1 Introduction 3 1.1 Uncertainty in

More information

Sparse Distributed Memories in Reinforcement Learning: Case Studies

Sparse Distributed Memories in Reinforcement Learning: Case Studies Sparse Distributed Memories in Reinforcement Learning: Case Studies Bohdana Ratitch Swaminathan Mahadevan Doina Precup {bohdana,smahad,dprecup}@cs.mcgill.ca, McGill University, Canada Abstract In this

More information

Learning Continuous State/Action Models for Humanoid Robots

Learning Continuous State/Action Models for Humanoid Robots Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference Learning Continuous State/Action Models for Humanoid Robots Astrid Jackson and Gita Sukthankar

More information

CS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation

CS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation CS 687 Jana Kosecka Reinforcement Learning Continuous State MDP s Value Function approximation Markov Decision Process - Review Formal definition 4-tuple (S, A, T, R) Set of states S - finite Set of actions

More information

Robot Motion Planning

Robot Motion Planning Robot Motion Planning slides by Jan Faigl Department of Computer Science and Engineering Faculty of Electrical Engineering, Czech Technical University in Prague lecture A4M36PAH - Planning and Games Dpt.

More information

RNNs as Directed Graphical Models

RNNs as Directed Graphical Models RNNs as Directed Graphical Models Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 10. Topics in Sequence Modeling Overview

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Overview of Part Two Probabilistic Graphical Models Part Two: Inference and Learning Christopher M. Bishop Exact inference and the junction tree MCMC Variational methods and EM Example General variational

More information

Instance-Based Action Models for Fast Action Planning

Instance-Based Action Models for Fast Action Planning In Visser, Ribeiro, Ohashi and Dellaert, editors, RoboCup-2007. Robot Soccer World Cup XI, pp. 1-16, Springer Verlag, 2008. Instance-Based Action Models for Fast Action Planning Mazda Ahmadi and Peter

More information

Table of Contents. Chapter 1. Modeling and Identification of Serial Robots... 1 Wisama KHALIL and Etienne DOMBRE

Table of Contents. Chapter 1. Modeling and Identification of Serial Robots... 1 Wisama KHALIL and Etienne DOMBRE Chapter 1. Modeling and Identification of Serial Robots.... 1 Wisama KHALIL and Etienne DOMBRE 1.1. Introduction... 1 1.2. Geometric modeling... 2 1.2.1. Geometric description... 2 1.2.2. Direct geometric

More information

Reinforcement Learning of Clothing Assistance with a Dual-arm Robot

Reinforcement Learning of Clothing Assistance with a Dual-arm Robot Reinforcement Learning of Clothing Assistance with a Dual-arm Robot Tomoya Tamei, Takamitsu Matsubara, Akshara Rai, Tomohiro Shibata Graduate School of Information Science,Nara Institute of Science and

More information

Can we quantify the hardness of learning manipulation? Kris Hauser Department of Electrical and Computer Engineering Duke University

Can we quantify the hardness of learning manipulation? Kris Hauser Department of Electrical and Computer Engineering Duke University Can we quantify the hardness of learning manipulation? Kris Hauser Department of Electrical and Computer Engineering Duke University Robot Learning! Robot Learning! Google used 14 identical robots 800,000

More information

Monte Carlo Tree Search PAH 2015

Monte Carlo Tree Search PAH 2015 Monte Carlo Tree Search PAH 2015 MCTS animation and RAVE slides by Michèle Sebag and Romaric Gaudel Markov Decision Processes (MDPs) main formal model Π = S, A, D, T, R states finite set of states of the

More information