On Movement Skill Learning and Movement Representations for Robotics
|
|
- Rosanna Kelley
- 6 years ago
- Views:
Transcription
1 On Movement Skill Learning and Movement Representations for Robotics Gerhard Neumann 1 1 Graz University of Technology, Institute for Theoretical Computer Science November 2, 2011
2 Modern Robotic Systems: Motivation... Many degrees of freedoms, compliant actuators, highly dynamic movements...
3 In principle the advanced morphology of these robots would allow us to perform a wide range of complex movements such as Different forms of locomotion (walking, running, trott) Jumping Playing tennis... Classical control methods often fail or are very hard to use for such complex movements. More promising approach : Let the robot learn the movement from trial and error Main topic of this thesis!
4 Movement Skill Learning for Robotics Movement Skill Learning can be easily formulated as Reinforcement Learning problem. The agent has to search for a policy which optimizes reward So why is it challenging? High dimensional continuous state spaces High dimensional continuous action spaces Data is expensive : Needs to be data efficient Needs to be safe
5 Movement Skill Learning for Robotics Learning algorithms can be roughly divided into Value-based methods Policy-search methods
6 Value-based methods Estimate the expected discounted future reward for each state s when following policy π [ ] V π (s) = E γ t r t, t=0 Also denoted as value function of policy π Recursive Form V π (s) = E [ r(s,a) + γv π (s ) ]
7 Value-based methods + The value function can be used to assess the quality of each intermediate action of an episode E.g. by the use of the Temporal Difference (TD) error δ t = r t + γv π (s t+1 ) V π (s t ) Evaluates if the current step s t,a t,r t,s t+1 was better or worse than expected We can efficiently solve the temporal credit assignment problem - The value function is very hard to estimate in high-dimensional continuous state and action spaces
8 Policy Search Methods Rely on a parametric representation of the policy π(a s;w) Parameters of the policy w Directly optimize policy parameters by performing rollouts on the real system - We can only assess the quality of a whole trajectory instead of single actions + However, as no value function is estimated this can be done very accurately More successful than value based methods Performance strongly depends on the used movement representation
9 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression
10 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives
11 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives Policy Search Variational Inference for Policy Search in Changing Situations
12 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives Policy Search Variational Inference for Policy Search in Changing Situations
13 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression
14 Fitted Q-iteration : Batch-Mode Reinforcement Learning (BMRL) Batch-Mode RL methods use the whole history H of the agent to update the value or action value function H = {< s i,a i, r i,s i >} 1 i N Advantage : Data-points are used more efficiently than in online methods
15 Fitted Q-iteration : Batch-Mode Reinforcement Learning (BMRL) Fitted Q-Iteration (Ernst et al., 2003) approximates the state-action value function Q(s,a) by iteratively using supervised regression techniques Repeat K times Q k+1 (i) = r i + γv k (s i) = r i + γ max a Q k (s i,a ) D k = { [(si,a i ), Q k+1 (i) ] 1 i N }, Q k+1 = Regress(D k )
16 Fitted Q-iteration : Batch-Mode Reinforcement Learning (BMRL) + FQI has proven to outperform classical online RL methods in many applications (Ernst et al., 2005). + Any type of supervised learning method can be used... E.g. neural networks (Riedmiller, 2005), regression trees (Ernst et al., 2005), Gaussian Processes - High computational demands...
17 FQI for Robotics... Continuous state spaces : Any type of supervised learning method can be used... E.g. neural networks, regression trees, Gaussian Processes Continuous action spaces : We have to solve Q k+1 (i) = r i + γ max a Q k (s i,a ) - Hm... how do perform the max a -operator in continuous action spaces?
18 FQI for Robotics... Hm... how do perform the max a -operator in continuous action spaces? Discretizations become prohibitively expensive in high dimensional spaces We have to solve an optimization problem for each sample! E.g. use Cross-Entropy optimization for each data point s i
19 FQI for Robotics... Hm... how do perform the max a -operator in continuous action spaces? We show that an advantage-weighted regression can be used to approximate max a Q(s,a). The regression uses the states s i as input values and Q(s i,a i ) as target values. The weighting w i = exp(τā(s,a) of each data point is based on the advantage function A(s, a) = Q(s, a) V (s).
20 FQI for Robotics... What is a weighted regression? Minimize the error function w.r.t. θ E = N w i (V (s i ; θ) Q(s i,a i )) 2 i=1 w i... each data point gets an individual weighting
21 FQI for Robotics... We proof this by applying the following 2 steps: Weighted regression for value estimation Soft-greedy policy improvement
22 Weighted regression for value estimation The value function of a stochastic policy π is given by V π (s) = a π(a s)q(s,a)da We show that this can be approximated without evaluating the integral by solving a weighted regression problem D V = { s i, Q(s i,a i ) },U = {π(a i s i )}, ˆV = WeightedReg(D V, U)
23 Proof We want to find an approximation ˆV (s) of V π (s) by minimizing the error function ( 2 Error(ˆV ) = µ(s) π(a s)q(s,a)da ˆV (s)) ds s a = s ( µ(s) π(a s) a ( ) 2 Q(s,a) ˆV (s) da) ds, µ(s) : state distribution when following policy π( s).
24 Proof Squared error function : Error(ˆV ) = s ( ( ) 2 µ(s) π(a s) Q(s,a) ˆV (s) da) ds, a An upper bound of Error(ˆV ) is given by : ( 2 Error B (ˆV ) = µ(s) π(a s) Q(s,a) ˆV (s)) dads Error(ˆV ). s a Use of Jensens inequality
25 Proof It is easy to show that both error functions have the same minimum for ˆV The upper bound Error B can be approximated straightforwardly by samples {(s i,a i ), Q(s i,a i )} 1 i N Error B (ˆV ) N i=1 ( 2 π(a i s i ) Q(s i,a i ) ˆV (s i )) (1) No integral over the action space is needed!
26 FQI for Robotics... We proof this by applying the following 2 steps: Weighted regression for value estimation Soft-greedy policy improvement
27 Soft-greedy policy improvement The optimal value function V (s) = max a Q(s,a) can be approximated without evaluating max a Q(s,a) by solving an advantage-weighted regression problem. D V = { s i, Q(s i,a i ) }, U = { exp(τā(s i,a i )) }, (2) ˆV = WeightedReg(D V, U ) (3) - τ... greediness parameter of the algorithm. - Ā(s, a)... normalized advantage function.
28 Proof We approximate the value function V π 1 of a soft-max policy π 1 by the use of weighted regression. Since a soft-max policy is an approximation of the greedy policy, we can replace V (s) = max a Q(s,a) with V π 1 (s).
29 Proof The used soft-max policy π 1 (a s) is based on the advantage function A(s,a) = Q(s,a) V (s). π 1 (a s) = exp(τā(s,a)) a exp(τā(s,a))da, Ā(s,a) = A(s,a) m A(s) σ A (s). If we assume that the advantages A(s,a) are normally distributed the denominator of π 1 is constant. Thus we can use exp(τā(s,a)) π 1 (a s) directly as weighting for the regression.
30 Concrete algorithm : LAWER The Locally-Advantage WEighted Regression (LAWER) algorithm implements the presented theoretical results. It combines Locally Weighted Regression (LWR, (Atkeson et al., 1997)) and advantage weighted regression. The locality weighting w i (s) and the advantage weighting u i = exp(τā(s i,a i )) can be multiplicatively combined
31 Concrete algorithm : LAWER The value function is then given by a simple weighted linear regression: V k+1 (s) = s(s T US) 1 S T UQ k+1 s = [1,s T ] T, S = [ s 1, s 2,..., s N ] T is the state matrix. U = diag(w i (s)u i ) In order to approximate V (s) = max a Q k (s,a) only the Q-values of neighbored state-action pairs are needed.
32 Approximation of the policy For unseen states we need to approximate the soft-max policy Gaussian policy π(a s) = N(a µ(s), σ 2 ). For estimating this policy we use reward-weighted regression (Peters & Schaal, 2007), only the advantage is used instead of the reward for the weighting. Thus, we optimize the long-term reward instead of the immediate reward
33 Results We use the Cross-Entropy (CE) optimization method (de Boer et al., 2005) as comparison to find the maximum Q-values max a Q(s,a). We compare the LAWER algorithm to 3 different state of the art CE-based fitted Q-iteration algorithms: Tree-based FQI (Ernst et al., 2005) (CE-Tree) Neural FQI (Riedmiller, 2005) (CE-Net) LWR-based FQI (CE-LWR) After each FQI cycle new data was collected. The immediate reward function was quadratic in the distance to the goal position x G and in the applied torque/force
34 Pendulum swing-up task A pendulum needs to be swung up from the position at the bottom to the top position (Riedmiller, 2005). 2 experiments with different torque punishment factors (c 2 ) were carried out Average Reward LAWER CE Tree CE LWR CE Net Number of Data Collections Average Reward LAWER CE Tree CE LWR CE Net Number of Data Collections (a) c 2 = (b) c 2 = 0.025
35 Comparison of torque trajectories u [N] LAWER CE Tree 5 CE LWR Time [s] (c) c 2 = u [N] LAWER CE Tree 5 CE LWR Time [s] (d) c 2 = 0.025
36 Dynamic puddle-world The agent has to navigate from a start position to a goal position, it gets negative reward when going through puddles. Dynamic version of the puddle-world : The agent can set a force accelerating a k-dimensional point mass. This was done for k = 2 and k = 3 dimensions. 1 Start Goal 0 1
37 Comparison of the algorithms 20 Average Reward LAWER CE Tree Number of Data Collections Average Reward LAWER CE Tree Number of Data Collections (e) 2-D (f) 3-D The CE-Tree Method learns faster, but does not manage to learn high quality policies for the 3D setting. LAWER also works for high dimensional action spaces.
38 Comparison of torque trajectories u Time [s] 4 5 (g) LAWER u 1 u Time [s] (h) CE-Tree u 1 u 2 u 3
39 Conclusion We have proven that the greedy operator max a Q(s,a) can be approximated efficiently by an advantage-weighted regression. The resulting algorithm runs an order of magnitude faster than competing algorithms. In spite of the resulting soft-greedy policy improvement our algorithm was able to produce policies of higher quality. The Locally-Advantage Weighted Regression algorithm allows us to use fitted Q-iteration even for high dimensional continuous action spaces.
40 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives Policy Search Variational Inference for Policy Search in Changing Situations
41 Movement Representations for Motor Skill Learning Directly optimize a parametric movement representation No value estimation is needed What is a good representation for learning a movement? Episodic Tasks: Often it is sufficient to formulate the learning task in the episodic RL setup Single initial state, specified fixed duration of the movement Direct Policy Search can be applied easily in this setup
42 Movement Representations for Motor Skill Learning Episodic setup : Use a trajectory-based representation We learn a parametric representation of the desired trajectory [q d (t;w), q d (t;w)] t... duration of the movement, no direct dependence on the high dimensional state t is now a scalar, this significantly simplifies the learning problem Can only be used in the episodic setup (single start states) This trajectory is then followed by using feedback control laws Most common movement representations are trajectory based... Dynamic Movement Primitives (Ijspeert & Schaal, 2003), Splines (Kolter & Ng, 2009),...
43 Trajectory-based vs. Value Based Motor Skill learning Trajectory-Based: Can be seen as single-step decision task The agent chooses the parameters w as action of a single, temporally extended step Only one step per episode... Value-Based: One decision per time step of the agent The agent chooses the tourque u as action of a single, very short time step Up to a few hundred steps per episode...
44 Trajectory-based vs. Value Based Motor Skill learning Can we find a more intuitive solution for which the agent chooses new actions only at certain, characteristic time points of the movement? Temporal Abstraction: Sequencing of temporally extended actions, also called Motion Templates
45 Temporal Abstractions for Motor-Skill Learning Example : Drawing a triangle with a pen Flat Setup Abstracted Setup We have to make many unessential decisions The movement can be easily decomposed into 3 elemental motions
46 Temporal Abstractions for Motor-Skill Learning Example : Drawing a triangle with a pen Flat Setup Abstracted Setup We have to make many unessential decisions The movement can be easily decomposed into 3 elemental motions
47 Temporal Abstractions for Motor-Skill Learning Standard framework for temporally extended actions : Options (Sutton et al., 1999) Options are closed loop policies taking actions over a period of time However: They are mainly used in discrete environments. In many applications options are discrete temporally extended actions E.g. Go to another room, Follow the hallway or Frighten the poor monkey For motor tasks useful options are often difficult to specify.
48 Temporal Abstractions for Motor-Skill Learning : Illustration Pendulum Swing-up Task : Standard RL benchmark task Learn how to swing up and balance an inverted pendulum from the bottom position We additionally want to minimize the energy consumption Flat RL : Choose a new action every 50ms
49 Pendulum Swing-Up: Illustration How can we decompose the trajectory into options? 6 Torque [s] Positive peak 4 Negative peak Balancing Motion Time [s] We have positive and negative peaks in the torque trajectory followed by a final balancing motion.
50 Pendulum Swing-Up: Illustration Torque [s] How can we decompose the trajectory into options? Specify the exact form of the peaks and 6 the balancing motion for the options? Positive peak 4 Negative peak Balancing Motion Time [s] Requires a lot of prior knowledge... The learning task becomes trivial... However : We can specify the functional form of the options Use parameterized options...
51 Motion Templates Motion templates : Parameterized Options Used as our building blocks of motion. A motion template m p is defined by : Its k p dimensional parameter space Θ p Its parameterized policy u p (s,t;θ p ) Its termination condition c p (s,t;θ p ) s... state, t... execution time, θ p Θ p... parameters The functional form of u p and c p is chosen by the designer, the parameters θ p are learned by the agent
52 Motion Templates At each decision time step σ k the agent has to choose : Which motion template m p A(σ k ) to use. A(σ k )... set of available motion templates in decision time step σ k. Which parameterization θ p Θ p of m p to use. Subsequently the policy u p is executed until the termination condition c p is fullfilled. Continuous time : The duration of the templates can be continuous valued The agent has to learn the correct sequence and parameterization of the motion templates
53 Pendulum Swing-up : Decomposition into Motion Templates How can we decompose the trajectory into motion templates? 6 Torque [s] Positive peak 4 Negative peak Balancing Motion Time [s] We have positive and negative peaks in the torque trajectory followed by a final balancing motion.
54 Pendulum Swing-Up : Templates to model the peaks We use 2 templates per peak: One for the ascending part : m and one for the descending part : m 2 Both just depend on the execution time of the template.
55 Pendulum Swing-Up : Decomposition into Motion Templates We use 2 templates per peak: Ascending part (m ) 1 Descending part (m 2 ) 4 4 Torque [N] 2 0 a 1 = 4 a 1 = 3 a 1 = 2 Torque [N] 2 0 a 2 = 4 a 2 = 3 a 2 = Time [s] Parameters : a i... height of the template Time [s]
56 Pendulum Swing-Up : Decomposition into Motion Templates We use 2 templates per peak: Ascending part (m ) 1 Descending part (m 2 ) 4 4 Torque [N] 2 0 o 1 = 0.5 o 1 = 1 o 1 = 2 Torque [N] 2 0 o 2 = 3 o 2 = 6 o 2 = Time [s] Parameters : a i... height of the template o i... curvature of the template Time [s]
57 Pendulum Swing-Up : Decomposition into Motion Templates We use 2 templates per peak: Ascending part (m ) 1 Descending part (m 2 ) 4 4 Torque [N] 2 0 d 1 = 0.3 d 1 = 0.5 d 1 = 0.7 Torque [N] 2 0 d 2 = 0.3 d 2 = 0.5 d 2 = Time [s] Parameters : a i... height of the template o i... curvature of the template d i... duration of the template Time [s]
58 Pendulum Swing-Up : Decomposition into Motion Templates We fix the height of the descending peak template m 2 to be the height of m 1. m 3 and m 4 are the same templates, just for negative peaks. Torque [N] Torque [N] Positive Peak, Descending Part Negative Peak, Ascending Part Negative Peak, Descending Part 4 Positive Peak, Ascending Part Balancing Motion Time [s] m 2 m 3 m 4 m Time [s]
59 Pendulum Swing-Up : Decomposition into Motion Templates The balancing template is implemented as PD-controller MT Functional Form Parameters m 5 k 1 θ k 2 θ k 1, k 2 k 1 and k 2 are the PD controller gains. m 5 always runs for 20s, subsequently the episode is terminated
60 Pendulum Swing-Up : Constructing the motion 6 4 The agent can either choose the peak templates in the predefined order (m 2, m 3, m 4, m 1, m 2,...)... or it can use the balancing template m 5 as final template. Thus the agent has to learn the correct number of swing-ups and the correct parameterization of the swing- ups. Torque [N] Torque [N] Positive Peak, Descending Part Negative Peak, Ascending Part Negative Peak, Descending Part 4 Positive Peak, Ascending Part Balancing Motion Time [s] 6 m 2 4 m 3 2 m 4 m 0 1 m Time [s]
61 Pendulum Swing-Up : Constructing the motion Torque [N] Positive Peak, Descending Part Negative Peak, Ascending Part Negative Peak, Descending Part Positive Peak, Ascending Part Balancing Motion Time [s] Torque [N] m 2 m 3 m 4 m 1 m Time [s] Flat : Approximately 50 decisions/parameters are needed to reach the top position Motion Templates : The whole motion consists only of 5 decisions / 13 parameters
62 Pendulum Swing-Up : Accuracy of the policy Motion templates decrease the number of necessary decisions significantly Overall learning task is simplified Ok... where is the catch?
63 Pendulum Swing-Up : Accuracy of the policy Motion templates decrease the number of necessary decisions significantly Overall learning task is simplified Ok... where is the catch? A single decision has now much more influence on the outcome of the whole motion. Therefore a single decision has to be made much more precisely than in flat RL.
64 Algorithm for Motion Template Learning An RL algorithm is needed which can learn very precise continuous valued policies! For each template m p, we use an advancement of the Locally Advantage WEighted Regression (LAWER, (Neumann & Peters, 2009)) algorithm to learn the policy π p (θ p s) for selecting the parameters of m p.
65 Extensions of LAWER : LAWER for Motion Template Learning Due to the increased precision requirements of motion template learning we had to develop 2 substantal extensions of LAWER Adaptive tree-based Kernels Additional optimization to improve the approximation of V (s) = max a Q(s,a)
66 Extensions of LAWER : Adaptive Tree-Based Kernels The use of an uniform weighting kernel is often problematic in the case of... High dimensional input spaces ( curse of dimensionality ) Spatially varying data densities Spatially varying curvatures of the regression surface This problem can be alleviated by varying the shape of the weighting kernel. We do this by the use of randomized regression trees...
67 Extensions of LAWER : Improved approximation of V (s) = max a Q(s,a) In order to estimate the weightings u i, the original LAWER needed the assumption of normally distributed advantage values. Often this assumption does not hold and the estimate of u i gets imprecise. We improve the estimate of the u i by an additional optimization...
68 Experiments Minimum-Time problems with additional energy-consumption constraints (c 2 ) Pendulum Swing-Up 2-link Pendulum Swing-Up 2-link Pendulum Balancing Iterative learning protocoll: We collect L episodes with the currently estimated exploration policy Subsequently the optimal policy is reestimated and the performance (summed reward) of the optimal policy (without exploration) is evaluated.
69 Experiments : Pendulum Swing-Up Comparison of learning progress for different energy punishment factors (L = 50) Average Reward MT Tree MT Gauss Flat Number of Data Collections Average Reward MT Tree MT Gauss Flat Number of Data Collections Figure: Learning curves for the Gaussian kernel (MT Gauss) and the tree-based kernel (MT Tree) for (left) c 2 = and (right) c 2 = 0.075
70 Experiments : Pendulum Swing-Up Comparison of the flat and the motion template policy 5 m 2 m m 2 m 3 m 4 m m 5 c 2 = m 5 c 2 = m 2 m 3 m 4 m 1 m 2 m 3 m 4 m m c 2 = Time [s] c 2 = c 2 = c = Time [s] 4 5 Figure: (a) Torque trajectories and motion templates learned for different energy punishment factors c 2. (b) Torque trajectories learned with flat RL Performance for c 2 = : flat RL 48.6, motion templates 38.5
71 Experiments : 2-Link Pendulum Swing-Up Same templates as for the 1-dimensional task The peak templates have now 2 additional parameters, the height and the curvature for the second control dimension u 2. The parameters of the balancer template m 5 consists of two 2 2 matrices for the controller gains. Average Reward MT Tree Flat Number of Data Collections Figure: Comparison for motion template learning with tree-based kernels and flat RL
72 Experiments : 2-Link Pendulum Swing-Up Learned motion template policy 6 m 2 m 3 m 4 m 5 4 Tourque [Nm] Time [s] u 1 u 2 Figure: Left: Torque trajectories and decomposition in the motion templates. Right: Illustration of the motion. The bold postures represent the switching time points of the motion templates.
73 Conclusions We have shown that by the use of motion templates, i.e. parametrized options, many motor tasks can be decomposed into elemental movements. Motion templates are the first movement representation which can be sequenced in time While the whole motion consists of less decisions, a single decision has to be made more precisely. We propose a new algorithm for motion template learning which can cope with the precision requirements We have shown that learning with motion templates can produce policies of higher quality than flat RL and could even be applied to tasks where flat RL was not successful.
74 Outline: The thesis is divided into 3 parts... Value-based Methods Graph-Based Reinforcement Learning Fitted Q-Iteration by Advantage Weighted Regression Movement Representations Kinematic Synergies Motion Templates Planning Movement Primitives Policy Search Variational Inference for Policy Search in Changing Situations
75 Policy Search for trajectory-based representations Back to trajectory-based representations Only 1 decision per episode : Choose parameter vector w Typically w is very high dimensional ( parameters) How can we optimize the parameters w? Policy Gradient Methods (Williams, 1992; Peters & Schaal, 2006) EM-based Methods (Kober & Peters, 2010) Inference-based Methods (Vlassis et al., 2009; Theodorou et al., 2010)
76 Inference-based Methods: Policy Search for changing Situations In different situations s 0 i we have to choose different parameter vectors w i Can we generalize between solutions to avoid relearning? Learn a hierarchic policy π MP (w s 0 ; θ) which chooses the parameter vector w according to the situation s 0. In order to do so we will use approximate inference methods
77 Outline Approximate Inference for Policy Search Decomposition of the log-likelihood Monte-Carlo EM based methods Variational Inference based methods Policy Search for Movement Primitives in changing situations 4-Link Balancing
78 Approximate Inference for Policy Search Using inference or inference-based methods has proven to be very useful for policy search PoWeR (Kober & Peters, 2010), Policy Improvement by Path Integrals (Theodorou et al., 2010) Reward Weighted Regression, Cost Regularized Kernel Regression (Kober et al., 2010) Monte Carlo EM Policy Search(Vlassis et al., 2009) CMA-ES (Heidrich-Meisner & Igel, 2009)
79 All these algorithms use the Moment-projection of a certain target distribution to estimate the policy As we will see this can be problematic in many cases... (multi-modal solution space, complex reward functions...) Here we will introduce the theory to use the Information-projection and show that this projection alleviates many of these problems
80 Approximate Inference for Policy Search Formulating policy search as inference problem... Observed variable : Introduce a Reward Event p(r = 1 τ), e.g p(r = 1 τ) exp( C(τ)) C(τ) trajectory costs Latent Variables : trajectories τ Probabilistic Model : p(r = 1, τ; θ) = p(r = 1 τ)p(τ; θ) We want to find parameters θ which maximize the log-marginal likelihood log p(r; θ) = log p(r τ)p(τ; θ)dτ τ
81 Approximate Inference for Policy Search Policy Search can be seen as finding the maximum likelihood (ML) solution of p(r; θ) p(r; θ) = p(r τ)p(τ; θ)dτ τ Problem: Huge trajectory space, the integral is intractable
82 Decomposition of the log-likelihood We can decompose the log-likelihood by introducing a variational distribution q(τ) over the latent variable τ: Lower Bound L(q, θ): L(q, θ) = = log p(r; θ) = L(q, θ) + KL(q p R ), τ τ q(τ)log p(r, τ; θ) + f 1 (q) = q(τ)log p(τ; θ) + f 2 (q) Expected complete data log-likelihood...
83 Decomposition of the log-likelihood We can decompose the log-likelihood by introducing a variational distribution q(τ) over the latent variable τ: log p(r; θ) = L(q, θ) + KL(q p R ), Kullback-Leibler divergence KL(q p R ) : KL(q p R ) = q(τ)log p R(τ) τ q(τ) dτ Distance between variational distribution q and conditional distribution of the latent variable p(τ R; θ) p R (τ) = p(τ R; θ) p(r τ)p(τ; θ)... reward-weighted model distribution
84 Decomposition of the log-likelihood We can now iteratively increase the lower bound L(q, θ) by: E-Step: Keep model parameters θ fixed Minimize KL-divergence KL(q p R ) w.r.t q M-Step: Keep variational distribution q fixed Maximize Lower Bound L(q,θ) w.r.t θ
85 Approximate Inference for Policy Search Two types of policy search algorithms emerge from this decomposition Monte-Carlo EM based Policy Search (Kober et al., 2010; Kober & Peters, 2010; Vlassis et al., 2009) Variational Inference Policy Search
86 Monte-Carlo (MC) EM based Algorithms MC-EM based algorithms use a sample based approximation of q in the E-step. E-Step min q KL(q p R ) : q(i) = p R (i) p(r τ i )p(τ i ; θ old ) M-Step max θ L(q, θ) : Use q(i) to approximate lower bound L(q, θ) i p R (i)log p(τ i ; θ) + const = KL(p R p(τ; θ)) + const This is the same lower Bound given as given for PoWER and Reward-Weighted Regression.
87 Monte-Carlo (MC) EM based Algorithms Iteratively calculate M(oment)-projection of p R : min θ KL(p R p) = i p R (i)log ( p(τi ; θ) p R (i) ) The model becomes Reward attracted Forces model p to have high probability in regions with high reward Negatively rewarded samples are neglected! Minimization? p can be easily calculated by matching the moments of p with the moments of p R
88 Variational Inference based Algorithms For Variational Inference we use a parametric variational distribution q(τ) = q(τ; θ ) E-Step min q KL(q p R ) : Use sample-based approximation for the integral in the KL-divergence KL(q(τ; θ ) p R ) τ i q(τ i ; θ )log p R(i) q(τ i ; θ ) M-Step max θ L(q, θ) : If we use the same family of distributions for p(τ; θ) and q(τ; θ ) we can simply set θ to θ
89 Variational Inference based Algorithms Iteratively calculate I(nformation)-projection of p R : min θ KL(p p R ) = i p(τ i ; θ)log ( ) pr (i) p(τ i ; θ) The model becomes Cost-averse : Tries to avoid including in regions with low reward in p(τ;θ) Uses information from negatively and positively rewarded examples Minimization? Non-convex optimization problem (computationally much more demanding than using M-projection)... We use numerical gradient ascent
90 Approximate Inference for Policy Search MC-EM : M-projection based min θ KL(p R p) = i p R (i)log ( ) p(τi ; θ) p R (i) Variational Inference : I-projection based min θ KL(p p R ) = i p(τ i ; θ)log ( ) pr (i) p(τ i ; θ) Both algorithms are guaranteed to iteratively increase the lower bound...
91 I vs M-projection : Illustrative Examples Lets look at the differences in more detail...
92 I vs M-projection : Illustrative Examples We consider 1-step decision problems in continuous state and action spaces We typically use a Gaussian distribution as model distribution ([ ] [ ] [ ]) s µs Σss Σ p(s,a; θ) = N, as, a µ a Σ sa Σ aa with θ = {µ s, µ a,σ ss,σ as,σ aa }
93 I vs M-projection : Illustrative Examples 2-dimensional action space, no state variables, multimodal target distribution I Projection M Projection M-projection averages over all modes I-projection concentrates on one mode
94 I vs M-projection : Illustrative Examples We also want to have state variables... The policy π(a s; θ) is obtained by conditioning on the state s. Policy π is a linear Gaussian model... In order to get more complex policies π(a s t ; θ)... For each state s t, we reestimate the model p(s,a; θ) locally (using either the M- or I-projection) We clamp µ s at s t.
95 I vs M-projection : Illustrative Examples 1-dimensional state and action space complex reward function (dark background indicates negative reward) Policy is estimated for 6 different states M projection I projection s1 s2 s3 s4 s5 s6 s1 s2 s3 s4 s5 s6 M-projection includes areas of low reward in the distribution!
96 Policy Search for Motion Primitives Lets apply variational inference for policy search in changing situations Movement representation : parametrized velocity profiles
97 Multi situation setting : How can we learn θ? Existing algorithms are all MC-EM based and therefore use the M-projection Reward-Weighted Regression (Peters & Schaal, 2007), Cost-Regularized Kernel-Regression (Kober et al., 2010) Online learning setup : as samples we always use the history of the agent...
98 Experiments : Cannon-Ball Task Learn to shoot a cannon-ball at a desired location State-space s 0 : Desired Location, Wind Force Parameter-space w : Launching Angle and Velocity of the ball Comparison of I and M-projection : CRKR : Cost-Regularized Kernel Regression Multi-modal solution space, I-projection performs best Performance I Projection M projection CRKR Episodes
99 Experiments : 4-link pendulum balancing 4-link Humanoid robot has to counterbalance different pushes Situations : The robot gets pushed with different forces F i [0;25]Ns at 4 different points of origin 4-dimensional state space Movement Primitives Sequence of sigmoidal velocity profiles (39 parameters)... t = 0.10 s t = 0.60 s t = 1.10 s t = 1.60 s t = 2.10 s
100 Experiments : 4-link pendulum balancing 4-link Humanoid robot has to counterbalance different pushes After episode the robot has learned to balance almost every force The robot learns completely different balancing strategies We could not produce relyable results with the M-projection... t = 0.10 s t = 0.60 s t = 1.10 s t = 1.60 s t = 2.10 s
101 Conclusion We can use the M-projection or the I-projection for Policy Search The I-projection also uses information of bad samples, which are neglected by the M-projection! It therefore can be used with ease for multi-modal distributions or non-concave reward functions Computationally quite demanding... More efficient methods to calculate the I-projection are needed Is there still a big difference for more complex model distributions...?
102 The end Thanks for your attention!
103 Bibliography I Atkeson, Chris G., Moore, Andrew W., & Schaal, Stefan Locally Weighted Learning. Artificial Intelligence Review, 11, de Boer, Pieter-Tjerk, Kroese, Dirk, Mannor, Shie, & Rubinstein, Reuven A Tutorial on the Cross-Entropy Method. Annals of Operations Research, 134(1), Ernst, D., Geurts, P., & Wehenkel, L Tree-Based Batch Mode Reinforcement Learning. Journal of Machine Learning Resource, 6,
104 Bibliography II Ernst, Damien, Geurts, Pierre, & Wehenkel, Louis Iteratively Extending Time Horizon Reinforcement Learning. Pages of: European Conference on Machine Learning (ECML). Heidrich-Meisner, V., & Igel, C Neuroevolution Strategies for Episodic Reinforcement Learning. Journal of Algorithms, 64(4),
105 Bibliography III Ijspeert, Auke Jan, & Schaal, Stefan Learning Attractor Landscapes for Learning Motor Primitives. Pages of: Advances in Neural Information Processing Systems 15. Cambridge, MA: MIT Press. Kober, J., & Peters, J Policy Search for Motor Primitives in Robotics. Machine Learning Journal, online first, Kober, Jens, Oztop, Erhan, & Peters, Jan Reinforcement Learning to adjust Robot Movements to New Situations. In: Proceedings of the 2010 Robotics: Science and Systems Conference (RSS 2010).
106 Bibliography IV Kolter, Z., & Ng, A Task-Space Trajectories via Cubic Spline Optimization. Pages of: Proceedings of the 2009 IEEE international conference on Robotics and Automation. ICRA 09. Piscataway, NJ, USA: IEEE Press. Neumann, G., & Peters, J Fitted Q-Iteration by Advantage Weighted Regression. In: Advances in Neural Information Processing Systems 22 (NIPS 2008). MA: MIT Press.
107 Bibliography V Peters, J., & Schaal, S Policy Gradient methods for Robotics. In: Proceedings of the IEEE International Conference on Intelligent Robotics Systems (IROS). Peters, J., & Schaal, S Reinforcement Learning by Reward-Weighted Regression for Operational Space Control. In: Proceedings of the International Conference on Machine Learning (ICML).
108 Bibliography VI Riedmiller, M Neural fitted Q-Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Proceedings of the European Conference on Machine Learning (ECML). Sutton, Richard, Precup, Doina, & Singh, Satinder Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artificial Intelligence, 112,
109 Bibliography VII Theodorou, E., Buchli, J., & Schaal, S Reinforcement Learning of Motor Skills in High Dimensions: a Path Integral Approach. Pages of: Robotics and Automation (ICRA), 2010 IEEE International Conference on. Vlassis, Nikos, Toussaint, Marc, Kontes, Georgios, & Piperidis, Savas Learning Model-Free Robot Control by a Monte Carlo EM Algorithm. Autonomous Robots, 27(2),
110 Bibliography VIII Williams, Ronald J Simple Statistical Gradient..Following Algorithms for Connectionist Reinforcement Learning. Machine Learning.
Learning Complex Motions by Sequencing Simpler Motion Templates
Learning Complex Motions by Sequencing Simpler Motion Templates Gerhard Neumann gerhard@igi.tugraz.at Wolfgang Maass maass@igi.tugraz.at Institute for Theoretical Computer Science, Graz University of Technology,
More informationApplying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning
Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning Jan Peters 1, Stefan Schaal 1 University of Southern California, Los Angeles CA 90089, USA Abstract. In this paper, we
More informationLearning to bounce a ball with a robotic arm
Eric Wolter TU Darmstadt Thorsten Baark TU Darmstadt Abstract Bouncing a ball is a fun and challenging task for humans. It requires fine and complex motor controls and thus is an interesting problem for
More informationData-Efficient Generalization of Robot Skills with Contextual Policy Search
Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Data-Efficient Generalization of Robot Skills with Contextual Policy Search Andras Gabor Kupcsik Department of Electrical and
More informationA Brief Introduction to Reinforcement Learning
A Brief Introduction to Reinforcement Learning Minlie Huang ( ) Dept. of Computer Science, Tsinghua University aihuang@tsinghua.edu.cn 1 http://coai.cs.tsinghua.edu.cn/hml Reinforcement Learning Agent
More informationInverse KKT Motion Optimization: A Newton Method to Efficiently Extract Task Spaces and Cost Parameters from Demonstrations
Inverse KKT Motion Optimization: A Newton Method to Efficiently Extract Task Spaces and Cost Parameters from Demonstrations Peter Englert Machine Learning and Robotics Lab Universität Stuttgart Germany
More informationTeaching a robot to perform a basketball shot using EM-based reinforcement learning methods
Teaching a robot to perform a basketball shot using EM-based reinforcement learning methods Tobias Michels TU Darmstadt Aaron Hochländer TU Darmstadt Abstract In this paper we experiment with reinforcement
More informationGeneralized Inverse Reinforcement Learning
Generalized Inverse Reinforcement Learning James MacGlashan Cogitai, Inc. james@cogitai.com Michael L. Littman mlittman@cs.brown.edu Nakul Gopalan ngopalan@cs.brown.edu Amy Greenwald amy@cs.brown.edu Abstract
More informationApprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang
Apprenticeship Learning for Reinforcement Learning with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Table of Contents Introduction Theory Autonomous helicopter control
More informationTopics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning
Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 12: Deep Reinforcement Learning Types of Learning Supervised training Learning from the teacher Training data includes
More informationADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN
More informationHierarchical Reinforcement Learning for Robot Navigation
Hierarchical Reinforcement Learning for Robot Navigation B. Bischoff 1, D. Nguyen-Tuong 1,I-H.Lee 1, F. Streichert 1 and A. Knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701
More informationIntroduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University
Introduction to Reinforcement Learning J. Zico Kolter Carnegie Mellon University 1 Agent interaction with environment Agent State s Reward r Action a Environment 2 Of course, an oversimplification 3 Review:
More informationMarkov Decision Processes (MDPs) (cont.)
Markov Decision Processes (MDPs) (cont.) Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University November 29 th, 2007 Markov Decision Process (MDP) Representation State space: Joint state x
More informationReinforcement Learning to Adjust Robot Movements to New Situations
Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Reinforcement Learning to Adjust Robot Movements to New Situations Jens Kober MPI Tübingen, Germany jens.kober@tuebingen.mpg.de
More informationLearning Motor Behaviors: Past & Present Work
Stefan Schaal Computer Science & Neuroscience University of Southern California, Los Angeles & ATR Computational Neuroscience Laboratory Kyoto, Japan sschaal@usc.edu http://www-clmc.usc.edu Learning Motor
More informationReinforcement Learning to Adjust Robot Movements to New Situations
Reinforcement Learning to Adjust Robot Movements to New Situations Jens Kober MPI Tübingen, Germany jens.kober@tuebingen.mpg.de Erhan Oztop ATR, Japan erhan@atr.jp Jan Peters MPI Tübingen, Germany jan.peters@tuebingen.mpg.de
More information10703 Deep Reinforcement Learning and Control
10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient I Used Materials Disclaimer: Much of the material and slides for this lecture
More informationDeep Reinforcement Learning
Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3. Policy Gradient and Gradient Estimators 4. Q-prop: Sample Efficient Policy Gradient and an Off-policy Critic
More informationAlgorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10
Algorithms for Solving RL: Temporal Difference Learning (TD) 1 Reinforcement Learning Lecture 10 Gillian Hayes 8th February 2007 Incremental Monte Carlo Algorithm TD Prediction TD vs MC vs DP TD for control:
More informationSlides credited from Dr. David Silver & Hung-Yi Lee
Slides credited from Dr. David Silver & Hung-Yi Lee Review Reinforcement Learning 2 Reinforcement Learning RL is a general purpose framework for decision making RL is for an agent with the capacity to
More informationQuadruped Robots and Legged Locomotion
Quadruped Robots and Legged Locomotion J. Zico Kolter Computer Science Department Stanford University Joint work with Pieter Abbeel, Andrew Ng Why legged robots? 1 Why Legged Robots? There is a need for
More informationCommon Subspace Transfer for Reinforcement Learning Tasks
Common Subspace Transfer for Reinforcement Learning Tasks ABSTRACT Haitham Bou Ammar Institute of Applied Research Ravensburg-Weingarten University of Applied Sciences, Germany bouammha@hs-weingarten.de
More information10703 Deep Reinforcement Learning and Control
10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient II Used Materials Disclaimer: Much of the material and slides for this lecture
More informationApproximate Linear Successor Representation
Approximate Linear Successor Representation Clement A. Gehring Computer Science and Artificial Intelligence Laboratory Massachusetts Institutes of Technology Cambridge, MA 2139 gehring@csail.mit.edu Abstract
More informationRobot learning for ball bouncing
Robot learning for ball bouncing Denny Dittmar Denny.Dittmar@stud.tu-darmstadt.de Bernhard Koch Bernhard.Koch@stud.tu-darmstadt.de Abstract For robots automatically learning to solve a given task is still
More informationDeep Learning of Visual Control Policies
ESANN proceedings, European Symposium on Artificial Neural Networks - Computational Intelligence and Machine Learning. Bruges (Belgium), 8-3 April, d-side publi., ISBN -9337--. Deep Learning of Visual
More informationMarkov Decision Processes and Reinforcement Learning
Lecture 14 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Artificial Intelligence
More informationDeep Generative Models Variational Autoencoders
Deep Generative Models Variational Autoencoders Sudeshna Sarkar 5 April 2017 Generative Nets Generative models that represent probability distributions over multiple variables in some way. Directed Generative
More informationCS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed
More informationPlanning and Control: Markov Decision Processes
CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Perfect vs. Noisy Environment What
More informationRobotic Search & Rescue via Online Multi-task Reinforcement Learning
Lisa Lee Department of Mathematics, Princeton University, Princeton, NJ 08544, USA Advisor: Dr. Eric Eaton Mentors: Dr. Haitham Bou Ammar, Christopher Clingerman GRASP Laboratory, University of Pennsylvania,
More informationKnowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay
Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay Haiyan (Helena) Yin, Sinno Jialin Pan School of Computer Science and Engineering Nanyang Technological University
More informationDirect Policy Search Reinforcement Learning based on Particle Filtering
JMLR: Workshop and Conference Proceedings vol:1 13, 2012 European Workshop on Reinforcement Learning Direct Policy Search Reinforcement Learning based on Particle Filtering Petar Kormushev petar.kormushev@iit.it
More informationProblem characteristics. Dynamic Optimization. Examples. Policies vs. Trajectories. Planning using dynamic optimization. Dynamic Optimization Issues
Problem characteristics Planning using dynamic optimization Chris Atkeson 2004 Want optimal plan, not just feasible plan We will minimize a cost function C(execution). Some examples: C() = c T (x T ) +
More informationMarco Wiering Intelligent Systems Group Utrecht University
Reinforcement Learning for Robot Control Marco Wiering Intelligent Systems Group Utrecht University marco@cs.uu.nl 22-11-2004 Introduction Robots move in the physical environment to perform tasks The environment
More informationLearning Inverse Dynamics: a Comparison
Learning Inverse Dynamics: a Comparison Duy Nguyen-Tuong, Jan Peters, Matthias Seeger, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics Spemannstraße 38, 72076 Tübingen - Germany Abstract.
More informationIn Homework 1, you determined the inverse dynamics model of the spinbot robot to be
Robot Learning Winter Semester 22/3, Homework 2 Prof. Dr. J. Peters, M.Eng. O. Kroemer, M. Sc. H. van Hoof Due date: Wed 6 Jan. 23 Note: Please fill in the solution on this sheet but add sheets for the
More informationGerhard Neumann Computational Learning for Autonomous Systems Technische Universität Darmstadt, Germany
Model-Free Preference-based Reinforcement Learning Christian Wirth Knowledge Engineering Technische Universität Darmstadt, Germany Gerhard Neumann Computational Learning for Autonomous Systems Technische
More informationLocally Weighted Learning
Locally Weighted Learning Peter Englert Department of Computer Science TU Darmstadt englert.peter@gmx.de Abstract Locally Weighted Learning is a class of function approximation techniques, where a prediction
More informationQ-learning with linear function approximation
Q-learning with linear function approximation Francisco S. Melo and M. Isabel Ribeiro Institute for Systems and Robotics [fmelo,mir]@isr.ist.utl.pt Conference on Learning Theory, COLT 2007 June 14th, 2007
More informationNatural Actor-Critic. Authors: Jan Peters and Stefan Schaal Neurocomputing, Cognitive robotics 2008/2009 Wouter Klijn
Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn Content Content / Introduction Actor-Critic Natural gradient Applications Conclusion
More informationConvexization in Markov Chain Monte Carlo
in Markov Chain Monte Carlo 1 IBM T. J. Watson Yorktown Heights, NY 2 Department of Aerospace Engineering Technion, Israel August 23, 2011 Problem Statement MCMC processes in general are governed by non
More informationIntroduction to Mobile Robotics
Introduction to Mobile Robotics Gaussian Processes Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann SS08, University of Freiburg, Department for Computer Science Announcement
More informationLearning Multiple Models of Non-Linear Dynamics for Control under Varying Contexts
Learning Multiple Models of Non-Linear Dynamics for Control under Varying Contexts Georgios Petkos, Marc Toussaint, and Sethu Vijayakumar Institute of Perception, Action and Behaviour, School of Informatics
More informationModel learning for robot control: a survey
Model learning for robot control: a survey Duy Nguyen-Tuong, Jan Peters 2011 Presented by Evan Beachly 1 Motivation Robots that can learn how their motors move their body Complexity Unanticipated Environments
More informationGeneralization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding
Advances in Neural Information Processing Systems 8, pp. 1038-1044, MIT Press, 1996. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding Richard S. Sutton University
More informationPattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More informationClustering Lecture 5: Mixture Model
Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More information10-701/15-781, Fall 2006, Final
-7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly
More informationLearning of a ball-in-a-cup playing robot
Learning of a ball-in-a-cup playing robot Bojan Nemec, Matej Zorko, Leon Žlajpah Robotics Laboratory, Jožef Stefan Institute Jamova 39, 1001 Ljubljana, Slovenia E-mail: bojannemec@ijssi Abstract In the
More informationOptimization of a two-link Robotic Manipulator
Optimization of a two-link Robotic Manipulator Zachary Renwick, Yalım Yıldırım April 22, 2016 Abstract Although robots are used in many processes in research and industry, they are generally not customized
More informationWhen Network Embedding meets Reinforcement Learning?
When Network Embedding meets Reinforcement Learning? ---Learning Combinatorial Optimization Problems over Graphs Changjun Fan 1 1. An Introduction to (Deep) Reinforcement Learning 2. How to combine NE
More informationFunction Approximation. Pieter Abbeel UC Berkeley EECS
Function Approximation Pieter Abbeel UC Berkeley EECS Outline Value iteration with function approximation Linear programming with function approximation Value Iteration Algorithm: Start with for all s.
More informationPascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering
Study of Multiple Multiagent Reinforcement Learning Algorithms in Grid Games by Pascal De Beck-Courcelle A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of
More information15-780: MarkovDecisionProcesses
15-780: MarkovDecisionProcesses J. Zico Kolter Feburary 29, 2016 1 Outline Introduction Formal definition Value iteration Policy iteration Linear programming for MDPs 2 1988 Judea Pearl publishes Probabilistic
More informationGenerative and discriminative classification techniques
Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14
More informationDynamic Bayesian network (DBN)
Readings: K&F: 18.1, 18.2, 18.3, 18.4 ynamic Bayesian Networks Beyond 10708 Graphical Models 10708 Carlos Guestrin Carnegie Mellon University ecember 1 st, 2006 1 ynamic Bayesian network (BN) HMM defined
More informationTheoretical Concepts of Machine Learning
Theoretical Concepts of Machine Learning Part 2 Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline 1 Introduction 2 Generalization Error 3 Maximum Likelihood 4 Noise Models 5
More informationLocally Weighted Learning for Control. Alexander Skoglund Machine Learning Course AASS, June 2005
Locally Weighted Learning for Control Alexander Skoglund Machine Learning Course AASS, June 2005 Outline Locally Weighted Learning, Christopher G. Atkeson et. al. in Artificial Intelligence Review, 11:11-73,1997
More informationReinforcement Learning: A brief introduction. Mihaela van der Schaar
Reinforcement Learning: A brief introduction Mihaela van der Schaar Outline Optimal Decisions & Optimal Forecasts Markov Decision Processes (MDPs) States, actions, rewards and value functions Dynamic Programming
More informationAdaptive Building of Decision Trees by Reinforcement Learning
Proceedings of the 7th WSEAS International Conference on Applied Informatics and Communications, Athens, Greece, August 24-26, 2007 34 Adaptive Building of Decision Trees by Reinforcement Learning MIRCEA
More informationUsing Continuous Action Spaces to Solve Discrete Problems
Using Continuous Action Spaces to Solve Discrete Problems Hado van Hasselt Marco A. Wiering Abstract Real-world control problems are often modeled as Markov Decision Processes (MDPs) with discrete action
More informationUnsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1
Unsupervised CS 3793/5233 Artificial Intelligence Unsupervised 1 EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means In clustering, the target feature is not given. Goal: Construct
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 Andreas Krause Announcements Project poster session on Thursday Dec 3, 4-6pm in Annenberg 2 nd floor atrium! Easels, poster boards and cookies
More informationSegmentation: Clustering, Graph Cut and EM
Segmentation: Clustering, Graph Cut and EM Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 yingwu@northwestern.edu http://www.eecs.northwestern.edu/~yingwu
More informationInstance-Based Action Models for Fast Action Planning
Instance-Based Action Models for Fast Action Planning Mazda Ahmadi and Peter Stone Department of Computer Sciences The University of Texas at Austin 1 University Station C0500, Austin, TX 78712-0233 Email:{mazda,pstone}@cs.utexas.edu
More informationGradient Reinforcement Learning of POMDP Policy Graphs
1 Gradient Reinforcement Learning of POMDP Policy Graphs Douglas Aberdeen Research School of Information Science and Engineering Australian National University Jonathan Baxter WhizBang! Labs July 23, 2001
More information9.1. K-means Clustering
424 9. MIXTURE MODELS AND EM Section 9.2 Section 9.3 Section 9.4 view of mixture distributions in which the discrete latent variables can be interpreted as defining assignments of data points to specific
More informationGetting a kick out of humanoid robotics
Getting a kick out of humanoid robotics Using reinforcement learning to shape a soccer kick Christiaan W. Meijer Getting a kick out of humanoid robotics Using reinforcement learning to shape a soccer
More informationA Fuzzy Reinforcement Learning for a Ball Interception Problem
A Fuzzy Reinforcement Learning for a Ball Interception Problem Tomoharu Nakashima, Masayo Udo, and Hisao Ishibuchi Department of Industrial Engineering, Osaka Prefecture University Gakuen-cho 1-1, Sakai,
More informationNeural Networks: promises of current research
April 2008 www.apstat.com Current research on deep architectures A few labs are currently researching deep neural network training: Geoffrey Hinton s lab at U.Toronto Yann LeCun s lab at NYU Our LISA lab
More informationReinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods
Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods Alessandro Lazaric Marcello Restelli Andrea Bonarini Department of Electronics and Information Politecnico di Milano
More informationClustering web search results
Clustering K-means Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 1 Clustering images Set of Images [Goldberger et al.] 2 1 Clustering web search results 3 Some Data 4 2 K-means
More informationSample-Efficient Reinforcement Learning for Walking Robots
Sample-Efficient Reinforcement Learning for Walking Robots B. Vennemann Delft Robotics Institute Sample-Efficient Reinforcement Learning for Walking Robots For the degree of Master of Science in Mechanical
More informationMarkov Decision Processes. (Slides from Mausam)
Markov Decision Processes (Slides from Mausam) Machine Learning Operations Research Graph Theory Control Theory Markov Decision Process Economics Robotics Artificial Intelligence Neuroscience /Psychology
More informationThis is an author produced version of Definition and composition of motor primitives using latent force models and hidden Markov models.
This is an author produced version of Definition and composition of motor primitives using latent force models and hidden Markov models. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/116580/
More informationAutoencoders, denoising autoencoders, and learning deep networks
4 th CiFAR Summer School on Learning and Vision in Biology and Engineering Toronto, August 5-9 2008 Autoencoders, denoising autoencoders, and learning deep networks Part II joint work with Hugo Larochelle,
More informationPerformance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms
Performance Comparison of Sarsa(λ) and Watkin s Q(λ) Algorithms Karan M. Gupta Department of Computer Science Texas Tech University Lubbock, TX 7949-314 gupta@cs.ttu.edu Abstract This paper presents a
More informationUnsupervised Learning: Clustering
Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer Machine Learning Supervised Learning Unsupervised Learning
More informationReinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation
POLISH MARITIME RESEARCH Special Issue S1 (74) 2012 Vol 19; pp. 31-36 10.2478/v10012-012-0020-8 Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation Andrzej Rak,
More informationReinforcement Learning with Parameterized Actions
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Reinforcement Learning with Parameterized Actions Warwick Masson and Pravesh Ranchod School of Computer Science and Applied
More informationMachine Learning on Physical Robots
Machine Learning on Physical Robots Alfred P. Sloan Research Fellow Department or Computer Sciences The University of Texas at Austin Research Question To what degree can autonomous intelligent agents
More informationWhat can we learn from demonstrations?
What can we learn from demonstrations? Marc Toussaint Machine Learning & Robotics Lab University of Stuttgart IROS workshop on ML in Robot Motion Planning, Okt 2018 1/29 Outline Previous work on learning
More informationBehavioral Data Mining. Lecture 18 Clustering
Behavioral Data Mining Lecture 18 Clustering Outline Why? Cluster quality K-means Spectral clustering Generative Models Rationale Given a set {X i } for i = 1,,n, a clustering is a partition of the X i
More informationImitation and Reinforcement Learning for Motor Primitives with Perceptual Coupling
Imitation and Reinforcement Learning for Motor Primitives with Perceptual Coupling Jens Kober, Betty Mohler, Jan Peters Abstract Traditional motor primitive approaches deal largely with open-loop policies
More informationEstimation of Bilateral Connections in a Network: Copula vs. Maximum Entropy
Estimation of Bilateral Connections in a Network: Copula vs. Maximum Entropy Pallavi Baral and Jose Pedro Fique Department of Economics Indiana University at Bloomington 1st Annual CIRANO Workshop on Networks
More informationProbabilistic Robotics
Probabilistic Robotics Sebastian Thrun Wolfram Burgard Dieter Fox The MIT Press Cambridge, Massachusetts London, England Preface xvii Acknowledgments xix I Basics 1 1 Introduction 3 1.1 Uncertainty in
More informationSparse Distributed Memories in Reinforcement Learning: Case Studies
Sparse Distributed Memories in Reinforcement Learning: Case Studies Bohdana Ratitch Swaminathan Mahadevan Doina Precup {bohdana,smahad,dprecup}@cs.mcgill.ca, McGill University, Canada Abstract In this
More informationLearning Continuous State/Action Models for Humanoid Robots
Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference Learning Continuous State/Action Models for Humanoid Robots Astrid Jackson and Gita Sukthankar
More informationCS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation
CS 687 Jana Kosecka Reinforcement Learning Continuous State MDP s Value Function approximation Markov Decision Process - Review Formal definition 4-tuple (S, A, T, R) Set of states S - finite Set of actions
More informationRobot Motion Planning
Robot Motion Planning slides by Jan Faigl Department of Computer Science and Engineering Faculty of Electrical Engineering, Czech Technical University in Prague lecture A4M36PAH - Planning and Games Dpt.
More informationRNNs as Directed Graphical Models
RNNs as Directed Graphical Models Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 10. Topics in Sequence Modeling Overview
More informationProbabilistic Graphical Models
Overview of Part Two Probabilistic Graphical Models Part Two: Inference and Learning Christopher M. Bishop Exact inference and the junction tree MCMC Variational methods and EM Example General variational
More informationInstance-Based Action Models for Fast Action Planning
In Visser, Ribeiro, Ohashi and Dellaert, editors, RoboCup-2007. Robot Soccer World Cup XI, pp. 1-16, Springer Verlag, 2008. Instance-Based Action Models for Fast Action Planning Mazda Ahmadi and Peter
More informationTable of Contents. Chapter 1. Modeling and Identification of Serial Robots... 1 Wisama KHALIL and Etienne DOMBRE
Chapter 1. Modeling and Identification of Serial Robots.... 1 Wisama KHALIL and Etienne DOMBRE 1.1. Introduction... 1 1.2. Geometric modeling... 2 1.2.1. Geometric description... 2 1.2.2. Direct geometric
More informationReinforcement Learning of Clothing Assistance with a Dual-arm Robot
Reinforcement Learning of Clothing Assistance with a Dual-arm Robot Tomoya Tamei, Takamitsu Matsubara, Akshara Rai, Tomohiro Shibata Graduate School of Information Science,Nara Institute of Science and
More informationCan we quantify the hardness of learning manipulation? Kris Hauser Department of Electrical and Computer Engineering Duke University
Can we quantify the hardness of learning manipulation? Kris Hauser Department of Electrical and Computer Engineering Duke University Robot Learning! Robot Learning! Google used 14 identical robots 800,000
More informationMonte Carlo Tree Search PAH 2015
Monte Carlo Tree Search PAH 2015 MCTS animation and RAVE slides by Michèle Sebag and Romaric Gaudel Markov Decision Processes (MDPs) main formal model Π = S, A, D, T, R states finite set of states of the
More information