Learning Motor Behaviors: Past & Present Work

Size: px

Start display at page:

Download "Learning Motor Behaviors: Past & Present Work"

Jean Greene
5 years ago
Views:

1 Stefan Schaal Computer Science & Neuroscience University of Southern California, Los Angeles & ATR Computational Neuroscience Laboratory Kyoto, Japan Learning Motor Behaviors: Past & Present Work

2 Auke Ijspeert Aaron D Souza Jun Nakanishi Jan Peters Michael Mistry Dimitris Pongas Joint Work With:

3 How are Motor Skills Generated? A Question Shared by Biological and Robotics Research Movies from collaborations with C. Atkeson, S. Kotosaka, S. Vijayakumar

4 How are Motor Skills Generated? A Question Shared by Biological and Robotics Research Unfortunately, each of these skills required manual generation of representations control policies, and learning mechanisms Movies from collaborations with C. Akteson, S. Kotosaka, S. Vijayakumar

5 What Motor Behaviors Exist? Tracking Tasks e.g., tracing a figure-8 on a piece of paper Regulator Tasks e.g., balance control (pole balancing, biped balancing, helicopter hover) Discrete Tasks e.g., reach for a cup, tennis forehand, basket ball shot Periodic Tasks e.g., legged locomotion, swimming, dancing Complex sequences and superposition of the above e.g., assembly tasks, empty the dishwasher, playing tennis, almost every daily life behavior Level of Difficulty

6 Learning Motor Behaviors: Control Policies The General Goal of Motor Learning: Control Policies u(t)=p(x(t),t,å)

7 How Are Control Policies Used in Robotics? Direct Control (Model Free) Indirect Control (Model-Based)

8 Approaches to Learning Motor Behaviors in Robotics Supervised Learning direct inverse model learning, forward model learning distal teacher feedback error learning Reinforcement Learning value function-based approaches policy gradients Motor Primitives schemas, basis behaviors, units of actions, macros, options parameterized policies Imitation Learning learning a policy from observation learning the task goal from observation (inverses RL) learning an initial strategy for self-improvement Past to Present

9 Supervised Learning of Motor Behaviors Given: A parameterized policy A task goal A measure of (signed) error Usually applied to discrete tasks Goal Learn a task-level controller that produces the right motor command for the given goal from all initial conditions.

Inverse Model u ff x desired Feedback Controller u fb Σ Robot y

10 Supervised Learning of Motor Behaviors Approaches: Learn Task Models Direct Inverse Learning Forward Model Learning & Search Inverse Model u ff x desired Feedback Controller u fb Σ Robot y Jordan & Rumelhart Distal Teacher Feedback Error Learning Kawato

11 Supervised Learning of Motor Behaviors Example: Learning Devilsticking

12 Supervised Learning of Motor Behaviors Example: Learning Polebalancing

13 Approaches to Learning Motor Behaviors in Robotics Supervised Learning direct inverse model learning, forward model learning distal teacher feedback error learning Reinforcement Learning value function-based approaches policy gradients Motor Primitives schemas, basis behaviors, units of actions, macros, options parameterized policies Imitation Learning learning a policy from observation learning the task goal from observation (inverses RL) learning an initial strategy for self-improvement

14 Reinforcement Learning: Value Function Based Q-Learning or SARSA requires function approximation for the action value function usually only discrete actions considered only low dimensional robotic systems e.g., acrobot Qπ (x,u) = E{ r 1 + γ r 2 + γ 2 r 3 + x 0 = x,u 0 = u} Watkins; Sutton

15 Reinforcement Learning: Value Function Based RL in Continuous Time and Space continuous version of actor-critic systems closed form solution for optimal action for motor systems of the form: i.e., x = f ( x) + g( x)u u * g x ( ) T V x particularly useful for model-based RL T V π (x) = E{ r 1 + γ r 2 + γ 2 r 3 + x 0 = x} Doya, Morimoto, Kimura

16 Reinforcement Learning: Value Function Based RL in Continuous Time and Space: Example

17 Reinforcement Learning: Policy Gradients Motivation for Policy Gradients value function approximation is too hard in complex motor systems, thus avoid value function smooth policy improvement instead of greedy jumps even useful for hidden state systems useful for parsimoniously parameterized policies e.g., J π θ = X d π x θ π θ π + α J π ( ) π( u x) U θ θ ( ) b( x) ( Q π x,u )dudx Note that policy gradients can only achieve local optimization.

18 Reinforcement Learning: Policy Gradients Examples: Robot Peg-in-hole insertion, Tuning Biped Locomotion Tedrake Gullapalli more results available, e.g., see Andrew Ng, Drew Bagnell, etc. Benbrahim & Franklin

19 Approaches to Learning Motor Behaviors in Robotics Supervised Learning direct inverse model learning, forward model learning distal teacher feedback error learning Reinforcement Learning value function-based approaches policy gradients Motor Primitives schemas, basis behaviors, units of actions, macros, options parameterized policies Imitation Learning learning a policy from observation learning the task goal from observation (inverses RL) learning an initial strategy for self-improvement

20 Motor Primitives Motivation 1: Divide & Conquer Motivation 2: Suitable Parameterization u t ( ) = p x t ( ( ),t,α )

21 What is Good Motor Primitive? From the view of biological research Previous Su(estions Included: Organizational Principles 2/3 Power Law Piecewise Planarity Speed-Accuracy Tradeoff Optimization of Energy, Jerk, Torque Change, Motor Command Change, Task Variance, Stochastic Feedback Control, Effort, etc. Equilibrium Point/Trajectory Hypotheses VITE Model of trajectory planning Force Fields Pattern Generators and Dynamics System Theory Focusing mostly on coupling phenomena (e.g,. inter-limb, perception-action, intra-limb) and the necessary interaction of control and musculoskeletal dynamics Contraction Theory A version of control theory for modular control and many more

22 What is Good Motor Primitive? From the view of machine learning/robotics Previous Su(estions Included: hardcrafted basis behaviors that are of some level of generality e.g., flocking, dispersing, door finding, object pick-up, closed-loop policies, etc. automatic regular coarse partitioning of the world e.g., a very coarse grid, potentia,y with hidden state automatic detection of basis behaviors from examining the statistics of the world e.g., states with drastic changes of value gradients, states that are common on successful trials, etc.

23 Movement Primitives as Attractor Systems Note the similarity between a generic control policy ( ) = p x t u t ( ( ),t,α ) and nonlinear differential equations u( t) = x desired ( t) = p( x ( t),goal,α ) desired This view creates a natural distinction between two major movement classes: Rhythmic Movement Discrete Movement

24 Rhythmic & Discrete Movement Representation in the Brain PMdr M1,S1 BA40 BA7 BA44 BA47 DISC RETE-RHYTHMIC RHYTHMIC-DISCRETE Joint work with Dagmar Sternad, Rieko Osu, and Mitsuo Kawato Nature Neuroscience 7: , 2004

25 Movement Primitives as Attractor Systems: Goals x = f ( x,goal) A Class of Dynamic Systems that Can Code: Point-to-point and periodic behavior as their attractor Multi-dimensional systems that required phase locking Attractors that have rather complex shape (e.g., complex phase relationships, movement reversals) Learning and optimization Coupling phenomena Timing (without requiring explicit time) Generalization (structural equivalence for parameter changes) Robustness to disturbances and interactions with the environment Stability guarantees

26 A Dynamic Systems Model for Discrete Movement A learnable nonlinear point attractor with guaranteed stability properties Behavioral Phase v = α v β ( v g x) v x = α x v ( ) Nonlinear Function f ( x,v) z = α ( β ( z z g y) z) y = α y ( f ( x,v) + z)

27 A Dynamic Systems Model for Discrete Movement Use Gaussian Basis Functions to build nonlinear learning system Trajectory Plan Dynamics Canonical Dynamics Local Linear Model Approx. ( ( ) z) ( f ( x,v) + z) z = α z β z g y y = α y where ( ( ) v) v = α v β v g x x = α x v f ( x,v) = k i =1 k w i b i v i =1 w i w i = exp 1 2 d i Linear in learning parameters ( ) 2 x c i and x = x x 0 g x 0

28 An Example Desired Position Desired Velocity Basis Functions in Time Phase Velocity Phase

29 Extension to Periodic Systems A learnable nonlinear limit cycle attractor with guaranteed stability properties Behavioral Phase r = α r ϕ = ω ( A r) Nonlinear Function f ( r,ϕ) Phase Oscillator with amplitude A z = α ( z β ( z g y) z) y = α y ( f ( r,ϕ) + z)

30 Example: Policy Gradients with Movement Primitives Goal: Hit ball precisely Note: about 150 trials are needed.

31 Approaches to Learning Motor Behaviors in Robotics Supervised Learning direct inverse model learning, forward model learning distal teacher feedback error learning Reinforcement Learning value function-based approaches policy gradients Motor Primitives schemas, basis behaviors, units of actions, macros, options parameterized policies Imitation Learning learning a policy from observation learning the task goal from observation (inverses RL) learning an initial strategy for self-improvement

32 Imitation Learning What can be learned.om imitation? control policies (assume actions are observable) internal models reward criteria e.g., inverse reinforcement learning (Ng et al.) use demonstration as soft-constraint value functions

33 Imitation Learning: Example Learning an internal model.om demonstration

34 Imitation Learning: Example Using the demonstrated behavior as soft-constraint

35 Given: Imitation Learning with Motor A desired Canonical trajectory Dynamics Algorithm Trajectory Plan Dynamics y, y, demo demo Primitives y demo k Local Linear Extract movement duration and movement w i goal i =1 Model Approx. Adjust time constants of canonical dynamics to movement duration w i = exp 1 Using Locally Weighted Learning to solve 2 d i ( x c i ) 2 and x = x x 0 nonlinear function g x 0 approximation problem y target = ( ( ) z) ( f ( x,v) + z) z = α z β z g y y = α y where ( ( ) v) v = α v β v g x x = α x v f ( x,v) = Note: This is a one-shot w i b i v y demo z = f x,v α y learning problem, i.e., ( ) where z can be calculated by integrating the differential equation with desired trajectory information k i =1 no iterations!

36 Example: A Tennis Forehand as a Movement Primitive

37 Example: A Tennis Forehand as a Dynamic Primitive

38 Example: Various Rhythmic Movement Primitives

39 Example: Imitation Learning with Self-Improvement Goal: Hit ball precisely Note: about 150 trials are needed.

40 Movement Primitives for Planar Walking

41 Coupling of Mechanics and Control

42 Movement Primitives in Interaction with Sound

43 Discussion The amount of learning research in manipulator robotics is poor! Reinforcement Learning in this domain is very hard! Finding good reward functions is hard! Policy gradients are of some use, at the cost of giving up global optimality and the discovery of new strategies Imitation Learning is great for initializing policies We, designed motor primitives can facilitate learning tremendously. But no autonomous learning.amework yet...

Learning to bounce a ball with a robotic arm

Learning to bounce a ball with a robotic arm Eric Wolter TU Darmstadt Thorsten Baark TU Darmstadt Abstract Bouncing a ball is a fun and challenging task for humans. It requires fine and complex motor controls and thus is an interesting problem for