Learning Motor Behaviors: Past & Present Work

Stefan Schaal Computer Science & Neuroscience University of Southern California, Los Angeles & ATR Computational Neuroscience Laboratory Kyoto, Japan sschaal@usc.edu http://www-clmc.usc.edu Learning Motor Behaviors: Past & Present Work

Auke Ijspeert Aaron D Souza Jun Nakanishi Jan Peters Michael Mistry Dimitris Pongas Joint Work With:

How are Motor Skills Generated? A Question Shared by Biological and Robotics Research Movies from collaborations with C. Atkeson, S. Kotosaka, S. Vijayakumar

How are Motor Skills Generated? A Question Shared by Biological and Robotics Research Unfortunately, each of these skills required manual generation of representations control policies, and learning mechanisms Movies from collaborations with C. Akteson, S. Kotosaka, S. Vijayakumar

What Motor Behaviors Exist? Tracking Tasks e.g., tracing a figure-8 on a piece of paper Regulator Tasks e.g., balance control (pole balancing, biped balancing, helicopter hover) Discrete Tasks e.g., reach for a cup, tennis forehand, basket ball shot Periodic Tasks e.g., legged locomotion, swimming, dancing Complex sequences and superposition of the above e.g., assembly tasks, empty the dishwasher, playing tennis, almost every daily life behavior Level of Difficulty

Learning Motor Behaviors: Control Policies The General Goal of Motor Learning: Control Policies u(t)=p(x(t),t,å)

How Are Control Policies Used in Robotics? Direct Control (Model Free) Indirect Control (Model-Based)

Approaches to Learning Motor Behaviors in Robotics Supervised Learning direct inverse model learning, forward model learning distal teacher feedback error learning Reinforcement Learning value function-based approaches policy gradients Motor Primitives schemas, basis behaviors, units of actions, macros, options parameterized policies Imitation Learning learning a policy from observation learning the task goal from observation (inverses RL) learning an initial strategy for self-improvement Past to Present

Supervised Learning of Motor Behaviors Given: A parameterized policy A task goal A measure of (signed) error Usually applied to discrete tasks Goal Learn a task-level controller that produces the right motor command for the given goal from all initial conditions.

Supervised Learning of Motor Behaviors Approaches: Learn Task Models Direct Inverse Learning Forward Model Learning & Search Inverse Model u ff x desired Feedback Controller u fb Σ Robot y Jordan & Rumelhart Distal Teacher Feedback Error Learning Kawato

Supervised Learning of Motor Behaviors Example: Learning Devilsticking

Supervised Learning of Motor Behaviors Example: Learning Polebalancing

Reinforcement Learning: Value Function Based Q-Learning or SARSA requires function approximation for the action value function usually only discrete actions considered only low dimensional robotic systems e.g., acrobot Qπ (x,u) = E{ r 1 + γ r 2 + γ 2 r 3 + x 0 = x,u 0 = u} Watkins; Sutton

Reinforcement Learning: Value Function Based RL in Continuous Time and Space continuous version of actor-critic systems closed form solution for optimal action for motor systems of the form: i.e., x = f ( x) + g( x)u u * g x ( ) T V x particularly useful for model-based RL T V π (x) = E{ r 1 + γ r 2 + γ 2 r 3 + x 0 = x} Doya, Morimoto, Kimura

Reinforcement Learning: Value Function Based RL in Continuous Time and Space: Example

Reinforcement Learning: Policy Gradients Motivation for Policy Gradients value function approximation is too hard in complex motor systems, thus avoid value function smooth policy improvement instead of greedy jumps even useful for hidden state systems useful for parsimoniously parameterized policies e.g., J π θ = X d π x θ π θ π + α J π ( ) π( u x) U θ θ ( ) b( x) ( Q π x,u )dudx Note that policy gradients can only achieve local optimization.

Reinforcement Learning: Policy Gradients Examples: Robot Peg-in-hole insertion, Tuning Biped Locomotion Tedrake Gullapalli more results available, e.g., see Andrew Ng, Drew Bagnell, etc. Benbrahim & Franklin

Motor Primitives Motivation 1: Divide & Conquer Motivation 2: Suitable Parameterization u t ( ) = p x t ( ( ),t,α )

What is Good Motor Primitive? From the view of biological research Previous Su(estions Included: Organizational Principles 2/3 Power Law Piecewise Planarity Speed-Accuracy Tradeoff Optimization of Energy, Jerk, Torque Change, Motor Command Change, Task Variance, Stochastic Feedback Control, Effort, etc. Equilibrium Point/Trajectory Hypotheses VITE Model of trajectory planning Force Fields Pattern Generators and Dynamics System Theory Focusing mostly on coupling phenomena (e.g,. inter-limb, perception-action, intra-limb) and the necessary interaction of control and musculoskeletal dynamics Contraction Theory A version of control theory for modular control and many more

What is Good Motor Primitive? From the view of machine learning/robotics Previous Su(estions Included: hardcrafted basis behaviors that are of some level of generality e.g., flocking, dispersing, door finding, object pick-up, closed-loop policies, etc. automatic regular coarse partitioning of the world e.g., a very coarse grid, potentia,y with hidden state automatic detection of basis behaviors from examining the statistics of the world e.g., states with drastic changes of value gradients, states that are common on successful trials, etc.

Movement Primitives as Attractor Systems Note the similarity between a generic control policy ( ) = p x t u t ( ( ),t,α ) and nonlinear differential equations u( t) = x desired ( t) = p( x ( t),goal,α ) desired This view creates a natural distinction between two major movement classes: Rhythmic Movement Discrete Movement

Rhythmic & Discrete Movement Representation in the Brain PMdr M1,S1 BA40 BA7 BA44 BA47 DISC RETE-RHYTHMIC RHYTHMIC-DISCRETE Joint work with Dagmar Sternad, Rieko Osu, and Mitsuo Kawato Nature Neuroscience 7: 1137-1144, 2004

Movement Primitives as Attractor Systems: Goals x = f ( x,goal) A Class of Dynamic Systems that Can Code: Point-to-point and periodic behavior as their attractor Multi-dimensional systems that required phase locking Attractors that have rather complex shape (e.g., complex phase relationships, movement reversals) Learning and optimization Coupling phenomena Timing (without requiring explicit time) Generalization (structural equivalence for parameter changes) Robustness to disturbances and interactions with the environment Stability guarantees

A Dynamic Systems Model for Discrete Movement A learnable nonlinear point attractor with guaranteed stability properties Behavioral Phase v = α v β ( v g x) v x = α x v ( ) Nonlinear Function f ( x,v) z = α ( β ( z z g y) z) y = α y ( f ( x,v) + z)

A Dynamic Systems Model for Discrete Movement Use Gaussian Basis Functions to build nonlinear learning system Trajectory Plan Dynamics Canonical Dynamics Local Linear Model Approx. ( ( ) z) ( f ( x,v) + z) z = α z β z g y y = α y where ( ( ) v) v = α v β v g x x = α x v f ( x,v) = k i =1 k w i b i v i =1 w i w i = exp 1 2 d i Linear in learning parameters ( ) 2 x c i and x = x x 0 g x 0

An Example Desired Position Desired Velocity Basis Functions in Time Phase Velocity Phase

Extension to Periodic Systems A learnable nonlinear limit cycle attractor with guaranteed stability properties Behavioral Phase r = α r ϕ = ω ( A r) Nonlinear Function f ( r,ϕ) Phase Oscillator with amplitude A z = α ( z β ( z g y) z) y = α y ( f ( r,ϕ) + z)

Example: Policy Gradients with Movement Primitives Goal: Hit ball precisely Note: about 150 trials are needed.

Imitation Learning What can be learned.om imitation? control policies (assume actions are observable) internal models reward criteria e.g., inverse reinforcement learning (Ng et al.) use demonstration as soft-constraint value functions

Imitation Learning: Example Learning an internal model.om demonstration

Imitation Learning: Example Using the demonstrated behavior as soft-constraint

Given: Imitation Learning with Motor A desired Canonical trajectory Dynamics Algorithm Trajectory Plan Dynamics y, y, demo demo Primitives y demo k Local Linear Extract movement duration and movement w i goal i =1 Model Approx. Adjust time constants of canonical dynamics to movement duration w i = exp 1 Using Locally Weighted Learning to solve 2 d i ( x c i ) 2 and x = x x 0 nonlinear function g x 0 approximation problem y target = ( ( ) z) ( f ( x,v) + z) z = α z β z g y y = α y where ( ( ) v) v = α v β v g x x = α x v f ( x,v) = Note: This is a one-shot w i b i v y demo z = f x,v α y learning problem, i.e., ( ) where z can be calculated by integrating the differential equation with desired trajectory information k i =1 no iterations!

Example: A Tennis Forehand as a Movement Primitive

Example: A Tennis Forehand as a Dynamic Primitive

Example: Various Rhythmic Movement Primitives

Example: Imitation Learning with Self-Improvement Goal: Hit ball precisely Note: about 150 trials are needed.

Movement Primitives for Planar Walking

Coupling of Mechanics and Control

Movement Primitives in Interaction with Sound

Discussion The amount of learning research in manipulator robotics is poor! Reinforcement Learning in this domain is very hard! Finding good reward functions is hard! Policy gradients are of some use, at the cost of giving up global optimality and the discovery of new strategies Imitation Learning is great for initializing policies We, designed motor primitives can facilitate learning tremendously. But no autonomous learning.amework yet...