Machine Learning on Physical Robots

Size: px

Start display at page:

Download "Machine Learning on Physical Robots"

Caitlin Henry
5 years ago
Views:

1 Machine Learning on Physical Robots Alfred P. Sloan Research Fellow Department or Computer Sciences The University of Texas at Austin

2 Research Question To what degree can autonomous intelligent agents learn in the presence of teammates and/or adversaries in real-time, dynamic domains?

3 Research Question To what degree can autonomous intelligent agents learn in the presence of teammates and/or adversaries in real-time, dynamic domains? Autonomous agents Multiagent systems Machine learning Robotics

4 Autonomous Intelligent Agents They must sense their environment. They must decide what action to take ( think ). They must act in their environment.

5 Autonomous Intelligent Agents They must sense their environment. They must decide what action to take ( think ). They must act in their environment. Complete Intelligent Agents

6 Autonomous Intelligent Agents They must sense their environment. They must decide what action to take ( think ). They must act in their environment. Complete Intelligent Agents Interact with other agents (Multiagent systems)

7 Autonomous Intelligent Agents They must sense their environment. They must decide what action to take ( think ). They must act in their environment. Complete Intelligent Agents Interact with other agents (Multiagent systems) Improve performance from experience (Learning agents)

8 Autonomous Intelligent Agents They must sense their environment. They must decide what action to take ( think ). They must act in their environment. Complete Intelligent Agents Interact with other agents (Multiagent systems) Improve performance from experience (Learning agents) Autonomous Bidding, Cognitive Systems, Traffic management, Robot Soccer

9 RoboCup

10 RoboCup Goal: By the year 2050, a team of humanoid robots that can beat the human World Cup champion team.

11 RoboCup Goal: By the year 2050, a team of humanoid robots that can beat the human World Cup champion team. An international research initiative

12 RoboCup Goal: By the year 2050, a team of humanoid robots that can beat the human World Cup champion team. An international research initiative Drives research in many areas: Control algorithms; machine vision, sensing; localization; Distributed computing; real-time systems; Ad hoc networking; mechanical design;

13 RoboCup Goal: By the year 2050, a team of humanoid robots that can beat the human World Cup champion team. An international research initiative Drives research in many areas: Control algorithms; machine vision, sensing; localization; Distributed computing; real-time systems; Ad hoc networking; mechanical design; Multiagent systems; machine learning; robotics

14 RoboCup Goal: By the year 2050, a team of humanoid robots that can beat the human World Cup champion team. An international research initiative Drives research in many areas: Control algorithms; machine vision, sensing; localization; Distributed computing; real-time systems; Ad hoc networking; mechanical design; Multiagent systems; machine learning; robotics Several Different Leagues

15 RoboCup Soccer

16 Sony Aibo (ERS-210A, ERS-7)

17 Sony Aibo (ERS-210A, ERS-7)

18 Sony Aibo (ERS-210A, ERS-7)

19 Creating a team Subtasks

20 Creating a team Subtasks Vision Localization Walking Ball manipulation (kicking) Individual decision making Communication/coordination

21 Creating a team Subtasks Vision Localization Walking Ball manipulation (kicking) Individual decision making Communication/coordination

22 Competitions Barely closed the loop by American Open (May, 03)

23 Competitions Barely closed the loop by American Open (May, 03) Improved significantly by Int l RoboCup (July, 03)

24 Competitions Barely closed the loop by American Open (May, 03) Improved significantly by Int l RoboCup (July, 03) Won 3rd place at US Open (2004, 2005) Quarterfinalist at RoboCup (2004, 2005)

25 Competitions Barely closed the loop by American Open (May, 03) Improved significantly by Int l RoboCup (July, 03) Won 3rd place at US Open (2004, 2005) Quarterfinalist at RoboCup (2004, 2005) Highlights: Many saves: 1; 2; 3; 4; Lots of goals: CMU; Penn; Penn; Germany; A nice clear A counterattack goal

26 Post-competition: the research

27 Post-competition: the research Model-based joint control [Stronger, S, 04] Learning sensor and action models [Stronger, S, 06] Machine learning for fast walking [Kohl, S, 04] Learning to acquire the ball [Fidelman, S, 06] Color constancy on mobile robots [Sridharan, S, 04] Robust particle filter localization [Sridharan, Kuhlmann, S, 05] Autonomous Color Learning [Sridharan, S, 05]

28 Learned Actuator/Sensor Models Mobile robots rely on models of their actions and sensors Typically tuned manually: Time-consuming

29 Learned Actuator/Sensor Models Mobile robots rely on models of their actions and sensors Typically tuned manually: Time-consuming Autonomous Sensor and Actuator Model Induction (ASAMI)

30 Learned Actuator/Sensor Models Mobile robots rely on models of their actions and sensors Typically tuned manually: Time-consuming Autonomous Sensor and Actuator Model Induction (ASAMI) ASAMI is autonomous: no external feedback Developmental robotics

31 Learned Actuator/Sensor Models Mobile robots rely on models of their actions and sensors Typically tuned manually: Time-consuming Autonomous Sensor and Actuator Model Induction (ASAMI) ASAMI is autonomous: no external feedback Developmental robotics Implemented and validated on Aibo ERS-7

32 The Task distance sensor input Sensor model: beacon height in image distance Mapping motivated by camera specs not accurate

accurate Action model: parametrized walking, W(x) velocity

33 The Task distance sensor input Sensor model: beacon height in image distance Mapping motivated by camera specs not accurate Action model: parametrized walking, W(x) velocity W(0) steps in place W(-300) has velocity -300 W(300) has velocity 300

34 Experimental Setup Aibo alternates walking forwards and backwards Forwards: random action in [0, 300] Backward phase: random action in [ 300, 0] Switch based on beacon size in image

35 Experimental Setup Aibo alternates walking forwards and backwards Forwards: random action in [0, 300] Backward phase: random action in [ 300, 0] Switch based on beacon size in image Aibo keeps self pointed at beacon

36 Learning Action and Sensor Models Both models provide info about the robot s location Sensor model: observation obs k location: x s (t k ) = S(obs k )

37 Learning Action and Sensor Models Both models provide info about the robot s location Sensor model: observation obs k location: x s (t k ) = S(obs k ) Action model: action command C(t) velocity: x a (t) = x(0) + t A(C(s)) ds 0

38 Learning Action and Sensor Models Both models provide info about the robot s location Sensor model: observation obs k location: x s (t k ) = S(obs k ) Action model: action command C(t) velocity: x a (t) = x(0) + t A(C(s)) ds 0 Goal: learn arbitrary continuous functions, A and S

39 Learning Action and Sensor Models Both models provide info about the robot s location Sensor model: observation obs k location: x s (t k ) = S(obs k ) Action model: action command C(t) velocity: x a (t) = x(0) + t A(C(s)) ds 0 Goal: learn arbitrary continuous functions, A and S Use polynomial regression as function approximator

40 Learning Action and Sensor Models Both models provide info about the robot s location Sensor model: observation obs k location: x s (t k ) = S(obs k ) Action model: action command C(t) velocity: x a (t) = x(0) + t A(C(s)) ds 0 Goal: learn arbitrary continuous functions, A and S Use polynomial regression as function approximator Models learned in arbitrary units

41 Learning a Sensor Model Assume accurate action model Plot x a (t) against beacon height in image Best fit polynomial is learned sensor model

42 Learning a Sensor Model Assume accurate action model Plot x a (t) against beacon height in image Best fit polynomial is learned sensor model x (t) a Walking Forwards Observations: Walking Backwards Observations: Best Fit Cubic (to all data): Beacon Height

43 Learning an Action Model Assume accurate sensor model Plot x s (t) against time

44 Learning an Action Model Assume accurate sensor model Plot x s (t) against time x (t) s Observations: Time (s)

45 Learning an Action Model Assume accurate sensor model Plot x s (t) against time x (t) s Observations: Time (s) Compute action model that minimizes the error Problem equivalent to another polynomial regression

46 Learning an Action Model Assume accurate sensor model is accurate Plot x s (t) against time x (t) s Observations: Learned Action Model: Time (s) Compute action model that minimizes the error Problem equivalent to another polynomial regression

47 Learning Both Simultaneously Both models improve via bootstrapping Maintain two notions of location, x s (t) and x a (t) Each used to fit the other model

48 Learning Both Simultaneously Both models improve via bootstrapping Maintain two notions of location, x s (t) and x a (t) Each used to fit the other model Use weighted regression w i = γ n i, γ < 1 Can still be computed incrementally

49 Learning Both Simultaneously Both models improve via bootstrapping Maintain two notions of location, x s (t) and x a (t) Each used to fit the other model Use weighted regression w i = γ n i, γ < 1 Can still be computed incrementally Ramping up St A 0 A t t = 0 t = t start t = 2t start

50 Learning Both Simultaneously Over 2.5 min., x s (t) and x a (t) converge x(t) Time (s)

51 Experimental Results Run ASAMI for pre-set amount of time (2.5 minutes) Measure actual models with stopwatch and ruler

52 Experimental Results Run ASAMI for pre-set amount of time (2.5 minutes) Measure actual models with stopwatch and ruler Compare measured vs. learned after best scaling

53 Experimental Results Run ASAMI for pre-set amount of time (2.5 minutes) Measure actual models with stopwatch and ruler Compare measured vs. learned after best scaling Vel. Dist. Measured Sensor Model: Learned Sensor Model: Measured Action Model: Learned Action Model: Action Command Beacon Height

54 Experimental Results Average fitness of model over 15 runs 200 Error (mm, mm/s) Learned Action Model Error Learned Sensor Model Error Time (s)

55 Summary ASAMI: Autonomous, no external feedback Computationally efficient Starts with poor action model, no sensor model Learns accurate approximations to both models Models are to scale with each other

56 Outline Learning sensor and action models [Stronger, S, 06] Machine learning for fast walking [Kohl, S, 04] Learning to acquire the ball [Fidelman, S, 06] Color constancy on mobile robots [Sridharan, S, 04] Autonomous Color Learning [Sridharan, S, 06]

57 Policy Gradient RL to learn fast walk Goal: Enable an Aibo to walk as fast as possible

58 Policy Gradient RL to learn fast walk Goal: Enable an Aibo to walk as fast as possible Start with a parameterized walk Learn fastest possible parameters

59 Policy Gradient RL to learn fast walk Goal: Enable an Aibo to walk as fast as possible Start with a parameterized walk Learn fastest possible parameters No simulator available: Learn entirely on robots Minimal human intervention

60 Walking Aibos Walks that come with Aibo are slow RoboCup soccer: 25+ Aibo teams internationally Motivates faster walks

61 Walking Aibos Walks that come with Aibo are slow RoboCup soccer: 25+ Aibo teams internationally Motivates faster walks Hand-tuned gaits [2003] Learned gaits German UT Austin Hornby et al. Kim & Uther Team Villa UNSW [1999] [2003] 230 mm/s (±5)

62 A Parameterized Walk Developed from scratch as part of UT Austin Villa 2003 Trot gait with elliptical locus on each leg

63 Locus Parameters z x y Ellipse length Ellipse height Position on x axis Position on y axis Body height Timing values 12 continuous parameters

64 Locus Parameters z x y Ellipse length Ellipse height Position on x axis Position on y axis Body height Timing values 12 continuous parameters Hand tuning by April, 03: 140 mm/s Hand tuning by July, 03: 245 mm/s

65 Experimental Setup Policy π = {θ 1,..., θ 12 }, V (π) = walk speed when using π

66 Experimental Setup Policy π = {θ 1,..., θ 12 }, V (π) = walk speed when using π Training Scenario Robots time themselves traversing fixed distance Multiple traversals (3) per policy to account for noise

67 Experimental Setup Policy π = {θ 1,..., θ 12 }, V (π) = walk speed when using π Training Scenario Robots time themselves traversing fixed distance Multiple traversals (3) per policy to account for noise Multiple robots evaluate policies simultaneously Off-board computer collects results, assigns policies

68 Experimental Setup Policy π = {θ 1,..., θ 12 }, V (π) = walk speed when using π Training Scenario Robots time themselves traversing fixed distance Multiple traversals (3) per policy to account for noise Multiple robots evaluate policies simultaneously Off-board computer collects results, assigns policies No human intervention except battery changes

69 Policy Gradient RL From π want to move in direction of gradient of V (π)

70 Policy Gradient RL From π want to move in direction of gradient of V (π) Can t compute V (π) θ i directly: estimate empirically

71 Policy Gradient RL From π want to move in direction of gradient of V (π) Can t compute V (π) θ i directly: estimate empirically Evaluate neighboring policies to estimate gradient Each trial randomly varies every parameter

72 Policy Gradient RL From π want to move in direction of gradient of V (π) Can t compute V (π) θ i directly: estimate empirically Evaluate neighboring policies to estimate gradient Each trial randomly varies every parameter

73 Experiments Started from stable, but fairly slow gait Used 3 robots simultaneously Each iteration takes 45 traversals, minutes

74 Experiments Started from stable, but fairly slow gait Used 3 robots simultaneously Each iteration takes 45 traversals, minutes Before learning After learning 24 iterations = 1080 field traversals, 3 hours

75 Results 300 Velocity of Learned Gait during Training Learned Gait (UT Austin Villa) 280 Velocity (mm/s) Learned Gait (UNSW) Hand tuned Gait (UNSW) Hand tuned Gait (UT Austin Villa) Hand tuned Gait (German Team) Number of Iterations

76 Results 300 Velocity of Learned Gait during Training Learned Gait (UT Austin Villa) 280 Velocity (mm/s) Learned Gait (UNSW) Hand tuned Gait (UNSW) Hand tuned Gait (UT Austin Villa) Hand tuned Gait (German Team) Number of Iterations Additional iterations didn t help Spikes: evaluation noise? large step size?

77 Learned Parameters Parameter Initial ɛ Best Value Value Front ellipse: (height) (x offset) (y offset) Rear ellipse: (height) (x offset) (y offset) Ellipse length Ellipse skew multiplier Front height Rear height Time to move through locus Time on ground

78 Algorithmic Comparison, Robot Port Before learning After learning

79 Summary Used policy gradient RL to learn fastest Aibo walk All learning done on real robots No human itervention (except battery changes)

80 Outline Learning sensor and action models [Stronger, S, 06] Machine learning for fast walking [Kohl, S, 04] Learning to acquire the ball [Fidelman, S, 06] Color constancy on mobile robots [Sridharan, S, 05] Autonomous Color Learning [Sridharan, S, 06]

81 Grasping the Ball Three stages: walk to ball; slow down; lower chin Head proprioception, IR chest sensor ball distance Movement specified by 4 parameters

82 Grasping the Ball Three stages: walk to ball; slow down; lower chin Head proprioception, IR chest sensor ball distance Movement specified by 4 parameters Brittle!

83 Parameterization slowdown dist: when to slow down slowdown factor: how much to slow down capture angle: when to stop turning capture dist: when to put down head

84 Learning the Chin Pinch Binary, noisy reinforcement signal: multiple trials Robot evaluates self: no human intervention

85 Results Evaluation of policy gradient, hill climbing, amoeba successful captures out of 100 trials policy gradient amoeba hill climbing iterations

What it learned Policy slowdown slowdown capture capture Success dist factor angle dist rate Initial 200mm 0.7 15.

86 What it learned Policy slowdown slowdown capture capture Success dist factor angle dist rate Initial 200mm o 110mm 36% Policy gradient 125mm o 152mm 64% Amoeba 208mm o 162mm 69% Hill climbing 240mm o 170mm 66%

87 Instance of Layered Learning For domains too complex for tractably mapping state features S outputs O

88 Instance of Layered Learning For domains too complex for tractably mapping state features S outputs O Hierarchical subtask decomposition given: {L 1, L 2,..., L n }

89 Instance of Layered Learning For domains too complex for tractably mapping state features S outputs O Hierarchical subtask decomposition given: {L 1, L 2,..., L n } Machine learning: exploit data to train, adapt

90 Instance of Layered Learning For domains too complex for tractably mapping state features S outputs O Hierarchical subtask decomposition given: {L 1, L 2,..., L n } Machine learning: exploit data to train, adapt Learning in one layer feeds into next layer

91 Instance of Layered Learning For domains too complex for tractably mapping state features S outputs O Hierarchical subtask decomposition given: {L 1, L 2,..., L n } Machine learning: exploit data to train, adapt Learning in one layer feeds into next layer High Level Goals Adversarial Behaviors Team Behaviors Multi-Agent Behaviors Machine Learning Opportunities Individual Behaviors World State Environment

92 Outline Learning sensor and action models [Stronger, S, 06] Machine learning for fast walking [Kohl, S, 04] Learning to acquire the ball [Fidelman, S, 06] Color constancy on mobile robots [Sridharan, S, 05] Autonomous Color Learning [Sridharan, S, 06]

MLSS 2015 Robot Learning

MLSS 2015 Robot Learning Prof: Department of Computer Science The University of Texas at Austin Applications: Robot Learning Lots of models/functions to learn from data Applications: Robot Learning Lots