Marco Wiering Intelligent Systems Group Utrecht University
|
|
- Gwendoline Robertson
- 5 years ago
- Views:
Transcription
1 Reinforcement Learning for Robot Control Marco Wiering Intelligent Systems Group Utrecht University
2 Introduction Robots move in the physical environment to perform tasks The environment is continuous and uncertain Programming a robot controller is often difficult Reinforcement learning methods are useful for learning controllers from trial and error We will examine how reinforcement learning can be used for optimizing robot controllers
3 Contents of this presentation Robot Control Problems Dynamic Programming Reinforcement Learning Reinforcement Learning with Function Approximators Experimental Results Discussion
4 Robotics A robot is an active moving artificial agent that lives in the physical environment We will concentrate ourselves on autonomous robots: robots that make their own decisions using feedback from their sensors The problem of designing robot controllers is that the real world is highly demanding: (1) The real world cannot be perfectly and completely perceived, (2) It is non-deterministic, (3) It is dynamic, and (4) It is continuous.
5 Robot Applications Robots are used for many different tasks: Manufacturing: repetitive tasks on a production-belt (car and micro-electronic industry) Construction industry (e.g. cutting sheep wool) Security and post-delivery in buildings Unmanned cars, underwater vehicles, unmanned aeroplanes Rescue operations (e.g. after earthquakes)
6 Robot Control Problems Robot control problems can be divided into: Manipulation: change the position of objects using e.g. a robot arm Locomotion: change the physical position of the robot by driving (walking) around Manipulation and Locomotion: do both things at the same time (e.g. in Robot Soccer)
7 Locomotion with legs Most animals use legs for locomotion. Controlling legs is harder than controlling wheels, however. The advantage of using legs is that locomotion on rough surfaces (e.g. with many stones or stairs) becomes possible The Ambler robot (1992) has 6 legs, is almost 10 meters big, and can climb obstacles with a size of 2 meters The Ambler robot is a static stable walker like an AIBO robot. Such robots can stand still, but are usually quite slow and consume a lot of energy
8 Locomotion with wheels Locomotion with wheels is most practical for most environments. Wheels are easier to build and more efficient (stable, and faster) Robots with wheels also have problems: (x,y) Φ A car can drive to every position and take every orientation so there are three degrees of freedom. But: we can only steer and drive; so there are only two actuators
9 A non holonomic robot has less degrees of freedom for control than total degrees of freedom A holonomic robot has as many degrees of freedom for control as positional degrees of freedom
10 Manipulation with robots Manipulators are effectors which can move objects in the environment Kinematics is the study of how movements of the actuators correspond to movements of parts of the robot in the physical space Most manipulators allow rotating or linear translating movements:
11 Control for Locomotion The goal of locomotion is usually to find a minimal cost path from one state in the environment to another state The difficulty of this is that the robot often is uncertain about its own position Solving path-planning problems with uncertainty is often cast in the framework of partially observable Markov decision processes (POMDPs) Since solving even simpler instances of POMDPs are NP-hard problems, often heuristic algorithms are used
12 Control for Manipulation The goal of manipulation is often to grasp an object and to put (or fix) it in another place The good thing of manipulation is that usually the complete environment can be observed The difficulty is that there are many degrees of freedom to be controlled Inverse Kinematics could be used for simple (e.g. empty) environments, but learning methods can work better if many obstacles are present
13 Introduction to Reinforcement Learning Supervised learning: learning from data which are all labelled with the desired outcome/action Reinforcement learning: learning by trial and error. The agent interacts with the environment and may receive reward or punishment Navigate a robot Reward: if the desired position is reached Punishment: if the robot makes a collision Play chess, checkers, backgammon,... Reward if the game is won, Punishment if the game is lost
14 Reinforcement Learning principles Learn to control an agent by trying out actions and using the obtained feedback (rewards) to strengthen (reinforce) the agent s behavior. The agent interacts with the environment by using its (virtual) sensors and effectors. The reward function determines which agent s behavior is most desired ENVIRONMENT INPUT REWARD ACTION AGENT
15 Some applications of Reinforcement Learning Game-playing (checkers, backgammon, chess) Elevator control Robot control Combinatorial optimisation Simulated robot soccer Network routing Traffic control
16 Convergence issues in Reinforcement Learning Reinforcement Learning algorithms with lookup-tables (tabular RL) are proved to converge to an optimal policy after an infinite number of experiences Reinforcement learning with function approximators have sometimes obtained excellent results (e.g. Tesauro s TD-Gammon) However, particular studies have shown that RL with function approximators may diverge (to infinite parameter values)
17 Markov Decision Problems A Markov decision problem (MDP) consists of: S: Finite set of states {s 1,s 2,...,s n }. A: Finite set of actions. P (i, a, j): probability to make a step to state j if action a is selected in state i. R(i, a, j) reward for making a transition from state i to state j by executing action a γ: discount parameter for future rewards (0 γ 1)
18 Dynamic Programming and Value Functions (1) The value of a state V π (s) is defined as the expected cumulative discounted future reward starting in state s and following policy π: V π (s) =E( i=0 γ i r i s 0 = s, π) The optimal policy is the one which has the largest state-value in all states: π =argmax V π π
19 Dynamic Programming and Value Functions (2) We also define the Q-value of a state-action pair as: Q π (s, a) =E( i=0 γ i r i s 0 = s, a 0 = a, π) The Bellman optimality equation relates a state-action value of an optimal Q-function to other optimal state-values: Q (s, a) = s P (s, a, s )(R(s, a, s )+γv (s ))
20 Dynamic Programming and Value Functions (3) The Bellman equation has led to very efficient dynamic programming (DP) algorithms. Value iteration computes an optimal policy for a known Markov decision problem using: Q k+1 (s, a) := s P (s, a, s )(R(s, a, s )+γv k (s )) Where: V k (s) =max a Q k (s, a) It can be easily shown that lim k Q k = Q.
21 Reinforcement Learning Dynamic programming is not applicable when: The Markov decision problem is unknown There are too many states (often caused by many state-variables leading to Bellman s curse of dimensionality) There are continuous states or actions For such cases, we can use Reinforcement Learning algorithms. RL algorithms learn from interacting with an environment and can be combined with function approximators.
22 RL Algorithms: Q-learning A well-known RL algorithm is Q-learning (Watkins, 1989) which updates the Q-function after an experience (s t,a t,r t,s t+1 ) as follows: Q(s t,a t ):=Q(s t,a t )+α(r t + γ max a Where 0 <α 1 is the learning rate. Q(s t+1,a) Q(s t,a t )) Q-learning is an off-policy RL algorithm, meaning that it learns about the optimal value function while following another behavioural policy. Tabular Q-learning converges to the optimal policy after an infinite number of experiences.
23 RL Algorithms: SARSA SARSA (State-Action-Reward-State-Action) is an example of an on-policy RL algorithm. Q(s t,a t ):=Q(s t,a t )+α(r t + γq(s t+1,a t+1 ) Q(s t,a t )) Since SARSA is an on-policy algorithm it learns the Q-function based on the trajectories generated by the behavioural policy. To make SARSA converge, the exploration policy should be GLIE (Greedy in the Limit of Infinite Exploration), see (Singh et al., 2000)
24 Reinforcement Learning with Function Approximators To learn value functions for problems with many (or continuous) state variables, we have to combine reinforcement learning with function approximators. We concentrate on linear function approximators using a parameter vector θ. The value of a state-action pair is: Q(s, a) = i θ i,a φ i (s) Where the state-vector s t which is received by the agent at time t is mapped upon a feature-vector φ(s t ).
25 Linear Function Approximators The linear function approximator looks as follows: Action values Learnable weights Feature vector Fixed method Input (State)
26 Standard Q-learning with Linear FAs When standard Q-learning is used, we can update the parameter vector as follows: θ i,at := θ i,at + α(r t + γ max a Q(s t+1,a) Q(s t,a t ))φ i (s t ) Q-learning using this update-rule may diverge to infinite values.
27 Standard SARSA with Linear FAs When standard SARSA is used, we use the following update: θ i,at := θ i,at + α(r t + γq(s t+1,a t+1 ) Q(s t,a t ))φ i (s t ) Gordon (2002) proved that using this update for a fixed policy the parameters will converge, and for a changing policy the parameters will stay fixed in a bounded region. Perkins and Precup (2002) proved that the parameters will converge if the policy improvement operator produces ɛ-soft policies and is Lipschitz continuous in the action values with a constant that is not too large.
28 Averaging RL with Linear FAs In averaging RL updates do not use the difference between two successor state-values, but between the successor state-value and the parameter s value itself. Reynolds (2002) has shown that averaging RL will not diverge if the feature-vector for each state is normalised, but the value functions will remain in some region. We assume normalised feature vectors, so we will use φ i (s) as the normalised feature-vector computed as: φ i (s) = φ i(s) j φ j(s)
29 Averaging Q-learning with Linear FAs The averaging Q-learning rule looks as follows for all i: θ i,at := θ i,at + α(r t + γ max a Q(s t+1,a) θ i,at )φ i(s t ) The θ parameters learn towards the desired value and are not cooperating in the learning updates. Under a fixed stationary distribution with interpolative function approximators, averaging Q-learning has been proved to converge (Szepesvari and Smart, 2004).
30 Sample-based Dynamic Programming In case we have a model of the environment, we can also use sample-based Dynamic Programming as was proposed by Boyan and Moore (1995) and Gordon (1995). Sample-based value iteration goes as follows: First we select a subset of states S 0 S. OftenS 0 is chosen arbitrarily, but in general S 0 should be large enough and be spread over the state space. Then we are going to use the states s S 0 for updating the value function using a model of the environment.
31 Averaging Q-value iteration with linear FAs Averaging sample-based DP uses the following update, where T (.) is the backup operator: 1 T (θ i,a ) = s S 0 φ i (s) γ max b j s S 0 φ i(s) φ j(s )θ j,b ) s P (s, a, s )(R(s, a, s )+ This DP-update rule is obtained from the averaging Q-learning rule with linear function approximators (where we set the learning rate α to 1).
32 Deriving the Averaging Q-value iteration update Remember that we had: θ i,at We rewrite this as: := θ i,at +(r t + γ max a Q(s t+1,a) θ i,at )φ i(s t ) θ i,at := θ i,at +(r t + γ max Q(s t+1,a))φ i(s t ) θ i,at φ i(s t ) a θ i,at := (r t + γ max a Q(s t+1,a))φ i (s t) φ i (s t) The Q-value iteration backup operator is obtained by averaging over all states in the sweep.
33 Analysing the Averaging Q-value iteration update This is averaging DP, because we take the weighted average of all targets irrespective of other parameter values for the current state, so parameter estimation is done in a non co-operative way. Note that averaging RL has a problem. Consider two examples: 0.5 1and1 2. The best value of the parameter would be 2, but averaging Q-value iteration will compute the value 5 3. Note that this would also be the fixed point if averaging Q-learning is used: (1-5/3)*0.5 + (2-5/3)*1 = 0.
34 Standard Q-value iteration with linear FAs (1) For standard value-iteration with linear function approximators, the resulting algorithm will look a bit non-conventional. First of all we introduce the value of a state if a particular parameter i is not used as: Q i (s, a) = j i θ j,a φ j(s) Q i (s, a) = 0 when only feature i active.
35 Standard Q-value iteration with linear FAs (2) The standard Q-value iteration algorithm that is co-operative, but may diverge, is: 1 T (θ i,a ) = s S 0 φ i (s)2 s S 0 φ i(s) (R(s, a, s )+γ max b j s P (s, a, s ) φ j(s )θ j,b Q i (s, a))
36 Deriving the Standard Q-value iteration update This rule is obtained by examining the Standard Q-learning rule with linear function approximators (with α = 1). Remember that we had: θ i,a := θ i,a +(r t + γ max b Q(s t+1,b) Q(s t,a))φ i(s t ) We can rewrite this as: θ i,a += (r t + γ max b θ i,a := Q(s t+1,b) Q i (s t,a))φ i(s t ) θ i,a φ i(s t ) 2 (r t + γ max b Q(s t+1,b) Q i (s t,a))φ i(s t ) φ i (s t) 2
37 Analysing the Standard Q-value iteration update In the case of a tabular representation with only one state active, the resulting algorithm is again the same algorithm as conventional Q-value iteration. If we examine the same examples: 0.5 1and1 2, we can see that standard DP would compute the value 2, which is correct. The problem of this standard value-iteration algorithm with function approximators is that it may diverge, just as Q-learning may diverge.
38 Divergence of Q-learning with FAs Off-policy standard RL methods such as online value iteration or standard Q-learning can diverge for a large number of function approximators such as: Linear neural networks Locally weighted learning Radial basis networks We will show an example in which infinite parameters will be obtained when online value iteration is used Online value iteration uses a model of the environment but only updates on visited states
39 Example demonstrating divergence The example is very simple: There is one state s with value 0.5 or 1. φ(s) :=s The agent can select actions a t from {0.1, 0.2, 0.3,...,0.9, 1.0}. An absorbing state has been reached if s =1,elseweset s t+1 =0.5 ifa t < 1ands t+1 =1ifa t =1. The reward on all transitions is 0 The initial state s 0 =0.5.
40 Proof that divergence will be the result The algorithm computes the following update if s = 0.5: θ = θ + α(γθ 0.5θ)0.5 And the following update if s =1: θ = θ + α(0 θ) If the agent often selects random actions, in many cases the agent will be in state s =0.5. Suppose it stays on average h times in state s =0.5 beforemakingasteptos =1. Then it will make the following average update: θ = θ + α((γθ 0.5θ)0.5h θ)
41 This will lead to ever increasing values of initial positive values of θ if: h> 1 0.5γ 0.25 Here γ is larger than 0.5
42 Explaining divergence In this example, for large enough γ and exploration, the parameter will always increase. If the estimated value of the current state grows, then the estimated value of the best next state grows at the same time. This may lead to divergence. If instead we use a Monte-Carlo return, and do not use bootstrapping at all, we will not have the problem. Also if we use the averaging RL update or on-policy RL divergence is prevented.
43 Experiments We executed a number of experiments to compare standard and averaging DP with CMACs as a linear function approximator. The environment is a maze of size with one goal state. Each state can be a starting state except for the goal state. There are deterministic single-step actions North, East, South, and West. The reward for every action is -1.
44 Experimental results As function approximator we use CMACs in which one tiling encodes the x-position and the other tiling the y-position. Tiling 1 Tiling 2 So there are two tilings with 51 states each. We first examine the error of the computed value function for standard and averaging Q-value iteration in an empty maze:
45 Experimental results Results in the Empty Maze Convergence in the Empty Maze 1e Averaging DP Standard DP Averaging DP Standard DP Value function error Value function error Discount Factor Number of iterations
46 Discussion empty maze (1) In the empty maze we observe the following: For γ = 0, both algorithms compute the optimal value function. For γ = 1, standard DP is able to compute the perfect value function, but averaging DP makes a large error. For γ values in between 0.05 and 0.9 the value function obtained by averaging DP is slightly better than the one obtained with standard DP.
47 Discussion empty maze (2) We also studied the influence of initial parameter values: As expected averaging DP always converged to the same value function. For standard DP, random initialisations of 0 to 100 and 0 to 1000 led to convergence For standard DP, if we initialized the parameters to values between 0 and 10000, the parameters always diverged.
48 Experiments on a maze containing a wall We also examine the total number of steps from all initial states to the goal states for a maze with a wall. G 3e e+06 Results in the Wall Maze Averaging DP Standard DP Total Nr. of Steps 2e e+06 1e Discount Factor
49 Other projects We want to use averaging RL for solving path-planning problems in outer space involving gravity forces from planets: Goal Such a problem features continuous inputs (position, velocity, direction). Crashes into planets should be circumvented We plan to solve it with averaging RL combined with multi-layer perceptrons
50 Other projects Project of Sander Maas for Philips: learn to control a robot arm For each joint two muscles are used. These are so-called McKibben actuators which work on air-pressure. This resembles human muscles, but due to different causes the behavior of these muscles is highly non-linear
51 Discussion For the two robot control problem; Manipulation and Locomotion, manipulation seems most suitable for reinforcement learning For Locomotion, noise in the actions and sensors of the robot, cause the controller to deal with a probabilistic state description RL approaches using recurrent neural networks or hidden Markov models can be useful for locomotion We have extended averaging RL to feed-forward neural networks and plan to use this algorithm for robot control
Markov Decision Processes and Reinforcement Learning
Lecture 14 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Artificial Intelligence
More informationAlgorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10
Algorithms for Solving RL: Temporal Difference Learning (TD) 1 Reinforcement Learning Lecture 10 Gillian Hayes 8th February 2007 Incremental Monte Carlo Algorithm TD Prediction TD vs MC vs DP TD for control:
More informationA Brief Introduction to Reinforcement Learning
A Brief Introduction to Reinforcement Learning Minlie Huang ( ) Dept. of Computer Science, Tsinghua University aihuang@tsinghua.edu.cn 1 http://coai.cs.tsinghua.edu.cn/hml Reinforcement Learning Agent
More informationIntroduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University
Introduction to Reinforcement Learning J. Zico Kolter Carnegie Mellon University 1 Agent interaction with environment Agent State s Reward r Action a Environment 2 Of course, an oversimplification 3 Review:
More informationReinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)
Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended) Pavlos Andreadis, February 2 nd 2018 1 Markov Decision Processes A finite Markov Decision Process
More informationMarkov Decision Processes. (Slides from Mausam)
Markov Decision Processes (Slides from Mausam) Machine Learning Operations Research Graph Theory Control Theory Markov Decision Process Economics Robotics Artificial Intelligence Neuroscience /Psychology
More informationReinforcement Learning: A brief introduction. Mihaela van der Schaar
Reinforcement Learning: A brief introduction Mihaela van der Schaar Outline Optimal Decisions & Optimal Forecasts Markov Decision Processes (MDPs) States, actions, rewards and value functions Dynamic Programming
More informationPlanning and Control: Markov Decision Processes
CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Perfect vs. Noisy Environment What
More informationIn Homework 1, you determined the inverse dynamics model of the spinbot robot to be
Robot Learning Winter Semester 22/3, Homework 2 Prof. Dr. J. Peters, M.Eng. O. Kroemer, M. Sc. H. van Hoof Due date: Wed 6 Jan. 23 Note: Please fill in the solution on this sheet but add sheets for the
More informationHierarchical Assignment of Behaviours by Self-Organizing
Hierarchical Assignment of Behaviours by Self-Organizing W. Moerman 1 B. Bakker 2 M. Wiering 3 1 M.Sc. Cognitive Artificial Intelligence Utrecht University 2 Intelligent Autonomous Systems Group University
More informationADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL
ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN
More informationWhen Network Embedding meets Reinforcement Learning?
When Network Embedding meets Reinforcement Learning? ---Learning Combinatorial Optimization Problems over Graphs Changjun Fan 1 1. An Introduction to (Deep) Reinforcement Learning 2. How to combine NE
More informationTopics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning
Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 12: Deep Reinforcement Learning Types of Learning Supervised training Learning from the teacher Training data includes
More informationProblem characteristics. Dynamic Optimization. Examples. Policies vs. Trajectories. Planning using dynamic optimization. Dynamic Optimization Issues
Problem characteristics Planning using dynamic optimization Chris Atkeson 2004 Want optimal plan, not just feasible plan We will minimize a cost function C(execution). Some examples: C() = c T (x T ) +
More informationManifold Representations for Continuous-State Reinforcement Learning
Washington University in St. Louis Washington University Open Scholarship All Computer Science and Engineering Research Computer Science and Engineering Report Number: WUCSE-2005-19 2005-05-01 Manifold
More informationMarkov Decision Processes (MDPs) (cont.)
Markov Decision Processes (MDPs) (cont.) Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University November 29 th, 2007 Markov Decision Process (MDP) Representation State space: Joint state x
More information15-780: MarkovDecisionProcesses
15-780: MarkovDecisionProcesses J. Zico Kolter Feburary 29, 2016 1 Outline Introduction Formal definition Value iteration Policy iteration Linear programming for MDPs 2 1988 Judea Pearl publishes Probabilistic
More informationUnsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1
Unsupervised CS 3793/5233 Artificial Intelligence Unsupervised 1 EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means In clustering, the target feature is not given. Goal: Construct
More informationQ-learning with linear function approximation
Q-learning with linear function approximation Francisco S. Melo and M. Isabel Ribeiro Institute for Systems and Robotics [fmelo,mir]@isr.ist.utl.pt Conference on Learning Theory, COLT 2007 June 14th, 2007
More informationAdaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces
Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces Eric Christiansen Michael Gorbach May 13, 2008 Abstract One of the drawbacks of standard reinforcement learning techniques is that
More informationHierarchical Reinforcement Learning for Robot Navigation
Hierarchical Reinforcement Learning for Robot Navigation B. Bischoff 1, D. Nguyen-Tuong 1,I-H.Lee 1, F. Streichert 1 and A. Knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701
More informationPerseus: randomized point-based value iteration for POMDPs
Universiteit van Amsterdam IAS technical report IAS-UA-04-02 Perseus: randomized point-based value iteration for POMDPs Matthijs T. J. Spaan and Nikos lassis Informatics Institute Faculty of Science University
More informationMobile Robots: An Introduction.
Mobile Robots: An Introduction Amirkabir University of Technology Computer Engineering & Information Technology Department http://ce.aut.ac.ir/~shiry/lecture/robotics-2004/robotics04.html Introduction
More informationCSE151 Assignment 2 Markov Decision Processes in the Grid World
CSE5 Assignment Markov Decision Processes in the Grid World Grace Lin A484 gclin@ucsd.edu Tom Maddock A55645 tmaddock@ucsd.edu Abstract Markov decision processes exemplify sequential problems, which are
More informationReinforcement Learning for Appearance Based Visual Servoing in Robotic Manipulation
Reinforcement Learning for Appearance Based Visual Servoing in Robotic Manipulation UMAR KHAN, LIAQUAT ALI KHAN, S. ZAHID HUSSAIN Department of Mechatronics Engineering AIR University E-9, Islamabad PAKISTAN
More informationPartially Observable Markov Decision Processes. Silvia Cruciani João Carvalho
Partially Observable Markov Decision Processes Silvia Cruciani João Carvalho MDP A reminder: is a set of states is a set of actions is the state transition function. is the probability of ending in state
More informationReinforcement learning of competitive and cooperative skills in soccer agents
Edith Cowan University Research Online ECU Publications 2011 2011 Reinforcement learning of competitive and cooperative skills in soccer agents Jinsong Leng Edith Cowan University Chee Lim 10.1016/j.asoc.2010.04.007
More informationMonte Carlo Tree Search PAH 2015
Monte Carlo Tree Search PAH 2015 MCTS animation and RAVE slides by Michèle Sebag and Romaric Gaudel Markov Decision Processes (MDPs) main formal model Π = S, A, D, T, R states finite set of states of the
More informationPartially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)
Partially Observable Markov Decision Processes Mausam (slides by Dieter Fox) Stochastic Planning: MDPs Static Environment Fully Observable Perfect What action next? Stochastic Instantaneous Percepts Actions
More informationSample-Efficient Reinforcement Learning for Walking Robots
Sample-Efficient Reinforcement Learning for Walking Robots B. Vennemann Delft Robotics Institute Sample-Efficient Reinforcement Learning for Walking Robots For the degree of Master of Science in Mechanical
More informationApprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang
Apprenticeship Learning for Reinforcement Learning with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Table of Contents Introduction Theory Autonomous helicopter control
More informationRobotics. Lecture 5: Monte Carlo Localisation. See course website for up to date information.
Robotics Lecture 5: Monte Carlo Localisation See course website http://www.doc.ic.ac.uk/~ajd/robotics/ for up to date information. Andrew Davison Department of Computing Imperial College London Review:
More informationMachine Learning on Physical Robots
Machine Learning on Physical Robots Alfred P. Sloan Research Fellow Department or Computer Sciences The University of Texas at Austin Research Question To what degree can autonomous intelligent agents
More informationFinal Exam Practice Fall Semester, 2012
COS 495 - Autonomous Robot Navigation Final Exam Practice Fall Semester, 2012 Duration: Total Marks: 70 Closed Book 2 hours Start Time: End Time: By signing this exam, I agree to the honor code Name: Signature:
More informationApproximate Linear Successor Representation
Approximate Linear Successor Representation Clement A. Gehring Computer Science and Artificial Intelligence Laboratory Massachusetts Institutes of Technology Cambridge, MA 2139 gehring@csail.mit.edu Abstract
More informationInverse KKT Motion Optimization: A Newton Method to Efficiently Extract Task Spaces and Cost Parameters from Demonstrations
Inverse KKT Motion Optimization: A Newton Method to Efficiently Extract Task Spaces and Cost Parameters from Demonstrations Peter Englert Machine Learning and Robotics Lab Universität Stuttgart Germany
More informationReinforcement Learning (2)
Reinforcement Learning (2) Bruno Bouzy 1 october 2013 This document is the second part of the «Reinforcement Learning» chapter of the «Agent oriented learning» teaching unit of the Master MI computer course.
More informationQuadruped Robots and Legged Locomotion
Quadruped Robots and Legged Locomotion J. Zico Kolter Computer Science Department Stanford University Joint work with Pieter Abbeel, Andrew Ng Why legged robots? 1 Why Legged Robots? There is a need for
More informationDistributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes
Distributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes Willy Arthur Silva Reis 1, Karina Valdivia Delgado 2, Leliane Nunes de Barros 1 1 Departamento de Ciência da
More informationUsing Machine Learning Techniques for Autonomous Planning and Navigation with Groups of Unmanned Vehicles
Using Machine Learning Techniques for Autonomous Planning and Navigation with Groups of Unmanned Vehicles Gerben Bergwerff July 19, 2016 Master s Thesis Department of Artificial Intelligence, University
More informationGeneralization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding
Advances in Neural Information Processing Systems 8, pp. 1038-1044, MIT Press, 1996. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding Richard S. Sutton University
More informationLearning to bounce a ball with a robotic arm
Eric Wolter TU Darmstadt Thorsten Baark TU Darmstadt Abstract Bouncing a ball is a fun and challenging task for humans. It requires fine and complex motor controls and thus is an interesting problem for
More informationPrioritizing Point-Based POMDP Solvers
Prioritizing Point-Based POMDP Solvers Guy Shani, Ronen I. Brafman, and Solomon E. Shimony Department of Computer Science, Ben-Gurion University, Beer-Sheva, Israel Abstract. Recent scaling up of POMDP
More informationA DECENTRALIZED REINFORCEMENT LEARNING CONTROLLER FOR COLLABORATIVE DRIVING. Luke Ng, Chris Clark, Jan P. Huissoon
A DECENTRALIZED REINFORCEMENT LEARNING CONTROLLER FOR COLLABORATIVE DRIVING Luke Ng, Chris Clark, Jan P. Huissoon Department of Mechanical Engineering, Automation & Controls Group University of Waterloo
More informationReinforcement Learning of Traffic Light Controllers under Partial Observability
Reinforcement Learning of Traffic Light Controllers under Partial Observability MSc Thesis of R. Schouten(0010774), M. Steingröver(0043826) Students of Artificial Intelligence on Faculty of Science University
More informationNeuro-Dynamic Programming An Overview
1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. May 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly applicable,
More informationLocalization and Map Building
Localization and Map Building Noise and aliasing; odometric position estimation To localize or not to localize Belief representation Map representation Probabilistic map-based localization Other examples
More informationRobotics. Haslum COMP3620/6320
Robotics P@trik Haslum COMP3620/6320 Introduction Robotics Industrial Automation * Repetitive manipulation tasks (assembly, etc). * Well-known, controlled environment. * High-power, high-precision, very
More information10703 Deep Reinforcement Learning and Control
10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient I Used Materials Disclaimer: Much of the material and slides for this lecture
More informationContinuous Valued Q-learning for Vision-Guided Behavior Acquisition
Continuous Valued Q-learning for Vision-Guided Behavior Acquisition Yasutake Takahashi, Masanori Takeda, and Minoru Asada Dept. of Adaptive Machine Systems Graduate School of Engineering Osaka University
More informationStrategies for simulating pedestrian navigation with multiple reinforcement learning agents
Strategies for simulating pedestrian navigation with multiple reinforcement learning agents Francisco Martinez-Gil, Miguel Lozano, Fernando Ferna ndez Presented by: Daniel Geschwender 9/29/2016 1 Overview
More informationResidual Advantage Learning Applied to a Differential Game
Presented at the International Conference on Neural Networks (ICNN 96), Washington DC, 2-6 June 1996. Residual Advantage Learning Applied to a Differential Game Mance E. Harmon Wright Laboratory WL/AAAT
More informationExam in DD2426 Robotics and Autonomous Systems
Exam in DD2426 Robotics and Autonomous Systems Lecturer: Patric Jensfelt KTH, March 16, 2010, 9-12 No aids are allowed on the exam, i.e. no notes, no books, no calculators, etc. You need a minimum of 20
More informationSlides credited from Dr. David Silver & Hung-Yi Lee
Slides credited from Dr. David Silver & Hung-Yi Lee Review Reinforcement Learning 2 Reinforcement Learning RL is a general purpose framework for decision making RL is for an agent with the capacity to
More informationGeneralized Inverse Reinforcement Learning
Generalized Inverse Reinforcement Learning James MacGlashan Cogitai, Inc. james@cogitai.com Michael L. Littman mlittman@cs.brown.edu Nakul Gopalan ngopalan@cs.brown.edu Amy Greenwald amy@cs.brown.edu Abstract
More informationWestminsterResearch
WestminsterResearch http://www.westminster.ac.uk/research/westminsterresearch Reinforcement learning in continuous state- and action-space Barry D. Nichols Faculty of Science and Technology This is an
More informationApplying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning
Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning Jan Peters 1, Stefan Schaal 1 University of Southern California, Los Angeles CA 90089, USA Abstract. In this paper, we
More informationCMPUT 412 Motion Control Wheeled robots. Csaba Szepesvári University of Alberta
CMPUT 412 Motion Control Wheeled robots Csaba Szepesvári University of Alberta 1 Motion Control (wheeled robots) Requirements Kinematic/dynamic model of the robot Model of the interaction between the wheel
More informationPascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering
Study of Multiple Multiagent Reinforcement Learning Algorithms in Grid Games by Pascal De Beck-Courcelle A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of
More informationRobotics. CSPP Artificial Intelligence March 10, 2004
Robotics CSPP 56553 Artificial Intelligence March 10, 2004 Roadmap Robotics is AI-complete Integration of many AI techniques Classic AI Search in configuration space (Ultra) Modern AI Subsumption architecture
More informationApplying Neural Network Architecture for Inverse Kinematics Problem in Robotics
J. Software Engineering & Applications, 2010, 3: 230-239 doi:10.4236/jsea.2010.33028 Published Online March 2010 (http://www.scirp.org/journal/jsea) Applying Neural Network Architecture for Inverse Kinematics
More informationForward Search Value Iteration For POMDPs
Forward Search Value Iteration For POMDPs Guy Shani and Ronen I. Brafman and Solomon E. Shimony Department of Computer Science, Ben-Gurion University, Beer-Sheva, Israel Abstract Recent scaling up of POMDP
More informationPath Planning for a Robot Manipulator based on Probabilistic Roadmap and Reinforcement Learning
674 International Journal Jung-Jun of Control, Park, Automation, Ji-Hun Kim, and and Systems, Jae-Bok vol. Song 5, no. 6, pp. 674-680, December 2007 Path Planning for a Robot Manipulator based on Probabilistic
More informationInverse Kinematics. Given a desired position (p) & orientation (R) of the end-effector
Inverse Kinematics Given a desired position (p) & orientation (R) of the end-effector q ( q, q, q ) 1 2 n Find the joint variables which can bring the robot the desired configuration z y x 1 The Inverse
More informationReinforcement Learning-Based Path Planning for Autonomous Robots
Reinforcement Learning-Based Path Planning for Autonomous Robots Dennis Barrios Aranibar 1, Pablo Javier Alsina 1 1 Laboratório de Sistemas Inteligentes Departamento de Engenharia de Computação e Automação
More informationDeep Reinforcement Learning
Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3. Policy Gradient and Gradient Estimators 4. Q-prop: Sample Efficient Policy Gradient and an Off-policy Critic
More informationA Fuzzy Reinforcement Learning for a Ball Interception Problem
A Fuzzy Reinforcement Learning for a Ball Interception Problem Tomoharu Nakashima, Masayo Udo, and Hisao Ishibuchi Department of Industrial Engineering, Osaka Prefecture University Gakuen-cho 1-1, Sakai,
More informationReinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation
POLISH MARITIME RESEARCH Special Issue S1 (74) 2012 Vol 19; pp. 31-36 10.2478/v10012-012-0020-8 Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation Andrzej Rak,
More informationICRA 2012 Tutorial on Reinforcement Learning I. Introduction
ICRA 2012 Tutorial on Reinforcement Learning I. Introduction Pieter Abbeel UC Berkeley Jan Peters TU Darmstadt Motivational Example: Helicopter Control Unstable Nonlinear Complicated dynamics Air flow
More informationValue Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Reinforcement Learning: Value Iteration Matt Gormley Lecture 23 Apr. 10, 2019 1
More informationMobile Robots Locomotion
Mobile Robots Locomotion Institute for Software Technology 1 Course Outline 1. Introduction to Mobile Robots 2. Locomotion 3. Sensors 4. Localization 5. Environment Modelling 6. Reactive Navigation 2 Today
More informationAdaptive Building of Decision Trees by Reinforcement Learning
Proceedings of the 7th WSEAS International Conference on Applied Informatics and Communications, Athens, Greece, August 24-26, 2007 34 Adaptive Building of Decision Trees by Reinforcement Learning MIRCEA
More informationREINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION
REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION ABSTRACT Mark A. Mueller Georgia Institute of Technology, Computer Science, Atlanta, GA USA The problem of autonomous vehicle navigation between
More informationCS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation
CS 687 Jana Kosecka Reinforcement Learning Continuous State MDP s Value Function approximation Markov Decision Process - Review Formal definition 4-tuple (S, A, T, R) Set of states S - finite Set of actions
More informationProbabilistic Double-Distance Algorithm of Search after Static or Moving Target by Autonomous Mobile Agent
2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel Probabilistic Double-Distance Algorithm of Search after Static or Moving Target by Autonomous Mobile Agent Eugene Kagan Dept.
More informationReinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019
Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 1 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Introduction, History, General Concepts
More informationLocal Search Methods. CS 188: Artificial Intelligence Fall Announcements. Hill Climbing. Hill Climbing Diagram. Today
CS 188: Artificial Intelligence Fall 2006 Lecture 5: Robot Motion Planning 9/14/2006 Local Search Methods Queue-based algorithms keep fallback options (backtracking) Local search: improve what you have
More informationAnnouncements. CS 188: Artificial Intelligence Fall Robot motion planning! Today. Robotics Tasks. Mobile Robots
CS 188: Artificial Intelligence Fall 2007 Lecture 6: Robot Motion Planning 9/13/2007 Announcements Project 1 due (yesterday)! Project 2 (Pacman with ghosts) up in a few days Reminder: you are allowed to
More informationCS 188: Artificial Intelligence Fall Announcements
CS 188: Artificial Intelligence Fall 2007 Lecture 6: Robot Motion Planning 9/13/2007 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore Announcements Project
More informationIntroduction to Mobile Robotics Path Planning and Collision Avoidance
Introduction to Mobile Robotics Path Planning and Collision Avoidance Wolfram Burgard, Cyrill Stachniss, Maren Bennewitz, Giorgio Grisetti, Kai Arras 1 Motion Planning Latombe (1991): eminently necessary
More informationHeuristic Policy Iteration for Infinite-Horizon Decentralized POMDPs
Heuristic Policy Iteration for Infinite-Horizon Decentralized POMDPs Christopher Amato and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003 USA Abstract Decentralized
More informationPer-decision Multi-step Temporal Difference Learning with Control Variates
Per-decision Multi-step Temporal Difference Learning with Control Variates Kristopher De Asis Department of Computing Science University of Alberta Edmonton, AB T6G 2E8 kldeasis@ualberta.ca Richard S.
More informationRobots are built to accomplish complex and difficult tasks that require highly non-linear motions.
Path and Trajectory specification Robots are built to accomplish complex and difficult tasks that require highly non-linear motions. Specifying the desired motion to achieve a specified goal is often a
More informationLocalization, Where am I?
5.1 Localization, Where am I?? position Position Update (Estimation?) Encoder Prediction of Position (e.g. odometry) YES matched observations Map data base predicted position Matching Odometry, Dead Reckoning
More informationAssignment 4: CS Machine Learning
Assignment 4: CS7641 - Machine Learning Saad Khan November 29, 2015 1 Introduction The purpose of this assignment is to apply some of the techniques learned from reinforcement learning to make decisions
More informationME 597/747 Autonomous Mobile Robots. Mid Term Exam. Duration: 2 hour Total Marks: 100
ME 597/747 Autonomous Mobile Robots Mid Term Exam Duration: 2 hour Total Marks: 100 Instructions: Read the exam carefully before starting. Equations are at the back, but they are NOT necessarily valid
More informationStuck in Traffic (SiT) Attacks
Stuck in Traffic (SiT) Attacks Mina Guirguis Texas State University Joint work with George Atia Traffic 2 Intelligent Transportation Systems V2X communication enable drivers to make better decisions: Avoiding
More informationHeuristic Search Value Iteration Trey Smith. Presenter: Guillermo Vázquez November 2007
Heuristic Search Value Iteration Trey Smith Presenter: Guillermo Vázquez November 2007 What is HSVI? Heuristic Search Value Iteration is an algorithm that approximates POMDP solutions. HSVI stores an upper
More informationA convergent Reinforcement Learning algorithm in the continuous case based on a Finite Difference method
A convergent Reinforcement Learning algorithm in the continuous case based on a Finite Difference method Remi Munos* CEMAGREF, LISC, Pare de Tourvoie, BP 121, 92185 Antony Cedex, FRANCE. Tel : (0)1 40
More informationDecision Making under Uncertainty
Decision Making under Uncertainty MDPs and POMDPs Mykel J. Kochenderfer 27 January 214 Recommended reference Markov Decision Processes in Artificial Intelligence edited by Sigaud and Buffet Surveys a broad
More informationLearning and Solving Partially Observable Markov Decision Processes
Ben-Gurion University of the Negev Department of Computer Science Learning and Solving Partially Observable Markov Decision Processes Dissertation submitted in partial fulfillment of the requirements for
More informationIncremental methods for computing bounds in partially observable Markov decision processes
Incremental methods for computing bounds in partially observable Markov decision processes Milos Hauskrecht MIT Laboratory for Computer Science, NE43-421 545 Technology Square Cambridge, MA 02139 milos@medg.lcs.mit.edu
More informationPlanning for Markov Decision Processes with Sparse Stochasticity
Planning for Markov Decision Processes with Sparse Stochasticity Maxim Likhachev Geoff Gordon Sebastian Thrun School of Computer Science School of Computer Science Dept. of Computer Science Carnegie Mellon
More informationThomas Bräunl EMBEDDED ROBOTICS. Mobile Robot Design and Applications with Embedded Systems. Second Edition. With 233 Figures and 24 Tables.
Thomas Bräunl EMBEDDED ROBOTICS Mobile Robot Design and Applications with Embedded Systems Second Edition With 233 Figures and 24 Tables Springer CONTENTS PART I: EMBEDDED SYSTEMS 1 Robots and Controllers
More informationLecture 13: Learning from Demonstration
CS 294-5 Algorithmic Human-Robot Interaction Fall 206 Lecture 3: Learning from Demonstration Scribes: Samee Ibraheem and Malayandi Palaniappan - Adapted from Notes by Avi Singh and Sammy Staszak 3. Introduction
More informationInstance-Based Action Models for Fast Action Planning
Instance-Based Action Models for Fast Action Planning Mazda Ahmadi and Peter Stone Department of Computer Sciences The University of Texas at Austin 1 University Station C0500, Austin, TX 78712-0233 Email:{mazda,pstone}@cs.utexas.edu
More informationReinforcement Learning on the Lego Mindstorms NXT Robot. Analysis and Implementation.
ESCUELA TÉCNICA SUPERIOR DE INGENIEROS INDUSTRIALES Departamento de Ingeniería de Sistemas y Automática Master Thesis Reinforcement Learning on the Lego Mindstorms NXT Robot. Analysis and Implementation.
More informationGradient Reinforcement Learning of POMDP Policy Graphs
1 Gradient Reinforcement Learning of POMDP Policy Graphs Douglas Aberdeen Research School of Information Science and Engineering Australian National University Jonathan Baxter WhizBang! Labs July 23, 2001
More informationVision-Based Reinforcement Learning using Approximate Policy Iteration
Vision-Based Reinforcement Learning using Approximate Policy Iteration Marwan R. Shaker, Shigang Yue and Tom Duckett Abstract A major issue for reinforcement learning (RL) applied to robotics is the time
More informationRobotics Tasks. CS 188: Artificial Intelligence Spring Manipulator Robots. Mobile Robots. Degrees of Freedom. Sensors and Effectors
CS 188: Artificial Intelligence Spring 2006 Lecture 5: Robot Motion Planning 1/31/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Motion planning (today) How to move from
More information