Marco Wiering Intelligent Systems Group Utrecht University

Size: px
Start display at page:

Download "Marco Wiering Intelligent Systems Group Utrecht University"

Transcription

1 Reinforcement Learning for Robot Control Marco Wiering Intelligent Systems Group Utrecht University

2 Introduction Robots move in the physical environment to perform tasks The environment is continuous and uncertain Programming a robot controller is often difficult Reinforcement learning methods are useful for learning controllers from trial and error We will examine how reinforcement learning can be used for optimizing robot controllers

3 Contents of this presentation Robot Control Problems Dynamic Programming Reinforcement Learning Reinforcement Learning with Function Approximators Experimental Results Discussion

4 Robotics A robot is an active moving artificial agent that lives in the physical environment We will concentrate ourselves on autonomous robots: robots that make their own decisions using feedback from their sensors The problem of designing robot controllers is that the real world is highly demanding: (1) The real world cannot be perfectly and completely perceived, (2) It is non-deterministic, (3) It is dynamic, and (4) It is continuous.

5 Robot Applications Robots are used for many different tasks: Manufacturing: repetitive tasks on a production-belt (car and micro-electronic industry) Construction industry (e.g. cutting sheep wool) Security and post-delivery in buildings Unmanned cars, underwater vehicles, unmanned aeroplanes Rescue operations (e.g. after earthquakes)

6 Robot Control Problems Robot control problems can be divided into: Manipulation: change the position of objects using e.g. a robot arm Locomotion: change the physical position of the robot by driving (walking) around Manipulation and Locomotion: do both things at the same time (e.g. in Robot Soccer)

7 Locomotion with legs Most animals use legs for locomotion. Controlling legs is harder than controlling wheels, however. The advantage of using legs is that locomotion on rough surfaces (e.g. with many stones or stairs) becomes possible The Ambler robot (1992) has 6 legs, is almost 10 meters big, and can climb obstacles with a size of 2 meters The Ambler robot is a static stable walker like an AIBO robot. Such robots can stand still, but are usually quite slow and consume a lot of energy

8 Locomotion with wheels Locomotion with wheels is most practical for most environments. Wheels are easier to build and more efficient (stable, and faster) Robots with wheels also have problems: (x,y) Φ A car can drive to every position and take every orientation so there are three degrees of freedom. But: we can only steer and drive; so there are only two actuators

9 A non holonomic robot has less degrees of freedom for control than total degrees of freedom A holonomic robot has as many degrees of freedom for control as positional degrees of freedom

10 Manipulation with robots Manipulators are effectors which can move objects in the environment Kinematics is the study of how movements of the actuators correspond to movements of parts of the robot in the physical space Most manipulators allow rotating or linear translating movements:

11 Control for Locomotion The goal of locomotion is usually to find a minimal cost path from one state in the environment to another state The difficulty of this is that the robot often is uncertain about its own position Solving path-planning problems with uncertainty is often cast in the framework of partially observable Markov decision processes (POMDPs) Since solving even simpler instances of POMDPs are NP-hard problems, often heuristic algorithms are used

12 Control for Manipulation The goal of manipulation is often to grasp an object and to put (or fix) it in another place The good thing of manipulation is that usually the complete environment can be observed The difficulty is that there are many degrees of freedom to be controlled Inverse Kinematics could be used for simple (e.g. empty) environments, but learning methods can work better if many obstacles are present

13 Introduction to Reinforcement Learning Supervised learning: learning from data which are all labelled with the desired outcome/action Reinforcement learning: learning by trial and error. The agent interacts with the environment and may receive reward or punishment Navigate a robot Reward: if the desired position is reached Punishment: if the robot makes a collision Play chess, checkers, backgammon,... Reward if the game is won, Punishment if the game is lost

14 Reinforcement Learning principles Learn to control an agent by trying out actions and using the obtained feedback (rewards) to strengthen (reinforce) the agent s behavior. The agent interacts with the environment by using its (virtual) sensors and effectors. The reward function determines which agent s behavior is most desired ENVIRONMENT INPUT REWARD ACTION AGENT

15 Some applications of Reinforcement Learning Game-playing (checkers, backgammon, chess) Elevator control Robot control Combinatorial optimisation Simulated robot soccer Network routing Traffic control

16 Convergence issues in Reinforcement Learning Reinforcement Learning algorithms with lookup-tables (tabular RL) are proved to converge to an optimal policy after an infinite number of experiences Reinforcement learning with function approximators have sometimes obtained excellent results (e.g. Tesauro s TD-Gammon) However, particular studies have shown that RL with function approximators may diverge (to infinite parameter values)

17 Markov Decision Problems A Markov decision problem (MDP) consists of: S: Finite set of states {s 1,s 2,...,s n }. A: Finite set of actions. P (i, a, j): probability to make a step to state j if action a is selected in state i. R(i, a, j) reward for making a transition from state i to state j by executing action a γ: discount parameter for future rewards (0 γ 1)

18 Dynamic Programming and Value Functions (1) The value of a state V π (s) is defined as the expected cumulative discounted future reward starting in state s and following policy π: V π (s) =E( i=0 γ i r i s 0 = s, π) The optimal policy is the one which has the largest state-value in all states: π =argmax V π π

19 Dynamic Programming and Value Functions (2) We also define the Q-value of a state-action pair as: Q π (s, a) =E( i=0 γ i r i s 0 = s, a 0 = a, π) The Bellman optimality equation relates a state-action value of an optimal Q-function to other optimal state-values: Q (s, a) = s P (s, a, s )(R(s, a, s )+γv (s ))

20 Dynamic Programming and Value Functions (3) The Bellman equation has led to very efficient dynamic programming (DP) algorithms. Value iteration computes an optimal policy for a known Markov decision problem using: Q k+1 (s, a) := s P (s, a, s )(R(s, a, s )+γv k (s )) Where: V k (s) =max a Q k (s, a) It can be easily shown that lim k Q k = Q.

21 Reinforcement Learning Dynamic programming is not applicable when: The Markov decision problem is unknown There are too many states (often caused by many state-variables leading to Bellman s curse of dimensionality) There are continuous states or actions For such cases, we can use Reinforcement Learning algorithms. RL algorithms learn from interacting with an environment and can be combined with function approximators.

22 RL Algorithms: Q-learning A well-known RL algorithm is Q-learning (Watkins, 1989) which updates the Q-function after an experience (s t,a t,r t,s t+1 ) as follows: Q(s t,a t ):=Q(s t,a t )+α(r t + γ max a Where 0 <α 1 is the learning rate. Q(s t+1,a) Q(s t,a t )) Q-learning is an off-policy RL algorithm, meaning that it learns about the optimal value function while following another behavioural policy. Tabular Q-learning converges to the optimal policy after an infinite number of experiences.

23 RL Algorithms: SARSA SARSA (State-Action-Reward-State-Action) is an example of an on-policy RL algorithm. Q(s t,a t ):=Q(s t,a t )+α(r t + γq(s t+1,a t+1 ) Q(s t,a t )) Since SARSA is an on-policy algorithm it learns the Q-function based on the trajectories generated by the behavioural policy. To make SARSA converge, the exploration policy should be GLIE (Greedy in the Limit of Infinite Exploration), see (Singh et al., 2000)

24 Reinforcement Learning with Function Approximators To learn value functions for problems with many (or continuous) state variables, we have to combine reinforcement learning with function approximators. We concentrate on linear function approximators using a parameter vector θ. The value of a state-action pair is: Q(s, a) = i θ i,a φ i (s) Where the state-vector s t which is received by the agent at time t is mapped upon a feature-vector φ(s t ).

25 Linear Function Approximators The linear function approximator looks as follows: Action values Learnable weights Feature vector Fixed method Input (State)

26 Standard Q-learning with Linear FAs When standard Q-learning is used, we can update the parameter vector as follows: θ i,at := θ i,at + α(r t + γ max a Q(s t+1,a) Q(s t,a t ))φ i (s t ) Q-learning using this update-rule may diverge to infinite values.

27 Standard SARSA with Linear FAs When standard SARSA is used, we use the following update: θ i,at := θ i,at + α(r t + γq(s t+1,a t+1 ) Q(s t,a t ))φ i (s t ) Gordon (2002) proved that using this update for a fixed policy the parameters will converge, and for a changing policy the parameters will stay fixed in a bounded region. Perkins and Precup (2002) proved that the parameters will converge if the policy improvement operator produces ɛ-soft policies and is Lipschitz continuous in the action values with a constant that is not too large.

28 Averaging RL with Linear FAs In averaging RL updates do not use the difference between two successor state-values, but between the successor state-value and the parameter s value itself. Reynolds (2002) has shown that averaging RL will not diverge if the feature-vector for each state is normalised, but the value functions will remain in some region. We assume normalised feature vectors, so we will use φ i (s) as the normalised feature-vector computed as: φ i (s) = φ i(s) j φ j(s)

29 Averaging Q-learning with Linear FAs The averaging Q-learning rule looks as follows for all i: θ i,at := θ i,at + α(r t + γ max a Q(s t+1,a) θ i,at )φ i(s t ) The θ parameters learn towards the desired value and are not cooperating in the learning updates. Under a fixed stationary distribution with interpolative function approximators, averaging Q-learning has been proved to converge (Szepesvari and Smart, 2004).

30 Sample-based Dynamic Programming In case we have a model of the environment, we can also use sample-based Dynamic Programming as was proposed by Boyan and Moore (1995) and Gordon (1995). Sample-based value iteration goes as follows: First we select a subset of states S 0 S. OftenS 0 is chosen arbitrarily, but in general S 0 should be large enough and be spread over the state space. Then we are going to use the states s S 0 for updating the value function using a model of the environment.

31 Averaging Q-value iteration with linear FAs Averaging sample-based DP uses the following update, where T (.) is the backup operator: 1 T (θ i,a ) = s S 0 φ i (s) γ max b j s S 0 φ i(s) φ j(s )θ j,b ) s P (s, a, s )(R(s, a, s )+ This DP-update rule is obtained from the averaging Q-learning rule with linear function approximators (where we set the learning rate α to 1).

32 Deriving the Averaging Q-value iteration update Remember that we had: θ i,at We rewrite this as: := θ i,at +(r t + γ max a Q(s t+1,a) θ i,at )φ i(s t ) θ i,at := θ i,at +(r t + γ max Q(s t+1,a))φ i(s t ) θ i,at φ i(s t ) a θ i,at := (r t + γ max a Q(s t+1,a))φ i (s t) φ i (s t) The Q-value iteration backup operator is obtained by averaging over all states in the sweep.

33 Analysing the Averaging Q-value iteration update This is averaging DP, because we take the weighted average of all targets irrespective of other parameter values for the current state, so parameter estimation is done in a non co-operative way. Note that averaging RL has a problem. Consider two examples: 0.5 1and1 2. The best value of the parameter would be 2, but averaging Q-value iteration will compute the value 5 3. Note that this would also be the fixed point if averaging Q-learning is used: (1-5/3)*0.5 + (2-5/3)*1 = 0.

34 Standard Q-value iteration with linear FAs (1) For standard value-iteration with linear function approximators, the resulting algorithm will look a bit non-conventional. First of all we introduce the value of a state if a particular parameter i is not used as: Q i (s, a) = j i θ j,a φ j(s) Q i (s, a) = 0 when only feature i active.

35 Standard Q-value iteration with linear FAs (2) The standard Q-value iteration algorithm that is co-operative, but may diverge, is: 1 T (θ i,a ) = s S 0 φ i (s)2 s S 0 φ i(s) (R(s, a, s )+γ max b j s P (s, a, s ) φ j(s )θ j,b Q i (s, a))

36 Deriving the Standard Q-value iteration update This rule is obtained by examining the Standard Q-learning rule with linear function approximators (with α = 1). Remember that we had: θ i,a := θ i,a +(r t + γ max b Q(s t+1,b) Q(s t,a))φ i(s t ) We can rewrite this as: θ i,a += (r t + γ max b θ i,a := Q(s t+1,b) Q i (s t,a))φ i(s t ) θ i,a φ i(s t ) 2 (r t + γ max b Q(s t+1,b) Q i (s t,a))φ i(s t ) φ i (s t) 2

37 Analysing the Standard Q-value iteration update In the case of a tabular representation with only one state active, the resulting algorithm is again the same algorithm as conventional Q-value iteration. If we examine the same examples: 0.5 1and1 2, we can see that standard DP would compute the value 2, which is correct. The problem of this standard value-iteration algorithm with function approximators is that it may diverge, just as Q-learning may diverge.

38 Divergence of Q-learning with FAs Off-policy standard RL methods such as online value iteration or standard Q-learning can diverge for a large number of function approximators such as: Linear neural networks Locally weighted learning Radial basis networks We will show an example in which infinite parameters will be obtained when online value iteration is used Online value iteration uses a model of the environment but only updates on visited states

39 Example demonstrating divergence The example is very simple: There is one state s with value 0.5 or 1. φ(s) :=s The agent can select actions a t from {0.1, 0.2, 0.3,...,0.9, 1.0}. An absorbing state has been reached if s =1,elseweset s t+1 =0.5 ifa t < 1ands t+1 =1ifa t =1. The reward on all transitions is 0 The initial state s 0 =0.5.

40 Proof that divergence will be the result The algorithm computes the following update if s = 0.5: θ = θ + α(γθ 0.5θ)0.5 And the following update if s =1: θ = θ + α(0 θ) If the agent often selects random actions, in many cases the agent will be in state s =0.5. Suppose it stays on average h times in state s =0.5 beforemakingasteptos =1. Then it will make the following average update: θ = θ + α((γθ 0.5θ)0.5h θ)

41 This will lead to ever increasing values of initial positive values of θ if: h> 1 0.5γ 0.25 Here γ is larger than 0.5

42 Explaining divergence In this example, for large enough γ and exploration, the parameter will always increase. If the estimated value of the current state grows, then the estimated value of the best next state grows at the same time. This may lead to divergence. If instead we use a Monte-Carlo return, and do not use bootstrapping at all, we will not have the problem. Also if we use the averaging RL update or on-policy RL divergence is prevented.

43 Experiments We executed a number of experiments to compare standard and averaging DP with CMACs as a linear function approximator. The environment is a maze of size with one goal state. Each state can be a starting state except for the goal state. There are deterministic single-step actions North, East, South, and West. The reward for every action is -1.

44 Experimental results As function approximator we use CMACs in which one tiling encodes the x-position and the other tiling the y-position. Tiling 1 Tiling 2 So there are two tilings with 51 states each. We first examine the error of the computed value function for standard and averaging Q-value iteration in an empty maze:

45 Experimental results Results in the Empty Maze Convergence in the Empty Maze 1e Averaging DP Standard DP Averaging DP Standard DP Value function error Value function error Discount Factor Number of iterations

46 Discussion empty maze (1) In the empty maze we observe the following: For γ = 0, both algorithms compute the optimal value function. For γ = 1, standard DP is able to compute the perfect value function, but averaging DP makes a large error. For γ values in between 0.05 and 0.9 the value function obtained by averaging DP is slightly better than the one obtained with standard DP.

47 Discussion empty maze (2) We also studied the influence of initial parameter values: As expected averaging DP always converged to the same value function. For standard DP, random initialisations of 0 to 100 and 0 to 1000 led to convergence For standard DP, if we initialized the parameters to values between 0 and 10000, the parameters always diverged.

48 Experiments on a maze containing a wall We also examine the total number of steps from all initial states to the goal states for a maze with a wall. G 3e e+06 Results in the Wall Maze Averaging DP Standard DP Total Nr. of Steps 2e e+06 1e Discount Factor

49 Other projects We want to use averaging RL for solving path-planning problems in outer space involving gravity forces from planets: Goal Such a problem features continuous inputs (position, velocity, direction). Crashes into planets should be circumvented We plan to solve it with averaging RL combined with multi-layer perceptrons

50 Other projects Project of Sander Maas for Philips: learn to control a robot arm For each joint two muscles are used. These are so-called McKibben actuators which work on air-pressure. This resembles human muscles, but due to different causes the behavior of these muscles is highly non-linear

51 Discussion For the two robot control problem; Manipulation and Locomotion, manipulation seems most suitable for reinforcement learning For Locomotion, noise in the actions and sensors of the robot, cause the controller to deal with a probabilistic state description RL approaches using recurrent neural networks or hidden Markov models can be useful for locomotion We have extended averaging RL to feed-forward neural networks and plan to use this algorithm for robot control

Markov Decision Processes and Reinforcement Learning

Markov Decision Processes and Reinforcement Learning Lecture 14 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Artificial Intelligence

More information

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10

Algorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10 Algorithms for Solving RL: Temporal Difference Learning (TD) 1 Reinforcement Learning Lecture 10 Gillian Hayes 8th February 2007 Incremental Monte Carlo Algorithm TD Prediction TD vs MC vs DP TD for control:

More information

A Brief Introduction to Reinforcement Learning

A Brief Introduction to Reinforcement Learning A Brief Introduction to Reinforcement Learning Minlie Huang ( ) Dept. of Computer Science, Tsinghua University aihuang@tsinghua.edu.cn 1 http://coai.cs.tsinghua.edu.cn/hml Reinforcement Learning Agent

More information

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University Introduction to Reinforcement Learning J. Zico Kolter Carnegie Mellon University 1 Agent interaction with environment Agent State s Reward r Action a Environment 2 Of course, an oversimplification 3 Review:

More information

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended) Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended) Pavlos Andreadis, February 2 nd 2018 1 Markov Decision Processes A finite Markov Decision Process

More information

Markov Decision Processes. (Slides from Mausam)

Markov Decision Processes. (Slides from Mausam) Markov Decision Processes (Slides from Mausam) Machine Learning Operations Research Graph Theory Control Theory Markov Decision Process Economics Robotics Artificial Intelligence Neuroscience /Psychology

More information

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Reinforcement Learning: A brief introduction. Mihaela van der Schaar Reinforcement Learning: A brief introduction Mihaela van der Schaar Outline Optimal Decisions & Optimal Forecasts Markov Decision Processes (MDPs) States, actions, rewards and value functions Dynamic Programming

More information

Planning and Control: Markov Decision Processes

Planning and Control: Markov Decision Processes CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Perfect vs. Noisy Environment What

More information

In Homework 1, you determined the inverse dynamics model of the spinbot robot to be

In Homework 1, you determined the inverse dynamics model of the spinbot robot to be Robot Learning Winter Semester 22/3, Homework 2 Prof. Dr. J. Peters, M.Eng. O. Kroemer, M. Sc. H. van Hoof Due date: Wed 6 Jan. 23 Note: Please fill in the solution on this sheet but add sheets for the

More information

Hierarchical Assignment of Behaviours by Self-Organizing

Hierarchical Assignment of Behaviours by Self-Organizing Hierarchical Assignment of Behaviours by Self-Organizing W. Moerman 1 B. Bakker 2 M. Wiering 3 1 M.Sc. Cognitive Artificial Intelligence Utrecht University 2 Intelligent Autonomous Systems Group University

More information

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

More information

When Network Embedding meets Reinforcement Learning?

When Network Embedding meets Reinforcement Learning? When Network Embedding meets Reinforcement Learning? ---Learning Combinatorial Optimization Problems over Graphs Changjun Fan 1 1. An Introduction to (Deep) Reinforcement Learning 2. How to combine NE

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 12: Deep Reinforcement Learning Types of Learning Supervised training Learning from the teacher Training data includes

More information

Problem characteristics. Dynamic Optimization. Examples. Policies vs. Trajectories. Planning using dynamic optimization. Dynamic Optimization Issues

Problem characteristics. Dynamic Optimization. Examples. Policies vs. Trajectories. Planning using dynamic optimization. Dynamic Optimization Issues Problem characteristics Planning using dynamic optimization Chris Atkeson 2004 Want optimal plan, not just feasible plan We will minimize a cost function C(execution). Some examples: C() = c T (x T ) +

More information

Manifold Representations for Continuous-State Reinforcement Learning

Manifold Representations for Continuous-State Reinforcement Learning Washington University in St. Louis Washington University Open Scholarship All Computer Science and Engineering Research Computer Science and Engineering Report Number: WUCSE-2005-19 2005-05-01 Manifold

More information

Markov Decision Processes (MDPs) (cont.)

Markov Decision Processes (MDPs) (cont.) Markov Decision Processes (MDPs) (cont.) Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University November 29 th, 2007 Markov Decision Process (MDP) Representation State space: Joint state x

More information

15-780: MarkovDecisionProcesses

15-780: MarkovDecisionProcesses 15-780: MarkovDecisionProcesses J. Zico Kolter Feburary 29, 2016 1 Outline Introduction Formal definition Value iteration Policy iteration Linear programming for MDPs 2 1988 Judea Pearl publishes Probabilistic

More information

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1

Unsupervised Learning. CS 3793/5233 Artificial Intelligence Unsupervised Learning 1 Unsupervised CS 3793/5233 Artificial Intelligence Unsupervised 1 EM k-means Procedure Data Random Assignment Assign 1 Assign 2 Soft k-means In clustering, the target feature is not given. Goal: Construct

More information

Q-learning with linear function approximation

Q-learning with linear function approximation Q-learning with linear function approximation Francisco S. Melo and M. Isabel Ribeiro Institute for Systems and Robotics [fmelo,mir]@isr.ist.utl.pt Conference on Learning Theory, COLT 2007 June 14th, 2007

More information

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces Eric Christiansen Michael Gorbach May 13, 2008 Abstract One of the drawbacks of standard reinforcement learning techniques is that

More information

Hierarchical Reinforcement Learning for Robot Navigation

Hierarchical Reinforcement Learning for Robot Navigation Hierarchical Reinforcement Learning for Robot Navigation B. Bischoff 1, D. Nguyen-Tuong 1,I-H.Lee 1, F. Streichert 1 and A. Knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701

More information

Perseus: randomized point-based value iteration for POMDPs

Perseus: randomized point-based value iteration for POMDPs Universiteit van Amsterdam IAS technical report IAS-UA-04-02 Perseus: randomized point-based value iteration for POMDPs Matthijs T. J. Spaan and Nikos lassis Informatics Institute Faculty of Science University

More information

Mobile Robots: An Introduction.

Mobile Robots: An Introduction. Mobile Robots: An Introduction Amirkabir University of Technology Computer Engineering & Information Technology Department http://ce.aut.ac.ir/~shiry/lecture/robotics-2004/robotics04.html Introduction

More information

CSE151 Assignment 2 Markov Decision Processes in the Grid World

CSE151 Assignment 2 Markov Decision Processes in the Grid World CSE5 Assignment Markov Decision Processes in the Grid World Grace Lin A484 gclin@ucsd.edu Tom Maddock A55645 tmaddock@ucsd.edu Abstract Markov decision processes exemplify sequential problems, which are

More information

Reinforcement Learning for Appearance Based Visual Servoing in Robotic Manipulation

Reinforcement Learning for Appearance Based Visual Servoing in Robotic Manipulation Reinforcement Learning for Appearance Based Visual Servoing in Robotic Manipulation UMAR KHAN, LIAQUAT ALI KHAN, S. ZAHID HUSSAIN Department of Mechatronics Engineering AIR University E-9, Islamabad PAKISTAN

More information

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho Partially Observable Markov Decision Processes Silvia Cruciani João Carvalho MDP A reminder: is a set of states is a set of actions is the state transition function. is the probability of ending in state

More information

Reinforcement learning of competitive and cooperative skills in soccer agents

Reinforcement learning of competitive and cooperative skills in soccer agents Edith Cowan University Research Online ECU Publications 2011 2011 Reinforcement learning of competitive and cooperative skills in soccer agents Jinsong Leng Edith Cowan University Chee Lim 10.1016/j.asoc.2010.04.007

More information

Monte Carlo Tree Search PAH 2015

Monte Carlo Tree Search PAH 2015 Monte Carlo Tree Search PAH 2015 MCTS animation and RAVE slides by Michèle Sebag and Romaric Gaudel Markov Decision Processes (MDPs) main formal model Π = S, A, D, T, R states finite set of states of the

More information

Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)

Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox) Partially Observable Markov Decision Processes Mausam (slides by Dieter Fox) Stochastic Planning: MDPs Static Environment Fully Observable Perfect What action next? Stochastic Instantaneous Percepts Actions

More information

Sample-Efficient Reinforcement Learning for Walking Robots

Sample-Efficient Reinforcement Learning for Walking Robots Sample-Efficient Reinforcement Learning for Walking Robots B. Vennemann Delft Robotics Institute Sample-Efficient Reinforcement Learning for Walking Robots For the degree of Master of Science in Mechanical

More information

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Apprenticeship Learning for Reinforcement Learning with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Table of Contents Introduction Theory Autonomous helicopter control

More information

Robotics. Lecture 5: Monte Carlo Localisation. See course website for up to date information.

Robotics. Lecture 5: Monte Carlo Localisation. See course website  for up to date information. Robotics Lecture 5: Monte Carlo Localisation See course website http://www.doc.ic.ac.uk/~ajd/robotics/ for up to date information. Andrew Davison Department of Computing Imperial College London Review:

More information

Machine Learning on Physical Robots

Machine Learning on Physical Robots Machine Learning on Physical Robots Alfred P. Sloan Research Fellow Department or Computer Sciences The University of Texas at Austin Research Question To what degree can autonomous intelligent agents

More information

Final Exam Practice Fall Semester, 2012

Final Exam Practice Fall Semester, 2012 COS 495 - Autonomous Robot Navigation Final Exam Practice Fall Semester, 2012 Duration: Total Marks: 70 Closed Book 2 hours Start Time: End Time: By signing this exam, I agree to the honor code Name: Signature:

More information

Approximate Linear Successor Representation

Approximate Linear Successor Representation Approximate Linear Successor Representation Clement A. Gehring Computer Science and Artificial Intelligence Laboratory Massachusetts Institutes of Technology Cambridge, MA 2139 gehring@csail.mit.edu Abstract

More information

Inverse KKT Motion Optimization: A Newton Method to Efficiently Extract Task Spaces and Cost Parameters from Demonstrations

Inverse KKT Motion Optimization: A Newton Method to Efficiently Extract Task Spaces and Cost Parameters from Demonstrations Inverse KKT Motion Optimization: A Newton Method to Efficiently Extract Task Spaces and Cost Parameters from Demonstrations Peter Englert Machine Learning and Robotics Lab Universität Stuttgart Germany

More information

Reinforcement Learning (2)

Reinforcement Learning (2) Reinforcement Learning (2) Bruno Bouzy 1 october 2013 This document is the second part of the «Reinforcement Learning» chapter of the «Agent oriented learning» teaching unit of the Master MI computer course.

More information

Quadruped Robots and Legged Locomotion

Quadruped Robots and Legged Locomotion Quadruped Robots and Legged Locomotion J. Zico Kolter Computer Science Department Stanford University Joint work with Pieter Abbeel, Andrew Ng Why legged robots? 1 Why Legged Robots? There is a need for

More information

Distributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes

Distributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes Distributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes Willy Arthur Silva Reis 1, Karina Valdivia Delgado 2, Leliane Nunes de Barros 1 1 Departamento de Ciência da

More information

Using Machine Learning Techniques for Autonomous Planning and Navigation with Groups of Unmanned Vehicles

Using Machine Learning Techniques for Autonomous Planning and Navigation with Groups of Unmanned Vehicles Using Machine Learning Techniques for Autonomous Planning and Navigation with Groups of Unmanned Vehicles Gerben Bergwerff July 19, 2016 Master s Thesis Department of Artificial Intelligence, University

More information

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding

Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding Advances in Neural Information Processing Systems 8, pp. 1038-1044, MIT Press, 1996. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding Richard S. Sutton University

More information

Learning to bounce a ball with a robotic arm

Learning to bounce a ball with a robotic arm Eric Wolter TU Darmstadt Thorsten Baark TU Darmstadt Abstract Bouncing a ball is a fun and challenging task for humans. It requires fine and complex motor controls and thus is an interesting problem for

More information

Prioritizing Point-Based POMDP Solvers

Prioritizing Point-Based POMDP Solvers Prioritizing Point-Based POMDP Solvers Guy Shani, Ronen I. Brafman, and Solomon E. Shimony Department of Computer Science, Ben-Gurion University, Beer-Sheva, Israel Abstract. Recent scaling up of POMDP

More information

A DECENTRALIZED REINFORCEMENT LEARNING CONTROLLER FOR COLLABORATIVE DRIVING. Luke Ng, Chris Clark, Jan P. Huissoon

A DECENTRALIZED REINFORCEMENT LEARNING CONTROLLER FOR COLLABORATIVE DRIVING. Luke Ng, Chris Clark, Jan P. Huissoon A DECENTRALIZED REINFORCEMENT LEARNING CONTROLLER FOR COLLABORATIVE DRIVING Luke Ng, Chris Clark, Jan P. Huissoon Department of Mechanical Engineering, Automation & Controls Group University of Waterloo

More information

Reinforcement Learning of Traffic Light Controllers under Partial Observability

Reinforcement Learning of Traffic Light Controllers under Partial Observability Reinforcement Learning of Traffic Light Controllers under Partial Observability MSc Thesis of R. Schouten(0010774), M. Steingröver(0043826) Students of Artificial Intelligence on Faculty of Science University

More information

Neuro-Dynamic Programming An Overview

Neuro-Dynamic Programming An Overview 1 Neuro-Dynamic Programming An Overview Dimitri Bertsekas Dept. of Electrical Engineering and Computer Science M.I.T. May 2006 2 BELLMAN AND THE DUAL CURSES Dynamic Programming (DP) is very broadly applicable,

More information

Localization and Map Building

Localization and Map Building Localization and Map Building Noise and aliasing; odometric position estimation To localize or not to localize Belief representation Map representation Probabilistic map-based localization Other examples

More information

Robotics. Haslum COMP3620/6320

Robotics. Haslum COMP3620/6320 Robotics P@trik Haslum COMP3620/6320 Introduction Robotics Industrial Automation * Repetitive manipulation tasks (assembly, etc). * Well-known, controlled environment. * High-power, high-precision, very

More information

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient I Used Materials Disclaimer: Much of the material and slides for this lecture

More information

Continuous Valued Q-learning for Vision-Guided Behavior Acquisition

Continuous Valued Q-learning for Vision-Guided Behavior Acquisition Continuous Valued Q-learning for Vision-Guided Behavior Acquisition Yasutake Takahashi, Masanori Takeda, and Minoru Asada Dept. of Adaptive Machine Systems Graduate School of Engineering Osaka University

More information

Strategies for simulating pedestrian navigation with multiple reinforcement learning agents

Strategies for simulating pedestrian navigation with multiple reinforcement learning agents Strategies for simulating pedestrian navigation with multiple reinforcement learning agents Francisco Martinez-Gil, Miguel Lozano, Fernando Ferna ndez Presented by: Daniel Geschwender 9/29/2016 1 Overview

More information

Residual Advantage Learning Applied to a Differential Game

Residual Advantage Learning Applied to a Differential Game Presented at the International Conference on Neural Networks (ICNN 96), Washington DC, 2-6 June 1996. Residual Advantage Learning Applied to a Differential Game Mance E. Harmon Wright Laboratory WL/AAAT

More information

Exam in DD2426 Robotics and Autonomous Systems

Exam in DD2426 Robotics and Autonomous Systems Exam in DD2426 Robotics and Autonomous Systems Lecturer: Patric Jensfelt KTH, March 16, 2010, 9-12 No aids are allowed on the exam, i.e. no notes, no books, no calculators, etc. You need a minimum of 20

More information

Slides credited from Dr. David Silver & Hung-Yi Lee

Slides credited from Dr. David Silver & Hung-Yi Lee Slides credited from Dr. David Silver & Hung-Yi Lee Review Reinforcement Learning 2 Reinforcement Learning RL is a general purpose framework for decision making RL is for an agent with the capacity to

More information

Generalized Inverse Reinforcement Learning

Generalized Inverse Reinforcement Learning Generalized Inverse Reinforcement Learning James MacGlashan Cogitai, Inc. james@cogitai.com Michael L. Littman mlittman@cs.brown.edu Nakul Gopalan ngopalan@cs.brown.edu Amy Greenwald amy@cs.brown.edu Abstract

More information

WestminsterResearch

WestminsterResearch WestminsterResearch http://www.westminster.ac.uk/research/westminsterresearch Reinforcement learning in continuous state- and action-space Barry D. Nichols Faculty of Science and Technology This is an

More information

Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning

Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning Applying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning Jan Peters 1, Stefan Schaal 1 University of Southern California, Los Angeles CA 90089, USA Abstract. In this paper, we

More information

CMPUT 412 Motion Control Wheeled robots. Csaba Szepesvári University of Alberta

CMPUT 412 Motion Control Wheeled robots. Csaba Szepesvári University of Alberta CMPUT 412 Motion Control Wheeled robots Csaba Szepesvári University of Alberta 1 Motion Control (wheeled robots) Requirements Kinematic/dynamic model of the robot Model of the interaction between the wheel

More information

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering Study of Multiple Multiagent Reinforcement Learning Algorithms in Grid Games by Pascal De Beck-Courcelle A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of

More information

Robotics. CSPP Artificial Intelligence March 10, 2004

Robotics. CSPP Artificial Intelligence March 10, 2004 Robotics CSPP 56553 Artificial Intelligence March 10, 2004 Roadmap Robotics is AI-complete Integration of many AI techniques Classic AI Search in configuration space (Ultra) Modern AI Subsumption architecture

More information

Applying Neural Network Architecture for Inverse Kinematics Problem in Robotics

Applying Neural Network Architecture for Inverse Kinematics Problem in Robotics J. Software Engineering & Applications, 2010, 3: 230-239 doi:10.4236/jsea.2010.33028 Published Online March 2010 (http://www.scirp.org/journal/jsea) Applying Neural Network Architecture for Inverse Kinematics

More information

Forward Search Value Iteration For POMDPs

Forward Search Value Iteration For POMDPs Forward Search Value Iteration For POMDPs Guy Shani and Ronen I. Brafman and Solomon E. Shimony Department of Computer Science, Ben-Gurion University, Beer-Sheva, Israel Abstract Recent scaling up of POMDP

More information

Path Planning for a Robot Manipulator based on Probabilistic Roadmap and Reinforcement Learning

Path Planning for a Robot Manipulator based on Probabilistic Roadmap and Reinforcement Learning 674 International Journal Jung-Jun of Control, Park, Automation, Ji-Hun Kim, and and Systems, Jae-Bok vol. Song 5, no. 6, pp. 674-680, December 2007 Path Planning for a Robot Manipulator based on Probabilistic

More information

Inverse Kinematics. Given a desired position (p) & orientation (R) of the end-effector

Inverse Kinematics. Given a desired position (p) & orientation (R) of the end-effector Inverse Kinematics Given a desired position (p) & orientation (R) of the end-effector q ( q, q, q ) 1 2 n Find the joint variables which can bring the robot the desired configuration z y x 1 The Inverse

More information

Reinforcement Learning-Based Path Planning for Autonomous Robots

Reinforcement Learning-Based Path Planning for Autonomous Robots Reinforcement Learning-Based Path Planning for Autonomous Robots Dennis Barrios Aranibar 1, Pablo Javier Alsina 1 1 Laboratório de Sistemas Inteligentes Departamento de Engenharia de Computação e Automação

More information

Deep Reinforcement Learning

Deep Reinforcement Learning Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3. Policy Gradient and Gradient Estimators 4. Q-prop: Sample Efficient Policy Gradient and an Off-policy Critic

More information

A Fuzzy Reinforcement Learning for a Ball Interception Problem

A Fuzzy Reinforcement Learning for a Ball Interception Problem A Fuzzy Reinforcement Learning for a Ball Interception Problem Tomoharu Nakashima, Masayo Udo, and Hisao Ishibuchi Department of Industrial Engineering, Osaka Prefecture University Gakuen-cho 1-1, Sakai,

More information

Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation

Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation POLISH MARITIME RESEARCH Special Issue S1 (74) 2012 Vol 19; pp. 31-36 10.2478/v10012-012-0020-8 Reinforcement Learning in Discrete and Continuous Domains Applied to Ship Trajectory Generation Andrzej Rak,

More information

ICRA 2012 Tutorial on Reinforcement Learning I. Introduction

ICRA 2012 Tutorial on Reinforcement Learning I. Introduction ICRA 2012 Tutorial on Reinforcement Learning I. Introduction Pieter Abbeel UC Berkeley Jan Peters TU Darmstadt Motivational Example: Helicopter Control Unstable Nonlinear Complicated dynamics Air flow

More information

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Reinforcement Learning: Value Iteration Matt Gormley Lecture 23 Apr. 10, 2019 1

More information

Mobile Robots Locomotion

Mobile Robots Locomotion Mobile Robots Locomotion Institute for Software Technology 1 Course Outline 1. Introduction to Mobile Robots 2. Locomotion 3. Sensors 4. Localization 5. Environment Modelling 6. Reactive Navigation 2 Today

More information

Adaptive Building of Decision Trees by Reinforcement Learning

Adaptive Building of Decision Trees by Reinforcement Learning Proceedings of the 7th WSEAS International Conference on Applied Informatics and Communications, Athens, Greece, August 24-26, 2007 34 Adaptive Building of Decision Trees by Reinforcement Learning MIRCEA

More information

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION ABSTRACT Mark A. Mueller Georgia Institute of Technology, Computer Science, Atlanta, GA USA The problem of autonomous vehicle navigation between

More information

CS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation

CS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation CS 687 Jana Kosecka Reinforcement Learning Continuous State MDP s Value Function approximation Markov Decision Process - Review Formal definition 4-tuple (S, A, T, R) Set of states S - finite Set of actions

More information

Probabilistic Double-Distance Algorithm of Search after Static or Moving Target by Autonomous Mobile Agent

Probabilistic Double-Distance Algorithm of Search after Static or Moving Target by Autonomous Mobile Agent 2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel Probabilistic Double-Distance Algorithm of Search after Static or Moving Target by Autonomous Mobile Agent Eugene Kagan Dept.

More information

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019 Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 1 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Introduction, History, General Concepts

More information

Local Search Methods. CS 188: Artificial Intelligence Fall Announcements. Hill Climbing. Hill Climbing Diagram. Today

Local Search Methods. CS 188: Artificial Intelligence Fall Announcements. Hill Climbing. Hill Climbing Diagram. Today CS 188: Artificial Intelligence Fall 2006 Lecture 5: Robot Motion Planning 9/14/2006 Local Search Methods Queue-based algorithms keep fallback options (backtracking) Local search: improve what you have

More information

Announcements. CS 188: Artificial Intelligence Fall Robot motion planning! Today. Robotics Tasks. Mobile Robots

Announcements. CS 188: Artificial Intelligence Fall Robot motion planning! Today. Robotics Tasks. Mobile Robots CS 188: Artificial Intelligence Fall 2007 Lecture 6: Robot Motion Planning 9/13/2007 Announcements Project 1 due (yesterday)! Project 2 (Pacman with ghosts) up in a few days Reminder: you are allowed to

More information

CS 188: Artificial Intelligence Fall Announcements

CS 188: Artificial Intelligence Fall Announcements CS 188: Artificial Intelligence Fall 2007 Lecture 6: Robot Motion Planning 9/13/2007 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore Announcements Project

More information

Introduction to Mobile Robotics Path Planning and Collision Avoidance

Introduction to Mobile Robotics Path Planning and Collision Avoidance Introduction to Mobile Robotics Path Planning and Collision Avoidance Wolfram Burgard, Cyrill Stachniss, Maren Bennewitz, Giorgio Grisetti, Kai Arras 1 Motion Planning Latombe (1991): eminently necessary

More information

Heuristic Policy Iteration for Infinite-Horizon Decentralized POMDPs

Heuristic Policy Iteration for Infinite-Horizon Decentralized POMDPs Heuristic Policy Iteration for Infinite-Horizon Decentralized POMDPs Christopher Amato and Shlomo Zilberstein Department of Computer Science University of Massachusetts Amherst, MA 01003 USA Abstract Decentralized

More information

Per-decision Multi-step Temporal Difference Learning with Control Variates

Per-decision Multi-step Temporal Difference Learning with Control Variates Per-decision Multi-step Temporal Difference Learning with Control Variates Kristopher De Asis Department of Computing Science University of Alberta Edmonton, AB T6G 2E8 kldeasis@ualberta.ca Richard S.

More information

Robots are built to accomplish complex and difficult tasks that require highly non-linear motions.

Robots are built to accomplish complex and difficult tasks that require highly non-linear motions. Path and Trajectory specification Robots are built to accomplish complex and difficult tasks that require highly non-linear motions. Specifying the desired motion to achieve a specified goal is often a

More information

Localization, Where am I?

Localization, Where am I? 5.1 Localization, Where am I?? position Position Update (Estimation?) Encoder Prediction of Position (e.g. odometry) YES matched observations Map data base predicted position Matching Odometry, Dead Reckoning

More information

Assignment 4: CS Machine Learning

Assignment 4: CS Machine Learning Assignment 4: CS7641 - Machine Learning Saad Khan November 29, 2015 1 Introduction The purpose of this assignment is to apply some of the techniques learned from reinforcement learning to make decisions

More information

ME 597/747 Autonomous Mobile Robots. Mid Term Exam. Duration: 2 hour Total Marks: 100

ME 597/747 Autonomous Mobile Robots. Mid Term Exam. Duration: 2 hour Total Marks: 100 ME 597/747 Autonomous Mobile Robots Mid Term Exam Duration: 2 hour Total Marks: 100 Instructions: Read the exam carefully before starting. Equations are at the back, but they are NOT necessarily valid

More information

Stuck in Traffic (SiT) Attacks

Stuck in Traffic (SiT) Attacks Stuck in Traffic (SiT) Attacks Mina Guirguis Texas State University Joint work with George Atia Traffic 2 Intelligent Transportation Systems V2X communication enable drivers to make better decisions: Avoiding

More information

Heuristic Search Value Iteration Trey Smith. Presenter: Guillermo Vázquez November 2007

Heuristic Search Value Iteration Trey Smith. Presenter: Guillermo Vázquez November 2007 Heuristic Search Value Iteration Trey Smith Presenter: Guillermo Vázquez November 2007 What is HSVI? Heuristic Search Value Iteration is an algorithm that approximates POMDP solutions. HSVI stores an upper

More information

A convergent Reinforcement Learning algorithm in the continuous case based on a Finite Difference method

A convergent Reinforcement Learning algorithm in the continuous case based on a Finite Difference method A convergent Reinforcement Learning algorithm in the continuous case based on a Finite Difference method Remi Munos* CEMAGREF, LISC, Pare de Tourvoie, BP 121, 92185 Antony Cedex, FRANCE. Tel : (0)1 40

More information

Decision Making under Uncertainty

Decision Making under Uncertainty Decision Making under Uncertainty MDPs and POMDPs Mykel J. Kochenderfer 27 January 214 Recommended reference Markov Decision Processes in Artificial Intelligence edited by Sigaud and Buffet Surveys a broad

More information

Learning and Solving Partially Observable Markov Decision Processes

Learning and Solving Partially Observable Markov Decision Processes Ben-Gurion University of the Negev Department of Computer Science Learning and Solving Partially Observable Markov Decision Processes Dissertation submitted in partial fulfillment of the requirements for

More information

Incremental methods for computing bounds in partially observable Markov decision processes

Incremental methods for computing bounds in partially observable Markov decision processes Incremental methods for computing bounds in partially observable Markov decision processes Milos Hauskrecht MIT Laboratory for Computer Science, NE43-421 545 Technology Square Cambridge, MA 02139 milos@medg.lcs.mit.edu

More information

Planning for Markov Decision Processes with Sparse Stochasticity

Planning for Markov Decision Processes with Sparse Stochasticity Planning for Markov Decision Processes with Sparse Stochasticity Maxim Likhachev Geoff Gordon Sebastian Thrun School of Computer Science School of Computer Science Dept. of Computer Science Carnegie Mellon

More information

Thomas Bräunl EMBEDDED ROBOTICS. Mobile Robot Design and Applications with Embedded Systems. Second Edition. With 233 Figures and 24 Tables.

Thomas Bräunl EMBEDDED ROBOTICS. Mobile Robot Design and Applications with Embedded Systems. Second Edition. With 233 Figures and 24 Tables. Thomas Bräunl EMBEDDED ROBOTICS Mobile Robot Design and Applications with Embedded Systems Second Edition With 233 Figures and 24 Tables Springer CONTENTS PART I: EMBEDDED SYSTEMS 1 Robots and Controllers

More information

Lecture 13: Learning from Demonstration

Lecture 13: Learning from Demonstration CS 294-5 Algorithmic Human-Robot Interaction Fall 206 Lecture 3: Learning from Demonstration Scribes: Samee Ibraheem and Malayandi Palaniappan - Adapted from Notes by Avi Singh and Sammy Staszak 3. Introduction

More information

Instance-Based Action Models for Fast Action Planning

Instance-Based Action Models for Fast Action Planning Instance-Based Action Models for Fast Action Planning Mazda Ahmadi and Peter Stone Department of Computer Sciences The University of Texas at Austin 1 University Station C0500, Austin, TX 78712-0233 Email:{mazda,pstone}@cs.utexas.edu

More information

Reinforcement Learning on the Lego Mindstorms NXT Robot. Analysis and Implementation.

Reinforcement Learning on the Lego Mindstorms NXT Robot. Analysis and Implementation. ESCUELA TÉCNICA SUPERIOR DE INGENIEROS INDUSTRIALES Departamento de Ingeniería de Sistemas y Automática Master Thesis Reinforcement Learning on the Lego Mindstorms NXT Robot. Analysis and Implementation.

More information

Gradient Reinforcement Learning of POMDP Policy Graphs

Gradient Reinforcement Learning of POMDP Policy Graphs 1 Gradient Reinforcement Learning of POMDP Policy Graphs Douglas Aberdeen Research School of Information Science and Engineering Australian National University Jonathan Baxter WhizBang! Labs July 23, 2001

More information

Vision-Based Reinforcement Learning using Approximate Policy Iteration

Vision-Based Reinforcement Learning using Approximate Policy Iteration Vision-Based Reinforcement Learning using Approximate Policy Iteration Marwan R. Shaker, Shigang Yue and Tom Duckett Abstract A major issue for reinforcement learning (RL) applied to robotics is the time

More information

Robotics Tasks. CS 188: Artificial Intelligence Spring Manipulator Robots. Mobile Robots. Degrees of Freedom. Sensors and Effectors

Robotics Tasks. CS 188: Artificial Intelligence Spring Manipulator Robots. Mobile Robots. Degrees of Freedom. Sensors and Effectors CS 188: Artificial Intelligence Spring 2006 Lecture 5: Robot Motion Planning 1/31/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Motion planning (today) How to move from

More information