Planning and Control: Markov Decision Processes
|
|
- Jason Bruce
- 5 years ago
- Views:
Transcription
1 CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes
2 Planning Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Perfect vs. Noisy Environment What action next? Discrete vs. Continuous Outcomes Deterministic vs. Stochastic Percepts Actions Full vs. Partial satisfaction
3 Full Classical Planning Static Predictable Environment Fully Observable What action next? Discrete Deterministic Perfect Percepts Actions
4 Full Stochastic Planning Static Unpredictable Environment Fully Observable Perfect What action next? Discrete Stochastic Percepts Actions
5 Deterministic, fully observable
6 Stochastic, Fully Observable
7 Stochastic, Partially Observable
8 Markov Decision Process (MDP) S: A set of states A: A set of actions Pr(s s,a): transition model C(s,a,s ): cost model G: set of goals s 0 : start state : discount factor R(s,a,s ): reward model
9 Role of Discount Factor ( ) Keep the total reward/total cost finite useful for infinite horizon problems sometimes indefinite horizon: if there are deadends Intuition (economics): Money today is worth more than money tomorrow. Total reward: r 1 + r r 3 + Total cost: c 1 + c c 3 +
10 Objective of a Fully Observable MDP Find a policy : S A which optimises minimises maximises maximises discounted or undiscount. expected cost to reach a goal expected reward expected (reward-cost) given a horizon finite infinite indefinite assuming full observability
11 Examples of MDPs Goal-directed, Indefinite Horizon, Cost Minimisation MDP <S, A, Pr, C, G, s 0 > Infinite Horizon, Discounted Reward Maximisation MDP <S, A, Pr, R, > Reward = t t r t Goal-directed, Finite Horizon, Prob. Maximisation MDP <S, A, Pr, G, s 0, T>
12 Bellman Equations for MDP 1 <S, A, Pr, C, G, s 0 > Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state. J* should satisfy the following equation: Q*(s,a)
13 Bellman Equations for MDP 2 <S, A, Pr, R, s 0, > Define V*(s) {optimal value} as the maximum expected discounted reward from this state. V* should satisfy the following equation:
14 Bellman Backup Given an estimate of V* function (say V n ) Backup V n function at state s calculate a new estimate (V n+1 ) : Q n+1 (s,a) : value/cost of the strategy: execute action a in s, execute n subsequently n = argmax a Ap(s) Q n (s,a) (greedy action)
15 Bellman Backup a greedy = a 1 max a 1 s 1 V 0 = 20 Q 1 (s,a 1 ) = Q 1 (s,a 2 ) = Q 1 (s,a 3 ) = V 1 = 25 s 0 20 a? 2 s 2 V 0 = 2 a 3 s 3 V 0 = 3
16 Value iteration [Bellman 57] assign an arbitrary assignment of V 0 to each non-goal state. repeat for all states s compute V n+1 (s) by Bellman backup at s. until max s V n+1 (s) V n (s) < Iteration n+1 Residual(s) -convergence
17 Complexity of value iteration One iteration takes O( A S 2 ) time. Number of iterations required poly( S, A,1/(1-γ)) Overall: the algorithm is polynomial in state space thus exponential in number of state variables.
18 Policy Computation Optimal policy is stationary and time-independent. for infinite/indefinite horizon problems Policy Evaluation A system of linear equations in S variables.
19 Markov Decision Process (MDP) r=1 0.7 s r=0 s s 3 r= r=0 0.2 s 4 s r=-10
20 Value Function and Policy Value residual and policy residual
21 Changing the Search Space Value Iteration Search in value space Compute the resulting policy Policy Iteration [Howard 60] Search in policy space Compute the resulting value
22 Policy iteration [Howard 60] assign an arbitrary assignment of 0 to each state. repeat compute V n+1 : the evaluation of n for all states s compute n+1 (s): argmax a2 Ap(s) Q n+1 (s,a) until n+1 = n Advantage Modified Policy Iteration costly: O(n 3 ) approximate by value iteration using fixed policy searching in a finite (policy) space as opposed to uncountably infinite (value) space convergence faster. all other properties follow!
23 LP Formulation minimise s2s V*(s) under constraints: for every s, a V*(s) R(s) + s 2S Pr(s a,s)v*(s ) A big LP. So other tricks used to solve it!
24 Hybrid MDPs Hybrid Markov decision process: Markov state = (n, x), where n is the discrete component (set of fluents) and. Bellman s equation: V t 1 n ( x) = max Pr( n' n, x, a) a An ( x) n' N x' X Pr( x' x l n, x, a, n') t R ( x' ) V ( x' dx' n ' n' )
25 Hybrid MDPs Hybrid Markov decision process: Markov state = (n, x), where n is the discrete component (set of fluents) and. Bellman s equation: V t 1 n ( x) = max Pr( n' n, x, a) a An ( x) n' N x' X Pr( x' x l n, x, a, n') t R ( x' ) V ( x' dx' n ' n' )
26 Convolutions discrete-discrete constant-discrete [Feng et.al. 04] constant-constant [Li&Littman 05]
27 probability density function Result of convolutions value function discrete constant linear discrete discrete constant linear constant constant linear quadratic linear linear quadratic cubic
28 Value Iteration for Motion Planning (assumes knowledge of robot s location)
29 Frontier-based Exploration Every unknown location is a target point.
30 Manipulator Control Arm with two joints Configuration space
31 Manipulator Control Path State space Configuration space
32 Manipulator Control Path State space Configuration space
33 Collision Avoidance via Planning Potential field methods have local minima Perform efficient path planning in the local perceptual space Path costs depend on length and closeness to obstacles [Konolige, Gradient method]
34 Paths and Costs Path is list of points P={p 1, p 2, p k } p k is only point in goal set Cost of path is separable into intrinsic cost at each point along with adjacency cost of moving from one point to the next F ) ( P) = I( pi ) A( pi, pi 1 i i Adjacency cost typically Euclidean distance Intrinsic cost typically occupancy, distance to obstacle
35 Navigation Function Assignment of potential field value to every element in configuration space [Latombe, 91]. Goal set is always downhill, no local minima. Navigation function of a point is cost of minimal cost path that starts at that point. N k = min P k F( P k )
36 Computation of Navigation Function Initialization Points in goal set 0 cost All other points infinite cost Active list goal set Repeat Take point from active list and update neighbors If cost changes, add the point to the active list Until active list is empty
37 Challenges Where do we get the state space from? Where do we get the model from? What happens when the world is slightly different? Where does reward come from? Continuous state variables Continuous action space
38 How to solve larger problems? If deterministic problem Use dijkstra s algorithm If no back-edge Use backward Bellman updates Prioritize Bellman updates to maximize information flow If known initial state Use dynamic programming + heuristic search LAO*, RTDP and variants Divide an MDP into sub-mdps are solve the hierarchy Aggregate states with similar values Relational MDPs
39 Approximations: n-step lookahead n=1 : greedy 1 (s) = argmax a R(s,a) n-step lookahead n (s) = argmax a V n (s)
40 Approximation: Incremental approaches deterministic relaxation Deterministic planner plan Stochastic simulation Identify weakness Solve/Merge
41 Approximations: Planning and Replanning deterministic relaxation Deterministic planner plan send the state reached Execute the action
42 CSE-571 AI-based Mobile Robotics Planning and Control: SA-1 (1) Reinforcement Learning (2) Partially Observable Markov Decision Processes
43 Reinforcement Learning Still have an MDP Still looking for policy New twist: don t know Pr and/or R i.e. don t know which states are good And what actions do Must actually try actions and states out to learn
44 Model based methods Visit different states, perform different actions Estimate Pr and R Once model built, do planning using V.I. or other methods Cons: require _huge_ amounts of data
45 Model free methods TD learning Directly learn Q*(s,a) values sample = R(s,a,s ) + max a Q n (s,a ) Nudge the old estimate towards the new sample Q n+1 (s,a) Ã (1- )Q n (s,a) + [sample]
46 Properties Converges to optimal if If you explore enough If you make learning rate ( ) small enough But not decrease it too quickly
47 Exploration vs. Exploitation -greedy Each time step flip a coin With prob, action randomly With prob 1- take the current greedy action Lower over time to increase exploitation as more learning has happened
48 Q-learning Problems Too many states to visit during learning Q(s,a) is a BIG table We want to generalize from small set of training examples Solutions Value function approximators Policy approximators Hierarchical Reinforcement Learning
49 Task Hierarchy: MAXQ Decomposition [Dietterich 00] Fetch Root Deliver Children of a task are unordered Take Navigate(loc) Give Extend-arm Grab Release Extend-arm Move s Move w Move e Move n
50 MAXQ Decomposition Augment the state s by adding the subtask i: [s,i]. Define C([s,i],j) as the reward received in i after j finishes. Q([s,Fetch],Navigate(prr)) = V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr)) Reward received while navigating Reward received after navigation Express V in terms of C Learn C, instead of learning Q
51 MAXQ Decomposition (contd) State Abstraction Finding irrelevant actions Finding funnel actions
52 POMDPs: Recall example
53 Partially Observable Markov Decision Processes
54 POMDPs In POMDPs we apply the very same idea as in MDPs. Since the state is not observable, the agent has to make its decisions based on the belief state which is a posterior distribution over states. Let b be the belief of the agent about the state under consideration. POMDPs compute a value function over belief space: CSE-571- AI-based Mobile Robotics 54
55 Problems Each belief is a probability distribution, thus, each value in a POMDP is a function of an entire probability distribution. This is problematic, since probability distributions are continuous. Additionally, we have to deal with the huge complexity of belief spaces. For finite worlds with finite state, action, and measurement spaces and finite horizons, however, we can effectively represent the value functions by piecewise linear functions CSE-571- AI-based Mobile Robotics 55
56 An Illustrative Example measurements state x 1 action u 3 state x 2 measurements z 1 z x u u 3 x 2 u1 u2 u1 u z 1 z 2 actions u 1, u 2 payoff payoff CSE-571- AI-based Mobile Robotics 56
57 The Parameters of the Example The actions u 1 and u 2 are terminal actions. The action u 3 is a sensing action that potentially leads to a state transition. The horizon is finite and = CSE-571- AI-based Mobile Robotics 57
58 Payoff in POMDPs In MDPs, the payoff (or return) depended on the state of the system. In POMDPs, however, the true state is not exactly known. Therefore, we compute the expected payoff by integrating over all states: CSE-571- AI-based Mobile Robotics 58
59 Payoffs in Our Example (1) If we are totally certain that we are in state x 1 and execute action u 1, we receive a reward of -100 If, on the other hand, we definitely know that we are in x 2 and execute u 1, the reward is In between it is the linear combination of the extreme values weighted by the probabilities CSE-571- AI-based Mobile Robotics 59
60 Payoffs in Our Example (2) CSE-571- AI-based Mobile Robotics 60
61 The Resulting Policy for T=1 Given we have a finite POMDP with T=1, we would use V 1 (b) to determine the optimal policy. In our example, the optimal policy for T=1 is This is the upper thick graph in the diagram CSE-571- AI-based Mobile Robotics 61
62 Piecewise Linearity, Convexity The resulting value function V 1 (b) is the maximum of the three functions at each point It is piecewise linear and convex CSE-571- AI-based Mobile Robotics 62
63 Pruning If we carefully consider V 1 (b), we see that only the first two components contribute. The third component can therefore safely be pruned away from V 1 (b) CSE-571- AI-based Mobile Robotics 63
64 Increasing the Time Horizon Assume the robot can make an observation before deciding on an action V 1 (b) CSE-571- AI-based Mobile Robotics 64
65 Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. Suppose the robot perceives z 1 for which p(z 1 x 1 )=0.7 and p(z 1 x 2 )=0.3. Given the observation z 1 we update the belief using Bayes rule. p' p' 1 2 p( z = = p1 p( z ) 1 0.3(1 p p( z ) 1 ) = 0.7 p 1 1 ) 0.3(1 p 1 ) = 0.4 p CSE-571- AI-based Mobile Robotics 65
66 Value Function V 1 (b) b (b z 1 ) V 1 (b z 1 ) CSE-571- AI-based Mobile Robotics 66
67 Increasing the Time Horizon Assume the robot can make an observation before deciding on an action. Suppose the robot perceives z 1 for which p(z 1 x 1 )=0.7 and p(z 1 x 2 )=0.3. Given the observation z 1 we update the belief using Bayes rule. Thus V 1 (b z 1 ) is given by CSE-571- AI-based Mobile Robotics 67
68 CSE-571- AI-based Mobile Robotics 68 Expected Value after Measuring Since we do not know in advance what the next measurement will be, we have to compute the expected belief = = = = = = = ) ( ) ( ) ( ) ( ) ( ) ( )] ( [ ) ( i i i i i i i i i z p x z p V z p p x z p V z p z b V z p z b V E b V
69 Expected Value after Measuring Since we do not know in advance what the next measurement will be, we have to compute the expected belief CSE-571- AI-based Mobile Robotics 69
70 Resulting Value Function The four possible combinations yield the following function which then can be simplified and pruned CSE-571- AI-based Mobile Robotics 70
71 Value Function p(z 1 ) V 1 (b z 1 ) b (b z 1 ) p(z 2 ) V 2 (b z 2 ) \bar{v} 1 (b) CSE-571- AI-based Mobile Robotics 71
72 State Transitions (Prediction) When the agent selects u 3 its state potentially changes. When computing the value function, we have to take these potential state changes into account CSE-571- AI-based Mobile Robotics 72
73 Resulting Value Function after executing u 3 Taking the state transitions into account, we finally obtain CSE-571- AI-based Mobile Robotics 73
74 Value Function after executing u 3 \bar{v}1(b) \bar{v} 1 (b u 3 ) CSE-571- AI-based Mobile Robotics 74
75 Value Function for T=2 Taking into account that the agent can either directly perform u 1 or u 2 or first u 3 and then u 1 or u 2, we obtain (after pruning) CSE-571- AI-based Mobile Robotics 75
76 Graphical Representation of V 2 (b) u 1 optimal u 2 optimal unclear outcome of measuring is important here CSE-571- AI-based Mobile Robotics 76
77 Deep Horizons and Pruning We have now completed a full backup in belief space. This process can be applied recursively. The value functions for T=10 and T=20 are CSE-571- AI-based Mobile Robotics 77
78 Deep Horizons and Pruning CSE-571- AI-based Mobile Robotics 78
79 CSE-571- AI-based Mobile Robotics 79
80 Why Pruning is Essential Each update introduces additional linear components to V. Each measurement squares the number of linear components. Thus, an unpruned value function for T=20 includes more than ,864 linear functions. At T=30 we have ,012,337 linear functions. The pruned value functions at T=20, in comparison, contains only 12 linear components. The combinatorial explosion of linear components in the value function are the major reason why POMDPs are impractical for most applications CSE-571- AI-based Mobile Robotics 80
81 POMDP Summary POMDPs compute the optimal action in partially observable, stochastic domains. For finite horizon problems, the resulting value functions are piecewise linear and convex. In each iteration the number of linear constraints grows exponentially. POMDPs so far have only been applied successfully to very small state spaces with small numbers of possible observations and actions CSE-571- AI-based Mobile Robotics 81
82 POMDP Approximations Point-based value iteration QMDPs AMDPs CSE-571- AI-based Mobile Robotics 82
83 Point-based Value Iteration Maintains a set of example beliefs Only considers constraints that maximize value function for at least one of the examples CSE-571- AI-based Mobile Robotics 83
84 Point-based Value Iteration Value functions for T=30 Exact value function PBVI CSE-571- AI-based Mobile Robotics 84
85 Example Application CSE-571- AI-based Mobile Robotics 85
86 Example Application CSE-571- AI-based Mobile Robotics 86
87 QMDPs QMDPs only consider state uncertainty in the first step After that, the world becomes fully observable CSE-571- AI-based Mobile Robotics 87
88 CSE-571- AI-based Mobile Robotics 88 = = N j i j j i i x u x p x V u x r u x Q 1 ), ( ) ( ), ( ), ( = N j i i u u x Q p 1 ), ( arg max
89 Augmented MDPs Augmentation adds uncertainty component to state space, e.g. b arg max b( x) = = x, Hb( x) b( x)log b( x) dx Hb( x) Planning is performed by MDP in augmented state space Transition, observation and payoff models have to be learned CSE-571- AI-based Mobile Robotics 89
Markov Decision Processes. (Slides from Mausam)
Markov Decision Processes (Slides from Mausam) Machine Learning Operations Research Graph Theory Control Theory Markov Decision Process Economics Robotics Artificial Intelligence Neuroscience /Psychology
More informationPartially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)
Partially Observable Markov Decision Processes Mausam (slides by Dieter Fox) Stochastic Planning: MDPs Static Environment Fully Observable Perfect What action next? Stochastic Instantaneous Percepts Actions
More informationReinforcement Learning: A brief introduction. Mihaela van der Schaar
Reinforcement Learning: A brief introduction Mihaela van der Schaar Outline Optimal Decisions & Optimal Forecasts Markov Decision Processes (MDPs) States, actions, rewards and value functions Dynamic Programming
More informationPartially Observable Markov Decision Processes. Silvia Cruciani João Carvalho
Partially Observable Markov Decision Processes Silvia Cruciani João Carvalho MDP A reminder: is a set of states is a set of actions is the state transition function. is the probability of ending in state
More information15-780: MarkovDecisionProcesses
15-780: MarkovDecisionProcesses J. Zico Kolter Feburary 29, 2016 1 Outline Introduction Formal definition Value iteration Policy iteration Linear programming for MDPs 2 1988 Judea Pearl publishes Probabilistic
More informationProbabilistic Robotics
Probabilistic Robotics Sebastian Thrun Wolfram Burgard Dieter Fox The MIT Press Cambridge, Massachusetts London, England Preface xvii Acknowledgments xix I Basics 1 1 Introduction 3 1.1 Uncertainty in
More informationReinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)
Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended) Pavlos Andreadis, February 2 nd 2018 1 Markov Decision Processes A finite Markov Decision Process
More informationMonte Carlo Tree Search PAH 2015
Monte Carlo Tree Search PAH 2015 MCTS animation and RAVE slides by Michèle Sebag and Romaric Gaudel Markov Decision Processes (MDPs) main formal model Π = S, A, D, T, R states finite set of states of the
More informationMarkov Decision Processes and Reinforcement Learning
Lecture 14 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Artificial Intelligence
More informationIntroduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University
Introduction to Reinforcement Learning J. Zico Kolter Carnegie Mellon University 1 Agent interaction with environment Agent State s Reward r Action a Environment 2 Of course, an oversimplification 3 Review:
More informationAutonomous Robotics 6905
6905 Lecture 5: to Path-Planning and Navigation Dalhousie University i October 7, 2011 1 Lecture Outline based on diagrams and lecture notes adapted from: Probabilistic bili i Robotics (Thrun, et. al.)
More informationA Brief Introduction to Reinforcement Learning
A Brief Introduction to Reinforcement Learning Minlie Huang ( ) Dept. of Computer Science, Tsinghua University aihuang@tsinghua.edu.cn 1 http://coai.cs.tsinghua.edu.cn/hml Reinforcement Learning Agent
More informationGradient Reinforcement Learning of POMDP Policy Graphs
1 Gradient Reinforcement Learning of POMDP Policy Graphs Douglas Aberdeen Research School of Information Science and Engineering Australian National University Jonathan Baxter WhizBang! Labs July 23, 2001
More informationCS 188: Artificial Intelligence Spring Recap Search I
CS 188: Artificial Intelligence Spring 2011 Midterm Review 3/14/2011 Pieter Abbeel UC Berkeley Recap Search I Agents that plan ahead à formalization: Search Search problem: States (configurations of the
More informationMarkov Decision Processes (MDPs) (cont.)
Markov Decision Processes (MDPs) (cont.) Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University November 29 th, 2007 Markov Decision Process (MDP) Representation State space: Joint state x
More informationRecap Search I. CS 188: Artificial Intelligence Spring Recap Search II. Example State Space Graph. Search Problems.
CS 188: Artificial Intelligence Spring 011 Midterm Review 3/14/011 Pieter Abbeel UC Berkeley Recap Search I Agents that plan ahead à formalization: Search Search problem: States (configurations of the
More informationAn Improved Policy Iteratioll Algorithm for Partially Observable MDPs
An Improved Policy Iteratioll Algorithm for Partially Observable MDPs Eric A. Hansen Computer Science Department University of Massachusetts Amherst, MA 01003 hansen@cs.umass.edu Abstract A new policy
More informationCS 687 Jana Kosecka. Reinforcement Learning Continuous State MDP s Value Function approximation
CS 687 Jana Kosecka Reinforcement Learning Continuous State MDP s Value Function approximation Markov Decision Process - Review Formal definition 4-tuple (S, A, T, R) Set of states S - finite Set of actions
More informationA fast point-based algorithm for POMDPs
A fast point-based algorithm for POMDPs Nikos lassis Matthijs T. J. Spaan Informatics Institute, Faculty of Science, University of Amsterdam Kruislaan 43, 198 SJ Amsterdam, The Netherlands {vlassis,mtjspaan}@science.uva.nl
More informationReinforcement Learning (2)
Reinforcement Learning (2) Bruno Bouzy 1 october 2013 This document is the second part of the «Reinforcement Learning» chapter of the «Agent oriented learning» teaching unit of the Master MI computer course.
More informationAlgorithms for Solving RL: Temporal Difference Learning (TD) Reinforcement Learning Lecture 10
Algorithms for Solving RL: Temporal Difference Learning (TD) 1 Reinforcement Learning Lecture 10 Gillian Hayes 8th February 2007 Incremental Monte Carlo Algorithm TD Prediction TD vs MC vs DP TD for control:
More informationCMU-Q Lecture 4: Path Planning. Teacher: Gianni A. Di Caro
CMU-Q 15-381 Lecture 4: Path Planning Teacher: Gianni A. Di Caro APPLICATION: MOTION PLANNING Path planning: computing a continuous sequence ( a path ) of configurations (states) between an initial configuration
More information10703 Deep Reinforcement Learning and Control
10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Policy Gradient I Used Materials Disclaimer: Much of the material and slides for this lecture
More informationValue Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Reinforcement Learning: Value Iteration Matt Gormley Lecture 23 Apr. 10, 2019 1
More informationPerseus: randomized point-based value iteration for POMDPs
Universiteit van Amsterdam IAS technical report IAS-UA-04-02 Perseus: randomized point-based value iteration for POMDPs Matthijs T. J. Spaan and Nikos lassis Informatics Institute Faculty of Science University
More informationTask-Oriented Planning for Manipulating Articulated Mechanisms under Model Uncertainty. Venkatraman Narayanan and Maxim Likhachev
Task-Oriented Planning for Manipulating Articulated Mechanisms under Model Uncertainty Venkatraman Narayanan and Maxim Likhachev Motivation Motivation A lot of household objects are ARTICULATED A lot of
More informationArtificial Intelligence
Artificial Intelligence Other models of interactive domains Marc Toussaint University of Stuttgart Winter 2018/19 Basic Taxonomy of domain models Other models of interactive domains Basic Taxonomy of domain
More informationPredictive Autonomous Robot Navigation
Predictive Autonomous Robot Navigation Amalia F. Foka and Panos E. Trahanias Institute of Computer Science, Foundation for Research and Technology-Hellas (FORTH), Heraklion, Greece and Department of Computer
More informationWhat is Search For? CSE 473: Artificial Intelligence. Example: N-Queens. Example: N-Queens. Example: Map-Coloring 4/7/17
CSE 473: Artificial Intelligence Constraint Satisfaction Dieter Fox What is Search For? Models of the world: single agent, deterministic actions, fully observed state, discrete state space Planning: sequences
More informationName: UW CSE 473 Midterm, Fall 2014
Instructions Please answer clearly and succinctly. If an explanation is requested, think carefully before writing. Points may be removed for rambling answers. If a question is unclear or ambiguous, feel
More informationSymbolic LAO* Search for Factored Markov Decision Processes
Symbolic LAO* Search for Factored Markov Decision Processes Zhengzhu Feng Computer Science Department University of Massachusetts Amherst MA 01003 Eric A. Hansen Computer Science Department Mississippi
More informationApprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang
Apprenticeship Learning for Reinforcement Learning with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Table of Contents Introduction Theory Autonomous helicopter control
More informationPlanning for Markov Decision Processes with Sparse Stochasticity
Planning for Markov Decision Processes with Sparse Stochasticity Maxim Likhachev Geoff Gordon Sebastian Thrun School of Computer Science School of Computer Science Dept. of Computer Science Carnegie Mellon
More informationPlanning. CPS 570 Ron Parr. Some Actual Planning Applications. Used to fulfill mission objectives in Nasa s Deep Space One (Remote Agent)
Planning CPS 570 Ron Parr Some Actual Planning Applications Used to fulfill mission objectives in Nasa s Deep Space One (Remote Agent) Particularly important for space operations due to latency Also used
More informationIncremental methods for computing bounds in partially observable Markov decision processes
Incremental methods for computing bounds in partially observable Markov decision processes Milos Hauskrecht MIT Laboratory for Computer Science, NE43-421 545 Technology Square Cambridge, MA 02139 milos@medg.lcs.mit.edu
More informationAn Approach to State Aggregation for POMDPs
An Approach to State Aggregation for POMDPs Zhengzhu Feng Computer Science Department University of Massachusetts Amherst, MA 01003 fengzz@cs.umass.edu Eric A. Hansen Dept. of Computer Science and Engineering
More informationMarco Wiering Intelligent Systems Group Utrecht University
Reinforcement Learning for Robot Control Marco Wiering Intelligent Systems Group Utrecht University marco@cs.uu.nl 22-11-2004 Introduction Robots move in the physical environment to perform tasks The environment
More informationProbabilistic Planning for Behavior-Based Robots
Probabilistic Planning for Behavior-Based Robots Amin Atrash and Sven Koenig College of Computing Georgia Institute of Technology Atlanta, Georgia 30332-0280 {amin, skoenig}@cc.gatech.edu Abstract Partially
More informationIntroduction to Mobile Robotics Path and Motion Planning. Wolfram Burgard, Cyrill Stachniss, Maren Bennewitz, Diego Tipaldi, Luciano Spinello
Introduction to Mobile Robotics Path and Motion Planning Wolfram Burgard, Cyrill Stachniss, Maren Bennewitz, Diego Tipaldi, Luciano Spinello 1 Motion Planning Latombe (1991): eminently necessary since,
More informationCOS Lecture 13 Autonomous Robot Navigation
COS 495 - Lecture 13 Autonomous Robot Navigation Instructor: Chris Clark Semester: Fall 2011 1 Figures courtesy of Siegwart & Nourbakhsh Control Structure Prior Knowledge Operator Commands Localization
More informationSpace-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs
Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs Nevin L. Zhang and Weihong Zhang lzhang,wzhang @cs.ust.hk Department of Computer Science Hong Kong University of Science &
More informationFunction Approximation. Pieter Abbeel UC Berkeley EECS
Function Approximation Pieter Abbeel UC Berkeley EECS Outline Value iteration with function approximation Linear programming with function approximation Value Iteration Algorithm: Start with for all s.
More informationReinforcement Learning of Traffic Light Controllers under Partial Observability
Reinforcement Learning of Traffic Light Controllers under Partial Observability MSc Thesis of R. Schouten(0010774), M. Steingröver(0043826) Students of Artificial Intelligence on Faculty of Science University
More informationLocalization and Map Building
Localization and Map Building Noise and aliasing; odometric position estimation To localize or not to localize Belief representation Map representation Probabilistic map-based localization Other examples
More informationCSE151 Assignment 2 Markov Decision Processes in the Grid World
CSE5 Assignment Markov Decision Processes in the Grid World Grace Lin A484 gclin@ucsd.edu Tom Maddock A55645 tmaddock@ucsd.edu Abstract Markov decision processes exemplify sequential problems, which are
More informationExam Topics. Search in Discrete State Spaces. What is intelligence? Adversarial Search. Which Algorithm? 6/1/2012
Exam Topics Artificial Intelligence Recap & Expectation Maximization CSE 473 Dan Weld BFS, DFS, UCS, A* (tree and graph) Completeness and Optimality Heuristics: admissibility and consistency CSPs Constraint
More informationSpring Localization II. Roland Siegwart, Margarita Chli, Juan Nieto, Nick Lawrance. ASL Autonomous Systems Lab. Autonomous Mobile Robots
Spring 2018 Localization II Localization I 16.04.2018 1 knowledge, data base mission commands Localization Map Building environment model local map position global map Cognition Path Planning path Perception
More informationarxiv: v1 [cs.cv] 2 Sep 2018
Natural Language Person Search Using Deep Reinforcement Learning Ankit Shah Language Technologies Institute Carnegie Mellon University aps1@andrew.cmu.edu Tyler Vuong Electrical and Computer Engineering
More informationReinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019
Reinforcement Learning and Optimal Control ASU, CSE 691, Winter 2019 Dimitri P. Bertsekas dimitrib@mit.edu Lecture 1 Bertsekas Reinforcement Learning 1 / 21 Outline 1 Introduction, History, General Concepts
More information6.141: Robotics systems and science Lecture 10: Motion Planning III
6.141: Robotics systems and science Lecture 10: Motion Planning III Lecture Notes Prepared by N. Roy and D. Rus EECS/MIT Spring 2012 Reading: Chapter 3, and Craig: Robotics http://courses.csail.mit.edu/6.141/!
More informationMonte Carlo Tree Search
Monte Carlo Tree Search Branislav Bošanský PAH/PUI 2016/2017 MDPs Using Monte Carlo Methods Monte Carlo Simulation: a technique that can be used to solve a mathematical or statistical problem using repeated
More information6.141: Robotics systems and science Lecture 10: Implementing Motion Planning
6.141: Robotics systems and science Lecture 10: Implementing Motion Planning Lecture Notes Prepared by N. Roy and D. Rus EECS/MIT Spring 2011 Reading: Chapter 3, and Craig: Robotics http://courses.csail.mit.edu/6.141/!
More informationGeneralized and Bounded Policy Iteration for Interactive POMDPs
Generalized and Bounded Policy Iteration for Interactive POMDPs Ekhlas Sonu and Prashant Doshi THINC Lab, Dept. of Computer Science University Of Georgia Athens, GA 30602 esonu@uga.edu, pdoshi@cs.uga.edu
More informationHeuristic Search Value Iteration Trey Smith. Presenter: Guillermo Vázquez November 2007
Heuristic Search Value Iteration Trey Smith Presenter: Guillermo Vázquez November 2007 What is HSVI? Heuristic Search Value Iteration is an algorithm that approximates POMDP solutions. HSVI stores an upper
More informationForward Search Value Iteration For POMDPs
Forward Search Value Iteration For POMDPs Guy Shani and Ronen I. Brafman and Solomon E. Shimony Department of Computer Science, Ben-Gurion University, Beer-Sheva, Israel Abstract Recent scaling up of POMDP
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 5 Inference
More informationCS 378: Computer Game Technology
CS 378: Computer Game Technology Dynamic Path Planning, Flocking Spring 2012 University of Texas at Austin CS 378 Game Technology Don Fussell Dynamic Path Planning! What happens when the environment changes
More informationUnsupervised Learning: Clustering
Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer Machine Learning Supervised Learning Unsupervised Learning
More informationTopics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning
Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 12: Deep Reinforcement Learning Types of Learning Supervised training Learning from the teacher Training data includes
More informationPascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering
Study of Multiple Multiagent Reinforcement Learning Algorithms in Grid Games by Pascal De Beck-Courcelle A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of
More informationThis chapter explains two techniques which are frequently used throughout
Chapter 2 Basic Techniques This chapter explains two techniques which are frequently used throughout this thesis. First, we will introduce the concept of particle filters. A particle filter is a recursive
More informationFinal Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.
CS 188 Spring 2010 Introduction to Artificial Intelligence Final Exam INSTRUCTIONS You have 3 hours. The exam is closed book, closed notes except a two-page crib sheet. Please use non-programmable calculators
More informationDIT411/TIN175, Artificial Intelligence. Peter Ljunglöf. 19 January, 2018
DIT411/TIN175, Artificial Intelligence Chapter 3: Classical search algorithms CHAPTER 3: CLASSICAL SEARCH ALGORITHMS DIT411/TIN175, Artificial Intelligence Peter Ljunglöf 19 January, 2018 1 DEADLINE FOR
More informationChapter 3: Search. c D. Poole, A. Mackworth 2010, W. Menzel 2015 Artificial Intelligence, Chapter 3, Page 1
Chapter 3: Search c D. Poole, A. Mackworth 2010, W. Menzel 2015 Artificial Intelligence, Chapter 3, Page 1 Searching Often we are not given an algorithm to solve a problem, but only a specification of
More informationCS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008 Lecture 4: CSPs 9/9/2008 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1 1 Announcements Grading questions:
More informationAnnouncements. CS 188: Artificial Intelligence Fall Large Scale: Problems with A* What is Search For? Example: N-Queens
CS 188: Artificial Intelligence Fall 2008 Announcements Grading questions: don t panic, talk to us Newsgroup: check it out Lecture 4: CSPs 9/9/2008 Dan Klein UC Berkeley Many slides over the course adapted
More informationWhen Network Embedding meets Reinforcement Learning?
When Network Embedding meets Reinforcement Learning? ---Learning Combinatorial Optimization Problems over Graphs Changjun Fan 1 1. An Introduction to (Deep) Reinforcement Learning 2. How to combine NE
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Constraint Satisfaction Problems II and Local Search Instructors: Sergey Levine and Stuart Russell University of California, Berkeley [These slides were created by Dan Klein
More informationSearch-based Planning with Motion Primitives. Maxim Likhachev Carnegie Mellon University
Search-based Planning with Motion Primitives Maxim Likhachev Carnegie Mellon University generate a graph representation of the planning problem search the graph for a solution What is Search-based Planning
More informationSolving Factored POMDPs with Linear Value Functions
IJCAI-01 workshop on Planning under Uncertainty and Incomplete Information (PRO-2), pp. 67-75, Seattle, Washington, August 2001. Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Computer
More informationSupplementary material: Planning with Abstract Markov Decision Processes
Supplementary material: Planning with Abstract Markov Decision Processes Nakul Gopalan ngopalan@cs.brown.edu Marie desjardins mariedj@umbc.edu Michael L. Littman mlittman@cs.brown.edu James MacGlashan
More informationClass Overview. Introduction to Artificial Intelligence COMP 3501 / COMP Lecture 2: Search. Problem Solving Agents
Class Overview COMP 3501 / COMP 4704-4 Lecture 2: Search Prof. 1 2 Problem Solving Agents Problem Solving Agents: Assumptions Requires a goal Assume world is: Requires actions Observable What actions?
More informationSpring Localization II. Roland Siegwart, Margarita Chli, Martin Rufli. ASL Autonomous Systems Lab. Autonomous Mobile Robots
Spring 2016 Localization II Localization I 25.04.2016 1 knowledge, data base mission commands Localization Map Building environment model local map position global map Cognition Path Planning path Perception
More informationSung-Eui Yoon ( 윤성의 )
Path Planning for Point Robots Sung-Eui Yoon ( 윤성의 ) Course URL: http://sglab.kaist.ac.kr/~sungeui/mpa Class Objectives Motion planning framework Classic motion planning approaches 2 3 Configuration Space:
More informationSlides credited from Dr. David Silver & Hung-Yi Lee
Slides credited from Dr. David Silver & Hung-Yi Lee Review Reinforcement Learning 2 Reinforcement Learning RL is a general purpose framework for decision making RL is for an agent with the capacity to
More informationIntroduction to Mobile Robotics Path Planning and Collision Avoidance
Introduction to Mobile Robotics Path Planning and Collision Avoidance Wolfram Burgard, Cyrill Stachniss, Maren Bennewitz, Giorgio Grisetti, Kai Arras 1 Motion Planning Latombe (1991): eminently necessary
More informationDeep Reinforcement Learning
Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3. Policy Gradient and Gradient Estimators 4. Q-prop: Sample Efficient Policy Gradient and an Off-policy Critic
More informationDistributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes
Distributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes Willy Arthur Silva Reis 1, Karina Valdivia Delgado 2, Leliane Nunes de Barros 1 1 Departamento de Ciência da
More informationApplying Metric-Trees to Belief-Point POMDPs
Applying Metric-Trees to Belief-Point POMDPs Joelle Pineau, Geoffrey Gordon School of Computer Science Carnegie Mellon University Pittsburgh, PA 1513 {jpineau,ggordon}@cs.cmu.edu Sebastian Thrun Computer
More informationClass Overview. Introduction to Artificial Intelligence COMP 3501 / COMP Lecture 2. Problem Solving Agents. Problem Solving Agents: Assumptions
Class Overview COMP 3501 / COMP 4704-4 Lecture 2 Prof. JGH 318 Problem Solving Agents Problem Solving Agents: Assumptions Requires a goal Assume world is: Requires actions Observable What actions? Discrete
More informationUsing Domain-Configurable Search Control for Probabilistic Planning
Using Domain-Configurable Search Control for Probabilistic Planning Ugur Kuter and Dana Nau Department of Computer Science and Institute for Systems Research University of Maryland College Park, MD 20742,
More informationUsing Reinforcement Learning to Optimize Storage Decisions Ravi Khadiwala Cleversafe
Using Reinforcement Learning to Optimize Storage Decisions Ravi Khadiwala Cleversafe Topics What is Reinforcement Learning? Exploration vs. Exploitation The Multi-armed Bandit Optimizing read locations
More informationSolving Concurrent Markov Decision Processes
Solving Concurrent Markov Decision Processes Mausam and Daniel S. Weld Dept of Computer Science and Engineering University of Washington Seattle, WA-98195 mausam,weld@cs.washington.edu Content areas: Markov
More informationLearning and Solving Partially Observable Markov Decision Processes
Ben-Gurion University of the Negev Department of Computer Science Learning and Solving Partially Observable Markov Decision Processes Dissertation submitted in partial fulfillment of the requirements for
More informationRobot Motion Planning
Robot Motion Planning James Bruce Computer Science Department Carnegie Mellon University April 7, 2004 Agent Planning An agent is a situated entity which can choose and execute actions within in an environment.
More informationConfiguration Space of a Robot
Robot Path Planning Overview: 1. Visibility Graphs 2. Voronoi Graphs 3. Potential Fields 4. Sampling-Based Planners PRM: Probabilistic Roadmap Methods RRTs: Rapidly-exploring Random Trees Configuration
More informationREINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION
REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION ABSTRACT Mark A. Mueller Georgia Institute of Technology, Computer Science, Atlanta, GA USA The problem of autonomous vehicle navigation between
More informationProbabilistic Planning with Markov Decision Processes. Andrey Kolobov and Mausam Computer Science and Engineering University of Washington, Seattle
Probabilistic Planning with Markov Decision Processes Andrey Kolobov and Mausam Computer Science and Engineering University of Washington, Seattle 1 Goal an extensive introduction to theory and algorithms
More informationMachine Learning. Sourangshu Bhattacharya
Machine Learning Sourangshu Bhattacharya Bayesian Networks Directed Acyclic Graph (DAG) Bayesian Networks General Factorization Curve Fitting Re-visited Maximum Likelihood Determine by minimizing sum-of-squares
More information10-701/15-781, Fall 2006, Final
-7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly
More informationservable Incremental methods for computing Markov decision
From: AAAI-97 Proceedings. Copyright 1997, AAAI (www.aaai.org). All rights reserved. Incremental methods for computing Markov decision servable Milos Hauskrecht MIT Laboratory for Computer Science, NE43-421
More informationIntroduction to Mobile Robotics Path Planning and Collision Avoidance. Wolfram Burgard, Maren Bennewitz, Diego Tipaldi, Luciano Spinello
Introduction to Mobile Robotics Path Planning and Collision Avoidance Wolfram Burgard, Maren Bennewitz, Diego Tipaldi, Luciano Spinello 1 Motion Planning Latombe (1991): is eminently necessary since, by
More informationRobotic Behaviors. Potential Field Methods
Robotic Behaviors Potential field techniques - trajectory generation - closed feedback-loop control Design of variety of behaviors - motivated by potential field based approach steering behaviors Closed
More informationMaster Thesis. Simulation Based Planning for Partially Observable Markov Decision Processes with Continuous Observation Spaces
Master Thesis Simulation Based Planning for Partially Observable Markov Decision Processes with Continuous Observation Spaces Andreas ten Pas Master Thesis DKE 09-16 Thesis submitted in partial fulfillment
More informationCost Functions in Machine Learning
Cost Functions in Machine Learning Kevin Swingler Motivation Given some data that reflects measurements from the environment We want to build a model that reflects certain statistics about that data Something
More informationAnnouncements. Homework 4. Project 3. Due tonight at 11:59pm. Due 3/8 at 4:00pm
Announcements Homework 4 Due tonight at 11:59pm Project 3 Due 3/8 at 4:00pm CS 188: Artificial Intelligence Constraint Satisfaction Problems Instructor: Stuart Russell & Sergey Levine, University of California,
More informationHierarchical Average Reward Reinforcement Learning
Journal of Machine Learning Research 8 (2007) 2629-2669 Submitted 6/03; Revised 8/07; Published 11/07 Hierarchical Average Reward Reinforcement Learning Mohammad Ghavamzadeh Department of Computing Science
More informationMulti-Robot Planning for Perception of Multiple Regions of Interest
Multi-Robot Planning for Perception of Multiple Regions of Interest Tiago Pereira 123, A. Paulo G. M. Moreira 12, and Manuela Veloso 3 1 Faculty of Engineering, University of Porto, Portugal {tiago.raul,
More informationApproximate path planning. Computational Geometry csci3250 Laura Toma Bowdoin College
Approximate path planning Computational Geometry csci3250 Laura Toma Bowdoin College Outline Path planning Combinatorial Approximate Combinatorial path planning Idea: Compute free C-space combinatorially
More informationHierarchical Average Reward Reinforcement Learning Mohammad Ghavamzadeh Sridhar Mahadevan CMPSCI Technical Report
Hierarchical Average Reward Reinforcement Learning Mohammad Ghavamzadeh Sridhar Mahadevan CMPSCI Technical Report 03-19 June 25, 2003 Department of Computer Science 140 Governors Drive University of Massachusetts
More information