CSE151 Assignment 2 Markov Decision Processes in the Grid World

Size: px
Start display at page:

Download "CSE151 Assignment 2 Markov Decision Processes in the Grid World"

Transcription

1 CSE5 Assignment Markov Decision Processes in the Grid World Grace Lin A484 Tom Maddock A55645 Abstract Markov decision processes exemplify sequential problems, which are defined by a transition model and a reward function, situated in uncertain environments. The environment we will be observing is the grid world, which is analogous to the problem of solving for the best moves in a board game. Our objective is to take a grid world where there are obstacles and terminals and solve for the optimal policy in the world. In the optimal policy, each state that is not an obstacle or a terminal state will acquire a direction. This direction is the direction that will result in the most optimal outcome when at that particular state. To solve for the optimal policy, we use two algorithms, value iteration and policy iteration. When observing the results of these two algorithms, we noticed that they produced results that were similar. However, policy iteration performed much faster than value iteration. Introduction One of the most significant and popular aspects of Artificial Intelligence is solving sequential decision problems. We can view a sequential decision problem as a situation where an agent s utility is dependent on a sequence of decisions that the agent can make. Our focus is on sequential decision problems because sequential decision problems lay the groundwork for reinforcement learning. Sequential problems are intriguing due to the components that contribute to the makeup of the problem and how each of the components affects the solution. The infrastructure that we will focus on is the Markovian transition model and additive rewards which is called a Markov decision process (MDP). The model is of a grid world, an m-by-n world that contains obstacles, terminals, a starting initial state, and where each state in the world has a reward value that can be a negative or positive value. In the grid world, there is also a probabilistic move formulation that specifies the probability of moving in each of the four directions (North, South, East, and West) given that the agent has chose to go towards a particular direction.

2 Given a grid world with obstacles, terminals, a start state, rewards, and moves, we will find the optimal policy of the grid world, the policy that yields the highest expected utility. The result is a world where each state has a direction and each direction points to the best direction to take when the agent is at that state; the best direction being the direction that will yield the highest expected utility. To solve for the optimal policy, we use two different algorithms: value iteration and policy iteration. Each of the algorithms will produce an optimal policy. We will be observing the results of these two algorithms and comparing them with each other. Methods. Matlab Code Overview For this assignment, we wrote a handful of Matlab functions that find solutions to a given MDP. The two top-level functions, value_iteration.m and policy_iteration.m, are the core result of our work, and produce, respectively, a utility function and an optimal policy for a given MDP. The display_agent_choice.m and policy_evaluation.m functions complement them by computing, respectively, an optimal policy given a utility function, and a utility function given a policy. Our next_state_utility.m function provides computational support to these functions by finding the expected utility for the next state, given a current state, an action, and a utility function. Finally, the iterate_rewards.m, compute_performance.m, and create_figure.m functions provide convenient methods of investigating the performance of our solution algorithms and plotting the results.. Solution Algorithms Our top-level functions, mentioned above, implement the value iteration and policy iteration algorithms for solving MDPs. Our implementation of value iteration is relatively straightforward, and works by iteratively applying the Bellman update to a utility function until the resulting utilities are within a given error bound. For policy iteration, we chose to implement the policy evaluation step by solving a system of linear equations, instead of using modified policy iteration. We felt that, for the size of the MDPs given in this assignment, this was the preferred method for policy evaluation, in terms of speed as well as accuracy. Additionally, our policy iteration algorithm starts with a random initial policy, instead of a fixed one, which it then iteratively refines until it becomes optimal..3 Measuring Performance For this assignment, we tested our solution algorithms by measuring their performance while they solved two MDPs. The first of these (MDP ) was a grid world built from the data in rewards.txt and terminals.txt. The second of these (MDP ) used data from newrewards.txt and newterminals.txt. To measure the performance of these MDP solution algorithms, we added extra provisions that enable them to output a history of all the utility functions that were computed for each of their iterations. This allows us to observe the evolution of the utilities as they converge on their final values. We then derived values for the maximum error and policy loss, for each of their iterations, based on this history. From this data we could measure how quickly and accurately each algorithm found a solution to the given MDP, and compare based on these results.

3 3 Results 3. Investigating Value Iteration In the first part of our assignment, we investigated the performance of our value iteration algorithm by having it solve both of the provided MDPs, with various discount factors, and watching the progress of the algorithm as it iterated. Figures 3. and 3. below show the algorithm s progress as it finds a solution for MDP and MDP, respectively, with a discount factor of.9. For each figure, the utilities of the states (excluding the obstacle state) are shown on the left-hand graph, and the maximum error and policy loss of the utility function is shown on the right-hand graph. Each graph is plotted over the number of iterations of the algorithm. Value Iteration Utilities for γ =.9 Value Iteration Accuracy for γ =.9 Utility Values U(,) U(,) U(,3) U(,4) U(,) U(,4) U(3,) Figure 3.: Value Iteration Performance for MDP Value Iteration Utilities for γ =.9 Value Iteration Accuracy for γ = Utility Values 4 3 U(,) U(,) U(,3) U(,4) U(,) U(,4) U(3,) Figure 3.: Value Iteration Performance for MDP We then used our value iteration algorithm, in conjunction with our display_agent_choice.m function, to produce the optimal policies for MDPs and as a function of the reward given for non-terminal states. Figures 3.3 and 3.4 on the following page show the results from this experiment. Note that an X in the figures denotes that any action was optimal in that state for the given reward range.

4 N O E T E E E N R < R < R < R < R < N W N W E E E N N W W W E E N N N O W T N W W W R < R < R < R R > Figure 3.3: Optimal policies for MDP for different ranges of rewards, γ = E N N E E T O N T E N N N O W T N W W S N E N W X X W T X O W T X X X S S O E S E E T T R < R < R < R < R < -.3 E E W T N W T T S O E S E E T T E E T T E E N O N W W T N S T T N E T T X W W T X O N W X W T T -.3 R < R R > N W T T Figure 3.4: Optimal policies for MDP for different ranges of rewards, γ = Investigating Policy Iteration For the second part of the assignment, we investigated the performance of policy iteration by having it find solutions for MDPs and, with various discount factors. Figures 3.5 and 3.6 below show the algorithm s progress while solving these MDPs with a discount factor of.9, plotted against the number of iterations taken. Policy Iteration Utilities for γ =.9 Policy Iteration Accuracy for γ =.9 Utility Values U(,) U(,) U(,3) U(,4) U(,) U(,4) U(3,) Figure 3.5: Policy Iteration Performance for MDP

5 Utility Values Policy Iteration Utilities for γ =.9 U(,) U(,) U(,3) U(,4) U(,) U(,4) U(3,) Policy Iteration Accuracy for γ = Figure 3.6: Policy Iteration Performance for MDP 4 Discussion 4. Value Iteration We charted the performance of value iteration, as it solves the MDPs given in the provided text files, in Figures 3. and 3. above, and notice a couple of interesting things about the performance of algorithm. First of all, we noticed that some of the states utility values converged faster than others. The states that tended to converge slower were often the states that were further towards the start of an optimal path. Since the nature of the Bellman equation is such that the utility of each state depends on the utility of its most optimal next state, if a state happens to have many future states that come after it in an optimal path, then it should take longer for the Bellman updates from those states to propagate down that path and reach states more towards the beginning. Thus, this result is reasonable. Although our graphs do not show it, we also noticed that as the discount factor for the given MDP increased, the number of iterations it took before the utility values converged increased as well. This result, too, is to be expected. With a high discount factor, the utility of each state is more heavily influenced by the utilities of future states that come after it. Thus, more Bellman updates are necessary to propagate the utilities of future states through the world, so that all the states that come before them may have their own utilities properly set. In general, we can also see that the policy loss produced by the value iteration algorithm approaches zero much faster than the maximum error, demonstrating that the resulting utility values are often able to produce an optimal policy long before the actual utility values are close enough to be considered correct. This, of course, is the motivation behind the policy iteration algorithm, which we cover in the next section. However, it is interesting to note that the policy loss for value iteration is not always monotonically decreasing, as shown in Figure 3.. This is, presumably, an effect of the non-uniform convergence of the utilities during value iteration, which could cause a policy for some iteration of the algorithm to, in fact, be less optimal than the policy for the previous iteration. In Figures 3.3 and 3.4, we indicated what the optimal policies for both of the MDPs would be, depending on the reward value for the non-terminal states. Most noticeable from these results is that some of our reward ranges in Figure 3.3 do not

6 exactly match the reward ranges given in the book for the same grid world (Russell & Norvig 66). We assume that this is because we do not know the exact discount factor with which the results in the book were computed, and therefore, cannot make any assumption that our results should be exactly the same. However, our policies for rewards ranges that are in the same neighborhood seem to match with the book, so we feel our results are reasonable. That being said, we found reward ranges for MDP where the optimal policy changes (6 more than in the book), and 8 reward ranges for MDP that influence the optimal policy. It is interesting to note that, for both MDPs, when the reward for the non-terminal states was strictly greater than zero, our algorithm was indecisive of the optimal action for the states that had no terminal neighbors, because utilities in neighboring states would continue to increase in value, making all directions equally optimal. Apparently, then, an optimal agent would do its best to stay away from all terminal states, and maximize its reward by moving around in the world indefinitely. For MDP, in particular, it was also interesting to see that for R >.99, the optimal policy for the world was to pass by the + reward state, and instead try and reach the +4.7 reward state that lay beyond. Even more interesting, was that for R > -.3, the optimal policy was to avoid the + reward state at all costs by going West in (3,), and hope for that % chance that the agent would end up going South instead, towards the +4.7 state. 4. Policy Iteration When we ran policy iteration on the same MDPs, we noticed that it behaved a lot like value iteration. In both algorithms, the utility values started out being relatively inaccurate, and then were refined with each iteration until they converged. Unlike value iteration, however, the initial values for the utilities in our policy iteration algorithm always started at random values, so the number of iterations it would take for convergence was never fixed. This is, of course, a result of our policy iteration implementation. Still, we saw that policy iteration generally converged much faster than value iteration. For example, comparing Figures 3. and 3.5, policy iteration converges four times faster than value iteration; we can see that for Figure 3.5, all states have converged by the third iteration, while for Figure 3., it takes at least iterations for value iteration to even get close to the correct utility values. We also found that, for policy iteration, varying the discount factor does not appear to have an influence on the number of iterations required to find the optimal policy, unlike for the value iteration algorithm. This result seems reasonable, firstly, since our policy iteration algorithm starts with a random policy for the first iteration, and secondly, since the convergence of policy iteration does not depend directly on Bellman updates propagating through the state space, as it does for value iteration. 5 Conclusion In this assignment, Tom wrote the majority of the value iteration functions in Matlab, prepared the figures and graphs, and helped edit and proofread the report. He learned more about the relationship between utility functions and policy functions in an MDP, the limits of value and policy iteration, and how to create and save plots of 3-dimensional matrices of data in Matlab. Grace wrote the majority of the policy iteration functions in Matlab, ran the algorithms on the given MDPs with various parameters, and wrote a good share of the report. She developed a better understanding of MDPs, value iteration, and policy iteration, as well as learned more about the Matlab environment and how to accomplish things such as working with matrices and linear equations.

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Reinforcement Learning: A brief introduction. Mihaela van der Schaar Reinforcement Learning: A brief introduction Mihaela van der Schaar Outline Optimal Decisions & Optimal Forecasts Markov Decision Processes (MDPs) States, actions, rewards and value functions Dynamic Programming

More information

Markov Decision Processes and Reinforcement Learning

Markov Decision Processes and Reinforcement Learning Lecture 14 and Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Course Overview Introduction Artificial Intelligence

More information

Assignment 4: CS Machine Learning

Assignment 4: CS Machine Learning Assignment 4: CS7641 - Machine Learning Saad Khan November 29, 2015 1 Introduction The purpose of this assignment is to apply some of the techniques learned from reinforcement learning to make decisions

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Markov Decision Processes. (Slides from Mausam)

Markov Decision Processes. (Slides from Mausam) Markov Decision Processes (Slides from Mausam) Machine Learning Operations Research Graph Theory Control Theory Markov Decision Process Economics Robotics Artificial Intelligence Neuroscience /Psychology

More information

Policy Iteration, Value Iteration, and Linear Programming

Policy Iteration, Value Iteration, and Linear Programming 151-0563-01 Dynamic Programming and Optimal Control (Fall 2018) Programming Exercise Topic: Infinite Horizon Problems Issued: Nov 22, 2018 Due: Dec 19, 2018 Rajan Gill(rgill@ethz.ch), Weixuan Zhang(wzhang@ethz.ch),

More information

Non-Homogeneous Swarms vs. MDP s A Comparison of Path Finding Under Uncertainty

Non-Homogeneous Swarms vs. MDP s A Comparison of Path Finding Under Uncertainty Non-Homogeneous Swarms vs. MDP s A Comparison of Path Finding Under Uncertainty Michael Comstock December 6, 2012 1 Introduction This paper presents a comparison of two different machine learning systems

More information

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours. CS 188 Spring 2010 Introduction to Artificial Intelligence Final Exam INSTRUCTIONS You have 3 hours. The exam is closed book, closed notes except a two-page crib sheet. Please use non-programmable calculators

More information

Probabilistic Double-Distance Algorithm of Search after Static or Moving Target by Autonomous Mobile Agent

Probabilistic Double-Distance Algorithm of Search after Static or Moving Target by Autonomous Mobile Agent 2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel Probabilistic Double-Distance Algorithm of Search after Static or Moving Target by Autonomous Mobile Agent Eugene Kagan Dept.

More information

15-780: MarkovDecisionProcesses

15-780: MarkovDecisionProcesses 15-780: MarkovDecisionProcesses J. Zico Kolter Feburary 29, 2016 1 Outline Introduction Formal definition Value iteration Policy iteration Linear programming for MDPs 2 1988 Judea Pearl publishes Probabilistic

More information

To earn the extra credit, one of the following has to hold true. Please circle and sign.

To earn the extra credit, one of the following has to hold true. Please circle and sign. CS 188 Spring 2011 Introduction to Artificial Intelligence Practice Final Exam To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 3 or more hours on the

More information

Planning and Control: Markov Decision Processes

Planning and Control: Markov Decision Processes CSE-571 AI-based Mobile Robotics Planning and Control: Markov Decision Processes Planning Static vs. Dynamic Predictable vs. Unpredictable Fully vs. Partially Observable Perfect vs. Noisy Environment What

More information

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering Study of Multiple Multiagent Reinforcement Learning Algorithms in Grid Games by Pascal De Beck-Courcelle A thesis submitted to the Faculty of Graduate and Postdoctoral Affairs in partial fulfillment of

More information

DriveFaster: Optimizing a Traffic Light Grid System

DriveFaster: Optimizing a Traffic Light Grid System DriveFaster: Optimizing a Traffic Light Grid System Abstract CS221 Fall 2016: Final Report Team Members: Xiaofan Li, Ahmed Jaffery Traffic lights are the central point of control of traffic for cities

More information

Adversarial Policy Switching with Application to RTS Games

Adversarial Policy Switching with Application to RTS Games Adversarial Policy Switching with Application to RTS Games Brian King 1 and Alan Fern 2 and Jesse Hostetler 2 Department of Electrical Engineering and Computer Science Oregon State University 1 kingbria@lifetime.oregonstate.edu

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Other models of interactive domains Marc Toussaint University of Stuttgart Winter 2018/19 Basic Taxonomy of domain models Other models of interactive domains Basic Taxonomy of domain

More information

Reinforcement Learning (2)

Reinforcement Learning (2) Reinforcement Learning (2) Bruno Bouzy 1 october 2013 This document is the second part of the «Reinforcement Learning» chapter of the «Agent oriented learning» teaching unit of the Master MI computer course.

More information

A Brief Introduction to Reinforcement Learning

A Brief Introduction to Reinforcement Learning A Brief Introduction to Reinforcement Learning Minlie Huang ( ) Dept. of Computer Science, Tsinghua University aihuang@tsinghua.edu.cn 1 http://coai.cs.tsinghua.edu.cn/hml Reinforcement Learning Agent

More information

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION ABSTRACT Mark A. Mueller Georgia Institute of Technology, Computer Science, Atlanta, GA USA The problem of autonomous vehicle navigation between

More information

When Network Embedding meets Reinforcement Learning?

When Network Embedding meets Reinforcement Learning? When Network Embedding meets Reinforcement Learning? ---Learning Combinatorial Optimization Problems over Graphs Changjun Fan 1 1. An Introduction to (Deep) Reinforcement Learning 2. How to combine NE

More information

Midterm I. Introduction to Artificial Intelligence. CS 188 Fall You have approximately 3 hours.

Midterm I. Introduction to Artificial Intelligence. CS 188 Fall You have approximately 3 hours. CS 88 Fall 202 Introduction to Artificial Intelligence Midterm I You have approximately 3 hours. The exam is closed book, closed notes except a one-page crib sheet. Please use non-programmable calculators

More information

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Reinforcement Learning: Value Iteration Matt Gormley Lecture 23 Apr. 10, 2019 1

More information

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. S 88 Summer 205 Introduction to rtificial Intelligence Final ˆ You have approximately 2 hours 50 minutes. ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

More information

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended) Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended) Pavlos Andreadis, February 2 nd 2018 1 Markov Decision Processes A finite Markov Decision Process

More information

arxiv: v1 [cs.cv] 2 Sep 2018

arxiv: v1 [cs.cv] 2 Sep 2018 Natural Language Person Search Using Deep Reinforcement Learning Ankit Shah Language Technologies Institute Carnegie Mellon University aps1@andrew.cmu.edu Tyler Vuong Electrical and Computer Engineering

More information

Programming Reinforcement Learning in Jason

Programming Reinforcement Learning in Jason Programming Reinforcement Learning in Jason Amelia Bădică 1, Costin Bădică 1, Mirjana Ivanović 2 1 University of Craiova, Romania 2 University of Novi Sad, Serbia Talk Outline Introduction, Motivation

More information

Introduction to Fall 2008 Artificial Intelligence Midterm Exam

Introduction to Fall 2008 Artificial Intelligence Midterm Exam CS 188 Introduction to Fall 2008 Artificial Intelligence Midterm Exam INSTRUCTIONS You have 80 minutes. 70 points total. Don t panic! The exam is closed book, closed notes except a one-page crib sheet,

More information

Q-learning with linear function approximation

Q-learning with linear function approximation Q-learning with linear function approximation Francisco S. Melo and M. Isabel Ribeiro Institute for Systems and Robotics [fmelo,mir]@isr.ist.utl.pt Conference on Learning Theory, COLT 2007 June 14th, 2007

More information

Incremental methods for computing bounds in partially observable Markov decision processes

Incremental methods for computing bounds in partially observable Markov decision processes Incremental methods for computing bounds in partially observable Markov decision processes Milos Hauskrecht MIT Laboratory for Computer Science, NE43-421 545 Technology Square Cambridge, MA 02139 milos@medg.lcs.mit.edu

More information

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University Introduction to Reinforcement Learning J. Zico Kolter Carnegie Mellon University 1 Agent interaction with environment Agent State s Reward r Action a Environment 2 Of course, an oversimplification 3 Review:

More information

Stuck in Traffic (SiT) Attacks

Stuck in Traffic (SiT) Attacks Stuck in Traffic (SiT) Attacks Mina Guirguis Texas State University Joint work with George Atia Traffic 2 Intelligent Transportation Systems V2X communication enable drivers to make better decisions: Avoiding

More information

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Apprenticeship Learning for Reinforcement Learning with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang Table of Contents Introduction Theory Autonomous helicopter control

More information

Residual Advantage Learning Applied to a Differential Game

Residual Advantage Learning Applied to a Differential Game Presented at the International Conference on Neural Networks (ICNN 96), Washington DC, 2-6 June 1996. Residual Advantage Learning Applied to a Differential Game Mance E. Harmon Wright Laboratory WL/AAAT

More information

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho Partially Observable Markov Decision Processes Silvia Cruciani João Carvalho MDP A reminder: is a set of states is a set of actions is the state transition function. is the probability of ending in state

More information

Approximate Linear Programming for Average-Cost Dynamic Programming

Approximate Linear Programming for Average-Cost Dynamic Programming Approximate Linear Programming for Average-Cost Dynamic Programming Daniela Pucci de Farias IBM Almaden Research Center 65 Harry Road, San Jose, CA 51 pucci@mitedu Benjamin Van Roy Department of Management

More information

Nicole Dobrowolski. Keywords: State-space, Search, Maze, Quagent, Quake

Nicole Dobrowolski. Keywords: State-space, Search, Maze, Quagent, Quake The Applicability of Uninformed and Informed Searches to Maze Traversal Computer Science 242: Artificial Intelligence Dept. of Computer Science, University of Rochester Nicole Dobrowolski Abstract: There

More information

Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods

Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods Ben Karsin University of Hawaii at Manoa Information and Computer Science ICS 63 Machine Learning Fall 8 Introduction

More information

Distributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes

Distributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes Distributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes Willy Arthur Silva Reis 1, Karina Valdivia Delgado 2, Leliane Nunes de Barros 1 1 Departamento de Ciência da

More information

Exploring Performance Tradeoffs in a Sudoku SAT Solver CS242 Project Report

Exploring Performance Tradeoffs in a Sudoku SAT Solver CS242 Project Report Exploring Performance Tradeoffs in a Sudoku SAT Solver CS242 Project Report Hana Lee (leehana@stanford.edu) December 15, 2017 1 Summary I implemented a SAT solver capable of solving Sudoku puzzles using

More information

Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs

Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs Nevin L. Zhang and Weihong Zhang lzhang,wzhang @cs.ust.hk Department of Computer Science Hong Kong University of Science &

More information

Maximizing an interpolating quadratic

Maximizing an interpolating quadratic Week 11: Monday, Apr 9 Maximizing an interpolating quadratic Suppose that a function f is evaluated on a reasonably fine, uniform mesh {x i } n i=0 with spacing h = x i+1 x i. How can we find any local

More information

Monte Carlo Tree Search PAH 2015

Monte Carlo Tree Search PAH 2015 Monte Carlo Tree Search PAH 2015 MCTS animation and RAVE slides by Michèle Sebag and Romaric Gaudel Markov Decision Processes (MDPs) main formal model Π = S, A, D, T, R states finite set of states of the

More information

Hi everyone. I hope everyone had a good Fourth of July. Today we're going to be covering graph search. Now, whenever we bring up graph algorithms, we

Hi everyone. I hope everyone had a good Fourth of July. Today we're going to be covering graph search. Now, whenever we bring up graph algorithms, we Hi everyone. I hope everyone had a good Fourth of July. Today we're going to be covering graph search. Now, whenever we bring up graph algorithms, we have to talk about the way in which we represent the

More information

CSC 2515 Introduction to Machine Learning Assignment 2

CSC 2515 Introduction to Machine Learning Assignment 2 CSC 2515 Introduction to Machine Learning Assignment 2 Zhongtian Qiu(1002274530) Problem 1 See attached scan files for question 1. 2. Neural Network 2.1 Examine the statistics and plots of training error

More information

Midterm I. Introduction to Artificial Intelligence. CS 188 Fall You have approximately 3 hours.

Midterm I. Introduction to Artificial Intelligence. CS 188 Fall You have approximately 3 hours. CS 88 Fall 202 Introduction to Artificial Intelligence Midterm I You have approximately 3 hours. The exam is closed book, closed notes except a one-page crib sheet. Please use non-programmable calculators

More information

PLite.jl. Release 1.0

PLite.jl. Release 1.0 PLite.jl Release 1.0 October 19, 2015 Contents 1 In Depth Documentation 3 1.1 Installation................................................ 3 1.2 Problem definition............................................

More information

Computer Game Programming Basic Path Finding

Computer Game Programming Basic Path Finding 15-466 Computer Game Programming Basic Path Finding Robotics Institute Path Planning Sven Path Planning needs to be very fast (especially for games with many characters) needs to generate believable paths

More information

Fundamentals of Operations Research. Prof. G. Srinivasan. Department of Management Studies. Indian Institute of Technology, Madras. Lecture No.

Fundamentals of Operations Research. Prof. G. Srinivasan. Department of Management Studies. Indian Institute of Technology, Madras. Lecture No. Fundamentals of Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Lecture No. # 13 Transportation Problem, Methods for Initial Basic Feasible

More information

Introduction to Artificial Intelligence Midterm 1. CS 188 Spring You have approximately 2 hours.

Introduction to Artificial Intelligence Midterm 1. CS 188 Spring You have approximately 2 hours. CS 88 Spring 0 Introduction to Artificial Intelligence Midterm You have approximately hours. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators

More information

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS Summer Introduction to Artificial Intelligence Midterm ˆ You have approximately minutes. ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. ˆ Mark your answers

More information

Note on Neoclassical Growth Model: Value Function Iteration + Discretization

Note on Neoclassical Growth Model: Value Function Iteration + Discretization 1 Introduction Note on Neoclassical Growth Model: Value Function Iteration + Discretization Makoto Nakajima, UIUC January 27 We study the solution algorithm using value function iteration, and discretization

More information

Analysis of ARES Data using ML-EM

Analysis of ARES Data using ML-EM Analysis of ARES Data using ML-EM Nicole Eikmeier Hosting Site: Lawrence Berkeley National Laboratory Mentor(s): Brian Quiter, Mark Bandstra Abstract. Imaging analysis of background data collected from

More information

Gradient Reinforcement Learning of POMDP Policy Graphs

Gradient Reinforcement Learning of POMDP Policy Graphs 1 Gradient Reinforcement Learning of POMDP Policy Graphs Douglas Aberdeen Research School of Information Science and Engineering Australian National University Jonathan Baxter WhizBang! Labs July 23, 2001

More information

Constraint Satisfaction Algorithms for Graphical Games

Constraint Satisfaction Algorithms for Graphical Games Constraint Satisfaction Algorithms for Graphical Games Vishal Soni soniv@umich.edu Satinder Singh baveja@umich.edu Computer Science and Engineering Division University of Michigan, Ann Arbor Michael P.

More information

Perseus: randomized point-based value iteration for POMDPs

Perseus: randomized point-based value iteration for POMDPs Universiteit van Amsterdam IAS technical report IAS-UA-04-02 Perseus: randomized point-based value iteration for POMDPs Matthijs T. J. Spaan and Nikos lassis Informatics Institute Faculty of Science University

More information

Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)

Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox) Partially Observable Markov Decision Processes Mausam (slides by Dieter Fox) Stochastic Planning: MDPs Static Environment Fully Observable Perfect What action next? Stochastic Instantaneous Percepts Actions

More information

Attend to details of the value iteration and policy iteration algorithms Reflect on Markov decision process behavior Reinforce C programming skills

Attend to details of the value iteration and policy iteration algorithms Reflect on Markov decision process behavior Reinforce C programming skills CSC 261 Lab 11: Markov Decision Processes Fall 2015 Assigned: Tuesday 1 December Due: Friday 11 December, 5:00 pm Objectives: Attend to details of the value iteration and policy iteration algorithms Reflect

More information

Generalized Inverse Reinforcement Learning

Generalized Inverse Reinforcement Learning Generalized Inverse Reinforcement Learning James MacGlashan Cogitai, Inc. james@cogitai.com Michael L. Littman mlittman@cs.brown.edu Nakul Gopalan ngopalan@cs.brown.edu Amy Greenwald amy@cs.brown.edu Abstract

More information

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound Lecture 12: Deep Reinforcement Learning Types of Learning Supervised training Learning from the teacher Training data includes

More information

Deep Q-Learning to play Snake

Deep Q-Learning to play Snake Deep Q-Learning to play Snake Daniele Grattarola August 1, 2016 Abstract This article describes the application of deep learning and Q-learning to play the famous 90s videogame Snake. I applied deep convolutional

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Alisa Maas ajmaas@cs.wisc.edu 1. INTRODUCTION 1.1 Background Machine learning often relies heavily on being able to rank the relative fitness of instances

More information

Learning to bounce a ball with a robotic arm

Learning to bounce a ball with a robotic arm Eric Wolter TU Darmstadt Thorsten Baark TU Darmstadt Abstract Bouncing a ball is a fun and challenging task for humans. It requires fine and complex motor controls and thus is an interesting problem for

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

A fast point-based algorithm for POMDPs

A fast point-based algorithm for POMDPs A fast point-based algorithm for POMDPs Nikos lassis Matthijs T. J. Spaan Informatics Institute, Faculty of Science, University of Amsterdam Kruislaan 43, 198 SJ Amsterdam, The Netherlands {vlassis,mtjspaan}@science.uva.nl

More information

Monte Carlo Tree Search

Monte Carlo Tree Search Monte Carlo Tree Search Branislav Bošanský PAH/PUI 2016/2017 MDPs Using Monte Carlo Methods Monte Carlo Simulation: a technique that can be used to solve a mathematical or statistical problem using repeated

More information

Predictive Autonomous Robot Navigation

Predictive Autonomous Robot Navigation Predictive Autonomous Robot Navigation Amalia F. Foka and Panos E. Trahanias Institute of Computer Science, Foundation for Research and Technology-Hellas (FORTH), Heraklion, Greece and Department of Computer

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

Marco Wiering Intelligent Systems Group Utrecht University

Marco Wiering Intelligent Systems Group Utrecht University Reinforcement Learning for Robot Control Marco Wiering Intelligent Systems Group Utrecht University marco@cs.uu.nl 22-11-2004 Introduction Robots move in the physical environment to perform tasks The environment

More information

Parallelizing LU Factorization

Parallelizing LU Factorization Parallelizing LU Factorization Scott Ricketts December 3, 2006 Abstract Systems of linear equations can be represented by matrix equations of the form A x = b LU Factorization is a method for solving systems

More information

An Agent-Based Adaptation of Friendship Games: Observations on Network Topologies

An Agent-Based Adaptation of Friendship Games: Observations on Network Topologies An Agent-Based Adaptation of Friendship Games: Observations on Network Topologies David S. Dixon University of New Mexico, Albuquerque NM 87131, USA Abstract. A friendship game in game theory is a network

More information

Modeling Plant Succession with Markov Matrices

Modeling Plant Succession with Markov Matrices Modeling Plant Succession with Markov Matrices 1 Modeling Plant Succession with Markov Matrices Concluding Paper Undergraduate Biology and Math Training Program New Jersey Institute of Technology Catherine

More information

The Simplex Algorithm for LP, and an Open Problem

The Simplex Algorithm for LP, and an Open Problem The Simplex Algorithm for LP, and an Open Problem Linear Programming: General Formulation Inputs: real-valued m x n matrix A, and vectors c in R n and b in R m Output: n-dimensional vector x There is one

More information

Worksheet Answer Key: Scanning and Mapping Projects > Mine Mapping > Investigation 2

Worksheet Answer Key: Scanning and Mapping Projects > Mine Mapping > Investigation 2 Worksheet Answer Key: Scanning and Mapping Projects > Mine Mapping > Investigation 2 Ruler Graph: Analyze your graph 1. Examine the shape formed by the connected dots. i. Does the connected graph create

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS Summer Introduction to Artificial Intelligence Midterm You have approximately minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. Mark your answers

More information

Hierarchical Assignment of Behaviours by Self-Organizing

Hierarchical Assignment of Behaviours by Self-Organizing Hierarchical Assignment of Behaviours by Self-Organizing W. Moerman 1 B. Bakker 2 M. Wiering 3 1 M.Sc. Cognitive Artificial Intelligence Utrecht University 2 Intelligent Autonomous Systems Group University

More information

Chapter 14 Global Search Algorithms

Chapter 14 Global Search Algorithms Chapter 14 Global Search Algorithms An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Introduction We discuss various search methods that attempts to search throughout the entire feasible set.

More information

Throughput Maximization for Energy Efficient Multi-Node Communications using Actor-Critic Approach

Throughput Maximization for Energy Efficient Multi-Node Communications using Actor-Critic Approach Throughput Maximization for Energy Efficient Multi-Node Communications using Actor-Critic Approach Charles Pandana and K. J. Ray Liu Department of Electrical and Computer Engineering University of Maryland,

More information

Search in discrete and continuous spaces

Search in discrete and continuous spaces UNSW COMP3431: Robot Architectures S2 2006 1 Overview Assignment #1 Answer Sheet Search in discrete and continuous spaces Due: Start of Lab, Week 6 (1pm, 30 August 2006) The goal of this assignment is

More information

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Lecture 16 Cutting Plane Algorithm We shall continue the discussion on integer programming,

More information

Targeting Specific Distributions of Trajectories in MDPs

Targeting Specific Distributions of Trajectories in MDPs Targeting Specific Distributions of Trajectories in MDPs David L. Roberts 1, Mark J. Nelson 1, Charles L. Isbell, Jr. 1, Michael Mateas 1, Michael L. Littman 2 1 College of Computing, Georgia Institute

More information

Statistical Techniques in Robotics (16-831, F10) Lecture #02 (Thursday, August 28) Bayes Filtering

Statistical Techniques in Robotics (16-831, F10) Lecture #02 (Thursday, August 28) Bayes Filtering Statistical Techniques in Robotics (16-831, F10) Lecture #02 (Thursday, August 28) Bayes Filtering Lecturer: Drew Bagnell Scribes: Pranay Agrawal, Trevor Decker, and Humphrey Hu 1 1 A Brief Example Let

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

MathZoom, Summer, 2014

MathZoom, Summer, 2014 A one-dimensional bug starts at the origin and each minute moves either left or right exactly one unit. Suppose it makes there moves with equal likelihood. That is the probability of a move to the left

More information

Chapter 3 Limits and Derivative Concepts

Chapter 3 Limits and Derivative Concepts Chapter 3 Limits and Derivative Concepts 1. Average Rate of Change 2. Using Tables to Investigate Limits 3. Symbolic Limits and the Derivative Definition 4. Graphical Derivatives 5. Numerical Derivatives

More information

Detection of Man-made Structures in Natural Images

Detection of Man-made Structures in Natural Images Detection of Man-made Structures in Natural Images Tim Rees December 17, 2004 Abstract Object detection in images is a very active research topic in many disciplines. Probabilistic methods have been applied

More information

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces Eric Christiansen Michael Gorbach May 13, 2008 Abstract One of the drawbacks of standard reinforcement learning techniques is that

More information

ICRA 2012 Tutorial on Reinforcement Learning I. Introduction

ICRA 2012 Tutorial on Reinforcement Learning I. Introduction ICRA 2012 Tutorial on Reinforcement Learning I. Introduction Pieter Abbeel UC Berkeley Jan Peters TU Darmstadt Motivational Example: Helicopter Control Unstable Nonlinear Complicated dynamics Air flow

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Machine Learning: Perceptron Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer and Dan Klein. 1 Generative vs. Discriminative Generative classifiers:

More information

6.094 Introduction to MATLAB January (IAP) 2009

6.094 Introduction to MATLAB January (IAP) 2009 MIT OpenCourseWare http://ocw.mit.edu 6.094 Introduction to MATLAB January (IAP) 009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 6.094: Introduction

More information

Using Markov decision processes to optimise a non-linear functional of the final distribution, with manufacturing applications.

Using Markov decision processes to optimise a non-linear functional of the final distribution, with manufacturing applications. Using Markov decision processes to optimise a non-linear functional of the final distribution, with manufacturing applications. E.J. Collins 1 1 Department of Mathematics, University of Bristol, University

More information

Máté Lengyel, Peter Dayan: Hippocampal contributions to control: the third way

Máté Lengyel, Peter Dayan: Hippocampal contributions to control: the third way Máté Lengyel, Peter Dayan: Hippocampal contributions to control: the third way David Nagy journal club at 1 markov decision processes 2 model-based & model-free control 3 a third way 1 markov decision

More information

Package pomdp. January 3, 2019

Package pomdp. January 3, 2019 Package pomdp January 3, 2019 Title Solver for Partially Observable Markov Decision Processes (POMDP) Version 0.9.1 Date 2019-01-02 Provides an interface to pomdp-solve, a solver for Partially Observable

More information

Latency of Remote Access

Latency of Remote Access Latency of Remote Access Introdunction Remote access refers to that we log in the server system which is far away from where we are at the present. One typical example of Remote Access is Windows RDP.

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Artificial Intelligence Prof. Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology, Madras

Artificial Intelligence Prof. Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology, Madras Artificial Intelligence Prof. Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology, Madras (Refer Slide Time: 00:17) Lecture No - 10 Hill Climbing So, we were looking

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

roving the Learning Efficiencies of ealtime Searc

roving the Learning Efficiencies of ealtime Searc roving the Learning Efficiencies of ealtime Searc Toru Ishida Masashi Shimbo Department of Information Science Kyoto University Kyoto, 606-01, JAPAN {ishida, shimbo}@kuis.kyoto-u.ac.jp Abstract The capability

More information

Training Intelligent Stoplights

Training Intelligent Stoplights Training Intelligent Stoplights Thomas Davids, Michael Celentano, and Luke Knepper December 14, 2012 1 Introduction Traffic is a huge problem for the American economy. In 2010, the average American commuter

More information

Evolutionary Computation Algorithms for Cryptanalysis: A Study

Evolutionary Computation Algorithms for Cryptanalysis: A Study Evolutionary Computation Algorithms for Cryptanalysis: A Study Poonam Garg Information Technology and Management Dept. Institute of Management Technology Ghaziabad, India pgarg@imt.edu Abstract The cryptanalysis

More information

Ranking Clustered Data with Pairwise Comparisons

Ranking Clustered Data with Pairwise Comparisons Ranking Clustered Data with Pairwise Comparisons Kevin Kowalski nargle@cs.wisc.edu 1. INTRODUCTION Background. Machine learning often relies heavily on being able to rank instances in a large set of data

More information