CSE151 Assignment 2 Markov Decision Processes in the Grid World

Similar documents
Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Markov Decision Processes and Reinforcement Learning

Assignment 4: CS Machine Learning

10-701/15-781, Fall 2006, Final

Markov Decision Processes. (Slides from Mausam)

Policy Iteration, Value Iteration, and Linear Programming

Non-Homogeneous Swarms vs. MDP s A Comparison of Path Finding Under Uncertainty

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Probabilistic Double-Distance Algorithm of Search after Static or Moving Target by Autonomous Mobile Agent

15-780: MarkovDecisionProcesses

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Planning and Control: Markov Decision Processes

Pascal De Beck-Courcelle. Master in Applied Science. Electrical and Computer Engineering

DriveFaster: Optimizing a Traffic Light Grid System

Adversarial Policy Switching with Application to RTS Games

Artificial Intelligence

Reinforcement Learning (2)

A Brief Introduction to Reinforcement Learning

REINFORCEMENT LEARNING: MDP APPLIED TO AUTONOMOUS NAVIGATION

When Network Embedding meets Reinforcement Learning?

Midterm I. Introduction to Artificial Intelligence. CS 188 Fall You have approximately 3 hours.

Value Iteration. Reinforcement Learning: Introduction to Machine Learning. Matt Gormley Lecture 23 Apr. 10, 2019

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Reinforcement Learning (INF11010) Lecture 6: Dynamic Programming for Reinforcement Learning (extended)

arxiv: v1 [cs.cv] 2 Sep 2018

Programming Reinforcement Learning in Jason

Introduction to Fall 2008 Artificial Intelligence Midterm Exam

Q-learning with linear function approximation

Incremental methods for computing bounds in partially observable Markov decision processes

Introduction to Reinforcement Learning. J. Zico Kolter Carnegie Mellon University

Stuck in Traffic (SiT) Attacks

Apprenticeship Learning for Reinforcement Learning. with application to RC helicopter flight Ritwik Anand, Nick Haliday, Audrey Huang

Residual Advantage Learning Applied to a Differential Game

Partially Observable Markov Decision Processes. Silvia Cruciani João Carvalho

Approximate Linear Programming for Average-Cost Dynamic Programming

Nicole Dobrowolski. Keywords: State-space, Search, Maze, Quagent, Quake

Review of the Robust K-means Algorithm and Comparison with Other Clustering Methods

Distributed and Asynchronous Policy Iteration for Bounded Parameter Markov Decision Processes

Exploring Performance Tradeoffs in a Sudoku SAT Solver CS242 Project Report

Space-Progressive Value Iteration: An Anytime Algorithm for a Class of POMDPs

Maximizing an interpolating quadratic

Monte Carlo Tree Search PAH 2015

Hi everyone. I hope everyone had a good Fourth of July. Today we're going to be covering graph search. Now, whenever we bring up graph algorithms, we

CSC 2515 Introduction to Machine Learning Assignment 2

Midterm I. Introduction to Artificial Intelligence. CS 188 Fall You have approximately 3 hours.

PLite.jl. Release 1.0

Computer Game Programming Basic Path Finding

Fundamentals of Operations Research. Prof. G. Srinivasan. Department of Management Studies. Indian Institute of Technology, Madras. Lecture No.

Introduction to Artificial Intelligence Midterm 1. CS 188 Spring You have approximately 2 hours.

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Note on Neoclassical Growth Model: Value Function Iteration + Discretization

Analysis of ARES Data using ML-EM

Gradient Reinforcement Learning of POMDP Policy Graphs

Constraint Satisfaction Algorithms for Graphical Games

Perseus: randomized point-based value iteration for POMDPs

Partially Observable Markov Decision Processes. Mausam (slides by Dieter Fox)

Attend to details of the value iteration and policy iteration algorithms Reflect on Markov decision process behavior Reinforce C programming skills

Generalized Inverse Reinforcement Learning

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 12: Deep Reinforcement Learning

Deep Q-Learning to play Snake

Ranking Clustered Data with Pairwise Comparisons

Learning to bounce a ball with a robotic arm

Note Set 4: Finite Mixture Models and the EM Algorithm

A fast point-based algorithm for POMDPs

Monte Carlo Tree Search

Predictive Autonomous Robot Navigation

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Marco Wiering Intelligent Systems Group Utrecht University

Parallelizing LU Factorization

An Agent-Based Adaptation of Friendship Games: Observations on Network Topologies

Modeling Plant Succession with Markov Matrices

The Simplex Algorithm for LP, and an Open Problem

Worksheet Answer Key: Scanning and Mapping Projects > Mine Mapping > Investigation 2

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Hierarchical Assignment of Behaviours by Self-Organizing

Chapter 14 Global Search Algorithms

Throughput Maximization for Energy Efficient Multi-Node Communications using Actor-Critic Approach

Search in discrete and continuous spaces

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Targeting Specific Distributions of Trajectories in MDPs

Statistical Techniques in Robotics (16-831, F10) Lecture #02 (Thursday, August 28) Bayes Filtering

Using Machine Learning to Optimize Storage Systems

MathZoom, Summer, 2014

Chapter 3 Limits and Derivative Concepts

Detection of Man-made Structures in Natural Images

Adaptive Robotics - Final Report Extending Q-Learning to Infinite Spaces

ICRA 2012 Tutorial on Reinforcement Learning I. Introduction

CSEP 573: Artificial Intelligence

6.094 Introduction to MATLAB January (IAP) 2009

Using Markov decision processes to optimise a non-linear functional of the final distribution, with manufacturing applications.

Máté Lengyel, Peter Dayan: Hippocampal contributions to control: the third way

Package pomdp. January 3, 2019

Latency of Remote Access

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Artificial Intelligence Prof. Deepak Khemani Department of Computer Science and Engineering Indian Institute of Technology, Madras

Random projection for non-gaussian mixture models

roving the Learning Efficiencies of ealtime Searc

Training Intelligent Stoplights

Evolutionary Computation Algorithms for Cryptanalysis: A Study

Ranking Clustered Data with Pairwise Comparisons

Transcription:

CSE5 Assignment Markov Decision Processes in the Grid World Grace Lin A484 gclin@ucsd.edu Tom Maddock A55645 tmaddock@ucsd.edu Abstract Markov decision processes exemplify sequential problems, which are defined by a transition model and a reward function, situated in uncertain environments. The environment we will be observing is the grid world, which is analogous to the problem of solving for the best moves in a board game. Our objective is to take a grid world where there are obstacles and terminals and solve for the optimal policy in the world. In the optimal policy, each state that is not an obstacle or a terminal state will acquire a direction. This direction is the direction that will result in the most optimal outcome when at that particular state. To solve for the optimal policy, we use two algorithms, value iteration and policy iteration. When observing the results of these two algorithms, we noticed that they produced results that were similar. However, policy iteration performed much faster than value iteration. Introduction One of the most significant and popular aspects of Artificial Intelligence is solving sequential decision problems. We can view a sequential decision problem as a situation where an agent s utility is dependent on a sequence of decisions that the agent can make. Our focus is on sequential decision problems because sequential decision problems lay the groundwork for reinforcement learning. Sequential problems are intriguing due to the components that contribute to the makeup of the problem and how each of the components affects the solution. The infrastructure that we will focus on is the Markovian transition model and additive rewards which is called a Markov decision process (MDP). The model is of a grid world, an m-by-n world that contains obstacles, terminals, a starting initial state, and where each state in the world has a reward value that can be a negative or positive value. In the grid world, there is also a probabilistic move formulation that specifies the probability of moving in each of the four directions (North, South, East, and West) given that the agent has chose to go towards a particular direction.

Given a grid world with obstacles, terminals, a start state, rewards, and moves, we will find the optimal policy of the grid world, the policy that yields the highest expected utility. The result is a world where each state has a direction and each direction points to the best direction to take when the agent is at that state; the best direction being the direction that will yield the highest expected utility. To solve for the optimal policy, we use two different algorithms: value iteration and policy iteration. Each of the algorithms will produce an optimal policy. We will be observing the results of these two algorithms and comparing them with each other. Methods. Matlab Code Overview For this assignment, we wrote a handful of Matlab functions that find solutions to a given MDP. The two top-level functions, value_iteration.m and policy_iteration.m, are the core result of our work, and produce, respectively, a utility function and an optimal policy for a given MDP. The display_agent_choice.m and policy_evaluation.m functions complement them by computing, respectively, an optimal policy given a utility function, and a utility function given a policy. Our next_state_utility.m function provides computational support to these functions by finding the expected utility for the next state, given a current state, an action, and a utility function. Finally, the iterate_rewards.m, compute_performance.m, and create_figure.m functions provide convenient methods of investigating the performance of our solution algorithms and plotting the results.. Solution Algorithms Our top-level functions, mentioned above, implement the value iteration and policy iteration algorithms for solving MDPs. Our implementation of value iteration is relatively straightforward, and works by iteratively applying the Bellman update to a utility function until the resulting utilities are within a given error bound. For policy iteration, we chose to implement the policy evaluation step by solving a system of linear equations, instead of using modified policy iteration. We felt that, for the size of the MDPs given in this assignment, this was the preferred method for policy evaluation, in terms of speed as well as accuracy. Additionally, our policy iteration algorithm starts with a random initial policy, instead of a fixed one, which it then iteratively refines until it becomes optimal..3 Measuring Performance For this assignment, we tested our solution algorithms by measuring their performance while they solved two MDPs. The first of these (MDP ) was a grid world built from the data in rewards.txt and terminals.txt. The second of these (MDP ) used data from newrewards.txt and newterminals.txt. To measure the performance of these MDP solution algorithms, we added extra provisions that enable them to output a history of all the utility functions that were computed for each of their iterations. This allows us to observe the evolution of the utilities as they converge on their final values. We then derived values for the maximum error and policy loss, for each of their iterations, based on this history. From this data we could measure how quickly and accurately each algorithm found a solution to the given MDP, and compare based on these results.

3 Results 3. Investigating Value Iteration In the first part of our assignment, we investigated the performance of our value iteration algorithm by having it solve both of the provided MDPs, with various discount factors, and watching the progress of the algorithm as it iterated. Figures 3. and 3. below show the algorithm s progress as it finds a solution for MDP and MDP, respectively, with a discount factor of.9. For each figure, the utilities of the states (excluding the obstacle state) are shown on the left-hand graph, and the maximum error and policy loss of the utility function is shown on the right-hand graph. Each graph is plotted over the number of iterations of the algorithm. Value Iteration Utilities for γ =.9 Value Iteration Accuracy for γ =.9 Utility Values.8.6.4. -. -.4 -.6 U(,) U(,) U(,3) U(,4) U(,) U(,4) U(3,).8.7.6.5.4.3. -.8 -. 4 6 8 4 6 8 4 6 8 4 6 8 Figure 3.: Value Iteration Performance for MDP Value Iteration Utilities for γ =.9 Value Iteration Accuracy for γ =.9 5.5 Utility Values 4 3 U(,) U(,) U(,3) U(,4) U(,) U(,4) U(3,).5.5-4 6 8 4 6 8 4 6 8 4 6 8 Figure 3.: Value Iteration Performance for MDP We then used our value iteration algorithm, in conjunction with our display_agent_choice.m function, to produce the optimal policies for MDPs and as a function of the reward given for non-terminal states. Figures 3.3 and 3.4 on the following page show the results from this experiment. Note that an X in the figures denotes that any action was optimal in that state for the given reward range.

N O E T E E E N R < -.6497 -.6497 R < -.564 -.564 R < -.73 -.73 R < -.456 -.456 R < -.849 N W N W E E E N N W W W E E N N N O W T N W W W -.849 R < -.448 -.448 R < -.73 -.73 R < -. -. R R > Figure 3.3: Optimal policies for MDP for different ranges of rewards, γ =.99999 E N N E E T O N T E N N N O W T N W W S N E N W X X W T X O W T X X X S S O E S E E T T R < -.975 -.975 R < -.765 -.765 R < -.495 -.495 R < -.99 -.99 R < -.3 E E W T N W T T S O E S E E T T E E T T E E N O N W W T N S T T N E T T X W W T X O N W X W T T -.3 R < -.4 -.4 R R > N W T T Figure 3.4: Optimal policies for MDP for different ranges of rewards, γ =.99999 3. Investigating Policy Iteration For the second part of the assignment, we investigated the performance of policy iteration by having it find solutions for MDPs and, with various discount factors. Figures 3.5 and 3.6 below show the algorithm s progress while solving these MDPs with a discount factor of.9, plotted against the number of iterations taken. Policy Iteration Utilities for γ =.9 Policy Iteration Accuracy for γ =.9 Utility Values.8.6.4. -. -.4 -.6 -.8 -.5.5 3 U(,) U(,) U(,3) U(,4) U(,) U(,4) U(3,) Figure 3.5: Policy Iteration Performance for MDP.9.8.7.6.5.4.3...5.5 3

Utility Values 5 4 3 - - Policy Iteration Utilities for γ =.9 U(,) U(,) U(,3) U(,4) U(,) U(,4) U(3,) 5 4.5 4 3.5 3.5.5.5 Policy Iteration Accuracy for γ =.9.5.5 3 3.5 4 4.5 5.5.5 3 3.5 4 4.5 5 Figure 3.6: Policy Iteration Performance for MDP 4 Discussion 4. Value Iteration We charted the performance of value iteration, as it solves the MDPs given in the provided text files, in Figures 3. and 3. above, and notice a couple of interesting things about the performance of algorithm. First of all, we noticed that some of the states utility values converged faster than others. The states that tended to converge slower were often the states that were further towards the start of an optimal path. Since the nature of the Bellman equation is such that the utility of each state depends on the utility of its most optimal next state, if a state happens to have many future states that come after it in an optimal path, then it should take longer for the Bellman updates from those states to propagate down that path and reach states more towards the beginning. Thus, this result is reasonable. Although our graphs do not show it, we also noticed that as the discount factor for the given MDP increased, the number of iterations it took before the utility values converged increased as well. This result, too, is to be expected. With a high discount factor, the utility of each state is more heavily influenced by the utilities of future states that come after it. Thus, more Bellman updates are necessary to propagate the utilities of future states through the world, so that all the states that come before them may have their own utilities properly set. In general, we can also see that the policy loss produced by the value iteration algorithm approaches zero much faster than the maximum error, demonstrating that the resulting utility values are often able to produce an optimal policy long before the actual utility values are close enough to be considered correct. This, of course, is the motivation behind the policy iteration algorithm, which we cover in the next section. However, it is interesting to note that the policy loss for value iteration is not always monotonically decreasing, as shown in Figure 3.. This is, presumably, an effect of the non-uniform convergence of the utilities during value iteration, which could cause a policy for some iteration of the algorithm to, in fact, be less optimal than the policy for the previous iteration. In Figures 3.3 and 3.4, we indicated what the optimal policies for both of the MDPs would be, depending on the reward value for the non-terminal states. Most noticeable from these results is that some of our reward ranges in Figure 3.3 do not

exactly match the reward ranges given in the book for the same grid world (Russell & Norvig 66). We assume that this is because we do not know the exact discount factor with which the results in the book were computed, and therefore, cannot make any assumption that our results should be exactly the same. However, our policies for rewards ranges that are in the same neighborhood seem to match with the book, so we feel our results are reasonable. That being said, we found reward ranges for MDP where the optimal policy changes (6 more than in the book), and 8 reward ranges for MDP that influence the optimal policy. It is interesting to note that, for both MDPs, when the reward for the non-terminal states was strictly greater than zero, our algorithm was indecisive of the optimal action for the states that had no terminal neighbors, because utilities in neighboring states would continue to increase in value, making all directions equally optimal. Apparently, then, an optimal agent would do its best to stay away from all terminal states, and maximize its reward by moving around in the world indefinitely. For MDP, in particular, it was also interesting to see that for R >.99, the optimal policy for the world was to pass by the + reward state, and instead try and reach the +4.7 reward state that lay beyond. Even more interesting, was that for R > -.3, the optimal policy was to avoid the + reward state at all costs by going West in (3,), and hope for that % chance that the agent would end up going South instead, towards the +4.7 state. 4. Policy Iteration When we ran policy iteration on the same MDPs, we noticed that it behaved a lot like value iteration. In both algorithms, the utility values started out being relatively inaccurate, and then were refined with each iteration until they converged. Unlike value iteration, however, the initial values for the utilities in our policy iteration algorithm always started at random values, so the number of iterations it would take for convergence was never fixed. This is, of course, a result of our policy iteration implementation. Still, we saw that policy iteration generally converged much faster than value iteration. For example, comparing Figures 3. and 3.5, policy iteration converges four times faster than value iteration; we can see that for Figure 3.5, all states have converged by the third iteration, while for Figure 3., it takes at least iterations for value iteration to even get close to the correct utility values. We also found that, for policy iteration, varying the discount factor does not appear to have an influence on the number of iterations required to find the optimal policy, unlike for the value iteration algorithm. This result seems reasonable, firstly, since our policy iteration algorithm starts with a random policy for the first iteration, and secondly, since the convergence of policy iteration does not depend directly on Bellman updates propagating through the state space, as it does for value iteration. 5 Conclusion In this assignment, Tom wrote the majority of the value iteration functions in Matlab, prepared the figures and graphs, and helped edit and proofread the report. He learned more about the relationship between utility functions and policy functions in an MDP, the limits of value and policy iteration, and how to create and save plots of 3-dimensional matrices of data in Matlab. Grace wrote the majority of the policy iteration functions in Matlab, ran the algorithms on the given MDPs with various parameters, and wrote a good share of the report. She developed a better understanding of MDPs, value iteration, and policy iteration, as well as learned more about the Matlab environment and how to accomplish things such as working with matrices and linear equations.