CSE151 Assignment 2 Markov Decision Processes in the Grid World

CSE5 Assignment Markov Decision Processes in the Grid World Grace Lin A484 gclin@ucsd.edu Tom Maddock A55645 tmaddock@ucsd.edu Abstract Markov decision processes exemplify sequential problems, which are defined by a transition model and a reward function, situated in uncertain environments. The environment we will be observing is the grid world, which is analogous to the problem of solving for the best moves in a board game. Our objective is to take a grid world where there are obstacles and terminals and solve for the optimal policy in the world. In the optimal policy, each state that is not an obstacle or a terminal state will acquire a direction. This direction is the direction that will result in the most optimal outcome when at that particular state. To solve for the optimal policy, we use two algorithms, value iteration and policy iteration. When observing the results of these two algorithms, we noticed that they produced results that were similar. However, policy iteration performed much faster than value iteration. Introduction One of the most significant and popular aspects of Artificial Intelligence is solving sequential decision problems. We can view a sequential decision problem as a situation where an agent s utility is dependent on a sequence of decisions that the agent can make. Our focus is on sequential decision problems because sequential decision problems lay the groundwork for reinforcement learning. Sequential problems are intriguing due to the components that contribute to the makeup of the problem and how each of the components affects the solution. The infrastructure that we will focus on is the Markovian transition model and additive rewards which is called a Markov decision process (MDP). The model is of a grid world, an m-by-n world that contains obstacles, terminals, a starting initial state, and where each state in the world has a reward value that can be a negative or positive value. In the grid world, there is also a probabilistic move formulation that specifies the probability of moving in each of the four directions (North, South, East, and West) given that the agent has chose to go towards a particular direction.

Given a grid world with obstacles, terminals, a start state, rewards, and moves, we will find the optimal policy of the grid world, the policy that yields the highest expected utility. The result is a world where each state has a direction and each direction points to the best direction to take when the agent is at that state; the best direction being the direction that will yield the highest expected utility. To solve for the optimal policy, we use two different algorithms: value iteration and policy iteration. Each of the algorithms will produce an optimal policy. We will be observing the results of these two algorithms and comparing them with each other. Methods. Matlab Code Overview For this assignment, we wrote a handful of Matlab functions that find solutions to a given MDP. The two top-level functions, value_iteration.m and policy_iteration.m, are the core result of our work, and produce, respectively, a utility function and an optimal policy for a given MDP. The display_agent_choice.m and policy_evaluation.m functions complement them by computing, respectively, an optimal policy given a utility function, and a utility function given a policy. Our next_state_utility.m function provides computational support to these functions by finding the expected utility for the next state, given a current state, an action, and a utility function. Finally, the iterate_rewards.m, compute_performance.m, and create_figure.m functions provide convenient methods of investigating the performance of our solution algorithms and plotting the results.. Solution Algorithms Our top-level functions, mentioned above, implement the value iteration and policy iteration algorithms for solving MDPs. Our implementation of value iteration is relatively straightforward, and works by iteratively applying the Bellman update to a utility function until the resulting utilities are within a given error bound. For policy iteration, we chose to implement the policy evaluation step by solving a system of linear equations, instead of using modified policy iteration. We felt that, for the size of the MDPs given in this assignment, this was the preferred method for policy evaluation, in terms of speed as well as accuracy. Additionally, our policy iteration algorithm starts with a random initial policy, instead of a fixed one, which it then iteratively refines until it becomes optimal..3 Measuring Performance For this assignment, we tested our solution algorithms by measuring their performance while they solved two MDPs. The first of these (MDP ) was a grid world built from the data in rewards.txt and terminals.txt. The second of these (MDP ) used data from newrewards.txt and newterminals.txt. To measure the performance of these MDP solution algorithms, we added extra provisions that enable them to output a history of all the utility functions that were computed for each of their iterations. This allows us to observe the evolution of the utilities as they converge on their final values. We then derived values for the maximum error and policy loss, for each of their iterations, based on this history. From this data we could measure how quickly and accurately each algorithm found a solution to the given MDP, and compare based on these results.

3 Results 3. Investigating Value Iteration In the first part of our assignment, we investigated the performance of our value iteration algorithm by having it solve both of the provided MDPs, with various discount factors, and watching the progress of the algorithm as it iterated. Figures 3. and 3. below show the algorithm s progress as it finds a solution for MDP and MDP, respectively, with a discount factor of.9. For each figure, the utilities of the states (excluding the obstacle state) are shown on the left-hand graph, and the maximum error and policy loss of the utility function is shown on the right-hand graph. Each graph is plotted over the number of iterations of the algorithm. Value Iteration Utilities for γ =.9 Value Iteration Accuracy for γ =.9 Utility Values.8.6.4. -. -.4 -.6 U(,) U(,) U(,3) U(,4) U(,) U(,4) U(3,).8.7.6.5.4.3. -.8 -. 4 6 8 4 6 8 4 6 8 4 6 8 Figure 3.: Value Iteration Performance for MDP Value Iteration Utilities for γ =.9 Value Iteration Accuracy for γ =.9 5.5 Utility Values 4 3 U(,) U(,) U(,3) U(,4) U(,) U(,4) U(3,).5.5-4 6 8 4 6 8 4 6 8 4 6 8 Figure 3.: Value Iteration Performance for MDP We then used our value iteration algorithm, in conjunction with our display_agent_choice.m function, to produce the optimal policies for MDPs and as a function of the reward given for non-terminal states. Figures 3.3 and 3.4 on the following page show the results from this experiment. Note that an X in the figures denotes that any action was optimal in that state for the given reward range.

N O E T E E E N R < -.6497 -.6497 R < -.564 -.564 R < -.73 -.73 R < -.456 -.456 R < -.849 N W N W E E E N N W W W E E N N N O W T N W W W -.849 R < -.448 -.448 R < -.73 -.73 R < -. -. R R > Figure 3.3: Optimal policies for MDP for different ranges of rewards, γ =.99999 E N N E E T O N T E N N N O W T N W W S N E N W X X W T X O W T X X X S S O E S E E T T R < -.975 -.975 R < -.765 -.765 R < -.495 -.495 R < -.99 -.99 R < -.3 E E W T N W T T S O E S E E T T E E T T E E N O N W W T N S T T N E T T X W W T X O N W X W T T -.3 R < -.4 -.4 R R > N W T T Figure 3.4: Optimal policies for MDP for different ranges of rewards, γ =.99999 3. Investigating Policy Iteration For the second part of the assignment, we investigated the performance of policy iteration by having it find solutions for MDPs and, with various discount factors. Figures 3.5 and 3.6 below show the algorithm s progress while solving these MDPs with a discount factor of.9, plotted against the number of iterations taken. Policy Iteration Utilities for γ =.9 Policy Iteration Accuracy for γ =.9 Utility Values.8.6.4. -. -.4 -.6 -.8 -.5.5 3 U(,) U(,) U(,3) U(,4) U(,) U(,4) U(3,) Figure 3.5: Policy Iteration Performance for MDP.9.8.7.6.5.4.3...5.5 3

Utility Values 5 4 3 - - Policy Iteration Utilities for γ =.9 U(,) U(,) U(,3) U(,4) U(,) U(,4) U(3,) 5 4.5 4 3.5 3.5.5.5 Policy Iteration Accuracy for γ =.9.5.5 3 3.5 4 4.5 5.5.5 3 3.5 4 4.5 5 Figure 3.6: Policy Iteration Performance for MDP 4 Discussion 4. Value Iteration We charted the performance of value iteration, as it solves the MDPs given in the provided text files, in Figures 3. and 3. above, and notice a couple of interesting things about the performance of algorithm. First of all, we noticed that some of the states utility values converged faster than others. The states that tended to converge slower were often the states that were further towards the start of an optimal path. Since the nature of the Bellman equation is such that the utility of each state depends on the utility of its most optimal next state, if a state happens to have many future states that come after it in an optimal path, then it should take longer for the Bellman updates from those states to propagate down that path and reach states more towards the beginning. Thus, this result is reasonable. Although our graphs do not show it, we also noticed that as the discount factor for the given MDP increased, the number of iterations it took before the utility values converged increased as well. This result, too, is to be expected. With a high discount factor, the utility of each state is more heavily influenced by the utilities of future states that come after it. Thus, more Bellman updates are necessary to propagate the utilities of future states through the world, so that all the states that come before them may have their own utilities properly set. In general, we can also see that the policy loss produced by the value iteration algorithm approaches zero much faster than the maximum error, demonstrating that the resulting utility values are often able to produce an optimal policy long before the actual utility values are close enough to be considered correct. This, of course, is the motivation behind the policy iteration algorithm, which we cover in the next section. However, it is interesting to note that the policy loss for value iteration is not always monotonically decreasing, as shown in Figure 3.. This is, presumably, an effect of the non-uniform convergence of the utilities during value iteration, which could cause a policy for some iteration of the algorithm to, in fact, be less optimal than the policy for the previous iteration. In Figures 3.3 and 3.4, we indicated what the optimal policies for both of the MDPs would be, depending on the reward value for the non-terminal states. Most noticeable from these results is that some of our reward ranges in Figure 3.3 do not

exactly match the reward ranges given in the book for the same grid world (Russell & Norvig 66). We assume that this is because we do not know the exact discount factor with which the results in the book were computed, and therefore, cannot make any assumption that our results should be exactly the same. However, our policies for rewards ranges that are in the same neighborhood seem to match with the book, so we feel our results are reasonable. That being said, we found reward ranges for MDP where the optimal policy changes (6 more than in the book), and 8 reward ranges for MDP that influence the optimal policy. It is interesting to note that, for both MDPs, when the reward for the non-terminal states was strictly greater than zero, our algorithm was indecisive of the optimal action for the states that had no terminal neighbors, because utilities in neighboring states would continue to increase in value, making all directions equally optimal. Apparently, then, an optimal agent would do its best to stay away from all terminal states, and maximize its reward by moving around in the world indefinitely. For MDP, in particular, it was also interesting to see that for R >.99, the optimal policy for the world was to pass by the + reward state, and instead try and reach the +4.7 reward state that lay beyond. Even more interesting, was that for R > -.3, the optimal policy was to avoid the + reward state at all costs by going West in (3,), and hope for that % chance that the agent would end up going South instead, towards the +4.7 state. 4. Policy Iteration When we ran policy iteration on the same MDPs, we noticed that it behaved a lot like value iteration. In both algorithms, the utility values started out being relatively inaccurate, and then were refined with each iteration until they converged. Unlike value iteration, however, the initial values for the utilities in our policy iteration algorithm always started at random values, so the number of iterations it would take for convergence was never fixed. This is, of course, a result of our policy iteration implementation. Still, we saw that policy iteration generally converged much faster than value iteration. For example, comparing Figures 3. and 3.5, policy iteration converges four times faster than value iteration; we can see that for Figure 3.5, all states have converged by the third iteration, while for Figure 3., it takes at least iterations for value iteration to even get close to the correct utility values. We also found that, for policy iteration, varying the discount factor does not appear to have an influence on the number of iterations required to find the optimal policy, unlike for the value iteration algorithm. This result seems reasonable, firstly, since our policy iteration algorithm starts with a random policy for the first iteration, and secondly, since the convergence of policy iteration does not depend directly on Bellman updates propagating through the state space, as it does for value iteration. 5 Conclusion In this assignment, Tom wrote the majority of the value iteration functions in Matlab, prepared the figures and graphs, and helped edit and proofread the report. He learned more about the relationship between utility functions and policy functions in an MDP, the limits of value and policy iteration, and how to create and save plots of 3-dimensional matrices of data in Matlab. Grace wrote the majority of the policy iteration functions in Matlab, ran the algorithms on the given MDPs with various parameters, and wrote a good share of the report. She developed a better understanding of MDPs, value iteration, and policy iteration, as well as learned more about the Matlab environment and how to accomplish things such as working with matrices and linear equations.