Polynomial Value Iteration Algorithms for Deterministic MDPs

Size: px

Start display at page:

Download "Polynomial Value Iteration Algorithms for Deterministic MDPs"

Sophie Lewis
6 years ago
Views:

1 Polynomial Value Iteration Algorithms for Deterministic MDPs Omid Madani Department of Computing Science Uniersity of Alberta Edmonton, AL Canada T6G 2E8 Abstract Value iteration is a commonly used and an empirically competitie method in soling many Marko decision process problems. Howeer, it is known that alue iteration has only pseudopolynomial complexity in general. We establish a somewhat surprising polynomial bound for alue iteration on deterministic Marko decision (DMDP) problems. We show that the basic alue iteration procedure conerges to the highest aerage reward cycle on a DMDP problem in Ò µ iterations, or ÑÒ µ total time, where Ò denotes the number of states, and Ñ the number of edges. We gie two extensions of alue iteration that sole the DMDP in ÑÒµ time. We explore the analysis of policy iteration algorithms and report on an empirical study of alue iteration showing that its conergence is much faster on random sparse graphs. Introduction Marko decision processes offer a clean and rich framework for problems of control and decision making under uncertainty[bdh99, RN95]. Infinite-horizon fully obserable MDP problems (MDPs) are classic optimization problems in this framework. Not only are the MDP problems significant on their own, but solutions to these problems are used repeatedly in soling problem ariants such as stochastic games, and partially obserable MDPs [Sha53, Han98]. Preferred methods for soling MDP problems use dynamic programming techniques, and in particular often contain a so-called alue iteration or policy iteration loop [Put94, Lit96, Han98, GKP0]. These methods conerge to optimal solutions quickly in practice, but we know little about their asymptotic complexity. It is known, howeer, that algorithms based on alue iteration hae no better than a pseudo-polynomial run time on MDP prob- An algorithm has pseudo-polynomial run time complexity, if it runs in time polynomial in the unary representation of the lems [Tse90, Lit96]. In this paper, we analyze the basic alue iteration procedure on the deterministic MDP problem under the aerage reward criterion, or the DMDP problem, and we establish seeral positie results. The DMDP problem is also known as the maximum (or minimum) mean cycle problem in a directed weighted graph [AMO93]. In soling DMDPs, we are often interested in finding a highest aerage weight cycle (an optimal cycle), or the aerage weight of such a cycle (the highest mean). A policy in this problem is simply a subgraph in which each ertex (state) has a single out-edge (action choice) leading by a directed path to an optimal cycle. Just as is the case for general MDPs, the DMDP has both direct and indirect applications, for example in soling network flow problems and in system-performance analysis [AMO93, DG98, CTCG 98]. We establish that in graphs with Ò ertices and Ñ edges, basic alue iteration conerges to an optimal cycle in a DMDP in Ç Ò µ iterations, irrespectie of the initial assignment of alues. This is somewhat unexpected considering that alue iteration is generally pseudo-polynomial. We also show that the bound is tight by giing an example on which alue iteration takes Å Ò µ iterations. We note that while an optimal cycle is found in polynomial time, examples exist where alue iteration still takes pseudo-polynomial time to conerge to an optimal policy. This occurs because it may take many iterations before states that do not reside on an optimal cycle choose their optimal action (Section 3.). Therefore, in a sense, alue iteration is almost polynomial in finding optimal policies for DMDPs. Neertheless, finding the optimal cycles is the main task in DMDPs, and we expect that alue iteration conerges much faster in practice. Our experiments on random graphs show that alue iteration conerges exponentially faster than the Ò worstcase bound would suggest (see Section 6). The insight from the analyses allows us to show that, by making small modifications to alue iteration, optimal cycles (and policies) can be found in Òµ iterations. As each numbers in the input, but exponential in the binary representation.

2 iteration takes Ñµ time, this gies an algorithm that ties the only other algorithm with the same run-time of ÑÒµ [Kar78, AMO93]. An algorithm described here has an additional interesting property that it is distributed: each ertex performs a simple local computation and need only communicate to its immediate neighbors ia its edges, and by Òµ iterations, all ertices will know the highest mean. With Òµ more iterations, the optimal cycle would also form. We remark that the algorithmic technique deeloped here also extends to gie polynomial algorithms for more general problem classes where edges may hae two parameters: a probability or (time) cost in addition to a reward [Mad02]. We also gie a polynomial algorithm that is similar to the multi-chain policy iteration algorithm for the DMDP [Put94]. The polynomial bound proof for this algorithm is identical to that for simple alue iteration, but we conjecture that the bound is not tight. We describe the similarity to policy iteration and the difficulty of analyzing policy iteration on DMDPs, and indicate promising ways of addressing the open problems. We inestigate the conergence of alue iteration on random sparse graphs. The experiments suggest that alue iteration conerges to optimal cycles in an expected Ç ÐÓ Òµ many iterations, i.e., exponentially faster than what the worst case bound indicates. These experiments proide aluable insights into why such algorithms hae excellent performance in practice. We begin the paper with problem definitions and notation in the next section. In Section 3, we describe our analysis of alue iteration. Section 4 presents two modified algorithms with ÑÒµ time. Next, we gie our results on policy iteration, and, in Section 6, report on our experiments and discuss preious empirical studies of algorithms on DMDPs. Section 7 concludes with a summary and a discussion of open problems and future directions. Throughout the paper, we hae tried to describe the line of argument and hae gien proofs sketches for the important steps. Complete proofs appear in the appendix. 2 Preliminaries We gie the graph theoretic definition of the DMDP problem here to sae space. Let Î Öµ be a directed graph with Ò ertices and Ñ edges, where Ö is a function from the edge set into real numbers such that for an edge in, Ö µ is the reward of. Each ertex has at least one single out-edge. A walk Û ½ ½ is a progression of edges such that the end ertex of is the start ertex of ½ A walk may hae repeated edges. We call an edge Ù µ (resp. walk) connecting ertex Ù (start ertex Ù) to end-ertex a Ù- edge (resp. walk). For a walk Û, let Û denote the number of edges in Û, let Ê Ûµ È ½ Ö µ be its total reward, and let 2 Ê Ûµ Ê Ûµ be its aerage reward or mean. A cycle is a walk where the start ertex of ½ is the same ertex as the end of, and no other ertex is repeated. Let Ñ Ü Ê µ, where ranges oer all cycles in. We call the maximum mean and in soling a DMDP problem we are interested in either the problem of computing or finding an optimal cycle with mean (the problems are equialent). The DMDP problem was shown solable in Ç ÑÒµ time by Karp [Kar78]. Later, algorithms were gien with run times of Ç ÑÒ ÐÓ Òµ, and these algorithms are belieed to be faster than (unmodified) Karp s algorithm in practice [YTO9], as Karp s algorithm takes ÑÒµ irrespectie of underlying graph. To the best of our knowledge, all the algorithms with known Ç ÑÒµ prior to our work [HO93, DG98], are modifications of Karp s algorithm. The basic alue iteration process is shown in Fig.. Let Ø ¼ ½ denote time points, where time Ø is immediately before iteration Ø ½ of the algorithm. Let Ü Øµ denote the alue of ertex at time Ø, thus Ü ¼µ is its initial alue. Then the alue of a Ù- edge in iteration Ø Ø ½ is Ö µ. Each ertex Ù at each iteration Ø performs the following computation to obtain its new alue Ü Øµ Ù Ü Ø ½µ Ü Øµ ½µ Ù Ñ Ü Ö µ Ü Ø Ø ½ () Ù- edge We call a subgraph of with out-degree exactly one for each ertex a policy. Thus the choice of an out-edge for each ertex in each iteration of alue iteration defines a policy which we say alue iteration isits 2. We say a ertex changes edges or simply switches at an iteration if the choice of edge according to eq. changes from the preious iteration. The following assumption is important for the correctness of the algorithms: for any ertex, if the chosen edge from preious iteration ties in the best alue, then that edge is chosen again. In other words, alue iteration is lazy in changing the policy from one iteration to another. Ties in the first iteration or when the preious edge is not a highest alued edge may be broken arbitrarily. Note that each iteration of alue iteration takes Ç Ñµ time. Fig. 2 shows the first two iterations of alue iteration where the ertices are assigned zero alues initially. The sequence of alues for ertex Ù in the graph in Fig. 2a is: Ü Øµ Ù Ü ¼µ Ù Ü ½µ Ù Ü µ Ù ¼ ½½ ½ 2 Interestingly, has a related characterization of being the maximum eigenalue of the Ò Ò matrix of edge weights under max-plus operations: multiplication and summation in Ü translate to summation and taking the maximum respectiely [CTCG 98]. This in turn corresponds to the dynamic programming operation in eq.. Our result shows that for any alue ector Ü, the ector Ý Ò Ü is special: performing alue iteration on Ý for at most Ò iterations isits a policy containing an optimal cycle.

3 . Each ertex begins with a alue 2. Repeat 3. Each ertex chooses its highest alued out-edge and obtains the alue of 4 Figure : Pseudocode for alue iteration. 5 u u 5 u 5 z z z (a) (b) (c) 0.5 u u 0.5 u z z z (d) (e) (f) Figure 2: (a) A three ertex DMDP graph shown with edge rewards. (b) and (c) First two iteration of alue iterations where ertices start with zero initial alues, and isited policies and ertex alues after each iteration are shown. (d), (e), and (f) The corresponding mean-zero parallel graph (Sect. 3), and the first two alue iterations selections of alue iteration starting at and going back in time. More precisely, at iteration Ø, if ertex has edge to Ù, then its history walk of one time step consists of edge. Its history walk of ½ steps is concatenated with history walk of length ½ steps for ertex Ù at iteration Ø ½. The history walk is necessarily a maximum alued walk by Lemma 3.. In Fig. 2c, the history walk of length two for ertex Þ in iteration is Þ µ Þµ and for ertex it is Ùµ Ù Þµ. Two graphs are parallels of one another if they hae identical ertex and edge sets, but the reward function Ö in one graph is a constant offset of the reward function Ö ¼ for the other, or Ö ¼ µ Ö µ, for some constant. The aerage reward of a cycle is offset by as well, and therefore cycles keep their relatie merits in the corresponding graphs, and in particular optimal cycles remain the same under such a transformation of edge rewards 3. Furthermore, alue iteration behaes identically on parallel graphs, meaning that each ertex selects the same edge in either problem at any iteration (subject to ties), as a consequence of the next lemma: Lemma 3.2 Consider alue iteration started with equal initial alues on a pair of parallel graphs and ¼, with reward functions Ö and Ö ¼ respectiely where Ö ¼ µ Ö µ, for some constant. Then if Ü Øµ is the alue of ertex at time Ø in graph, Ü Øµ Ø is the alue of ertex in ¼ at time Ø. 3 Polynomial Conergence of Value Iteration Here, we first show that analyzing alue iteration on a transformed DMDP graph where ¼ which we refer to as a mean-zero graph, suffices in showing polynomial conergence, then we analyze alue iteration on mean-zero graphs. Consider alue iteration as it proceeds on a graph Î Öµ, where ertices are initialized with arbitrary alues. The alue of a Ù- walk Û at an iteration Ø Û is the sum of its total reward and alue of end-ertex at Û time points ago: Ê Ûµ Ü Ø Û µ For example, in Fig. 2b and 2c, the alue of the Ù- walk composed of a single Ù- edge in iterations one and two is respectiely and ½½ The alue of the Ù-Þ walk Û Ù µ Þµ in iteration two is also ½½. The next lemma relates the alue of walks and the alue of ertices. It can be shown by induction on time Ø (or length of walks). Lemma 3. Ü Øµ is the maximum alue oer the alues of all walks of length Ø with start ertex The history walk of a ertex at a gien iteration is, informally, the walk formed following the sequence of edge 3 Proof.(sketch) The two alues of any walk of length Ø, in particular a history walk, in two parallel graphs are different by Ø. We can then use Lemma 3.. Now, gien a graph with maximum mean, the parallel graph with offset has maximum mean ¼ (Fig. 2d). As a consequence of () the optimal cycles being identical in parallel graphs, and (2) the identical behaior of alue iteration on parallel graphs, the properties on the structure of history walks and policies that we show hold on the meanzero graph in the next section (e.g. lemma 3.6) also hold in the general case. Fig.3 gies a high leel picture of the identical behaior of alue iteration on parallel graphs and its rate of conergence to optimal cycles and policies. 3. Value Iteration on Mean-Zero Graphs The line of argument in this section is roughly as follows. We show that for any ertex, its sequences of alues has a maximum and the maximum is reached in no more than Ò iterations (Lemmas 3.3 and 3.4). For ertices in the meanzero (optimal) cycle, once they obtain the maximum alues (and a collection of other highest-alues alues to be defined), they can keep these alues by choosing the edges of the optimal cycle, and we show that this is in fact what 3 The same transformation is used in [Kar78].

4 initial policy + µ O(n 2 ) iterations optimal cycles become permanent pseudo polynomial (exponential) worst case one or more optimal policies isited repeatedy Figure 3: Value iteration isits the same sequence of policies when it starts with identical initial alues on parallel graphs. Some policies (but not ertex alues) may repeat on the way to optimal policies. Conergence can take exponentially many iterations, but only Ç Ò µ iterations for formation of optimal cycles. happens (Lemmas 3.5, 3.6, and 3.7). In the mean-zero case, no cycle has a positie reward, from which the following lemma follows. Lemma 3.3 In the mean-zero case, if a ertex has a history walk of length to itself at an iteration Ø, then Ü Øµ µ Ü Ø and Ü Øµ µ Ü Ø if the cycle is suboptimal, i.e., has negatie mean. Now, consider the sequence of alues of a ertex, Ü Øµ Ü ¼µ Ü ½µ, as alue iteration proceeds on a mean-zero graph. The next lemma, which is central to our results, bounds the latest time an increase in the maximum alue can occur oer the sequence Ü Øµ and also its subsequences. Take Ô in the lemma as the period or the interal at which we look at the alues in the sequence Ü Øµ Consider the simplest Ô ½ case, i.e., look at all the alues of the sequence). The lemma states that for any ertex, an increase in its maximum alue oer its alues seen so far can occur only in the first Ò iterations. In other words, () such a sequence of alues has a maximum and (2) it first occurs in the first Ò iterations. Similarly, with Ô we consider the een and odd subsequences. The lemma states that the two highest alues, one oer the een subsequence Ü Øµ, and the other oer the odd subsequence Ü Ø ½µ, appear in at most Ò iterations. Note that one of the two highest alues is the highest alue oer any subsequence (Ô ½) and must appear in the first Ò iterations. Similar results hold for higher periods Ô. Lemma 3.4 For any Ô ½, at any time Ð Ô if the following ( dominance ) property holds: Ð, such that Ð then Ð ÔÒ ÑÓ Ôµ Ü µ Ü Ðµ (2) Proof. Assume the dominance property holds for ertex at time Ð Consider the subsequence Û ¼ of ertices formed 4 by examining the history walk Û of after eery Ô steps (for example, if Ô and the ertex sequence in history walk Û is Þ Ù Ý then ertices in Û ¼ are Ù Ý ). Therefore, Ð Û Ô Û ¼. We can see that Û ¼ cannot repeat a ertex due to the dominance property and Lemma 3.3. It follows that Û ¼ Ò, or Ð ÔÒ. Lemma 3.4 has many consequences. For any integer ½ and any ertex, let Ü ¼, denote the highest alue a ertex obtains in any iteration Ø ÑÓ µ. These alues are well-defined (bounded) by Lemma 3.4. We call these alues the highest alues of. For example, for the mean-zero parallel graph in Fig. 2d, ertex obtains the following alue sequence ¼ ¼ ½ ¼ ¼ ½ ¼ (with zero initial ertex alue assignment), and its two highest alues are Ü ¼ ½ ¼, and Ü ½ ¼ Assume ertex is in a mean-zero (optimal) cycle Since has mean zero, it follows from Lemma 3. that once obtains a highest alue from its highest alues, it gets it back eery iterations. Note that some of the highest alues may be equal. Let Ü be the highest alue obtains eer. As another consequence of Lemma 3.4, obtains alue Ü in the first Ò iterations and its highest alues in no more than Ò iterations. We show basically that once the highest alues reach the ertices of a mean-zero cycle the ertices of the cycle do not need to deiate from that cycle, i.e., they can always choose the edges of the cycle in eery subsequent alue iteration (Lemma 3.7). The next two lemmas help us establish conergence. We obsere that once a ertex in obtains its highest alue Ü, its immediate neighbor in must obtain its highest alue in the next iteration, and in general some ertex in obtains its highest alue in eery subsequent iteration. Lemma 3.5 is the generalization of this property to all highest alues as defined aboe. The first statement of lemma 3.6 is a consequence of Lemma 3.3. Then the second statement follows using in addition Lemmas 3.5 and 3.4. Lemma 3.5 Assume ertex Ù has an edge to in a meanzero (i.e. optimal) cycle. Then wheneer obtains its th highest alue Ü, Ù obtains its ½st highest alue Ü ½ in the following iteration. Lemma 3.6 The history walk of a ertex in an optimal cycle, wheneer obtains its highest alue, includes only optimal cycles. At any iteration Ø Ò, the history walk of some ertex on an optimal cycle includes only optimal cycles. Lemma 3.7 When all ertices in all optimal cycles hae obtained their highest alues, after at most Ò iterations, some optimal cycle appears in all subsequent isited policies. Proof. When ertices of an optimal cycle find all their highest alues, if some ertex chooses the out-edge in the

5 k 3 0 k+3 2 k+2 s k+ Figure 4: An example graph where it takes Ò µ iterations for the the optimal cycle to form for the first time. All edges without a displayed reward hae zero reward. The numbered ertices in top row form the optimal cycle which has ertices and zero mean. The remaining two rows hae ½ ertices each. Vertex is initialized with zero and all others are initialized with. cycle, it will not switch (change edge) again due to our assumption of lazy policy change and Lemma 3.5. When all ertices of optimal cycles find their highest alues, wheneer a ertex obtains Ü its history walk can only contain mean-zero (in general optimal) cycles by 3.6. Consider the first such cycle in the walk. As its ertices hae found their highest alues, those ertices will not switch again. Examples exist where some ertices may repeatedly change edges foreer, but it can be shown that eentually all ertices will hae some path to some optimal cycle in any isited policy. Howeer, consider a ertex with a high reward edge to a cycle of mean -, and another edge to a mean zero (optimal) cycle. It is not hard to see that eentually it will choose the edge to the mean-zero cycle, but this could take many iterations. This example can be formalized to show that the worst-case number of iterations of alue iteration to optimal policies is pseudo-polynomial (see for example [ZP96]). It is also not hard to gie an example for which the time until all the highest alues arrie at the ertices of an optimal cycle can be Ò µ This would mean the time until an optimal cycle becomes fixed is Ò µ in the worst case. Fig. 4 shows that een the first time formation of an optimal cycle takes Ò µ time, in the worst case. In the gien graph, ertex is initialized with zero, and all other ertices may be assigned any alue less than. The highest alues that arrie at the ertices of the optimal cycle are ¼ ½ ½ For example, it takes iterations until ¼ first arries at the cycle (at ertex ), and ½ ½ iterations until ½ arries at the cycle, and in general Ì ½µ µ iterations until the th highest arries. Until then, in each iteration at least one ertex of the optimal cycle chooses a suboptimal (downward) edge. Thus with ertices in the graph, it take µ iterations for the optimal cycle to form. As a consequence of Lemmas 3.7, 3.4, 3.2, and the example graph gien, we obtain the following theorem on alue iteration on DMDPs. k 5 Theorem 3.8 Value iteration conerges to an optimal cycle in a DMDP problem in Ò µ iterations. We remark that a common ariation of alue iteration, referred to as Gauss-Siedel alue iteration [Put94], does not necessarily conerge to an optimal cycle. In this ariation, the ertices are numbered, the ertex alues are updated in order in each iteration, and the new alue of a ertex is used as soon as it becomes aailable. Note that this is a natural implementation of alue iteration on a sequential machine. But history walks in this case can be longer than the number of iterations, and Lemma 3.2 in particular breaks for Gauss-Siedel alue iteration, i.e. Gauss-Siedel alue iteration does not hae identical behaior on parallel graphs. Howeer, it can be shown that it conerges to some cycle, and moreoer the properties of alue iteration on mean-zero graphs still hold for Gauss-Siedel. 4 Algorithms Based on Histories We can still compute the optimal cycle using basic alue iteration in Ç Òµ iterations een though conergence takes Ò µ iterations. Lemma 3.6 shows us how. If we keep track of the edge chosen by each ertex in each iteration for the first Ò iterations as alue iterations progresses, we can reconstruct the cycles in the history walks, and some cycle must be optimal by Lemma 3.6. Searching for the optimal cycle takes Ò time, thus the algorithm takes Ç Ò ÑÒµ Ç ÑÒµ time, but unfortunately requires Ò µ space. We next describe a ariation that reduces the space back to linear. The algorithm has the desirable property that just like alue iteration it has a distributed nature: each ertex performs a simple local computation until all ertices discoer the optimal mean. This algorithm also works in two phases, the first phase being simple alue iteration for Ò iterations. The second phase takes Ò iterations as well, but each ertex performs an additional computation in addition to updating its alue and edge choice. In each iteration after Ò, each ertex keeps track of not only its current alue and chosen edge, but updates parameters characterizing its super edge as well. Super edges summarize history walks, and may be iewed as packets sent along edges. Each super edge is either dropped, or is updated and passed along in each iteration. When a ertex discoers that a super edge was sent by itself, it computes the aerage alue of the cycle corresponding to the super edge from the parameters of the super edge, and updates the current highest mean found so far. The highest mean found in the second phase is the optimal. A super edge has three parameters Ð Ö µ, where is the ertex it ends in, Ð is the number of edges in the super edge, and Ö is its total reward. At any iteration, such as the beginning of the second phase, the super edge of a ertex may be undefined. Vertex Ù computes its super edge at iteration

6 Ø as follows. Assume Ù chooses the Ù- edge in iteration Ø. In case Ù, if the super edge for of iteration Ø ½ is undefined, the super edge for Ù is defined to be ½ Ö µµ If has super edge Þ Ð Ö µ, and Ù Þ the super edge for Ù is Þ Ð ½ Ö µ Ö µ Otherwise, when Ù or Ù Þ, ertex Ù has obtained a cyclic super edge, and the mean is respectiely Ö µ or Ö µ Ö In this case, the running estimate is updated if necessary, and ertex Ù marks Ð ½ its current super edge undefined. As an example, if ertices where to begin keeping track of super-edges starting from iteration in Fig. 2a, then the super-edges for ertex Ù at iterations and 2 would be respectiely Þ ½ µ and Þ ½½µ If ertices began keeping track of super-edges starting from iteration two, the super-edges of ertex Ù at iterations 2 and 3 would be ½ µ and Ù µ respectiely, and at end of iteration 3, ertex Ù would update the highest mean found so far if necessary (in this case ), and mark its current super-edge as undefined. The algorithm takes Ò iterations and each iteration takes constant time per edge, thus the run time is Ç ÑÒµ, with only Ç Òµ extra space. Correspondence made between super edges and history walks, and Lemma 3.6 establish the correctness. A subtlety is when there are multiple optimal cycles and ties in edge selections occur. In this case we assume a ertex chooses the edge whose end-ertex has a super edge with lowest numbered ertex. In the beginning of the second iteration where no ertex has a super edge, we assume ties are broken based on the lower numbered end-ertex. We call this rule the lowest-index rule. The following lemmas establish the properties of super edges and lead to correctness of the algorithm: Lemma 4. At any time point, the super edge of length Ð for a ertex, if any, corresponds to the history walk of length Ð for ertex at that time. Let be the highest mean oer the aerages of cyclic super edges discoered in the second phase. Lemma 4.2 can be shown by noting that a walk with the same start and end ertex, possibly with two or more cycles inside, where each has mean no greater than does not hae a mean greater than Lemma 4.2 No cyclic super edge has aerage reward greater than the optimal mean alue. Therefore throughout the algorithm. That some cyclic super edge computed in the second phase has mean is not hard to see when the optimal cycle is unique, as the highest alue in the mean-zero graph is created and traces the optimal cycle after the first Ò iterations. In case of multiple optimal cycles, ties in edges may not be broken arbitrarily, otherwise we can gie examples where no cyclic super edge corresponding to an optimal cycle is created in the second Ò iterations. But the lowest-index 6 u 0 5 Figure 5: If ertices keep track of super edges at iteration, and ertices are initialized with 0 in the graph shown, the super edge corresponding to the optimal cycle will neer form. rule preents this. We expect other easier rules for example if each ertex breaks ties consistently locally also gie correct algorithms. Lemma 4.3 Assume the lowest-index rule is used in breaking ties. Then some ertex obtains a cyclic super edge corresponding to an optimal cycle in the second Ò iterations. Correctness of the algorithm, which we shall refer to as the history-walk algorithm, follows. Theorem 4.4 The history-walk algorithm takes ÑÒµ time and uses Òµ space in finding the optimal mean. A natural question is whether ertices may begin keeping track of super edges before iteration Ò. The answer is negatie in the worst case, at least as far as the algorithm just described, as cyclic super edges may not form in this case. In Fig. 5, with ertices initialized at ¼ ertex Ù picks the edge to in first iteration, and then the optimal cycle, in this case the self-cycle to itself in eery subsequent iteration. Howeer, if ertices start keeping track of super-edges at iteration, the super-edge of ertex Ù will always end in, and thus it will neer correspond to a cyclic super edge. Note that ertices cannot examine the intermediate ertices of a super-edge as such information is not kept. Howeer, in our experiments in Section 6 we describe an algorithm in which ertices start keeping track of super edges early (e.g. after ÐÓ Ò iterations) and repeatedly discard their super edges and start oer until the optimal mean is discoered. The technique of keeping super edges extends to higher problem classes where edges hae two parameters: a time cost or probability in addition to a reward. For example in the maximum reward-to-time cycle ratio problem, edges also hae time costs, and the problem is finding the cycle with maximum ratio of cycle s total reward to total time cost (this problem is known as the minimum cost-to-time ratio cycle problem [AMO93]). In this case the only difference for the parameters of a super-edge would be that in place of total length, we would include total time cost. In MDP(2) problems [Mad02], edges hae a reward and a tranisition probability, and instead of length, oerall transition probability of the walk would is used. The details of

7 . Begin with an arbitrary policy 2. Repeat until no new cycle is discoered 3. Update edge rewards, edge choices and ertex alues 4. Apply alue iteration until a new cycle is discoered or until Ò iterations. Figure 6: Generic phased policy iteration. the algorithms and the analyses and time bounds are different and will be presented in another paper. 5 On Policy Iteration Algorithms Consider the following change to alue iteration, which we call augmented alue iteration: At each iteration, the cycles in the isited policy are identified and the highest cycle mean is computed. Then each ertex gets a self-arc ( - edge) with the same reward (or mean) so that can choose the self-arc in subsequent iterations. That this algorithm finds the optimal mean in at most Ç Ò µiterations from the same conergence arguments used for alue iteration, but the bound may not be tight. On the other hand, this algorithm is ery similar to the so-called multi-chain policy iteration algorithm for aerage reward MDP problems [Put94, CTCG 98]. These algorithms can be iewed as working in phases, as shown in Fig. 6, where each phase begins with using the mean of the recently discoered cycle, then updates edge rewards and ertex alues appropriately, and then begins a series of alue iterations until another cycle is found. The algorithms differ on how they update alues, edge choices, and edge rewards, but they all guarantee that the next cycle discoered will hae higher mean than the last. In the augmented alue iteration algorithm, ertex alues are not changed, howeer new self-arcs with most recently found cycle mean are added. Alternatiely, in augmented alue iteration, we may subtract from all edge rewards, but keep zero reward self-arcs for each ertex: this does not change the optimal cycle, nor the behaior of alue iteration by 3.2. In policy iteration, in addition to subtracting from edge rewards, each ertex redirects itself to the cycle with mean that is, the algorithm finds a policy so that all ertices hae a path 4 to cycle Vertices are then reassigned alues as follows: an arbitrary ertex in the cycle of the current policy is assigned 0, and all others get the total reward of their path to. The reassignment of alues to ertices in policy iteration appears to make analysis difficult. Howeer, in work in progress we hae shown that ariants of augmented alue iteration where ertices get reassigned zero alues (or any 4 Without loss of generality we may assume the graph is strongly connected. Otherwise, the algorithm performs this for each component. 7 alue ector that remains constant across phases) before alue iteration begins in each phase, terminate within a polynomial number of phases and therefore run in polynomial time. Just as in policy iteration, in these algorithms ertex alues can only increase during alue iteration within a phase, and the cycles discoered improe from one phase to the next. The basic behaior seen from the results is that ertices behae almost identically, in terms of the edges they choose in corresponding iterations from one phase to the next. The exception is that progressiely more ertices hae zero increase in alue and thus stop switching edge choices, with each subsequent phase. These results, howeer, make use of the fact that ertices begin with the same alue ector (for example zero) in each phase, which simplifies comparison between phases and aids analysis. Relaxing such constraints, and improing the bounds for these algorithms may proide fruitful insights on the path to establishing efficiency of policy iteration algorithms. 6 Behaior on Random Graphs The algorithms we hae deeloped require at least a linear number of iterations. Howeer we suspected that on random graphs, relatiely few iterations of alue iteration would suffice in finding the maximum mean. We explored these questions on random graphs where eery ertex has two out-edges, the end-ertex of each edge is chosen uniformly at random from the remaining Ò ½ ertices, and the reward of edges were chosen uniformly at random in ¼ ½. We tested on sparse graphs only, as the number of actions (edges) per ertex is usually small in MDP and many other problems Ñ is often a linear function of Ò which leads to problems on sparse graphs. The aerages for graphs of size Ò were obtained oer samples of size maximum of 500 and Ò. As Fig. 7a suggests, the expected optimal graph length appears to grow logarithmically 5 (Ç ÐÓ Òµ), while time to conergence appears super-logaritmic but bounded by ÐÓ Ò. The distributions of the two random ariables seem to decay exponentially. One way to compute the optimal mean quickly is to test the current isited policy periodically and compute whether aerage reward of the cycle(s) in the policy is optimal. An efficient way to test this is to subtract the candidate mean from all edge rewards, and use an efficient implementation of the Bellman-Ford shortest paths algorithm to detect the presence of positie cycles [AMO93, CLR92]. If there are no positie cycles, the candidate mean is the maximum. While the shortest path detection algorithm also has Ç ÑÒµ run time, empirically it is ery efficient and may hae linear expected time [KB8]. Fig. 7b er- 5 We expect that the relatiely high aerage length of optimal cycles for size 25 graphs is due to the small size of the graphs. With larger graph sizes the limiting distribution of the random ariable seems to kick in.

8 6 4 aerage optimal cycle length time to first formation Fraction of switches with iteration n=200 n=400 n=800 n= graph size time in seconds Karp s find in history find in policy graph size Figure 7: (a) Aerages of optimal cycle lengths and the first iteration until an optimal cycle is found. (b) A comparison of the run times Figure 8: The faction of ertices switching with each subequent iteration of alue iteration. not alue iteration, are tested on seeral graph families in [DG98], and they conclude that policy iteration is the fastest. We expect that the algorithms gien here will be ery competitie empirically as well due to their low oerhead and small number of iterations. In particular, the augmented alue iteration algorithm (Sec. 5) may hae a lower oerhead than policy iteration as it need not compute path alues for ertices in each phase: it simply continues from the current ertex alues. In future work, we will further compare the performance of these algorithms on DMDPs and related problems. 7 Discussion ifies our expectation. It shows the run times for two algorithms that test periodically after ÐÓ Ò initial alue iterations. These tests were performed on a Pentium III/500 with 28 megabytes of RAM with small load. One algorithm uses super edges (find-in-history) and another simply checks the cycles formed in the policies (find-in-policy). The run times of both algorithms are close to linear time as expected. The plots show that the find-in-policy ersion seems to perform better, probably due to its lower oerhead and that find-in-history also has to wait a number of iterations until an optimal cycle is formed in a super edge. It may be possible to speed up each iteration by keeping track of only those ertices that are likely to changes. As shown in Fig 8, the fraction of ertices that switch decreases significanlty with subsequent iterations of alue iteration, and it appears that it is independent of the graph size (for random graphs). It would be interesting if the fraction decreased geometrically, as that would suggest that cost of performing the alue iterations oer all the Ç ÐÓ Òµ expected iteration would be down to Ç Òµ from Ç Ò ÐÓ Òµ Howeer, as the plot suggests, the decrease in the fraction slows. An interesting open question is whether there is an algorithm with Ç Òµ expected time on random sparse graphs. Many DMDP algorithms including policy iteration, but 8 We noted that alue iteration does not sole the problem of finding an optimal policy in polynomial time. Thus the DMDP problem may be considered a borderline problem on which alue iteration is almost polynomial. On shortest path problems with no negatie cycles, and on DMDPs where all cycles share a single ertex, alue iteration finds an optimal policy in polynomial time. Higher up in the problem hierarchy, on general stochastic MDP problems and seeral subclasses where the degree of stochasticity is limited, it takes exponential time to conerge to optimal cycles or policies. Perhaps the closest problem to the DMDP is the discounted deterministic MDP problem. On these problems, as the discount approaches ½ an optimal cycle becomes the same as the highest aerage reward cycle [Put94], and with small the problem is easy to approximate. Therefore an approximation property such as the following may hold: Ç Ò µ runs of alue iteration starting with any initial ector is sufficient to conerge to approximately optimal cycles in discounted deterministic MDPs. This work was motiated by the analysis of policy iteration on MDPs and is in line with deeloping a complete picture on the efficiency of alue and policy iteration algorithms on MDPs. Studying simpler problem classes can lead to techniques of algorithm design and analysis applicable to more general problems, as well as giing a better understanding

9 of where new ideas are needed when such techniques fail to generalize. Our hope is that the gaps in our understanding of whether and why these algorithms are efficient on arious problem classes, MDP problems and beyond, continue to be filled. Acknowledgments This work was supported in part by NSF grant IIS and was performed in large part during the PhD work of the author at the Uniersity of Washington. The author is indebted to his adisors Richard Anderson and Stee Hanks for their guidance and support throughout this research. Thanks to Ali Dasdan for aluable discussions and comments on an earlier ersion of the paper. Many thanks to Russ Greiner the anonymous referees for suggestions for improing the presentation. References [AMO93] [BDH99] [CLR92] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network Flows : Theory, Algorithms, and Applications. Prentice Hall, Englewood Cliffs, NJ, 993. C. Boutilier, T. Dean, and S. Hanks. Decision theoretic planning: Structural assumptions and computational leerage. JAIR, pages 57 7, 999. T. H. Cormen, C. E. Leiserson, and R. L. Riest. Introduction to algorithms. MIT Press and McGraw- Hill Book Company, 6th edition, 992. [CTCG 98] J. Cochet-Terrasson, G. Cohen, S. Gaubert, M. McGettrick, and J.-P. Quadrat. Numerical computation of spectral elements in max-plus algebra. In Proc. IFAC Conf. on Systems Structure and Control, pages , 998. [DG98] A. Dasdan and R. K. Gupta. Faster maximum and minimum mean cycle algorithms for systemperformance analysis. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems, 7(0):889 99, 998. [GKP0] C. Guestrin, D. Koller, and R. Parr. Max-norm projections for factored MDPs. In AAAI, pages , 200. [Han98] E. A. Hansen. Finite Memory Control of Partially Obserable Systems. PhD thesis, U. of Mass., Amherst, 998. [Mad02] O. Madani. On policy iteration as a Newton s method and polynomial policy iteration algorithms. In Proc. of 8th national conference on Artifical Intelligence, To appear. [Put94] [RN95] [Sha53] [Tse90] [YTO9] [ZP96] A Proofs Proof of Lemma 3.: M. L. Puterman. Marko Decision Processes. Wiley Inter-science, 994. S. Russell and P. Norig. Artificial Intelligence:A Modern Approach. Prentice Hall, 995. L. S. Shapley. Stochastic games. Proceedings of the National Academy of Sciences USA, 39:095 00, 953. P. Tseng. Soling À-horizon stationary Marko decision process in time proportional to ÐÓ Àµ. Operations Research Letters, 9(5): , 990. N. E. Young, R. E. Tarjan, and J. B. Orlin. Faster parametric shortest paths and minimum balance algorithms. Networks, 2:205 22, 99. U. Zwick and M. Paterson. The complexity of mean payoff games on graphs. Theoretical Computer Science, 58(-2): , May 996. Lemma A. Ü Øµ is the maximum alue oer the alues of all walks of length Ø with start ertex Proof. This is basically a statement of the optimality of dynamic programming in this setting. We hae Ü Øµ ½µ Ñ Ü Ö µ Ü Ø Ù Ø ½ Ùµ On the other hand, if we let Î Ø µ denote the highest alue oer alues of walks of length Ø with start ertex, we hae that Î Ø µ Ñ Ü Ùµ Ö µ Î Ø ½ Ùµ Ø ½ where Î ¼ µ Ü ¼µ, and an argument by induction shows that equality holds: Î Ø µ Ñ Ü Ùµ Ö µ ÎØ ½ Ùµ. It follows that Î Ø µ Ü Øµ. Proof of Lemma 3.2: Lemma A.2 Consider alue iteration started with equal initial alues on a pair of parallel graphs and ¼, with reward functions Ö and Ö ¼ respectiely where Ö ¼ µ Ö µ, for some constant. Then if Ü Øµ is the alue of ertex at time Ø in graph, Ü Øµ Ø is the alue of ertex in ¼ at time Ø. [HO93] M. Hartmann and J. Orlin. Finding minimum cost to time ratio cycles with small integral transit times. Networks, 23:567 74, 993. [Kar78] R. M. Karp. A characterization of the minimum cycle mean in a digraph. Discrete Mathematics, 23:309 3, 978. [KB8] R. M. Karp and J. B.Orlin. Parametric shortest paths algorithm with an application to cyclic staffing. Discrete Applied Mathematics, 3:37 45, 98. [Lit96] M. Littman. Algorithms for Sequential Decision Making. PhD thesis, Brown, Proof. Take any Ù- walk Û of length Ø. If walk Û gies alue Ü to Ù on graph, gien alue Ý for, then it gies alue Ü Ø to Ù gien alue Ý for on ¼. It follows from Lemma 3. that if has alue Ü Øµ in È at time Ø then its alue in È ¼, Ü ¼ Øµ, is not less than Ü Øµ Ø. Symmetry establishes that Ü ¼ Øµ Ü Øµ Ø. To proe Lemma 3.3, let us define a loop and what we mean by remoing a loop from a walk. Think of a walk as a sequence of ertices, so for example walk Û ½ ½ A loop is just a walk with identical first and last ertices, so it may hae multiple cycles. Remoing a loop from a walk is simply remoing the ertices of the loop from the walk, so if we remoe the - loop from walk ½ ½ we obtain the walk.

10 Lemma A.3 In the mean-zero case, if a loop is remoed from a walk, the resulting walk has equal or higher total reward. Therefore, if a ertex has a history walk of length to itself at an iteration Ø, then Ü Øµ and Ü Øµ if the loop contains a sub-optimal cycle, i.e., a cycle with negatie mean. Ü Ø µ Ü Ø µ Lemma A.6 The history walk of a ertex in an optimal cycle, wheneer obtains its highest alue, includes only optimal cycles. At any iteration Ø Ò, for alue iteration on any graph (not just mean-zero), the history walk of some ertex on an optimal cycle includes optimal cycles only. Proof. Remoing a loop is equialent to remoing one or more cycles, and because cycles hae nonpositie total reward, the whole loop has nonpositie total reward. Therefore, the alue obtained by a ertex at beginning of the walk does not decrease if a loop is remoed from the walk. This implies the alue of the ertex must hae been at least as large at iterations ago by Lemma 3.. We will next proe an extended ersion of Lemma 3.5. Recall that for a ertex and an integer ½, we may define Ü Ñ Ü Ü Øµ Ø ¼ Ø ÑÓ µ. Lemma A.4 Assume ertex Ù has an edge to in a mean-zero (i.e. optimal) cycle of length. Then wheneer obtains its highest alue Ü, Ù obtains its highest alue Ü Ù in the following iteration, and Ù can select its edge to to obtain this alue. More generally, wheneer obtains Ü in an iteration Ø, Ù obtains its highest alue Ü Ù ½ at iteration Ø ½, and Ù can select its edge to to obtain this alue. Proof. Say Ù has edge to in and has path Ô to Ù in cycle, thus Ê Ôµ Ö µ ¼ Say obtains its highest alue at an iteration Ø If at some point, Ù obtains a alue Ü Ù higher than Ö µ Ü then Ô iterations later, obtain a alue at least as high as Ê Ôµ Ü Ù, but Ê Ôµ Ü Ù Ö µ Ü contradicting the assumption on Ü Therefore the highest alue Ù eer obtains must be at most Ö µ Ü and furthermore it must obtain it at least wheneer obtains its highest alue Ü in the preious iteration. We can use a similar argument to show the generalization to the highest alues of ertices Ù and : wheneer obtains the highest alue Ü at an iteration Ø, Ù obtains a alue that is at least Ö µ Ü in the next iteration Ø ½. In fact the alue obtained is exactly Ö µ Ü, as otherwise ½ iterations later (iteration Ø ), could obtain a alue greater than Ü contradicting the fact that Ü is highest. The following lemma is interesting but is not required for the paper. It makes a statement about the smallest period at which the highest alues of a ertex in an optimal cycle repeat. We say the alues of a ertex become periodic with period Ô if there is a time point Æ, such that Ü Ø Ôµ Ü Øµ Ø Æ Lemma A.5 The alues of any ertex in any mean-zero cycle become periodic in at most Ò iterations. Furthermore, the ertex alues of the ertices in will become periodic, in at most Ò iterations, with the same smallest period Ô and Ô diides Proof. Consider the highest alues of any ertex of are obtained in Ò iterations by Lemma 3.4, and by their definition we must hae Ü Ø µ Ü Øµ for Ø Ò It follows that is a period for each ertex in, but not necessarily the smallest period. If Ô ½ and Ô are two periods, their greatest common diisor can be seen to also be a period as follows: there must exist Ñ ½ and Ñ such that Ñ Ô Ñ ½Ô ½, thus we hae for any Ø, Ü Ø Ñ ½Ô ½ µ, or for all Ø ¼ Ø Ñ ½Ô ½ Ü Ø¼ µ Ü Ø¼ µ As the highest alues do no change by definition, we must hae that the alues become periodic at their smallest alue in at most Ò iterations. Proof of Lemma 3.6: Ü Øµ Ü Ø Ñ Ô µ Ü Ø Ñ ½Ô ½ µ Proof. The first part follows from Lemma 3.3. The highest alue of any ertex is obtained in at most Ò iterations as a consequence of Lemma 3.4. Once a ertex in an optimal cycle finds its highest alue, from the preious Lemma 3.5, it follows that in eery subsequent iteration some ertex in the optimal cycle finds its highest alue. As history walks hae length at least Ò after iteration Ò and thus must contain cycles, it follows from the first part that the history walk of some ertex on the optimal cycle must contain at least one and only optimal cycles after iteration Ò. Proof of Lemma 4.: Lemma A.7 At any time point, the super edge of length Ð for a ertex, if any, corresponds to the history walk of length Ð for ertex at that time. Proof. This can be erified using the definition of history walks and super edges, and by using induction on the length of super edges. Proof of Lemma 4.2: Lemma A.8 No cyclic super edge has aerage reward greater than the optimal mean alue. Therefore throughout the algorithm. Proof. The history walk corresponding to a cyclic super edge is a loop, i.e., start and end ertex are the same, possibly with multiple cycles inside. Each cycle in the loop has aerage reward not greater than, thus the maximum aerage reward cycle in the walk is not greater than, and thus the aerage reward of the whole super edge is less than and furthermore, if there is any cycle with aerage reward less than the aerage reward of the whole walk is less than. It follows that the aerage reward of the super edge is if and only if all cycles of the loop are optimal cycles. Proof of Lemma 4.3: Lemma A.9 Assume the lowest-index rule is used in breaking ties. Then some ertex obtains a cyclic super edge corresponding to an optimal cycle in the second Ò iterations. Proof. The proof is only complicated for the case that there are multiple optimal cycles that share ertices. Without loss of generality, take the mean-zero case. Consider the set Ë of ertices that are in some optimal cycle, and consider their highest alue as it gets propagated along edges. We say Ù Ë can propagate to Ë if there is a path from to Ù such the highest alue can propagate from Ù to on the path (in the reerse of the edge directions). The can-propagate-to relation is transitie: if Ù can propagate (the highest alue) to and to Û then Ù can propagate to Û. Consider the transitie closure Ì of the can-propagate-to relation. Ì is composed of one or more fully connected components where in each component, each ertex can propagate to all ertices in the same component. Therefore each component contains all the ertices of at least one optimal cycle of the graph. Consider the partial ordering imposed by Ì on the components, and take a component with no predecessors (no ertex outside the component has an edge into the component). 0

11 After iteration Ò, if a ertex in Ë receies a highest alue, it must hae been propagated from another ertex in Ë (i.e., a ertex in some optimal cycle). This can be erified by examining the history walks: at least one optimal cycle and only optimal cycles exist in the history walk of length Ò or more wheneer a ertex obtains its highest alue. Thus ertices in can obtain their highest alues only from propagation of the highest alue from ertices in assuming otherwise contradicts the assumption that has no predecessor components. Component has at least one ertex with the highest alue in iteration Ò, as has at least one optimal cycle. Consider the lowest numbered ertex from such highest alued ertices. Now consider any optimal cycle with smallest length from the optimal cycles containing, and let ¼ and ½ denote the Ø ertex that has a path of edges in cycle to. It follows from the tie breaker rule and properties of that ½ will choose its edge in to in iteration Ò ½, and will form a super edge ending in, and in general will hae a history path of length to in the end of iteration Ò (not necessarily corresponding to but with same total reward and length), and obtain a super edge ending in with total reward equal to s path to in. Thus in iteration Ò, obtains a cyclic super edge from at least one neighboring ertex in component, corresponding to an optimal cycle, containing of shortest length.

Polynomial Value Iteration Algorithms for Deterministic MDPs

UAI 2002 MADANI 311 Polynomial Value Iteration Algorithms for Deterministic MDPs OmidMadani Department of Computing Science University of Alberta Edmonton, AB Canada T6G 2E8 madani @cs.ualberta.ca Abstract