Using Markov decision processes to optimise a non-linear functional of the final distribution, with manufacturing applications.

Size: px

Start display at page:

Download "Using Markov decision processes to optimise a non-linear functional of the final distribution, with manufacturing applications."

Cory Henderson
5 years ago
Views:

1 Using Markov decision processes to optimise a non-linear functional of the final distribution, with manufacturing applications. E.J. Collins 1 1 Department of Mathematics, University of Bristol, University Walk, Bristol BS8 1TW, UK. Abstract. We consider manufacturing problems which can be modelled as finite horizon Markov decision processes for which the effective reward function is either a strictly concave or strictly convex functional of the distribution of the final state. Reward structures such as these often arise when penalty factors are incorporated into the usual expected reward objective function. For convex problems there is a Markov deterministic policy which is optimal, but for concave problems we usually have to consider the larger class of Markov randomised policies. In the natural formulation these problems cannot be solved directly by dynamic programming. We outline alternative iterative schemes for solution and show how they can be applied in a specific manufacturing example. Keywords. Markov Decision Processes, Penalty, Non-linear reward 1 Introduction 1.1 Concave/convex effective rewards in manufacturing Consider a manufacturing process where a number of items are processed independently. Each item can be classified into one of a finite number of states at each of a finite number of stages, and at each stage it is necessary for the manufacturer to choose some appropriate action which affects the progress of the item over the next stage. Finite horizon Markov decision processes (MDPs) are commonly used to model such stochastic optimisation problems because they capture the comparison of the uncertain benefits arising from different possible control strategies. In the standard formulation, the objective is to optimise the expected value of some function of the final state of the process (or equivalently some linear function of the distribution of the final state) together with the total rewards or costs incurred along the way. Problems like these are often formulated with linear objective functions precisely because they are then 1

2 easy to solve with simple standard techniques. However, there are situations where a more realistic assessment would take into account nonlinearities in the benefits which accrue to the manufacturer. We will consider problems where the different actions available at each stage are effectively neutral with respect to their immediate cost or reward (for instance, different actions may represent different settings on a machine, or using differents blends of equally expensive raw materials) and where the objective is to optimise some non-linear function of the proportion of items in each of the different possible states at final time T. We will assume that the number of items to be produced is sufficiently large that we can equate the proportion of items in each state i at time T under a given policy π with the probability an individual item is in i under π at T. One situation where such non-linearities might arise is when a manufacturer wishes to maximise an expected final reward subject to various other considerations or constraints. If one incorporates the constraints as additional penalty terms in the usual expected reward objective function, this may lead to effective reward functions which are either strictly concave or strictly convex functionals of the distribution of the final state. For example, a policy might be judged both by its expected reward and by some measure of its associated risk. If we use the variance of the resulting rewards as a measure of the risk to the manufacturer, then this leads us to seek a policy which maximises the variance penalized reward, in which some given fixed multiple θ of the variance is incorporated as a penalty into the objective function. Mean-variance tradeoffs for processes with an infinite stream of rewards have been studied by several authors, including Filar et al (1989), Huang & Kallenberg (1994) and White (1994), using either average or discounted reward criteria. To see that a finite horizon version gives rise to optimising a strictly convex functional of the distribution of the final state, we proceed as follows. Consider a Markov decision process with no continuation rewards, so the only reward is a terminal reward R at time T, where R(i) = r i, i E. Let T denote the horizon and let S T denote the state at time T. The variance penalized reward v π associated with a policy π is given by v π = E π [R(S T )] θvar π [R(S T )] = Σ i r i x π (i) θ[σ i r 2 i xπ (i) (Σ i r i x π (i)) 2 ] = φ(x π ). where x π (i) = P π (S T = i), i E; where x π is the vector with components x π (i); and where φ(x) = Σ i r i x(i) θσ i r 2 i x(i) + θ(σ i r i x(i)) 2. Without loss of generality (White (1994)) we can assume r i > 0, i E, so that φ is a strictly convex function. 2

3 Another example, this time with a concave effective reward, occurs when a manufacturer wishes to maximise the usual expected final reward but is also subject to some externally imposed penalty c i if the proportion of times the process produces an item with final state S T = i differs from some prescribed value d i. This leads to maximising the strictly concave penalised reward function φ(x) = Σ i r i x(i) Σ i c i (x(i) d i ) 2. The purpose of this paper is to alert those working in the manufacturing area to the fact that these non-linear problems can be solved using methods based on familiar techniques like dynamic programming (and simple nonlinear programming), to outline the methods and work through some simple examples, and to encourage the identification of further manufacturing problems for which such methods are appropriate. 1.2 MDP model Our basic model will be that of a discrete Markov decision process (MDP) with finite horizon T, a finite set of states E = {0, 1, 2,..., N} and a finite action set A. If the current state is i and action a is chosen, then the next state is determined according to the transition probabilities p ij (a), j E. The time points t = 0, 1,..., T 1 form decision epochs. Let S t denote the state at times t = 0, 1,... T and assume throughout that the process starts with some fixed distribution q for the initial state S 0. The only reward is a final reward which (taking the penalty into account) is a strictly concave or convex function φ of the final distribution. We let x denote the distribution of the final state S T and let x π denote the final distribution under a policy π. The focus of this paper is on how to characterise an optimal final distribution x which maximises φ(x) and how to characterise and compute a policy π which achieves this final distribution x. In manufacturing applications, one might want to classify items according to different characteristics at different stages, and the set of appropriate actions would likewise differ. Although we present the theory in a time and state homogeneous form, all the results apply with only notational changes to the case where the set of states E t and transition probabilities p tij (a) may also vary with time, and where the action set A it may vary with state and time. All that is required is the Markov property of the transition to the next state, given the current time, state and action. 1.3 Non-standard solutions For standard finite horizon Markov decision processes, dynamic programming is the natural method of finding an optimal policy and computing the corresponding optimal reward. A linear final reward function φ can be expressed as the expected value of some function of the final state alone, and the problem then reduces to a standard MDP. When φ is a non-linear function this is 3

4 not possible and the example in Section 2 below indicates that, in the natural formulation, the finite horizon value functions fail to satisfy the principle of optimality so that dynamic programming on its own is not directly applicable as a method of solution. Similarly, for standard finite horizon Markov decision processes an optimal policy can always be found in the class of Markov deterministic policies. We shall see that this remains true when φ is convex, but that it no longer holds when φ is concave and we need to explore more carefully what kind of policies are then optimal. In particular, we will find it useful to allow consideration of randomised policies and mixtures of policies (where we say µ is a mixture of the finite set of policies π 1,..., π K with mixing probabilities λ 1,..., λ K if an individual using µ makes a preliminary choice of a policy π k from the set {π 1,..., π K } according to the respective probabilities λ 1,..., λ K and then uses that policy π k throughout at decision epochs t = 0, 1,..., T 1). 1.4 Related work This paper follows the approach taken in Collins (1995) and Collins and Mc- Namara (1995). A computational example of the application of these ideas can be found in McNamara et al (1995). The mathematical motivation and terminology used have elements in common with work in the area of probabilistic constraints and variance criteria in Markov decision processes (see White (1988) for a survey). Some of the results on the equivalence of different spaces of outcomes in Section 3 were originally developed in the context of probabilistic constraints and we will quote the relevant results from Derman (1970). Such problems also give rise to examples where only randomised optimal policies exist (Kallenberg (1983)). Sobel (1982) gives an example in the context of variance models for which the principle of optimality does not apply but does not develop any general method of solution. 1.5 Outline of remaining sections Section 2 uses a small example to show the difficulties caused by uncritical application of dynamic programming to this type of problem. In Section 3 we describe the geometry of the space of outcomes (i.e. final distributions) and outline how this geometric description can be used to make qualitative statements about the optimal final distribution(s) and the corresponding optimal policies. Section 4 introduces the best response method for updating our current guess at an optimal policy and final distribution. Solution algorithms for concave (and, briefly, for convex) effective rewards are discussed in Section 5. Finally, in Section 6, we show how these ideas can be applied to a simple manufacturing example. 4

5 2 Example Purely for motivational purposes, we consider a problem first described in Collins and McNamara (1995), with the following parameters: E = {0, 1}, A = {a, b}, q = ( 1 2, 1 2 ( ). 6/16 10/16 p ij (a) = 12/16 4/16 ), p ij (b) = φ(x) = 1 x 2 (0) x 2 (1) (concave case). ( 1/16 15/16 14/16 2/16 Let T = 1 and consider the choice of action at the single decision epoch t = 0 (so a policy is just a single decision rule). 2.1 Optimality principle Taking φ as the final reward function and using the usual optimality equation, we obtain the following table which seems to indicate that the optimal decision rule is (a, a) (i.e. action a in state i = 0 and action a in state i = 1). Current state Action Final distribution x φ(x) 0 a (6/16, 10/16) 120/256 b (1/16, 15/16) 30/256 1 a (12/16, 4/16) 96/256 b (14/16, 2/16) 56/256 ). 2.2 Best deterministic policy However, the following table, showing the result of using the four possible deterministic decision rules, indicates that (b, b) is the best deterministic decision rule. Decision rule Final distribution x φ(x) (a, a) (9/16, 7/16) 252/512 (a, b) (10/16, 6/16) 240/512 (b, a (13/32, 19/32) 247/512 (b, b) (15/32, 17/32) 255/512 The reason for the difference is as follows. The distributions in the first table are not actually final distributions, but conditional distributions, conditioned on the current state. The overall final distribution is a linear combination of these conditional distributions, but the reward from the overall final distribution is not the corresponding linear combination of the rewards from the individual conditional distributions since φ is a non-linear function. 5

6 2.3 Randomised policies and mixtures In contrast to the usual case for standard finite horizon MDP s, there is no deterministic decision rule that does as well as the randomised decision rule (( 1 2, 1 2 ), ( 3 4, 1 4 )) - under which, in state 0 action a is taken with probability 1 2 and action b with probability 1 2, and in state 1 action a is taken with probability 3 4 and action b with probability 1 4. This gives the best possible final distribution of ( 1 2, 1 2 ). Alternatively, the same final distribution can be achieved with the mixture that uses the deterministic policy corresponding to the decision rule (a, b) with probability λ 1 = 0.2 and uses the deterministic policy corresponding to the decision rule (b, a) with probability λ 2 = Optimal policies In this section we describe the geometry of the space of outcomes (i.e. final distributions) and outline how this geometric description can be used to make qualitative statements about the optimal final distribution(s) and the corresponding optimal policies. 3.1 Geometry of the space of outcomes Let X denote the space of outcomes, so X is the set of all final distributions x achievable under some general (possibly history-dependent and randomised) policy. We can think of each x X as a point in the N dimensional simplex Σ N R N+1, where Σ N = {x : 0 x(i) 1, i E; Σ i x(i) = 1}. Important subsets of X are X MR (the set of all final distributions achievable under some Markov randomised policy) and the finite set X MD (containing all final distributions achievable under some Markov deterministic policy). The following theorem (Derman (1970), p91 Theorem 2) shows how X can be represented in terms of these subsets. The first part of the theorem is due to Derman and Strauch (1966) and the second part to Derman himself. The proof given for part (ii) is for an average cost formulation, but is easily adapted to the finite horizon case using the fact that, for standard finite horizon problems, there is a Markov deterministic policy which is optimal. Theorem (i) X = X MR (ii) X = convex hull (X MD ). From the theorem we see that, in looking for an optimal policy, there is no loss in restricting ourselves to Markov randomised policies. We also see that X is a convex polytope, each of whose vertices is a point in the finite set X MD corresponding to the final distribution of some Markov deterministic policy. In our description we will assume, for ease of presentation, that these are the only points of X MD in the boundary of X. 6

7 3.2 Type/number of optimal policies The following lemmas address the number of optimal policies and the type of policy that is optimal. They follow immediately from the characterisation of X above and standard results on the maximum of a strictly convex/concave function over a closed bounded convex set (e.g. Luenberger (1973), Section 6.4) Lemma If φ is strictly concave then φ achieves its maximum over X at a unique final distribution x. However, x may either be a vertex of X (in which case it is achieved by a Markov deterministic policy) or x may not be a vertex (in which case we must usually resort to a Markov randomised policy to achieve x ). Lemma If φ is strictly convex then each maximising final distribution is a vertex of X and hence corresponds to a Markov deterministic policy. However, φ may achieve its maximum over X at more than one point. 4 The best response method It can be extremely difficult to explicitly determine X directly from the parameters of the problem. Even if we knew X and found a maximising value x directly, it is not easy to determine a policy π that had x as its final distribution. Our approach will therefore be to look for sensible and efficient ways of searching through points in X and identifying corresponding policies - ways that do not rely on an explicit representation of X but essentially start out by treating X as unknown. 4.1 Best response Our basic tool will be a method we call the best response method (see Collins and McNamara (1995) for a fuller description). Given any point x 0 and corresponding policy π 0 this method allows us to easily identify a new updated best response point ˆx 0 and corresponding policy ˆπ 0 which are candidates for improvements on x 0 and π 0. Definition Given a point x 0 X (and a corresponding policy π 0 ) we say ˆπ 0, with corresponding final distribution ˆx 0 xˆπ0, is a best response to π 0 if φ(x 0 )x φ(x 0 )ˆx 0 for all x X. 7

8 4.2 Computation We cannot use dynamic programming directly to find a policy maximising φ. However we can use it to find the policy ˆπ 0 which provides the best response to a given initial point x 0. Define the real valued function Rx 0 on E by taking Rx 0 (i) = φ(x 0 )(i), where φ(x 0 )(i) is the i th component of φ(x 0 ). Then φ(x 0 )x = Σ i Rx 0 (i)x(i) = Ex[Rx 0 (S T )], where S T is the state at time T and Ex denotes expectation conditioned on S T having distribution x. Maximising φ(x 0 )x is then a standard MDP with the same state space E, action space A and transition probabilities p ij (a) as before, but now with an expected reward criterion, where the terminal reward function Rx 0 is a function of the final state alone. Dynamic programming back against this final reward results in a policy ˆπ 0 corresponding to the best response and we can then work forward using the known initial distribution q, the known transition probabilities p ij (a) and the known policy ˆπ 0 to compute the distribution ˆx 0 = xˆπ0. Note that problem specification for finding the best response depends through Rx 0 on the point x 0 and different points will generate different MDP s. 4.3 Geometric interpretation Let φ be the effective reward function and consider the surface z = φ(x) defined on X (or on Σ N ). An optimal distribution x corresponds to a maximum value of z. Now consider the tangent hyperplane to the surface at the point x 0. The equation of this new surface is z = ψx 0 (x) = φ(x 0 )(x x 0 ) + φ(x 0 ). The function ψx 0 (x) is a linear function of x which provides a local approximation to φ(x) at the point x 0. Points with φ(x 0 )x = constant form contours in Σ N of the surface z = ψx 0 (x), so ˆx 0 lies on the highest contour of ψx 0 which intersects X. Assuming φ(x 0 ) 0 the points at which this contour intersects X will be boundary points of X. Moreover, the policy ˆπ 0 selected by dynamic programming will be a Markov deterministic policy, so the final distribution associated with it will be a point in X MD. If the only point of intersection of the highest contour and X is a single vertex, then this point in X MD will be the point x 0 selected as the best response. If the contour intersects X at more than one vertex, then any of them may be selected. The important properties of ˆx 0 are that it is a vertex of X, that it corresponds to the final distribution of some known Markov deterministic policy ˆπ 0 and that all points of X lie in the half-space containing x 0 defined by the hyperplane φ(x 0 )x = φ(x 0 )ˆx 0. 8

9 5 Algorithms 5.1 Concave rewards When φ is strictly concave on the closed bounded convex set X, then standard results (e.g. Luenberger (1973)) show that x maximises φ(x) over X if and only if φ(x )x φ(x )x for all x X. This motivates the following basic version of a policy improvement type algorithm for computing an optimal policy and the corresponding optimal final distribution. Algorithm 1. Choose some initial policy π 0 and compute the corresponding point x Generate a sequence of policies and points π 1, π 2, π 3,... and x 1, x 2, x 3,... by taking x n+1 = ˆx n and π n+1 = ˆπ n, n = 0, 1, Stop if x n+1 = x n. In this case φ(x n )x n = φ(x n )x n+1 φ(x n )x for all x X (by the construction of x n+1 ), so that x n is the optimal final distribution and π n is a corresponding optimal policy. Although it is intuitively attractive, the above algorithm as stated can run into problems with improvement and convergence (in particular, cycling and identifiability of optimal randomised policies). One way of dealing with these problems is motivated as follows. Let P be the convex hull of K linearly independent vertices x 1,..., x K in X, let y be the point at which φ(x) achieves its maximum over x P, and let let ŷ denote the best response to y. Then y = x if and only if ŷ lies in the face of P generated by x 1,..., x K. For example, if the best response cycles between two vertices x 1 and x 2, then one can break out of the cycle by finding the point y = λx 1 + (1 λ)x 2 maximising φ(x) along the line segment joining x 1 and x 2 and using this as the starting point of the next iteration. If ŷ = x 1 or x 2 then, from above, y = x and a mixture of deterministic policies achieving x is given by using π 1 with probability λ and π 2 with probability 1 λ. More generally, Collins and McNamara (1995) show that it is possible to construct a modified algorithm incorporating the best response method, which produces a strict improvement at each iteration, which converges to optimality in a finite number of iterations and where the optimal policy can be identified even when it is a Markov randomised policy. The algorithm is computationally more demanding, but it does indicate at least one systematic way to proceed in cases where heuristic modifications may fail, in particular when x is not a vertex so that the corresponding optimal policy is not deterministic. A full description of the modified algorithm, with proofs of convergence to the optimal solution, can be found in Collins and McNamara (1995), but the basis of the algorithm is as follows. Assume simple iteration of the best response algorithm has identified N + 1 linearly independent vertices x 0,..., x N in X. Consider an N + 1- dimensional vector λ = (λ 0,..., λ N ) and for these fixed vertices set g(λ) = 9

10 φ(λ 0 x λ N x N ). Find the values λ 0,..., λ N solving the (relatively small dimensional) non-linear programming problem of maximising g(λ) subject to the constraints λ j 0 j = 0,..., N and j λ j = 1, using one of the standard routines available or easily implemented methods such as those in Luenberger (1973). Let P denote the convex hull of x 0,..., x N. Then the maximum of φ over P is achieved at y = λ 0x λ N x N. If all the λ j are strictly positive then y = x and one can stop. If one or more of the λ j are zero then find the point ŷ which is the best response to y. If ŷ is not in P then use it to replace one of the vertices with a zero coefficient λ j and start again. If ŷ is in P then again y = x and one can stop. Once one has identified x (along with the final set of vertices x 0,..., x N, the corresponding policies π 0,..., π N and the weights λ 0,..., λ N ) one can construct an optimal mixture µ of Markov deterministic policies by using each π j with probability λ j. Alternatively, if one wants the optimal policy in the form of a Markov randomised policy, one can use the known transition probabilities under each policy π j to calculate the quantities j α t (i, a) = λ j P π j (S t = i, Action at t = a) j λ j P. π j (S t = i) Let π be the Markov randomised policy that takes action a with probability α t (i, a) if in state i at time t. Then it follows from Derman (1970), p91 Theorem 1, that π results in exactly the same final distribution as µ and hence π gives a Markov randomised policy which is optimal. In Section 6 we will discuss the implementation of these two equivalent methods of optimal control in the context of a manufacturing example. 5.2 Convex rewards The problem is intrinsically more difficult when φ is convex, since any or all of the vertices of X may provide an optimal final distribution. Collins (1995) shows how the best response method can be used as part of an algorithm which successively approximate X from within by a finite sequence of convex polytopes X 1... X M, which have vertices in common with X. The steps in the algorithm can be briefly outlined as follows. Use the best response method to generate N + 1 linearly independent vertices of X which then define an initial polytope X 1. Generate a sequence of polytopes X 1, X 2,... by using the best response method at each stage to find a new vertex of X and taking the next polytope to be the convex hull of the vertices identified so far. Hence generate a sequence of points x 1, x 2,... with φ(x 1 ) φ(x 2 )..., by taking x m to be the vertex (with corresponding known Markov deterministic policy π m ) which maximises φ over the known vertices of X m. Stop when there are no vertices of X exterior to the current 10

11 polytope (say X M ). Then X M = X and φ(x M ) = φ(x ). Details can be found in Collins (1995). 6 Manufacturing Examples Consider a manufacturing process where a large number of items are processed individually. Each item can be classified into one of N + 1 states at each of three stages (t = 1, 2, 3), together with an initial unprocessed stage (t = 0) and a final finished product stage (t = 4 = T ). In general we will speak of these as quality states (zero is lowest, N is highest), but they might might also represent cosmetic differences that did not affect the quality but had unequal demand. The states of incoming unprocessed items are independent, but the overall proportion of incoming items in each state is known. At each stage of the process, the operator in charge can either choose to let the process run (action 1) or can choose to intervene to make an adjustment to the process or to the item (action 2). If the process is allowed to run, the item has some probability of deteriorating by one or more quality levels during that stage and some (smaller) probability of improving. Otherwise it stays at the same quality level. The probability of a jump by k levels decreases with k. The probability of a transition to a different state also decreases with time to reflect the fact that changes are less likely as the process nears completion. If the operator intervenes, there is an appreciable probability that the quality improves by one level during that stage, and an appreciable probability that the intervention will spoil the process and the item will revert to quality level zero. Otherwise the quality level stays the same. Again the probability of change after intervention decreases over the lifetime of the process. The overall reward depends on the proportion of items in each quality level at the finished product stage. There is a reward for each item, which increases with quality and is substantially higher for items of level N. There is also a quadratic penalty depending on the proportion of items at each level reflecting perhaps long term future lost sales due to changing perception of the quality of the product by current or potential buyers. Note that for the modified algorithm the computational requirements for each iteration split into two parts a non-linear programming part which depends only on the size of N (and is independent of T and A) and a dynamic programming part which scales linearly with T for fixed N and A. Although T is small in the example, the computations involved would be relatively unaffected if T was much larger. 6.1 The model We can summarise the model below, in the notation of the section on MDP models in the introduction. The choice of parameter values and the specific 11

12 reward function is designed to illustrate the possibilities of the approach rather than necessarily reflecting realistic values for a particular problem. Time horizon: T = 4. State space: E = {0, 1,..., N}. Action space: A = {1, 2}. Initial distribution: q, where q(i) = 2(i + 1)/(N + 1)(N + 2), i = 0,..., N. Transition probabilities: For t = 0,..., T 1 p tii 2 (1) = 0.1ρ t ; p tii 1 (1) = 0.4ρ t ; p tii+1 (1) = 0.2ρ t ; p tii+2 (1) = 0.05ρ t ; p tij (1) = 0, j i 2, i 1, i, i + 1, i + 2; p tii (1) = 1 j i p tij(1), p ti0 (2) = 0.1ρ t ; p tii+1 (2) = 0.3ρ t ; p tij (2) = 0, j 0, i, i + 1; p tii (2) = 1 j i p tij(2), where ρ t = T/(t + T ) reflects the diminishing probability of change. Final reward: φ(x) = Σ i r i x(i) Σ i c i (x(i) d i ) 2, where r i = 9(i + 1)/(N + 1), i = 0,..., N 1; r N = 10, where c i = 10 + (N i), i = 0,..., N, and where d i = 0, i = 0,..., N 1; d N = 1, so state N is the preferred final state. In the following examples x 0, x 1,..., will denote the sequence of final distributions generated by the best response algorithm; f k (i, t) will denote the action specified by a given Markov deterministic policy π k when an item is in state i at time t; and α t (i, a) will denote the probability which a corresponding Markov randomised policy π assigns to taking action a when an item is in state i at time t. 6.2 Example 1: N = 4 To apply the simple best response algorithm, start with some arbitrary policy, say the policy π 0 of never intervening (so f 0 (i, t) = 1 for all i and t). Use the known transition matrices under π 0 to compute x 0 = (0.077, 0.137, 0.200, 0.263, 0.323) where x 0 (i) denotes the probability an item is state i (i.e. the proportion of items in state i) at time T under π 0. Define the real valued function Rx 0 on E by taking Rx 0 (i) = φ(x 0 )(i), where here φ(x 0 )(i) = r i 2c i (x 0 (i) d i ). Use dynamic programming to compute a Markov deterministic policy π 1 = {f 1 (i, t)} which is optimal for a standard MDP problem with terminal reward function Rx 0. Now repeat the process starting with π 1, and so on. For this example we find that x 1 = x 2 = (0.100, 0.133, 0.172, 0.233, 0.362), so the algorithm has converged, x 1 is the optimal final distribution and (the Markov deterministic policy) π 1 is an optimal policy, where π 1 is given below. 12

13 f 1 (i, t) t = i = Example 2: N = 5 Start with some arbitrary policy π 1, say again the policy of never intervening, and proceed as in Example 1 above. This time the best response sequence cycles between the two points x 1 and x 2. Using a simple line search, we find that the maximum value of φ(x) along the line λx 1 + (1 λ)x 2 occurs at y = λ x 1 + (1 λ )x 2 where λ = We define the real valued function Ry as before and look for a policy that is optimal in a problem with terminal reward Ry. We find the policy π 1 is again optimal for this terminal reward, so there is no point in X with higher reward that y. The optimal final distribution is thus y = (0.078, 0.091, 0.146, 0.163, 0.210, 0.312) and this can be achieved using a mixture of the Markov deterministic policies π 1 and π 2 given below. Under the mixture an initial choice of policy is made for each item (π 1 being chosen with probability λ and π 2 being chosen with probability 1 λ ) and that policy is then used throughout the processing of that item. Alternatively, the same optimal final distribution can be achieved using the corresponding Markov randomised policy. f 1 (i, t) t = f 2 (i, t) t = i = i = Example 3: N = 6 Proceeding as before, the best response algorithm starts cycling at x 3. Using the modified algorithm over successive iterations, we find the maximum value of φ(x) over the convex hull of x 0,..., x 5 occurs at y = 5 j=0 λ j x j, where λ = (0, 0, 0, 0.135, 0.759, 0.106). Furthermore, the optimal policy against the terminal reward generated by y is π 3, so ŷ = x 3 is in the convex hull of x 0,..., x 5. Thus y = 13

14 (0.062, 0.074, 0.107, 0.138, 0.152, 0.192, 0.274) is the optimal final distribution, and it can be achieved by the mixture which uses the Markov deterministic policies π 3, π 4 and π 5 given below with respective probabilities λ 3 = 0.135, λ 4 = and λ 5 = f 3 (i, t) t = f 4 (i, t) t = i = i = f 5 (i, t) t = α t (i, 1) t = i = i = Alternatively, the same optimal final distribution can be achieved using the corresponding Markov randomised policy π for which the quantities α t (i, 1) are given above and α t (i, 2) = 1 α t (i, 1). 6.5 Implementation of the optimal policy In Examples 2 and Example 3 there are two equivalent optimal controls the optimal mixture µ (of Markov deterministic policies π 1,..., π K ) and the optimal Markov randomised policy π (which, in state i at time t, takes action a with probability α it (a)). There are simple intuitive interpretations of how these might be implemented in a manufacturing context. If a single operator has responsibility for each stage (time) of the process then it may be easiest to implement the optimal control in the form of µ by assuming that for each successive item the single operator choses policy π j with probability λ j and then proceeds to use that Markov deterministic policy throughout the processing of the given item. Alternatively, when a different operator has responsibility for the action taken at each stage of the process and it is not convenient to attempt to co-ordinate the actions of each operator for a given item, one can implement the optimal Markov randomised policy π by allowing each operator at each stage t to act independently and use the randomised decision rule which takes action a with probability α it (a) if the current item is in state i.. 14

15 References Collins, E.J. (1995) Finite horizon variance penalized Markov decision process. Department of Mathematics, University of Bristol, Report no. S Submitted to OR Spektrum. Collins, E.J. and McNamara, J.M. (1995) Finite-horizon dynamic optimisation when the terminal reward is a concave functional of the distribution of the final state. Department of Mathematics, University of Bristol, Report no. S Submitted to Adv. Appl. Prob. Derman, C. (1970) Finite State Markovian Decision Processes. Academic Press, New York. Derman, C. and Strauch, R. (1966) A note on memoryless rules for controlling sequential control processes. Ann. Math. Statist. 37, Filar, J.A., Kallenberg, L.C.M. and Lee, H.M. (1989) Variance-penalised Markov decision processes. Math Oper Res 14, Huang, Y. and Kallenberg, L.C.M. (1994) On finding optimal policies for Markov decision chains: a unifying framework for mean-variance tradeoffs. Math Oper Res 19, Kallenberg, L.C.M. (1983) Linear Programming and Finite Markov Control Problems. Mathematical Centre, Amsterdam. Luenberger, D.G. (1973) Introduction to Linear and Nonlinear Programming. Addison Wesley, Reading. McNamara, J.M., Webb, J.N. and Collins, E.J. (1995) Dynamic optimisation in fluctuating environments. Proc Roy Soc, B 261, Sobel, M.J. (1982) The variance of discounted Markov decision processes. J Appl Prob 19, White, D.J. (1988) Mean, variance and probabilistic criteria in finite Markov decision processes: a review. J Optimization Theory and Applic 56, White, D.J. (1994) A mathematical programming approach to a problem in variance penalised Markov decision processes. OR Spektrum 15,

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material