Dynamic Programming Approximations for a Stochastic Inventory Routing Problem

Size: px

Start display at page:

Download "Dynamic Programming Approximations for a Stochastic Inventory Routing Problem"

Merilyn Gibbs
5 years ago
Views:

1 Dynamic Programming Approximations for a Stochastic Inventory Routing Problem Anton J. Kleywegt Vijay S. Nori Martin W. P. Savelsbergh School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA August 28, 2002 Abstract This work is motivated by the need to solve the inventory routing problem when implementing a business practice called vendor managed inventory replenishment (VMI). With VMI, vendors monitor their customers inventories, and decide when and how much inventory should be replenished at each customer. The inventory routing problem attempts to coordinate inventory replenishment and transportation in such a way that the cost is minimized over the long run. We formulate a Markov decision process model of the stochastic inventory routing problem, and propose approximation methods to find good solutions with reasonable computational effort. We indicate how the proposed approach can be used for other Markov decision processes involving the control of multiple resources. Supported by the National Science Foundation under grant DMI

2 Introduction Recently the business practice called vendor managed inventory replenishment (VMI) has been adopted by many companies. VMI refers to the situation in which a vendor monitors the inventory levels at its customers and decides when and how much inventory to replenish at each customer. This contrasts with conventional inventory management, in which customers monitor their own inventory levels and place orders when they think that it is the appropriate time to reorder. VMI has several advantages over conventional inventory management. Vendors can usually obtain a more uniform utilization of production resources, which leads to reduced production and inventory holding costs. Similarly, vendors can often obtain a more uniform utilization of transportation resources, which in turn leads to reduced transportation costs. Furthermore, additional savings in transportation costs may be obtained by increasing the use of low-cost full-truckload shipments and decreasing the use of high-cost less-than-truckload shipments, and by using more efficient routes by coordinating the replenishment at customers close to each other. VMI also has advantages for customers. Service levels may increase, measured in terms of reliability of product availability, due to the fact that vendors can use the information that they collect on the inventory levels at the customers to better anticipate future demand, and to proactively smooth peaks in the demand. Also, customers do not have to devote as many resources to monitoring their inventory levels and placing orders, as long as the vendor is successful in earning and maintaining the trust of the customers. A first requirement for a successful implementation of VMI is that a vendor is able to obtain relevant and accurate information in a timely and efficient way. One of the reasons for the increased popularity of VMI is the increase in the availability of affordable and reliable equipment to collect and transmit the necessary data between the customers and the vendor. However, access to the relevant information is only one requirement. A vendor should also be able to use the increased amount of information to make good decisions. This is not an easy task. In fact, it is a very complicated task, as the decision problems involved are very hard. The objective of this work is to develop efficient methods to help the vendor to make good decisions when implementing VMI. In many applications of VMI, the vendor manages a fleet of vehicles to transport the product to the customers. The objective of the vendor is to coordinate the inventory replenishment and transportation in such a way that the total cost is minimized over the long run. The problem of optimal coordination of inventory replenishment and transportation is called the inventory routing problem (IRP). In this paper, we study the problem of determining optimal policies for the variant of the IRP in which a single product is distributed from a single vendor to multiple customers. The demands at the customers are assumed to have probability distributions that are known to the vendor. The objective is to maximize the expected discounted value, incorporating sales revenues, production costs, transportation costs, inventory holding costs, and shortage penalties, over an infinite horizon. 2

3 Our work on this problem was motivated by our collaboration with a producer and distributor of air products. The company operates plants worldwide and produces a variety of air products, such as liquid nitrogen, oxygen and argon. The company s bulk customers have their own storage tanks at their sites, which are replenished by tanker trucks under the supplier s control. Approximately 80% of the bulk customers participate in the company s VMI program. For the most part each customer and each vehicle is allocated to a specific plant, so that the overall problem decomposes according to individual plants. Also, to improve safety and reduce contamination, each vehicle and each storage tank at a customer is dedicated to a particular type of product. Hence the problem also decomposes according to type of product. (This assumption does not hold if the number of drivers is a tight constraint, and drivers can be allocated to deliver one of several different products.) Therefore, in this paper we consider an inventory routing problem with a single vendor, multiple customers, multiple vehicles, and a single type of product. The main contributions of the research reported in this paper are as follows: 1. In an earlier paper (Kleywegt et al., 2002), we formulated the inventory routing problem with direct deliveries, i.e., one delivery per trip, as a Markov decision process and proposed an approximate dynamic programming approach for its solution. In this paper, we extend both the formulation and the approach to handle multiple deliveries per trip. 2. We present a solution approach that uses decomposition and optimization to approximate the value function. Specifically, the overall problem is decomposed into smaller subproblems, each designed to have two properties: (1) it provides an accurate representation of a portion of the overall problem, and (2) it is relatively easy to solve. In addition, an optimization problem is defined to combine the solutions of the subproblems, in such a way that the value of a given state of the process is approximated by the optimal value of the optimization problem. 3. Computational experiments demonstrate that our approach allows the construction of near optimal policies for small instances and policies that are better than policies that have been proposed in the literature for realistically sized instances (with approximately 20 customers). The sizes of the state spaces for these instances are orders of magnitude larger than those that can be handled with more traditional methods, such as the modified policy iteration algorithm. In Section 1 we define the stochastic inventory routing problem, point out the obstacles encountered when attempting to solve the problem, present an overview of the proposed solution method, and review related literature. In Section 2 we propose a method for approximating the dynamic programming value function. In Section 3 the day-to-day control of the IRP process using the dynamic programming value function approximation is discussed. In Section 4 we investigate a special case of the IRP. Computational 3

4 results are presented in Section 5, and Section 6 concludes with some remarks regarding the application of the approach to other stochastic control problems. 1 Problem Definition A general description of the IRP is given in Section 1.1, after which a Markov decision process formulation is given in Section 1.2. Section 1.3 discusses the issues to be addressed when solving the IRP, and Section 1.4 presents an overview of the proposed solution method. Section 1.5 reviews some related literature. 1.1 Problem Description A product is distributed from a vendor s facility to N customers, using a fleet of M homogeneous vehicles, each with known capacity C. The process is modeled in discrete time t =0, 1,..., and the discrete time periods are called days. Let random variable U it denote the demand of customer i at time t, and let U t (U 1t,...,U Nt ) denote the vector of customer demands at time t. Customers demands on different days are independent random vectors with a joint probability distribution F that does not change with time; that is, U 0,U 1,... is an independent and identically distributed sequence, and F is the probability distribution of each U t. The probability distribution F is known to the decision maker. (In many applications customers demands on different days may not be independent; in such cases customers demands on previous days may provide valuable data for the forecasting of customers future demands. A refined model with a suitably expanded state space can be formulated to exploit such additional information. Such refinement is not addressed in this paper.) There is an upper bound C i on the amount of product that can be in inventory at each customer i. This upper bound C i can be due to limited storage capacity at customer i, as in the application that motivated this research. In other applications of VMI, there is often a contractual upper bound C i, agreed upon by customer i and the vendor, on the amount of inventory that may be at customer i at any point in time. One motivation for this contractual bound is to prevent the vendor from dumping too much product at the customer. The vendor can measure the inventory level X it of each customer i at any time t. At each time t, the vendor makes a decision that controls the routing of vehicles and the replenishment of customers inventories. Such decisions may have many aspects, some of which are important for the method developed in this paper, and others which are not. Aspects of daily decisions that are important for the method developed in this paper are the following: 1. which customers inventories to replenish, 2. how much to deliver at each customer, and 4

5 3. how to combine customers into vehicle routes. On the other hand, the ideas developed in the paper are independent of the routing constraints that are imposed, and thus routing constraints are not explicitly spelled out in the formulation. Unless otherwise stated, we assume that each vehicle can perform at most one route per day. We also assume that the duration of the task assigned to each driver and vehicle is less than the length of a day, so that all M drivers and vehicles are available at the beginning of each day, when the tasks for that day are assigned. The expected value (revenues and costs) accumulated during a day depends on the inventory levels and decision of that day, and is known to the vendor. As in the case of the routing constraints, the ideas developed in the paper are independent of the exact composition of the costs of the daily decisions. Next we describe some typical types of costs for illustrative purposes. (These costs were also used in numerical work.) The cost of a daily decision may include the travel costs c ij on the arcs (i, j) of the distribution network that are traversed according to the decision. Travel costs may also depend on the amount of product transported along each arc. The cost of a daily decision may include the costs incurred at customers sites, for example due to product losses during delivery. The cost of a daily decision may include revenue: if quantity d i is delivered at customer i, the vendor earns a reward of r i (d i ). The cost of a daily decision may include shortage penalties: because demand is uncertain, there is often a positive probability that a customer runs out of stock, and thus shortages cannot always be prevented. Shortages are discouraged with a penalty p i (s i )ifthe unsatisfied demand on day t at customer i is s i. Unsatisfied demand is treated as lost demand, and is not backlogged. The cost of a daily decision may include inventory holding cost: if the inventory at customer i is x i at the beginning of the day, and quantity d i is delivered at customer i, then an inventory holding cost of h i (x i + d i ) is incurred. The inventory holding cost can also be modeled as a function of some average amount of inventory at each customer during the time period. The role played by inventory holding cost depends on the application. In some cases, the vendor and customers belong to different organizations, and the customers own the inventory. In these cases, the vendor typically does not incur any inventory holding costs based on the inventory at the customers. This was the case in the application that motivated this work. In other cases, such as when the vendor and customers belong to the same organization, or when the vendor owns the inventory at the customers, the vendor does incur inventory holding costs based on the inventory at the customers. The objective is to choose a distribution policy that maximizes the expected discounted value (rewards minus costs) over an infinite time horizon. 1.2 Problem Formulation In this section we formulate the IRP as a discrete time Markov decision process (MDP) with the following components: 5

6 1. The state x =(x 1,x 2,...,x N ) represents the current amount of inventory at each customer. Thus the state space is X =[0,C 1 ] [0,C 2 ] [0,C N ] if the quantity of product can vary continuously, or X = {0, 1,...,C 1 } {0, 1,...,C 2 } {0, 1,...,C N } if the quantity of product varies in discrete units. Let X it [0,C i ] (or X it {0, 1,...,C i }) denote the random inventory level at customer i at time t. LetX t =(X 1t,...,X Nt ) X denote the state at time t. 2. For any state x, leta(x) denote the set of all feasible decisions when the process is in state x. A decision a A(x) made at time t when the process is in state x, contains information about (1) which customers inventories to replenish, (2) how much to deliver at each customer, and (3) how to combine customers into vehicle routes. A decision may contain more information such as travel times and arrival and departure times at customers (relative to time windows); the three attributes of a decision mentioned above are the important attributes for our purposes. For any decision a, letd i (a) denote the quantity of product that is delivered to customer i while executing decision a. The set A(x) is determined by various constraints, such as work load constraints, routing constraints, vehicles capacity constraints, and customers inventory constraints. As discussed in Section 1.1, constraints such as work load constraints and routing constraints do not affect the method described in this paper. The constraints explicitly addressed in this paper are the limited number M of vehicles that can be used each day, the limited quantity C (vehicle capacity) that can be delivered by each vehicle on a day, and the maximum inventory levels C i that are allowed at any time at each customer i. The maximum inventory level constraints can be imposed in a variety of ways. For example, if it is assumed that no product is used between the time that the inventory level x i is measured at customer i and the time that the delivery of d i (a) takes place, then the maximum inventory level constraints can be expressed as x i + d i (a) C i for all i, all x X, and all a A(x). If product is used during this time period, it may be possible to deliver more. The exact way in which the constraint is applied does not affect the rest of the development. For simplicity we applied the constraint as stated above. Let the random variable A t A(X t ) denote the decision chosen at time t. 3. In this formulation, the source of randomness is the random customer demands U it. To simplify the exposition, assume that the deliveries at time t take place in time to satisfy the demand at time t. Then the amount of product used by customer i at time t is given by min{x it + d i (A t ), U it }. Thus the shortage at customer i at time t is given by S it = max{u it (X it + d i (A t )), 0}, and the next inventory level at customer i at time t + 1 is given by X i,t+1 = max{x it + d i (A t ) U it, 0}. The known joint probability distribution F of customer demands U t gives a known Markov transition function Q, according to which transitions occur. For any state x X, any decision a A(x), and any Borel subset { B X,letU(x, a, B) U R N + : ( max{x 1 +d 1 (a) U 1, 0},...,max{x N +d N (a) U N, 0} ) } B. 6

7 Then Q[B x, a] F [U(x, a, B)]. In other words, for any state x X, and any decision a A(x), P [X t+1 B X t = x, A t = a] = Q[B x, a] F [U(x, a, B)] 4. Let g(x, a) denote the expected single stage net reward if the process is in state x at time t, and decision a A(x) is implemented. To give a specific example in terms of the costs mentioned in Section 1.1, for any decision a and arc (i, j), let k ij (a) denote the number of times that arc (i, j) is traversed by a vehicle while executing decision a. Then, g(x, a) N r i (d i (a)) N c ij k ij (a) h i (x i + d i (a)) (i,j) i=1 i=1 N i=1 E F [ p i ( max{ui0 (x i + d i (a)), 0} )] where E F denotes expected value with respect to the probability distribution F of U The objective is to maximize the expected total discounted value over an infinite horizon. The decisions A t are restricted such that A t A(X t )foreacht, anda t may depend only on the history (X 0,A 0,X 1,A 1,...,X t ) of the process up to time t, i.e., when the decision maker decides on a decision at time t, the decision maker does not know what is going to happen in the future. Let Π denote the set of policies that depend only on the history of the process up to time t. Let α [0, 1) denote the discount factor. Let V (x) denote the optimal expected value given that the initial state is x, i.e., [ ] V (x) sup E π α t g (X t,a t ) π Π X 0 = x t=0 (1) A stationary deterministic policy π prescribes a decision π(x) A(x) based on the information contained in the current state x of the process only. For any stationary deterministic policy π, and any state x X,the expected value V π (x) is given by [ ] V π (x) E π α t g (X t,π(x t )) X 0 = x t=0 = g(x, π(x)) + α V π (y) Q[dy x, π(x)] (The last equality is a standard result in dynamic programming; see for example Bertsekas and Shreve 1978.) It follows from results in dynamic programming that, under conditions that are not very restrictive (e.g., g bounded and α<1), to determine the optimal expected value in (1), it is sufficient to restrict attention to X 7

8 the class Π SD of stationary deterministic policies. It follows that for any state x X, V (x) = sup V π (x) π Π SD = sup a A(x) { g(x, a)+α X } V (y) Q[dy x, a] (2) A policy π is called optimal if V π = V. 1.3 Solving the Markov Decision Process Many algorithms have been proposed to solve Markov decision processes; for example, see the textbooks by Bertsekas (1995) and Puterman (1994). Solving a Markov decision process usually involves computing the optimal value function V, and an optimal policy π, by solving the optimality equation (2). This requires the following major computational tasks to be performed. 1. Computation of the optimal value function V. Because V appears in the left hand side and right hand side of (2), most algorithms for computing V involves the computation of successive approximations to V (x) for every x X. These algorithms are practical only if the state space X is small. For the IRP as formulated in Section 1.2, X may be uncountable. One may attempt to make the problem more tractable by discretizing the state space X and the transition probabilities Q. Even if one discretizes X and Q, the number of states grows exponentially in the number of customers. Thus even for discretized X and Q, the number of states is far too large to compute V (x) for every x X if there are more than about four customers. 2. Estimation of the expected value (integral) in (2). For the IRP, this is a high dimensional integral, with the number of dimensions equal to the number N of customers, which can be as much as several hundred. Conventional numerical integration methods are not practical for the computation of such high dimensional integrals. 3. The maximization problem on the right hand side of (2) has to be solved to determine the optimal decision for each state. In the case of the IRP, the optimization problem on the right hand side of (2) is very hard. For example, the vehicle routing problem (VRP), which is NP-hard, is a special case of that problem. (Consider any instance of the VRP, with a given number of capacitated vehicles, a graph with costs on the arcs, and demand quantities at the nodes. For the IRP, let the vehicles and graph be the same as for the VRP, let the demand be deterministic with demand quantities as given for the VRP, let the current inventory level at each customer be zero, let the discount factor be zero, and let the penalties be sufficiently large such that an optimal solution for the optimization problem 8

9 on the right hand side of (2) has to satisfy the demand quantities at all the nodes. Then the instance of the VRP can be solved by solving the optimization problem on the right hand side of (2).) In Kleywegt et al. (2002) we developed approximation methods to perform the computational tasks mentioned above efficiently and to obtain good solutions for the inventory routing problem with direct deliveries (IRPDD). To extend the approach to the IRP in which multiple customers can be visited on a route, we develop in this paper new methods for the first and third computational tasks, that is, to compute, at least approximately, V, and to solve the maximization problem on the right hand side of (2). The second task was addressed in the way described in Kleywegt et al. (2002). 1.4 Overview of the Proposed Method An outline of our approach is as follows. The first major step in solving the IRP is to construct an approximation ˆV to the optimal value function V. The approximation ˆV is constructed as follows. First, a decomposition of the IRP is developed. Subproblems are defined for specific subsets of customers. Each subproblem is also a Markov decision process. The subsets of customers do not necessarily partition the set of customers, but must cover the set of customers. The idea is to define each subproblem so that it gives an accurate representation of the overall process as experienced by the subset of customers. To do that, the parameters of each subproblem are determined by simulating the overall IRP process, and by constructing simulation estimates of subproblem parameters. Second, each subproblem is solved optimally. Third, for any given state x of the IRP process, the approximate value ˆV (x) is determined by choosing a collection of subsets of customers that partitions the set of customers. Then ˆV (x) is set equal to the sum of the optimal value functions of the subproblems corresponding to the chosen collection of subsets at states corresponding to x. The collection of subsets of customers is chosen to maximize ˆV (x). Details of the construction of ˆV are given in Section 2. An outline of the value function approximation algorithm is given in Algorithm 1. Given ˆV, the IRP process is controlled as follows. Whenever the state of the process is x, then a decision ˆπ(x) is chosen that solves { max g(x, a)+α a A(x) X } ˆV (y) Q[dy x, a] which is the right hand side of the optimality equation (2) with ˆV instead of V. A method for problem (3) is described in Section 3. Algorithm 1 already indicates that the development of the approximating function ˆV requires a lot of computational effort. The effort is required to determine appropriate parameters for the subproblems and to solve all the subproblems. This effort is required only once at the beginning of the control of the IRP process (3) 9

10 Algorithm 1 Procedure for computing ˆV and ˆπ. 1. Start with an initial policy ˆπ 0. Set i Simulate the IRP under policy ˆπ 0 to estimate the subproblem parameters. 3. Solve the subproblems. 4. ˆV is determined by the optimal value functions of the subproblems. 5. Policy ˆπ 1 is defined by equation (4). 6. Repeat steps 7 through 11 for a chosen number of iterations, or until a convergence test is satisfied. 7. Increment i i Simulate the IRP under policy ˆπ i to update the estimates of the subproblem parameters. 9. With the updated estimates of the subproblem parameters, solve the updated subproblems. 10. ˆV is determined by the optimal value functions of the updated subproblems. 11. Policy ˆπ i+1 is given by equation (4). (although, in practice, ˆV may have to be changed if the parameters of the MDP change), so that a substantial effort for this initial computational task seems to be acceptable. In contrast, once the approximating function ˆV has been constructed, only the daily problem (3) has to be solved at each stage of the IRP process, each time for a given value of the state x. Because the daily problem has to be solved many times, it is important that this computational task can be performed with relatively little effort. 1.5 Review of Related Literature In this section we give a brief review of related literature on the inventory routing problem (Section 1.5.1) and on dynamic programming approximations (Section 1.5.2). The review is not comprehensive Inventory Routing Literature A large variety of deterministic and stochastic models of inventory routing problems have been formulated, and a variety of heuristics and bounds have been produced. A classification of the inventory routing literature is given in Kleywegt et al. (2002). Bell et al. (1983) propose an integer program for the inventory routing problem at Air Products, a producer of products such as liquid nitrogen. Dror, Ball, and Golden (1985), and Dror and Ball (1987) construct a solution for a short-term planning period based on identifying, for each customer, the optimal replenishment day t and the expected increase in cost if the customer is visited on day t instead of t.an integer program is then solved that assigns customers to a vehicle and a day, or just a day, that minimizes the sum of these costs plus the transportation costs. Dror and Levy (1986) use a similar method to construct a 10

11 weekly schedule, and then apply node and arc exchanges to reduce costs in the planning period. Trudeau and Dror (1992) apply similar ideas to the case in which inventories are observable only at delivery times. Bard et al. (1998) follow a rolling horizon approach to an inventory routing problem with satellite facilities where trucks can be refilled. To choose the customers to be visited during the next two weeks, they determine an optimal replenishment frequency for each customer, similar to the approach in Dror, Ball, and Golden (1985), and Dror and Ball (1987). Federgruen and Zipkin (1984) formulate an inventory routing problem quite similar to the one in Section 1.2, except that they focus on solving the myopic single-stage problem max a A(x) g(x, a), which is a nonlinear integer program. Golden, Assad, and Dahl (1984) also propose a heuristic to solve the myopic single-stage problem max a A(x) g(x, a), while maintaining an adequate inventory at all customers. Chien, Balakrishnan, and Wong (1989) also propose an integer programming based heuristic to solve the single-stage problem, but they attempt to find a solution that is less myopic than that of Federgruen and Zipkin (1984) and Golden, Assad, and Dahl (1984), by passing information from one day to the next. Anily and Federgruen (1990, 1991, 1993) analyze fixed partition policies for the inventory routing problem with constant deterministic demand rates and an unlimited number of vehicles. They also find lower and upper bounds on the minimum long-run average cost over all fixed partition policies, and propose a heuristic, called modified circular regional partitioning, to choose a fixed partition. Gallego and Simchi-Levi (1990) use an approach similar to that of Anily and Federgruen (1990) to evaluate the long-run effectiveness of direct deliveries (one customer on each route). Bramel and Simchi-Levi (1995) also study fixed partition policies for the deterministic inventory routing problem with an unlimited number of vehicles. They propose a location based heuristic, based on the capacitated concentrator location problem (CCLP), to choose a fixed partition. The tour through each subset of customers is constructed while solving the CCLP, using a nearest insertion heuristic. Chan, Federgruen, and Simchi-Levi (1998) analyze zero-inventory ordering policies, in which a customer s inventory is replenished only when the customer s inventory has been depleted, and fixed partition policies, also for the deterministic inventory routing problem with an unlimited number of vehicles. They derive asymptotic worst-case bounds on the performance of the policies. They also propose a heuristic based on the CCLP, similar to that of Bramel and Simchi-Levi (1995), for determining a fixed partition of the set of customers. Gaur and Fisher (2002) consider a deterministic inventory routing problem with time varying demand. They propose a randomized heuristic to find a fixed partition policy with periodic deliveries. Their method was implemented for a supermarket chain. Burns et al. (1985) develop approximating equations for both a direct delivery policy as well as a policy in which vehicles visit multiple customers on a route. Minkoff (1993) also formulated the inventory routing problem as a MDP. He focused on the case with an unlimited number of vehicles. He proposed a decomposition heuristic to reduce the computational effort. 11

12 The heuristic solves a linear program to allocate joint transportation costs to individual customers, and then solves individual customer subproblems. The value functions of the subproblems are added to approximate the value function of the combined problem. Minkoff s work differs from ours in the following aspects: (1) we consider the case with a limited number of vehicles, (2) we define subproblems involving one or more customers, and the subproblems are defined differently, one reason being that the bound on the number of vehicles has to be addressed in our subproblems, and (3) we solve an optimization problem to combine the results of the subproblems. Webb and Larson (1995) propose a solution for the problem of determining the minimum fleet size for an inventory routing system. Their work is related to Larson s earlier work on fleet sizing and inventory routing (Larson, 1988). Bassok and Ernst (1995) consider the problem of delivering multiple products to customers on a fixed tour. The optimal policy for each product is characterized by a sequence of critical numbers, similar to an optimal policy found by Topkis (1968). Barnes-Schuster and Bassok (1997) study the cost effectiveness of a particular direct delivery policy for the inventory routing problem. Kleywegt et al. (2002) also consider the special case with direct deliveries. A MDP model of the inventory routing problem is formulated, and a dynamic programming approximation method is developed to find a policy. Herer and Roundy (1997) propose several heuristics to construct power-of-two policies for the inventory routing problem with constant deterministic demand rates and an unlimited number of vehicles, and they prove performance bounds for the heuristics. Viswanathan and Mathur (1997) propose an insertion heuristic to construct a power-of-two policy for the inventory routing problem with multiple products, constant deterministic demand rates, and an unlimited number of vehicles. Reiman et al. (1999) perform a heavy traffic analysis for three types of policies for the inventory routing problem with a single vehicle. Çetinkaya and Lee (2000) study a problem in which the vendor accumulates customer orders over time intervals of length T, and then delivers customer orders at the end of each time interval. Bertazzi et al. (2002) consider a deterministic inventory routing problem with a single capacitated vehicle. Each customer has a specified minimum and maximum inventory level. They propose a heuristic to determine the vehicle route at each discrete time point, while following an order-up-to policy, that is, each time a customer is visited the inventory at the customer is replenished to the specified maximum inventory level. They consider the impact of different objective functions. The inventory pickup and delivery problem is quite similar to the inventory routing problem. In the inventory pickup and delivery problem, there are multiple sources of a single product, multiple demand points, and multiple vehicles. The vehicles are scheduled to travel alternatingly between sources and demand points to replenish the inventory at the demand points. Christiansen and Nygreen (1998a, 1998b) present 12

13 a path flow formulation and column generation method for the inventory pickup and delivery problem with time windows (IPDPTW). Christiansen (1999) presents an arc flow formulation for the IPDPTW Dynamic Programming Approximation Literature Dynamic programming or Markov decision processes is a versatile and widely used framework for modeling dynamic and stochastic optimal control problems. However, a major shortcoming is that for many interesting applications an optimal policy cannot be computed because (1) the state space X is too big to compute and store the optimal value V (x) and an optimal decision π (x) for each state x; and/or (2) the expected value in (2), which often is a high dimensional integral, cannot be computed exactly; and/or (3) the single stage optimization problem on the right hand side of (2) cannot be solved exactly. In this section we briefly mention some of the work that has been done to address the first issue, that is, how to attack problems with large state spaces. The second issue makes up a large part of the field of statistics, and the third issue makes up a large part of the field of optimization; these fields are not reviewed here. A natural approach for attacking MDPs with large state spaces, which is also the approach used in this paper, is to approximate the optimal value function V with an approximating function ˆV. It is shown in Section 2 that a good approximation ˆV of the optimal value function V canbeusedtofindagood policy ˆπ. Some of the early work on this approach is that of Bellman and Dreyfus (1959), who propose using Legendre polynomials inductively to approximate the optimal value function of a finite horizon MDP. Chang (1966), Bellman et al. (1963), and Schweitzer and Seidman (1985) also study the approximation of V with polynomials, especially orthogonal polynomials such as Legendre and Chebychev polynomials. Approximations using splines are suggested by Daniel (1976), and approximations using regression splines by Chen et al. (1999). Recently a lot of work has been done on parameterized approximations. Some of this work was motivated by approaches proposed for reinforcement learning; Sutton and Barto (1998) give an overview. Tsitsiklis and Van Roy (1996), Van Roy and Tsitsiklis (1996), Bertsekas and Tsitsiklis (1996), and De Farias and Van Roy (2000) study the estimation of the parameters of these approximating functions for infinite horizon discounted MDPs, and Tsitsiklis and Van Roy (1999a) consider estimation for long-run average cost MDPs. Value function approximations are proposed for specific applications by Van Roy et al. (1997), Powell and Carvalho (1998), Tsitsiklis and Van Roy (1999b), Secomandi (2000), and Kleywegt et al. (2002). In many models the state space is uncountable and the transition and cost functions are too complex for closed form solutions to be obtained. Discretization methods and convergence results for such problems are discussed in Wong (1970a), Fox (1973), Bertsekas (1975), Kushner (1990), Chow and Tsitsiklis (1991), and Kushner and Dupuis (1992). Another natural approach for attacking a large-scale MDP is to decompose the MDP into smaller related 13

14 MDPs, which are easier to solve, and then to use the solutions of the smaller MDPs to obtain a good solution for the original MDP. Decomposition methods are discussed in Wong (1970b), Collins and Lew (1970), Collins (1970), Collins and Angel (1971), Courtois (1977), Courtois and Semal (1984), Stewart (1984), and Kleywegt et al. (2002). Some general state space reduction methods that include many of the methods mentioned above are analyzed in Whitt (1978, 1979a, 1979b), Hinderer (1976, 1978), Hinderer and Hübner (1977), and Haurie and L Ecuyer (1986). Surveys are given in Morin (1978), and Rogers et al. (1991). 2 Value Function Approximation The first major step in solving the IRP is the construction of an approximation ˆV to the optimal value function V. A good approximating function ˆV can then be used to find a good policy ˆπ, in the sense described next. Suppose that V ˆV <ε,thatis, ˆV is an ε-approximation of V. Also suppose that stationary deterministic policy ˆπ satisfies g(x, ˆπ(x)) + α y X ˆV (y) Q[y x, ˆπ(x)] sup a A(x) g(x, a)+α y X ˆV (y) Q[y x, a] δ (4) for all x X, that is, decision ˆπ(x) is within δ of the optimal decision using approximating function ˆV on the right hand side of the optimality equation (2). Then V ˆπ (x) V (x) 2αε + δ 1 α for all x X, that is, the value function V ˆπ of policy ˆπ is within (2αε + δ)/(1 α) of the optimal value function V. This observation is the motivation for putting in the effort to construct a good approximating function ˆV. This section describes the construction of ˆV ; the decisions referred to in this section are used only for the purpose of motivating the approximation ˆV, and are not used to control the IRP process. The decisions used to control the IRP process are described subsequently in Section Subproblem Definition To approximate the optimal value function V, we decompose the IRP into subproblems, and then combine the subproblem results using another optimization problem, described in Section 2.2, to produce the approximating function ˆV. Each subproblem is a Markov decision process involving a subset of customers. The subsets of customers do not necessarily partition the set of customers, but must cover the set of customers, 14

15 and it must be possible to form a partition with a subcollection of the subsets. The approach we followed was to define subproblems for each subset of customers that can be visited on a single vehicle route. Thus each single customer forms a subset, and in addition there are a variety of subsets with multiple customers. Hence, the cover and partition conditions referred to above are automatically satisfied. After the subsets of customers have been identified, a subproblem has to be defined (a model has to be constructed) for each subset. That involves determining appropriate parameters and parameter values for the MDP of each subset. An appealing idea is to choose the parameters and parameter values of each subproblem so that the subproblem represents the overall IRP process as experienced by the subset of customers. There are several obstacles in the way of implementing such an idea. First, the overall process depends on the policy controlling the process, and an optimal policy is not known. Second, even with a given policy for controlling the overall process, it is still hard to determine appropriate parameters and parameter values for each subproblem so that the combined subproblems give a good representation of the overall process. This section, including Subsections and 2.1.2, is devoted to the modeling of the subproblems, that is, the determination of parameters and parameter values for each subproblem. It has the interesting feature that simulation is used in the process of constructing the subproblem models. Issues that have to be addressed are the following. 1. One question is how many vehicles are available for a given subproblem. This issue comes about because in the overall IRP process, several subsets compete for the M vehicles, and thus, at any given time, all M vehicles will not be available to any given subset. Also a vehicle may visit customers in the subset as well as customers not in the subset, and thus not all of a vehicle s capacity C may be available to the given subset. Thus, the availability of vehicles and vehicle capacity to subsets of customers (and therefore in subproblems) has to be modeled. 2. Transition probabilities have to be determined for the subproblems. The transition probabilities of the inventory levels are determined by the demand distribution F as before. In addition, for the subproblems we also address the transition probabilities of vehicle availability to the subset of customers. In the description of the subproblems, we sometimes refer to the overall process, and sometimes to the models of the individual subproblems; we attempt to keep the distinctions as well as the similarities clear. To simplify notation, the modeling of the subproblems is described for a two-customer subproblem; the models for the subproblems with one or more than two customers are similar. A two-customer subproblem for subset {i, j} is denoted by MDP ij. The method presented in this section is for a discrete demand distribution F and a discrete state space X, which may come about naturally due to the nature of the product or because of discretization of the demand distribution and the state space. Let the support of F be denoted by U 1 U N, and let f ij denote the (marginal) probability mass function 15

16 of the demand of customers i and j, thatis,f ij (u i,u j ) F [U 1 U 2 {u i } {u j } U N ] denotes the probability that the demand at customer i is u i and the demand at customer j is u j. Recall that the idea is to define each subproblem so that it gives an accurate representation of the overall process as experienced by the subset of customers. Clearly, the state of a subproblem has to include the inventory level at each of the customers in the subproblem. Furthermore, to capture information about the availability of vehicles for delivering to the customers in the subproblem, the state of a subproblem also includes a component with information about the vehicle availability to the subset of customers. To determine possible values of the vehicle availability component v ij of the state of subproblem MDP ij, consider the different ways in which the customers i and j can be visited in the overall IRP process. For simplicity, we assume that each customer is visited at most once per day. Consequently, on any day, the subset of two customers can be visited by 0, 1, or 2 vehicles. Hence, in subproblem MDP ij,atanypointin time, either 0, or 1, or 2 vehicles are available to the subset of two customers. The simplest case is the case with no vehicles available for delivering to customers i and j (denoted by v ij = 0 in subproblem MDP ij ). When 1 or 2 vehicles are available to the subset of two customers, we also have to specify how much of those vehicles capacities are available to the subset of customers, because those same vehicles may also make deliveries to customers other than i or j on a route. Consider the different ways in which one vehicle could deliver to i and/or j in the overall IRP process. There are the following six possibilities: 1. exclusive delivery to i, 2. exclusive delivery to j, 3. exclusive delivery to i and j (no deliveries to other customers), 4. fraction of vehicle capacity delivered to i and no delivery to j, 5. fraction of vehicle capacity delivered to j and no delivery to i, 6. fraction of vehicle capacity delivered to i and j plus delivery to other customers. The first three possibilities are represented by the same vehicle availability component in subproblem MDP ij (denoted by v ij = a), because in all three cases one vehicle is available exclusively for customers in the subproblem. The other possibilities are denoted by v ij = b, c, d respectively, in subproblem MDP ij. Next consider the different ways in which two vehicles could deliver to i and j in the overall IRP process. There are the following four possibilities: 1. exclusive delivery to i and j (no deliveries to other customers), 2. exclusive delivery to i, fraction of vehicle capacity delivered to j 16

17 3. exclusive delivery to j, fraction of vehicle capacity delivered to i 4. fraction of vehicle capacity delivered to i and fraction of vehicle capacity delivered to j (with different vehicles visiting i and j, each also delivering to other customers). These possibilities are denoted by v ij = e, f, g, h respectively, in subproblem MDP ij. Whenever a vehicle is available for delivering a fraction of its capacity to one or both of the customers in the subset, the model for subproblem MDP ij also needs to specify what portion of the vehicle s capacity is available to the subset. For example, when the vehicle availability v ij {b, c, d}, one vehicle with a fraction of the capacity C is available to the two-customer subset; when v ij = h, two vehicles, each with a fraction of the capacity C, are available to the subset; and when v ij {f,g}, two vehicles, one with capacity C and one with a fraction of the capacity C, are available to the subset. Each of the subproblem vehicle availabilities v ij {b, g, h} correspond to a situation in the overall IRP in which a vehicle visits i and a customer not in {i, j}, but the same vehicle does not visit j. The fractional capacity associated with the vehicle availabilities v ij {b, g} is the same and is denoted by λ i ij [0, C]. Similarly, the fractional capacity associated with the vehicle availabilities v ij {c, f} is the same and is denoted by λ j ij [0, C]. When the vehicle availability is v ij = h, one vehicle with fractional capacity λ i ij and another vehicle with fractional capacity λj ij are available to the subset. Finally, when the vehicle availability is v ij = d, the fractional capacity available to the subset is denoted by λ ij ij [0, C]. Table 1 summarizes the vehicle availability values v ij and associated available capacities for a two-customer subproblem MDP ij. Note that for the subproblem, it is sufficient to know the (possibly fractional) capacities available to the subset. The subproblem decision determines how the capacities will be used to serve customers i and j. Section explains how simulation is used to choose appropriate values for these λ-parameters. Table 1: Vehicle availability values v ij and associated capacities for a two-customer subproblem MDP ij. v ij -value Vehicle capacities available to customer subset {i, j} 0 None a One vehicle with capacity C b One vehicle with capacity λ i ij c One vehicle with capacity λ j ij d One vehicle with capacity λ ij ij e Two vehicles, each with capacity C f Two vehicles, one with capacity C, and one with capacity λ j ij g Two vehicles, one with capacity λ i ij, and one with capacity C h Two vehicles, one with capacity λ i ij, and one with capacity λj ij Each two-customer subproblem MDP ij follows. is a discrete time Markov decision process, and is defined as 17

18 1. The state space is X ij = {0, 1,...,C i } {0, 1,...,C j } {0,a,b,c,d,e,f,g,h}. State (x i,x j,v ij ) denotes that the inventory levels at customers i and j are x i and x j, and the vehicle availability is v ij. Let X it {0, 1,...,C i } denote the random inventory level at customer i at time t, and let V ijt denote the random vehicle availability at time t. 2. For any subproblem state (x i,x j,v ij ), let A ij (x i,x j,v ij ) denote the set of feasible subproblem decisions when the subproblem process is in state (x i,x j,v ij ). A decision a ij A ij (x i,x j,v ij ) contains information about (1) which of customers i and j to replenish, (2) how much to deliver at each of customers i and j, and (3) how to combine customers i and j into vehicle routes. (For a two-customer subproblem, the routing aspect of the decision is easy.) Let d i (a ij ) denote the quantity of product that is delivered to customer i while executing decision a ij. The feasible decisions a ij A ij (x i,x j,v ij ) satisfy the following constraints when the subproblem state is (x i,x j,v ij ). When the vehicle availability is v ij = 0, then no vehicles can be sent to customers i and j, andd i (a ij )=d j (a ij ) = 0. When v ij = a, then one vehicle can be sent to customers i and j, andd i (a ij )+d j (a ij ) C, x i + d i (a ij ) C i, and x j + d j (a ij ) C j. When v ij = b, then one vehicle can be sent to customer i, no vehicle can be sent to customer j, andd i (a ij ) min{λ i ij,c i x i },andd j (a ij ) = 0. Feasible decisions are determined similarly if v ij = c. When v ij = d, then one vehicle can be sent to customers i and j, and d i (a ij )+d j (a ij ) λ ij ij, x i + d i (a ij ) C i,andx j + d j (a ij ) C j. When v ij = e, then one vehicle can be sent to each of customers i and j, andd i (a ij ) min{ C,C i x i },andd j (a ij ) min{ C,C j x j }. When v ij = f, then one vehicle can be sent to each of customers i and j, andd i (a ij ) min{ C,C i x i }, and d j (a ij ) min{λ j ij,c j x j }. Feasible decisions are determined similarly if v ij = g. Finally, when v ij = h, then both i and j can be visited by a vehicle each, and d i (a ij ) min{λ i ij,c i x i },and d j (a ij ) min{λ j ij,c j x j }. As for the overall IRP, let the random variable A ijt A ij (X it,x jt,v ijt ) denote the decision chosen at time t. 3. The transition probabilities of the subproblems have to incorporate the probability distribution of customer demands, as well as the probabilities of vehicle availabilities to the subset of customers. Because we assume that the probability distribution f ij of customer demands is known, the transition probabilities of the inventory levels can be determined for the subproblems as for the overall IRP. In the overall IRP process, the probabilities of vehicle availabilities to a subset of customers depend on the policy used to control the process, and are not directly obtainable from the input data of the IRP. Thus, some additional effort is required to make the transition probabilities of vehicle availabilities in the subproblems representative of what happens in the overall IRP. The basic idea is described next, and more details are provided in Section Consider any policy π Π for the IRP with unique stationary probability ν π (x) foreachx X. (Thus, as indicated in Algorithm 1, the formulation 18

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used