Planning and Reinforcement Learning through Approximate Inference and Aggregate Simulation

Size: px

Start display at page:

Download "Planning and Reinforcement Learning through Approximate Inference and Aggregate Simulation"

Agnes Dennis
5 years ago
Views:

1 Planning and Reinforcement Learning through Approximate Inference and Aggregate Simulation Hao Cui Department of Computer Science Tufts University Medford, MA 02155, USA Roni Khardon Department of Computer Science Indiana University Bloomington, IN, USA Abstract The paper reviews the SOGBOFA algorithm for stochastic planning in high dimensional state and action spaces and its connections to reinforcement learning. SOGBOFA builds an explicit computation graph capturing an approximation of the Q value for the current state and any action. The computation of this graph was recently shown to be equivalent to running belief propagation inference on the corresponding dynamic Bayesian network. This construction is combined with automatic differentiation over the graph to search for the best action in the current state, and the result is used for on-line planning. The algorithm therefore combines value function approximation through probabilistic inference with gradient search, both of which are closely related to algorithms in reinforcement learning. The key difference is that instead of the widely used idea of approximating Q values using function approximation, SOGBOFA uses a novel graph construction and approximate inference to compute Q values. We argue that these ideas have useful analogues for reinforcement learning. 1 Introduction SOGBOFA is an algorithm for stochastic planning in high dimensional state and action spaces. This paper gives an overview of the SOGBOFA algorithm and its connection to probabilistic inference and then explores the connections to model based reinforcement learning and policy gradients. SOGBOFA [2] extends the well known rollout algorithm [12] that chooses its actions through simulation. The rollout algorithm uses a simulator to estimate the quality of each possible action for the current step by taking that action and then continuing the simulation with some fixed policy. This method is both model free and policy free and it works well in some cases. Two limitations are the need for a good rollout policy and the need for many simulations. For the second point note that one must enumerate all actions and estimate the value for each action separately and, in addition, each action requires many simulations to get an accurate estimate. SOGBOFA is model based and it uses the model to improve over this algorithm in two important ways. The first is that, instead of using concrete simulation of trajectories, the algorithm builds an explicit computation graph capturing an approximation of the corresponding value when rolling out the random policy. Therefore a single symbolic simulation suffices. The second is that because the simulation is given in an explicit computation graph one can use automatic differentiation and gradients to search for the best action, avoiding the action enumeration which is required by rollout. Several recent observations expose the connection to RL more clearly [3]. The first is that the computation graph of SOGBOFA calculates exactly the same solution as the one that would be computed by belief propagation (BP) on the corresponding inference problem [5], and that this allows for im- Infer to Control: NIPS 2018 Workshop on Probabilistic Reinforcement Learning and Structured Control.

2 provements through lifted inference. The second is that the rollout policy can be optimized to some extent in the same manner and simultaneously with action selection. This approach was shown to be competitive in large scale planning problems and lifted-conformant SOGBOFA was the runner-up at the international probabilistic planning competition In addition, the most recent version of SOGBOFA deals with hard action constrains represented in first order logic, which is an important aspect of some recent RL applications. Finally, [5] developed a reduction that allows a variant of SOGBOFA to solve any Marginal MAP inference problem and sowed that this yields a state of the art MMAP solver. This might allow us ot use the same approach to solve a wider range of decision problems, including POMDPs. Unlike most methods in RL SOGBOFA assumes that an explicit representation of the model is given as input. However, like recent RL methods (e.g., [11]) it embeds its model inside a simulation whose output is being optimized. In addition, estimation of Q values is a fundamental question for modelfree RL methods such as policy gradient and Actor-Critic methods. SOGBOFA can provide an alternative way for constructing critics with more stable estimates. The interesting point is that it performs its computations in a manner that is different from current methods in RL and it offers opportunities for improvement through the use of aggregate simulation. The paper reviews some aspects of SOGBOFA and provides some further discussion of the similarities to work in RL and potential future work. 2 The SOGBOFA Algorithm Stochastic planning can be formalized using Markov decision processes [8] in factored state and action spaces. In factored spaces [1] the state is specified by a set of variables and the number of states is exponential in the number of variables. Similarly in factored action spaces an action is specified by a set of variables. We assume that all state and action variables are binary. Finite horizon planning can be captured using a dynamic Bayesian network (DBN) where state and action variables at each time step are represented explicitly and the CPTs of variables are given by the transition probabilities. SOGBOFA works in the setup of on-line planning. In on-line planning we are given a fixed limited time t per step and cannot compute a policy in advance. Instead, given the current state, the algorithm must decide on the next action within time t. Then the action is performed, a transition and reward are observed and the algorithm is presented with the next state. This process repeats and the long term performance of the algorithm is evaluated. SOGBOFA performs on-line planning by estimating the value of initial actions where a fixed rollout policy, typically a random policy, is used in future steps. The AROLLOUT algorithm [4] introduced the idea of algebraic simulation to estimate values but optimized over actions by enumeration. Then SOGBOFA showed how algebraic rollouts can be computed symbolically and that the optimization can be done using automatic differentiation. Graph Construction: The algorithm starts by translating from a high level language (e.g., RDDL [9]) to a DBN. AROLLOUT transforms the CPT of a node x into a disjoint sum form. In particular, the CPT is represented in the form if(c 11 c 12...) then p 1... if(c n1 c n2...) then p n, where p i is p(x=1) and the c ij are conjunctions of parent values which are are mutually exclusive and exhaustive. It then performs a forward pass calculating ˆp(x), an approximation of the true marginal p(x), for any node x in the graph. ˆp(x) is calculated as a function of ˆp(c ij ), an estimate of the probability that c ij is true, which assumes the parents are independent. This is done using the following equations where nodes are processed in the topological order of the graph: ˆp(x) = ij p(x c ij )ˆp(c ij ) = ij p i ˆp(c ij ) (1) ˆp(c ij ) = w k c ij ˆp(w k ) w k c ij (1 ˆp(w k )). (2) The following example from [2] illustrates this construction. The problem has three state variables s(1), s(2) and s(3), and three action variables a(1), a(2), a(3) respectively. In addition we have two 1 2

3 2 * * * 1/3 1/3 1/ a1 a2 a3 STEP Q r s1 s2 s3 a1 a2 a3 Figure 1: Example of SOGBOFA graph construction. intermediate variables cond1 and cond2 which are not part of the state. The transitions and reward are given by the following RDDL [9] expressions where primed variants of variables represent the value of the variable after performing the action. cond1 = Bernoulli(0.7) cond2 = Bernoulli(0.5) s (1) = if (cond1) then a(3) else false s (2) = if (s(1)) then a(2) else false s (3) = if (cond2) then s(2) else false reward = s(1) s(2) s(3) AROLLOUT translates the RDDL code into algebraic expressions using standard transformations from a logical to a numerical representation. In our example this yields: s (1) = (1-a(3))*0.7 s (2) = s(1)*a(2) s (3) = s(2) * 0.5 r = s(1) s(2) s(3) These expressions are used to calculate an approximation of marginal distributions over state and reward variables. The distribution at each time step is approximated using a product distribution over the state variables. To illustrate, assume that the state is s 0 = {s(1)=0, s(2)=1, s(3)=0} which we take to be a product of marginals. At the first step AROLLOUT uses a concrete action, for example a 0 = {a(1)=1, a(2)=0, a(3)=0}. This gives values for the reward r 0 = = 1 and state variables s 1 = {s(1)=(1 0) 0.7=0.7, s(2)=0 0=0, s(3)=1*0.5 = 0.5}. In future steps it calculates marginals for the action variables and uses them in a similar manner. For example if a 1 = {a(1)=0.33, a(2)=0.33, a(3)=0.33} we get r 1 = = 1.2 and s 2 = {s(1)=(1 0.33) 0.7, s(2)= 0.7 * 0.33, s(3)=0 0.5}. Summing the rewards from all steps gives an estimate of the Q value for a 0. AROLLOUT randomly enumerates values for a 0 and then selects the one with the highest estimate. The main observation in SOGBOFA is that instead of calculating concrete values, we can use the expressions to construct an explicit directed acyclic graph representing the computation steps, where the last node represents the expectation of the cumulative reward. SOGBOFA uses a symbolic representation for the first action and assumes that the rollout uses the random policy. In our example if the action variables are mutually exclusive (such constraints are often imposed in high level domain descriptions) this gives marginals of a 1 = {a(1)=0.33, a(2)=0.33, a(3)=0.33} over these variables. The SOGBOFA graph for our example expanded to depth 1 is shown In Figure 1. The bottom layer represents the current state and action variables. Each node at the next level represents the expression that AROLLOUT would have calculated for that marginal. To expand the planning horizon we simply duplicate the second layer construction multiple times. 3

4 Now, given concrete marginal probabilities for the action variables at the first step, i.e., a 0, an evaluation of the graph captures the same calculation as AROLLOUT. Calculating the gradient: The main observation is that once the graph is built, representing Q as a function of action variables, we can calculate gradients using the method of automatic differentiation [6]. Our implementation uses the reverse accumulation method since it is more efficient having linear complexity in the size of the DAG for calculation of all gradients. Lifted SOGBOFA Graph: As mentioned above the computation graph was shown to calculate the same value as would be computed by belief propagation. Building on that one can build a lifted graph based on ideas from lifted BP [10, 7]. In Lifted BP, two nodes or factors send identical messages when all their input messages are identical and in addition the local factor is identical. With the computational structure of SOGBOFA this has a clear analogy. Two nodes in SOGBOFA s graph are identical when their parents are identical and the local algebraic expressions capturing the local factors are identical. This suggests a straightforward implementation for Lifted SOGBOFA using dynamic programming, where the graph construction tracks nodes and avoids duplication. Moreover, by joining identical edges into grouped calculations (every plus node with k incoming identical edges is turned into a multiply node with constant multiplier k and every multiply node with k incoming identical edges is turned into a power node with constant power k) the graph directly captures counting expressions that are often seen in lifted inference. Optimizing the Rollout Policy: Note that, just like rollout, SOGBOFA does not calculate an explicit policy. It optimizes values for the current step and uses a random policy for trajectory actions. However, one can try to improve on this by optimizing the rollout policy as well. It turns out that optimizing a full-fledged policy is not realistic due to the non-transferability of the policy across planning problems with different goals as well as the approximation inherent in aggregate simulation. However, one can optimize rollout actions to some extent. In particular, we can directly optimize the marginal probability for each action variable in future steps. This gives the conformant variant of SOGBOFA (following a similar name for straight-line plans in stochastic planning). This type of optimization is already supported by the construction described above, by not assigning marginal probabilities to future action variables and turning them into symbolic variables. Then automatic differentiation and gradient ascent for all variables can be performed at the same time cost. Lifted conformant SOGBOFA was shown to be very effective especially in problems with large action spaces and high amount of stochasticity and was runner up in the recent planning competition. 3 Connection to Reinforcement Learning While the scheme above assumes a given symbolic model one can combine it with model learning to get a model based RL algorithm. The main requirement is that the representation of the learned model will allow for an algebraic representation (or be transformed to one) so as to be included with aggregate simulation. On the other hand, because the planner assumes independence in the graph construction, one can learn an independent model for each state variable and ignore correlation among them. This will give a novel model based RL algorithm that plans via approximate inference and first action gradients. This approach can also yield an interesting model based actor critic algorithm where the critic is constructed through the graph based on the learned model. Note that it is not necessary to learn the model off-line. The model learning can be done incrementally as typically done in RL. Our algorithm also has aspects that are similar to policy gradients. Here the main distinction is that a policy is not explicitly constructed by the algorithm. This is partly due to its use in on-line planning and the domain and goal independent nature of planning (i.e., because policies do not transfer for different reward functions). However, allowing for long term learning, and maintaining a richer policy representation which can be reused over multiple steps can yield a novel variant of policy gradients. Here too the simulation and gradients of policy gradients can be done through the graph construction. We note, however, that such a construction is non-trivial because aggregate simulations do not allow the policy to access the state directly. This discussion suggests that the techniques in SOGBOFA can be useful for new model based RL methods. Acknowledgments This work was partly supported by NSF under grant IIS

5 References [1] Craig Boutilier, Thomas Dean, and Steve Hanks. Planning under uncertainty: Structural assumptions and computational leverage. In Proceedings of the Second European Workshop on Planning, pages , [2] Hao Cui and Roni Khardon. Online symbolic gradient-based optimization for factored action MDPs. In Proc. of the International Joint Conference on Artificial Intelligence, [3] Hao Cui and Roni Khardon. Stochastic planning, lifted inference, and marginal MAP. In Workshop on Planning and Inference help with AAAI, [4] Hao Cui, Roni Khardon, Alan Fern, and Prasad Tadepalli. Factored MCTS for large scale stochastic planning. In Proc. of the AAAI Conference on Artificial Intelligence, [5] Hao Cui, Radu Marinescu, and Roni Khardon. From stochastic planning to marginal MAP. In Advances in Neural Information Processing Systems, [6] Andreas Griewank and Andrea Walther. Evaluating derivatives - principles and techniques of algorithmic differentiation (2. ed.). SIAM, [7] Kristian Kersting, Babak Ahmadi, and Sriraam Natarajan. Counting belief propagation. In UAI, pages , [8] M. L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. Wiley, [9] Scott Sanner. Relational dynamic influence diagram language (rddl): Language description. Unpublished Manuscript. Australian National University, [10] Parag Singla and Pedro M. Domingos. Lifted first-order belief propagation. In AAAI, [11] Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages , [12] Gerald Tesauro and Gregory R Galperin. On-line policy improvement using monte-carlo search. In Advances in Neural Information Processing Systems, pages ,

Lifted Stochastic Planning, Belief Propagation and Marginal MAP

Lifted Stochastic Planning, Belief Propagation and Marginal MAP Hao Cui and Roni Khardon Department of Computer Science, Tufts University, Medford, Massachusetts, USA Hao.Cui@tufts.edu, roni@cs.tufts.edu