Lifted Stochastic Planning, Belief Propagation and Marginal MAP

Size: px

Start display at page:

Download "Lifted Stochastic Planning, Belief Propagation and Marginal MAP"

Sharon Matthews
5 years ago
Views:

1 Lifted Stochastic Planning, Belief Propagation and Marginal MAP Hao Cui and Roni Khardon Department of Computer Science, Tufts University, Medford, Massachusetts, USA Abstract It is well known that the problems of stochastic planning and probabilistic inference are closely related. This paper makes several contributions in this context for factored spaces where the complexity of solutions is challenging. First, we analyze the recently developed SOGBOFA heuristic, which performs stochastic planning by building an explicit computation graph capturing an approximate aggregate simulation of the dynamics. It is shown that the values computed by this algorithm are identical to the approximation provided by Belief Propagation (BP). Second, as a consequence of this observation, we show how ideas on lifted BP can be used to develop a lifted version of SOGBOFA. Unlike implementations of lifted BP, Lifted SOGBOFA has a very simple implementation as a dynamic programming version of the original graph construction. Third, we show that the idea of graph construction for aggregate simulation can be used to solve marginal MAP (MMAP) problems in Bayesian networks, where MAP variables are constrained to be at roots of the network. This yields a novel algorithm for MMAP for this subclass. An experimental evaluation illustrates the advantage of Lifted SOG- BOFA for planning. Introduction The connection between planning and inference is well known both for the logical setting and the probabilistic setting. Over the last decade multiple authors have introduced explicit reductions showing how stochastic planning can be solved using probabilistic inference (Domshlak and Hoffmann 2006; Toussaint and Storsky 2006; Furmston and Barber 2010; Lang and Toussaint 2010; Furmston and Barber 2011; Neumann 2011; Kober and Peters 2011; Sabbadin, Peyrard, and Forsell 2012; Liu and Ihler 2012; Cheng et al. 2013; Kiselev and Poupart 2014; Nitti, Belle, and Raedt 2015; van de Meent et al. 2016; Lee, Marinescau, and Dechter 2016; Nguyen, Kumar, and Lau 2017) with applications in robotics, scheduling and environmental problems. However, heuristic methods and search are still the best performing approaches for planning in large combinatorial state and action spaces (Kolobov et al. 2012; Keller and Helmert 2013; Cui and Khardon 2016). This paper makes several contributions in this context. Copyright c 2018, Association for the Advancement of Artificial Intelligence ( All rights reserved. We analyze a recent heuristic algorithm that was shown to be very effective in practice for large action spaces. SOG- BOFA (Cui and Khardon 2016) builds an approximate computation graph capturing marginals of state and reward variables under independence assumption. It then uses gradient based search to optimize action choice. Our analysis shows that the value computed by SOGBOFA s computation graph is identical to the solution of Belief Propagation (BP) when conditioned on actions. Abstracting this suggests a novel approach to planning and marginal MAP problems, where approximations are often used: If we can compute a closed form representation of the value (or marginal probability) computed by the chosen approximation, when conditioned on the actions (max variables), then we can perform the optimization through automatic differentiation (Griewank and Walther 2008). In the case of SOGBOFA we have a closed form for the result of BP for a sub-family of directed graphs. This captures marginal MAP (MMAP) problems for Bayesian networks with max variables at roots and evidence restricted to be at one leaf node. Given this analysis it is natural to ask if we can improve SOGBOFA further by taking advantage of improvements of BP, for example Lifted BP (Singla and Domingos 2008; Kersting, Ahmadi, and Natarajan 2009) where repeated messages are avoided in the computation. As we show, this can be done, and the form of the lifted algorithm is extremely simple. For the sub-family of directed graphs, lifted BP is achieved simply by using a dynamic programming implementation of the construction of the SOGBOFA graph. This is an important advantage because typical implementations of lifted inference are complex, whereas here the implementation comes almost for free. An experimental evaluation shows that the computation time of SOGBOFA is improved and that, under the same time constraints, lifted SOGBOFA improves the performance in planning over SOGBOFA. Finally, while SOGBOFA was developed for planning, it can be used to solve MMAP problems. The original construction for SOGBOFA assumes one final Q value node which is being optimized. It is not hard to see that this can be adapted to solve any MMAP problem with decision variables at the roots of the Bayesian network. This provides a novel alternative to the mixed-bp algorithm of (Liu and Ihler 2013) for MMAP, where our algorithm combines lifted BP conditioned on MAP nodes with gradient search.

2 Preliminaries We assume familiarity with basic notions in Bayesian networks and stochastic planning. This section recalls some details of BP and SOGBOFA. Belief Propagation in Bayesian Networks For our results it is convenient to refer to the BP algorithm for directed graphs (Pearl 1989). A Bayesian network is given by a directed acyclic graph where each node x is associated with a random variable and a corresponding conditional probability table (CPT) capturing p(x parents(x)). The joint probability of all variables is given by i p(x i parents(x i )). We restrict the presentation in this paper to Bayesian networks with binary variables, i.e., x i {0, 1}. BP calculates an approximation of the marginal probability of each node by sending messages among adjacent nodes. Assume first that the directed graph is a polytree (no underlying undirected cycles). The marginal of x, as calculated by BP, is denoted by BEL(x) p(x e), where e is the total evidence in the graph. Let π(x) p(x e + ) and λ(x) p(e x), where e + is the evidence that can reach x through its parents, e is the evidence that can be reached from x though its children. We use α to represent a normalization constant, for example, α(2, 3, 5) = (0.2, 0.3, 0.5), and use β to denote some constant. For a polytree, x separates its parents from its children and where BEL(x) = απ(x)λ(x) (1) λ(x) = n λ zj (x) (2) j=1 and π(x) = n p(x w) π x (w k ) (3) w k=1 incorporate evidence through children and parents respectively. In (3) the sum variable w ranges over all assignments to the parents of x and w k is the induced value to the kth parent. λ zj (x) is the message that each child z j sends to its parent to x and π x (w k ) is the message that each parent w k sends to x. The messages are given by λ x (w i ) = β λ(x) p(x w) π x (w k ) (4) x w k :k i k i π zj (x) = α k j λ zk (x)π(x) (5) where in (4) w i is fixed and the the sum is over values to other parents w k. Since the nodes are binary the messages have two values. The algorithm is initialized as follows. If x = 0 is observed, we set both π(x) and λ(x) to (0,1) and if x = 1 is observed we set them to (1,0). Otherwise, if a node x is a root (parent-less), we set π(x) to be the prior probability of x. If x is child-less, we set λ(x) to be (1,1), i.e., an uninformative value. A node can send a message along an edge if all messages from its other edges have been received. If the graph is a tree (or polytree) then one pass on the graph from leaves to root and one pass from root to leaves provide BEL(x) = p(x) for any x in the tree, where p(x) is the true marginal probability. The loopy BP algorithm applies the same updates even if the graph is not a polytree. In this case we initialize all messages to (1,1) and follow the same update rules for messages according to some schedule. The algorithm is not guaranteed to converge but it often does and it is known to perform well in many cases. The following property of BP is well known: Lemma 1. If loopy BP is applied to a graph G with no evidence, i.e., λ(x) = (1, 1) for all x at initialization, then for any order of message updates and at any time in the execution of loopy BP for any node x, λ(x) {1, 1} and λ x (w) is a constant independent of w for any parent of x. AROLLOUT and SOGBOFA Stochastic planning is typically captured as a Markov decision process (MDP) (Puterman 1994) with factored state and action spaces. A MDP is specified by {S, A, T, R, γ}, where S is a finite state space, A is a finite action space, T (s, a, s ) = p(s s, a) defines the transition probabilities, R(s, a) is the immediate reward of taking action a in state s, and γ is the discount factor. In factored spaces (Boutilier, Dean, and Hanks 1995) the state space S is specified by a set of binary variables {x 1,..., x l } so that S = 2 l. Similarly, A is specified by a set of binary variables {y 1,..., y m } so that A = 2 m. In off-line planning the task is to compute a policy that optimizes the long term reward. More formally, any policy π : S A induces a distribution over trajectories, and the task in planning is to optimize the expected discounted total reward E[ i γi R(s i, π(s i )) π], where s i is the i th state visited by following π (and s 0 = s). For finite horizon problems the sum goes up to the fixed horizon and we use γ = 1, i.e., discounting is not needed. In on-line planning we are given a fixed limited time t per step and cannot compute a policy in advance. Instead, given the current state, the algorithm must decide on the next action consuming at most t time. Then the action is performed in the MDP, a transition and reward are observed and the algorithm is presented with the next state. This process repeats and the long term performance of the algorithm is evaluated. On-line planning has been the standard evaluation method in recent planning competitions. The algorithms discussed in this paper perform on-line planning by estimating the value of every initial action where a rollout policy, typically a random policy, is used in future steps. They differ in how they estimate the value of an initial action through a forward simulation and in how they search over possibilities for the initial action. The AROLLOUT algorithm (Cui et al. 2015) introduced the idea of algebraic simulation to estimate values but optimized over actions in a brute force manner. The work of (Cui and Khardon 2016) showed how rollouts can be computed symbolically and that the optimization can be done using automatic differentiation leading to the SOGBOFA algorithm. Finite horizon planning can be translated from a high

3 level language (e.g., RDDL (Sanner 2010)) to a dynamic Bayesian network (DBN). SOGBOFA transforms the CPT of a node x into disjoint sum form. The CPT is split into n disjoint portions g i, each sharing p(x parents). Each portion is further split into mutually exclusive clauses c i1..c ij. In other words, the CPT is represented in the form if(c 11 c 12...) then p 1... if(c n1 c n2...) then p n, and the elements c i1j 1 and c i2j 2 are mutually exclusive and exhaustive. AROLLOUT performs a forward pass calculating ˆp(x), an approximation of the true marginal p(x), for any node in the graph. ˆp(x) is calculated as a function of ˆp(c ij ), an estimate of the probability that c ij is true. ˆp(x) = ij p(x c ij )ˆp(c ij ) = ij p i ˆp(c ij ) (6) where ˆp(c ij ) is calculated assuming its parents are independent. ˆp(c ij ) = ˆp(w k ) (1 ˆp(w k )). (7) w k c ij w k c ij The main observation in SOGBOFA is that instead of using this process to calculate concrete values, we can use the expressions to construct an explicit data structure representing the computation graph, where the last node represents the accumulated reward. Once this is done, we can use automatic differentiation and gradient based search to optimize action variables. SOGBOFA includes several additional heuristics including dynamic control of simulation depth, dynamic selection of gradient step size, maintaining domain constraints, and a balance between gradient search and random restarts. Most of this is orthogonal to topic of this paper and we omit the details. We discuss the control of simulation depth with the experimental section, where it bears on the results presented. AROLLOUT is Equivalent to BP We first show that the computation of AROLLOUT can be rewritten as a sum over assignments. Lemma 2. AROLLOUT s calculation in Eq (6) and (7) is equivalent to ˆp(x) = W p(x W ) w k (8) where W is the set of assignment to the parents of x. Proof. The sum in 8 can be divided into disjoint sets of assignments according to the c ij they satisfy. Let w l be a parent of x which is not in c ij. Let W (c ij ) be the assignments to the parents of x which satisfy c ij, and W \wl (c ij ) be the assignments to the parents of x except for w l which satisfy c ij. Since w l is not in c ij, W \wl (c ij ) is well defined. We have W (c ij) p(x W ) w k (9) = W \wl (c ij) ( p(x W, w l = T )ˆp(w l ) k l + p(x W, w l = F )(1 ˆp(w l )) ) k l Since w l is not in c ij the assignment of w l doesn t affect the probability of x. So for W W \wl (c ij ) we have p(x W, w l = T ) = p(x W, w l = F ) = p(x W ) and therefore the above sum can be simplified to. W \wl (c ij) p(x W ) k l Applying the same reasoning to all individual w l / c ij, we get that Eq (9) is equal to p(x W ) ˆp(w k ) wk (1 ˆp(w k )) 1 wk k c ij W w p(c ij) where w p (c ij ) is the set of assignments to variables in c ij which satisfies c ij. Any assignment to w k c ij which satisfy c ij must satisfy w k = T when w k c ij and w k = F when w k c ij, and the last expression simplifies to p(x W ) ˆp(w k ) (1 ˆp(w k )) w k c ij w k c ij W w p(c ij) Finally, because the c ij are mutually exclusive and exhaustive, the sum over the disjoint sets of assignments is identical to the sum of 6. With Lemma 1 and Lemma 2 we can prove the equivalence. Proposition 1. The marginals calculated by AROLLOUT are identical to the marginals calculated by BP on the DBN generated by the planning problem, conditioned on the initial state, initial action, and rollout policy, and with no evidence. Proof. By Lemma 1 λ(x) and λ x (w i ) are (1, 1) for all nodes. Therefore, backward messages do not affect the result of BP and we can argue inductively going forward from roots to leaves in the DBN. By Eq (1) and Lemma 1 we have BEL(x) = απ(x)λ(x) = π(x) (10) where the last equality is true because π(x) is always normalized. Therefore from Eq (3) we have BEL(x) = n p(x w) π x (w k ). (11) w k=1 Now, from Eq (5) and Lemma 1 we have π x (w k ) = π(w k ) = BEL(w k ) (12)

4 and substituting (12) into (11) we get BEL(x) = w p(x w) n BEL(w k ). (13) k=1 Inductively assuming BEL(w k = T ) = ˆp(w k ) and BEL(w k = F ) = 1 ˆp(w k ), we can rewrite Eq (13) as BEL(x = T ) = ˆp(x) = W p(x W ) w k (1 ˆp(w k )) 1 w k, which is identical to Eq (8). Lifted SOGBOFA Given the equivalence of the previous section, we next show how SOGBOFA can be improved using ideas from lifted inference. Lifted BP aims to calculate exactly the same solutions as BP but do so more quickly by avoiding identical messages. The idea was first introduced by (Singla and Domingos 2008) who showed how to extract the equivalence classes of messages from a structured model, specifically a Markov logic network. Each type of message is then computed only once and propagated as a power of the original individual message, replacing the product of identical messages. The work of (Kersting, Ahmadi, and Natarajan 2009) showed how repeated structure can be extracted from any ground Markov network by simulating only the process of BP without calculating the messages. The messages are then grouped and propagated similarly. The details of Lifted BP were developed for undirected models but the same principles apply to directed models. Thanks to the property of BP exposed in Lemma 1 we know that in SOGBOFA graphs there are no backward messages, and we can focus on the structure of forward messages. Further, forward messages are the marginal distributions of the corresponding variables, and since we are focusing on binary variables each message can be captured with one number or expression for p(x i = 1). In Lifted BP, two nodes or factors send identical messages when all their input messages are identical and in addition the local factor is identical. With the forward structure of SOGBOFA this has a clear analogy. Two nodes in the SOGBOFA graph are identical when their parents are identical and the local algebraic expressions capturing the local factors are identical. This suggests a straightforward implementation for Lifted SOGBOFA using dynamic programming. Lifted SOGBOFA algorithm: We run the algorithm in exactly the same manner as SOGBOFA except for the construction of the computation graph. (1) When constructing a node in the explicit computation graph we check if an identical node, with same parents and same operation, has been constructed before. If so we return that node instead of constructing a new one. Otherwise we create a new node in the graph. (2) If we only use the idea above we may end up with multiple edges between the same two nodes. Such structures are simplified on the fly during construction. In particular, every plus node with k incoming identical edges is turned into a multiply node with constant multiplier k. Similarly, every multiply node with k incoming identical edges is turned into a power node with constant multiplier k. This portion of the construction generates the counting expressions, often seen in lifted inference, automatically from the DBN structure. It is important to identify two aspects of this algorithm and its relation to lifted BP. First, similar to (Kersting, Ahmadi, and Natarajan 2009) and unlike (Singla and Domingos 2008), our algorithm for constructing the lifted graph must first process the ground network, i.e., its complexity is linear in the size of the original ground computation graph. It is interesting to explore constructions that go directly from the relational planning problem description to the lifted SOG- BOFA graph, without considering each ground node, but we do not pursue this here. On its face the saving from this construction does not look significant because we only have one pass of messages in the computation of BP so that spending linear time in advance is already expensive. This would be correct if we only ran BP once. However, because we evaluate the same lifted SOGBOFA graph and calculate gradients over it many times in the course of optimization, the effect of a compressed computation graph is significant. The second difference is due to the inputs to the two algorithms. SOGBOFA uses algebraic expressions whereas lifted BP uses tabular CPTs. Therefore, SOGBOFA s computation is potentially more structured than lifted BP because the input CPTs are given as algebraic expressions and because subexpressions are also represented as nodes in the graph. In other words lifting can occur at sub-cpt level as well as at the full CPT level as in Lifted BP. Summarizing this discussion we have: Observation 1. The computation graph of Lifted SOGBOFA is at least as compressed as the message structure in Lifted BP. when run on the DBN generated by the planning problem, conditioned on the initial state, initial action, and rollout policy, and with no evidence Lifted SOGBOFA shrinks the computation graph and leads to faster estimates of action values. This can be used in two ways. Given a fixed time constraint, we can aim for the same number of updates and deeper search or keeping the same depth we can perform more updates or random restarts. Both of these have the potential to improve planning performance. Experimental Validation For the evaluation we need to discuss one additional aspect of SOGBOFA. The algorithm aims at on-line planning where one has to decide on the next action in a short amount of time, and then the process repeats. The algorithm unrolls the DBN given by a planning problem to a horizon h and then performs the optimization as described. If h is too deep the computation is slow and there is not enough time for the optimization. Therefore SOGBOFA dynamically decides on the depth in order to trade off accuracy with available computation time. Speedup in construction means that the algorithm will search to a greater depth. This potentially complicates the comparison of the original and lifted versions of the algorithm. For the experiments in this paper, we manually fix the depth for each problem to a reasonable setting, and then compare the results of the two algorithms. This evaluates the potential improvement from having more updates in the search process.

5 avg. reward Domain: tamarisk Random HandCode Lifted SOGBOFA (Fixed Depth) SOGBOFA (Fixed Depth) avg. reward Domain: traffic Random HandCode Lifted SOGBOFA (Fixed Depth) SOGBOFA (Fixed Depth) avg. reward Domain: skill_teaching Random HandCode Lifted SOGBOFA (Fixed Depth) SOGBOFA (Fixed Depth) instance index instance index instance index Figure 1: Comparison of SOGBOFA and Lifted SOGBOFA on 20 problems in 3 domains. The work of (Cui and Khardon 2016) has evaluated the algorithms on problems of varying sizes from 6 planning domains 1 and the code and details are available at github.com/hcui01/sogbofa. We follow that evaluation in using the original code and the same problems. We first evaluate the contribution of lifting to SOGBOFA in terms of graph size and number of updates on problem 11 in each domain. Following exploratory experiments, we fixed the depth manually to a value that was expected to yield reasonable results for SOGBOFA. 2 The depth used, average graph size (over multiple steps), average number of updates within 10 seconds are reported in Table 1. As can be seen, on 3 of the domains, the lifted algorithm creates smaller graphs and results in more updates. On the other domains there is a small improvement but the differences are not significant. Therefore in the following experiments we focus on the 3 domains where the lifting compression looks promising. We next turn to evaluate whether the compression translates to better planning performance. For this we follow the experimental methodology from (Cui and Khardon 2016). We run all problems with horizon 40 and each problem instance is run 20 times. The results reported represent the average cumulative reward (no discounting) and standard deviations from these 20 runs. To control the run time needed for these experiments the algorithms are given 10 seconds per time step (compared to 60 seconds in (Cui and Khardon 2016)). The results are shown in Figure 1. For visual clarity raw scores of cumulative reward are normalized relative to two fixed algorithms. The random policy is normalized as 0 and a simple human coded policy is normalized as 1. The raw scores are linearly scaled relative to these. As can be seen in the figure lifting provides an improvement in performance across a range of problems. For skill teaching a significant 1 The domains are taken from International Probabilistic Planning Competition (IPPC) 2011 (sysadmin, traffic), IPPC 2014 (skill teaching, academic advising, tamarisk) and the RDDL distribution (elevators) (Sanner 2010) 2 In particular, we pick the average depth value in a run of SOG- BOFA with dynamic depth on problem 11 in each domain. improvement can be seen on problems For tamarisk and traffic, the improvement is more modest, and there are three ranges of problem sizes. The smaller and easier problems are already solved well by SOGBOFA and we do not get a significant improvement. The largest problems are too large for both algorithms (with 10 seconds per step), i.e., they do not get enough updates and their performance degrades. In the middle (problems on tamarisk and on traffic) lifting provides performance imporvements. This analysis is corroborated by looking at graph sizes and the number of gradient updates from the same runs. Overall, the lifted algorithm does not degrade performance on any problem and improves performance in large number of cases. The range of problems where it provides an improvement will change with the setting of time limit, where we expect the gap to move to the right and increase for large problems, when given more time. Algebraic Solver for Marginal MAP Marginal MAP is computationaly hard and state of the art exact algorithms use branch and bound techniques (Lee et al. 2016). Approximation algorithms include mixed belief propagation (Liu and Ihler 2013), an extension of BP that directly addresses MMAP. The SOGBOFA algorithm uses the DBN model to solve the planning problem. But the analogy to inference does not require the DBN structure. The main conditions that are required are that the input is a directed model, that the action nodes are roots of that model, and that there is a single value node at a leaf of the directed model. The corresponding requirements for MMAP are that max nodes are at the roots of the model and that there is a single evidence node at a leaf of this graph. Given a MMAP problem with multiple evidence nodes, we can first disconnect the edges to children of evidence nodes by substituting the evidence node values directly. We can then connect all leaf evidence node to a new leaf node (denoted L* below) which captures the logical conjunction requiring that all those nodes have their observed values. See (Mauá 2016). This leads to the following algorithm:

6 Domain elevators sysadmin academic tamarisk traffic skill depth SOGBOFA Size Lifted Size SOGBOFA Updates Lifted Updates Table 1: Graph size and number of update statistics for SOGBOFA and lifted SOGBOFA Algebraic MMAP Algorithm: (1) Given a MMAP problem with decision nodes at roots perform the above translation to make it suitable for inference by SOGBOFA. (2) Generate a lifted algebraic graph from this MMAP problem, where MAP nodes are treated as action nodes and L* is the reward node of the planning problem. This graph satisfies the requirements for the SOGBOFA optimizer. (3) Optimization for decision nodes is done using the same procedure as in SOGBOFA, in particular, using gradient ascent on the prior marginals for the decision nodes, where gradients are computed using automatic differentiation and where gradient updates use an adaptive step size. Preliminary experiments with this algorithm suggest that it is competitive or better than mixed belief propagation (Liu and Ihler 2013) for this class of problems. We leave a more thorough experimental evaluation to future work. An interesting question is whether these ideas can be used to solve general MMAP problems. Mauá (2016) has shown that one can translate any MMAP problem into a MMAP (or maximum expected utility, MEU) problem with MAP variables at roots and a single evidence (or reward) variable. Unfortunately, while that construction is correct for exact inference, it does not preserve the advantage of the forward search in SOGBOFA. We leave an investigation into alternative constructions for future work. Related Work The connection between planning an inference is not new. In the context of influence diagrams, the connection between MEU and marginal MAP shown to be equivalent by (Mauá 2016), was pointed out by (Cooper 1988). Starting from an AI planning formulation, Domshlak and Hoffmann (2006) have shown how the conformant planning problem can be solved using weighted model counting for CNF, and Lee, Marinescau, and Dechter (2016) have shown how the problem can be translated to MMAP and solved directly by MMAP algorithms. Another major direction has been MDP formulations of factored state and action spaces, where symbolic versions of dynamic programming algorithms were developed (Hoey et al. 1999; Boutilier et al. 1995; Raghavan et al. 2012; 2013). It is easy to see that symbolic value iteration corresponds to alternating variable elimination where we alternate between eliminating state variables of a time slice and action variables of the same time slice. These algorithms cleverly avoid an explicit representation of the potentially exponential size policy. Both exact and approximate algorithms have been developed. The main difficulty with this approach is the space requirements of representing value functions or policies which can become prohibitive. A different approach to MDPs has been through formulations as variational inference. Starting with (Dayan and Hinton 1997) several formulations define a reward weighted posterior distribution over trajectories, where identifying the MAP over actions gives the optimal policy. Solutions include the expectation maximization (EM) or variational EM algorithms (Toussaint and Storsky 2006; Furmston and Barber 2010; Neumann 2011; Kober and Peters 2011) policy gradients (Kober and Peters 2011; van de Meent et al. 2016) and BP (Liu and Ihler 2012; Cheng et al. 2013). As discussed above, one of the main differences between these approaches and ours is that they condition on the reward whereas our approach simply evaluates the marginal of the reward. While in principle these are equivalent, the computational properties of different approximation algorithms on the different graphs is not necessarily the same. The work of Lang and Toussaint (2010) introduces an algorithm in the context of planning for robotics which, with the hindsight of this paper, looks close to AROLLOUT. Their algorithm performs action evaluation by forward simulation using the factored frontier algorithm, which is equivalent to BP. The details differ, however, because they sample concrete trajectories and do not take advantage of the symbolic aggregate simulation. Lifted SOGBOFA could potentially improve planning performance in that approach. Conclusions The paper identifies a connection between a successful heuristic for planning in large factored state and action spaces and belief propagation. The SOGBOFA heuristic performs its estimation symbolically and through that performs its search using gradients. This suggests a general scheme for approximation algorithms for MMAP. Any approximation scheme whose value or objective can be represented using an explicit computation graph, can be optimized directly with gradient search through automatic differentiation. Two immediate consequences of this connection are shown to derive new interesting algorithms. The first taking ideas from inference to planning is a lifted version of SOG- BOFA which improves planning performance. The second, taking ideas from planning to inference, is a new algorithm that applies SOGBOFA inference for a subclass of MMAP problems. These connections expose the potential for improving algorithms further in both fields.

7 Acknowledgments This work was partly supported by NSF under grant IIS Some of the experiments in this paper were performed on the Tufts Linux Research Cluster supported by Tufts Technology Services. References Boutilier, C.; Dearden, R.; Goldszmidt, M.; et al Exploiting structure in policy construction. In Proc. of the International Joint Conference on Artificial Intelligence, volume 14, Boutilier, C.; Dean, T.; and Hanks, S Planning under uncertainty: Structural assumptions and computational leverage. In Proceedings of the Second European Workshop on Planning, Cheng, Q.; Liu, Q.; Chen, F.; and Ihler, A. T Variational planning for graph-based mdps. In NIPS, Cooper, G. F A method for using belief networks as influence diagrams. UAI. Cui, H., and Khardon, R Online symbolic gradient-based optimization for factored action MDPs. In Proc. of the International Joint Conference on Artificial Intelligence. Cui, H.; Khardon, R.; Fern, A.; and Tadepalli, P Factored MCTS for large scale stochastic planning. In Proc. of the AAAI Conference on Artificial Intelligence. Dayan, P., and Hinton, G. E Using expectationmaximization for reinforcement learning. Neural Computation 9(2): Domshlak, C., and Hoffmann, J Fast probabilistic planning through weighted model counting. In Proc. of the International Conference on Automated Planning and Scheduling, Furmston, T., and Barber, D Variational methods for reinforcement learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, AISTATS, Furmston, T., and Barber, D Efficient inference in markov control problems. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence (UAI), Griewank, A., and Walther, A Evaluating derivatives - principles and techniques of algorithmic differentiation (2. ed.). SIAM. Hoey, J.; St-Aubin, R.; Hu, A.; and Boutilier, C SPUDD: Stochastic planning using decision diagrams. In Proceedings of the Workshop on Uncertainty in Artificial Intelligence, Keller, T., and Helmert, M Trial-based heuristic tree search for finite horizon MDPs. In Proc. of the International Conference on Automated Planning and Scheduling. Kersting, K.; Ahmadi, B.; and Natarajan, S Counting belief propagation. In UAI, Kiselev, I., and Poupart, P A novel single-dbn generative model for optimizing POMDP controllers by probabilistic inference. In AAAI, AAAI Press. Kober, J., and Peters, J Policy search for motor primitives in robotics. Machine Learning 84(1-2): Kolobov, A.; Dai, P.; Mausam, M.; and Weld, D. S Reverse iterative deepening for finite-horizon MDPs with large branching factors. In Proc. of the International Conference on Automated Planning and Scheduling. Lang, T., and Toussaint, M Planning with noisy probabilistic relational rules. Journal of Artificial Intelligence Research 39:1 49. Lee, J.; Marinescu, R.; Dechter, R.; and Ihler, A. T From exact to anytime solutions for marginal MAP. In AAAI, AAAI Press. Lee, J.; Marinescau, R.; and Dechter, R Applying search based probabilistic inference algorithms to probabilistic conformant planning: Preliminary results. In Proceedings of the International Symposium on Artificial Intelligence and Mathematics (ISAIM). Liu, Q., and Ihler, A. T Belief propagation for structured decision making. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Liu, Q., and Ihler, A. T Variational algorithms for marginal MAP. Journal of Machine Learning Research 14(1): Mauá, D. D Equivalences between maximum a posteriori inference in bayesian networks and maximum expected utility computation in influence diagrams. International Journal Approximate Reasoning 68: Neumann, G Variational inference for policy search in changing situations. In Proceedings of the 28th International Conference on Machine Learning, ICML, Nguyen, D. T.; Kumar, A.; and Lau, H. C Collective multiagent sequential decision making under uncertainty. In AAAI, Nitti, D.; Belle, V.; and Raedt, L. D Planning in discrete and continuous markov decision processes by probabilistic programming. In ECML/PKDD (2), volume 9285 of Lecture Notes in Computer Science, Springer. Pearl, J Probabilistic reasoning in intelligent systems - networks of plausible inference. Morgan Kaufmann series in representation and reasoning. Morgan Kaufmann. Puterman, M. L Markov decision processes: Discrete stochastic dynamic programming. Wiley. Raghavan, A.; Joshi, S.; Fern, A.; Tadepalli, P.; and Khardon, R Planning in factored action spaces with symbolic dynamic programming. In Proc. of the AAAI Conference on Artificial Intelligence. Raghavan, A.; Khardon, R.; Fern, A.; and Tadepalli, P Symbolic opportunistic policy iteration for factored-action MDPs. In Advances in Neural Information Processing Systems, Sabbadin, R.; Peyrard, N.; and Forsell, N A framework and a mean-field algorithm for the local control of spatial processes. International Journal Approximate Reasoning 53(1): Sanner, S Relational dynamic influence diagram language (rddl): Language description. Unpublished Manuscript. Australian National University. Singla, P., and Domingos, P. M Lifted first-order belief propagation. In AAAI. Toussaint, M., and Storsky, A Probabilistic inference for solving discrete and continuous sta te markov decision processes. In ICML. van de Meent, J.; Paige, B.; Tolpin, D.; and Wood, F Blackbox policy search with probabilistic programs. In Proceedings of the International Conference on Artificial Intelligence and Statistics, AISTATS,

Planning and Reinforcement Learning through Approximate Inference and Aggregate Simulation

Planning and Reinforcement Learning through Approximate Inference and Aggregate Simulation Hao Cui Department of Computer Science Tufts University Medford, MA 02155, USA hao.cui@tufts.edu Roni Khardon