Compiling relational Bayesian networks for exact inference

Size: px

Start display at page:

Download "Compiling relational Bayesian networks for exact inference"

Kristopher Atkins
5 years ago
Views:

1 International Journal of Approximate Reasoning 42 (2006) Compiling relational Bayesian networks for exact inference Mark Chavira a, *, Adnan Darwiche a, Manfred Jaeger b a Computer Science Department, UCLA, Los Angeles, CA 90095, United States b Institut for Datalogi, Aalborg Universitet, Fredrik Bajers Vej 7 E, DK-9220 Aalborg Ø, Denmark Available online 15 November 2005 Abstract We describe in this paper a system for exact inference with relational Bayesian networks as defined in the publicly available Primula tool. The system is based on compiling propositional instances of relational Bayesian networks into arithmetic circuits and then performing online inference by evaluating and differentiating these circuits in time linear in their size. We report on experimental results showing successful compilation and efficient inference on relational Bayesian networks, whose Primula-generated propositional instances have thousands of variables, and whose jointrees have clusters with hundreds of variables. Ó 2005 Elsevier Inc. All rights reserved. Keywords: Exact inference; Relational models; Bayesian networks 1. Introduction Relational probabilistic models extend Bayesian network models by representing objects, their attributes, and their relations with other objects. The standard approach for inference with a relational model is based on the generation of a propositional instance of the model in the form of a classical Bayesian network, and then applying classical algorithms, such as jointree [1], to compute answers to queries. * Corresponding author. addresses: chavira@cs.ucla.edu (M. Chavira), darwiche@cs.ucla.edu (A. Darwiche), jaeger@cs.aau.dk (M. Jaeger) X/$ - see front matter Ó 2005 Elsevier Inc. All rights reserved. doi: /j.ijar

2 The propositional instance of a relational model includes one Boolean random variable for each ground relational atom. For example, if we have n domain objects o 1,...,o n, and a binary relation R(Æ,Æ), we generate a propositional variable for each instance of the relation: R(o 1,o 1 ),R(o 1,o 2 ),...,R(o n,o n ). The first task in making Bayesian networks over these random variables tractable for inference is to ensure that the size of the Bayesian network representation does not show exponential growth in the number n of domain objects (as can easily happen due to nodes whose in-degree grows as a function of n). This can often be achieved by decomposing nodes with high in-degree into suitable, sparsely connected sub-networks using a number of new, auxiliary nodes. This approach is systematically employed in the Primula system. Even when a reasonably compact Bayesian network representation (i.e., polynomial in the number of objects) has been constructed for a propositional instance, this model will often be inaccessible to standard algorithms for exact inference, because its global structure does not lead to tractable jointrees. Even though the constructed networks may lack the global structure that would make them accessible to standard inference techniques, they may very well exhibit abundant local structure in the form of determinism. The objective of this paper is to describe a system for inference with propositional instances of relational models which can exploit this local structure, allowing us to reason very efficiently with some relational models whose propositional instances may look quite formidable at first. Specifically, we employ the approach proposed by [2] to compile propositional instances of relational models into arithmetic circuits, and then perform online inference by evaluating and differentiating the compiled circuits in time linear in their size. As our experimental results illustrate, this approach can efficiently handle some relational models whose Primula-generated propositional instances are quite massive. 1 We note here that the inference approach of [2] is applicable to any Bayesian network, but is especially effective on networks with local structure, including determinism. Hence, one of the main points of this paper is to illustrate the extent of local structure available in propositional instances of relational models, and the effectiveness in exploiting this local structure by the approach proposed in [2]. This paper is structured as follows. We start in Section 2 with a review of relational models in general and the specific formalization used in this paper. We then discuss in Section 3 the Primula system, which implements this formalization together with a method for generating propositional instances in the form of Bayesian networks. Section 4 is then dedicated to our proposed approach for compiling relational models. We provide experimental results in Section 5, and finally close with some concluding remarks in Section Relational models M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) A Bayesian network is a compact representation of a probability distribution and has two parts: a directed acyclic graph and a set of conditional probability tables (CPTs). Each node in the graph represents a random variable, which we assume to be discrete in this paper. Each variable X has associated with it a CPT, which specifies the conditional probabilities Pr(xju), where u is a configuration of the parents U of X in the network. 1 Some may recall the technique of zero-compression which can be used to exploit determinism in the jointree framework [3]. This technique, however, requires that one perform inference on the original jointree before it is zero-compressed, making almost all of our datasets inaccessible to this method. For a more detailed relationship to jointree inference, the reader is referred to [4].

3 6 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) 4 20 A Bayesian network over a set of variables specifies a unique probability distribution over these variables. Probabilistic queries with respect to a Bayesian network are to be interpreted as queries with respect to the probability table the network specifies. The main goal of algorithms for Bayesian networks is to answer such queries without having to construct the table explicitly, since the tableõs size is exponential in the number of network variables. Fig. 1 depicts a simple Bayesian network with two of its CPTs. Relational or first-order probabilistic models extend propositional modeling supported by Bayesian networks by allowing one to represent objects explicitly, and to define relations over these objects. Most of the early work on such generic models, which has been subsumed under the title knowledge-based model construction (see e.g. [5]), combines elements of logic-programming with Bayesian networks. Today one can distinguish several distinct representation paradigms for relational and first-order models: (inductive) logicprogramming based approaches [6 8], network fragments [9], frame-based representations [10,11], and probabilistic predicate logic formulas [12]. We review relational models with an example An example Consider the well-known example depicted in Fig. 2(a), in which Holmes becomes alarmed if he receives a call from his neighbor Watson. Watson will likely call if an alarm has sounded at HolmesÕ residence, which is more likely if a burglary occurs. However, Watson is a prankster, so Holmes may receive a call even if the alarm does not sound. We can model this example with a Bayesian network as shown in Fig. 2(b). A query might be the probability that there is a burglary given that Holmes is alarmed. We could also consider similar scenarios. Holmes might have multiple neighbors (only some of whom are pranksters) and become alarmed if any of them calls. There might be multiple individuals who can receive calls, each with distinct neighbors. Or it might be that individuals share neighbors and individuals who receive calls can also make them. For each of these scenarios, we can construct a distinct Bayesian network. Moreover, we can imagine needing to deal with many of these situations, and hence needing to construct many different networks. Each of the situations described represents a combination of various themes, such as the theme of an alarm compelling a neighbor to call or an individual becoming alarmed when some neighbor calls. Relational models address domains involving themes by separating the model construction process into two phases. We first describe a set of general Fig. 1. A Bayesian net with two of its CPTs.

4 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) Holmes Phone Watson Alarm Burglar Watson (prankster) (a) Neighbor Holmes is Alarmed Watson calls Holmes Alarm Burglary Holmes (b) (c) Fig. 2. (a) A simple alarm scenario, (b) the corresponding Bayesian network, and (c) a graph depicting the particulars of the situation, as opposed to what is common to all alarm situations. rules that apply to all situations. For example, in the alarm domain described, we need four rules: (1) At a given residence, the probability of burglary is (2) A particular alarm sounds with probability 0.95 if a burglary occurs at the corresponding residence, and with probability 0.01 otherwise. (3) If an alarm sounds at an individualõs residence, then each of the individualõs neighbors will call with probability 0.9; otherwise, if the neighbor is a prankster, then the neighbor will call with probability 0.05; otherwise, the neighbor will not call. (4) An individual is alarmed if one or more neighbors call. We highlight here that whether an individual is alarmed depends on the number of the individualõs neighbors, which makes this domain difficult represent with a template-based language. Once we have specified what is common to all situations, in order to specify a particular situation, we only need specify a small amount of additional information. In the alarm example, that information consists of which individuals are involved (other than burglars), who are neighbors of whom, and who are pranksters. We specify a graph where nodes represent individuals, edges capture the neighbor relationship, and each node is marked if the corresponding individual is a pranktser. Fig. 2(c) depicts the graph corresponding to the situation in Fig. 2(a). One of the main advantages of using a relational model is that a relational model describes a situation involving themes succinctly. This advantage often makes constructing a relational model much easier and less error-prone than constructing a Bayesian network. For example, it is not uncommon for a relational model with a dozen or so general rules to correspond to a Bayesian network that involves hundreds of thousands of CPT parameters. Another advantage is that much of the work performed in constructing a relational model can be directly re-used in describing variations of the model, whereas creating another Bayesian network can involve much more work.

5 8 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) Relational Bayesian networks We use in this paper the language of relational Bayesian networks [12] to represent relational models, as implemented in the Primula system available at ~jaeger/primula. The formal semantics of the language is based on Random relational structure models (RRSMs), which we define next. Definition 1. Given (1) a set of relational symbols S, called predefined relations; (2) a set of relational symbols R, called probabilistic relations; and (3) a finite set D, called the domain; we define an S D -structure to be an interpretation of relations S over domain D, that is, a function which maps every ground atom s(d) (s 2 S, d D) to either true or false. We also define a random relational structure model (RRSM) as a partial function which takes an S D -structure as input, and returns a probability distribution over all R D -structures as output. Intuitively, members of domain D represent objects, and members of S and R represent relations that can hold on these objects. These relations can be unary in which case they are called attributes. A user would typically define the relations in S (by providing an S D -structure), and then use an RRSM to induce a probability distribution over the possible definitions of relations in R (R D -structures). We note here that S D -structures correspond to skeleton structures in [11]. For the alarm example above, the set D of objects is the set of individuals. The set of predefined relations S contains a unary relation, prankster, in addition to a binary relation neighbor. There are four probabilistic relations in R for this domain. The first is calls(v,w): whether v calls w in order to warn w that his alarm went off. We also have another probabilistic relation alarmed(v): whether v has been alarmed (called by at least one neighbor). A third is the relation alarm(v): whether võs alarm went off. The last probabilistic relation is burglary(v): whether võs home has been burglarized. The RRSM is the set of four generic rules described previously. We now describe four RRSMs used in our experiments. These models have been implemented in Primula, which provides a syntax for specifying RRSM Random blocks This model describes the random placement of blocks (obstacles) on the locations of a map. The input structures consist of a particular gridmap and a set of blocks. This is represented using a set of predefined relations S ={location,block, leftof, belowof} where location and block are attributes that partition the domain into the two types of objects, and leftof and belowof are binary relations that determine the spatial relationship among locations. Fig. 3 shows an input S D -structure. One of the probabilistic relations in R for this model is the binary relation blocks(b, l) which represents the random placement of a block b on some location l. Another is connected(l 1,l 2 ) between pairs of locations which describes whether, after placement of the blocks, there is an unblocked path between l 1 and l 2. A probabilistic query might be the probability that there is an unblocked path between two locations l 1 and l 2, given the observed locations of some blocks (but uncertainty about the placement of the remaining ones). We experiment with different versions of this relational model, blockmap-l b, where l is the number of locations and b, the number of blocks.

6 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) Blocks B1 B2 1 Locations 2 3 belowof 4 leftof 5 Fig. 3. Input S D -structure Mastermind In the game of Mastermind, Player 1 arranges a hidden sequence of colored pegs. Player 2 guesses the exact sequence of colors by arranging guessed sequences of colored pegs. To each guessed sequence, Player 1 responds by stating how many pegs in the guess match pegs in his hidden sequence both in color and position (white feedback), and how many pegs in the guess match pegs in the hidden sequence only in color (black feedback). Player 2 wins if he guesses the hidden sequence within a certain number of rounds. The game can be represented as an RRSM where the domain D consists of objects of types peg, color, and round specified by corresponding unary relations in S, as well as binary relations peg-ord and round-ord in S that impose orders on the peg and round objects, respectively. The probabilistic relations R in the model represent the game configurations after a number of rounds: true-color(p,c) represents that c is the color of the hidden peg p; guessed-color(p,c,r) represents that in round r color c was placed in position p in the guess. Similarly, the arrangement of the feedback pegs can be encoded. A query might be the most probable color configuration of the hidden pegs, given the observed query and feedback pegs. We experiment with different versions of this model, mastermind-c g p, where c is the number of colors, g is the number of guesses, and p is the number of pegs Students and professors This domain was used by [13] to investigate methods for approximate inference for relational models. We have two types of objects in this model: students and professors and two corresponding attributes in the set S. Professors have two probabilistic attributes in R: fame(yes/no) and funding_level (high/low). Students have one probabilistic attribute in R: success(yes/no). Students and professors are related via the binary probabilistic relation advisor(s, p) in R. According to the model, students use the softmax rule, and choose advisor i with funding level y i with probability e y i = Pk ey k. With the funding level discretized into two categories high and low, this reduces to choosing any given rich (poor) professor with probability z h /(Kz h + Lz l )(z l /(Kz h + Lz l )), where K is the number of rich professors, L is the number of poor professors, and z h, z l are the (exponentials of) the funding levels of rich, respectively poor, professors. The probability of success of a student is defined conditional on the funding level. A query for this model can be the probabilities for a professorõs funding level, given the success of his students. Inference in this model becomes hard very quickly with increasing numbers of professors and students in the

7 10 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) 4 20 domain [13]. We will experiment with different versions of this relational model, studentsp s, where p is the number of professors and s is the number of students Friends and smokers This domain was introduced in [14]. It involves a number of individuals, with relations in R, such as smokes(v), which indicates whether a person smokes, cancer(v), which indicates whether a person has cancer, and friends(u,v), which indicates who are friends of whom. There are no relations in S for this model. The probabilistic model over R is defined by assigning weights to logical constraints, such as friends(u, v) ^ smokes(u)! smokes(v). A query for this model might be the probability that a person has cancer given information about others who have cancer. The Primula encoding of this model utilizes auxiliary probabilistic relations corresponding to the logical constraints. In ground instances of the model these auxiliary variables manifest themselves as variables in the Bayesian network, on which evidence should be asserted to indicate that they are always true. We experiment with different versions of this relational model, fr&sm-n, where n is the number of people in the domain. 3. The Primula system The RRSM is an abstract semantics of probabilistic relational models. For a practical system, one needs a specific syntax for specifying an RRSM. Primula allows users to encode RRSMs using the language of relational Bayesian networks [12], and outputs the distribution on R D -structures in the form of a standard Bayesian network Specifying RRSMs using Primula We now provide an example of specifying an RRSM using Primula. Consider again the alarm example from Section 2.1 and recall that for this example, the domain is the set of individuals, the set of predefined relations is S ={prankster(v), neighbor(v, w)}, and the set of probabilistic relations is R ={calls(v, w), alarm(v), alarmed(v), burglary(v)}. The probability of calls(v, w) is defined conditional on the predefined neighbor and prankster relations (it is 0 if v and w are not neighbors), and on the probabilistic alarm(v) relation: whether the alarm of v went off. This RRSM is specified in Primula as given in Fig. 4, which provides the probability distribution on probabilistic relations using probability formulas. These formulas can be seen either as probabilistic analogues of predicate logic formulas, or as expressions in a functional programming language. A probability formula defines both the dependency Fig. 4. Specifying an RRSM using Primula.

8 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) structure between ground probabilistic atoms (which depends on the predefined relations in the input structure), and the exact conditional probabilities, given the truth values of parent atoms. The specification of the RRSM provides some intuition for why a logic-based approach might work well when applied to Primula generated networks. In addition to certain numbers, we also see in this specification a number of logical constructs. For example, each of the occurrences of (x : y, z) is essentially an application of an if then else, and the noisy-or construct is essentially an existential quantification, which can be converted into a disjunction over a set of auxiliary variables. The utilization of these logical constructs is quite common in relational models From relational to propositional networks To instantiate a generic relational model in Primula, one must provide a definition of an input S D -structure. For the RRSM defined in Fig. 4, one must define the set of individuals in domain D, and then one must define which of these individuals are pranksters (by defining the attribute prankster), and who are neighbors of whom (by defining the relation neighbor). Primula provides a GUI for this purpose, but one can also supply a file-based definition of the domain and corresponding S relations. Fig. 5 presents what one of these files might look like. This file defines the domain to be D ={Holmes,Watson,Gibbon} and specifies that Gibbon is a prankster, that Holmes is a neighbor of Watson and Gibbon and that Watson and Gibbon are neighbors of Holmes. Given the above inputs, the distribution over probabilistic relations can be represented, as described in Section 1, using a standard Bayesian network with a node for each ground probabilistic atom. Our example also illustrates how the in-degree of a node can grow as a function of the number of domain objects: the node alarmed(holmes), for instance, depends on calls(w,holmes) for all of HolmesÕs neighbors w (of which there might be arbitrarily many). The Primula system employs the general method described in [15] to decompose the dependency of a node on multiple parents. This method consists of an iterative algorithm that takes the probability formula defining the distribution of a node, decomposes it into its top-level subformulas by introducing one new auxiliary node for each of these subformulas and defines the probability of the original node conditional only on the new auxiliary nodes. This method can be applied to any relational Bayesian network that only contains multi-linear combination functions (including noisy-or and mean), and yields a Bayesian network where the number of parents is bounded by three for all nodes. Even when one succeeds in constructing a standard Bayesian network of a manageable representation size, inference in this network may be computationally very hard. It is a long-standing open problem in first-order and relational modeling whether one might not design inference techniques that avoid these complexities of inference in the ground propositional instances by performing inference directly on the level of the relational Fig. 5. Specifying an S D structure using Primula.

9 12 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) 4 20 representation, perhaps employing techniques of first-order logical inference. Complexity results derived in [16] show that one cannot hope for a better worst-case performance with such inference techniques. This still leaves the possibility that they could often lead to substantial gains in practice. Recent work has described high-level inference techniques that aim at achieving such gains in average-case performance [17,18]. The potential advantage of this and similar techniques seems to be restricted, however, to relational models where individual model instances are given by relatively unstructured input structures, i.e., input structures containing large numbers of indistinguishable objects. The potential of high-level inference techniques lies in their ability to deal with such sets of objects without explicitly naming each object individually. However, in the type of relational models we are considering here, the input structures consist of mostly unique objects (in Section 2.2.1, for instance, the block objects are indistinguishable, but all location objects have unique properties defined by the belowof and leftof relations). We can identify an input structure with the complete ground propositional theory that defines it (for the structure of Fig. 3 this would be the theory blockðb1þ^:locationðb1þ^^leftofð2; 3Þ^^ :belowofð5; 5ÞÞ, and, informally, characterize highly structured input structures as those for which this propositional theory admits no simple first-order abstraction. When a relational model instance, now, is given by an input structure that cannot be succinctly encoded in an abstract, first-order style representation, chances are very small that probabilistic inference for this model instance can gain much efficiency by operating on a nonpropositional level. It thus appears that at least for a fairly large class of interesting models more advantages might be gained by optimizing inference techniques for ground propositional models, than by non-propositional inference techniques. Table 1 depicts the relational models with which we experimented, together with the size of corresponding propositional Bayesian networks generated by Primula. The table also reports the size of the largest cluster for the jointree we constructed for these networks. Obviously, most of these networks are inaccessible to mainstream, structure-based algorithms for exact inference. Yet, we will show later that all of these particular models can be handled efficiently using the compilation approach we propose in this paper. 4. Compiling relational models We describe in this section the approach we use to perform exact inference on propositional instances of relational models, which is based on compiling Bayesian networks into arithmetic circuits [2]. Inference can then be performed using a simple two-pass procedure in which the circuit is evaluated and differentiated given evidence Bayesian networks as polynomials The compilation approach we adopt is based on viewing each Bayesian network as a very large polynomial (multi-linear function in particular), which may be compactly represented using an arithmetic circuit. The function itself contains two types of variables. For each value x of each variable X in the network, we have a variable k x called an evidence indicator. For each instantiation x, u of each variable X and its parents U in the network, we have a variable h xju called a network parameter. The multi-linear function has a

10 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) Table 1 Relational Bayesian networks, their corresponding propositional instances, and the sizes of their CNF encodings Relational model Mastermind Bayesian network CNF encoding Arithmetic circuit AC time JT Vars CPT Max Vars Clauses Nodes Edges Inf Comp Inf Params Clst (s) (min) (s) Count Log c r p , , , , , , , , ,228 1,523, , ,490 1,293,323 4,315, , ,351 4,859, , ,355 19,457, , ,453 1,359,391 55,417, Students p s , , , , , , ,099 95, , , , , , , , ,092 2,531, , ,734 38,889 1,319,834 5,236, , ,353 4,586,368 16,936, , ,693 64,325 9,922,233 36,450, Blockmap l b , , , ,083 10,147 56, , ,318 11, , , ,529 17, , , , ,525 58,094 29, , , , ,709 62,299 33,011 1,798, , , ,877 66,443 47,475 7,643, , , , ,164 69, , , , , ,570 75,299 6,989, , , , , ,602 40,172, , , , ,526 96,424 1,103, , , , , ,980 11,707, , , , ,056 44,136 76,649, fr&sm n ,404 1, , ,655 3,686 4, , ,986 42,302 7,689 8, , ,976 77,864 13,919 15, , , ,257 22,824 25, (continued on next page)

11 14 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) 4 20 Table 1 (continued) Relational model Bayesian network CNF encoding Arithmetic circuit AC time JT Vars CPT Max Vars Clauses Nodes Edges Inf Comp Inf Params Clst (s) (min) (s) Count Log , , ,397 34,877 38, , , ,200 50,651 55, , , ,582 70,541 76, , , ,342 78,203 84, term for each instantiation of the network variables, which is constructed by multiplying all evidence indicators and network parameters that are consistent with that instantiation. For example, the multi-linear function of the network in Fig. 1 has eight terms corresponding to the eight instantiations of variables A, B, C: f ¼ k a k b k c h a h bja h cja þ k a k b k c h a h bja h cja þþk a k b k c h a h bja h cja. Given this multi-linear function f, we can answer standard queries with respect to its corresponding Bayesian network by simply evaluating and differentiating this function; see [2] for details. The ability to compute answers to probabilistic queries directly from the derivatives of f is interesting semantically, but one must realize that the size of function f is exponential in the number of network variables. Yet, one may be able to factor this function and represent it more compactly using an arithmetic circuit. An arithmetic circuit is a rooted DAG, in which each leaf represents a variable or constant and each internal node represents the product or sum of its children; see Fig. 6. If we can represent the network polynomial efficiently using an arithmetic circuit, then inference can be done in time linear in the size of such circuits, since the (first) partial derivatives of an arithmetic circuit can all be computed simultaneously in time linear in the circuit size [2] Compiling the network polynomial into an arithmetic circuit We now turn to the approach for compiling/factoring network polynomials into arithmetic circuits, which is based on reducing the factoring problem to one of logical reasoning [19]. This approach is based on three conceptual steps, as shown in Fig. 6. First, the network polynomial is encoded using a propositional theory. Next, the propositional theory is factored by converting it to a special logical form. Finally, an arithmetic circuit is extracted from the factored propositional theory. 2 Step 1: Encoding a multi-linear function using a propositional theory. The purpose of this step is to specify the network polynomial using a propositional theory. To illustrate how a multi-linear function can be specified using a propositional theory, consider the following function f = ac + abc + c over real-valued variables a, b, c. The basic idea is to specify this multi-linear function using a propositional theory that has exactly three models, where each model encodes one of the terms in the function. Specifically, suppose we have the 2 A similar approach has been recently proposed in [20], which calls for encoding Bayesian networks into CNFs, and reducing probabilistic inference to weighted model counting on the generated CNFs. The approach is similar in two senses. First, the weighted model counting algorithm applied in [20] is powerful enough to factor the CNF as suggested by Step 2 below see [21]. Second, the factored logical form we generate from the CNF in Step 2 is tractable enough to allow weighted model counting in time linear in the form size [22,23].

12 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) Fig. 6. Factoring multi-linear functions into arithmetic circuits. Boolean variables V a, V b, V c. Then the propositional theory D f ¼ðV a _:V b Þ^V c encodes the multi-linear function f as follows: Model V a V b V c Encoded term r 1 true false true ac r 2 true true true abc r 3 false false true c That is, model r encodes term t since r(v j )=true precisely when term t contains the realvalued variable j. This method of specifying network polynomials allows one to easily capture local structure; that is, to declare certain information about values of polynomial variables. For example, if we know that parameter a = 0, then we can exclude all terms that contain a by conjoining :V a with our encoding. Step 2: Factoring the propositional encoding. If we view the conversion of a network polynomial into an arithmetic circuit as a factoring process, then the purpose of this second step is to accomplish a similar task but at the logical level. Instead of starting with a polynomial (set of terms), we start with a propositional theory (set of models). And instead of building an arithmetic circuit, we build a Boolean circuit that satisfies certain properties. Specifically, the circuit must be in negation normal form (NNF): a rooted DAG where leaves are labeled with literals, and where internal nodes are labeled with conjunctions or disjunctions; see Fig. 6. The NNF must satisfy three properties: (1) conjuncts cannot share variables (decomposability), (2) disjuncts must be logically exclusive (determinism), and (3) disjuncts must be over the same variables (smoothness). The NNF in Fig. 6 satisfies the above properties, and encodes the multi-linear function shown in the same figure. In our experimental results, we use a second generation compiler for converting CNFs to NNFs that are decomposable, deterministic and smooth (smooth d-dnnf) [24]. Step 3: Extracting an arithmetic circuit. The purpose of this last step is to extract an arithmetic circuit for the polynomial encoded by an NNF. If D f is an NNF that encodes a network polynomial f, and if D f is a smooth d-dnnf, then an arithmetic circuit for the

13 16 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) 4 20 polynomial f can be obtained easily. First, replace and nodes in D f by multiplications; then replace or nodes by additions; and finally, replace each leaf node labeled with V x by x and each node labeled with :V x by 1. The resulting arithmetic circuit is then guaranteed to correspond to polynomial f [19]. Fig. 6 depicts an NNF and its corresponding arithmetic circuit. Note that the generated arithmetic circuit is no larger than the NNF. Hence, if we attempt to minimize the size of NNF, we are also attempting to minimize the size of generated arithmetic circuit Encoding Primula s networks The encoding step described above is semantic; that is, it describes the theory D f which encodes a multi-linear function by describing its models. As mentioned earlier, the Primula system generates propositional instances of relational models in the form of classical Bayesian networks. We now turn to the question of how to syntactically represent in CNF the multi-linear function of a network so generated. We start with the baseline encoding defined in [19], which applies to any Bayesian network. The CNF has one Boolean variable I k for each indicator variable k, and one Boolean variable P h for each parameter variable h. CNF clauses fall into three sets. First, for each network variable X with domain x 1,x 2,...,x n, we have: Indicator clauses : I kx1 _ I kx2 I kxn :I kxi _:I kxj for i < j For example, variable B from Fig. 1 generates the following clauses: I kb _ I kb ; :I kb _:I kb ð1þ These clauses ensure that exactly one indicator variable for B appears in every term of the multi-linear function. The second two sets of clauses correspond to network parameters. In particular, for each parameter h xnjx 1 ;x 2 ;...;x n 1, we have: IP clause : I kx1 ^ I kx2 ^^I kxn ) P hxnjx1 ;x 2 ;...;x n 1 PI clauses : P hxnjx1 ;x 2 ;...;x n 1 ) I kxi for each i For example, parameter h bja in Fig. 1 generates the following clauses: I ka ^ I kb ) P hbja ; P hbja ) I ka ; P hbja ) I kb ð2þ These clauses ensure that h bja appears in a term iff the k a and k b appear. The encoding as discussed does not capture information about parameter values (local structure). However, it is quite easy to encode information about determinism within this encoding. Consider again Fig. 1 and the parameter h bja = 0, which generates the clauses in Eq. 2. Given that this parameter is known to be 0, all multi-linear terms that contain this parameter must vanish. Therefore, we can suppress the generation of a Boolean variable for this parameter, and then replace the above clauses by the single clause: :I ka _:I kb. This clause has the effect of eliminating all CNF models which correspond to vanishing terms, those containing the parameter h bja.

14 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) To this basic encoding we apply some optimizations: Primula generated networks contain only binary variables. Therefore, instead of using one propositional variable for each evidence indicator k x, which would be needed in general, we use one propositional variable I X for each Bayesian network variable X, where the positive literal I X represents indicator k x, and the negative literal :I X represents indicator k x. Not only does this cut the number of indicator variables by half, but it also relieves the need for indicator clauses. For example, without the enhancement, variable B in Fig. 1 generates Boolean variables I kb and I kb and the two clauses in Eq. (1). With the optimization, B generates only a single Boolean variable I B and no clauses. This optimization requires a corresponding modification to the decoding step as indicated below. Another enhancement results from the observation that the Boolean indicators and parameters corresponding to the same state of a network root variable are logically equivalent, making it possible to delete the parameter variables and the corresponding IP and PI clauses, which establish the equivalence. The Boolean indicator thus represents both an indicator and a parameter. For example, without the enhancement, parameter h a in Fig. 1 generates one Boolean variable P ha and two clauses, I A ) P ha and P ha ) I A. With the enhancement, the variable and clauses are omitted. This optimization requires a corresponding modification to the decoding step as indicated below. Variables and clauses generated by parameters equal to 1 are redundant and therefore omitted. Applying these enhancements allows us to create the CNF as follows. For each network variable X, we create propositional variable I X.IfX is not a root, then we perform three more steps. (1) For each network parameter h xju not equal to 0 or 1, create a propositional variable P hxju. (2) For each parameter h xju1 ;u 2...;u n equal to 0, create clause :L U 1 _:L U 2 _..._:L U n _:L X, where L U i is a literal over variable I U i whose sign is the same as u i, and similarly for L X with respect to x. (3) For each parameter h xju1 ;u 2 ;...;u n not equal to 0 and not equal to 1, create clauses, L U 1 ^ L U 2 ^^L U n ^ L x ) P ; P hxju1 ;...;un h ) L xju1 ;...;un U 1 ; P ) L hxju1 ;...;un U 2 ;...; P ) L hxju1 ;...;un U n ; P ) L hxju1 ;...;un X, where L U i and L X are as defined earlier. As an example, the CPT for variable B in Fig. 1 generates the following clauses: First CPT row: :I A _:I B. Third CPT row: :I A ^ I B ) P hbja ; P hbja ):I A ; P hbja ) I B. Fourth CPT row: :I A ^:I B ) P hbja ; P hbja ):I A ; P hbja ):I B. Because Primula generates networks with binary variables and nodes with at most three parents, this encoding leads to a CNF whose size is linear in the number of network variables. Table 1 depicts the size of CNF encodings for the relational models with which we experimented. The special encoding used above calls for a slightly different decoding scheme for transforming a smooth d-dnnf into an arithmetic circuit. Specifically, if X is not a root, then literals I X and :I X are replaced with evidence indicators k x and k x, respectively. If X is a root, then literals I X and :I X are replaced with k x h x and k x h x, respectively. Moreover,

15 18 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) 4 20 literals P hx;u and :P hx;u are replaced by h x;u and 1, respectively. Finally, conjunctions and disjunctions are replaced by multiplications and additions. We close this section by pointing the reader to [25], which discusses more recent and sophisticated encodings to handle Bayesian networks with context-specific-independence [26], multi-valued variables, large CPTs, and lesser amounts of determinism. 5. Experimental results We ran our experiments on a 1.6 GHz Pentium M with 2 GB of RAM using a system available for download at Table 1 lists for each relational model a number of instances, and for each instance a number of measurements. First is the size and connectivity of the Bayesian network that Primula generated. Primula generates networks in formats acceptable by general purpose tools such as Hugin and Netica, but exact inference in these tools cannot handle most of these networks. Next is the number of variables and clauses in the CNF encodings. Clauses have at most five literals since the networks have at most three parents per node. Table 1 shows additional findings. First, the table shows the size of the compiled arithmetic circuit in terms of both number of nodes and edges (count and log base 2). We also show the time it takes to evaluate and differentiate the circuit, averaged over 31 different randomly generated evidence sets. By evaluating and differentiating the circuit, one obtains marginals over all network families, in addition to other probabilities discussed in [2]. The main points to observe are the efficiency of online inference on compiled circuits and the size of these circuits compared to the size and connectivity of the Bayesian networks. Table 1 also shows the time for jointree propagation using the SamIam inference engine ( on instances whose cluster size was manageable. One can see the big difference between online inference using the compiled AC and corresponding jointrees. Table 1 finally shows the compile time to generate the arithmetic circuits. The compile times range from less than a minute to about 60 min for the largest model. Yet the time for online inference ranges from milliseconds to about 13 s for these models. This clearly shows the benefit of offline compilation in this case, whose time can be amortized over online queries. Friends and smokers produces networks with particularly high connectivity. We mentioned previously that logical constraints in this model give rise to grounded Bayesian networks with evidence that applies to all queries. One might hope that classical pruning techniques such as deleting leaf nodes not part of the query or evidence [27] and deleting edges exiting evidence nodes [28] might reduce the connectivity of these networks, making them accessible to classical inference algorithms. This possibility is not realized though since all of the evidence occur on leaf nodes. However, we can use the method of [29] to place this evidence into the CNF encoding and compile with the evidence. In particular, if we know that network variable A corresponds to a logical constraint that must be true, then we simply add a unit clause k a to the CNF encoding. In fact, injecting these unit clauses into the CNF encoding prior to compilation has a critical effect on both compilation time and AC size, as most of these networks could not be compiled otherwise.

16 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) Conclusion We described in this paper an inference system for relational Bayesian networks as defined by Primula. The proposed inference approach is based on compiling propositional instances of these models into arithmetic circuits. The approach exploits determinism in relational models, allowing us to reason efficiently with some relational models whose Primula-generated propositional instances contain thousands of variables, and whose jointrees contain hundreds of variables. The described system appears to significantly expand the scale of Primula-based relational models that can be handled efficiently by exact inference algorithms. It is also equally applicable and effective to any Bayesian network that exhibits similar properties (e.g., determinism), regardless of whether it is synthesized from a relational model. Acknowledgment This work has been partially supported by NSF grant IIS and MURI grant N References [1] F.V. Jensen, S. Lauritzen, K. Olesen, Bayesian updating in recursive graphical models by local computation, Computational Statistics Quarterly 4 (1990) [2] A. Darwiche, A differential approach to inference in Bayesian networks, Journal of the ACM 50 (3) (2003) [3] F. Jensen, S.K. Andersen, Approximations in Bayesian belief universes for knowledge based systems, in: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Cambridge, MA, 1990, pp [4] J. Park, A. Darwiche, A differential semantics for jointree algorithms, Artificial Intelligence 156 (2004) [5] J.S. Breese, R.P. Goldman, M.P. Wellman, Introduction to the special section on knowledge-based construction of probabilistic decision models, IEEE Transactions on Systems, Man, and Cybernetics 24 (11) (1994) [6] T. Sato, A statistical learning method for logic programs with distribution semantics, in: Proceedings of the International Conference on Logic Programming (ICLP), 1995, pp [7] S. Muggleton, Stochastic logic programs, in: L. de Raedt (Ed.), Advances in Inductive Logic Programming, IOS Press, 1996, pp [8] K. Kersting, L. de Raedt, Towards combining inductive logic programming and Bayesian networks, in: Proceedings of the International Conference on Inductive Logic Programming (ILP), Springer Lecture Notes in AI 2157, [9] K.B. Laskey, S.M. Mahoney, Network fragments: representing knowledge for constructing probabilistic models, in: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), San Francisco, CA, 1997, pp [10] D. Koller, A. Pfeffer, Probabilistic frame-based systems, in: Proceedings of the National Conference on Artificial Intelligence (AAAI), 1998, pp [11] N. Friedman, L. Getoor, D. Koller, A. Pfeffer, Learning probabilistic relational models, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), [12] M. Jaeger, Relational Bayesian networks, in: D. Geiger, P.P. Shenoy (Eds.), Proceedings of the Conference of Uncertainty in Artificial Intelligence (UAI), Providence, USA, 1997, pp [13] H. Pasula, S. Russell, Approximate inference for first-order probabilistic languages, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2001, pp [14] M. Richardson, P. Domingos, Markov logic networks, Special issue of the Machine Learning Journal on Statistical Relational Learning and Multi-Relational Data Mining, in press.

17 20 M. Chavira et al. / Internat. J. Approx. Reason. 42 (2006) 4 20 [15] M. Jaeger, Complex probabilistic modeling with recursive relational Bayesian networks, Annals of Mathematics and Artificial Intelligence 32 (2001) [16] M. Jaeger, On the complexity of inference about probabilistic relational models, Artificial Intelligence 117 (2000) [17] D. Poole, First-order probabilistic inference, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), [18] R. de Salvo Braz, E. Amir, D. Roth, Lifted first-order probabilistic inference, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2005, pp [19] A. Darwiche, A logical approach to factoring belief networks, in: Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR), 2002, pp [20] T. Sang, P. Beame, H. Kautz, Solving Bayesian networks by weighted model countingproceedings of the National Conference on Artificial Intelligence (AAAI), vol. 1, AAAI Press, 2005, pp [21] J. Huang, A. Darwiche, DPLL with a trace: from sat to knowledge compilation, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2005, pp [22] A. Darwiche, P. Marquis, A knowledge compilation map, Journal of Artificial Intelligence Research 17 (2002) [23] A. Darwiche, P. Marquis, Compiling propositional weighted bases, Artificial Intelligence 157 (1 2) (2004) [24] A. Darwiche, New advances in compiling CNF to decomposable negational normal form, in: Proceedings of the European Conference on Artificial Intelligence (ECAI), 2004, pp [25] M. Chavira, A. Darwiche, Compiling Bayesian networks with local structure, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2005, pp [26] C. Boutilier, N. Friedman, M. Goldszmidt, D. Koller, Context-specific independence in Bayesian networks, in: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 1996, pp [27] R.D. Shachter, Evaluating influence diagrams, Operations Research 34 (6) (1986) [28] S. Ross, Evidence absorption and propagation through evidence reversals, in: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), Elsevier Science Publishing Company, Inc., New York, NY, [29] M. Chavira, D. Allen, A. Darwiche, Exploiting evidence in probabilistic inference, in: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2005, pp

Compiling Relational Bayesian Networks for Exact Inference

Compiling Relational Bayesian Networks for Exact Inference Mark Chavira, Adnan Darwiche Computer Science Department, UCLA, Los Angeles, CA 90095 Manfred Jaeger Institut for Datalogi, Aalborg Universitet,