Unifying Cluster-Tree Decompositions for Automated Reasoning. Kalev Kask, Rina Dechter and Javier Larrosa

Size: px

Start display at page:

Download "Unifying Cluster-Tree Decompositions for Automated Reasoning. Kalev Kask, Rina Dechter and Javier Larrosa"

Gillian May
6 years ago
Views:

1 Unifying Cluster-Tree Decompositions for Automated Reasoning Kalev Kask, Rina Dechter and Javier Larrosa Department of Information and Computer Science University of California, Irvine, CA June 11, 2003 Abstract The paper provides a unifying perspective of tree-decomposition algorithms appearing in various automated reasoning areas such as jointree clustering for constraint-satisfaction and the clique-tree algorithm for probabilistic reasoning. By following the framework introduced by Shenoy [16], we then introduce a new algorithm, called buckettree elimination (BT E), that extends Bucket Elimination (BE) [5] to trees, and show that it is also an instance of tree-decomposition. Moreover, our analysis shows that the new extension, BT E, can provide a speed-up of n over BE for important reasoning tasks. Finally, time-space tradeoffs are shown to be cast naturally within the treedecomposition framework. 1 Introduction The paper provides a unifying perspective of tree-decomposition algorithms appearing in various automated reasoning areas. It puts together various approaches developed throughout the years in various communities. We show that many existing decomposition schemes, such as join-tree clustering, junction-tree decompositions, and hyper-tree decomposition, are instances of 1

2 tree-decomposition. By following the framework introduced by Shenoy [16], we then introduce a new algorithm, called bucket-tree elimination (BT E), that extends Bucket Elimination (BE) [5] to trees, and show that it is also an instance of tree-decomposition. The unifying framework provides clarity which is likely to encourage technology transfer. Section 2 provides some definitions and background concepts, while Section 3 introduces the concept of automated reasoning problems. In Section 4 we introduce the concept of legal cluster-tree decompositions and show that many tractable classes of automated reasoning problems fit within this framework. Section 5 introduces and analyzes the bucket-tree elimination algorithm. Section 6 reviews some existing decomposition methods and place them in the context of cluster-tree decomposition. In section 7 we discuss space-time trade-off in cluster-tree decomposition and Section 8 concludes. 2 Preliminaries A reasoning problem is defined in terms of a set of variables taking values on finite domains and a set of functions defined over these variables. We denote variables or subsets of variables by uppercase letters (e.g. X, Y, Z, S, R...) and values of variables by lower case letters (e.g. x, y, z, s). An assignment (X 1 = x 1,..., X n = x n ) can be abbreviated as x = (x 1,..., x n ). For a subset of variables S, D S denotes the Cartesian product of the domains of variables in S. x S is the projection of x = (x 1,..., x n ) over a subset S. We denote functions by letters f, g, h, etc., and the scope (set of arguments) of the function f by scope(f). Definition 2.1 Given a function h defined over a subset of variables S, where X S, functions (min X h), (max X h), and ( X h) are defined over U = S {X} as follows: For every U = u, and denoting by (u, x) the extension of tuple u by assignment X = x, (min X h)(u) = min x h(u, x), (max X h)(u) = max x h(u, x), and ( X h)(u) = x h(u, x). Given a set of functions h 1,..., h j defined over the subsets S 1,..., S j, the product function Π j h j and j h j are defined over U = j S j. For every U = u, (Π j h j )(u) = Π j h j (u Sj ), and ( j h j )(u) = j h j (u Sj ). Definition 2.2 (Graph concepts) A directed graph is a pair, G = {V, E}, where V = {X 1,..., X n } is a set of vertices, and E = {(X i, X j ) X i, X j V } 2

3 is the set of edges. If (X i, X j ) E, we say that X i points to X j. The degree of a variable is the number of edges incident to it. For each variable X i, pa(x i ) or pa i, is the set of variables pointing to X i in G, while the set of child vertices of X i, denoted ch(x i ), comprises the variables that X i points to. The family of X i, F i, includes X i and its parent variables. A directed graph is acyclic if it has no directed cycles. A poly-tree is an acyclic directed graph whose underlying undirected graph (ignoring the arrows) has no loops. While a formal definition of an automated reasoning problem is given in Section 3, it is specified by a set of functions F over a set of variables X. It can be represented by a directed (undirected) graph such that there is exactly one vertex for each variable and there is an edge between variables iff these variables participate in a function. Definition 2.3 (Primal graph, dual graph, hyper-graph) The primal graph of a reasoning problem has variables as its vertices and an arc connects any two variables that appear in the scope of the same function. A dual graph of a reasoning problem has an one-to-one mapping between its variables and functions of the reasoning problem. Two variables in the dual graph are connected if the corresponding functions in the reasoning problem share a variable. The hyper-graph of a problem has the variables as its vertices and the scopes of functions as its hyperedges. Definition 2.4 (Induced-width) An ordered graph is a pair (G, d) where G is an undirected graph, and d = (X 1,..., X n ) is an ordering of the vertices. The width of a vertex in an ordered graph is the number of its earlier neighbors. The width of an ordering d, w(d), is the maximum width over all vertices. The induced width of an ordered graph, w (d), is the width of the induced ordered graph obtained by processing the vertices recursively, from last to first; when vertex X is processed, all its earlier neighbors are connected. 3 Automated reasoning tasks Our approach is general and applicable to probabilistic and deterministic networks. To facilitate a general exposition, we use a unified description of automated reasoning tasks. 3

4 Definition 3.1 An automated reasoning task P is a six-tuple P =< X, D, F,, Y, {Z 1,..., Z t } > defined as follows, 1. X = {X 1,..., X n } is a set of variables. 2. D = {D 1,..., D n } is a collection of finite domains. 3. Z = {Z 1,..., Z t } is a set of variables of interest. 4. F = {f 1,..., f r } is a set of functions. 5. i f i { i f i, i f i, i f i } is a combination operator. 6. Y f { max min f, f, f, f}, where S is the scope of function S Y S Y S Y S Y f, is a marginalization operator, where Y X. The problem is to compute, Z1 r f i,..., Zt i=1 r f i i=1 For optimization tasks we have t = 1, Z 1 = and S = X. Often we also seek an assignment to all the variables that optimizes (maximizes or minimizes) the combined cost function f. Namely, we need to find x = (x 1,..., x n ) such that f(x) = ri=1 f i. We assume that functions are expressed in a tabular form, having an entry for every combination of values from the domains of its variables. Therefore, the specification of such functions is exponential in their scope (the base of the exponent is the maximum domain size). Relations, or clauses, can be expressed as functions as well, associating a value of 0 or 1 for each tuple, depending on whether or not the tuple is in the relation (or satisfies a clause). The combination operator takes a set of functions and generates a new function. Note that i stands for a product when it is a combination operator and Π i for a projection when it is a marginalization operator. The operators are defined explicitly as a list of possible specific operators. However, they can be defined axiomatically, as we will elaborate later. We also call an automated reasoning problem a dependency model, or a graphical model, because it can be associated with a dependency graph as described next. We next elaborate on the special cases of automated reasoning tasks defined over constraint networks and belief networks. 4

5 3.1 Constraint Networks Constraint Satisfaction is a framework for formulating real-world problems, such as scheduling, planning, boolean satisfiability, etc., as a set of constraints between variables. For example, one approach to formulating a scheduling problem as a CSP is to create a variable for each resource and time slice. Values of variables would be tasks that need to be scheduled. Assigning a task to a particular variable (corresponding to a resource at some time slice) means that this resource starts executing the given task at the specified time. Various physical constraints (such as that a given job takes a certain amount of time to execute, or that a task can be executed at most once) can be modelled as constraints between variables. The Constraint Satisfaction task is to find an assignment of values to variables that does not violate any constraint, or else to conclude that the problem is inconsistent. Such problems are graphically represented by nodes corresponding to variables and edges corresponding to constraints between variables. Definition 3.2 (Constraint Networks, Constraint Satisfaction Problems) A Constraint Network (CN) is defined by a triplet (X, D, C) where X is a set of variables X = {X 1,..., X n }, associated with a set of discrete-valued domains, D = {D 1,..., D n }, and a set of constraints C = {C 1,..., C r }. Each constraint C i is a pair (S i, R i ), where R i is a relation R i D Si defined on a subset of variables S i X called the scope of C i. The relation denotes all compatible tuples of D Si allowed by the constraint. The primal graph of a constraint network is sometimes called a constraint graph.a solution is an assignment of values to variables x = (x 1,..., x n ), x i D i, such that each constraint is satisfied, namely C i C x Si R i. The Constraint Satisfaction Problem (CSP) is to determine if a constraint network has a solution, and if so, to find a solution. A binary CSP is one where each constraint involves at most two variables, namely S i 2. Sometimes (for the Max-CSP problem), we express the relation R i as a cost function C i (X i1 = x i1,..., X ik = x ik ) = 0 if (x i1,..., x ik ) R i, and 1 otherwise. A constraint satisfaction problem is an automated reasoning task P =< X, D, C,, Π, Z = >, where (X, D, C) is a constraint network, the combination operator is the join operator and the marginalization operator is the projection operator. Namely, the problem is to find i f i = Π X f i i. 5

6 Many real-world problems are often over-constrained and do not have a solution. In such cases, it is desirable to find an assignment that satisfies a maximum number of constraints, called a Max-CSP assignment. Definition 3.3 (Max-CSP) Given a constraint network, the Max-CSP task is an optimization version of Constraint Satisfaction. It means finding an assignment that satisfies a maximum number of constraints. While, as its name suggests, a Max-CSP problem is a maximization problem, it can also be defined as a minimization problem. Instead of maximizing the number of constraints that are satisfied, minimizing the number of constraints that are violated. Its set of functions F is the set of cost functions assigning 0 to all allowed tuples and 1 to all non-allowed tuples. It can be formalized as a reasoning task P =< X, D, F,, min, Z = >, where (X, D, F ) is a constraint network, the combination operator is summation and the marginalization operator is the minimization operator. Namely, the task is to find i f i = min X i f i. 3.2 Belief Networks Belief Networks [13] provide a formalism for reasoning about partial beliefs under conditions of uncertainty. They are defined by a directed acyclic graph over nodes representing random variables of interest (e.g., the temperature of a device, the gender of a patient, a feature of an object, the occurrence of an event). The arcs signify the existence of direct causal influences between linked variables quantified by conditional probabilities that are attached to each cluster of parents-child nodes in the network. Definition 3.4 (Belief Networks) Given a set, X = {X 1,..., X n } of random variables over multi-valued domains D = {D 1,..., D n }, a belief network is a pair (G, P ) where G is a directed acyclic graph over X and P = {P i }, where P i = {P (X i pa (X i ) ) } are conditional probability matrices associated with each X i. Given a subset of variables S, we will write P (s) as the probability P (S = s), where s D S. A belief network represents a probability distribution over X, P (x 1,..., x n ) = Π n i=1p (x i x pa(xi )). An evidence set e is an instantiated subset of variables. The primal graph of a belief network is called a moral graph. It can be obtained by connecting the parents 6

7 of each node in G and removing the arrows. Equivalently, it connects any two variables appearing in the same family. Definition 3.5 (Belief Updating) Given a belief network and evidence e, the Belief Updating task is to compute the posterior marginal probability of variable x i, conditioned on the evidence. Namely, Bel(x i ) = α n P (x 1,..., x n, e) = α n P (x k, e x pak ) X {x i } k=1 X {x i } k=1 where α is a normalization constant. When formulated as an automated reasoning task, functions in F denote conditional probability tables and the scopes of these functions is determined by a directed acyclic graph (DAG): Each function f i ranges over variable i and its parents in the DAG. The combination operator is j = j, the marginalization operator is y = X y, and Z i = {x i }. Namely, x i, xi i f i = X {x i } i f i. Definition 3.6 (Most Probable Explanation) Given a belief network and evidence e, the Most Probable Explanation (MPE) task is to find a complete assignment which agrees with the available evidence, and which has the highest probability among all such assignments, namely, to find an assignment (x o 1,..., x o n) such that P (x o 1,..., x o n) = max x1,...,x n n k=1 P (x 1,..., x n, e) = max x1,...,x n n k=1 P (x k, e x pak ) When MPE is formalized as an automated reasoning task, the combination operator is multiplication and the marginalization operator is maximization. An MPE task is to find i f i = max X i f i where X is the set of variables and f i is the set of conditional probability tables. It also requires an optimizing assignment. 4 Cluster-Tree Decomposition Tree clustering schemes are popular both for constraint processing and probabilistic reasoning. The most popular variants are join-tree clustering algorithms, also called junction-trees. The schemes vary somewhat in their 7

8 graph definition as well as in the way tree-decompositions are processed [12, 7, 11, 9, 8, 15, 16]. They all involve a decomposition of a hypergraph into a hypertree. To allow a coherent discussion and extension of these methods we introduce a unifying perspective. We present a unifying (cluster-)tree-decomposition framework that borrows its notation from the recent hypertree decomposition proposal for constraint satisfaction presented in [8]. The exposition is declarative, separating the desired target output from its generative process. Definition 4.1 Let P =< X, D, F,,, {Z i } > be an automated reasoning problem. A tree-decomposition for P is a triple < T, χ, ψ >, where T = (V, E) is a tree, and χ and ψ are labelling functions which associate with each vertex v V two sets, χ(v) X and ψ(v) F, that satisfy the following conditions: 1. For each function f i F, there is exactly one vertex v V such that f i ψ(v). 2. If f i ψ(v), then scope(f i ) χ(v). 3. For each variable x X, the set {v V x χ(v)} induces a connected subtree of T. This is also called the running intersection or connectedness property. 4. i Z i χ(v) for some v T. When the combination operator is join, as in constraint satisfaction, condition 1 can be relaxed to require that each function will be in at least one node, thus allowing multiple appearances of a function in nodes. Definition 4.2 (tree-width, hyper-width, separator) The width (also called tree-width) of a tree-decomposition < T, χ, ψ > is max χ(v), and its v V hyper-width is max ψ(v). Given two adjacent vertices u and v of a treedecomposition, a separator of u and v is defined as sep(u, v) = χ(u) v V χ(v). Example 4.1 Consider a problem P over variables A, B, C, D, F, G with functions over scopes of size 2: F = {f(a, B), f(a, C), f(b, C), f(b, F ), f(c, F ), F (A, B, D) F (F, G)}. Figure 2b gives its primal graph. Any of the 8

9 Algorithm cluster-tree elimination (CTE) Input: A tree decomposition < T, χ, ψ > for a problem P =< X, D, F,,, {Z 1,...Z t } >. Output: An augmented tree whose nodes are clusters containing the original functions as well as messages received from neighbors. A solution computed from the augmented clusters. 1. Compute messages: For every edge (u, v) in the cluster tree, do If vertex u has received messages from all adjacent vertices other than v, then Compute m (u,v), the message that vertex u sends to vertex v, m (u,v) = sep(u,v) ( f cluster(u),f m (v,u) f) where cluster(u) = ψ(u) {m (v,u) (v, u) T } Note: functions that do not contain elimination variables do not need to be processed, and can instead be directly passed on to the receiving node. 2. Compute solution: For every v T and every Z i χ(v), compute Zi f cluster(v) f. Figure 1: Algorithm Cluster-Tree Elimination (CTE) 9

10 Season A A Automated sprinkler B C Rain B C F Wet F D D Manual watering G Slippery G (a) (b) Figure 2: (a) Belief network P (g, f, d, c, b, a) = P (g f) P (f c, b) P (d b, a) P (b a) P (c a) P (a), and (b) its moral graph. Figure 3: From a bucket-tree (left) to join-tree (middle) to a super-buckettree (right). trees in Figure 3 is a tree-decomposition for the problem where the functions can be partitioned into clusters that contain their scopes. For example, Figure 3C shows a cluster-tree with two nodes, and labelling χ(1) = {G, F } and χ(2) = {A, B, C, D, F }. Any function having G as an argument must be placed in node 1, any function having A, B, C or D as an argument must be placed in node 2, while any function over F can be placed either in node 1 or 2. A tree-decomposition facilitates a solution to an automated reasoning task. Algorithm cluster-tree elimination for processing a tree-decomposition is given in Figure 1. It works by having each vertex of the tree send a function to each of its neighbors. If the tree contains m edges, then a total of 2m messages will be sent. Node u takes all the functions in node u and all 10

11 messages received by u from all adjacent nodes other than v, combines them using the combination operator and projects the combined function onto the separator of u and v using the marginalization operator. The projected function is then sent to v. Node activation can be asynchronous and convergence is guaranteed. If processing is performed from leaves to root and back, convergence is guaranteed after two passes, where only one message is sent on each edge in each direction. Once all nodes have received messages from all neighbors, a solution to the problem can be generated using the output augmented tree (as described in the algorithm), in output linear time. For some tasks the whole output tree is used to compute the solution (e.g., computing optimal tuple). The CTE algorithm presented in Figure 1 can be further optimized. In general, when a message that a node u sends to node v is computed, there are three kinds of variables: separator variables, variables that will be eliminated by marginalization, and instantiated (sometimes called observed, or evidence) variables. If a function (in node u) does not contain any elimination variables, it does not need to be combined with other functions. Instead, it can be directly sent to node v. 4.1 Properties and Complexity Theorem 4.2 (Correctness and Completeness) Assuming that the combination operator i and marginalization operator Y satisfy the following axioms (these axioms were first formulated by [17, 16]): 1. Order of marginalization does not matter: X {y} ( X {z} f(x)) = X {z} ( X {y} f(x)) 2. Commutativity: f g = g f 3. Associativity: f (g h) = (f g) h 4. Restricted distributivity: X {z} [f(x {z}) g(x)] = f(x {z}) X {z} g(x) Algorithm CT E is sound and complete. 11

12 A proof of this theorem follows from the work of Shenoy [17, 16]. However, for completeness and clarity, we provide an alternative proof. Proof. By definition, solving an automated reasoning problem P requires computing a function F (Z i ) = ri=1 Zi f i for each Z i. Using the four properties of combination and marginalization operators, the claim can be proved by induction on the depth of the tree as follows. Let < T, χ, ψ > be a cluster-tree decomposition for P. By definition, there must be a node v T, such that Z i χ(v). We will create a partial order of the nodes of T by making v the root of T. Let T u = (N u, E u ) be a subtree of T rooted at node u. We define χ(t u ) = w N u χ(w) and χ(t T u ) = w {N N u} χ(w). We will rearrange the order in which functions are combined when F (Z i ) is computed. Let d(j) N, j = 1,..., N be a partial order of nodes of the rooted tree T, such that a node must be in the ordering before any of its children. The first node in the ordering is the root of the tree. Let F u = f ψ(u) f. We define F (Z i ) = Zi N F d(j) j=1 Because of associativity and commutativity, we have F (Z i ) = F (Z i ). We define e(u) = χ(u) sep(u, w), where w is the parent of u in the rooted tree T. For the root node v, e(v) = X Z i. In other words, e(u) is the set of variables that are eliminated when we go from u to w. We define e(t u ) = w N u e(w). In other words, e(t u ) is the set of variables that are eliminated in the subtree rooted at u. Because of the connectedness property, it must be that e(t u ) {x x χ(t T u )} =. In other words, variables in e(t u ) appear only in the subtree rooted at u. Next, we will rearrange the order in F (Z i ) in which the marginalization is applied. If x Z i and x e(d(k)), for some k, then the marginalization eliminating x can be applied to N j=k F d(j) instead of N j=1 F d(j). This is safe to do, because as shown above, if a variable x belongs to e(d(k)), then it cannot be part of any F d(j), j < k. Let ch(u) be the set of children of u in the rooted tree T. If ch(u) = (node u is a leaf node), then we define F u = X e(u) F u, otherwise we define F u = X e(u) (F u w ch(u) F w ). If v is the root of T, we define F (Z i ) = F v 12

13 Because of properties 1 and 4, we have F (Z i ) = F (Z i ). However, F (Z i ) is exactly what the cluster-tree algorithm computes. The message that each node u sends to its parent is F u. This concludes the proof. Theorem 4.3 (Complexity CTE) Let N be the number of nodes in the tree decomposition, w its tree-width, sep its maximum separator size, r be the number of input functions in F, and deg be the maximum degree in T. The time complexity of CT E is O((r + N) deg exp(w)) and its space complexity is O(N exp(sep)). Proof. The time complexity of processing a node u is deg u ( ψ(u) + deg u 1) exp( χ(u) ), where deg u is the degree of u, because node u has to send out deg u messages, each being a combination of ( ψ(u) +deg u 1) functions, and requiring the enumeration of exp( χ(u) ) combinations of values. Time complexity of CTE is deg u ( ψ(u) + deg u 1) exp( χ(u) ) u Let deg be the maximum degree of a node in T. By bounding the first occurrence of deg u by deg and χ(u) by the tree-width w, we get Since u ψ(u) = r we can write deg exp(w) ( ψ(u) + deg u 1) u deg exp(w) (r + N) = O((r + N) deg exp(w)) For each edge CTE will record two functions. Since the number of edges is bounded by N and the size of each function we record is bounded by exp(sep), the space complexity is bounded by O(N exp(sep)). If the cluster-tree is minimal (for any u and v, sep(u, v) χ(u) and sep(u, v) χ(v)) then we can bound the number of nodes N by n. Assuming r n, the time complexity of a minimal CTE is O(deg r exp(w)). 13

14 4.2 Trading Space for Time Algorithm CTE presented in Figure 1 is inefficient in that when a node is processed, many computations are performed repeatedly. By precomputing intermediate functions we can reduce the time complexity of the algorithm. When node u is processed, it contains two kinds of functions - original functions (the number of these is ψ(u) ) and messages that u received from its neighbors (there is deg u of these, one from each neighbor). When a node u computes a message to be sent to an adjacent node v, it combines all original functions ψ(u) with the deg u 1 messages received from its neighbors other than v, and marginalizes over the separator between u and v. We can define a set of intermediate functions: 1. Let f u = ψ(u). 2. Let m (i,j) = j k=i m (k,u). A message that u sends to v can be defined as m (u,v) = sep(u,v) (f u m (1,v 1) m (v+1,degu) ) In Figure 4 we present a new version of the CTE algorithm (called ICTE), that precomputes intermediate functions for each node. The following Theorem shows that ICTE is faster than CTE by a factor of deg. However, because ICTE needs to store intermediate functions, its space complexity is exponential in the tree-width, and not in the separator size, as is the case with CTE. Theorem 4.4 (Complexity ICTE) Let N be the number of nodes in the tree decomposition, w its tree-width and r be the number of input functions in F. The time complexity of ICTE is O((r + N) exp(w)) and its space complexity is O(N exp(w)). Proof: For each node u, ICTE has to first compute intermediate functions f u, m (1,j) and m (j,degu), and then messages m (u,v) for each adjacent node v. Computing intermediate functions takes time O(( ψ(u) + 2deg u ) exp(w)). Once intermediate functions are computed, we can compute messages to all neighbors in time O(3deg u exp(w)) (deg u neighbors and O(3 exp(w)) per 14

15 Algorithm improved-cluster-tree elimination (ICTE) Input: A tree decomposition < T, χ, ψ > for a problem P =< X, D, F,,, {Z 1,...Z t } >. Output: An augmented tree whose nodes are clusters containing the original functions as well as messages received from neighbors. A solution computed from the augmented clusters. 1. Compute messages: For every edge (u, v) in the cluster tree, do If vertex u has received messages from all adjacent vertices other than v, then Compute f u = ψ(u), if not yet computed. For all j, 1 < j < deg u, compute m (1,j) degu k=j m (k,u), if not yet computed. = j k=1 m (k,u) and m (j,degu) = Compute m (u,v), the message that vertex u sends to vertex v, m (u,v) = sep(u,v) (f u m (1,v 1) m (v+1,degu) ) 2. Compute solution: For every v T and every Z i χ(v), compute Zi f cluster(v) f. Figure 4: Algorithm improved-cluster-tree Elimination (ICTE) 15

16 neighbor). Therefore the time complexity of processing node u is O(( ψ(u) + 5deg u ) exp(w)). The time complexity of ICTE is O(( ψ(u) + 5deg u ) exp(w)) u Since u ψ(u) = r and u deg u = 2(N 1) time complexity of ICTE is = O((r + N) exp(w)) For each node u, we need to store O(2deg u ) intermediate functions of size exp(w). By summing over all nodes, the space complexity of storing all intermediate functions is O(N exp(w)). Also, for each edge, ICTE has to store two messages of size exp(sep). Since the total number of edges is N 1, the space complexity of storing messages is O(N exp(sep)). However, since sep w the total space complexity of ICTE is O(N exp(w)). Remark: Shenoy [16] introduces binary-join trees to organize computations more efficiently. For any cluster-tree, there exists a binary cluster-tree such that CTE has the same time and space complexity on the binary tree, as ICTE has on the original tree. So, our ICTE algorithm can be viewed as a reformulation and rederivation of Shenoy s result without the actual construction of the binary tree. Our derivation also pin-points the associated space-time complexity tradeoff. 5 Bucket-Tree Elimination This section extends the Bucket-elimination scheme into a message passing algorithm along a bucket-tree. and shows that the extended algorithm is an instance of the cluster-tree decomposition scheme. 5.1 Bucket Elimination Bucket elimination (BE) is a unifying algorithmic framework for dynamicprogramming algorithms applicable to probabilistic and deterministic reasoning [1, 5]. The input to a BE algorithm consists of a collection of functions or relations (e.g., clauses for propositional satisfiability, constraints, or conditional probability matrices for belief networks). Given a variable ordering, the algorithm partitions functions into buckets, each associated with a single 16

17 Algorithm BE Input: A problem description P =< X, D, F,,, >; an ordering of the variables d. Output: An assignment corresponding to an optimal solution. 1. Initialize: Partition the functions in F into bucket 1,..., bucket n, where bucket i contains all functions whose highest variable is X i. Let S 1,..., S j be the scopes of functions (original and intermediate) in the processed bucket. 2. Backward: For p n down-to 1, do for h 1, h 2,..., h j in bucket p, do If variable X p is instantiated (X p = x p ), assign X p = x p to each h i and put each resulting function into its appropriate bucket. j i=1 h i, where U p = j i=1 S i {X p }. Add Else, generate the function h p : h p = Up h p to the bucket of the largest-index variable in U p. 3. Forward: Assign a value to each variable in the ordering d s.t. the combination of functions in each bucket is optimized. 4. Return the function computed in the bucket of the first variable and the optimizing assignment. Figure 5: Bucket Elimination Algorithm variable. A function is placed in the bucket of its latest argument in the ordering. The algorithm processes each bucket, top-down, from the last variable to the first, by a variable elimination procedure that computes a new function using combination and marginalization operators. The new function is placed in the closets lower bucket whose variable appear in the function scope. When the solution of the problem requires a complete assignment (e.g. solving the Most Probable Explanation problem in Bayesian networks) a second, bottom-up phase, assigns a value to each variable along the ordering, consulting the functions created during the top-down phase. For more information see [5]. For completeness sake we present the BE algorithm for a general reasoning tasks in Figure 5 [5]. It is well known that the complexity of BE is exponential in the induced-width of the problems graph along the processed ordering. We will provide a formal result for the complexity of BE in Section

18 5.2 Bucket-Tree Elimination Definition 5.1 (singleton-optimality tasks) An automated reasoning problem P =< X, D, F,,, {Z 1,...Z t } > is a singleton-optimality problem if t = n and for all i, Z i = {X i }. In this case, we write Opt(X i ) = Xi ri=1 f i. Some tasks, such as singleton-optimality tasks, however, require repeated execution of the BE algorithm, for example, when a belief for every variable in a Bayesian network is required. Another example is computing the optimal cost associated with each value of each variable which is used to guide a search algorithm [6]. In order to compute the belief of every variable, BE would have to be run n times, each initiated by a different variable. We next propose a more efficient alternative, by extending bucket-elimination into a Bucket-Tree Elimination scheme, called BT E. Apparently, the essence of this extension was done before by Shenoy [16]. Our idea is based on a recent result in the context of belief updating. It is known that BE can be viewed as message-passing from leaves to root along a bucket-tree [5]. Also, a generalized elimination scheme was recently developed by Cozman [3] in the context of probabilistic inference, where a second pass along the bucket tree can update every bucket in the tree. In this chapter this scheme is derived and analyzed in a more general setting. We present the idea for any automated reasoning task, show that BT E is an instance of tree-decomposition, and derive correctness and complexity from this relationship. Definition 5.2 (buckets) Let P =< X, D, F,,, {Z i } > be an automated reasoning problem and d an ordering of its variables d = (X 1,..., X n ). Let B X1,..., B Xn be a set of buckets, one for each variable. Each bucket B Xi contains those functions in F whose latest variable in d is X i. Definition 5.3 (bucket-tree) A bucket-tree of P and an ordering d, has buckets as its nodes, and bucket B X is connected to bucket B Y (and points to B Y ) if the function generated in bucket B X by BE is placed in B Y. The variables of B Xi are those appearing in the scopes of any of its new and old functions. Therefore, in a bucket tree, every node B X has one parent node B Y and several child nodes B Z1,...B Zt. The structure of the bucket-tree can also be 18

19 extracted from the induced-ordered graph of P along d using the following equivalent definition. Definition 5.4 (bucket tree, graph-based) Let G d be the induced graph along d of a reasoning problem P whose primal graph is G. The nodes of the bucket-tree are the n buckets. Each node B X points to B Y (or, B Y is the parent of B X ) if Y is the latest earlier neighbor of X in G d. Each variable X and its earlier neighbors in the induced-graph are variables of bucket B X. If B Y is the parent of B X in the bucket-tree, then the separator of X and Y, is the set of variables appearing in B X B Y. Example 5.1 Consider the Bayesian network defined over the DAG in Figure 2a. Figure 7a shows the initial buckets along the ordering d = A, B, C, D, F, G, and the λ messages that will be passed by BE from top to bottom. Figure 7b displays the same computation as a message-passing along its bucket-tree. Theorem 5.2 A bucket tree of a reasoning problem P is a tree-decomposition of P. Proof: We need to provide mappings χ and ψ and show that tree-decomposition properties hold for a bucket tree. Mappings χ and ψ follow from the buckettree construction. Properties other than connectedness are straightforward. Connectedness follows from the graph-based bucket-tree construction. Since the bucket-tree is a tree-decomposition, algorithm CT E is applicable. Indeed, as we will show, the correctness of the extension of BE to BT E that adds a bottom-up message passing can be established by showing equivalence with CT E when applied to the same bucket-tree. We will describe the algorithm using two types of messages, λs and πs, as common in some message propagation schemes. Algorithm bucket-tree elimination (BTE) is given in Figure 6. In the topdown phase, each bucket receives λ messages from its children and sends a λ message to its parent. This portion is equivalent to BE. In the bottom-up phase, each bucket receives a π message from its parent and sends π messages to each child. Example 5.3 Figure 8 shows the complete execution of BT E along the linear order of buckets and along the bucket-tree. The π and λ messages are viewed as messages placed on the outgoing arcs. 19

20 Algorithm bucket-tree elimination (BTE) Input: A problem P =< X, D, F,,, {x 1,..., x n } >, ordering d. Output: Augmented buckets containing the original functions and all the π and λ functions received from neighbors in the bucket-tree. A solution to P computed from augmented buckets. 0. Pre-processing: Place each function in the latest bucket, along d, that mentions a variable in its scope. Connect two buckets B x and B y if variable y is the latest earlier neighbor of x in the induced graph G d. 1. Bottom-up phase: λ messages (BE) For i = n to 1, process bucket B xi : Let λ 1,...λ j be all the functions in B xi at the time B xi is processed, including original functions of P. The message λ y x i sent from x i to its parent y, is computed by j λ y x i (sep(x i, y)) = sep(xi,y) λ i where sep(x i, y) is the separator of x i and y. 2. Top-down phase: π messages For i = 1 to n, process bucket B xi : Let λ 1,..., λ j be all the functions in B xi at the time B xi is processed, including the original functions of P. B xi and computes a message π z j x i i=1 takes the π message received from its parent y, π x i for each child bucket z j by π z j x i (sep(x i, z j )) = sep(xi,z j ) π x i y ( i λ i λ x i z j ) 3. Compute optimal solution cost: In each augmented bucket compute: Opt(X i ) = xi f bucket i f, Figure 6: Algorithm Bucket-Tree Elimination y, 20

21 Bucket G: P(G F) Bucket F: P(F B,C) Bucket D: P(D A,B) Bucket C: P(C A) Bucket B: P(B A) Bucket A: P(A) (a) F λ G (F) C λ F ( B, C) B B λ ( A, B) ( A, B) D A λ B (A) λ C F P(F B,C) C P(C A) A P(A) C λ F F λ G (F) ( B, C) B λ C ( A, B) (b) G P(G F) D P(D A,B) B λ D A λ B (A) ( A, B) B P(B A) Figure 7: Execution of BE along the bucket-tree Theorem 5.4 Algorithm BTE is correct and complete. Proof: Since a bucket-tree is a tree-decomposition and since it can be shown that CT E applied to a bucket-tree is equivalent to BT E, the correctness and completeness of BT E follows from the correctness and completeness of CT E. 5.3 Complexity Clearly, the induced-width w along d is identical to the tree-width of the bucket-tree when viewed as a tree-decomposition. We next provide a refined complexity analysis of BE followed by complexity analysis of BT E and IBT E. Theorem 5.5 (Complexity of BE) Let w be the induced width of G along ordering d. The time complexity of BE is O(r exp(w + 1)) and its space complexity is O(n exp(w )). Proof. During BE, each bucket sends a λ message to its parent and since it computes a function defined on all the variables in the bucket, the number of which is bounded by w, the computed function has domain which is exponential in w. Since the number of functions that need to be consulted for each tuple in the generated function is bounded by the number of original functions in the bucket, r xi plus the messages received from its children, which 21

22 Bucket G: P(G F) Bucket F: P(F B,C) Bucket D: P(D A,B) Bucket C: P(C A) Bucket B: P(B A) Bucket A: P(A) C λ F F λ G (F) C λ F ( B, C) B B λ ( A, B) ( A, B) D A λ B (A) F P(F B,C) ( B, C) C P(C A) B λ C A P(A) F Π C ( B, C) ( A, B) B Π A (A) F λ G C Π B G Π F (F) (F) B λ D ( A, B) λ C G P(G F) D P(D A,B) ( A, B) A λ B B P(B A) (A) D Π B ( A, B) G Π F (F) F Π C D Π B C Π B ( B, C) ( A, B) ( A, B) B Π A (A) Figure 8: Propagation of π s and λ s along the bucket-tree is bounden by deg i, the overall computation, summing over all buckets, is bounded by x i (r xi + deg i 1) exp(w + 1) The total complexity can be bound by O((r + n) exp(w + 1)). Assuming r > n, this becomes O(r exp(w + 1)). The size of each λ message is O(exp(w )). Since the total number of λ messages is n 1, the total space complexity is O(n exp(w )). Theorem 5.6 (Complexity of BTE) Let w be the induced width of G along ordering d. The time complexity of BT E is O(r deg exp(w + 1)), where deg is the maximum degree in the bucket-tree. The space complexity of BT E is O(n exp(w )). Proof: Since the number of buckets is n, and induced width w equals w 1, where w is the tree-width, from the analysis of CT E we can derive that the time complexity of BT E is O((r + n) deg exp(w + 1)). Assuming that r > n we get the desired bound for time complexity. Since the size of each 22

23 message is exp(sep), and since here sep = w, we get space complexity of O(n exp(w )).. We can apply the same idea of precomputing intermediate functions described in Section 4 to BTE, resulting in new algorithm IBTE. However, in this case, we have improvement in speed with no compromise in space complexity. Theorem 5.7 (Complexity of IBTE) Let w be the induced width of G along ordering d. The time complexity of IBT E is O(r exp(w + 1)) and the space complexity is O(n exp(w + 1)). Proof: Follows for the complexity of ICT E and BT E.. Next we will compare the complexity of BTE and IBTE against running BE n times (n-be). While both BTE and n-be have the same space complexity, the space needs of IBTE are larger by a factor of k, where k is the domain size of a variable. In theory the speedup expected from running BT E vs running n-be is at most n. This may seem insignificant compared with the exponential complexity in w, however in practice it can be very significant. In particular, when these computations are used as a procedure within more extensive search algorithms [10]. The actual speedup of BTE relative to n-be may be smaller than n, however. We know that the complexity of n-be is O(n r exp(w +1)), whereas the complexity of BTE is O(deg r exp(w +1)). These two bounds cannot be directly compared because we do not know how tight the n-be bound is. However, there are classes of problems (e.g. k-trees) for which the complexity of n-be is Θ(n r exp(w + 1)), and the maximum degree of a node in the bucket tree can be bounded by w. Therefore, the speedup of BTE over n-be for these classes of problems would be Ω(n/deg) (also Ω(n/w )). Similar considerations appear when comparing IBTE with n-be. While the worst-case time complexity of IBTE is smaller than the worst-case time complexity of n-be by a factor of n, we don t know how tight these bounds are. Therefore the speedup of IBTE over n-be may be less than n. Clearly, the speedup of IBTE over n-be is never worse than the speedup of BTE over n-be. 23

24 6 Comparing tree-decomposition Methods Here we will discuss the relationship between several known decomposition methods. 6.1 Join-Tree Clustering In both constraint satisfaction and in Bayesian network s communities the most common clustering methods, called join-tree clustering ([7, 11]), are based on a triangulation algorithm which transforms the primal graph G = (V, E) of a problem instance P into a chordal graph G. A graph is chordal, if any cycle of length 4 or more has a chord. To transform a primal graph G into a chordal graph G, the triangulation algorithm processes G along the reverse order of ordering d and connects any two non-adjacent nodes if they are connected through a node later in the ordering. A join-tree clustering is defined as a tree T = (V, E), where V is a set of maximal cliques of G and E is a set of edges that form a tree between cliques satisfying the connectedness property [12]. The width of a join-tree clustering is the cardinality of its maximal clique, which coincides with the induced-width (plus 1), along the order of triangulation. Subsequently, every function is placed in one clique containing its scope. It is easy to see that a join-tree satisfies the properties of tree-decomposition. Proposition 1 Every join-tree clustering is a tree-decomposition. Join-trees correspond to minimal tree-decompositions, where separators are always strict subsets of their adjacent clusters, thus excluding some decompositions that can be useful (see [8]). Moreover, they are cluster-minimal; no node and its variables can be partitioned further to yield a more refined tree-decomposition. Example 6.1 Consider a problem having functions defined on all pairs of variables whose graph is complete. Clearly, the only possible join-tree will have one node containing all the variables and all the functions. An alternative tree-decomposition has node C 1 whose variables are {1,..., n} and whose functions are defined over the pairs of variables: {(1, 2)(3, 4),...(i, i + 1)(i + 2, i + 3)...}. Then, there is a node, C i,j, for each other function that is not 24

25 contained in C 1, and the tree connects C 1 with each other node. While this is a legitimate tree-decomposition, it is not a legitimate join-tree. This example is an instance of a hyper-tree decomposition, discussed next. 6.2 Hypertree Decomposition Recently, Gottlob et.al [8] presented the notion of hyper-tree decompositions for Constraint Satisfaction, and showed that for CSPs the hyper-width parameter can capture tractable classes that are not captured by tree-width. The exposition in [8] of hypertree-decomposition, as is, is not an instance of tree-decomposition because it allows a function to label more than a single node in the tree. While this will not hurt the solution of constraint problems it is not legal for automated reasoning problems general. We will therefore modify the definition of hypertree-decomposition in [8] with this restriction, and will show that the modified hypertree-decomposition is an instance of tree-decomposition. Definition 6.1 [8] A hypertree for a hypergraph H is a triple < T, χ, ψ >, where T = (N, E) is a rooted tree, and χ and ψ are labelling functions which associate with each node p N two sets χ(p) scope(h) and ψ(p) edges(h). If T = (N, E ) is subtree of T, we define χ(t ) = v N χ(v). We denote the set of vertices N of T by vertices(t ), and the root of T by root(t ). Moreover, for any p N, T p denotes the subtree of T rooted at p. Definition 6.2 [8] A (restricted) hypertree decomposition of a hypergraph H is a hypertree < T, χ, ψ > for H which satisfies the following conditions 1 : 1. For each edge h edges(h), there exists p vertices(h) such that h ψ(p)and scope(h) χ(p) (we say that p strongly covers h); 2. For each variable x scope(h), the set {p vertices(t ) x χ(p)} induces a (connected) subtree of T. 3. For each p vertices(h), χ(p) scope(ψ(p)). 4. For each p vertices(t ), scope(ψ(p)) χ(t p ) χ(p). 1 In [8] these decompositions are called complete. 25

26 5. (the restricting condition) For every h H there is exactly one p vertices(t ) s.t. h ψ(p). Conditions 1-4 correspond to (complete) hypertree-decompositions in [8]. Definition 6.3 A hypertree-decomposition of a reasoning problem P is obtained from a hypertree-decomposition of its hypergraph by replacing hyperedges with the functions having the hyperedges as their scope. Proposition 2 Any (restricted) hypertree decomposition of a reasoning problem P is a tree-decomposition of P. Notice that the opposite is not true. There are tree-decompositions that are not (restricted) hyper-tree decompositions, because hypertree decompositions require that the variables labelling a node will be contained in the scope of its labelling functions. For example, consider a single n-ary function f. It can be mapped into a bucket-tree with n nodes. Node i contains all variables {1, 2,...i} but no functions, while node n contains all the variables and the input function. Both join-tree and hyper-tree decomposition will allow just one node that include the function and all its variables. 7 Space-Time Tradeoff : Superbuckets The main drawback of CT E is its memory needs. The space complexity of CT E is exponential in the largest separator size. In practice this may be too prohibitive and therefore time-space tradeoffs were introduced [4]. The idea is to trade space for time by combining adjacent nodes, thus reducing separator sizes, while increasing their width and hyper-width. Proposition 3 If T is a tree-decomposition, then any tree obtained by merging adjacent nodes in T, is a tree-decomposition. Since a bucket tree is a tree-decomposition, by merging adjacent buckets, we get what we call a super-bucket-tree (SBT). This means that in the topdown phase of processing SBT, several variables are eliminated at once. Note that one can always generate a join-tree from a bucket-tree by merging adjacent nodes. For illustration see Figure 3. 26

27 8 Related work and conclusions By its nature the work here is related to all the work in the past two decades on tree-decompositions for specific tasks, to which we referred sporadically throughout the chapter. Unifying framework were also presented [14, 2]. The work here put all these schemes and formalisms together. The novelty of this work is that it provides a unifying framework for tree-decomposition that draws on notations and formalizations that appear in wide sources and in diverse communities, such as probabilistic reasoning, optimization, constraint satisfaction and graph theory. The correctness and complexity of the involving algorithms is analyzed. We believe that the current exposition add clarity and will allow technology transfer. References [1] U. Bertele and F. Brioschi. Nonserial Dynamic Programming. Academic Press, [2] S. Bistarelli, U. Montanari, and F. Rossi. Semiring-based constraint satisfaction and optimization. Journal of the Association of Computing Machinery, to appear, [3] F. G. Cozman. Generalizing variable-elimination in bayesian networks. In Workshop on Probabilistic reasoning in Bayesian networks at SBIA/Iberamia 2000, pages 21 26, [4] R. Dechter. Topological parameters for time-space tradeoffs. In Uncertainty in Artificial Intelligence (UAI-96), pages , [5] R. Dechter. Bucket elimination: A unifying framework for reasoning. Artificial Intelligence, 113:41 85, [6] R. Dechter, K. Kask, and J. Larrosa. A general scheme for multiple lower-bound computation in constraint optimization. Principles and Practice of Constraint Programming (CP-2001), [7] R. Dechter and J. Pearl. Tree clustering for constraint networks. Artificial Intelligence, pages ,

28 [8] G. Gottlob, N. Leone, and F. Scarello. A comparison of structural csp decomposition methods. Ijcai-99, [9] F.V. Jensen, S.L Lauritzen, and K.G. Olesen. Bayesian updating in causal probabilistic networks by local computation. Computational Statistics Quarterly, 4: , [10] K. Kask and R. Dechter. A general scheme for automatic generation of search heuristics from specification dependencies. Artificial Intelligence, 129:91 131, [11] S.L. Lauritzen and D.J. Spiegelhalter. Local computation with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society, Series B, 50(2): , [12] D. Maier. The theory of relational databases. In Computer Science Press, Rockville, MD, [13] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, [14] P.P. Shenoy. Valuation-based systems for bayesian decision analysis. Operations Research, 40: , [15] P.P. Shenoy. Binary join trees. pages , [16] P.P. Shenoy. Binary join-trees for computing marginals in the shenoyshafer architecture. International Journal of Approximate Reasoning, 2-3: , [17] P.P. Shenoy and G. Shafer. Axioms for probability and belief-function propagation. R.D. Shachter, T.S. Levitt, J.F. Lemmer and L.N. Kanal (eds.), Uncertainty in Artificial Intelligence, North-Holland, Amsterdam, 4: ,

Belief propagation in a bucket-tree. Handouts, 275B Fall Rina Dechter. November 1, 2000

Belief propagation in a bucket-tree Handouts, 275B Fall-2000 Rina Dechter November 1, 2000 1 From bucket-elimination to tree-propagation The bucket-elimination algorithm, elim-bel, for belief updating