CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination Instructor: Erik Sudderth Brown University Computer Science September 11, 2014 Some figures and materials courtesy David Barber, Bayesian Reasoning and Machine Learning http://www.cs.ucl.ac.uk/staff/d.barber/brml/

Probabilistic Graphical Models x 3 x 3 x 3 x 4 x 5 Directed Graphical Model (Bayesian Network) Markov Properties Factorization x 4 x 5 Factor Graph (Hypergraph) A more detailed view of factorization, useful for algorithms x 4 x 5 Undirected Graphical Model (Markov Random Field) Markov Properties Factorization

Undirected Graphical Models, Parameterization, & Factor Graphs

F Factor Graphs set of hyperedges linking subsets of nodes f V V set of N nodes or vertices, p(x) = 1 Y f (x f ) Z Z>0 f2f Z = X x {1, 2,...,N} normalization constant (partition function) In a hypergraph, the hyperedges link arbitrary subsets of nodes (not just pairs) Visualize by a bipartite graph, with square (usually black) nodes for hyperedges Motivation: factorization key to computation Motivation: avoid ambiguities of undirected graphical models Y f2f f (x f ) f (x f ) 0 arbitrary non-negative potential function

Factor Graphs & Factorization For a given undirected graph, there exist distributions with equivalent Markov properties, but different factorizations and different inference/learning complexities. p(x) = 1 Z Y f2f f (x f ) Undirected Graphical Model p(x) = 1 Z Y (s,t)2e Pairwise (edge) Potentials st(x s,x t ) Maximal Clique Potentials p(x) = 1 Z Y c2c c(x c ) An Alternative Factorization Recommendation: Use undirected graphs for pairwise MRFs, and factor graphs for higher-order potentials

p(x) = Directed Graphs as Factor Graphs Directed Graphical Model NY p(x i x (i) ) i=1

p(x) = Directed Graphs as Factor Graphs Directed Graphical Model NY p(x i x (i) ) i=1 Corresponding Factor Graph p(x) = NY i=1 i(x i,x (i) ) Associate one factor with each node, linking it to its parents and defined to equal the corresponding conditional distribution Information lost: Directionality of conditional distributions, and fact that global partition function Z=1

Directed Graphs as Undirected Graphs p(x) = NY i(x i,x (i) ) p(x) = i=1 NY p(x i x (i) ) i=1 x 3 x 4 x 5 Retain all directed edges, but drop directionality Create moral graph by linking ( marrying ) pairs of nodes with a common child This undirected graph has a clique for conditional of each node given its parent

Directed Factor Graphs? A rarely-used proposal by Brendan Frey, UAI 2003 Chain Graphs: Another hybrid of directed & undirected graphical models (see readings)

Inference In Graphical Models: Loss Functions & Marginals y z

Inference with Two Variables X X X Y Y Y p(x, y) =p(x)p(y x) Table Lookup: p(y x = x) Bayes Rule: p(x y =ȳ) = p(ȳ x)p(x) p(ȳ)

Marginal & Conditional Distributions y p(x, y) = X z2z z p(x, y, z) p(x) = X y2y p(x, y) p(x, y Z = z) = p(x, y, z) p(z) Target: Subset of variables (nodes) whose values are of interest. Condition: Whatever other variables (nodes) we have observed. Marginalize: Everything that s not a target or observation. This is a finite sum, but the number of terms may be exponentially large.

y 2Y X A = X L(x, a) Inference & Decision Theory observed data, can take values in any space unknown or hidden variables (assume discrete for now) action is to estimate the hidden variables table giving loss for all possible mistakes The posterior expected loss of taking action a is (a y) =E[L(x, a) y] = X x2x L(x, a)p(x y) The optimal Bayes decision rule is then (y) = arg min a2x (a y) Challenge: For graphical models, the decision space is exponentially large

Decomposable Loss Functions (a y) =E[L(x, a) y] = X x2x L(x, a)p(x y) Many practical losses decompose into a term for each variable node: NX NX X L(x, a) = L i (x i,a i ) (a y) = L(x i,a i )p(x i y) x i L(x, a) = L(x, a) = i=1 NX I(x i 6= a i ) i=1 NX i=1 i=1 Hamming Loss: For how many nodes is our estimate incorrect? (x i a i ) 2 Quadratic Loss: Expected squared distance minimized by posterior mean For any such loss, sufficient to find posterior marginals: p(x i y) Note that MAP loss does not decompose: L(x, a) = I(x 6= a)

Variable Elimination Algorithms for Inference in Factor Graphs x 3 x 5 x 3

Example: Inference in a Directed Graph x 4 N=6 discrete variables, each taking one of K states x 6 x 3 x 5 p(x) =p( )p( )p(x 3 )p(x 4 )p(x 5 x 3 )p(x 6,x 5 )

Example: Inference in a Directed Graph? x 4 x 6 N=6 discrete variables, each taking one of K states Goal: Compute the marginal distribution of, given observations of x 4 and x 6 x 3 x 5 p(x) =p( )p( )p(x 3 )p(x 4 )p(x 5 x 3 )p(x 6,x 5 ) p( x 4 = x 4,x 6 = x 6 ) / X X X p( )p( )p(x 3 )p( x 4 )p(x 5 x 3 )p( x 6,x 5 ) x 3 x 5 p( x 4 = x 4,x 6 = x 6 ) / X X X p( )p( )p(x 3 )p( x 4 )p(x 5 x 3 )p( x 6,x 5 ) x 5 x 3 Number of arithmetic operations to naively compute this sum: Exponential in number of unobserved variables! O(NK 4 )

Conversion to Factor Graph Form x 4 m 4 ( )= 4 ( x 4, )? x 6? m 6 (,x 5 )= 6 ( x 6,,x 5 ) x 3 x 5 x 3 x 5 p(x) =p( )p( )p(x 3 )p(x 4 )p(x 5 x 3 )p(x 6,x 5 ) p(x) = 1 ( ) 2 (, ) 3 (x 3, ) 4 (x 4, ) 5 (x 5,x 3 ) 6 (x 6,,x 5 ) p(,,x 3,x 5 x 4, x 6 ) / 1 ( ) 2 (, ) 3 (x 3, )m 4 ( ) 5 (x 5,x 3 )m 6 (,x 5 ) 1. Convert directed graph to corresponding factor graph. Algorithm is equally valid for factor graphs derived from undirected MRFs. 2. Replace observed variables by lower-order evidence factors.

Iterative Variable Elimination I?? m 5 (,x 3 )= X x 5 5(x 5,x 3 )m 6 (,x 5 ) x 3 x 5 x 3 O(K 3 ) operations (matrix-matrix product) p(,,x 3,x 5 x 4, x 6 ) / 1 ( ) 2 (, ) 3 (x 3, )m 4 ( ) 5 (x 5,x 3 )m 6 (,x 5 ) p(,,x 3 x 4, x 6 ) / X x 5 1( ) 2 (, ) 3 (x 3, )m 4 ( ) 5 (x 5,x 3 )m 6 (,x 5 ) p(,,x 3 x 4, x 6 ) / 1 ( ) 2 (, ) 3 (x 3, )m 4 ( )m 5 (,x 3 ) 3. Pick some unobserved variable for elimination. Replace all factors which depend on that variable by a single, equivalent factor linking the eliminated variable s neighbors. Repeat recursively.

Iterative Variable Elimination II?? m 3 (, )= X x 3 3(x 3, )m 5 (,x 3 ) x 3 O(K 3 ) operations (matrix-matrix product) p(,,x 3 x 4, x 6 ) / 1 ( ) 2 (, ) 3 (x 3, )m 4 ( )m 5 (,x 3 ) p(, x 4, x 6 ) / X x 3 1( ) 2 (, ) 3 (x 3, )m 4 ( )m 5 (,x 3 ) p(, x 4, x 6 ) / 1 ( ) 2 (, )m 4 ( )m 3 (, ) 3. Pick some unobserved variable for elimination. Replace all factors which depend on that variable by a single, equivalent factor linking the eliminated variable s neighbors. Repeat recursively.

Iterative Variable Elimination III?? m 2 ( )= X 2(, )m 3 (, )m 4 ( ) O(K 2 ) operations (matrix-vector product) p(, x 4, x 6 ) / 1 ( ) 2 (, )m 4 ( )m 3 (, ) p( x 4, x 6 ) / X 1( ) 2 (, )m 4 ( )m 3 (, ) p( x 4, x 6 )=Z 1 1 1( )m 2 ( ), Z 1 = X 1( )m 2 ( ) 3. Pick some unobserved variable for elimination. Replace all factors which depend on that variable by a single, equivalent factor linking the eliminated variable s neighbors. Repeat recursively.

A General Graph Elimination Algorithm Algebraic Marginalization Operations Marginalize out the variable associated with node being eliminated Compute a new potential function (factor) involving all other variables which are linked to the just-marginalized variable Graph Manipulation Operations Remove, or eliminate, a single node from the graph Remove any factors linked to that node, and replace them by a single factor joining the neighbors of the just-removed node Alternative view based on undirected graphs: Add edges between all pairs of nodes neighboring the just-removed node, forming a clique A Graph Elimination Algorithm Convert model to factor graph, encoding observed nodes as evidence factors Choose an elimination ordering (query node must be last) Iterate: Remove node, replace old factors by new message factor to neighbors

Elimination Algorithm Complexity x 3 x 5 x 3 Elimination cliques: Sets of neighbors of eliminated nodes Marginalization cost: Exponential in number of variables in each elimination clique (dominated by largest clique). Similar storage cost. Treewidth of graph: Over all possible elimination orderings, the smallest possible max-elimination-clique size, minus one NP-Hard: Finding the best elimination ordering for an arbitrary input graph (but heuristic approximation algorithms can be effective) Limitation: Lack of shared computation for finding multiple marginals

Importance of Elimination Order Treewidth = 1 Treewidth = 2