CS242: Probabilistic Graphical Models Lecture 2B: Loopy Belief Propagation & Junction Trees

CS242: Probabilistic Graphical Models Lecture 2B: Loopy Belief Propagation & Junction Trees Professor Erik Sudderth Brown University Computer Science September 22, 2016 Some figures and materials courtesy David Barber, Bayesian Reasoning and Machine Learning http://www.cs.ucl.ac.uk/staff/d.barber/brml/

CS242: Lecture 2B Outline Ø Elimination Algorithms & Clique Trees Ø Junction Tree Algorithm Ø Loopy Belief Propagation X2 X 4 X 4 X 1 X 1 X1 X1 X1 X 3 X 5 X 6 X 1 X 6 X 3 X2 X 5 X2 X 3 X 5 X 3 X 5

x s V Undirected Graphical Models Random variable associated with node s. Set of N nodes or vertices, {1, 2,...,N} x 1 x 2 E Set of undirected edges (s,t) linking pairs of nodes. x 3 C Undirected Markov Random Field: p(x) = 1 Z Set of cliques (fully connected subsets) of nodes. Y c2c c(x c ) 0 Z>0 c(x c ) Z = X x Y c2c x 4 x 5 c(x c ) Arbitrary non-negative potential function. Normalization constant, or partition function. Defined as sum over all possible joint states.

p(x) = 1 Z Clique-Based Inference Algorithms Y c2c c(x c ) z c = {x s s 2 c}, c2 C z 123 p(z) / Y c(z c ) z 34 z 35 c2c x 1 x 2 x 3 For each clique c, define a variable z c which enumerates joint configurations of dependent variables Does this define an equivalent joint distribution? x 4 x 5 PROBLEM: We have defined multiple copies of the variables in the true model, but not enforced any relationships among them

p(x) = 1 Z Clique-Based Inference Algorithms z c = {x s s 2 c}, c2 C p(z) / Y c2c Y c2c c(x c ) c(z c ) Y d6=c cd(z c,z d ) z 123 z 34 z 35 x 1 x 2 For each clique c, define a variable z c which enumerates joint configurations of dependent variables Add potentials enforcing consistency between all pairs of clique variables which share one of the original variables: 1 if zc (x cd(z c,z d )= s )=z d (x s ) for all s 2 c \ d, 0 otherwise. PROBLEM: The graph may have a large number of pairwise consistency constraints, and inference will be difficult x 3 x 4 x 5

p(x) = 1 Z Clique-Based Inference Algorithms Y c2c c(x c ) z c = {x s s 2 c}, c2 C p(z) / Y Y c(z c ) c2c (c,d)2e(c) cd(z c,z d ) z 123 z 34 z 35 For each clique c, define a variable z c which enumerates joint configurations of dependent variables Add potentials enforcing consistency between some subset of pairs of cliques, taking advantage of transitivity of equality: x a = x b,x b = x c! x a = x c Question: How many edges are needed for global consistency? When can we build a tree-structured clique graph? x 1 x 2 x 3 x 4 x 5

An Undirected View of Graph Elimination Elimination Order: (6,5,4,3,2,1) X 4 X 4 X 1 X 6 X1 X 6 X 3 X 5 X 3 X 5 For the node being eliminated, marginalize over values of associated variable Compute a new potential function involving all other variables which are neighbors of the just-marginalized variables, and are thus related by some potential If necessary, add edges to create a clique for the newly created potential

An Undirected View of Graph Elimination Elimination Order: (6,5,4,3,2,1) X 4 X 4 X 1 X 6 X1 X 6 X 4 X 3 X 5 X 3 X 5 X 1 X 3 X 5

An Undirected View of Graph Elimination X 4 Elimination Order: (6,5,4,3,2,1) X 4 X 1 X 1 X 1 X 3 X 5 X 3 X 3 X 1 X 1

Elimination Clique Tree X2 X 4 X 1 X1 X1 X 3 X 5 X 6 The clique tree contains the cliques (fully connected subsets) which are generated as elimination executes X2 X 3 X 5 X2 X 4 X 1 X 1 X1 X1 X1 X 3 X 5 X 6 The separator sets contain the variables which are shared among each linked pair of cliques X 3 X2 X 3 X 5 X2 X 5

CS242: Lecture 2B Outline Ø Elimination Algorithms & Clique Trees Ø Junction Tree Algorithm Ø Loopy Belief Propagation b d {b, d} {b} COLLECT messages DISTRIBUTE messages 1 4 a c e f {a, b, c} {b, c} {b, c, e} {b, e} {b, e, f} root 2 1 3 4 Some figures from Paskin (2003)

Marginal Inference Algorithms One Marginal All Marginals Tree elimination applied to leaves of tree belief propagation or sum-product algorithm Graph elimination algorithm junction tree algorithm: belief propagation on a junction tree (special clique tree)

Clique Trees and Junction Trees X 4 X2 X 4 X 1 X 6 X1 X1 X 3 X 5 X 6 X 3 X 5 This clique tree has the junction tree property: the clique nodes containing any variable from the original model form a connected subtree We can exactly represent the distribution ignoring redundant constraints Note that not all clique trees are junction trees: X1 X1 X 3 X2 X 3 X 5 X2 X 4 X 5 X 6 X2 X 3 X 5

Finding a Junction Tree The Junction Tree Property: A clique tree is a junction tree if for every pair of cliques V, W, all cliques on the (unique) path between V and W contain any shared variables. A B A,B Not a Junction Tree! C D A,C B,D C,D A B C D Given a set of cliques, how can we efficiently find a clique tree with the junction tree (running intersection) property? How can we be sure that at least one junction tree exists? Strategy: Augment the graph with additional edges Ø Cliques of original graph are always subsets of cliques of the augmented graph, so original distribution still factorizes appropriately Ø As cliques grow, will eventually be able to construct a junction tree Question: Which undirected graphs have junction trees?

p(a, b, c, d) = Z Junction Trees and Triangulation b The Junction Tree Property: A clique tree is a junction tree if for every pair of cliques V, W, all cliques on the (unique) path between V and W contain any shared variables. A B A A,B a B,D Bd c C D A,C C (a) C,D D a, b, c b, c b, c, d (b) A chord is an edge connectingfigure two non-adjacent nodes : (a) Markov network (a,inb, some c) (b, c,cycle d). (b) Clique graph represen A cycle is chordless if it contains no chords A graph is triangulated (chordal) if it contains no chordless cycles of length 4 or more Cliqueofpotential assignments Theorem: The maximal cliques a graph have a corresponding The separator potential may be set to the normalisation constan junction tree if and only if that undirected graph is triangulated Cliques have potentials b, c) and d). Lemma: For a non-complete triangulated graph with at least 3(a, nodes, there (b, is ac,decomposition of the nodes into disjoint sets A, B, S such that S separates A from B, and S is complete. Ø Key induction argument in constructing junction tree from triangulation Ø Implies existence of elimination ordering which introduces no new edges

Constructing a Junction Tree A ABD ABD B D D BD C E BCD CD CDE Theorem: A clique tree is a junction tree if and only if it is a maximal spanning tree of: Ø Graph nodes: One for each maximal clique in triangulation Ø Graph edges: Link all pairs of cliques that share at least one variable Ø Edge weights: Cardinality of separator set (intersection) of cliques Ø Computational complexity: At worst quadratic in number of maximal cliques Junction Tree Algorithms for General-Purpose Inference 1. Triangulate the target undirected graphical model Ø Any elimination ordering generates a valid triangulation (linear in graph size) Ø But finding the optimal triangulation is NP-hard 2. Arrange triangulated cliques into a junction tree (at worst quadratic in graph size) 3. Execute sum-product algorithm on junction tree (exponential in clique size) BCD CD CDE

Example: Junction Tree Formation 1. Compute the elimination cliques (the order here is f, d, e, c, b, a). d d b b b b a f a a a b c e c e c e c a 2. Form the complete cluster graph over the maximal elimination cliques and find a maximum-weight spanning tree. {b, d} 1 1 1 2 {b, e, f} 1 {b, d} {b} NOTE: For a given set of cliques, there may be multiple valid junction trees {b, c, e} 2 {a, b, c} {a, b, c} {b, c} {b, c, e} {b, e} {b, e, f} Paskin 2003

Example: Junction Tree Formation a b c d e a b c d e f g h i f g h i j k j k l l a b c d e f g h i j k l

Example: Junction Tree Formation abf bf bcfg cdhi di dei a b c d e cfg chi f g h i cfgj cj ck chik j k cjk jk l jkl

Reminder: Pairwise Sum-Product BP Set of neighbors of node t: (t) ={s 2V (s, t) 2F} m ts (x s ) x s x t x s x t m st (x t ) p t (x t ) / Y s2 (t) m st (x t ) m ts (x s )= X Y st(x s,x t ) m ut (x t ) x t u2 (t)\s K t p t (x t ) m st (x t ) m ts (x s ) number of discrete states for random variable x t marginal distribution of the K t discrete states of random variable x t message from node s to node t, a vector of K t non-negative numbers message from node t to node s, a vector of K s non-negative numbers

p(z) / Y c2c cd(z c,z d )= Sum-Product for Junction Trees c(z c ) Y (c,d)2e(c) cd(z c,z d ) 1 if zc (x s )=z d (x s ) for all s 2 c \ d, 0 otherwise. Original Undirected Graphical Model Ø Nodes: Discrete random variables x s Ø Cliques: Groups of variables over which the joint distribution factorizes into potentials Junction Tree Ø Nodes: Maximal cliques in some triangulation Ø Node variables: All joint configurations z c of variables in corresponding clique Ø Node potentials: Potentials from original model Ø Edge potentials: Indicator functions forcing consistency of shared variables (separator sets) z c = {x s s 2 c}, c2 C {a, b, c} a {b, c} b c {b, d} {b} {b, c, e} d e {b, e} f {b, e, f}

Sum-Product for Junction Trees Sum-Product for Junction Trees (Shafer-Shenoy) Ø Express algorithm via original variables x s Ø Messages depend on clique intersection (separators) Ø Efficient schedules compute each message once C i C j µ ji C i C j S ij COLLECT messages DISTRIBUTE messages 1 4 µ ij p j (x Cj ) / Cj (x Cj ) Y µ ij (x Sij ) i2 (j) root 2 1 3 4 {b, d} S ij = S ji = C i \ C j µ ji (x Sji ) / X C j (x Cj ) x Rji R ji = C j \ S ji Y µ kj (x Skj ) k2 (j)\i {a, b, c} {b, c} {b} {b, c, e} {b, e} {b, e, f}

Sum-Product for Junction Trees Sum-Product for Junction Trees (Shafer-Shenoy) Ø Express algorithm via original variables x s Ø Messages depend on clique intersection (separators) Ø Efficient schedules compute each message once Storage & Computational Cost O P j Qs2C j K s, where x s 2{1,...,K s } Exponential in sizes of maximal cliques. p j (x Cj ) / Cj (x Cj ) Y µ ij (x Sij ) µ ji (x Sji ) / X C j (x Cj ) x Rji i2 (j) Y k2 (j)\i µ kj (x Skj ) COLLECT messages DISTRIBUTE messages root {a, b, c} {b, c} 1 2 1 3 4 {b, d} {b} 4 {b, c, e} {b, e} {b, e, f}

a b c Summary: Junction Tree Algorithm d e f {a, b, c} {b, c} {b, d} {b} {b, c, e} p j (x Cj ) / Cj (x Cj ) Y i2 (j) {b, e} {b, e, f} µ ij (x Sij ) COLLECT messages DISTRIBUTE messages root 1 4 2 1 3 4 S ij = S ji = C i \ C j Junction Tree Algorithms for General-Purpose Inference 1. If necessary, convert graphical model to undirected form (linear in graph size) 2. Triangulate the target undirected graph Ø Any elimination ordering generates a valid triangulation (linear in graph size) Ø Finding an optimal triangulation, with minimal cliques, is NP-hard 3. Arrange triangulated cliques into a junction tree (at worst quadratic in graph size) 4. Execute sum-product algorithm on junction tree (exponential in clique size)

CS242: Lecture 2B Outline Ø Elimination Algorithms & Clique Trees Ø Junction Tree Algorithm Ø Loopy Belief Propagation

Inference for Graphs with Cycles For graphs with cycles, the dynamic programming BP derivation breaks Junction Tree Algorithm Cluster nodes to break cycles Run BP on the tree of clusters Exact, but may be intractable Loopy Belief Propagation Iterate local BP message updates on cyclic graph Hope beliefs converge Empirically, often very effective Coming later: Justification as variational method

Sum-Product Belief Propagation Set of neighbors of node s: (s) ={f 2F s 2 f} x t x u x s f x v x w Loopy BP: Message updates can be iteratively computed on graphs with cycles. But marginals incorrect! x t x u x s f x v m sf (x s )= Y g2 (s)\f m gs (x s ) / p s(x s ) m fs (x s ) Marginal Distribution of Each Variable: Marginal Distribution of Each Factor: Clique of variables linked by factor. m fs (x s )= X x f\s p s (x s ) / Y f2 (s) f (x f ) Y m fs (x s ) t2f\s p f (x f ) / f (x f ) Y m sf (x s ) s2f x w m tf (x t )

Low Density Parity Check (LDPC) Codes Parity Check Factors Evidence (observation) Factors Data Bits Parity Bits Ø Each variable node is binary, so x s 2 {0, 1} Ø Parity check factors equal 1 if the sum of the connected bits is even, 0 if the sum is odd (invalid codewords are excluded) Ø Unary evidence factors equal probability that each bit is a 0 or 1, given data. Assumes independent noise on each bit. Loopy BP: 0 5 15 \

Loopy Belief Propagation A Revolution: Belief Propagation Graphs With Cycles In NIPS 1997: https://papers.nips.cc/paper/1467-a-revolution-belief-propagation-in-graphs-with-cycles Brendan J. Frey http://wvw.cs.utoronto.ca/-frey Department of Computer Science University of Toronto David J. C. MacKay http://vol.ra.phy.cam.ac.uk/mackay Department of Physics, Cavendish Laboratory Cambridge University UAI 1999: https://arxiv.org/abs/1301.6725