OSU CS 536 Probabilistic Graphical Models. Loopy Belief Propagation and Clique Trees / Join Trees

OSU CS 536 Probabilistic Graphical Models Loopy Belief Propagation and Clique Trees / Join Trees Slides from Kevin Murphy s Graphical Model Tutorial (with minor changes) Reading: Koller and Friedman Ch 10

Part I: Sum Product Algorithm and (Loopy) Belief Propagation (All you need to know is slide 17 which is covered in the excerpt from MacKay s book posted on the course webpage.)

What s wrong with VarElim Often we want to query all hidden nodes. VarElim takes O(N 2 K w+1 ) time to compute P(X i x e ) for all (hidden) nodes i. There exist message passing algorithms that can do this in O(N K w+1 ) time. Later, we will use these to do approximate inference in O(N K 2 ) time, indep of w. X 1 X 2 X 3 Y 2 Y 1 Y 3 SP2-3

Repeated variable elimination leads to redundant calculations X 1 X 2 X 3 Y 1 Y 3 Y 2 O(N 2 K 2 ) time to compute all N marginals SP2-4

Forwards-backwards algorithm X t X t X t+1 Rabiner89,etc Y 1:t-1 Y t Y t+1:n Forwards prediction Local evidence Backwards prediction (Use dynamic programming to compute these) SP2-5

Forwards algorithm (filtering) X t X t Y 1:t-1 Y t SP2-6

Backwards algorithm X t X t X t+1 X t+2 Y t+1 Y t+2:n SP2-7

Forwards-backwards algorithm 1 12 X 1 X 12 24 X 24 b 1 Forwards b 12 b 24 Backwards Backwards messages independent of forwards messages Combine O(N K 2 ) time to compute all N marginals, not O(N 2 K 2 ) SP2-8

Belief propagation Pearl88,Shafer90,Yedidia01,etc Forwards-backwards algorithm can be generalized to apply to any tree-like graph (ones with no loops). For now, we assume pairwise potentials. SP2-9

Absorbing messages X t-1 X t X t+1 Y t SP2-10

Sending messages X t-1 X t X t+1 Y t SP2-11

Centralized protocol Collect to root (post-order) Distribute from root (pre-order) R 5 4 3 R 1 2 3 1 2 5 4 Computes all N marginals in 2 passes over graph SP2-12

Distributed protocol Computes all N marginals in O(N) parallel updates SP2-13

Loopy belief propagation Applying BP to graphs with loops (cycles) can give the wrong answer, because it overcounts evidence Cloudy Sprinkler Rain WetGrass In practice, often works well (e.g., error correcting codes) SP2-14

Why Loopy BP? We can compute exact answers by converting a loopy graph to a junction tree and running BP (see later). However, the resulting Jtree has nodes with O(K w+1 ) states, so inference takes O(N K w+1 ) time [w=clique size of triangulated graph]. We can apply BP to the original graph in O(N K C ) time [C = clique size of original graph]. To apply BP to a graph with non pairwise potentials, it is simpler to use factor graphs. SP2-15

Factor graphs Kschischang01 X1 X2 X1 X2 X1 X2 Bayes net X3 Markov net X3 Pairwise Markov net X3 X4 X5 X4 X5 X4 X5 X1 X2 X1 X2 X1 X2 X3 X3 X3 X4 X5 X4 X5 X4 X5 Bipartite graph SP2-16

Loopy BP (see MacKay PDF) Dashed messages are products of same color solid messages (and factor) f 1 f 2 f 3 q x f1 q x f2 q y f2 q y f3 q z f3 = 1 (empty ) r f1 x = f1 r f2 x r f2 y r f3 y x y r f4 x = f4 q x f1 r f3 z z f 4 SP2-17

Sum-product vs max-product Sum-product computes marginals using this rule Max-product computes max marginals using the rule Same algorithm on different semirings: (+,x,0,1) and (max,x,-1,1) Shafer90,Bistarelli97,Goodman99,Aji00 SP2-18

Viterbi decoding Compute most probable explanation (MPE) of observed data Hidden Markov Model (HMM) X 1 X 2 X 3 hidden Y 1 Y 3 Y 2 observed Tomato SP2-19

Viterbi algorithm for HMMs Run max forwards algorithm, keeping track of most probable predecessor for each state Pointer traceback Can produce N-best list (most probable configurations) in O(N T K 2 ) time Forney73,Nilsson01 SP2-20

Loopy Viterbi Use max-product to compute/ approximate If there are no ties and the max-marginals are exact, then This method does not use traceback, so can be used with distributed/ loopy BP We can break ties, and produce N most-probable configurations, by asserting that certain assignments are disallowed, and rerunning Yanover04 SP2-21

BP speedup tricks Sometimes we can reduce the time to compute a message from O(K 2 ) to O(K) If (x i,x j ) = exp( f(x i ) f(x j ) 2 ), then Sum-product in O(K log K) time [exact FFT] or O(K) time [approx] Max-product in O(K) time [distance transform] Felzenszwalb03/04,Movellan04,deFreitas04 For general (discrete) potentials, we can dynamically add/delete states to reduce K Coughlan04 Sometimes we can speedup convergence by Using a better message-passing schedule (e.g., along Wainwright01 embedded spanning trees) Using a multiscale method Felzenszwalb04 SP2-22

Part II: Sum Product Algorithm and (Loopy) Belief Propagation Not tested material

Junction/ join/ clique trees To perform exact inference in an arbitrary graph, convert it to a junction tree, and then perform belief propagation. A jtree is a tree whose nodes are sets, and which has the Jtree property: all sets which contain any given variable form a connected graph (variable cannot appear in 2 disjoint places) C C C S W R moralize S W R Make jtree Maximal cliques = { {C,S,R}, {S,R,W} } Separators = { {C,S,R} Å {S,R,W} = {S,R} } SP2-24

B G Making a junction tree GM D B D A F moralize A F C E C E {a,b,c} Jensen94 Max spanning tree {b,c,e} {b,d} {b,e,f} Jtree W ij = C i Å C j {b,d} 1 {b,e,f} 1 1 1 2 {b,c,e} 2 {a,b,c} Jgraph SP2-25 Find max cliques A B C Triangulate (order f,d,e,c,b,a) E GT D F

S C W R Clique potentials C Each model clique potential gets assigned to one Jtree clique potential Each observed variable assigns a delta function to one Jtree clique potential If we observe W=w *, set E(w)= (w,w * ), else E(w)=1 Square nodes are factors SP2-26

S C W R Separator potentials C Separator potentials enforce consistency between neighboring cliques on common variables. Square nodes are factors SP2-27

BP on a Jtree 1 C 4 A Jtree is a MRF with pairwise potentials. Each (clique) node potential contains CPDs and local evidence. Each edge potential acts like a projection function. 2 3 We do a forwards (collect) pass, then a backwards (distribute) pass. The result is the Hugin/ Shafer-Shenoy algorithm. SP2-28

BP on a Jtree (collect) C Initial clique potentials contain CPDs and evidence SP2-29

BP on a Jtree (collect) C Message from clique to separator marginalizes belief (projects onto intersection) [remove c] SP2-30

BP on a Jtree (collect) C Separator potentials gets marginal belief from their parent clique. SP2-31

BP on a Jtree (collect) C Message from separator to clique expands marginal [add w] SP2-32

BP on a Jtree (collect) C SP2-33 Root clique has seen all the evidence

BP on a Jtree (distribute) C C SP2-34

BP on a Jtree (distribute) C C Marginalize out w and exclude old evidence (e c, e r ) SP2-35

BP on a Jtree (distribute) C C Combine upstream and downstream evidence SP2-36

BP on a Jtree (distribute) C C Add c and exclude old evidence (e c, e r ) SP2-37

BP on a Jtree (distribute) C C Combine upstream and downstream evidence SP2-38

Partial beliefs C C Evidence on R now added here The beliefs / messages at intermediate stages (before finishing both passes) may not be meaningful, because any given clique may not have seen all the model potentials/ evidence (and hence may not be normalizable). This can cause problems when messages may fail (eg. Sensor nets). One must reparameterize using the decomposable model to ensure meaningful partial beliefs. Paskin04 SP2-39

Hugin algorithm Hugin = BP applied to a Jtree using a serial protocol Collect Distribute C i C i S ij S ij C j C j Square nodes are separators SP2-40