INFERENCE IN BAYESIAN NETWORKS

Size: px

Start display at page:

Download "INFERENCE IN BAYESIAN NETWORKS"

Megan Butler
6 years ago
Views:

1 INFERENCE IN BAYESIAN NETWORKS Concha Bielza, Pedro Larrañaga Computational Intelligence Group Departamento de Inteligencia Artificial Universidad Politécnica de Madrid Master Universitario en Inteligencia Artificial

2 C.Bielza, P.Larrañaga -UPM- 2 Conceptos básicos Inference in Bayesian networks Types of queries : Brute-force computation Variable elimination algorithm Message passing algorithm Approximate inference: Logic sampling Likelihood weighting Markov chain Monte Carlo (MCMC)

3 C.Bielza, P.Larrañaga -UPM- 3 Types of queries Queries: posterior probabilities Given some evidence e (observations), Posterior probability of a target variable(s) X : Other names: probability propagation, belief updating or revision Burgl. Earth.? Vector Alarm News WCalls

4 C.Bielza, P.Larrañaga -UPM- 4 Types of queries Semantically, for any kind of reasoning Predictive reasoning or deductive (causal inference): predict effects Burgl. Earth. Symptoms Disease? WCalls Alarm News Diagnostic reasoning (diagnostic inference): diagnose the causes? Burgl. Earth. Target variable is usually a descendant of the evidence Disease Symptoms WCalls Alarm News Target variable is usually an ancestor of the evidence

5 C.Bielza, P.Larrañaga -UPM- 5 Types of queries for any kind of reasoning Intercausal reasoning: between causes of a common effect Burgl. Earth.? WCalls Alarm B and E are independent of each other Suppose that A=Yes It raises the Prob. for both possible causes B and E Suppose then that B=Yes This explains the observed A, which in turn lowers the Prob. that E=Yes News Two causes initially independent. If the effect is known, the presence of one explanatory cause renders the alternative cause less likely (it is explained away)

6 C.Bielza, P.Larrañaga -UPM- 6 Types of queries for any kind of reasoning Bidirectional reasoning (mixed inference): combine 2 or more of the above Burgl.? Earth. Diagnostic and predictive reasoning Alarm News WCalls Diagnostic and intercausal reasoning The arc direction between variables does not restrict the type of query to be asked: probabilistic inference can combine evidence from all parts of the network

7 C.Bielza, P.Larrañaga -UPM- 7 Types of queries More queries: joint and likelihood Posterior joint: conditional prob. of several variables The size of the answer to query is exponential in the number of variables in the joint Likelihood of the evidence: the simplest query, i.e. the prob. of the evidence

8 C.Bielza, P.Larrañaga -UPM- 8 Types of queries More queries: maximum a posteriori (MAP) Most likely configurations (abductive inference): event that best explains the evidence Total abduction: search for all the unobserved Burgl.? Earth.? In general, cannot be computed component-wise, with max P(x i e) Partial abduction: search for Alarm WCalls subset. of unobserved (explanation set)?? Burgl. Earth. News?? Alarm News WCalls K most likely explanations

9 C.Bielza, P.Larrañaga -UPM- 9 Types of queries More queries: maximum a posteriori (MAP) MAP is equivalent to Use MAP for: Classification: find most likely label, given the evidence Explanation: what is the most likely scenario, given the evidence

10 C.Bielza, P.Larrañaga -UPM- 10 Types of queries More queries: decision-making Optimal decisions (of maximum expected utility), with influence diagrams

11 [Pearl 88; Lauritzen & Spiegelhalter 88] Brute-force computation of P(X e) First, consider P(X i ), without observed evidence e. Conceptually simple but computationally complex For a BN with n variables, each with its P(X j Pa(X j )): Brute-force approach But this amounts to computing the JPD, often very inefficient and even intractable computationally CHALLENGE: Without computing the JDP, exploit the factorization encoded by the BN and the distributive law (local computations) C.Bielza, P.Larrañaga -UPM- 11

12 C.Bielza, P.Larrañaga -UPM- 12 Easy inference cases: simple forward inference Computing prior requires simple forward propagation of probabilities P(J)= M,E P(J M,E)P(M,E) (marginalization)? = M,E P(J M)P(M E)P(E) (chain rule and cond. indep.) = M P(J M) E P(M E)P(E) (distributive law) All terms used are CPTs in the BN; Only ancestors of J are considered

13 C.Bielza, P.Larrañaga -UPM- 13 Easy inference cases: simple forward inference Same idea applies when we have upstream evidence? P(J E) = M P(J M,E) P(M E) = M P(J M) P(M E)

14 C.Bielza, P.Larrañaga -UPM- 14 Improving brute-force Use the JPD factorization and the distributive law? Table with 32 inputs (JPD) (if binary variables)

15 C.Bielza, P.Larrañaga -UPM- 15 Improving brute-force Arrange computations effectively, moving some additions over X 5 and X 3 : over X 4 : Biggest table with 8 (like the BN)

16 C.Bielza, P.Larrañaga -UPM- 16 Improving brute-force Iteratively Move all irrelevant terms outside of innermost sum Perform innermost sum, getting a new term Insert the new term into the product

17 C.Bielza, P.Larrañaga -UPM- 17 Improving brute-force I.e., comparing both: 1 Brute-force approach Table with 32 entries. 52 multiplications (tables in a suitable way) and 30 additions (marginalizations: 16, 8, 4, 2) 2 Factoriz. & distributive 1 table with 8 and 3 with 4 entries. 14 multiplications and 14 additions (marginalizations)

18 C.Bielza, P.Larrañaga -UPM- 18 Complexity of exact inference in BNs In BN without loops (cycles in the underlying undirected graph) polytrees-, inference is easy: you can arrange the additions so as not to create tables bigger than those included in the BN Complexity of the previous method is exponential in the width (N. of variables) of the biggest CPT used in the process Otherwise, in general BNs, inference is NP-complete [Cooper 1990] Does not mean we cannot solve inference; implies that we cannot find a general procedure that works efficiently for all networks The key for efficient inference lies in finding a good summation order (elimination/deletion order )

19 C.Bielza, P.Larrañaga -UPM- 19 Recall Cycle Loop

20 C.Bielza, P.Larrañaga -UPM- 20 Recall Polytree=DAG without loops There is only one path between any pair of nodes= =singly connected graph Tree=each node with one parent, except the root node

21 C.Bielza, P.Larrañaga -UPM- 21 Recall: types of directed graphs

22 C.Bielza, P.Larrañaga -UPM- 22 Variable elimination algorithm Wanted: A list with all functions of the problem Select an elimination order of all variables (except i) For each X k from, if F is the set of functions that involve X k : Delete F from the list Compute ONE variable Eliminate X k = combine all the functions that contain this variable and marginalize out X k Add f to the list Output: combination (multiplication) of all functions in the current list

23 C.Bielza, P.Larrañaga -UPM- 23 Variable elimination algorithm Repeat the algorithm for each target variable

24 C.Bielza, P.Larrañaga -UPM- 24 Example with Asia network Visit to Asia (A) Smoking (S) Tuberculosis (T) Lung Cancer (L) Tub. or Lung Canc (E) Bronchitis (B) X-Ray (X) Dyspnea (D)

25 C.Bielza, P.Larrañaga -UPM- 25 Brute-force approach Compute P(D) by brute-force: P( d) x b e l t s a P( a, s, t, l, e, b, x, d) Complexity is exponential in the size of the graph (number of variables *number of states for each variable)

26 C.Bielza, P.Larrañaga -UPM- 26 not necessarily a probability term

27 C.Bielza, P.Larrañaga -UPM- 27 4

28 Variable elimination algorithm Size = 8 Local computations (due to moving the additions) Importance of the elimination ordering, but finding an optimal (minimum cost) is NP-hard [Arnborg et al. 87] (heuristics for good sequences) Discard parts of the net (irrelevant for the query): we can prune all variables such that given e, they re c.i. of the target variable (Bayes-Ball algorithm, Shachter 98) Complexity is exponential in the max N. of var. in factors of the summation C.Bielza, P.Larrañaga -UPM- 28

29 C.Bielza, P.Larrañaga -UPM- 29 Now, with evidence e We have observed For each E i identify the f functions in which it is included and:

30 C.Bielza, P.Larrañaga -UPM- 30 VE algorithm: dealing with evidence Queries Brute-force VE Message Approx V S L T A B X D Suppose get evidence e (instantiation to an observed value): V = t, S = f, D = t We want to compute P(L, V = t, S = f, D = t) ), ( ) ( ), ( ) ( ) ( ) ( ) ( ) ( B A D P A X P L T A P S B P S L P V P T S P P V A function of T, L, B, A, X only?

31 C.Bielza, P.Larrañaga -UPM- 31 VE algorithm: dealing with evidence Since we know that V = t, we don t need to eliminate V Instead, we can replace the factors P(V) and P(T V) with f P V t ) f ( T ) P ( T V t ) P ( V ) ( p( T V ) These select the appropriate parts of the original factors given the evidence Note that f p(v) is a constant, and thus does not appear in elimination of other variables

32 C.Bielza, P.Larrañaga -UPM- 32 VE algorithm: dealing with evidence Queries Brute-force VE Message Approx Initial factors, after setting evidence: ), ( ) ( ), ( ) ( ) ( ) ( ), ( ) ( ) ( ) ( ) ( ) ( B A f A X P L T A P B f L f T f f f b a d P b s P s l P v t P s P v P Eliminating X, we get: ), ( ) ( ), ( ) ( ) ( ) ( ), ( ) ( ) ( ) ( ) ( ) ( B A f A f L T A P B f L f T f f f b a d P x b s P s l P v t P s P v P

33 C.Bielza, P.Larrañaga -UPM- 33 VE algorithm: dealing with evidence Queries Brute-force VE Message Approx Eliminating A, we get: Eliminating T, we get: ), ( ) ( ), ( ) ( ) ( ), ( ) ( ) ( ) ( ) ( B A f A f L A f B f L f f f b a d P x t b s P s l P s P v P ), ( ) ( ) ( ) ( ) ( ) ( ) ( L B f B f L f f f a b s P s l P s P v P ) ( ) ( ) ( ) ( ) ( L f L f f f b s l P s P v P Eliminating B, we get:

34 C.Bielza, P.Larrañaga -UPM- 34 Message passing algorithm Operates passing messages among the nodes of the network. Nodes act as processors that receive, calculate and send information. Called propagation algorithms Clique tree propagation, based on the same principle as VE but with a sophisticated caching strategy that: Enables to compute the posterior prob. distr. of all variables in twice the time it takes to compute that of one single variable Works in an intuitive appealing fashion, namely message propagation

35 C.Bielza, P.Larrañaga -UPM- 35 Basic operations for a node Ask info(i,j): Target node i asks info to node j. Does it for all neighbors j. They do the same until there are no nodes to ask Send-message(i,j): Each node sends a message to the node that asked him the info until reaching the target node A message is defined over the intersection of domains of f i and f j. It is computed as: And finally, we calculate locally at each node i: Target combines all received info with his info and marginalize over the target variable

36 C.Bielza, P.Larrañaga -UPM- 36 CollectEvidence Procedure for X 2 Ask

37 C.Bielza, P.Larrañaga -UPM- 37 P(X 2 ) as a message passing algorithm?

38 C.Bielza, P.Larrañaga -UPM- 38 Correspondence VE & message passing algorithm Direct correspondence:? Mess. VE

39 C.Bielza, P.Larrañaga -UPM- 39 Computing prob. P(X i e) of all (unobserved) variables i at a time We can perform the previous process for each node: but many messages are repeated! Or, we can use 2 rounds of messages as follows: Select a node as a root (or pivot) Ask or collect evidence from the leaves toward the root (messages in downward direction). As VE. Distribute evidence from the root toward the leaves (messages in upward direction) Calculate marginal distributions at each node by local computation, i.e. using its incoming messages This algorithm never constructs tables larger than those in the BN

40 C.Bielza, P.Larrañaga -UPM- 40 Message passing algorithm First sweep: CollectEvidence X 4 Root node Second sweep: DistributeEvidence

41 C.Bielza, P.Larrañaga -UPM- 41 Networks with loops If net is not a polytree, it does not work Request/messages go in a cycle indefinitely (info goes through 2 paths and is counted twice) Independence assumptions applied in the algorithm cannot be used here (now any node separates the graph into 2 unconnected parts (polytrees) does not hold) Alternatives??

42 C.Bielza, P.Larrañaga -UPM- 42 Alternative 1: conditioning method Cut the multiple paths between nodes by instantiating some variables included in the loops we ll have a polytree and its algorithms may be applied Alternative 2: clustering methods Group variables in an auxiliary, simpler representation, and structure the clusters having finally a polytree over this secondary structure Clique tree or junction tree (usually)

43 C.Bielza, P.Larrañaga -UPM- 43 Complexity Complexity of propagation algorithms in polytrees is linear in the size (nodes+arcs) of the network [brute-force is exponential] in multiply-connected BNs is an NP-complete problem (both alternatives have this complexity, and none of them is the best; they re complementary mixed algorithms)

44 C.Bielza, P.Larrañaga -UPM- 44 Alternative 1: conditioning method Without loops, any node can separate the graph into 2 unconnected parts 1. Parents and nodes to which it is connected passing through its parents C D C D D Both sets of nodes are c.i. given D. This idea is used in the message passing algorithm 2. Children and nodes to which it is connected passing through its children

45 C.Bielza, P.Larrañaga -UPM- 45 Alternative 1: conditioning method With loops, we can t. But we can cut the loops 1. Fix the (arbitrary) state of some nodes called cutset-: {C} C=c C D It becomes a polytree F 2. Absorb evidence topology changes: arc from C removed and P(F C=c, D) at F 3. Many algorithms Computed as a polytree, for each c Minimize cutset size (heuristics; it s NP-complete)

46 C.Bielza, P.Larrañaga -UPM- 46 Alternative 1: conditioning method Other example of cutset: A B C We can cut the loop by considering {A} the cutset: Option 1: P(B)=P(B A=a) B A A=a A=a C B C A Option 2: P(C)=P(C A=a) Only one arc is absorbed (otherwise the graph becomes unconnected)

47 C.Bielza, P.Larrañaga -UPM- 47 Alternative 2: clustering methods [Lauritzen & Spiegelhalter 88] Method implemented in the main BN software packages Transform the BN into a probabilistically equivalent polytree by merging nodes, removing the multiple paths between two nodes S M C B H Create a new node Z, that combines S and B Metastatic cancer (M) is a possible cause of brain tumors (B) and an explanation for increased total serum calcium (S). In turn, either of these could explain a patient falling into a coma (C). Severe headache (H) is also associated with brain tumors. M Z=S,B C States of Z: {tt,ft,tf,ff} P(Z M)=P(S M)P(B M) since they are c.i. given M H P(H Z)=P(H B) since H c.i. of S given B

48 COMPILATION C.Bielza, P.Larrañaga -UPM- 48 Alternative 2: clustering methods Steps for the JUNCTION TREE CLUSTERING ALGORITHM: Transform BN into a polytree (slow, much memory if dense, but only once) Belief updating (fast) 1. Moralize the BN 2. Triangulate the moral graph and obtain the cliques 3. Create the junction tree and its separators 4. Compute new parameters 5. Message passing algorithm

49 C.Bielza, P.Larrañaga -UPM- 49 Alternative 2: clustering methods 1. MORALIZE the BN: Connect ( marry ) all parents with a common child, remove arrows obtain the moral graph M M S B S B C H C H Keep the dependencies that are lost when transforming the DAG into undirected (independence in the moral graph implies independence in the BN)

50 C.Bielza, P.Larrañaga -UPM- 50 Alternative 2: clustering methods 2. TRIANGULATE the moral graph (needed for having a junction tree): Add edges so that every cycle of length >3 contains a chord edge between 2 nonconsecutive nodes (i.e. there is a subcycle composed of exactly 3 of its nodes) produces a triangulated or chordal graph S M B We don t create functions defined over non-joined groups Not necessary here (it is already triangulated) C H Different triangulations produce different clusters (and different CPT size at the compound node). Finding an optimal triangulation is NP-complete heuristics Preserve the original topology as much as possible add few edges

51 C.Bielza, P.Larrañaga -UPM- 51 Alternative 2: clustering methods 2. TRIANGULATE the moral graph: Added edges are called fill-ins, obtained by the fill-in process guided by a deletion sequence, where before deleting a node X and all its edges, we add new edges to make complete the subgraph given by X and its neighbors 1 ordering ={1,2,3,4,5,6}: Moral graph 6 6 Triangulation-via-elimination 6 Triangulated graph

52 C.Bielza, P.Larrañaga -UPM- 52 Alternative 2: clustering methods Triangulate a graph does not mean to divide it into triangles Triangulated Don t do this!

53 C.Bielza, P.Larrañaga -UPM- 53 Alternative 2: clustering methods Not triangulated Triangulated

54 C.Bielza, P.Larrañaga -UPM- 54 Alternative 2: clustering methods 2. Triangulate the moral graph and obtain the cliques: Clique = maximal complete subgraph (all nodes are pairwise linked and it is not a subset of other complete set) Identify them during the fill-in process (complete subgraphs that are maximal) S M C B H {M,S,B} {S,B,C} {B,H} {1,2,3} {2,3,4} {3,4,5} {4,5,6}

55 C.Bielza, P.Larrañaga -UPM- 55 Alternative 2: clustering methods 3. Create the JUNCTION TREE and its separators: JT is an undirected tree that contains all the cliques as nodes JT must satisfy the following property: Given two nodes X and Y, X Y must be contained in all nodes on the path between X and Y Separator: intersections of the adjacent nodes Hypergraph not a JT (B) a JT

56 C.Bielza, P.Larrañaga -UPM- 56 Alternative 2: clustering methods In the examples: 1,2,3 M,S,B 2,3 2,3,4 S,B B 3,4 S,B,C B,H 3,4,5 4,5 4,5,6 Order the cliques and try to link them s.t. create bigger separators

57 C.Bielza, P.Larrañaga -UPM- 57 Alternative 2: clustering methods 4. Compute NEW PARAMETERS (new CPTs): Each potential is attached to a node (clique) containing its domain If a node is not attached with any function, attach the identity function to it. Whenever there are more than one potential attached, the potential at the node is the product of all of them Result: the product of all the node CPTs on the junction tree is the product of all the CPTs in the original BN (same info, the JPD, and different representation)

58 C.Bielza, P.Larrañaga -UPM- 58 Alternative 2: clustering methods In the examples: C 1 M,S,B S,B B C 2 C 3 S,B,C B,H 1,2,3 2,3 2,3,4 3,4 3,4,5 4,5 4,5,6 C 1 C 2 C 3 C 4

59 C.Bielza, P.Larrañaga -UPM- 59 Alternative 2: clustering methods Another example:

60 C.Bielza, P.Larrañaga -UPM- 60 Alternative 2: clustering methods 5. MESSAGE passing algorithm over the JT Applying the propagation algorithm over the JT, we have the Shenoy-Shafer architecture Store 2 messages at each separator (one for each direction) Computing messages: S ij Residual set After a full-propagation (upward+downward) all the separators are full and: Then marginalize

61 C.Bielza, P.Larrañaga -UPM- 61 Alternative 2: clustering methods With evidence, as always: Suppose A=y, X=y:

62 C.Bielza, P.Larrañaga -UPM- 62 Alternative 2: clustering methods If there is only one query variable Q, find a clique C Q that contains Q and use it as a pivot in inference E.g.: compute P(L A=y, X=y) Possible pivots

63 C.Bielza, P.Larrañaga -UPM- 63 Alternative 2: clustering methods Message passing: from leaves to pivot (Shenoy-Shafer) 1. Collect evidence Answer: Pivot

64 C.Bielza, P.Larrañaga -UPM- 64 Alternative 2: clustering methods Message passing: from pivot to leaves (Shenoy-Shafer) 2. Distribute evidence Not f5 Not f1 Complexity is exponential in the maximum clique size

65 C.Bielza, P.Larrañaga -UPM- 65 Alternative 2: clustering methods Summary: DAG Moral Graph Triangulated Graph Identifying Cliques Junction Tree Message passing

66 C.Bielza, P.Larrañaga -UPM- 66 Queries Brute-force VE Message Approx Inferencia Approximate aproximada inference Why? Because exact inference is intractable (NP-complete) with large (+40) and densely connected BNs the associated cliques for the junction tree algorithm or the intermediate factors in the VE algorithm will grow in size, generating an exponential blowup in the number of computations performed Both deterministic and stochastic simulation to find approximate answers

67 C.Bielza, P.Larrañaga -UPM- 67 Inferencia Approximate aproximada inference Deterministic algorithms to simplify the model Eliminate arcs that encode almost independent nodes (weak dependences measured using the Kullback-Leibler divergence) [Engelen 97] Eliminate nodes that are far away from the target node (localized partial evaluation algorithm) [Draper 95] Replace low Ps by zeros [Jensen and Andersen 90] Reduce cardinality of CPTs (state space abstraction) [Wellman & Liu 94] Use alternative representations of CPTs joining similar probabilities: using rules [Poole 98] or probability trees [Cano et al 03]

68 C.Bielza, P.Larrañaga -UPM- 68 Inferencia Approximate aproximada inference Stochastic simulation Uses the network to generate a large number of cases (full instantiations) from the network distribution P(X i e) is estimated using these cases by counting observed frequencies in the samples. By the Law of Large Numbers, estimate converges to the exact probability as more cases are generated Approximate propagation in BNs within an arbitrary tolerance or accuracy is an NP-complete problem In practice, if e is not too unlikely, convergence is quickly

69 C.Bielza, P.Larrañaga -UPM- 69 Inferencia Approximate aproximada inference Probabilistic logic sampling [Henrion 88] Given an ancestral ordering of the nodes (parents before children), generate from X once we have generated from its parents (i.e. from the root nodes down to the leaves) When all the nodes have been visited, we have a case, an instantiation of all the nodes in the BN Use conditional prob. given the known values of the parents A forward sampling algorithm Repeat and use the observed frequencies to estimate P(X i e)

70 C.Bielza, P.Larrañaga -UPM- 70 Inferencia Approximate aproximada inference Probabilistic logic sampling Suppose we obtain the following samples: (0,1,1,1,1,1), (0,1,0,1,1,1), (1,0,0,1,1,1), (0,0,1,1,1,0), (1,1,1,1,0,0) Then: With evidence, e.g. X 2 =1, we discard the third and fourth samples and we would repeat until having a sample of size 5 as desired (0,1,1,1,1,1), (0,1,0,1,1,1), (1,1,0,0,1,1), (1,1,1,1,1,0), (1,1,1,1,0,0)

71 C.Bielza, P.Larrañaga -UPM- 71 Inferencia Approximate aproximada inference Probabilistic logic sampling It works since there is a simulation algorithm with the following idea to simulate from (X 1,,X r ): If each factor of the factorization of the chain rule is simple to sample from, then For i=1 to r Generate x i ~X i x 1,,x i-1 Return (x 1,,x r )

72 C.Bielza, P.Larrañaga -UPM- 72 Inferencia Approximate aproximada inference Likelihood weighting [Fung & Chang 90; Shachter & Peot 90] PLS easily generalized to more than one query node When approximating P(X i e), reject all the samples not consistent with e Problem: if e is unlikely, most of the cases discarded (they don t contribute to the counts in the freq.) inefficient Example: if we observe X 2 =1 and P(X 2 =1)=0.0064, we need trials to get 64 valid samples. Thus, obtaining a significant number of samples becomes intractable Likelihood weighting to avoid so many rejections of PLS

73 C.Bielza, P.Larrañaga -UPM- 73 Inferencia Approximate aproximada inference Likelihood weighting Likelihood weighting: Don t generate from E; fix its values E=e Generate from the rest as in PLS Instead of adding 1 to the run count, the CPTs for the evidence nodes are used to determine how likely that evidence combination is: For a sample i, assign a weight w i given by the likelihood of the evidence given its parents In PLS, w i =1 for samples consistent with e and w i =0 otherwise

74 C.Bielza, P.Larrañaga -UPM- 74 Inferencia Approximate aproximada inference Likelihood weighting: example P(C=n B=y,E=y)?? e A P(A=y)=0.2 A=n C=n P(C=y A=n)=0.7 C B B=y P(B=y A=n)=0.4 E=y P(E=y C=n,B=y)=0.8 (A=n,B=y,C=n,D=y,E=y) with w 1 =0.4*0.8=0.32 (A=n,B=y,C=y,D=n,E=y) with w 2 =0.88 (A=y,B=y,C=y,D=y,E=y) with w 3 =0.80 E D P(D=y B=y)=0.7 D=y

75 C.Bielza, P.Larrañaga -UPM- 75 Inferencia Approximate aproximada inference Markov Chain Monte Carlo (MCMC): basics Designed for cases in which sampling from a distribution ( ) is not easy, i.e. with MCMC we simulate draws from complex prob. distributions General description: Select a Markov chain on, with stationary distribution ( ) Start at a point 0 and generate 1,, n from the chain until convergence Eliminate an initial transient 1,, k and use k+1,, n as an approximate sample from ( ) Two issues: How to design a Markov chain with stationary distribution Metropolis-Hastings algorithm & its special cases (we only see Gibbs sampler) How to judge the convergence of the Markov chain a number of criteria

76 C.Bielza, P.Larrañaga -UPM- 76 Inferencia Approximate aproximada inference MCMC: Gibbs sampler

77 C.Bielza, P.Larrañaga -UPM- 77 Inferencia Approximate aproximada inference MCMC: Gibbs sampler From the conditional prob., we sample from the JPD The chain moves from i to i+1 one coordinate at a time (or one group of coordinates at a time less corr among pars) Bivariate: =( 1, 2 ) 2 1 ( ) 0 2 1

78 C.Bielza, P.Larrañaga -UPM- 78 Inferencia Approximate aproximada inference MCMC in BNs In BNs, Gibbs means, for each X i E, sampling from: Gibbs Example: patient with severe headache and not in a coma P(B=b H=h, C= c)? M Only its Markov blanket is involved Theorem [Pearl 97] S B C H

79 C.Bielza, P.Larrañaga -UPM- 79 Inferencia Approximate aproximada inference Markov Chain Monte Carlo (MCMC) Analytically, P= Gibbs sampling: M S B C H Only visit unobserved nodes Normalizing constants ij only computed once

80 C.Bielza, P.Larrañaga -UPM- 80 Inferencia Approximate aproximada inference Markov Chain Monte Carlo (MCMC) E.g. one cycle would be:

81 C.Bielza, P.Larrañaga -UPM- 81 Inferencia Approximate aproximada inference Markov Chain Monte Carlo (MCMC) ~0.032 after 500 iterations, accumulate 1000 values

82 C.Bielza, P.Larrañaga -UPM- 82 Inferencia Approximate aproximada inference Assessing approximate inference algorithms Measure the quality of different approximations (compare algorithms) Kullback-Leibler divergence between a true distribution P and the estimated distrib. P of a node with states i: KL=0 if P=P For several query nodes, X and Y, and Z the evidence, we should use KL(P(X,Y Z),P (X,Y Z))

83 Software C.Bielza, P.Larrañaga -UPM- 83

84 Software C.Bielza, P.Larrañaga -UPM- 84

85 C.Bielza, P.Larrañaga -UPM- 85 Software genie.sis.pitt.edu

86 Software C.Bielza, P.Larrañaga -UPM- 86

87 C.Bielza, P.Larrañaga -UPM- 87 Software http.cs.berkeley.edu/~murphyk/

88 C.Bielza, P.Larrañaga -UPM- 88 Software leo.ugr.es/elvira

89 Examples C.Bielza, P.Larrañaga -UPM- 89

90 C.Bielza, P.Larrañaga -UPM- 90 Examples Increase if S=yes

91 Examples C.Bielza, P.Larrañaga -UPM- 91

92 C.Bielza, P.Larrañaga -UPM- 92 Examples Increase

93 C.Bielza, P.Larrañaga -UPM- 93 Examples Increase

94 Examples C.Bielza, P.Larrañaga -UPM- 94

95 Examples Increase Increase C.Bielza, P.Larrañaga -UPM- 95

96 C.Bielza, P.Larrañaga -UPM- 96 Texts and readings: general T.Verma, J.Pearl (1990) Causal networks: Semantics and expresiveness, UAI-4, S.Lauritzen, D.Spiegelhalter (1988) Local computations with probabilities on graphical structures and their applications to expert systems, J. of the Royal Stat. Soc., Series B,

97 C.Bielza, P.Larrañaga -UPM- 97 Texts and readings: deterministic algorithms for approx inference

98 C.Bielza, P.Larrañaga -UPM- 98 Texts and readings: stochastic simulation for approx inference M.Henrion (1988) Propagating uncertainty in BNs by logic sampling, UAI- 98, R.Fung, K.Chang (1990) Weighing and integrating evidence for stochastic simulation in Bayesian networks, UAI-5, R.Shachter, M.Peot (1990) Simulation approaches to general probabilistic inference on belief networks, UAI-5, D.Gamerman (1997) Markov Chain Monte Carlo. Chapman & Hall. G.Casella & E.George (1992) Explaining the Gibbs sampler, The Amer. Statistician 46, M.K.Cowles, B.P.Carlin (1996) MCMC convergence diagnostics: A comparative review, J. of the Amer. Statis. Assoc. 91, G.O.Roberts, A.F.M.Smith (1994) Simple conditions for the convergence of the Gibbs sampler and M-H algorithms, Stochastic Processes and their Applications 49,

99 C.Bielza, P.Larrañaga -UPM- 99 Possible projects/readings 1 Canonical models for the CPTs: Noisy OR modelisations S.Srinivas (1993) A generalization of the noisy OR model. UAI-93 F.J.Díez (1993) Parameter adjustment in Bayes networks. The generalized noisy-or gate. UAI in Neapolitan s book 2 Context-specific independence: X and Y are c.i. given Z in context C=c if P(X Y,Z,C=c)=P(X Z,C=c) C.Boutilier, N.Friedman, M.Goldszmidt, D.Koller (1996) Context-specific independence in Bayesian networks. UAI-96, Modeling tricks: parent divorcing, time-stamped models, expert disagreements, interventions 2.3 in Jensen s book

100 C.Bielza, P.Larrañaga -UPM- 100 Possible projects/readings 4 Abductive inference J.A.Gámez (2004) Abductive inference in Bayesian networks: A review. In Gámez, J.A., Moral, S., Salmerón, A., eds.: Advances in Bayesian Networks, Springer, Partial abduction L.M. de Campos, J. A. Gámez and S. Moral (2002): Partial abductive inference in Bayesian belief networks - An evolutionary computation approach by using problem-specific genetic operators. IEEE Trans. Evolutionary Computation 6(2): R. Marinescu, R. Dechter (2009) And/or branch-and-bound search for combinatorial optimization in graphical models. Artificial Intelligence 173,

101 Possible projects/readings 6 Approximate inference L.Hernández, S.Moral, A.Salmerón (1998) A Monte Carlo algorithm for probabilistic propagation in belief networks based on importance sampling and stratified simulation techniques, Int. J. of Approx. Reasoning 18, C.Yuan, M.Druzdzel (2005) Importance sampling algorithms for Bayesian networks: Principles and performance, Mathematical and Computer Modeling 43, S. Moral, A. Salmerón (2005) Dynamic importance sampling in BNs based on probability trees. International Journal of Approximate Reasoning 38(3), A. Cano, M. Gómez, S. Moral, C. Pérez-Ariza (2009) Recursive probability trees for Bayesian networks. Proceedings XIII CAEPIA, 1-10 (decomposition of potentials) T. Heskes, O. Zoeter (2002) Expectation propagation for approximate inference in dynamic BNs, Proc. 18th Conf. UAI-002, A. Cano, M. Gómez, S. Moral (2011) Approximate inference in Bayesian networks using binary probability trees, International Journal of Approximate Reasoning 52, C.Bielza, P.Larrañaga -UPM- 101

102 C.Bielza, P.Larrañaga -UPM- 102 Possible projects/readings 7 Hugin architecture: potentials in the cliques are changed dynamically and there s a division in the separators Lauritzen and Spiegelhalter (1988) F.Jensen, S.Lauritzen, K.Olesen (1990) Bayesian updating in causal probabilistic networks by local computations. Computational Statistics Quarterly 4, Lazy propagation: dissolves the differences between Shenoy-Shafer and Hugin propagation A.Madsen, F.Jensen (1999) Lazy evaluation of symmetric Bayesian decision problems, UAI-99,

103 C.Bielza, P.Larrañaga -UPM- 103 Possible projects/readings 9 More on graph theory: properties of c.i., equivalence graph-lists of c.i. statements-factorization of the JPD Chaps 5 (5.3, 5.4, 5.6) and 6 of Castillo et al s book Chap5 of Jensen s book 10 Inference in hybrid networks (discrete & continuous variables) T. Heskes, O. Zoeter (2003) Generalized belief propagation for approximate inference in hybrid Bayesian networks, Proc. 9th Int. Workshop on AI and Statistics R. Rumí, A. Salmerón (2007). Approximate probability propagation with mixtures of truncated exponentials. Int. J. Approx. Reas. 45,

104 INFERENCE IN BAYESIAN NETWORKS Concha Bielza, Pedro Larrañaga Computational Intelligence Group Departamento de Inteligencia Artificial Universidad Politécnica de Madrid Master Universitario en Inteligencia Artificial C.Bielza, P.Larrañaga -UPM-

Lecture 5: Exact inference

Lecture 5: Exact inference Queries Inference in chains Variable elimination Without evidence With evidence Complexity of variable elimination which has the highest probability: instantiation of all other