Graphical Models Part 1-2 (Reading Notes) Wednesday, August 3 2011, 2:35 PM Notes for the Reading of Chapter 8 Graphical Models of the book Pattern Recognition and Machine Learning (PRML) by Chris Bishop This section (part 1-2) covers 8.1 and 8.2 of the book, which covers Bayesian Networks (directed graph) and the conditional independencies 8.1 Bayesian Networks Directed acyclic graph describes joint distribution over variables (how they decompose!). In general: *** (key) * Generative Model observed variabel is generated by latent (hidden) vairable Discrete variables: 1st order chain of M discrete nodes Turn discrete variables into a Bayesian model by introducing Dirichlet prior
And with a shared u among all the condition distribution p(xi xi-1) 8.2 Conditional Independence e.g. a three variable case: p(a b, c)=p(a c) or: p(a, b c)=p(a b, c) * p(b c)=p(a c) *p(b c) we call that a is conditionally independent of b given c. Denote as Three example graphs: 1. tail-to-tail Figure 8.15: graph over 3 variables a,b and c. The joint distribution of the 3 variable is: p(a, b, c)= p(a c) p(b c) p (c) to see whether a b are independent, we sum the distribution on c: In general, this does not lead to the product of p(a)p(b) Thus a and b are NOT independent (they are dependent), denote it as:
Figure 8.16: same as 8.15 but conditioned on c of b given c:, which shows that p(a c) and p(b c) are independent, i.e a is conditional independent 2. head-to-tail The joint distribution of a, b and c is : p(a, b, c)=p(a) p(c a) p(b c) To get the joint probability of a and b, we sum over c: which in general does not factorize into p(a)*p(b), and so: If we condition on node c, as shown above, using Bayesian theorem, together with the joint distribution of the three variables, we get: so a and b are conditionally independent given c, i.e.:
3. head-to-head This is a tricker case, actually the property is quite the opposite to the previous two. The joint distribution of a, b and c is: p(a, b, c)= p(a) p(b) p(c a, b) When c is not observed, we marginalize both sides over c to get the joint distribution of a and b: p(a, b)=p(a) p(b), i.e a and b are independent with no variables observed Suppose we condition on c, the conditional distribution of a and b is: which does not factorize into the product p(a) and p(b), and so: Note the third example (head to head) has the opposite behavior from the first two. When c is unobserved, it block the path, and a and b are independent. However, conditioning on c "unblocks" the path and makes a and b dependent. A more subtle thing with this case: a head-to-head path will become unblocked if either the node, or any of its descendants, is observed. Summary : a tail-to-tail node or a head-to-tail node leaves a path unblocked unless it is observed in which case it blocks the path. By contrast, a head-to-head node blocks a path if it is unobserved, but once the node, and/or at least one of its descendants, is observed the path becomes unblocked. D-seperation
D-seperation D-seperation property of directed graph: Consider a general directed graph in which A, B and C are arbitrary nonintersecting sets of nodes. We wish to ascertain whether a particular conditional independence statement is implied by a given directed acyclic graph. To do so,we consider all possible paths from any node in A to any node in B. Any such path is said to be blocked if it includes a node such that either: (a) the arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set C, or (b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, is in the set C. If all paths are blocked, A is said to be d-sperated from B by C, and the joint distribution over all the variables in the graph will satisfy Illustration of D-seperation. In 8.22(a) the path from a to b is NOT blocked by f NOR e; and in (b) the path from a to b IS blocked by f and e. What expressed by directed graph: (a) A particular directed graph represents a specific decomposition of a joint probability distribution into a product of conditional probabilities. (b) The graph also expresses a set of conditional independence statements obtained through the d-separation criterion, The d-separation theorem is really an expression of the equivalence of these two properties. In order to make this clear, it is helpful to think of a directed graph as a filter. Markov Blanket
Consider a joint distribution p(x1,...,xd) represented by a directed graph having D nodes, and consider the conditional distribution of a particular node with variables xi conditioned on all of the remaining variables x(j!=i). Using the factorization property (8.5), we can express this conditional distribution in the form Any factor p(xk pak) that does not have any functional dependence on xi can be taken outside the integral over xi, and will therefore cancel between numerator and denominator. The only factors that remain will be the conditional distribution p(xi pai) for node xi itself, together with the conditional distributions for any nodes xk such that node xi is in the conditioning set of p(xk pak) The Markov blanket of a node x, comprises the set of parents, children and co-parents. We can think Markov blanket of a node as the minimal set of nodes that isolate that node.