Machine Learning Lecture 16

Size: px

Start display at page:

Download "Machine Learning Lecture 16"

Emory Short
5 years ago
Views:

ourse Outline Machine Learning Lecture 16 undamentals (2 weeks) ayes ecision Theory Probability ensity stimation Undirected raphical Models & Inference 28.06.

1 ourse Outline Machine Learning Lecture 16 undamentals (2 weeks) ayes ecision Theory Probability ensity stimation Undirected raphical Models & Inference iscriminative pproaches (5 weeks) Linear iscriminant unctions Statistical Learning Theory & SVMs nsemble Methods & oosting ecision Trees & Randomized Trees astian Leibe RWTH achen Many slides adapted from. Schiele, S. Roth,. harahmani enerative Models (4 weeks) ayesian Networks Markov Random ields xact Inference 2 Topics of This Lecture Recap: raphical Models Recap: irected raphical Models (ayesian Networks) actorization properties onditional independence ayes all algorithm Two basic kinds of graphical models irected graphical models or ayesian Networks Undirected graphical models or Markov Random ields Undirected raphical Models (Markov Random ields) onditional Independence actorization xample application: image restoration onverting directed into undirected graphs xact Inference in raphical Models Marginalization for undirected graphs Inference on a chain Inference on a tree Message passing formalism 3 Key components Nodes Random variables dges irected or undirected The value of a random variable may be known or unknown. Slide credit: ernt Schiele unknown irected graphical model known Undirected graphical model 4 Recap: irected raphical Models Recap: irected raphical Models hains of nodes: p(a) p(bja) p(cjb) onvergent connections: Knowledge about a is expressed by the prior probability: p(a) ependencies are expressed through conditional probabilities: p(bja); p(cjb) Joint distribution of all three variables: p(a; b; c) = p(cja; b)p(a; b) = p(cjb)p(bja)p(a) Here the value of c depends on both variables a and b. This is modeled with the conditional probability: p(cja; b) Therefore, the joint probability of all three variables is given as: p(a; b; c) = p(cja; b)p(a; b) = p(cja; b)p(a)p(b) Slide credit: ernt Schiele, Stefan Roth 5 Slide credit: ernt Schiele, Stefan Roth 6 1

Recap: actorization of the Joint Probability Recap: actorized Representation omputing the joint probability p(x 1 ; : : : ; x 7 ) = p(x 1 )p(x 2 )p(x 3 )p(x 4 jx 1 ; x 2 ; x 3 ) p(x 5 jx 1 ; x 3 )p(x

2 Recap: actorization of the Joint Probability Recap: actorized Representation omputing the joint probability p(x 1 ; : : : ; x 7 ) = p(x 1 )p(x 2 )p(x 3 )p(x 4 jx 1 ; x 2 ; x 3 ) p(x 5 jx 1 ; x 3 )p(x 6 jx 4 )p(x 7 jx 4 ; x 5 ) Reduction of complexity Joint probability of n binary variables requires us to represent values by brute force O(2 n ) terms eneral factorization The factorized form obtained from the graphical model only requires O(n 2 k ) terms k: maximum number of parents of a node. We can directly read off the factorization of the joint from the network structure! It s the edges that are missing in the graph that are important! They encode the simplifying assumptions we make. 7 Image source:. ishop, 2006 Slide credit: ernt Schiele, Stefan Roth 8 Recap: onditional Independence Recap: onditional Independence X is conditionally independent of Y given V efinition: Three cases ivergent ( Tail-to-Tail ) onditional independence when c is observed. lso: Special case: Marginal Independence hain ( Head-to-Tail ) onditional independence when c is observed. Often, we are interested in conditional independence between sets of variables: onvergent ( Head-to-Head ) onditional independence when neither c, nor any of its descendants are observed Image source:. ishop, 2006 Recap: -Separation efinition Let,, and be non-intersecting subsets of nodes in a directed graph. path from to is blocked if it contains a node such that either The arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set, or The arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set. If all paths from to are blocked, is said to be d-separated from by. If is d-separated from by, the joint distribution over all variables in the graph satisfies. Read: is conditionally independent of given. Slide adapted from hris ishop 11 Intuitive View: The ayes all lgorithm ame an you get a ball from X to Y without being blocked by V? epending on its direction and the previous node, the ball can Pass through (from parent to all children, from child to all parents) ounce back (from any parent/child to all parents/children) e blocked R.. Shachter, ayes-all: The Rational Pastime (for etermining Irrelevance and Requisite Information in elief Networks and Influence iagrams), UI 98, 1998 Slide adapted from oubin harahmani 12 2

3 The ayes all lgorithm xample: ayes all ame rules n unobserved node (W V) passes through balls from parents, but also bounces back balls from children. n observed node (W 2 V) bounces back balls from parents, but blocks balls from children. The ayes all algorithm determines those nodes that are d- separated from the query node. Which nodes are d-separated from given and? 13 Image source: R. Shachter, xample: ayes all xample: ayes all Rule: Rule: Which nodes are d-separated from given and? Which nodes are d-separated from given and? xample: ayes all xample: ayes all Rule: Rules: Which nodes are d-separated from given and? Which nodes are d-separated from given and?

xample: ayes all The Markov lanket Rule: Which nodes are d-separated from given and? is d-separated from given and.

This is what we have to watch out for! 20 Image source:.

models ( Markov Random ields ) iven by undirected graph Undirected raphical Models (Markov Random ields) onditional Independence actorization xample application: image restoration onverting directed

4 xample: ayes all The Markov lanket Rule: Which nodes are d-separated from given and? is d-separated from given and. 19 Markov blanket of a node x i Minimal set of nodes that isolates x i from the rest of the graph. This comprises the set of Parents, hildren, and o-parents of x i. This is what we have to watch out for! 20 Image source:. ishop, 2006 Topics of This Lecture Undirected raphical Models Recap: irected raphical Models (ayesian Networks) actorization properties onditional independence ayes all algorithm Undirected graphical models ( Markov Random ields ) iven by undirected graph Undirected raphical Models (Markov Random ields) onditional Independence actorization xample application: image restoration onverting directed into undirected graphs xact Inference in raphical Models Marginalization for undirected graphs Inference on a chain Inference on a tree Message passing formalism 23 onditional independence is easier to read off for MRs. Without arrows, there is only one type of neighbors. Simpler Markov blanket: 24 Image source:. ishop, 2006 Undirected raphical Models actorization in MRs actorization actorization is more complicated in MRs than in Ns. Important concept: maximal cliques lique Subset of the nodes such that there exists a link between all pairs of nodes in the subset. lique onditional independence for undirected graphs If every path from any node in set to set passes through at least one node in set, then. Maximal clique The biggest possible such clique in a given graph. Maximal lique 25 Image source:. ishop, Image source:. ishop,

actorization in MRs actorization in MRs Joint distribution Written as product of potential functions over maximal cliques in the graph: p(x) = 1 Y Ã (x ) The normalization constant is called the

5 actorization in MRs actorization in MRs Joint distribution Written as product of potential functions over maximal cliques in the graph: p(x) = 1 Y Ã (x ) The normalization constant is called the partition function. = X Y Ã (x ) x Remarks Ns are automatically normalized. ut for MRs, we have to explicitly perform the normalization. Presence of normalization constant is major limitation! Role of the potential functions eneral interpretation No restriction to potential functions that have a specific probabilistic interpretation as marginals or conditional distributions. onvenient to express them as exponential functions ( oltzmann distribution ) Ã (x ) = expf (x )g with an energy function. Why is this convenient? Joint distribution is the product of potentials sum of energies. We can take the log and simply work with the sums valuation of involves summing over O(K M ) terms for M nodes omparison: irected vs. Undirected raphs onverting irected to Undirected raphs irected graphs (ayesian networks) etter at expressing causal relationships. Interpretation of a link: onditional probability p(b a). a b actorization is simple (and result is automatically normalized). onditional independence is more complicated. Simple case: chain Undirected graphs (Markov Random ields) etter at representing soft constraints between variables. Interpretation of a link: a b There is some relationship between a and b. actorization is complicated (and result needs normalization). onditional independence is simple. 29 We can directly replace the directed links by undirected ones. 34 Slide adapted from hris ishop Image source:. ishop, 2006 onverting irected to Undirected raphs onverting irected to Undirected raphs More difficult case: multiple parents fully connected, no cond. indep.! eneral procedure to convert directed undirected 1. dd undirected links to marry the parents of each node. 2. rop the arrows on the original links moral graph. 3. ind maximal cliques for each node and initialize all clique potentials to Take each conditional distribution factor of the original directed graph and multiply it into one clique potential. Need a clique of x 1,,x 4 to represent this factor! Restriction onditional independence properties are often lost! Moralization results in additional connections and larger cliques. Need to introduce additional links ( marry the parents ). This process is called moralization. It results in the moral graph. Slide adapted from hris ishop 35 Image source:. ishop, 2006 Slide adapted from hris ishop 36 5

jx 1 ; x 3 )p(x 6 jx 4 )p(x 7 jx 4 ; x 5 ) 38 xample: raph onversion xample: raph onversion Step 3) inding maximal cliques for each node.

6 xample: raph onversion Step 1) Marrying the parents. xample: raph onversion Step 2) ropping the arrows. p(x 1 ; : : : ; x 7 ) = p(x 1 )p(x 2 )p(x 3 )p(x 4 jx 1 ; x 2 ; x 3 ) p(x 5 jx 1 ; x 3 )p(x 6 jx 4 )p(x 7 jx 4 ; x 5 ) 37 p(x 1 ; : : : ; x 7 ) = p(x 1 )p(x 2 )p(x 3 )p(x 4 jx 1 ; x 2 ; x 3 ) p(x 5 jx 1 ; x 3 )p(x 6 jx 4 )p(x 7 jx 4 ; x 5 ) 38 xample: raph onversion xample: raph onversion Step 3) inding maximal cliques for each node. Ã 2(x 1; x 3; x 4; x 5) Ã 1(x 1; x 2; x 3; x 4) Step 4) ssigning the probabilities to clique potentials. Ã 1(x 1; x 2; x 3; x 4) = 1 p(x 4 jx 1 ; x 2 ; x 3 ) p(x 2 )p(x 3 ) Ã 2(x 1; x 3; x 4; x 5) = 1 p(x 5 jx 1 ; x 3 ) p(x 1 ) Ã 4(x 4; x 6) Ã 3(x 4; x 5; x 7) = 1 p(x 7 jx 4 ; x 5 ) Ã 3(x 4; x 5; x 7) Ã 4(x 4; x 6) = 1 p(x 6 jx 4 ) p(x 1 ; : : : ; x 7 ) = p(x 1 )p(x 2 )p(x 3 )p(x 4 jx 1 ; x 2 ; x 3 ) p(x 5 jx 1 ; x 3 )p(x 6 jx 4 )p(x 7 jx 4 ; x 5 ) 39 p(x 1 ; : : : ; x 7 ) = p(x 1 )p(x 2 )p(x 3 )p(x ) 4 jx 1 ; x 2 ; x 3 ) p(x 5 jx 1 ; x 3 )p(x 6 jx 4 )p(x ) 7 jx 4 ; x 5 ) 40 omparison of xpressive Power Topics of This Lecture oth types of graphs have unique configurations. No directed graph can represent these and only these independencies. Recap: irected raphical Models (ayesian Networks) actorization properties onditional independence ayes all algorithm Undirected raphical Models (Markov Random ields) onditional Independence actorization xample application: image restoration onverting directed into undirected graphs No undirected graph can represent these and only these independencies. xact Inference in raphical Models Marginalization for undirected graphs Inference on a chain Inference on a tree Message passing formalism Slide adapted from hris ishop 41 Image source:. ishop,

7 Inference in raphical Models Inference in raphical Models xample 1: oal: compute the marginals p(a) = X? p(a)p(bja)p(cjb) b;c p(b) = X? p(a)p(bja)p(cjb) a;c Inference eneral definition valuate the probability distribution over some set of variables, given the values of another set of variables (=observations). xample 2: p(ajb = b 0 ) = X? p(a)p(b = b 0 ja)p(cjb = b 0 ) c = p(a)p(b = b 0 ja) xample: p(; ; ; ; ) = p()p()p(j;? )p(j; )p(j; ) How can we compute p( = c)? p(cjb = b 0 ) = X? p(a)p(b = b 0 ja)p(cjb = b 0 ) a = p(cjb = b 0 ) 43 Slide adapted from Stefan Roth, ernt Schiele Idea: Slide credit: oubin harahmani p(j = c) = p(; = c) p( = c) 44 Inference in raphical Models Inference in raphical Models omputing p( = c) We know p(; ; ; ; ) = p()p()p(j; )p(j; )p(j; ) ssume each variable is binary. Naïve approach: p(; = c) = X Two possible values for each 2 4 terms p(; ; = c; ; ) 16 operations ;; p( = c) = X p(; = c) 2 operations We know p(; ; ; ; ) = p()p()p(j; )p(j; )p(j; ) More efficient method for p( = c): p(; = c) = X p()p()p( = cj; )p(j; = c)p(j = c; ) ;; = X p()p()p( = cj; ) X p(j; = c) X p(j = c; ) = X =1 =1 p()p()p( = cj; ) 4 operations p(j = c) = Slide credit: oubin harahmani p(; = c) p( = c) 2 operations Total: = 20 operations 45 Rest stays the same: Strategy: Use the conditional independence in a graph to perform efficient inference. or singly connected graphs, exponential gains in efficiency! Slide credit: oubin harahmani Total: = 8 operations 46 omputing Marginals Inference on a hain How do we apply graphical models? iven some observed variables, we want to compute distributions of the unobserved variables. In particular, we want to compute marginal distributions, for example p(x 4 ). hain graph Joint probability How can we compute marginals? lassical technique: sum-product algorithm by Judea Pearl. In the context of (loopy) undirected models, this is also called (loopy) belief propagation [Weiss, 1997]. asic idea: message-passing. Marginalization Slide credit: ernt Schiele, Stefan Roth 47 Slide adapted from hris ishop 48 Image source:. ishop,

Inference on a hain Inference on a hain Idea: Split the computation into two parts ( messages ). We can define the messages recursively Slide adapted from hris ishop 49 Image source:.

ompute and store all backward messages ¹ (x n ). ompute at any node x m. Until we reach the leaf nodes ompute Interpretation We pass messages from the two ends towards the query node x n.

8 Inference on a hain Inference on a hain Idea: Split the computation into two parts ( messages ). We can define the messages recursively Slide adapted from hris ishop 49 Image source:. ishop, 2006 Slide adapted from hris ishop 50 Image source:. ishop, 2006 Inference on a hain Summary: Inference on a hain To compute local marginals: ompute and store all forward messages ¹ (x n ). ompute and store all backward messages ¹ (x n ). ompute at any node x m. Until we reach the leaf nodes ompute Interpretation We pass messages from the two ends towards the query node x n. for all variables required. We still need the normalization constant. This can be easily obtained from the marginals: Inference through message passing We have thus seen a first message passing algorithm. How can we generalize this? 51 Image source:. ishop, 2006 Slide adapted from hris ishop 52 Inference on Trees Inference on Trees Let s next assume a tree graph. xample: Strategy Marginalize out all other variables by summing over them. We are given the following joint distribution: p(; ; ; ; ) = 1? f 1(; ) f 2 (; ) f 3 (; ) f 4 (; ) Then rearrange terms: p() = X X XX p(; ; ; ; ) = X X XX 1 f1(; ) f2(; ) f3(; ) f4(; ) Ã Ã! Ã Ã X X X X!!! ssume we want to know the marginal p() = 1 f 4(; ) f 3(; ) f 2(; ) f 1(; ) Slide credit: ernt Schiele, Stefan Roth 53 Slide credit: ernt Schiele, Stefan Roth 54 8

Marginalization with Messages Marginalization with Messages Use messages to express the marginalization: f 1 (; ) f 3 (; ) Use messages to express the marginalization: f 1 (; ) f 3 (; ) m! = X m!

() Ã Ã! Ã Ã!!! p() = 1 X X X X f 4(; ) f 3(; ) f 2(; ) f 1(; ) Ã Ã!! = 1 X X f 4(; ) f 3(; ) m!() Slide credit: ernt Schiele, Stefan Roth 55 Slide credit: ernt Schiele, Stefan Roth 56 = X m!

9 Marginalization with Messages Marginalization with Messages Use messages to express the marginalization: f 1 (; ) f 3 (; ) Use messages to express the marginalization: f 1 (; ) f 3 (; ) m! = X m! = X m! = X f 2 (; )m! () m! = X m! = X m! = X f 2 (; )m! () m! = X f 4 (; )m! ()m! () m! = X f 4 (; )m! ()m! () Ã Ã! Ã Ã!!! p() = 1 X X X X f 4(; ) f 3(; ) f 2(; ) f 1(; ) Ã Ã! Ã!! = 1 X X X f 4(; ) f 3(; ) f 2(; ) m!() Ã Ã! Ã Ã!!! p() = 1 X X X X f 4(; ) f 3(; ) f 2(; ) f 1(; ) Ã Ã!! = 1 X X f 4(; ) f 3(; ) m!() Slide credit: ernt Schiele, Stefan Roth 55 Slide credit: ernt Schiele, Stefan Roth 56 Marginalization with Messages Marginalization with Messages Use messages to express the marginalization: f 1 (; ) f 3 (; ) Use messages to express the marginalization: f 1 (; ) f 3 (; ) m! = X m! = X m! = X f 2 (; )m! () m! = X m! = X m! = X f 2 (; )m! () m! = X f 4 (; )m! ()m! () m! = X f 4 (; )m! ()m! () Ã Ã! Ã Ã!!! p() = 1 X X X X f 4(; ) f 3(; ) f 2(; ) f 1(; ) Ã! = 1 X f 4(; ) m!() m!() Ã Ã! Ã Ã!!! p() = 1 X X X X f 4(; ) f 3(; ) f 2(; ) f 1(; ) = 1 m!() Slide credit: ernt Schiele, Stefan Roth 57 Slide credit: ernt Schiele, Stefan Roth 58 Inference on Trees Trees How an We eneralize? We can generalize this for all tree graphs. Root the tree at the variable that we want to compute the marginal of. Start computing messages at the leaves. ompute the messages for all nodes for which all incoming messages have already been computed. Repeat until we reach the root. Undirected Tree irected Tree Polytree If we want to compute the marginals for all possible nodes (roots), we can reuse some of the messages. omputational expense linear in the number of nodes. Next lecture ormalize the message-passing idea Sum-product algorithm ommon representation of the above actor graphs eal with loopy graphs structures Junction tree algorithm Slide credit: ernt Schiele, Stefan Roth Image source:. ishop,

10 References and urther Reading thorough introduction to raphical Models in general and ayesian Networks in particular can be found in hapter 8 of ishop s book. hristopher M. ishop Pattern Recognition and Machine Learning Springer,

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part II C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Converting Directed to Undirected Graphs (1) Converting Directed to Undirected Graphs (2) Add extra links between