Machine Learning Lecture 16

Similar documents
Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Machine Learning. Sourangshu Bhattacharya

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

Markov Random Fields

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Machine Learning

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

STA 4273H: Statistical Machine Learning

Graphical Models Part 1-2 (Reading Notes)

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient

Lecture 9: Undirected Graphical Models Machine Learning

Machine Learning Lecture 3

CS 343: Artificial Intelligence

Chapter 8 of Bishop's Book: Graphical Models

COMP90051 Statistical Machine Learning

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Lecture 3: Conditional Independence - Undirected

Machine Learning

Machine Learning

Machine Learning Lecture 3

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Machine Learning Lecture 3

Bayesian Networks and Decision Graphs

ALGORITHMS FOR DECISION SUPPORT

6 : Factor Graphs, Message Passing and Junction Trees

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Graphical Models. David M. Blei Columbia University. September 17, 2014

Probabilistic Graphical Models

Exact Inference: Elimination and Sum Product (and hidden Markov models)

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

CS 343: Artificial Intelligence

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Exact Inference

AND/OR Cutset Conditioning

ECE521 W17 Tutorial 10

Graphical Models & HMMs

Bayesian Machine Learning - Lecture 6

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Loopy Belief Propagation

Computational Intelligence

Part I: Sum Product Algorithm and (Loopy) Belief Propagation. What s wrong with VarElim. Forwards algorithm (filtering) Forwards-backwards algorithm

Lecture 4: Undirected Graphical Models

Av. Prof. Mello Moraes, 2231, , São Paulo, SP - Brazil

The Basics of Graphical Models

Probabilistic Graphical Models

Graphical Models. Pradeep Ravikumar Department of Computer Science The University of Texas at Austin

Collective classification in network data

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake.

Introduction to Graphical Models

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

3 : Representation of Undirected GMs

Escola Politécnica, University of São Paulo Av. Prof. Mello Moraes, 2231, , São Paulo, SP - Brazil

Probabilistic Graphical Models

V,T C3: S,L,B T C4: A,L,T A,L C5: A,L,B A,B C6: C2: X,A A

A New Approach For Convert Multiply-Connected Trees in Bayesian networks

OSU CS 536 Probabilistic Graphical Models. Loopy Belief Propagation and Clique Trees / Join Trees

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

Directed Graphical Models (Bayes Nets) (9/4/13)

Bayesian Networks. 2. Separation in Graphs

CS242: Probabilistic Graphical Models Lecture 2B: Loopy Belief Propagation & Junction Trees

ECE521 Lecture 21 HMM cont. Message Passing Algorithms

Recall from last time. Lecture 4: Wrap-up of Bayes net representation. Markov networks. Markov blanket. Isolating a node

2. Graphical Models. Undirected graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

Graphical models are a lot like a circuit diagram they are written down to visualize and better understand a problem.

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Lecture 5: Exact inference

Graph Theory. Probabilistic Graphical Models. L. Enrique Sucar, INAOE. Definitions. Types of Graphs. Trajectories and Circuits.

Statistical Techniques in Robotics (STR, S15) Lecture#06 (Wednesday, January 28)

COS 513: Foundations of Probabilistic Modeling. Lecture 5

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

4 Factor graphs and Comparing Graphical Model Types

Machine Learning

Dynamic Bayesian network (DBN)

Machine Learning

ECE521 Lecture 18 Graphical Models Hidden Markov Models

Probabilistic Graphical Models

Graphical Models Reconstruction

Graphical Models. Dmitrij Lagutin, T Machine Learning: Basic Principles

Bayesian Networks Inference

Reading. Graph Terminology. Graphs. What are graphs? Varieties. Motivation for Graphs. Reading. CSE 373 Data Structures. Graphs are composed of.

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Directed Graphical Models

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Factor Graphs and Inference

Probabilistic Graphical Models

Belief propagation in a bucket-tree. Handouts, 275B Fall Rina Dechter. November 1, 2000

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Probabilistic Graphical Models

Design and Analysis of Algorithms

Note. Out of town Thursday afternoon. Willing to meet before 1pm, me if you want to meet then so I know to be in my office

Sequence Labeling: The Problem

Recitation 4: Elimination algorithm, reconstituted graph, triangulation

Expectation Propagation

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

CS 188: Artificial Intelligence

2. Graphical Models. Undirected pairwise graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

Transcription:

ourse Outline Machine Learning Lecture 16 undamentals (2 weeks) ayes ecision Theory Probability ensity stimation Undirected raphical Models & Inference 28.06.2016 iscriminative pproaches (5 weeks) Linear iscriminant unctions Statistical Learning Theory & SVMs nsemble Methods & oosting ecision Trees & Randomized Trees astian Leibe RWTH achen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de Many slides adapted from. Schiele, S. Roth,. harahmani enerative Models (4 weeks) ayesian Networks Markov Random ields xact Inference 2 Topics of This Lecture Recap: raphical Models Recap: irected raphical Models (ayesian Networks) actorization properties onditional independence ayes all algorithm Two basic kinds of graphical models irected graphical models or ayesian Networks Undirected graphical models or Markov Random ields Undirected raphical Models (Markov Random ields) onditional Independence actorization xample application: image restoration onverting directed into undirected graphs xact Inference in raphical Models Marginalization for undirected graphs Inference on a chain Inference on a tree Message passing formalism 3 Key components Nodes Random variables dges irected or undirected The value of a random variable may be known or unknown. Slide credit: ernt Schiele unknown irected graphical model known Undirected graphical model 4 Recap: irected raphical Models Recap: irected raphical Models hains of nodes: p(a) p(bja) p(cjb) onvergent connections: Knowledge about a is expressed by the prior probability: p(a) ependencies are expressed through conditional probabilities: p(bja); p(cjb) Joint distribution of all three variables: p(a; b; c) = p(cja; b)p(a; b) = p(cjb)p(bja)p(a) Here the value of c depends on both variables a and b. This is modeled with the conditional probability: p(cja; b) Therefore, the joint probability of all three variables is given as: p(a; b; c) = p(cja; b)p(a; b) = p(cja; b)p(a)p(b) Slide credit: ernt Schiele, Stefan Roth 5 Slide credit: ernt Schiele, Stefan Roth 6 1

Recap: actorization of the Joint Probability Recap: actorized Representation omputing the joint probability p(x 1 ; : : : ; x 7 ) = p(x 1 )p(x 2 )p(x 3 )p(x 4 jx 1 ; x 2 ; x 3 ) p(x 5 jx 1 ; x 3 )p(x 6 jx 4 )p(x 7 jx 4 ; x 5 ) Reduction of complexity Joint probability of n binary variables requires us to represent values by brute force O(2 n ) terms eneral factorization The factorized form obtained from the graphical model only requires O(n 2 k ) terms k: maximum number of parents of a node. We can directly read off the factorization of the joint from the network structure! It s the edges that are missing in the graph that are important! They encode the simplifying assumptions we make. 7 Image source:. ishop, 2006 Slide credit: ernt Schiele, Stefan Roth 8 Recap: onditional Independence Recap: onditional Independence X is conditionally independent of Y given V efinition: Three cases ivergent ( Tail-to-Tail ) onditional independence when c is observed. lso: Special case: Marginal Independence hain ( Head-to-Tail ) onditional independence when c is observed. Often, we are interested in conditional independence between sets of variables: onvergent ( Head-to-Head ) onditional independence when neither c, nor any of its descendants are observed. 9 10 Image source:. ishop, 2006 Recap: -Separation efinition Let,, and be non-intersecting subsets of nodes in a directed graph. path from to is blocked if it contains a node such that either The arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set, or The arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set. If all paths from to are blocked, is said to be d-separated from by. If is d-separated from by, the joint distribution over all variables in the graph satisfies. Read: is conditionally independent of given. Slide adapted from hris ishop 11 Intuitive View: The ayes all lgorithm ame an you get a ball from X to Y without being blocked by V? epending on its direction and the previous node, the ball can Pass through (from parent to all children, from child to all parents) ounce back (from any parent/child to all parents/children) e blocked R.. Shachter, ayes-all: The Rational Pastime (for etermining Irrelevance and Requisite Information in elief Networks and Influence iagrams), UI 98, 1998 Slide adapted from oubin harahmani 12 2

The ayes all lgorithm xample: ayes all ame rules n unobserved node (W V) passes through balls from parents, but also bounces back balls from children. n observed node (W 2 V) bounces back balls from parents, but blocks balls from children. The ayes all algorithm determines those nodes that are d- separated from the query node. Which nodes are d-separated from given and? 13 Image source: R. Shachter, 1998 14 xample: ayes all xample: ayes all Rule: Rule: Which nodes are d-separated from given and? Which nodes are d-separated from given and? 15 16 xample: ayes all xample: ayes all Rule: Rules: Which nodes are d-separated from given and? Which nodes are d-separated from given and? 17 18 3

xample: ayes all The Markov lanket Rule: Which nodes are d-separated from given and? is d-separated from given and. 19 Markov blanket of a node x i Minimal set of nodes that isolates x i from the rest of the graph. This comprises the set of Parents, hildren, and o-parents of x i. This is what we have to watch out for! 20 Image source:. ishop, 2006 Topics of This Lecture Undirected raphical Models Recap: irected raphical Models (ayesian Networks) actorization properties onditional independence ayes all algorithm Undirected graphical models ( Markov Random ields ) iven by undirected graph Undirected raphical Models (Markov Random ields) onditional Independence actorization xample application: image restoration onverting directed into undirected graphs xact Inference in raphical Models Marginalization for undirected graphs Inference on a chain Inference on a tree Message passing formalism 23 onditional independence is easier to read off for MRs. Without arrows, there is only one type of neighbors. Simpler Markov blanket: 24 Image source:. ishop, 2006 Undirected raphical Models actorization in MRs actorization actorization is more complicated in MRs than in Ns. Important concept: maximal cliques lique Subset of the nodes such that there exists a link between all pairs of nodes in the subset. lique onditional independence for undirected graphs If every path from any node in set to set passes through at least one node in set, then. Maximal clique The biggest possible such clique in a given graph. Maximal lique 25 Image source:. ishop, 2006 26 Image source:. ishop, 2006 4

actorization in MRs actorization in MRs Joint distribution Written as product of potential functions over maximal cliques in the graph: p(x) = 1 Y Ã (x ) The normalization constant is called the partition function. = X Y Ã (x ) x Remarks Ns are automatically normalized. ut for MRs, we have to explicitly perform the normalization. Presence of normalization constant is major limitation! Role of the potential functions eneral interpretation No restriction to potential functions that have a specific probabilistic interpretation as marginals or conditional distributions. onvenient to express them as exponential functions ( oltzmann distribution ) Ã (x ) = expf (x )g with an energy function. Why is this convenient? Joint distribution is the product of potentials sum of energies. We can take the log and simply work with the sums valuation of involves summing over O(K M ) terms for M nodes. 27 28 omparison: irected vs. Undirected raphs onverting irected to Undirected raphs irected graphs (ayesian networks) etter at expressing causal relationships. Interpretation of a link: onditional probability p(b a). a b actorization is simple (and result is automatically normalized). onditional independence is more complicated. Simple case: chain Undirected graphs (Markov Random ields) etter at representing soft constraints between variables. Interpretation of a link: a b There is some relationship between a and b. actorization is complicated (and result needs normalization). onditional independence is simple. 29 We can directly replace the directed links by undirected ones. 34 Slide adapted from hris ishop Image source:. ishop, 2006 onverting irected to Undirected raphs onverting irected to Undirected raphs More difficult case: multiple parents fully connected, no cond. indep.! eneral procedure to convert directed undirected 1. dd undirected links to marry the parents of each node. 2. rop the arrows on the original links moral graph. 3. ind maximal cliques for each node and initialize all clique potentials to 1. 4. Take each conditional distribution factor of the original directed graph and multiply it into one clique potential. Need a clique of x 1,,x 4 to represent this factor! Restriction onditional independence properties are often lost! Moralization results in additional connections and larger cliques. Need to introduce additional links ( marry the parents ). This process is called moralization. It results in the moral graph. Slide adapted from hris ishop 35 Image source:. ishop, 2006 Slide adapted from hris ishop 36 5

xample: raph onversion Step 1) Marrying the parents. xample: raph onversion Step 2) ropping the arrows. p(x 1 ; : : : ; x 7 ) = p(x 1 )p(x 2 )p(x 3 )p(x 4 jx 1 ; x 2 ; x 3 ) p(x 5 jx 1 ; x 3 )p(x 6 jx 4 )p(x 7 jx 4 ; x 5 ) 37 p(x 1 ; : : : ; x 7 ) = p(x 1 )p(x 2 )p(x 3 )p(x 4 jx 1 ; x 2 ; x 3 ) p(x 5 jx 1 ; x 3 )p(x 6 jx 4 )p(x 7 jx 4 ; x 5 ) 38 xample: raph onversion xample: raph onversion Step 3) inding maximal cliques for each node. Ã 2(x 1; x 3; x 4; x 5) Ã 1(x 1; x 2; x 3; x 4) Step 4) ssigning the probabilities to clique potentials. Ã 1(x 1; x 2; x 3; x 4) = 1 p(x 4 jx 1 ; x 2 ; x 3 ) p(x 2 )p(x 3 ) Ã 2(x 1; x 3; x 4; x 5) = 1 p(x 5 jx 1 ; x 3 ) p(x 1 ) Ã 4(x 4; x 6) Ã 3(x 4; x 5; x 7) = 1 p(x 7 jx 4 ; x 5 ) Ã 3(x 4; x 5; x 7) Ã 4(x 4; x 6) = 1 p(x 6 jx 4 ) p(x 1 ; : : : ; x 7 ) = p(x 1 )p(x 2 )p(x 3 )p(x 4 jx 1 ; x 2 ; x 3 ) p(x 5 jx 1 ; x 3 )p(x 6 jx 4 )p(x 7 jx 4 ; x 5 ) 39 p(x 1 ; : : : ; x 7 ) = p(x 1 )p(x 2 )p(x 3 )p(x ) 4 jx 1 ; x 2 ; x 3 ) p(x 5 jx 1 ; x 3 )p(x 6 jx 4 )p(x ) 7 jx 4 ; x 5 ) 40 omparison of xpressive Power Topics of This Lecture oth types of graphs have unique configurations. No directed graph can represent these and only these independencies. Recap: irected raphical Models (ayesian Networks) actorization properties onditional independence ayes all algorithm Undirected raphical Models (Markov Random ields) onditional Independence actorization xample application: image restoration onverting directed into undirected graphs No undirected graph can represent these and only these independencies. xact Inference in raphical Models Marginalization for undirected graphs Inference on a chain Inference on a tree Message passing formalism Slide adapted from hris ishop 41 Image source:. ishop, 2006 42 6

Inference in raphical Models Inference in raphical Models xample 1: oal: compute the marginals p(a) = X? p(a)p(bja)p(cjb) b;c p(b) = X? p(a)p(bja)p(cjb) a;c Inference eneral definition valuate the probability distribution over some set of variables, given the values of another set of variables (=observations). xample 2: p(ajb = b 0 ) = X? p(a)p(b = b 0 ja)p(cjb = b 0 ) c = p(a)p(b = b 0 ja) xample: p(; ; ; ; ) = p()p()p(j;? )p(j; )p(j; ) How can we compute p( = c)? p(cjb = b 0 ) = X? p(a)p(b = b 0 ja)p(cjb = b 0 ) a = p(cjb = b 0 ) 43 Slide adapted from Stefan Roth, ernt Schiele Idea: Slide credit: oubin harahmani p(j = c) = p(; = c) p( = c) 44 Inference in raphical Models Inference in raphical Models omputing p( = c) We know p(; ; ; ; ) = p()p()p(j; )p(j; )p(j; ) ssume each variable is binary. Naïve approach: p(; = c) = X Two possible values for each 2 4 terms p(; ; = c; ; ) 16 operations ;; p( = c) = X p(; = c) 2 operations We know p(; ; ; ; ) = p()p()p(j; )p(j; )p(j; ) More efficient method for p( = c): p(; = c) = X p()p()p( = cj; )p(j; = c)p(j = c; ) ;; = X p()p()p( = cj; ) X p(j; = c) X p(j = c; ) = X =1 =1 p()p()p( = cj; ) 4 operations p(j = c) = Slide credit: oubin harahmani p(; = c) p( = c) 2 operations Total: 16+2+2 = 20 operations 45 Rest stays the same: Strategy: Use the conditional independence in a graph to perform efficient inference. or singly connected graphs, exponential gains in efficiency! Slide credit: oubin harahmani Total: 4+2+2 = 8 operations 46 omputing Marginals Inference on a hain How do we apply graphical models? iven some observed variables, we want to compute distributions of the unobserved variables. In particular, we want to compute marginal distributions, for example p(x 4 ). hain graph Joint probability How can we compute marginals? lassical technique: sum-product algorithm by Judea Pearl. In the context of (loopy) undirected models, this is also called (loopy) belief propagation [Weiss, 1997]. asic idea: message-passing. Marginalization Slide credit: ernt Schiele, Stefan Roth 47 Slide adapted from hris ishop 48 Image source:. ishop, 2006 7

Inference on a hain Inference on a hain Idea: Split the computation into two parts ( messages ). We can define the messages recursively Slide adapted from hris ishop 49 Image source:. ishop, 2006 Slide adapted from hris ishop 50 Image source:. ishop, 2006 Inference on a hain Summary: Inference on a hain To compute local marginals: ompute and store all forward messages ¹ (x n ). ompute and store all backward messages ¹ (x n ). ompute at any node x m. Until we reach the leaf nodes ompute Interpretation We pass messages from the two ends towards the query node x n. for all variables required. We still need the normalization constant. This can be easily obtained from the marginals: Inference through message passing We have thus seen a first message passing algorithm. How can we generalize this? 51 Image source:. ishop, 2006 Slide adapted from hris ishop 52 Inference on Trees Inference on Trees Let s next assume a tree graph. xample: Strategy Marginalize out all other variables by summing over them. We are given the following joint distribution: p(; ; ; ; ) = 1? f 1(; ) f 2 (; ) f 3 (; ) f 4 (; ) Then rearrange terms: p() = X X XX p(; ; ; ; ) = X X XX 1 f1(; ) f2(; ) f3(; ) f4(; ) Ã Ã! Ã Ã X X X X!!! ssume we want to know the marginal p() = 1 f 4(; ) f 3(; ) f 2(; ) f 1(; ) Slide credit: ernt Schiele, Stefan Roth 53 Slide credit: ernt Schiele, Stefan Roth 54 8

Marginalization with Messages Marginalization with Messages Use messages to express the marginalization: f 1 (; ) f 3 (; ) Use messages to express the marginalization: f 1 (; ) f 3 (; ) m! = X m! = X m! = X f 2 (; )m! () m! = X m! = X m! = X f 2 (; )m! () m! = X f 4 (; )m! ()m! () m! = X f 4 (; )m! ()m! () Ã Ã! Ã Ã!!! p() = 1 X X X X f 4(; ) f 3(; ) f 2(; ) f 1(; ) Ã Ã! Ã!! = 1 X X X f 4(; ) f 3(; ) f 2(; ) m!() Ã Ã! Ã Ã!!! p() = 1 X X X X f 4(; ) f 3(; ) f 2(; ) f 1(; ) Ã Ã!! = 1 X X f 4(; ) f 3(; ) m!() Slide credit: ernt Schiele, Stefan Roth 55 Slide credit: ernt Schiele, Stefan Roth 56 Marginalization with Messages Marginalization with Messages Use messages to express the marginalization: f 1 (; ) f 3 (; ) Use messages to express the marginalization: f 1 (; ) f 3 (; ) m! = X m! = X m! = X f 2 (; )m! () m! = X m! = X m! = X f 2 (; )m! () m! = X f 4 (; )m! ()m! () m! = X f 4 (; )m! ()m! () Ã Ã! Ã Ã!!! p() = 1 X X X X f 4(; ) f 3(; ) f 2(; ) f 1(; ) Ã! = 1 X f 4(; ) m!() m!() Ã Ã! Ã Ã!!! p() = 1 X X X X f 4(; ) f 3(; ) f 2(; ) f 1(; ) = 1 m!() Slide credit: ernt Schiele, Stefan Roth 57 Slide credit: ernt Schiele, Stefan Roth 58 Inference on Trees Trees How an We eneralize? We can generalize this for all tree graphs. Root the tree at the variable that we want to compute the marginal of. Start computing messages at the leaves. ompute the messages for all nodes for which all incoming messages have already been computed. Repeat until we reach the root. Undirected Tree irected Tree Polytree If we want to compute the marginals for all possible nodes (roots), we can reuse some of the messages. omputational expense linear in the number of nodes. Next lecture ormalize the message-passing idea Sum-product algorithm ommon representation of the above actor graphs eal with loopy graphs structures Junction tree algorithm Slide credit: ernt Schiele, Stefan Roth 59 60 Image source:. ishop, 2006 9

References and urther Reading thorough introduction to raphical Models in general and ayesian Networks in particular can be found in hapter 8 of ishop s book. hristopher M. ishop Pattern Recognition and Machine Learning Springer, 2006 61 10