D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Similar documents
Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Machine Learning. Sourangshu Bhattacharya

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

STA 4273H: Statistical Machine Learning

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Probabilistic Graphical Models

Machine Learning

Graphical Models & HMMs

3 : Representation of Undirected GMs

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient

ECE521 W17 Tutorial 10

COMP90051 Statistical Machine Learning

Collective classification in network data

Machine Learning Lecture 16

Graphical Models Part 1-2 (Reading Notes)

Chapter 8 of Bishop's Book: Graphical Models

Machine Learning

ECE521 Lecture 21 HMM cont. Message Passing Algorithms

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Introduction to Graphical Models

Lecture 4: Undirected Graphical Models

Machine Learning

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

Bayesian Machine Learning - Lecture 6

6 : Factor Graphs, Message Passing and Junction Trees

2. Graphical Models. Undirected graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

Probabilistic Graphical Models

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Loopy Belief Propagation

OSU CS 536 Probabilistic Graphical Models. Loopy Belief Propagation and Clique Trees / Join Trees

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

Sequence Labeling: The Problem

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Algorithms for Markov Random Fields in Computer Vision

Exact Inference: Elimination and Sum Product (and hidden Markov models)

Models for grids. Computer vision: models, learning and inference. Multi label Denoising. Binary Denoising. Denoising Goal.

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Lecture 5: Exact inference

Junction tree propagation - BNDG 4-4.6

Graphical Models. Pradeep Ravikumar Department of Computer Science The University of Texas at Austin

Building Classifiers using Bayesian Networks

Machine Learning

Directed Graphical Models

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Graphical Models. Dmitrij Lagutin, T Machine Learning: Basic Principles

Lecture 3: Conditional Independence - Undirected

Mean Field and Variational Methods finishing off

Undirected Graphical Models. Raul Queiroz Feitosa

CS 343: Artificial Intelligence

More details on Loopy BP

Expectation Propagation

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

Probabilistic Graphical Models

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Probabilistic Graphical Models

Probabilistic Graphical Models

V,T C3: S,L,B T C4: A,L,T A,L C5: A,L,B A,B C6: C2: X,A A

From the Jungle to the Garden: Growing Trees for Markov Chain Monte Carlo Inference in Undirected Graphical Models

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014

Lecture 9: Undirected Graphical Models Machine Learning

2. Graphical Models. Undirected pairwise graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

Mean Field and Variational Methods finishing off

Part I: Sum Product Algorithm and (Loopy) Belief Propagation. What s wrong with VarElim. Forwards algorithm (filtering) Forwards-backwards algorithm

10-701/15-781, Fall 2006, Final

Statistical and Learning Techniques in Computer Vision Lecture 1: Markov Random Fields Jens Rittscher and Chuck Stewart

Computational Intelligence

Probabilistic Graphical Models

These notes present some properties of chordal graphs, a set of undirected graphs that are important for undirected graphical models.

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields

Bayesian Classification Using Probabilistic Graphical Models

COS 513: Foundations of Probabilistic Modeling. Lecture 5

Junction Trees and Chordal Graphs

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Decomposition of log-linear models

Detection of Man-made Structures in Natural Images

COMP90051 Statistical Machine Learning

Integrating Probabilistic Reasoning with Constraint Satisfaction

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Graph Theory. Probabilistic Graphical Models. L. Enrique Sucar, INAOE. Definitions. Types of Graphs. Trajectories and Circuits.

Graphical Models. David M. Blei Columbia University. September 17, 2014

10 Sum-product on factor tree graphs, MAP elimination

Markov Networks in Computer Vision. Sargur Srihari

CS242: Probabilistic Graphical Models Lecture 2B: Loopy Belief Propagation & Junction Trees

Deep Boltzmann Machines

The Basics of Graphical Models

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Exact Inference

Finding Non-overlapping Clusters for Generalized Inference Over Graphical Models

Recall from last time. Lecture 4: Wrap-up of Bayes net representation. Markov networks. Markov blanket. Isolating a node

4 Factor graphs and Comparing Graphical Model Types

CS 188: Artificial Intelligence

CRFs for Image Classification

Bayes Net Learning. EECS 474 Fall 2016

Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos

Deep Generative Models Variational Autoencoders

Transcription:

D-Separation Say: A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked by C if it contains a node such that either a) the arrows on the path meet either head-to-tail or tail-totail at the node, and the node is in the set C, or b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. If all paths from A to B are blocked, A is said to be d-separated from B by C. Notation: 1

D-Separation Say: A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked by C if it contains a node such that either a) the arrows on the and path meet not either of head-to-tail or tail-totail at the node, and the node is in the set C, or b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. If all paths from A to B are blocked, A is said to be d-separated from B by C. Notation: D-Separation is a property of graphs probability distributions 2

D-Separation: Example We condition on a descendant of e, i.e. it does not block the path from a to b. We condition on a tail-to-tail node on the only path from a to b, i.e f blocks the path. 3

I-Map Definition 4.1: A graph G is called an I-map for a distribution p if every D-separation of G corresponds to a conditional independence relation satisfied by p: Example: The fully connected graph is an I-map for any distribution, as there are no D-separations in that graph. 4

D-Map Definition 4.2: A graph G is called an D-map for a distribution p if for every conditional independence relation satisfied by p there is a D-separation in G : Example: The graph without any edges is a D-map for any distribution, as all pairs of subsets of nodes are D-separated in that graph. 5

Perfect Map Definition 4.3: A graph G is called a perfect map for a distribution p if it is a D-map and an I-map of p. A perfect map uniquely defines a probability distribution. 6

The Markov Blanket Consider a distribution of a node xi conditioned on all other nodes: Markov blanket x i : all parents, children and co-parents of x i. at 7 Factors independent of x i cancel between numerator and denominator.

Repetition: Directed Graphical Models Directed graphical models can be used to represent probability distributions This is useful to do inference and to generate samples from the distribution efficiently 8

Repetition: D-Separation D-separation is a property of graphs that can be easily determined An I-map assigns every d-separation a c.i. rel A D-map assigns every c.i. rel a d-separation Every Bayes net determines a unique prob. dist. 9

In-depth: The Head-to-Head Node Example: a: Battery charged (0 or 1) b: Fuel tank full (0 or 1) p(a) =0.9 p(b) =0.9 a b p(c) 1 1 0.8 1 0 0.2 0 1 0.2 0 0 0.1 c: Fuel gauge says full (0 or 1) We can compute and and obtain similarly: p( c b) =0.81 a explains c away p( c) =0.315 p( b c) 0.257 p( b c, a) 0.111 10

Repetition: D-Separation 11

Directed vs. Undirected Graphs Using D-separation we can identify conditional independencies in directed graphical models, but: Is there a simpler, more intuitive way to express conditional independence in a graph? Can we find a representation for cases where an ordering of the random variables is inappropriate (e.g. the pixels in a camera image)? Yes, we can: by removing the directions of the edges we obtain an Undirected Graphical Model, also known as a Markov Random Field 12

Example: Camera Image x i x i directions are counter-intuitive for images Markov blanket is not just the direct neighbors when using a directed model 13

Markov Random Fields Markov Blanket All paths from A to B go through C, i.e. C blocks all paths. We only need to condition on the direct neighbors of x to get c.i., because these already block every path from x to any other node. 14

Factorization of MRFs Any two nodes x i and x j that are not connected in an MRF are conditionally independent given all other nodes: p(x i,x j x \{i,j} )=p(x i x \{i,j} )p(x j x \{i,j} ) In turn: each factor contains only nodes that are connected This motivates the consideration of cliques in the graph: Clique A clique is a fully connected subgraph. A maximal clique can not be extended with another node without loosing the property of full connectivity. Maximal Clique 15

Factorization of MRFs In general, a Markov Random Field is factorized as where C is the set of all (maximal) cliques and Φ C is a positive function of a given clique x C of nodes, called the clique potential. Z is called the partition function. Theorem (Hammersley/Clifford): Any undirected model with associated clique potentials Φ C is a perfect map for the probability distribution defined by Equation (4.1). As a conclusion, all probability distributions that can be factorized as in (4.1), can be represented as an MRF. 16

Converting Directed to Undirected Graphs (1) In this case: Z=1 17

Converting Directed to Undirected Graphs (2) x 1 x 3 x 1 x 3 x 2 x 2 x 4 x 4 p(x) =p(x 1 )p(x 2 )p(x 2 )p(x 4 x 1,x 2,x 3 ) In general: conditional distributions in the directed graph are mapped to cliques in the undirected graph However: the variables are not conditionally independent given the head-to-head node Therefore: Connect all parents of head-to-head nodes with each other (moralization) 18

Converting Directed to Undirected Graphs (2) x 1 x 3 x 1 x 3 x 2 x 2 x 4 x 4 p(x) =p(x 1 )p(x 2 )p(x 2 )p(x 4 x 1,x 2,x 3 ) p(x) = (x 1,x 2,x 3,x 4 ) Problem: This process can remove conditional independence relations (inefficient) Generally: There is no one-to-one mapping between the distributions represented by directed and by undirected graphs. 19

Representability As for DAGs, we can define an I-map, a D-map and a perfect map for MRFs. The set of all distributions for which a DAG exists that is a perfect map is different from that for MRFs. Distributions with a DAG as perfect map Distributions with an MRF as perfect map All distributions 20

Directed vs. Undirected Graphs Both distributions can not be represented in the other framework (directed/undirected) with all conditional independence relations. 21

Using Graphical Models We can use a graphical model to do inference: Some nodes in the graph are observed, for others we want to find the posterior distribution Also, computing the local marginal distribution p(x n ) at any node x n can be done using inference. Question: How can inference be done with a graphical model? We will see that, when exploiting conditional independences, we can do efficient inference. 22

Inference on a Chain The joint probability is given by The marginal at x 3 is In the general case with N nodes we have and 23

Inference on a Chain This would mean K N computations! A more efficient way is obtained by rearranging: Vectors of size K 24

Inference on a Chain In general, we have 25

Inference on a Chain The messages µ α and µ β can be computed recursively: Computation of µ α starts at the first node and computation of µ β starts at the last node. 26

Inference on a Chain The first values of µ α and µ β are: The partition function can be computed at any node: Overall, we have O(NK 2 ) operations to compute the marginal 27

Inference on a Chain To compute local marginals: Compute and store all forward messages,. Compute and store all backward messages, Compute Z once at a node x m : Compute for all variables required. 28

More General Graphs The message-passing algorithm can be extended to more general graphs: Undirected Tree Directed Tree Polytree It is then known as the sum-product algorithm. A special case of this is belief propagation. 29

Factor Graphs The Sum-product algorithm can be used to do inference on undirected and directed graphs. A representation that generalizes directed and undirected models is the factor graph. f(x 1,x 2,x 3 )=p(x 1 )p(x 2 )p(x 3 x 1,x 2 ) Directed graph Factor graph 30

Factor Graphs The Sum-product algorithm can be used to do inference on undirected and directed graphs. A representation that generalizes directed and undirected models is the factor graph. Undirected graph Factor graph 31

Factor Graphs Factor graphs can contain multiple factors x 1 x 3 f a for the same nodes f b are more general than undirected graphs x 4 are bipartite, i.e. they consist of two kinds of nodes and all edges connect nodes of different kind 32

Factor Graphs Directed trees convert to x 1 x 3 tree-structured factor graphs The same holds for x 4 undirected trees Also: directed polytrees convert to tree-structured factor graphs x 1 x 3 f a And: Local cycles in a directed graph can be removed by converting to a x 4 factor graph 33

Sum-Product Inference in General Graphical Models 1.Convert graph (directed or undirected) into a factor graph (there are no cycles) 2.If the goal is to marginalize at node x, then consider x as a root node 3.Initialize the recursion at the leaf nodes as: (var) or µ f!x (x) =1 µ x!f (x) =f(x) (fac) 4.Propagate messages from the leaves to x 5.Propagate messages from x to the leaves 6.Obtain marginals at every node by multiplying all incoming messages 34

Other Inference Algorithms Max-Sum algorithm: used to maximize the joint probability of all variables (no marginalization) Junction Tree algorithm: exact inference for general graphs (even with loops) Loopy belief propagation: approximate inference on general graphs (more efficient) Special kind of undirected GM: Conditional Random fields (e.g.: classification) 35

Conditional Random Fields Another kind of undirected graphical model is known as Conditional Random Field (CRF). CRFs are used for classification where labels are represented as discrete random variables y and features as continuous random variables x A CRF represents the conditional probability where w are parameters learned from training data. CRFs are discriminative and MRFs are generative 36

Conditional Random Fields Derivation of the formula for CRFs: In the training phase, we compute parameters w that maximize the posterior: where (x *,y * ) is the training data and p(w) is a Gaussian prior. In the inference phase we maximize 37

Conditional Random Fields Typical example: observed variables x i,j are intensity values of pixels in an image and hidden variables y i,j are object labels Note: the definition of x i,j and y i,j is different from the one in C.M. Bishop (pg.389)! 38

CRF Training We minimize the negative log-posterior: Computing the likelihood is intractable, as we have to compute the partition function for each w. We can approximate the likelihood using pseudo-likelihood: where Markov blanket C i : All cliques containing y i 39

Pseudo Likelihood 40

Pseudo Likelihood Pseudo-likelihood is computed only on the Markov blanket of y i and its corresp. feature nodes. 41

Potential Functions The only requirement for the potential functions is that they are positive. We achieve that with: where f is a compatibility function that is large if the labels y C fit well to the features x C. This is called the log-linear model. The function f can be, e.g. a local classifier 42

Training: CRF Training and Inference Using pseudo-likelihood, training is efficient. We have to minimize: Log-pseudo-likelihood Gaussian prior This is a convex function that can be minimized using gradient descent Inference: Only approximatively, e.g. using loopy belief propagation 43

Summary Undirected models (aka Markov random fields) provide an intuitive representation of conditional independence An MRF is defined as a factorization over clique potentials and normalized globally Directed and undirected models have different representative power (no simple containment ) Inference on undirected Markov chains is efficient using message passing Factor graphs are more general; exact inference can be done efficiently using sum-product 44