FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Machine Learning. Sourangshu Bhattacharya

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Probabilistic Graphical Models

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Graphical Models Part 1-2 (Reading Notes)

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Graphical Models. Dmitrij Lagutin, T Machine Learning: Basic Principles

Bayesian Machine Learning - Lecture 6

Graphical Models. David M. Blei Columbia University. September 17, 2014

Lecture 4: Undirected Graphical Models

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Lecture 3: Conditional Independence - Undirected

STA 4273H: Statistical Machine Learning

The Basics of Graphical Models

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

3 : Representation of Undirected GMs

These notes present some properties of chordal graphs, a set of undirected graphs that are important for undirected graphical models.

Lecture 5: Exact inference

Directed Graphical Models (Bayes Nets) (9/4/13)

4 Factor graphs and Comparing Graphical Model Types

ECE521 Lecture 21 HMM cont. Message Passing Algorithms

Graphical Models & HMMs

Probabilistic Graphical Models

Machine Learning

Machine Learning Lecture 16

COMP90051 Statistical Machine Learning

Recall from last time. Lecture 4: Wrap-up of Bayes net representation. Markov networks. Markov blanket. Isolating a node

ECE521 W17 Tutorial 10

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

Chapter 8 of Bishop's Book: Graphical Models

Statistical and Learning Techniques in Computer Vision Lecture 1: Markov Random Fields Jens Rittscher and Chuck Stewart

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Graphical Models. Pradeep Ravikumar Department of Computer Science The University of Texas at Austin

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014

5 Minimal I-Maps, Chordal Graphs, Trees, and Markov Chains

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Approximation Algorithms

Introduction to Graphical Models

Probabilistic Graphical Models

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

Chapter 2 PRELIMINARIES. 1. Random variables and conditional independence

Graphical models and message-passing algorithms: Some introductory lectures

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

A Transformational Characterization of Markov Equivalence for Directed Maximal Ancestral Graphs

6 : Factor Graphs, Message Passing and Junction Trees

Lecture 9: Undirected Graphical Models Machine Learning

Undirected Graphical Models. Raul Queiroz Feitosa

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Markov Equivalence in Bayesian Networks

Machine Learning

Machine Learning

Av. Prof. Mello Moraes, 2231, , São Paulo, SP - Brazil

A Well-Behaved Algorithm for Simulating Dependence Structures of Bayesian Networks

Bayesian Classification Using Probabilistic Graphical Models

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Introduction to information theory and coding - Lecture 1 on Graphical models

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

COS 513: Foundations of Probabilistic Modeling. Lecture 5

Linear Methods for Regression and Shrinkage Methods

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

A Discovery Algorithm for Directed Cyclic Graphs

Recitation 4: Elimination algorithm, reconstituted graph, triangulation

Loopy Belief Propagation

Graph Theory. Probabilistic Graphical Models. L. Enrique Sucar, INAOE. Definitions. Types of Graphs. Trajectories and Circuits.

Inference for loglinear models (contd):

Expectation Propagation

A Parallel Algorithm for Exact Structure Learning of Bayesian Networks

Exact Inference: Elimination and Sum Product (and hidden Markov models)

2. Graphical Models. Undirected graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

Building Classifiers using Bayesian Networks

WUCT121. Discrete Mathematics. Graphs

AN ANALYSIS ON MARKOV RANDOM FIELDS (MRFs) USING CYCLE GRAPHS

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

Escola Politécnica, University of São Paulo Av. Prof. Mello Moraes, 2231, , São Paulo, SP - Brazil

Causality with Gates

Directed Graphical Models

LECTURE 8: SETS. Software Engineering Mike Wooldridge

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 9, SEPTEMBER

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Reasoning About Uncertainty

Introduction to Graph Theory

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Treewidth and graph minors

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Exact Inference

The Structure of Bull-Free Perfect Graphs

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Ch9: Exact Inference: Variable Elimination. Shimi Salant, Barak Sternberg

CS 343: Artificial Intelligence

What is machine learning?

Chapter 3. Set Theory. 3.1 What is a Set?

Junction tree propagation - BNDG 4-4.6

V,T C3: S,L,B T C4: A,L,T A,L C5: A,L,B A,B C6: C2: X,A A

Approximation Algorithms

Transcription:

FMA901F: Machine Learning Lecture 6: Graphical Models Cristian Sminchisescu

Graphical Models Provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate new models Insights into the properties of the model, including conditional independence properties, can be obtained by inspection of the graph Complex computations, required to perform inference and learning in sophisticated models, can be expressed in terms of graphical manipulations, in which underlying mathematical expressions are carried along implicitly

Choice of Graphical Models In a probabilistic graphical model, each node represents a random variable (or a group of variables), and the links express probabilistic relationships between these variables. The graph captures the way in which the joint distribution over all of the random variables can be decomposed into a product of factors, each depending only on a subset of the variables In Bayesian networks, also known as directed graphical models, the links of the graphs have a particular directionality indicated by arrows The other major class of graphical models are Markov random fields, also known as undirected graphical models, in which the links do not carry arrows and have no directional significance Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are better suited at expressing soft constraints between random variables For the purposes of solving inference problems, it is often convenient to convert both directed and undirected graphs into a different representation called a factor graph

Bayesian Networks Directed Acyclic Graph (DAG) parent child l.h.s. symmetric w.r.t a, b, c Holds for any choice of joint distribution r.h.s. not symmetric w.r.t a, b, c Slides adapted from textbook (Bishop)

Representation For a given choice of K, we can again represent this as a directed graph having K nodes, one for each conditional distribution on the right hand side, with each node having incoming links from all lower numbered nodes. We say that this graph is fully connected because there is a link between every pair of nodes It s however the absence of links in the graph that conveys interesting information about the properties of the class of distributions that the graph represents

Bayesian Networks General Factorization

Factorization The equation expresses the factorization properties of the joint distribution for a directed graphical model. This is general we can equally associate sets of variables and vector valued variables with the nodes of a graph It is easy to show that the representation on the right hand side of is always correctly normalized provided the individual conditional distributions are normalized Restriction: there must be no directed cycles (directed acyclic graphs, DAGs). This is equivalent to the statement that there exists an ordering of the nodes such that there are no links that go from any node to any lower numbered node

Bayesian Curve Fitting (1) Polynomial

Bayesian Curve Fitting (2) Plate

Bayesian Curve Fitting (3) Input variables and explicit hyperparameters

Bayesian Curve Fitting Learning Condition on data

Bayesian Curve Fitting Prediction Predictive distribution: where

Generative Models Causal process for generating images

Ancestral Sampling

Causal, Generative Models The graphical model captures the causal process by which the observed data was generated. For this reason, such models are often called generative models By contrast, the polynomial regression model is not generative because there is no probability distribution associated with the input variable x, and so it is not possible to generate synthetic data points from this model. We could make it generative by introducing a suitable prior distribution p(x), at the expense of a more complex model

Discrete Variables (1) General joint distribution: 2 parameters Independent joint distribution: parameters

Discrete Variables (2) General joint distribution over parameters variables: node Markov chain: parameters

Discrete Variables: Bayesian Parameters (1)

Discrete Variables: Bayesian Parameters (2) Shared prior

Parameterized Conditional Distributions If are discrete, state variables, in general has parameters. The parameterized form requires only 1 parameters

Linear Gaussian Models Directed Graph Each node is Gaussian, the mean is a linear function of the parents. Vector valued Gaussian Nodes

Conditional Independence Suppose that is independent of given Equivalently Notation

Verifying Conditional Independence Conditional independence properties play an important role in using probabilistic models for pattern recognition, simplifying both the structure of a model and the computations needed to perform inference and learning under that model If we are given an expression for the joint distribution over a set of variables in terms of a product of conditional distributions, then we could test whether any potential conditional independence property holds by repeated application of the sum and product rules of probability. Such an approach would be very time consuming An important feature of graphical models is that conditional independence properties of the joint distribution can be read directly from the graph without the need for any analytical manipulations. The general framework for achieving this is called d separation

Conditional Independence: Example 1 node c is said to be tail-to-tail with respect to the path from node a to node b Make a distinction: In this case the conditional independence property does not hold in general. It may, however, hold for a particular distribution by virtue of the specific numerical values associated with the various conditional probabilities, but it does not follow in general from the structure of the graph.

Conditional Independence: Example 1

Conditional Independence: Example 2 node c is said to be head-to-tail with respect to the path from node a to node b

Conditional Independence: Example 2

Conditional Independence: Example 3 node c is head-to-head with respect to the path from a to b Note: this is the opposite of Example 1, with unobserved.

Conditional Independence: Example 3 There is one more subtlety associated with this third example that we need to consider. We say that node y is a descendant of node x if there is a path from x to y in which each step of the path follows the directions of the arrows. Then it can be shown that a head-to-head path will become unblocked if either the node, or any of its descendants, is observed.

Summary A tail to tail node or a head to tail node leaves a path unblocked unless it is observed in which case it blocks the path By contrast, a head to head node blocks a path if it is unobserved, but once the node, and/or at least one of its descendants, is observed the path becomes unblocked

Am I out of fuel? and hence B = Battery (0=flat, 1=fully charged) F = Fuel Tank (0=empty, 1=full) G = Fuel Gauge Reading (0=empty, 1=full)

Am I out of fuel? Probability of an empty tank increased by observing 0.

Am I out of fuel? Probability of an empty tank reduced by observing 0. This referred to as explaining away.

D separation A, B, and C are non intersecting subsets of nodes in a directed graph. A path from A to B is blocked if it contains a node such that either a) the arrows on the path meet either head to tail or tailto tail at the node, and the node is in the set C, or b) the arrows meet head to head at the node, and neither the node, nor any of its descendants, are in the set C. If all paths from A to B are blocked, A is said to be d separated from B by C. If A is d separated from B by C, the joint distribution over all variables in the graph satisfies.

D separation: Example

Bayes Ball Algorithm To check if x A x C x B, we need to check if every variable in A is d-separated from every variable in C conditioned on all variables in B Otherwise said, given that all nodes in x B are clamped, when we change nodes x A, can we change any of the node x C? The Bayes-Ball Algorithm is a such a d-separation test We shade all nodes x B, place balls at each node in x A (or x C ), let them bounce around according to some rules, and then ask if any of the balls reach any of the nodes in x C (or x A ) So we need to know what happens when a ball arrives at a node Y on its way from X to Z

Bayes-Ball Rules

Bayes-Ball Boundary Rules We also need the boundary conditions Explaining away case: If y or any of its descendants is shaded, the ball passes through Balls can travel opposite to edge directions

D separation: I.I.D. Data

Directed Graphs as Distribution Filters We can view a graphical model (in this case a directed graph) as a filter in which a probability distribution p(x) is allowed through the filter if, and only if, it satisfies the directed factorization property. The set of all possible probability distributions p(x) that pass through the filter is denoted DF. We can alternatively use the graph to filter distributions according to whether they respect all of the conditional independencies implied by the d- separation properties of the graph. The d-separation theorem says that it is the same set of distributions DF that will be allowed through this second kind of filter.

Directed Graphs as Distribution Filters At one extreme we have a fully connected graph that exhibits no conditional independence properties at all, and which can represent any possible joint distribution over the given variables. The set DF will contain all possible distributions p(x) At the other extreme, we have the fully disconnected graph. This corresponds to joint distributions which factorize into the product of the marginal distributions over the variables comprising the nodes of the graph. For any given graph, the set of distributions DF will include any distributions that have additional independence properties beyond those described by the graph. For instance, a fully factorized distribution will always be passed through the filter implied by any graph over the corresponding set of variables.

The Markov Blanket Factors independent of x i cancel between numerator and denominator.

Markov Random Fields A Markov random field, also known as a Markov network or an undirected graphical model has a set of nodes each of which corresponds to a variable or group of variables, as well as a set of links each of which connects a pair of nodes The links are undirected, that is they do not carry arrows. By removing directionality from the links of the graph, the asymmetry between parent and child nodes is removed, and so the subtleties associated with head to head nodes (explaining away) no longer arise

Conditional Independence a vertex cutset is a subset whose removal breaks the graph into two or more pieces We consider all possible paths that connect nodes in set A to nodes in set B. If all such paths pass through one or more nodes in set C, then all such paths are blocked and so the conditional independence property holds. However, if there is at least one such path that is not blocked, then the property does not necessarily hold, or there will exist at least some distributions corresponding to the graph that do not satisfy this conditional independence relation. Note that this is exactly the same as the d separation criterion except that there is no explaining away phenomenon. Testing for conditional independence in undirected graphs is therefore simpler than in directed graphs An alternative way to view the conditional independence test is to remove all nodes in set C from the graph together with any links that connect to those nodes. We then ask if there exists a path that connects any node in A to any node in B. If there are no such paths, then the conditional independence property must hold

Markov Blanket Markov Blanket

Cliques and Maximal Cliques Clique Maximal Clique

Cliques and Maximal Cliques Clique Maximal Clique This leads us to consider a graphical concept called a clique, which is defined as a subset of the nodes in a graph such that there exists a link between all pairs of nodes in the subset. In other words, the set of nodes in a clique is fully connected. We can therefore define the factors in the decomposition of the joint distribution to be functions of the variables in the cliques. In fact, we can consider functions of the maximal cliques, without loss of generality, because other cliques must be subsets of maximal cliques.

Joint Distribution where is the positive potential over clique C and is the normalization coefficient; note: state variables terms in. Energies and the Boltzmann distribution Mostly necessary for learning Note that we do not restrict the choice of potential functions to those that have a specific probabilistic interpretation as marginal or conditional distributions. This is in contrast to directed graphs in which each factor represents the conditional distribution of the corresponding variable, conditioned on the state of its parents. However, when the undirected graph is constructed by starting with a directed graph, the potential functions may indeed have such an interpretation

Hammersley Clifford s Theorem (I) In plain words Consider the set of all possible distributions defined over a fixed set of variables corresponding to the nodes of a particular undirected graph. We can define UI to be the set of such distributions that are consistent with the set of conditional independence statements that can be read from the graph using graph separation Similarly, we can define UF to be the set of such distributions that can be expressed as a factorization with respect to the maximal cliques of the graph. The Hammersley Clifford theorem (Clifford, 1990) states that the sets UI and UF are identical.

Hammersley Clifford s Theorem (II)

Illustration: Image De Noising (1) Original Image Noisy Image

Illustration: Image De Noising (2)

Illustration: Image De Noising (3) Noisy Image Restored Image (ICM)

Illustration: Image De Noising (4) Restored Image (ICM) Restored Image (Graph cuts)

Converting Directed to Undirected Graphs (1)

How can we do this in general? We wish to convert any distribution specified by a factorization over a directed graph into one specified by a factorization over an undirected graph. This can be achieved if the clique potentials of the undirected graph are given by the conditional distributions of the directed graph. In order for this to be valid, we must ensure that the set of variables that appears in each of the conditional distributions is a member of at least one clique of the undirected graph. For nodes on the directed graph having just one parent, this is achieved simply by replacing the directed link with an undirected link. However, for nodes in the directed graph having more than one parent, this is not sufficient. These are nodes that have head to head paths encountered in our discussion of conditional independence.

Converting Directed to Undirected Graphs (2) Additional links

Moralization (I)

Moralization (II) We saw that in going from a directed to an undirected representation we had to discard some conditional independence properties from the graph. We could always trivially convert any distribution over a directed graph into one over an undirected graph by simply using a fully connected undirected graph. This would, however, discard all conditional independence properties and so would be vacuous. The process of moralization adds the fewest extra links and so retains the maximum number of independence properties.

D-map and I-map A graph is said to be a D map (for dependency map ) of a distribution if every conditional independence statement satisfied by the distribution is reflected in the graph. Thus a completely disconnected graph (no links) will be a trivial D map for any distribution. Alternatively, we can consider a specific distribution and ask which graphs have the appropriate conditional independence properties. If every conditional independence statement implied by a graph is satisfied by a specific distribution, then the graph is said to be an I map (for independence map ) of that distribution. A fully connected graph will be a trivial I map for any distribution. If it is the case that every conditional independence property of the distribution is reflected in the graph, and vice versa, then the graph is said to be a perfect map for that distribution. A perfect map is therefore both an I map and a D map.

Directed vs. Undirected Graphs (1)

Directed vs. Undirected Graphs (2) Cannot be expressed using undirected graph over the same 3 variables Cannot be expressed using a directed graph over the same 4 variables

Inference We will study the problem of inference in graphical models, in which some of the nodes in a graph are clamped to observed values, and we wish to compute the posterior distributions of one or more subsets of other nodes. We will be able to exploit the graphical structure both to find efficient algorithms for inference, and to make the structure of those algorithms transparent. Specifically, we shall see that many algorithms can be expressed in terms of the propagation of local messages around the graph.

Inference in Graphical Models

Inference on a Chain

Inference on a Chain Because each summation effectively removes a variable from the distribution, this can be viewed as the removal of a node from the graph.

Inference on a Chain

Inference on a Chain

Inference on a Chain To compute local marginals: Compute and store all forward messages,. Compute and store all backward messages,. Compute Z at any node x n Compute for all variables required.

Readings Bishop, Ch. 8 (up to 8.4.2)