Lecture 3: Conditional Independence - Undirected

Similar documents
Graphical Models. Pradeep Ravikumar Department of Computer Science The University of Texas at Austin

Graphical Models. David M. Blei Columbia University. September 17, 2014

The Basics of Graphical Models

4 Factor graphs and Comparing Graphical Model Types

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

3 : Representation of Undirected GMs

10 Sum-product on factor tree graphs, MAP elimination

COMP90051 Statistical Machine Learning

Lecture 4: Undirected Graphical Models

2. Graphical Models. Undirected graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Bayesian Machine Learning - Lecture 6

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Probabilistic Graphical Models

Graphical models are a lot like a circuit diagram they are written down to visualize and better understand a problem.

2. Graphical Models. Undirected pairwise graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

Chapter 8 of Bishop's Book: Graphical Models

Loopy Belief Propagation

Probabilistic Graphical Models

5 Minimal I-Maps, Chordal Graphs, Trees, and Markov Chains

Lecture 9: Undirected Graphical Models Machine Learning

Bayesian Networks, Winter Yoav Haimovitch & Ariel Raviv

Recall from last time. Lecture 4: Wrap-up of Bayes net representation. Markov networks. Markov blanket. Isolating a node

Stat 5421 Lecture Notes Graphical Models Charles J. Geyer April 27, Introduction. 2 Undirected Graphs

Graphical models and message-passing algorithms: Some introductory lectures

Directed Graphical Models (Bayes Nets) (9/4/13)

Statistical and Learning Techniques in Computer Vision Lecture 1: Markov Random Fields Jens Rittscher and Chuck Stewart

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Statistical Techniques in Robotics (STR, S15) Lecture#06 (Wednesday, January 28)

Undirected Graphical Models. Raul Queiroz Feitosa

Machine Learning. Sourangshu Bhattacharya

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

Probabilistic Graphical Models

Computer vision: models, learning and inference. Chapter 10 Graphical Models

STA 4273H: Statistical Machine Learning

Exact Inference: Elimination and Sum Product (and hidden Markov models)

6 : Factor Graphs, Message Passing and Junction Trees

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Today. Logistic Regression. Decision Trees Redux. Graphical Models. Maximum Entropy Formulation. Now using Information Theory

Discrete mathematics

Lecture 5: Exact inference

Bipartite Roots of Graphs

Machine Learning Lecture 16

COS 513: Foundations of Probabilistic Modeling. Lecture 5

CS 6784 Paper Presentation

Treewidth and graph minors

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Exact Inference

AN ANALYSIS ON MARKOV RANDOM FIELDS (MRFs) USING CYCLE GRAPHS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Image Restoration using Markov Random Fields

V,T C3: S,L,B T C4: A,L,T A,L C5: A,L,B A,B C6: C2: X,A A

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Decomposition of log-linear models

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

CS649 Sensor Networks IP Track Lecture 6: Graphical Models

Finding Non-overlapping Clusters for Generalized Inference Over Graphical Models

Markov Logic: Representation

Learning Graph Grammars

CS270 Combinatorial Algorithms & Data Structures Spring Lecture 19:

Machine Learning

Algorithms for Markov Random Fields in Computer Vision

Learning Bayesian Networks (part 3) Goals for the lecture

ECE521 Lecture 18 Graphical Models Hidden Markov Models

Sequence Labeling: The Problem

Collective classification in network data

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

EE512 Graphical Models Fall 2009

Sample tasks from: Algebra Assessments Through the Common Core (Grades 6-12)

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS 343: Artificial Intelligence

Motivation: Shortcomings of Hidden Markov Model. Ko, Youngjoong. Solution: Maximum Entropy Markov Model (MEMM)

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014

Expectation Propagation

10708 Graphical Models: Homework 4

Graphical Models, Bayesian Method, Sampling, and Variational Inference

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Machine Learning

Probabilistic Graphical Models

CS675: Convex and Combinatorial Optimization Spring 2018 Consequences of the Ellipsoid Algorithm. Instructor: Shaddin Dughmi

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient

6c Lecture 3 & 4: April 8 & 10, 2014

Bayesian Classification Using Probabilistic Graphical Models

Machine Learning

To graph the point (r, θ), simply go out r units along the initial ray, then rotate through the angle θ. The point (1, 5π 6

CS 343: Artificial Intelligence

Probabilistic Graphical Models

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

The Structure of Bull-Free Perfect Graphs

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Directed Graphical Models

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Throughout this course, we use the terms vertex and node interchangeably.

Generative and discriminative classification techniques

Efficient Feature Learning Using Perturb-and-MAP

Exact Algorithms Lecture 7: FPT Hardness and the ETH

Transcription:

CS598: Graphical Models, Fall 2016 Lecture 3: Conditional Independence - Undirected Lecturer: Sanmi Koyejo Scribe: Nate Bowman and Erin Carrier, Aug. 30, 2016 1 Review for the Bayes-Ball Algorithm Recall the primary rules for Bayes-Ball algorithm: In the case of a chain, if we know ( given ) a middle node, then the ball is blocked. Otherwise, the ball passes through. Consider the case of a tree with one parent node and two children. If the parent is known, then the ball is blocked. Otherwise, the ball passes through. Consider two parent nodes pointing to one child node. If the child node is given, the ball passes through. Otherwise, it is blocked. If the ball passes through, then the nodes are conditionally dependent. conditionally independent. Otherwise, they are 2 Undirected Graphical Models The main focus of the class was on undirected graphical models. These are also known as Markov Random Fields (MRFs). Undirected graphical models have a variety of applications. These include image processing, natural language processing, and networking. For instance, in image processing, Figure 1 could represent the image, with the nodes representing the pixels (or potentially blocks). A potential use of this would be to distinguish the foreground from the background, such as the foreground denoted by the circled nodes in Figure 1. It is important to distinguish that undirected graphical models encode correlation, not causation. Directed graphical models, on the other hand, are better suited for encoding causation. 2.1 Conditional Independence in Undirected Graphical Models The fundamental concept covered in this lecture is conditional independence in undirected graphical models. Consider an undirected graph denoted as the pair G = (V, E), where V is the set of nodes 1

2 3: Conditional Independence - Undirected Figure 1: Example graph from an image-processing application illustrating foreground vs background as graph separation and E is the set of undirected edges. Each node i is associated with a random variable X i. Let A, B, and C be three disjoint sets of nodes and let X A, X B, and X C be the corresponding sets of random variables. Then, X A is conditionally independent of X C given X B if the nodes in A are separated from the nodes in C by the nodes in B. In other words, in order to be conditionally independent, then all paths between nodes in A and nodes in C must go through one or more nodes in B. This also means that if we remove the nodes in X B and the edges connected to them, then the graph is split into two disconnected sets of nodes and the graph is no longer a connected graph. For instance, consider Figure 2 where X A is conditionally independent from X C given X B. Figure 2: Illustration of graph separation 2.2 Undirected and Directed Graphs Equivalency Now that we have established the idea of conditional independence in undirected graphs and directed graphs (previous lectures), a natural question is whether we can reduce a given directed graph to an equivalent undirected graph (or vice versa). For instance, consider the directed graph shown in

3: Conditional Independence - Undirected 3 Figure 3a. (a) Example of a directed graph. (b) One attempt at an equivalent undirected graph. Figure 3: Example of a directed graph and an attempt for an equivalent undirected graph This graph encodes these conditional independence assertions: x z, x z y. Figure 3b shows an undirected graph that might be able to encode the same set of independence assertions. However, this fails to encode x z. It turns out that no undirected graph can exactly represent the set of conditional independence assertions given by the directed graph in Figure 3a. In general, neither undirected or directed graphical models are reducible to the other. However, there is a set of graphs that can be represented by both. These are known as chordal graphs. An interesting difference between a directed an undirected graphs is that the only way to have marginal independence in an undirected graph is for the graph not to be connected. With a connected, undirected graph, no node is marginally independent of the others. 3 Undirected Graphical Model Families An undirected graph provides a set of conditional independence assertions using graph separation. The undirected graphical model family is the set of distributions that satisfy all of the assertions given in the graph. We would like to be able to represent the joint distributions of random variables that satisfy all of the conditional independence assertions implied by an undirected graphical model. 3.1 Local Parametrization To compactly represent a joint distribution, we want to obtain a local parametrization of a given undirected graphical model. We had such a parametrization in the case of directed graphs. In directed graphs, the local conditional probabilities of nodes given their parents provided the local parametrization. However, this interpretation of local, meaning a node and its parents, is not available in undirected graphical models, in which a node does not have parents. One obvious choice for the meaning of local in an undirected graphical model is a node and its neighbors where we would give the conditional probability of a node given its neighbors. This type of local parametrization, though, does not ensure consistency of conditional probabilities across

4 3: Conditional Independence - Undirected nodes, which means that the product of these conditional probabilities will not necessarily be a valid joint distribution. Defining conditional probabilities for each node dependent on its neighbors is thus not a good way to locally parametrize the distribution represented by an undirected graphical model. To find a more convenient way to locally parametrize, we first establish a different meaning of local. To do so, we must first understand cliques in graphs. A clique is a set of nodes wherein each node has an edge connecting it to each other node. For example, in Figure 4, the nodes A, B, and C form a clique, but nodes B, C, and D do not because there is no edge between B and D. A maximal clique is a clique to which no node can be added to make a larger clique. In Figure 4, nodes C and D are a maximal clique, as are nodes A, B, and C, whereas nodes A and B by themselves are not a maximal clique because node C can be added and the nodes still form a clique. From here on, local will refer to a maximal clique. It should be noted that while maximum cliques will be convenient to use as we develop our theory, the problem of finding maximum cliques is known to be nontrivial. Figure 4: Example graph demonstrating maximum cliques 3.2 Potential Functions Now that we have decided on a meaning of local, we will determine how to find functions that are local but which can be chosen independently and still lead to a valid joint distribution. Note that if random variables X A and X C are conditionally independent given variable X B, then P (X A X B, X C ) = P (X A X C ), and P (X A, X B, X C ) = P (X A X C )P (X B, X C ).

3: Conditional Independence - Undirected 5 We see that the joint distribution P (X A, X B, X C ) can be factored such that P (X A, X B, X C ) = f(x A, X B )g(x B, X C ) for some functions f and g. Using this idea, we will factor the joint distribution as a product of functions, and each function in the factorization will depend only on variables within a maximal clique. So, if there is no edge between variables X i and X j, for example, then the joint distribution cannot have a factor such as ψ(x i, X j, X k ) because the three variables do not form a clique. We will refer to the functions in the factorization as potential functions. If C is a set containing the indices of a maximal clique in an undirected graph G, then the potential function ψ C (x C ) is a function that depends only on values x c of maximal clique variables X C. It turns out that we need only loose restrictions on our potential functions, namely that the functions are non-negative and real-valued. Aside from these restrictions, our potential functions can be arbitrary. Let us denote the set of all maximal cliques M. We then let the probability of some outcome x, p(x), be p(x) 1 Z C M where Z is the normalization constant and is given by Z x C M ψ XC (x C ), ψ XC (x C ). These potential functions, being arbitrary, do not have a probabilistic interpretation. However, they do have a quasi-probabilistic interpretation, in that a configuration of random variables with a large potential-function value will be favored over a configuration with a smaller value. Because of this relationship, one can think of potential functions as pre-probabilistic. 3.3 Exponentiated Potential Functions To ensure that the potential function is non-negative, it is often convenient to make it the exponential of an arbitrary real-valued function. That is, have a potential function ψ as ψ XC (x C ) = exp{ H C (x C )}, where we either define H C and exponentiate or directly define ψ. In this case, the actual probability is p(x) = 1 Z C M = 1 Z exp { exp{ H C (x C )} C M H C (x C ) }.

6 3: Conditional Independence - Undirected Thus, we see that using the function H C and the exponentiation instead of ψ also allows us to transform the product into a more convenient summation. The two representations, defining ψ directly or defining H C, are equivalent, so using the H functions is done for convenience and does not gain any ability to represent different potential functions. As an aside, the representation using H C instead of ψ comes up often in physics, where it is known as the Boltzmann distribution. 3.4 Hammersley-Clifford Theorem The above use of potential functions with maximal cliques is justified by the Hammersley-Clifford theorem. This theorem states that our convenient representation, namely, expressing distributions as the product of potential functions over maximal cliques, is equivalent to representing distributions that satisfy conditional independence statements specified by graph separation. In other words, the factorized representation we use allows us to represent all distributions specified by using graph separation on our undirected graphs. 4 Reduced Parametrizations It is often useful to add additional structure that is not specified in a graphical model in order to further simplify complicated joint distributions. Instead of imposing constraints only based on conditional independence, we can impose an algebraic structure on our conditional probabilities and potential functions. For example, look at the graphical model in Figure 5. Since node Y has 50 parents, the conditional probability table at Y would have 2 50 entries. Since this is unacceptably large, we instead make an additional assumption about P (Y X) that is not specified in the graph. For example, we could specify that it is a function of a linear combination of the variables X i, i.e. p(y = 1 x) = f(θ 1 x 1 + θ 2 x 2 +... + θ m x m ). This reduces our formula down to just the 50 parameters θ i. This loses some representational power, but is now feasible to use. Figure 5: An example demonstrating why we might need reduced factorization