CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Similar documents
CS242: Probabilistic Graphical Models Lecture 2B: Loopy Belief Propagation & Junction Trees

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

STA 4273H: Statistical Machine Learning

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Machine Learning. Sourangshu Bhattacharya

Graphical Models. Pradeep Ravikumar Department of Computer Science The University of Texas at Austin

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Chapter 8 of Bishop's Book: Graphical Models

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

2. Graphical Models. Undirected graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

STAT 598L Probabilistic Graphical Models. Instructor: Sergey Kirshner. Exact Inference

Expectation Propagation

Lecture 5: Exact inference

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Lecture 4: Undirected Graphical Models

Graphical Models. David M. Blei Columbia University. September 17, 2014

Lecture 3: Conditional Independence - Undirected

Exact Inference: Elimination and Sum Product (and hidden Markov models)

4 Factor graphs and Comparing Graphical Model Types

Computer vision: models, learning and inference. Chapter 10 Graphical Models

CS 343: Artificial Intelligence

Decomposition of log-linear models

2. Graphical Models. Undirected pairwise graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

Lecture 9: Undirected Graphical Models Machine Learning

The Basics of Graphical Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

ECE521 W17 Tutorial 10

OSU CS 536 Probabilistic Graphical Models. Loopy Belief Propagation and Clique Trees / Join Trees

Markov Random Fields

Regularization and Markov Random Fields (MRF) CS 664 Spring 2008

Introduction to Graphical Models

Introduction to Machine Learning

Scene Grammars, Factor Graphs, and Belief Propagation

Probabilistic Graphical Models

Bayesian Machine Learning - Lecture 6

CS 188: Artificial Intelligence

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Bayesian Networks Inference

Probabilistic Graphical Models

Algorithms for Markov Random Fields in Computer Vision

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Collective classification in network data

Scene Grammars, Factor Graphs, and Belief Propagation

3 : Representation of Undirected GMs

Probabilistic Graphical Models

Modeling and Reasoning with Bayesian Networks. Adnan Darwiche University of California Los Angeles, CA

Learning Bayesian Networks (part 3) Goals for the lecture

Models for grids. Computer vision: models, learning and inference. Multi label Denoising. Binary Denoising. Denoising Goal.

Probabilistic Graphical Models

Loopy Belief Propagation

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

An Effective Upperbound on Treewidth Using Partial Fill-in of Separators

Graphical Models & HMMs

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Machine Learning Lecture 16

Junction tree propagation - BNDG 4-4.6

Bayesian Networks Inference (continued) Learning

COS 513: Foundations of Probabilistic Modeling. Lecture 5

Homework 1: Belief Propagation & Factor Graphs

5 Minimal I-Maps, Chordal Graphs, Trees, and Markov Chains

6 : Factor Graphs, Message Passing and Junction Trees

Ch9: Exact Inference: Variable Elimination. Shimi Salant, Barak Sternberg

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms for Inference Fall 2014

Machine Learning

Machine Learning

Junction Trees and Chordal Graphs

Integrating Probabilistic Reasoning with Constraint Satisfaction

Building Classifiers using Bayesian Networks

An Introduction to LP Relaxations for MAP Inference

Sequence Labeling: The Problem

10 Sum-product on factor tree graphs, MAP elimination

Part I: Sum Product Algorithm and (Loopy) Belief Propagation. What s wrong with VarElim. Forwards algorithm (filtering) Forwards-backwards algorithm

Inference and Representation

Directed Graphical Models

CS 664 Flexible Templates. Daniel Huttenlocher

Probabilistic Graphical Models

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Lecture 11: May 1, 2000

Mesh segmentation. Florent Lafarge Inria Sophia Antipolis - Mediterranee

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning

Probabilistic Graphical Models

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Machine Learning

Probabilistic Graphical Models

AN ANALYSIS ON MARKOV RANDOM FIELDS (MRFs) USING CYCLE GRAPHS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Probabilistic Graphical Models

Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos

ECE521 Lecture 21 HMM cont. Message Passing Algorithms

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Exact Algorithms Lecture 7: FPT Hardness and the ETH

Junction Trees and Chordal Graphs

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

ACO Comprehensive Exam October 12 and 13, Computability, Complexity and Algorithms

Convexization in Markov Chain Monte Carlo

Transcription:

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination Instructor: Erik Sudderth Brown University Computer Science September 11, 2014 Some figures and materials courtesy David Barber, Bayesian Reasoning and Machine Learning http://www.cs.ucl.ac.uk/staff/d.barber/brml/

Probabilistic Graphical Models x 3 x 3 x 3 x 4 x 5 Directed Graphical Model (Bayesian Network) Markov Properties Factorization x 4 x 5 Factor Graph (Hypergraph) A more detailed view of factorization, useful for algorithms x 4 x 5 Undirected Graphical Model (Markov Random Field) Markov Properties Factorization

Undirected Graphical Models, Parameterization, & Factor Graphs

F Factor Graphs set of hyperedges linking subsets of nodes f V V set of N nodes or vertices, p(x) = 1 Y f (x f ) Z Z>0 f2f Z = X x {1, 2,...,N} normalization constant (partition function) In a hypergraph, the hyperedges link arbitrary subsets of nodes (not just pairs) Visualize by a bipartite graph, with square (usually black) nodes for hyperedges Motivation: factorization key to computation Motivation: avoid ambiguities of undirected graphical models Y f2f f (x f ) f (x f ) 0 arbitrary non-negative potential function

Factor Graphs & Factorization For a given undirected graph, there exist distributions with equivalent Markov properties, but different factorizations and different inference/learning complexities. p(x) = 1 Z Y f2f f (x f ) Undirected Graphical Model p(x) = 1 Z Y (s,t)2e Pairwise (edge) Potentials st(x s,x t ) Maximal Clique Potentials p(x) = 1 Z Y c2c c(x c ) An Alternative Factorization Recommendation: Use undirected graphs for pairwise MRFs, and factor graphs for higher-order potentials

p(x) = Directed Graphs as Factor Graphs Directed Graphical Model NY p(x i x (i) ) i=1

p(x) = Directed Graphs as Factor Graphs Directed Graphical Model NY p(x i x (i) ) i=1 Corresponding Factor Graph p(x) = NY i=1 i(x i,x (i) ) Associate one factor with each node, linking it to its parents and defined to equal the corresponding conditional distribution Information lost: Directionality of conditional distributions, and fact that global partition function Z=1

Directed Graphs as Undirected Graphs p(x) = NY i(x i,x (i) ) p(x) = i=1 NY p(x i x (i) ) i=1 x 3 x 4 x 5 Retain all directed edges, but drop directionality Create moral graph by linking ( marrying ) pairs of nodes with a common child This undirected graph has a clique for conditional of each node given its parent

Directed Factor Graphs? A rarely-used proposal by Brendan Frey, UAI 2003 Chain Graphs: Another hybrid of directed & undirected graphical models (see readings)

Inference In Graphical Models: Loss Functions & Marginals y z

Inference with Two Variables X X X Y Y Y p(x, y) =p(x)p(y x) Table Lookup: p(y x = x) Bayes Rule: p(x y =ȳ) = p(ȳ x)p(x) p(ȳ)

Marginal & Conditional Distributions y p(x, y) = X z2z z p(x, y, z) p(x) = X y2y p(x, y) p(x, y Z = z) = p(x, y, z) p(z) Target: Subset of variables (nodes) whose values are of interest. Condition: Whatever other variables (nodes) we have observed. Marginalize: Everything that s not a target or observation. This is a finite sum, but the number of terms may be exponentially large.

y 2Y X A = X L(x, a) Inference & Decision Theory observed data, can take values in any space unknown or hidden variables (assume discrete for now) action is to estimate the hidden variables table giving loss for all possible mistakes The posterior expected loss of taking action a is (a y) =E[L(x, a) y] = X x2x L(x, a)p(x y) The optimal Bayes decision rule is then (y) = arg min a2x (a y) Challenge: For graphical models, the decision space is exponentially large

Decomposable Loss Functions (a y) =E[L(x, a) y] = X x2x L(x, a)p(x y) Many practical losses decompose into a term for each variable node: NX NX X L(x, a) = L i (x i,a i ) (a y) = L(x i,a i )p(x i y) x i L(x, a) = L(x, a) = i=1 NX I(x i 6= a i ) i=1 NX i=1 i=1 Hamming Loss: For how many nodes is our estimate incorrect? (x i a i ) 2 Quadratic Loss: Expected squared distance minimized by posterior mean For any such loss, sufficient to find posterior marginals: p(x i y) Note that MAP loss does not decompose: L(x, a) = I(x 6= a)

Variable Elimination Algorithms for Inference in Factor Graphs x 3 x 5 x 3

Example: Inference in a Directed Graph x 4 N=6 discrete variables, each taking one of K states x 6 x 3 x 5 p(x) =p( )p( )p(x 3 )p(x 4 )p(x 5 x 3 )p(x 6,x 5 )

Example: Inference in a Directed Graph? x 4 x 6 N=6 discrete variables, each taking one of K states Goal: Compute the marginal distribution of, given observations of x 4 and x 6 x 3 x 5 p(x) =p( )p( )p(x 3 )p(x 4 )p(x 5 x 3 )p(x 6,x 5 ) p( x 4 = x 4,x 6 = x 6 ) / X X X p( )p( )p(x 3 )p( x 4 )p(x 5 x 3 )p( x 6,x 5 ) x 3 x 5 p( x 4 = x 4,x 6 = x 6 ) / X X X p( )p( )p(x 3 )p( x 4 )p(x 5 x 3 )p( x 6,x 5 ) x 5 x 3 Number of arithmetic operations to naively compute this sum: Exponential in number of unobserved variables! O(NK 4 )

Conversion to Factor Graph Form x 4 m 4 ( )= 4 ( x 4, )? x 6? m 6 (,x 5 )= 6 ( x 6,,x 5 ) x 3 x 5 x 3 x 5 p(x) =p( )p( )p(x 3 )p(x 4 )p(x 5 x 3 )p(x 6,x 5 ) p(x) = 1 ( ) 2 (, ) 3 (x 3, ) 4 (x 4, ) 5 (x 5,x 3 ) 6 (x 6,,x 5 ) p(,,x 3,x 5 x 4, x 6 ) / 1 ( ) 2 (, ) 3 (x 3, )m 4 ( ) 5 (x 5,x 3 )m 6 (,x 5 ) 1. Convert directed graph to corresponding factor graph. Algorithm is equally valid for factor graphs derived from undirected MRFs. 2. Replace observed variables by lower-order evidence factors.

Iterative Variable Elimination I?? m 5 (,x 3 )= X x 5 5(x 5,x 3 )m 6 (,x 5 ) x 3 x 5 x 3 O(K 3 ) operations (matrix-matrix product) p(,,x 3,x 5 x 4, x 6 ) / 1 ( ) 2 (, ) 3 (x 3, )m 4 ( ) 5 (x 5,x 3 )m 6 (,x 5 ) p(,,x 3 x 4, x 6 ) / X x 5 1( ) 2 (, ) 3 (x 3, )m 4 ( ) 5 (x 5,x 3 )m 6 (,x 5 ) p(,,x 3 x 4, x 6 ) / 1 ( ) 2 (, ) 3 (x 3, )m 4 ( )m 5 (,x 3 ) 3. Pick some unobserved variable for elimination. Replace all factors which depend on that variable by a single, equivalent factor linking the eliminated variable s neighbors. Repeat recursively.

Iterative Variable Elimination II?? m 3 (, )= X x 3 3(x 3, )m 5 (,x 3 ) x 3 O(K 3 ) operations (matrix-matrix product) p(,,x 3 x 4, x 6 ) / 1 ( ) 2 (, ) 3 (x 3, )m 4 ( )m 5 (,x 3 ) p(, x 4, x 6 ) / X x 3 1( ) 2 (, ) 3 (x 3, )m 4 ( )m 5 (,x 3 ) p(, x 4, x 6 ) / 1 ( ) 2 (, )m 4 ( )m 3 (, ) 3. Pick some unobserved variable for elimination. Replace all factors which depend on that variable by a single, equivalent factor linking the eliminated variable s neighbors. Repeat recursively.

Iterative Variable Elimination III?? m 2 ( )= X 2(, )m 3 (, )m 4 ( ) O(K 2 ) operations (matrix-vector product) p(, x 4, x 6 ) / 1 ( ) 2 (, )m 4 ( )m 3 (, ) p( x 4, x 6 ) / X 1( ) 2 (, )m 4 ( )m 3 (, ) p( x 4, x 6 )=Z 1 1 1( )m 2 ( ), Z 1 = X 1( )m 2 ( ) 3. Pick some unobserved variable for elimination. Replace all factors which depend on that variable by a single, equivalent factor linking the eliminated variable s neighbors. Repeat recursively.

A General Graph Elimination Algorithm Algebraic Marginalization Operations Marginalize out the variable associated with node being eliminated Compute a new potential function (factor) involving all other variables which are linked to the just-marginalized variable Graph Manipulation Operations Remove, or eliminate, a single node from the graph Remove any factors linked to that node, and replace them by a single factor joining the neighbors of the just-removed node Alternative view based on undirected graphs: Add edges between all pairs of nodes neighboring the just-removed node, forming a clique A Graph Elimination Algorithm Convert model to factor graph, encoding observed nodes as evidence factors Choose an elimination ordering (query node must be last) Iterate: Remove node, replace old factors by new message factor to neighbors

Elimination Algorithm Complexity x 3 x 5 x 3 Elimination cliques: Sets of neighbors of eliminated nodes Marginalization cost: Exponential in number of variables in each elimination clique (dominated by largest clique). Similar storage cost. Treewidth of graph: Over all possible elimination orderings, the smallest possible max-elimination-clique size, minus one NP-Hard: Finding the best elimination ordering for an arbitrary input graph (but heuristic approximation algorithms can be effective) Limitation: Lack of shared computation for finding multiple marginals

Importance of Elimination Order Treewidth = 1 Treewidth = 2