Cheng Soon Ong & Christian Walder. Canberra February June 2018

Similar documents
Graphical Models. Dmitrij Lagutin, T Machine Learning: Basic Principles

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Machine Learning. Sourangshu Bhattacharya

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Bayesian Machine Learning - Lecture 6

Graphical Models Part 1-2 (Reading Notes)

Probabilistic Graphical Models

Workshop report 1. Daniels report is on website 2. Don t expect to write it based on listening to one project (we had 6 only 2 was sufficient

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

Graphical Models & HMMs

Machine Learning

Machine Learning

Machine Learning

Introduction to Graphical Models

COMP90051 Statistical Machine Learning

Machine Learning

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

Graphical Models and Markov Blankets

Machine Learning

Machine Learning

CS 343: Artificial Intelligence

Chapter 2 PRELIMINARIES. 1. Random variables and conditional independence

10-701/15-781, Fall 2006, Final

Machine Learning

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Today. Logistic Regression. Decision Trees Redux. Graphical Models. Maximum Entropy Formulation. Now using Information Theory

Dependency detection with Bayesian Networks

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

The Basics of Graphical Models

Graphical Models. David M. Blei Columbia University. September 17, 2014

Inference. Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation:

Reasoning About Uncertainty

Directed Graphical Models (Bayes Nets) (9/4/13)

CS 343: Artificial Intelligence

COMP90051 Statistical Machine Learning

Probabilistic Graphical Models

Inference and Representation

Introduction to information theory and coding - Lecture 1 on Graphical models

3 : Representation of Undirected GMs

A note on the pairwise Markov condition in directed Markov fields

Markov Equivalence in Bayesian Networks

K-Means and Gaussian Mixture Models

A Brief Introduction to Bayesian Networks AIMA CIS 391 Intro to Artificial Intelligence

Bayesian Networks and Decision Graphs

Computer vision: models, learning and inference. Chapter 10 Graphical Models

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

Lecture 3: Conditional Independence - Undirected

Causality with Gates

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Graphical models are a lot like a circuit diagram they are written down to visualize and better understand a problem.

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Statistical Techniques in Robotics (STR, S15) Lecture#06 (Wednesday, January 28)

Lecture 4: Undirected Graphical Models

Stat 5421 Lecture Notes Graphical Models Charles J. Geyer April 27, Introduction. 2 Undirected Graphs

Expectation Maximization (EM) and Gaussian Mixture Models

Introduction to Probabilistic Graphical Models

Directed Graphical Models

Note Set 4: Finite Mixture Models and the EM Algorithm

Factor Graphs and message passing

Statistical and Learning Techniques in Computer Vision Lecture 1: Markov Random Fields Jens Rittscher and Chuck Stewart

Markov Random Fields and Gibbs Sampling for Image Denoising

Machine Learning

COUNTING AND PROBABILITY

Bayesian Classification Using Probabilistic Graphical Models

Introduction to Machine Learning CMU-10701

2. Graphical Models. Undirected graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

These notes present some properties of chordal graphs, a set of undirected graphs that are important for undirected graphical models.

RNNs as Directed Graphical Models

CSC Discrete Math I, Spring Sets

Section 6.3: Further Rules for Counting Sets

A Brief Introduction to Bayesian Networks. adapted from slides by Mitch Marcus

Probabilistic Graphical Models

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

10708 Graphical Models: Homework 2

Graphical Models. Pradeep Ravikumar Department of Computer Science The University of Texas at Austin

Lecture 9: Undirected Graphical Models Machine Learning

Using a Model of Human Cognition of Causality to Orient Arcs in Structural Learning

CAUSAL NETWORKS: SEMANTICS AND EXPRESSIVENESS*

Machine Learning. Chao Lan

Bioinformatics - Lecture 07

Machine Learning Lecture 16

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Transitivity and Triads

Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S

Joint Entity Resolution

THREE LECTURES ON BASIC TOPOLOGY. 1. Basic notions.

Warm-up as you walk in

Learning Tractable Probabilistic Models Pedro Domingos

7. Boosting and Bagging Bagging

Probabilistic Learning Classification using Naïve Bayes

2.2 Set Operations. Introduction DEFINITION 1. EXAMPLE 1 The union of the sets {1, 3, 5} and {1, 2, 3} is the set {1, 2, 3, 5}; that is, EXAMPLE 2

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake.

Sequence Labeling: The Problem

Causal Networks: Semantics and Expressiveness*

AI Programming CS S-15 Probability Theory

2. Graphical Models. Undirected pairwise graphical models. Factor graphs. Bayesian networks. Conversion between graphical models. Graphical Models 2-1

Transcription:

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 828

Part XIII Probabilistic Graphical Models 1 472of 828

via a picture Image Segmentation Cluster the gray value representation of an image Neighbourhood information lost Need to use the structure of the image. 473of 828

via (1/4) Why is the grass wet? Cloudy Sprinkler Rain WetGrass Introduce four Boolean variables : C(loudy), S(prinkler), R(ain), W(etGrass) {F(alse), T(rue)}. 474of 828

via (2/4) Model the conditional probabilities p(c = F) p(c = T) 0.2 0.8 C p(s = F) p(s = T) F 0.5 0.5 T 0.9 0.1 S R p(w = F) p(w = T) F F 1.0 0.0 T F 0.1 0.9 F T 0.1 0.9 T T 0.01 0.99 C p(r = F) p(r = T) F 0.8 0.2 T 0.2 0.8 Cloudy Sprinkler Rain WetGrass 475of 828

via (3/4) If everything depends on everything C S R W p(c, S, R, W) F F F F... F F F T......... T T T F... T T T T... p(w, S, R, C) = p(w S, R, C) p(s, R, C) = p(w S, R, C) p(s R, C) p(r, C) = p(w S, R, C) p(s R, C) p(r C) p(c) Cloudy Sprinkler Rain WetGrass 476of 828

via (4/4) p(w) = p(w S, R, C) p(s R, C) p(r C) p(c) S,R,C Sprinkler Cloudy WetGrass Rain p(w) = p(w S, R) S,R C p(s C) p(r C) p(c) Sprinkler Cloudy Rain WetGrass 477of 828

via Distributive Law Two key observations when dealing with probabilities 1 Distributive Law can save operations a(b + c) = ab + ac }{{}}{{} 2 operations 3 operations 2 If some probabilities do not depend on all random variables, we might be able to factor them out. For example, assume p(x 1, x 3 x 2) = p(x 1 x 2) p(x 3 x 2), then (using x 3 p(x 3 x 2) = 1) p(x 1) = x 2,x 3 p(x 1, x 2, x 3) = x 2,x 3 p(x 1, x 3 x 2) p(x 2) = p(x 1 x 2) p(x 3 x 2) p(x 2) = p(x 1 x 2) p(x 2) x 2,x 3 x 2 } {{ } O( X 1 X 2 X 3 ) } {{ } O( X 1 X 2 ) 478of 828

via Complexity Reduction How to deal with more complex expression? p(x 1 ) p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) p(x 5 x 1, x 3 ) p(x 6 x 4 ) p(x 7 x 4, x 5 ) 479of 828

via Complexity Reduction How to deal with more complex expression? p(x 1 ) p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) p(x 5 x 1, x 3 ) p(x 6 x 4 ) p(x 7 x 4, x 5 ) Graphical models x 1 x 2 x 3 x 4 x 5 x 6 x 7 480of 828

Probabilistic Graphical Models: Aims Graphical models Visualise the structure of a probabilistic model Complex computations with formulas manipulations with graphs Obtain insights into model properties by inspection Develop and motivate new models 481of 828

Probabilistic Graphical Models: Main Flavours Graph 1 Nodes (vertices) : a random variable 2 Edges (links, arcs; directed or undirected) : probabilistic relationship Directed Graph : (also called Directed Graphical Model) expressing causal relationship between variables Undirected Graph : Markov Random Field expressing soft constraints between variables Factor Graph : convenient for solving inference problems (derived from s or Markov Random Fields). x1 x2 x3 x1 x2 x3 yi x4 x5 x6 x7 fa fb fc fd Factor Graph xi Markov Random Field 482of 828

: Arbitrary Joint Distribution p(a, b, c) = p(c a, b) p(a, b) = p(c a, b) p(b a) p(a) 483of 828

: Arbitrary Joint Distribution p(a, b, c) = p(c a, b) p(a, b) = p(c a, b) p(b a) p(a) 1 Draw a node for each conditional distribution associated with a random variable. 2 Draw an edge from each conditional distribution associated with a random variable to all other conditional distribution which are conditioned on this variable. a b c We have chosen a particular ordering of the variables! 484of 828

: Arbitrary Joint Distribution General case for K variables p(x 1,..., x K ) = p(x K x 1,..., x K 1 )... p(x 2 x 1 ) p(x 1 ) The graph of this distribution is fully connected. (Prove it.) What happens if we deal with a distribution represented by a graph which is not fully connected? Can not be the most general distribution anymore. The absence of edges carries important information. 485of 828

Joint Factorisation Graph p(x 1 )p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) p(x 5 x 1, x 3 ) p(x 6 x 4 ) p(x 7 x 4, x 5 ) 486of 828

Joint Factorisation Graph p(x 1 )p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) p(x 5 x 1, x 3 ) p(x 6 x 4 ) p(x 7 x 4, x 5 ) 1 Draw a node for each conditional distribution associated with a random variable. 2 Draw an edge from each conditional distribution associated with a random variable to all other conditional distribution which are conditioned on this variable. 487of 828

Joint Factorisation Graph p(x 1 )p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) p(x 5 x 1, x 3 ) p(x 6 x 4 ) p(x 7 x 4, x 5 ) 1 Draw a node for each conditional distribution associated with a random variable. 2 Draw an edge from each conditional distribution associated with a random variable to all other conditional distribution which are conditioned on this variable. x 1 x 2 x 3 x 4 x 5 x 6 x 7 488of 828

Graph Joint Factorisation x 1 x 2 x 3 x 4 x 5 x 6 x 7 Can we get the expression from the graph? 489of 828

Graph Joint Factorisation x 1 x 2 x 3 x 4 x 5 x 6 x 7 Can we get the expression from the graph? 1 Write a product of probability distributions, one for each associated random variable. Draw a node for each conditional distribution associated with a random variable. 2 Add all random variables associated with parent nodes to the list of conditioning variables. Draw an edge from each conditional distribution associated with a random variable to all other conditional distribution which are conditioned on this variable. p(x 1 )p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) p(x 5 x 1, x 3 ) p(x 6 x 4 ) p(x 7 x 4, x 5 ) 490of 828

Joint Factorisation Graph The joint distribution defined by a graph is given by the product, over all of the nodes of the graph, of a conditional distribution for each node conditioned on the variables corresponding to the parents of the node in the graph. p(x) = K p(x k pa(x k )) k=1 where pa(x k ) denotes the set of parents of x k and x = (x 1,..., x K ). The absence of edges (compared to the fully connected case) is the point. 491of 828

Joint Factorisation Graph Restriction : Graph must be a directed acyclic graph (DAG). There are no closed paths in the graph when moving along the directed edges. Or equivalently: There exists an ordering of the nodes such that there are no edges that go from any node to any lower numbered node. x 1 x 2 x 3 x 4 x 5 x 6 x 7 Extension: Can also have sets of variables, or vectors at a node. 492of 828

: Normalisation Given p(x) = K p(x k pa(x k )). k=1 Is p(x) normalised, x p(x) = 1? 493of 828

: Normalisation Given p(x) = K p(x k pa(x k )). k=1 Is p(x) normalised, x p(x) = 1? As graph is DAG, there always exists a node with no outgoing edges, say x i. p(x) = x x 1,...,x i 1,x i+1,...,x K K p(x k pa(x k )) k=1 k i x i p(x i pa(x i )) }{{} =1 because x i p(x i pa(x i )) = x i Repeat, until no node left. p(x i,pa(x i)) p(pa(x i)) = p(pa(xi)) p(pa(x i)) = 1 494of 828

- Bayesian polynomial regression : observed inputs x, observed targets t, noise variance σ 2, hyperparameter α controlling the priors for w. Focusing on t and w only p(t, w) = p(w) N p(t n w) n=1 w t 1 t N 495of 828

- Bayesian polynomial regression : observed inputs x, observed targets t, noise variance σ 2, hyperparameter α controlling the priors for w. Focusing on t and w only p(t, w) = p(w) N p(t n w) n=1 w w t 1 t N t n N A plate. 496of 828

- Include also the parameters into the graphical model N p(t, w x, α, σ 2 ) = p(w α) p(t n w, x n, σ 2 ) n=1 Random variables = open circles Deterministic variables = smaller solid circles x n α σ 2 t n N w 497of 828

- Random variables Observed random variables, e.g. t Unobserved random variables, e.g. w, (latent random variables, hidden random variables) Shade the observed random variables in the graphical model. x n α σ 2 t n N w 498of 828

- Prediction : new data point x. Want to predict t. [ N ] p( t, t, w x, x, α, σ 2 ) = p(t n x n, w, σ 2 ) p(w α) p( t x, w, σ 2 ) n=1 x n α w t n N x σ 2 t Polynomial regression model including prediction. 499of 828

- Any p(x) may be written p(x) = = K p(x k x 1, x 2,..., x k 1 ) k=1 K p(x k pa(x k )) k=1 What does it mean if pa(x k ) x 1, x 2,..., x k 1 in the above equation? Sparse graphical model. p(x) no longer general. Assumption, e.g. p(w C, S, R) = p(w S, R). independence. 500of 828

- Definition ( ) If for three random variables a, b, and c the following holds p(a b, c) = p(a c) then a is conditionally independent of b given c. Notation : a b c. The above equation must hold for all possible values of c. Consequence : p(a, b c) = p(a b, c) p(b c) = p(a c) p(b c) independence simplifies the structure of the model the computations needed to perform interference/learning. NB: conditional dependence is a related but different concept we are not concerned with in this course. 501of 828

- Rules for Symmetry : Decomposition : Weak Union : Contraction : Intersection : X Y Z = Y X Z Y, W X Z = Y X Z and W X Z X Y, W Z = X Y Z, W X W Z, Y and X Y Z = X W, Y Z X Y Z, W and X W Z, Y = X Y, W Z Note: Intersection is only valid for p(x), p(y), p(z), p(w) > 0. 502of 828

: TT (1/3) Can we work with the graphical model directly? Check the simplest examples containing only three nodes. First example has joint distribution p(a, b, c) = p(a c) p(b c) p(c) Marginalise both sides over c p(a, b) = p(a c) p(b c) p(c) p(a) p(b). c Does not hold : a b (where is the empty set). c a b 503of 828

: TT (2/3) Now condition on c. p(a, b c) = Therefore a b c. p(a, b, c) p(c) c = p(a c) p(b c) a b 504of 828

: TT (3/3) Graphical interpretation In both graphical models there is a path from a to b. The node c is called tail-to-tail (TT) with respect to this path because the node c is connected to the tails of the arrows in the path. The presence of the TT-node c in the path left renders a dependent on b (and b dependent on a). Conditioning on c blocks the path from a to b and causes a and b to become conditionally independent on c. c c a b a b Not a b a b c 505of 828

: HT (1/3) Second example. p(a, b, c) = p(a) p(c a) p(b c) Marginalise over c to test for independence. p(a, b) = p(a) p(c a) p(b c) = p(a) p(b a) p(a) p(b) c Does not hold : a b. a c b 506of 828

: HT (2/3) Now condition on c. p(a, b c) = p(a, b, c) p(c) = p(a) p(c a) p(b c) p(c) = p(a c) p(b c) where we used Bayes theorem p(c a) = p(a c) p(c)/p(a). Therefore a b c. a c b 507of 828

: HT (3/3) a c b a c b Not a b a b c 508of 828

: HH (1/4) Third example. (A little bit more subtle.) p(a, b, c) = p(a) p(b) p(c a, b) Marginalise over c to test for independence. p(a, b) = c p(a) p(b) p(c a, b) = p(a) p(b) c p(c a, b) = p(a) p(b) a and b are independent if NO variable is observed: a b. a b c 509of 828

: HH (2/4) Now condition on c. p(a, b c) = p(a, b, c) p(c) = Does not hold : a b c. a p(a) p(b) p(c a, b) p(c) b p(a c) p(b c). c 510of 828

: HH (3/4) Graphical interpretation In both graphical models there is a path from a to b. The node c is called head-to-head (HH) with respect to this path because the node c is connected to the heads of the arrows in the path. The presence of the HH-node c in the path left makes a independent of b (and b independent of a). The unobserved c blocks the path from a to b. a b a b c a b c Not a b c 511of 828

: HH (4/4) Graphical interpretation Conditioning on c unblocks the path from a to b, and renders a conditionally dependent on b given c. Some more terminology: Node y is a descendant of node x if there is a path from x to y in which each step follows the directions of the arrows. A HH-path will become unblocked if either the node, or any of its descendants, is observed. a b a f e b c Not a b c c Not a f c 512of 828

- General Graph independence and has been established for all configurations of three variables: (HH, HT, TT) (observed, unobserved) the subtle case of HH junction with observed descendent. We can generalise these results to arbitrary Bayesian Networks. Roughly: the graph connectivity implies conditional independence for those sets of nodes which are directionally separated. 513of 828

- D-Separation Definition (Blocked Path) A blocked path is a path which contains an observed TT- or HT-node, or an unobserved HH-node whose descendants are all unobserved. c a b a c b a b c 514of 828

- D-Separation Consider a general directed graph in which A, B, and C are arbitrary non-intersecting sets of nodes. (There may be other nodes in the graph which are not contained in the union of A, B, and C.) Consider all possible paths from any node in A to any node in B. Any such path is blocked, if it includes a node such that either the node is HT or TT, and the node is in set C, or the node is HH, and neither the node, nor any of the descendants, is in set C. If all paths are blocked, then A is d separated from B by C, and the joint distribution over all the variables in the graph will satisfy A B C. (Note: D-separation stands for directional separation.) 515of 828

- D-separation Example The path from a to b is not blocked by f because f is a TT-node and unobserved. The path from a to b is not blocked by e because e is a HH-node, and although unobserved itself, one of its descendants (node c) is observed. a f e b c 516of 828

- D-separation Another example The path from a to b is blocked by f because f is a TT-node and observed. Therefore, a b f. Furthermore, the path from a to b is also blocked by e because e is a HH-node, and neither it nor its descendants are observed. Therefore a b f. a f e b c 517of 828

Factorisation (1/4) Theorem (Factorisation ) If a probability distribution factorises according to a directed acyclic graph, and if A, B and C are disjoint subsets of nodes such that A is d-separated from B by C in the graph, then the distribution satisfies A B C. Theorem ( Factorisation) If a probability distribution satisfies the conditional independence statements implied by d-separation over a particular directed graph, then it also factorises according to the graph. 518of 828

Factorisation (2/4) Why is Factorisation relevant? statements are usually what a domain expert knows about the problem at hand. Needed is a model p(x) for computation. The Factorisation provides p(x) from statements. One can build a global model for computation from local conditional independence statements. 519of 828

Factorisation (3/4) Given a set of statements. Adding another statement will in general produce other statements. All statements can be read as d-separation in a DAG. However, there are sets of statements which cannot be satisfied by any Bayesian Network. 520of 828

Factorisation (4/4) The broader picture A directed graphical model can be viewed as a filter accepting probability distributions p(x) and only letting these through which satisfy the factorisation property. The set of all possible distribution p(x) which pass through the filter is denoted as DF. Alternatively, only these probability distributions p(x) pass through the filter (graph), which respect the conditional independencies implied by the d-separation properties of the graph. The d-separation theorem says that the resulting set DF is the same in both cases. p(x) DF 521of 828