Cheng Soon Ong & Christian Walder. Canberra February June 2018

Size: px

Start display at page:

Download "Cheng Soon Ong & Christian Walder. Canberra February June 2018"

Daisy Greer
5 years ago
Views:

1 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1 Linear Regression 2 Linear Classification 1 Linear Classification 2 Kernel Methods Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1 Neural Networks 2 Principal Component Analysis Autoencoders Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 828

2 Part XIII Probabilistic Graphical Models 1 472of 828

3 via a picture Image Segmentation Cluster the gray value representation of an image Neighbourhood information lost Need to use the structure of the image. 473of 828

4 via (1/4) Why is the grass wet? Cloudy Sprinkler Rain WetGrass Introduce four Boolean variables : C(loudy), S(prinkler), R(ain), W(etGrass) {F(alse), T(rue)}. 474of 828

5 via (2/4) Model the conditional probabilities p(c = F) p(c = T) C p(s = F) p(s = T) F T S R p(w = F) p(w = T) F F T F F T T T C p(r = F) p(r = T) F T Cloudy Sprinkler Rain WetGrass 475of 828

6 via (3/4) If everything depends on everything C S R W p(c, S, R, W) F F F F... F F F T T T T F... T T T T... p(w, S, R, C) = p(w S, R, C) p(s, R, C) = p(w S, R, C) p(s R, C) p(r, C) = p(w S, R, C) p(s R, C) p(r C) p(c) Cloudy Sprinkler Rain WetGrass 476of 828

7 via (4/4) p(w) = p(w S, R, C) p(s R, C) p(r C) p(c) S,R,C Sprinkler Cloudy WetGrass Rain p(w) = p(w S, R) S,R C p(s C) p(r C) p(c) Sprinkler Cloudy Rain WetGrass 477of 828

8 via Distributive Law Two key observations when dealing with probabilities 1 Distributive Law can save operations a(b + c) = ab + ac }{{}}{{} 2 operations 3 operations 2 If some probabilities do not depend on all random variables, we might be able to factor them out. For example, assume p(x 1, x 3 x 2) = p(x 1 x 2) p(x 3 x 2), then (using x 3 p(x 3 x 2) = 1) p(x 1) = x 2,x 3 p(x 1, x 2, x 3) = x 2,x 3 p(x 1, x 3 x 2) p(x 2) = p(x 1 x 2) p(x 3 x 2) p(x 2) = p(x 1 x 2) p(x 2) x 2,x 3 x 2 } {{ } O( X 1 X 2 X 3 ) } {{ } O( X 1 X 2 ) 478of 828

9 via Complexity Reduction How to deal with more complex expression? p(x 1 ) p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) p(x 5 x 1, x 3 ) p(x 6 x 4 ) p(x 7 x 4, x 5 ) 479of 828

10 via Complexity Reduction How to deal with more complex expression? p(x 1 ) p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) p(x 5 x 1, x 3 ) p(x 6 x 4 ) p(x 7 x 4, x 5 ) Graphical models x 1 x 2 x 3 x 4 x 5 x 6 x 7 480of 828

11 Probabilistic Graphical Models: Aims Graphical models Visualise the structure of a probabilistic model Complex computations with formulas manipulations with graphs Obtain insights into model properties by inspection Develop and motivate new models 481of 828

12 Probabilistic Graphical Models: Main Flavours Graph 1 Nodes (vertices) : a random variable 2 Edges (links, arcs; directed or undirected) : probabilistic relationship Directed Graph : (also called Directed Graphical Model) expressing causal relationship between variables Undirected Graph : Markov Random Field expressing soft constraints between variables Factor Graph : convenient for solving inference problems (derived from s or Markov Random Fields). x1 x2 x3 x1 x2 x3 yi x4 x5 x6 x7 fa fb fc fd Factor Graph xi Markov Random Field 482of 828

13 : Arbitrary Joint Distribution p(a, b, c) = p(c a, b) p(a, b) = p(c a, b) p(b a) p(a) 483of 828

14 : Arbitrary Joint Distribution p(a, b, c) = p(c a, b) p(a, b) = p(c a, b) p(b a) p(a) 1 Draw a node for each conditional distribution associated with a random variable. 2 Draw an edge from each conditional distribution associated with a random variable to all other conditional distribution which are conditioned on this variable. a b c We have chosen a particular ordering of the variables! 484of 828

15 : Arbitrary Joint Distribution General case for K variables p(x 1,..., x K ) = p(x K x 1,..., x K 1 )... p(x 2 x 1 ) p(x 1 ) The graph of this distribution is fully connected. (Prove it.) What happens if we deal with a distribution represented by a graph which is not fully connected? Can not be the most general distribution anymore. The absence of edges carries important information. 485of 828

16 Joint Factorisation Graph p(x 1 )p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) p(x 5 x 1, x 3 ) p(x 6 x 4 ) p(x 7 x 4, x 5 ) 486of 828

17 Joint Factorisation Graph p(x 1 )p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) p(x 5 x 1, x 3 ) p(x 6 x 4 ) p(x 7 x 4, x 5 ) 1 Draw a node for each conditional distribution associated with a random variable. 2 Draw an edge from each conditional distribution associated with a random variable to all other conditional distribution which are conditioned on this variable. 487of 828

18 Joint Factorisation Graph p(x 1 )p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) p(x 5 x 1, x 3 ) p(x 6 x 4 ) p(x 7 x 4, x 5 ) 1 Draw a node for each conditional distribution associated with a random variable. 2 Draw an edge from each conditional distribution associated with a random variable to all other conditional distribution which are conditioned on this variable. x 1 x 2 x 3 x 4 x 5 x 6 x 7 488of 828

19 Graph Joint Factorisation x 1 x 2 x 3 x 4 x 5 x 6 x 7 Can we get the expression from the graph? 489of 828

20 Graph Joint Factorisation x 1 x 2 x 3 x 4 x 5 x 6 x 7 Can we get the expression from the graph? 1 Write a product of probability distributions, one for each associated random variable. Draw a node for each conditional distribution associated with a random variable. 2 Add all random variables associated with parent nodes to the list of conditioning variables. Draw an edge from each conditional distribution associated with a random variable to all other conditional distribution which are conditioned on this variable. p(x 1 )p(x 2 ) p(x 3 ) p(x 4 x 1, x 2, x 3 ) p(x 5 x 1, x 3 ) p(x 6 x 4 ) p(x 7 x 4, x 5 ) 490of 828

21 Joint Factorisation Graph The joint distribution defined by a graph is given by the product, over all of the nodes of the graph, of a conditional distribution for each node conditioned on the variables corresponding to the parents of the node in the graph. p(x) = K p(x k pa(x k )) k=1 where pa(x k ) denotes the set of parents of x k and x = (x 1,..., x K ). The absence of edges (compared to the fully connected case) is the point. 491of 828

22 Joint Factorisation Graph Restriction : Graph must be a directed acyclic graph (DAG). There are no closed paths in the graph when moving along the directed edges. Or equivalently: There exists an ordering of the nodes such that there are no edges that go from any node to any lower numbered node. x 1 x 2 x 3 x 4 x 5 x 6 x 7 Extension: Can also have sets of variables, or vectors at a node. 492of 828

23 : Normalisation Given p(x) = K p(x k pa(x k )). k=1 Is p(x) normalised, x p(x) = 1? 493of 828

24 : Normalisation Given p(x) = K p(x k pa(x k )). k=1 Is p(x) normalised, x p(x) = 1? As graph is DAG, there always exists a node with no outgoing edges, say x i. p(x) = x x 1,...,x i 1,x i+1,...,x K K p(x k pa(x k )) k=1 k i x i p(x i pa(x i )) }{{} =1 because x i p(x i pa(x i )) = x i Repeat, until no node left. p(x i,pa(x i)) p(pa(x i)) = p(pa(xi)) p(pa(x i)) = 1 494of 828

25 - Bayesian polynomial regression : observed inputs x, observed targets t, noise variance σ 2, hyperparameter α controlling the priors for w. Focusing on t and w only p(t, w) = p(w) N p(t n w) n=1 w t 1 t N 495of 828

26 - Bayesian polynomial regression : observed inputs x, observed targets t, noise variance σ 2, hyperparameter α controlling the priors for w. Focusing on t and w only p(t, w) = p(w) N p(t n w) n=1 w w t 1 t N t n N A plate. 496of 828

27 - Include also the parameters into the graphical model N p(t, w x, α, σ 2 ) = p(w α) p(t n w, x n, σ 2 ) n=1 Random variables = open circles Deterministic variables = smaller solid circles x n α σ 2 t n N w 497of 828

28 - Random variables Observed random variables, e.g. t Unobserved random variables, e.g. w, (latent random variables, hidden random variables) Shade the observed random variables in the graphical model. x n α σ 2 t n N w 498of 828

29 - Prediction : new data point x. Want to predict t. [ N ] p( t, t, w x, x, α, σ 2 ) = p(t n x n, w, σ 2 ) p(w α) p( t x, w, σ 2 ) n=1 x n α w t n N x σ 2 t Polynomial regression model including prediction. 499of 828

30 - Any p(x) may be written p(x) = = K p(x k x 1, x 2,..., x k 1 ) k=1 K p(x k pa(x k )) k=1 What does it mean if pa(x k ) x 1, x 2,..., x k 1 in the above equation? Sparse graphical model. p(x) no longer general. Assumption, e.g. p(w C, S, R) = p(w S, R). independence. 500of 828

31 - Definition ( ) If for three random variables a, b, and c the following holds p(a b, c) = p(a c) then a is conditionally independent of b given c. Notation : a b c. The above equation must hold for all possible values of c. Consequence : p(a, b c) = p(a b, c) p(b c) = p(a c) p(b c) independence simplifies the structure of the model the computations needed to perform interference/learning. NB: conditional dependence is a related but different concept we are not concerned with in this course. 501of 828

32 - Rules for Symmetry : Decomposition : Weak Union : Contraction : Intersection : X Y Z = Y X Z Y, W X Z = Y X Z and W X Z X Y, W Z = X Y Z, W X W Z, Y and X Y Z = X W, Y Z X Y Z, W and X W Z, Y = X Y, W Z Note: Intersection is only valid for p(x), p(y), p(z), p(w) > of 828

33 : TT (1/3) Can we work with the graphical model directly? Check the simplest examples containing only three nodes. First example has joint distribution p(a, b, c) = p(a c) p(b c) p(c) Marginalise both sides over c p(a, b) = p(a c) p(b c) p(c) p(a) p(b). c Does not hold : a b (where is the empty set). c a b 503of 828

34 : TT (2/3) Now condition on c. p(a, b c) = Therefore a b c. p(a, b, c) p(c) c = p(a c) p(b c) a b 504of 828

35 : TT (3/3) Graphical interpretation In both graphical models there is a path from a to b. The node c is called tail-to-tail (TT) with respect to this path because the node c is connected to the tails of the arrows in the path. The presence of the TT-node c in the path left renders a dependent on b (and b dependent on a). Conditioning on c blocks the path from a to b and causes a and b to become conditionally independent on c. c c a b a b Not a b a b c 505of 828

36 : HT (1/3) Second example. p(a, b, c) = p(a) p(c a) p(b c) Marginalise over c to test for independence. p(a, b) = p(a) p(c a) p(b c) = p(a) p(b a) p(a) p(b) c Does not hold : a b. a c b 506of 828

37 : HT (2/3) Now condition on c. p(a, b c) = p(a, b, c) p(c) = p(a) p(c a) p(b c) p(c) = p(a c) p(b c) where we used Bayes theorem p(c a) = p(a c) p(c)/p(a). Therefore a b c. a c b 507of 828

38 : HT (3/3) a c b a c b Not a b a b c 508of 828

39 : HH (1/4) Third example. (A little bit more subtle.) p(a, b, c) = p(a) p(b) p(c a, b) Marginalise over c to test for independence. p(a, b) = c p(a) p(b) p(c a, b) = p(a) p(b) c p(c a, b) = p(a) p(b) a and b are independent if NO variable is observed: a b. a b c 509of 828

40 : HH (2/4) Now condition on c. p(a, b c) = p(a, b, c) p(c) = Does not hold : a b c. a p(a) p(b) p(c a, b) p(c) b p(a c) p(b c). c 510of 828

41 : HH (3/4) Graphical interpretation In both graphical models there is a path from a to b. The node c is called head-to-head (HH) with respect to this path because the node c is connected to the heads of the arrows in the path. The presence of the HH-node c in the path left makes a independent of b (and b independent of a). The unobserved c blocks the path from a to b. a b a b c a b c Not a b c 511of 828

42 : HH (4/4) Graphical interpretation Conditioning on c unblocks the path from a to b, and renders a conditionally dependent on b given c. Some more terminology: Node y is a descendant of node x if there is a path from x to y in which each step follows the directions of the arrows. A HH-path will become unblocked if either the node, or any of its descendants, is observed. a b a f e b c Not a b c c Not a f c 512of 828

43 - General Graph independence and has been established for all configurations of three variables: (HH, HT, TT) (observed, unobserved) the subtle case of HH junction with observed descendent. We can generalise these results to arbitrary Bayesian Networks. Roughly: the graph connectivity implies conditional independence for those sets of nodes which are directionally separated. 513of 828

44 - D-Separation Definition (Blocked Path) A blocked path is a path which contains an observed TT- or HT-node, or an unobserved HH-node whose descendants are all unobserved. c a b a c b a b c 514of 828

45 - D-Separation Consider a general directed graph in which A, B, and C are arbitrary non-intersecting sets of nodes. (There may be other nodes in the graph which are not contained in the union of A, B, and C.) Consider all possible paths from any node in A to any node in B. Any such path is blocked, if it includes a node such that either the node is HT or TT, and the node is in set C, or the node is HH, and neither the node, nor any of the descendants, is in set C. If all paths are blocked, then A is d separated from B by C, and the joint distribution over all the variables in the graph will satisfy A B C. (Note: D-separation stands for directional separation.) 515of 828

46 - D-separation Example The path from a to b is not blocked by f because f is a TT-node and unobserved. The path from a to b is not blocked by e because e is a HH-node, and although unobserved itself, one of its descendants (node c) is observed. a f e b c 516of 828

47 - D-separation Another example The path from a to b is blocked by f because f is a TT-node and observed. Therefore, a b f. Furthermore, the path from a to b is also blocked by e because e is a HH-node, and neither it nor its descendants are observed. Therefore a b f. a f e b c 517of 828

48 Factorisation (1/4) Theorem (Factorisation ) If a probability distribution factorises according to a directed acyclic graph, and if A, B and C are disjoint subsets of nodes such that A is d-separated from B by C in the graph, then the distribution satisfies A B C. Theorem ( Factorisation) If a probability distribution satisfies the conditional independence statements implied by d-separation over a particular directed graph, then it also factorises according to the graph. 518of 828

49 Factorisation (2/4) Why is Factorisation relevant? statements are usually what a domain expert knows about the problem at hand. Needed is a model p(x) for computation. The Factorisation provides p(x) from statements. One can build a global model for computation from local conditional independence statements. 519of 828

50 Factorisation (3/4) Given a set of statements. Adding another statement will in general produce other statements. All statements can be read as d-separation in a DAG. However, there are sets of statements which cannot be satisfied by any Bayesian Network. 520of 828

51 Factorisation (4/4) The broader picture A directed graphical model can be viewed as a filter accepting probability distributions p(x) and only letting these through which satisfy the factorisation property. The set of all possible distribution p(x) which pass through the filter is denoted as DF. Alternatively, only these probability distributions p(x) pass through the filter (graph), which respect the conditional independencies implied by the d-separation properties of the graph. The d-separation theorem says that the resulting set DF is the same in both cases. p(x) DF 521of 828

Graphical Models. Dmitrij Lagutin, T Machine Learning: Basic Principles

Graphical Models. Dmitrij Lagutin, T Machine Learning: Basic Principles Graphical Models Dmitrij Lagutin, dlagutin@cc.hut.fi T-61.6020 - Machine Learning: Basic Principles 12.3.2007 Contents Introduction to graphical models Bayesian networks Conditional independence Markov