Introduction to Probabilistic Graphical Models

Size: px

Start display at page:

Download "Introduction to Probabilistic Graphical Models"

Anabel Byrd
6 years ago
Views:

1 Introduction to Probabilistic Graphical Models Tomi Silander School of Computing National University of Singapore June 13, 2011

2 What is a probabilistic graphical model? The formalism of probabilistic graphical models provides a unifying framework for capturing complex dependencies among random variables, and building large-scale multivariate statistical models. Martin J. Wainwright and Michael I. Jordan in Graphical Models, Exponential Families, and Variational Inference (2008).

3 Outline Motivation Representation Inference Learning Learning parameters Learning structure Misc

4 Why probabilities? As an output they allow decision making. As an input they solve the qualification problem. They can be learnt from data, Principled way to combine data with expert knowledge. Combining evidence from different sources.

5 Decision making Scenario: Patient comes to the doctor with severe respiratory distress. All kinds of tests including X-ray and blood-tests are taken and fed to a classifier that comes up with the diagnosis: the patient has most probably pneumonia.... the patient has pneumonia with 65% probability.... the patient has pneumonia with 65% and AIDS with 30% probability (100-fold risk). Very often you want probabilities as a result. 50% 50% does not necessarily mean that you do not know what to do!

6 Qualification problem Example by Russell & Norvig (Artificial Intelligence): Agent needs to drive someone to the airport that is 20km from home to catch the plane. Plan A 90 : leave home 90 minutes before departure. Driving within speed limits will get us there in time provided that: car does not break down, car has enough gas, I do not get into accident, there is no accident on a bridge, the plane does not leave early,... You ll get there in time with 98% probability.

7 Joint probability distributions H B L F C P(HBLFC) = Θ h1 b1 l1 f1 c h1 b1 l1 f1 c h1 b1 l1 f2 c h1 b1 l1 f2 c h1 b1 l2 f1 c h1 b1 l2 f1 c h1 b1 l2 f2 c h1 b1 l2 f2 c h1 b2 l1 f1 c h1 b2 l1 f1 c h1 b2 l1 f2 c h1 b2 l1 f2 c h1 b2 l2 f1 c h1 b2 l2 f1 c h1 b2 l2 f2 c h1 b2 l2 f2 c h2 b1 l1 f1 c h2 b1 l1 f1 c h2 b1 l1 f2 c h2 b1 l1 f2 c h2 b1 l2 f1 c h2 b1 l2 f1 c h2 b1 l2 f2 c h2 b1 l2 f2 c h2 b2 l1 f1 c h2 b2 l1 f1 c h2 b2 l1 f2 c h2 b2 l1 f2 c h2 b2 l2 f1 c h2 b2 l2 f1 c h2 b2 l2 f2 c h2 b2 l2 f2 c Let us assume five binary variables H,B,L,F, and C with values H {h1, h2}, B {b1, b2},.... Joint probability distribution gives a probability for any combination of values. P(h2, b2, l1, f 1, c2) = Sum of probabilities of all combinations is 1.

8 Joint probability distributions H B L F C P(HBLFC) = Θ h1 b1 l1 f1 c h1 b1 l1 f1 c h1 b1 l1 f2 c h1 b1 l1 f2 c h1 b1 l2 f1 c h1 b1 l2 f1 c h1 b1 l2 f2 c h1 b1 l2 f2 c h1 b2 l1 f1 c h1 b2 l1 f1 c h1 b2 l1 f2 c h1 b2 l1 f2 c h1 b2 l2 f1 c h1 b2 l2 f1 c h1 b2 l2 f2 c h1 b2 l2 f2 c h2 b1 l1 f1 c h2 b1 l1 f1 c h2 b1 l1 f2 c h2 b1 l1 f2 c h2 b1 l2 f1 c h2 b1 l2 f1 c h2 b1 l2 f2 c h2 b1 l2 f2 c h2 b2 l1 f1 c h2 b2 l1 f1 c h2 b2 l1 f2 c h2 b2 l1 f2 c h2 b2 l2 f1 c h2 b2 l2 f1 c h2 b2 l2 f2 c h2 b2 l2 f2 c With joint probability distribution you can calculate any probabilities. P(h1 b2 f 1 (f 1 c1)) = P((h1 b2) ( f 1 (f 1 c1))) P( f 1 (f 1 c1)).

9 But it is exponential in size Tedious to specify (to say the least). Inference of simple things like P(h1) takes exponential time. Learning it from data needs exponential amount of data. Well, maybe one could try some kind of locally weighted learning solution to this, but... And what about continuous distributions?

10 So we forget about joint probability distributions? Wrong! We try to make them scalable. Graphical models do just that.

11 Summary Probabilistic models allow convenient way to represent knowledge as a joint probability distributions. Joint probability distributions support inferences that output probabilities that are useful for decision making. But using joint probability distribution tables does not scale Probabilistic graphical models (PGMs) make them scalable And PGMs can also be learnt from data

12 Outline Motivation Representation Inference Learning Learning parameters Learning structure Misc

13 Let us see how independence helps A B C D P(A,B,C,D) a1 b1 c1 d a1 b1 c1 d a1 b1 c2 d a1 b1 c2 d a1 b2 c1 d a1 b2 c1 d a1 b2 c2 d a1 b2 c2 d a2 b1 c1 d a2 b1 c1 d a2 b1 c2 d a2 b1 c2 d a2 b2 c1 d a2 b2 c1 d a2 b2 c2 d a2 b2 c2 d The joint probability table (16 parameters) on the left was actually produced as a product of four simple probability tables: P(A) = (0.2, 0.8), P(B) = (0.6, 0.4), P(C) = (0.3, 0.7), P(D) = (0.9, 0.1).

14 Global independence model (GI) A B C D P(A,B,C,D) a1 b1 c1 d a1 b1 c1 d a1 b1 c2 d a1 b1 c2 d a1 b2 c1 d a1 b2 c1 d a1 b2 c2 d a1 b2 c2 d a2 b1 c1 d a2 b1 c1 d a2 b1 c2 d a2 b1 c2 d a2 b2 c1 d a2 b2 c1 d a2 b2 c2 d a2 b2 c2 d P(a1,b2,c1,d2) = P(a1)P(b2)P(c1)P(d2) = = So we see that sometimes joint probability tables can be expressed compactly. 4 vs. 15 independent parameters. Little bit more calculation needed to get the joint probabilities out though.

15 Factorizing joint probability P(A, B) P(B A) = P(A) P(A, B) = P(A)P(B A). A P(A) a1 0.3 a2 0.7 A B P(B A) a1 b1 0.2 a1 b2 0.8 a2 b1 0.7 a3 b2 0.3 We did not save anything yet, but.. = A B P(A,B) a1 b a1 b a2 b a3 b2 0.21

16 So when can we save space? Chain rule: P(A, B, C, D) = P(A)P(B A)P(C A, B)P(D A, B, C). Now, if it so happens, for example, that P(D A, B, C) = P(D A), then we do save. That is conditional independence. D is independent of B and C given A. or D {B, C} A = 9 vs = 15 parameters.

17 Such independences can also be expressed graphically P(A, B, C, D) = P(A)P(B A)P(C A, B)P(D A, B, C) a complete graph with 3+2+1=6 edges P(A)P(B A)P(C A, B)P(D A) a complete graph with 1+2+1=4 edges

18 Causal relations yield such independences Bayesian network structure P(H, B, L, F, C) = P(H)P(B H)P(L H)P(F B, L)P(C L)

19 Graphs encode (in)dependences Parents, children, ancestors, descendants. Variable A is dependent on B given set Z if there is a path from A to B such that each node V on a path is of type: 1. V, V, or V AND V Z, or 2. V AND V Z, or W desc(v ) : W Z. Otherwise A B Z. C B L. C B H, F.

20 Causal models yield such joint probabilities Bayesian network (BN) P(h2, b2, l1, f 1, c2) = P(h2)P(b2 h2)p(l1 h2)p(f 1 b2, l1)p(c2 l1) = H B L F C P(HBLFC) = Θ h1 b1 l1 f1 c h1 b1 l1 f1 c h1 b1 l1 f2 c h1 b1 l1 f2 c h1 b1 l2 f1 c h1 b1 l2 f1 c h1 b1 l2 f2 c h1 b1 l2 f2 c h1 b2 l1 f1 c h1 b2 l1 f1 c h1 b2 l1 f2 c h1 b2 l1 f2 c h1 b2 l2 f1 c h1 b2 l2 f1 c h1 b2 l2 f2 c h1 b2 l2 f2 c h2 b1 l1 f1 c h2 b1 l1 f1 c h2 b1 l1 f2 c h2 b1 l1 f2 c h2 b1 l2 f1 c h2 b1 l2 f1 c h2 b1 l2 f2 c h2 b1 l2 f2 c h2 b2 l1 f1 c h2 b2 l1 f1 c h2 b2 l1 f2 c h2 b2 l1 f2 c h2 b2 l2 f1 c h2 b2 l2 f1 c h2 b2 l2 f2 c h2 b2 l2 f2 c

21 Graphs and probability distributions Can this joint probability distribution be presented with this graph? i.e. are there parameters for this structure that produce the joint probability distribution on right? H B L F C P(HBLFC) = Θ h1 b1 l1 f1 c h1 b1 l1 f1 c h1 b1 l1 f2 c h1 b1 l1 f2 c h1 b1 l2 f1 c h1 b1 l2 f1 c h1 b1 l2 f2 c h1 b1 l2 f2 c h1 b2 l1 f1 c h1 b2 l1 f1 c h1 b2 l1 f2 c h1 b2 l1 f2 c h1 b2 l2 f1 c h1 b2 l2 f1 c h1 b2 l2 f2 c h1 b2 l2 f2 c h2 b1 l1 f1 c h2 b1 l1 f1 c h2 b1 l1 f2 c h2 b1 l1 f2 c h2 b1 l2 f1 c h2 b1 l2 f1 c h2 b1 l2 f2 c h2 b1 l2 f2 c h2 b2 l1 f1 c h2 b2 l1 f1 c h2 b2 l1 f2 c h2 b2 l1 f2 c h2 b2 l2 f1 c h2 b2 l2 f1 c h2 b2 l2 f2 c h2 b2 l2 f2 c

22 Graphs host probability distributions What joint probability distributions can be generated by parametrizing this network? H B L F C P(HBLFC) = Θ h1 b1 l1 f1 c1 0.???????? h1 b1 l1 f1 c2 0.???????? h1 b1 l1 f2 c1 0.???????? h1 b1 l1 f2 c2 0.???????? h1 b1 l2 f1 c1 0.???????? h1 b1 l2 f1 c2 0.???????? h1 b1 l2 f2 c1 0.???????? h1 b1 l2 f2 c2 0.???????? h1 b2 l1 f1 c1 0.???????? h1 b2 l1 f1 c2 0.???????? h1 b2 l1 f2 c1 0.???????? h1 b2 l1 f2 c2 0.???????? h1 b2 l2 f1 c1 0.???????? h1 b2 l2 f1 c2 0.???????? h1 b2 l2 f2 c1 0.???????? h1 b2 l2 f2 c2 0.???????? h2 b1 l1 f1 c1 0.???????? h2 b1 l1 f1 c2 0.???????? h2 b1 l1 f2 c1 0.???????? h2 b1 l1 f2 c2 0.???????? h2 b1 l2 f1 c1 0.???????? h2 b1 l2 f1 c2 0.???????? h2 b1 l2 f2 c1 0.???????? h2 b1 l2 f2 c2 0.???????? h2 b2 l1 f1 c1 0.???????? h2 b2 l1 f1 c2 0.???????? h2 b2 l1 f2 c1 0.???????? h2 b2 l1 f2 c2 0.???????? h2 b2 l2 f1 c1 0.???????? h2 b2 l2 f1 c2 0.???????? h2 b2 l2 f2 c1 0.???????? h2 b2 l2 f2 c2 0.????????

23 The misconception story Alice, Bob, Charles and Debbie are asked to do homework in pairs. For some reason only pairs (Alice, Bob), (Bob, Charles), (Charles, Debbie), and (Debbie, Alice) ever meet. Now (A)lice and (C)harles cannot stand each other, and (B)ob and (D)ebbie just broke up. Anyway, the prof had again one of those weird lectures leaving some suspicion on students minds whether statement S is true(0) or not(1).

24 The misconception graph This network encodes the independences A C {B, D} and B D {A, C} No Bayesian network can encode these independences (only). You can try all 4 node Bayesian networks.

25 Parametrization of undirected graphs φ(a, B) a0 b0 30 a0 b1 5 a1 b0 1 a1 b1 10 φ(b, C) b0 c0 100 b0 c1 1 b1 c0 1 b1 c1 100 φ(c, D) c0 d0 1 c0 d1 100 c1 d0 100 c1 d1 1 φ(d, A) d0 a0 100 d0 a1 1 d1 a0 1 d1 a1 100 Alice and Bob generally agree, but more about S being true than false. Charles and Debbie just disagree, no matter whether S is true or not. A B C D φ P(ABCD) a0 b0 c0 d a0 b0 c0 d a0 b0 c1 d a0 b0 c1 d a0 b1 c0 d a0 b1 c0 d a0 b1 c1 d a0 b1 c1 d a1 b0 c0 d a1 b0 c0 d a1 b0 c1 d a1 b0 c1 d a1 b1 c0 d a1 b1 c0 d a1 b1 c1 d a1 b1 c1 d

Hierarchical Bayesian Models Plate notation for graphical models. Repetitive parts are boxed. Number of repetitions marked to the low corner.

26 Hierarchical Bayesian Models Plate notation for graphical models. Repetitive parts are boxed. Number of repetitions marked to the low corner. Observed nodes (in data) shadowed. like in this LDA model there are M edges from α to θ 1, θ 2,..., θ M. and N edges from each θi to z 1, z 2,..., z N.

27 LDA plate expanded and M can be one million. and N tens of thousands.

28 Summary Sometimes joint probability distribution can be composed from smaller parts. Causal scenarios produce such joint distributions. These joint distributions can be represented as Bayesian networks. Also non-causal dependencies may lead to saving space. Not all independence relations can be neatly presented as graphical models. for example context specific independencies.

29 Why is the book so thick? Because there are many kinds of graphical models. directed undirected mixture of these factor graphs

30 Outline Motivation Representation Inference Learning Learning parameters Learning structure Misc

31 Some notation first P(X = x Y = y) is a probability. a real number between 0 and 1. X and Y may be vectors of random variables. P(X = x Y = y) often abbreviated as P(x y).

32 Some notation first P(X = x Y = y) is a probability. a real number between 0 and 1. X and Y may be vectors of random variables. P(X = x Y = y) often abbreviated as P(x y). P(X Y = y) (or P(X y)) is a distribution. a set (vector) of real numbers: P(X y) = (P(x 1 y), P(x 2 y),..., P(x n y))

33 Some notation first P(X = x Y = y) is a probability. a real number between 0 and 1. X and Y may be vectors of random variables. P(X = x Y = y) often abbreviated as P(x y). P(X Y = y) (or P(X y)) is a distribution. a set (vector) of real numbers: P(X y) = (P(x 1 y), P(x 2 y),..., P(x n y)) P(X Y ) is a set of distributions. P(X Y ) = (P(X y 1 ), P(X y 2 ),..., P(X y m ))

34 Some notation first P(X = x Y = y) is a probability. a real number between 0 and 1. X and Y may be vectors of random variables. P(X = x Y = y) often abbreviated as P(x y). P(X Y = y) (or P(X y)) is a distribution. a set (vector) of real numbers: P(X y) = (P(x 1 y), P(x 2 y),..., P(x n y)) P(X Y ) is a set of distributions. P(X Y ) = (P(X y 1 ), P(X y 2 ),..., P(X y m )) We may also need P(x Y ) which is not a distribution, but P(x Y ) = ((P(x y 1 ), P(x y 2 ),..., P(x y m )).

35 Inference means calculating conditional distributions P(X y) = P(X,y) P(y) = P(X,y) x P(x,y) P(X, y). Notice the (de)marginalization P(y) = x P(x, y). If there is no conditioning y, we condition with and write P(X ) = P(X ), where P( ) = 1. Unconditional distributions are called marginal distributions. Notice that we restricted the inferences we attempt to make. Statisticians use the word inference for what we often call learning.

36 And why did want to do inference? If we can estimate the probabilities of unknown things in light of given evidence, we can make principled decisions. Diagnosing what caused the problems and what else might be wrong. Recognizing the objects from the picture. Converting speech to text. Much of the probabilistic modeling is about how to make probabilistic inference (in big models efficiently.)

37 Inference with full table does not scale Problems stem from exponential size of table Even simple things take long to compute in big table So could structure help here too like it did with representation? Answer: could (i.e. sometimes it will) like if the distribution can be presented with GI model: P(A B) = P(A)P(B) P(B). and these numbers are available in GI model.

38 For some things structure clearly helps Some things easily available: like P(H) and P(F B) But some not clearly so: like P(C) and P(B F )

39 How about chain? In principle everything depends on everything so it might turn out ugly. But for simple marginals it seems to help: P(X ) = (0, 4, 0.6).

40 How about chain? In principle everything depends on everything so it might turn out ugly. But for simple marginals it seems to help: P(X ) = (0, 4, 0.6). P(Y ) = x P(Y, x) = P(Y x)p(x) = (0, 84, 0.16). x

41 How about chain? In principle everything depends on everything so it might turn out ugly. But for simple marginals it seems to help: P(X ) = (0, 4, 0.6). P(Y ) = x P(Y, x) = x P(Y x)p(x) = (0, 84, 0.16). P(Z) = y P(Z y)p(y) = (0.652, 0.34).

42 How about chain? In principle everything depends on everything so it might turn out ugly. But for simple marginals it seems to help: P(X ) = (0, 4, 0.6). P(Y ) = x P(Y, x) = x P(Y x)p(x) = (0, 84, 0.16). P(Z) = y P(Z y)p(y) = (0.652, 0.34). P(W ) = z P(W z)p(z) = (0.5348, ).

43 But multiply connected do not work. Calculating marginals also works for trees and even for singly connected networks (at most one path between any pair of nodes), but.. It does not work for multiply connected graphs. Assuming A C B: P(C) = a,b P(a, b, C) = a,b P(C a, b)p(a, b). P(a, b) not available, so let us give up for now.

44 Conditioning down the chain Let s try the old strategy:

45 Conditioning down the chain Let s try the old strategy: P(Y x1) = (0.9, 0.1).

46 Conditioning down the chain Let s try the old strategy: P(Y x1) = (0.9, 0.1). P(Z x1) = y P(Z, y x1) = P(Z y, x1)p(y x1) = y y P(Z y)p(y x1) = (0.67, 0.33).

47 Conditioning down the chain Let s try the old strategy: P(Y x1) = (0.9, 0.1). P(Z x1) = y P(Z, y x1) = P(Z y, x1)p(y x1) = y y P(Z y)p(y x1) = (0.67, 0.33). P(W x1) = y P(W z)p(z x1) = (0.533, 0.267).

48 Conditioning up the chain Let s try the old strategy: P(w1) can be computed quick since we know how to get marginals. Then we could try to get P(Z w1), P(Y w1), and P(X w1). But, notice: Since we know how to calculate marginal probabilities P(A) and P(B), we can always calculate P(A B) from P(B A), since P(A B) = P(B A)P(A) P(B).

49 Conditioning up the chain So let us first calculate marginals and then try order P(w1 Z), P(w1 Y ), and P(w1 X ), and then we can invert them. P(w1 Z) = (0.5, 0.6).

50 Conditioning up the chain So let us first calculate marginals and then try order P(w1 Z), P(w1 Y ), and P(w1 X ), and then we can invert them. P(w1 Z) = (0.5, 0.6). P(w1 Y ) = z P(w1, z Y ) = P(w1 z)p(z Y ). z

51 Conditioning up the chain So let us first calculate marginals and then try order P(w1 Z), P(w1 Y ), and P(w1 X ), and then we can invert them. P(w1 Z) = (0.5, 0.6). P(w1 Y ) = z P(w1, z Y ) = z P(w1 z)p(z Y ). P(w1 X ) = y P(w1 y)p(y X ). So we can compute P(Z w1), P(Y w1), and P(X w1).

52 Conditioning up n down the chain P(Y,w a) P(w a) P(Y w, a) = P(Y a)p(w Y ). So the P(Y w, a) is a product of terms we calculated by conditioning from up and down. And then you just need to show that it works for singly connected networks too, and that it does not work for multiply connected networks.

53 The essence of the general algorithm for singly connected networks Assume a courier service in singly connected network between agents A - F with asymmetric transfer costs. Sending from A down to C costs $6. And from C up to A costs $3. Receiver pays. If some agents have money, how much would the agent X get if it all were transferred to her.

DP in polytree Assume B has $100 and E has $200 and rest have nothing. How much would C get if money were transferred to him? $-6 from A, $97 from B and $188 via D = $279.

54 DP in polytree Assume B has $100 and E has $200 and rest have nothing. How much would C get if money were transferred to him? $-6 from A, $97 from B and $188 via D = $279. How much would D get if money were transferred to her? But now we notice that earlier calculations can be reused. One message up and down is enough to compute money for all.

55 How about the general graph NP hard, but clever ways gets you pretty far: Cutset conditioning: Try to find variable set C that breaks the network into singly connected components. Do inference separately for each value of C Combine results of inferences (by weighting results appropriately). Joint Tree Propagation You merge variables to vector-valued variables (cliques) so that you can build a singly connected graph from them. but then one variable can be part of many cliques and you need so called joint-tree propagation algorithm to ensure that clique probabilities are consistent with each other.

56 Approximate inference So how about giving up with calculating exact probabilities, and try to get good approximate probabilities: This is the only way with complex models anyway. Idea for P(x e) (direct sampling): Generate vectors from a BN. Count how many times x and e happens. P(x e) = Nx e N e. and x and e can be complicated things. Works for any network. So can we generate vectors from BN?

57 Direct sampling 1. P(Cloudy) = (0.5, 0.5) Cloudy true.

58 Direct sampling 1. P(Cloudy) = (0.5, 0.5) Cloudy true. 2. P(S C = t) = (0.1, 0.9) Sprinkler false.

59 Direct sampling 1. P(Cloudy) = (0.5, 0.5) Cloudy true. 2. P(S C = t) = (0.1, 0.9) Sprinkler false. 3. P(R C = t) = (0.8, 0.2) Rain true.

60 Direct sampling 1. P(Cloudy) = (0.5, 0.5) Cloudy true. 2. P(S C = t) = (0.1, 0.9) Sprinkler false. 3. P(R C = t) = (0.8, 0.2) Rain true. 4. P(W S = f, R = t) = (0.9, 0.1) WetGrass true.

61 Direct sampling 1. P(Cloudy) = (0.5, 0.5) Cloudy true. 2. P(S C = t) = (0.1, 0.9) Sprinkler false. 3. P(R C = t) = (0.8, 0.2) Rain true. 4. P(W S = f, R = t) = (0.9, 0.1) WetGrass true. P(C,S,R,W)=P(t,f,t,t)=0.324.

62 Problems with direct sampling Evidence e in p(x e) can be really rare, so we may need to generate very many vectors before we get even one e. Next attempt is to only generate vectors in which e is present. And when that fails we do Gibbs sampling. and that s another long story.

63 Summary Inference is important. Inference is (NP) hard and kind of complicated. In many models we need to resort to approximate inference.

64 Why is the book so thick? Many different representations. Inference has dimensions too. Exact inference vs. approximate inference. Multiple methods in both categories. Actually there are other inference tasks as well.

65 Outline Motivation Representation Inference Learning Learning parameters Learning structure Misc

66 Learning parameters Much of statistics does this - estimation. Also common in ML when building models for specific purpose. Naive Bayes classifier / Finite mixture model Hidden Markov Model (Kalman filter) Undirected grid (Markov Random Field)

67 Some popular models Naive Bayes HMM Pairwise MRF Common classifier. Used in spam filtering. Temporal model. Used in speech recognition. Undirected model used in spatial models and images.

68 Data X 1 X 2 X 3 X 4 X Traditionally tabular with X i having discrete values (0, 1,..., r i 1). Continuous values sometimes discretized to achieve this. Lately more interest in more versatile data formats. documents of different lengths images of different size relational data

69 Full tables hard - but GI easy X 1 X 2 X 3 X 4 X Full table needs at least r i data vectors for observing each possible data vector once. GI model needs at least max r i data vectors. and estimating parameter θ ik that determines the probability P(X i = x k θ ik ) = θ ik is simply based on counting the relative frequency of the value x ik. For example the relative frequency of the event X 3 = 2 is Indeed, maximum likelihood estimate is simply the relative frequency: ˆθ 32 = 2. 12

70 Bayesian learning of parameters In Bayesian learning you do not learn a parameter value but you assign a probability (density) to each possible value of each parameter. After observing (X 3 = 2) 2 out of 12 times you might, for example say that θ 32 is three times more 2 likely to equal than In general, the probability of θ ik is based partly on relative frequency N ik N, but we first assume a priori that we have seen the value x ik α ik times and then state that P(θ ik N ik, α ik ) θ N ik+α ik 1 ik.

71 Dirichlet distribution P(θ ik N ik, α ik ) θ N ik+α ik 1 ik is called Dirichlet distribution with parameters N i + α i. The most probable parameter value is N ˆθ ik = ik +α ik 1 i (N i k+α i k ) K N ik + α ik 1. So if we imagine we have seen 1 of each values beforehand, the most probable parameter value coincides with ML-estimate (relative frequency). Indeed, Dir(1, 1, 1, 1, 1) is a uniform distribution. Is it desirable that the most probable parameter value coincides with relative frequency? The expected parameter value is θ ik = N ik+α ik i N N i k+α i ik + α ik. k

72 BNs not much more difficult The parameter θ ijk in a Bayesian network G determine conditional probabilities P(X i = x k Π i = π j, G) = θ ijk, where Π i is a vector of random variables consisting of parents of X i, and π j is the j th value combination of those parents (enumerated somehow). To estimate the θ ijk we can again just count relative frequencies of the event (X i = x k Π i = π j, G). The maximum likelihood parameter is simply this relative frequency: ˆθ ijk = N ijk N ij, where N ij counts how many times Π i has value π i.

73 Zero support problem the ML parameter is ˆθ ijk = N ijk N ij, but what if where N ij = 0. Then ML parameters are not unique but you can have any θ ijk as long as k θ ijk = 1 Having N ij = 0 is common. Bayesians often use expected parameter values θ ijk (with prior counts α ijk = 1). Thus learning parameters for BNs is almost trivial. Just counting the events is needed.

74 How about undirected models? Parameter learning for undirected models is much more difficult.

75 How about missing data X 1 X 2 X 3 X 4 X ? Full Bayesian handling runs into trouble Usually searching for maximum likelihood or maximum a posterior parameters Family of EM algorithms Different kind of imputation mechanisms to reduce the situation to many complete data cases. Another chapter of the book. Latent data is kind of missing data.

76 Summary Parameter learning is a common task for fixed structures. Parameter learning for BNs from complete data easy. Parameter learning for undirected graphs hard. Parameter learning from missing data involved.

77 Why is the book so thick? Many different representations Inference Parameter learning ML vs. Bayesian Complete vs. missing data

78 Outline Motivation Representation Inference Learning Learning parameters Learning structure Misc

79 Learning Structure Valuable data mining tool Score based learning: Space of candidate structures, Scoring function to measure goodness of a structure, Search algorithm to find a good structure. Also independence test based learning methods are used. There are results that link these two methods.

80 Data in structure out an example

81 How many possible BN structures? n number of Bayesian network structures with n nodes b(n) = { 1 if n = 0, n k=1 ( 1)k+1( n k) 2 k(n k) b(n k) if n >

82 Marginal likelihood Bayesian would choose the most probable structure using: P(G D) = P(D G)P(G) P(D) P(D G)P(G). P(G) often assumed uniform, so the score is the marginal likelihood P(D G) = N j=1 P(d j D j 1 1, G). Thus the structure with the best prediction record for the data gets selected.

83 Marginal likelihood formula P(D G) = P(D θ, G)P(θ G)dθ. For a complete data, marginal likelihood has a closed form: P(D G, α) = n q i i=1 j=1 Γ( r i k=1 α ijk) Γ( r i k=1 α ijk + N ijk ) r i k=1 Γ(α ijk + N ijk ). Γ(α ijk ) Notice that the score decomposes by the structure log P(D G) = n log P(D i D Πi, G). i=1

84 Other popular criteria Penalized maximum likelihood scores: S(G, D) = log P(D ˆθ(D, G)) penalty. Akaike Information Criterion (AIC) qi j=1 (r i 1), i.e., the penalty = = n i=1 number of free parameters in the model. Bayesian Information Criterion (BIC) penalty = 2 log N. AIC and BIC also decompose by structure. but NML: penalty = log D P(D ˆθ(D, G)) does not decompose (and is not popular).

85 Equivalence of structures Two node BN A B allows exactly the same joint distributions than BN B A. So should they be equally good for all data sets D? In general any two BNs with same skeleton and same v-structures are equivalent in the sense that they allow same joint distributions. Trivia: there are about 4 times more Bayesian network structures than their equivalence classes. All the scores mentioned give equal scores for equivalent structures. Marginal likelihood requires α be set carefully though (Buntine 91).

86 Searching for the structure NP hard, but best tree(s) can be found in quadratic time. Local search: 1. start with a best tree. 2. add, delete or reverse a random arc (beware of cycles). 3. if no improvement, do not accept change. 4. back to step 2 or 1. Decomposability helps, because local change means that only one or two of the terms in score need to be re-evaluated. It also helps so that it is possible to find the best network for about 30 variables for sure.

87 Summary Structure learning for BNs is NP hard, thus heuristic search is usually used. Many different scores. which are decomposable to allow efficient score function evaluation local search. Marginal likelihood can be interpreted as sequential prediction.

88 Why is the book so thick? Many different representations Inference (exact and approximate) Parameter learning (complete and missing data) Structure learning Many criteria. Heuristic search is an endless topic.

89 Outline Motivation Representation Inference Learning Learning parameters Learning structure Misc

90 Similarity? Two big ML themes: graphical models, kernel based methods. But there are connections: Many graphical models try to capture similarity. Finite mixture models for clustering. Topic models for documents. Let us look at Fisher kernels for Bayesian networks

91 Hamming distance example A target vector t= vector x= vector y= sim(x, t) > sim(y, t)? But what if I tell you that four first bits are always the same? Dependency can be taken into account using graphical model.

92 Fisher kernel The graphical model G and parameters θ define the probability P(d θ, G) of data vector d. Vectors x and y are similar if changing parameters θ a little bit changes the the probabilities of x and y in a similar way. We compare the gradient vectors U x = θ log P(x θ) and U y = θ log P(y θ) by taking the inner product K(x, y) = UX T I 1 U y, where I is the Fisher information matrix. as an approximation I is often dropped.

93 Fisher kernel for BNs For BNs the Fisher kernel (with I ) can be computed relatively easily. It decomposes K(x, y) = i K i(x, y), where 0 if pa i (x) pa i (y), K i (x, y) = P(pa i (x)) 1 if pa i (x) = pa i (y) and x i y i, 1 P(x i pa i (x)) P(x i,pa i if pa (x)) i (x) = pa i (y) and x i = y i. Terms P(pa i (x)) and P(x i, pa i (x)) available in joint tree.

94 Summary The formalism of probabilistic graphical models provides a unifying framework for capturing complex dependencies among random variables, and building large-scale multivariate statistical models. They do so by allowing structured representations, facilitating efficient inference, and being amenable to learning from data.

95 Thank You!

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression