Introduction to Probabilistic Graphical Models

Size: px
Start display at page:

Download "Introduction to Probabilistic Graphical Models"

Transcription

1 Introduction to Probabilistic Graphical Models Tomi Silander School of Computing National University of Singapore June 13, 2011

2 What is a probabilistic graphical model? The formalism of probabilistic graphical models provides a unifying framework for capturing complex dependencies among random variables, and building large-scale multivariate statistical models. Martin J. Wainwright and Michael I. Jordan in Graphical Models, Exponential Families, and Variational Inference (2008).

3 Outline Motivation Representation Inference Learning Learning parameters Learning structure Misc

4 Why probabilities? As an output they allow decision making. As an input they solve the qualification problem. They can be learnt from data, Principled way to combine data with expert knowledge. Combining evidence from different sources.

5 Decision making Scenario: Patient comes to the doctor with severe respiratory distress. All kinds of tests including X-ray and blood-tests are taken and fed to a classifier that comes up with the diagnosis: the patient has most probably pneumonia.... the patient has pneumonia with 65% probability.... the patient has pneumonia with 65% and AIDS with 30% probability (100-fold risk). Very often you want probabilities as a result. 50% 50% does not necessarily mean that you do not know what to do!

6 Qualification problem Example by Russell & Norvig (Artificial Intelligence): Agent needs to drive someone to the airport that is 20km from home to catch the plane. Plan A 90 : leave home 90 minutes before departure. Driving within speed limits will get us there in time provided that: car does not break down, car has enough gas, I do not get into accident, there is no accident on a bridge, the plane does not leave early,... You ll get there in time with 98% probability.

7 Joint probability distributions H B L F C P(HBLFC) = Θ h1 b1 l1 f1 c h1 b1 l1 f1 c h1 b1 l1 f2 c h1 b1 l1 f2 c h1 b1 l2 f1 c h1 b1 l2 f1 c h1 b1 l2 f2 c h1 b1 l2 f2 c h1 b2 l1 f1 c h1 b2 l1 f1 c h1 b2 l1 f2 c h1 b2 l1 f2 c h1 b2 l2 f1 c h1 b2 l2 f1 c h1 b2 l2 f2 c h1 b2 l2 f2 c h2 b1 l1 f1 c h2 b1 l1 f1 c h2 b1 l1 f2 c h2 b1 l1 f2 c h2 b1 l2 f1 c h2 b1 l2 f1 c h2 b1 l2 f2 c h2 b1 l2 f2 c h2 b2 l1 f1 c h2 b2 l1 f1 c h2 b2 l1 f2 c h2 b2 l1 f2 c h2 b2 l2 f1 c h2 b2 l2 f1 c h2 b2 l2 f2 c h2 b2 l2 f2 c Let us assume five binary variables H,B,L,F, and C with values H {h1, h2}, B {b1, b2},.... Joint probability distribution gives a probability for any combination of values. P(h2, b2, l1, f 1, c2) = Sum of probabilities of all combinations is 1.

8 Joint probability distributions H B L F C P(HBLFC) = Θ h1 b1 l1 f1 c h1 b1 l1 f1 c h1 b1 l1 f2 c h1 b1 l1 f2 c h1 b1 l2 f1 c h1 b1 l2 f1 c h1 b1 l2 f2 c h1 b1 l2 f2 c h1 b2 l1 f1 c h1 b2 l1 f1 c h1 b2 l1 f2 c h1 b2 l1 f2 c h1 b2 l2 f1 c h1 b2 l2 f1 c h1 b2 l2 f2 c h1 b2 l2 f2 c h2 b1 l1 f1 c h2 b1 l1 f1 c h2 b1 l1 f2 c h2 b1 l1 f2 c h2 b1 l2 f1 c h2 b1 l2 f1 c h2 b1 l2 f2 c h2 b1 l2 f2 c h2 b2 l1 f1 c h2 b2 l1 f1 c h2 b2 l1 f2 c h2 b2 l1 f2 c h2 b2 l2 f1 c h2 b2 l2 f1 c h2 b2 l2 f2 c h2 b2 l2 f2 c With joint probability distribution you can calculate any probabilities. P(h1 b2 f 1 (f 1 c1)) = P((h1 b2) ( f 1 (f 1 c1))) P( f 1 (f 1 c1)).

9 But it is exponential in size Tedious to specify (to say the least). Inference of simple things like P(h1) takes exponential time. Learning it from data needs exponential amount of data. Well, maybe one could try some kind of locally weighted learning solution to this, but... And what about continuous distributions?

10 So we forget about joint probability distributions? Wrong! We try to make them scalable. Graphical models do just that.

11 Summary Probabilistic models allow convenient way to represent knowledge as a joint probability distributions. Joint probability distributions support inferences that output probabilities that are useful for decision making. But using joint probability distribution tables does not scale Probabilistic graphical models (PGMs) make them scalable And PGMs can also be learnt from data

12 Outline Motivation Representation Inference Learning Learning parameters Learning structure Misc

13 Let us see how independence helps A B C D P(A,B,C,D) a1 b1 c1 d a1 b1 c1 d a1 b1 c2 d a1 b1 c2 d a1 b2 c1 d a1 b2 c1 d a1 b2 c2 d a1 b2 c2 d a2 b1 c1 d a2 b1 c1 d a2 b1 c2 d a2 b1 c2 d a2 b2 c1 d a2 b2 c1 d a2 b2 c2 d a2 b2 c2 d The joint probability table (16 parameters) on the left was actually produced as a product of four simple probability tables: P(A) = (0.2, 0.8), P(B) = (0.6, 0.4), P(C) = (0.3, 0.7), P(D) = (0.9, 0.1).

14 Global independence model (GI) A B C D P(A,B,C,D) a1 b1 c1 d a1 b1 c1 d a1 b1 c2 d a1 b1 c2 d a1 b2 c1 d a1 b2 c1 d a1 b2 c2 d a1 b2 c2 d a2 b1 c1 d a2 b1 c1 d a2 b1 c2 d a2 b1 c2 d a2 b2 c1 d a2 b2 c1 d a2 b2 c2 d a2 b2 c2 d P(a1,b2,c1,d2) = P(a1)P(b2)P(c1)P(d2) = = So we see that sometimes joint probability tables can be expressed compactly. 4 vs. 15 independent parameters. Little bit more calculation needed to get the joint probabilities out though.

15 Factorizing joint probability P(A, B) P(B A) = P(A) P(A, B) = P(A)P(B A). A P(A) a1 0.3 a2 0.7 A B P(B A) a1 b1 0.2 a1 b2 0.8 a2 b1 0.7 a3 b2 0.3 We did not save anything yet, but.. = A B P(A,B) a1 b a1 b a2 b a3 b2 0.21

16 So when can we save space? Chain rule: P(A, B, C, D) = P(A)P(B A)P(C A, B)P(D A, B, C). Now, if it so happens, for example, that P(D A, B, C) = P(D A), then we do save. That is conditional independence. D is independent of B and C given A. or D {B, C} A = 9 vs = 15 parameters.

17 Such independences can also be expressed graphically P(A, B, C, D) = P(A)P(B A)P(C A, B)P(D A, B, C) a complete graph with 3+2+1=6 edges P(A)P(B A)P(C A, B)P(D A) a complete graph with 1+2+1=4 edges

18 Causal relations yield such independences Bayesian network structure P(H, B, L, F, C) = P(H)P(B H)P(L H)P(F B, L)P(C L)

19 Graphs encode (in)dependences Parents, children, ancestors, descendants. Variable A is dependent on B given set Z if there is a path from A to B such that each node V on a path is of type: 1. V, V, or V AND V Z, or 2. V AND V Z, or W desc(v ) : W Z. Otherwise A B Z. C B L. C B H, F.

20 Causal models yield such joint probabilities Bayesian network (BN) P(h2, b2, l1, f 1, c2) = P(h2)P(b2 h2)p(l1 h2)p(f 1 b2, l1)p(c2 l1) = H B L F C P(HBLFC) = Θ h1 b1 l1 f1 c h1 b1 l1 f1 c h1 b1 l1 f2 c h1 b1 l1 f2 c h1 b1 l2 f1 c h1 b1 l2 f1 c h1 b1 l2 f2 c h1 b1 l2 f2 c h1 b2 l1 f1 c h1 b2 l1 f1 c h1 b2 l1 f2 c h1 b2 l1 f2 c h1 b2 l2 f1 c h1 b2 l2 f1 c h1 b2 l2 f2 c h1 b2 l2 f2 c h2 b1 l1 f1 c h2 b1 l1 f1 c h2 b1 l1 f2 c h2 b1 l1 f2 c h2 b1 l2 f1 c h2 b1 l2 f1 c h2 b1 l2 f2 c h2 b1 l2 f2 c h2 b2 l1 f1 c h2 b2 l1 f1 c h2 b2 l1 f2 c h2 b2 l1 f2 c h2 b2 l2 f1 c h2 b2 l2 f1 c h2 b2 l2 f2 c h2 b2 l2 f2 c

21 Graphs and probability distributions Can this joint probability distribution be presented with this graph? i.e. are there parameters for this structure that produce the joint probability distribution on right? H B L F C P(HBLFC) = Θ h1 b1 l1 f1 c h1 b1 l1 f1 c h1 b1 l1 f2 c h1 b1 l1 f2 c h1 b1 l2 f1 c h1 b1 l2 f1 c h1 b1 l2 f2 c h1 b1 l2 f2 c h1 b2 l1 f1 c h1 b2 l1 f1 c h1 b2 l1 f2 c h1 b2 l1 f2 c h1 b2 l2 f1 c h1 b2 l2 f1 c h1 b2 l2 f2 c h1 b2 l2 f2 c h2 b1 l1 f1 c h2 b1 l1 f1 c h2 b1 l1 f2 c h2 b1 l1 f2 c h2 b1 l2 f1 c h2 b1 l2 f1 c h2 b1 l2 f2 c h2 b1 l2 f2 c h2 b2 l1 f1 c h2 b2 l1 f1 c h2 b2 l1 f2 c h2 b2 l1 f2 c h2 b2 l2 f1 c h2 b2 l2 f1 c h2 b2 l2 f2 c h2 b2 l2 f2 c

22 Graphs host probability distributions What joint probability distributions can be generated by parametrizing this network? H B L F C P(HBLFC) = Θ h1 b1 l1 f1 c1 0.???????? h1 b1 l1 f1 c2 0.???????? h1 b1 l1 f2 c1 0.???????? h1 b1 l1 f2 c2 0.???????? h1 b1 l2 f1 c1 0.???????? h1 b1 l2 f1 c2 0.???????? h1 b1 l2 f2 c1 0.???????? h1 b1 l2 f2 c2 0.???????? h1 b2 l1 f1 c1 0.???????? h1 b2 l1 f1 c2 0.???????? h1 b2 l1 f2 c1 0.???????? h1 b2 l1 f2 c2 0.???????? h1 b2 l2 f1 c1 0.???????? h1 b2 l2 f1 c2 0.???????? h1 b2 l2 f2 c1 0.???????? h1 b2 l2 f2 c2 0.???????? h2 b1 l1 f1 c1 0.???????? h2 b1 l1 f1 c2 0.???????? h2 b1 l1 f2 c1 0.???????? h2 b1 l1 f2 c2 0.???????? h2 b1 l2 f1 c1 0.???????? h2 b1 l2 f1 c2 0.???????? h2 b1 l2 f2 c1 0.???????? h2 b1 l2 f2 c2 0.???????? h2 b2 l1 f1 c1 0.???????? h2 b2 l1 f1 c2 0.???????? h2 b2 l1 f2 c1 0.???????? h2 b2 l1 f2 c2 0.???????? h2 b2 l2 f1 c1 0.???????? h2 b2 l2 f1 c2 0.???????? h2 b2 l2 f2 c1 0.???????? h2 b2 l2 f2 c2 0.????????

23 The misconception story Alice, Bob, Charles and Debbie are asked to do homework in pairs. For some reason only pairs (Alice, Bob), (Bob, Charles), (Charles, Debbie), and (Debbie, Alice) ever meet. Now (A)lice and (C)harles cannot stand each other, and (B)ob and (D)ebbie just broke up. Anyway, the prof had again one of those weird lectures leaving some suspicion on students minds whether statement S is true(0) or not(1).

24 The misconception graph This network encodes the independences A C {B, D} and B D {A, C} No Bayesian network can encode these independences (only). You can try all 4 node Bayesian networks.

25 Parametrization of undirected graphs φ(a, B) a0 b0 30 a0 b1 5 a1 b0 1 a1 b1 10 φ(b, C) b0 c0 100 b0 c1 1 b1 c0 1 b1 c1 100 φ(c, D) c0 d0 1 c0 d1 100 c1 d0 100 c1 d1 1 φ(d, A) d0 a0 100 d0 a1 1 d1 a0 1 d1 a1 100 Alice and Bob generally agree, but more about S being true than false. Charles and Debbie just disagree, no matter whether S is true or not. A B C D φ P(ABCD) a0 b0 c0 d a0 b0 c0 d a0 b0 c1 d a0 b0 c1 d a0 b1 c0 d a0 b1 c0 d a0 b1 c1 d a0 b1 c1 d a1 b0 c0 d a1 b0 c0 d a1 b0 c1 d a1 b0 c1 d a1 b1 c0 d a1 b1 c0 d a1 b1 c1 d a1 b1 c1 d

26 Hierarchical Bayesian Models Plate notation for graphical models. Repetitive parts are boxed. Number of repetitions marked to the low corner. Observed nodes (in data) shadowed. like in this LDA model there are M edges from α to θ 1, θ 2,..., θ M. and N edges from each θi to z 1, z 2,..., z N.

27 LDA plate expanded and M can be one million. and N tens of thousands.

28 Summary Sometimes joint probability distribution can be composed from smaller parts. Causal scenarios produce such joint distributions. These joint distributions can be represented as Bayesian networks. Also non-causal dependencies may lead to saving space. Not all independence relations can be neatly presented as graphical models. for example context specific independencies.

29 Why is the book so thick? Because there are many kinds of graphical models. directed undirected mixture of these factor graphs

30 Outline Motivation Representation Inference Learning Learning parameters Learning structure Misc

31 Some notation first P(X = x Y = y) is a probability. a real number between 0 and 1. X and Y may be vectors of random variables. P(X = x Y = y) often abbreviated as P(x y).

32 Some notation first P(X = x Y = y) is a probability. a real number between 0 and 1. X and Y may be vectors of random variables. P(X = x Y = y) often abbreviated as P(x y). P(X Y = y) (or P(X y)) is a distribution. a set (vector) of real numbers: P(X y) = (P(x 1 y), P(x 2 y),..., P(x n y))

33 Some notation first P(X = x Y = y) is a probability. a real number between 0 and 1. X and Y may be vectors of random variables. P(X = x Y = y) often abbreviated as P(x y). P(X Y = y) (or P(X y)) is a distribution. a set (vector) of real numbers: P(X y) = (P(x 1 y), P(x 2 y),..., P(x n y)) P(X Y ) is a set of distributions. P(X Y ) = (P(X y 1 ), P(X y 2 ),..., P(X y m ))

34 Some notation first P(X = x Y = y) is a probability. a real number between 0 and 1. X and Y may be vectors of random variables. P(X = x Y = y) often abbreviated as P(x y). P(X Y = y) (or P(X y)) is a distribution. a set (vector) of real numbers: P(X y) = (P(x 1 y), P(x 2 y),..., P(x n y)) P(X Y ) is a set of distributions. P(X Y ) = (P(X y 1 ), P(X y 2 ),..., P(X y m )) We may also need P(x Y ) which is not a distribution, but P(x Y ) = ((P(x y 1 ), P(x y 2 ),..., P(x y m )).

35 Inference means calculating conditional distributions P(X y) = P(X,y) P(y) = P(X,y) x P(x,y) P(X, y). Notice the (de)marginalization P(y) = x P(x, y). If there is no conditioning y, we condition with and write P(X ) = P(X ), where P( ) = 1. Unconditional distributions are called marginal distributions. Notice that we restricted the inferences we attempt to make. Statisticians use the word inference for what we often call learning.

36 And why did want to do inference? If we can estimate the probabilities of unknown things in light of given evidence, we can make principled decisions. Diagnosing what caused the problems and what else might be wrong. Recognizing the objects from the picture. Converting speech to text. Much of the probabilistic modeling is about how to make probabilistic inference (in big models efficiently.)

37 Inference with full table does not scale Problems stem from exponential size of table Even simple things take long to compute in big table So could structure help here too like it did with representation? Answer: could (i.e. sometimes it will) like if the distribution can be presented with GI model: P(A B) = P(A)P(B) P(B). and these numbers are available in GI model.

38 For some things structure clearly helps Some things easily available: like P(H) and P(F B) But some not clearly so: like P(C) and P(B F )

39 How about chain? In principle everything depends on everything so it might turn out ugly. But for simple marginals it seems to help: P(X ) = (0, 4, 0.6).

40 How about chain? In principle everything depends on everything so it might turn out ugly. But for simple marginals it seems to help: P(X ) = (0, 4, 0.6). P(Y ) = x P(Y, x) = P(Y x)p(x) = (0, 84, 0.16). x

41 How about chain? In principle everything depends on everything so it might turn out ugly. But for simple marginals it seems to help: P(X ) = (0, 4, 0.6). P(Y ) = x P(Y, x) = x P(Y x)p(x) = (0, 84, 0.16). P(Z) = y P(Z y)p(y) = (0.652, 0.34).

42 How about chain? In principle everything depends on everything so it might turn out ugly. But for simple marginals it seems to help: P(X ) = (0, 4, 0.6). P(Y ) = x P(Y, x) = x P(Y x)p(x) = (0, 84, 0.16). P(Z) = y P(Z y)p(y) = (0.652, 0.34). P(W ) = z P(W z)p(z) = (0.5348, ).

43 But multiply connected do not work. Calculating marginals also works for trees and even for singly connected networks (at most one path between any pair of nodes), but.. It does not work for multiply connected graphs. Assuming A C B: P(C) = a,b P(a, b, C) = a,b P(C a, b)p(a, b). P(a, b) not available, so let us give up for now.

44 Conditioning down the chain Let s try the old strategy:

45 Conditioning down the chain Let s try the old strategy: P(Y x1) = (0.9, 0.1).

46 Conditioning down the chain Let s try the old strategy: P(Y x1) = (0.9, 0.1). P(Z x1) = y P(Z, y x1) = P(Z y, x1)p(y x1) = y y P(Z y)p(y x1) = (0.67, 0.33).

47 Conditioning down the chain Let s try the old strategy: P(Y x1) = (0.9, 0.1). P(Z x1) = y P(Z, y x1) = P(Z y, x1)p(y x1) = y y P(Z y)p(y x1) = (0.67, 0.33). P(W x1) = y P(W z)p(z x1) = (0.533, 0.267).

48 Conditioning up the chain Let s try the old strategy: P(w1) can be computed quick since we know how to get marginals. Then we could try to get P(Z w1), P(Y w1), and P(X w1). But, notice: Since we know how to calculate marginal probabilities P(A) and P(B), we can always calculate P(A B) from P(B A), since P(A B) = P(B A)P(A) P(B).

49 Conditioning up the chain So let us first calculate marginals and then try order P(w1 Z), P(w1 Y ), and P(w1 X ), and then we can invert them. P(w1 Z) = (0.5, 0.6).

50 Conditioning up the chain So let us first calculate marginals and then try order P(w1 Z), P(w1 Y ), and P(w1 X ), and then we can invert them. P(w1 Z) = (0.5, 0.6). P(w1 Y ) = z P(w1, z Y ) = P(w1 z)p(z Y ). z

51 Conditioning up the chain So let us first calculate marginals and then try order P(w1 Z), P(w1 Y ), and P(w1 X ), and then we can invert them. P(w1 Z) = (0.5, 0.6). P(w1 Y ) = z P(w1, z Y ) = z P(w1 z)p(z Y ). P(w1 X ) = y P(w1 y)p(y X ). So we can compute P(Z w1), P(Y w1), and P(X w1).

52 Conditioning up n down the chain P(Y,w a) P(w a) P(Y w, a) = P(Y a)p(w Y ). So the P(Y w, a) is a product of terms we calculated by conditioning from up and down. And then you just need to show that it works for singly connected networks too, and that it does not work for multiply connected networks.

53 The essence of the general algorithm for singly connected networks Assume a courier service in singly connected network between agents A - F with asymmetric transfer costs. Sending from A down to C costs $6. And from C up to A costs $3. Receiver pays. If some agents have money, how much would the agent X get if it all were transferred to her.

54 DP in polytree Assume B has $100 and E has $200 and rest have nothing. How much would C get if money were transferred to him? $-6 from A, $97 from B and $188 via D = $279. How much would D get if money were transferred to her? But now we notice that earlier calculations can be reused. One message up and down is enough to compute money for all.

55 How about the general graph NP hard, but clever ways gets you pretty far: Cutset conditioning: Try to find variable set C that breaks the network into singly connected components. Do inference separately for each value of C Combine results of inferences (by weighting results appropriately). Joint Tree Propagation You merge variables to vector-valued variables (cliques) so that you can build a singly connected graph from them. but then one variable can be part of many cliques and you need so called joint-tree propagation algorithm to ensure that clique probabilities are consistent with each other.

56 Approximate inference So how about giving up with calculating exact probabilities, and try to get good approximate probabilities: This is the only way with complex models anyway. Idea for P(x e) (direct sampling): Generate vectors from a BN. Count how many times x and e happens. P(x e) = Nx e N e. and x and e can be complicated things. Works for any network. So can we generate vectors from BN?

57 Direct sampling 1. P(Cloudy) = (0.5, 0.5) Cloudy true.

58 Direct sampling 1. P(Cloudy) = (0.5, 0.5) Cloudy true. 2. P(S C = t) = (0.1, 0.9) Sprinkler false.

59 Direct sampling 1. P(Cloudy) = (0.5, 0.5) Cloudy true. 2. P(S C = t) = (0.1, 0.9) Sprinkler false. 3. P(R C = t) = (0.8, 0.2) Rain true.

60 Direct sampling 1. P(Cloudy) = (0.5, 0.5) Cloudy true. 2. P(S C = t) = (0.1, 0.9) Sprinkler false. 3. P(R C = t) = (0.8, 0.2) Rain true. 4. P(W S = f, R = t) = (0.9, 0.1) WetGrass true.

61 Direct sampling 1. P(Cloudy) = (0.5, 0.5) Cloudy true. 2. P(S C = t) = (0.1, 0.9) Sprinkler false. 3. P(R C = t) = (0.8, 0.2) Rain true. 4. P(W S = f, R = t) = (0.9, 0.1) WetGrass true. P(C,S,R,W)=P(t,f,t,t)=0.324.

62 Problems with direct sampling Evidence e in p(x e) can be really rare, so we may need to generate very many vectors before we get even one e. Next attempt is to only generate vectors in which e is present. And when that fails we do Gibbs sampling. and that s another long story.

63 Summary Inference is important. Inference is (NP) hard and kind of complicated. In many models we need to resort to approximate inference.

64 Why is the book so thick? Many different representations. Inference has dimensions too. Exact inference vs. approximate inference. Multiple methods in both categories. Actually there are other inference tasks as well.

65 Outline Motivation Representation Inference Learning Learning parameters Learning structure Misc

66 Learning parameters Much of statistics does this - estimation. Also common in ML when building models for specific purpose. Naive Bayes classifier / Finite mixture model Hidden Markov Model (Kalman filter) Undirected grid (Markov Random Field)

67 Some popular models Naive Bayes HMM Pairwise MRF Common classifier. Used in spam filtering. Temporal model. Used in speech recognition. Undirected model used in spatial models and images.

68 Data X 1 X 2 X 3 X 4 X Traditionally tabular with X i having discrete values (0, 1,..., r i 1). Continuous values sometimes discretized to achieve this. Lately more interest in more versatile data formats. documents of different lengths images of different size relational data

69 Full tables hard - but GI easy X 1 X 2 X 3 X 4 X Full table needs at least r i data vectors for observing each possible data vector once. GI model needs at least max r i data vectors. and estimating parameter θ ik that determines the probability P(X i = x k θ ik ) = θ ik is simply based on counting the relative frequency of the value x ik. For example the relative frequency of the event X 3 = 2 is Indeed, maximum likelihood estimate is simply the relative frequency: ˆθ 32 = 2. 12

70 Bayesian learning of parameters In Bayesian learning you do not learn a parameter value but you assign a probability (density) to each possible value of each parameter. After observing (X 3 = 2) 2 out of 12 times you might, for example say that θ 32 is three times more 2 likely to equal than In general, the probability of θ ik is based partly on relative frequency N ik N, but we first assume a priori that we have seen the value x ik α ik times and then state that P(θ ik N ik, α ik ) θ N ik+α ik 1 ik.

71 Dirichlet distribution P(θ ik N ik, α ik ) θ N ik+α ik 1 ik is called Dirichlet distribution with parameters N i + α i. The most probable parameter value is N ˆθ ik = ik +α ik 1 i (N i k+α i k ) K N ik + α ik 1. So if we imagine we have seen 1 of each values beforehand, the most probable parameter value coincides with ML-estimate (relative frequency). Indeed, Dir(1, 1, 1, 1, 1) is a uniform distribution. Is it desirable that the most probable parameter value coincides with relative frequency? The expected parameter value is θ ik = N ik+α ik i N N i k+α i ik + α ik. k

72 BNs not much more difficult The parameter θ ijk in a Bayesian network G determine conditional probabilities P(X i = x k Π i = π j, G) = θ ijk, where Π i is a vector of random variables consisting of parents of X i, and π j is the j th value combination of those parents (enumerated somehow). To estimate the θ ijk we can again just count relative frequencies of the event (X i = x k Π i = π j, G). The maximum likelihood parameter is simply this relative frequency: ˆθ ijk = N ijk N ij, where N ij counts how many times Π i has value π i.

73 Zero support problem the ML parameter is ˆθ ijk = N ijk N ij, but what if where N ij = 0. Then ML parameters are not unique but you can have any θ ijk as long as k θ ijk = 1 Having N ij = 0 is common. Bayesians often use expected parameter values θ ijk (with prior counts α ijk = 1). Thus learning parameters for BNs is almost trivial. Just counting the events is needed.

74 How about undirected models? Parameter learning for undirected models is much more difficult.

75 How about missing data X 1 X 2 X 3 X 4 X ? Full Bayesian handling runs into trouble Usually searching for maximum likelihood or maximum a posterior parameters Family of EM algorithms Different kind of imputation mechanisms to reduce the situation to many complete data cases. Another chapter of the book. Latent data is kind of missing data.

76 Summary Parameter learning is a common task for fixed structures. Parameter learning for BNs from complete data easy. Parameter learning for undirected graphs hard. Parameter learning from missing data involved.

77 Why is the book so thick? Many different representations Inference Parameter learning ML vs. Bayesian Complete vs. missing data

78 Outline Motivation Representation Inference Learning Learning parameters Learning structure Misc

79 Learning Structure Valuable data mining tool Score based learning: Space of candidate structures, Scoring function to measure goodness of a structure, Search algorithm to find a good structure. Also independence test based learning methods are used. There are results that link these two methods.

80 Data in structure out an example

81 How many possible BN structures? n number of Bayesian network structures with n nodes b(n) = { 1 if n = 0, n k=1 ( 1)k+1( n k) 2 k(n k) b(n k) if n >

82 Marginal likelihood Bayesian would choose the most probable structure using: P(G D) = P(D G)P(G) P(D) P(D G)P(G). P(G) often assumed uniform, so the score is the marginal likelihood P(D G) = N j=1 P(d j D j 1 1, G). Thus the structure with the best prediction record for the data gets selected.

83 Marginal likelihood formula P(D G) = P(D θ, G)P(θ G)dθ. For a complete data, marginal likelihood has a closed form: P(D G, α) = n q i i=1 j=1 Γ( r i k=1 α ijk) Γ( r i k=1 α ijk + N ijk ) r i k=1 Γ(α ijk + N ijk ). Γ(α ijk ) Notice that the score decomposes by the structure log P(D G) = n log P(D i D Πi, G). i=1

84 Other popular criteria Penalized maximum likelihood scores: S(G, D) = log P(D ˆθ(D, G)) penalty. Akaike Information Criterion (AIC) qi j=1 (r i 1), i.e., the penalty = = n i=1 number of free parameters in the model. Bayesian Information Criterion (BIC) penalty = 2 log N. AIC and BIC also decompose by structure. but NML: penalty = log D P(D ˆθ(D, G)) does not decompose (and is not popular).

85 Equivalence of structures Two node BN A B allows exactly the same joint distributions than BN B A. So should they be equally good for all data sets D? In general any two BNs with same skeleton and same v-structures are equivalent in the sense that they allow same joint distributions. Trivia: there are about 4 times more Bayesian network structures than their equivalence classes. All the scores mentioned give equal scores for equivalent structures. Marginal likelihood requires α be set carefully though (Buntine 91).

86 Searching for the structure NP hard, but best tree(s) can be found in quadratic time. Local search: 1. start with a best tree. 2. add, delete or reverse a random arc (beware of cycles). 3. if no improvement, do not accept change. 4. back to step 2 or 1. Decomposability helps, because local change means that only one or two of the terms in score need to be re-evaluated. It also helps so that it is possible to find the best network for about 30 variables for sure.

87 Summary Structure learning for BNs is NP hard, thus heuristic search is usually used. Many different scores. which are decomposable to allow efficient score function evaluation local search. Marginal likelihood can be interpreted as sequential prediction.

88 Why is the book so thick? Many different representations Inference (exact and approximate) Parameter learning (complete and missing data) Structure learning Many criteria. Heuristic search is an endless topic.

89 Outline Motivation Representation Inference Learning Learning parameters Learning structure Misc

90 Similarity? Two big ML themes: graphical models, kernel based methods. But there are connections: Many graphical models try to capture similarity. Finite mixture models for clustering. Topic models for documents. Let us look at Fisher kernels for Bayesian networks

91 Hamming distance example A target vector t= vector x= vector y= sim(x, t) > sim(y, t)? But what if I tell you that four first bits are always the same? Dependency can be taken into account using graphical model.

92 Fisher kernel The graphical model G and parameters θ define the probability P(d θ, G) of data vector d. Vectors x and y are similar if changing parameters θ a little bit changes the the probabilities of x and y in a similar way. We compare the gradient vectors U x = θ log P(x θ) and U y = θ log P(y θ) by taking the inner product K(x, y) = UX T I 1 U y, where I is the Fisher information matrix. as an approximation I is often dropped.

93 Fisher kernel for BNs For BNs the Fisher kernel (with I ) can be computed relatively easily. It decomposes K(x, y) = i K i(x, y), where 0 if pa i (x) pa i (y), K i (x, y) = P(pa i (x)) 1 if pa i (x) = pa i (y) and x i y i, 1 P(x i pa i (x)) P(x i,pa i if pa (x)) i (x) = pa i (y) and x i = y i. Terms P(pa i (x)) and P(x i, pa i (x)) available in joint tree.

94 Summary The formalism of probabilistic graphical models provides a unifying framework for capturing complex dependencies among random variables, and building large-scale multivariate statistical models. They do so by allowing structured representations, facilitating efficient inference, and being amenable to learning from data.

95 Thank You!

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Graphical Models & HMMs

Graphical Models & HMMs Graphical Models & HMMs Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I. Christensen (RIM@GT) Graphical Models

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Bayes Nets: Inference Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.

More information

Graphical Models. David M. Blei Columbia University. September 17, 2014

Graphical Models. David M. Blei Columbia University. September 17, 2014 Graphical Models David M. Blei Columbia University September 17, 2014 These lecture notes follow the ideas in Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. In addition,

More information

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination

CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination CS242: Probabilistic Graphical Models Lecture 3: Factor Graphs & Variable Elimination Instructor: Erik Sudderth Brown University Computer Science September 11, 2014 Some figures and materials courtesy

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 2, 2012 Today: Graphical models Bayes Nets: Representing distributions Conditional independencies

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 18, 2015 Today: Graphical models Bayes Nets: Representing distributions Conditional independencies

More information

Introduction to Graphical Models

Introduction to Graphical Models Robert Collins CSE586 Introduction to Graphical Models Readings in Prince textbook: Chapters 10 and 11 but mainly only on directed graphs at this time Credits: Several slides are from: Review: Probability

More information

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall

More information

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C.

D-Separation. b) the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. D-Separation Say: A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked by C if it contains a node such that either a) the arrows on the path meet either

More information

The Basics of Graphical Models

The Basics of Graphical Models The Basics of Graphical Models David M. Blei Columbia University September 30, 2016 1 Introduction (These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan.

More information

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov ECE521: Week 11, Lecture 20 27 March 2017: HMM learning/inference With thanks to Russ Salakhutdinov Examples of other perspectives Murphy 17.4 End of Russell & Norvig 15.2 (Artificial Intelligence: A Modern

More information

Inference. Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation:

Inference. Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: B A E J M Most likely explanation: This slide deck courtesy of Dan Klein at

More information

Summary: A Tutorial on Learning With Bayesian Networks

Summary: A Tutorial on Learning With Bayesian Networks Summary: A Tutorial on Learning With Bayesian Networks Markus Kalisch May 5, 2006 We primarily summarize [4]. When we think that it is appropriate, we comment on additional facts and more recent developments.

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 15, 2011 Today: Graphical models Inference Conditional independence and D-separation Learning from

More information

Bayesian Classification Using Probabilistic Graphical Models

Bayesian Classification Using Probabilistic Graphical Models San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2014 Bayesian Classification Using Probabilistic Graphical Models Mehal Patel San Jose State University

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Bayes Nets: Inference Instructors: Dan Klein and Pieter Abbeel --- University of California, Berkeley [These slides were created by Dan Klein and Pieter Abbeel for CS188

More information

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 20. PGM Representation Next Lectures Representation of joint distributions Conditional/marginal independence * Directed vs

More information

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models Prof. Daniel Cremers 4. Probabilistic Graphical Models Directed Models The Bayes Filter (Rep.) (Bayes) (Markov) (Tot. prob.) (Markov) (Markov) 2 Graphical Representation (Rep.) We can describe the overall

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 25, 2015 Today: Graphical models Bayes Nets: Inference Learning EM Readings: Bishop chapter 8 Mitchell

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 5 Inference

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University April 1, 2019 Today: Inference in graphical models Learning graphical models Readings: Bishop chapter 8 Bayesian

More information

AI Programming CS S-15 Probability Theory

AI Programming CS S-15 Probability Theory AI Programming CS662-2013S-15 Probability Theory David Galles Department of Computer Science University of San Francisco 15-0: Uncertainty In many interesting agent environments, uncertainty plays a central

More information

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models

Computer Vision Group Prof. Daniel Cremers. 4a. Inference in Graphical Models Group Prof. Daniel Cremers 4a. Inference in Graphical Models Inference on a Chain (Rep.) The first values of µ α and µ β are: The partition function can be computed at any node: Overall, we have O(NK 2

More information

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Lecture 21 : A Hybrid: Deep Learning and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Bayes Nets: Independence Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.

More information

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization Machine Learning A 708.064 13W 1sst KU Exercises Problems marked with * are optional. 1 Conditional Independence a) [1 P] For the probability distribution P (A, B, C, D) with the factorization P (A, B,

More information

Warm-up as you walk in

Warm-up as you walk in arm-up as you walk in Given these N=10 observations of the world: hat is the approximate value for P c a, +b? A. 1/10 B. 5/10. 1/4 D. 1/5 E. I m not sure a, b, +c +a, b, +c a, b, +c a, +b, +c +a, b, +c

More information

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization Machine Learning A 708.064 WS15/16 1sst KU Version: January 11, 2016 Exercises Problems marked with * are optional. 1 Conditional Independence I [3 P] a) [1 P] For the probability distribution P (A, B,

More information

CS 188: Artificial Intelligence Fall Machine Learning

CS 188: Artificial Intelligence Fall Machine Learning CS 188: Artificial Intelligence Fall 2007 Lecture 23: Naïve Bayes 11/15/2007 Dan Klein UC Berkeley Machine Learning Up till now: how to reason or make decisions using a model Machine learning: how to select

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43 K-Means Clustering Example: Old Faithful Geyser

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

Overview of machine learning

Overview of machine learning Overview of machine learning Kevin P. Murphy Last updated November 26, 2007 1 Introduction In this Chapter, we provide a brief overview of the most commonly studied problems and solution methods within

More information

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Optimal Naïve Nets (Adapted from

More information

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Sum-Product Networks STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 15, 2015 Introduction Outline What is a Sum-Product Network? Inference Applications In more depth

More information

A Brief Introduction to Bayesian Networks. adapted from slides by Mitch Marcus

A Brief Introduction to Bayesian Networks. adapted from slides by Mitch Marcus A Brief Introduction to Bayesian Networks adapted from slides by Mitch Marcus Bayesian Networks A simple, graphical notation for conditional independence assertions and hence for compact specification

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 2014-2015 Jakob Verbeek, November 28, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

Graphical models are a lot like a circuit diagram they are written down to visualize and better understand a problem.

Graphical models are a lot like a circuit diagram they are written down to visualize and better understand a problem. Machine Learning (ML, F16) Lecture#15 (Tuesday, Nov. 1st) Lecturer: Byron Boots Graphical Models 1 Graphical Models Often, one is interested in representing a joint distribution P over a set of n random

More information

Machine Learning. Sourangshu Bhattacharya

Machine Learning. Sourangshu Bhattacharya Machine Learning Sourangshu Bhattacharya Bayesian Networks Directed Acyclic Graph (DAG) Bayesian Networks General Factorization Curve Fitting Re-visited Maximum Likelihood Determine by minimizing sum-of-squares

More information

Bayesian Networks Inference

Bayesian Networks Inference Bayesian Networks Inference Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University November 5 th, 2007 2005-2007 Carlos Guestrin 1 General probabilistic inference Flu Allergy Query: Sinus

More information

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part II. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part II C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Converting Directed to Undirected Graphs (1) Converting Directed to Undirected Graphs (2) Add extra links between

More information

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of

More information

Collective classification in network data

Collective classification in network data 1 / 50 Collective classification in network data Seminar on graphs, UCSB 2009 Outline 2 / 50 1 Problem 2 Methods Local methods Global methods 3 Experiments Outline 3 / 50 1 Problem 2 Methods Local methods

More information

Dynamic Bayesian network (DBN)

Dynamic Bayesian network (DBN) Readings: K&F: 18.1, 18.2, 18.3, 18.4 ynamic Bayesian Networks Beyond 10708 Graphical Models 10708 Carlos Guestrin Carnegie Mellon University ecember 1 st, 2006 1 ynamic Bayesian network (BN) HMM defined

More information

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake.

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake. Bayes Nets Independence With joint probability distributions we can compute many useful things, but working with joint PD's is often intractable. The naïve Bayes' approach represents one (boneheaded?)

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 21. Independence in PGMs; Example PGMs Independence PGMs encode assumption of statistical independence between variables. Critical

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics Bayes Nets: Inference (Finish) Variable Elimination Graph-view of VE: Fill-edges, induced width

More information

Directed Graphical Models (Bayes Nets) (9/4/13)

Directed Graphical Models (Bayes Nets) (9/4/13) STA561: Probabilistic machine learning Directed Graphical Models (Bayes Nets) (9/4/13) Lecturer: Barbara Engelhardt Scribes: Richard (Fangjian) Guo, Yan Chen, Siyang Wang, Huayang Cui 1 Introduction For

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Overview of Part One Probabilistic Graphical Models Part One: Graphs and Markov Properties Christopher M. Bishop Graphs and probabilities Directed graphs Markov properties Undirected graphs Examples Microsoft

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 20: Naïve Bayes 4/11/2011 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. W4 due right now Announcements P4 out, due Friday First contest competition

More information

Inference and Representation

Inference and Representation Inference and Representation Rachel Hodos New York University Lecture 5, October 6, 2015 Rachel Hodos Lecture 5: Inference and Representation Today: Learning with hidden variables Outline: Unsupervised

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Overview of Part Two Probabilistic Graphical Models Part Two: Inference and Learning Christopher M. Bishop Exact inference and the junction tree MCMC Variational methods and EM Example General variational

More information

ECE521 Lecture 18 Graphical Models Hidden Markov Models

ECE521 Lecture 18 Graphical Models Hidden Markov Models ECE521 Lecture 18 Graphical Models Hidden Markov Models Outline Graphical models Conditional independence Conditional independence after marginalization Sequence models hidden Markov models 2 Graphical

More information

ECE521 Lecture 21 HMM cont. Message Passing Algorithms

ECE521 Lecture 21 HMM cont. Message Passing Algorithms ECE521 Lecture 21 HMM cont Message Passing Algorithms Outline Hidden Markov models Numerical example of figuring out marginal of the observed sequence Numerical example of figuring out the most probable

More information

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework Ji Won Yoon arxiv:37.99v [cs.lg] 3 Jul 23 Abstract In order to cluster

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Naïve Bayes Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

AN MDL FRAMEWORK FOR DATA CLUSTERING

AN MDL FRAMEWORK FOR DATA CLUSTERING Complex Systems Computation Group (CoSCo) http://cosco.hiit.fi/ AN MDL FRAMEWORK FOR DATA CLUSTERING Petri Myllymäki Complex Systems Computation Group (CoSCo) Helsinki Institute for Information Technology

More information

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2015 1 : Introduction to GM and Directed GMs: Bayesian Networks Lecturer: Eric P. Xing Scribes: Wenbo Liu, Venkata Krishna Pillutla 1 Overview This lecture

More information

Bayes Net Learning. EECS 474 Fall 2016

Bayes Net Learning. EECS 474 Fall 2016 Bayes Net Learning EECS 474 Fall 2016 Homework Remaining Homework #3 assigned Homework #4 will be about semi-supervised learning and expectation-maximization Homeworks #3-#4: the how of Graphical Models

More information

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Computer vision: models, learning and inference. Chapter 10 Graphical Models Computer vision: models, learning and inference Chapter 10 Graphical Models Independence Two variables x 1 and x 2 are independent if their joint probability distribution factorizes as Pr(x 1, x 2 )=Pr(x

More information

Today. Logistic Regression. Decision Trees Redux. Graphical Models. Maximum Entropy Formulation. Now using Information Theory

Today. Logistic Regression. Decision Trees Redux. Graphical Models. Maximum Entropy Formulation. Now using Information Theory Today Logistic Regression Maximum Entropy Formulation Decision Trees Redux Now using Information Theory Graphical Models Representing conditional dependence graphically 1 Logistic Regression Optimization

More information

3 : Representation of Undirected GMs

3 : Representation of Undirected GMs 0-708: Probabilistic Graphical Models 0-708, Spring 202 3 : Representation of Undirected GMs Lecturer: Eric P. Xing Scribes: Nicole Rafidi, Kirstin Early Last Time In the last lecture, we discussed directed

More information

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p:// Lecture Slides for INTRODUCTION TO Machine Learning ETHEM ALPAYDIN The MIT Press, 2010 alpaydin@boun.edu.tr h1p://www.cmpe.boun.edu.tr/~ethem/i2ml2e CHAPTER 16: Graphical Models Graphical Models Aka Bayesian

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

Statistical Techniques in Robotics (STR, S15) Lecture#06 (Wednesday, January 28)

Statistical Techniques in Robotics (STR, S15) Lecture#06 (Wednesday, January 28) Statistical Techniques in Robotics (STR, S15) Lecture#06 (Wednesday, January 28) Lecturer: Byron Boots Graphical Models 1 Graphical Models Often one is interested in representing a joint distribution P

More information

Computational Intelligence

Computational Intelligence Computational Intelligence A Logical Approach Problems for Chapter 10 Here are some problems to help you understand the material in Computational Intelligence: A Logical Approach. They are designed to

More information

CS 188: Artificial Intelligence Fall Announcements

CS 188: Artificial Intelligence Fall Announcements CS 188: Artificial Intelligence Fall 2006 Lecture 22: Naïve Bayes 11/14/2006 Dan Klein UC Berkeley Announcements Optional midterm On Tuesday 11/21 in class Review session 11/19, 7-9pm, in 306 Soda Projects

More information

Announcements. CS 188: Artificial Intelligence Fall Machine Learning. Classification. Classification. Bayes Nets for Classification

Announcements. CS 188: Artificial Intelligence Fall Machine Learning. Classification. Classification. Bayes Nets for Classification CS 88: Artificial Intelligence Fall 00 Lecture : Naïve Bayes //00 Announcements Optional midterm On Tuesday / in class Review session /9, 7-9pm, in 0 Soda Projects. due /. due /7 Dan Klein UC Berkeley

More information

Bayesian Networks Inference (continued) Learning

Bayesian Networks Inference (continued) Learning Learning BN tutorial: ftp://ftp.research.microsoft.com/pub/tr/tr-95-06.pdf TAN paper: http://www.cs.huji.ac.il/~nir/abstracts/frgg1.html Bayesian Networks Inference (continued) Learning Machine Learning

More information

Lecture 5: Exact inference

Lecture 5: Exact inference Lecture 5: Exact inference Queries Inference in chains Variable elimination Without evidence With evidence Complexity of variable elimination which has the highest probability: instantiation of all other

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 21: ML: Naïve Bayes 11/10/2011 Dan Klein UC Berkeley Example: Spam Filter Input: email Output: spam/ham Setup: Get a large collection of example emails,

More information

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos Machine Learning for Computer Vision 1 18 October, 2013 MVA ENS Cachan Lecture 6: Introduction to graphical models Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Center for Visual Computing Ecole Centrale Paris

More information

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 6: Graphical Models. Cristian Sminchisescu FMA901F: Machine Learning Lecture 6: Graphical Models Cristian Sminchisescu Graphical Models Provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate

More information

Clustering Relational Data using the Infinite Relational Model

Clustering Relational Data using the Infinite Relational Model Clustering Relational Data using the Infinite Relational Model Ana Daglis Supervised by: Matthew Ludkin September 4, 2015 Ana Daglis Clustering Data using the Infinite Relational Model September 4, 2015

More information

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics Bayes Nets (Finish) Parameter Learning Structure Learning Readings: KF 18.1, 18.3; Barber 9.5,

More information

Lecture 3: Conditional Independence - Undirected

Lecture 3: Conditional Independence - Undirected CS598: Graphical Models, Fall 2016 Lecture 3: Conditional Independence - Undirected Lecturer: Sanmi Koyejo Scribe: Nate Bowman and Erin Carrier, Aug. 30, 2016 1 Review for the Bayes-Ball Algorithm Recall

More information

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying

Lecture 5: Exact inference. Queries. Complexity of inference. Queries (continued) Bayesian networks can answer questions about the underlying given that Maximum a posteriori (MAP query: given evidence 2 which has the highest probability: instantiation of all other variables in the network,, Most probable evidence (MPE: given evidence, find an

More information

A Brief Introduction to Bayesian Networks AIMA CIS 391 Intro to Artificial Intelligence

A Brief Introduction to Bayesian Networks AIMA CIS 391 Intro to Artificial Intelligence A Brief Introduction to Bayesian Networks AIMA 14.1-14.3 CIS 391 Intro to Artificial Intelligence (LDA slides from Lyle Ungar from slides by Jonathan Huang (jch1@cs.cmu.edu)) Bayesian networks A simple,

More information

Time series, HMMs, Kalman Filters

Time series, HMMs, Kalman Filters Classic HMM tutorial see class website: *L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. Time series,

More information

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C, Conditional Random Fields and beyond D A N I E L K H A S H A B I C S 5 4 6 U I U C, 2 0 1 3 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 4 Due Apr 27, 12:00 noon Submission: Homework is due on the due date at 12:00 noon. Please see course website for policy on late submission. You must submit

More information

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017 CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.

More information

Exact Inference: Elimination and Sum Product (and hidden Markov models)

Exact Inference: Elimination and Sum Product (and hidden Markov models) Exact Inference: Elimination and Sum Product (and hidden Markov models) David M. Blei Columbia University October 13, 2015 The first sections of these lecture notes follow the ideas in Chapters 3 and 4

More information

Content-based image and video analysis. Machine learning

Content-based image and video analysis. Machine learning Content-based image and video analysis Machine learning for multimedia retrieval 04.05.2009 What is machine learning? Some problems are very hard to solve by writing a computer program by hand Almost all

More information

Dependency detection with Bayesian Networks

Dependency detection with Bayesian Networks Dependency detection with Bayesian Networks M V Vikhreva Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Leninskie Gory, Moscow, 119991 Supervisor: A G Dyakonov

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Generative and discriminative classification techniques

Generative and discriminative classification techniques Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 17 EM CS/CNS/EE 155 Andreas Krause Announcements Project poster session on Thursday Dec 3, 4-6pm in Annenberg 2 nd floor atrium! Easels, poster boards and cookies

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

6 : Factor Graphs, Message Passing and Junction Trees

6 : Factor Graphs, Message Passing and Junction Trees 10-708: Probabilistic Graphical Models 10-708, Spring 2018 6 : Factor Graphs, Message Passing and Junction Trees Lecturer: Kayhan Batmanghelich Scribes: Sarthak Garg 1 Factor Graphs Factor Graphs are graphical

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 2: Probability: Discrete Random Variables Classification: Validation & Model Selection Many figures

More information

Graphical Models Part 1-2 (Reading Notes)

Graphical Models Part 1-2 (Reading Notes) Graphical Models Part 1-2 (Reading Notes) Wednesday, August 3 2011, 2:35 PM Notes for the Reading of Chapter 8 Graphical Models of the book Pattern Recognition and Machine Learning (PRML) by Chris Bishop

More information

Machine Learning

Machine Learning Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 17, 2011 Today: Graphical models Learning from fully labeled data Learning from partly observed data

More information

Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos

Learning the Structure of Sum-Product Networks. Robert Gens Pedro Domingos Learning the Structure of Sum-Product Networks Robert Gens Pedro Domingos w 20 10x O(n) X Y LL PLL CLL CMLL Motivation SPN Structure Experiments Review Learning Graphical Models Representation Inference

More information

OSU CS 536 Probabilistic Graphical Models. Loopy Belief Propagation and Clique Trees / Join Trees

OSU CS 536 Probabilistic Graphical Models. Loopy Belief Propagation and Clique Trees / Join Trees OSU CS 536 Probabilistic Graphical Models Loopy Belief Propagation and Clique Trees / Join Trees Slides from Kevin Murphy s Graphical Model Tutorial (with minor changes) Reading: Koller and Friedman Ch

More information

Case Study IV: Bayesian clustering of Alzheimer patients

Case Study IV: Bayesian clustering of Alzheimer patients Case Study IV: Bayesian clustering of Alzheimer patients Mike Wiper and Conchi Ausín Department of Statistics Universidad Carlos III de Madrid Advanced Statistics and Data Mining Summer School 2nd - 6th

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information