ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Size: px

Start display at page:

Download "ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning"

Bruce Sherman
5 years ago
Views:

1 ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics Bayes Nets (Finish) Parameter Learning Structure Learning Readings: KF 18.1, 18.3; Barber 9.5, 10.4 Dhruv Batra Virginia Tech

2 Administrativia HW1 Out Due in 2 weeks: Feb 17, Feb 19, 11:59pm Please please please please start early Implementation: TAN, structure + parameter learning Please post questions on Scholar Forum. (C) Dhruv Batra 2

3 Recap of Last Time (C) Dhruv Batra 3

4 Learning Bayes nets Fully observable data Missing data Known structure Very easy Somewhat easy (EM) Unknown structure Hard Very very hard Data x (1) x (m) structure CPTs P(X i Pa Xi ) parameters (C) Dhruv Batra Slide Credit: Carlos Guestrin 4

5 Learning the CPTs Data For each discrete variable X i x (1) x (m) ˆP MLE (X i = a Pa Xi = b) = Count(X i = a, Pa Xi = b) Count(Pa Xi = b) (C) Dhruv Batra Slide Credit: Carlos Guestrin 5

6 Plan for today (Finish) BN Parameter Learning Parameter Sharing Plate notation (Start) BN Structure Learning Log-likelihood score Decomposability Information never hurts (C) Dhruv Batra 6

7 Meta BN Explicitly showing parameters as variables Example on board One variable X; parameter θ X Two variables X,Y; parameters θ X, θ Y X (C) Dhruv Batra 7

8 Global parameter independence Global parameter independence: All CPT parameters are independent Prior over parameters is product of prior over CPTs Flu Headache Sinus Allergy Nose Proposition: For fully observable data D, if prior satisfies global parameter independence, then

9 Parameter Sharing What if X 1,, X n are n random variables for coin tosses of the same coin? (C) Dhruv Batra 9

10 Naïve Bayes vs Bag-of-Words What s the difference? Parameter sharing! (C) Dhruv Batra 10

11 Text classification Classify s Y = {Spam,NotSpam} What about the features X? X i represents i th word in document; i = 1 to doc-length X i takes values in vocabulary, 10,000 words, etc. (C) Dhruv Batra 11

12 Bag of Words Position in document doesn t matter: P(X i =x i Y=y) = P(X k =x i Y=y) Order of words on the page ignored Parameter sharing When the lecture is over, remember to wake up the person sitting next to you in the lecture room. (C) Dhruv Batra Slide Credit: Carlos Guestrin 12

13 Bag of Words Position in document doesn t matter: P(X i =x i Y=y) = P(X k =x i Y=y) Order of words on the page ignored Parameter sharing in is lecture lecture next over person remember room sitting the the the to to up wake when you (C) Dhruv Batra Slide Credit: Carlos Guestrin 13

14 HMMs semantics: Details X 1 = {a, z} X 2 = {a, z} X 3 = {a, z} X 4 = {a, z} X 5 = {a, z} O 1 = O 2 = O 3 = O 4 = O 5 = Just 3 distributions: (C) Dhruv Batra Slide Credit: Carlos Guestrin 14

15 N-grams Learnt from Darwin s On the Origin of Species _ a b c d e f g h i j k l m n o p q r s t u v w x y z Unigrams _ a b c d e f g h i j k l m n o p q r s t u v w x y z Bigrams _ a b c d e f g h i j k l m n o p q r s t u v w x y z (C) Dhruv Batra Image Credit: Kevin Murphy 15

16 Plate Notation X 1,, X n are n random variables for coin tosses of the same coin Plate denotes replication (C) Dhruv Batra 16

17 Plate Notation Y X j D Plates denote replication of random variables (C) Dhruv Batra 17

18 Hierarchical Bayesian Models Why stop with a single prior? (C) Dhruv Batra 18

19 BN: Parameter Learning: What you need to know Parameter Learning MLE Decomposes; results in counting procedure Will shatter dataset if too many parents Bayesian Estimation Conjugate priors Priors = regularization (also viewed as smoothing) Hierarchical priors Plate notation Shared parameters (C) Dhruv Batra 19

20 Learning Bayes nets Fully observable data Missing data Known structure Very easy Somewhat easy (EM) Unknown structure Hard Very very hard Data x (1) x (m) structure CPTs P(X i Pa Xi ) parameters (C) Dhruv Batra Slide Credit: Carlos Guestrin 20

21 Goals of Structure Learning Prediction Care about a good structure because presumably it will lead to good predictions Discovery I want to understand some system Data x (1) x (m) structure CPTs P(X i Pa Xi ) parameters (C) Dhruv Batra 21

22 Types of Errors Truth: Flu Allergy Sinus Headache Nose Recovered: Flu Allergy Flu Allergy Sinus Sinus Headache Nose Headache Nose (C) Dhruv Batra 22

23 Learning the structure of a BN Data <x 1 (1),,x n (1) > <x 1 (m),,x n (m) > Learn structure and parameters Constraint-based approach Test conditional independencies in data Find an I-map Score-based approach Finding a structure and parameters is a density estimation task Evaluate model as we evaluated parameters Maximum likelihood Bayesian etc. Flu Allergy Sinus Headache Nose (C) Dhruv Batra Slide Credit: Carlos Guestrin 23

24 Score-based approach Possible structures Data <x 1 (1),,x n (1) > Flu Headache Sinus Allergy Nose Learn parameters Score structure -52 <x 1 (m),,x n (m) > Flu Headache Sinus Allergy Nose Learn parameters Score structure -60 Flu Headache Sinus Allergy Nose Learn parameters Score structure -500 (C) Dhruv Batra Slide Credit: Carlos Guestrin 24

25 How many graphs? N vertices. How many (undirected) graphs? How many (undirected) trees? (C) Dhruv Batra 25

26 What s a good score? Score(G) = log-likelihood(g : D, θ MLE ) (C) Dhruv Batra 26

27 Information-theoretic interpretation of Maximum Likelihood Score Consider two node graph Derived on board (C) Dhruv Batra 27

28 Information-theoretic interpretation of Maximum Likelihood Score For a general graph G Flu Allergy Sinus Headache Nose (C) Dhruv Batra Slide Credit: Carlos Guestrin 28

29 Information-theoretic interpretation of Maximum Likelihood Score Flu Allergy Sinus Headache Nose Implications: Intuitive: higher mutual info à higher score Decomposes over families in BN (node and it s parents) Same score for I-equivalent structures! Information never hurts! (C) Dhruv Batra 29

30 Chow-Liu tree learning algorithm 1 For each pair of variables X i,x j Compute empirical distribution: Compute mutual information: Define a graph Nodes X 1,,X n Edge (i,j) gets weight (C) Dhruv Batra Slide Credit: Carlos Guestrin 30

31 Chow-Liu tree learning algorithm 2 Optimal tree BN Compute maximum weight spanning tree Directions in BN: pick any node as root, and direct edges away from root breadth-first-search defines directions (C) Dhruv Batra Slide Credit: Carlos Guestrin 31

32 Can we extend Chow-Liu? Tree augmented naïve Bayes (TAN) [Friedman et al. 97] Naïve Bayes model overcounts, because correlation between features not considered Same as Chow-Liu, but score edges with: (C) Dhruv Batra Slide Credit: Carlos Guestrin 32

Bayesian Networks Inference (continued) Learning

Learning BN tutorial: ftp://ftp.research.microsoft.com/pub/tr/tr-95-06.pdf TAN paper: http://www.cs.huji.ac.il/~nir/abstracts/frgg1.html Bayesian Networks Inference (continued) Learning Machine Learning