Today. Logistic Regression. Decision Trees Redux. Graphical Models. Maximum Entropy Formulation. Now using Information Theory

Size: px

Start display at page:

Download "Today. Logistic Regression. Decision Trees Redux. Graphical Models. Maximum Entropy Formulation. Now using Information Theory"

Kathryn Flynn
6 years ago
Views:

1 Today Logistic Regression Maximum Entropy Formulation Decision Trees Redux Now using Information Theory Graphical Models Representing conditional dependence graphically 1

2 Logistic Regression Optimization Take the gradient in terms of w ( w) = ln p(t w) = N 1 n=0 {t n ln y n +(1 t n )ln(1 y n )} Where y 0 = p(c 0 x n )=σ(a n ) w E = a n = w T x n N 1 n=0 E y n y n a n w a n 2

3 Optimization We know the gradient of the error function, but how do we find the maximum value? w E = (y n t n )x n N 1 n=0 Setting to zero is nontrivial Numerical approximation 3

4 Entropy H(x) = x p(x) log 2 p(x) Measure of uncertainty, or Measure of Information High uncertainty equals high entropy. Rare events are more informative than common events. 4

5 Examples of Entropy Uniform distributions have higher distributions. 5

6 Maximum Entropy Logistic Regression is also known as Maximum Entropy. Entropy is convex. Convergence Expectation. Constrain this optimization to enforce good classification. Increase maximum likelihood of the data while making the distribution of weights most even. Include as many useful features as possible. 6

7 Maximum Entropy with Constraints From Klein and Manning Tutorial 7

8 Optimization formulation If we let the weights represent likelihoods of value for each feature. max H(w; x, t) = x w log 2 w For each feature i s.t. w T x = t and w 2 =1 8

Lagrange Multipliers max H(w; x, t) = x w log 2 w Dual representation of the maximum

9 Solving MaxEnt formulation Convex optimization with a concave objective function and linear constraints. Lagrange Multipliers max H(w; x, t) = x w log 2 w Dual representation of the maximum likelihood estimation of Logistic Regression (p, λ)= w w log 2 w + N i=1 For each feature i s.t. w T x = t and w 2 =1 λ i (wi T x i t)+λ 0 ( w 2 1) 9

10 Decision Trees Nested if -statements for classification Each Decision Tree Node contains a feature and a split point. Challenges: Determine which feature and split point to use Determine which branches are worth including at all (Pruning) 10

11 Decision Trees color blue brown green h w w <66 <140 <150 w w m f h h <145 <170 <66 <64 m f m f m f m f 11

12 Ranking Branches Last time, we used classification accuracy to measure value of a branch. 6M / 6F <68 height 1M / 5F 5M / 1F 50% Accuracy before Branch 83.3% Accuracy after Branch 33.3% Accuracy Improvement 12

13 Ranking Branches Measure Decrease in Entropy of the class distribution following the split 6M / 6F <68 height 1M / 5F 5M / 1F H(x) = 2 before Branch 83.3% Accuracy after Branch 33.3% Accuracy Improvement 13

14 InfoGain Criterion Calculate the decrease in Entropy across a split point. This represents the amount of information contained in the split. This is relatively indifferent to the position on the decision tree. More applicable to N-way classification. Accuracy represents the mode of the distribution Entropy can be reduced while leaving the mode unaffected. 14

15 Graphical Models and Conditional Independence More generally about probabilities, but used in classification and clustering. Both Linear Regression and Logistic Regression use probabilistic models. Graphical Models allow us to structure, and visualize probabilistic models, and the relationships between variables. 15

16 (Joint) Probability Tables Represent multinomial joint probabilities between K variables as K-dimensional tables p(x) =p(flu?,achiness?,headache?,..., temperature?) Assuming D binary variables, how big is this table? What is we had multinomials with M entries? 16

17 Probability Models What if the variables are independent? p(x) =p(flu?,achiness?,headache?,..., temperature?) If x and y are independent: p(x, y) =p(x)p(y) The original distribution can be factored p(x) =p(flu?)p(achiness?)p(headache?)...p(temperature?) How big is this table, if each variable is binary? 17

18 Conditional Independence Independence assumptions are convenient (Naïve Bayes), but rarely true. More often some groups of variables are dependent, but others are independent. Still others are conditionally independent. 18

19 Conditional Independence If two variables are conditionally independent. p(x, z y) =p(x y)p(z y) p(x, z) = p(x)p(z) E.g. y = flu?, x = achiness?, z = headache? x z y 19

20 Factorization if a joint Assume x z y How do you factorize: p(x, y, z) p(x, y, z) =p(x, z y)p(y) p(x y)p(z y)p(y) 20

21 Factorization if a joint What if there is no conditional independence? How do you factorize: p(x, y, z) p(x, y, z) =p(x, z y)p(y) p(x y, z)p(z y)p(y) 21

22 Structure of Graphical Models Graphical models allow us to represent dependence relationships between variables visually Graphical models are directed acyclic graphs (DAG). Nodes: random variables Edges: Dependence relationship No Edge: Independent variables Direction of the edge: indicates a parent-child relationship Parent: Source Trigger Child: Destination Response 22

23 Example Graphical Models p(x, y) =p(x)p(y) p(x, y) =p(x y)p(y) x y x y Parents of a node i are denoted π i Factorization of the joint in a graphical model: p(x 0,...,x n 1 )= n 1 i=0 p(x i π i ) 23

24 Basic Graphical Models Independent Variables Observations x y z x y z When we observe a variable, (fix its value from data) we color the node grey. Observing a variable allows us to condition on it. E.g. p(x,z y) Given an observation we can generate pdfs for the other variables. 24

25 Example Graphical Models X = cloudy? Y = raining? Z = wet ground? Markov Chain x y z p(x, y, z) = n {x,y,z} p(n π n )=p(x)p(y x)p(z y) 25

26 Example Graphical Models Markov Chain p(x, y, z) = x y z p(n π n )=p(x)p(y x)p(z y n {x,y,z} Are x and z conditionally independent given y? p(x, z y) =p(x y)p(z y) 26

27 Example Graphical Models Markov Chain p(x, y, z) = x y z n {x,y,z} p(n π n )=p(x)p(y x)p(z y) p(x, z y) =p(x z,y)p(z y) p(x z,y) = p(x, y, z) p(y, z) = p(x)p(y x) p(y) p(x, z y) =p(x y)p(z y) x z y = p(x)p(y x)p(z y) p(y)p(z y) p(x, y) = = p(x y) p(y) 27

28 One Trigger Two Responses X = achiness? Y = flu? Z = fever? y x z p(x, y, z) = (n π n )=p(x y)p(y)p(z y) n {x,y,z} 28

29 (x, y, z) = Example Graphical Models x n {x,y,z} Are x and z conditionally independent given y? y (n π n )=p(x y)p(y)p(z y) z p(x, z y) =p(x y)p(z y) 29

30 Example Graphical Models p(x, y, z) = p(x, z y) =p(x z,y)p(z y) p(x z,y) = p(x, y, z) p(y, z) = p(x y) p(x, z y) =p(x y)p(z y) x z y x n {x,y,z} y = p(x y)p(y)p(z y) p(y)p(z y) z (n π n )=p(x y)p(y)p(z y 30

31 Two Triggers One Response X = rain? Y = wet sidewalk? Z = spilled coffee? x z p(x, y, z) = n {x,y,z} y p(n π n )=p(x)p(y x, z)p(z) 31

32 Example Graphical Models x z p(x, y, z) = n {x,y,z} Are x and z conditionally independent given y? y p(n π n )=p(x)p(y x, z)p(z) p(x, z y) =p(x y)p(z y) 32

33 Example Graphical Models p(x, y, z) = p(x, z y) =p(x z,y)p(z y) p(x z,y) = p(x, y, z) p(y, z) = p(x) p(x, z y) =p(x)p(z y) x not z y = x n {x,y,z} y z p(n π n )=p(x)p(y x, z)p(z p(x)p(y x, z)p(z) p(y x, z)p(z) 33

34 Factorization x1 x3 x0 x5 x2 x4 p(x 0,x 1,x 2,x 3,x 4,x 5 )=? 34

35 Factorization x1 x3 x0 x5 x2 x4 p(x 0,x 1,x 2,x 3,x 4,x 5 )= p(x 0 )p(x 1 x 0 )p(x 2 x 0 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 1,x 4 ) 35

36 How Large are the probability tables? p(x 0,x 1,x 2,x 3,x 4,x 5 )= p(x 0 )p(x 1 x 0 )p(x 2 x 0 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 1,x 4 ) 36

37 Model Parameters as Nodes Treating model parameters as a random variable, we can include these in a graphical model Multivariate Bernouli µ0 µ1 µ2 x0 x1 x2 37

38 Model Parameters as Nodes Treating model parameters as a random variable, we can include these in a graphical model Multinomial µ x0 x1 x2 38

39 Naïve Bayes Classification y x0 x1 x2 Observed variables xi are independent given the class variable y The distribution can be optimized using maximum likelihood on each variable separately. Can easily combine various types of distributions p(y x 0 x 1,x 2 ) p(x 0,x 1,x 2 y)p(y) p(y x 0 x 1,x 2 ) p(x 0 y)p(x 1 y)p(x 2 y)p(y) 39

40 Graphical Models Graphical representation of dependency relationships Directed Acyclic Graphs Nodes as random variables Edges define dependency relations What can we do with Graphical Models Learn parameters to fit data Understand independence relationships between variables Perform inference (marginals and conditionals) Compute likelihoods for classification. 40

41 Plate Notation y y x0 x1 xn n xi To indicate a repeated variable, draw a plate around it. 41

42 Completely observed Graphical Model Observations for every node Flu Fever Sinus Ache Swell Head Y L Y Y Y N N M N N N N Y H N N Y Y Y M Y N N Y Simplest (least general) graph, assume each independent Fl Fe Si Ac Sw He 42

43 Completely observed Graphical Model Observations for every node Flu Fever Sinus Ache Swell Head Y L Y Y Y N N M N N N N Y H N N Y Y Y M Y N N Y Second simplest graph, assume complete dependence Fl Fe Si Ac Sw He 43

44 Maximum Likelihood x 0 x 1 x 2 x 3 x 4 x 5 Each node has a conditional probability table, θ Given the tables, we can construct the pdf. p(x θ) = M 1 i=0 p(x i π i, θ i ) Use Maximum Likelihood to find the best settings of θ 44

45 Maximum likelihood θ = argmax θ = argmax θ = argmax θ = argmax θ ln p( X θ) N 1 n=0 N 1 n=0 N 1 n=0 ln p( X n θ) ln M 1 i=0 M 1 i=0 p(x in θ i ) ln p(x in θ i ) 45

46 Count functions Count the number of times something appears in the data = x 1 m(x 1 )= x 1 x 2 δ(x 1,x 2 ) δ(x n, x m )= m(x i )= m( X)= { 1 if xn = x m 0 otherwise N 1 n=0 N 1 n=0 δ(x i,x in ) δ( X, X n ) = x 1 x 2 x 3 δ(x 1,x 2,x 3 )... 46

47 Maximum Likelihood l(θ) = = = N 1 n=0 N 1 n=0 N 1 n=0 ln p(x n θ) ln X p(x θ) δ(x n,x) δ(x n, X)lnp(X θ) X = x n m(x)lnp(x θ) = x n m(x)ln = M 1 x n = = M 1 i=0 M 1 i=0 i=0 x i,π i M 1 i=0 p(x i π i, θ i ) m(x)lnp(x i π i, θ i ) X\x i \π i m(x)lnp(x i π i, θ i ) x i,π i m(x i, π i )lnp(x i π i, θ i ) Define a function: Constraint: θ(x i, π i )=p(x i π i, θ i ) x i θ(x i, π i )=1 47

48 Maximum Likelihood Use Lagrange Multipliers l(θ) = M 1 i=0 m(x i, π i )lnθ(x i, π i ) x i π i l(θ) θ(x i, π i ) = m(x i, π i ) θ(x i, π i ) λ π i =0 θ(x i, π i )= m(x i, π i ) λ πi m(x i, π i ) = 1 the constraint λ x πi i λ πi = x i m(x i, π i )=m(π i ) M 1 i=0 π i λ πi x i θ(x i, π i ) 1 θ(x i, π i )= m(x i, π i ) m(π i ) counts! 48

49 Maximum A Posteriori Training Bayesians would never do that, the thetas need a prior. θ(x i, π i )= m(x i, π i )+ m(π i )+ x i 49

50 Conditional Dependence Test Can check conditional independence in a graphical model Is achiness (x3) independent of the flue (x0) given fever(x1)? Is achiness (x3) independent of sinus infections(x2) given fever (x1)? p(x) =p(x 0 )p(x 1 x 0 )p(x 2 x 0 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 1,x 4 ) p(x 3 x 0,x 1,x 2 )= p(x 0,x 1,x 2,x 3 ) p(x 0,x 1,x 2 ) x 3 x 0,x 2 x 1 = p(x 0)p(x 1 x 0 )p(x 2 x 0 )p(x 3 x 1 ) p(x 0 )p(x 1 x 0 )p(x 2 x 0 ) = p(x 3 x 1 ) 50

51 D-Separation and Bayes Ball x 0 x 1 x 2 Intuition: nodes are separated or blocked by sets of nodes. E.g. nodes x1 and x2, block the path from x0 to x5. So x0 is cond. ind.from x5 given x1 and x2 x 3 x 4 x 5 51

52 Bayes Ball Algorithm x a x b x c Shade nodes x c Place a ball at each node in x a Bounce balls around the graph according to rules If no balls reach x b, then cond. ind. 52

53 Ten rules of Bayes Ball Theorem 53

54 Bayes Ball Example x 0 x 4 x 2? x 3 x 0 x 1 x 5 x 2 x 4 54

55 Bayes Ball Example x 0 x 5 x 1, x 2? x 3 x 0 x 1 x 5 x 2 x 4 55

56 Undirected Graphs What if we allow undirected graphs? What do they correspond to? Not Cause/Effect, or Trigger/Response, but general dependence Example: Image pixels, each pixel is a bernouli P(x11,, x1m,, xm1,, xmm) Bright pixels have bright neighbors No parents, just probabilities. Grid models are called Markov Random Fields 56

57 Undirected Graphs A D C Undirected separability is easy. B To check conditional independence of A and B given C, check the Graph reachability of A and B without going through nodes in C 57

58 Next Time More fun with Graphical Models Read Chapter 8.1,

Lecture 9: Undirected Graphical Models Machine Learning

Lecture 9: Undirected Graphical Models Machine Learning Andrew Rosenberg March 5, 2010 1/1 Today Graphical Models Probabilities in Undirected Graphs 2/1 Undirected Graphs What if we allow undirected graphs?