Probabilistic Graphical Models

Size: px

Start display at page:

Download "Probabilistic Graphical Models"

Thomas Hoover
5 years ago
Views:

1 Probabilistic Graphical Models Homework 4 Due Apr 27, 12:00 noon Submission: Homework is due on the due date at 12:00 noon. Please see course website for policy on late submission. You must submit two complete and identical copies of your homework: a hard copy and a soft copy. Hard Copy Submission: You must submit a hard copy of your homework solution to Mallory Deptola (GHC 8001). Include each problem on a separate sheet of paper, with your name and andrew id at the top. For programming questions, the hard copy must contain both the report and the source code. Soft Copy Submission: You must also a complete soft copy of your homework solution to pgm.asst.2016@gmail.com. The must be sent by the deadline. Rules: Include a PDF version of the report that is identical to the hard copy submission. For programming questions, the soft copy must also include the complete source code of your implementation. Remember to include a small README file and a script that would help us execute your code. 1. We recommend that you typeset your homework using appropriate software such as L A TEX. If you are writing please make sure your homework is cleanly written up and legible. The TAs will not invest undue effort to decrypt bad handwriting. If you handwrite the homework, you must still submit a PDF scan for the soft copy submission. 2. You are allowed to collaborate on the homework, but you should write up your own solution and code. Please indicate your collaborators in your submission. 3. There is only one exception to the hard copy submission: if you are out of town you must instructor@cs.cmu.edu to notify us that you will only be submitting only a soft copy. This notification must be sent by the deadline. 4. Please staple your homeworks. 1

2 1 MCMC v.s. Variational Inference (Zhiting - 25pts) Markov Chain Monte Carlo (MCMC) and Variational Inference (VI) are two most popular techniques for approximate Bayesian inference. In this problem we compare these two methods. The following list shows a set of properties: A Fast to converge B Hard to assess convergence C Amenable to parallelize D Asymptotically exact E Deterministic optimization F Biased G Easier to work with the Chinese restaurant process representation of Dirichlet process H Easier to work with the stick-breaking process representation of Dirichlet process I For LDA, more memory efficient From the above, select all that apply to MCMC and VI, respectively, and explain each choice in 1-2 sentences. Note that recent advances on MCMC and VI have been made to overcome the respective shortcomings, and even combine the two methods to obtain best of both worlds. MCMC: VI: 2 Kernel Embedding (Zhiting - 30pts) Consider a reproducing kernel Hilbert space (RKHS) F on X with kernel k, which is a Hilbert space of functions f : X R. Denote feature map φ(x) := k(x, ). Assume another RKHS G on Y with kernel l and feature map ψ(y) := l(y, ). 1. True or False, and explain in 1-2 sentences: (a) For x, x X, k(x, x ) = φ(x), φ(x ) F (b) k(x, x ) can be seen as a similarity measure between points x, x X which is linear in both the feature space and the original space. (c) The cross-covariance operator C XY is an element in the space G F. (d) For a conditional distribution P (Y X), its RKHS embedding is µ Y x := E Y x [ψ(y ) x] = C Y X φ x 2

3 2. According to the law of total expectation we have µ X = E XY [φ(x)] = E Y E X Y [φ(x) Y ]. Prove: (a) The sum rule in RKHS: µ X = C X Y µ Y (b) The chain rule in RKHS: µ XY = C X Y µ Y = C Y Xµ X, where µ Y = E Y [ψ(y ) ψ(y )] and µ X = E X [φ(x) φ(x)]. 3. Using the notations in pp.35 of Lecture.20 slides, with training examples {X, y} = {(x 1, y 1 ),..., (x n, y n )}, the predictive mean of Guassian process regression on test points X is K(X, X)(K(X, X)+σ 2 I) 1 y. Briefly explain (in 1-2 sentences) the relation between the predictive mean and conditional embedding. 3 Spectral Learning for Latent Variable Models (Calvin - 25pts) Suppose we roll a K-sided die, then use the outcome Z to select one of K categorical distributions, each over D categories. Then, we roll N iid D-sided dies according to the selected categorical distribution. For this question, let us assume that N 3. Z Categorical(θ) X (1),..., X (N) Categorical(µ (Z) ). Note that we represent each of the X (n) s as a D-dimensional one-hot vector. So if the nth D-sided die comes up with the ith category, then X (n) i = 1, while the other elements of X (n) are set to 0. Furthermore, θ is a K-dimensional vector on the probability simplex. And each µ (k) is a D-dimensional vector on the probability simplex, for all k {1,..., K}. 1. How many independent parameters are there, in terms of K, D, and N? 2. How many independent equations f are there of the form f(θ, µ) = E[X (n) i ]? (Note that K 1 = E[X (n) 1 ] and K 2 = E[X (n) 2 ] are independent. However, we are not counting K 1 = E[X (n1) 1 ] and K 1 = E[X (n2) 1 ] as independent.) 3. How many independent functions g are there of the form g(θ, µ) = E[X (n1) i X (n2) j ], n 1 n 2? 4. How many independent functions h are there of the form h(θ, µ) = E[X (n1) i X (n2) j X (n3) k ], n 1 n 2 n 3? 5. Combining first-, second-, and third-order functions above, how many independent equations do we have in terms of K, D, and N? 6. What inequality in terms of K and D will guarantee that we have enough equations to estimate the parameters using only first-, second-, and third-order moments? (Hint: this is sometimes called the blessing of dimensionality!) 7. The above inequality does not depend on N, under our assumption that N 3. Explain in a sentence or two why we still benefit from larger N, even though N 3 is sufficient to have an estimate of all the parameters. 4 Parameter Learning in BN and MRF (Yuntian - 10pts) 1. (3pts) CRF Parameter Learning. Consider the process of gradient-ascent training for a CRF log-linear model with k features, given a data set D with M instances. Assume for simplicity that the cost of computing a single feature over a single instance in our data set is constant, as is the cost of computing the expected value of each feature once we compute a marginal over the variables in its scope. Also assume that we can compute each required marginal in constant time after we have a calibrated clique 3

4 tree. (Clique tree calibration is a smart way of reusing messages in the message passing algorithm for calculating marginals on a graphical model, but all you need to know is that once we finish the clique tree calibration, each required marginal can be computed in constant time) Assume that we use clique tree calibration to compute the expected sufficient statistics in this model, and that the cost of running clique tree calibration is c. Assume that we need r iterations for the gradient process to converge. (We are using a batch algorithm, so each iteration means using all the data to calculate the gradients once) What is the cost of this procedure? A O(r(Mc + k)) B O(r(Mk + c)) C O(M(k + rc)) D O(Mk + r(c + k)) 2. (3pts) MRF Parameter Learning. Consider the process of gradient-ascent training for a log-linear model with k features, given a data set D with M instances. Assume for simplicity that the cost of computing a single feature over a single instance in our data set is constant, as is the cost of computing the expected value of each feature once we compute a marginal over the variables in its scope. Also assume that we can compute each required marginal in constant time after we have a calibrated clique tree. Assume that we use clique tree calibration to compute the expected sufficient statistics in this model and that the cost of doing this is c. Also, assume that we need r iterations for the gradient process to converge. What is the cost of this procedure? A O(Mk + rck) B O(Mk + rc) C O(M(k + rc)) D O(Mk + r(c + k)) 3. (3pts) Parameter Learning in MNs vs BNs. Compared to learning parameters in Bayesian networks, learning in Markov networks is generally (Choose all that applies) A less difficult as we do not need to account for the directed nature of factors as we do in a Bayes Net. B more difficult because factors in MNs need not sum up to 1. C equally difficult, though MN inference will be better by a constant factor difference in the computation time as we do not need to worry about directionality. D more difficult because we cannot use parallel optimization of subparts of our likelihood as we often can in BN learning. 5 CRF & HMM comparison (Yuntian - 10pts) 1. (6pts) Explain the difference between Conditional Random Fields and Hidden Markov Models with respect to the following factors. Please give only a one-line explanation. A Type of model - generative/discriminative B Objective function optimized C Require a normalization constant 2. (4pts) Consider the following senarios. Choose an appropriate model from CRF and HMM for the given task and briefly justify your selection. 4

5 A Image segmentation task. Given an image, we want to segment the image to semantically meaningful parts, e.g., grass, road, water and sky. We already have some labelled images with hand-crafted features. The features may correlate with each other. B POS tagging task. Given a sentence, we want to assign Part-Of-Speech tags to the individual words. We have both labelled sentences and unlabelled sentences. References 5

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed