Quantitative Biology II!

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Quantitative Biology II!"

Transcription

1 Quantitative Biology II! Lecture 3: Markov Chain Monte Carlo! March 9, 2015!

2 2! Plan for Today!! Introduction to Sampling!! Introduction to MCMC!! Metropolis Algorithm!! Metropolis-Hastings Algorithm!! Gibbs Sampling!! Monitoring Convergence!! Examples!

3 3! Sampling Motivation!! So far we have focused on models for which exact inference is possible!! In general, this will not be true e.g., models with non-gaussian continuous distributions or large clique sizes!! There are two main options available in such cases:!! Approximate inference!! Sampling methods!! Today: sampling (Monte Carlo) methods, Mickey: approximate inference!

4 4! Sampling!! Suppose exact inference is impossible for a pdf p (x), but samples x (1), x (2),..., x (N) can be drawn!! Many properties of interest can be estimated if N is sufficiently large, e.g.,!! Note that the samples need not be independent, but if not then N must be larger!

5 5! Toy Example! Radius: r = 1! Side: s = 2! Number of darts: N! Number in circle: k! E[k/N] = π/4! π 4k/N!

6 6! Monte Carlo History!! We have estimated the area of the circle by Monte Carlo integration!! Monte Carlo methods were pioneered by mathematicians and statistical physicists during and after the Manhattan Project (esp. Stan Ulam, Nicholas Metropolis, John von Neumann)!! Interest in sampling theory dates back to the early days of probability theory, but putting it to work required electronic computers!

7 7! Simple Sampling!! Uniform distribution: generate a pseudorandom number between 0 and large M (e.g., RAND_MAX), then divide by M!! More complex distributions:!! Inversion method!! Rejection sampling!! Importance sampling!! Sampling-importance-resampling!

8 8! Inversion Method! Example:! If h(y) is a CDF, and y is a random variate from the desired distribution, then x = h(y) is uniformly distributed: x ~ U(0, 1)!! Thus, a uniform random variate xʹ can be converted to a random variate yʹ by inverting h: yʹ = h 1 (xʹ )

9 9! Rejection Sampling! Algorithm:!! Sample x0 from q(x)!! Sample u ~ U(0, 1)!! Accept x0 if:!! Otherwise reject x0 and continue sampling!!

10 Adaptive Rejection 10! Sampling!! Suppose x univariate, p (x) concave!! Given a set of points, {x1,, xn}, define piecewise linear envelope function for ln p(x)!! Drawing from the envelope function is straightforward piecewise exponential form!! Initialize with a grid of points. As new points are drawn, they can be added to the set, improving the envelope function!

11 11! Importance Sampling!! Suppose we seek f = E[f(X)]! We canʼt sample from p(x), but we can evaluate the density!! Suppose, in addition, we can sample from a simpler q(x)! Importance sampling follows from:! More generally, for unnormalized distributions,! where!

12 Sampling-Importance- 12! Resampling!! The same idea can be incorporated into a sampling scheme!! Start by drawing N points from q(x) and computing weights, similar to those above!! Now draw M points with probabilities given by these weights!! As N approaches infinity, the resampling distribution approaches p(x)

13 MCMC! 13!! The basic idea of MCMC is to sample variables (or subsets of variables) conditional on previous samples!! Typically, these conditional distributions are easier to work with than the full joint distribution!! Successive samples will be correlated. The samples form a Markov chain whose state space equals the support of the joint distribution!! MCMC is designed such that the long-term average (stationary) distribution of the chain equals the desired distribution!! Basic approach: collect many samples, try to show convergence!

14 Notation! 14!! As with EM, assume some variables are observed and denote them x!! Assume other variables are latent and denote them z! The observed variables will be held fixed throughout the procedure, while the latent variables will be sampled!! The state space of the Markov chain therefore equals the space of possible values of z, and its stationary distribution is p (z x)! Key problem: what should p(z (t+1) z (t), x) be?!

15 15! Illustration of MCMC! The transition probabilities must be designed so that the stationary distribution is p(z x).! After a suitable burn-in period, samples drawn from each p(z t z t 1, x) will be representative of p (z x).! However, they will not be independent samples.!

16 16! Bivariate Normal Example!! Suppose x is a set of n points on the two-dimensional plane!! These points are assumed to be drawn independently from a bivariate normal distribution with unknown mean μ! The goal is to infer the distribution of μ given x (the posterior)!! A (diffuse) normal prior is assumed:!

17 17! Bivariate Normal, cont.!! In this case, we can derive an exact closed form solution for the posterior distribution, but suppose we wish to use MCMC instead!! Here z is the mean μ, and the state space of the Markov chain is points on the twodimensional plane. The observed variable x is fixed at the given set of points!! Transitions can be thought of as moves from one point on the plane to another, and a sequence of samples will trace a 2d trajectory!! Over the long term, points from this trajectory will represent the posterior p(μ x)!

18 Illustration! 18!

19 19! How Does MCMC Work?! How can we set the transition probabilities such that the equilibrium distribution is the posterior, without knowing what the posterior is?!

20 20! Marginals for a Markov Chain!! Let z = (z (1), z (2),..., z (N) ) be a (first-order) Markov chain, with z (t) S for t {1,..., N}. For simplicity, assume S is a finite set.!! Let π (t) be the marginal distribution of z (t) :!! Thus,! or, in matrix notation,!! Given an initial distribution π (0), π (t) is given by:!

21 21! Stationary Distribution!! We say the chain is invariant, or stationary, when π (t) = π (t+1) = π *, i.e.,!! A Markov chain may have more than one stationary distribution. For example, every distribution is invariant when A = I!!! If the Markov chain is ergodic, however, then it will always converge to a single stationary distribution:!! This distribution is given by the eigenvector corresponding to the largest eigenvalue of A

22 Ergodicity! 22!! To be ergodic, the chain must be:!! Irreducible must be positive probability of reaching any state from any other!! Aperiodic must not cycle through states deterministically!! Non-transient must always be able to return to a state after visiting it!! In designing transition distributions for MCMC, irreducibility is typically the critical property!! Ergodicity is automatic if the transitions to all states have nonzero probability!

23 23! Reversibility!! A Markov chain is said to be reversible if:!! Reversibility with respect to a distribution π * is sufficient to make π * invariant:!! Thus, if a Markov chain is constructed to be ergodic and reversible with respect to some π *, then it will converge to π *

24 Metropolis Algorithm! 24!! Suppose transitions are proposed from a symmetric distribution q(z (t) z (t 1) ) i.e., such that q(z (t) =a z (t 1) =b) = q(z (t) =b z (t 1) =a)!! Now suppose proposals are accepted with probability (implicitly conditioning on x):!! Thus:!

25 25! Implications!! This simple procedure guarantees reversibility of the Markov chain with respect to the posterior p (z) simply by evaluating ratios of densities!! Furthermore, ratios of posterior densities can be computed as ratios of complete data densities:!! As discussed, reversibility with respect to p(z) implies that p(z) is a stationary distribution of the Markov chain!! If the Markov chain is also ergodic, then p(z) is a unique stationary distribution of the Markov chain!

26 Logistics! 26!! The proposal distribution has to be designed to guarantee ergodicity!! The chain will not reach stationarity immediately; a burn-in period is required. Suppose it consists of B steps!! Suppose S samples are collected following the B burn-in steps!! A sample can be collected on each iteration, but successive samples may be highly correlated, resulting in an effective sample size << S. It may be more efficient to retain every kth sample!

27 27! Metropolis Algorithm! initialize with z (0) s.t. p(z (0) x) > 0! t 1! repeat! sample z (t) from q(z (t) z (t 1), x)! compute:! draw u from U(0,1)! if (u > a(z (t 1), z (t) )) z (t) z (t 1) /* reject proposal */! if (t > B and t mod k = 0) retain sample z (t)! t t + 1! until enough samples (t = B + Sk)!

28 28! Recall: Bivariate Normal!! Suppose x is a set of n points on the two-dimensional plane!! These points are assumed to be drawn independently from a bivariate normal distribution with unknown mean μ! The goal is to infer the distribution of μ given x (the posterior) [assume fixed var. I]!! A (diffuse) normal prior is assumed:!

29 29! Bivariate Normal, cont.!! As a symmetric proposal distribution for moves on the 2d plane, assume a simple Gaussian random walk:!! The acceptance probabilities will be:!! The variance σ 2 determines the average step size, and can be used as a tuning parameter!

30 30! Illustration! Small σ 2 : small steps, high acceptance rate! Large σ 2 : big steps, low acceptance rate! Minimizing the correlation between successive samples, hence minimizing the number of samples needed, requires a tradeoff!

31 Remarks! 31!! Notice that probabilities (densities) are always computed from fully observed variables; no integration is necessary!! Furthermore, only ratios of densities are needed. As a result, unnormalized distributions can be used.!! The key design parameter is the proposal distribution. It must ensure that the chain is ergodic, keep the acceptance rate high, and facilitate mixing (low correlation of successive samples)!! There is tradeoff between bold and cautious proposals in optimizing mixing!

32 Asymmetric Proposals! 32!! The requirement of a symmetric proposal distr. is easily circumvented!! An additional term in the acceptance probability corrects for any asymmetry:!! Now:!

33 33! Metropolis-Hastings! initialize with z (0) s.t. p(z (0) x) > 0! t 1! repeat! sample z (t) from q(z (t) z (t 1), x)! compute:! draw u from U(0,1)! if (u > a(z (t 1), z (t) )) z (t) z (t 1) /* reject proposal */! if (t > B and t mod k = 0) retain sample z (t)! t t + 1! until enough samples (t = B + Sk)!

34 More Remarks! 34!! MCMC is enormously versatile: a sampler can easily be constructed for almost any model!! It is also flexible: not only can the posterior be approximated, but so can any function of the posterior!! The critical issue is convergence. How long does the chain have to run? How can we be sure it has converged? Even if it has, have enough samples been drawn?!! Bottom line: hard problems are still hard, but MCMC with clever proposal distributions can help!

35 35! Proposing Subsets!! If z has high dimension, it may be hard to find a proposal distribution that will result in a sufficiently high acceptance rate!! A possible solution is to partition the variables into W subsets, and to sample individual subsets conditional on the others!! On each step t consider a subset zi (randomly or by round robin) and propose a new value from:!

36 Illustration! 36!

37 37! Gibbs Sampling!! Gibbs sampling is the special case in which the proposal distribution is defined by the exact conditional distribution:!! This proposal distribution guarantees a perfect acceptance rate!!

38 38! Simple Example!! Suppose three latent variables, z1, z2, z3!! Gibbs sampling will sample each in turn conditional on the other two (and on x), using the exact conditionals:!! z1 (t) ~ p(z1 z2 (t 1), z3 (t 1), x)! z2 (t+1) ~ p(z2 z1 (t), z3 (t), x)! z3 (t+2) ~ p(z3 z1 (t+1), z2 (t+1), x)! It can either cycle through them in order, or visit them randomly (provided each is visited with sufficiently high probability)!

39 39! Gibbs Sampling Algorithm! initialize with z (0) s.t. p(z (0) x) > 0! t 1! repeat! for i 1 to W! sample zi (t) from p(zi (t) z i (t 1), x) z-i (t) z-i (t 1)! if (t > B and t mod k = 0) retain sample z (t)! t t + 1! end for! until enough samples (t = B + Sk)!

40 40! Another Way to See It!! It can be shown more directly that Gibbs sampling must produce the desired stationary distribution!! Suppose the Markov chain has reached a point at which z (t) ~ p(z x). Note that p(z x) = p(z i x) p(zi z i, x)!! Each Gibbs step holds z i (t) fixed and draws zi (t+1) from the exact conditional; thus z (t+1) ~ p(z x)!! It is also easy to show directly that the chain is reversible wrt p(z x)

41 41! Ergodicity!! For the posterior to be a unique equilibrium distribution, the chain must also be ergodic (as usual)!! If all conditional distributions are nonzero everywhere, then ergodicity must hold!! Otherwise, it must be proven explicitly!

42 Bivariate Normal Gibbs! 42!

43 43! Gaussian Mixtures!! Gibbs sampling allows the Gaussian mixtures problem to be addressed in a fully Bayesian way:!! Assign cluster means a (Gaussian) prior!! Mean sampling: For each cluster, sample new mean based on prior and currently assigned data points!! Assignment sampling: Sample new cluster assignment for each data point given current cluster means!! Upon termination, summarize groupings from samples of joint posterior!

44 44!

45 45! Comparison with EM!! Both EM and Gibbs alternate between setting variables and setting parameters!! EM avoids hard assignments, instead using expectations!! Gibbs makes hard assignments but does so stochastically!! EM maximizes parameters based on expectations of rvʼs; Gibbs does not distinguish between parameters and rvʼs!! Gibbs can be seen as a stochastic hill climbing algorithm. It may do better than EM at avoiding local maxima!

46 46! Assessing Convergence!! Simplest approach: plot complete log likelihood, visually assess stationarity!! Using this method can usually make a good guess at appropriate burn-in length B!! Can apply to logl or estimated scalars!! Good idea to start multiple chains and see whether they end up behaving the same!! More rigorously, can run multiple chains and compare within chain and between chain variances!

47 Visual Inspection! 47!

48 Another Example! 48!

49 49! Monitoring Scalar Estimands!! Run J parallel chains, initializing from an overdispersed distribution. Collect n samples from each.!! Compute within-chain (W) and betweenchain (B) variances for scalar samples!! Monitor convergence via scale reduction,! Gelman et al. Bayesian Data Analysis, 1995!

50 50! Sampling Motifs! initialize! extract counts,! sample from Dirichlet! compute posteriors,! sample positions!

51 51! Sampling Alignments! V! L! S! P! A! D! K! H! L! A! E! S! K!

52 52! Sampling Alignments! VLSPAD-K! HL--AESK! H! L! V! L! S! P! A! D! K! A! E! S! K!

53 53! Sampling Alignments! VLSPAD-K! HL--AESK! VL--SPADK! HLAES---K! H! L! A! E! V! L! S! P! A! D! K! S! K!

54 54! Sampling Alignments! VLSPAD-K! HL--AESK! VL--SPADK! HLAES---K! -VLSPADK! H-LAES-K! H! L! A! E! S! K! V! L! S! P! A! D! K!

55 55! Measuring Confidence! Lunter et al., Genome Res, 2008!

56 56! Thatʼs All!! Bishop has good introduction to sampling and MCMC!! Sampling alignments is covered in Durbin et al.!! Gelman et al. good reference on applied Bayesian analysis!! Thanks for listening!!