Quantitative Biology II!

Similar documents
Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

Computer vision: models, learning and inference. Chapter 10 Graphical Models

MCMC Methods for data modeling

Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users

Approximate Bayesian Computation. Alireza Shafaei - April 2016

1 Methods for Posterior Simulation

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

Markov Chain Monte Carlo (part 1)

Markov chain Monte Carlo methods

Clustering Relational Data using the Infinite Relational Model

Chapter 1. Introduction

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation

Probabilistic Graphical Models

Short-Cut MCMC: An Alternative to Adaptation

CS281 Section 9: Graph Models and Practical MCMC

Sampling informative/complex a priori probability distributions using Gibbs sampling assisted by sequential simulation

10-701/15-781, Fall 2006, Final

An Introduction to Markov Chain Monte Carlo

ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION

Convexization in Markov Chain Monte Carlo

Clustering web search results

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

GAMES Webinar: Rendering Tutorial 2. Monte Carlo Methods. Shuang Zhao

MSA101/MVE Lecture 5

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Monte Carlo for Spatial Models

Probabilistic Graphical Models

Stochastic Simulation: Algorithms and Analysis

10.4 Linear interpolation method Newton s method

Monte Carlo Methods and Statistical Computing: My Personal E

Level-set MCMC Curve Sampling and Geometric Conditional Simulation

The Multi Stage Gibbs Sampling: Data Augmentation Dutch Example

Inference and Representation

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

STA 4273H: Statistical Machine Learning

A Short History of Markov Chain Monte Carlo

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Recap: The E-M algorithm. Biostatistics 615/815 Lecture 22: Gibbs Sampling. Recap - Local minimization methods

Math 494: Mathematical Statistics

Metropolis Light Transport

AN ADAPTIVE POPULATION IMPORTANCE SAMPLER. Luca Martino*, Victor Elvira\ David Luengcfi, Jukka Corander*

Modified Metropolis-Hastings algorithm with delayed rejection

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Nested Sampling: Introduction and Implementation

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Convergence and Efficiency of Adaptive MCMC. Jinyoung Yang

A noninformative Bayesian approach to small area estimation

Tutorial using BEAST v2.4.1 Troubleshooting David A. Rasmussen

The Plan: Basic statistics: Random and pseudorandom numbers and their generation: Chapter 16.

K-Means and Gaussian Mixture Models

Machine Learning

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

Bayesian Statistics Group 8th March Slice samplers. (A very brief introduction) The basic idea

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Stat 547 Assignment 3

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

Discussion on Bayesian Model Selection and Parameter Estimation in Extragalactic Astronomy by Martin Weinberg

Grundlagen der Künstlichen Intelligenz

Machine Learning

Clustering Lecture 5: Mixture Model

Approximate (Monte Carlo) Inference in Bayes Nets. Monte Carlo (continued)

Introduction to Machine Learning CMU-10701

10703 Deep Reinforcement Learning and Control

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit IV Monte Carlo

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Variational Methods for Graphical Models

Analysis of Incomplete Multivariate Data

RJaCGH, a package for analysis of

Mesh segmentation. Florent Lafarge Inria Sophia Antipolis - Mediterranee

Mixture Models and the EM Algorithm

L10. PARTICLE FILTERING CONTINUED. NA568 Mobile Robotics: Methods & Algorithms

GiRaF: a toolbox for Gibbs Random Fields analysis

Machine Learning

CSC412: Stochastic Variational Inference. David Duvenaud

arxiv: v2 [stat.co] 19 Feb 2016

CS 229 Midterm Review

Lecture 6: Spectral Graph Theory I

arxiv: v3 [stat.co] 27 Apr 2012

Hierarchical Bayesian Modeling with Ensemble MCMC. Eric B. Ford (Penn State) Bayesian Computing for Astronomical Data Analysis June 12, 2014

Slice sampler algorithm for generalized Pareto distribution

Particle Filters for Visual Tracking

This chapter explains two techniques which are frequently used throughout

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Bayesian Modelling with JAGS and R

Parallel Gibbs Sampling From Colored Fields to Thin Junction Trees

Expectation-Maximization Methods in Population Analysis. Robert J. Bauer, Ph.D. ICON plc.

Clustering: Classic Methods and Modern Views

Warped Mixture Models

Improved Adaptive Rejection Metropolis Sampling Algorithms

Expectation Maximization (EM) and Gaussian Mixture Models

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Statistical techniques for data analysis in Cosmology

Probabilistic Graphical Models

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Probabilistic Robotics

Missing Data Analysis for the Employee Dataset

Physics 736. Experimental Methods in Nuclear-, Particle-, and Astrophysics. - Statistical Methods -

Statistical Matching using Fractional Imputation

SGN (4 cr) Chapter 11

Linear Modeling with Bayesian Statistics

Transcription:

Quantitative Biology II! Lecture 3: Markov Chain Monte Carlo! March 9, 2015!

2! Plan for Today!! Introduction to Sampling!! Introduction to MCMC!! Metropolis Algorithm!! Metropolis-Hastings Algorithm!! Gibbs Sampling!! Monitoring Convergence!! Examples!

3! Sampling Motivation!! So far we have focused on models for which exact inference is possible!! In general, this will not be true e.g., models with non-gaussian continuous distributions or large clique sizes!! There are two main options available in such cases:!! Approximate inference!! Sampling methods!! Today: sampling (Monte Carlo) methods, Mickey: approximate inference!

4! Sampling!! Suppose exact inference is impossible for a pdf p (x), but samples x (1), x (2),..., x (N) can be drawn!! Many properties of interest can be estimated if N is sufficiently large, e.g.,!! Note that the samples need not be independent, but if not then N must be larger!

5! Toy Example! Radius: r = 1! Side: s = 2! Number of darts: N! Number in circle: k! E[k/N] = π/4! π 4k/N!

6! Monte Carlo History!! We have estimated the area of the circle by Monte Carlo integration!! Monte Carlo methods were pioneered by mathematicians and statistical physicists during and after the Manhattan Project (esp. Stan Ulam, Nicholas Metropolis, John von Neumann)!! Interest in sampling theory dates back to the early days of probability theory, but putting it to work required electronic computers!

7! Simple Sampling!! Uniform distribution: generate a pseudorandom number between 0 and large M (e.g., RAND_MAX), then divide by M!! More complex distributions:!! Inversion method!! Rejection sampling!! Importance sampling!! Sampling-importance-resampling!

8! Inversion Method! Example:! If h(y) is a CDF, and y is a random variate from the desired distribution, then x = h(y) is uniformly distributed: x ~ U(0, 1)!! Thus, a uniform random variate xʹ can be converted to a random variate yʹ by inverting h: yʹ = h 1 (xʹ )

9! Rejection Sampling! Algorithm:!! Sample x0 from q(x)!! Sample u ~ U(0, 1)!! Accept x0 if:!! Otherwise reject x0 and continue sampling!!

Adaptive Rejection 10! Sampling!! Suppose x univariate, p (x) concave!! Given a set of points, {x1,, xn}, define piecewise linear envelope function for ln p(x)!! Drawing from the envelope function is straightforward piecewise exponential form!! Initialize with a grid of points. As new points are drawn, they can be added to the set, improving the envelope function!

11! Importance Sampling!! Suppose we seek f = E[f(X)]! We canʼt sample from p(x), but we can evaluate the density!! Suppose, in addition, we can sample from a simpler q(x)! Importance sampling follows from:! More generally, for unnormalized distributions,! where!

Sampling-Importance- 12! Resampling!! The same idea can be incorporated into a sampling scheme!! Start by drawing N points from q(x) and computing weights, similar to those above!! Now draw M points with probabilities given by these weights!! As N approaches infinity, the resampling distribution approaches p(x)

MCMC! 13!! The basic idea of MCMC is to sample variables (or subsets of variables) conditional on previous samples!! Typically, these conditional distributions are easier to work with than the full joint distribution!! Successive samples will be correlated. The samples form a Markov chain whose state space equals the support of the joint distribution!! MCMC is designed such that the long-term average (stationary) distribution of the chain equals the desired distribution!! Basic approach: collect many samples, try to show convergence!

Notation! 14!! As with EM, assume some variables are observed and denote them x!! Assume other variables are latent and denote them z! The observed variables will be held fixed throughout the procedure, while the latent variables will be sampled!! The state space of the Markov chain therefore equals the space of possible values of z, and its stationary distribution is p (z x)! Key problem: what should p(z (t+1) z (t), x) be?!

15! Illustration of MCMC! The transition probabilities must be designed so that the stationary distribution is p(z x).! After a suitable burn-in period, samples drawn from each p(z t z t 1, x) will be representative of p (z x).! However, they will not be independent samples.!

16! Bivariate Normal Example!! Suppose x is a set of n points on the two-dimensional plane!! These points are assumed to be drawn independently from a bivariate normal distribution with unknown mean μ! The goal is to infer the distribution of μ given x (the posterior)!! A (diffuse) normal prior is assumed:!

17! Bivariate Normal, cont.!! In this case, we can derive an exact closed form solution for the posterior distribution, but suppose we wish to use MCMC instead!! Here z is the mean μ, and the state space of the Markov chain is points on the twodimensional plane. The observed variable x is fixed at the given set of points!! Transitions can be thought of as moves from one point on the plane to another, and a sequence of samples will trace a 2d trajectory!! Over the long term, points from this trajectory will represent the posterior p(μ x)!

Illustration! 18!

19! How Does MCMC Work?! How can we set the transition probabilities such that the equilibrium distribution is the posterior, without knowing what the posterior is?!

20! Marginals for a Markov Chain!! Let z = (z (1), z (2),..., z (N) ) be a (first-order) Markov chain, with z (t) S for t {1,..., N}. For simplicity, assume S is a finite set.!! Let π (t) be the marginal distribution of z (t) :!! Thus,! or, in matrix notation,!! Given an initial distribution π (0), π (t) is given by:!

21! Stationary Distribution!! We say the chain is invariant, or stationary, when π (t) = π (t+1) = π *, i.e.,!! A Markov chain may have more than one stationary distribution. For example, every distribution is invariant when A = I!!! If the Markov chain is ergodic, however, then it will always converge to a single stationary distribution:!! This distribution is given by the eigenvector corresponding to the largest eigenvalue of A

Ergodicity! 22!! To be ergodic, the chain must be:!! Irreducible must be positive probability of reaching any state from any other!! Aperiodic must not cycle through states deterministically!! Non-transient must always be able to return to a state after visiting it!! In designing transition distributions for MCMC, irreducibility is typically the critical property!! Ergodicity is automatic if the transitions to all states have nonzero probability!

23! Reversibility!! A Markov chain is said to be reversible if:!! Reversibility with respect to a distribution π * is sufficient to make π * invariant:!! Thus, if a Markov chain is constructed to be ergodic and reversible with respect to some π *, then it will converge to π *

Metropolis Algorithm! 24!! Suppose transitions are proposed from a symmetric distribution q(z (t) z (t 1) ) i.e., such that q(z (t) =a z (t 1) =b) = q(z (t) =b z (t 1) =a)!! Now suppose proposals are accepted with probability (implicitly conditioning on x):!! Thus:!

25! Implications!! This simple procedure guarantees reversibility of the Markov chain with respect to the posterior p (z) simply by evaluating ratios of densities!! Furthermore, ratios of posterior densities can be computed as ratios of complete data densities:!! As discussed, reversibility with respect to p(z) implies that p(z) is a stationary distribution of the Markov chain!! If the Markov chain is also ergodic, then p(z) is a unique stationary distribution of the Markov chain!

Logistics! 26!! The proposal distribution has to be designed to guarantee ergodicity!! The chain will not reach stationarity immediately; a burn-in period is required. Suppose it consists of B steps!! Suppose S samples are collected following the B burn-in steps!! A sample can be collected on each iteration, but successive samples may be highly correlated, resulting in an effective sample size << S. It may be more efficient to retain every kth sample!

27! Metropolis Algorithm! initialize with z (0) s.t. p(z (0) x) > 0! t 1! repeat! sample z (t) from q(z (t) z (t 1), x)! compute:! draw u from U(0,1)! if (u > a(z (t 1), z (t) )) z (t) z (t 1) /* reject proposal */! if (t > B and t mod k = 0) retain sample z (t)! t t + 1! until enough samples (t = B + Sk)!

28! Recall: Bivariate Normal!! Suppose x is a set of n points on the two-dimensional plane!! These points are assumed to be drawn independently from a bivariate normal distribution with unknown mean μ! The goal is to infer the distribution of μ given x (the posterior) [assume fixed var. I]!! A (diffuse) normal prior is assumed:!

29! Bivariate Normal, cont.!! As a symmetric proposal distribution for moves on the 2d plane, assume a simple Gaussian random walk:!! The acceptance probabilities will be:!! The variance σ 2 determines the average step size, and can be used as a tuning parameter!

30! Illustration! Small σ 2 : small steps, high acceptance rate! Large σ 2 : big steps, low acceptance rate! Minimizing the correlation between successive samples, hence minimizing the number of samples needed, requires a tradeoff!

Remarks! 31!! Notice that probabilities (densities) are always computed from fully observed variables; no integration is necessary!! Furthermore, only ratios of densities are needed. As a result, unnormalized distributions can be used.!! The key design parameter is the proposal distribution. It must ensure that the chain is ergodic, keep the acceptance rate high, and facilitate mixing (low correlation of successive samples)!! There is tradeoff between bold and cautious proposals in optimizing mixing!

Asymmetric Proposals! 32!! The requirement of a symmetric proposal distr. is easily circumvented!! An additional term in the acceptance probability corrects for any asymmetry:!! Now:!

33! Metropolis-Hastings! initialize with z (0) s.t. p(z (0) x) > 0! t 1! repeat! sample z (t) from q(z (t) z (t 1), x)! compute:! draw u from U(0,1)! if (u > a(z (t 1), z (t) )) z (t) z (t 1) /* reject proposal */! if (t > B and t mod k = 0) retain sample z (t)! t t + 1! until enough samples (t = B + Sk)!

More Remarks! 34!! MCMC is enormously versatile: a sampler can easily be constructed for almost any model!! It is also flexible: not only can the posterior be approximated, but so can any function of the posterior!! The critical issue is convergence. How long does the chain have to run? How can we be sure it has converged? Even if it has, have enough samples been drawn?!! Bottom line: hard problems are still hard, but MCMC with clever proposal distributions can help!

35! Proposing Subsets!! If z has high dimension, it may be hard to find a proposal distribution that will result in a sufficiently high acceptance rate!! A possible solution is to partition the variables into W subsets, and to sample individual subsets conditional on the others!! On each step t consider a subset zi (randomly or by round robin) and propose a new value from:!

Illustration! 36!

37! Gibbs Sampling!! Gibbs sampling is the special case in which the proposal distribution is defined by the exact conditional distribution:!! This proposal distribution guarantees a perfect acceptance rate!!

38! Simple Example!! Suppose three latent variables, z1, z2, z3!! Gibbs sampling will sample each in turn conditional on the other two (and on x), using the exact conditionals:!! z1 (t) ~ p(z1 z2 (t 1), z3 (t 1), x)! z2 (t+1) ~ p(z2 z1 (t), z3 (t), x)! z3 (t+2) ~ p(z3 z1 (t+1), z2 (t+1), x)! It can either cycle through them in order, or visit them randomly (provided each is visited with sufficiently high probability)!

39! Gibbs Sampling Algorithm! initialize with z (0) s.t. p(z (0) x) > 0! t 1! repeat! for i 1 to W! sample zi (t) from p(zi (t) z i (t 1), x) z-i (t) z-i (t 1)! if (t > B and t mod k = 0) retain sample z (t)! t t + 1! end for! until enough samples (t = B + Sk)!

40! Another Way to See It!! It can be shown more directly that Gibbs sampling must produce the desired stationary distribution!! Suppose the Markov chain has reached a point at which z (t) ~ p(z x). Note that p(z x) = p(z i x) p(zi z i, x)!! Each Gibbs step holds z i (t) fixed and draws zi (t+1) from the exact conditional; thus z (t+1) ~ p(z x)!! It is also easy to show directly that the chain is reversible wrt p(z x)

41! Ergodicity!! For the posterior to be a unique equilibrium distribution, the chain must also be ergodic (as usual)!! If all conditional distributions are nonzero everywhere, then ergodicity must hold!! Otherwise, it must be proven explicitly!

Bivariate Normal Gibbs! 42!

43! Gaussian Mixtures!! Gibbs sampling allows the Gaussian mixtures problem to be addressed in a fully Bayesian way:!! Assign cluster means a (Gaussian) prior!! Mean sampling: For each cluster, sample new mean based on prior and currently assigned data points!! Assignment sampling: Sample new cluster assignment for each data point given current cluster means!! Upon termination, summarize groupings from samples of joint posterior!

44!

45! Comparison with EM!! Both EM and Gibbs alternate between setting variables and setting parameters!! EM avoids hard assignments, instead using expectations!! Gibbs makes hard assignments but does so stochastically!! EM maximizes parameters based on expectations of rvʼs; Gibbs does not distinguish between parameters and rvʼs!! Gibbs can be seen as a stochastic hill climbing algorithm. It may do better than EM at avoiding local maxima!

46! Assessing Convergence!! Simplest approach: plot complete log likelihood, visually assess stationarity!! Using this method can usually make a good guess at appropriate burn-in length B!! Can apply to logl or estimated scalars!! Good idea to start multiple chains and see whether they end up behaving the same!! More rigorously, can run multiple chains and compare within chain and between chain variances!

Visual Inspection! 47!

Another Example! 48!

49! Monitoring Scalar Estimands!! Run J parallel chains, initializing from an overdispersed distribution. Collect n samples from each.!! Compute within-chain (W) and betweenchain (B) variances for scalar samples!! Monitor convergence via scale reduction,! Gelman et al. Bayesian Data Analysis, 1995!

50! Sampling Motifs! initialize! extract counts,! sample from Dirichlet! compute posteriors,! sample positions!

51! Sampling Alignments! V! L! S! P! A! D! K! H! L! A! E! S! K!

52! Sampling Alignments! VLSPAD-K! HL--AESK! H! L! V! L! S! P! A! D! K! A! E! S! K!

53! Sampling Alignments! VLSPAD-K! HL--AESK! VL--SPADK! HLAES---K! H! L! A! E! V! L! S! P! A! D! K! S! K!

54! Sampling Alignments! VLSPAD-K! HL--AESK! VL--SPADK! HLAES---K! -VLSPADK! H-LAES-K! H! L! A! E! S! K! V! L! S! P! A! D! K!

55! Measuring Confidence! Lunter et al., Genome Res, 2008!

56! Thatʼs All!! Bishop has good introduction to sampling and MCMC!! Sampling alignments is covered in Durbin et al.!! Gelman et al. good reference on applied Bayesian analysis!! Thanks for listening!!