The Multi Stage Gibbs Sampling: Data Augmentation Dutch Example

Similar documents
CS281 Section 9: Graph Models and Practical MCMC

Clustering web search results

Clustering Relational Data using the Infinite Relational Model

Quantitative Biology II!

Warped Mixture Models

appstats6.notebook September 27, 2016

Chapter 6 Normal Probability Distributions

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Network-based auto-probit modeling for protein function prediction

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

10.4 Linear interpolation method Newton s method

Bayesian Methods. David Rosenberg. April 11, New York University. David Rosenberg (New York University) DS-GA 1003 April 11, / 19

Chapter 5: The normal model

Recap: The E-M algorithm. Biostatistics 615/815 Lecture 22: Gibbs Sampling. Recap - Local minimization methods

Introduction to the Practice of Statistics Fifth Edition Moore, McCabe

Expectation Maximization: Inferring model parameters and class labels

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Ch6: The Normal Distribution

Math 14 Lecture Notes Ch. 6.1

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

1. The Normal Distribution, continued

Sections 4.3 and 4.4

CHAPTER 2 Modeling Distributions of Data

MCMC Methods for data modeling

Normal Distribution. 6.4 Applications of Normal Distribution

Mixture Models and the EM Algorithm

Bayesian Inference for Sample Surveys

Sampling Table Configulations for the Hierarchical Poisson-Dirichlet Process

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

1 Methods for Posterior Simulation

Lecture Notes 3: Data summarization

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

A Nonparametric Bayesian Approach to Detecting Spatial Activation Patterns in fmri Data

BeviMed Guide. Daniel Greene

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

BUGS: Language, engines, and interfaces

Bayesian Statistics Group 8th March Slice samplers. (A very brief introduction) The basic idea

Macros and ODS. SAS Programming November 6, / 89

Introduction to Machine Learning CMU-10701

Segmentation: Clustering, Graph Cut and EM

Machine Learning

Markov chain Monte Carlo methods

Problem 1 (20 pt) Answer the following questions, and provide an explanation for each question.

BIOL Gradation of a histogram (a) into the normal curve (b)

Chapter 6: DESCRIPTIVE STATISTICS

Model selection and validation 1: Cross-validation

Latent Variable Models and Expectation Maximization

Graphical Models. David M. Blei Columbia University. September 17, 2014

Expectation Maximization: Inferring model parameters and class labels

K-means and Hierarchical Clustering

Learning Objectives. Continuous Random Variables & The Normal Probability Distribution. Continuous Random Variable

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Chapter 3: The IF Function and Table Lookup

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

Chapter 6. THE NORMAL DISTRIBUTION

Case Study IV: Bayesian clustering of Alzheimer patients

K-Means and Gaussian Mixture Models

Deep Generative Models Variational Autoencoders

Chapter 2: The Normal Distributions

Probabilistic Graphical Models

A Nonparametric Bayesian Approach to Modeling Overlapping Clusters

Chapter 7 Assignment SOLUTIONS WOMEN:

CHAPTER 2 Modeling Distributions of Data

Generative and discriminative classification techniques

Semiparametric Mixed Effecs with Hierarchical DP Mixture

Chapter 5: The standard deviation as a ruler and the normal model p131

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis.

Chapters 5-6: Statistical Inference Methods

Concepts of programming languages

Fall 09, Homework 5

Machine Learning. Supervised Learning. Manfred Huber

Section 2.2 Normal Distributions. Normal Distributions

Chapter 6. THE NORMAL DISTRIBUTION

Section 10.4 Normal Distributions

The Basics of Graphical Models

Data organization. So what kind of data did we collect?

Inference and Representation

Machine Learning

Chapter 2: Modeling Distributions of Data

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 6.2-1

Math 2250 Lab #3: Landing on Target

Overview of machine learning

Machine Learning

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Bootstrapping Methods

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

A GENERAL GIBBS SAMPLING ALGORITHM FOR ANALYZING LINEAR MODELS USING THE SAS SYSTEM

Clustering Lecture 5: Mixture Model

Short-Cut MCMC: An Alternative to Adaptation

Computer exercise 1 Introduction to Matlab.

Machine Learning Lecture 3

Introduction to Mixed Models: Multivariate Regression

Chapter 2 Modeling Distributions of Data

Transcription:

The Multi Stage Gibbs Sampling: Data Augmentation Dutch Example Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601 Module 8 1

Example: Data augmentation / Auxiliary variables A commonly-used technique for designing MCMC samplers is to use data augmentation, also known as auxiliary variables. Introduce variable(s) Z that depends on the distribution of the existing variables in such a way that the resulting conditional distributions, with Z included, are easier to sample from and/or result in better mixing. Z s are latent/hidden variables that are introduced for the purpose of simplifying/improving the sampler. 2

Idea: Create Z s and throw them away at the end! Suppose we want to sample from p(x, y), but p(x y) and/or p(y x) are complicated. Choose p(z x, y) such that p(x y, z), p(y x, z), and p(z x, y) are easy to sample from. Then construct a Gibbs sampler to sample all three variables (X, Y, Z) from p(x, y, z). Then we just throw away the Z s and we will have samples (X, Y ) from p(x, y). 3

Dutch Example Consider a data set on the heights of 695 Dutch women and 562 Dutch men. Suppose we have the list of heights, but we don t know which data points are from women and which are from men. 4

From Figure 1 can we still infer the distribution of female heights and male heights? Figure 1: Heights of Dutch women and men, combined. 5

From Figure 1 can we still infer the distribution of female heights and male heights? Figure 1: Heights of Dutch women and men, combined. Surprisingly, the answer is yes! 5

What s the magic trick? The reason is that this is a two-component mixture of Normals, and there is an (essentially) unique set of mixture parameters corresponding to any such distribution. We ll get to details soon. Be patient! 6

Constructing a Gibbs sampler To construct a Gibbs sampler for this situation: Common to introduce an auxiliary variable Z i for each datapoint, indicating which mixture component it is drawn from. In this example, Z i indicates whether subject i is female or male. This results in a Gibbs sampler that is easy to derive/implement. 7

Two component Mixture Model Let s assume that both mixture components (female and male) have the same precision, say λ, and that λ is fixed and known. Then the usual two-component Normal mixture model is: X 1,..., X n µ, π iid F (µ, π) (1) µ := (µ 0, µ 1 ) iid N (m, l 1 ) (2) π Beta(a, b) (3) where F (µ, π) is the distribution with p.d.f. f(x µ, π) = (1 π)n (x µ 0, λ 1 ) + πn (x µ 1, λ 1 ) and µ = (µ 0, µ 1 ). 8

The likelihood is p(x 1:n µ, π) = = n f(x i µ, π) i=1 n i=1 [ ] (1 π)n (x i µ 0, λ 1 ) + πn (x i µ 1, λ 1 ) which is a complicated function of µ and π, making the posterior difficult to sample from directly. 9

10 Latent allocation variables to the rescue! Define an equivalent model that includes latent allocation variables Z 1,..., Z n These indicate which mixture component each data point comes from that is, Z i indicates whether subject i is female or male. X i N (µ Zi, λ 1 ) independently for i = 1,..., n. (4) Z 1,..., Z n µ, π iid Bernoulli(π) (5) µ = (µ 0, µ 1 ) iid N (m, l 1 ) (6) π Beta(a, b) (7)

Recall X i N (µ Zi, λ 1 ) independently for i = 1,..., n. Z 1,..., Z n µ, π iid Bernoulli(π) µ = (µ 0, µ 1 ) iid N (m, l 1 ) π Beta(a, b) This is equivalent to the model above, since p(x i µ, π) = p(x Z i = 0, µ, π)p(z i = 0 µ, π) + p(x Z i = 1, µ, π)p(z i = 1 µ, π) (9) = (1 π)n (x i µ 0, λ 1 ) + πn (x i µ 1, λ 1 ) (10) = f(x i µ, π), (8) (11) and thus it induces the same distribution on (x 1:n, µ, π). The latent model is considerably easier to work with, particularly for Gibbs sampling. 11

12 Full conditionals Derive the full conditionals below as an exercise. Denote x = x 1:n and z = z 1:n. Find (π ) Find (µ ) Find (z ) As usual, each iteration of Gibbs sampling proceeds by sampling from each of these conditional distributions, in turn. They are in this module, so you can check your results after completion.

13 Full conditionals derived (π ) Given z, π is independent of everything else, so this reduces to a Beta Bernoulli model, and we have p(π µ, z, x) = p(π z) = Beta(π a + n 1, b + n 0 ) where n k = n i=1 1(z i = k) for k {0, 1}.

14 (µ ) Given z, we know which component each data point comes from. The model (conditionally on z) is just two independent Normal Normal models, as we have seen before: where for k {0, 1}, µ 0 µ 1, x, z, π N (M 0, L 1 0 ) µ 1 µ 0, x, z, π N (M 1, L 1 1 ) n k = n 1(z i = k) i=1 L k = l + n k λ M k = lm + λ i:z i =k x i. l + n k λ

15 (z ) p(z µ, π, x) z p(x, z, π, µ) z p(x z, µ)p(z π) where = = = z n N (x i µ zi, λ 1 ) Bernoulli(z i π) i=1 n i=1 n ( ) πn (x i µ 1, λ 1 zi ( ) ) (1 π)n (x i µ 0, λ 1 1 zi ) α z i i,1 α1 z i i,0 i=1 n i=1 Bernoulli(z i α i,1 /(α i,0 + α i,1 )) α i,0 = (1 π)n (x i µ 0, λ 1 ) α i,1 = πn (x i µ 1, λ 1 ).

16 (a) Traceplots of the component means, µ 0 and µ 1. (b) Traceplot of the mixture weight, π (prior probability that a subject comes from component 1). Figure 2: Results from one run of the mixture example.

17 (a) Histograms of the heights of subjects assigned to each component, according to z 1,..., z n, in a typical sample. Figure 3: Results from one run of the mixture example. One way of visualizing the allocation/assignment variables z 1,..., z n is to make histograms of the heights of the subjects assigned to each component.

18 My Factory Settings! λ = 1/σ 2 where σ = 8 cm ( 3.1 inches) (σ = standard deviation of the subject heights within each component) a = 1, b = 1 (Beta parameters, equivalent to prior sample size of 1 for each component) m = 175 cm ( 68.9 inches) (mean of the prior on the component means) l = 1/s 2 where s = 15 cm ( 6 inches) (s = standard deviation of the prior on the component means)

19 My Factory Settings! We initialize the sampler at: π = 1/2 (equal probability for each component) z 1,..., z n sampled i.i.d. from Bernoulli(1/2) (initial assignment to components chosen uniformly at random) µ 0 = µ 1 = m (component means initialized to the mean of their prior)

20 (a) Traceplots of the component means, µ 0 and µ 1. (b) Traceplot of the mixture weight, π (prior probability that a subject comes from component 1). Figure 4: Results from another run of the mixture example.

21 (a) Histograms of the heights of subjects assigned to each component, according to z 1,..., z n, in a typical sample. Figure 5: Results from another run of the mixture example.

22 Caution watch out for modes Example illustrates a big thing that can go wrong with MCMC (although fortunately, in this case, the results are still valid if interpreted correctly). Why are females assigned to component 0 and males assigned to component 1? Why not the other way around? In fact, the model is symmetric with respect to the two components, and thus the posterior is also symmetric. If we run the sampler multiple times (starting from the same initial values), sometimes it will settle on females as 0 and males as 1, and sometimes on females as 1 and males as 0 see Figure 5. Roughly speaking, the posterior has two modes. If the sampler were behaving properly, it would move back and forth between these two modes. But it doesn t it gets stuck in one and stays there.

This is a very common problem with mixture models. Fortunately, however, in the case of mixture models, the results are still valid if we interpret them correctly. Specifically, our inferences will be valid as long as we only consider quantities that are invariant with respect to permutations of the components (e.g. symmetry about the mean). 23