MCMC Methods for data modeling

Similar documents
Markov Chain Monte Carlo (part 1)

1 Methods for Posterior Simulation

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

Markov chain Monte Carlo methods

Quantitative Biology II!

Bayesian Statistics Group 8th March Slice samplers. (A very brief introduction) The basic idea

Short-Cut MCMC: An Alternative to Adaptation

Approximate Bayesian Computation. Alireza Shafaei - April 2016

Clustering Relational Data using the Infinite Relational Model

A Short History of Markov Chain Monte Carlo

Monte Carlo for Spatial Models

Nested Sampling: Introduction and Implementation

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

An Introduction to Markov Chain Monte Carlo

ADAPTIVE METROPOLIS-HASTINGS SAMPLING, OR MONTE CARLO KERNEL ESTIMATION

Approximate Bayesian Computation using Auxiliary Models

Particle Filters for Visual Tracking

Discussion on Bayesian Model Selection and Parameter Estimation in Extragalactic Astronomy by Martin Weinberg

Linear Modeling with Bayesian Statistics

Expectation-Maximization Methods in Population Analysis. Robert J. Bauer, Ph.D. ICON plc.

Statistical techniques for data analysis in Cosmology

Bayesian Modelling with JAGS and R

Statistical Matching using Fractional Imputation

Chapter 1. Introduction

A guided walk Metropolis algorithm

Convexization in Markov Chain Monte Carlo

A Nonparametric Bayesian Approach to Detecting Spatial Activation Patterns in fmri Data

Level-set MCMC Curve Sampling and Geometric Conditional Simulation

Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

Tutorial using BEAST v2.4.1 Troubleshooting David A. Rasmussen

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

A GENERAL GIBBS SAMPLING ALGORITHM FOR ANALYZING LINEAR MODELS USING THE SAS SYSTEM

From Bayesian Analysis of Item Response Theory Models Using SAS. Full book available for purchase here.

Hierarchical Bayesian Modeling with Ensemble MCMC. Eric B. Ford (Penn State) Bayesian Computing for Astronomical Data Analysis June 12, 2014

Stochastic Simulation: Algorithms and Analysis

A Bayesian approach to artificial neural network model selection

A noninformative Bayesian approach to small area estimation

Parallel Multivariate Slice Sampling

Slice sampler algorithm for generalized Pareto distribution

MRF-based Algorithms for Segmentation of SAR Images

MSA101/MVE Lecture 5

BART STAT8810, Fall 2017

Sampling informative/complex a priori probability distributions using Gibbs sampling assisted by sequential simulation

MCMC for doubly-intractable distributions

Graphical Models, Bayesian Method, Sampling, and Variational Inference

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Package atmcmc. February 19, 2015

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

CS281 Section 9: Graph Models and Practical MCMC

10.4 Linear interpolation method Newton s method

Bayesian data analysis using R

Recap: The E-M algorithm. Biostatistics 615/815 Lecture 22: Gibbs Sampling. Recap - Local minimization methods

Semiparametric Mixed Effecs with Hierarchical DP Mixture

Outline. Bayesian Data Analysis Hierarchical models. Rat tumor data. Errandum: exercise GCSR 3.11

GiRaF: a toolbox for Gibbs Random Fields analysis

Bayesian Spatiotemporal Modeling with Hierarchical Spatial Priors for fmri

Exam Issued: May 29, 2017, 13:00 Hand in: May 29, 2017, 16:00

RJaCGH, a package for analysis of

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Package dirmcmc. R topics documented: February 27, 2017

Introduction to Machine Learning CMU-10701

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Analysis of Incomplete Multivariate Data

Modified Metropolis-Hastings algorithm with delayed rejection

Probabilistic Graphical Models

PIANOS requirements specifications

Computer Vision Group Prof. Daniel Cremers. 4. Probabilistic Graphical Models Directed Models

Simulating from the Polya posterior by Glen Meeden, March 06

Markov Random Fields and Gibbs Sampling for Image Denoising

Theoretical Concepts of Machine Learning

Debugging MCMC Code. Charles J. Geyer. April 15, 2017

Missing Data and Imputation

Collective classification in network data

Monte Carlo Techniques for Bayesian Statistical Inference A comparative review

Probabilistic Graphical Models

Introduction to Bayesian Analysis in Stata

R package mcll for Monte Carlo local likelihood estimation

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Package binomlogit. February 19, 2015

AMCMC: An R interface for adaptive MCMC

Multi-Modal Metropolis Nested Sampling For Inspiralling Binaries

Parallel Adaptive Importance Sampling. Cotter, Colin and Cotter, Simon and Russell, Paul. MIMS EPrint:

Laboratorio di Problemi Inversi Esercitazione 4: metodi Bayesiani e importance sampling

BeviMed Guide. Daniel Greene

Bayesian model selection and diagnostics

Bayesian Computation: Overview and Methods for Low-Dimensional Models

Variability in Annual Temperature Profiles

A problem - too many features. TDA 231 Dimension Reduction: PCA. Features. Making new features

Inference. Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation:

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

Monte Carlo Methods and Statistical Computing: My Personal E

Network Traffic Measurements and Analysis

GAMES Webinar: Rendering Tutorial 2. Monte Carlo Methods. Shuang Zhao

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

Transcription:

MCMC Methods for data modeling Kenneth Scerri Department of Automatic Control and Systems Engineering

Introduction 1. Symposium on Data Modelling 2. Outline: a. Definition and uses of MCMC b. MCMC algorithms i) Metropolis - Hastings (MH) algorithm ii) Gibbs Sampler iii)their variants

How does this fit? Three problems solved by data modeling 1. Regression 2. Classification 3. Density Estimation These are methods that estimate densities. These densities can then be utilized for both regression and classification We will be using y = y = N w i x i + e i=1 N θ i x i + e i=1

Definition Markov Chain Monte Carlo (MCMC) techniques are methods for sampling from probability distributions using Markov chains MCMC methods are used in data modelling for bayesian inference and numerical integration

Monte Carlo Methods Monte Carlo techniques are sampling methods Direct simulation: Let x be a random variable with distribution p(x) ; then the expectation E[x] is given by: E[x] = which can be approximated by drawing from p(x) and then evaluating Z x R E[x] 1 n xp(x)dx n x i i=1 n samples

Monte Carlo Integration Example 1. Mean of a lognormal distribution by direct simulation

Monte Carlo Integration

Markov Chains Consider the sequence of random variables {x 0,x 1,x 2,...}, sampled from the distribution p(x t+1 x t ), then each next sample x t+1 depends only on the current state x t and does not depend on the further history {x 0,x 1,...,x t 1 }. Such a sequence is call a Markov chain. Thus MCMC techniques aim to construct cleverly sampled chains which (after a burn in period) draw samples which are progressively more likely realizations of the distribution of interest; the target distribution.

Bayesian Inference Where can these methods be applied? Consider the model y = θx + e where e N(0,σ 2 ) This model can be rewritten as y θ,σ 2,x N(θx,σ 2 ) For the input set X = {x 1,x 2,...,x n } and the corresponding observations set Y = {y 1,y 2,...,y n }, Bayesian inference aims to estimate the posterior distribution of the parameter θ via Bayes Theorem.

Bayesian Inference Bayes Theoram: p(θ Y ) = p(y θ)p(θ) p(y ) where p(θ Y ) is the posterior distribution of the parameter of interest θ p(y θ) is the likelihood function p(θ) is the chosen prior distribution of θ Inference is usually performed ignoring the normalizing constant p(y ), thus utilizing p(θ Y ) p(y θ)p(θ)

MH Algorithm Some history: The Metropolis algorithm was first proposed in Metropolis et al. (1953) It was then generalized by Hastings in Hastings (1970) Made into mainstream statistics and engineering via the articles Gelfand and Smith (1990) and Gelfand et al. (1990) which presented the Gibbs sampler as used in Geman and Geman (1984) All other MCMC methods can be considered as special cases of the MH algorithm

Metropolis Algorithm The Metropolis algorithm is a random walk that uses an acceptance/rejection rule to converge to the target distribution. Its algorithm is: 1. Draw a starting sample θ 0 from a starting distribution p 0 (θ) 2. For i = 1, 2,... a. Sample a proposal θ from a proposal distribution q(θ θ i 1 ) b. Calculate the ratio r = p(θ y) p(θ i 1 y) c. Set θ i = { θ θ i 1 with probability otherwise min(r, 1)

Metropolis Algorithm Target θ 0

Metropolis Algorithm Target Proposal θ 0

Metropolis Algorithm Target Proposal θ 1 θ 0

Metropolis Algorithm Target Proposal θ 2 θ 1 θ 0

Metropolis Algorithm Target Proposal θ 2 θ 3 θ 1 θ 0

Metropolis Algorithm Example 2. Consider the ordinary linear regression model where: y θ,σ 2,x N(θx,σ 2 I) y is the outcome variable variables x = (x 1,x 2,...,x k ) θ = (θ 1,θ 2,...,θ k ) is a vector for explanatory is the parameter vector σ is the independent observation error standard deviation

Metropolis Algorithm For a sample of n idd observations (y 1,...,y n ), and a uniform noninformative prior for θ, the conditional posterior distribution for θ is given by θ σ 2,y N(ˆθ,V σ 2 ) where: ˆθ = (X T X) 1 X T y V = (X T X) 1 X is an n k matrix of explanatory variables

Metropolis Algorithm

Metropolis Algorithm

Metropolis Algorithm

Metropolis Algorithm

Notes: Metropolis Algorithm Proposal distribution must be symmetric Algorithm requires the ability the calculate the ratio r, for all (θ,θ ) A proposed sample that increases the posterior probability is always accepted Some samples that decrease the posterior probability are also accepted with probability r Proven in the limit as conditions i under certain

MH Algorithm Hastings generalized the basic Metropolis algorithm in 2 ways: 1. The proposal density need no longer be symmetric 2. consequently the ratio r, changes to r = p(θ y)/q(θ θ i 1 ) p(θ i 1 y)/q(θ i 1 θ ) Having asymmetric proposal distributions may be useful to increase speed of convergance

MH Algorithm Consideration on the choice of proposal distribution: Easy to sample Jumps go a reasonable distance in the parameter space Jumps are not rejected too frequently 2 basic ideas are most widely used: 1. normal jumps centred around the previous chain position 2. proposal distributions very close to target distribution; resulting in high acceptance ratios

Single Component MH The proposal acceptance/rejection ratio of MH and Metropolis falls as the dimension of the posterior distribution increases (especially for highly dependent parameters) resulting in slow moving chains and long simulations Simulation times can be improved by using the single component MH algorithm. Instead of updating the whole of θ together, θ is divided in components (or blocks) of different dimensions with each component updated separately. A special case of this algorithm is the Gibbs sampler

Gibbs Sampler For the Gibbs sampler the parameter vector into d components, thus θ = (θ 1,...,θ d ) θ is divided Each iteration of the sampler cycles through the d components of θ drawing each subset condition on the value of the others At each iteration the order of sampling each subset is randomly changed At each iteration, where θ i j is sampled from θ i 1 j = (θ i 1,...,θ i j 1,θ i 1 j+1,...,θi 1 d ) θ i j p(θ j θ i 1 j,y) Thus θ j is sampled based on the latest values of each component of θ

Gibbs Sampler Example 3. Consider the model with an observable vector y of d components where y θ,σ N(θ,Σ) µ Σ is a vector of means is the variance matrix Assuming a normal prior distribution for θ, given by θ N(θ 0,Σ 0 )

Gibbs Sampler The marginal conditional distribution required for the Gibbs sampler is given by where θ (1) θ (2),y N ( θ (1) n + β 1 2 (θ (2) θ (2) n ),Σ 1 2 ) ) Σ 1 n = Σ 1 0 + nσ 1 θ n = (Σ 1 0 + nσ 1 ) 1 (Σ 1 0 θ 0 + nσ 1 ȳ) β 1 2 = Σ (12) n (Σ (22) n ) 1 Σ 1 2 = Σ n (11) Σ (12) n (Σ (22) n ) 1 Σ (21) n

Gibbs Sampler No correlation between parameters

Gibbs Sampler No correlation between parameters

Gibbs Sampler Correlated parameters

Gibbs Sampler Correlated parameters

Gibbs Sampler Algorithm requires the ability to directly sample from p(θ j θ i 1 j,y), which is very often the case for many widely used models. This makes the Gibbs sampler a widely used technique. Gibbs sampler is the simplest of MCMC algorithms and should be used if sampling from the conditional posterior is possible Improving the Gibbs sampler when slow mixing: 1. reparameterize - by linear transformations 2. parameter expansion and auxiliary variables 3. better blocking

Inference Initial samples are severely effected by initial conditions and are thus not representative of the target distribution Only samples obtains once the chain has converged are used to summarize the posterior distribution and compute quantiles How to assess convergence? Run multiple simulations with starting points spread throughout the parameter space Compare the distribution from each simulation to the mixed results

Recommendation The Gibbs sampler and MH algorithms work well for a wide range of problems if 1. approximate independence withing the target distribution can be obtained for the Gibbs sampler 2. and the proposal distribution can be tuned to an acceptance of 20 to 45% for the MH algorithm Try use simplest algorithm first and if convergence is slow try improving by reparameterization, choosing different component blocks and tuning. If convergence problems remain more advanced algorithms should be used

MH Algorithm Variations Adaptive MH algorithm: 1. Start with fixed algorithm with normal proposal density N(θ θ t 1,c 2 Σ), where Σ is the variance of the target density and c = 2.4/ d with d being the posterior dimension 2. After some simulations update the proposal rule to be proportional to the estimated posterior covariance 3. Increase or decrease the scale of the proposal distribution to approach the optimal value of 0.44 (for one dimension) or 0.23 (for multidimensional jumps)

Hamiltonian Monte Carlo Gibbs and MH methods are hampered by their random walk behaviours especially for high dimensions Adding auxiliary variables can help the chain move more rapidly though the target distribution Hamiltonian methods add a momentum term φ j to each component θ j of the target space. Both are updated together using MH methods φ j gives the expected distance and direction of jump of based on the last few jumps. It favour successive jump in the same direction, allowing the simulation to move rapidly through the space of θ j θ j

Langevin Rule The symmetric jumping rules of the Metropolis algorithm cause low acceptance ratios by jumping in low probability areas The Langevin algorithm changes the jumping rule of the MH algorithm to favour jumps in the direction of the maximum gradient of the target density, thus moving the chains towards the high density regions of the distribution The proposal density depends on the location of the current sample and this is not symmetric. The MH algorithm is thus required

Slice Sampler Slice samplers adapt to distribution being sampled by sampling uniformly for the region under the density function plot. A Markov chain is constructed for a univariate distribution by: 1. Sample a random initial value θ 0 2. At each iteration i a. Sample uniformly (in the vertical slice) for the auxiliary variable uniformly in the region µ i [0, p(θ i 1 y)] b. Sample uniformly (in the horizontal slice) for θ i

Slice Sampler Target Vertical Slice Horizontal Slice Vertical Sample θ 0

Slice Sampler Target Horizontal Sample θ 0 θ 1

Slice Sampler Target Vertical Sample θ 0 θ 1 θ 2

Slice Sampler Target Vertical Sample θ 0

Slice Sampler Target Vertical Sample θ 0 θ 1

Slice Sampler Notes: The slice sampler can be used for multivariate distributions by sampling one-dimensional conditional distributions in a Gibbs structure As opposed to the Gibbs sampler it does not require to sampled for the marginal conditional distribution but needs to be able to evaluate them The slice sampler can deal with distributions exhibiting multiple modes if the modes are not separated by large regions of low probability

Simulated Tempering Sampling from multimodal distributions is problematic since chains get stuck for long in the closest mode of the distribution. Simulated tempering offers a solution in this case. It works with K + 1 chains each with a different target distribution. For the multimodal target distribution p(θ y) a common choice for the K + 1 distribution is p(θ y) 1/T k where the temperature parameter T k > 0 and k = 0,1,...,K. k = 0 results in the original density and large T k values produces less peaked modes (thus allowing the chain to move among all modes)

Simulated Tempering Algorithm is as follows: 1. Draw a starting sample θ 0 from a starting distribution p 0 (θ) 2. For each iteration t a. θ t+1 is sampled using the Markov chain with target distribution p(θ y) 1/T k b. A jump from the current sampler k to sampler j is proposed with probability J s, j. This move is then accepted with probability min(1, r) where r = c jp(θ t+1 y) 1/T j J j,k c k p(θ t+1 y) 1/T k Jk, j

Simulated Tempering Notes: The constant c k is set such that the algorithm spends equal time in each sampler Only samples from p(θ y) are then used for inference The maximum temperature must be set such that the algorithm has a significant number of less peaky distributions but not too high resulting in poor acceptance ratios Other auxiliary variable methods exist for dealing with multimodel distributions

Reversible Jump Sampling Reversible jump sampling addresses the problem of sampling from a parameter space whose dimension can change from one iteration to the next Such scenarios are common when a Markov chain is used to choose between a number of plausible models In such cases the sampled parameter space consists of both the traditional model parameters and an indication of the current model (θ,u) Let M k donate the candidate model with k = 1,...,K and let θ k donate the model parameters of model k with dimension d k

Reversible Jump Sampling θ (1) θ (2) θ (3) k = 3

Reversible Jump Sampling The algorithm is as follows: 1. Sampled an initial state (k 0,θ 0 ) 2. For each iteration a. propose a new model M k with probability J k.k b. generate an auxiliary variable u from the proposal density J(u k,k,θ k ) c. determine the proposal density model parameters (θ k,u ) = g k,k (θ k,u) d. accept the new model with probability min(r, 1)

Reversible Jump Sampling Notes: (θ k,u ) = g k,k (θ k,u) is a one-to-one deterministic mapping function between all model The resulting posterior draws provide inference about both the model and its parameters

Conclusions There is no magic in these algorithms. Great care must be taken to assess convergence Always run multiple iterations of the sampler with varying initial samples and ensure convergence of each chain to the same distribution Tuning is always required Implement the simpler algorithms first and then (if convergence issues arise) adopt more elaborate methods These are offline methods

for new researchers 27th March 2007 2007 SYMPOSIUM ON DATA MODELLING Are you a data modeller? Do you use data to generate a mathematical description of a system? Do you use data driven models to investigate the learning process? Do you create models to mimic a system for tasks such as classification, filtering or prediction? If any of this describes your work then this symposium is for you! The 2007 symposium on data modelling is a one day event aimed at postgraduate and postdoctoral researchers. It aims to bring together new researchers in data modelling from various departments in the University of She!eld. Any new researcher on models, methods and applications of data driven modelling is welcomed. We are looking for one page abstracts of recent work or work in progress for peer review for both oral and poster presentations." The deadline for submission is 1 February 2007." Registration will be FREE for presenters and will be open in the new year. datamodelling.group.shef.ac.uk/symposium