Monte Carlo Techniques for Bayesian Statistical Inference A comparative review

Size: px

Start display at page:

Download "Monte Carlo Techniques for Bayesian Statistical Inference A comparative review"

Cory Burns
5 years ago
Views:

1 1 Monte Carlo Techniques for Bayesian Statistical Inference A comparative review BY Y. FAN, D. WANG School of Mathematics and Statistics, University of New South Wales, Sydney 2052, AUSTRALIA 18th January, 2007 ABSTRACT In this article, we summarise Monte Carlo simulation methods commonly used in Bayesian statistical computing. We give descriptions for each algorithm and provide R codes for their implementation via a simple 2-dimensional example. We compare the relative merits of these methods qualitatively by considering their general user-friendliness, and numerically in terms of mean squared error and computational time. We conclude with some general guidelines and recommendations. Some keywords: Monte Carlo; Markov Chain Monte Carlo; Simulation; Rejection Sampling; Importance Sampling; Gibbs Sampler; Metropolis-Hastings; Adaptive Rejection; Slice Sampler; Sequential Monte Carlo. 1 Introduction Together with the availability of more complex data, Bayesian statistical models have become more sophisticated, making analytical calculations tedious, or simply intractable in many cases. There is an increasing reliance on simulation based methods for such inferences. Suppose we have a distribution f(θ) for a vector of real-valued parameters θ Θ, we may be interested in summary statistics for θ such as the mean, mode or the quantiles. When f is complex, inference for θ are often carried out using Monte Carlo simulations. In this article, we summarise some of the most commonly used Monte Carlo methods, including importance sampling, rejection sampling, sequential Monte Carlo, Gibbs sampler, adaptive rejection sampler, slice sampler and the Metropolis-Hastings sampler. We first summarise and present the algorithms listed above, and provide simple R codes to demonstrate how implementations are carried out (with the exception of the adaptive rejection sampling, where we make use of the WinBUGS software) for a simple example. We carry out a comparisons of how well each method performs, in terms of their general applicability and user-friendliness. Quantitative measures in terms of mean squared errors of the estimate, and computational time are also recorded via the example. Throughout the paper, we base our comparisons using a bivariate Normal distribution with varying degrees of correlation, as the benchmark distribution. While the benchmark distribution we use here is relatively simple, it is a conceptually easy distribution for the reader to grasp the effects of the different samplers on the various statistic we monitor.

2 2 In Section 2, we first set out some notations and background for the examples. In Section 3 and 4, we present the above algorithms by separating them into direct and iterative algorithms respectively. We compare the results of our simulation study in Section 5, and conclude with some recommendations in Section 6. 2 Notation and Benchmark Distribution Let θ = (θ 1,..., θ d ) Θ be the set of d dimensional parameter for which we wish to make inference. f(θ) denotes the distribution for θ. We wish to simulate N samples of θ from f(θ), which we refer to as the target distribution. For a running example, taking d = 2, a bivariate Normal distribution with varying correlation coefficient is used to benchmark the comparisons. Here we give details of the target distribution and the notations used throughout the paper. Let ( ) θ1 N(µ, Σ) where µ = [ 0 0 θ 2 ] Σ = ρ is the correlation between parameters θ 1 and θ 2. The joint distribution for θ 1, θ 2 is f(θ 1, θ 2 ) = [ 1 ρ ρ 1 ] 1 2π 1 ρ 2 exp{ 1 2(1 ρ 2 ) (θ2 1 + θ 2 2 2ρθ 1 θ 2 ) (1) Hence the conditional distribution for θ 1 θ 2 N(ρθ 2, 1 ρ 2 ), similarly, the conditional distribution for θ 2 θ 1 N(ρθ 1, 1 ρ 2 ). In some applications that follows, it was easier to transform the parameter space onto the unit square. We make the transformation for θ 1 (, ) and θ 2 (, ) to u (0, 1) and v (0, 1) via the logistic functions (other transformations are also possible) u = exp( θ 1 ), v = exp( θ 2 ). The transformed function equivalent to Equation 1 now becomes, f s (u, v) = 1 2π + t 2 2ρst) (1 ρ 2 ) exp{ (s2 2(1 ρ 2 uv(1 u)(1 v) (2) ) where s = log( u v 1 u ), t = log( 1 v ). Finally, we stress that we do not claim any optimality for the use of this particular example for benchmarking. Indeed it is chosen for its simplicity and intuitiveness. Multimodality, correlations between parameters, dimensionality of the target distribution all play vital roles in determining how each method performs. We make our comparisons in this paper under the assumption that the user has minimum knowledge about the distribution from which they wish to sample from.

3 3 3 Direct Simulation Methods In this section, we present three of the most popular direct simulation method, importance sampling, rejection sampling and sequential Monte Carlo sampling. Common to all three of these methods is that samples are not obtained iteratively (depending on previous samples), and all samples obtained from these methods are used for statistical inference. 3.1 Importance Sampling We begin by introducing the classical Monte Carlo algorithm, see Wasserman (2004), Robert and Casella (2004), for approximating the integral associated with E(h(θ)) for some function h. The classical Monte Carlo algorithm begins by drawing N samples θ (i), i = 1,..., N uniformly over Θ and approximates the integral by the sample mean of h(θ (i) )f(θ (i) ). Importance sampling extends this by drawing samples from a trial distribution g. More efficient algorithm is obtained when g is close to f. Importance sampling produces weighted samples, with weights given by the ratio f/g. One can work either directly with the weighted samples, or resample with respect to the weights for a set of un-weighted samples. We give the algorithm below, Importance Sampling algorithm: (IS) 1. Draw N samples θ (1),..., θ (N) from g(θ). 2. Evaluate weights f(θ(i) ), for i = 1,..., N. g(θ (i) ) Example: Since the trial distribution g has to be chosen to have the same support as the target distribution f, we will make the transformation of θ 1 and θ 2 onto the unit square and work with the transformed function f s as in Equation 2. Note that better trial distributions may be found, although the process is non-trivial. We provide the R code for the importance sampler below is.bvn < function(n, rho){ theta1 < runif(n); theta2 < runif(n) weights < f.s(theta1, theta2, rho) ind < sample(c(1 : N), replace = T, prob = weights) return(list(theta1 = log(theta1[ind]/(1 theta1[ind])), theta2 = log(theta2[ind]/(1 theta2[ind])))) The function f.s(theta1,theta2,rho) computes the transformed function as in Equation 2.

4 4 3.2 Rejection Sampling For a trial density g and a constant M such that f(θ) Mg(θ), the rejection sampling algorithm, Ripley (1987), is given by Rejection Sampling algorithm: (RS) 1. Draw N samples θ (1),..., θ (N) from g(θ). 2. Draw N samples u (1),..., u (N) from Unif(0, 1). 3. Accept θ (i) if u (i) f(θ(i) ), for i = 1,..., N. Mg(θ (i) ) Unlike the importance sampling algorithm, rejection sampling produces independent and identically distributed samples from f. However, the calculation of M is an additional difficulty, and can be difficult to find in high dimensional problems. Example: Again, we use the transformed distribution of Equation 2, and take g as the bivariate Unif(0, 1) 2 distribution, thus M occurs at f(0, 0) = f s (0.5, 0.5), that is, the mode of the distribution. The acceptance rate in this example was around 20%. We give R code for the rejection sampler below, rs.bvn < function(n, rho){ theta1 < runif(n); theta2 < runif(n) M < f.s(0.5, 0.5, rho) ind < runif(n) < (f.s(y1, y2, rho)/m) return(list(theta1 = log(theta1[ind]/(1 theta1[ind])), theta2 = log(theta2[ind]/(1 theta2[ind])))) 3.3 Sequential Monte Carlo The sequential Monte Carlo (SMC) sampler can be viewed as an extension of importance sampling, by allowing intermediary steps and propagating moves within each distribution. Crucially, SMC does not require an initial distribution which takes the same support as the target distribution. This can be considered a major advantage, particularly for high dimensional problems. Furthermore, SMC is able to deal with far more complex problems by allowing corrections to the initial samples iteratively. As in importance sampling, SMC produces weighted samples, we give the algorithm below,

5 5 Sequential Monte Carlo algorithm: (SMC) 1. Draw N samples θ (1) 0,..., θ(n) 0 from initial distribution f 0 (θ). 2. Initialise weights w (i) 0 = 1, i = 1,..., N. Set t = For i = 1,..., N, (a) Move samples θ (i) t 1 according to forward transition kernel K t(θ (i) t 1, θ(i) t ). (b) For some arbitrary backwards transition kernel, L t 1 (θ (i) t for samples 4. If [ N w (i) t = w (i) t 1 f t (θ (i) t i=1 (w(i) t ) 2 ] 1 < N/2, resample with replacement, the samples {θ (i) t samples {θ (i) t, and set weights {w (i) t = Increment t = t + 1, if t < T, return to step 3. )L t 1 (θ (i), θ (i) t t 1 ) f t 1 (θ (i) t 1 )K t(θ (i) t 1, θ(i) t )., θ (i) t 1 ), set weights with weights {w (i) t to obtain new Here the initial distribution f 0 can be any distribution that we can sample directly from. f t can be viewed as intermediary distributions bridging between the initial and final distribution f T from which samples are required. The number of intermediary distributions T, as well as the forward and backward transition kernels are arbitrary, but consecutive distributions f i, f i+1 should be close. Del Moral et al. (2006) provides details on how to make these choices. Example: Contrary to importance and rejection sampling, sequential Monte Carlo does not require that the initial trial distribution to have the same support as the target distribution. However, in this example, for the ease of comparison, we choose the initial trial distribution to be the same as g used in the previous examples, and let f 0 to take on uniform values over the unit square. Again we work with the transformed function f s. We set T = 11, and f t = f 1 ɛ 0 f ɛ s, ɛ = 0, 0.1,..., 1. We choose the forward transition kernel K t (θ t 1, θ t ) to be a Beta random walk distribution Beta( θ t θ t 1, 1000) and let the backward kernel L t 1 (θ t, θ t 1 ) = Beta( θt θ t, 1000). R code for the SMC sampler are given below, the R function f.t(theta1,theta2,rho,eps)

6 6 computes the function f t above. smc.bvn < function(n, rho){ theta1 < runif(n); theta2 < runif(n) weights < rep(1, N); eps < seq(0, 1, length = T) for (i in c(1 : T)){ alpha1 < theta1 1000/(1 theta1); alpha2 < theta2 1000/(1 theta2) y1 < rbeta(n, alpha1, 1000); y2 < rbeta(n, alpha2, 1000) alphay1 < y1 1000/(1 y1); alphay2 < y2 1000/(1 y2) ratio1 < f.t(y1, y2, rho, eps[i + 1])/f.t(theta1, theta2, rho, eps[i]) ratio2 < dbeta(theta1, alphay1, 1000) dbeta(theta2, alphay2, 1000) ratio3 < dbeta(y1, alpha1, 1000) dbeta(y2, alpha2, 1000) weights < weights ratio1 ratio2/ratio3 theta1 < y1; theta2 < y2 if((1/sum(weights 2 )) < N/2){ ind < sample(c(1 : N), replace = T, prob = weights) theta1 < theta1[ind]; theta2 < theta2[ind] weights < rep(1, N) return(list(theta1 = log(theta1/(1 theta1)), theta2 = log(theta2/(1 theta2)))) 4 Iterative Simulation Methods In this section, Markov chain Monte Carlo methods are presented. Unlike direct simulation methods, these methods rely on the construction of a Markov chain. Hence by starting the Markov chain at any (arbitrarily) starting point, standard MCMC theory guarantees that the chain will converge to the correct distribution, see Gilks et al. (1996) for more details. One crucial difference between the iterative methods and the direct simulation methods is that iterative methods produce serially correlated samples. It is vitally important that the initial portions of the MCMC sample be discarded (usually termed burn-in). The determination of the length of burn-in, and the total length of Markov chain is collectively known as convergence diagnostics. Cowles and Carlin (1996) gives a comparative review of the various methods available in the literature for the assessment of convergence. 4.1 Gibbs Sampling The Gibbs sampler is a Markov chain sampler that starts at any arbitrary initial state. The chain then gets iteratively updated for some specified N iterations. At every iteration, it cylces through each of the d components of the paremeter θ = (θ 1,..., θ d ) in turn. The parameters are updated to a new sample according to their distributions conditioned on the current values of all other parameters. Casella and George (1992) provides an easy to read explanation of how the Gibbs sampler works. Here we give the Gibbs sampling algorithm for sampling θ = (θ 1,..., θ d ).

7 7 Gibbs Sampling algorithm: (Gibbs) 1. Initialise θ (1) 1,..., θ(1), set i = For j = 1,..., d, d (a) Sample θ (i+1) 1 from conditional distribution f(θ 1 θ (i) 2,..., θ(i) d ). (b) Sample θ (i+1) 2 from conditional distribution f(θ 2 θ (i+1) 1, θ (i) 3,..., θ(i) d ). (c). (d) Sample θ (i+1) d from conditional distribution f(θ d θ (i+1 1,..., θ (i+1) d 1 ). 3. Increment i = i + 1, if i < N, return to Step 2. The Gibbs sampler depends on the availability of the conditional distributions from which direct sampling must be possible. Example: Here we choose an arbitrary initial value. Full conditional distributions for θ 1 and θ 2 are available and can be sampled from directly. That is, f(θ (i) 1 θ(i 1) 2 ) N(ρθ (i 1) 2, 1 ρ 2 ) and f(θ (i) 2 θ(i) 1 ) N(ρθ(i) 1, 1 ρ2 ). R code for the Gibbs sampler is provided below. gibbs.bvn < function(n, rho, start){ theta1 < rep(na, N); theta2 < rep(na, N) theta1[1] < start[1]; theta2[1] < start[2] for (i in c(2 : N)){ #simulate from conditional distributions. theta1[i] < rnorm(1, rho theta2[i 1], sqrt(1 (rho rho))) theta2[i] < rnorm(1, rho theta1[i], sqrt(1 (rho rho))) return(list(theta1 = theta1, theta2 = theta2)) 4.2 Adaptive Rejection Sampling Gilks and Wild (1992) introduced adaptive rejection sampling for log-concave densities.the algorithm proceeds as in the Gibbs sampler, cycling through each of the d univariate parameters in turn, sampling from the conditional densities. Whereas the Gibbs sampler requires these conditional densities to be a standard distribution such that sampling from it is easy, the adaptive rejection sampling method will work for any logconcave conditional densities. Specifically the difference between the two algorithms is in Step 2 of the Gibbs sampler. We describe the adaptive rejection sampler for updating the jth parameter θ j in Step 2 of Gibbs algorithm. For some initial abscissae containing K points T K = {x k, k = 1,..., K, x 1 x 2... x K, over the parameter space of θ j, let h(θ j ) = ln f(θ j θ (i+1) 1,..., θ (i+1) j 1, θ(i) j+1,..., θ(i) d ),

8 8 and z k = h(x k+1) h(x k ) x k+1 h (x k+1 ) + x k h (x k ) h (x k ) h,, k = 1,..., K 1, (x k+1 ) u(θ) = h(x k ) + (θ x k )h (x k ), θ [z k 1, z k ], k = 1,..., K, l(θ) = (x k+1 θ)h(x k ) + (θ x k )h(x k+1 ) x k+1 x k, θ [x k, x k+1 ], k = 1,..., K, s(θ) = exp u(θ) exp u(θ )dθ, θ [z k 1, z k ], k = 1,..., K. z 0 and z K are the lower and upper bound on the support of θ j respectively. l(θ) = for θ < x 1 or θ > x K. The algorithm below samples from the conditional density f(θ j θ (i+1) 1,..., θ (i+1) j 1, θ(i) j+1,..., θ(i) d ) Adaptive Rejection algorithm: (ARS) 1. Initialise the K abscissae T K = {x j, j = 1,..., K 2. Sample y from s(θ) and sample w from Unif(0,1). If w exp{l(y) u(y), set θ (i+1) j = y. Otherwise go to Step If w exp{h(y) u(y), set θ (i+1) j = y. Otherwise go to Step Set T K+1 = T K {y, K = K + 1 and go to Step 2. Example: We implemented the example in WinBUGS, which requires specific coding generally based on statistically model specifications. Since our example here is artificial and not model based, we needed to use some of the tricks from Spiegelhalter et al. (2003) to sample from Equation 1. We will not include these codes here. 4.3 Slice Sampling The slice sampler generates a random sample from a given distribution by using an auxiliary variable, we give the algorithm for the slice sampler based on the single-variable slice sampler. As with ARS and Gibbs samplers, each parameter is updated in turn. We give the algorithm updating the jth parameter in Step 2 of Gibbs algorithm.

9 9 Slice Sampling algorithm: (SLI) 1. Sample u from Unif(0, f(θ (i) j θ(i+1) 1,..., θ (i+1) j 1 2. Sample θ (i+1) j uniformly from the set, θ(i) A = {θ j : f(θ j θ (i+1) 1,..., θ (i+1) j 1 j+1,..., θ(i) d ))., θ(i) j+1,..., θ(i) d ) > u The algorithm we gave is for the simplest form of the slice sampler. Multivariate updates using the slice sampler is also possible. However these are far more complex, see Neal (2003) for further details. Example: We will assume that we cannot sample easily from the conditional distributions of θ 1 and θ 2 here. R code for the slice sampler is given below. sli.bvn < function(n, rho, start){ theta1 < rep(na, N); theta2 < rep(na, N) theta1[1] < start[1]; theta2[1] < start[2] for (i in c(1 : T)){ u < runif(1, 0, f(theta1[i 1], theta2[i 1], rho)) x1 < left(theta2[i 1], u, rho) x2 < right(theta2[i 1], u, rho) theta1[i] < runif(1, x1, x2) u < runif(1, 0, f(theta1[i], theta2[i 1], rho)) y1 < left(theta1[i], u, rho) y2 < right(theta1[i], u, rho) theta2[i] < runif(1, y1, y2) return(list(theta1 = theta1, theta2 = theta2)) Here the function f(theta1,theta2,rho) computes the density from Equation 1, and the functions left(theta,u,rho), right(theta,u,rho) returns the values ρθ ± (1 ρ 2 ) log(2π(1 ρ 2 )u), giving the left and right limits of the interval A. 4.4 Metropolis-Hastings Sampling Metropolis-Hastings algorithms rely on the construction of a reversible Markov chain. At each iteration of the chain, a candidate sample is proposed from an arbitrary candidate generating function Q, this sample is then either accepted or rejected according to an acceptance ratio. Chib and Greenberg (1995) provides an expository article on the algorithm. The general Metropolis-Hastings algorithm is given below:

10 10 Metropolis-Hastings Algorithm: (MH) 1. Initialise θ (1), set i = Generate y from function Q(θ (i),.) and U from Unif(0, 1). 3. Let θ (i+1) = y if U min(1, f(y)q(y,θ(i) ) f(θ i )Q(θ (i),y) ), otherwise let θ(i+1) = θ (i). 4. increment i = i + 1, if i < N, return to Step 2. f(y)q(y,θ The quantity min(1, (i) ) ) is commonly referred to as the acceptance probability. f(θ (i) )Q(θ (i),y) Q(θ (i), y) is an arbitrary candidate generating function giving the probability of the new point y given the current point θ (i). Note that the algorithm given above updates the d dimensional parameter vector simultaneously. However it is also possible to update smaller blocks of size 1 s < d and cycle through each block in turn, in the manner of the Gibbs sampler. Note that when s = 1 for all blocks, and the candidate generating function is the full conditional distribution, then the Metropolis-Hastings sampler is equivalent to the Gibbs sampler. A combination of the Gibbs and Metropolis-Hastings move is called the hybrid sampler, we do not give further details here but refer the reader to Gilks et al. (1996) for further reading. Here we give two of the most popular MH samplers in detail, the random walk Metropolis-Hastings sampler (RW-MH) and the independence sampler (IND-MH). RW- MH takes Q(θ (i), y) = g( y θ (i) ) as the candidate generating function, where Q is symmetric with Q(θ (i), y) = Q(y, θ (i) ), hence the corresponding acceptance probability f(y) is given by min(1, f(θ (i) ) ). The IND-MH takes as candidate generating Q(θ(i), y) = g(y), where g is independent of the current state of the chain. Note here that the acceptance probability min(1, f(y)g(θ(i) ) f(θ (i) )g(y) ) =min(1, f(y)/g(y) ) is the ratio of the weights used in importance f(θ (i) )/g(θ (i) ) sampling. Example: For the RW-MH we let the proposal distribution Q be a bivariate normal distribution. We take the mean to be the value of the current iteration θ (i), and we tune the covariance matrix so that we obtain approximately acceptance rate. See Roberts and Rosenthal (2001) for more details on this acceptance rate calculation. Here, the sampler works with respect to Equation 1. R code below for the RW-MH requires the use of mvtnorm library package,

11 11 rw.bvn < function(n, rho, start){ theta1 < rep(na, N); theta2 < rep(na, N) theta1[1] < start[1]; theta2[1] < start[2] Id < matrix(c(1, 0, 0, 1), ncol = 2, byrow = T) sigma < matrix(c(1, rho, rho, 1), ncol = 2, byrow = T) for (i in c(2 : N)){ prop < rmvnorm(1, c(theta1[i 1], theta2[i 1]), 5.6 Id) accept < dmvnorm(prop, c(0, 0), sigma) dmvnorm(c(theta1[i 1], theta2[i 1]), c(0, 0), sigma) if(runif(1) < accept){ theta1[i] < prop[1]; theta2[i] < prop[2] else{theta1[i] < theta1[i 1]; theta2[i] < theta2[i 1] return(list(theta1 = theta1, theta2 = theta2)) For the independence sampler, we take the candidate distribution Q =Unif(0,1). Again the proposal distribution needs to have the same support as the target distribution, otherwise parts of the target distribution will never be visited by the chain. Thus we choose the Uniform distribution and again use the transformed bivariate normal distribution so that this algorithm is then broadly comparable with the direct simulation methods. We omit the R code for the IND-MH here as it differs from the RW-MH only in the candidate generating function and the consequent calculation of the acceptance probability. 5 Results of Comparisons We used the same example throughout this paper. For direct simulation methods, we transformed our bivariate Normal distribution onto the unit square and used a Uniform distribution over the unit square as the trial/initial distribution. We note that this is not the optimal choice as the trial distribution but a convenient one. We kept user specified choices as closely as possible for all our algorithms to facilitate comparison. For direct simulation methods, we drew N = 1, 000 samples each. For the iterative methods, we chose independent random starting points, and threw away 500 initial samples as burn-in and used the remaining 1,000 samples for inference. We calculated the mean square errors of quantile estimators for all samplers, using 5 replications, results are shown in Table 1. We restrict our comparisons to be within direct and iterative simulations separately. The direct simulation methods were not optimised, whereas the iterative simulations were in a sense optimised (with the exception of the IND-MH, which is comparable to direct simulation methods), this may explain the apparent smaller MSE values for the iterative methods. For the direct simulation methods, the MSE performances are similar for all three methods for the quantiles that are close to the modes of the distribution, i.e., θ 0.25, θ 0.5, θ Some differences for the tails θ 0.025, θ are apparent. Though all samplers deteriorate with the increase in the correlation ρ between parameters. The IS appears to have a small advantage over the other two methods, particularly for small ρ, i.e., when the two parameters are near independent. We note that the method of SMC, has many possibilities for

12 12 improvement, it is potentially a better method, particularly for high dimensions. However such improvement is beyond the capabilities of the general user. For iterative simulations, it is clear that Gibbs type algorithms (GIBBS, ARS and SLI) gave better MSEs than the MH (RW and IND) methods, particularly in the tails. In results not shown here, for ρ = 0.99, the RW-MH algorithm appears to be converging faster than the Gibbs type algorithms. Hence blocking updating highly correlated parameters may be preferable here. Perhaps more interestingly, the slice sampler appeared to have converged faster than the Gibbs sampler. In an informal comparison, it took about 5,500 iterations of the Gibbs sampler to converge at ρ = 0.99 compared with about 3,500 for the single-variable slice sampler. Roberts and Rosenthal (1999) gives some theoretical support for why this may occur for a two dimensional parameter space. Table 2 gives a list of main features for each of the methods, together with an estimate of their computational cost in terms of running time for the algorithms. Clearly in terms of computational time, IS and RS are much faster than SMC. For MCMC methods, both RW-MH and SLI samplers require large computational time, with Gibbs the fastest. We note that ARS is computed using WinBUGS, not R, and hence more efficient. 6 Discussions and recommendations Although IS is by far the easiest algorithm to use, RS produces i.i.d. samples. We recommend the use of rejection sampling (RS) when the dimension of the target distribution is small, i.e., one or two dimensions, and a good enveloping function g and M can easily be found. Transforming the target distribution onto bounded spaces, such as the way we did in our example, may sometimes help the search for g. In higher dimensions, SMC and MCMC will be preferable. Although the application of SMC requires more tuning than MCMC, it does not require additional computation of convergence assessment. Of the MCMC methods, we recommend using a combination of the Gibbs sampler when the full conditional distributions can easily be sampled from, and the single-variable slice sampler. When the parameters are known to be highly correlated, we recommend updating these parameters in a block using RW-MH, tuned to have acceptance rate of around Finally, if the user is familiar with the WinBUGS language, statistical models can be fitted using ARS, however, any flexibility to block update highly correlated parameters are lost. We note that the slice sampler is not restricted to updating one parameter at a time, multivariate slice samplers are also possible instead of the RW-MH for block updating, but the algorithm is far more complex. However, the slice sampler does not require userspecific tuning, and has the potential to be made into a generic software, given its nice properties supported by our simulation study, this would be highly recommended! Acknowledgements The authors wish to thank the Faculty of Science, and the School of Mathematics and Statistics at UNSW, the second author was supported by an FRG(UNSW) grant.

13 13 References Casella, G. and E. I. George (1992). Explaining the gibbs sampler. Journal of the American Statistical Association 46, Chib, S. and E. Greenberg (1995). Understanding the metropolis-hastings algorithm. The American Statistician 49, Cowles, M. K. and B. P. Carlin (1996). Markov chain monte carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association 91, Del Moral, P., A. Doucet, and A. Jasra (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society. Series B 68, Gilks, W. R., S. Richardson, and D. J. Spiegelhalter (1996). Markov Chain Monte Carlo in Practice. Chapman and Hall. Gilks, W. R. and P. Wild (1992). Adaptive rejection sampling for gibbs sampling. Applied Statistics 41, Neal, R. M. (2003). Slice sampling. Annals of Statistics 31, Ripley, B. D. (1987). Stochastic Simulation. John Wiley and Sons. Robert, C. P. and G. Casella (2004). Monte Carlo Statistical Methods (2nd ed.). Springer Verlag. Roberts, G. and J. S. Rosenthal (2001). Optimal scaling for various metropolis-hastings algorithms. Statistical Science 16, Roberts, G. O. and J. S. Rosenthal (1999). Convergence of slice sampler Markov chains. Journal of the Royal Statistical Society: Series B 61, Spiegelhalter, D. J., A. Thomas, N. G. Best, and D. Lunn (2003). WinBUGS version 1.4 User Manual. MRC Biostatistics Unit, Cambridge, Available at Wasserman, L. (2004). All of Statistics - A Concise Course in Statistical Inference. Springer Verlag.

14 14 MSE: Direct simulation ρ IS θ RS SMC IS θ 0.25 RS SMC IS θ 0.5 RS SMC IS θ 0.75 RS SMC IS θ RS SMC MSE: Iterative simulation GIBBS RW-MH ARS SLI θ IND-MH GIBBS ARS SLI RW-MH θ 0.25 IND-MH GIBBS ARS SLI RW-MH θ 0.5 IND-MH GIBBS ARS SLI RW-MH θ 0.75 IND-MH GIBBS ARS SLI RW-MH θ IND-MH Table 1: Comparisons of MSE (sum of θ 1 and θ 2 )) for direct and iterative simulation methods.

15 15 Method time Advantages Disadvantages IS 0.04 produces weighted samples requires an enveloping function g. RS 0.04 produces i.i.d. samples requires an enveloping function g calculation of M SMC 1.51 suitable for high dimensions requires tuning produces weighted samples consecutive distributions should be close GIBBS 0.84 easy implementation need to sample directly from cond. distr. no block updating RW-MH easy implementation requires tuning block update IND-MH 3.00 easy to implement requires a good proposal distribution block update ARS 5.00 implemented in WinBUGS log-concave densities only no block updating SLI no tuning is required implementation is more complicated block update Table 2: Summary table for direct and iterative simulation methods. Process time in seconds in the amount of real time elapsed for computing 10,000 iterations on Pentium GHz machine.

1 Methods for Posterior Simulation

1 Methods for Posterior Simulation Let p(θ y) be the posterior. simulation. Koop presents four methods for (posterior) 1. Monte Carlo integration: draw from p(θ y). 2. Gibbs sampler: sequentially drawing