Monte Carlo for Spatial Models

Size: px

Start display at page:

Download "Monte Carlo for Spatial Models"

Arabella Jennings
5 years ago
Views:

1 Monte Carlo for Spatial Models Murali Haran Department of Statistics Penn State University Penn State Computational Science Lectures April 2007

2 Spatial Models Lots of scientific questions involve analyzing data that are spatially dependent: Data points close together are more closely related (dependent) than data further away. Examples: Concentrations of PM2.5 (air pollutants) across the U.S. Disease rates by county. Abundance of plant/animal species across Pennsylvania. Space does not always mean physical distance. Similar ideas are applicable to many other research areas: Machine learning: two objects may be close in feature space. Approximations to computationally expensive computer models.

3 Geostatistical Data: Examples Wheat flowering dates in North Dakota (below): Courtesy Plant Pathology, PSU and North Dakota State. Other e.g. Concentrations of PM2.5 (pollutants) across the U.S.

4 Areal and Lattice Data: Examples Minnesota breast cancer by county: observed expected counts Courtesy MN Cancer Surveillance System, Dept. of Health Other e.g.pixel values from remote sensing e.g. PA forest cover.

5 Spatial Models: General Ideas We will focus on geostatistical data (process observed at points). Want a spatial process over a region. Simplest way to define a joint distribution is to assign a joint normal distribution to a finite set of points in that region. The set of points=location of observations and locations where we want to predict. Note: want this to be true (want a joint normal distribution) for any finite set of points in this region. This is therefore an infinite dimensional distribution called a Gaussian process.

6 Spatial Models: General Ideas (Contd.) Consider a joint distribution for 3 locations: Multivariate Normal: z 1 µ σ 11 σ 12 σ 13 z 2 N µ, σ 21 σ 22 σ 23. µ σ 31 σ 32 σ 33 z 3 We want the dependence (characterized by the covariance matrix) to be related to the distance between the locations. One possibility: Σ ij = ψ exp( ( s i s j )/φ) where s i is the location of the ith observation. So if the distance between ith and jth locations is large, Σ ij will be small. For example: if ψ = 2, φ = 0.4, location 1= (0,0), location 2=(1,2) then σ 12 = 2 exp( (5/0.4)).

7 Inference for Spatial Models A linear Gaussian process model with exponential covariance: Let Z (s) be spatial process s and X(s) be some covariate at location. For example: Z (s)= pollutant concentration at location s, X(s)= elevation at location s. Simple model: Z (s) = βx(s) + ɛ(s), where β is the regression coefficient and ɛ(s) is the error. If ɛ = (ɛ(s 1 ),..., ɛ(s N )) T, then ɛ N(0, Σ) where Σ ij = ψ exp( ( s i s j )/φ). We have a statistical model connecting data (Z, X) to parameters Θ = (β, ψ, φ). Inferred (estimated) Θ can be used to predict Z (s ) at any location s in region.

8 Inference for Spatial Models (contd.) Likelihood, L(Z; Θ) connects observations (Z) to parameters (Θ). It is proportional to multivariate normal density N(βX, Σ) for our example. Frequentist approach: maximize L(Z; Θ) w.r.t. Θ to obtain ˆΘ, the maximum likelihood estimate. Bayesian approach: treat Θ as random variable(s) and specify prior distribution f (Θ). Inference is based on posterior distribution, π(θ Z) = L(Z Θ)f (Θ) L(Z Θ)f (Θ) L(Z Θ)f (Θ)dΘ (This is just an application of Bayes rule)

9 Bayesian Inference for Spatial Models Z (s ) (estimate at a new location) is inferred from the posterior predictive distribution. π(z (s ) Z) = π(z (s ), Θ Z)dΘ = π(z (s ) Θ, Z)π(Θ Z)dΘ which is clearly obtained via the posterior distribution of the parameters Θ (given the observations Z). Note that this approach automatically propogates the variability associated with our inference about the parameters Θ. If we are less sure about Θ, that uncertainty is reflected in our estimate of Z (s ).

yellow: after July 18. Left: Flowering survey data.

10 Inference for Spatial Models: Example Figure: North Dakota flowering dates, 2005: red: <July 6, orange: July 6 to July 12, brown:july 13 to July 18, yellow: after July 18. Left: Flowering survey data. Right: Posterior mean, E π (Z (s ) Z), based on the distribution π(z (s ) Z) at unobserved locations s.

11 Monte Carlo for Inference All inference for the model is based on the posterior (π). For e.g. E π (φ Z), the posterior expectation of parameter φ. In general we are interested in expectations of the form: E π g = g(x)π(x)dx Integral is too hard, so use sample based inference. We simulate X 1,..., X N from the distribution π. Use sample average: N i=1 g(x i)/n. In principle, if we have enough samples (large enough N), we can answer any question of interest. Example: What is the probability that φ > 0.8? Answer: Count the proportion of times sampled φ > 0.8.

12 Monte Carlo with iid samples Assume X 1,..., X N iid π. Strong Law holds: If E π g < then ḡ N = N g(x i )/N E π g as N. i=1 Central Limit Theorem: If E π g 2 < we have N(ḡN E π g) N(0, σ 2 ) Easy to estimate σ 2 using sample variance (ˆσ 2 ). Estimate accuracy of our estimate by ˆσ 2 /N. N is large enough when ˆσ 2 /N is small enough. Note that X i s can be multidimensional (accuracy is unaffected by the dimension of the problem here.)

13 Markov chain Monte Carlo Life is simple with i.i.d. Monte Carlo. Generally very difficult to draw i.i.d. samples from π, especially in high dimensions, complicated distributions. Not considered an option for spatial models in general. More general approach: Metropolis-Hastings algorithm. Start with an initial value X0. For i = 2 to N: Propose a value X for X i based on X i 1. Set X i = X with M-H probability depending on X i, X, the proposal distribution and the target distribution (π). The Markov chain X 1,..., X N has stationary distribution π (roughly: for large values of N, X N is approximately distributed according to π.)

14 Markov chain Monte Carlo (contd.) Use X 1,..., X N as before to estimate E π g. Strong Law holds if E π ( g ) <. ḡ N = N g(x i )/N E π g as N i=1 We also need technical conditions on the Markov chain but these typically hold by construction. It appears as if we have the same situation as in the i.i.d. case.

15 Markov chain Monte Carlo: Complications 1. The Central Limit Theorem may not hold. Hard to know when it does. 2. X i s are not i.i.d. so hard to estimate variance hard to rigorously assess the accuracy of our estimates. 3. We do not know how long to run our Markov chain. (How do we determine N for our sample X 1,..., X N?) 4. Not clear how to construct an efficient chain/algorithm for each new model,data set (Metropolis-Hastings only provides a general recipe.) Implication: Every time a user wants to fit a complex (e.g. spatial) model, needs to spend a lot of time tuning the algorithm. Also no guarantees about the accuracy of the estimates.

16 Some approaches I have considered 1. Exact/perfect sampling: avoid all MCMC issues by constructing samplers that produce i.i.d. draws. 2. Fast mixing algorithms using heavy-tailed approximations, transformations etc. Some samplers with known (good) theoretical properties. Others that appear to work well in practice (based on empirical studies). 3. Monte Carlo standard errors: Consistent and easy-to-use estimator for assessing standard errors. When standard errors are below a threshold, stop the MCMC sampler. Note: These approaches are not mutually exclusive.

17 Toy example: Normal density x Density x y Left (approximation-based): Green line is heavy-tailed approximation (t-density), red line is target (normal density.) Use heavy-tailed approximation as proposal. Right (transformation-based): Bounded 2D region corresponds to transformation of 1D normal (y/x normal density.) Simulate in transformed space.

18 For spatial models Linear spatial model where inference is based on π(z (s ) Z) = π(z (s ), Θ Z)dΘ allows us to deal with the simulation in stages: 1. Simulate Θ π(θ Z). 2. Simulate Z (s ) π(z (s ) Θ, Z), which is just a draw from a multivariate normal. With this approach: Even though inference may be for a high-dimensional distribution, only step 1 is problematic. Step 1 involves few dimensions (typically around 4 though it could have more if there are many predictors.) If we construct a very good heavy-tailed approximation to propose values for MCMC or simulate on an appropriate transformed space, we will have an algorithm with good properties.

19 Spatial generalized linear models What if data are non-gaussian (e.g. 0-1 or count data)? Hierarchical modeling is not a problem. For example: Stage 1: Z (s) µ(s) Poisson(E(s) exp(µ(s))). Stage 2: Now model µ(s) Θ as a Gaussian process. Stage 3: Priors for Θ. This specification destroys our ability to simulate easily: 1. Now need to simulate Θ, µ (s 1 ),..., µ (s n ) π(θ, µ (s 1 ),..., µ (s n ) Z). 2. Simulate Z (s ) π(z (s ) Θ, µ (s 1 ),..., µ (s n ), Z). Step 1 is now more complicated and dimensions of distribution # of observations. One approach: Approximate this model by a linear hierarchical model.use samples from approximation to obtain draws from above distribution.

20 Some lessons learned Exact sampling: ideal situation but very hard to construct. Algorithms end up being very specialized. Fast mixing algorithms: best case when the algorithm has good theoretical properties and works well in practice (one does not imply the other.) Putting it all together: Using a good estimate of standard errors for an algorithm with good theoretical properties: Efficient algorithm. Easy, accurate assessment of standard errors. A simple rule for stopping the algorithm based on desired accuracy.

21 Some lessons learned (contd.) If the sampler gets stuck in one area of the sample space (if π has well separated modes.) Even estimates of standard error can be misleading. Hence, first requirement: good sampler. Very important to use as much information as possible about π when constructing a sampler. Exploit the structure of the model/distribution. For e.g. spatial models have a lot of structure. Utilize matrix algorithms (e.g. sparse matrix inversions, choleskis etc.) whenever possible. Be aware of possible multimodalities (obvious in some scenarios, not so obvious in others.) Always attempt to quantify the accuracy of your estimates.

22 Some references Accurate standard errors and a stopping rule for MCMC: Flegal, J.M., Haran, M. and Jones, G.L. Markov chain Monte Carlo: Can We Trust the Third Decimal Place? R code for estimating errors easily via consistent batch means : mharan/batchmeans.r Experiments with block updating of parameters: Haran, M., Hodges, J.S., and Carlin, B.P. (2003), Accelerating computation in Markov random field models for spatial data via structured MCMC, J.Comp.Graph.Stat.. Exact sampling using a fast mixing MCMC algorithm: Haran, M. and Tierney, L., Perfect sampling for a Markov random field model.

MCMC Methods for data modeling

MCMC Methods for data modeling Kenneth Scerri Department of Automatic Control and Systems Engineering Introduction 1. Symposium on Data Modelling 2. Outline: a. Definition and uses of MCMC b. MCMC algorithms