Approximate Bayesian Computation methods and their applications for hierarchical statistical models. University College London, 2015

Size: px

Start display at page:

Download "Approximate Bayesian Computation methods and their applications for hierarchical statistical models. University College London, 2015"

Nicholas Ray
5 years ago
Views:

1 Approximate Bayesian Computation methods and their applications for hierarchical statistical models University College London, 2015

2 Contents 1. Introduction 2. ABC methods 3. Hierarchical models 4. Application for ovarian cancer detection 5. Conclusion

3 Introduction The likelihood function plays an important role in statistical inference problems For complex models computational costs for evaluating the analytical formula are very high Methods which provide statistical inference bypassing evaluation of the likelihood function gained high popularity

4 ABC methods ABC methods provide ways of evaluating posterior distributions when the likelihood function is analytically or computationally intractable These methods are based on replacing the calculation of the likelihood with a comparison between the observed and simulated data θ Let be a parameter vector to be estimated. Given the prior distribution, the goal is to approximate the posterior distribution, where f(x θ) is the likelihood. π(θ) π(θ x) f(x θ)π(θ)

5 Generic form of ABC methods 1. Sample a candidate parameter vector from some proposal distribution. x* π(θ) θ* 2. Simulate a dataset from the model described by a conditional probability distribution. f(x θ*) 3. Compare the simulated dataset,, with the experimental data,, using a distance function, x 0 x* d ε d(x 0,x*) ε θ* The tolerance ε 0 is the desired level of agreement between x 0 and x*. and tolerance ;; if, accept.

6 Most popular ABC algorithms ABC rejection algorithm ABC MCMC algorithm (Markov Chain Monte Carlo) ABC SMC algorithm (Sequential Monte Carlo)

7 1. Sample from. ABC rejection method θ* π(θ) x* f(x θ*) d(x 0,x*) ε θ* 2. Simulate a dataset from. 3. If, accept, otherwise reject. 4. Return to step 1. Disadvantage: if prior distribution will be very different from the posterior, acceptance rate would be low.

9 Markov chain Monte Carlo (WIKIPEDIA) In mathematics, more specifically in statistics, Markov chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps.

10 ABC MCMC 1. Metropolis-Hastings Algorithm 2. Random-walk-Metropolis-Hastings Algorithm 3. Gibbs Sampling Algorithm 4. Metropolis within Gibbs Algorithm

11 q(y x) Metropolis-Hastings Algorithm Let be an arbitrary, friendly distribution (we know how to sample from) called proposal. Choose arbitrarily. Suppose we have generated X 0, X 1,..., X i. To generate X i+1 do the following: (1) Generate a proposal or candidate valuey ~q(y X i ) (2) Evaluate where (3) Set X 0 X i+1 = r r(x i Y) Y with probability r X i with probability 1 r " $ # $ % $! r(x,y)=min f(x)q(y x) f(y)q(x y) #,1 " # $ % # & # '

12 Remarks to Metropolis-Hastings Algorithm A simple way to execute step (3) is to generate U~(0,1) U<r X i+1 =Y X i+1 =X i. If set otherwise. q(y x) N(x,b 2 ) b>0 q q(y x)=q(x y), and! r(x,y)=min f(x) f(y),1 % # # " & # # A common choice for is for some. In this case, proposal density is symmetric, $ '

13 Metropolis-Hastings Algorithm. Example1 Let s simulate a Markov chain whose distribution is Let s take Then f(x)=1 π 1 1+x 2 N(x,b 2 ) r(x, y) = min! " # (The Cauchy distribution) as proposal distribution. f (y),1 f (x) Let s choose, length of chain. $ % & b=1 N=10,000! " # 1+x = min,1 1+y 2 2 $ % &

14 Example 1. Code in R N=10000 b=1 x_values=rep(0,n) x_cauchy=rep(0,n) x_axis=seq(-7,7,by=0.1) x_old=0 x_new=x_old for (i in 1:N) {y=rnorm(1,x_old,b) r=min((1+x_old^2)/(1+y^2),1) p=runif(1) if (p<r) (x_new=y) else (x_new=x_old);; x_values[i]=x_new x_old=x_new} x_cauchy=dcauchy(x_axis) plot(x_axis,x_cauchy,type="p",col="black") points(density(x_values),type="l",col="red",lwd=3)

16 Gibbs Sampling Gibbs Sampling is the easiest to use MCMC algorithm in case of dealing with high-dimensional problems as it helps to turn a high-dimensional problem into several one-dimensional problems. One of the examples of high-dimensional problems is hierarchical model

17 Hierarchical model. Example1 Posterior distribution on joint model specified. (θ,σ 2 ) X i ~(θ,σ 2 ), i=1,...,n, θ~n(θ,τ 2 0 ), σ 2 ~IG(a,b), θ 0,τ 2, a, b associated with the

18 Gibbs Sampling algorithm (X,Y) f X, Y (x, y) Suppose that has density. Suppose that it is possible to simulate from the conditional distributions and. Let be f X Y (x y) f Y X (y x) (X 0,Y 0 ) (X 0,Y 0 ),...,(X n,y n ) (X n+1,y n+1 ) starting values. Assume we have drawn Then the Gibbs sampling algorithm for getting : X n+1 ~ f X Y (x Y n ) Y n+1 ~ f Y X (y X n+1 ) repeat

19 Posteriors for the Example1 X i ~(θ,σ 2 ), i=1,...,n, θ~n(θ,τ 2 0 ), σ 2 ~IG(a,b), θ,τ 2 0, a, b f(θ x,σ 2! )~N σ 2 σ 2 +nτ 2θ 0+ nτ 2 σ 2 +nτ x, σ 2 τ 2 # # # # 2 σ 2 +nτ 2 # " " $ f(σ 2 $ " x,θ)~ig n+a, 1 x 2 2 θ $ % $ $ i ' $ '2 $ # & +b $ i $ # % ' ' ' ' ' ' ' ' & $ & & & & & %

20 x=rnorm(1000,10,2) n=length(x) a=3;; b=3 tau2=10 theta0=5 Nsim=5000 Example 1. Code in R xbar=mean(x) sh1=(n/2)+a sigma2=theta=rep(0,nsim) #init arrays sigma2[1]=1/rgamma(1,shape=a,rate=b) #init chains B=sigma2[1]/(sigma2[1]+n*tau2) theta[1]=rnorm(1,m=b*theta0+(1-b)*xbar,sd=sqrt(tau2*b)) for (i in 2:Nsim){ B=sigma2[i-1]/(sigma2[i-1]+n*tau2) theta[i]=rnorm(1,m=b*theta0+(1-b)*xbar,sd=sqrt(tau2*b)) ra1=(1/2)*(sum((x-theta[i])^2))+b sigma2[i]=1/rgamma(1,shape=sh1,rate=ra1) } mean(theta[3000:5000]) mean(sigma2[3000:5000])

21 Conjugate priors In Bayesian probability theory, if posterior distributions are in the same family as the prior distributions, then both prior and posterior are called conjugate distributions and the prior is called conjugate prior. P(θ D)= P(θ)P(D θ) P(θ)P(D θ)dθ

22 Conjugate priors. Example x~n(µ,σ (2) ) x σ (2) Let s consider normal distribution. For normally distributed with fixed variance, the conjugate prior is also normally distributed. For (2) prior µ~n(µ 0,σ 0 ) posterior will be in the form: µ x,σ (2) ~N( ˆµ 0, ˆ σ 0 (2) ), ˆµ 0 = σ (2) x+ σ (2) 0 σ (2) 0 +σ (2) σ (2) +σ µ (2) 0, 0 0 ˆ σ 0 (2) = σ (2) σ 0 (2) σ (2) +σ 0 (2)

23 Conjugate priors. Example Let s consider normal distribution. x~n(µ,σ (2) ) x µ For normally distributed with fixed mean, the conjugate prior is distributed according to inversegamma distribution. For prior σ (2) ~IG(α,β) P(x,µ σ (2) )= 1 σ 2π exp( (x µ)2 ) (σ 2 ) 1/2 exp( 1/2(x µ)2 ) 2σ 2 σ 2 P(σ (2) )=IG(α,β)= β α (σ (2) ) ( α 1) Γ(α) $ exp β & & & & & % σ (2) P(σ (2) x,µ) (σ 2 ) (α+1/2) 1 & & exp β 1/2(x µ)2 & ˆα=α+1/2, ˆβ=β+1(x µ) 2 $ & & % σ (2), ' ) ) ) ) ) ( ' ) ) ) ) ) (

24 ABC SMC A number of sampled parameter values (particles) {θ (1),...,θ (n) } π(θ), sampled from the prior distribution, are propagated through a sequence of intermediate distributions, π(θ d(x 0, x * ) ε i ), i=1,...,t 1, until it represents a sample from the target distribution π(θ d(x 0, x * ) ε T ). The tolerances what mean gradual evolving towards the target posterior. ε 1 >...>ε T 0 For sufficiently large numbers of particles, this approach avoid the problem of getting stuck in areas of low probability (as in ABC MCMC)

25 ABC SMC Algorithm S1. Initialize. Set the population indicator. S2.0 Set the particle indicator. S2.1 If, sample independently from. Else, sample with weights and perturb the particle to obtain, where is a perturbation kernel., return to S2.1. Simulate a candidate dataset. If ε 1,...,ε T t=0 i=1 t=0 θ ** π(θ) θ * (i) from the previous population {θ t 1 } θ ** ~K t (θ θ * ) If π(θ ** )=0 d(x *, x 0 ) ε t w t 1 K t, return to S2.1. x * ~ f(x θ ** )

26 θ t (i) =θ ** ABC SMC Algorithm S2.2 Set and calculate the weight for particle # % 1, if t=0, % (i) w t = π(θ (i) % t ) $, if t>0. % N w(j) t 1Kt (θ (j),θ % (i) t 1 t ) % &% j=1 If, set, go to S2.1. S3 Normalize the weights. t<t i<n i=i+1 t=t+1 If, set, go to S2.0. θ t (i)

27 Ovarian Cancer case study. CA125

28 Risk calculation

29 Change-point hierarchical model for CA125 Controls: Cases: Y ij t ij ~N(θ i,σ 2 ) Y ij t ij,{i i =0}~N(θ i,σ 2 ) Y ij t ij,{i i =1}~N(θ i +γ i (tij τ i) +,σ 2 )

32 Conditional distributions

34 Conclusion 1. ABC methods has great impact on parameters estimation. 1. A lot of applied problems can be reduced to hierarchical model 2. Gibbs Sampling Algorithm is most useful in dealing with hierarchical models

35 Literature 1. Steven J. Skates, Donna K. Pauler, Ian J. Jacobs. Screening Based on the Risk of Cancer Calculation from Bayesian Hierarchical Changepoint and Mixture Models of Longitudinal Markers. Journal of the American Statistical Society, vol. 96 (2001). 2. Wasserman L. All of Statistics. A concise course in Statistical Inference, Springer, Tina Toni, David Welch, Natalja Strelkowa, Andreas Ipsen, Michael P.H. Stumpf. Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of the royal society, 6, (2009). 4. Robert P. Christian, Casella George. Introducing Monte Carlo Methods with R, Springer, 2009.

36 Data processing with caret package in R Data preprocessing Data splitting Data processing Model comparison

37 Data preprocessing preprocess Standardizing Transformation Imputing

38 Data preprocessing. Example data(bloodbrain) # contains array bbbdescr bbbdescr=bbbdescr[,-3] preproc <- preprocess(bbbdescr,method = c("center", "scale")) data <- predict(preproc, bbbdescr) mean(bbbdescr[,1]) mean(data[,1]) var(data[,1]) mean(bbbdescr[,2]) mean(data[,2]) var(data[,2])

39 Data splitting createdatapartition # training/test partition createresample # bootstrap samples createfolds # split the data into k groups createtimeslices # is used for time series data

40 Data splitting. Example data(bloodbrain) # contains array bbbdescr bbbdescr=bbbdescr[,-3] train_part <- createdatapartition(y=bbbdescr[,1], p=0.75, list=false) training <- bbbdescr[train_part,] testing <- bbbdescr[-train_part,] dim(bbbdescr) dim(training) dim(testing)

41 Data processing. Resampling train method= boot # bootstraping boot632 # bootstrapping with adjustment cv # cross validation repeatedcv # repeated cross validation LOOCV # leave one out cross validation

42 Data processing. Example library(mlbench) data(sonar) set.seed(107) intrain <- createdatapartition(y = Sonar$Class, p =.75, list = FALSE) training <- Sonar[ intrain,] testing <- Sonar[-inTrain,] plsfit <- train(class ~., data = training, method = "knn", preproc = c("center", "scale")) plsclasses <- predict(plsfit, newdata = testing) plsclasses

43 Model comparison. Metric options confusionmatrix Continuous outcomes: RMSE # root mean squared error RSquared # R^2 from regression models Categorical outcomes: Accuracy # fraction of correct classes Kappa # measure of concordance

44 Model comparison. Example names(getmodelinfo()) plsfit <- train(class ~., data = training, method = "knn", preproc = c("center", "scale")) plsclasses <- predict(plsfit, newdata = testing) confusionmatrix(data = plsclasses, testing$class) plsfit <- train(class ~., data = training, method = "pls", preproc = c("center", "scale")) plsclasses <- predict(plsfit, newdata = testing) confusionmatrix(data = plsclasses, testing$class) plsfit <- train(class ~., data = training, method = "cforest", preproc = c("center", "scale")) plsclasses <- predict(plsfit, newdata = testing) confusionmatrix(data = plsclasses, testing$class)

45 Literature 1. Max Kuhn. A Short Introduction to the caret Package (2014). 2. Model training and tuning:

46 Questions

A Short Introduction to the caret Package

A Short Introduction to the caret Package Max Kuhn max.kuhn@pfizer.com October 28, 2016 The caret package (short for classification and regression training) contains functions to streamline the model training