Case Study IV: Bayesian clustering of Alzheimer patients

Size: px

Start display at page:

Download "Case Study IV: Bayesian clustering of Alzheimer patients"

Cody Maxwell
5 years ago
Views:

Case Study IV: Bayesian clustering of Alzheimer patients Mike Wiper and Conchi Ausín Department of Statistics Universidad Carlos III de Madrid Advanced

1 Case Study IV: Bayesian clustering of Alzheimer patients Mike Wiper and Conchi Ausín Department of Statistics Universidad Carlos III de Madrid Advanced Statistics and Data Mining Summer School 2nd - 6th July, 2018 Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 1 / 26

Objective We illustrate how to use the EM algorithm, Gibbs sampling and Variational Bayes approximation for clustering of Alzeihemer patients.

2 Objective We illustrate how to use the EM algorithm, Gibbs sampling and Variational Bayes approximation for clustering of Alzeihemer patients. We would like to divide patients in subgroups according to the symptoms presented. Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 2 / 26

3 Alzeihemer data This data set can be downloaded from the BayesLCA R package. rm(list=ls()) library(bayeslca) data("alzheimer") This data set contains information about the presence or absence of six symptoms displayed by 240 patients diagnosed with early onset Alzheimer s disease recorded in the Mercer Institute of St. James Hospital in Dublin. attach(alzheimer) par(mfrow=c(2,3)) plot(hallucination) plot(activity) plot(aggression) plot(agitation) plot(diurnal) plot(affective) par(mfrow = c(1, 1)) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 3 / 26

4 Latent class analysis We wish to obtain K groups of patients according to their symptoms. Thus, for each observation, x = (x 1,..., x M ), we may assume a K-component mixture of multivariate binary variables with probability distribution: Pr (x K, w, θ) = K M w K k=1 m=1 θ xm km (1 θ km) (1 xm) where w = {w K } and θ = {θ km } for K = 1,..., K and for m = 1,..., M. We assume the following prior distributions, w Dirichlet(δ 1,..., δ K ) θ km Beta (α km, β km ) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 4 / 26

5 Latent class analysis Let x = {x 1,..., x N } be the sample of N = 126 patients. Each observation x i = (x i1,..., x im ) is a vector of M = 6 binary variables representing the presence or absence of the m-th symptom. We assume that there are K = 3 groups of patients and the prior probability of belonging to group k is w k. We also assume that the presence of the symptoms in each group follow independent Bernoulli distributions, Pr(x im θ km ) = θ x im km (1 θ km) 1 x im, for m = 1,..., M Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 5 / 26

6 Latent class analysis We may define a latent set of variables, z = {z 1,..., z N }, indicating the group of each patient. The prior probability that i-th patient belongs to group k is: Pr(z i = k w, θ) = w k and, given that the patient is in group k, Pr(x i z i = k, w, θ) = M m=1 θ x im km (1 θ km) 1 x im Then, the complete-data likelihood function is, f (x, z w, θ) = K k=1 i=1 N I(z i = k)w k M m=1 θ x im km (1 θ km) 1 x im Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 6 / 26

7 Latent class analysis Since we are assuming the following prior distributions, f (w δ 1,..., δ K ) K k=1 w (δ k 1) k f (θ km α km, β km ) θ α km 1 km (1 θ km ) (β km 1) We can obtain the log-posterior, [ K N M log f (w, θ x, z) = I(z i = k) log w k + k=1 k=1 i=1 K K M + (δ k 1) log w k + log k=1 m=1 m=1 { } ] log θ x im km (1 θ km) (1 x im) { θ (α km 1) km (1 θ km ) (β km 1) } Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 7 / 26

8 EM algorithm E Step Calculate E z xi,w (t),θ(t) [log f (w, θ x, z)] which depends on: [ E I(z i = k) x i, w (t), θ (t)] = Pr (z i = k x i, w (t), θ (t)) M Step Maximize the previous expectation. arg maxe z xi,w (t),θ(t) [log f (w, θ x, z)] w,θ Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 8 / 26

9 EM algorithm Repeat t until convergence: E Step M Step z (t+1) ik = w (t+1) k θ (t+1) km = w (t) k K s=1 w (t) s ) M m=1 (x Pr im θ (t) km M m=1 Pr (x im θ (t) sm = δ k + N i=1 z(t+1) ik 1 K s=1 δ s + N K α km + N i=1 z(t+1) ik x im α km + β km + N i=1 z(t+1) ik 2 ) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 9 / 26

10 EM algorithm for Alzheimer data We apply the EM algorithm for our Alzheimer data and we assume three groups of patients. fit.em=blca(alzheimer, 3, method = "em") An important difficulty with the EM algorithm is that it may converge to a local maximum or saddle-point. Thus, the algorithm is run for a number of different starting values (5, by default). Parameter estimates from the run which achieved the highest log-posterior are returned. From only five starts, the algorithm obtains three distinct local maxima of the log-posterior. Then, it seems sensible to run the algorithm more times. fit.em=blca.em(alzheimer, 3,restarts=20) The algorithm provides MAP estimates of model parameters: print(fit.em) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 10 / 26

11 EM algorithm for Alzheimer data Obtain complete information of prior specifications, EM performance, log-posterior, AIC and BIC results: summary(fit.em) Note that AIC and BIC can be used to select the number of patient groups. The MAP of class probabilities are: fit.em$classprob and the MAP of the item probabilities, conditional on class membership, are: fit.em$itemprob These estimates can be visualized with the following plot: par(mfrow=c(1,1)) plot(fit.em) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 11 / 26

12 EM algorithm for Alzheimer data We may try different prior assumptions: fit.em=blca.em(alzheimer, 3,restarts=20,alpha=2,beta=2) print(fit.em) plot(fit.em) fit.em=blca.em(alzheimer, 3,restarts=20,alpha=0.001,beta=0.001) print(fit.em) plot(fit.em) We wish to approximate the whole posterior distribution of the model parameters rather than obtain only their MAP values. One possibility is using a Gibbs sampling. Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 12 / 26

13 Gibbs sampling Gibbs sampling is a particular MCMC method when the conditional posterior distributions are known. In order to obtain a sample from the joint distribution, f (w, θ, z x), we can sample iteratively from the conditional posterior distributions: ) 1 Sample θ (t+1) f (θ x, w (t), z (t) ) 2 Sample w (t+1) f (w x, θ (t+1), z (t) ) 3 Sample z (t+1) Pr (z x, θ (t+1), w (t+1) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 13 / 26

14 Gibbs sampling We obtain that the conditional posterior distributions are given by: 1 For k = 1,..., m, and for k = 1,..., K, N i=1 f (θ km x, w, z) θ I(z i =k)x im +α km 1 N i=1 km (1 θ km ) I(z i =k)(1 x im )+β km 1 which is a Beta distribution. 2 And, f (w x, θ, z) which is a Dirichlet distribution. 3 Finally, for i = 1,..., N, Pr(z i = k x i, w, θ) = K k=1 N i=1 w I(z i =k)+δ k 1 k M w k m=1 Pr(x im θ km ) K s=1 w M s m=1 Pr(x im θ sm ) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 14 / 26

15 Gibbs sampling algorithm for Alzheimer data We now apply the Gibbs sampling algorithm for the Alzheimer data. We initially use three groups: out=blca(alzheimer, 3, method = "gibbs") print(out) plot(out) We may also observe the plots of density estimates for model parameters. For the item probabilities, conditional on class membership: par(mfrow = c(3,2)) plot(out,which=3) And for the class probabilities: par(mfrow = c(1,1)) plot(out,which=4) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 15 / 26

16 Prior sensitivity We may try different prior assumptions: out.prior2=blca(alzheimer, 3, method = "gibbs", alpha=2, beta=2) print(out.prior2) plot(out.prior2) out.prior3=blca(alzheimer, 3, method = "gibbs", alpha=0.001,beta=0.001) print(out.prior3) plot(out.prior3) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 16 / 26

17 Model selection We may also try different values for the number of patient groups: out.size1=blca(alzheimer, 1, method = "gibbs") print(out.1) plot(out.1) out.size2=blca(alzheimer, 2, method = "gibbs") print(out3) plot(out3) We can use the DIC criteria (that will be studied in chapter 5) to select the mixture size. -out.size1$dic -out.size2$dic -out$dic Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 17 / 26

18 Gibbs sampling algorithm for Alzheimer data In all cases, we have run the Gibbs sampler over its default settings: with a burn-in of 100 iterations and thinning rate of 1. A convergence diagnosis must always be done. par(mfrow = c(4, 2)) plot(out, which = 5) We may observe that the mcmc performance is not very good. It seems that the mcmc chain has converged but it does not present a good mixing. This can be also observed with convergence diagnostic methods such as raftery.diag available in the coda package, which is automatically loaded in the BayesLCA package. raftery.diag(as.mcmc(out)) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 18 / 26

19 Gibbs sampling algorithm for Alzheimer data The output of the convergence diagnostic suggests that the sampler converges quickly (burn-in values are low), but is not mixing satisfactorily (note the high dependence factor of many parameters). A Gibbs sampler with better tuned parameters can then be run: out2=blca(alzheimer, 3, method = "gibbs", burn.in = 150, thin = 1/10, iter = 50000) plot(out2, which = 5) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 19 / 26

20 Gibbs sampling algorithm for Alzheimer data One question that should be mentioned is that the blca.gibbs function includes by default a relabeling method to reduce the label switching problem. This is a well-known problem in mixture models that appears due to the lack of identifiability. Without relabeling, we can observe that the label switching problem appears in the trace plots: fit.gs=blca(alzheimer, 3, method = "gibbs", relabel=f) plot(fit.gs, which = 5) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 20 / 26

21 Variational Bayes The idea consists in approximating the posterior distribution f (ω, θ, z x) with a variational distribution q(w, θ, z) which assumes independence among block of parameters: q(w, θ, z) = q 1 (w γ)q 2 (θ ζ)q 3 (z φ) where (γ, ζ, φ) are the variational parameters. The VB approach looks for the distributions q j that minimize the Kullback-Leibler divergence between the posterior and variational approximation. In mixture models, it can be shown that the form of q j is the same as that of the conditional posterior distribution. Then, w γ Dirichlet(γ 1,..., γ k ) θ km ζ Beta(ζ km1, ζ km2 ) z i φ Multinomial(φ 1,..., φ n ) The variational parameters are updated iteratively until the KL divergence is minimized. Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 21 / 26

22 Variational Bayes We now apply the VB algorithm for the Alzheimer data: fit.vb=blca(alzheimer, 3, method = "vb") print(fit.vb) Observe that the Variational Bayes method is much more faster than the Gibbs sampling. And VB also provides posterior standard deviation estimates. fit.vb$itemprob fit.vb$classprob fit.vb$itemprob.sd fit.vb$classprob.sd MAP estimates are close to those obtained with the Gibss sampling algorithm: fit.gs$itemprob fit.vb$itemprob fit.gs$classprob fit.vb$classprob Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 22 / 26

23 Variational Bayes However, the Gibbs sampling provides a better approximation of the posterior distributions. Observe that the posterior standard deviation estimates are larger than those obtained for the VB method: fit.gs$itemprob.sd fit.gs$classprob.sd fit.vb$itemprob.sd fit.vb$classprob.sd Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 23 / 26

24 Variational Bayes We may also observe these differences in the plots of density estimates for model parameters. For the item probabilities, conditional on class membership: par(mfrow = c(3,2)) plot(fit.gs,which=3) plot(fit.vb,which=3) And for the class probabilities: par(mfrow = c(1,1)) plot(fit.gs,which=4) plot(fit.vb,which=4) Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 24 / 26

25 Variational Bayes One method for determining an appropriate number of classes to fit to the Alzheimer data is to deliberately over-fit the model, and then consider only the classes for which the posterior mean of w K is positive. fit.vb=blca(alzheimer, 10, method = "vb") fit.vb$classprob This suggests a 2-class fit is the best suited for the variational Bayes approximation. plot(fit.vb, which = 5) The multiple jumps in the lower bound indicate where components have emptied out. Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 25 / 26

26 Summary We have implemented a Bayesian approach for clustering of Alzheimer patients according to their symptoms. A mixture of K groups of multivariate binary distributions have been considered for modelling the observed data. We have implemented three different computational Bayesian methods to estimate finite mixture models: EM, MCMC and VB. The VB approximation provides a very fast procedure to estimate the posterior distribution of the model parameters. However, it is well known that the enforced independence between parameters that is imposed in VB approximations results in diminished variance estimates. A better approximation (although usually more time consuming) is provided by MCMC methods and, in particular, the Gibbs sampling. Standard MCMC methods can be extremely time consuming for big data sets. Mike Wiper and Conchi Ausín Clustering of Alzheimer patients Advanced Statistics and Data Mining 26 / 26

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24 MCMC Diagnostics Yingbo Li Clemson University MATH 9810 Yingbo Li (Clemson) MCMC Diagnostics MATH 9810 1 / 24 Convergence to Posterior Distribution Theory proves that if a Gibbs sampler iterates enough,