BART STAT8810, Fall 2017

Size: px

Start display at page:

Download "BART STAT8810, Fall 2017"

Barbara Hunter
5 years ago
Views:

1 BART STAT8810, Fall 2017 M.T. Pratola November 1, 2017

2 Today BART: Bayesian Additive Regression Trees

3 BART: Bayesian Additive Regression Trees Additive model generalizes the single-tree regression model: Y (x) = f (x) + ɛ, ɛ N(0, σ 2 ) where m f (x) = g(x; T j, M j ). j=1 We viewed each tree as representing a map g(x; T j, M j ) : R p R. Can get a richer class of models by considering the sum of many such maps. We will see that each individual function g(x; T j, M j ) is constrained to be a simplistic function that explains only a small portion of the response variability. so-called sum of weak-learners assumption.

4 BART: Bayesian Additive Regression Trees Figure 1: BART uses a sum of many simple trees.

5 BART: Bayesian Additive Regression Trees There is, of course, nothing new about a GAM-like formulation. However, one advantage of the tree-based approach is the tree s natural ability to capture interactions, possibly of a high-dimensional form. Or, to not capture such behavior if not present. That is, in the tree based approach we are learning the form of predictor functions themselves rather than assuming a fixed class of bases with a particular form.

6 BART Model The data is modeled as m Y (x) {(T j, M j )} m j=1, σ2 N g(x; T j, M j ), σ 2 giving our likelihood, j=1 L({(T j, M j )} m j=1, σ2 Y) = 2 1 (2πσ 2 exp ) n/2 1 n m y(x 2σ 2 i ) g(x i T j, M j ) i=1 j=1 where Y = (y(x 1 ),..., y(x n )).

7 BART Model Default number of trees is m = 200, which seems to work well in many problems. Increasing m allows one to have a model with greater fidelity to more complex responses. Interpretation is as follows. We can view g(x; T j, M j ) as a function that assigns a terminal node scalar µ jb for a given input x. And so, the expected response E[Y (x) {(T j, M j )} m j=1 ] is simply the sum of all such µ jb s that is assigned to x by each tree (T j, M j ).

8 BART Priors Similar to our single-tree model, the BART prior is factored as m π({(t j, M j )} m j=1, σ2 ) = π(σ 2 ) π(m j T j )π(t j ) j=1 where for each j, π(m j T j ) = B j b=1 π(µ jb ), where B j = M j is the number of terminal nodes in tree T j.

9 BART Priors: π(t j ) The interior of the tree T j is made up of split rules, {(v ji, c ji )} T j i=1 with discrete uniform priors as in our single-tree model, π(v j ) = π(v ji ) i and π(c j ) = i π(c ji v ji, T j \v ji ). And the tree is regularized via the depth-penalizing prior from before as well, π(η ji splits) = α(1 + d(η ji, η j1 )) β where d(η ji, η j1 ) is the depth from node η ji to the root node in tree T j. The default prior is the same as our single-tree model: α = 0.95, β = 2.

10 BART Priors: π(µ ji T j ) The prior on the scalar terminal node parameters is the conjugate Normal, µ ji N(µ µ, σ 2 µ). Note that this prior implies a priori that the prior on E[Y (x) {(T j, M j )} m j=1 ] is N(mµ µ, mσ 2 µ). In practice, data is usually centered to have mean zero and scaled so (y min, y max ) = ( 0.5, 0.5) and the prior used is µ ji N(0, σ 2 µ).

11 BART Priors: π(µ ji T j ) The induced prior on E[Y (x) {(T j, M j )} m j=1 ] is N(0, mσ 2 µ). The strategy to calibrate this prior is the same as the single-tree model: choose a value k such that k mσ µ = 0.5 which implies that the prior is µ ji N(0, 0.5 k m ). As in the single-tree model, the default value recommended is k = 2.

12 BART Priors: π(µ ji T j ) Prior with m=200, k=2 Density µ ji

13 BART Priors: π(µ ji T j ) This means that we induce further shrinkage of the µ ji s towards zero by increasing k, or by increasing m. However, for m fixed, increasing k implies more of the response variability ends up in σ 2. While for k fixed, increasing m implies more of the response variability ends up in f (x). Besides the default choice of k = 2, one might try tuning this prior hyperparameter using cross-validation.

14 BART Priors: π(σ 2 ) The variance prior is again σ 2 χ 2 (ν, τ 2 ) and is calibrated similarly as in the single-tree model. ν is selected to get an appropriate shape. Typical values are between 3 and 10, with ν = 3 being the default.

15 BART Priors: π(σ 2 ) The scale parameter τ 2 is selected in the following way. Provide an initial estimate of the standard deviation of your data, ˆσ. Typically the sample standard deviation. Provide an upper quantile q, with q = 0.90 being the default. τ 2 is selected so that, a priori, P(σ < ˆσ) = q. The idea is that our data is unlikely all noise, so a conservative approach is to setup the prior such that it is very unlikely to estimate the variance to be greater than the sample variance of our data. The smaller ν the more concentrated on small σ the prior becomes.

16 Sampling BART s Posterior The posterior is π({(t j, M j )} m j=1, σ2 Y) m L({(T j, M j )} m j=1, σ2 Y)π(σ 2 ) π(m j T j )π(t j ) j=1

17 Sampling BART s Posterior First, note the following: = = L({(T j, M j )} m j=1, σ2 Y) 2 1 (2πσ 2 exp ) n/2 1 n m y(x 2σ 2 i ) g(x i T j, M j ) i=1 j=1 ( ) 1 1 n (2πσ 2 exp ) n/2 2σ 2 (r j (x i ) g(x i T j, M j )) 2 i=1 where r j (x i ) = y(x i ) m k j g(x i T k, M k ).

18 Sampling BART s Posterior Our MCMC algorithm will perform the following steps: 1. For j = 1,..., m: 1.1 Draw T j σ 2, R j where R j = (r j (x 1 ),..., r j (x n )) Metropolis-Hastings step via proposal distribution 1.2 Draw M j T j, σ 2, R j Gibbs step using conjugate prior 2. Draw σ 2 {(T j, M j )} m j=1, Y Gibbs step using conjugate prior So once we have the R j s, the algorithm proceeds similarly as the single-tree algorithm.

19 We have So, π(σ 2 ν, τ 2 ) = Draw σ 2 {(T j, M j )} m j=1, Y ( ) ντ 2 ν/2 2 Γ ( ν 2 ) σ ν+2 exp ( ντ 2 ) 2σ 2 1 ( exp ντ 2 ) σν+2 2σ 2 π(σ 2 {(T j, M j )} m j=1, Y) 1 ( σ n exp 1 ) n 2σ 2 (y(x i ) f (x i )) 2 i=1 1 ( exp ντ 2 ) σν+2 2σ 2 ( ( 1 (ν + n) ντ 2 + ns 2 ) = exp σ (ν+n)+2 2σ 2 ν + n where s 2 = 1 n ni=1 (y i f (x i )) 2 and f (x i ) = m j=1 g(x i ; T j, M j ).

20 Draw σ 2 {(T j, M j )} m j=1, Y ( ( )) 1 And we recognize exp (ν+n) ντ 2 +ns 2 σ (ν+n)+2 2σ 2 ν+n as the kernel of a scaled-inverse-chisquared distribution, so ( σ 2 {(T j, M j )} m j=1, Y χ 2 ν + n, ντ 2 + ns 2 ) ν + n So we know how to perform the Gibbs step for σ 2.

21 Draw M j T j, σ 2, R j Suppose there are B terminal nodes in tree T j, η b j1,..., ηb jb. Using the same factorization as the single-tree case: ( L(σ 2, T j, M j R j ) exp 1 ) n 2σ 2 (r j (x i ) g(x i T j, M j )) 2 i=1 = exp 1 B n k 2σ 2 (r j (x i ) µ jk ) 2 k=1 i:r j (x i ) ηk b B = exp 1 n k 2σ 2 (r j (x i ) µ jk ) 2 k=1 i:r j (x i ) ηk b where n k is the number of observations mapping to terminal nodes η b jk and k n k = n.

22 Draw M T, σ 2, y In other words, conditional on T j, R j, the scalar terminal node parameters are independent. So, we can simply write down the full conditional for each µ jk and draw them sequentially using Gibbs steps.

23 Draw µ jk T j, σ 2, R j Assuming mean-centered observations, our prior is π(µ jk T j ) = N(0, σ 2 µ). Based on our Normal-Normal conjugacy results, the full conditional is π(µ jk σ 2 nk, T j, R j ) N ( σ ) 1 ) ( (nk r jk nk σµ 2 σ 2, σ ) 1 σµ 2 where r jk = 1 n k i:r j (x i ) ηk b r j (x i ).

24 Draw T j σ 2, R j Figure 2: Tree Moves

25 Marginal Likelihood We will again need to marginalize our likelihood over the µ parameters. Marginalizing the portion of the likelihood associated with terminal node ηk b, we have L(η b jk σ 2, R j ) = µ jk L(η b jk µ jk, σ 2, R j )π(µ jk )dµ jk

26 Birth Proposal 1. Randomly select a terminal node k {1,..., B j } with probability 1 B j where B j = M j. 2. Introduce a new rule v jk π v (v jk ) and cutpoint c jk π c (c jk ) where π v, π c are typically discrete Uniform on the available variable, cutpoints. 3. Calculate { α = min 1, π(t j σ2, R j )q(t j T j ) } π(t j σ 2, R j )q(t j T j) 4. Generate u Uniform(0, 1). If u < α then accept T j otherwise reject. As mentioned in the single-tree model, death proposals work similarly.

27 MCMC Algorithm Let s recap our sampling algorithm. 1. For j = 1,..., m: 1.1 Draw T j σ 2, R j With probability π b do a birth proposal, otherwise a death proposal. More complex moves possible, such as changing variable/cutpoints of existing tree. 1.2 Draw M j T j, σ 2, R j For k = 1,..., B j, perform our Gibbs steps by drawing ( ( µ jk σ 2 nk, T j, R j N σ ) 1 ( ) ( nk r jk nk σµ 2 σ 2, σ ) ) 1 σµ 2 2. Draw σ 2 {(T j, M j )} m j=1, Y Perform our Gibbs step by drawing ( σ 2 {(T j, M j )} m j=1, Y χ 2 ν + n, ντ 2 + ns 2 ) ν + n

28 Example source("dace.sim.r") # Generate response: set.seed(88) n=5; k=1; rhotrue=0.2; lambdatrue=1 design=as.matrix(runif(n)) l1=list(m1=outer(design[,1],design[,1],"-")) l.dez=list(l1=l1) R=rhogeodacecormat(l.dez,c(rhotrue))$R L=t(chol(R)) u=rnorm(nrow(r)) z=l%*%u # Our observed data: y=as.vector(z)

29 Example library(bayestree) preds=matrix(seq(0,1,length=100),ncol=1) # Variance prior shat=sd(y) nu=3 q=0.90 # Mean prior k=2 # Tree prior m=1 alpha=0.95 beta=2 nc=100 # MCMC settings N=1000 burn=1000

30 ## ## ## Running BART with numeric y ## ## number of trees: 1 ## Prior: ## k: ## degrees of freedom in sigma prior: 3 ## quantile in sigma prior: ## power and base for tree prior: ## use quantiles for rule cut points: 0 ## data: ## number of training observations: 5 Example fit=bart(design,y,preds,sigest=shat,sigdf=nu,sigquant=q, k=k,power=beta,base=alpha,ntree=m,numcut=nc, ndpost=n,nskip=burn)

31 Example plot(design,y,pch=20,col="red",cex=2,xlim=c(0,1), ylim=c(2.3,3.7),xlab="x", main="predicted mean response +/- 2s.d.") for(i in 1:nrow(fit$yhat.test)) lines(preds,fit$yhat.test[i,],col="grey",lwd=0.25) mean=apply(fit$yhat.test,2,mean) sd=apply(fit$yhat.test,2,sd) lines(preds,mean-1.96*sd,lwd=0.75,col="black") lines(preds,mean+1.96*sd,lwd=0.75,col="black") lines(preds,mean,lwd=2,col="blue") points(design,y,pch=20,col="red")

32 Example Predicted mean response +/ 2s.d. y x

33 ## ## ## Running BART with numeric y ## ## number of trees: 10 ## Prior: ## k: ## degrees of freedom in sigma prior: 3 ## quantile in sigma prior: ## power and base for tree prior: ## use quantiles for rule cut points: 0 Example # Try m=10 trees m=10 fit=bart(design,y,preds,sigest=shat,sigdf=nu,sigquant=q, k=k,power=beta,base=alpha,ntree=m,numcut=nc, ndpost=n,nskip=burn)

34 Example Predicted mean response +/ 2s.d. y x

35 ## ## ## Running BART with numeric y ## ## number of trees: 20 ## Prior: ## k: ## degrees of freedom in sigma prior: 3 ## quantile in sigma prior: ## power and base for tree prior: ## use quantiles for rule cut points: 0 Example # Try m=20 trees m=20 fit=bart(design,y,preds,sigest=shat,sigdf=nu,sigquant=q, k=k,power=beta,base=alpha,ntree=m,numcut=nc, ndpost=n,nskip=burn)

36 Example Predicted mean response +/ 2s.d. y x

37 ## ## ## Running BART with numeric y ## ## number of trees: 100 ## Prior: ## k: ## degrees of freedom in sigma prior: 3 ## quantile in sigma prior: ## power and base for tree prior: ## use quantiles for rule cut points: 0 Example # Try m=100 trees m=100 fit=bart(design,y,preds,sigest=shat,sigdf=nu,sigquant=q, k=k,power=beta,base=alpha,ntree=m,numcut=nc, ndpost=n,nskip=burn)

38 Example Predicted mean response +/ 2s.d. y x

39 ## ## ## Running BART with numeric y ## ## number of trees: 200 ## Prior: ## k: ## degrees of freedom in sigma prior: 3 ## quantile in sigma prior: ## power and base for tree prior: ## use quantiles for rule cut points: 0 Example # Try m=200 trees, the recommended default m=200 fit=bart(design,y,preds,sigest=shat,sigdf=nu,sigquant=q, k=k,power=beta,base=alpha,ntree=m,numcut=nc, ndpost=n,nskip=burn)

40 Example Predicted mean response +/ 2s.d. y x

41 ## ## ## Running BART with numeric y ## ## number of trees: 200 ## Prior: ## k: ## degrees of freedom in sigma prior: 3 ## quantile in sigma prior: Example # Try m=200 trees, the recommended default m=200 # And k=1 k=1 fit=bart(design,y,preds,sigest=shat,sigdf=nu,sigquant=q, k=k,power=beta,base=alpha,ntree=m,numcut=nc, ndpost=n,nskip=burn)

42 Example Predicted mean response +/ 2s.d. y x

43 ## ## ## Running BART with numeric y ## ## number of trees: 200 Example # Try m=200 trees, the recommended default m=200 # And k=1 k=1 # And nu=3, q=.99 nu=3 q=0.99 fit=bart(design,y,preds,sigest=shat,sigdf=nu,sigquant=q, k=k,power=beta,base=alpha,ntree=m,numcut=nc, ndpost=n,nskip=burn)

44 Example Predicted mean response +/ 2s.d. y x

45 ## ## ## Running BART with numeric y ## ## number of trees: 200 Example # Try m=200 trees, the recommended default m=200 # And k=1 k=1 # And nu=2, q=.99 nu=2 q=0.99 fit=bart(design,y,preds,sigest=shat,sigdf=nu,sigquant=q, k=k,power=beta,base=alpha,ntree=m,numcut=nc, ndpost=n,nskip=burn)

46 Example Predicted mean response +/ 2s.d. y x

47 ## ## ## Running BART with numeric y ## ## number of trees: 200 Example # Try m=200 trees, the recommended default m=200 # And k=1 k=1 # And nu=1, q=.99 nu=1 q=0.99 fit=bart(design,y,preds,sigest=shat,sigdf=nu,sigquant=q, k=k,power=beta,base=alpha,ntree=m,numcut=nc, ndpost=n,nskip=burn)

48 Example Predicted mean response +/ 2s.d. y x

49 ## ## ## Running BART with numeric y Example # Try m=200 trees, the recommended default m=200 # And k=1 k=1 # And nu=1, q=.99 nu=1 q=0.99 # And numcuts=1000 nc=1000 fit=bart(design,y,preds,sigest=shat,sigdf=nu,sigquant=q, k=k,power=beta,base=alpha,ntree=m,numcut=nc, ndpost=n,nskip=burn)

50 Example Predicted mean response +/ 2s.d. y x

51 library(rgl) load("co2plume.dat") plot3d(co2plume) rgl.snapshot("co2a.png") Example

52 Example

53 Example y=co2plume$co2 x=co2plume[,1:2] preds=as.data.frame(expand.grid(seq(0,1,length=20), seq(0,1,length=20))) colnames(preds)=colnames(x) shat=sd(y) # Try m=200 trees, the recommended default m=200 # And k=1 k=1 # And nu=1, q=.99 nu=1 q=0.99 # And numcuts=1000 nc=1000 fit=bart(x,y,preds,sigest=shat,sigdf=nu,sigquant=q,

54 Example plot(fit$sigma,type='l',xlab="iteration", ylab=expression(sigma))

55 Example σ Iteration

56 Example ym=fit$yhat.test.mean ysd=apply(fit$yhat.test,2,sd) persp3d(x=seq(0,1,length=20),y=seq(0,1,length=20),z=matrix( col="grey",xlab="stack_inerts",ylab="time",zlab="co plot3d(co2plume,add=true) rgl.snapshot("co2b.png")

57 Example

58 Example persp3d(x=seq(0,1,length=20),y=seq(0,1,length=20),z=matrix( col="grey",xlab="stack_inerts",ylab="time",zlab="co persp3d(x=seq(0,1,length=20),y=seq(0,1,length=20), z=matrix(ym+2*ysd,20,20),col="green",add=true) persp3d(x=seq(0,1,length=20),y=seq(0,1,length=20), z=matrix(ym-2*ysd,20,20),col="green",add=true) plot3d(co2plume,add=true) rgl.snapshot("co2c.png")

59 Example

Calibration and emulation of TIE-GCM

Calibration and emulation of TIE-GCM Serge Guillas School of Mathematics Georgia Institute of Technology Jonathan Rougier University of Bristol Big Thanks to Crystal Linkletter (SFU-SAMSI summer school)