The BGLR (Bayesian Generalized Linear Regression) R- Package. Gustavo de los Campos, Amit Pataki & Paulino Pérez. (August- 2013)

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) The BGLR (Bayesan Generalzed Lnear Regresson) R- Package By Gustavo de los Campos, Amt Patak & Paulno Pérez (August- 03) (contact: gdeloscampos@gmal.com ) Contents. Introducton.... Structure of the software... 3 3. Runnng BGLR... 4 3.. Loadng the BGLR package... 4 3.. Fttng a fxed effects model to a contnuous outcome... 4 3.3. Fttng a fxed effects model to a bnary outcome... 6 3.4. Fttng fxed effects model to a rght- censored outcome... 8 3.5. Fttng marker effects as random... 0 3.6. Extractng estmates of marker effects and predctons... 3.7. Predctng un- observed outcomes usng BGLR... 3

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR). Introducton The BLR (Bayesan Lnear Regresson, http://cran.r- project.org/web/packages/blr/ndex.html ) package of R (http://cran.r- project.org) mplements several types of Bayesan regresson models, ncludng fxed effects, Bayesan Lasso (BL, Park and Casella 008) and Bayesan Rdge Regresson. BLR can only handle contnuous outcomes. We have produced a modfed (beta) verson of BLR (BGLR=Bayesan Generalzed Lnear Regresson) that extends BLR by allowng regressons for bnary and censored outcomes. Most of the nputs, processes and outputs are as n BLR. Here we focus on descrbng changes n nputs, nternal process and outputs ntroduced to handle bnary and censored outcomes. Users that are not famlar wth BLR are strongly encouraged to frst read the BLR user s manual and Pérez et al. (00). Future developments wll be released frst n the R- forge webpage https://r- forge.r- project.org/projects/bglr/ and subsequently as R- packages. Censored outcomes. In BGLR censored outcomes are dealt wth as a mssng data problem. BGLR handles three types of censorng: left, rght and nterval censored. For an nterval censored data- pont the nformaton avalable s a < y < b where: a and b are known lower and upper bounds and y s the actual phenotype whch for censored data ponts s un- observed. Rght censorng occurs when b s also unknown, therefore, the only nformaton avalable s a < y. In a tme- to- event settng ths means that we know that tme to event exceeded the tme at censorng gven by a. Left censorng occurs when a s unknown; therefore, the only nformaton avalable s: specfed wth three vectors, y = { y }, a = { a } and b = { b } { a y, b } y < b. In BGLR censored outcomes are then. The confguraton of the trplet, for un- censored, rght- censored, left- censored and nterval censored are descrbed n the table below.

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) a y b Un- censored NA y NA Rght Censored a NA Left Censored - NA b Interval Censored a NA b Relatve to BLR, the only modfcaton ntroduced n the Gbbs sampler requred for handlng censored data ponts consst of samplng, at each teraton of the Gbbs sampler, the censored phenotypes form the correspondng fully- condtonal denstes whch n BGLR are truncated normal denstes. Bnary outcomes are modeled usng the threshold model, or probt lnk. Here, probablty of success s p( y =) = Φ( η ) where Φ( ) s the standard normal cumulatve dstrbuton functon (also known as normal probt lnk) and η s a lnear predctor, whch can nclude fxed or random effects, handled by BGLR. In order to run a regresson for bnary outcomes, the response must be coded wth 0 s (falure) and s (success), and the argument response_type should be set to 'ordnal' (further detals are gven n the examples provded below).. Structure of the software The program s provded as an R package that can be downloaded from http://r- forge.r- project.org/r/?group_d=55. The package ncludes several datasets. Here we descrbe the wheat dataset that have been used n several publcatons. 3

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) The wheat dataset comprses phenotypc (Y, 4 trats), marker (X,,79 markers) and pedgree (A, a matrx contanng knshp coeffcents derved from pedgree) nformaton for 599 lnes of wheat. The data can be loaded wthn R typng lbrary(bglr) and then data(wheat). Further detals about ths data can be found n Crossa et al. (00). 3. Runnng BGLR In ths secton we ntroduce examples that llustrate the use of the BGLR package for regressons usng molecular markers and other covarates. 3.. Loadng the BGLR package Box provdes the code requred to load BGLR. Box. Loadng BGLR setwd(tempdr()) #Set workng drectory lbrary(bglr) 3.. Fttng a fxed effects model to a contnuous outcome In the followng example we llustrate how ft a fxed effects lnear model to a contnuous outcome usng BGLR (lne n Box ). The code n lnes 5-7 loads the program and the wheat dataset that contans phenotypc and genotypc nformaton of 599 pure lnes of wheat, ths dataset s also avalable wth the BLR package (de los Campos and Pérez 00). Phenotypes are smulated n lnes 0-4. The pror assgned to the resdual varance s defned n lnes 7-8 Detals about the prors used n BGLR and on how to choose hyper- parameters are explaned n Pérez et al. (00). The lnear model s ftted usng BGLR n lnes 9-. The argument y n BGLR s used to provde phenotypes, for contnuous outcomes ths must be a numerc vector and a lst wth predctors whose effects wll be consdered as fxed. In addton to 4

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) phenotypes, we ndcate the number of teratons of the Gbbs sampler (6000) and the number that we want to dscard as burn- n (000 n the example). For comparson we nclude n lne 4 code that fts the same lnear model va ordnary least squares usng the lm() functon. Results from both BGLR and lm are dsplayed n Fgure, the code used to produce ths fgure s gven n lnes 7-8 of Box. Box. Fttng a fxed effects model to a contnuous outcome 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 0 3 4 5 6 7 8 rm(lst=ls()) setwd(tempdr()) #loads BGLR & Data lbrary(bglr) data(wheat) X<-wheat.X #smulaton of data X<-X[,:4] N<-nrow(X) b<-c(-,,-,) error<-rnorm(n) y<-as.vector(x%*%b+ error) #fts model usng BGLR DF<-5 S<-var(y)/*(DF-) ETA<-lst(lst(X=X,model='FIXED')) fm<-bglr(y=y,eta=eta,niter=6000,burnin=000,df0=df,s0=s) #fts the same model usng lm() fm<- lm(y~x) #compares results from BGLR() & lm() plot(fm$eta[[]]$b~fm$coeff[-],pch=9,col=,cex=.5, xlab="lm()", ylab="bglr()"); ablne(a=0,b=,lty=) 5

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) BGLR() - - 0 - - 0 lm() Fgure. Estmated effects n a lnear model for a contnuous outcome (BGLR vs lm). 3.3. Fttng a fxed effects model to a bnary outcome We now turn nto an example nvolvng a bnary outcome. Usng the same smulaton used n Box, we generate a bnary outcome by dchotomzng the smulated phenotype (see lne 0 of Box 3). The model s ftted usng BGLR n lnes 3-5. For comparson, we also ft the model usng the glm() functon of R (lne 7). In BGLR we set the argument response_type="ordnal" (see lne 4) to ndcate that the response s bnary. Note that for bnary outcomes we do not have a resdual varance parameter, therefore, for ths example there s no need to provde a pror. Estmates of effects derved usng BGLR and glm are gven n Fgure. 6

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) Box 3. Fttng a fxed effects model to a bnary outcome 3 4 5 9 0 3 4 5 6 7 0 3 4 5 6 7 8 9 30 rm(lst=ls()) setwd(tempdr()) #loads BGLR & Data lbrary(bglr) data(wheat) X=wheat.X #smulaton of data X<-X[,:4] N<-nrow(X) b<-c(-,,-,) error<-rnorm(n) y<-as.vector(x%*%b+ error) ybn<-felse(y>0,,0) #fts models ETA<-lst(lst(X=X,model='FIXED')) fm<-bglr(y=ybn,response_type='ordnal',eta=eta, niter=6000,burnin=000) fm<- glm(ybn~x,famly=bnomal(lnk='probt')) plot(fm$eta[[]]$b~fm$coeff[-],pch=9,col=,cex=.5, xlab="glm()", ylab="bglr()") ; ablne(a=0,b=,lty=) BGLR() -0. -0. 0.0 0. 0. 0.3 0.4-0. -0. 0.0 0. 0. 0.3 0.4 glm() Fgure. Estmated effects n fxed effects model for a bnary outcome (BGLR vs glm) 7

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) 3.4. Fttng fxed effects model to a rght- censored outcome We now llustrate how to use BGLR to ft a model to a rght- censored outcome. The code s gven n Box 4. The begnnng of the code (lnes - 7) s as n the examples ntroduced n Box ad 3. In lnes 8-4 we generate 00 rght- censored data ponts. These are defned usng the conventons explaned n Table. Subsequently, we ft the model usng BGLR() n lne 30. Relatve to un- censored outcomes (see example n Box ) the only dfference here s that the response s specfed va 3 vectors (y,a,b) whch are defned usng the conventons explaned n Table. For comparson we ft the same model usng the surverg() functon of the survval package (lnes 34-38). Fgure 3 gves estmates of effects derved from surverg()and BGLR(). BGLR() - - 0 - - 0 survreg() Fgure 3. Estmated effects n fxed effects model for a bnary outcome (BGLR vs survreg) 8

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) Box 4. Fttng a fxed effects model to a censored outcome 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 30 3 3 33 34 35 36 37 38 39 40 4 4 rm(lst=ls()) setwd(tempdr()) #loadng lbrares lbrary(bglr) lbrary(survval) #loadng data ncluded n BGLR data(wheat) #smulaton of data X<-wheat.X[,:4] N<-nrow(X) b<-c(-,,-,) error<-rnorm(n) y<-as.vector(x%*%b+ error) cen<-sample(:n,sze=00) ycen<-y ycen[cen]<-na a<-rep(na,n) b<-rep(na,n) a[cen]<-y[cen]-runf(mn=0,max=,n=00) b[cen]<-inf DF<-5 S<-var(y)/*(DF-) ETA<-lst(lst(X=X,model='FIXED')) fm<-bglr(y=y,a=a,b=b,eta=eta,niter=6000,burnin=000, df0=df,s0=s) #fts the model usng survreg event<-felse(s.na(ycen),0,) tme<-felse(s.na(ycen),a,ycen) surv.object<-surv(tme=tme,event=event,type='rght') fm<-survreg(surv.object~x, dst="gaussan") plot(fm$eta[[]]$b~fm$coeff[-],pch=9,col=,cex=.5, xlab="survreg()", ylab="bglr()") ablne(a=0,b=,lty=) 9

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) 3.5. Fttng marker effects as random We now turn nto the problem of usng BGLR for fttng a Whole- Genome Regresson (WGR) model to contnuous, bnary or censored outcomes. In these models, the number of predctors typcally exceeds the number of phenotypes; therefore, shrnkage estmaton procedures are commonly used. BGLR offers several shrnkage (Bayesan) estmaton methods, for example: Bayesan Rdge Regresson (BRR) and the Bayesan Lasso (BL, Park and Casella 008). Here we llustrate how to ft models for contnuous, bnary and a censored outcome usng the BL. For the BL we need to provde a pror to the regularzaton parameter (λ) whch controls the extent of shrnkage of estmates of effects. A dscusson of how to choose these hyper- parameters based on pror nformaton about trat hertablty and on the number of markers nvolved s gven n Pérez et al. (00). In the example gven n Box 5 we ft the BL usng, for the 599 wheat lnes avalable n the wheat dataset,,79 markers. Lnes 4-7 gve the code requred for loadng BGLR and the wheat dataset. In lne we extract one of the four phenotypes, ths wll be used as a contnuous response. In lne 4 we extract the genotypes. Subsequently we generate (lnes 6-3) a rght- censored outcome by censorng 00 out of the 599 records. These lnes prepare the trplets (y,a,b) needed to specfy the censored outcome n BGLR. Fnally, n lne 6 we generate a bnary outcome. Lnes 8-4 are used to ft the models. As the number of markers ncluded n the model ncreases the number of teratons requred for convergence also ncreases, n the example of Box 5, and only for llustraton purposes, we use,000 teratons; however, convergence wth large- p may requre runnng much longer chans. Box 6 gves code that llustrates how to extract estmates of marker effects and predctons from the ftted model. 0

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) Box 5. Fttng a Whole Genome Regresson Usng the Bayesan LASSO for contnuous, censored and bnary outcomes 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 30 3 3 33 34 35 36 37 38 39 40 4 4 43 rm(lst=ls()) setwd(tempdr()) #loadng lbrares lbrary(bglr) lbrary(survval) data(wheat) #extracts phenotypes #contnous y<-wheat.y[,] #Extract genotypes X<-wheat.X n<- length(y) #censored cen<-sample(:n,sze=00) ycen<-y ycen[cen]<-na ; a<-rep(na,n) ; b<-rep(na,n) a[cen]<-y[cen]-runf(mn=0,max=,n=00) b[cen]<-inf #bnary ybn<-felse(y>0,,0) #pror DF<-5 S<-var(y)/*(DF-) #models ETA<-lst(lst(X=X,model='BL',lambda=5,type='gamma', rate=e-4,shape=0.55)) fm<-bglr(y=y,eta=eta,niter=000,burnin=000, df0=df,s0=s) fm<-bglr(y=ycen,a=a,b=b,eta=eta,niter=000,burnin=000, df0=df,s0=s) fm3<-bglr(y=ybn,response_type='ordnal', ETA=ETA, niter=000,burnin=000)

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) 3.6. Extractng estmates of marker effects and predctons Box 6 llustrates how to extract: the estmated posteror means and posteror standard devatons of marker effects (see lnes 3-8) and posteror means of the lnear predctor (e.g., fm$yhat, see lne 3). For bnary and censored outcomes the lnear posteror mean of the lnear predctor consttutes an estmate of the condtonal expectaton. For bnary outcomes, BGLR uses the probt lnk; therefore an estmate of the expected value of the response, or probablty of success, can be obtaned by evaluatng the standard normal cumulatve dstrbuton functon at the posteror mean of the lnear predctor (see lne n Box 6). Box 6. Extractng and Dsplayng Estmates of Marker Effects and Predctons 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 0 3 ##Vulcano plot (posteror SD vs estmated effects) plot(fm$eta[[]]$b~fm$eta[[]]$sd.b,col=, man='vulcano Plot (contnuous outcome)', xlab='estmated Effect',ylab='Est. Posteror SD') ##Estmated effects, contnuous versus censored plot(fm$eta[[]]$b~fm$eta[[]]$b,col=, man='estmated Effects', xlab='censored', ylab='contnuos') ##Predctons: contnuous versus censored outcome plot(fm$yhat~fm$yhat,col=, man='predctons', xlab='censored', ylab='contnuos') ##Estmated effects, contnuous versus bnary plot(fm$eta[[]]$b~fm3$eta[[]]$b,col=, man='estmated Effects', xlab='bnary', ylab='contnuos') ##Predctons: contnuous versus bnary outcome plot(fm$yhat~pnorm(fm3$yhat),col=, man='predctons', xlab='bnary (probablty)', ylab='contnuos')

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) 3.7. Predctng un- observed outcomes usng BGLR We close ths note by llustratng how to use BGLR for the predcton of yet- to- be observed phenotypes. In prncple there are at least two ways of carryng out ths task. One possblty s to partton the data (both predctors and response) nto tranng and a valdaton dataset, the tranng dataset s provded to BGLR to derve parameter estmates, whch could then be used to predct observatons n the valdatng dataset. An alternatve s to provde the whole data to BGLR wth the response values of the observatons n the valdaton set replaced wth mssng values. BGLR wll return predctons for these data- ponts as well and such predctons can be used to assess the ablty of the model to predct un- observed phenotypes. In the case of contnuous and bnary outcomes ths s done smply by settng the entres of y correspondng to the valdaton dataset equal to NA (see example below); for censored outcomes, the trplets correspondng to the valdaton set needs to be set to (a =-, y =NA, b = ) so that these are completely un- nformatve. Predcton of bnary outcomes. The example n Box 7 llustrates how to derve predctons for a valdaton dataset n case of a bnary outcome. The code n lnes - 9 loads lbrares and the wheat dataset and defnes the pror densty and sets predctors. These lnes are essentally as n our prevous examples. In lnes - 4 we generate a valdaton set by settng 00 randomly chosen entres of the response to mssng values. The model s ftted n lnes 6-7. Lnes 9-30 llustrate how to calculate mean- squared predcton error and area under the curve. 3

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) Box 7. Fttng a Whole Genome Regresson Usng the Bayesan LASSO for contnuous, 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 30 rm(lst=ls()) setwd(tempdr()) #loadng lbrares lbrary(bglr) lbrary(proc) data(wheat) #extracts phenotypes #contnous y<-wheat.y[,] X<-wheat.X #bnary ybn<-felse(y>0,,0) censored and bnary outcomes ETA<-lst(lst(X=X,model='BL',lambda=5,type='gamma', rate=e-4,shape=0.55)) #generates testng dataset tst<-sample(:599,sze=00,replace=false) yna<-ybn yna[tst]<-na fm<-bglr(y=yna,response_type='ordnal', ETA=ETA, niter=000,burnin=000) mean((ybn[tst]-pnorm(fm$yhat[tst]))^) # mean-sq. error auc(response=ybn[tst],predctor=fm$yhat[tst]) Predcton of censored outcomes. The example n Box 8 llustrates how to derve predctons for a valdaton dataset n case of a censored outcome. Lnes - 4 are used to load lbrares and the dataset and to defne the pror. These are essentally as n our prevous examples. In lnes 3-36 we generate a valdaton set usng 00 lnes randomly chosen among the un- censored observatons. Note that n order for these phenotypes to be un- nformatve we need to set the trplets of the lnes n the valdaton dataset to (a =-, y =NA, b = ). The model s ftted n lnes 39-40 and predcton accuracy s quantfed n lne 4. 4

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) Box 8. Fttng a Whole Genome Regresson Usng the Bayesan LASSO for contnuous, censored and bnary outcomes 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 0 3 4 5 6 7 8 9 30 3 3 33 34 35 36 37 38 39 40 4 rm(lst=ls()) setwd(tempdr()) #loadng lbrares lbrary(bglr) lbrary(survval) data(wheat) #extracts phenotypes #contnous y<-wheat.y[,] #Extract genotypes X<-wheat.X n<- length(y) #censored cen<-sample(:n,sze=00) ycen<-y ycen[cen]<-na ; a<-rep(na,n) ; b<-rep(na,n) a[cen]<-y[cen]-runf(mn=0,max=,n=00) b[cen]<-inf #Set pror and predctors DF<-5 S<-var(y)/*(DF-) ETA<-lst(lst(X=X,model='BL',lambda=5,type='gamma', rate=e-4,shape=0.55)) #generates testng dataset tst<-sample(whch(!s.na(ycen)),sze=00,replace=false) yna<-ycen ; yna[tst]<-na ana<-a ; ana[tst]<- -Inf bna<-b ; bna[tst]<- Inf #model fm<-bglr(y=ycen,a=a,b=b,eta=eta,niter=000,burnin=000, df0=df,s0=s) cor(fm$yhat[tst],ycen[tst]) 5

Bostatstcs Department Bayesan Generalzed Lnear Regresson (BGLR) Acknowledgments. Fnancal support from NIH P30 Admnstratve supplement (UAB- Nutrton Obesty Research Center) and NIH grants R0GM09-0 and R0GM09999-0A are gratefully acknowledged. References de los Campos, G., and P. Pérez. 00. BLR: Bayesan Lnear Regresson. R Package Verson.. http://cran.r- project.org/web/packages/blr/ndex.html. Crossa, J., G. de los Campos, P. Perez, D. Ganola, J. Burgueño, J. L Araus, D. Makumb, et al. 00. Predcton of Genetc Values of Quanttatve Trats n Plant Breedng Usng Pedgree and Molecular Markers. Genetcs 86 (): 73 74. Park, T., and G. Casella. 008. The Bayesan Lasso. Journal of the Amercan Statstcal Assocaton 03 (48): 68 686. Pérez, Paulno, Gustavo de los Campos, José Crossa, and Danel Ganola. 00. Genomc- Enabled Predcton Based on Molecular Markers and Pedgree Usng the Bayesan Lnear Regresson Package n R. The Plant Genome Journal 3 (): 06 6. do:0.3835/plantgenome00.04.0005. 6