Package reglogit. November 19, 2017

Similar documents
STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

Package bayescl. April 14, 2017

Predicting Diabetes using Neural Networks and Randomized Optimization

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

Package binomlogit. February 19, 2015

Package SSLASSO. August 28, 2018

Package plgp. February 20, 2015

Using the package glmbfp: a binary regression example.

Package BayesLogit. May 6, 2012

Predicting Diabetes and Heart Disease Using Diagnostic Measurements and Supervised Learning Classification Models

Retrieving and Working with Datasets Prof. Pietro Ducange

Package BayesCR. September 11, 2017

Package RegressionFactory

The grplasso Package

Package hiernet. March 18, 2018

Package TVsMiss. April 5, 2018

Package RAMP. May 25, 2017

The Pseudo Gradient Search and a Penalty Technique Used in Classifications.

Machine Learning with MATLAB --classification

Package texteffect. November 27, 2017

Package logisticpca. R topics documented: August 29, 2016

Package glmmml. R topics documented: March 25, Encoding UTF-8 Version Date Title Generalized Linear Models with Clustering

Package glassomix. May 30, 2013

Package milr. June 8, 2017

Double Sort Algorithm Resulting in Reference Set of the Desired Size

Package flam. April 6, 2018

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Package msda. February 20, 2015

Package glinternet. June 15, 2018

Package glmnetutils. August 1, 2017

Package plsrbeta. February 20, 2015

Package gplm. August 29, 2016

Package biglars. February 19, 2015

Package embed. November 19, 2018

Package SAENET. June 4, 2015

Package beast. March 16, 2018

Package subgroup.discovery

Data cleansing and wrangling with Diabetes.csv data set Shiloh Bradley Webster University St. Louis. Data Wrangling 1

Package bsam. July 1, 2017

Bias-variance trade-off and cross validation Computer exercises

Package FMsmsnReg. March 30, 2016

Package bayesdp. July 10, 2018

Package sparsereg. R topics documented: March 10, Type Package

Package mederrrank. R topics documented: July 8, 2015

Package SafeBayes. October 20, 2016

C#.NET Algorithm for Variable Selection Based on the Mallow s C p Criterion

Package vbmp. March 3, 2018

The glmmml Package. August 20, 2006

Package gee. June 29, 2015

Package SVMMaj. R topics documented: February 19, Type Package Title SVMMaj algorithm Version Date

Package sprinter. February 20, 2015

Package spikeslab. February 20, 2015

Package acebayes. R topics documented: November 21, Type Package

Package BKPC. March 13, 2018

Package Bergm. R topics documented: September 25, Type Package

Package parcor. February 20, 2015

Package addhaz. September 26, 2018

Package gibbs.met. February 19, 2015

Package EBglmnet. January 30, 2016

Logistic Regression. (Dichotomous predicted variable) Tim Frasier

Package ordinalnet. December 5, 2017

Package optimus. March 24, 2017

Package CompGLM. April 29, 2018

Package mfbvar. December 28, Type Package Title Mixed-Frequency Bayesian VAR Models Version Date

Package TANDEM. R topics documented: June 15, Type Package

Package quickreg. R topics documented:

Package mcemglm. November 29, 2015

Package SAM. March 12, 2019

Package clusternomics

Package CGP. June 12, 2018

Package msgps. February 20, 2015

Package citools. October 20, 2018

Package maxent. July 2, 2014

Package DTRlearn. April 6, 2018

Package FWDselect. December 19, 2015

Package sspse. August 26, 2018

Package intccr. September 12, 2017

Package bayeslongitudinal

Package SparseFactorAnalysis

Package nodeharvest. June 12, 2015

Package endogenous. October 29, 2016

Package misclassglm. September 3, 2016

Package gbts. February 27, 2017

Package gains. September 12, 2017

Package rpst. June 6, 2017

Package swcrtdesign. February 12, 2018

Package MfUSampler. June 13, 2017

Package kernelfactory

Package stockr. June 14, 2018

Package Kernelheaping

Package sbrl. R topics documented: July 29, 2016

Package RPMM. September 14, 2014

Package CVR. March 22, 2017

Package naivebayes. R topics documented: January 3, Type Package. Title High Performance Implementation of the Naive Bayes Algorithm

Package kofnga. November 24, 2015

Package survivalmpl. December 11, 2017

Expectation-Maximization Methods in Population Analysis. Robert J. Bauer, Ph.D. ICON plc.

Package MSGLasso. November 9, 2016

Package LCMCR. R topics documented: July 8, 2017

Package bestnormalize

Transcription:

Type Package Package reglogit November 19, 2017 Title Simulation-Based Regularized Logistic Regression Version 1.2-5 Date 2017-11-17 Author Robert B. Gramacy <rbg@vt.edu> Maintainer Robert B. Gramacy <rbg@vt.edu> Depends R (>= 2.14.0), methods, mvtnorm, boot, Matrix Suggests plgp Description Regularized (polychotomous) logistic regression by Gibbs sampling. The package implements subtly different MCMC schemes with varying efficiency depending on the data type (binary v. binomial, say) and the desired estimator (regularized maximum likelihood, or Bayesian maximum a posteriori/posterior mean, etc.) through a unified interface. License LGPL URL http://bobby.gramacy.com/r_packages/reglogit LazyLoad yes NeedsCompilation yes Repository CRAN Date/Publication 2017-11-19 18:22:50 UTC R topics documented: reglogit-package...................................... 2 pima............................................. 3 predict.reglogit....................................... 4 reglogit........................................... 6 Index 11 1

2 reglogit-package reglogit-package Simulation-based Regularized Logistic Regression Description Details Regularized (polychotomous) logistic regression by Gibbs sampling. The package implements subtly different MCMC schemes with varying efficiency depending on the data type (binary v. binomial, say) and the desired estimator (regularized maximum likelihood, or Bayesian maximum a posteriori/posterior mean, etc.) through a unified interface. Package: reglogit Type: Package Version: 1.0 Date: 2011-08-05 License: LGPL LazyLoad: yes See the documentation for the reglogit function Author(s) Robert B. Gramacy <rbg@vt.edu> Maintainer: Robert B. Gramacy <rbg@vt.edu> References R.B. Gramacy, N.G. Polson. Simulation-based regularized logistic regression. (2010); arxiv:1005.3430; http://arxiv.org/abs/1005.3430 See Also reglogit, blasso and regress Examples ## see the help file for the reglogit function

pima 3 pima Pima Indian Data Description A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. Usage data(pima) Format Source A data frame with 768 observations on the following 9 variables. npreg number of pregnancies glu plasma glucose concentration in an oral glucose tolerance test bp diastolic blood pressure (mm Hg) skin tricepts skin fold thickness (mm) serum 2-hour serum insulin (mu U/ml) bmi mody mass index (weight in kg/(height in m)\^2) ped diabetes pedigree function age age in years y classification label: 1 for diabetic Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C. and Johannes, R. S. (1988) Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications in Medical Care (Washington, 1988), ed. R. A. Greenes, pp. 261-265. Los Alamitos, CA: IEEE Computer Society Press. Ripley, B.D. (1996) Pattern Recognition and Neural Networks. Cambridge: Cambridge University Press. UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes Examples data(pima) ## see reglogit documentation for an example using this data

4 predict.reglogit predict.reglogit Prediction for regularized (polychotomous) logistic regression models Description Usage Sampling from the posterior predictive distribution of a regularized (multinomial) logistic regression fit, including entropy information for variability assessment ## S3 method for class 'reglogit' predict(object, XX, burnin = round(0.1 * nrow(object$beta)),...) ## S3 method for class 'regmlogit' predict(object, XX, burnin = round(0.1 * dim(object$beta)[1]),...) Arguments object XX Details Value burnin a "reglogit"-class object or a "regmlogit"-class object, depending on whether binary or polychotomous methods were used for fitting a matrix of predictive locations where ncol(xx) == object$ncol(xx). a scalar positive integer indicate the number of samples of object$beta to discard as burn-in; the default is 10% of the number of samples... For compatibility with generic predict method; not used Applies the logit transformation (reglogit) or multinomial logit (regmlogit) to convert samples of the linear predictor at XX into a samples from a predictive posterior probability distribution. The raw probabilties, averages (posterior means), entropies, and posterior mean casses (arg-max of the average probabilities) are returned. The output is a list with components explained below. For predict.regmlogit everyhing (except entropy) is expanded by one dimension into an array or matrix as appropriate. p mp pc ent a nrow(xx) x (T-burnin) sized matrix of probabilities (of class 1) from the posterior predictive distribution. a vector of average probablities calculated over the rows of p class labels formed by rouding (or arg max for predict.regmlogit) the values in mp The posterior mean entropy given the probabilities in mp Author(s) Robert B. Gramacy <rbg@vt.edu>

predict.reglogit 5 References R.B. Gramacy, N.G. Polson. Simulation-based regularized logistic regression. (2012) Bayesian Analysis, 7(3), p567-590; arxiv:1005.3430; http://arxiv.org/abs/1005.3430 C. Holmes, K. Held (2006). Bayesian Auxilliary Variable Models for Binary and Multinomial Regression. Bayesian Analysis, 1(1), p145-168. See Also reglogit and regmlogit Examples ## see reglogit for a full example of binary classifiction complete with ## sampling from the posterior predictive distribution. ## the example here is for polychotomous classification and prediction ## Not run: library(plgp) x <- seq(-2, 2, length=40) X <- expand.grid(x, x) C <- exp2d.c(x) xx <- seq(-2, 2, length=100) XX <- expand.grid(xx, xx) CC <- exp2d.c(xx) ## build cubically-expanded design matrix (with interactions) Xe <- cbind(x, X[,1]^2, X[,2]^2, X[,1]*X[,2], X[,1]^3, X[,2]^3, X[,1]^2*X[,2], X[,2]^2*X[,1], (X[,1]*X[,2])^2) ## perform MCMC T <- 1000 out <- regmlogit(t, C, Xe, nu=6, normalize=true) ## create predictive (cubically-expanded) design matrix XX <- as.matrix(xx) XXe <- cbind(xx, XX[,1]^2, XX[,2]^2, XX[,1]*XX[,2], XX[,1]^3, XX[,2]^3, XX[,1]^2*XX[,2], XX[,2]^2*XX[,1], (XX[,1]*XX[,2])^2) ## predict class labels p <- predict(out, XXe) ## make an image of the predictive surface cols <- c(gray(0.85), gray(0.625), gray(0.4)) par(mfrow=c(1,3)) image(xx, xx, matrix(cc, ncol=length(xx)), col=cols, main="truth") image(xx, xx, matrix(p$c, ncol=length(xx)), col=cols, main="predicted") image(xx, xx, matrix(p$ent, ncol=length(xx)), col=heat.colors(128), main="entropy")

6 reglogit ## End(Not run) reglogit Gibbs sampling for regularized logistic regression Description Regularized (multinomial) logistic regression by Gibbs sampling implementing subtly different MCMC schemes with varying efficiency depending on the data type (binary v. binomial, say) and the desired estimator (regularized maximum likelihood, or Bayesian maximum a posteriori/posterior mean, etc.) through a unified interface. Usage reglogit(t, y, X, N = NULL, flatten = FALSE, sigma = 1, nu = 1, kappa = 1, icept = TRUE, normalize = TRUE, zzero = TRUE, powerprior = TRUE, kmax = 442, bstart = NULL, lt = NULL, nup = list(a = 2, b = 0.1), save.latents = FALSE, verb = 100) regmlogit(t, y, X, flatten = FALSE, sigma = 1, nu = 1, kappa = 1, icept=true, normalize = TRUE, zzero = TRUE, powerprior = TRUE, kmax = 442, bstart = NULL, lt = NULL, nup = list(a=2, b=0.1), save.latents = FALSE, verb=100) Arguments T y X N flatten sigma nu a positive integer scalar specifying the number of MCMC rounds reglogit requires logical classification labels for Bernoulli data, or countsfor Binomial data; for the latter, N must also be specified. regmlogit requires positive integer class labeels in 1:C where C is the number of classes. a design matrix of predictors; can be a typical (dense) matrix or a sparse Matrix object. When the design matrix is sparse (and is stored sparsely), this can produce a ~3x-faster execution via a more efficient update for the beta parameter. But when it is not sparse (but is stored sparsely) the execution could be much slower an optional integer vector of total numbers of replicate trials for each X-y, i.e., for Binomial data instead of Bernoulli a scalar logical that is only specified for Binomial data. It indicates if preprocessing code should flatten the Binomial likelihood into a Bernoulli likelihood weights on the regression coefficients in the lasso penalty. The default of 1 is sensible when normalize = TRUE since then the estimator for beta is equivariant under rescaling a non-negative scalar indicating the initial value of the penalty parameter

reglogit 7 kappa icept normalize zzero powerprior kmax bstart lt nup save.latents verb a positive scalar specifying the multiplicity; kappa = 1 provides samples from the Bayesian posterior distribution. Larger values of kappa facilitates a simulated annealing approach to obtaining a regularized point estimator a scalar logical indicating if an (implicit) intercept should be included in the model a scalar logical which, if TRUE, causes each variable is standardized to have unit L2-norm, otherwise it is left alone a scalar logical indicating if the latent z variables to be sampled. Therefore this indicator specifies if the cdf representation (zzero = FALSE) or pdf representation (otherwise) should be used a scalar logical indicating if the prior should be powered up with multiplicity parameter kappa as well as the likelihood a positive integer indicating the number replacing infinity in the sum for mixing density in the generative expression for lambda an optional vector of length p = ncol(x) specifying initial values for the regression coefficients beta. Otherwise standard normal deviates are used an optional vector of length n = nrow(x) of initial values for the lambda latent variables. Otherwise a vector of ones is used. prior parameters =list(a, b) for the inverse Gamma distribution prior for nu, or NULL, which causes nu to be fixed a scalar logical indicating wether or not a trace of latent z, lambda and omega values should be saved for each iteration. Specify save.latents=true for very large X in order to reduce memory swapping on low-ram machines A positive integer indicating the number of MCMC rounds after which a progress statement is printed. Giving verb = 0 causes no statements to be printed Details These are the main functions in the package. They support an omnibus framework for simulationbased regularized logistic regression. The default arguments invoke a Gibbs sampling algorithm to sample from the posterior distribution of a logistic regression model with lasso-type (doubleexponential) priors. See the paper by Gramacy \& Polson (2012) for details. Both cdf and pdf implementations are provided, which use slightly different latent variable representations, resulting in slightly different Gibbs samplers. These methods extend the un-regularized methods of Holmes \& Held (2006) The kappa parameter facilitates simulated annealing (SA) implementations in order to help find the MAP, and other point estimators. The actual SA algorithm is not provided in the package. However, it is easy to string calls to this function, using the outputs from one call as inputs to another, in order to establish a SA schedule for increasing kappa values. The regmlogit function is a wrapper around the Gibbs sampler inside reglogit, invoking C-1 linked chains for C classes, extending the polychotomous regression scheme outlined by Holmes \& Held (2006). For an example with regmlogit, see predict.regmlogit

8 reglogit Value The output is a list object of type "reglogit" or "regmlogit" containing a subset of the following fields; for "regmlogit" everyhing is expanded by one dimension into an array or matrix as appropriate. X y beta z lambda lpost map kappa omega the input design matrix, possible adjusted by normalization or intercept the input response variable a matrix of T sampled regression coefficients on the original input scale if zzero = FALSE a matrix of latent variables for the hierarchical cdf representation of the likelihood a matrix of latent variables for the hierarchical (cdf or pdf) representation of the likelihood a vector of log posterior probabilities of the parameters the list containing the maximum a posterior parameters; out$map$beta is on the original scale of the data the input multiplicity parameter a matrix of latent variables for the regularization prior Author(s) Robert B. Gramacy <rbg@vt.edu> References R.B. Gramacy, N.G. Polson. Simulation-based regularized logistic regression. (2012) Bayesian Analysis, 7(3), p567-590; arxiv:1005.3430; http://arxiv.org/abs/1005.3430 C. Holmes, K. Held (2006). Bayesian Auxilliary Variable Models for Binary and Multinomial Regression. Bayesian Analysis, 1(1), p145-168. See Also predict.reglogit, predict.regmlogit, blasso and regress Examples ## load in the pima indian data data(pima) X <- as.matrix(pima[,-9]) y <- as.numeric(pima[,9]) ## pre-normalize to match the comparison in the paper one <- rep(1, nrow(x)) normx <- sqrt(drop(one %*% (X^2))) X <- scale(x, FALSE, normx) ## compare to the GLM fit fit.logit <- glm(y~x, family=binomial(link="logit"))

reglogit 9 bstart <- fit.logit$coef ## do the Gibbs sampling T <- 300 ## set low for CRAN checks; increase to >= 1000 for better results out6 <- reglogit(t, y, X, nu=6, nup=null, bstart=bstart, normalize=false) ## plot the posterior distribution of the coefficients burnin <- (1:(T/10)) boxplot(out6$beta[-burnin,], main="nu=6, kappa=1", ylab="posterior", xlab="coefficients", bty="n", names=c("mu", paste("b", 1:8, sep=""))) abline(h=0, lty=2) ## add in GLM fit and MAP with legend points(bstart, col=2, pch=17) points(out6$map$beta, pch=19, col=3) legend("topright", c("mle", "MAP"), col=2:3, pch=c(17,19)) ## simple prediction p6 <- predict(out6, XX=X) ## hit rate mean(p6$c == y) ## ## for a polychotomous example, with prediction, ## see? predict.regmlogit ## ## Not run: ## now with kappa=10 out10 <- reglogit(t, y, X, kappa=10, nu=6, nup=null, bstart=bstart, normalize=false) ## plot the posterior distribution of the coefficients par(mfrow=c(1,2)) boxplot(out6$beta[-burnin,], main="nu=6, kappa=1", ylab="posterior", xlab="coefficients", bty="n", names=c("mu", paste("b", 1:8, sep=""))) abline(h=0, lty=2) points(bstart, col=2, pch=17) points(out6$map$beta, pch=19, col=3) legend("topright", c("mle", "MAP"), col=2:3, pch=c(17,19)) boxplot(out10$beta[-burnin,], main="nu=6, kappa=10", ylab="posterior", xlab="coefficients", bty="n", names=c("mu", paste("b", 1:8, sep=""))) abline(h=0, lty=2) ## add in GLM fit and MAP with legend points(bstart, col=2, pch=17) points(out10$map$beta, pch=19, col=3) legend("topright", c("mle", "MAP"), col=2:3, pch=c(17,19)) ## End(Not run) ## ## now some binomial data ##

10 reglogit ## Not run: ## synthetic data generation library(boot) N <- rep(20, 100) beta <- c(2, -3, 2, -4, 0, 0, 0, 0, 0) X <- matrix(runif(length(n)*length(beta)), ncol=length(beta)) eta <- drop(1 + X %*% beta) p <- inv.logit(eta) y <- rbinom(length(n), N, p) ## run the Gibbs sampler for the logit -- uses the fast Binomial ## version; for a comparison, try flatten=false out <- reglogit(t, y, X, N) ## plot the posterior distribution of the coefficients boxplot(out$beta[-burnin,], main="binomial data", ylab="posterior", xlab="coefficients", bty="n", names=c("mu", paste("b", 1:ncol(X), sep=""))) abline(h=0, lty=2) ## add in GLM fit, the MAP fit, the truth, and a legend fit.logit <- glm(y/n~x, family=binomial(link="logit"), weights=n) points(fit.logit$coef, col=2, pch=17) points(c(1, beta), col=4, pch=16) points(out$map$beta, pch=19, col=3) legend("topright", c("mle", "MAP", "truth"), col=2:4, pch=c(17,19,16)) ## also try specifying a larger kappa value to pin down the MAP ## End(Not run)

Index Topic classif predict.reglogit, 4 reglogit, 6 Topic datasets pima, 3 Topic methods reglogit, 6 Topic models predict.reglogit, 4 Topic package reglogit-package, 2 blasso, 2, 8 Matrix, 6 pima, 3 predict, 4 predict.reglogit, 4, 8 predict.regmlogit, 7, 8 predict.regmlogit (predict.reglogit), 4 reglogit, 2, 5, 6 reglogit-package, 2 regmlogit, 5 regmlogit (reglogit), 6 regress, 2, 8 11