Multinomial Logit Models with R

Similar documents
1 The SAS System 23:01 Friday, November 9, 2012

Unit 5 Logistic Regression Practice Problems

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

Package mnlogit. November 8, 2016

Stat 4510/7510 Homework 4

Binary Regression in S-Plus

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Discriminant analysis in R QMMA

run ld50 /* Plot the onserved proportions and the fitted curve */ DATA SETR1 SET SETR1 PROB=X1/(X1+X2) /* Use this to create graphs in Windows */ gopt

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

Repeated Measures Part 4: Blood Flow data

Cell means coding and effect coding

Poisson Regression and Model Checking

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

Dynamic Network Regression Using R Package dnr

Generalized Additive Models

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

Chapter 10: Extensions to the GLM

CH9.Generalized Additive Model

Logistic Regression. (Dichotomous predicted variable) Tim Frasier

Discussion Notes 3 Stepwise Regression and Model Selection

Modelling Proportions and Count Data

Package dglm. August 24, 2016

URLs identification task: Istat current status. Istat developed and applied a procedure consisting of the following steps:

GxE.scan. October 30, 2018

Modelling Proportions and Count Data

The linear mixed model: modeling hierarchical and longitudinal data

Nina Zumel and John Mount Win-Vector LLC

Package mvprobit. November 2, 2015

Regression on the trees data with R

Chapitre 2 : modèle linéaire généralisé

Factorial ANOVA. Skipping... Page 1 of 18

An introduction to SPSS

Comparing Fitted Models with the fit.models Package

Package intreg. R topics documented: February 15, Version Date Title Interval Regression

Stat 5303 (Oehlert): Unbalanced Factorial Examples 1

The brlr Package. March 22, brlr... 1 lizards Index 5. Bias-reduced Logistic Regression

Package glmmml. R topics documented: March 25, Encoding UTF-8 Version Date Title Generalized Linear Models with Clustering

Hierarchical Generalized Linear Models

The glmmml Package. August 20, 2006

Work through the sheet in any order you like. Skip the starred (*) bits in the first instance, unless you re fairly confident.

Laboratory for Two-Way ANOVA: Interactions

Week 4: Simple Linear Regression III

Introduction to R, Github and Gitlab

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Experiments in Imputation

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

Modeling Categorical Outcomes via SAS GLIMMIX and STATA MEOLOGIT/MLOGIT (data, syntax, and output available for SAS and STATA electronically)

Package oglmx. R topics documented: May 5, Type Package

Introduction to Mixed Models: Multivariate Regression

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Exercise 3: Multivariable analysis in R part 1: Logistic regression

Package addhaz. September 26, 2018

Week 4: Simple Linear Regression II

Generalized Additive Models

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

Canopy Light: Synthesizing multiple data sources

Using R for Analyzing Delay Discounting Choice Data. analysis of discounting choice data requires the use of tools that allow for repeated measures

Introduction to hypothesis testing

36-402/608 HW #1 Solutions 1/21/2010

Package misclassglm. September 3, 2016

Package ihs. February 25, 2015

Using the SemiPar Package

Product Catalog. AcaStat. Software

Exponential Random Graph Models for Social Networks

Factorial ANOVA with SAS

An Introductory Guide to Stata

Stat 8053, Fall 2013: Additive Models

Correctly Compute Complex Samples Statistics

Package GWRM. R topics documented: July 31, Type Package

Machine Learning: Practice Midterm, Spring 2018

Correctly Compute Complex Samples Statistics

Package ridge. R topics documented: February 15, Title Ridge Regression with automatic selection of the penalty parameter. Version 2.

options(width = 65) suppressmessages(library(mi)) data(nlsyv, package = "mi")

The biglm Package. August 25, bigglm... 1 biglm Index 5. Bounded memory linear regression

Introduction to Mixed-Effects Models for Hierarchical and Longitudinal Data

1 Simple Linear Regression

References R's single biggest strenght is it online community. There are tons of free tutorials on R.

APPENDIX D. RESIDENTIAL AND FIRM LOCATION CHOICE MODEL SPECIFICATION AND ESTIMATION

Classification. Data Set Iris. Logistic Regression. Species. Petal.Width

Stat 500 lab notes c Philip M. Dixon, Week 10: Autocorrelated errors

Beta Regression: Shaken, Stirred, Mixed, and Partitioned. Achim Zeileis, Francisco Cribari-Neto, Bettina Grün

An Econometric Study: The Cost of Mobile Broadband

Using HLM for Presenting Meta Analysis Results. R, C, Gardner Department of Psychology

Introduction to the R Statistical Computing Environment R Programming: Exercises

Exercise 2.23 Villanova MAT 8406 September 7, 2015

1 Condence Intervals for Mean Value Parameters

Organizing data in R. Fitting Mixed-Effects Models Using the lme4 Package in R. R packages. Accessing documentation. The Dyestuff data set

Introduction to the R Statistical Computing Environment R Programming: Exercises

Didacticiel - Études de cas

Analysis of variance - ANOVA

Assignment 5.5. Nothing here to hand in

Package mcemglm. November 29, 2015

Data Analysis and Hypothesis Testing Using the Python ecosystem

The glmc Package. December 6, 2006

A New Method of Using Polytomous Independent Variables with Many Levels for the Binary Outcome of Big Data Analysis

A SAS Macro for Producing Benchmarks for Interpreting School Effect Sizes

Package plsrbeta. February 20, 2015

Package spregime. March 12, 2012

Transcription:

Multinomial Logit Models with R > rm(list=ls()); options(scipen=999) # To avoid scientific notation > # install.packages("mlogit", dependencies=true) # Only need to do this once > library(mlogit) # Load the package every time Loading required package: Formula Loading required package: maxlik Loading required package: misctools Please cite the 'maxlik' package as: Henningsen, Arne and Toomet, Ott (2011). maxlik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010- 0217-1. If you have questions, suggestions, or comments regarding the 'maxlik' package, please use a forum or 'tracker' at maxlik's R-Forge site: https://r-forge.r-project.org/projects/maxlik/ > math = read.table("http://www.utstat.utoronto.ca/~brunner/data/legal/mathcat.data.txt") > head(math) hsgpa hsengl hscalc course passed outcome 1 78.0 80 Yes Mainstrm No Failed 2 66.0 75 Yes Mainstrm Yes Passed 3 80.2 70 Yes Mainstrm Yes Passed 4 81.7 67 Yes Mainstrm Yes Passed 5 86.8 80 Yes Mainstrm Yes Passed 6 76.7 75 Yes Mainstrm Yes Passed The explanatory vars can be characteristics of the individual case (individual-specific), or of the alternative (alternative-specific) -- that is, the value of the response variable. The mlogit function requires its own special type of data frame, and there are two data formats: ``wide" and ``long." In long, each row is an alternative (the rows are really long!). In wide, each row is an observation (but if there are 3 response categories, each case gets 3 rows). The mlogit.data function converts ordinary data frames to a type required by mlogit. I can only make the wide format work. Page 1 of 9

> # Try a simple logistic regression. > # First make a data frame with just what we need. > math0 = math[,c(1,5)]; head(math0) hsgpa passed 1 78.0 No 2 66.0 Yes 3 80.2 Yes 4 81.7 Yes 5 86.8 Yes 6 76.7 Yes > # Make an mlogit data frame in long format > wide0 = mlogit.data(math0,shape="wide",choice="passed") > head(wide0) hsgpa passed chid alt 1.No 78.0 TRUE 1 No 1.Yes 78.0 FALSE 1 Yes 2.No 66.0 FALSE 2 No 2.Yes 66.0 TRUE 2 Yes 3.No 80.2 FALSE 3 No 3.Yes 80.2 TRUE 3 Yes Model description (formula) is more complex than for glm, because the models are more complex. Have the mformula function. It provides for individual-specific variables (the kind we use) and two kinds of alternative-specific variables. The user can provide 3 parts, separated by vertical bars. The first and third are alternative-specific. If we stick to individual-specific vars, we can leave off the last, like this: > simple0 = mlogit(passed ~ 0 hsgpa, data=wide0) Page 2 of 9

> simple0 = mlogit(passed ~ 0 hsgpa, data=wide0); summary(simple0) mlogit(formula = passed ~ 0 hsgpa, data = wide0, method = "nr", print.level = 0) Frequencies of alternatives: No Yes 0.40102 0.59898 nr method 5 iterations, 0h:0m:0s g'(-h)^-1g = 0.000119 successive function values within tolerance limits Coefficients : Estimate Std. Error t-value Pr(> t ) Yes:(intercept) -15.210112 1.998398-7.6112 0.00000000000002709 *** Yes:hsgpa 0.197734 0.025486 7.7587 0.00000000000000866 *** Log-Likelihood: -221.72 McFadden R^2: 0.16436 Likelihood ratio test : chisq = 87.221 (p.value = < 0.000000000000000222) Log-Likelihood: -221.72 McFadden R^2: 0.16436 Likelihood ratio test : chisq = 87.221 (p.value = < 2.22e-16) > # Compare ordinary logistic regression > summary(glm(passed~hsgpa,family=binomial,data=math)) glm(formula = passed ~ hsgpa, family = binomial, data = math) Deviance Residuals: Min 1Q Median 3Q Max -2.5152-1.0209 0.4435 0.9321 2.1302 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -15.21013 1.99832-7.611 0.00000000000002710 *** hsgpa 0.19773 0.02548 7.759 0.00000000000000856 *** (Dispersion parameter for binomial family taken to be 1) Null deviance: 530.66 on 393 degrees of freedom Residual deviance: 443.43 on 392 degrees of freedom AIC: 447.43 Number of Fisher Scoring iterations: 4 Page 3 of 9

> anova(glm(passed~hsgpa,family=binomial,data=math)) Analysis of Deviance Table Model: binomial, link: logit Response: passed Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 393 530.66 hsgpa 1 87.221 392 443.43 > # Compare G^2 = 87.221 from mlogit > # Excellent. Now try simple regression with a 3-category outcome. > # I think I have to make an mlogit data frame with just the vars I want. > # First try to make reference category of outcome Failed. > # Setting contrasts had no effect. Change the alphabetical order. > outcome = as.character(math$outcome) > for(j in 1:length(outcome)) + {if(outcome[j]=='disappeared') outcome[j]='gone'} > math$outcome = factor(outcome); table(outcome) outcome Failed Gone Passed 61 97 236 > math1 = math[,c(1,6)] # All rows, cols 1 and 6 > wide1 = mlogit.data(math1,shape="wide",choice="outcome") > head(wide1) hsgpa outcome chid alt 1.Failed 78 TRUE 1 Failed 1.Gone 78 FALSE 1 Gone 1.Passed 78 FALSE 1 Passed 2.Failed 66 FALSE 2 Failed 2.Gone 66 FALSE 2 Gone 2.Passed 66 TRUE 2 Passed > head(math) hsgpa hsengl hscalc course passed outcome 1 78.0 80 Yes Mainstrm No Failed 2 66.0 75 Yes Mainstrm Yes Passed 3 80.2 70 Yes Mainstrm Yes Passed 4 81.7 67 Yes Mainstrm Yes Passed 5 86.8 80 Yes Mainstrm Yes Passed 6 76.7 75 Yes Mainstrm Yes Passed Page 4 of 9

> simple1 = mlogit(outcome ~ 0 hsgpa, data=wide1) > summary(simple1) mlogit(formula = outcome ~ 0 hsgpa, data = wide1, method = "nr", print.level = 0) Frequencies of alternatives: Failed Gone Passed 0.15482 0.24619 0.59898 nr method 5 iterations, 0h:0m:0s g'(-h)^-1g = 1.09E-05 successive fonction values within tolerance limits Coefficients : Estimate Std. Error t-value Pr(> t ) Gone:(intercept) 1.904226 2.744979 0.6937 0.4879 Passed:(intercept) -13.393056 2.570453-5.2104 1.884e-07 *** Gone:hsgpa -0.018816 0.035775-0.5260 0.5989 Passed:hsgpa 0.186437 0.033018 5.6465 1.637e-08 *** Log-Likelihood: -326.96 McFadden R^2: 0.11801 Likelihood ratio test : chisq = 87.497 (p.value = < 2.22e-16) Page 5 of 9

> # Estimate probabilities for a student with HSGPA = 90 > betahat1 = simple1$coefficients; betahat1 Gone:(intercept) Passed:(intercept) Gone:hsgpa Passed:hsgpa 1.90422575-13.39305637-0.01881621 0.18643711 attr(,"fixed") Gone:(intercept) Passed:(intercept) Gone:hsgpa Passed:hsgpa FALSE FALSE FALSE FALSE > gpa = 90 > L1 = betahat1[1] + betahat1[3]*gpa # Gone > L2 = betahat1[2] + betahat1[4]*gpa # Passed > denom = 1+exp(L1)+exp(L2) > pihat1 = exp(l1)/denom # Gone > pihat2 = exp(l2)/denom # Passed > pihat3 = 1/denom # Failed > rbind(pihat1,pihat2,pihat3) Gone:(intercept) pihat1 0.03883621 pihat2 0.92970789 pihat3 0.03145590 Page 6 of 9

> # More interesting full model. > # Make the mlogit data frame, without passed. > head(math[,c(1:4,6)]) hsgpa hsengl hscalc course outcome 1 78.0 80 Yes Mainstrm Failed 2 66.0 75 Yes Mainstrm Passed 3 80.2 70 Yes Mainstrm Passed 4 81.7 67 Yes Mainstrm Passed 5 86.8 80 Yes Mainstrm Passed 6 76.7 75 Yes Mainstrm Passed > wide = mlogit.data(math[,c(1:4,6)],shape="wide",choice="outcome") > # Make Mainstream the reference category for course, with nice labels. > contrasts(wide$course) = contr.treatment(3,base=3) > colnames(contrasts(wide$course)) = c("catch-up","elite") > contrasts(wide$course) Catch-up Elite Catch-up 1 0 Elite 0 1 Mainstrm 0 0 > # Note that setting the contrasts in math does not work. > # mlogit.data converts them back to the default! > fullmod = mlogit(outcome ~ 0 hsgpa+hsengl+hscalc+course, data=wide) > summary(fullmod) mlogit(formula = outcome ~ 0 hsgpa + hsengl + hscalc + course, data = wide, method = "nr", print.level = 0) Frequencies of alternatives: Failed Gone Passed 0.15482 0.24619 0.59898 nr method 5 iterations, 0h:0m:0s g'(-h)^-1g = 0.000216 successive function values within tolerance limits Coefficients : Estimate Std. Error t-value Pr(> t ) Gone:(intercept) Passed:(intercept) 1.8899724-13.6325290 2.8959975 0.6526 0.51400 2.7562888-4.9460 0.00000075765 *** Gone:hsgpa -0.0079779 0.0413277-0.1930 0.84693 Passed:hsgpa 0.2157706 0.0382179 5.6458 0.00000001644 *** Gone:hsengl -0.0067241 0.0251049-0.2678 0.78882 Passed:hsengl -0.0399811 0.0228733-1.7479 0.08047. Gone:hscalcYes -0.3902775 0.6742796-0.5788 0.56272 Passed:hscalcYes 1.0009683 0.8215247 1.2184 0.22306 Gone:courseCatch-up 0.6834686 0.5560854 1.2291 0.21905 Passed:courseCatch-up -0.4086564 0.6339142-0.6447 0.51915 Gone:courseElite -1.3831859 0.8700722-1.5897 0.11189 Passed:courseElite 0.1946275 0.5664312 0.3436 0.73114 Log-Likelihood: -312.26 McFadden R^2: 0.15766 Likelihood ratio test : chisq = 116.89 (p.value = < 0.000000000000000222) Page 7 of 9

Page 8 of 9

> # Test Course controlling for HS variables > nocourse = mlogit(outcome ~ 0 hsgpa+hsengl+hscalc, data=wide) > anova(nocourse,fullmod) Error in UseMethod("anova") : no applicable method for 'anova' applied to an object of class "mlogit" > # Oh well. > G2 = -2 * as.numeric(nocourse$loglik - fullmod$loglik); G2 [1] 11.86122 > pval = 1-pchisq(G2,df=4) # Two betas for each dummy variable. > pval [1] 0.01841369 > # Let's keep course and hsgpa. Do we need hsengl and hscalc? > coursegpa = mlogit(outcome ~ 0 hsgpa+course, data=wide) > G2 = -2 * as.numeric(coursegpa$loglik - fullmod$loglik); G2 [1] 8.457276 > pval = 1-pchisq(G2,df=4) # df=4 again > pval [1] 0.07619288 Conclusion: Let's keep just course and hsgpa. > summary(coursegpa) mlogit(formula = outcome ~ 0 hsgpa + course, data = wide, method = "nr", print.level = 0) Frequencies of alternatives: Failed Gone Passed 0.15482 0.24619 0.59898 nr method 5 iterations, 0h:0m:0s g'(-h)^-1g = 0.00016 successive function values within tolerance limits Coefficients : Estimate Std. Error t-value Pr(> t ) Gone:(intercept) 1.363464 2.818403 0.4838 0.62855 Passed:(intercept) -13.264833 2.611242-5.0799 0.00000037765 *** Gone:hsgpa -0.012635 0.036570-0.3455 0.72971 Passed:hsgpa 0.184952 0.033426 5.5332 0.00000003144 *** Gone:courseCatch-up 0.838789 0.501793 1.6716 0.09461. Passed:courseCatch-up -0.707533 0.590061-1.1991 0.23049 Gone:courseElite -1.322791 0.857624-1.5424 0.12298 Passed:courseElite 0.357558 0.549697 0.6505 0.51539 Log-Likelihood: -316.49 McFadden R^2: 0.14625 Likelihood ratio test : chisq = 108.43 (p.value = < 0.000000000000000222) Page 9 of 9