Stat 5100 Handout #14.a SAS: Logistic Regression

Similar documents
Stat 5100 Handout #12.f SAS: Time Series Case Study (Unit 7)

Stat 5100 Handout #6 SAS: Linear Regression Remedial Measures

Stat 500 lab notes c Philip M. Dixon, Week 10: Autocorrelated errors

Stat 5100 Handout #11.a SAS: Variations on Ordinary Least Squares

Stat 5100 Handout #19 SAS: Influential Observations and Outliers

run ld50 /* Plot the onserved proportions and the fitted curve */ DATA SETR1 SET SETR1 PROB=X1/(X1+X2) /* Use this to create graphs in Windows */ gopt

A SAS Macro for Covariate Specification in Linear, Logistic, or Survival Regression

SAS Workshop. Introduction to SAS Programming. Iowa State University DAY 2 SESSION IV

CH9.Generalized Additive Model

Introduction to Statistical Analyses in SAS

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office)

Baruch College STA Senem Acet Coskun

Stat 5100 Handout #15 SAS: Alternative Predictor Variable Types

Data Management - 50%

Repeated Measures Part 4: Blood Flow data

An introduction to SPSS

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Generalized Additive Models

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

Centering and Interactions: The Training Data

Applied Regression Modeling: A Business Approach

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

UTILIZING DATA FROM VARIOUS DATA PARTNERS IN A DISTRIBUTED MANNER

Biostat Methods STAT 5820/6910 Handout #9 Meta-Analysis Examples

STAT 5200 Handout #25. R-Square & Design Matrix in Mixed Models

Didacticiel - Études de cas

/* Parametric models: AFT modeling */ /* Data described in Chapter 3 of P. Allison, "Survival Analysis Using the SAS System." */

SYS 6021 Linear Statistical Models

Information Criteria Methods in SAS for Multiple Linear Regression Models

ST Lab 1 - The basics of SAS

Applied Regression Modeling: A Business Approach

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition

A New Method of Using Polytomous Independent Variables with Many Levels for the Binary Outcome of Big Data Analysis

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Box-Cox Transformation for Simple Linear Regression

Brief Guide on Using SPSS 10.0

STAT 5200 Handout #21. Split Split Plot Example (Ch. 16)

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Paper SDA-11. Logistic regression will be used for estimation of net error for the 2010 Census as outlined in Griffin (2005).

Unit 5 Logistic Regression Practice Problems

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Lab 07: Multiple Linear Regression: Variable Selection

The SAS RELRISK9 Macro

Intermediate SAS: Statistics

Week 4: Simple Linear Regression III

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

StatCalc User Manual. Version 9 for Mac and Windows. Copyright 2018, AcaStat Software. All rights Reserved.

Week 5: Multiple Linear Regression II

PharmaSUG China. model to include all potential prognostic factors and exploratory variables, 2) select covariates which are significant at

Detecting and Circumventing Collinearity or Ill-Conditioning Problems

Chapter 6: Linear Model Selection and Regularization

MINITAB 17 BASICS REFERENCE GUIDE

Lab #9: ANOVA and TUKEY tests

Chapter 1 Changes and Enhancements to SAS/STAT Software in Versions 7 and 8

Package samplesizelogisticcasecontrol

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

Problem set for Week 7 Linear models: Linear regression, multiple linear regression, ANOVA, ANCOVA

Multiple imputation using chained equations: Issues and guidance for practice

Statistics and Data Analysis. Common Pitfalls in SAS Statistical Analysis Macros in a Mass Production Environment

SAS Workshop. Introduction to SAS Programming. Iowa State University DAY 3 SESSION I

The linear mixed model: modeling hierarchical and longitudinal data

Purposeful Selection of Variables in Logistic Regression: Macro and Simulation Results

Logistic Model Selection with SAS PROC s LOGISTIC, HPLOGISTIC, HPGENSELECT

Minitab detailed

USING MACROS TO CREATE PARAMETER DRIVEN PROCEDURES THAT SUMMARIZE AND PRESENT STATISTICAL OUTPUT IN TABULAR FORM

Fathom Dynamic Data TM Version 2 Specifications

SAS/STAT 14.3 User s Guide The SURVEYFREQ Procedure

CHAPTER 1 INTRODUCTION

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

Mixed Effects Models. Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC.

Package GWAF. March 12, 2015

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

NCSS Statistical Software

SASEG 9B Regression Assumptions

Modelling Proportions and Count Data

Bivariate (Simple) Regression Analysis

Regression Analysis and Linear Regression Models

Performing Cluster Bootstrapped Regressions in R

Modelling Proportions and Count Data

Quantitative - One Population

Example Using Missing Data 1

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Correctly Compute Complex Samples Statistics

SAS Macros for Binning Predictors with a Binary Target

Exploratory model analysis

Zero-Inflated Poisson Regression

SD10 A SAS MACRO FOR PERFORMING BACKWARD SELECTION IN PROC SURVEYREG

SAS/STAT 13.1 User s Guide. The SURVEYFREQ Procedure

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian. Panel Data Analysis: Fixed Effects Models

Lecture 25: Review I

Section 2.3: Simple Linear Regression: Predictions and Inference

Table of Contents (As covered from textbook)

STAT:5201 Applied Statistic II

Modeling Categorical Outcomes via SAS GLIMMIX and STATA MEOLOGIT/MLOGIT (data, syntax, and output available for SAS and STATA electronically)

From logistic to binomial & Poisson models

Correctly Compute Complex Samples Statistics

Stat 302 Statistical Software and Its Applications SAS: Data I/O

CSC 328/428 Summer Session I 2002 Data Analysis for the Experimenter FINAL EXAM

Package GWRM. R topics documented: July 31, Type Package

Transcription:

Stat 5100 Handout #14.a SAS: Logistic Regression Example: (Text Table 14.3) Individuals were randomly sampled within two sectors of a city, and checked for presence of disease (here, spread by mosquitoes). Subjects age (in years), socioeconomic status (low, medium, high), and city sector are to be used to predict the probability of contracting the disease. /* Define options */ ods html image_dpi=300 style=journal; /* Input data -- see Table 14.3 in text Case = subject ID Age = years SES_mid = indicator of middle socioeconomic status SES_low = indicator of low socioeconomic status (upper is reference level for socioeconomic status) Sector = indicator of sector 2 in city (sector 1 is reference level) Disease = indicator of disease presence */ filename myurl url "http://www.stat.usu.edu/~jrstevens/stat5100/data/ch14ta03.txt"; data outbreak; infile myurl delimiter = '09'x; input Case Age SES_mid SES_low Sector Disease; Observation = _n_; /* Run logistic regression, checking for lack of fit */ proc logistic data=outbreak; model Disease(event = '1') = Age SES_mid SES_low Sector / clparm=wald alpha=.05 lackfit; SES: test SES_mid=SES_low=0; output out=a1out prob=phat; title1 'Logistic Regression'; Logistic Regression Probability modeled is Disease=1. Model Convergence Status Convergence criterion (GCONV=1E-8) satisfied. 1

Model Fit Statistics Criterion Intercept Only Intercept and Covariates AIC 124.318 111.054 SC 126.903 123.979-2 Log L 122.318 101.054 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 21.2635 4 0.0003 Score 20.4067 4 0.0004 Wald 16.6437 4 0.0023 Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept 1-2.3127 0.6426 12.9545 0.0003 Age 1 0.0297 0.0135 4.8535 0.0276 SES_mid 1 0.4088 0.5990 0.4657 0.4950 SES_low 1-0.3051 0.6041 0.2551 0.6135 Sector 1 1.5746 0.5016 9.8543 0.0017 Odds Ratio Estimates Effect Point Estimate 95% Wald Confidence Limits Age 1.030 1.003 1.058 SES_mid 1.505 0.465 4.868 SES_low 0.737 0.226 2.408 Sector 4.829 1.807 12.907 2

Association of Predicted Probabilities and Observed Responses Percent Concordant 77.5 Somers' D 0.554 Percent Discordant 22.1 Gamma 0.556 Percent Tied 0.3 Tau-a 0.242 Pairs 2077 c 0.777 Parameter Estimates and Wald Confidence Intervals Parameter Estimate 95% Confidence Limits Intercept -2.3127-3.5721-1.0533 Age 0.0297 0.00328 0.0562 SES_mid 0.4088-0.7653 1.5828 SES_low -0.3051-1.4891 0.8789 Sector 1.5746 0.5915 2.5578 Linear Hypotheses Testing Results Label Wald Chi-Square DF Pr > ChiSq SES 1.2053 2 0.5474 3

Partition for the Hosmer and Lemeshow Test Group Total Disease = 1 Disease = 0 Observed Expected Observed Expected 1 10 0 0.79 10 9.21 2 10 1 1.02 9 8.98 3 11 2 1.51 9 9.49 4 10 1 1.78 9 8.22 5 10 3 2.34 7 7.66 6 10 4 3.09 6 6.91 7 10 7 3.91 3 6.09 8 11 3 5.51 8 5.49 9 10 5 6.32 5 3.68 10 6 5 4.75 1 1.25 Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square DF Pr > ChiSq 9.1871 8 0.3268 4

/* Make a 'Conditional Effect' plot to compare predicted disease probabilities for sector 1 (Sector=0) vs sector 2 (Sector=1) at low socioeconomic status (SES_mid=0, SES_low=1), as a function of Age */ data new; set outbreak; p1 = 1/(1+exp(-(-2.3127+0.0297*Age+0.4088*0-0.3051*1+1.5746*0))); p2 = 1/(1+exp(-(-2.3127+0.0297*Age+0.4088*0-0.3051*1+1.5746*1))); label p1 = 'Sector 1' p2 = 'Sector 2'; proc sort data=new; by Age; proc sgplot data=new; series y=p1 x=age / lineattrs=(pattern=solid); series y=p2 x=age / lineattrs=(pattern=dash); xaxis label='age'; yaxis label='predicted Probability'; title1 'Conditional Effect Plot at Low SES'; 5

/* Check for multicollinearity */ proc reg data=outbreak; model Disease = Age SES_mid SES_low Sector / vif collin; title1 'Collinearity Check'; Collinearity Check Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Variance Inflation Intercept 1 0.04699 0.10470 0.45 0.6546 0 Age 1 0.00555 0.00238 2.33 0.0218 1.05242 SES_mid 1 0.07595 0.11139 0.68 0.4970 1.24616 SES_low 1-0.04150 0.10323-0.40 0.6886 1.34514 Sector 1 0.31702 0.09180 3.45 0.0008 1.09668 Collinearity Diagnostics Number Eigenvalue Condition Index Proportion of Variation Intercept Age SES_mid SES_low Sector 1 2.91249 1.00000 0.01791 0.02993 0.02390 0.01836 0.03753 2 1.03987 1.67357 0.00177 0.00043477 0.23886 0.23672 0.02873 3 0.56543 2.26957 0.00369 0.01213 0.41619 0.08529 0.46221 4 0.36812 2.81280 0.00172 0.50905 0.06301 0.18584 0.37255 5 0.11410 5.05233 0.97491 0.44845 0.25805 0.47378 0.09897 6

/* Variable selection */ /* - here, backward elimination; may also consider selection=stepwise */ proc logistic data=outbreak; model Disease(event = '1') = Age SES_mid SES_low Sector / selection=backward slstay=0.10; title1 'Backward Elimination'; Backward Elimination Step 0. The following effects were entered: Intercept Age SES_mid SES_low Sector Step 1. Effect SES_low is removed: Step 2. Effect SES_mid is removed: Probability modeled is Disease=1. Backward Elimination Procedure Note: No (additional) effects met the 0.1 significance level for removal from the model. Summary of Backward Elimination Step Effect Removed DF Number In Wald Chi-Square Pr > ChiSq 1 SES_low 1 3 0.2551 0.6135 2 SES_mid 1 2 0.9590 0.3274 Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept 1-2.3350 0.5111 20.8713 <.0001 Age 1 0.0293 0.0132 4.9455 0.0262 Sector 1 1.6734 0.4873 11.7906 0.0006 7

/* Variable selection */ /* - here, display the 'best' two models containing between 1 and 2 predictors */ proc logistic data=outbreak; model Disease(event = '1') = Age SES_mid SES_low Sector / selection=score best=2 start=1 stop=2; title1 'Variable Selection: best by score'; Variable Selection: best by score Probability modeled is Disease=1. Regression Models Selected by Score Criterion Number of Variables Score Chi-Square Variables Included in Model 1 14.7805 Sector 1 7.5802 Age 2 19.5250 Age Sector 2 15.7058 SES_low Sector /* Check for outliers using the half-normal probability plot with simulated envelope -- note that this macro can be slow for large sample sizes */ /* Two [unused] ways to access simulated envelope macro: filename macrourl url "http://www.stat.usu.edu/jrstevens/stat5100/simenv.macro.sas"; %include macrourl; filename macrourl "C:\[filepath]\simEnv.macro.sas"; %include macrourl; */ 8

/* How you can access shortcut in SAS Web Editor: Don't worry about understanding this one long line of code; just copy and paste from www.stat.usu.edu/jrstevens/stat5100/simenv.sas */ %macro simenv(dataset, response, predictors, N); proc... /* Call simenv macro; arguments: dataset = name of dataset containing data response = name of response variable in dataset, coded 0/1 predictors = name(s) of predictor variable(s) (if multiple, separated by spaces) N = number of observations (sample size) */ %simenv(dataset = outbreak, response = disease, predictors = Age SES_mid SES_low Sector, N=98); 9

/* Check for influential observations */ proc logistic data=outbreak plots(only)=(phat influence dpc); model Disease(event = '1') = Age SES_mid SES_low Sector; title1 'Check for Influential Observations'; Check for Influential Observations 10

proc logistic data=outbreak plots(only label)=(phat influence dpc); model Disease(event = '1') = Age SES_mid SES_low Sector; 11

/* Look at suspect observation */ proc print data=outbreak; where Case = 48; var Case Age SES_mid SES_low Sector Disease; title1 'Suspect point'; Suspect point Obs Case Age SES_mid SES_low Sector Disease 48 48 65 0 1 1 0 12