Fall 200 STA 216 September 7, 2000 1 Getting Started in UNIX Binary Regression in S-Plus Create a class working directory and.data directory for S-Plus 5.0. If you have used Splus 3.x before, then it is necessary to create a new.data directory for using S-Plus 5.0 as the way in which data and objects are stored is not compatible between the two versions. okeeffe% mkdir sta216 okeeffe% cd sta216 okeeffe% Splus5 CHAPTER okeeffe% ls -a./../.data/ To start S-Plus 5.0, enter Splus5 -e ( the -e option allows you to edit your commands on the command line) -or- I prefer to run S-Plus under emacs. The advantage of running under emacs is that you can easily edit your history, commands, and functions and create scripts to automate procedures. Start emacs in your class directory. Then enter M-x (M is the meta or Esc key) followed by entering S+5. You will be prompted for the starting directory (edit or hit return), the buffer will show S-PLUS : Copyright (c) 1988, 1999 MathSoft, Inc. S : Copyright Lucent Technologies, Inc. Version 5.1 Release 1 for DEC alpha, Digital UNIX (OSF/1) V4.0 : 1999 Working data will be in.data > If you are using the Windows version in the clusters, go to the Start button on the task bar, then select Programs. S-Plus 2000 should be under the Statistics programs. Under the Windows menu at the top, select Command Window to open the window where commands will be issued. While many of the functions we will use are available from the menus, I will cover the command version so that the same syntax can be used for both the PC s and Unix platforms. 2 Reading in Data The following data comprise temperature readings (in degrees F) and indicators of O-ring failure of the space shuttle for 24 launches prior to the Challenger disaster in 1986. temp failure 53 1 56 1 57 1 63 0 66 0 67 0 80 0 81 0 S-Plus stores data in objects called dataframes. To read the datafile orings.dat into a dataframe, use the command read.table: 1
orings <- read.table("orings.dat", header=t) The option header=t is used when the first line of the file contains the column or variable names. (In the Windows version you can use the Import option under the file menu. hint, for text files, you should rename them with the ending.asc for ASCII rather than.dat.) To refer to the a variable in a dataframe, you can use matrix notation i.e the first temperature observation is orings[1,1] or the entire vector is oring[,1]. Dataframes in S-Plus are also em lists so you can refer to columns by names, orings$temp. If you wish to refer to the variables by names without using the dataframe name, you may attach the dataframe: > attach(orings) To create a scatter plot of the data with a title enter: > plot(temp, failure, xlab="temperature", ylab="failure Indicator") > title("o-ring Failures") O-ring Failures Failure Indicator 0.0 0.2 0.4 0.6 0.8 1.0 55 60 65 70 75 80 Temperature How should we model failures as a function of temperature? Do failures depend on temperature? What is the failure probability at 31 o F? 2
3 Models Random component: Each observation Y i has a Bernoulli distribution with a probability of failure, π i, i = 1,., 24 (independent?) Systematic component: linear predictor η i = β 0 + β 1 temp i Quadratic temperature term? Link between π and η Which link? identity: π = η canonical or logit: logit(π) = θ = η probit: Φ 1 (π) = η Student-t or other inverse cdf: F 1 (π) = η complementary log-log: log( log(1 π)) = η 3.1 Estimation To fit a GLM in S-Plus, we will use the function glm. To fit a logit model with temperature plus an intercept as the linear predictor use: > oring.logit <- glm(failure ~ temp, family=binomial(link=logit), data=orings) form of the linear predictor is determined by the model expression, the first argument. By default an intercept is included. The output of this is a glm.object, assigned to oring.logit. To summarize the output use the function summary > summary(oring.logit) Call: glm(formula = failure ~ temp, family = binomial(link=logit), data = orings) Deviance Residuals: Min 1Q Median 3Q Max -1.212493-0.8252676-0.470546 0.5907502 2.051237 Coefficients: Value Std. Error t value (Intercept) 10.8753321 5.69793801 1.908643 temp -0.1713202 0.08336339-2.055102 (Dispersion Parameter for Binomial family taken to be 1 ) Null Deviance: 28.97459 on 23 degrees of freedom Residual Deviance: 23.03045 on 22 degrees of freedom Number of Fisher Scoring Iterations: 4 Correlation of Coefficients: (Intercept) temp -0.9958713 3
For the clog-log link: > oring.cloglog <- glm(failure ~ temp, family=binomial(link=cloglog), data=orings) > summary(oring.cloglog) Call: glm(formula = failure ~ temp, family = binomial(link = cloglog), data = orings) Deviance Residuals: Min 1Q Median 3Q Max -1.215259-0.7975805-0.468455 0.3467605 2.062026 Coefficients: Value Std. Error t value (Intercept) 8.9361729 3.72824308 2.396886 temp -0.1466572 0.05662127-2.590144 (Dispersion Parameter for Binomial family taken to be 1 ) Null Deviance: 28.97459 on 23 degrees of freedom Residual Deviance: 22.4359 on 22 degrees of freedom Number of Fisher Scoring Iterations: 5 Correlation of Coefficients: (Intercept) temp -0.9940828 To obtain estimates of the probabilities, use the predict function (see help for other options with it) > predict(oring.logit, type="response") > predict(oring.cloglog, type="response") # to add to the graph > lines(temp, predict(oring.logit, type="response"), lwd=2, lty=1) > lines(temp, predict(oring.cloglog, type="response"), lwd=2, lty=2) > lines(temp, predict(oring.probit, type="response"), lwd=2, lty=3) > legend(70,.8, c("logit", "cloglog", "probit"), lty=c(1,2,3)) 4
O-ring Failures Failure Indicator 0.0 0.2 0.4 0.6 0.8 1.0 logit cloglog probit 55 60 65 70 75 80 Temperature 5