Data-Analysis Exercise Fitting and Extending the Discrete-Time Survival Analysis Model (ALDA, Chapters 11 & 12, pp )

Similar documents
Creating a data file and entering data

Genetic Analysis. Page 1

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

Show how the LG-Syntax can be generated from a GUI model. Modify the LG-Equations to specify a different LC regression model

Lab #3: Probability, Simulations, Distributions:

STATA 13 INTRODUCTION

Creating a Do File in Stata

Brief Guide on Using SPSS 10.0

SAS Training Spring 2006

The Power and Sample Size Application

ST Lab 1 - The basics of SAS

Introduction to SAS. Cristina Murray-Krezan Research Assistant Professor of Internal Medicine Biostatistician, CTSC

Basic concepts and terms

HILDA PROJECT TECHNICAL PAPER SERIES No. 2/08, February 2008

An Introduction to Stata Part I: Data Management

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

Using HLM for Presenting Meta Analysis Results. R, C, Gardner Department of Psychology

Correctly Compute Complex Samples Statistics

Analysis of Complex Survey Data with SAS

IPUMS Training and Development: Requesting Data

You will learn: The structure of the Stata interface How to open files in Stata How to modify variable and value labels How to manipulate variables

COMP 110 Project 1 Programming Project Warm-Up Exercise

REALCOM-IMPUTE: multiple imputation using MLwin. Modified September Harvey Goldstein, Centre for Multilevel Modelling, University of Bristol

DSCI 325: Handout 10 Summarizing Numerical and Categorical Data in SAS Spring 2017

An introduction to SPSS

ebook Writing Workshop Practical 1: A First Logfile-Style ebook

Modeling Categorical Outcomes via SAS GLIMMIX and STATA MEOLOGIT/MLOGIT (data, syntax, and output available for SAS and STATA electronically)

PRI Workshop Introduction to AMOS

Introduction to Stata First Session. I- Launching and Exiting Stata Launching Stata Exiting Stata..

Simply Accounting Intelligence Tips and Tricks Booklet Vol. 1

Tutorial #3A: Using Latent GOLD 5.0 with Holdout Records. Overview. The Data

Introduction to Stata Toy Program #1 Basic Descriptives

Algebraically Speaking Chalkdust Algebra 1 Fall Semester

PHPM 672/677 Lab #2: Variables & Conditionals Due date: Submit by 11:59pm Monday 2/5 with Assignment 2

Also, for all analyses, two other files are produced upon program completion.

STATS PAD USER MANUAL

Tutorial #1: Using Latent GOLD choice to Estimate Discrete Choice Models

User Services Spring 2008 OBJECTIVES Introduction Getting Help Instructors

The MAXQDA Stats Data Editor

STAT 7000: Experimental Statistics I

Correctly Compute Complex Samples Statistics

Chapter 17: INTERNATIONAL DATA PRODUCTS

1 Introduction to Using Excel Spreadsheets

I Launching and Exiting Stata. Stata will ask you if you would like to check for updates. Update now or later, your choice.

7 Fractions. Number Sense and Numeration Measurement Geometry and Spatial Sense Patterning and Algebra Data Management and Probability

Math 227 EXCEL / MEGASTAT Guide

SAS/STAT 13.1 User s Guide. The Power and Sample Size Application

DETAILED CONTENTS. About the Editor About the Contributors PART I. GUIDE 1

STAT 3304/5304 Introduction to Statistical Computing. Introduction to SAS

JMP Book Descriptions

Chapter 2 The SAS Environment

Tests of difference for two sample designs. Bivariate and multiple regression. Analysis of covariance and multivariate analysis of variance

Linear Methods for Regression and Shrinkage Methods

Introduction to Mplus

An Introduction to Stata Part II: Data Analysis

Introduction to SAS. I. Understanding the basics In this section, we introduce a few basic but very helpful commands.

Lesson 1: Creating and formatting an Answers analysis

Chapter 7. Joining Maps to Other Datasets in QGIS

Package epitab. July 4, 2018

Lab #1: Introduction to Basic SAS Operations

Introduction to CS databases and statistics in Excel Jacek Wiślicki, Laurent Babout,

Hierarchical Generalized Linear Models

A. Using the data provided above, calculate the sampling variance and standard error for S for each week s data.

A Guided Tour Through the SAS Windowing Environment Casey Cantrell, Clarion Consulting, Los Angeles, CA

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

SAS Structural Equation Modeling 1.3 for JMP

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

StatCalc User Manual. Version 9 for Mac and Windows. Copyright 2018, AcaStat Software. All rights Reserved.

MAS 90/200 Intelligence Tips and Tricks Booklet Vol. 1

Applied Regression Modeling: A Business Approach

Stata v 12 Illustration. First Session

STAT:5400 Computing in Statistics

Preparing for Data Analysis

Example Using Missing Data 1

STA 570 Spring Lecture 5 Tuesday, Feb 1

Table of Contents. Tutorial The Basics Prerequisites Concepts... 1 Information... 1 Learning Objectives... 2

International Graduate School of Genetic and Molecular Epidemiology (GAME) Computing Notes and Introduction to Stata

Department of Economics Spring 2018 University of California Economics 154 Professor Martha Olney Stata Lesson Thursday February 15, 2018

tabulate varname [aw=weightvar]

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Opening a Data File in SPSS. Defining Variables in SPSS

Surviving SPSS.

Preparing for Data Analysis

20 - Analyzing Focus Groups

EXAMPLE 3: MATCHING DATA FROM RESPONDENTS AT 2 OR MORE WAVES (LONG FORMAT)

INTRODUCTION to. Program in Statistics and Methodology (PRISM) Daniel Blake & Benjamin Jones January 15, 2010

Introduction to Mixed Models: Multivariate Regression

Bioinformatics? Reads, assembly, annotation, comparative genomics and a bit of phylogeny.

Stata version 13. First Session. January I- Launching and Exiting Stata Launching Stata Exiting Stata..

An Introduction to Stata Exercise 1

EXAMPLE 3: MATCHING DATA FROM RESPONDENTS AT 2 OR MORE WAVES (LONG FORMAT)

Using R for Analyzing Delay Discounting Choice Data. analysis of discounting choice data requires the use of tools that allow for repeated measures

A QUICK INTRODUCTION TO STATA

7.4 Tutorial #4: Profiling LC Segments Using the CHAID Option

MaxDiff Designer v2. (Updated 7 March 06) Sawtooth Software, Inc. Sequim, WA

SPSS INSTRUCTION CHAPTER 9

Epidemiology Principles of Biostatistics Chapter 3. Introduction to SAS. John Koval

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Blimp User s Guide. Version 1.0. Brian T. Keller. Craig K. Enders.

CLAREMONT MCKENNA COLLEGE. Fletcher Jones Student Peer to Peer Technology Training Program. Basic Statistics using Stata

Transcription:

Applied Longitudinal Data Analysis Page 1 Data-Analysis Exercise Fitting and Extending the Discrete-Time Survival Analysis Model (ALDA, Chapters 11 & 12, pp. 357-467) Purpose of the Exercise This data-analytic exercise supplements our presentation on specifying, fitting and extending the discrete-time survival analysis. The immediate purpose for the exercise is to open the door to your implementation of discrete-time survival analyses with the SAS statistical package, by asking you to program and execute a series of simple statistical analyses that replicate, and extend, the core material contained in ALDA and featured in our presentation. General Resources for Conducting the Exercise While we provide self-contained instructions below for conducting the data-analysis exercise, we recommend that you keep the following resources close at hand: ALDA itself, and the handouts from our presentation. The UCLA Academic Technology Services website that supports ALDA, at http://www.ats.ucla.edu/stat/examples/alda/. This site contains downloadable versions of the datasets used in ALDA, specimen computer programs (in HLM, MLwiN, S+, SAS, SPSS, Stata) and sample computer output for all exhibits in the book, using each of the software programs. Datasets The dataset used in this exercise has been formatted by UCLA Academic Technology Services and is available for download, both as text and system files from their website. For our current purposes, the dataset has already been downloaded onto your workstation and stored in a directory on the hard drive. You will need to know the address of this directory in order to conduct the analyses below. Data-Analysis Exercise Please proceed sequentially: 1. All of the tasks in this exercise use the data on grade at first heterosexual intercourse in ALDA Table 11.1 on p. 360. Please make sure that you have this exhibit close at hand. 2. The files that contain these data have the prefix firstsex. They appears in the data directory several times with different suffixes denoting different versions of the file -- as raw text, in person-level and person-period format, and as system files appropriate for analyses with the different software packages.

Applied Longitudinal Data Analysis Page 2 a. First, examine the raw text person-level dataset, firstsex.txt, by opening Windows Explorer and clicking on the filename. Windows Notepad should then display it. Make sure that the first two lines of the dataset match the following: "ID","TIME","CENSOR","PT","PAS" 1,9,0,0,1.978866614548087 b. Notice that this data-file contains comma-delimited text, with a first line that provides the names of the variables it contains in quotes: ID, TIME, CENSOR, PT, PAS. By consulting ALDA, match these variable names with the variables description on p. 358 ff. Make sure that you can identify the values of each of the variables for the first case in the dataset, and that you know what each represents. c. Now, in a similar fashion, examine the raw text person-period dataset of the same data it is called firstsex_pp.txt. Check that the first four lines of the dataset are: "ID","PERIOD","EVENT","D7","D8","D9","D10","D11","D12","PT","PAS" 1,7,0,1,0,0,0,0,0,0,1.978866614548087 1,8,0,0,1,0,0,0,0,0,1.978866614548087 1,9,1,0,0,1,0,0,0,0,1.978866614548087 d. Notice the switch in variable names between the person-level and person-period datasets specifically, the appearance in the latter of the time dummies, D7 thru D12, which accommodate the different grades in which the adolescent boy could first experience heterosexual intercourse. Again, make sure that you can line up the names of the variables with their values for the first case across the three periods of his event history and that you understand why they take on the values that they do. 3. Boot up PC-SAS on your workstation, by clicking on the appropriate program icon or filename. a. You will see the SAS opening screen that usually, by default, contains the SAS Explorer, Editor, Log and Output windows (the Output window may be hidden behind other windows). These windows can be opened by clicking on the labeled tabs in the lower toolbar or by using the VIEW menu.. i. You will use the Editor window to compose your program code. ii. Your output will appear in the Output window. iii. Your job log will appear in the Log window. b. If there is any content currently present in the Editor, Log or Output windows, clear it by going to that window and selecting Edit>Clear All on the upper tool bar, or on the menu that pops up when you click the right mouse button.

Applied Longitudinal Data Analysis Page 3 c. Although SAS can accept data entry in all kinds of formats, we will be using data in the form of SAS system files in this data-analysis exercise. This has the advantage of streamlining the entry of the data along with the corresponding variable names, etc. 4. Now, begin your discrete-time survival analyses with SAS. a. Read the firstsex person-period system file (firstsex_pp.sas7bdat) into SAS and conduct the contingency table analyses outlined below to estimate the sample frequencies and proportions in ALDA Table 11.1, on p. 360, by typing the following code into the SAS Editor: proc freq data= C:/????/firstsex_pp.sas7bdat ; tables pt*period*event/nopercent nocol; b. Where you must replace the fully-qualified filename within the quotes on the first line of the program to identify the home of the firstsex person-period SAS system file on your own computer. i. In SAS, proc calls on a particular procedure for execution, in this case the freq procedure to carry out cross-tabulations and compute frequencies, proportions and the like. ii. After the proc sentence, there are usually additional sentences. In this example, within the freq procedure, we are asking for tables of PT by PERIOD by EVENT. Notice how the order in which you list these variables has lead to the particular structure of the output. iii. After the slash (/) on the second line of the program, you can include options. In this case, we have chosen options that eliminate unnecessary statistics from the output ( nopercent nocol ) to reduce clutter. (Try rerunning the program without these options (and the slash), and check out what happens). iv. It doesn t matter whether you use upper or lower case text in your program. It doesn t matter where you insert spaces, nor how many. But, you must end each sentence with a semicolon omitting the semicolon is the most common error in SAS programming. c. To execute your program, click on the small running man icon on the upper tool bar in SAS or select Run>Submit. i. In the unlikely event that you have not made any programming errors (!), then output from your analyses should appear in the SAS Output window. You can move among the windows by using the tabs in the lower toolbar. ii. If you have made errors, you will find diagnostic messages in the Log window that may help you to fix the error (or maybe not!).

Applied Longitudinal Data Analysis Page 4 1. Edit your program, in the Editor window, and re-run. 2. If that doesn t work, clear all and start again. 3. If that doesn t work, scream loudly and put your foot through the screen. d. Once you have successfully obtained output (?), print it out for your records and for the computations below. 5. Examine your output, and identify the no parenting transition sub-table (PT=0) a. Within this sub=table, identify: i. The size of the risk set in each period (last column on the right, labeled total ). ii. The number of events in the period (the penultimate column, labeled EVENT=1 ). b. Find the row corresponding to the 10 th period, and divide the number of events that have occurred by the size of the risk set for this period. The result is the (conditional) hazard probability for the 10 th period, and can be checked against the percentage value in the output and the sample hazard estimate in the first panel of Table 11.1, also for the tenth period. c. Use the cumulating formula for survival probability in Equation (10.4) on p. 335 of ALDA to estimate the sample survival probability in: (a) period 7 (remember that everybody begins by surviving in grade 6), and (b) period 8. Check your values against the appropriate entry in Table 11.1. 6. Now, fit your first discrete-time hazard model to the firstsex data. This first model will replicate Model A in Table 11.3, on page 386 of ALDA. a. Begin, by clearing all from the Editor and the Log, and deleting the output files from the SAS Explorer window if you wish. Then, type the following new program into the SAS Editor: proc logistic data="c:/????/firstsex_pp.sas7bdat" descending; model event=d7-d12 / noint; i. You could also simply add the new code to the end of the previous code, if you want, inserting it immediately before the command if you do this, then both the cross-tabulation and the logistic regression analysis will execute sequentially when you run the new program. ii. In SAS, logistic regression analyses are conducted using proc logistic in the person-period dataset. The hypothesized DTSA model is specified in the model statement, with the outcome variable ( EVENT ) listed on the left of the = sign and the predictor variables (D7 thru D12) listed on the

Applied Longitudinal Data Analysis Page 5 right of the = sign. The noint option in the model sentence eliminates the regular intercept, in favor of the multiple intercepts associated with the complete set of time dummies. The descending option in the proc sentence defines the value 1 as event occurrence for the EVENT variable. (PROC LOGISTIC treats its outcome as categorical, not numerical, and so it doesn t know until you tell it, that 1 is higher than 0. Try rerunning the program without the descending option, and check out what happens to the logistic regression coefficients!) b. Execute the program in the usual way and debug until you successfully obtain output. c. Print a copy of the output for your records and, on the output, identify and compare the following with entries in Table 11.3 on p. 386: i. The parameter estimates associated with time dummies, D7 thru D12. Notice that their anti-logged versions are listed as Odds Ratio Estimates beneath the estimate listing. ii. The standard errors associated with time dummies, D7 thru D12. iii. The deviance statistic associated with the model fit. It is listed as the entry for -2 Log L under Model Fit Statistics, With Covariates. iv. The AIC statistic. It is listed under Model Fit Statstics, With Covariates. d. Using Equation (11.15) on p. 387, transform the coefficient associated with the D12 predictor to obtain a fitted hazard probability and compare it to the value contained in the last row of Table 11.4 on p. 388. 7. Now, refit the discrete-time hazard model with the PT and PAS predictors included, in order to replicate Model D in Table 11.3 (p. 386). a. Begin, by clearing the appropriate SAS windows and editing the existing program in the SAS Editor, to read: proc logistic data="c:/????/firstsex_pp.sas7bdat" descending; model event=d7-d12 pt pas / noint; i. In SAS, multiple predictors like PT and PAS can be added to a model statement by listing them out on the right hand side of the = sign, separated by one or more spaces. b. Execute the program in the usual way. c. Print a copy of the output for your records and on it, identify and compare the following with entries in Table 11.3 on p. 386, under Model D: i. The parameter estimates associated with the PT and PAS predictors. ii. The deviance statistic associated with the model fit.

Applied Longitudinal Data Analysis Page 6 d. Compare your odds ratio estimate obtained for the PT predictor with the value given in text on p. 389, beneath Equation (11.16). Notice that your value differs from the text value (1.94 vs. 2.40) because the text version is obtained in Model B, which estimates the impact of PT in the absence of a control for PAS. Your estimate controls for the main effect of PAS. 8. In Chapter 12 of ALDA, we describe how to replace the completely general specification of TIME in the hazard model with more parsimonious polynomial specifications and for comparing the fits of these models in order to determine which specification is most effective. This approach can also be adopted with the firstsex example, even though it is probably not necessary, given that TIME contains a small number of discrete timeperiods. a. Begin, by clearing the appropriate SAS windows and editing the existing program in the SAS Editor, to read: data sex; set "c:/????/firstsex_pp.sas7bdat"; one=1; t1=period-7; t2=t1*t1; proc logistic data=sex descending; model event = one pt pas / noint; proc logistic data=sex descending; model event = one t1 pt pas / noint; proc logistic data=sex descending; model event = one t1 t2 pt pas / noint; i. In this program, you are asked to implement a different strategy for entering the data into the analysis, using a data step to create a temporary SAS dataset that exists only within this particular run. We have called this temporary dataset, sex. The firstsex_pp system datafile is read into the temporary dataset by the set command on the second line. ii. We have adopted this new approach to data entry so that you can easily create new polynomial representations of time from the PERIOD variable already present in the dataset. The new representations are: (a) the variable one, which is constant and equal to 1 in every record (and will serve as the new intercept in the polynomial models, see Table 12.1, p. 411), (b) the variable t1, which represents linear time (re-centered on grade 7), and (c) the variable t2, which represents quadratic time. iii. The program also contains multiple proc logistic paragraphs, each of which requests that a different discrete-time hazard model be fit. Notice that we ask each of these implementations of the procedure to fit the required model using the temporary dataset sex and not the original dataset, in order to make use of the newly minted polynomial representations of time.

Applied Longitudinal Data Analysis Page 7 b. Execute the program in the usual way. c. Print a copy of the output for your records and, on the output, identify the deviance and AIC statistics associated with each model fit. Compare these fits to the deviance statistic for the completely general model (containing both the PT and PAS predictors) obtained earlier. Notice, as argued in our presentation, that the deviance statistics for the constant model and the completely general model bracket the deviance statistics for all the polynomial specifications. d. Using the strategies of fit comparison described in our presentation and book, determine which polynomial representation best represents the shape of the discrete-time hazard function. e. Use the odds ratio estimates on the output to interpret the main of effects of the PT and PAS predictors, in your selected model. 9. In Chapter 12 of ALDA, we also describe how to test the adequacy of the proportional odds assumption made in many discrete-time hazard models, by including interactions between TIME and substantive predictors. This approach can also be adopted with the firstsex example. a. You will test for the failure of the proportionality assumption with regards the PT predictor in a model that uses a linear specification of TIME, investigated above. b. Begin by editing your existing program in the SAS Editor to: i. Create an interaction between the PT predictor and linear time, t1, in the data step, give it a variable name of your own choosing. ii. Fit a discrete-time hazard model that contains the predictors ONE, linear time, PT, and the new interaction. c. Execute the program in the usual way. d. Print a copy of the output for your records and, on the output, identify the statistic that confirms or disconfirms the adequacy of the proportionality assumption. What is your conclusion? 10. If you have any time left, take a look at the programs and output for ALDA Chapters 11 and 12 on the UCLA support site or work on getting your own data in shape for your subsequent analyses on Friday morning.