Fundamental Data Manipulation Techniques

Size: px

Start display at page:

Download "Fundamental Data Manipulation Techniques"

Felix Shaw
6 years ago
Views:

1 The Analysis of Longitudinal Data Introduction This document describes the process of organizing longitudinal data from the HRS for the purposes of statistical analysis. We start by considering three small problems that illustrate the key components of longitudinal data manipulation. In particular, we consider the process of selecting a subset of variables and records from a file, linking records of related files together, and grouping records with common fields. In the next three sections, we use these fundamental data operations as building blocks to solve more complex analysis problems. We start by describing and solving a time series analysis problem have deaths been decreasing over the past two years? Next, we consider an analysis of a problem involving two distinct types of events is there a correlation between child survival (a mortality event in the HRS) and the out-migration of the child's father (an out-migration event in the HRS)? Finally, we consider an analysis question that involves a "time-to-event" dependent variable with censored cases (censored cases are those for which the event has not yet occurred) what are the determinants of migration? This document assumes that the reader is familiar with the basic operations of using FoxPro. If you are not comfortable with FoxPro, the book FoxPro Step-by-Step, published by Microsoft Press, provides a good tutorial. We are grateful for the assistance of the Navrongo Health Research Centre (NHRC), Ghana, in providing a small subset of data to help illustrate these analysis techniques. The data set is small enough to be manageable for instructional purposes and large enough to generate enough interesting cases in the course of an analysis. Fundamental Data Manipulation Techniques The examples in this section illustrate key operations in the manipulation of longitudinal data. In the first example, we select a subset of records and fields from a single file. We then load that file into the SPSS statistical analysis package and perform a cross-tabulation of the data. In the second example, we illustrate how to merge records from two different files. In the last example, we group together collections of records in a single file and compute basic attributes about the groups. Extracting Subsets of Variables and Records We consider the problem of gender and mortality in this example. We will extract all the members who have an exit_type as "DTH" (death) and an exit_date greater than December 31, For these records, we are interested in the variables sex and age. We then would like to do a simple cross-tabulation in SPSS of sex by age group (0-5, 5-18, > 18). RQBE (Relational Query By Example) One of the easiest ways to extract subsets of data from a file is to use the RQBE command in FoxPro. This is a utility that helps in the construction of commands to merge and extract data from files. If you are not familiar with the RQBE command, then you should review the FoxPro tutorial or the FoxPro Step-by-Step book. Every RQBE generates an "SQL select" command (you can click on the See SQL command to see it); the RQBE is just an easy-to-use method of constructing SQL commands. In this document, we will usually present the SQL command as a specification of the corresponding RQBE. This will prove useful in later sections, when we find that some SQL commands cannot be specified in the RQBE. The RQBE screen to extract all members who died later than December 31, 1992 is displayed below:

2 Let's consider the parts of this command. The command starts with the SELECT keyword which could be confused with the FoxPro command to select (or focus) on a file but the command is followed by variable names so FoxPro is able to recognize this as an "SQL Select." The list of variables following the "SELECT" are the variables and expressions entered in the "fields" dialog box of the RQBE. The second entry is typed in at the expression window. The FROM keyword is followed by a list of files that will be used in this case only the member database file. Following the WHERE keyword is a list of all the conditions for selecting a record; in this case, we are selecting all members who died (exit_type = "DTH") after 12/31/92. When working with multiple files, the WHERE command can also be used to specify how records in different files are to be matched up. Finally the INTO keyword indicates where to place the output. Guidelines for Setting Up Working Files Before going any further, guidelines are needed for setting up data files for analysis. One of the most important things to keep in mind is that a number of "temporary" working files get created along the way to a statistical analysis. These files have a way of becoming permanent and, in the end, cluttering up disk drives. A clogged disk drive can make almost any system crash, so you want to keep a careful eye on your space utilization. In these examples, we put all our working files in a subdirectory called output. We do not put them in the main DBFS directory holding the HRS datafiles. So in the above example, the output is going to current directory; for this document, the current directory will be \analysis (use the set default to \analysis command to make this your default directory). The member database is located in the \hrs\dbfs directory. Constructing the correct RQBE (SQL SELECT) is usually an iterative process. While developing the appropriate RQBE, we typically send the results to a browse window and compare our expectations with the results of the command. Eliminating the INTO clause from the above command will send the output to a browse window. Only when we get the right output do we send the results to a database file. If you save the RQBE command, then a file is created with an extension of.qpr. We saved the above command in file called stepf1_1.qpr. We can perform any SQL SELECT command from the command window. For example, our extraction of records from the member file can be performed by typing in the command window DO stepf1_1.qpr Note: when using the RQBE to reference files in different directories, sometimes it is unable to find the designated file. For instance, if we ran the above SQL from the command window with no files open, the command would be unable to find the member file; we need to tell it to look in the \hrs\dbfs directory. The command would also place the results (stepf1_1.dbf) in the current directory. You can change the stepf1_1.qpr file (using modify command in the command window) to reference explicit directories in the following manner : Modifying the Working File We have not completed the construction of the working file for statistical analysis. We would like to rename the second field of the database (currently named exp_2 from the SQL command) and add another field to represent the age group. We make these changes by making the stepf1_1 file active (click on it or open it in the view window) and then choose the Database menu option, followed by the setup submenu option. Click on the modify button and change the name of exp_2 to age and add a new variable to represent the age_group (numeric, length 1).

3 To set the new variable age_group with appropriate values we type the following commands in the command window : This completes the construction of our database file; we are now ready to do the cross-tabulation in SPSS. Loading the database file into SPSS for Windows We describe the process of loading data into SPSS for Windows. Since SPSS supports the loading of FoxPro files (.dbf files), the process of loading our work files is relatively straightforward. Start SPSS, choose the FILE menu option, followed by Open suboption, followed by Data suboption. A dialog box appears; change the type of file we are loading from a *.sav file to a *.dbf. Now locate and select the working file you have created (in our case stepf1_1.dbf). The database file will be loaded and represented in a spreadsheet-like format. At this point we can choose the cross-tabulation routine by choosing the Statistics menu option, followed by the Summarize suboption and then the cross-tabulation option. The results from this analysis follow : SEX by AGE_GROUP AGE_GROUP Row Total SEX F M Column Total Chi-Square Value DF Significance Pearson Likelihood Ratio Minimum Expected Frequency Cells with Expected Frequency < 5-2 OF 6 ( 33.3%) Linking Related Records in Different Files To illustrate the second basic data manipulation technique we ask: "Have men or women migrated away from the compound more often in 1993, 1994?" Answering this question will require linking data records in the migration file (to get the migration events) with data records in the member file (to get the gender of the individual). The primary purpose of this example is to illustrate how to merge records from different files. To start, we need to extract a subset of records and fields from the migration file. In particular, we need all the out-migrations (indicated by a type field of "EXT") during 1993 and The SQL command to extract the permanent ID and the date of migration for this subset of migration events is : Now we need to link this newly created data file (stepf2_1.dbf) with the member file. We could do this in a couple of ways. We will explain both since different circumstances warrant the use of one technique over the other. The first technique continues the use of the RQBE, but in this case, it is used to join two files together and extract a subset of fields. The second technique relates

4 records in the two files with a key expression and then commands are written in the command window to merge fields from the two files. The RQBE to implement the first technique is : In this RQBE, we are creating a new record for every matching record between the stepf2_1 file and the member file. Records are matched based on their permanent ID. The results are sent to the stepf2_2 file. The second technique to link files together uses the "set relations" command to link records in two files. Both files first need to be opened. Use the view window (choose the window option, followed by view suboption) to open them if they are not open already. Using the view window, specify the index order for the member file to be based on the permanent ID (click on member file, followed by setup) and then make the stepf2_1 database active (click on it). Now to relate the stepf2_1 to member, click on the relation button, then click on the member file. You will be asked for the field (or expression) in the stepf2_1 file that is to be used to link to the index expression in the member file choose the perm_id field. The view window is the easiest way to specify the relations between files alternatively you could type this in at the command window (make sure both files are open, using the view window): Now each out-migration of a member (a record in the stepf2_1 file) is linked to background information about that member (a record in the member file). We first create a new field in stepf2_1 to hold the gender of the member who migrated. Then we copy that information from the member record to the stepf2_1 record with the following command : How do you decide when to use one technique over the other? In general, it is easier to use the RQBE over the set relation, but it is not always possible. The RQBE will generate an output record for any and all matches between records in one file and another. If there isn't a matching record, then no output record will be generated. Now consider a problem in which we want to update a list of members to record the last out-migration before 01/01/1993. In this case, not every member record will be matched to a corresponding migration record (not all members migrate out), so the RQBE will generate a list of only those members who migrated before This is not exactly what we want. It is possible, using the set relation command, to identify and update only those records with a matching migration record. We will see how to do this in later sections. At this point, the data can be loaded into SPSS to get a frequency distribution of the number of records with sex = "M" and the number of records with sex = "F". We choose not to do this because the procedure is fairly straightforward and because there is an easier way to do this in FoxPro with the Group option in the RQBE window. This technique is the subject of our next subsection. Grouping Records with Common Values The last of the fundamental data operations that we consider involves the grouping of records that share a common value in a field or a set of fields. This technique is easy to do and quite

5 useful in a number of circumstances. In this section, we determine the number of children each woman has. The RQBE (or corresponding SQL) has an option that can designate a field or a collection of fields that can be grouped. Basic statistics, such as the number of elements, maximum, minimum, or average of a variable, can be computed on the elements of the group. For example, to group records in the member file by region and count the number of elements in each group would provide a count of members by region. Or, you can group records in the stepf2_2 file (see previous subsection) by the field sex to determine the number of males and the number of females who migrated in 1993 and In this particular example, we want to determine the number of children a woman has. Also, as a way of explaining some of the features of grouping, we determine the birth date of the youngest child. The SQL command to do this is : In this command, the mother ID is used to group records together. The command computes the number of elements that share the same mother ID value and determines the maximum birth date of all members in the group. Now suppose I wanted to compute a frequency distribution of the number of children. Do you know how to compute this? Try it. In the next three sections, we consider problems that require us to apply a number of the fundamental data manipulation techniques. Time Series Analysis Time Series analysis refers to any longitudinal study in which time is the unit of analysis and trends or events in time are variables of interest. Time series analysis can be one of three types: Temporal analysis involves describing a trend over time. For example has the migration rate increased over time? Has mortality declined with time? Discontinuity analysis represents a simple extension beyond description, to the interpretation of the impact of some event. Has the trend in mortality changed after immunization was introduced? Did fertility decline after a family planning program was launched? In such cases time series data include some indicator of a disturbance in time. Time series regression analysis involves interpreting a set of several time series in which the timing of disturbances varies by area, but processes under observation are otherwise comparable. For example, an immunization program may be introduced in an area in phases. The question that arises is, do areas where immunization is introduced earlier have more precipitous declines than areas where children are immunized later? From the data management standpoint, all three types of analysis have the same requirements: events or rates must be aggregated and arrayed over time. Time Series analysis involves problems in which time is a correlate. For example, has the migration rate increased over time? Or has the number of deaths been decreasing? In this section, we describe the steps to put together the data to answer the question, "Has the number of deaths been decreasing in the time period from the beginning of 1993 until the end of 1994?" An Overview of the Data Consolidation Steps We describe two different ways of answering the above question. First, we develop a workfile to analyze the number of deaths by month. The second way involves the calculation of a rate of

6 mortality (number of deaths normalized by the population). In this case, we have to determine the population at the beginning of each month. Actually, for this dataset, the population does not vary that much from month to month, so the normalization is not necessary, but for the purposes of an example it is instructive. The data file that we need to construct for the analysis has the following format: Record for each month of 1993 and Each record consists of variables for: o Time Period (1...24) o Date at the beginning of time period (month) Number of deaths in this time period Population at the beginning of the time period (for the second stage) The steps needed to construct this file for stage 1 data include: 1. Define a working file (time_per) that, by the time we are finished, will be passed to the statistical analysis package. In particular, it will have 24 records (one for each month of 1993 and 1994) and four variables to hold the time, mortality, and population. 2. Construct a working file (mortmont) of 1993 and 1994 mortality events by month. 3. Link these two files (time_per, mortmont) by month/year and add the mortality information in the mortmont file to the time_per file. 4. The time_per file is ready for the statistics package. In stage two of the analysis, we normalize the mortality events by the size of the population. To do this, we must determine the population at the beginning of each month. When we are finished, we will have defined a function that returns the population on a particular day. That function is made up of the following steps: 5. Construct a working file (mem) that consists the variables perm_id, entry_date, and exit_date for all records from the member file. 6. Adjust for migrations of members during the period of interest. For example, if we had an out-migration before the date of interest and then an in-migration after the date of interest, then we would not want to include that individual in a count of the population (notice that for this case, the exit type and exit date of the member is currently blank, although at one time (when out-migrated) it was not blank. 7. Let d_date designate the date that we want to determine the population. Delete all records with an entry_date > d_date or with an exit_date < d_date. 8. Count the number of remaining records. Detailed Description of the Data Consolidation Steps Each of the above steps is now described in more detail: Step 1: Use the create command to define a database file with the following structure: Structure for table: c:\analysis\time_per.dbf Number of data records: 24 Date of last update: 04/06/95 Field Name Type Width Dec Index 1 PERIOD Numeric 2 2 DATE Date 8 3 POPULATION Numeric 8 4 MORTALITY Numeric 4 ** Total ** 23 We use the append command to add 24 records each with a date corresponding to the beginning of a month. The first few records of the file follow:

7 Record# PERIOD DATE POPULATION MORTALITY /01/ /01/ /01/ /01/93 Step 2: Use the FoxPro s RQBE to select and group the mortality records. The RQBE results in an SQL statement which you can type in and run from the command window, or type into a file and execute the file with a do, or you can specify the parameters interactively with the RQBE. The SQL statement to select out 1993, 1994 mortality events and group them by month is as follows : The above SQL statement groups by (1,2), which refers to the first and second expressions in the select statement year(date) and month(date). Step 3: Link the file in Step 1 with the file in step 2 with the following statements: ** exp_1 refers to the first expression in the RQBE,... The expression, exp_1*100+ exp_2, is necessary in order to guarantee unique combinations of dates. If we tried to just add exp_1 and exp_2, then the year 1994, month 1 would have the same values as 1993, month 2. Once the relation is set, it is easy to replace the mortality fields in mortmont : Step 4: Load the time_per file into SPSS by starting SPSS, using the Open Data commands. Then change the type of data file from the default (*.sav) to *.dbf. Change to the directory containing the time_per file and choose that file. The results from a simple linear regression of mortality events by time period follow : * * * * M U L T I P L E R E G R E S S I O N * * * * Equation Number 1 Dependent Variable.. MORTALITY Block Number 1. Method: Enter PERIOD Variable(s) Entered on Step Number 1.. PERIOD Multiple R R Square Adjusted R Square Standard Error Analysis of Variance DF Sum of Squares Mean Square Regression Residual

8 F = Signif F = Variables in the Equation Variable B SE B Beta T Sig T PERIOD (Constant) Obviously there is a fairly weak correlation between deaths and time period. We now consider how to normalize the number of mortality events by the size of the population. As mentioned before, this population is not changing that much, so it is not necessary to do this, but the process of determining the population at a particular time period is instructive and potentially useful in other studies. Step 5: Construct a working file (mem) that consists of the variables perm_id, entry_date, and exit_date for all records from the member file. The SQL command to do this is as follows : Step 6: Let d_date designate the date that we want to determine the population. Adjust for migrations of members before d_date. For example, if we had an out-migration before d_date and then an in-migration after d_date, then we would not want to include that individual in a count of the population (notice that for this case, the exit type and exit date of the member is currently blank, although at one time (at d_date) it was not blank. The SQL command to extract all out-migrants before d_date: The command to get all the in-migrants is similar : Our goal is to change the exit status of any members who were gone during some period that includes d_date. If that member entered after the d_date, then we want to include that member in the population count. First we link the files together (all the linking can be done at the command window or by using the view window):

9 Now we replace the exit date if we have an earlier one from the migrate file (but remember we selected only those less than d_date). If an in-migration comes after the out-migration, but before the d_date then we want to update the exit_date: Step 7: Delete all records with an entry_date > d_date or with an exit_date < d_date. Step 8: Count the number of remaining records. The variable m.pop_ddate contains the total population at time d_date. It will get very tedious to run through steps 5 through 8 for all 24 time periods. There is a better way. Steps 5 through 8 can be collected into a file (cenddate.prg). With a few changes we can set the file up so that it can be treated as a function. In particular, we add a parameter statement at the top and a return statement at the bottom. The function is listed in Appendix A at the end of this document. To call the function, we need to set a path to the directory that contains our source code this will tell FoxPro to look in this directory to find the source files (in our case, the directory is c:\hrs\prog). Use the Files option of the view window to set the path. The file named time_per should already be open and should not be linked to any files: This code will load the calculated population into the field population of the time_per file. It is now ready to load into SPSS (we can calculate the value mortality/population in SPSS). Concomitant Event Analysis Concomitant event analysis looks for a correlation between two different types of events. For example, we can question whether there is a correlation between child survival (a mortality event in the NDSS) and the out-migration of the child s father (an out-migration event in the NDSS). An Overview of the Data Consolidation Steps In this section, we describe how to put together the data to answer two different forms of the above question. First, we look to establish a correlation between the survival of the child in 1994

10 and whether the father was absent at any time during Second, we extend the analysis to look for a correlation between child survival and the number of days in which the father was absent. While the second problem is not a concomitant event analysis problem, we include it because it illustrates some data extraction techniques and it is not that different from the first problem. The data file that we need to construct for the first stage of the analysis has the following format: Record for each child born between 12/31/1988 and 01/01/1995. Each record consists of variables for Child ID Survival Status of the Child (1 = Child Died in 1994, 0 otherwise) Father ID Migration Status of the Father (1 = Father Migrated in 1994, 0 otherwise) The steps needed to construct this file for stage 1 data include: 1. Construct a working file (child94) from the member file which consist of all children born between 12/31/1988 and 01/01/1995. Select the fields perm_id, exit_type, exit_date, and father_id. Add fields for migration status of father (0,1) and the survival status of the child (0,1). 2. Use the child94 file to form another working file (father94) of all fathers of these children: group by the father_id in child94 to form this file. 3. Call a function (which we provide and is listed in the appendix with comments) for each father to determine whether the father had out-migrated at any time in 1994 (including if he left before 1994). 4. Link the father94 file to child94 to insert the migration status of the father. Set the child survival field for each child based on the exit status of the child. 5. Use the child94 file in a statistical analysis program to compute a cross tabulation of child survival and father out-migration status. The stage 2 analysis needs to change step 3 in the above list to determine the number of days of migration in the 1994 time interval. Revision of step 3 above: Instead of calling a function to determine whether a father migrated, we call another function to determine how long he was away from home. Afterward, link the data in a manner similar to step 4. Use the child94 file in a statistical analysis program to compute a logistic regression of child survival with number of out-migration days. Detailed Description of the Data Consolidation Steps We now expand on the outline of data consolidation steps: Step1: We construct a working file (child94) from the member file, which consists of all children born between 12/31/1988 and 01/01/1995. Select the fields perm_id, exit_type, exit_date, and father_id. This is done with the following RQBE (because of the complicated and/or conditions, it is easier to type this command into a file and run it):

11 We add fields for the migration status of father (0,1) and the survival status of the child (0,1) using the database setup/modify (from the FoxPro menus) or the modify structure command (typed in at the command window). Step 2: Use the child94 file to form another working file (father94) of all fathers of these children; group by the father_id in child94 to form this file. This is done with the following RQBE. We add fields for the migration status of father (0,1) using the database setup/modify (from the FoxPro menus) or the modify structure command (typed in at the command window). Step 3: Call a function (which we provide and is listed in the appendix) for each father to determine whether the father out-migrated at any time in 1994 (including if he left before 1994). The name of the function is MigrAway. So we can replace the mig_stat field of the father94 file with the following command: The MigrAway function determines the latest migration-out date that is less than (in this case) {01/01/95} and it also determines the latest in-migration date that is less than (in this case) {01/01/94}. If the latest in-migration event has a date greater than the latest out-migration, you can conclude that the member was present during 1994; otherwise, he was not. Step 4: Link the father94 file to child94 to insert the migration status of the father. Set the child survival field for each child based on the exit status of the child. We type the following commands in the command window (the default directory should be \analysis):

12 Step 5: We load the child94 database into SPSS (File Open Data *.dbf...) and then compute a cross tabulation of the death_stat with the father_mig variables. For our particular subset of data, we find no correlation between child survival events and father s migration events.

13 For the purpose of exposition (and not because the data warrant further investigation), we extend the analysis to look for a correlation between child survival and the number of days in which the father was absent. To do this we change step 3 in the above description to: Step 6: Call a function (which we provide and is listed in the appendix) for each father to determine how many days the father was out-migrated at any time in 1994 (including if he left before 1994). The name of the function is MigrAway. So we can replace the migr_stat field of the father94 file with the following command: The MigrAway function is derived from a function that determines the person days of observation. In this case, we consider the amount of time spent during every out-migration to in-migration sequence in between the period {01/01/94} and {01/01/95}. Proportional Hazards Modeling (Cox Regression) Analysis questions that involve a "time-to-event" dependent variable with censored cases are common with longitudinal data (censored cases are those for which the event has not yet occurred). For example, "What are the determinants of migration?" "What are the determinants of pregnancy for eligible women?" "What are the determinants of child death?" All of these analysis questions involve cases in which the event did not occur. Multiple linear regression cannot be used for analysis of time-to-event data, since there is no way to handle censored observations. Proportional Hazards Modeling (using Cox regression model) can be used to analyze data that contain censored observations. In this section we consider the data requirements for analysis of the question, " What are the determinants of migration?" For more information about the statistical nature of the Proportional Hazards Modeling procedure, see Chapter 12 in the advanced statistics manual of SPSS. Background The analysis of longitudinal data is complicated by the fact that individuals observed in a study can be lost to observation over time. If a cohort is observed from the beginning of a survival period but is lost to observation in the course of a study, it is termed "right censoring." If individuals are observed in episodes whereby their pre-study status is potentially important to research outcomes, but remains unknown for the period prior to launching a study, it is termed

14 "left censoring." "Left censoring" is associated with special statistical problems that we will ignore for the moment. Let us consider the case of "right censored" data, or "censored" data for short. Censoring represents a potential bias because observation durations are terminated artificially by events that have nothing to do with the attrition process that is under study. A person entering a survival study may be alive at the end of the study. Such a person has been observed until the end of the study, and should be included in the analysis, but we know nothing about the person s survival after the study. Excluding such cases would remove individuals with long survival histories, biasing results. Treating such cases as deaths would obviously bias results by spuriously elevating risks. Every censored study has examples of potential biases that could arise from the censoring process. The conventional method for analyzing censored data is termed the "life table." This procedure makes the simple assumption that censoring is independent of the attrition process. Whatever observation arises from such cases is used in calculating populations at risk until the time of censoring. The survival process is analyzed in small discrete time periods, with simple assumptions made about the temporal distribution of risk within discrete intervals. Survival probabilities are calculated for each interval, so that cases that are censored at some point of time can be used in the denominators of rates for time segments prior to the point of censoring. Discrete probabilities computed in this fashion can then be accumulated multiplicatively to show the implication of a series of probabilities for the overall survival process. The conventional term for the discrete survival probability is "q(x)," where x denotes some point in time. Tables of "q" and corresponding cumulative survival probabilities represent a life table. The problem with life tables is that they are tables aggregate data on processes that have underlying covariates. The logic of regression analysis is extremely useful in explaining the covariates of some process, but tabular data are aggregated, and statistical procedures based on likelihood estimation and least squares regression require individual level observations. In a classic paper written by D.R. Cox in 1972, the notion of "Hazard Regression" was proposed. Cox noted that attrition can be defined by "hazard functions." In the Cox approach, all attrition processes are captured by one of three types of relationships between "q" and time, and the role of covariates in increasing or decreasing the attrition process. Cox provides us with procedures for combining the statistical tools of regression with the advantages of life tables for dealing with censoring. To understand the concept of "hazard modeling" or Cox regression, it is helpful to review the concept of attrition that Cox has utilized in his approach."hazard functions" are simple equations for representing the relationship of q with time. As time progresses, the underlying "hazard" can remain constant, which is rare, or it can change. A constant "hazard" is the simplest representation of survival data: No hazard function is needed in a regression, because the intercept at time x=0 defines the entire attrition process. In cumulative terms, this constant defines an exponential rate of attrition. If we are interested in the role of a covariate (z) as a determinant of attrition, we can express this as: Note that the units of analysis are the discrete time points for each individual. The role of z is to shift the level of the attrition process. The level does not change with time, however. The logit is the appropriate function for discrete data (0 = surviving; 1 = dying). When the pace of attrition changes with time, cox regression simply substitutes an equation for the intercept:

15 Depending upon the data and the problem, a(x) can assume any form whatsoever increasing, decreasing, or remaining constant over time. This is a "proportional hazard" model. The coefficient defines the extent to which the covariate z elevates or reduces attrition, relative to the function a(x). The conditional hazard q(z,x) never crosses the underlying hazard because the effect of z is independent of time. The most complicated attrition process is one in which the effect of z covaries with time. A logit model for this is: In this process, b defines a "main effect" for the role of z, and c defines the extent to which this is modified as time progresses. From a data management standpoint, these three models have different requirements. In one case, it is sufficient to fit a model to tabular data, since adding the time dimension contributes no information. For the second model, data must be arrayed by individual, with time allowing for the estimation of the hazard. In the third case, the data must register values of z for each discrete time point in the period of observation. The models shown have important data management implications. First, data prepared for "Cox modeling" should be informed by the attrition process that is being studied. Data prepared for the first model may not have sufficient detail for estimating model 2; data arrayed for model 2 in turn, may be inadequate for estimating model 3. Second, a model 3 data set can always be used for the more parsimonious attrition process. That is, if a regression model is estimated for model 3, and the term c is found to be insignificant, then a model 2 regression is sufficient and feasible to estimate with the data set used for the more complex specification. Third, it is important to have data management procedures that record events for the attrition process, censoring so that observations can be defined, and covariates of interest. How that data matrix is designed depends upon which type of regression is employed. An Overview of the Data Consolidation Steps Preparing the data for an analysis of the question "What are the determinants of migration?" requires a number of steps. This section identifies these steps and later sections cover the steps in more detail. First, we need to frame the analysis in more concrete terms. This includes identifying the independent variables that may influence migration and specifying the time period over which we will look at migration. For this analysis, the variables age, gender, and size of the household (compound) at the beginning of 1994 are tested as possible determinants. We will only consider migrations in The data file that we need to construct for the analysis has the following format : Record for each member who was present during some portion of 1994 Each record consists of variables for o Age o Gender o Family Size o Migration out date o Number of days present during 1994 o Censored Status : Did a migration occur (Yes, No) The specific steps needed to construct this file include:

16 1. Construct a working file (mem94) that consists of individuals who were present at some time during Select the perm_id, region, family_num, member_num, birth_date, and sex fields. Add fields to this database file for Family_size, Migration out date, number of days present in 1994, and the censored status. 2. Construct another working file (pop_time) using the procedure pop_time. This procedure constructs a file (pop_time) that gives the population (and their current household) at a time specified by the user. In this case we are interested in the population on the 1st of January Group the pop_time workfile by the current household id to produce a working file (step3_394) that contains the number of members present in the household on the 1st of January After constructing the step3_3 working file, link this file (based on family_id) to the pop_time working file to give the number of individuals in a household on January 1, Finally, link the pop_time file to the mem94 file (based on permanent ID) to insert the count of the family members. 4. Construct another working file (step3_4) of all individuals who left a household during Link the step3_4 working file to the mem94 working (based on permanent ID) to add the migration out day (if there was one). 5. Call the PDO function to determine the person days of observation of each individual. Store the result of this calculation in the appropriate record of the mem94 file. 6. Set the censored status variable (whether an individual migrated (1) or not (0)). This variable is used by the Cox Regression procedure. Also calculate the age of each member at the beginning of the time interval. 7. Load the mem94 working file into SPSS and call the Cox Regression procedure. Detailed Description of the Data Consolidation Steps We now expand on the outline of data consolidation steps: Step 1: Construct a working file (mem94) that consists of individuals who were present at some time during Select the perm_id, region, family_num, member_num, birth_date, and sex fields. We can construct this file with the following SQL statement : Now we need to add fields to this database file for Family_size, Migration out date, number of days present in 1994, and the censored status. We add fields using the database setup/modify (from the FoxPro menus) or the modify structure command (typed in at the command window). Step 2: Construct another working file (pop_time) using the procedure pop_time. This procedure constructs a file (pop_time) that gives the population (and their current household) at a time specified by the user. In this case we are interested in the population on the 1st of January In the command window we can type the following command to call this procedure :

17 The pop_time procedure constructs a database file called membrsid which represents the list of all members present in the study on the 1st of January and for each member, their family (compound) is given in the region and family_num fields. Step 3: Group the membrsid workfile by the current household ID to produce a working file (step3_3) that contains the number of members present in the household on the 1st of January, SELECT membrsid.region, membrsid.family_num, COUNT(*); FROM membrsid; GROUP BY membrsid.region, membrsid.family_num; INTO TABLE step3_3.dbf After constructing the step3_3 working file, we need to add the count of family members to mem94 in two steps. First from step3_3 to membrsid and then from membrsid to mem_94. We do it this way because the current household of the member may be different at the end of 1994 (mem94) than in the beginning of 1994 (membrsid). We need to add a field fam_count to the membrsid working file. Then, to link the step3_3 file (based on family_id) to the membrsid working file to give the number of individuals in a household on January 1, 1994, the following commands are entered at the command window: Finally, link the membrsid file to the mem94 file (based on permanent ID) to insert the count of the family members: Step 4: Construct another working file (step3_4) of all individuals who left a household during The following SQL command will construct the step3_4 working file Link the step3_4 working file to the mem94 working (based on permanent ID) to add the migration out day (if there was one). Step 5: Call the PDO function to determine the person days of observation of each individual. The PDO function needs three parameters: the permanent ID of the individual to compute PDO for, the begin date of observation, and the end date of observation:

18 The PDO procedure is fairly slow as it must consider a number of factors: births, deaths, migrations, and family visit dates. A listing of the PDO function is given in the Appendix. Step 6: Set the censored status variable (whether an individual migrated (1) or not (0)). This variable is used by the Cox Regression procedure. Also calculate the age of each member at the beginning of the time interval: Step 7: We load the mem94 database into SPSS (File Open Data *.dbf...) and then compute the Cox Regression for PDO as the time variable, censored as the variable to indicate end of "monitoring" and age, sex, and family_cnt as covariates. The results from this analysis follow : Dependent Variable: PDO Events Censored (95.5%) Beginning Block Number 0. Initial Log Likelihood Function -2 Log Likelihood Beginning Block Number 1. Method: Enter Variable(s) Entered at Step Number 1.. AGE FAM_COUN FAM_COUNT SEX Log likelihood converged after 3 iterations. -2 Log Likelihood Chi-Square df Sig Overall (score) Change (-2LL) from Previous Block Previous Step Variables in the Equation Variable B S.E. Wald df Sig R Exp(B) AGE FAM_COUN SEX Covariate Means Variable Mean

19 AGE FAM_COUN SEX.0854 We can also request a graph of the Cox Regression procedure (this graph was exported to a.tif file in SPSS and then loaded into Microsoft Word): Summary These examples illustrate the process of extracting and merging longitudinal household data. In the time series example, one of the key parts of solving the problem was to develop a function that determines the size of the population at a particular point in time. This function was reused and revised for the next two problems. In particular, we build the functions MigrAway and Pop_time from the censddate function. We expect that some parts of new problems can be dealt with by making small changes to the functions we have included in the appendix. In the second example, we used selection techniques to determine a subset of the children and then used grouping techniques to determine the population of fathers. From this, we were able to analyze the relationship between the survival of children and the migration status of the father. In the last example, we deal with censored data by using the Cox Regression model Appendix A Censddte.prg Population at a Particular date

20 Appendix B MigrAway.prg Migration away days

21 Appendix C Pop_time.prg Population at a particular date

22 Appendix D

23 PDO_OPEN.PRG Appendix E PDO.PRG (Person Days of Observation)

Creating a data file and entering data

4 Creating a data file and entering data There are a number of stages in the process of setting up a data file and analysing the data. The flow chart shown on the next page outlines the main steps that