Fundamental Data Manipulation Techniques

Size: px
Start display at page:

Download "Fundamental Data Manipulation Techniques"

Transcription

1 The Analysis of Longitudinal Data Introduction This document describes the process of organizing longitudinal data from the HRS for the purposes of statistical analysis. We start by considering three small problems that illustrate the key components of longitudinal data manipulation. In particular, we consider the process of selecting a subset of variables and records from a file, linking records of related files together, and grouping records with common fields. In the next three sections, we use these fundamental data operations as building blocks to solve more complex analysis problems. We start by describing and solving a time series analysis problem have deaths been decreasing over the past two years? Next, we consider an analysis of a problem involving two distinct types of events is there a correlation between child survival (a mortality event in the HRS) and the out-migration of the child's father (an out-migration event in the HRS)? Finally, we consider an analysis question that involves a "time-to-event" dependent variable with censored cases (censored cases are those for which the event has not yet occurred) what are the determinants of migration? This document assumes that the reader is familiar with the basic operations of using FoxPro. If you are not comfortable with FoxPro, the book FoxPro Step-by-Step, published by Microsoft Press, provides a good tutorial. We are grateful for the assistance of the Navrongo Health Research Centre (NHRC), Ghana, in providing a small subset of data to help illustrate these analysis techniques. The data set is small enough to be manageable for instructional purposes and large enough to generate enough interesting cases in the course of an analysis. Fundamental Data Manipulation Techniques The examples in this section illustrate key operations in the manipulation of longitudinal data. In the first example, we select a subset of records and fields from a single file. We then load that file into the SPSS statistical analysis package and perform a cross-tabulation of the data. In the second example, we illustrate how to merge records from two different files. In the last example, we group together collections of records in a single file and compute basic attributes about the groups. Extracting Subsets of Variables and Records We consider the problem of gender and mortality in this example. We will extract all the members who have an exit_type as "DTH" (death) and an exit_date greater than December 31, For these records, we are interested in the variables sex and age. We then would like to do a simple cross-tabulation in SPSS of sex by age group (0-5, 5-18, > 18). RQBE (Relational Query By Example) One of the easiest ways to extract subsets of data from a file is to use the RQBE command in FoxPro. This is a utility that helps in the construction of commands to merge and extract data from files. If you are not familiar with the RQBE command, then you should review the FoxPro tutorial or the FoxPro Step-by-Step book. Every RQBE generates an "SQL select" command (you can click on the See SQL command to see it); the RQBE is just an easy-to-use method of constructing SQL commands. In this document, we will usually present the SQL command as a specification of the corresponding RQBE. This will prove useful in later sections, when we find that some SQL commands cannot be specified in the RQBE. The RQBE screen to extract all members who died later than December 31, 1992 is displayed below:

2 Let's consider the parts of this command. The command starts with the SELECT keyword which could be confused with the FoxPro command to select (or focus) on a file but the command is followed by variable names so FoxPro is able to recognize this as an "SQL Select." The list of variables following the "SELECT" are the variables and expressions entered in the "fields" dialog box of the RQBE. The second entry is typed in at the expression window. The FROM keyword is followed by a list of files that will be used in this case only the member database file. Following the WHERE keyword is a list of all the conditions for selecting a record; in this case, we are selecting all members who died (exit_type = "DTH") after 12/31/92. When working with multiple files, the WHERE command can also be used to specify how records in different files are to be matched up. Finally the INTO keyword indicates where to place the output. Guidelines for Setting Up Working Files Before going any further, guidelines are needed for setting up data files for analysis. One of the most important things to keep in mind is that a number of "temporary" working files get created along the way to a statistical analysis. These files have a way of becoming permanent and, in the end, cluttering up disk drives. A clogged disk drive can make almost any system crash, so you want to keep a careful eye on your space utilization. In these examples, we put all our working files in a subdirectory called output. We do not put them in the main DBFS directory holding the HRS datafiles. So in the above example, the output is going to current directory; for this document, the current directory will be \analysis (use the set default to \analysis command to make this your default directory). The member database is located in the \hrs\dbfs directory. Constructing the correct RQBE (SQL SELECT) is usually an iterative process. While developing the appropriate RQBE, we typically send the results to a browse window and compare our expectations with the results of the command. Eliminating the INTO clause from the above command will send the output to a browse window. Only when we get the right output do we send the results to a database file. If you save the RQBE command, then a file is created with an extension of.qpr. We saved the above command in file called stepf1_1.qpr. We can perform any SQL SELECT command from the command window. For example, our extraction of records from the member file can be performed by typing in the command window DO stepf1_1.qpr Note: when using the RQBE to reference files in different directories, sometimes it is unable to find the designated file. For instance, if we ran the above SQL from the command window with no files open, the command would be unable to find the member file; we need to tell it to look in the \hrs\dbfs directory. The command would also place the results (stepf1_1.dbf) in the current directory. You can change the stepf1_1.qpr file (using modify command in the command window) to reference explicit directories in the following manner : Modifying the Working File We have not completed the construction of the working file for statistical analysis. We would like to rename the second field of the database (currently named exp_2 from the SQL command) and add another field to represent the age group. We make these changes by making the stepf1_1 file active (click on it or open it in the view window) and then choose the Database menu option, followed by the setup submenu option. Click on the modify button and change the name of exp_2 to age and add a new variable to represent the age_group (numeric, length 1).

3 To set the new variable age_group with appropriate values we type the following commands in the command window : This completes the construction of our database file; we are now ready to do the cross-tabulation in SPSS. Loading the database file into SPSS for Windows We describe the process of loading data into SPSS for Windows. Since SPSS supports the loading of FoxPro files (.dbf files), the process of loading our work files is relatively straightforward. Start SPSS, choose the FILE menu option, followed by Open suboption, followed by Data suboption. A dialog box appears; change the type of file we are loading from a *.sav file to a *.dbf. Now locate and select the working file you have created (in our case stepf1_1.dbf). The database file will be loaded and represented in a spreadsheet-like format. At this point we can choose the cross-tabulation routine by choosing the Statistics menu option, followed by the Summarize suboption and then the cross-tabulation option. The results from this analysis follow : SEX by AGE_GROUP AGE_GROUP Row Total SEX F M Column Total Chi-Square Value DF Significance Pearson Likelihood Ratio Minimum Expected Frequency Cells with Expected Frequency < 5-2 OF 6 ( 33.3%) Linking Related Records in Different Files To illustrate the second basic data manipulation technique we ask: "Have men or women migrated away from the compound more often in 1993, 1994?" Answering this question will require linking data records in the migration file (to get the migration events) with data records in the member file (to get the gender of the individual). The primary purpose of this example is to illustrate how to merge records from different files. To start, we need to extract a subset of records and fields from the migration file. In particular, we need all the out-migrations (indicated by a type field of "EXT") during 1993 and The SQL command to extract the permanent ID and the date of migration for this subset of migration events is : Now we need to link this newly created data file (stepf2_1.dbf) with the member file. We could do this in a couple of ways. We will explain both since different circumstances warrant the use of one technique over the other. The first technique continues the use of the RQBE, but in this case, it is used to join two files together and extract a subset of fields. The second technique relates

4 records in the two files with a key expression and then commands are written in the command window to merge fields from the two files. The RQBE to implement the first technique is : In this RQBE, we are creating a new record for every matching record between the stepf2_1 file and the member file. Records are matched based on their permanent ID. The results are sent to the stepf2_2 file. The second technique to link files together uses the "set relations" command to link records in two files. Both files first need to be opened. Use the view window (choose the window option, followed by view suboption) to open them if they are not open already. Using the view window, specify the index order for the member file to be based on the permanent ID (click on member file, followed by setup) and then make the stepf2_1 database active (click on it). Now to relate the stepf2_1 to member, click on the relation button, then click on the member file. You will be asked for the field (or expression) in the stepf2_1 file that is to be used to link to the index expression in the member file choose the perm_id field. The view window is the easiest way to specify the relations between files alternatively you could type this in at the command window (make sure both files are open, using the view window): Now each out-migration of a member (a record in the stepf2_1 file) is linked to background information about that member (a record in the member file). We first create a new field in stepf2_1 to hold the gender of the member who migrated. Then we copy that information from the member record to the stepf2_1 record with the following command : How do you decide when to use one technique over the other? In general, it is easier to use the RQBE over the set relation, but it is not always possible. The RQBE will generate an output record for any and all matches between records in one file and another. If there isn't a matching record, then no output record will be generated. Now consider a problem in which we want to update a list of members to record the last out-migration before 01/01/1993. In this case, not every member record will be matched to a corresponding migration record (not all members migrate out), so the RQBE will generate a list of only those members who migrated before This is not exactly what we want. It is possible, using the set relation command, to identify and update only those records with a matching migration record. We will see how to do this in later sections. At this point, the data can be loaded into SPSS to get a frequency distribution of the number of records with sex = "M" and the number of records with sex = "F". We choose not to do this because the procedure is fairly straightforward and because there is an easier way to do this in FoxPro with the Group option in the RQBE window. This technique is the subject of our next subsection. Grouping Records with Common Values The last of the fundamental data operations that we consider involves the grouping of records that share a common value in a field or a set of fields. This technique is easy to do and quite

5 useful in a number of circumstances. In this section, we determine the number of children each woman has. The RQBE (or corresponding SQL) has an option that can designate a field or a collection of fields that can be grouped. Basic statistics, such as the number of elements, maximum, minimum, or average of a variable, can be computed on the elements of the group. For example, to group records in the member file by region and count the number of elements in each group would provide a count of members by region. Or, you can group records in the stepf2_2 file (see previous subsection) by the field sex to determine the number of males and the number of females who migrated in 1993 and In this particular example, we want to determine the number of children a woman has. Also, as a way of explaining some of the features of grouping, we determine the birth date of the youngest child. The SQL command to do this is : In this command, the mother ID is used to group records together. The command computes the number of elements that share the same mother ID value and determines the maximum birth date of all members in the group. Now suppose I wanted to compute a frequency distribution of the number of children. Do you know how to compute this? Try it. In the next three sections, we consider problems that require us to apply a number of the fundamental data manipulation techniques. Time Series Analysis Time Series analysis refers to any longitudinal study in which time is the unit of analysis and trends or events in time are variables of interest. Time series analysis can be one of three types: Temporal analysis involves describing a trend over time. For example has the migration rate increased over time? Has mortality declined with time? Discontinuity analysis represents a simple extension beyond description, to the interpretation of the impact of some event. Has the trend in mortality changed after immunization was introduced? Did fertility decline after a family planning program was launched? In such cases time series data include some indicator of a disturbance in time. Time series regression analysis involves interpreting a set of several time series in which the timing of disturbances varies by area, but processes under observation are otherwise comparable. For example, an immunization program may be introduced in an area in phases. The question that arises is, do areas where immunization is introduced earlier have more precipitous declines than areas where children are immunized later? From the data management standpoint, all three types of analysis have the same requirements: events or rates must be aggregated and arrayed over time. Time Series analysis involves problems in which time is a correlate. For example, has the migration rate increased over time? Or has the number of deaths been decreasing? In this section, we describe the steps to put together the data to answer the question, "Has the number of deaths been decreasing in the time period from the beginning of 1993 until the end of 1994?" An Overview of the Data Consolidation Steps We describe two different ways of answering the above question. First, we develop a workfile to analyze the number of deaths by month. The second way involves the calculation of a rate of

6 mortality (number of deaths normalized by the population). In this case, we have to determine the population at the beginning of each month. Actually, for this dataset, the population does not vary that much from month to month, so the normalization is not necessary, but for the purposes of an example it is instructive. The data file that we need to construct for the analysis has the following format: Record for each month of 1993 and Each record consists of variables for: o Time Period (1...24) o Date at the beginning of time period (month) Number of deaths in this time period Population at the beginning of the time period (for the second stage) The steps needed to construct this file for stage 1 data include: 1. Define a working file (time_per) that, by the time we are finished, will be passed to the statistical analysis package. In particular, it will have 24 records (one for each month of 1993 and 1994) and four variables to hold the time, mortality, and population. 2. Construct a working file (mortmont) of 1993 and 1994 mortality events by month. 3. Link these two files (time_per, mortmont) by month/year and add the mortality information in the mortmont file to the time_per file. 4. The time_per file is ready for the statistics package. In stage two of the analysis, we normalize the mortality events by the size of the population. To do this, we must determine the population at the beginning of each month. When we are finished, we will have defined a function that returns the population on a particular day. That function is made up of the following steps: 5. Construct a working file (mem) that consists the variables perm_id, entry_date, and exit_date for all records from the member file. 6. Adjust for migrations of members during the period of interest. For example, if we had an out-migration before the date of interest and then an in-migration after the date of interest, then we would not want to include that individual in a count of the population (notice that for this case, the exit type and exit date of the member is currently blank, although at one time (when out-migrated) it was not blank. 7. Let d_date designate the date that we want to determine the population. Delete all records with an entry_date > d_date or with an exit_date < d_date. 8. Count the number of remaining records. Detailed Description of the Data Consolidation Steps Each of the above steps is now described in more detail: Step 1: Use the create command to define a database file with the following structure: Structure for table: c:\analysis\time_per.dbf Number of data records: 24 Date of last update: 04/06/95 Field Name Type Width Dec Index 1 PERIOD Numeric 2 2 DATE Date 8 3 POPULATION Numeric 8 4 MORTALITY Numeric 4 ** Total ** 23 We use the append command to add 24 records each with a date corresponding to the beginning of a month. The first few records of the file follow:

7 Record# PERIOD DATE POPULATION MORTALITY /01/ /01/ /01/ /01/93 Step 2: Use the FoxPro s RQBE to select and group the mortality records. The RQBE results in an SQL statement which you can type in and run from the command window, or type into a file and execute the file with a do, or you can specify the parameters interactively with the RQBE. The SQL statement to select out 1993, 1994 mortality events and group them by month is as follows : The above SQL statement groups by (1,2), which refers to the first and second expressions in the select statement year(date) and month(date). Step 3: Link the file in Step 1 with the file in step 2 with the following statements: ** exp_1 refers to the first expression in the RQBE,... The expression, exp_1*100+ exp_2, is necessary in order to guarantee unique combinations of dates. If we tried to just add exp_1 and exp_2, then the year 1994, month 1 would have the same values as 1993, month 2. Once the relation is set, it is easy to replace the mortality fields in mortmont : Step 4: Load the time_per file into SPSS by starting SPSS, using the Open Data commands. Then change the type of data file from the default (*.sav) to *.dbf. Change to the directory containing the time_per file and choose that file. The results from a simple linear regression of mortality events by time period follow : * * * * M U L T I P L E R E G R E S S I O N * * * * Equation Number 1 Dependent Variable.. MORTALITY Block Number 1. Method: Enter PERIOD Variable(s) Entered on Step Number 1.. PERIOD Multiple R R Square Adjusted R Square Standard Error Analysis of Variance DF Sum of Squares Mean Square Regression Residual

8 F = Signif F = Variables in the Equation Variable B SE B Beta T Sig T PERIOD (Constant) Obviously there is a fairly weak correlation between deaths and time period. We now consider how to normalize the number of mortality events by the size of the population. As mentioned before, this population is not changing that much, so it is not necessary to do this, but the process of determining the population at a particular time period is instructive and potentially useful in other studies. Step 5: Construct a working file (mem) that consists of the variables perm_id, entry_date, and exit_date for all records from the member file. The SQL command to do this is as follows : Step 6: Let d_date designate the date that we want to determine the population. Adjust for migrations of members before d_date. For example, if we had an out-migration before d_date and then an in-migration after d_date, then we would not want to include that individual in a count of the population (notice that for this case, the exit type and exit date of the member is currently blank, although at one time (at d_date) it was not blank. The SQL command to extract all out-migrants before d_date: The command to get all the in-migrants is similar : Our goal is to change the exit status of any members who were gone during some period that includes d_date. If that member entered after the d_date, then we want to include that member in the population count. First we link the files together (all the linking can be done at the command window or by using the view window):

9 Now we replace the exit date if we have an earlier one from the migrate file (but remember we selected only those less than d_date). If an in-migration comes after the out-migration, but before the d_date then we want to update the exit_date: Step 7: Delete all records with an entry_date > d_date or with an exit_date < d_date. Step 8: Count the number of remaining records. The variable m.pop_ddate contains the total population at time d_date. It will get very tedious to run through steps 5 through 8 for all 24 time periods. There is a better way. Steps 5 through 8 can be collected into a file (cenddate.prg). With a few changes we can set the file up so that it can be treated as a function. In particular, we add a parameter statement at the top and a return statement at the bottom. The function is listed in Appendix A at the end of this document. To call the function, we need to set a path to the directory that contains our source code this will tell FoxPro to look in this directory to find the source files (in our case, the directory is c:\hrs\prog). Use the Files option of the view window to set the path. The file named time_per should already be open and should not be linked to any files: This code will load the calculated population into the field population of the time_per file. It is now ready to load into SPSS (we can calculate the value mortality/population in SPSS). Concomitant Event Analysis Concomitant event analysis looks for a correlation between two different types of events. For example, we can question whether there is a correlation between child survival (a mortality event in the NDSS) and the out-migration of the child s father (an out-migration event in the NDSS). An Overview of the Data Consolidation Steps In this section, we describe how to put together the data to answer two different forms of the above question. First, we look to establish a correlation between the survival of the child in 1994

10 and whether the father was absent at any time during Second, we extend the analysis to look for a correlation between child survival and the number of days in which the father was absent. While the second problem is not a concomitant event analysis problem, we include it because it illustrates some data extraction techniques and it is not that different from the first problem. The data file that we need to construct for the first stage of the analysis has the following format: Record for each child born between 12/31/1988 and 01/01/1995. Each record consists of variables for Child ID Survival Status of the Child (1 = Child Died in 1994, 0 otherwise) Father ID Migration Status of the Father (1 = Father Migrated in 1994, 0 otherwise) The steps needed to construct this file for stage 1 data include: 1. Construct a working file (child94) from the member file which consist of all children born between 12/31/1988 and 01/01/1995. Select the fields perm_id, exit_type, exit_date, and father_id. Add fields for migration status of father (0,1) and the survival status of the child (0,1). 2. Use the child94 file to form another working file (father94) of all fathers of these children: group by the father_id in child94 to form this file. 3. Call a function (which we provide and is listed in the appendix with comments) for each father to determine whether the father had out-migrated at any time in 1994 (including if he left before 1994). 4. Link the father94 file to child94 to insert the migration status of the father. Set the child survival field for each child based on the exit status of the child. 5. Use the child94 file in a statistical analysis program to compute a cross tabulation of child survival and father out-migration status. The stage 2 analysis needs to change step 3 in the above list to determine the number of days of migration in the 1994 time interval. Revision of step 3 above: Instead of calling a function to determine whether a father migrated, we call another function to determine how long he was away from home. Afterward, link the data in a manner similar to step 4. Use the child94 file in a statistical analysis program to compute a logistic regression of child survival with number of out-migration days. Detailed Description of the Data Consolidation Steps We now expand on the outline of data consolidation steps: Step1: We construct a working file (child94) from the member file, which consists of all children born between 12/31/1988 and 01/01/1995. Select the fields perm_id, exit_type, exit_date, and father_id. This is done with the following RQBE (because of the complicated and/or conditions, it is easier to type this command into a file and run it):

11 We add fields for the migration status of father (0,1) and the survival status of the child (0,1) using the database setup/modify (from the FoxPro menus) or the modify structure command (typed in at the command window). Step 2: Use the child94 file to form another working file (father94) of all fathers of these children; group by the father_id in child94 to form this file. This is done with the following RQBE. We add fields for the migration status of father (0,1) using the database setup/modify (from the FoxPro menus) or the modify structure command (typed in at the command window). Step 3: Call a function (which we provide and is listed in the appendix) for each father to determine whether the father out-migrated at any time in 1994 (including if he left before 1994). The name of the function is MigrAway. So we can replace the mig_stat field of the father94 file with the following command: The MigrAway function determines the latest migration-out date that is less than (in this case) {01/01/95} and it also determines the latest in-migration date that is less than (in this case) {01/01/94}. If the latest in-migration event has a date greater than the latest out-migration, you can conclude that the member was present during 1994; otherwise, he was not. Step 4: Link the father94 file to child94 to insert the migration status of the father. Set the child survival field for each child based on the exit status of the child. We type the following commands in the command window (the default directory should be \analysis):

12 Step 5: We load the child94 database into SPSS (File Open Data *.dbf...) and then compute a cross tabulation of the death_stat with the father_mig variables. For our particular subset of data, we find no correlation between child survival events and father s migration events.

13 For the purpose of exposition (and not because the data warrant further investigation), we extend the analysis to look for a correlation between child survival and the number of days in which the father was absent. To do this we change step 3 in the above description to: Step 6: Call a function (which we provide and is listed in the appendix) for each father to determine how many days the father was out-migrated at any time in 1994 (including if he left before 1994). The name of the function is MigrAway. So we can replace the migr_stat field of the father94 file with the following command: The MigrAway function is derived from a function that determines the person days of observation. In this case, we consider the amount of time spent during every out-migration to in-migration sequence in between the period {01/01/94} and {01/01/95}. Proportional Hazards Modeling (Cox Regression) Analysis questions that involve a "time-to-event" dependent variable with censored cases are common with longitudinal data (censored cases are those for which the event has not yet occurred). For example, "What are the determinants of migration?" "What are the determinants of pregnancy for eligible women?" "What are the determinants of child death?" All of these analysis questions involve cases in which the event did not occur. Multiple linear regression cannot be used for analysis of time-to-event data, since there is no way to handle censored observations. Proportional Hazards Modeling (using Cox regression model) can be used to analyze data that contain censored observations. In this section we consider the data requirements for analysis of the question, " What are the determinants of migration?" For more information about the statistical nature of the Proportional Hazards Modeling procedure, see Chapter 12 in the advanced statistics manual of SPSS. Background The analysis of longitudinal data is complicated by the fact that individuals observed in a study can be lost to observation over time. If a cohort is observed from the beginning of a survival period but is lost to observation in the course of a study, it is termed "right censoring." If individuals are observed in episodes whereby their pre-study status is potentially important to research outcomes, but remains unknown for the period prior to launching a study, it is termed

14 "left censoring." "Left censoring" is associated with special statistical problems that we will ignore for the moment. Let us consider the case of "right censored" data, or "censored" data for short. Censoring represents a potential bias because observation durations are terminated artificially by events that have nothing to do with the attrition process that is under study. A person entering a survival study may be alive at the end of the study. Such a person has been observed until the end of the study, and should be included in the analysis, but we know nothing about the person s survival after the study. Excluding such cases would remove individuals with long survival histories, biasing results. Treating such cases as deaths would obviously bias results by spuriously elevating risks. Every censored study has examples of potential biases that could arise from the censoring process. The conventional method for analyzing censored data is termed the "life table." This procedure makes the simple assumption that censoring is independent of the attrition process. Whatever observation arises from such cases is used in calculating populations at risk until the time of censoring. The survival process is analyzed in small discrete time periods, with simple assumptions made about the temporal distribution of risk within discrete intervals. Survival probabilities are calculated for each interval, so that cases that are censored at some point of time can be used in the denominators of rates for time segments prior to the point of censoring. Discrete probabilities computed in this fashion can then be accumulated multiplicatively to show the implication of a series of probabilities for the overall survival process. The conventional term for the discrete survival probability is "q(x)," where x denotes some point in time. Tables of "q" and corresponding cumulative survival probabilities represent a life table. The problem with life tables is that they are tables aggregate data on processes that have underlying covariates. The logic of regression analysis is extremely useful in explaining the covariates of some process, but tabular data are aggregated, and statistical procedures based on likelihood estimation and least squares regression require individual level observations. In a classic paper written by D.R. Cox in 1972, the notion of "Hazard Regression" was proposed. Cox noted that attrition can be defined by "hazard functions." In the Cox approach, all attrition processes are captured by one of three types of relationships between "q" and time, and the role of covariates in increasing or decreasing the attrition process. Cox provides us with procedures for combining the statistical tools of regression with the advantages of life tables for dealing with censoring. To understand the concept of "hazard modeling" or Cox regression, it is helpful to review the concept of attrition that Cox has utilized in his approach."hazard functions" are simple equations for representing the relationship of q with time. As time progresses, the underlying "hazard" can remain constant, which is rare, or it can change. A constant "hazard" is the simplest representation of survival data: No hazard function is needed in a regression, because the intercept at time x=0 defines the entire attrition process. In cumulative terms, this constant defines an exponential rate of attrition. If we are interested in the role of a covariate (z) as a determinant of attrition, we can express this as: Note that the units of analysis are the discrete time points for each individual. The role of z is to shift the level of the attrition process. The level does not change with time, however. The logit is the appropriate function for discrete data (0 = surviving; 1 = dying). When the pace of attrition changes with time, cox regression simply substitutes an equation for the intercept:

15 Depending upon the data and the problem, a(x) can assume any form whatsoever increasing, decreasing, or remaining constant over time. This is a "proportional hazard" model. The coefficient defines the extent to which the covariate z elevates or reduces attrition, relative to the function a(x). The conditional hazard q(z,x) never crosses the underlying hazard because the effect of z is independent of time. The most complicated attrition process is one in which the effect of z covaries with time. A logit model for this is: In this process, b defines a "main effect" for the role of z, and c defines the extent to which this is modified as time progresses. From a data management standpoint, these three models have different requirements. In one case, it is sufficient to fit a model to tabular data, since adding the time dimension contributes no information. For the second model, data must be arrayed by individual, with time allowing for the estimation of the hazard. In the third case, the data must register values of z for each discrete time point in the period of observation. The models shown have important data management implications. First, data prepared for "Cox modeling" should be informed by the attrition process that is being studied. Data prepared for the first model may not have sufficient detail for estimating model 2; data arrayed for model 2 in turn, may be inadequate for estimating model 3. Second, a model 3 data set can always be used for the more parsimonious attrition process. That is, if a regression model is estimated for model 3, and the term c is found to be insignificant, then a model 2 regression is sufficient and feasible to estimate with the data set used for the more complex specification. Third, it is important to have data management procedures that record events for the attrition process, censoring so that observations can be defined, and covariates of interest. How that data matrix is designed depends upon which type of regression is employed. An Overview of the Data Consolidation Steps Preparing the data for an analysis of the question "What are the determinants of migration?" requires a number of steps. This section identifies these steps and later sections cover the steps in more detail. First, we need to frame the analysis in more concrete terms. This includes identifying the independent variables that may influence migration and specifying the time period over which we will look at migration. For this analysis, the variables age, gender, and size of the household (compound) at the beginning of 1994 are tested as possible determinants. We will only consider migrations in The data file that we need to construct for the analysis has the following format : Record for each member who was present during some portion of 1994 Each record consists of variables for o Age o Gender o Family Size o Migration out date o Number of days present during 1994 o Censored Status : Did a migration occur (Yes, No) The specific steps needed to construct this file include:

16 1. Construct a working file (mem94) that consists of individuals who were present at some time during Select the perm_id, region, family_num, member_num, birth_date, and sex fields. Add fields to this database file for Family_size, Migration out date, number of days present in 1994, and the censored status. 2. Construct another working file (pop_time) using the procedure pop_time. This procedure constructs a file (pop_time) that gives the population (and their current household) at a time specified by the user. In this case we are interested in the population on the 1st of January Group the pop_time workfile by the current household id to produce a working file (step3_394) that contains the number of members present in the household on the 1st of January After constructing the step3_3 working file, link this file (based on family_id) to the pop_time working file to give the number of individuals in a household on January 1, Finally, link the pop_time file to the mem94 file (based on permanent ID) to insert the count of the family members. 4. Construct another working file (step3_4) of all individuals who left a household during Link the step3_4 working file to the mem94 working (based on permanent ID) to add the migration out day (if there was one). 5. Call the PDO function to determine the person days of observation of each individual. Store the result of this calculation in the appropriate record of the mem94 file. 6. Set the censored status variable (whether an individual migrated (1) or not (0)). This variable is used by the Cox Regression procedure. Also calculate the age of each member at the beginning of the time interval. 7. Load the mem94 working file into SPSS and call the Cox Regression procedure. Detailed Description of the Data Consolidation Steps We now expand on the outline of data consolidation steps: Step 1: Construct a working file (mem94) that consists of individuals who were present at some time during Select the perm_id, region, family_num, member_num, birth_date, and sex fields. We can construct this file with the following SQL statement : Now we need to add fields to this database file for Family_size, Migration out date, number of days present in 1994, and the censored status. We add fields using the database setup/modify (from the FoxPro menus) or the modify structure command (typed in at the command window). Step 2: Construct another working file (pop_time) using the procedure pop_time. This procedure constructs a file (pop_time) that gives the population (and their current household) at a time specified by the user. In this case we are interested in the population on the 1st of January In the command window we can type the following command to call this procedure :

17 The pop_time procedure constructs a database file called membrsid which represents the list of all members present in the study on the 1st of January and for each member, their family (compound) is given in the region and family_num fields. Step 3: Group the membrsid workfile by the current household ID to produce a working file (step3_3) that contains the number of members present in the household on the 1st of January, SELECT membrsid.region, membrsid.family_num, COUNT(*); FROM membrsid; GROUP BY membrsid.region, membrsid.family_num; INTO TABLE step3_3.dbf After constructing the step3_3 working file, we need to add the count of family members to mem94 in two steps. First from step3_3 to membrsid and then from membrsid to mem_94. We do it this way because the current household of the member may be different at the end of 1994 (mem94) than in the beginning of 1994 (membrsid). We need to add a field fam_count to the membrsid working file. Then, to link the step3_3 file (based on family_id) to the membrsid working file to give the number of individuals in a household on January 1, 1994, the following commands are entered at the command window: Finally, link the membrsid file to the mem94 file (based on permanent ID) to insert the count of the family members: Step 4: Construct another working file (step3_4) of all individuals who left a household during The following SQL command will construct the step3_4 working file Link the step3_4 working file to the mem94 working (based on permanent ID) to add the migration out day (if there was one). Step 5: Call the PDO function to determine the person days of observation of each individual. The PDO function needs three parameters: the permanent ID of the individual to compute PDO for, the begin date of observation, and the end date of observation:

18 The PDO procedure is fairly slow as it must consider a number of factors: births, deaths, migrations, and family visit dates. A listing of the PDO function is given in the Appendix. Step 6: Set the censored status variable (whether an individual migrated (1) or not (0)). This variable is used by the Cox Regression procedure. Also calculate the age of each member at the beginning of the time interval: Step 7: We load the mem94 database into SPSS (File Open Data *.dbf...) and then compute the Cox Regression for PDO as the time variable, censored as the variable to indicate end of "monitoring" and age, sex, and family_cnt as covariates. The results from this analysis follow : Dependent Variable: PDO Events Censored (95.5%) Beginning Block Number 0. Initial Log Likelihood Function -2 Log Likelihood Beginning Block Number 1. Method: Enter Variable(s) Entered at Step Number 1.. AGE FAM_COUN FAM_COUNT SEX Log likelihood converged after 3 iterations. -2 Log Likelihood Chi-Square df Sig Overall (score) Change (-2LL) from Previous Block Previous Step Variables in the Equation Variable B S.E. Wald df Sig R Exp(B) AGE FAM_COUN SEX Covariate Means Variable Mean

19 AGE FAM_COUN SEX.0854 We can also request a graph of the Cox Regression procedure (this graph was exported to a.tif file in SPSS and then loaded into Microsoft Word): Summary These examples illustrate the process of extracting and merging longitudinal household data. In the time series example, one of the key parts of solving the problem was to develop a function that determines the size of the population at a particular point in time. This function was reused and revised for the next two problems. In particular, we build the functions MigrAway and Pop_time from the censddate function. We expect that some parts of new problems can be dealt with by making small changes to the functions we have included in the appendix. In the second example, we used selection techniques to determine a subset of the children and then used grouping techniques to determine the population of fathers. From this, we were able to analyze the relationship between the survival of children and the migration status of the father. In the last example, we deal with censored data by using the Cox Regression model Appendix A Censddte.prg Population at a Particular date

20 Appendix B MigrAway.prg Migration away days

21 Appendix C Pop_time.prg Population at a particular date

22 Appendix D

23 PDO_OPEN.PRG Appendix E PDO.PRG (Person Days of Observation)

24

25

Creating a data file and entering data

Creating a data file and entering data 4 Creating a data file and entering data There are a number of stages in the process of setting up a data file and analysing the data. The flow chart shown on the next page outlines the main steps that

More information

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics PASW Complex Samples 17.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample

More information

OVERVIEW OF WINDOWS IN STATA

OVERVIEW OF WINDOWS IN STATA OBJECTIVES OF STATA This course is the series of statistical analysis using Stata. It is designed to acquire basic skill on Stata and produce a technical reports in the statistical views. After completion

More information

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics SPSS Complex Samples 15.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample

More information

Tutorial #1: Using Latent GOLD choice to Estimate Discrete Choice Models

Tutorial #1: Using Latent GOLD choice to Estimate Discrete Choice Models Tutorial #1: Using Latent GOLD choice to Estimate Discrete Choice Models In this tutorial, we analyze data from a simple choice-based conjoint (CBC) experiment designed to estimate market shares (choice

More information

Chapter One: Getting Started With IBM SPSS for Windows

Chapter One: Getting Started With IBM SPSS for Windows Chapter One: Getting Started With IBM SPSS for Windows Using Windows The Windows start-up screen should look something like Figure 1-1. Several standard desktop icons will always appear on start up. Note

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Data-Analysis Exercise Fitting and Extending the Discrete-Time Survival Analysis Model (ALDA, Chapters 11 & 12, pp )

Data-Analysis Exercise Fitting and Extending the Discrete-Time Survival Analysis Model (ALDA, Chapters 11 & 12, pp ) Applied Longitudinal Data Analysis Page 1 Data-Analysis Exercise Fitting and Extending the Discrete-Time Survival Analysis Model (ALDA, Chapters 11 & 12, pp. 357-467) Purpose of the Exercise This data-analytic

More information

Show how the LG-Syntax can be generated from a GUI model. Modify the LG-Equations to specify a different LC regression model

Show how the LG-Syntax can be generated from a GUI model. Modify the LG-Equations to specify a different LC regression model Tutorial #S1: Getting Started with LG-Syntax DemoData = 'conjoint.sav' This tutorial introduces the use of the LG-Syntax module, an add-on to the Advanced version of Latent GOLD. In this tutorial we utilize

More information

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings Statistical Good Practice Guidelines SSC home Using Excel for Statistics - Tips and Warnings On-line version 2 - March 2001 This is one in a series of guides for research and support staff involved in

More information

Introduction to Mixed Models: Multivariate Regression

Introduction to Mixed Models: Multivariate Regression Introduction to Mixed Models: Multivariate Regression EPSY 905: Multivariate Analysis Spring 2016 Lecture #9 March 30, 2016 EPSY 905: Multivariate Regression via Path Analysis Today s Lecture Multivariate

More information

Dr. Barbara Morgan Quantitative Methods

Dr. Barbara Morgan Quantitative Methods Dr. Barbara Morgan Quantitative Methods 195.650 Basic Stata This is a brief guide to using the most basic operations in Stata. Stata also has an on-line tutorial. At the initial prompt type tutorial. In

More information

An introduction to SPSS

An introduction to SPSS An introduction to SPSS To open the SPSS software using U of Iowa Virtual Desktop... Go to https://virtualdesktop.uiowa.edu and choose SPSS 24. Contents NOTE: Save data files in a drive that is accessible

More information

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM * Which directories are used for input files and output files? See menu-item "Options" and page 22 in the manual.

More information

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT

CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT CHAPTER 5. BASIC STEPS FOR MODEL DEVELOPMENT This chapter provides step by step instructions on how to define and estimate each of the three types of LC models (Cluster, DFactor or Regression) and also

More information

A QUICK INTRODUCTION TO STATA

A QUICK INTRODUCTION TO STATA A QUICK INTRODUCTION TO STATA This module provides a quick introduction to STATA. After completing this module you will be able to input data, save data, transform data, create basic tables, create basic

More information

Example Using Missing Data 1

Example Using Missing Data 1 Ronald H. Heck and Lynn N. Tabata 1 Example Using Missing Data 1 Creating the Missing Data Variable (Miss) Here is a data set (achieve subset MANOVAmiss.sav) with the actual missing data on the outcomes.

More information

Age & Stage Structure: Elephant Model

Age & Stage Structure: Elephant Model POPULATION MODELS Age & Stage Structure: Elephant Model Terri Donovan recorded: January, 2010 Today we're going to be building an age-structured model for the elephant population. And this will be the

More information

[spa-temp.inf] Spatial-temporal information

[spa-temp.inf] Spatial-temporal information [spa-temp.inf] Spatial-temporal information VI Table of Contents for Spatial-temporal information I. Spatial-temporal information........................................... VI - 1 A. Cohort-survival method.........................................

More information

4. Descriptive Statistics: Measures of Variability and Central Tendency

4. Descriptive Statistics: Measures of Variability and Central Tendency 4. Descriptive Statistics: Measures of Variability and Central Tendency Objectives Calculate descriptive for continuous and categorical data Edit output tables Although measures of central tendency and

More information

Tutorial: Using Tina Vision s Quantitative Pattern Recognition Tool.

Tutorial: Using Tina Vision s Quantitative Pattern Recognition Tool. Tina Memo No. 2014-004 Internal Report Tutorial: Using Tina Vision s Quantitative Pattern Recognition Tool. P.D.Tar. Last updated 07 / 06 / 2014 ISBE, Medical School, University of Manchester, Stopford

More information

Preparing for Data Analysis

Preparing for Data Analysis Preparing for Data Analysis Prof. Andrew Stokes March 21, 2017 Managing your data Entering the data into a database Reading the data into a statistical computing package Checking the data for errors and

More information

Preparing for Data Analysis

Preparing for Data Analysis Preparing for Data Analysis Prof. Andrew Stokes March 27, 2018 Managing your data Entering the data into a database Reading the data into a statistical computing package Checking the data for errors and

More information

Longitudinal Linkage of Cross-Sectional NCDS Data Files Using SPSS

Longitudinal Linkage of Cross-Sectional NCDS Data Files Using SPSS Longitudinal Linkage of Cross-Sectional NCDS Data Files Using SPSS What are we doing when we merge data from two sweeps of the NCDS (i.e. data from different points in time)? We are adding new information

More information

STATA 13 INTRODUCTION

STATA 13 INTRODUCTION STATA 13 INTRODUCTION Catherine McGowan & Elaine Williamson LONDON SCHOOL OF HYGIENE & TROPICAL MEDICINE DECEMBER 2013 0 CONTENTS INTRODUCTION... 1 Versions of STATA... 1 OPENING STATA... 1 THE STATA

More information

Computers and statistical software such as the Statistical Package for the Social Sciences (SPSS) make complex statistical

Computers and statistical software such as the Statistical Package for the Social Sciences (SPSS) make complex statistical Appendix C How to Use a Statistical Package With The Assistance of Lisa M. Gilman and with Contributions By Joan Saxton Weber Computers and statistical software such as the Statistical Package for the

More information

ANNOUNCING THE RELEASE OF LISREL VERSION BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3

ANNOUNCING THE RELEASE OF LISREL VERSION BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3 ANNOUNCING THE RELEASE OF LISREL VERSION 9.1 2 BACKGROUND 2 COMBINING LISREL AND PRELIS FUNCTIONALITY 2 FIML FOR ORDINAL AND CONTINUOUS VARIABLES 3 THREE-LEVEL MULTILEVEL GENERALIZED LINEAR MODELS 3 FOUR

More information

Panel Data 4: Fixed Effects vs Random Effects Models

Panel Data 4: Fixed Effects vs Random Effects Models Panel Data 4: Fixed Effects vs Random Effects Models Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised April 4, 2017 These notes borrow very heavily, sometimes verbatim,

More information

Contents of SAS Programming Techniques

Contents of SAS Programming Techniques Contents of SAS Programming Techniques Chapter 1 About SAS 1.1 Introduction 1.1.1 SAS modules 1.1.2 SAS module classification 1.1.3 SAS features 1.1.4 Three levels of SAS techniques 1.1.5 Chapter goal

More information

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file 1 SPSS Guide 2009 Content 1. Basic Steps for Data Analysis. 3 2. Data Editor. 2.4.To create a new SPSS file 3 4 3. Data Analysis/ Frequencies. 5 4. Recoding the variable into classes.. 5 5. Data Analysis/

More information

WHO STEPS Surveillance Support Materials. STEPS Epi Info Training Guide

WHO STEPS Surveillance Support Materials. STEPS Epi Info Training Guide STEPS Epi Info Training Guide Department of Chronic Diseases and Health Promotion World Health Organization 20 Avenue Appia, 1211 Geneva 27, Switzerland For further information: www.who.int/chp/steps WHO

More information

1 Introduction to Using Excel Spreadsheets

1 Introduction to Using Excel Spreadsheets Survey of Math: Excel Spreadsheet Guide (for Excel 2007) Page 1 of 6 1 Introduction to Using Excel Spreadsheets This section of the guide is based on the file (a faux grade sheet created for messing with)

More information

Expectation Maximization (EM) and Gaussian Mixture Models

Expectation Maximization (EM) and Gaussian Mixture Models Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation

More information

Choosing the Right Procedure

Choosing the Right Procedure 3 CHAPTER 1 Choosing the Right Procedure Functional Categories of Base SAS Procedures 3 Report Writing 3 Statistics 3 Utilities 4 Report-Writing Procedures 4 Statistical Procedures 6 Available Statistical

More information

Choosing the Right Procedure

Choosing the Right Procedure 3 CHAPTER 1 Choosing the Right Procedure Functional Categories of Base SAS Procedures 3 Report Writing 3 Statistics 3 Utilities 4 Report-Writing Procedures 4 Statistical Procedures 5 Efficiency Issues

More information

Also, for all analyses, two other files are produced upon program completion.

Also, for all analyses, two other files are produced upon program completion. MIXOR for Windows Overview MIXOR is a program that provides estimates for mixed-effects ordinal (and binary) regression models. This model can be used for analysis of clustered or longitudinal (i.e., 2-level)

More information

TRANSANA and Chapter 8 Retrieval

TRANSANA and Chapter 8 Retrieval TRANSANA and Chapter 8 Retrieval Chapter 8 in Using Software for Qualitative Research focuses on retrieval a crucial aspect of qualitatively coding data. Yet there are many aspects of this which lead to

More information

Box-Cox Transformation for Simple Linear Regression

Box-Cox Transformation for Simple Linear Regression Chapter 192 Box-Cox Transformation for Simple Linear Regression Introduction This procedure finds the appropriate Box-Cox power transformation (1964) for a dataset containing a pair of variables that are

More information

- 1 - Fig. A5.1 Missing value analysis dialog box

- 1 - Fig. A5.1 Missing value analysis dialog box WEB APPENDIX Sarstedt, M. & Mooi, E. (2019). A concise guide to market research. The process, data, and methods using SPSS (3 rd ed.). Heidelberg: Springer. Missing Value Analysis and Multiple Imputation

More information

Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression II Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties

More information

From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX

From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX Paper 152-27 From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX ABSTRACT This paper is a case study of how SAS products were

More information

Missing Data Techniques

Missing Data Techniques Missing Data Techniques Paul Philippe Pare Department of Sociology, UWO Centre for Population, Aging, and Health, UWO London Criminometrics (www.crimino.biz) 1 Introduction Missing data is a common problem

More information

How to Use a Statistical Package

How to Use a Statistical Package APPENDIX D How to Use a Statistical Package Candace M. Evans, Lisa M. Gilman, Jeffrey Xavier, Joan Saxton Weber Computers and statistical software such as the Statistical Package for the Social Sciences

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview Chapter 888 Introduction This procedure generates D-optimal designs for multi-factor experiments with both quantitative and qualitative factors. The factors can have a mixed number of levels. For example,

More information

Further processing of estimation results: Basic programming with matrices

Further processing of estimation results: Basic programming with matrices The Stata Journal (2005) 5, Number 1, pp. 83 91 Further processing of estimation results: Basic programming with matrices Ian Watson ACIRRT, University of Sydney i.watson@econ.usyd.edu.au Abstract. Rather

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Analysis of Complex Survey Data with SAS

Analysis of Complex Survey Data with SAS ABSTRACT Analysis of Complex Survey Data with SAS Christine R. Wells, Ph.D., UCLA, Los Angeles, CA The differences between data collected via a complex sampling design and data collected via other methods

More information

Mr. Kongmany Chaleunvong. GFMER - WHO - UNFPA - LAO PDR Training Course in Reproductive Health Research Vientiane, 22 October 2009

Mr. Kongmany Chaleunvong. GFMER - WHO - UNFPA - LAO PDR Training Course in Reproductive Health Research Vientiane, 22 October 2009 Mr. Kongmany Chaleunvong GFMER - WHO - UNFPA - LAO PDR Training Course in Reproductive Health Research Vientiane, 22 October 2009 1 Object of the Course Introduction to SPSS The basics of managing data

More information

Laboratory for Two-Way ANOVA: Interactions

Laboratory for Two-Way ANOVA: Interactions Laboratory for Two-Way ANOVA: Interactions For the last lab, we focused on the basics of the Two-Way ANOVA. That is, you learned how to compute a Brown-Forsythe analysis for a Two-Way ANOVA, as well as

More information

Control Invitation

Control Invitation Online Appendices Appendix A. Invitation Emails Control Invitation Email Subject: Reviewer Invitation from JPubE You are invited to review the above-mentioned manuscript for publication in the. The manuscript's

More information

How to Use a Statistical Package

How to Use a Statistical Package E App-Bachman-45191.qxd 1/31/2007 3:32 PM Page E-1 A P P E N D I X E How to Use a Statistical Package WITH THE ASSISTANCE OF LISA M. GILMAN AND WITH CONTRIBUTIONS BY JOAN SAXTON WEBER Computers and statistical

More information

BIOL 417: Biostatistics Laboratory #3 Tuesday, February 8, 2011 (snow day February 1) INTRODUCTION TO MYSTAT

BIOL 417: Biostatistics Laboratory #3 Tuesday, February 8, 2011 (snow day February 1) INTRODUCTION TO MYSTAT BIOL 417: Biostatistics Laboratory #3 Tuesday, February 8, 2011 (snow day February 1) INTRODUCTION TO MYSTAT Go to the course Blackboard site and download Laboratory 3 MYSTAT Intro.xls open this file in

More information

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa Ronald H. Heck 1 In this handout, we will address a number of issues regarding missing data. It is often the case that the weakest point of a study is the quality of the data that can be brought to bear

More information

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel Research Methods for Business and Management Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel A Simple Example- Gym Purpose of Questionnaire- to determine the participants involvement

More information

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression. Dr. G. Bharadwaja Kumar VIT Chennai Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

More information

Experiment 1 CH Fall 2004 INTRODUCTION TO SPREADSHEETS

Experiment 1 CH Fall 2004 INTRODUCTION TO SPREADSHEETS Experiment 1 CH 222 - Fall 2004 INTRODUCTION TO SPREADSHEETS Introduction Spreadsheets are valuable tools utilized in a variety of fields. They can be used for tasks as simple as adding or subtracting

More information

AcaStat User Manual. Version 10 for Mac and Windows. Copyright 2018, AcaStat Software. All rights Reserved.

AcaStat User Manual. Version 10 for Mac and Windows. Copyright 2018, AcaStat Software. All rights Reserved. AcaStat User Manual Version 10 for Mac and Windows Copyright 2018, AcaStat Software. All rights Reserved. http://www.acastat.com Table of Contents NEW IN VERSION 10... 6 INTRODUCTION... 7 GETTING HELP...

More information

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Examples: Mixture Modeling With Cross-Sectional Data CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA Mixture modeling refers to modeling with categorical latent variables that represent

More information

Multiple Regression White paper

Multiple Regression White paper +44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms

More information

An introduction to plotting data

An introduction to plotting data An introduction to plotting data Eric D. Black California Institute of Technology February 25, 2014 1 Introduction Plotting data is one of the essential skills every scientist must have. We use it on a

More information

Splines and penalized regression

Splines and penalized regression Splines and penalized regression November 23 Introduction We are discussing ways to estimate the regression function f, where E(y x) = f(x) One approach is of course to assume that f has a certain shape,

More information

STATS PAD USER MANUAL

STATS PAD USER MANUAL STATS PAD USER MANUAL For Version 2.0 Manual Version 2.0 1 Table of Contents Basic Navigation! 3 Settings! 7 Entering Data! 7 Sharing Data! 8 Managing Files! 10 Running Tests! 11 Interpreting Output! 11

More information

Excel 2007/2010. Don t be afraid of PivotTables. Prepared by: Tina Purtee Information Technology (818)

Excel 2007/2010. Don t be afraid of PivotTables. Prepared by: Tina Purtee Information Technology (818) Information Technology MS Office 2007/10 Users Guide Excel 2007/2010 Don t be afraid of PivotTables Prepared by: Tina Purtee Information Technology (818) 677-2090 tpurtee@csun.edu [ DON T BE AFRAID OF

More information

Demographic and Health Survey. Entry Guidelines DHS 6. ICF Macro Calverton, Maryland. DHS Data Processing Manual

Demographic and Health Survey. Entry Guidelines DHS 6. ICF Macro Calverton, Maryland. DHS Data Processing Manual Demographic and Health Survey Entry Guidelines DHS 6 ICF Macro Calverton, Maryland DHS Data Processing Manual DATA ENTRY GUIDELINES This guide explains the responsibilities of a data entry operator for

More information

SAS (Statistical Analysis Software/System)

SAS (Statistical Analysis Software/System) SAS (Statistical Analysis Software/System) SAS Adv. Analytics or Predictive Modelling:- Class Room: Training Fee & Duration : 30K & 3 Months Online Training Fee & Duration : 33K & 3 Months Learning SAS:

More information

Data analysis using Microsoft Excel

Data analysis using Microsoft Excel Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data

More information

Linear and Quadratic Least Squares

Linear and Quadratic Least Squares Linear and Quadratic Least Squares Prepared by Stephanie Quintal, graduate student Dept. of Mathematical Sciences, UMass Lowell in collaboration with Marvin Stick Dept. of Mathematical Sciences, UMass

More information

A Comparison of Modeling Scales in Flexible Parametric Models. Noori Akhtar-Danesh, PhD McMaster University

A Comparison of Modeling Scales in Flexible Parametric Models. Noori Akhtar-Danesh, PhD McMaster University A Comparison of Modeling Scales in Flexible Parametric Models Noori Akhtar-Danesh, PhD McMaster University Hamilton, Canada daneshn@mcmaster.ca Outline Backgroundg A review of splines Flexible parametric

More information

PRI Workshop Introduction to AMOS

PRI Workshop Introduction to AMOS PRI Workshop Introduction to AMOS Krissy Zeiser Pennsylvania State University klz24@pop.psu.edu 2-pm /3/2008 Setting up the Dataset Missing values should be recoded in another program (preferably with

More information

Introduction to Computer Science and Business

Introduction to Computer Science and Business Introduction to Computer Science and Business This is the second portion of the Database Design and Programming with SQL course. In this portion, students implement their database design by creating a

More information

MPhil computer package lesson: getting started with Eviews

MPhil computer package lesson: getting started with Eviews MPhil computer package lesson: getting started with Eviews Ryoko Ito (ri239@cam.ac.uk, itoryoko@gmail.com, www.itoryoko.com ) 1. Creating an Eviews workfile 1.1. Download Wage data.xlsx from my homepage:

More information

Using the Health Indicators database to help students research Canadian health issues

Using the Health Indicators database to help students research Canadian health issues Assignment Using the Health Indicators database to help students research Canadian health issues Joel Yan, Statistics Canada, joel.yan@statcan.ca, 1-800-465-1222 With input from Brenda Wannell, Health

More information

How to Use the Cancer-Rates.Info/NJ

How to Use the Cancer-Rates.Info/NJ How to Use the Cancer-Rates.Info/NJ Web- Based Incidence and Mortality Mapping and Inquiry Tool to Obtain Statewide and County Cancer Statistics for New Jersey Cancer Incidence and Mortality Inquiry System

More information

Using SPSS with The Fundamentals of Political Science Research

Using SPSS with The Fundamentals of Political Science Research Using SPSS with The Fundamentals of Political Science Research Paul M. Kellstedt and Guy D. Whitten Department of Political Science Texas A&M University c Paul M. Kellstedt and Guy D. Whitten 2009 Contents

More information

Introduction to Mplus

Introduction to Mplus Introduction to Mplus May 12, 2010 SPONSORED BY: Research Data Centre Population and Life Course Studies PLCS Interdisciplinary Development Initiative Piotr Wilk piotr.wilk@schulich.uwo.ca OVERVIEW Mplus

More information

SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups You might find that your data is in a very different structure to that needed for analysis.

More information

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to

More information

For our example, we will look at the following factors and factor levels.

For our example, we will look at the following factors and factor levels. In order to review the calculations that are used to generate the Analysis of Variance, we will use the statapult example. By adjusting various settings on the statapult, you are able to throw the ball

More information

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression Lecture Simple Regression, An Overview, and Simple Linear Regression Learning Objectives In this set of lectures we will develop a framework for simple linear, logistic, and Cox Proportional Hazards Regression

More information

METAPOPULATION DYNAMICS

METAPOPULATION DYNAMICS 16 METAPOPULATION DYNAMICS Objectives Determine how extinction and colonization parameters influence metapopulation dynamics. Determine how the number of patches in a system affects the probability of

More information

How to Use a Statistical Package

How to Use a Statistical Package APPENDIX F How to Use a Statistical Package With the assistance of Lisa M. Gilman and Jeffrey Xavier and with contributions by Joan Saxton Weber Computers and statistical software such as the Statistical

More information

STATISTICAL TECHNIQUES. Interpreting Basic Statistical Values

STATISTICAL TECHNIQUES. Interpreting Basic Statistical Values STATISTICAL TECHNIQUES Interpreting Basic Statistical Values INTERPRETING BASIC STATISTICAL VALUES Sample representative How would one represent the average or typical piece of information from a given

More information

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 9. oktober 2012 Multivariate tables Agenda today Age standardization Missing data 1 2 3 4 Age standardization

More information

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Epidemiological analysis PhD-course in epidemiology Lau Caspar Thygesen Associate professor, PhD 25 th February 2014 Age standardization Incidence and prevalence are strongly agedependent Risks rising

More information

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Multiple Imputation for Missing Data Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health Outline Missing data mechanisms What is Multiple Imputation? Software Options

More information

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office)

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office) SAS (Base & Advanced) Analytics & Predictive Modeling Tableau BI 96 HOURS Practical Learning WEEKDAY & WEEKEND BATCHES CLASSROOM & LIVE ONLINE DexLab Certified BUSINESS ANALYTICS Training Module Gurgaon

More information

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Missing Data. SPIDA 2012 Part 6 Mixed Models with R: The best solution to the missing data problem is not to have any. Stef van Buuren, developer of mice SPIDA 2012 Part 6 Mixed Models with R: Missing Data Georges Monette 1 May 2012 Email: georges@yorku.ca

More information

Enterprise Miner Tutorial Notes 2 1

Enterprise Miner Tutorial Notes 2 1 Enterprise Miner Tutorial Notes 2 1 ECT7110 E-Commerce Data Mining Techniques Tutorial 2 How to Join Table in Enterprise Miner e.g. we need to join the following two tables: Join1 Join 2 ID Name Gender

More information

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Your Name: Section: 36-201 INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Objectives: 1. To learn how to interpret scatterplots. Specifically you will investigate, using

More information

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL SPSS QM II SHORT INSTRUCTIONS This presentation contains only relatively short instructions on how to perform some statistical analyses in SPSS. Details around a certain function/analysis method not covered

More information

Sample Exam. Advanced Test Automation - Engineer

Sample Exam. Advanced Test Automation - Engineer Sample Exam Advanced Test Automation - Engineer Questions ASTQB Created - 2018 American Software Testing Qualifications Board Copyright Notice This document may be copied in its entirety, or extracts made,

More information

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite

More information

Chapter 18 Outputting Data

Chapter 18 Outputting Data Chapter 18: Outputting Data 231 Chapter 18 Outputting Data The main purpose of most business applications is to collect data and produce information. The most common way of returning the information is

More information

Tips and Guidance for Analyzing Data. Executive Summary

Tips and Guidance for Analyzing Data. Executive Summary Tips and Guidance for Analyzing Data Executive Summary This document has information and suggestions about three things: 1) how to quickly do a preliminary analysis of time-series data; 2) key things to

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Geostatistics 2D GMS 7.0 TUTORIALS. 1 Introduction. 1.1 Contents

Geostatistics 2D GMS 7.0 TUTORIALS. 1 Introduction. 1.1 Contents GMS 7.0 TUTORIALS 1 Introduction Two-dimensional geostatistics (interpolation) can be performed in GMS using the 2D Scatter Point module. The module is used to interpolate from sets of 2D scatter points

More information

Appendix II: STATA Preliminary

Appendix II: STATA Preliminary Appendix II: STATA Preliminary STATA is a statistical software package that offers a large number of statistical and econometric estimation procedures. With STATA we can easily manage data and apply standard

More information

Topology and Topological Spaces

Topology and Topological Spaces Topology and Topological Spaces Mathematical spaces such as vector spaces, normed vector spaces (Banach spaces), and metric spaces are generalizations of ideas that are familiar in R or in R n. For example,

More information

SAP InfiniteInsight 7.0

SAP InfiniteInsight 7.0 End User Documentation Document Version: 1.0-2014-11 SAP InfiniteInsight 7.0 Data Toolkit User Guide CUSTOMER Table of Contents 1 About this Document... 3 2 Common Steps... 4 2.1 Selecting a Data Set...

More information