Analysis of Complex Survey Data with SAS

Size: px

Start display at page:

Download "Analysis of Complex Survey Data with SAS"

Juliana Beasley
6 years ago
Views:

1 ABSTRACT Analysis of Complex Survey Data with SAS Christine R. Wells, Ph.D., UCLA, Los Angeles, CA The differences between data collected via a complex sampling design and data collected via other methods need to be taken into account when analyzing complex survey data. The elements of the data unique to complex survey data are defined and discussed. Examples of procedures for descriptive statistics and graphs are given for continuous and categorical variables. The analysis of domains, sometimes called subpopulations, is discussed, followed by examples of ordinary least squares regression and logistic regression. INTRODUCTION The most important part of the analysis of complex survey data analysis is correctly specifying the elements of the sampling plan. These elements indicate how the data collection process differed from a simple random sample. The math used to calculate the point estimates and standard errors is different for data collected via a simple random sample and complex surveys. There are numerous methods by which complex survey data can be collected, and the correct math to be used depends on how the data were collected. Because of this, it is important to carefully read the documentation for the data. When data are collected via a simple random sample, all elements of the population have an equal probability of being selected into the sample. This assumption is built into the math that underlies most of the procedures in SAS. With complex survey data, the analyst explicitly acknowledges that the data were not collected via a simple random sample. The most common way in which complex survey data are different from data collected via a simple random sample is that the elements of the population do not have an equal probability of being selected into the sample. With complex survey data, a variable is included in the data set that gives the inverse of the probability of selection for each observation. This is called the probability weight. Often, corrections and adjustments are made to this weight, and it is called a sampling weight. This variable is incorporated into the calculation of all weighted point estimates (e.g., means, frequencies, regression coefficients). The sum of the sampling weights should give a reasonable estimate of the number of elements in the population. The population may be stratified before data collection begins. This means that the population is broken up into groups such that each element of the population belongs to one and only one stratum. For example, the population of the United States may be stratified by location (such as states), or by demographics characteristics such as gender or age. Many categorical variables may be combined to create the strata. If the stratification variables are related to the outcome variable, the stratification will reduce the standard errors. However, for most analyses with public-use survey data sets, the stratification may decrease or increase the standard errors. Another element common to complex survey data sets that influences the calculation of the standard errors is clustering. In practice, it is very difficult to obtain a simple random sample. Instead, cluster sampling is used. In cluster sampling, large units are selected first. In single-cluster sampling, all of the elements within each selected cluster are included in the sample. In multiple-stage cluster sampling, large clusters are sampled from the population, and then smaller clusters are sampled from each large cluster that has been selected into the sample. This process continues until elements are selected. For example, metropolitan statistical units (MSAs) may be sampled, and then city blocks, and then households, and then a person. The effect of cluster sampling is usually to increase the standard errors. In summary, the sampling weight affects the calculation of the point estimate, and the stratification and the clustering affect the calculation of the standard error. Why be concerned with the standard error? The reason is that most test statistics are computed as point estimate divided by its standard error. For most public-use data sets, a stratification variable and a cluster variable are included in the data set. The names of these variables can be determined by reading the documentation that comes with the data set. The cluster variable is sometimes called the primary sampling unit (PSU). The primary sampling unit refers to the first level sampling. For example, if MSAs are sampled, and then blocks within the selected MSAs, and then households on the selected blocks, the primary sampling unit is MSAs. Usually, there are two or more clusters in each stratum. If there is only one cluster in a stratum, SAS will put a note in the log file. Another way to correct the standard errors for the sampling plan is to use replicate weights, which are a series of variables that are included with the survey data set. We will not be discussing replicate weights further.

2 INTRODUCTION TO THE EXAMPLE DATA SET For the following examples, the National Health and Nutrition Survey (NHANES) data will be used. Variables from the demographics, body measurement, and diet behavior and nutrition data sets will be used. For the variables for which missing data codes (such as 7, 8 and 9) were used, those values have been converted to missing for purpose of these analyses. DESCRIPTIVE STATISTICS WITH CONTINUOUS VARIABLES PROC SURVEYMEANS is the SAS procedure that is most often used to calculate descriptive statistics for continuous variables. A wide variety of descriptive statistics can be produced, including means, medians and percentiles. Graphs that correctly account for the sampling weight, such as histograms and boxplots, can also be produced. As of SAS/STAT 14.2, weighted correlations cannot be calculated. Some examples are given below. In the first example, only the basic specification will be used. In the next example, some options are included on the PROC SURVEYMEANS statement. These options request that the minimum value, mean, maximum value and the range be included in the output. proc surveymeans data = nhanes2012 min mean max range; In the next example, graphs are requested. The graphs will include a histogram and a boxplot. ods graphics on; proc surveymeans data = nhanes2012 plots = all; ods graphics off; DESCRIPTIVE STATISTICS WITH CATEGORICAL VARIABLES PROC SURVEYFREQ is the SAS procedure that is most often used to calculate descriptive statistics for categorical variables. One- and two-way tabulations can be produced. Several different types of chi-squared tests can be calculated for two-way cross-tabulations. Graphs, including kappa, mosaic, odds ratio, relative risk, risk difference, weighted frequency and weighted kappa plots can be produced. Some examples are given below. The first example shows the basic syntax. proc surveyfreq data = nhanes2012; tables female; In the next example, a cross-tabulation will be requested. A FORMAT statement will also be included. This is useful for making the output more easily interpretable.

3 proc surveyfreq data = nhanes2012; tables female; format female fm. cbq600r cb.; In the next example, the expected values, row and column percentages, and different versions of the chi-square test will be requested. proc surveyfreq data = nhanes2012; tables female; format female fm. cbq600r cb.; In the following example, a weighted frequency plot will be requested. ods graphics on; proc surveyfreq data = nhanes2012; tables dmdeduc2 / plots = wtfreqplot; format dmdeduc2 ed.; ods graphics off; ANALYSIS OF DOMAINS Sometimes, an analysis is to include only some members of the population but not others. For example, perhaps the analysis should include only women, or only women over age 50. Such analyses are called domain, or subpopulation, analyses. If the data were not weighted, the analyst could use a WHERE statement to include only the desired observations. If the analysis was to be done for women and men separately, a BY statement could be used. However, the math that is used with a WHERE statement or a BY statement produces standard errors that have an interpretation that is different from the interpretation that most analysts are seeking. Because of this, most of the survey procedures have a DOMAIN statement. One or more categorical variables can be specified on the DOMAIN statement, and SAS will run the analysis for every combination of the variables listed. There is no DOMAIN statement in PROC SURVEYFREQ. Instead, the variables that would have been specified on the DOMAIN statement can be added to the TABLES statement. What is the difference between the way a BY statement (or a WHERE statement) and a DOMAIN statement calculates the standard errors? Let us take a simple example with a single variable on a BY statement. With the BY statement, the data set is broken into two groups. Both the point estimate and its standard error are calculated using only the observations in that group. In contrast, with a DOMAIN statement, the point estimate is calculated using only the observations that are part of the domain, but all of the data are used in the calculation of the standard error. The method of calculation of the standard error allows the results of the analysis to be generalized to all elements in the subpopulation in the population. SAS produces output for every combination of the variables listed on the DOMAIN statement, and this can sometimes mean a great deal of output. Binary domain variables can be created in a data step. Typically, such a variable would be coded 1 for all observations that are to be included in the domain, and 0 otherwise. No observation should have a missing value. The following examples will start with the simplest situation, in which only one variable is specified on the DOMAIN statement, and progress to more complicated, but potentially more useful, uses. domain female;

4 In the next example, a FORMAT statement will be used. domain female; In the next example, two categorical variables will be given on the DOMAIN statement, but those variables will not be crossed. domain female adult; format female fm. adult ad.; The variables on the DOMAIN statement are crossed in the next example, and a FORMAT statement is used. domain female*adult; format female fm. adult ad.; The difference between means from two domains can be tested. There are at least two ways to do this using PROC SURVEYREG. One method uses a CONTRAST statement, and the other uses an LSMEANS statement. Notice that on the MODEL statement, the NOINT option has been used so that both means are estimated, rather than the intercept (which is the mean of the reference group) and the difference between the means. PROC SURVEYMEANS and PROC SURVEYREG calculate the variance estimates differently. The VADJUST = NONE option is used to get the variance estimates that are given by PROC SURVEYMEANS. The SOLUTION option is used to request the point estimates in the output. PROC SURVEYMEANS is shown first only to show the mean height for each gender. domain female; proc surveyreg data = nhanes2012; class female; model htfeet = female / noint solution vadjust = none; contrast comparing males and females female 1-1;

5 proc surveyreg data = nhanes2012; class female; model htfeet = female / noint solution vadjust = none; lsmeans female / diff; ORDINARY LEAST SQUARES REGRESSION Once the elements of the sampling plan have been taken into account, ordinary least squares regression (OLS regression) is very much like OLS regression with unweighted data. PROC SURVEYREG is used to run OLS regression with complex survey data. As before, if only part of the data are to be included in the analysis, a DOMAIN statement can be used. There is a CLASS statement in PROC SURVEYREG. How reference categories are specified depends on the presence or absence of a FORMAT statement. proc surveyreg data = nhanes2012; class female (reference = male ); model htfeet = female ridageyr / solution; LOGISTIC REGRESSION PROC SURVEYLOGISTIC can be used to conduct binary, ordinal and nominal (i.e., multinomial) logistic regression analyses. There is a CLASS statement in PROC SURVEYLOGISTIC. How reference categories are specified depends on the presence or absence of a FORMAT statement. proc surveylogistic data = nhanes2012; class dmdeduc2 (reference = 3 ) female (reference = male ) / param = ref; model cbq600r (desc) = dmdeduc2 female ridageyr; WEIGHTED MULTILEVEL MODELS Starting with SAS/STAT 13.1, weighted multilevel models can be run with PROC GLIMMIX. To run a weighted multilevel model, PROC GLIMMIX must be used; both linear and nonlinear models can be specified. Running a weighted linear multilevel model is more complicated than running a weighted OLS regression. First, sampling weights at each level of the multilevel model must be specified. For example, for a two-level model, sampling weights must be specified at level 1 and level 2. This is because the level 1 and level 2 sampling weights enter into the equation for the pseudo-likelihood at different places. Additionally, consideration should be given to the scaling of the level 1 sampling weights. There are a few different choices for scaling methods. If rescaling needs to be done, it must be done in a data step before running PROC GLIMMIX. It is assumed that the highest level of the multilevel corresponds to the primary sampling units of the sampling plan. For example, if MSAs are the PSUs, then MSAs should be the level 2 units in a two level multilevel model. The METHOD = QUADRATURE option is used to request the weighted likelihood, which is called a pseudolikelihood. The EMPIRICAL = CLASSICAL option is used to request sandwich variance estimators.

6 proc glimmix data = wishing method = quadrature(qpoints=10) empirical = classical; model dv = IV1 IV2 IV3 / obsweight = level1wt solution; random intercept / subject = level2id weight = level2wt; CONCLUSION Analyzing data collected via a complex sampling design is different than analyzing data collected via a simple random sample. However, there are also many similarities. Special attention should be given to the elements of the sampling plan to ensure that they are properly incorporated into the analysis. The analysis of subgroups of observations should be done with a DOMAIN statement. Many types of weighted regressions are possible, including weighted linear and nonlinear multilevel models. REFERENCES Heeringa, S. G., West, B. T., and Berglund, P. A. Applied Survey Data analysis, Second Edition (2017). Boca Raton, FL: CRC Press. Lewis, T. H. Complex Survey Data Analysis with SAS. (2017). Boca Raton, FL: CRC Press. Zhu, M. (2014). Analyzing Multilevel Models with the GLIMMIX Procedure. In Proceedings of the SAS Global Forum 2014 Conference. Cary, NC: SAS Institute Inc. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Christine Wells, Ph.D. UCLA 5308 Math Sciences Box Los Angeles, CA crwells@ucla.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.

Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS

ABSTRACT Paper 1938-2018 Frequencies, Unequal Variance Weights, and Sampling Weights: Similarities and Differences in SAS Robert M. Lucas, Robert M. Lucas Consulting, Fort Collins, CO, USA There is confusion