Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

Size: px

Start display at page:

Download "Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO"

Helen O’Connor’
6 years ago
Views:

1 Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO ABSTRACT The power of SAS programming can at times be greatly improved using PROC SQL statements for formatting and manipulating data, and there is considerable literature available on comparing the strengths of using PROC SQL over the DATA step. This paper expands on such programming comparisons by reviewing the use of the DATA step and PROC SQL to select first observation(s) of a particular incident with a specified date range across repeated measures in datasets. INTRODUCTION Data processing and manipulation are important steps in any query process. Building datasets that effectively describe data and prepare it for more robust analysis require many steps, and programmers are always looking for more efficient ways to perform such tasks. SAS has several power tools for such manipulation, so many so that the choices can be overwhelming. The SAS DATA step is often the baseline of any dataset creation or manipulation as it is robust in flexibility and processing efficiency. The DATA step allows for considerable functionality, particularly in iterative processing within a dataset and through observations based on defined conditional statements. In addition to the DATA step, SAS has incorporated Structured Query Language (SQL) into Base SAS, Versions 6 and above. SQL has grown across multiple software platforms as the standard in data manipulation, particularly in processing queries on data stored within relational databases. In SAS, the SQL procedure (very similar to SQL language found in other data software platforms) acts as both a substitute and a compliment to the standard DATA step. PROC SQL has many advantages over the traditional DATA step, both in coding efficiency and descriptive statistics. One of the main advantages of PROC SQL is the ability to merge, join and rearrange variables in a dataset without needing to first sort the dataset, a considerable time-saver for programmers working with very large datasets. REVIEW OF PROC SQL VERSUS DATA STEP This paper assumes a familiar background with PROC SQL and DATA step procedures in SAS, but it is important to reiterate a few key points in comparing the two. Many of the references listed at the end of this paper provide more detailed comparisons of PROC SQL and the DATA step. SQL is a programming language designed to query data in relational databases. PROC SQL was implemented in SAS to give programmers a tool for working with relational databases in a similar fashion. PROC SQL can create SAS datasets from relational data sources, effectively bridging the gap between complex relational databases and traditional dataset flat files. The basic syntax of PROC SQL is similar to the DATA step syntax but based on the elements of a relational database. The following table lists the differences in syntax between SAS and SQL programming languages: SAS SQL Datasets Table Observations Rows Variables Columns Merge Join Extract Query 1 Broadly, PROC SQL can perform many of the same tasks as several other SAS procedures. One of the advantages to PROC SQL is the relative uniformity of the code with respect to generating different results. For example, creating and viewing a table in SAS normally requires at least two steps- a DATA step to create the dataset and a print procedure to view the observations and variables. DATA TEST_A; SET WORK.DATASOURCE; PROC PRINT DATA=TEST_A; 1 Dickstein, Craig and Pass, Ray, DATA Step vs. PROC SQL: What s A Neophyte To Do?, Proceedings from the Twenty-Ninth Annual SAS Users Group International Conference, paper 296,

2 Using PROC SQL, the same result can be generated with one statement and without the need to create the dataset (or table, in this case). PROC SQL; SELECT * FROM WORK.DATASOURCE; QUIT; Similarly, merging data between two datasets using the DATA step necessitate a preliminary sort procedure to ensure both datasets are organized in the same way for the MERGE statement to operate. The MERGE statement to the left is an example of a simple merge between two tables based on a single similar element; the merge to the right performs a conditional merge of dataset TEST_B to TEST_A, pulling in all observations from TEST_A only joining the observations in TEST_B to TEST_A where both variables match. PROC SORT DATA=TEST_A; PROC SORT DATA=TEST_B; DATA TEST_C; MERGE TEST_A TEST_B; PROC SORT DATA=TEST_A; PROC SORT DATA=TEST_B; DATA TEST_C; MERGE TEST_A(IN=A) TEST_B (IN=B); IF (A=1); PROC SQL does not require different tables to be sorted the same way (or in any particular way) prior to a merge. If both sources contain the same column in the same format, a merger of the two tables can be performed in one step. The code to the left reflects a simple join between the two tables based on a single similar element; the code to the right performs conditional OUTER JOIN where all the data from TEST_A are retained and only the data matching the VAR element is added from TEST_B. PROC SQL; PROC SQL; CREATE TABLE TEST_C AS CREATE TABLE TEST_C AS SELECT * SELECT * FROM TEST_A A, TEST_B B FROM TEST_A A WHERE A.VAR = B.VAR; LEFT OUTER JOIN TEST_B B ON B.VAR = A.VAR; QUIT; QUIT; Depending on the objective of the programmer, the structure of the data source(s), and the desired query results, using both PROC SQL and the DATA step together can afford a considerable amount of flexibility and efficiency in manipulating data across various sources. PROC SQL does contain certain limitations from the DATA step. For example, PROC SQL cannot perform iterative processing on its own. In PROC SQL, datasets are processed relationally, based on columns (variables) in the dataset, whereas the DATA step works with the observations within the dataset itself on a row-by-row basis. For analysis requiring processing repeated measures of data, programming using the DATA step can be accomplish iterative processing of the dataset with one of several options (FIRST., LAST., RETAIN, OUTPUT, etc.). For PROC SQL, the answer is more nuanced than a straightforward option command, but not impossible with proper understanding of joins, nested queries and grouping of tables. The next sections compare the use the DATA step and PROC SQL for processing repeated measures in a dataset. REPEATED MEASURES AND DATA Repeated measures in data are part of categorical variables with multiple observations based on other criteria. Cross-sectional and longitudinal data are two examples of repeated measures. One of the most common ways repeated measures are displaying in longitudinal data is across categorical variables involving multivariate time 2

3 series data, or data containing different events with multiple overlapping characteristics. Consider medical data, where a group of patients whose visits to their provider occur over a period of several years. A database of longitudinal data could be maintained, containing all of the patients vital statistics 2. A second dataset contains demographic information on the patient, such as birth date, gender, etc. The following SAS code and output give an example of a dataset with repeated measures of Body-Mass-index (BMI) and age over time. /* Build Dataset by merging datasets with variables of interest */ data vital (keep= patid measure_dt birth_date age bmi gender); retain patid age gender measure_dt bmi; merge vitals demographics; age = round((measure_dt-birth_date)/365.25,1); by patid; NOTE: There were 9833 observations read from the data set WORK.VITALS. NOTE: There were 1000 observations read from the data set WORK.DEMOGRAPHICS. NOTE: The data set WORK.VITAL has 9833 observations and 6 variables. NOTE: DATA statement used (Total process time): proc print data=vital (firstobs=1 obs=10); /* Prints out the first ten observations in dataset */ LOG NOTE: There were 10 observations read from the data set WORK.VITAL. NOTE: PROCEDURE PRINT used (Total process time): OUTPUT The SAS System Obs patid age gender measure_dt bmi birth_date M 11/28/ /11/ M 10/12/ /20/ M 01/15/ /20/ M 12/18/ /20/ M 08/04/ /20/ M 07/29/ /20/ M 06/26/ /20/ M 04/02/ /20/ M 04/06/ /20/ M 04/21/ /20/1953 As the output above shows, patients are outlined in the dataset by PATID (patient ID), MEASURE_DT (date of vitals measurement), and BMI measurement corresponding with the MEASURE_DT. Patient identifiers (PATID, AGE, BIRTH_DATE, and GENDER) repeat across the multiple vitals records for each patient. COHORT DESIGN Programmers working with datasets structured with repeated measures often need to subset these datasets based on particular criteria. In epidemiological research, building cohorts of patient populations based on the first instance 2 Dates and patient IDs have been offset to ensure sample data presented in this paper contains no PHI and is completely de-identified. 3

4 of a condition within a time period is typical first step before more sophisticated analysis can take place. Suppose a researcher wanted to build a cohort of patients who s first recorded BMI >= 35 occurred between January 1, 2005 and December 31, Using the DATA step, this can be accomplished in three simple steps. First, a dataset containing the cohort criteria is built, then it is sorted by the identifier desired for the first instance of cohort criteria (in this case PATID), and finally the first PATID/BMI record within the observation is taken using the FIRST.PATID option. The SAS below annotates the steps. /* Step 1- Define cohort criteria- date range, BMI threshold, age threshold */ DATA step1; set vital; if '01jan05'd le measure_dt le '31dec06'd and bmi ge 35; /* Step 2- Sort Data by PATID and MEASURE_DT to order rows for FIRST. option in Step3 */ proc sort data=step1 out=step2; by patid measure_dt; /* Step 3- Identify first instance of cohort criteria */ data bmi_firstobs_data; set step2; by patid; if first.patid; proc sort data=bmi_firstobs_data; by measure_dt patid; proc print data=bmi_firstobs_data (firstobs=1 obs=10); LOG NOTE: There were 9833 observations read from the data set WORK.VITAL. NOTE: The data set WORK.STEP1 has 540 observations and 6 variables. NOTE: DATA statement used (Total process time): NOTE: There were 540 observations read from the data set WORK.STEP1. NOTE: The data set WORK.STEP2 has 540 observations and 6 variables. NOTE: PROCEDURE SORT used (Total process time): NOTE: There were 540 observations read from the data set WORK.STEP2. NOTE: The data set WORK.BMI_FIRSTOBS_DATA has 106 observations and 6 variables. NOTE: DATA statement used (Total process time): NOTE: There were 106 observations read from the data set WORK.BMI_FIRSTOBS_DATA. NOTE: The data set WORK.BMI_FIRSTOBS_DATA has 106 observations and 6 variables. NOTE: PROCEDURE SORT used (Total process time): NOTE: There were 10 observations read from the data set WORK.BMI_FIRSTOBS_DATA. 4

5 NOTE: PROCEDURE PRINT used (Total process time): OUTPUT The SAS System Obs patid age gender measure_dt bmi birth_date F 01/03/ /28/ F 01/04/ /13/ M 01/06/ /24/ F 01/06/ /21/ F 01/07/ /15/ F 01/10/ /23/ F 01/13/ /08/ F 01/15/ /22/ M 01/16/ /25/ F 01/18/ /08/1936 The dataset BMI_FIRSTOBS_DATA contains a subset of the original vitals measures defined in the dataset VITAL who fit the BMI and measurement date criteria defined above. Using PROC SQL to build the same cohort requires a little more intuition but can save time in both processing speed and CPU resources. Consider original VITALS and DEMOGRAPHICS tables and the steps above to merge and subset the two data sources. Using several nested queries, or queries-within-queries, PROC SQL can accomplish the same cohort construction in a single PROC SQL statement. The code begins identifying the final structure of the cohort table, namely the columns (variables) desired, grouping, ordering, etc. Using the SELECT DISTINCT function syntax (a unique row identifier in PROC SQL similar to the NODUPKEY option in the DATA step), the query limits results to retain only unique PATID/AGE/GENDER/MEASURE_DT/BMI combinations. The FROM statement would normally reference a dataset where the source data is located; in this case it begins the first nested query (N1), beginning with a new SELECT DISTINCT statement and column selections. While the table structure in N1 reflects the same structure as the table being created a layer above (i.e. the same columns are being pulled, albeit in a different order), the column names are proceeded with a table alias. These aliases reference tables and are placeholders for the PROC SQL statement to identify which table contains which column of interest. The second FROM statement leads into the second nested query (N2), which is where the cohort is actually defined under the WHERE statement (BMI >= 35, MEASURE_DT between January 1, 2005 and December 31, 2006). N2 only selects the columns PATID and MIN(MEASURE_DT), which reflects the first or numerically lowest measure date defined by the query. N2 also contains both a SELECT DISTINCT and GROUP BY statement, which under normal circumstances would be redundant as both statements reference PATID and would thus perform the same grouping/de-duping of the data. However, using both statements together ensures that the MIN(VITALS.MEASURE_DT) statement is applied to each PATID. Without both the SELECT DISTINCT PATID and GROUP BY PATID statements, the MIN(VITALS.MEASURE_DT) would be applied globally to the entire dataset, generating the same minimum measure date for the entire population instead of locally within each patients observed BMI records in the observation period. N2 is closed with a table alias A and then rejoined to the VITALS table to obtain the BMI record associated with the MIN(MEASURE_DT) and BMI selection criteria in N2. By joining N2 to VITALS on both matching PATID and MEASURE_DT, N1 is validated to reflect the correct minimum measure date and associated BMI reading. N1 is also joined to the DEMOGRAPHICS table to pull in the desired demographics information. N1 is then closed, and the initial query runs, generating the summary table BMI_FIRSTOBS_SQL. The final table populates the desired fields with the data obtained in N1, which in turn was generated from the data obtained in N2. Finally, BMI_FIRSTOBS_SQL is ordered by PATID and MEASURE_DT to match the repeated measures structure of BMI_FIRSTOBS_DATA. A PROC COMPARE is performed on the two datasets to verify their identical nature. /* Nested PROC SQL */ proc sql; create table bmi_firstobs_sql as 5

6 select distinct patid, round((measure_dt-birth_date)/365.25,1) as age, gender, measure_dt, bmi from /* Begin first nested query N1 */ (select distinct a.patid, A.measure_dt, demo.birth_date, vitals.bmi as bmi, demo.gender from /* Begin second nested query N2 */ (select distinct vitals.patid, min(vitals.measure_dt) as measure_dt format=mmddyy10. from vitals vitals where vitals.bmi >= 35 /* BMI Cohort Criteria */ and vitals.measure_dt between '01jan05'd and '31dec06'd /* Observation Period Criteria */ group by vitals.patid) A /* End N2 */ /* Rejoin A to VITALS to ensure retained MEASURE_DT matches MEASURE_DT from table A */ left outer join vitals vitals /* VITALS table alias */ on A.patid = vitals.patid and A.measure_dt = vitals.measure_dt /* Join second nested query containing data from the VITALS to DEMOGRAPHICS table */ left outer join demographics demo /* DEMOGRAPHICS table alias */ on demo.patid = A.patid) /* End N1 */ order by patid, measure_dt; quit; LOG NOTE: Table WORK.BMI_FIRSTOBS_SQL created, with 106 rows and 5 columns. NOTE: PROCEDURE SQL used (Total process time): 0.07 seconds 0.03 seconds /* Verify */ /* Sort bmi_firstobs_data to match bmi_firstobs_sql */ proc sort data=bmi_firstobs_data; by patid measure_dt ; proc compare base=bmi_firstobs_data compare=bmi_firstobs_sql; LOG NOTE: There were 106 observations read from the data set WORK.BMI_FIRSTOBS_DATA. NOTE: The data set WORK.BMI_FIRSTOBS_DATA has 106 observations and 6 variables. NOTE: PROCEDURE SORT used (Total process time): NOTE: There were 106 observations read from the data set WORK.BMI_FIRSTOBS_DATA. NOTE: There were 106 observations read from the data set WORK.BMI_FIRSTOBS_SQL. 6

7 NOTE: PROCEDURE COMPARE used (Total process time): OUTPUT The SAS System The COMPARE Procedure Comparison of WORK.BMI_FIRSTOBS_DATA with WORK.BMI_FIRSTOBS_SQL (Method=EXACT) Data Set Summary Dataset Created Modified NVar NObs WORK.BMI_FIRSTOBS_DATA 03AUG11:15:48:54 03AUG11:15:48: WORK.BMI_FIRSTOBS_SQL 03AUG11:15:46:08 03AUG11:15:46: Variables Summary Number of Variables in Common: 5. Number of Variables in WORK.BMI_FIRSTOBS_DATA but not in WORK.BMI_FIRSTOBS_SQL: 1. Observation Summary Observation Base Compare First Obs 1 1 Last Obs Number of Observations in Common: 106. Total Number of Observations Read from WORK.BMI_FIRSTOBS_DATA: 106. Total Number of Observations Read from WORK.BMI_FIRSTOBS_SQL: 106. Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables Equal: 106. NOTE: No unequal values were found. All values compared are exactly equal. AGGREGATION The power and efficiency of nested PROC SQL statements can be shown when generating simple, summary datasets with aggregate results. Borrowing from the previous example, a programmer can simplify output datasets into a crosstab of counts of patients by year, age and BMI groupings. Using PROC FORMAT to establish BMI and age groups for aggregation, both the DATA step and PROC SQL can generate the same summarized datasets of aggregate counts of patients. The code below lists the steps to generate output using both methods. First, the format AGEGRP_ and BMIGRP_ roll up the AGE and BMI variable records into 10-unit increments, respectively. The dataset STEP5 aggregates individual measure dates into years using the YEAR function. Finally, a PROC FREQ generates the table FREQSUMMARY with a patient count crosstab of YEAR-by-AGEGRP-BMI. While the DATA step requires an additional three steps in coding to generate these summary results, PROC SQL creates the same dataset in one statement using the same nested query with only minor adjustments to the output dataset structure. The dataset BMI_SUMMARY_SQL is created using the same nested query structure from the previous example, and like the DATA step to create STEP5, modifies the AGE and BMI columns using the AGEGRP_ and BMIGRP_ formats and MEASURE_DT using the YEAR function. Instead of a SELECT DISTINCT syntax, PATID is aggregated to counts using the COUNT DISTINCT function in PROC SQL. Finally, 7

8 BMI_SUMMARY_SQL is aggregated using the GROUP BY statement below, generating a crosstab of patient counts by the new columns YEAR, AGEGRP and BMIGRP. /* Aggregation */ /* Formats */ proc format; /* Age Groups */ value agegrp_.= '00: NULL ' low -< 15 = '01: 1-17 yrs old' 15 -< 25 = '02: yrs old' 25 -< 35 = '03: yrs old' 35 -< 50 = '04: yrs old' 50 -< 65 = '05: yrs old' 65 - high = '06: 65+ yrs old' ; /* BMI Groups */ value bmigrp_.= '00: NULL ' low -< 20 = '01: BMI under 20' 20 -< 30 = '02: BMI 20-29' 30 -< 40 = '03: BMI 30-39' 40 -< 50 = '04: BMI 40-49' 50 - high = '05: BMI 50+' ; quit; /* Step 4- Sort data for aggregation */ proc sort data=bmi_firstobs_data out=step4; by patid; /* Step 5- Aggregate data */ DATA step5 (keep=year patid agegrp bmigrp); set step4; by patid; if first.patid; agegrp=put(age,agegrp_.); bmigrp=put(bmi, bmigrp_.) ; year= year(measure_dt); *need to format dates to year; /* Step 6- Summarize data */ proc freq data=step5; tables year*agegrp*bmigrp /list noprint nopercent out = freqsummary; /* Nested PROC SQL Summary */ proc sql; create table bmi_summary_sql as select year(measure_dt) as year, put(round((measure_dt-birth_date)/365.25,1), agegrp_.) as agegrp /* Format for aggregation */, put(bmi, bmigrp_.) as bmigrp /* Format for aggregation */, count(distinct patid) as count /* Count patients for summarization */ from (select distinct a.patid, a.measure_dt, demo.birth_date, vitals.bmi as bmi, demo.gender from (select distinct vitals.patid 8

9 , min(vitals.measure_dt) as measure_dt format=mmddyy10. from vitals vitals where vitals.bmi >= 35 and vitals.measure_dt between '01jan05'd and '31dec06'd group by vitals.patid) a left outer join demographics demo on demo.patid = a.patid left outer join vitals vitals on a.patid = vitals.patid and a.measure_dt = vitals.measure_dt) group by year, agegrp, bmigrp /* Aggregate patient counts by GROUP BY statement */ order by year, agegrp, bmigrp ; quit; /* Validate results */ proc compare base=freqsummary compare=bmi_summary_sql; LOG NOTE: Format AGEGRP_ has been output. NOTE: Format BMIGRP_ has been output. NOTE: PROCEDURE FORMAT used (Total process time): NOTE: Input data set is already sorted; it has been copied to the output data set. NOTE: There were 106 observations read from the data set WORK.BMI_FIRSTOBS_DATA. NOTE: The data set WORK.STEP4 has 106 observations and 6 variables. NOTE: PROCEDURE SORT used (Total process time): NOTE: There were 106 observations read from the data set WORK.STEP4. NOTE: The data set WORK.STEP5 has 106 observations and 4 variables. NOTE: DATA statement used (Total process time): 0.04 seconds 0.03 seconds NOTE: There were 106 observations read from the data set WORK.STEP5. NOTE: The data set WORK.FREQSUMMARY has 21 observations and 5 variables. NOTE: PROCEDURE FREQ used (Total process time): 0.03 seconds NOTE: Table WORK.BMI_SUMMARY_SQL created, with 21 rows and 4 columns. NOTE: PROCEDURE SQL used (Total process time): 0.10 seconds 378 proc compare base=freqsummary compare=bmi_summary_sql; 379 NOTE: There were 21 observations read from the data set WORK.FREQSUMMARY. NOTE: There were 21 observations read from the data set WORK.BMI_SUMMARY_SQL. 9

10 NOTE: PROCEDURE COMPARE used (Total process time): OUTPUT The SAS System The COMPARE Procedure Comparison of WORK.FREQSUMMARY with WORK.BMI_SUMMARY_SQL (Method=EXACT) Data Set Summary Dataset Created Modified NVar NObs WORK.FREQSUMMARY 03AUG11:17:35:47 03AUG11:17:35: WORK.BMI_SUMMARY_SQL 03AUG11:17:35:58 03AUG11:17:35: Variables Summary Number of Variables in Common: 4. Number of Variables in WORK.FREQSUMMARY but not in WORK.BMI_SUMMARY_SQL: 1. Number of Variables with Differing Attributes: 1. Listing of Common Variables with Differing Attributes Variable Dataset Type Length Label COUNT WORK.FREQSUMMARY Num 8 Frequency Count WORK.BMI_SUMMARY_SQL Num 8 Observation Summary Observation Base Compare First Obs 1 1 Last Obs Number of Observations in Common: 21. Total Number of Observations Read from WORK.FREQSUMMARY: 21. Total Number of Observations Read from WORK.BMI_SUMMARY_SQL: 21. Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables Equal: 21. NOTE: No unequal values were found. All values compared are exactly equal. CONCLUSION PROC SQL has considerable advantages in coding efficiency over the DATA step, particularly when sub setting data on repeated measures. While the intuition behind using PROC SQL to generate FIRST. procedures isn t as apparent as through the traditional DATA step, it s concise presentation in both coding structure and minimal output can help expedite program generation and distribution across users. Creating aggregate counts and summaries off of tables generated in PROC SQL add to the efficiency available of this robust procedure. While the traditional DATA step can be more efficient in terms of query speed and table generation for generating incidence/prevalence in many instances, the PROC SQL method presented in this paper has coding advantages with respect to data organization and transference to other SQL-based software platforms. 10

11 REFERENCES Dickstein, Craig and Pass, Ray, DATA Step vs. PROC SQL: What s A Neophyte To Do?, Proceedings from the Twenty-Ninth Annual SAS Users Group International Conference, paper 296, Fain, Terry and Gareleck, Cyndie (254-30), Using SAS to Process Repeated Measures Data, Proceedings of the Thirtieth Annual SAS Users Group International Conference, paper 254, Matthews, Joann, PROC SQL Versus The DATA Step, Northeastern SAS Users Group Conference, 2006 SAS Institute Inc. (1990), Base SAS 9.2 SAS Procedure Guide, Third Edition, Cary, NC: SAS Institute Inc. Severino, Richard, PROC FREQ: It s More Than Counts, Proceedings of the twenty-fifth Annual SAS Users Group International Conference, paper69, 2000 CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: David Tabano Kaiser Permanente 2550 S. Parker Rd., Suite 300 Aurora, CO (303) (work) (303) (fax) david.c.tabano@kp.org SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 11

The Power of Combining Data with the PROC SQL

ABSTRACT Paper CC-09 The Power of Combining Data with the PROC SQL Stacey Slone, University of Kentucky Markey Cancer Center Combining two data sets which contain a common identifier with a MERGE statement