Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

Similar documents
The Power of Combining Data with the PROC SQL

PROC SQL vs. DATA Step Processing. T Winand, Customer Success Technical Team

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

PROC SQL VS. DATA STEP PROCESSING JEFF SIMPSON SAS CUSTOMER LOYALTY

Get SAS sy with PROC SQL Amie Bissonett, Pharmanet/i3, Minneapolis, MN

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

Chapter 6: Modifying and Combining Data Sets

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

Essentials of PDV: Directing the Aim to Understanding the DATA Step! Arthur Xuejun Li, City of Hope National Medical Center, Duarte, CA

Mapping Clinical Data to a Standard Structure: A Table Driven Approach

Querying Data with Transact SQL

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

50 WAYS TO MERGE YOUR DATA INSTALLMENT 1 Kristie Schuster, LabOne, Inc., Lenexa, Kansas Lori Sipe, LabOne, Inc., Lenexa, Kansas

Beginning Tutorials. PROC FSEDIT NEW=newfilename LIKE=oldfilename; Fig. 4 - Specifying a WHERE Clause in FSEDIT. Data Editing

Hypothesis Testing: An SQL Analogy

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA

Programming Beyond the Basics. Find() the power of Hash - How, Why and When to use the SAS Hash Object John Blackwell

A Quick and Gentle Introduction to PROC SQL

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

There s No Such Thing as Normal Clinical Trials Data, or Is There? Daphne Ewing, Octagon Research Solutions, Inc., Wayne, PA

Using Templates Created by the SAS/STAT Procedures

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys

Using Proc Freq for Manageable Data Summarization

MANAGING DATA(BASES) USING SQL (NON-PROCEDURAL SQL, X401.9)

9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

Advanced Visualization using TIBCO Spotfire and SAS

Anatomy of a Merge Gone Wrong James Lew, Compu-Stat Consulting, Scarborough, ON, Canada Joshua Horstman, Nested Loop Consulting, Indianapolis, IN, USA

Modular processing for Prep-to-research analysis- interfacing XML, SAS and Microsoft Excel David C. Tabano, Kaiser Permanente, Denver, CO

Correcting for natural time lag bias in non-participants in pre-post intervention evaluation studies

Clinical Data Visualization using TIBCO Spotfire and SAS

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

Personally Identifiable Information Secured Transformation

Files Arriving at an Inconvenient Time? Let SAS Process Your Files with FILEEXIST While You Sleep

Macro to compute best transform variable for the model

Using PROC SQL to Generate Shift Tables More Efficiently

Mastering Data Summarization with PROC SQL

Tales from the Help Desk 6: Solutions to Common SAS Tasks

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC

Paper PS05_05 Using SAS to Process Repeated Measures Data Terry Fain, RAND Corporation Cyndie Gareleck, RAND Corporation

To conceptualize the process, the table below shows the highly correlated covariates in descending order of their R statistic.

Beginning Tutorials. Paper 53-27

Using PROC PLAN for Randomization Assignments

Report Writing, SAS/GRAPH Creation, and Output Verification using SAS/ASSIST Matthew J. Becker, ST TPROBE, inc., Ann Arbor, MI

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

Going Under the Hood: How Does the Macro Processor Really Work?

10 The First Steps 4 Chapter 2

Statistics, Data Analysis & Econometrics

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

Merging Data Eight Different Ways

Introduction / Overview

Guide Users along Information Pathways and Surf through the Data

A SAS/AF Application for Linking Demographic & Laboratory Data For Participants in Clinical & Epidemiologic Research Studies

PREREQUISITES FOR EXAMPLES

Chaining Logic in One Data Step Libing Shi, Ginny Rego Blue Cross Blue Shield of Massachusetts, Boston, MA

SAS Macro Technique for Embedding and Using Metadata in Web Pages. DataCeutics, Inc., Pottstown, PA

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

Uncommon Techniques for Common Variables

SAS Scalable Performance Data Server 4.3

SAS Application to Automate a Comprehensive Review of DEFINE and All of its Components

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

MTA Database Administrator Fundamentals Course

Now That You Have Your Data in Hadoop, How Are You Staging Your Analytical Base Tables?

Merge Processing and Alternate Table Lookup Techniques Prepared by

The Dataset Diet How to transform short and fat into long and thin

INTRODUCTION TO PROC SQL JEFF SIMPSON SYSTEMS ENGINEER

Cell Suppression In SAS Visual Analytics: A Primer

A Practical Introduction to SAS Data Integration Studio

CFB: A Programming Pattern for Creating Change from Baseline Datasets Lei Zhang, Celgene Corporation, Summit, NJ

Understanding and Applying the Logic of the DOW-Loop

Mobile MOUSe MTA DATABASE ADMINISTRATOR FUNDAMENTALS ONLINE COURSE OUTLINE

WHAT ARE SASHELP VIEWS?

A Format to Make the _TYPE_ Field of PROC MEANS Easier to Interpret Matt Pettis, Thomson West, Eagan, MN

Obtaining the Patient Most Recent Time-stamped Measurements

Base and Advance SAS

Countdown of the Top 10 Ways to Merge Data David Franklin, Independent Consultant, Litchfield, NH

Macros for Two-Sample Hypothesis Tests Jinson J. Erinjeri, D.K. Shifflet and Associates Ltd., McLean, VA

PROGRAMMING ROLLING REGRESSIONS IN SAS MICHAEL D. BOLDIN, UNIVERSITY OF PENNSYLVANIA, PHILADELPHIA, PA

PharmaSUG Paper SP04

Tired of CALL EXECUTE? Try DOSUBL

Unlock SAS Code Automation with the Power of Macros

T-SQL Training: T-SQL for SQL Server for Developers

Efficiently Join a SAS Data Set with External Database Tables

ABSTRACT INTRODUCTION MACRO. Paper RF

Ten Great Reasons to Learn SAS Software's SQL Procedure

Arthur L. Carpenter California Occidental Consultants, Oceanside, California

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

Let Hash SUMINC Count For You Joseph Hinson, Accenture Life Sciences, Berwyn, PA, USA

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

SAS System Powers Web Measurement Solution at U S WEST

Contents of SAS Programming Techniques

David Beam, Systems Seminar Consultants, Inc., Madison, WI

USING PROC SQL EFFECTIVELY WITH SAS DATA SETS JIM DEFOOR LOCKHEED FORT WORTH COMPANY

Simplifying Your %DO Loop with CALL EXECUTE Arthur Li, City of Hope National Medical Center, Duarte, CA

SAS seminar. The little SAS book Chapters 3 & 4. April 15, Åsa Klint. By LD Delwiche and SJ Slaughter. 3.1 Creating and Redefining variables

SAS Macro Dynamics: from Simple Basics to Powerful Invocations Rick Andrews, Office of Research, Development, and Information, Baltimore, MD

DSCI 325: Handout 10 Summarizing Numerical and Categorical Data in SAS Spring 2017

Keh-Dong Shiang, Department of Biostatistics & Department of Diabetes, City of Hope National Medical Center, Duarte, CA

Get Going with PROC SQL Richard Severino, Convergence CT, Honolulu, HI

1. Join with PROC SQL a left join that will retain target records having no lookup match. 2. Data Step Merge of the target and lookup files.

Transcription:

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO ABSTRACT The power of SAS programming can at times be greatly improved using PROC SQL statements for formatting and manipulating data, and there is considerable literature available on comparing the strengths of using PROC SQL over the DATA step. This paper expands on such programming comparisons by reviewing the use of the DATA step and PROC SQL to select first observation(s) of a particular incident with a specified date range across repeated measures in datasets. INTRODUCTION Data processing and manipulation are important steps in any query process. Building datasets that effectively describe data and prepare it for more robust analysis require many steps, and programmers are always looking for more efficient ways to perform such tasks. SAS has several power tools for such manipulation, so many so that the choices can be overwhelming. The SAS DATA step is often the baseline of any dataset creation or manipulation as it is robust in flexibility and processing efficiency. The DATA step allows for considerable functionality, particularly in iterative processing within a dataset and through observations based on defined conditional statements. In addition to the DATA step, SAS has incorporated Structured Query Language (SQL) into Base SAS, Versions 6 and above. SQL has grown across multiple software platforms as the standard in data manipulation, particularly in processing queries on data stored within relational databases. In SAS, the SQL procedure (very similar to SQL language found in other data software platforms) acts as both a substitute and a compliment to the standard DATA step. PROC SQL has many advantages over the traditional DATA step, both in coding efficiency and descriptive statistics. One of the main advantages of PROC SQL is the ability to merge, join and rearrange variables in a dataset without needing to first sort the dataset, a considerable time-saver for programmers working with very large datasets. REVIEW OF PROC SQL VERSUS DATA STEP This paper assumes a familiar background with PROC SQL and DATA step procedures in SAS, but it is important to reiterate a few key points in comparing the two. Many of the references listed at the end of this paper provide more detailed comparisons of PROC SQL and the DATA step. SQL is a programming language designed to query data in relational databases. PROC SQL was implemented in SAS to give programmers a tool for working with relational databases in a similar fashion. PROC SQL can create SAS datasets from relational data sources, effectively bridging the gap between complex relational databases and traditional dataset flat files. The basic syntax of PROC SQL is similar to the DATA step syntax but based on the elements of a relational database. The following table lists the differences in syntax between SAS and SQL programming languages: SAS SQL Datasets Table Observations Rows Variables Columns Merge Join Extract Query 1 Broadly, PROC SQL can perform many of the same tasks as several other SAS procedures. One of the advantages to PROC SQL is the relative uniformity of the code with respect to generating different results. For example, creating and viewing a table in SAS normally requires at least two steps- a DATA step to create the dataset and a print procedure to view the observations and variables. DATA TEST_A; SET WORK.DATASOURCE; PROC PRINT DATA=TEST_A; 1 Dickstein, Craig and Pass, Ray, DATA Step vs. PROC SQL: What s A Neophyte To Do?, Proceedings from the Twenty-Ninth Annual SAS Users Group International Conference, paper 296, 2004 1

Using PROC SQL, the same result can be generated with one statement and without the need to create the dataset (or table, in this case). PROC SQL; SELECT * FROM WORK.DATASOURCE; QUIT; Similarly, merging data between two datasets using the DATA step necessitate a preliminary sort procedure to ensure both datasets are organized in the same way for the MERGE statement to operate. The MERGE statement to the left is an example of a simple merge between two tables based on a single similar element; the merge to the right performs a conditional merge of dataset TEST_B to TEST_A, pulling in all observations from TEST_A only joining the observations in TEST_B to TEST_A where both variables match. PROC SORT DATA=TEST_A; PROC SORT DATA=TEST_B; DATA TEST_C; MERGE TEST_A TEST_B; PROC SORT DATA=TEST_A; PROC SORT DATA=TEST_B; DATA TEST_C; MERGE TEST_A(IN=A) TEST_B (IN=B); IF (A=1); PROC SQL does not require different tables to be sorted the same way (or in any particular way) prior to a merge. If both sources contain the same column in the same format, a merger of the two tables can be performed in one step. The code to the left reflects a simple join between the two tables based on a single similar element; the code to the right performs conditional OUTER JOIN where all the data from TEST_A are retained and only the data matching the VAR element is added from TEST_B. PROC SQL; PROC SQL; CREATE TABLE TEST_C AS CREATE TABLE TEST_C AS SELECT * SELECT * FROM TEST_A A, TEST_B B FROM TEST_A A WHERE A.VAR = B.VAR; LEFT OUTER JOIN TEST_B B ON B.VAR = A.VAR; QUIT; QUIT; Depending on the objective of the programmer, the structure of the data source(s), and the desired query results, using both PROC SQL and the DATA step together can afford a considerable amount of flexibility and efficiency in manipulating data across various sources. PROC SQL does contain certain limitations from the DATA step. For example, PROC SQL cannot perform iterative processing on its own. In PROC SQL, datasets are processed relationally, based on columns (variables) in the dataset, whereas the DATA step works with the observations within the dataset itself on a row-by-row basis. For analysis requiring processing repeated measures of data, programming using the DATA step can be accomplish iterative processing of the dataset with one of several options (FIRST., LAST., RETAIN, OUTPUT, etc.). For PROC SQL, the answer is more nuanced than a straightforward option command, but not impossible with proper understanding of joins, nested queries and grouping of tables. The next sections compare the use the DATA step and PROC SQL for processing repeated measures in a dataset. REPEATED MEASURES AND DATA Repeated measures in data are part of categorical variables with multiple observations based on other criteria. Cross-sectional and longitudinal data are two examples of repeated measures. One of the most common ways repeated measures are displaying in longitudinal data is across categorical variables involving multivariate time 2

series data, or data containing different events with multiple overlapping characteristics. Consider medical data, where a group of patients whose visits to their provider occur over a period of several years. A database of longitudinal data could be maintained, containing all of the patients vital statistics 2. A second dataset contains demographic information on the patient, such as birth date, gender, etc. The following SAS code and output give an example of a dataset with repeated measures of Body-Mass-index (BMI) and age over time. /* Build Dataset by merging datasets with variables of interest */ data vital (keep= patid measure_dt birth_date age bmi gender); retain patid age gender measure_dt bmi; merge vitals demographics; age = round((measure_dt-birth_date)/365.25,1); by patid; NOTE: There were 9833 observations read from the data set WORK.VITALS. NOTE: There were 1000 observations read from the data set WORK.DEMOGRAPHICS. NOTE: The data set WORK.VITAL has 9833 observations and 6 variables. NOTE: DATA statement used (Total process time): proc print data=vital (firstobs=1 obs=10); /* Prints out the first ten observations in dataset */ LOG NOTE: There were 10 observations read from the data set WORK.VITAL. NOTE: PROCEDURE PRINT used (Total process time): OUTPUT The SAS System Obs patid age gender measure_dt bmi birth_date 1 100287662 52 M 11/28/2006 25.2 06/11/1954 2 100396209 59 M 10/12/2011 25.2 02/20/1953 3 100396209 58 M 01/15/2011 26.1 02/20/1953 4 100396209 58 M 12/18/2010 26.3 02/20/1953 5 100396209 57 M 08/04/2010 26.0 02/20/1953 6 100396209 57 M 07/29/2010 25.9 02/20/1953 7 100396209 56 M 06/26/2009 27.2 02/20/1953 8 100396209 55 M 04/02/2008 26.8 02/20/1953 9 100396209 54 M 04/06/2007 26.6 02/20/1953 10 100396209 53 M 04/21/2006 26.0 02/20/1953 As the output above shows, patients are outlined in the dataset by PATID (patient ID), MEASURE_DT (date of vitals measurement), and BMI measurement corresponding with the MEASURE_DT. Patient identifiers (PATID, AGE, BIRTH_DATE, and GENDER) repeat across the multiple vitals records for each patient. COHORT DESIGN Programmers working with datasets structured with repeated measures often need to subset these datasets based on particular criteria. In epidemiological research, building cohorts of patient populations based on the first instance 2 Dates and patient IDs have been offset to ensure sample data presented in this paper contains no PHI and is completely de-identified. 3

of a condition within a time period is typical first step before more sophisticated analysis can take place. Suppose a researcher wanted to build a cohort of patients who s first recorded BMI >= 35 occurred between January 1, 2005 and December 31, 2006. Using the DATA step, this can be accomplished in three simple steps. First, a dataset containing the cohort criteria is built, then it is sorted by the identifier desired for the first instance of cohort criteria (in this case PATID), and finally the first PATID/BMI record within the 2005-2006 observation is taken using the FIRST.PATID option. The SAS below annotates the steps. /* Step 1- Define cohort criteria- date range, BMI threshold, age threshold */ DATA step1; set vital; if '01jan05'd le measure_dt le '31dec06'd and bmi ge 35; /* Step 2- Sort Data by PATID and MEASURE_DT to order rows for FIRST. option in Step3 */ proc sort data=step1 out=step2; by patid measure_dt; /* Step 3- Identify first instance of cohort criteria */ data bmi_firstobs_data; set step2; by patid; if first.patid; proc sort data=bmi_firstobs_data; by measure_dt patid; proc print data=bmi_firstobs_data (firstobs=1 obs=10); LOG NOTE: There were 9833 observations read from the data set WORK.VITAL. NOTE: The data set WORK.STEP1 has 540 observations and 6 variables. NOTE: DATA statement used (Total process time): NOTE: There were 540 observations read from the data set WORK.STEP1. NOTE: The data set WORK.STEP2 has 540 observations and 6 variables. NOTE: PROCEDURE SORT used (Total process time): NOTE: There were 540 observations read from the data set WORK.STEP2. NOTE: The data set WORK.BMI_FIRSTOBS_DATA has 106 observations and 6 variables. NOTE: DATA statement used (Total process time): NOTE: There were 106 observations read from the data set WORK.BMI_FIRSTOBS_DATA. NOTE: The data set WORK.BMI_FIRSTOBS_DATA has 106 observations and 6 variables. NOTE: PROCEDURE SORT used (Total process time): NOTE: There were 10 observations read from the data set WORK.BMI_FIRSTOBS_DATA. 4

NOTE: PROCEDURE PRINT used (Total process time): OUTPUT The SAS System Obs patid age gender measure_dt bmi birth_date 1 188318065 77 F 01/03/2005 40.6 11/28/1927 2 366274576 66 F 01/04/2005 44.5 12/13/1938 3 124896603 85 M 01/06/2005 36.8 11/24/1919 4 230174462 64 F 01/06/2005 37.1 05/21/1941 5 126911638 42 F 01/07/2005 49.6 06/15/1963 6 366814101 27 F 01/10/2005 36.0 11/23/1977 7 469822039 73 F 01/13/2005 35.4 09/08/1931 8 186327123 55 F 01/15/2005 39.1 10/22/1949 9 314816680 38 M 01/16/2005 36.8 06/25/1967 10 172611809 69 F 01/18/2005 40.1 02/08/1936 The dataset BMI_FIRSTOBS_DATA contains a subset of the original vitals measures defined in the dataset VITAL who fit the BMI and measurement date criteria defined above. Using PROC SQL to build the same cohort requires a little more intuition but can save time in both processing speed and CPU resources. Consider original VITALS and DEMOGRAPHICS tables and the steps above to merge and subset the two data sources. Using several nested queries, or queries-within-queries, PROC SQL can accomplish the same cohort construction in a single PROC SQL statement. The code begins identifying the final structure of the cohort table, namely the columns (variables) desired, grouping, ordering, etc. Using the SELECT DISTINCT function syntax (a unique row identifier in PROC SQL similar to the NODUPKEY option in the DATA step), the query limits results to retain only unique PATID/AGE/GENDER/MEASURE_DT/BMI combinations. The FROM statement would normally reference a dataset where the source data is located; in this case it begins the first nested query (N1), beginning with a new SELECT DISTINCT statement and column selections. While the table structure in N1 reflects the same structure as the table being created a layer above (i.e. the same columns are being pulled, albeit in a different order), the column names are proceeded with a table alias. These aliases reference tables and are placeholders for the PROC SQL statement to identify which table contains which column of interest. The second FROM statement leads into the second nested query (N2), which is where the cohort is actually defined under the WHERE statement (BMI >= 35, MEASURE_DT between January 1, 2005 and December 31, 2006). N2 only selects the columns PATID and MIN(MEASURE_DT), which reflects the first or numerically lowest measure date defined by the query. N2 also contains both a SELECT DISTINCT and GROUP BY statement, which under normal circumstances would be redundant as both statements reference PATID and would thus perform the same grouping/de-duping of the data. However, using both statements together ensures that the MIN(VITALS.MEASURE_DT) statement is applied to each PATID. Without both the SELECT DISTINCT PATID and GROUP BY PATID statements, the MIN(VITALS.MEASURE_DT) would be applied globally to the entire dataset, generating the same minimum measure date for the entire population instead of locally within each patients observed BMI records in the observation period. N2 is closed with a table alias A and then rejoined to the VITALS table to obtain the BMI record associated with the MIN(MEASURE_DT) and BMI selection criteria in N2. By joining N2 to VITALS on both matching PATID and MEASURE_DT, N1 is validated to reflect the correct minimum measure date and associated BMI reading. N1 is also joined to the DEMOGRAPHICS table to pull in the desired demographics information. N1 is then closed, and the initial query runs, generating the summary table BMI_FIRSTOBS_SQL. The final table populates the desired fields with the data obtained in N1, which in turn was generated from the data obtained in N2. Finally, BMI_FIRSTOBS_SQL is ordered by PATID and MEASURE_DT to match the repeated measures structure of BMI_FIRSTOBS_DATA. A PROC COMPARE is performed on the two datasets to verify their identical nature. /* Nested PROC SQL */ proc sql; create table bmi_firstobs_sql as 5

select distinct patid, round((measure_dt-birth_date)/365.25,1) as age, gender, measure_dt, bmi from /* Begin first nested query N1 */ (select distinct a.patid, A.measure_dt, demo.birth_date, vitals.bmi as bmi, demo.gender from /* Begin second nested query N2 */ (select distinct vitals.patid, min(vitals.measure_dt) as measure_dt format=mmddyy10. from vitals vitals where vitals.bmi >= 35 /* BMI Cohort Criteria */ and vitals.measure_dt between '01jan05'd and '31dec06'd /* Observation Period Criteria */ group by vitals.patid) A /* End N2 */ /* Rejoin A to VITALS to ensure retained MEASURE_DT matches MEASURE_DT from table A */ left outer join vitals vitals /* VITALS table alias */ on A.patid = vitals.patid and A.measure_dt = vitals.measure_dt /* Join second nested query containing data from the VITALS to DEMOGRAPHICS table */ left outer join demographics demo /* DEMOGRAPHICS table alias */ on demo.patid = A.patid) /* End N1 */ order by patid, measure_dt; quit; LOG NOTE: Table WORK.BMI_FIRSTOBS_SQL created, with 106 rows and 5 columns. NOTE: PROCEDURE SQL used (Total process time): 0.07 seconds 0.03 seconds /* Verify */ /* Sort bmi_firstobs_data to match bmi_firstobs_sql */ proc sort data=bmi_firstobs_data; by patid measure_dt ; proc compare base=bmi_firstobs_data compare=bmi_firstobs_sql; LOG NOTE: There were 106 observations read from the data set WORK.BMI_FIRSTOBS_DATA. NOTE: The data set WORK.BMI_FIRSTOBS_DATA has 106 observations and 6 variables. NOTE: PROCEDURE SORT used (Total process time): NOTE: There were 106 observations read from the data set WORK.BMI_FIRSTOBS_DATA. NOTE: There were 106 observations read from the data set WORK.BMI_FIRSTOBS_SQL. 6

NOTE: PROCEDURE COMPARE used (Total process time): OUTPUT The SAS System The COMPARE Procedure Comparison of WORK.BMI_FIRSTOBS_DATA with WORK.BMI_FIRSTOBS_SQL (Method=EXACT) Data Set Summary Dataset Created Modified NVar NObs WORK.BMI_FIRSTOBS_DATA 03AUG11:15:48:54 03AUG11:15:48:54 6 106 WORK.BMI_FIRSTOBS_SQL 03AUG11:15:46:08 03AUG11:15:46:08 5 106 Variables Summary Number of Variables in Common: 5. Number of Variables in WORK.BMI_FIRSTOBS_DATA but not in WORK.BMI_FIRSTOBS_SQL: 1. Observation Summary Observation Base Compare First Obs 1 1 Last Obs 106 106 Number of Observations in Common: 106. Total Number of Observations Read from WORK.BMI_FIRSTOBS_DATA: 106. Total Number of Observations Read from WORK.BMI_FIRSTOBS_SQL: 106. Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables Equal: 106. NOTE: No unequal values were found. All values compared are exactly equal. AGGREGATION The power and efficiency of nested PROC SQL statements can be shown when generating simple, summary datasets with aggregate results. Borrowing from the previous example, a programmer can simplify output datasets into a crosstab of counts of patients by year, age and BMI groupings. Using PROC FORMAT to establish BMI and age groups for aggregation, both the DATA step and PROC SQL can generate the same summarized datasets of aggregate counts of patients. The code below lists the steps to generate output using both methods. First, the format AGEGRP_ and BMIGRP_ roll up the AGE and BMI variable records into 10-unit increments, respectively. The dataset STEP5 aggregates individual measure dates into years using the YEAR function. Finally, a PROC FREQ generates the table FREQSUMMARY with a patient count crosstab of YEAR-by-AGEGRP-BMI. While the DATA step requires an additional three steps in coding to generate these summary results, PROC SQL creates the same dataset in one statement using the same nested query with only minor adjustments to the output dataset structure. The dataset BMI_SUMMARY_SQL is created using the same nested query structure from the previous example, and like the DATA step to create STEP5, modifies the AGE and BMI columns using the AGEGRP_ and BMIGRP_ formats and MEASURE_DT using the YEAR function. Instead of a SELECT DISTINCT syntax, PATID is aggregated to counts using the COUNT DISTINCT function in PROC SQL. Finally, 7

BMI_SUMMARY_SQL is aggregated using the GROUP BY statement below, generating a crosstab of patient counts by the new columns YEAR, AGEGRP and BMIGRP. /* Aggregation */ /* Formats */ proc format; /* Age Groups */ value agegrp_.= '00: NULL ' low -< 15 = '01: 1-17 yrs old' 15 -< 25 = '02: 18-24 yrs old' 25 -< 35 = '03: 25-34 yrs old' 35 -< 50 = '04: 35-49 yrs old' 50 -< 65 = '05: 50-64 yrs old' 65 - high = '06: 65+ yrs old' ; /* BMI Groups */ value bmigrp_.= '00: NULL ' low -< 20 = '01: BMI under 20' 20 -< 30 = '02: BMI 20-29' 30 -< 40 = '03: BMI 30-39' 40 -< 50 = '04: BMI 40-49' 50 - high = '05: BMI 50+' ; quit; /* Step 4- Sort data for aggregation */ proc sort data=bmi_firstobs_data out=step4; by patid; /* Step 5- Aggregate data */ DATA step5 (keep=year patid agegrp bmigrp); set step4; by patid; if first.patid; agegrp=put(age,agegrp_.); bmigrp=put(bmi, bmigrp_.) ; year= year(measure_dt); *need to format dates to year; /* Step 6- Summarize data */ proc freq data=step5; tables year*agegrp*bmigrp /list noprint nopercent out = freqsummary; /* Nested PROC SQL Summary */ proc sql; create table bmi_summary_sql as select year(measure_dt) as year, put(round((measure_dt-birth_date)/365.25,1), agegrp_.) as agegrp /* Format for aggregation */, put(bmi, bmigrp_.) as bmigrp /* Format for aggregation */, count(distinct patid) as count /* Count patients for summarization */ from (select distinct a.patid, a.measure_dt, demo.birth_date, vitals.bmi as bmi, demo.gender from (select distinct vitals.patid 8

, min(vitals.measure_dt) as measure_dt format=mmddyy10. from vitals vitals where vitals.bmi >= 35 and vitals.measure_dt between '01jan05'd and '31dec06'd group by vitals.patid) a left outer join demographics demo on demo.patid = a.patid left outer join vitals vitals on a.patid = vitals.patid and a.measure_dt = vitals.measure_dt) group by year, agegrp, bmigrp /* Aggregate patient counts by GROUP BY statement */ order by year, agegrp, bmigrp ; quit; /* Validate results */ proc compare base=freqsummary compare=bmi_summary_sql; LOG NOTE: Format AGEGRP_ has been output. NOTE: Format BMIGRP_ has been output. NOTE: PROCEDURE FORMAT used (Total process time): NOTE: Input data set is already sorted; it has been copied to the output data set. NOTE: There were 106 observations read from the data set WORK.BMI_FIRSTOBS_DATA. NOTE: The data set WORK.STEP4 has 106 observations and 6 variables. NOTE: PROCEDURE SORT used (Total process time): NOTE: There were 106 observations read from the data set WORK.STEP4. NOTE: The data set WORK.STEP5 has 106 observations and 4 variables. NOTE: DATA statement used (Total process time): 0.04 seconds 0.03 seconds NOTE: There were 106 observations read from the data set WORK.STEP5. NOTE: The data set WORK.FREQSUMMARY has 21 observations and 5 variables. NOTE: PROCEDURE FREQ used (Total process time): 0.03 seconds NOTE: Table WORK.BMI_SUMMARY_SQL created, with 21 rows and 4 columns. NOTE: PROCEDURE SQL used (Total process time): 0.10 seconds 378 proc compare base=freqsummary compare=bmi_summary_sql; 379 NOTE: There were 21 observations read from the data set WORK.FREQSUMMARY. NOTE: There were 21 observations read from the data set WORK.BMI_SUMMARY_SQL. 9

NOTE: PROCEDURE COMPARE used (Total process time): OUTPUT The SAS System The COMPARE Procedure Comparison of WORK.FREQSUMMARY with WORK.BMI_SUMMARY_SQL (Method=EXACT) Data Set Summary Dataset Created Modified NVar NObs WORK.FREQSUMMARY 03AUG11:17:35:47 03AUG11:17:35:47 5 21 WORK.BMI_SUMMARY_SQL 03AUG11:17:35:58 03AUG11:17:35:58 4 21 Variables Summary Number of Variables in Common: 4. Number of Variables in WORK.FREQSUMMARY but not in WORK.BMI_SUMMARY_SQL: 1. Number of Variables with Differing Attributes: 1. Listing of Common Variables with Differing Attributes Variable Dataset Type Length Label COUNT WORK.FREQSUMMARY Num 8 Frequency Count WORK.BMI_SUMMARY_SQL Num 8 Observation Summary Observation Base Compare First Obs 1 1 Last Obs 21 21 Number of Observations in Common: 21. Total Number of Observations Read from WORK.FREQSUMMARY: 21. Total Number of Observations Read from WORK.BMI_SUMMARY_SQL: 21. Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables Equal: 21. NOTE: No unequal values were found. All values compared are exactly equal. CONCLUSION PROC SQL has considerable advantages in coding efficiency over the DATA step, particularly when sub setting data on repeated measures. While the intuition behind using PROC SQL to generate FIRST. procedures isn t as apparent as through the traditional DATA step, it s concise presentation in both coding structure and minimal output can help expedite program generation and distribution across users. Creating aggregate counts and summaries off of tables generated in PROC SQL add to the efficiency available of this robust procedure. While the traditional DATA step can be more efficient in terms of query speed and table generation for generating incidence/prevalence in many instances, the PROC SQL method presented in this paper has coding advantages with respect to data organization and transference to other SQL-based software platforms. 10

REFERENCES Dickstein, Craig and Pass, Ray, DATA Step vs. PROC SQL: What s A Neophyte To Do?, Proceedings from the Twenty-Ninth Annual SAS Users Group International Conference, paper 296, 2004. Fain, Terry and Gareleck, Cyndie (254-30), Using SAS to Process Repeated Measures Data, Proceedings of the Thirtieth Annual SAS Users Group International Conference, paper 254, 2005. Matthews, Joann, PROC SQL Versus The DATA Step, Northeastern SAS Users Group Conference, 2006 SAS Institute Inc. (1990), Base SAS 9.2 SAS Procedure Guide, Third Edition, Cary, NC: SAS Institute Inc. Severino, Richard, PROC FREQ: It s More Than Counts, Proceedings of the twenty-fifth Annual SAS Users Group International Conference, paper69, 2000 CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: David Tabano Kaiser Permanente 2550 S. Parker Rd., Suite 300 Aurora, CO 80014 (303) 614-1348 (work) (303) 636-1305 (fax) david.c.tabano@kp.org SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 11