Anatomy of a Merge Gone Wrong James Lew, Compu-Stat Consulting, Scarborough, ON, Canada Joshua Horstman, Nested Loop Consulting, Indianapolis, IN, USA

Similar documents
Merge with Caution: How to Avoid Common Problems when Combining SAS Datasets

Merge with Caution: How to Avoid Common Problems when Combining SAS Datasets

PharmaSUG 2015 Paper TT07

Don t Get Blindsided by PROC COMPARE Joshua Horstman, Nested Loop Consulting, Indianapolis, IN Roger Muller, Data-to-Events.

Merging Data Eight Different Ways

9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC

PhUse Practical Uses of the DOW Loop in Pharmaceutical Programming Richard Read Allen, Peak Statistical Services, Evergreen, CO, USA

Debugging 101 Peter Knapp U.S. Department of Commerce

Hypothesis Testing: An SQL Analogy

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

IF there is a Better Way than IF-THEN

Programming Beyond the Basics. Find() the power of Hash - How, Why and When to use the SAS Hash Object John Blackwell

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

Fifteen Functions to Supercharge Your SAS Code

Quality Control of Clinical Data Listings with Proc Compare

Let the CAT Out of the Bag: String Concatenation in SAS 9

The Benefits of Traceability Beyond Just From SDTM to ADaM in CDISC Standards Maggie Ci Jiang, Teva Pharmaceuticals, Great Valley, PA

Step through Your DATA Step: Introducing the DATA Step Debugger in SAS Enterprise Guide

Going Under the Hood: How Does the Macro Processor Really Work?

Analyst Beware: Five Dangerous Data Step Coding Traps David H. Abbott Veterans Affairs Health Services Research Durham, NC

One Project, Two Teams: The Unblind Leading the Blind

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

Countdown of the Top 10 Ways to Merge Data David Franklin, Independent Consultant, Litchfield, NH

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

Exploring DATA Step Merge and PROC SQL Join Techniques Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

KEYWORDS ARRAY statement, DO loop, temporary arrays, MERGE statement, Hash Objects, Big Data, Brute force Techniques, PROC PHREG

How to write ADaM specifications like a ninja.

Demystifying PROC SQL Join Algorithms

Please don't Merge without By!!

PROC SUMMARY AND PROC FORMAT: A WINNING COMBINATION

Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY

Tackling Unique Problems Using TWO SET Statements in ONE DATA Step. Ben Cochran, The Bedford Group, Raleigh, NC

Simplifying Your %DO Loop with CALL EXECUTE Arthur Li, City of Hope National Medical Center, Duarte, CA

The Path To Treatment Pathways Tracee Vinson-Sorrentino, IMS Health, Plymouth Meeting, PA

A SAS Macro for Generating Informative Cumulative/Point-wise Bar Charts

Using PROC FCMP to the Fullest: Getting Started and Doing More

Get SAS sy with PROC SQL Amie Bissonett, Pharmanet/i3, Minneapolis, MN

Making the most of SAS Jobs in LSAF

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

Getting Up to Speed with PROC REPORT Kimberly LeBouton, K.J.L. Computing, Rossmoor, CA

Using PROC FCMP to the Fullest: Getting Started and Doing More

Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY

Indenting with Style

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

Beyond IF THEN ELSE: Techniques for Conditional Execution of SAS Code Joshua M. Horstman, Nested Loop Consulting, Indianapolis, IN

Files Arriving at an Inconvenient Time? Let SAS Process Your Files with FILEEXIST While You Sleep

Data Management of Clinical Studies

Essentials of PDV: Directing the Aim to Understanding the DATA Step! Arthur Xuejun Li, City of Hope National Medical Center, Duarte, CA

Displaying Multiple Graphs to Quickly Assess Patient Data Trends

Format-o-matic: Using Formats To Merge Data From Multiple Sources

Practical Uses of the DOW Loop Richard Read Allen, Peak Statistical Services, Evergreen, CO

Submission-Ready Define.xml Files Using SAS Clinical Data Integration Melissa R. Martinez, SAS Institute, Cary, NC USA

PharmaSUG China Paper 70

SAS Data Libraries. Definition CHAPTER 26

ABSTRACT MORE THAN SYNTAX ORGANIZE YOUR WORK THE SAS ENTERPRISE GUIDE PROJECT. Paper 50-30

Cleaning up your SAS log: Note Messages

Arthur L. Carpenter California Occidental Consultants, Oceanside, California

Mastering the Basics: Preventing Problems by Understanding How SAS Works. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

Macro Architecture in Pictures Mark Tabladillo PhD, marktab Consulting, Atlanta, GA Associate Faculty, University of Phoenix

Using Proc Freq for Manageable Data Summarization

Not Just Merge - Complex Derivation Made Easy by Hash Object

PharmaSUG Paper IB11

Personally Identifiable Information Secured Transformation

The Dataset Diet How to transform short and fat into long and thin

Basic SAS Hash Programming Techniques Applied in Our Daily Work in Clinical Trials Data Analysis

How MERGE Really Works

How to Use ARRAYs and DO Loops: Do I DO OVER or Do I DO i? Jennifer L Waller, Medical College of Georgia, Augusta, GA

How to Implement the One-Time Methodology Mark Tabladillo, Ph.D., MarkTab Consulting, Atlanta, GA Associate Faculty, University of Phoenix

SAS Application to Automate a Comprehensive Review of DEFINE and All of its Components

DBLOAD Procedure Reference

Tales from the Help Desk 6: Solutions to Common SAS Tasks

One-Step Change from Baseline Calculations

Paper TT17 An Animated Guide : Speed Merges with Key Merging and the _IORC_ Variable Russ Lavery Contractor for Numeric resources, Inc.

Macro Quoting: Which Function Should We Use? Pengfei Guo, MSD R&D (China) Co., Ltd., Shanghai, China

Getting Started with Arrays and DO LOOPS. Created 01/2004, Updated 03/2008, Afton Royal Training & Consulting

How to Create Data-Driven Lists

WHAT IS THE CONFIGURATION TROUBLESHOOTER?

The Power of Combining Data with the PROC SQL

MISSING VALUES: Everything You Ever Wanted to Know

Using Metadata Queries To Build Row-Level Audit Reports in SAS Visual Analytics

Imelda C. Go, South Carolina Department of Education, Columbia, SC

Advanced PROC REPORT: Getting Your Tables Connected Using Links

Paper PS05_05 Using SAS to Process Repeated Measures Data Terry Fain, RAND Corporation Cyndie Gareleck, RAND Corporation

Beyond IF THEN ELSE: Techniques for Conditional Execution of SAS Code Joshua M. Horstman, Nested Loop Consulting, Indianapolis, IN

Journey to the center of the earth Deep understanding of SAS language processing mechanism Di Chen, SAS Beijing R&D, Beijing, China

What to Expect When You Need to Make a Data Delivery... Helpful Tips and Techniques

How to Implement the One-Time Methodology Mark Tabladillo, Ph.D., Atlanta, GA

One-PROC-Away: The Essence of an Analysis Database Russell W. Helms, Ph.D. Rho, Inc.

Hash Objects for Everyone

Remember to always check your simple SAS function code! Yingqiu Yvette Liu, Merck & Co. Inc., North Wales, PA

HASH Beyond Lookups Your Code Will Never Be the Same!

Effectively Utilizing Loops and Arrays in the DATA Step

Website Privacy Policy

The Proc Transpose Cookbook

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys

An Efficient Method to Create Titles for Multiple Clinical Reports Using Proc Format within A Do Loop Youying Yu, PharmaNet/i3, West Chester, Ohio

Breaking up (Axes) Isn t Hard to Do: An Updated Macro for Choosing Axis Breaks

Batch vs. Interactive: Why You Need Both Janet E. Stuelpner. ASG. Inc Cary. North Carolina

Transcription:

ABSTRACT PharmaSUG 2013 - Paper TF22 Anatomy of a Merge Gone Wrong James Lew, Compu-Stat Consulting, Scarborough, ON, Canada Joshua Horstman, Nested Loop Consulting, Indianapolis, IN, USA The merge is one of the SAS programmer s most commonly used tools. However, it can be fraught with pitfalls to the unwary user. In this paper, we look under the hood of the data step and examine how the program data vector works. We see what's really happening when datasets are merged and how to avoid subtle problems. INTRODUCTION Anyone who has spent much time programming with SAS has likely found themselves needing to combine data from multiple datasets into a single dataset. This is most commonly performed by using the MERGE statement within a DATA step. While the merge seems like a relatively simple and straightforward process, there are many traps waiting to snare the unsuspecting programmer. In a seminal pair of papers, Foley (1997, 1998) catalogs some 28 potential traps related to merging. These range from rather mundane oversights such as omitting the BY statement to more esoteric matters relating to the inner workings of SAS. The latter are far more interesting, and this paper will focus on one such trap: the automatic retain. A SIMPLE EXAMPLE THE DATA Consider the following example. We have two SAS datasets containing data collected in a medical research study. The DEMOG dataset contains demographic information such as the patient s age and weight. This information is recorded only once at the beginning of the study, so there is only one record per patient. The VITALS dataset contains vital signs measurements such as heart rate. These measurements are recorded at each study visit, so there can be multiple records per patient. Both datasets include a patient identification number which provides a unique key to the data. The VITALS dataset also includes a visit number. The combination of the patient identification number and the visit number uniquely identifies a particular record. DEMOG Dataset PATIENT_ID AGE WEIGHT 1 42 185 2 55 170 3 30 160 VITALS Dataset PATIENT_ID VISIT HEART_RATE 1 1 60 1 2 58 2 1 74 2 2 72 2 3 69 3 1 71 THE MERGE We wish to merge these two datasets. We also wish to convert the patient s weight from pounds to kilograms. We write the following SAS code: data alldata; merge demog vitals; weight = weight / 2.2; 1

As expected, the dataset resulting from the merge contains 5 variables and 6 records. However, a careful inspection of the WEIGHT variable reveals a serious error. Notice that the value of WEIGHT changes for each record, even within the same patient. This is clearly not the desired result. In order to explain these strange results, we need to take a look under the hood of the DATA step and discuss the program data vector. ALLDATA Dataset PATIENT_ID AGE WEIGHT VISIT HEART_RATE 1 42 84.0909 1 60 1 42 38.2231 2 58 2 55 77.2727 1 74 2 55 35.1240 2 72 2 55 15.9654 3 69 3 30 72.7273 1 71 PROGRAM DATA VECTOR The program data vector (PDV) is a temporary location in memory that SAS uses during the normal processing of a DATA step. The structure of the PDV is determined during DATA step compilation. At execution time, data which are read in using statements such as SET, MERGE, and INPUT are placed into the appropriate locations in the PDV. DATA step statements that manipulate the values of dataset variables are actually interacting with the PDV. When it is time for an output record to be written, the contents of the PDV are copied to the output dataset. We ll walk through this process step-by-step using our example from above. COMPILATION When you submit code to SAS, it runs in two distinct phases: compilation and execution. During the compilation phase, the program data vector is constructed. As SAS compiles our example code above, the first statement that affects construction of the PDV is the MERGE statement. The first dataset listed on the MERGE statement is DEMOG, which includes three variables: PATIENT_ID, AGE, and WEIGHT. All three are included in the PDV using the same attributes (length, format, label, etc.) present in the input dataset. The next dataset listed is VITALS, which includes three variables: PATIENT_ID, VISIT, and HEART_RATE. Since, PATIENT_ID is already on the PDV, only the latter two are added. Upon the completion of DATA step compilation, the following PDV structure is in place. Note that no actual values have been written to the PDV yet. That will occur during the execution phase. PATIENT_ID. AGE. WEIGHT. VISIT. HEART_RATE. 2

EXECUTION As we discuss the execution of the DATA step, it is important to remember that a DATA step is essentially a loop. The statements in the DATA step are executed repeatedly until certain conditions are met that cause execution to terminate. One such condition is a SET or MERGE statement that runs out of new records to read from all of the input datasets listed within the statement. In the meantime, the contents of the PDV are written to the specified output dataset each time execution returns to the top of the DATA step (unless you override this behavior using statements such as OUTPUT). As our example code begins, the first statement to execute is the MERGE statement. Since the DEMOG dataset is listed first, the first record from DEMOG is read into the PDV. Next, the first record from the VITALS dataset is read. Since both datasets contain the PATIENT_ID variable, the value from VITALS overwrites what had been previously read from DEMOG. Fortunately, since PATIENT_ID is a BY variable, it has the same value on both datasets. Once the MERGE statement has executed for the first time, the PDV looks like this: WEIGHT 185 VISIT 1 HEART_RATE 60 The next statement to execute is our weight conversion. This statement reads the value of WEIGHT from the PDV, divides it by 2.2, and then writes the result back to the PDV. After this statement executes, we have the following PDV: WEIGHT 84.0909 VISIT 1 HEART_RATE 60 We have now reached the bottom of the DATA step. Execution returns to the top and the current contents of the PDV are written to the ALLDATA dataset. So far, everything is proceeding exactly as expected. AUTOMATIC RETAIN There is a common misconception that the values in the PDV are reset to missing when execution returns to the top of the DATA step. This is only true for variables which are assigned values by an INPUT or assignment statement (unless overridden by a RETAIN statement). For variables read with a SET, MERGE, MODIFY, or UPDATE statement, the values are automatically retained from one iteration of the DATA step to the next. In our example, all of the variables on the PDV were read with a MERGE statement, so all values are retained. When the second iteration of the DATA step begins, the PDV looks just like it did when the first iteration ended. Next, the MERGE statement executes again. Since the DEMOG dataset does not contain any more records for the current BY group (PATIENT_ID = 1), nothing is read from DEMOG. There is still one record for the current BY group in the VITALS dataset, so the values from that record are copied to the PDV. Since nothing was read from DEMOG, the existing values of AGE and WEIGHT survive. The PDV now has the following state: 3

WEIGHT 84.0909 VISIT 2 HEART_RATE 58 Now we come once again to the weight conversion statement. The current value of WEIGHT (84.0909) is read from the PDV and divided by 2.2, and the result (38.2231) is written back to the PDV. Having reached the end of the DATA step, the contents of the PDV are written out as the second record of the output dataset. At last we have uncovered the source of our problem. The value of WEIGHT is read only once for each BY group, while the weight conversion statement executes once for each iteration of the DATA step. The WEIGHT continues to be divided by 2.2 repeatedly until the end of the BY group is reached. CORRECTING THE CODE Now that we understand what is causing this unexpected behavior, what can we do about it? The safest and most conservative option is to limit all merges to the required statements and perform additional processing in a separate DATA step. data step1; merge demog vitals; data step2; set step1; weight = weight / 2.2; However, it is not always necessary to take such drastic action. This merge can be made to perform as expected within a single DATA step by simply renaming one of the input variables as follows: data alldata(drop=weight_lbs); merge demog(rename=(weight=weight_lbs)) vitals; weight = weight_lbs / 2.2; As shown below, this modified code produces the output dataset we were expecting. Since WEIGHT_LBS is retained but not modified, each record within a given BY group will have the same value of WEIGHT. ALLDATA Dataset - Corrected PATIENT_ID AGE WEIGHT VISIT HEART_RATE 1 42 84.0909 1 60 1 42 84.0909 2 58 2 55 77.2727 1 74 2 55 77.2727 2 72 2 55 77.2727 3 69 3 30 72.7273 1 71 4

CONCLUSION Merging datasets is one of the most basic and common functions performed in SAS. However, the underlying procedure is more complex than it might first appear. It is therefore imperative for an effective SAS programmer to be equipped with a thorough understanding of the program data vector in order to avoid mistakes like the one discussed in this paper. See Johnson (2012) for a comprehensive treatment of the program data vector and Virgile (2000) for additional discussion of the PDV specifically as it relates to merging. Even the most skilled programmer can sometimes overlook subtle traps like this. Thus, it is advisable to habitually practice certain programming techniques to defend against these errors. As we have discussed, the surest way to steer clear of many problems in a merge is to avoid adding additional statements beyond those required for the merge: the DATA statement, the MERGE statement, the BY statement, possibly a subsetting IF statement, and of course the RUN statement. Some may find this practice overly cumbersome. At the very least, one should refrain from modifying the values of variables in a dataset being merged in the same DATA step unless one can be absolutely certain that no dataset will ever contain more than one record within a BY group. Creating new variables can generally be done safely since the values of new variables are not automatically retained across iterations of the DATA step. REFERENCES Foley, Malachy J. Advanced MATCH-MERGING: Techniques, Tricks, and Traps. Proceedings of the Twenty- Second Annual SAS Users Group International. Cary, NC: SAS Institute Inc., 1997. Paper 39. Foley, Malachy J. MATCH-MERGING: 20 Some Traps and How to Avoid Them. Proceedings of the Twenty-Third Annual SAS Users Group International. Cary, NC: SAS Institute Inc., 1998. Paper 47. Johnson, Jim. The Use and Abuse of the Program Data Vector. Proceedings of the SAS Global Forum 2012 Conference. Cary, NC: SAS Institute Inc., 2012. Paper 255. Virgile, Bob. How MERGE Really Works. Proceedings of the Pharmaceutical Industry SAS Users Group 2000 Annual Conference. Chapel Hill, NC: PharmaSUG, 2000. Paper DM12. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: James Lew Compu-Stat Consulting James.Lew@Rogers.com Joshua M. Horstman Nested Loop Consulting jmhorstman@gmail.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 5