STOP MERGING AND START COMBINING by Robert S. Nicol U.S. Quality Algorithms

Similar documents
INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

Merge Processing and Alternate Table Lookup Techniques Prepared by

PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING

Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico

BASICS BEFORE STARTING SAS DATAWAREHOSING Concepts What is ETL ETL Concepts What is OLAP SAS. What is SAS History of SAS Modules available SAS

Base and Advance SAS

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

capabilities and their overheads are therefore different.

Chapter 6: Modifying and Combining Data Sets

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

50 WAYS TO MERGE YOUR DATA INSTALLMENT 1 Kristie Schuster, LabOne, Inc., Lenexa, Kansas Lori Sipe, LabOne, Inc., Lenexa, Kansas

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY

The Problem With NODUPLICATES, Continued

Contents. About This Book...1

Table Lookups: From IF-THEN to Key-Indexing

PharmaSUG Paper PO12

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

S-M-U (Set, Merge, and Update) Revisited

Paper PO06. Building Dynamic Informats and Formats

OUT= IS IN: VISUALIZING PROC COMPARE RESULTS IN A DATASET

Format-o-matic: Using Formats To Merge Data From Multiple Sources

Quicker Than Merge? Kirby Cossey, Texas State Auditor s Office, Austin, Texas

Get Started Writing SAS Macros Luisa Hartman, Jane Liao, Merck Sharp & Dohme Corp.

SAS CURRICULUM. BASE SAS Introduction

SQL, HASH Tables, FORMAT and KEY= More Than One Way to Merge Two Datasets

Contents of SAS Programming Techniques

The Building Blocks of SAS Datasets. (Set, Merge, and Update) Andrew T. Kuligowski FCCI Insurance Group

NO MORE MERGE. Alternative Table Lookup Techniques

SUGI 29 Data Warehousing, Management and Quality

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

SAS CLINICAL SYLLABUS. DURATION: - 60 Hours

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

Simplifying Effective Data Transformation Via PROC TRANSPOSE

SAS (Statistical Analysis Software/System)

Omitting Records with Invalid Default Values

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

Keeping Track of Database Changes During Database Lock

Leave Your Bad Code Behind: 50 Ways to Make Your SAS Code Execute More Efficiently.

3. Almost always use system options options compress =yes nocenter; /* mostly use */ options ps=9999 ls=200;

SAS Online Training: Course contents: Agenda:

Updating Data Using the MODIFY Statement and the KEY= Option

Top 10 Ways to Optimize Your SAS Code Jeff Simpson SAS Customer Loyalty

EXAMPLE 3: MATCHING DATA FROM RESPONDENTS AT 2 OR MORE WAVES (LONG FORMAT)

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

A Cross-national Comparison Using Stacked Data

Techdata Solution. SAS Analytics (Clinical/Finance/Banking)

Beginning Tutorials. Paper 53-27

1. Join with PROC SQL a left join that will retain target records having no lookup match. 2. Data Step Merge of the target and lookup files.

Comparison of different ways using table lookups on huge tables

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

Introduction / Overview

Characteristics of a "Successful" Application.

SAS (Statistical Analysis Software/System)

Basic SQL Processing Prepared by Destiny Corporation

2. Don t forget semicolons and RUN statements The two most common programming errors.

Certkiller.A QA

. NO MORE MERGE - Alternative Table Lookup Techniques Dana Rafiee, Destiny Corporation/DDISC Group Ltd. U.S., Wethersfield, CT

3. Data Tables & Data Management

Programming Beyond the Basics. Find() the power of Hash - How, Why and When to use the SAS Hash Object John Blackwell

From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX

Contents. Overview How SAS processes programs Compilation phase Execution phase Debugging a DATA step Testing your programs

USING PROC SQL EFFECTIVELY WITH SAS DATA SETS JIM DEFOOR LOCKHEED FORT WORTH COMPANY

9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

Longitudinal Employer - Household Dynamics. Internal document No. IP-LEHD-BRB LEHD Business Register Bridge Technical documentation

12. Combining SAS datasets. GIORGIO RUSSOLILLO - Cours de prépara)on à la cer)fica)on SAS «Base Programming» 269

Facilitate Statistical Analysis with Automatic Collapsing of Small Size Strata

Tired of CALL EXECUTE? Try DOSUBL

Paper Haven't I Seen You Before? An Application of DATA Step HASH for Efficient Complex Event Associations. John Schmitz, Luminare Data LLC

Locking SAS Data Objects

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

Introduction to PROC SQL

From An Introduction to SAS University Edition. Full book available for purchase here.

Interleaving a Dataset with Itself: How and Why

%Addval: A SAS Macro Which Completes the Cartesian Product of Dataset Observations for All Values of a Selected Set of Variables

An Efficient Method to Create Titles for Multiple Clinical Reports Using Proc Format within A Do Loop Youying Yu, PharmaNet/i3, West Chester, Ohio

File Systems. ECE 650 Systems Programming & Engineering Duke University, Spring 2018

Taming a Spreadsheet Importation Monster

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

PROCESSING LARGE SAS AND DB2 Fll..ES: CLOSE ENCOUNTERS OF THE COLOSSAL KIND

Checking for Duplicates Wendi L. Wright

Oracle Database 10g: Introduction to SQL

An Annotated Guide: The New 9.1, Free & Fast SPDE Data Engine Russ Lavery, Ardmore PA, Independent Contractor Ian Whitlock, Kennett Square PA

Hash Objects for Everyone

QMF: Query Management Facility

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

SAS Macro Technique for Embedding and Using Metadata in Web Pages. DataCeutics, Inc., Pottstown, PA

Setting Up a New Project

Using an ICPSR set-up file to create a SAS dataset

ABSTRACT DATA CLARIFCIATION FORM TRACKING ORACLE TABLE INTRODUCTION REVIEW QUALITY CHECKS

PharmaSUG 2018 Paper AD-08 Parallel Processing Your Way to Faster Software and a Big Fat Bonus: Demonstrations in Base SAS. Troy Martin Hughes

Amie Bissonett, inventiv Health Clinical, Minneapolis, MN

Planting Your Rows: Using SAS Formats to Make the Generation of Zero- Filled Rows in Tables Less Thorny

CLINICAL DATA PROCESSING EFFICIENCY TECHNIQUES

David Franklin Independent SAS Consultant TheProgramersCabin.com

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research

A Tool to Compare Different Data Transfers Jun Wang, FMD K&L, Inc., Nanjing, China

The inner workings of the datastep. By Mathieu Gaouette Videotron

Please Don't Lag Behind LAG!

Transcription:

STOP MERGING AND START COMBINING by Robert S. Nicol U.S. Quality Algorithms There are many ways to combine data within the SAS system. Probably the most widely used method is the. While the merge is very powerful, it is often misunderstood and overused by the novice SAS programmer. This paper ilhjstrates some of the common data combination methods and provides guidelines for choosing the most effective technique. The data combination routines covered by this paper are formats, the, the update statement, the set statement, proc SQl, proc append, and proc datasets. This paper is intended to be used as an ongoing rapid reference and as such is designed to get you quickly to the relative facts and syntax. How do the fields in the source files relate to the resultant file? Are all of the fields in all of the files required? Does the same field appear in more than one file? Other considerations: If the files are related by 'keys', are the files sorted or indexed by those keys? How much memory is available to you? How often does this combination have to be performed? How large are the files? How often does the data change? The code must be maintainable. The best way to insure the results of any project is to start with a plan. This axiom also holds true in the design of a program. An integral part of planning a program is having a thorough understanding of the available data. Armed with your knowledge of the data and your operating environment, the methods used to combine the data should be evident. This paper reviews the processes that can be applied under three distinct sets of conditions: When you need to combine entire observations. When one file is a 'master', that is you need to end up with only those original observations. When the data drive the resultant observations. PRIMARY DECISION FACTORS How are the observations related to one another? Should the observations remain intact? Are they related ordinally? Is one file a 'master' with fields added or modified based upon other files? If the files are related by 'keys', which observations are to survive? The methods to follow are presented in increasing order of preference. However, the conditions at your site may dictate a different order. Additionally the methods presented here will combine all observations and all variables. Should you need to modify the observations you may be able to use data set options (where, obs, firstobs) or sub-setting statements(if, where, delete}. The variables in the output data set may be controlled by data set options(drop, keep, rename} or statements (drop, keep, rename). 171

COMBINING ENTIRE OBSERVATIONS INTERLEAVING FILES Interleaving is the process of combining entire data sets into one resultant data set. The order of the observations is controlled via a "by" variable. The total number of observations in the final data set is equal to the total observations in the contributing files. set statement +advantage The command is safe and straight forward. drawback All files must be presorted or indexed. data inter; set one twoj NOTE: The merge command can also be used to interleave data sets. However if 'by variable' matches are found, the resultant file will be in error. CONCATENATION Concatenation is the process of combining entire data sets into one resultant data set. All of the observations in one data set are followed by all of the observations in the other data set. The order of the observations is controlled by the order of the data sets in the concatenation command. The total number of observations in the final data set is equal to the total observations in the contributing files. procappend + advantage As the 'base' data set is not rewritten, the computer resource requirements are significantly reduced. +drawback No other processing can be perlormed. The base data set must be available for modification. If the step abends then the base data set may be damaged. proc append force base::one data::two; set statement advantage As the command is part of a data step, further processing can be accomplished without starting a new data or proc drawback All observations are read in and then written out which requires CPU usage and workspace. data concatj setonetwoi proc datasets advantage Since the 'base' data set is not rewritten, computer resource requirements are significantly reduced. drawback Only other 'proc dataset' processes can be performed. The base data set must be available for modification. If the proc abends the base data set may be damaged. proc datasets library=work forcej append base=one data=twoj THE OBSERVATIONS IN ONE ALE ARE THE ONLY REQUIRED OBSERVATIONS The observations in the 'base' data set are to be the only observations in resultant file. The fields in each observation are a combination of the fields in the source files. 172

procsql advantage The files do not have to be presorted. drawback The proc requires large amounts of work space. The syntax becomes tiring jf the files have a large number of fields. The secondary files must be pre-selected to have unique keys. procsql; create table oneto1 as select one.key. two.field_a from one. two where one.key = two.key union select key. field_a from one where key not in (select key from two) order by key advantage The statement handles large numbers of fields and files with simple syntax. As the command is part of a data step, further processing can be accomplished without starting a new data or proc drawback All files must be presorted or indexed. The secondary files must be sorted with the 'nodupkeys' option. NOTE: if the same variable occurs in more than one data set then the resultant value will come f rom the right most data set in the merge statement. update statement advantage Allows multiple 'transaction' records per 'master'. As the command is part of a data step, further processing can be accomplished without starting a new data or proc drawback Fields cannot be added to the master file. All files must be presorted or indexed. data manyto1 ; update one(ln=in_one) two; if in_one; formats advantage As the format(s) is created from the transaction file, the primary file does not have to be sorted or indexed. It may be possible not to modify the base file at all, but merely apply the formats at time of output. Because the format is applied within a data step, further processing can be accomplished without starting a new data or proc If the transaction file does not change often consider saving permanent file. drawback If many fields are to be added the overall complexity of the code and CPU resources will become burdensome. NOTE: ConSider writing a macro to create formats from data sets, thus reducing the coding to create formats data manyt01 ; merge one(in=in_one) two; if In_one i data cntltwo(keep= fmtname type start label); set two end:eof; format start $5.; fmtname='two_fmt'; type:'c'; 173

start=key; label=field_a; output; if eof then do; start='other'; label=' '; output; end; proc fonnat cntlln=cntuwo; data withfonn; set one; from_2 = put(key,stwo_fmt.}; RESULTANT OBSERVATIONS ARE DEPENPENT UPON THE KEYS In some situations you may need to let'the data do the talking. That is the number of records in the final data set will be a direct result of the number of matching keys in the source files. One-ta-Many The is typified by combining one record from a reference file with many observations from a detail file. procsql +advantage Files do not have to be presorted. +drawback Requires large amounts of work space. The syntax becomes tiring if the files have a large number of fields. proc sql ; create table manyto1 as select one.key, one.source1, two.field_a,two.source2 from one, two where one.key=two.key union select key, source1, field_a from one where key not in (select key from two) order advantage Many fields and files can be handled with simple syntax. As the command is part of a data step, further processing can be accomplished without starting a new data or proc drawback All files must be presorted or indexed. data manyt01 ; merge one(ln=in_one) two(in=in_two); if in_one and in_two; Many-ta-Many Many-to-many merges occur when you combine files that each have multiple occurrences of the key values. Caution: In many years of experience in different industries, I have noted the most many-to-many merges are caused by invalid data(or a lack of understanding of the data) rather then a true need to perform a many-ta-many combination. proc sql advantage It works. Files do not have to be presorted. drawback- Requires large amounts of work space. The syntax becomes lengthy if the files have many fields. proc sql; create table manytom as select two. *, three.source3, three.flekca from two(rename=(field_a=f1d_a2}), three where two.key=three.key order by key, fld_a2: 174

DO NOT USE MERGE FOR MANY TO MANY The uses the data from the right most data set that is still contributing data. Although the system only flags many to many merges as a warning, this will usually create erroneous data. Roll you own You may write your own 'many-to-many combiner'. This can be accomplished by performing a series of one to many merges, or by using indexes and pointers or by... POINTS TO REMEMBER While programming SAS there is no substitute for reading the log and testing your logic. If you start off with knowing your data(repeats of key values etc.) and the result that you need{only the observations in the base data set, etc.), then choosing the data combination technique should be straight forward. If a method does not teel right, it may not be. Test it. Then compare the results and the resource usage. CONTACT INFORMATION Robert S. Nicol telephone: 215-654-5813 (day) 610-944-0884 (evening) fax 215-654-6007 175