Quicker Than Merge? Kirby Cossey, Texas State Auditor s Office, Austin, Texas

Similar documents
Updating Data Using the MODIFY Statement and the KEY= Option

9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

Merging Data Eight Different Ways

Countdown of the Top 10 Ways to Merge Data David Franklin, Independent Consultant, Litchfield, NH

capabilities and their overheads are therefore different.

David Franklin Independent SAS Consultant TheProgramersCabin.com

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

SAS System Powers Web Measurement Solution at U S WEST

Guidelines for Coding of SAS Programs Thomas J. Winn, Jr. Texas State Auditor s Office

%Addval: A SAS Macro Which Completes the Cartesian Product of Dataset Observations for All Values of a Selected Set of Variables

Merge Processing and Alternate Table Lookup Techniques Prepared by

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

1. Join with PROC SQL a left join that will retain target records having no lookup match. 2. Data Step Merge of the target and lookup files.

SAS Scalable Performance Data Server 4.3

Getting the Most from Hash Objects. Bharath Gowda

Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY

Table Lookups: From IF-THEN to Key-Indexing

What's the Difference? Using the PROC COMPARE to find out.

SAS Macro Dynamics: from Simple Basics to Powerful Invocations Rick Andrews, Office of Research, Development, and Information, Baltimore, MD

PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING

Introduction / Overview

PDF Multi-Level Bookmarks via SAS

Cleaning up your SAS log: Note Messages

LST in Comparison Sanket Kale, Parexel International Inc., Durham, NC Sajin Johnny, Parexel International Inc., Durham, NC

Text Generational Data Sets (Text GDS)

Speed Dating: Looping Through a Table Using Dates

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

Hypothesis Testing: An SQL Analogy

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

PharmaSUG Paper AD06

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

Best ETL Design Practices. Helpful coding insights in SAS DI studio. Techniques and implementation using the Key transformations in SAS DI studio.

Automatic Indicators for Dummies: A macro for generating dummy indicators from category type variables

PharmaSUG Paper TT11

Paper CT-16 Manage Hierarchical or Associated Data with the RETAIN Statement Alan R. Mann, Independent Consultant, Harpers Ferry, WV

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

A Macro that can Search and Replace String in your SAS Programs

The Dataset Diet How to transform short and fat into long and thin

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

The Demystification of a Great Deal of Files

SAS Institute Exam A SAS Advanced Programming Version: 6.0 [ Total Questions: 184 ]

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

ET01. LIBNAME libref <engine-name> <physical-file-name> <libname-options>; <SAS Code> LIBNAME libref CLEAR;

Are Your SAS Programs Running You?

Programming Beyond the Basics. Find() the power of Hash - How, Why and When to use the SAS Hash Object John Blackwell

ERROR: The following columns were not found in the contributing table: vacation_allowed

Comparison of different ways using table lookups on huge tables

PharmaSUG Paper PO12

wuss 1994 You can also limit the observations which you chose by the use of a Where clause (Example 4). While SAS provides the means for

SAS Application to Automate a Comprehensive Review of DEFINE and All of its Components

50 WAYS TO MERGE YOUR DATA INSTALLMENT 1 Kristie Schuster, LabOne, Inc., Lenexa, Kansas Lori Sipe, LabOne, Inc., Lenexa, Kansas

Program Validation: Logging the Log

INTRODUCTION TO PROC SQL JEFF SIMPSON SYSTEMS ENGINEER

Useful Tips When Deploying SAS Code in a Production Environment

A SAS Macro to Create Validation Summary of Dataset Report

Data Manipulation with SQL Mara Werner, HHS/OIG, Chicago, IL

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

Make it a Date! Setting up a Master Date View in SAS

Electricity Forecasting Full Circle

Uncommon Techniques for Common Variables

David S. Septoff Fidia Pharmaceutical Corporation

Surfing the SAS cache

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

Hello World! Getting Started with the SAS DS2 Language

Effects of PROC EXPAND Data Interpolation on Time Series Modeling When the Data are Volatile or Complex

Missing Pages Report. David Gray, PPD, Austin, TX Zhuo Chen, PPD, Austin, TX

Please Don't Lag Behind LAG!

PharmaSUG China Paper 059

SQL Solutions Case Study SOUTH WALES POLICE DEPARTMENT. How South Wales PD Improves their SQL Server Management with IDERA

A Tool to Compare Different Data Transfers Jun Wang, FMD K&L, Inc., Nanjing, China

Planting Your Rows: Using SAS Formats to Make the Generation of Zero- Filled Rows in Tables Less Thorny

One SAS To Rule Them All

Run your reports through that last loop to standardize the presentation attributes

Same Data Different Attributes: Cloning Issues with Data Sets Brian Varney, Experis Business Analytics, Portage, MI

. NO MORE MERGE - Alternative Table Lookup Techniques Dana Rafiee, Destiny Corporation/DDISC Group Ltd. U.S., Wethersfield, CT

Let SAS Write and Execute Your Data-Driven SAS Code

Internet, Intranets, and The Web

A Practical Introduction to SAS Data Integration Studio

How a Code-Checking Algorithm Can Prevent Errors

Different Methods for Accessing Non-SAS Data to Build and Incrementally Update That Data Warehouse

Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX

SAS File Management. Improving Performance CHAPTER 37

Are Your SAS Programs Running You? Marje Fecht, Prowerk Consulting, Cape Coral, FL Larry Stewart, SAS Institute Inc., Cary, NC

A Macro to Create Program Inventory for Analysis Data Reviewer s Guide Xianhua (Allen) Zeng, PAREXEL International, Shanghai, China

Using SAS with Oracle : Writing efficient and accurate SQL Tasha Chapman and Lori Carleton, Oregon Department of Consumer and Business Services

PROC SQL vs. DATA Step Processing. T Winand, Customer Success Technical Team

TaskMaster Documentation INSTALLATION GUIDE

Give me EVERYTHING! A macro to combine the CONTENTS procedure output and formats. Lynn Mullins, PPD, Cincinnati, Ohio

New Vs. Old Under the Hood with Procs CONTENTS and COMPARE Patricia Hettinger, SAS Professional, Oakbrook Terrace, IL

How to Create Data-Driven Lists

Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico

Keeping Track of Database Changes During Database Lock

Automating Comparison of Multiple Datasets Sandeep Kottam, Remx IT, King of Prussia, PA

IF there is a Better Way than IF-THEN

Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

A Legislative Bill Text Retrieval and Distribution System Using SAS, PROC SQL, and SAS/Access to DB2

Transcription:

Paper 076-29 Quicker Than Merge? Kirby Cossey, Texas State Auditor s Office, Austin, Texas ABSTRACT How many times do you need to extract a few records from an extremely large dataset? INTRODUCTION In the typical audit our office does we get data with which we have no experience. We spend a lot of time understanding the data. A lot of different questions are raised requiring a myriad of unrelated queries. In this case we had 11 monthly files ranging from 7,172,501 observations to 10,198,223 with 97 fields and a record length of 499. The size of the compressed SAS files ranged from 2.3 gig to 3.2 gig. Several fields were indexed, including PCN. A smaller file containing 11,182 unique PCN values was developed for a particular set of queries. This was then used to extract all related records from the 11 extremely large files. This smaller file also has the PCN variable indexed and was sorted by the same field. THE CHALLENGE The challenge is to retrieve this set of records in a reasonable time and develop a reasonable methodology for future such extractions based on other fields. This was being done on a mainframe but was moved to a SAS server connected to a Storage Area Network (SAN) with 232 gig. 358,512 records were extracted from the 89,632,044 records in the 11 files. THIS PAPER COMPARES FOUR TECHNIQUES: traditional merge using SAS to write the where statement for you the KEY= option on a set statement SQL join 1

%MACRO MONTH(MONTH1,MONTH2); DATA &Month2; Merge E2.All&Month2(In=A) E.UniquePCNs( In=B); By PCN; If A & B; Run; %MONTH(Oct,Oct00); Run; %MONTH(Nov,Nov00); Run; %MONTH(Dec,Dec00); Run;. %MONTH(Jan,Jan01); Run; %MONTH(Feb,Feb01); Run; %MONTH(Mar,Mar01); Run; %MONTH(Apr,Apr01); Run; %MONTH(May,May01); Run; %MONTH(Jun,Jun01); Run; TRADITIONAL MERGE The bolded code shows what is unique to this technique. The reason this took hours is that the large dataset was not sorted by the indexed field. Keep this in mind when designing your database. We were exploring many facets of this data (new to all of us) so resorting for each idea was not practical. It took about 23 minutes to sort each of the 11 indexed data sets by an indexed field on the idle SAS server. It took about 4 minutes to build an index on one of the datasets on the idle SAS server. Using an indexed and sorted large dataset took about 11 seconds. TRADITIONAL MERGE ON THE MAINFRAME Each month took from 2:25:22.48 to 3:48:41.54 in real time for a total of over 33 hours. The CPU time varied from 5:02.63 to 6:58.38 for a total of over 64 minutes of CPU time. TRADITIONAL MERGE ON THE SAS SERVER With nothing else running on the server, each month took from 52:16.60 to 1:56:32.04 in real time for a total of over 16 hours. The CPU time varied from 3:11.60 to 4:00.48 for a total of over 57 minutes of CPU time. 2

Filename List 'E:\PProc LT 9\List.txt'; run; SAS WRITES THE WHERE STATEMENT DATA _Null_; File List; Set E.UniquePCNs END=EOF; By PCN; If _N_=1 Then Put "Where PCN IN ("; If EOF Then Do; Put @1 "'" @2 PCN @11 "'"; Put ");"; End; Else Put @1 "'" @2 PCN @11 "',";Run; %MACRO Month(Month1,Month2); DATA &Month2; Set Data.All&Month2; %INCLUDE List; Run;. The bolded code shows what is unique to this technique. Note the use of the %Include. The WHERE statement is written to a text file in the 1 st Data Step and then used in the 2 nd Data Step. Partial Log Listing 27 MPRINT(MONTH): DATA Sep00; MPRINT(MONTH): Set Data.Sep00; MPRINT(MONTH): Where PCN IN ('BQQXVVXTD', 'BQVFBDQHD', 'DDHVBVVBD','QZVFBTKDH' ); INFO: Index PCN not used. Sorting into index order may help. NOTE: The data set WORK.SEP00 has 32268 observations and 97 variables. NOTE: DATA statement used: real time 6:37.60 user cpu time 54.10 seconds system cpu time 7.87 seconds Memory 802k MPRINT(MONTH): Run; MPRINT(MONTH): Proc Sort; MPRINT(MONTH): By PCN; SAS WRITES THE WHERE STATEMENT ON THE MAINFRAME Each month took from 5:57.56 to 8:21.72 in real time for a total of over 74 minutes. The CPU time varied from 48.87 to 69.18 seconds for a total of over 10 minutes of CPU time. SAS WRITES THE WHERE STATEMENT ON THE SAS SERVER With nothing else running on the server, each month took from 42.17 to 5:49.98 in real time for a total of over 39 minutes. The CPU time varied from 0.86 to 69.87 seconds for a total of over 9 minutes of CPU time. 3

KEY= SET OPTION This complicated use of the KEY= method will return multiple records from the master dataset. %MACRO MONTH(MONTH1,MONTH2); DATA &Month2; set E.UniquePCNs; do until(_iorc_=%sysrc(_dsenom)); set E2.Data&Month2 Key=PCN; select(_iorc_); when (%sysrc(_sok)) do; output; when (%sysrc(_dsenom)) do; _error_=0; DELETE; otherwise; The bolded code is the only code that changes. The 1 st three bolded sections are data set names. The last bolded section is the name of the indexed field being used. Return Codes _iorc_ = 'input/output return code', when it's equal to 0 SAS did not find a match, when not equal to 0, it found a matching observation %sysrc = returns a system error number _dsenom = means no matching observation was found in the master dataset _sok = the i/o operation was successful KEY= SET OPTION ON THE MAINFRAME Each month took from 1:28.41 to 2:17.51 in real time for a total of over 19 minutes. The CPU time varied from 1.30 to 1.85 seconds for a total of over 16 seconds of CPU time. KEY= SET OPTION ON THE SERVER With nothing else running on the server, each month took from 0:15.73 to 0:25.23 in real time for a total of over 3 minutes. The CPU time varied from 0:00.78 to 0:11.30 seconds for a total of over 11 seconds of CPU time. 4

SQL %MACRO MONTH(MONTH1,MONTH2); PROC SQL; CREATE TABLE &Month2 AS SELECT S.* FROM DATA.UniquePCNs as BD, DATA&MONTH1..DataEdit&Month2 as S WHERE BD.pcn = S.pcn; QUIT; SQL ON THE MAINFRAME Each month took from 0:25.08 to 2:29.05 in real time for a total of over 19 minutes. The CPU time varied from 0:01.67 to 0:02.23 seconds for a total of over 20 seconds of CPU time. SQL ON THE SERVER With nothing else running on the server, each month took from 0:27.06 to 0:56.18 in real time for a total of over 8 minutes. The CPU time varied from 0:00.85 to 0:01.97 seconds for a total of over 17 seconds of CPU time. SQL ON THE SERVER Using files not sorted or indexed increased the time somewhat. With nothing else running on the server, each month took from 0:19.48 to 6:59.51 in real time for a total of over 43 minutes. The CPU time varied from 0:13.19 to 0:24.36 seconds for a total of over 3 minutes of CPU time. 5

CONCLUSION On the mainframe we went from 33 hours using the traditional merge to 74 minutes using SAS to write the Where statement to 19 minutes using the Key = option on the set statement to 19 minutes using SQL join. On the SAS server we went from 11 hours using the traditional merge to 39 minutes using SAS to write the Where statement to 9 minutes using the Key = option on the set statement to 8 minutes using SQL join. If you are familiar with SQL, it gives the most flexibility when datasets are not sorted or indexed. The KEY= method works fastest when the fewest records are needed but is more complicated and needs the data to be indexed. ACKNOWLEDGMENTS Thanks to Toby Dunn (tdunn@oakhilltech.com) for the error definitions. Also see SAS OnlineDoc 9 at http://v9doc.sas.com/sasdoc/ Also see the Archives of SAS-L@LISTSERV.UGA.EDU at http://listserv.uga.edu/archives/sas-l.html Thanks to Olin Davis (odavis@sao.state.tx.us) for all his help. Thanks to Janice at SAS Technical Support for her help. CONTACT INFORMATION Kirby Cossey Information System Team Texas State Auditor s Office P.O. Box 12067 Austin, TX 78711-2067 Work Phone: (512) 936-9739 Fax: (512) 936-9400 E-Mail: kcossey@sao.state.tx.us SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. Indicates USA registration. 6