Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

Similar documents
Deriving Rows in CDISC ADaM BDS Datasets

Creating an ADaM Data Set for Correlation Analyses

Not Just Merge - Complex Derivation Made Easy by Hash Object

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

Using PROC SQL to Generate Shift Tables More Efficiently

PharmaSUG China Paper 70

Creating output datasets using SQL (Structured Query Language) only Andrii Stakhniv, Experis Clinical, Ukraine

Quicker Than Merge? Kirby Cossey, Texas State Auditor s Office, Austin, Texas

An Efficient Solution to Efficacy ADaM Design and Implementation

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

An Introduction to SAS Hash Programming Techniques Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

An Efficient Method to Create Titles for Multiple Clinical Reports Using Proc Format within A Do Loop Youying Yu, PharmaNet/i3, West Chester, Ohio

Validating Analysis Data Set without Double Programming - An Alternative Way to Validate the Analysis Data Set

Conversion of CDISC specifications to CDISC data specifications driven SAS programming for CDISC data mapping

PharmaSUG Paper TT11

Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico

PharmaSUG Paper BB01

Amie Bissonett, inventiv Health Clinical, Minneapolis, MN

Make the Most Out of Your Data Set Specification Thea Arianna Valerio, PPD, Manila, Philippines

An Introduction to Visit Window Challenges and Solutions

PharmaSUG Paper DS24

PharmaSUG DS05

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

Quick and Efficient Way to Check the Transferred Data Divyaja Padamati, Eliassen Group Inc., North Carolina.

Making a List, Checking it Twice (Part 1): Techniques for Specifying and Validating Analysis Datasets

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

Let Hash SUMINC Count For You Joseph Hinson, Accenture Life Sciences, Berwyn, PA, USA

A Tool to Compare Different Data Transfers Jun Wang, FMD K&L, Inc., Nanjing, China

Keeping Track of Database Changes During Database Lock

PharmaSUG China Paper 059

Hash Objects for Everyone

A Hands-on Introduction to SAS DATA Step Hash Programming Techniques

Interactive Programming Using Task in SAS Studio

Reproducibly Random Values William Garner, Gilead Sciences, Inc., Foster City, CA Ting Bai, Gilead Sciences, Inc., Foster City, CA

Applying ADaM Principles in Developing a Response Analysis Dataset

The Benefits of Traceability Beyond Just From SDTM to ADaM in CDISC Standards Maggie Ci Jiang, Teva Pharmaceuticals, Great Valley, PA

PharmaSUG Paper AD06

Countdown of the Top 10 Ways to Merge Data David Franklin, Independent Consultant, Litchfield, NH

Clip Extreme Values for a More Readable Box Plot Mary Rose Sibayan, PPD, Manila, Philippines Thea Arianna Valerio, PPD, Manila, Philippines

A SAS Macro to Generate Caterpillar Plots. Guochen Song, i3 Statprobe, Cary, NC

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

Clinical Data Visualization using TIBCO Spotfire and SAS

PharmaSUG Paper DS06 Designing and Tuning ADaM Datasets. Songhui ZHU, K&L Consulting Services, Fort Washington, PA

Working with Composite Endpoints: Constructing Analysis Data Pushpa Saranadasa, Merck & Co., Inc., Upper Gwynedd, PA

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA

PharmaSUG Paper PO10

One-Step Change from Baseline Calculations

Multiple Graphical and Tabular Reports on One Page, Multiple Ways to Do It Niraj J Pandya, CT, USA

Merging Data Eight Different Ways

USING SAS HASH OBJECTS TO CUT DOWN PROCESSING TIME Girish Narayandas, Optum, Eden Prairie, MN

From SAP to BDS: The Nuts and Bolts Nancy Brucken, i3 Statprobe, Ann Arbor, MI Paul Slagle, United BioSource Corp., Ann Arbor, MI

Reducing SAS Dataset Merges with Data Driven Formats

PharmaSUG Paper TT10 Creating a Customized Graph for Adverse Event Incidence and Duration Sanjiv Ramalingam, Octagon Research Solutions Inc.

PharmaSUG2014 Paper DS09

SAS Macro Dynamics: from Simple Basics to Powerful Invocations Rick Andrews, Office of Research, Development, and Information, Baltimore, MD

ODS TAGSETS - a Powerful Reporting Method

ABSTRACT INTRODUCTION MACRO. Paper RF

Streamline Table Lookup by Embedding HASH in FCMP Qing Liu, Eli Lilly & Company, Shanghai, China

A Taste of SDTM in Real Time

Traceability in the ADaM Standard Ed Lombardi, SynteractHCR, Inc., Carlsbad, CA

Paper CT-16 Manage Hierarchical or Associated Data with the RETAIN Statement Alan R. Mann, Independent Consultant, Harpers Ferry, WV

Basic SAS Hash Programming Techniques Applied in Our Daily Work in Clinical Trials Data Analysis

Quality Control of Clinical Data Listings with Proc Compare

PharmaSUG Paper PO22

Guide Users along Information Pathways and Surf through the Data

Tweaking your tables: Suppressing superfluous subtotals in PROC TABULATE

Programming Beyond the Basics. Find() the power of Hash - How, Why and When to use the SAS Hash Object John Blackwell

Statistics and Data Analysis. Common Pitfalls in SAS Statistical Analysis Macros in a Mass Production Environment

9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

Comparison of different ways using table lookups on huge tables

The Power of Combining Data with the PROC SQL

PharmaSUG China Mina Chen, Roche (China) Holding Ltd.

A SAS Solution to Create a Weekly Format Susan Bakken, Aimia, Plymouth, MN

Using GSUBMIT command to customize the interface in SAS Xin Wang, Fountain Medical Technology Co., ltd, Nanjing, China

Automation of SDTM Programming in Oncology Disease Response Domain Yiwen Wang, Yu Cheng, Ju Chen Eli Lilly and Company, China

SAS Application to Automate a Comprehensive Review of DEFINE and All of its Components

Journey to the center of the earth Deep understanding of SAS language processing mechanism Di Chen, SAS Beijing R&D, Beijing, China

PharmaSUG Paper CC02

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

THE DATA DETECTIVE HINTS AND TIPS FOR INDEPENDENT PROGRAMMING QC. PhUSE Bethan Thomas DATE PRESENTED BY

A SAS Macro to Create Validation Summary of Dataset Report

PharmaSUG Paper IB11

%ANYTL: A Versatile Table/Listing Macro

PharmaSUG Paper DS-24. Family of PARAM***: PARAM, PARAMCD, PARAMN, PARCATy(N), PARAMTYP

USING HASH TABLES FOR AE SEARCH STRATEGIES Vinodita Bongarala, Liz Thomas Seattle Genetics, Inc., Bothell, WA

IF there is a Better Way than IF-THEN

The Dataset Diet How to transform short and fat into long and thin

50 WAYS TO MERGE YOUR DATA INSTALLMENT 1 Kristie Schuster, LabOne, Inc., Lenexa, Kansas Lori Sipe, LabOne, Inc., Lenexa, Kansas

WHAT ARE SASHELP VIEWS?

SAS Macros of Performing Look-Ahead and Look-Back Reads

The Proc Transpose Cookbook

Advanced Visualization using TIBCO Spotfire and SAS

SAS Programming Techniques for Manipulating Metadata on the Database Level Chris Speck, PAREXEL International, Durham, NC

Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables

How Macro Design and Program Structure Impacts GPP (Good Programming Practice) in TLF Coding

Hypothesis Testing: An SQL Analogy

29 Shades of Missing

Transcription:

PharmaSUG 2015 - Paper QT21 Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine ABSTRACT Very often working with big data causes difficulties for SAS programmers. As we all know, large datasets can be manipulated and analyzed using SAS, but these SAS programs may take many hours to run. Often, timelines are short and delivery dates cannot be moved, so decreased running time is critical. One of the most time-consuming tasks for the programmer is sorting large datasets. This paper describes different techniques of quick sorting like using indexes or even how to avoid sorting at all by using hash-objects. The main goal of this paper is to find the best methods for sorting large datasets within SAS depending on the different types of data structure (for example, vertical or horizontal) and different number of variables. To identify the efficiency of different techniques of quick sorting, this paper also provides comparison between the sorting process with and without those techniques. The comparison will show how much resource time-resource is consumed sorting different types of large SAS datasets. INTRODUCTION You can ask why the sorting of datasets is an issue? This is not an issue when we are talking about small datasets. But when you need to sort the large dataset (more than 1 million records with a lot of key variables) it may cause some problems. And the main problem is running time. When you work with large datasets, it may take minutes or even hours to run. Let s image the situation when you need to create ADLB dataset for pooling study. During creating this dataset you may need to sort the work datasets a few times: to merge the dataset with subject-level dataset (ADSL), to create baseline flags, to create any derived records or create the analysis flags. In cases where you have few million records in your dataset, it really may take a lot of time to run the program. This paper describes different techniques of sorting SAS datasets. The techniques described in this paper are: Using SAS SORT procedure; Using Indexes; Using Hash-objects; At the end of the paper, I ll compare the time results of different techniques and we will try to find the best one. All provided examples in this paper are based on ADLB dataset with 1 million records that has been created for a pooling study. To run the SAS codes provided in this paper, SAS 9.2 for Microsoft Windows has been used. PROC SORT The sort procedure is probably one of the most commonly used SAS procedures. I don t remember any SAS programs without at least one call of PROC SORT. Let s see how the PROC SORT works: proc sort data=<name_of_your_dataset> <some_options>; by <list_of_variables> There is nothing new if you have used SAS for a long time. The SORT procedure is the easiest procedure in SAS and it s more than fine to use this PROC SORT to sort the small datasets. But when you need to sort a very big dataset a few times within one program or even in a few programs, you will probably need to use indexes or even hash objects. Let s see how indexes and hash-objects work. INDEXES What is an index? The SAS documentation says that an index is an optional file that you can create for a SAS data file in order to provide direct access to specific observations. There are three main ways to create an index. Indexes can be created using PROC DATASETS, PROC SQL or in the DATA step. 1

CREATING INDEXES USING PROC DATASETS One of the ways to create an index is to use PROC DATASETS. Within this SAS procedure you can modify a dataset in a specific library. Please see the simple code that creates a few indexes for ADLB dataset: proc datasets library=analdata; modify adlb; index create studyid; index create usubjid; index create lbcat; index create paramcd; index create avisitn; index create avisit; Provided code above creates 6 simple indexes for study number (studyid), patient number (usubjid), name of category of laboratory test (lbcat), short name of laboratory test (paramcd), numeric patient analysis visit (avisitn), and patient analysis visit (avisitn). Let s see how much time it took to create 6 indexes using PROC DATASETS. Error! Reference source not found. shows the part of log file with time: modify adlb; index create studyid; NOTE: Simple index studyid has been defined. index create usubjid; NOTE: Simple index USUBJID has been defined. index create lbcat; NOTE: Simple index lbcat has been defined. index create paramcd; NOTE: Simple index paramcd has been defined. index create avisitn; NOTE: Simple index avisitn has been defined. index create avisit; NOTE: Simple index avisit has been defined. NOTE: MODIFY was successful for WORK.ADLB.DATA. 185 NOTE: PROCEDURE DATASETS used (Total process time): 5.45 seconds 8.17 seconds Output 1. Result of using PROC DATASETS to create the INDEXES CREATING INDEXES USING PROC SQL The SQL procedure can also be used to create indexes for your SAS dataset: proc sql noprint; create index studyid on analdata.adlb; create index usubjid on analdata.adlb; create index lbcat on analdata.adlb; create index paramcd on analdata.adlb; create index avisitn on analdata.adlb; create index avisit on analdata.adlb; Let s run this PROC step to see the time difference between PROC DATASETS and PROC SQL. proc sql noprint; create index studyid on work.adlb; 2

NOTE: Simple index studyid has been defined. create index usubjid on work.adlb; NOTE: Simple index usubjid has been defined. create index lbcat on work.adlb; NOTE: Simple index lbcat has been defined. create index paramcd on work.adlb; NOTE: Simple index paramcd has been defined. create index avisitn on work.adlb; NOTE: Simple index avisitn has been defined. create index avisit on work.adlb; NOTE: Simple index avisit has been defined. NOTE: PROCEDURE SQL used (Total process time): 6.53 seconds 7.95 seconds Output 2. Result of using PROC SQL to create the INDEXES CREATING INDEXES USING DATA STEP And the last way of creating indexes is using the INDEX dataset option in the DATA statement. Let s see how it works: (index=(studyid usubjid lbcat paramcd avisitn avisit)); Let s run this data step to see how much time it will take to create the indexes using DATA step: (index=(studyid usubjid lbcat paramcd avisitn avisit)); NOTE: There were 1149600 observations read from the data set ANALDATA.ADLB. NOTE: The data set ANALDATA.ADLB has 1149600 observations and 30 variables. NOTE: DATA statement used (Total process time): 54.71 seconds 12.96 seconds Output 3. Result of using DATA step to create the INDEXES 3

In the examples above simple indexes were created. Simple means that new index created from a single SAS variables. If you need to create an index from two or more variables, this index is called a composite index. In our example it would be better to create 3 composite indexes instead of 6: the first two variables describe a patient (studyid and usubjid), the next two variables describe a laboratory test (lbcat and paramcd), and the last two variables identify an analysis visit (avisitn and avisit). To create composite indexes you can use any of the ways described above. Based on time comparison I would suggest to create indexes within the data step, because it took a lot of time compared to others and we can see the difference. Below you can see the example of creating composite indexes using data step: (index=(patient=(studyid usubjid) lb_test=(lbcat paramcd) anal_visit=(avisitn avisit))); After running this PROC step you can see that it took less time than creating simple indexes. (index=(patient=(studyid usubjid) lb_test=(lbcat paramcd) anal_visit=(avisitn avisit))); NOTE: There were 1149600 observations read from the data set WORK.ADLB. NOTE: The data set WORK.ADLB has 1149600 observations and 30 variables. NOTE: DATA statement used (Total process time): 23.52 seconds 8.14 seconds Output 4. Result of using PROC SQL to create the composite INDEXES Now let s look at the difference between PROC SORT and INDEXES. The comparisons provided below: PROC SORT Simple INDEXES Composite INDEXES PROC DATASETS PROC SQL DATA STEP DATA STEP Actual time 52:20 5:45 6:53 54:71 23:52 CPU time 8:97 8:17 7:95 12:96 8:14 Table 1. Comparison of time usage between PROC SORT and INDEXES HASH-OBJECTS Compared to indexes that can be used in any data or procedure steps and even in different programs, hash-objects can be used only in the datastep, and at the end of the datastep the hash object and all its contents disappear. So, what is a hash-object? Hash-objects is a data structure that contains an array of items that are used to map key values (e.g., patient numbers) to their associated values (e.g., patient laboratory result). The hash-object has quite a complicated syntax with a lot of options. What you need to do is to create, define, fill and then access the hash-object. To create a hash-object you need to use a declare statement in your data step: declare HASH Hash_Name; Next step is to define the key and other variables: 4

Hash_Name.DefineKey( Key var ); Hash_Name.DefineData( Var1, Var2, Var3,, Varn ); Hash_Name.DefineDone(); When the hash-object is created and defined, it needs to be filled with data. The following syntax should be used: Hash_Name.add(); Let s see the whole datastep that will sort our dataset ADLB using hash-objects by STUDYID, USUBJID, LBCAT, PARAMCD, AVISITN, AVISIT: data _null_; if 0 then if _n_ = 1 then do; declare Hash HashSort(ordered:'a'); HashSort.DefineKey( STUDYID, USUBJID, LBCAT, PARAMCD, AVISITN, AVISIT ); HashSort.DefineData( STUDYID, USUBJID, LBCAT, PARAMCD, AVISITN, AVISIT, AVAL, CHG, TRT01A ); HashSort.DefineDone(); end; HashSort.add(); if eof then HashSort.output(dataset: sorted_adlb ); Let s run this data step to see how much time it will take to sort the dataset ADLB: data _null_; if 0 then set work.adlb; if _n_ = 1 then do; declare Hash HashSort(ordered:'a'); HashSort.DefineKey('STUDYID', 'USUBJID', 'LBCAT', 'PARAMCD', 'AVISITN', 'AVISIT'); HashSort.DefineData('STUDYID', 'USUBJID', 'LBCAT','PARAMCD', 'AVISITN', 'AVISIT'); HashSort.DefineDone(); end; set work.adlb end=eof; HashSort.add(); if eof then HashSort.output(dataset:'sorted_adlb'); NOTE: The data set WORK.SORTED_ADLB has 1149600 observations and 6 variables. NOTE: There were 1149600 observations read from the data set WORK.ADLB. NOTE: DATA statement used (Total process time): 3.92 seconds 3.16 seconds Output 5. Result of using hash-objects to sort the ADLB dataset CONCLUSION SAS has a lot of ways to sort the SAS datasets. One of the commonly used is SAS s SORT procedure but it s fine to use it only when you are working with datasets with small number of observations and key variables. In cases where you need to sort large datasets with a lot of key variables, it is definitely better to use indexes of hash-objects. Let's see the comparison of the different methods one more time: 5

PROC SORT Simple INDEXES Composite PROC PROC DATA INDEXES DATASETS SQL STEP DATA STEP Actual time 52:20 5:45 6:53 54:71 23:52 3:92 CPU time 8:97 8:17 7:95 12:96 8:14 3:16 Table 2. Comparison of time usage between PROC SORT, INDEXES and Hash-Objects As you can see from the table 2, the quickest way of sorting large datasets is a hash-object. REFERENCES Hash Objects Michael A. Raithel. The Basics of Using SAS Indexes. Proceedings of the 30th Annual SAS Users Group International Conference. Rockville, MD Gregg P. Snell. Think FAST! Use Memory Tables (Hashing) for Faster Merging. Proceedings of the 31th Annual SAS Users Group International Conference. Shawnee, KS SAS OnlineDoc 9.2. Available at http://support.sas.com/documentation/92 ACKNOWLEDGMENTS I would like to thank Donnelle M LaDouceur, Peter Lord, Chad Melson and Irina Kotenko for the review of the paper and all support during preparing this paper. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Name: Daniil Shliakhov Enterprise: Experis Clinical Address: 43/2 Gagarin av. City, State ZIP: Ukraine, Kharkiv, 61001 Work Phone: +38 057 755 7020 ext. 2402 Fax: +38 057 728 30 35 E-mail: daniil.shlyakhov@intego-group.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 6