IF there is a Better Way than IF-THEN

Similar documents
Anatomy of a Merge Gone Wrong James Lew, Compu-Stat Consulting, Scarborough, ON, Canada Joshua Horstman, Nested Loop Consulting, Indianapolis, IN, USA

Best Practice for Creation and Maintenance of a SAS Infrastructure

Beyond IF THEN ELSE: Techniques for Conditional Execution of SAS Code Joshua M. Horstman, Nested Loop Consulting, Indianapolis, IN

Can you decipher the code? If you can, maybe you can break it. Jay Iyengar, Data Systems Consultants LLC, Oak Brook, IL

An Efficient Tool for Clinical Data Check

Guidelines for Coding of SAS Programs Thomas J. Winn, Jr. Texas State Auditor s Office

Using SAS Macros to Extract P-values from PROC FREQ

PharmaSUG China 2018 Paper AD-62

SAS Macros of Performing Look-Ahead and Look-Back Reads

The Basics of PROC FCMP. Dachao Liu Northwestern Universtiy Chicago

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

Application Interface for executing a batch of SAS Programs and Checking Logs Sneha Sarmukadam, inventiv Health Clinical, Pune, India

Breaking up (Axes) Isn t Hard to Do: An Updated Macro for Choosing Axis Breaks

Remember to always check your simple SAS function code! Yingqiu Yvette Liu, Merck & Co. Inc., North Wales, PA

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

How to Create Data-Driven Lists

From Clicking to Coding: Using ODS Graphics Designer as a Tool to Learn Graph Template Language

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

Interleaving a Dataset with Itself: How and Why

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

Reproducibly Random Values William Garner, Gilead Sciences, Inc., Foster City, CA Ting Bai, Gilead Sciences, Inc., Foster City, CA

capabilities and their overheads are therefore different.

Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY

CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

SAS Application to Automate a Comprehensive Review of DEFINE and All of its Components

Creating Forest Plots Using SAS/GRAPH and the Annotate Facility

Why choose between SAS Data Step and PROC SQL when you can have both?

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

Quality Control of Clinical Data Listings with Proc Compare

When Powerful SAS Meets PowerShell TM

PharmaSUG China Paper 70

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA

Guide Users along Information Pathways and Surf through the Data

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC

Anticipating User Issues with Macros

Unlock SAS Code Automation with the Power of Macros

A SAS Macro to Generate Caterpillar Plots. Guochen Song, i3 Statprobe, Cary, NC

Run your reports through that last loop to standardize the presentation attributes

Not Just Merge - Complex Derivation Made Easy by Hash Object

SAS System Powers Web Measurement Solution at U S WEST

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

Data Manipulation with SQL Mara Werner, HHS/OIG, Chicago, IL

Migration to SAS Grid: Steps, Successes, and Obstacles for Performance Qualification Script Testing

Database Optimization

LST in Comparison Sanket Kale, Parexel International Inc., Durham, NC Sajin Johnny, Parexel International Inc., Durham, NC

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

HOW TO DEVELOP A SAS/AF APPLICATION

Interactive Programming Using Task in SAS Studio

Advanced PROC REPORT: Getting Your Tables Connected Using Links

Version 8 Base SAS Performance: How Does It Stack-Up? Robert Ray, SAS Institute Inc, Cary, NC

The Path To Treatment Pathways Tracee Vinson-Sorrentino, IMS Health, Plymouth Meeting, PA

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

Quick and Efficient Way to Check the Transferred Data Divyaja Padamati, Eliassen Group Inc., North Carolina.

Why Hash? Glen Becker, USAA

Beyond IF THEN ELSE: Techniques for Conditional Execution of SAS Code Joshua M. Horstman, Nested Loop Consulting, Indianapolis, IN

PharmaSUG China. Systematically Reordering Axis Major Tick Values in SAS Graph Brian Shen, PPDI, ShangHai

Indenting with Style

Big Data? Faster Cube Builds? PROC OLAP Can Do It

What Do You Mean My CSV Doesn t Match My SAS Dataset?

Getting it Done with PROC TABULATE

Conditional Processing Using the Case Expression in PROC SQL

Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide

Table Lookups: From IF-THEN to Key-Indexing

Essentials of PDV: Directing the Aim to Understanding the DATA Step! Arthur Xuejun Li, City of Hope National Medical Center, Duarte, CA

A Tool to Compare Different Data Transfers Jun Wang, FMD K&L, Inc., Nanjing, China

Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

Let the CAT Out of the Bag: String Concatenation in SAS 9

SAS Model Manager 15.1: Quick Start Tutorial

Paper SAS Programming Conventions Lois Levin, Independent Consultant, Bethesda, Maryland

One-Step Change from Baseline Calculations

PhUse Practical Uses of the DOW Loop in Pharmaceutical Programming Richard Read Allen, Peak Statistical Services, Evergreen, CO, USA

WHAT SPEED DO MOBILE USERS REALLY GET?

Using GSUBMIT command to customize the interface in SAS Xin Wang, Fountain Medical Technology Co., ltd, Nanjing, China

SAS Programming Conventions Lois Levin, Independent Consultant

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

Multiple Graphical and Tabular Reports on One Page, Multiple Ways to Do It Niraj J Pandya, CT, USA

Fifteen Functions to Supercharge Your SAS Code

Ranking Between the Lines

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

JavaScript Basics. The Big Picture

The 'SKIP' Statement

The new SAS 9.2 FCMP Procedure, what functions are in your future? John H. Adams, Boehringer Ingelheim Pharmaceutical, Inc.

Submitting SAS Code On The Side

Functions vs. Macros: A Comparison and Summary

Automatic Detection of Section Membership for SAS Conference Paper Abstract Submissions: A Case Study

Effects of PROC EXPAND Data Interpolation on Time Series Modeling When the Data are Volatile or Complex

Files Arriving at an Inconvenient Time? Let SAS Process Your Files with FILEEXIST While You Sleep

The Demystification of a Great Deal of Files

A Visual Step-by-step Approach to Converting an RTF File to an Excel File

Macro Architecture in Pictures Mark Tabladillo PhD, marktab Consulting, Atlanta, GA Associate Faculty, University of Phoenix

Checking for Duplicates Wendi L. Wright

BI-09 Using Enterprise Guide Effectively Tom Miron, Systems Seminar Consultants, Madison, WI

Upholding Ethics and Integrity: A macro-based approach to detect plagiarism in programming

Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

SAS Macro Dynamics: from Simple Basics to Powerful Invocations Rick Andrews, Office of Research, Development, and Information, Baltimore, MD

TLF Management Tools: SAS programs to help in managing large number of TLFs. Eduard Joseph Siquioco, PPD, Manila, Philippines

Using Templates Created by the SAS/STAT Procedures

Exploring DICTIONARY Tables and SASHELP Views

Transcription:

PharmaSUG 2018 - Paper QT-17 IF there is a Better Way than IF-THEN Bob Tian, Anni Weng, KMK Consulting Inc. ABSTRACT In this paper, the author compares different methods for implementing piecewise constant functions (step functions) in SAS. The author uses a simulated approach to measure the efficiencies of different methods in terms of CPU resource usage. INTRODUCTION Many SAS programmers routinely find themselves facing the task of implementing piecewise functions such as classifying BMI scores, or grouping blood sugar levels. While adopting an IF-THEN logic seems intuitive and effortless, as the size of real world data multitudes and the complexity of assignment increases, will this be the most efficient method? Is there a better way to achieve such tasks in SAS environment? While there are a few papers discussing different methods of implementing piecewise classification [1] [2], little reference can be found evaluating the efficiencies of different approaches. To find the best approach, we evaluate the performance of 2 different IF-THEN implementations, an implicit IF logic application, an IFC/IFN function, 2 different WHEN implementations, and a bespoke format method, in terms of computation time. A large simulated dataset was generated to approximate real word data. Each method was applied to the dataset for hundreds of cycles under similar CPU load. The average statistics will be used for the comparison. The authors were genuinely surprised by their findings, and would like to share the results with the readers. INPUT DATA To simulate real world data, we have randomly generated two beta-distributions from 0 to 100 as the input datasets. We have chosen a bell shaped distribution (α = 2, β = 3), as well as a bimodal distribution (α = 0.5, β = 0.5). The histogram plot for the two input datasets are shown below in Figure 1 and Figure 2. 1

Figure 1. Histogram Plot of Input Distribution 1 Beta(2, 3) Figure 2. Histogram Plot of Input Distribution 2 Beta(0.5, 0.5) 2

These two datasets closely mirror many of the natural occurring data the authors have to deal with. We have chosen a sample size big enough for each calculation to take approximately 30 seconds. METHODS We used two different piecewise functions to compare the performance of different methods. One is a simple binary classification; the other has 6 mutually exclusive categories. These two classifications resemble some of the most common classifications the authors have to implement on a daily basis. 1. Function 1, 2 categories 0, x < 50 y = 1, x 50 2. Function 2, 6 categories 1, x < 10 2, 10 x < 20 3, 20 x < 50 y = 4, 50 x < 80 5, 80 x < 90 6, x 90 A total of 7 different methods were tested by the author. We will illustrate them using the second classification. METHOD 1, A SIMPLE IF-THEN LOGIC This is the most straight forward logic, most likely one of the first programs anyone learned to write: DATA Method_1; IF x < 10 THEN y = 1; IF 10 <= x < 20 THEN y = 2; IF 20 <= x < 50 THEN y = 3; IF 50 <= x < 80 THEN y = 4; IF 80 <= x < 90 THEN y = 5; IF 90 <= x THEN y = 6; METHOD 2, IF-THEN, ELSE -THEN LOGIC Textbooks taught us using ELSE IF for mutually exclusive situations would be fast than only IF-THEN. We should now test this statement: DATA Method_2; 3

IF x < 10 THEN y = 1; ELSE IF 10 <= x < 20 THEN y = 2; ELSE IF 20 <= x < 50 THEN y = 3; ELSE IF 50 <= x < 80 THEN y = 4; ELSE IF 80 <= x < 90 THEN y = 5; ELSE y = 6; METHOD 3, AN IMPLICIT IF LOGIC We can also use an implicit function to achieve the same IF logic. In fact, this is the author s favorite method, as the code is very neat: DATA Method_3; y = 1 * (x < 10) + 2 * (10 <= x < 20) + 3 * (20 <= x < 50) + 4 * (50 <= x < 80) + 5 * (80 <= x < 90) + 6 * (90 <= x ); METHOD 4, NESTED IFN/IFC FUNCTIONS A nested IFN/IFC function can also be used. However, it can be sometimes difficult to read: DATA Method_4; y = ifn(x < 10, 1, ifn(10 <= x < 20, 2, ifn(20 <= x < 50, 3, ifn(50 <= x < 80, 4, ifn(80 <= x < 90, 5, 6 ))))); METHOD 5, SELECT-WHEN SELECT-WHEN method is another favorite of the author s: DATA Method_5; SELECT; when (x < 10) y = 1; when (10 <= x < 20) y = 2; when (20 <= x < 50) y = 3; when (50 <= x < 80) y = 4; when (80 <= x < 90) y = 5; when (90 <= x) y = 6; otherwise y =.; END; 4

METHOD 6, CASE-WHEN IN PROC SQL A similar logic can be implemented in PROC SQL : PROC SQL; CREATE table Method_6 as SELECT *, case when (x<10) then 1 when (10 <= x < 20) then 2 when (20 <= x < 50) then 3 when (50 <= x < 80) then 4 when (80 <= x < 90) then 5 when (90 <= x) then 6 else. end as y from indata; QUIT; METHOD 7, USING A BESPOKE FORMAT Using a customized format will results in some very neat coding: PROC FORMAT; VALUE pwfmt low - <10 = 1 10 - <20 = 2 20 - <50 = 3 50 - <80 = 4 80 - <90 = 5 90 - high = 6 other =.; DATA Method_7; y = put(x, pwfmt.); To minimize the variation due to global CPU load and memory usage of our SAS server, we packaged each method into a macro, and have them executed sequentially in a loop. This loop was then executed 100 times on each input dataset. For the format method, the format was only created once. We used the method described by McCartney and Hu [3] to exact the run time information from the log. We will discuss the results in the following section. RESULTS DISCUSSION The results were not quite what the author had expected. For the bell shaped distribution (α = 2, β = 3), we have the results summarized in the following two tables. 5

Method Note Mean (s) Std_Dev Method 1 IF THEN 28.71 0.45 Method 2 IF ELSE THEN 28.41 0.41 Method 3 Implicit 30.16 0.59 Method 4 Ifn() 28.12 0.53 Method 5 SELECT WHEN 28.25 0.50 Method 6 SQL CASE 38.97 0.58 Method 7 Format 36.66 0.51 Table 2. Results on Beta(2, 3) distribution with 2 categories Method Note Mean (s) Std_Dev Method 1 IF THEN 30.28 0.60 Method 2 IF ELSE THEN 29.43 0.60 Method 3 Implicit 36.74 0.79 Method 4 Ifn() 39.21 0.87 Method 5 SELECT WHEN 31.55 0.68 Method 6 SQL CASE 42.14 0.85 Method 7 Format 38.15 0.48 Table 2. Results on Beta(2, 3) distribution with 6 categories The first thing we did not anticipate was that the format method did not perform very well. Using a customized format is a very efficient way to merge two datasets, and the code is very elegant. It also can potentially save a lot code space when the same categorizing has to be applied multiple times. However, it simply lags behind the other methods in terms of performance in this case. Despite of being a very neat looking method, the author s favorite (Method 3) did not perform either. This is due to the fact that, when coded as an arithmetic operation with implicit IF operations, all cases have to be evaluated before a final value can be assigned. This will hinder the performance especially when a large number of categories are applied. The reason holds true for Method 4. The biggest surprise is that, Method 1 and Method 2 are on par in terms of performance with the other methods. When the complexity of the categorization increases, they even outperform the others. Even more surprisingly, using mutually exclusively logic ELSE IF does not improve the calculation speed by a huge margin. This is probably due to some internal code optimization at the compiler step. Lastly, it should not be a surprise that the SQL method does not perform as well as any of the DATA STEP methods. Similar results were observed for the bimodal distribution (α = 0.5, β = 0.5), and are presented in Table 3 and Table 4. 6

Method Note Mean (s) Std_Dev Method 1 IF THEN 28.31 0.48 Method 2 IF ELSE THEN 28.44 0.50 Method 3 Implicit 28.58 0.53 Method 4 Ifn() 28.38 0.48 Method 5 SELECT WHEN 28.48 0.46 Method 6 SQL CASE 39.84 0.51 Method 7 Format 37.14 0.62 Table 3. Results on Beta(0.5, 0.5) distribution with 2 categories Method Note Mean (s) Std_Dev Method 1 IF THEN 31.29 0.79 Method 2 IF ELSE THEN 30.53 0.66 Method 3 Implicit 36.02 0.77 Method 4 Ifn() 38.49 0.86 Method 5 SELECT WHEN 32.65 0.83 Method 6 SQL CASE 42.74 0.76 Method 7 Format 39.49 0.79 Table 4. Results on Beta(0.5, 0.5) distribution with 6 categories CONCLUSION In doing this experiment, the authors have learned that SAS has a highly optimized compiler. When dealing with a categorizing problem, it is hard to beat the good old IF-THEN logic, although other methods might appear to be more elegant. 7

REFERENCES 1. Hortsman, Joshua M., Beyond IF THEN ELSE: Techniques for Conditional Execution of SAS Code, Proceedings of PharmaSUG 2016, Paper TT16, http://support.sas.com/resources/papers/proceedings17/0326-2017.pdf 2. Billings, Thomas E., IFC and IFN Functions: Alternatives to Simple DATA Step IF-THEN-ELSE, SELECT-END Code and PROC SQL CASE Statements, Proceedings of Western Users of SAS Software 2012, https://www.lexjansen.com/wuss/2012/28.pdf 3. McCartney, Jeff; Hu, Raymond, SAS Shorts: Valuable Tips for Everyday Programming, Proceedings of NorthEast SAS Users Group 2011, https://www.lexjansen.com/nesug/nesug01/at/at1013.pdf ACKNOWLEDGMENTS The authors would like to thank Huanxue Zhou of KMK Consulting Inc, for valuable suggestions and discussion of this paper. Any mistakes herein are the responsibility of the author. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Anni Weng KMK Consulting Inc. anni.weng@kmkconsultinginc.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 8