Oh Quartile, Where Art Thou? David Franklin, TheProgrammersCabin.com, Litchfield, NH

Similar documents
9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

Countdown of the Top 10 Ways to Merge Data David Franklin, Independent Consultant, Litchfield, NH

Merging Data Eight Different Ways

Two useful macros to nudge SAS to serve you

PharmaSUG Paper TT11

Making an RTF file Out of a Text File, With SAS Paper CC13

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

Easy CSR In-Text Table Automation, Oh My

Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico

A Cross-national Comparison Using Stacked Data

A Tool to Compare Different Data Transfers Jun Wang, FMD K&L, Inc., Nanjing, China

PharmaSUG China Paper 059

Amie Bissonett, inventiv Health Clinical, Minneapolis, MN

A Macro for Systematic Treatment of Special Values in Weight of Evidence Variable Transformation Chaoxian Cai, Automated Financial Systems, Exton, PA

The Proc Transpose Cookbook

Correcting for natural time lag bias in non-participants in pre-post intervention evaluation studies

Ranking Between the Lines

Creating Macro Calls using Proc Freq

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

David Franklin Independent SAS Consultant TheProgramersCabin.com

A SAS Macro for Producing Benchmarks for Interpreting School Effect Sizes

CREATING THE DISTRIBUTION ANALYSIS

PharmaSUG Paper CC02

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

Let s Get FREQy with our Statistics: Data-Driven Approach to Determining Appropriate Test Statistic

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

1. To condense data in a single value. 2. To facilitate comparisons between data.

PharmaSUG Paper AD06

Taming the Box Plot. Sanjiv Ramalingam, Octagon Research Solutions, Inc., Wayne, PA

Chasing the log file while running the SAS program

PharmaSUG Paper TT10 Creating a Customized Graph for Adverse Event Incidence and Duration Sanjiv Ramalingam, Octagon Research Solutions Inc.

Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee

Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide

CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

Handling Numeric Representation SAS Errors Caused by Simple Floating-Point Arithmetic Computation Fuad J. Foty, U.S. Census Bureau, Washington, DC

Using SAS to Control and Automate a Multi SAS Program Process Patrick Halpin, dunnhumby USA, Cincinnati, OH

Give me EVERYTHING! A macro to combine the CONTENTS procedure output and formats. Lynn Mullins, PPD, Cincinnati, Ohio

SAS Macros of Performing Look-Ahead and Look-Back Reads

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

SAS Macro Dynamics: from Simple Basics to Powerful Invocations Rick Andrews, Office of Research, Development, and Information, Baltimore, MD

Regaining Some Control Over ODS RTF Pagination When Using Proc Report Gary E. Moore, Moore Computing Services, Inc., Little Rock, Arkansas

Paper A Simplified and Efficient Way to Map Variable Attributes of a Clinical Data Warehouse

5b. Descriptive Statistics - Part II

Greenspace: A Macro to Improve a SAS Data Set Footprint

PharmaSUG China Paper 70

A Macro to Create Program Inventory for Analysis Data Reviewer s Guide Xianhua (Allen) Zeng, PAREXEL International, Shanghai, China

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

Missing Pages Report. David Gray, PPD, Austin, TX Zhuo Chen, PPD, Austin, TX

Using SAS Macros to Extract P-values from PROC FREQ

PharmaSUG Paper PO12

A SAS Macro to Create Validation Summary of Dataset Report

WHAT ARE SASHELP VIEWS?

Macro Method to use Google Maps and SAS to Geocode a Location by Name or Address

Write SAS Code to Generate Another SAS Program A Dynamic Way to Get Your Data into SAS

Breaking up (Axes) Isn t Hard to Do: An Updated Macro for Choosing Axis Breaks

Clinical Data Visualization using TIBCO Spotfire and SAS

Leveraging SAS Visualization Technologies to Increase the Global Competency of the U.S. Workforce

Simplifying the Sample Design Process with PROC PMENU

Using the CLP Procedure to solve the agent-district assignment problem

Arthur L. Carpenter California Occidental Consultants, Oceanside, California

Paper An Automated Reporting Macro to Create Cell Index An Enhanced Revisit. Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA

Measures of Position

3. Data Analysis and Statistics

Importing CSV Data to All Character Variables Arthur L. Carpenter California Occidental Consultants, Anchorage, AK

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

Automating Comparison of Multiple Datasets Sandeep Kottam, Remx IT, King of Prussia, PA

Taming a Spreadsheet Importation Monster

%Addval: A SAS Macro Which Completes the Cartesian Product of Dataset Observations for All Values of a Selected Set of Variables

SQL, HASH Tables, FORMAT and KEY= More Than One Way to Merge Two Datasets

SAS automation techniques specification driven programming for Lab CTC grade derivation and more

An Efficient Tool for Clinical Data Check

PharmaSUG Paper PO10

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

TLF Management Tools: SAS programs to help in managing large number of TLFs. Eduard Joseph Siquioco, PPD, Manila, Philippines

What s New in Oracle Crystal Ball? What s New in Version Browse to:

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

Keeping Track of Database Changes During Database Lock

Out of Control! A SAS Macro to Recalculate QC Statistics

Paper S Data Presentation 101: An Analyst s Perspective

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

Using SAS/SCL to Create Flexible Programs... A Super-Sized Macro Ellen Michaliszyn, College of American Pathologists, Northfield, IL

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Creating an ADaM Data Set for Correlation Analyses

HAVE YOU EVER WISHED THAT YOU DO NOT NEED TO TYPE OR CHANGE REPORT NUMBERS AND TITLES IN YOUR SAS PROGRAMS?

EXST SAS Lab Lab #8: More data step and t-tests

Automated Checking Of Multiple Files Kathyayini Tappeta, Percept Pharma Services, Bridgewater, NJ

Get SAS sy with PROC SQL Amie Bissonett, Pharmanet/i3, Minneapolis, MN

LST in Comparison Sanket Kale, Parexel International Inc., Durham, NC Sajin Johnny, Parexel International Inc., Durham, NC

Run your reports through that last loop to standardize the presentation attributes

An Efficient Method to Create Titles for Multiple Clinical Reports Using Proc Format within A Do Loop Youying Yu, PharmaNet/i3, West Chester, Ohio

ODS/RTF Pagination Revisit

Using GSUBMIT command to customize the interface in SAS Xin Wang, Fountain Medical Technology Co., ltd, Nanjing, China

Going Under the Hood: How Does the Macro Processor Really Work?

Cover the Basics, Tool for structuring data checking with SAS Ole Zester, Novo Nordisk, Denmark

BI-09 Using Enterprise Guide Effectively Tom Miron, Systems Seminar Consultants, Madison, WI

SAS Viewer giving way to Universal Viewer Steve Wright, Quintiles, RTP, NC

3N Validation to Validate PROC COMPARE Output

Producing Summary Tables in SAS Enterprise Guide

Chapter 6: Modifying and Combining Data Sets

Transcription:

PharmaSUG 2013 Paper SP08 Oh Quartile, Where Art Thou? David Franklin, TheProgrammersCabin.com, Litchfield, NH ABSTRACT "Why is my first quartile number different from yours?" It was this question that led to a study of methods that can be used for calculating the first and third quartile for a set of data. SAS provides five methods for calculation of the quartile but other methods exist and are used in other common software packages. In this paper ten methods are discussed from the point of view of a SAS programmer and a macro presented that will calculate first and third quartiles for the five methods not provided in the Base SAS package. INTRODUCTION It was a late afternoon discussion that the subject of how a quartile is calculated came up. A colleague was checking some quartile calculations using a well known spreadsheet package and different results were appearing. So who was right? Had a bug been found in the way SAS calculated quartiles? Rerunning the program selecting the other quartile calculation options did not produce the same answer as the spreadsheet did. So what was going on? The school book definition of Quartile is any of the three values which divide the sorted data set into four equal parts, so that each part represents 1/4th of the sample the first quartile is 25% of the data denoted usually by Q1, the second quartile is 50% of the data denoted usually by Q2 (also called the Median), and the third quartile is 75% of the data denoted usually by Q3. Lets look at the following example: For the data 1, 2, 3, 4, 5, 6, 7 and 8 the following results are calculated for the 25 th percentile: SAS Method 5 (default) = 2.5 SAS Method 4 = 2.25 Excel = 2.75 Unlike the median that has a standard calculation method, there is no one standard for the calculation of the quartile. Base SAS provides five methods but there are others used by different software packages. This paper presents ten definitions that are commonly used, including the five that SAS uses. The discussion will be from the perspective of a SAS programmer so no comment will be made on the merits or demerits of each definition or which definition is best for calculating the statistic. THE SAS METHODS There are five methods that SAS uses to calculate the quartile, each of which can be called using various procedure options including the QNTLDEF=value option the MEANS procedure and PCTLDEF=value in the UNIVARIATE procedure. Interestingly SAS uses definition 5 as default so in keeping with SAS the definitions will be presented in reverse order. Definition 5, Empirical Distribution, Averaging y = (x j x j+1 )/2 if g=0 or y = x j+1 if g>0 Definition 4, Weighted Average at X(n+1) where (n+1)/4=j+g for the LQ and 3(n+1)/4=j+g for the UQ 1

Definition 3, Empirical Distribution Function y = x j if g=0, or y = x j+1 if g>0 Definition 2, Closest Observation If g=0.5 and j is even then y=xj, else if g=0.5 and j is odd then +1, else x j where (n/4)+0.5=j+g for LQ, (3n/4)+0.5=j+g for UQ Definition 1, Weighted Average at Xn Only methods 1 and 4 use interpolation to calculate the quartile value. Using the data 1, 2, 3, 4, 5, 6, 7 and 8 the following first and third quartiles are computed: Method 5: Q1=2.5 Q3=6.5 Method 4: Q1=2.25 Q3=6.75 Method 3: Q1=2 Q3=6 Method 2: Q1=2 Q3=6 Method 1: Q1=2 Q3=6 THE CLASSIC METHOD Developed by Tukey, its aim is to find the quartiles of a set of data with little or no calculation. The procedure for finding the quartile is: 1. find the median 2. if an odd number of observations include the sample median value and then find the median of the subset, else if an even number of observations exclude the sample median value then find the median of the subset. Putting this into a formula looks something like: LQ: if n is odd, (n+3)/4=j+g; else (n+2)/4=j+g UQ: if n is odd, (3n+1)/4=j+g; else (3n+2)/4=j+g Using the data 1,2,3,4,5,6,7,8, the sample median is 4.5; the Lower Quartile is from the set {1,2,3,4} so the value is 2.5; and the Upper Quartile is from the set {5,6,7,8} so the value is 6.5. There is an adaptation to the Tukey method presented by Moore and McCabe that does not include sample median in the quartile calculation in both cases where there is an even or odd number of observations. Putting this into a formula looks something like: LQ: if n is odd, (n+1)/4=j+g; else (n+2)/4=j+g UQ: if n is odd, (3n+3)/4=j+g; else (3n+2)/4=j+g OTHER METHODS Three methods will be discussed here the first two involve some sort of interpolation while the third does not. There are other methods but these are rarely used. 2

The first is a method that very few packages use the only one found with documentation relating to this method is 'R': where (n+2)/4=j+g for LQ and (3n+2)/4=j+g for UQ A paper written by Hyndman an Fan identified this method as the only one that met all of the six requirements they were considering for the calculation of a quartile. However this method is one of the least used methods in commercial software. Some packages, including Excel, after version 2002, and S Plus, use yet another calculation presented by Freund and Perles: where (n+3)/4=j+g for LQ and (3n+1)/4=j+g for UQ Prior to Excel version 2002, Excel used the method: +g(x j +1 x j ) where ((n 1)/4)+1=j+g for LQ and (3(n 1)/4)+1=j+g for UQ There is a third method called presented by Mendenhall and Sincich that is a variation on the Closest Observation method (Definition 2 in SAS): LQ: if g<0.5 then, else +1 UQ: if g<=0.5 then, else +1 where (n+1)/4=j+g for the LQ and (3n+2)/4=j+g for the UQ THE MACRO The five methods that SAS does not use for the Quartile calculation are calculated in the macro listed in the Appendix. The methods used in the macro are: Tukey Moore and McCabe Hyndman and Fan Freund and Perles Mendenhall and Sincich CONCLUSION Going back to the original question at the start of the paper, Why is my first quartile number different from yours?, it can be seen that there are different calculation methods for the quartile all definitions have their place and the selection of which definition to use does depend on the knowledge and experience of a statistician. For the question itself, SAS Definition 5 was being used for the calculation and the check was being done using Excel they do not have similar definitions so the results were different. The moral of this story is that you should know how your software calculates a statistic before blindly reporting the result. REFERENCES Hyndman, R. J. and Fan, Y. Sample Quantiles in Statistical Packages. Amer. Stat. 50, 361 365, 1996. 3

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Name: David Franklin Enterprise: TheProgrammersCabin.com E mail: dfranklin@theprogrammerscabin.com Web: http://www.theprogrammerscabin.com Twitter: ThePgmrsCabin SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. Appendix Listing of SAS Macro to Calculated Quartiles not Provided by SAS %macro mkquart(dsin=, /*Dataset where data to be analyzed is stored*/ var=, /*Variable to be analyzed*/ dsout= /*Dataset with results*/ ); *Find number of non missing values; data _null_; retain _k 0; set &dsin end=eof; if ^missing(&var) then _k+1; if eof then call symput('xobs',compress(put(_k,8.))); run; *Load dataset into an array; data &dsout; attrib TK_LQ length=8 label='tukey, First Quartile' TK_UQ length=8 label='tukey, Third Quartile' MM_LQ length=8 label='moore and McCabe, First Quartile' MM_UQ length=8 label='moore and McCabe, Third Quartile' HF_LQ length=8 label='hyndman and Fan, First Quartile' HF_UQ length=8 label='hyndman and Fan, Third Quartile' FP_LQ length=8 label='freund and Perles, First Quartile' FP_UQ length=8 label='freund and Perles, Third Quartile' MS_LQ length=8 label='mendenhall and Sincich, First Quartile' MS_UQ length=8 label='mendenhall and Sincich, Third Quartile'; set &dsin (where=(^missing(&var))) end=eof; array xx{&xobs} _temporary_; keep TK_LQ TK_UQ MM_LQ MM_UQ HF_LQ HF_UQ FP_LQ FP_UQ MS_LQ MS_UQ; xx{_n_}=&var; if eof then do; **Tukey; if mod(dim(xx),2)=1 then do; TK_LQpos=(dim(xx)+3)/4; TK_UQpos=(3*dim(xx)+1)/4; else do; TK_LQpos=(dim(xx)+2)/4; TK_UQpos=(3*dim(xx)+2)/4; 4

TK_LQ=((1 (TK_LQpos floor(tk_lqpos)))*xx{floor(tk_lqpos)}) + ((TK_LQpos floor(tk_lqpos))*xx{floor(tk_lqpos)+1}); TK_UQ=((1 (TK_UQpos floor(tk_uqpos)))*xx{floor(tk_uqpos)}) + ((TK_UQpos floor(tk_uqpos))*xx{floor(tk_uqpos)+1}); **Moore and McCabe; if mod(x,2)=1 then do; MM_LQpos=(dim(xx)+1)/4; MM_UQpos=(3*dim(xx)+3)/4; else do; MM_LQpos=(dim(xx)+2)/4; MM_UQpos=(3*dim(xx)+2)/4; MM_LQ=((1 (MM_LQpos floor(mm_lqpos)))*xx{floor(mm_lqpos)}) + ((MM_LQpos floor(mm_lqpos))*xx{floor(mm_lqpos)+1}); MM_UQ=((1 (MM_UQpos floor(mm_uqpos)))*xx{floor(mm_uqpos)}) + ((MM_UQpos floor(mm_uqpos))*xx{floor(mm_uqpos)+1}); **Hyndman and Fan; HF_LQpos=(dim(xx)+2)/4; HF_LQ=((1 (HF_LQpos floor(hf_lqpos)))*xx{floor(hf_lqpos)}) + ((HF_LQpos floor(hf_lqpos))*xx{floor(hf_lqpos)+1}); HF_UQpos=((3*dim(xx))+2)/4; HF_UQ=((1 (HF_UQpos floor(hf_uqpos)))*xx{floor(hf_uqpos)}) + ((HF_UQpos floor(hf_uqpos))*xx{floor(hf_uqpos)+1}); **Freund and Perles; FP_LQpos=(dim(xx)+3)/4; FP_LQ=((1 (FP_LQpos floor(fp_lqpos)))*xx{floor(fp_lqpos)}) + ((FP_LQpos floor(fp_lqpos))*xx{floor(fp_lqpos)+1}); FP_UQpos=((3*dim(xx))+1)/4; FP_UQ=((1 (FP_UQpos floor(fp_uqpos)))*xx{floor(fp_uqpos)}) + ((FP_UQpos floor(fp_uqpos))*xx{floor(fp_uqpos)+1}); **Mendenhall and Sincich; MS_LQpos=(dim(xx)+1)/4; if (MS_LQpos floor(ms_lqpos))<0.5 then MS_LQ=xx{floor(MS_LQpos)}; else MS_LQ=xx{floor(MS_LQpos)+1}; MS_UQpos=((3*dim(xx))+2)/4; if (MS_UQpos floor(ms_uqpos))<=0.5 then MS_UQ=xx{floor(MS_UQpos)}; else MS_UQ=xx{floor(MS_UQpos)+1}; run; %mend mkquart; 5