Clip Extreme Values for a More Readable Box Plot Mary Rose Sibayan, PPD, Manila, Philippines Thea Arianna Valerio, PPD, Manila, Philippines

Similar documents
Taming the Box Plot. Sanjiv Ramalingam, Octagon Research Solutions, Inc., Wayne, PA

NAME: DIRECTIONS FOR THE ROUGH DRAFT OF THE BOX-AND WHISKER PLOT

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

DAY 52 BOX-AND-WHISKER

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Make the Most Out of Your Data Set Specification Thea Arianna Valerio, PPD, Manila, Philippines

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Measures of Position. 1. Determine which student did better

3.3 The Five-Number Summary Boxplots

Boxplots. Lecture 17 Section Robb T. Koether. Hampden-Sydney College. Wed, Feb 10, 2010

Understanding and Comparing Distributions. Chapter 4

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

STA 570 Spring Lecture 5 Tuesday, Feb 1

Measures of Position

It s Not All Relative: SAS/Graph Annotate Coordinate Systems

Math 167 Pre-Statistics. Chapter 4 Summarizing Data Numerically Section 3 Boxplots

Chapter 5. Understanding and Comparing Distributions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Lecture Notes 3: Data summarization

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

CHAPTER 3: Data Description

Box and Whisker Plot Review A Five Number Summary. October 16, Box and Whisker Lesson.notebook. Oct 14 5:21 PM. Oct 14 5:21 PM.

Chapter 6: DESCRIPTIVE STATISTICS

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

Chapter 1. Looking at Data-Distribution

Descriptive Statistics: Box Plot

Chapter 5. Understanding and Comparing Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Basic Commands. Consider the data set: {15, 22, 32, 31, 52, 41, 11}

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

Name Date Types of Graphs and Creating Graphs Notes

Reproducibly Random Values William Garner, Gilead Sciences, Inc., Foster City, CA Ting Bai, Gilead Sciences, Inc., Foster City, CA

CREATING THE DISTRIBUTION ANALYSIS

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

15 Wyner Statistics Fall 2013

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

PharmaSUG Paper TT10 Creating a Customized Graph for Adverse Event Incidence and Duration Sanjiv Ramalingam, Octagon Research Solutions Inc.

Univariate Statistics Summary

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Run your reports through that last loop to standardize the presentation attributes

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

1.3 Graphical Summaries of Data

Box Plots. OpenStax College

Homework Packet Week #3

MATH& 146 Lesson 10. Section 1.6 Graphing Numerical Data

UNIVERSITY OF BAHRAIN COLLEGE OF APPLIED STUDIES STATA231 LAB 4. Creating A boxplot Or whisker diagram

Pre-Calculus Multiple Choice Questions - Chapter S2

Numerical Summaries of Data Section 14.3

Section 6.3: Measures of Position

An Efficient Tool for Clinical Data Check

PharmaSUG Paper PO10

Automate Clinical Trial Data Issue Checking and Tracking

Paper S Data Presentation 101: An Analyst s Perspective

Developing Graphical Standards: A Collaborative, Cross-Functional Approach Mayur Uttarwar, Seattle Genetics, Inc., Bothell, WA

How individual data points are positioned within a data set.

TMTH 3360 NOTES ON COMMON GRAPHS AND CHARTS

CHAPTER 2 DESCRIPTIVE STATISTICS

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

PharmaSUG China

Chapter 3 - Displaying and Summarizing Quantitative Data

Creating Forest Plots Using SAS/GRAPH and the Annotate Facility

NCSS Statistical Software

Coders' Corner. Paper ABSTRACT GLOBAL STATEMENTS INTRODUCTION

No. of blue jelly beans No. of bags

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

Table of Contents (As covered from textbook)

Averages and Variation

When Simpler is Better Visualizing Laboratory Data Using SG Procedures Wei Cheng, Isis Pharmaceuticals, Inc., Carlsbad, CA

Clinical Data Visualization using TIBCO Spotfire and SAS

AP Statistics Prerequisite Packet

2.1: Frequency Distributions and Their Graphs

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Section 9: One Variable Statistics

Chapter 2 Describing, Exploring, and Comparing Data

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

Journey to the center of the earth Deep understanding of SAS language processing mechanism Di Chen, SAS Beijing R&D, Beijing, China

PharmaSUG Paper TT11

SUGI 29 Posters. Paper A Group Scatter Plot with Clustering Xiaoli Hu, Wyeth Consumer Healthcare., Madison, NJ

Descriptive Statistics

Name Geometry Intro to Stats. Find the mean, median, and mode of the data set. 1. 1,6,3,9,6,8,4,4,4. Mean = Median = Mode = 2.

Data Annotations in Clinical Trial Graphs Sudhir Singh, i3 Statprobe, Cary, NC

PharmaSUG China 2018 Paper AD-62

A Picture is worth 3000 words!! 3D Visualization using SAS Suhas R. Sanjee, Novartis Institutes for Biomedical Research, INC.

Day 4 Percentiles and Box and Whisker.notebook. April 20, 2018

MAT 155. Z score. August 31, S3.4o3 Measures of Relative Standing and Boxplots

Chapter 5: The beast of bias

Ranking Between the Lines

Breaking up (Axes) Isn t Hard to Do: An Updated Macro for Choosing Axis Breaks

Measures of Central Tendency:

A SAS Macro to Generate Caterpillar Plots. Guochen Song, i3 Statprobe, Cary, NC

REVIEW OF 6 TH GRADE

Tips and Tricks in Creating Graphs Using PROC GPLOT

Data Mining By IK Unit 4. Unit 4

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

Learning Log Title: CHAPTER 8: STATISTICS AND MULTIPLICATION EQUATIONS. Date: Lesson: Chapter 8: Statistics and Multiplication Equations

+ Statistical Methods in

Creating a Box-and-Whisker Graph in Excel: Step One: Step Two:

Collect Data, Photos, and GPS Coordinates with a Handheld Windows Tablet ARM Tablet Data Collector, Licensed as TDCx Add-In

Transcription:

ABSTRACT PharmaSUG China 2016 - Paper 72 Clip Extreme Values for a More Readable Box Plot Mary Rose Sibayan, PPD, Manila, Philippines Thea Arianna Valerio, PPD, Manila, Philippines The BOXPLOT procedure creates box-and-whiskers plots of measurement that displays the mean, quartiles, minimum, maximum and outlier values for one or several groups. When handling interim or "dirty" data, there can be numerous outliers that may affect the readability of the plots. This issue can result into compressed plots and wide y- axis ranges, from which no clear conclusions can be drawn unless summary statistic values are presented. The CLIPFACTOR option has the ability of clipping extreme values to produce readable and useful plots without affecting the summary statistic values. This paper aims to explore the CLIPFACTOR option of PROC BOXPLOT and present a macro that will compute for the optimal factor value based on the data. In addition, the macro also produces a data set which contains the extreme observations that were clipped from the plots. INTRODUCTION In clinical trial analysis, box plots are commonly used to visualize and compare variabilities and summary statistics within and between one or more treatment groups. These plots offer a quick and simple way to assess the shape and distribution of the data. The degree of dispersion, skewness and even outliers can easily be estimated just by looking at the plots. However, when dealing with interim or dirty data, extreme values are often observed. Due to these extreme values (or otherwise known as outliers), the outputs often have wide axis ranges leading to overly compressed box plots. This defeats the practical use of box plots as summary statistics may not be easily estimated and the distribution of the data may be hard to distinguish in a compressed plot. This may prohibit comparison of results across different groups and may limit drawing sound conclusions from the plot. Using CLIPFACTOR=factor option in PROC BOXPLOT is an excellent way to clip extreme values in the plot without actually affecting the summary statistics derived from the data. Outliers falling outside the clipping range derived from the specified factor are removed from the box plot. In effect, axis ranges can be adjusted to give more emphasis on the main body of the boxplot where most of the summary statistics can be found. Question is, how to derive the appropriate factor value for a given data? What are the customizations or clipping options that can be used to further improve the plot? How can we make use of the clipped values? COMPRESSED PLOTS As an illustration, below is one example of a compressed boxplot due to extreme values. Figure 1: HGB Concentration Boxplot

As observed from figure 1, the plots are too compressed and the ranges are too wide leading to a difficulty in interpreting the results. Since the plots look similar with each other, comparing between groups would not be preferable. Even the summary statistics cannot be estimated just by looking at the plots. These are the problems that the CLIPFACTOR option resolves. CLIPFACTOR AND ITS RELATED OPTIONS These extreme outliers can be clipped from the plot to produce a more readable and useful output without changing the actual summary statistics. The factor value specified in the CLIPFACTOR option will determine the extent to which extreme values are clipped without affecting the main body of boxplot. Note that the valid factor is any value greater than 1. The minimum and maximum range for clipping values is defined below: y max = Q1 + (Q3 Q1 ) factor y min = Q3 (Q3 Q1 ) factor Observations outside of derived y max and y min values will be removed from the plot. The algorithm of deriving the first quartile Q1 may vary, depending on whether summary statistics are taken per different groups or across all records. If summary statistics are to be taken per different groups, then Q1 will be derived by variable level first. The minimum of Q1 across all of the by variable levels will then be set as Q1. If summary statistics are to be taken across all records, Q1 is derived as the first quartile across all observations. The same algorithm was also followed in deriving the third quartile Q3, only that the maximum of all Q3 among all variable levels was set as Q3 when summary statistics are to be derived per by variable level. The formula for y max was mainly used in deriving the formula for the optimal CLIPFACTOR and this is discussed further in the next sections. For summary statistics with by variables specified (say TRT01A and AVISITN are the grouping variables), the codes below will be used in deriving Q1 and Q3 : proc means data=dummydata noprint; by trt01a avisitn; var aval; output out=summ_ q1(aval)=q1 q3(aval)=q3 mean(aval)=mean; proc means data=summ_ noprint; var q1 q3; output out=fsumm max(q3)=q3 min(q1)=q1 ; Aside from the CLIPFACTOR option, there are other clipping options available in PROC BOXPLOT to help further improve the plot. Here is the list of commonly used clipping options and their use: The CLIPLEGEND= option allows the user to indicate the label for the legend that specifies the number of boxes clipped. The label must not be more than 16 characters and must be enclosed in quotes. The CLIPLEGPOS= option specifies the position for the legend. Possible values may be TOP or BOTTOM. The CLIPSUBCHAR= option specifies the character which will be replaced with the number of boxes clipped in the label of legend. The CLIPSYMBOL= option allows the user to specify the symbol to mark plots with clipped values. The CLIPSYMBOLHT= option specifies the height of the symbol marker for outliers. If not specified, then the default height specified with the H= option in the symbol statement would be used. SAMPLE OF CLIPPED OUTPUTS Below are samples of codes applying the CLIPFACTOR option in the BOXPLOT procedure, followed by the resulting plot: Figure 2 is a clipped version of Figure 1, applying the different clipping options in PROC BOXPLOT. title1 'Figure 2: Clipped HGB Concentration Boxplot (Clipped version of Figure 1)'; proc boxplot data=dummy(where=(trt01a="treatment A")) ; format avisitn visit.; plot aval*avisitn/ clipfactor=1.7 cliplegend='boxes clipped=#' clipsubchar='#' cliplegpos=bottom clipsymbol=square noframe idsymbol=circle boxstyle=schematic cboxfill=white vaxis=axis1 haxis=axis2 outbox=boxval1 ; 2

Figure 2: Clipped HGB Concentration Boxplot (Clipped Version of Figure 1) In contrast to Figure 1, the boxplots are now more readable after using the CLIPFACTOR option. Since the extreme values were removed, the axis range has been adjusted to focus on the main body of the boxplots (which includes the median, quartiles, etc). This time, comparison among the different groups can be easily made because the difference among the summary statistics is more evident. The squares indicate which among the boxplots were clipped and also denote whether clipped values are extremely high or low value (depending on where the squares are placed in the axis). In this figure, since the squares are at the top of the plot, it means that extremely high values were clipped. The legend gives information on how many boxes were clipped. For this plot, outliers were removed from 6 boxes or the 6 visits. It can be seen how the figure improved with the use of the CLIPFACTOR option but how should the appropriate factor value be determined? %CLIPFACTOR MACRO The CLIPFACTOR=factor option is an easy and effective way of removing extreme values from the boxplots. However, estimating the optimal factor value is not always easy. This issue led to the development of the %CLIPFACTOR macro which utilizes the same data and determines the optimal factor value. This macro uses the same formula in finding the maximum (y max ) and minimum (y min ) points as described earlier. The first step is to compute for Q1 and Q3 using the same algorithm mentioned earlier. Then the formula commonly used in finding the upper whiskers for boxplots will be used. upwhisk = Q3 + (Q3 Q1 ) 1.5 where (Q3 Q1 ) is known as the interquartile range (IQR) (Stapel, 2010) and the value 1.5 determines how wide the clipping range will be. 3

However for this macro, the user is given the option to specify the extent of the clipping range which is stored in the cutoff macro parameter. If the cutoff value is not specified, the macro will set the value to either 1.5 or 3, depending on which cutoff value produces more outliers. Then the upwhisk value will be substituted in y max giving us the initial formula used to find the factor is: The codes below will be used to derive the factor value: factor = (upwhisk Q1 )/(Q3 Q1 ) **Calculate the interquartile range, lower and upper whiskers/fences**; data whisk; set fsumm; iqr = (q3-q1); lowwhisk=q1-(&cutoff.*iqr); upwhisk=q3+(&cutoff.*iqr); **compute for the clipfactor using formula with Q1**; proc sql noprint; select max(upwhisk) into :maxwhis from whisk; select min(lowwhisk) into :minwhis from whisk; **get minimum value that is above the lower whiskers**; select min(aval) into :minbs from dummydata where aval > &minwhis.; **get maximum value that is lower than upper whiskers**; select max(aval) into :maxbs from dummydata where aval < &maxwhis.; quit; select round((&maxbs.-q1)/(q3- from fsumm; Q1),.1) into :factor Since the upwhisk value is determined from the Q3 value, numerous extremely low values will not be considered and this will still result to a wide axis range. Therefore, a second formula is introduced to account for the extremely low values by using Q1. The factor will then be re-derived using the formula for y min. The second formula used is: factor = (Q3 y min )/(Q3 Q1 ) The codes for deriving the factor value using the second formula is almost the same as the codes above except for the final section: proc sql noprint; select round((q3-&minbs.)/(q3-q1),.1) into :factor from fsumm; quit; The %CLIPFACTOR macro also considers reference ranges when supplied. The user can set the refl macro parameter to Y and specify the low and high reference ranges. These ranges will be considered in the computation of factor value so that the reference values will still be displayed in the axis. The specified reference ranges will be checked if it is within the computed maximum and minimum whiskers (maxbs and minbs). If the low reference range is lower than the computed minbs, then minbs. is set as the low reference range. If the high reference range is higher than the computed maxbs, then maxbs is set as the high reference range. 4

The optimal factor value derived is then stored in the macro variable factor, which will then be plugged in the CLIPFACTOR= option of the PROC BOXPLOT. ADDITIONAL FEATURES OF THE %CLIPFACTOR MACRO In addition to its main function of obtaining the factor value, the %CLIPFACTOR macro also outputs the details pertaining to the clipped values. A data set (clipped_values.sas7bdat) will be produced which contains information on the extreme values removed from the plot. Image 1 is an example of clipped_values data set where the list of clipped values and its details are contained. Image 1: Clipped Values Data Set Another data set containing the remaining values after the removal of outliers (input data set s name concatenated with _ ) will also be produced. To be able to make use of the clipped values, below are some suggestions on how the clipped_values data set can be used in the report: Annotate the range of clipped values in the plot Figure 3: Clipped HGB Concentration Boxplot (Clipped Version of Figure 1 with Annotation) Although the extreme values were removed from the plot, general information on these values can still be displayed. In this example, the ranges of clipped values are annotated to give an idea on how far the removed points are from maximum axis value. 5

Add details pertaining to the outlier in the footnote Figure 4: Clipped Albumin Boxplot If there are very few extreme outliers (say less than 5 outliers), the clipped_values data set can be used to extract pertinent information and present them in the footnotes. Adding these details in the footnote will somehow give emphasis on these values. Use for data cleaning process The list of extreme values contained in the clipped_values data set can be used to query the database for confirmation and resolution when needed. CONCLUSION Producing boxplots with severe outliers in the data usually results to compressed plots which are unreadable and almost not useful. Interpretation and comparison of these plots can be improved by clipping these extreme values using the CLIPFACTOR=factor option in PROC BOXPLOT. The %CLIPFACTOR macro proves to be helpful in determining the optimal factor value appropriate for the given data. This macro also provides information on the clipped values for other relevant use. This way, even if the outliers are removed from the plot, the values are not completely ignored. REFERENCES Gemperli, Armin. 2009. Zooming in on graphics. Proceedings of the Pharmaceutical Users Software Exchange 2009. Lex Jansen s Homepage. Available at http://www.lexjansen.com/phuse/2009/cs/cs04.pdf. Accessed 28 July 2016. SAS 9.22 Online product documentation. Available at https://support.sas.com/documentation/cdl/en/statug/63347/html/default/viewer.htm#titlepage.htm. Accessed 28 July 2016 Stapel, Elizabeth. "Box-and-Whisker Plots: Interquartile Ranges and Outliers." Purplemath. Available from http://www.purplemath.com/modules/boxwhisk3.htm. Accessed 28 July 2016 6

ACKNOWLEDGEMENT The authors would like to thank Rick Edwards, Jason Ralph Isturis and PPD Biostatistics and Programming Manila team who provided valuable inputs to improve this paper. CONTACT INFORMATION Your comments and questions are valued and encouraged. You may contact the authors at: Mary Rose Alpha Sibayan PPD Taguig City, Philippines E-mail: MaryRoseAlpha.Sibayan@ppdi.com Thea Arianna Valerio PPD Taguig City, Philippines E-mail: Thea.Valerio@ppdi.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 7