Macros for Two-Sample Hypothesis Tests Jinson J. Erinjeri, D.K. Shifflet and Associates Ltd., McLean, VA

Similar documents
Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

An Efficient Method to Create Titles for Multiple Clinical Reports Using Proc Format within A Do Loop Youying Yu, PharmaNet/i3, West Chester, Ohio

A Macro that can Search and Replace String in your SAS Programs

A Side of Hash for You To Dig Into

How to Keep Multiple Formats in One Variable after Transpose Mindy Wang

Automating Preliminary Data Cleaning in SAS

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

Data Quality Review for Missing Values and Outliers

A Few Quick and Efficient Ways to Compare Data

Using Templates Created by the SAS/STAT Procedures

A SAS Macro for Producing Benchmarks for Interpreting School Effect Sizes

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

Using PROC REPORT to Cross-Tabulate Multiple Response Items Patrick Thornton, SRI International, Menlo Park, CA

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

Using SAS/SCL to Create Flexible Programs... A Super-Sized Macro Ellen Michaliszyn, College of American Pathologists, Northfield, IL

Greenspace: A Macro to Improve a SAS Data Set Footprint

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

Keeping Track of Database Changes During Database Lock

PharmaSUG Paper TT11

A Macro to Create Program Inventory for Analysis Data Reviewer s Guide Xianhua (Allen) Zeng, PAREXEL International, Shanghai, China

186 Statistics, Data Analysis and Modeling. Proceedings of MWSUG '95

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

Using SAS Macros to Extract P-values from PROC FREQ

Out of Control! A SAS Macro to Recalculate QC Statistics

Run your reports through that last loop to standardize the presentation attributes

Virtual Accessing of a SAS Data Set Using OPEN, FETCH, and CLOSE Functions with %SYSFUNC and %DO Loops

Useful Tips When Deploying SAS Code in a Production Environment

Two useful macros to nudge SAS to serve you

Paper S Data Presentation 101: An Analyst s Perspective

Submitting SAS Code On The Side

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

An Algorithm to Compute Exact Power of an Unordered RxC Contingency Table

Creating output datasets using SQL (Structured Query Language) only Andrii Stakhniv, Experis Clinical, Ukraine

Chapter 6: Modifying and Combining Data Sets

Cover the Basics, Tool for structuring data checking with SAS Ole Zester, Novo Nordisk, Denmark

BY S NOTSORTED OPTION Karuna Samudral, Octagon Research Solutions, Inc., Wayne, PA Gregory M. Giddings, Centocor R&D Inc.

3N Validation to Validate PROC COMPARE Output

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

Statistics, Data Analysis & Econometrics

Tasks Menu Reference. Introduction. Data Management APPENDIX 1

An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step Mike Zdeb, FSL, University at Albany School of Public Health, Rensselaer, NY

SAS Macros CORR_P and TANGO: Interval Estimation for the Difference Between Correlated Proportions in Dependent Samples

A Macro to Keep Titles and Footnotes in One Place

footnote1 height=8pt j=l "(Rev. &sysdate)" j=c "{\b\ Page}{\field{\*\fldinst {\b\i PAGE}}}";

Hypothesis Testing: An SQL Analogy

Macros to Report Missing Data: An HTML Data Collection Guide Patrick Thornton, University of California San Francisco, SF, California

Poster Frequencies of a Multiple Mention Question

SAS Macro Technique for Embedding and Using Metadata in Web Pages. DataCeutics, Inc., Pottstown, PA

PhUSE US Connect 2018 Paper CT06 A Macro Tool to Find and/or Split Variable Text String Greater Than 200 Characters for Regulatory Submission Datasets

Quick and Efficient Way to Check the Transferred Data Divyaja Padamati, Eliassen Group Inc., North Carolina.

Automating Comparison of Multiple Datasets Sandeep Kottam, Remx IT, King of Prussia, PA

PharmaSUG Paper AD06

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research

= %sysfunc(dequote(&ds_in)); = %sysfunc(dequote(&var));

Clinical Data Visualization using TIBCO Spotfire and SAS

Facilitate Statistical Analysis with Automatic Collapsing of Small Size Strata

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office)

Validation Summary using SYSINFO

Internet/Intranet, the Web & SAS

Statistics and Data Analysis. Common Pitfalls in SAS Statistical Analysis Macros in a Mass Production Environment

EXST SAS Lab Lab #6: More DATA STEP tasks

%MISSING: A SAS Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System

WEB MATERIAL. eappendix 1: SAS code for simulation

Using PROC SQL to Generate Shift Tables More Efficiently

Building Sequential Programs for a Routine Task with Five SAS Techniques

The Dataset Diet How to transform short and fat into long and thin

The Proc Transpose Cookbook

STEP 1 - /*******************************/ /* Manipulate the data files */ /*******************************/ <<SAS DATA statements>>

Correcting for natural time lag bias in non-participants in pre-post intervention evaluation studies

Getting it Done with PROC TABULATE

SAS Online Training: Course contents: Agenda:

Simplifying the Sample Design Process with PROC PMENU

An Efficient Tool for Clinical Data Check

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

Missing Pages Report. David Gray, PPD, Austin, TX Zhuo Chen, PPD, Austin, TX

CREATING THE DISTRIBUTION ANALYSIS

New Vs. Old Under the Hood with Procs CONTENTS and COMPARE Patricia Hettinger, SAS Professional, Oakbrook Terrace, IL

Windows Application Using.NET and SAS to Produce Custom Rater Reliability Reports Sailesh Vezzu, Educational Testing Service, Princeton, NJ

ABSTRACT DATA CLARIFCIATION FORM TRACKING ORACLE TABLE INTRODUCTION REVIEW QUALITY CHECKS

Frequentist and Bayesian Interim Analysis in Clinical Trials: Group Sequential Testing and Posterior Predictive Probability Monitoring Using SAS

Essential ODS Techniques for Creating Reports in PDF Patrick Thornton, SRI International, Menlo Park, CA

%Addval: A SAS Macro Which Completes the Cartesian Product of Dataset Observations for All Values of a Selected Set of Variables

Don t Forget About SMALL Data

Data Quality Control: Using High Performance Binning to Prevent Information Loss

A Macro Application on Confidence Intervals for Binominal Proportion

WHERE YOUR MIND REALLY IS

PharmaSUG Paper IB11

PharmaSUG Paper SP07

What Do You Mean My CSV Doesn t Match My SAS Dataset?

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

ABSTRACT INTRODUCTION WORK FLOW AND PROGRAM SETUP

Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee

Stat 302 Statistical Software and Its Applications SAS: Data I/O

Ranking Between the Lines

Paper CC06. a seed number for the random number generator, a prime number is recommended

Tracking Dataset Dependencies in Clinical Trials Reporting

Stat 302 Statistical Software and Its Applications SAS: Data I/O & Descriptive Statistics

Creating Macro Calls using Proc Freq

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

Transcription:

Paper CC-20 Macros for Two-Sample Hypothesis Tests Jinson J. Erinjeri, D.K. Shifflet and Associates Ltd., McLean, VA ABSTRACT Statistical Hypothesis Testing is performed to determine whether enough statistical evidence exists to conclude that a hypothesis about a parameter is supported by the data. This paper deals with macro codes for Two-Sample hypothesis testing for means and proportions which are the commonly used statistical tests across all industries. The application was mainly developed to determine whether significant differences (mean/proportion) existed between important travel variables from year to year in the raw data. In this paper, generalized macros for Two-Sample hypothesis tests are presented. INTRODUCTION Statistical Hypothesis testing is a quantitative procedure for making inferences about a population by detecting whether enough statistical evidence exists to conclude that a hypothesis about a parameter is supported by the data. PROC TTEST and PROC SEQDESIGN are some of the few options where one can carry out two sample tests. However, if one wants to suppress all the information these outputs generate and make it simple, say yes/no to the end user - it is best to develop a macro. The main advantage will be to perform multiple two-sample tests across subjects in one run. In this paper, the mentioned advantages were incorporated to develop macros for hypothesis testing of two-sample (independent) tests, mainly for: 1. Test of Two Means 2. Test of Two Proportions The t-distribution and normal distribution was used for test of means and proportions respectively. It is important to note that once the macro code is deciphered, it is easy to incorporate changes based on requirements. The developed macro is best explained with the help of an example. In this case, the travel related data is used to illustrate the macros. The snapshot of travel data is shown in Figure1 with each record showing travel related information by a respondent. Obs nmmonth resp sdays fstate s_wt tmonth ttran spurp ifus tpdays tpers Myear4 Year4 1 2 591 1 RI 0.275 1 4 12 1 1.025 0.549 2007 2007 2 2 639 0 GA 9.564 1 5 13 1 51.142 57.38 2007 2007 3 2 1790 3 FL 1.199 1 1 2 1 4.879 1.199 2007 2007 4 2 1790 1 MI 2.048 1 1 4 1 3.805 2.048 2007 2007 5 2 5053 5 FL 0.997 1 4 8 1 6.34 0.997 2007 2007 6 2 5053 5 IN 1.323 1 4 8 1 8.354 1.323 2007 2007 7 2 5053 5 IN 1.323 1 4 8 1 8.354 1.323 2007 2007 8 2 5053 5 IN 1.323 1 4 8 1 8.354 1.323 2007 2007 9 2 5107 0 AZ 10.80 1 4 1 1 25.156 32.41 2007 2007 Figure 1. Snapshot of Travel Data Collected TEST OF TWO MEANS In most statistical testing, the standard deviation of population is unknown and therefore we have used t-distribution for the test of two means. For the above data in Figure1, the hypothesis is to find if there are significant differences between average length of stay (variable =sdays) at each of the destination states (variable=fstate) between years (variable=year4) in the raw data. The advantage of the macro is that we can do multiple hypotheses testing for as many subjects. In the example above we can test the difference between lengths of stay between various years for all states. The macro identifies all the possibilities and produces output as an excel file. The comments presented in the code below gives the description of the macro with partial output presented in Figure 2. %macro mean2test(dataname=,v2cb=,v2cw=,v2fall=,los=) /*Storing the categorical information of the variables (names and the number of them)*/ proc sql noprint 1

select distinct (&v2cb.) into : v2cb_nam separated by ' ' select count(distinct (&v2cb.)) into : n_v2cb select distinct(&v2fall.) into : s2test_nam separated by ' ' select count(distinct(&v2fall.)) into : n_v2fall quit /*Testing the mandatory requirements of variables for two-sample mean test*/ data _null_ set &dataname. (firstobs=1 obs=1) v2cw_type=vtype(&v2cw.) if v2cw_type='c' then do put "The variable &v2cw. considered for the two sample test is not numeric." abort else if &n_v2cb. in ('0','1') then do put "The variable &v2cb. should have atleast two categories. Currently it has %left(&n_v2cb.)." abort /* splitting the data for the variable under test (variable=v2cb)*/ data %do i=1 %to &n_v2cb. &v2cb.&i. set &dataname. %do j=1 %to &n_v2cb. if &v2cb.=%scan(&v2cb_nam.,&j.) then output &v2cb.&j. /* obtaining the frequency for the variables under test(v2cb) and total sample size by the subjects (variable=v2fall)*/ %do k=1 %to &n_v2cb. data _null_ set &v2cb.&k. call symputx('temps',year4) proc means data=&v2cb.&k. nway class &v2cb. &v2fall. var &v2cw. output out=data&k.(drop= _freq type_) mean=m&temps. std=std&temps. n=tot&temps. /*combining all the data sets obtained from the above Proc means output*/ data combine1 merge %do l=1 %to &n_v2cb. 2

data&l by &v2fall. /* determining all possible combinations taken two at a time for the variable under test (variable=v2cb)*/ data trial array v v1-v2 array temp(&n_v2cb.) _temporary_(&v2cb_nam.) do i=1 to %eval(&n_v2cb.-1) do j=2 to &n_v2cb. v{1}=temp{i} v{2}=temp{j} output data trial1 set trial if i-j>=0 then delete intvars=cat(v1,v2) drop i j /*v1 v2*/ /*Storing the all the possible combinations*/ proc sql select count(*) into :nobs select intvars into :coms1-:coms%left(&nobs.) select v1 into :a1-:a%left(&nobs.) select v2 into :b1-:b%left(&nobs.) quit data final set combine1 %do o=1 %to &nobs. /*calculating all the required stats to perform two sample t-test on means*/ /*los is Level of Significance*/ sp&&coms&o.=sqrt((((tot&&a&o.-1)*std&&a&o.*std&&a&o.) + ((tot&&b&o.- 1)*std&&b&o.*std&&b&o.))/(tot&&a&o.+ tot&&b&o. -2)) se&&coms&o.=sp&&coms&o.*(sqrt(1/tot&&a&o.+ 1/tot&&b&o.)) teststat&&coms&o.=(m&&a&o.-m&&b&o.)/(se&&coms&o.) pval&&coms&o.=2*(1-probt(abs(teststat&&coms&o.),(tot&&a&o.+ tot&&b&o. - 2))) /* constructing confidence intervals*/ &&coms&o.=(m&&a&o.-m&&b&o.)-(se&&coms&o.*tinv(1-(&los./2),tot&&a&o.+ tot&&b&o. -2)) &&coms&o.=(m&&a&o.-m&&b&o.)+(se&&coms&o.*tinv(1-(&los./2),tot&&a&o.+ tot&&b&o. -2)) /*checking for significance with yes/no*/ if (&&coms&o.=. or &&coms&o.=.) then significant&&coms&o.="mis" else if &&coms&o.<=0 and &&coms&o.>=0 then significant&&coms&o.='no' 3

else if ((&&coms&o.>0 and &&coms&o.>0) or (&&coms&o.<0 and &&coms&o.<0)) then significant&&coms&o.='yes' /*Output the results to excel*/ ods html file="test2mean.xls" rs=none style=minimal proc print data=final noobs var &v2fall. significant: : : ods html close %mend mean2test %mean2test(dataname=rates,v2cb=year4,v2cw=sdays,v2fall=fstate,los=0.05) fstate Significant significant Significant 20072008 20072009 20082009 20072008 20072009 20082009 20072008 20072009 20082009 AK yes no yes 3.50668-0.62712-4.22413 4.47505 0.96106-3.42366 AL no no no -0.2091-0.08616-0.08825 0.23715 0.37415 0.34819 AR yes no no -0.50749-0.30993-0.05686-0.0301 0.15478 0.43931 AZ yes no no -0.52331-0.29885-0.01483-0.02873 0.20455 0.47257 CA yes no yes 1.16971-0.0725-1.26716 1.28508 0.10244-1.1577 Figure 2. Partial Output of Two Sample Test of Means TEST OF TWO PROPORTIONS The test for two proportions macro is illustrated with the same example presented in Figure1. Let s assume that the hypothesis to be tested is that the share of travelers transportation-mode to each state with respect to last year has remained the same. To perform multiple two-sample proportions tests across many subjects, it is best to develop a macro. The macro is presented below with partial output shown in Figure 3. %macro prop2test(dataname=,v2cb=,v2cw=,v2fall=,los=) /*Storing the categorical information of the variables (names and the number of them)*/ proc sql noprint select distinct (&v2cb.) into : v2cb_nam separated by ' ' select count(distinct (&v2cb.)) into : n_v2cb select distinct (&v2cw.) into : v2cw_nam separated by ' ' select count(distinct (&v2cw.)) into : n_v2cw select distinct(&v2fall.) into : s2test_nam separated by ' ' select count(distinct(&v2fall.)) into : n_v2fall quit /*Testing the mandatory requirements of variables for the two-sample proportion test*/ data _null_ set &dataname. (firstobs=1 obs=1) if &n_v2cw. in ('0','1') then do put "The variable &v2cw. should have atleast two categories. Currently it has %left(&n_v2cw.)." 4

abort else if &n_v2cb. in ('0','1') then do put "The variable &v2cb. should have atleast two categories. Currently it has %left(&n_v2cb.)." abort /* splitting the data for the variable under test (variable=v2cb)*/ data %do i=1 %to &n_v2cb. &v2cb.&i. set &dataname. %do j=1 %to &n_v2cb. if &v2cb.=%scan(&v2cb_nam.,&j.) then output &v2cb.&j. /* obtaining the frequency for each category of variable v2cb and total sample size by variable v2fall */ %do k=1 %to &n_v2cb. proc sort data=&v2cb.&k.by &v2cb. &v2fall. proc freq data=&v2cb.&k. table fstate/out=freq&k. by &v2cb. /* obtaining the frequency for each category of variable v2cb and sample size by variable v2cw */ proc freq data=&v2cb.&k. table &v2cw.*fstate/out=freqa&k. by &v2cb. data freqa&k. set freqa&k. /*combining and rearranging all the above data sets*/ data combine1 set %do l=1 %to &n_v2cb. freq&l drop percent data combine2 set %do m=1 %to &n_v2cb. freqa&m drop percent proc sort data=combine1by &v2fall. proc transpose data=combine1 prefix=n out=combine1t by &v2fall. var count id &v2cb. proc sort data=combine2by &v2fall. &v2cw. 5

proc transpose data=combine2 prefix=ct out=combine2t by &v2fall. &v2cw. var count id &v2cb. proc sort data=combine1tby &v2fall. proc sort data=combine2tby &v2fall. data inorder merge combine1t combine2t by &v2fall. /* determinig all possible combinations taken two at a time for variable v2cb*/ data trial array v v1-v2 array temp(&n_v2cb.) _temporary_(&v2cb_nam.) do i=1 to %eval(&n_v2cb.-1) do j=2 to &n_v2cb. v{1}=temp{i} v{2}=temp{j} output data trial1 set trial if i-j>=0 then delete intvars=cat(v1,v2) drop i j /*v1 v2*/ proc sql select count(*) into :nobs select intvars into :coms1-:coms%left(&nobs.) select v1 into :a1-:a%left(&nobs.) select v2 into :b1-:b%left(&nobs.) quit /*calculating all the required stats to perform two sample z-test on proportions,los is level of significance*/ data final set inorder %do o=1 %to &n_v2cb. phat%scan(&v2cb_nam.,&o)=ct%scan(&v2cb_nam.,&o.)/n%scan(&v2cb_nam.,&o.) %do p=1 %to &nobs. diff&&coms&p.=phat&&a&p.-phat&&b&p. phat&&coms&p. = (ct&&a&p. + ct&&b&p.)/(n&&a&p. + n&&b&p.) sigdiff&&coms&p. = SQRT(phat&&coms&p.*(1- phat&&coms&p.)*(1/ct&&a&p. + 1/ct&&b&p.)) TestStat&&coms&p. = (phat&&a&p.-phat&&b&p.)/sigdiff&&coms&p. pval&&coms&p. = 2*MIN ((1-Probnorm(Teststat&&coms&p. )), Probnorm(Teststat&&coms&p.)) &&coms&p.= diff&&coms&p. - (sigdiff&&coms&p.*probit(1- (&los./2))) 6

&&coms&p.= diff&&coms&p. + (sigdiff&&coms&p.*probit(1- (&los./2))) if (&&coms&p.=. or &&coms&p.=.) then significant&&coms&p.="mis" else if &&coms&p.<=0 and &&coms&p.>=0 then significant&&coms&p.='no' else if ((&&coms&p.>0 and &&coms&p.>0) or (&&coms&p.<0 and &&coms&p.<0)) then significant&&coms&p.='yes' /* output to excel*/ ods html file="test2prop.xls" rs=none style=minimal proc print data=final noobs var &v2fall. &v2cw. significant: : : ods html close %mend prop2test %prop2test(dataname=rates,v2cb=year4,v2cw=ttran,v2fall=fstate,los=0.05) fstate ttran Significant 20072008 Significant 20072009 Significant 20082009 20072008 20072009 20082009 20072008 20072009 20082009 AL 1 no no no -0.07093-0.05213-0.04242 0.05559 0.07754 0.08316 AL 2 no mis mis -0.06579.. 0.06632.. AL 3 no no no -0.0683-0.06517-0.06211 0.06468 0.0703 0.07086 AL 4 no yes yes -0.01774 0.04361 0.02979 0.04742 0.11468 0.09882 AL 5 no yes yes -0.06385-0.17805-0.17319 0.06419-0.03533-0.04052 AL 1 no no no -0.07093-0.05213-0.04242 0.05559 0.07754 0.08316 Figure 3. Partial Output of Two Sample Test of Proportions CONCLUSION The macros developed in this paper can be used to perform multiple-two-sample tests across many subjects thereby saving time. Also, the macros can be tailored to specific needs with minimal changes in the code. The changes can be related to rounding numbers, performing one tailed test, incorporating weights, considering equality of variances and so forth. ACKNOWLEDGMENTS The author would like to thank Nandini Nadkarni for her valuable input while reviewing this paper. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Jinson J. Erinjeri D.K. Shifflet and Associates Ltd. 1750 Old Meadow Rd., Suite, 620 Mclean, VA 22102 Work Phone: 703-536-0924 Fax: 703-536-0580 E-mail: jerinjeri@dksa.com Web: www.dksa.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 7