PharmaSUG Paper AD06

Similar documents
ABSTRACT INTRODUCTION WORK FLOW AND PROGRAM SETUP

A SAS Macro Utility to Modify and Validate RTF Outputs for Regional Analyses Jagan Mohan Achi, PPD, Austin, TX Joshua N. Winters, PPD, Rochester, NY

An Efficient Method to Create Titles for Multiple Clinical Reports Using Proc Format within A Do Loop Youying Yu, PharmaNet/i3, West Chester, Ohio

Sorting big datasets. Do we really need it? Daniil Shliakhov, Experis Clinical, Kharkiv, Ukraine

Comparison of different ways using table lookups on huge tables

Clinical Data Visualization using TIBCO Spotfire and SAS

PharmaSUG 2013 CC26 Automating the Labeling of X- Axis Sanjiv Ramalingam, Vertex Pharmaceuticals, Inc., Cambridge, MA

Paper A Simplified and Efficient Way to Map Variable Attributes of a Clinical Data Warehouse

Advanced Visualization using TIBCO Spotfire and SAS

ABSTRACT INTRODUCTION MACRO. Paper RF

Matt Downs and Heidi Christ-Schmidt Statistics Collaborative, Inc., Washington, D.C.

PharmaSUG Paper PO12

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

Data Quality Review for Missing Values and Outliers

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys

Want to Do a Better Job? - Select Appropriate Statistical Analysis in Healthcare Research

Using SAS Macros to Extract P-values from PROC FREQ

A Macro to Create Program Inventory for Analysis Data Reviewer s Guide Xianhua (Allen) Zeng, PAREXEL International, Shanghai, China

Automate Clinical Trial Data Issue Checking and Tracking

Useful Tips When Deploying SAS Code in a Production Environment

Missing Pages Report. David Gray, PPD, Austin, TX Zhuo Chen, PPD, Austin, TX

Making the most of SAS Jobs in LSAF

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee

Amie Bissonett, inventiv Health Clinical, Minneapolis, MN

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

Using PROC PLAN for Randomization Assignments

Correcting for natural time lag bias in non-participants in pre-post intervention evaluation studies

Facilitate Statistical Analysis with Automatic Collapsing of Small Size Strata

SAS ENTERPRISE GUIDE USER INTERFACE

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

The Path To Treatment Pathways Tracee Vinson-Sorrentino, IMS Health, Plymouth Meeting, PA

Quick and Efficient Way to Check the Transferred Data Divyaja Padamati, Eliassen Group Inc., North Carolina.

Keeping Track of Database Changes During Database Lock

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

Interactive Programming Using Task in SAS Studio

Not Just Merge - Complex Derivation Made Easy by Hash Object

1 Files to download. 3 Macro to list the highest and lowest N data values. 2 Reading in the example data file

Document and Enhance Your SAS Code, Data Sets, and Catalogs with SAS Functions, Macros, and SAS Metadata. Louise S. Hadden. Abt Associates Inc.

PharmaSUG China Paper 70

PharmaSUG China 2018 Paper AD-62

Create a Format from a SAS Data Set Ruth Marisol Rivera, i3 Statprobe, Mexico City, Mexico

Checking for Duplicates Wendi L. Wright

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

Reproducibly Random Values William Garner, Gilead Sciences, Inc., Foster City, CA Ting Bai, Gilead Sciences, Inc., Foster City, CA

Paper CC06. a seed number for the random number generator, a prime number is recommended

Hypothesis Testing: An SQL Analogy

PharmaSUG China Mina Chen, Roche (China) Holding Ltd.

PharmaSUG Paper TT11

SQL, HASH Tables, FORMAT and KEY= More Than One Way to Merge Two Datasets

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

Programmatic Automation of Categorizing and Listing Specific Clinical Terms

Copy That! Using SAS to Create Directories and Duplicate Files

Countdown of the Top 10 Ways to Merge Data David Franklin, Independent Consultant, Litchfield, NH

PharmaSUG Paper PO10

Submitting SAS Code On The Side

PharmaSUG China. Systematically Reordering Axis Major Tick Values in SAS Graph Brian Shen, PPDI, ShangHai

Quicker Than Merge? Kirby Cossey, Texas State Auditor s Office, Austin, Texas

Extending the Scope of Custom Transformations

Best Practice for Creation and Maintenance of a SAS Infrastructure

A Simple Time Series Macro Scott Hanson, SVP Risk Management, Bank of America, Calabasas, CA

How to Keep Multiple Formats in One Variable after Transpose Mindy Wang

PhUse Practical Uses of the DOW Loop in Pharmaceutical Programming Richard Read Allen, Peak Statistical Services, Evergreen, CO, USA

Tracking Dataset Dependencies in Clinical Trials Reporting

Practical Uses of the DOW Loop Richard Read Allen, Peak Statistical Services, Evergreen, CO

PharmaSUG Paper CC11

One Project, Two Teams: The Unblind Leading the Blind

What Do You Mean My CSV Doesn t Match My SAS Dataset?

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

Plot Your Custom Regions on SAS Visual Analytics Geo Maps

Speed Dating: Looping Through a Table Using Dates

%ANYTL: A Versatile Table/Listing Macro

186 Statistics, Data Analysis and Modeling. Proceedings of MWSUG '95

How to use UNIX commands in SAS code to read SAS logs

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

PhUSE US Connect 2018 Paper CT06 A Macro Tool to Find and/or Split Variable Text String Greater Than 200 Characters for Regulatory Submission Datasets

An Efficient Tool for Clinical Data Check

9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

Using GSUBMIT command to customize the interface in SAS Xin Wang, Fountain Medical Technology Co., ltd, Nanjing, China

Summarizing Impossibly Large SAS Data Sets For the Data Warehouse Server Using Horizontal Summarization

To conceptualize the process, the table below shows the highly correlated covariates in descending order of their R statistic.

EXAMPLE 3: MATCHING DATA FROM RESPONDENTS AT 2 OR MORE WAVES (LONG FORMAT)

A Macro that can Search and Replace String in your SAS Programs

Out of Control! A SAS Macro to Recalculate QC Statistics

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

PASS Sample Size Software. Randomization Lists

Cover the Basics, Tool for structuring data checking with SAS Ole Zester, Novo Nordisk, Denmark

2 = Disagree 3 = Neutral 4 = Agree 5 = Strongly Agree. Disagree

PharmaSUG Paper SP04

Give me EVERYTHING! A macro to combine the CONTENTS procedure output and formats. Lynn Mullins, PPD, Cincinnati, Ohio

Using the CLP Procedure to solve the agent-district assignment problem

SAS Application Development Using Windows RAD Software for Front End

50 WAYS TO MERGE YOUR DATA INSTALLMENT 1 Kristie Schuster, LabOne, Inc., Lenexa, Kansas Lori Sipe, LabOne, Inc., Lenexa, Kansas

Using a Fillable PDF together with SAS for Questionnaire Data Donald Evans, US Department of the Treasury

Merging Data Eight Different Ways

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

The Output Bundle: A Solution for a Fully Documented Program Run

CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

Let Hash SUMINC Count For You Joseph Hinson, Accenture Life Sciences, Berwyn, PA, USA

The Power of Combining Data with the PROC SQL

Transcription:

PharmaSUG 2012 - Paper AD06 A SAS Tool to Allocate and Randomize Samples to Illumina Microarray Chips Huanying Qin, Baylor Institute of Immunology Research, Dallas, TX Greg Stanek, STEEEP Analytics, Baylor Health Care System, Dallas, TX Derek Blankenship, Quantitative Sciences, Baylor Health Care System, Dallas, TX ABSTRACT DNA/RNA Microarray has become a common tool for identifying differentially expressed genes under different experimental conditions. Several microarray platforms exist and differ by probe implementation and target-labeling strategies. When it comes to sample allocation, Illumina BeadChip is most unique because each chip can hold 12 samples, while for other platforms, each chip holds one sample or a combination of 2 paired samples. This adds complexity to sample randomization because each chip serves as a block and often the number of samples for each study group is not balanced, ie. each group does not have the same number of samples. The SAS PROC PLAN procedure is able to aid in the design and randomization of sample allocation for balanced studies but is limited for unbalanced studies. This paper describes a SAS macro tool we developed to implement an optimizing algorithm to allocate and randomize samples to Illumina chips for unbalanced study designs. INTRODUCTION It is important to randomly assign samples to different treatment groups in order to reduce the likelihood of bias due to factors other than treatment. SAS PROC PLAN or PROC SURVEYSELECT can be used for balanced design or when a fixed allocation ratio exists. To randomly allocate samples of different groups to fill in spaces has similar implication as assigning samples to treatment groups. However, the filling in space problem can be more complicated in real situation because often the sample size in each group is not the same and there is no fixed allocation ratio, especially when the block factor is involved. Microarray experiments involve loading samples to microarray chips where DNA probes are embedded. These chips then go through hybridization step followed by a stream of work. Most microarray platforms use one chip for one sample or a mixture of two samples. However, Illumina BeadChip Human V4 is very unique in its capacity to hold 12 samples on one chip, which has great advantage in cost-effectiveness. But this also presents great challenge in sample allocation, because each chip serves as a block. We try to avoid scenarios where samples from the same study group are all allocated to same chips and want to maximize the number of study groups represented on each chip. For most studies the sample size of each group is not divisible by the number of chips needed. In the past, our collaborators used EXCEL to manually randomize the samples by trial and error until a desired sample allocation was obtained. The process can be very time consuming without consistently applying an algorithm. The SAS tool in this paper uses the macro language and PROC SQL to implement a fixed algorithm to optimize the randomization. This greatly simplifies the sample allocation problem for researchers. PROGRAM FLOW Two major steps are involved in the use of the tool: 1. Calculate total number of chips needed and fill in all spaces. This requires determining the number of chips for each study group and the number of samples from each group for each chip. 2. Randomize chips and sample locations within each chip. 1. Calculating Total Number of Chips We developed an algorithm using linear programming to solve for 5 parameters: 1. Total number of chips needed. 2. Chips and each sample group need to be divided into 2 subgroups so that one subgroup of samples can be evenly assigned to the first group of chips and the other subgroup can be evenly assigned to the remaining chip group. 1 1

: With some algebra we obtain the following equations that are integrated in the code to capture the unknown values: 1, 1 As an example, Table 1 contains a microarray study data that will process 185 samples for 4 groups, which requires 16 Illumina chips (ceiling[185/12]) since each chip holds 12 samples. Table 2 contains the calculated chips for each site necessary based upon the algorithm for the assignment and randomization within each chip. Group Sample Site Sample Count 1 Seattle control 39 2 Seattle patient 38 3 Texas control 44 4 Texas patient 64 Table 1. Study example before calculating chips Sample Chip Sample Count/ Group Sample Site Count Count Chip Count Chip1 Size1 Chip2 Size2 1 Seattle control 39 16 2.4375 7 3 9 2 2 Seattle patient 38 16 2.375 6 3 10 2 3 Texas control 44 16 2.75 12 3 4 2 4 Texas patient 64 16 4 16 4 0 0 5 NULL 7 16 0.4375 7 1 9 0 Table 2. Table sample2 after calculating chips Table 2 illustrates the 4 th group is an ideal since it returns an integer value i.e., its size 64 is divisible by the number of chips (16) and we can easily assign 4 samples from this group to each chip. The other 3 groups do not return an integer value and their sizes vary as well, which requires the additional group 5 of 7 null spaces to be defined to balance out the allocation of samples on the chips for randomization. Given the samples from group 2 to 5 can t be evenly assigned to each chip, we need to determine how to divide those samples so that they will be randomly allocated to those chip spaces as much as possible. The code is relatively simple where it uses Proc SQL to generate macro variables to be passed thru to the algorithm to calculate the chip and size variables described in Table 2. It starts out by reading in the the data from Excel and restricting to the Site and generating the count and Group variables as shown in Table 1. libname Illumina "\\directory\path\samples for randomization.xls" stringdates=yes; create table sample as select status as Site label='sample Site', count(*) as Sample Label='Sample Count' from Illumina."Sheet1$"n (keep=status) group by 1; libname Illumina clear; data sample; set sample; Label GRPS='Group'; GRPS=_n_; 2

The first portion of the Macro (%Sample_Chips) utilizes Proc SQL to generate macro variables to calculate the size and chip variables in the next step. In the macro variable development we use the calculated function within SQL to sequentially build variables, the ceil function to capture the largest value for the chips and then assign them to macro variables. The second portion of Macro Sample_Chips generates the variables for size and chips as described previously. The second portion calculates the size and chips for both an unbalanced sample (our example) and a balanced sample. %macro sample_chips; %global init_samp k chip tot_samp samp_diff; proc sql noprint; select sum(sample) ceil(calculated init_samp/12) calculated chip*12 calculated tot_samp-calculated init_samp strip(put(count(distinct grps),8.)) as init_samp, as chip, as tot_samp, as samp_diff, as init_grps into: init_samp, :chip, :tot_samp, :samp_diff, :k from sample; %put &init_samp &chip &tot_samp &samp_diff &k; %if %eval(&samp_diff) >0 and %eval(&samp_diff) < 12 %then %do; data tmp; x=1; create table sample2 as select grps as Group, Site, Sample label='sample Count', &chip as Chip, sample/&chip as grp_ratio label='sample to Chip Ratio', &chip*(1-ceil(sample/&chip))+sample as Chip1, ceil(sample/&chip) as Size1, &chip-calculated chip1 as Chip2, case when calculated chip1<&chip then ceil(sample/&chip)-1 else 0 end as Size2 from sample union select &k+1 as grps as Group, 'NULL' as Site, &samp_diff as Sample label='sample Count', &chip as Chip, &samp_diff/&chip as grp_ratio label='sample to Chip Ratio', &chip*(1-ceil(&samp_diff/&chip))+&samp_diff as Chip1, ceil(&samp_diff/&chip) as Size1, &chip-calculated chip1 as Chip2, case when calculated chip1<&chip then ceil(&samp_diff/&chip)-1 else 0 end as Size2 from tmp; proc datasets library=work nolist; delete tmp; %end; Although it might be rare, there are situations where the samples are all balanced and therefore the following code will deal with the balanced case. %else %if %eval(&samp_diff)=0 %then %do; create table sample2 as select grps as Group, Site, sample label='sample Count', 3

&chip as Chip, sample/&chip as grp_ratio label='sample to Chip Ratio', &chip*(1-ceil(sample/&chip))+sample as Chip1, ceil(samp/&chip) as Size1, &chip-calculated chip1 as Chip2, case when calculated chip1<&chip then ceil(sample/&chip)-1 else 0 end as Size2 from sample; %end; %mend sample_chips; %sample_chips; The table sample2 (See Table 2) is the key for sample allocation and provides detailed information on how to divide the chips and samples. As an example, for the first group, each of the first 7 chips should have 3 samples and the rest of 9 chips should have 2 samples on each. However, guided by these parameters, there is multiple ways to allocate all samples to fill in the spaces of all chips. Further work is needed to automate the process for selecting the best way. In this paper, we randomly picked one combination (Table 3) to implement and a simple Proc SQL step was used to create a dataset according to Table 3. Number of samples by group Chip Group1 Group2 Group3 Group4 Group5 Total 1 4 3 2 2 1 12 2 4 3 2 2 1 12 3 4 2 2 3 1 12 4 4 2 2 3 1 12 5 4 2 3 2 1 12 6 4 2 3 2 1 12 7 4 2 3 2 1 12 8 4 2 3 3 0 12 9 4 2 3 3 0 12 10 4 2 3 3 0 12 11 4 2 3 3 0 12 12 4 2 3 3 0 12 13 4 3 3 2 0 12 14 4 3 3 2 0 12 15 4 3 3 2 0 12 16 4 3 3 2 0 12 Total 64 38 44 39 7 192 Table 3. The blueprint for initial allocation of samples to fill in the chips data assignment(drop=i j); do i=1 to &chip.; chip=i; do j=1 to 12; location=j; output; end; end; create table assignment as select chip, location, 4

case when location <=4 and 1<=chip<=16 then 4 when location=5 and 1<=chip<=7 then 5 when location=5 and 8<=chip<=16 then 3 when location in (6, 7) and 1<=chip<=16 then 3 when location=8 and 5<=chip<=7 then 3 when location=8 and (1<=chip<=4 or 8<=chip<=16) then 2 when location=9 and 1<=chip<=16 then 2 when location=10 and chip in (1,2,5,6,7,13,14,15,16) then 2 when location=10 and chip in (3,4,8,9,10,11,12) then 1 when 10<location and 1<=chip<=16 then 1 end as GRP from assignment order by chip; 2. Randomize Chips And Sample Locations Within Each Chip After the first step, all samples of the same group will be clustered with each other, and the next step will be to randomize the chip ID and sample location within each chip. We use random number generator functions to accomplish this purpose. In the codes below, the data set step1 comes from merging between the dataset assignment and the original sample list from the researcher and it has variables for sample ID, sample group status, chip ID and chip location according to information in Table 3. First, samples are randomized within each chip using the rand function followed by a few data manipulation steps. proc sort data=step1; by chip location grp; data step2; set step1; random_num=rand('uniform'); proc sort data=step2; by chip random_num; proc rank data=step2 out=step3(drop=location rename=(location_new=location)); by chip; var random_num; ranks location_new; label location =' '; data step4; do i=1 to &chip.; chip=i; random_chip=rand('uniform'); output; end; create table step5 as select a.*, b.random_chip from step3 a left join step4 b on a.chip=b.chip order by random_chip; It may not be required to randomize the chips every time, but it is a good practice especially when chips from different batches are used for one experiment. data step6 (drop=chip rename=(new_chip=chip)); set step5; by random_chip; if first.random_chip then new_chip+1; 5

create table final_sample as select chip, location, grp, random_chip, random_num as random_order from step6; The final_sample list is ready for researchers to implement in the microarray experiment. CONCLUSION With the use of SAS macro language and Proc SQL, this tool implements an algorithm to randomize samples to the Illumina chips for microarray experiments. This makes it possible to avoid tedious manual work of trial and error in EXCEL. As a result, it greatly saves researchers time and allows consistency due to the randomization algorithm. In addition, it can be applied to other types of experiments besides microarrays as well. ACKNOWLEDGEMENTS We would like to thank Esperanza Anguiano and Mamta Sharma from Institute of Immunology Research at Baylor Health Care System for providing this problem for which we developed the tool. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Name: Huanying Qin Enterprise: Baylor Institute of Immunology Research Address: 3310 Live Oak, Suite 400, Dallas, Tx, 75204 Work Phone: 214-820-9064 E-mail: huanyinq@baylorhealth.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 6