CREATING A SUMMARY TABLE OF NORMALIZED (Z) SCORES

Similar documents
Using SAS Macro to Include Statistics Output in Clinical Trial Summary Table

Chapter 6: Modifying and Combining Data Sets

The STANDARD Procedure

There s No Such Thing as Normal Clinical Trials Data, or Is There? Daphne Ewing, Octagon Research Solutions, Inc., Wayne, PA

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

A SAS Macro for Producing Benchmarks for Interpreting School Effect Sizes

SAS Enterprise Miner : Tutorials and Examples

The DMSPLIT Procedure

Creating Macro Calls using Proc Freq

An Application of PROC NLP to Survey Sample Weighting

Interleaving a Dataset with Itself: How and Why

A Side of Hash for You To Dig Into

Contents. About This Book...1

Handling Numeric Representation SAS Errors Caused by Simple Floating-Point Arithmetic Computation Fuad J. Foty, U.S. Census Bureau, Washington, DC

Using PROC SQL to Generate Shift Tables More Efficiently

ABSTRACT DATA CLARIFCIATION FORM TRACKING ORACLE TABLE INTRODUCTION REVIEW QUALITY CHECKS

Paper SAS Programming Conventions Lois Levin, Independent Consultant, Bethesda, Maryland

Summarizing Impossibly Large SAS Data Sets For the Data Warehouse Server Using Horizontal Summarization

Base and Advance SAS

SAS Online Training: Course contents: Agenda:

Techdata Solution. SAS Analytics (Clinical/Finance/Banking)

SAS (Statistical Analysis Software/System)

SAS (Statistical Analysis Software/System)

Choosing the Right Procedure

Keeping Track of Database Changes During Database Lock

A Practical and Efficient Approach in Generating AE (Adverse Events) Tables within a Clinical Study Environment

Workload Characterization Techniques

Generating Customized Analytical Reports from SAS Procedure Output Brinda Bhaskar and Kennan Murray, RTI International

From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX

SAS Programming Conventions Lois Levin, Independent Consultant

T.I.P.S. (Techniques and Information for Programming in SAS )

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

Learning SAS by Example

Virtual Accessing of a SAS Data Set Using OPEN, FETCH, and CLOSE Functions with %SYSFUNC and %DO Loops

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

Using Taylor s Linearization Technique in StEPS to Estimate Variances for Non-Linear Survey Estimators 1

Introduction to SAS Mike Zdeb ( , #1

Mastering the Basics: Preventing Problems by Understanding How SAS Works. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Epidemiology Principles of Biostatistics Chapter 3. Introduction to SAS. John Koval

2 = Disagree 3 = Neutral 4 = Agree 5 = Strongly Agree. Disagree

Introduction to PROC SQL

It s Proc Tabulate Jim, but not as we know it!

Automating Preliminary Data Cleaning in SAS

Small Sample Equating: Best Practices using a SAS Macro

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Taming a Spreadsheet Importation Monster

Macro to compute best transform variable for the model

KEYWORDS ARRAY statement, DO loop, temporary arrays, MERGE statement, Hash Objects, Big Data, Brute force Techniques, PROC PHREG

To conceptualize the process, the table below shows the highly correlated covariates in descending order of their R statistic.

Checking for Duplicates Wendi L. Wright

Introducing a Colorful Proc Tabulate Ben Cochran, The Bedford Group, Raleigh, NC

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Excel 2016: Core Data Analysis, Manipulation, and Presentation; Exam

Ranking Between the Lines

SAS (Statistical Analysis Software/System)

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Bruce Gilsen, Federal Reserve Board

Chapter Two: Descriptive Methods 1/50

Know What You Are Missing: How to Catalogue and Manage Missing Pieces of Historical Data

Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY

Data Manipulation with JMP

Using SAS Macros to Extract P-values from PROC FREQ

Introduction to SAS Procedures SAS Basics III. Susan J. Slaughter, Avocet Solutions

We have already seen the transportation problem and the assignment problem. Let us take the transportation problem, first.

PROGRAM DOCUMENTATION FOR SAS PLS PROGRAM I. This document describes the source code for the first program described in the article titled:

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC

SAS Scalable Performance Data Server 4.3

Simplifying Effective Data Transformation Via PROC TRANSPOSE

DATA Step Debugger APPENDIX 3

An Experiment in Visual Clustering Using Star Glyph Displays

Multiple Graphical and Tabular Reports on One Page, Multiple Ways to Do It Niraj J Pandya, CT, USA

Statements with the Same Function in Multiple Procedures

MS Office for Engineers

Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY

Using Templates Created by the SAS/STAT Procedures

EXAMPLE 3: MATCHING DATA FROM RESPONDENTS AT 2 OR MORE WAVES (LONG FORMAT)

Optimizing System Performance

Cut Out The Cut And Paste: SAS Macros For Presenting Statistical Output ABSTRACT INTRODUCTION

PROC JACK REG - A SAS~ PROCEDURE FOR JACKKNIFE REGRESSION Panayiotis Hambi, West Virginia University

Ivy s Business Analytics Foundation Certification Details (Module I + II+ III + IV + V)

SAS Linear Model Demo. Overview

Pruning the SASLOG Digging into the Roots of NOTEs, WARNINGs, and ERRORs

Right-click on whatever it is you are trying to change Get help about the screen you are on Help Help Get help interpreting a table

TOP 10 (OR MORE) WAYS TO OPTIMIZE YOUR SAS CODE

Processing SAS Data Sets

It s Not All Relative: SAS/Graph Annotate Coordinate Systems

SAS CURRICULUM. BASE SAS Introduction

Effectively Utilizing Loops and Arrays in the DATA Step

Creating output datasets using SQL (Structured Query Language) only Andrii Stakhniv, Experis Clinical, Ukraine

SAS CLINICAL SYLLABUS. DURATION: - 60 Hours

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

The EMCLUS Procedure. The EMCLUS Procedure

Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide

%MISSING: A SAS Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System

BUSINESS ANALYTICS. 96 HOURS Practical Learning. DexLab Certified. Training Module. Gurgaon (Head Office)

9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

PDF Multi-Level Bookmarks via SAS

Automated Checking Of Multiple Files Kathyayini Tappeta, Percept Pharma Services, Bridgewater, NJ

The Dataset Diet How to transform short and fat into long and thin

Transcription:

CREATING A SUMMARY TABLE OF NORMALIZED (Z) SCORES Walter W. OWen The Biostatistics Center The George Washington University ABSTRACT Data from the behavioral sciences are often analyzed by normalizing the scores for individuals in experimental subgroups to a reference population. Normalized scores, called Z-scores, may then be used to compare performance relative to the reference group either across the experimental subgroups or among different variables. Summary procedures al~ow group statistics to be output to SAS data sets. These data sets may be reshaped using the MATRIX and TRANSPOSE procedures before being brought together via SET and MERGE statements. The result is a compact table of normalized scores with SAS variable labels identifying the tests presented. Population: Sample: Where: Z Z = (Xi Z = (Xi - X ref ) I s ref Normalized Score Individualized Raw Score Reference Mean Scores Reference Standard Deviation Addition of the reference values to the table allows the reader to extrapolate information about the experimental subgroup means and to compare the reference group to other populations reported in the literature. Further, the percentage of each subgroup having absolute Z-scores greater than an arbitrary cutoff could be added yielding an even better definition of the experimental subgroups. For example, absolute scores greater than 1.64 indicate that an individual is performing at a level different from 90% of a normally distributed reference group. INTRODUCTION The use of normalized (Z) scores is widespread in the behavioral sciences. The process of normalization involves the transformation of data from experimental subgroups using the performance of a standard, or control, group as the initial point of reference. The resultant Z-scores allow researchers a common ground on which to compare a wide variety of tests that may be scored on different scales or are essentially objective in nature. The formulas for computing individual Z-scores based on reference populations and samples are listed below. The resultant Z values are unitless scores indicating the number of standard deviations by which the corresponding raw score lies above or below the mean of the reference distribution. The reference group is characterized by a Z-score mean of zero and standard deviation of one. If an individual has an absolute Z-score of 1.64 or greater, he is performing at a level different from 90% of a normally distributed reference population. Similarly, an absolute Z-score greater than 1.96 puts the individual outside the range of 95% of that reference group. When reporting the scores of several different subgroups for a battery of tests, it is desirable to present the results in tabular form. SAS provides several paths by which to create such a table. This paper will focus on gathering the information and the usefulness of different table layouts rather than elaborate methods for putting the information on paper. Accordingly, PROe PRINT, with LINESIZE options, was shown. used to output the tables 1116

METHODS Several requirements for an incoming data set should be established before elaborating on other methodology. Though the techniques described below work equally well for any number of subgroups, the groups must be classified by a single variable (perhaps GROUP) that identifies the experimental groupings as well as the reference group, in a mutually exclusive format. The appropriate PRoe FORMAT statement should include values for all groups and labels suitable as SAS variable names (i.e., eight characters or fewer with no spaces). This format should be permanently assigned to the GROUPing variable when creating the groups. Global macro variables should be established to give the number of variables (called by &N in the programming segments that follow) and the number of groups used (called by &G, not including the reference group), thus allowing much of the remaining programming to be generalized to accept variations in these values (see Program Segment 1). Also, a macro (called by %VARS) listing the actual variables to be normalized is fundamental if the program is to be easily adapted for various purposes. The number of observations per group is output from PROe FREQuency, TRANSPOSEd into a single observation, and saved for later use (see Program Segment 2). It is convenient to create a permanent length of 40 for the variable labels at this point. This will allow any label up to 40 characters to be printed in the final table without worrying about truncation in a subsequent MERGE statement. PROe PRINT will adjust spacing if no label requires this much space. The next step in the process is to create two data sets, one for the reference group and the other for all of the subgroup data. The mean and standard deviation for each variable in the reference group is output as a single observation using PROe MEANS. A copy of this data set is reshaped by PROe MATRIX and output for use in the final tables as reference parameters (see Program Segment 3). PROe SUMMARY could also be used, but requires a separate PRINT statement to look at the data. For smaller data sets, PROe MEANS is preferred even if it is slightly less efficient. Appending the reference statistics to each observation of the subgroup data set allows the calculation of individual Z scores (see Program Segment 4). The scores should replace the original raw values, thus retaining the variable labels for future use. A word of caution -- be aware that the variables must be able to accommodate the decimal portion of the newly created Z-score. A series of counting variables may be created to record which Z-scores are outside a desired range (perhaps 1.64 or 1.96 as described previously). If a value of 100 is used in these counting variables to mark a score as deviant and a value of zero if it is not, the mean of the values will automatically yield the percentage of individuals outside the specified range. PRoe MEANS (or SUMMARY) is used again, this time BY the GROUPing variable to output a data set of mean Z-scores with an observation for each subgroup (see' Program Segment 5). PRoe TRANSPOSE, using GROUP as the 10 variable, will produce data ready for the final table. The same general process is used to prepare the percentages of outliers for the table (see Program Segment 6). Data manipulation is completed by match MERGEing the reference statistics with the Z-score and percentage means for each subgroup. The group sizes may now be SET with the information collected for the variables in the previous step (see Program Segment 7). THE TABLES Now that all of the necessary information is together in a single data set having one observation for each variable plus one observation containing the group sizes, the tables may be PRINTed. The SPLIT option for labeling columns of PRoe PRINT should be used to give better definition to the table. The most simplistic output (see Table 1) gives only the mean Z-scores for each subgroup. Addition of the reference group means and standard deviations (see Table 2) will define where the values are centered and allows the reader to determine the means of the experimental subgroups. This is done 1117

by multiplying the reference standard deviation by the subgroup mean Z-score and then adding this value to the reference mean. The final bit of information to add is the percentage of each subgroup which lies outside of the specified range. These values are based on the number of subjects in each group who actually took the test and can enhance the information already listed by indicating the possible skewness of the subgroups. See Program Segment 8 for the PROe PRINT used to produce Table 3. The statements to produce Tables 1 and 2 are comparable. SAS is the registered trademark of SAS Institute, Inc., Cary, NC, USA. Address Correspondence To: Walter W. Owen The Biostatistics Center 7979 Old Georgetown Road, Suite 500 Bethesda, MD 20814 SUMMARY The use of the global macro variables G and N, defining the number of subgroups and variables respectively, allows flexibility in the programming. Simply by varying the value of N, along with the appropriate modifications in the VARS macro containing the list of variables used, the table may reflect different subsets of test items. The format of any of these tables may, of course, be changed to reflect the desired number of significant digits. If measurement units for the variables are needed, they should be included in the SAS variable labels. Units apply only for the reference group as Z-scores and percentages are unitless values. The SAS macro language offers some intriguing possibilities for the ambitious programmer. If further generalizations were added, a procedure-style macro could be set up with defining parameters to cover many of the requirements set forth for the incoming data mentioned earlier in this paper. It has proven to be a formidable challenge to put group sizes into macro variables for use in labeling the output, but SAS capabilities should make this possible. Also there is the possibility of using PUT statements to print the tables, although more information is generally needed to allow for varying column lengths, particularly for the variable labels. 1118

Table 1 VARIABLE GROUP GROUP 2 DESCRIPTION MEAN Z MEAN Z N= 125.00 212.00 SAS LABEL FOR VARIABLE 1-0.49-0.42 SAS LABEL FOR VARIABLE 2-0.49-0.54 SAS LABEL FOR VARIABLE 3 0.57 0.59 SAS LABEL FOR VARIABLE 4 0.52 0.45 SAS LABEL FOR VARIABLE 5 0.72 O.BO Table 2 NORMALIZED TO REFERENCE PARAMETERS VARIABLE REFERENCE REFERENCE GROUP GROUP 2 DESCRIPTION MEAN STD DEV MEAN Z MEAN Z N= 85.00 125.00 212.00 SAS LABEL FOR VARIABLE 1 107.3B 12.66-0.49-0.42 SAS LABEL FOR VARIABLE 2 61.54 14.84-0.49-0.54 SAS LABEL FOR VARIABLE 3 3.05 3.29 0.57 0.59 SAS LABEL FOR VARIABLE 4 0.35 0.07 0.52 0.45 SAS LABEL FOR VARIABLE 5 0.75 0.49 0.72 0.80 Table 3 NORMALIZED TO REFERENCE PARAMETERS PERCENTAGE OF GROUP WITH Izl > 1.64 SHOWN VARIABLE REFERENCE REFERENCE GROUP PCT GROUP 2 PCT 2 DESCRIPTION MEAN STn DEV MEAN Z MEAN Z N= 85.00 125.00 212.00 SAS LABEL FOR VARIABLE 107.38 12.66-0.49 15-0.42 16 SAS LABEL FOR VARIABLE 2 61.54 14.84-0.49 18-0.54 19 SAS LABEL FOR VARIABLE 3 3.05 3.29 0.57 12 0.59 15 SAS LABEL FOR VARIABLE 4 0.35 0.07 0.52 14 0.45 16 SAS LABEL FOR VARIABLE 5 0.75 0.49 0.72 19 0.80 25 1119

PROGRAMMING SEGMENTS Program Segment 1 Macro Definitions Call &G &N &CUT %VARS Heaning number of groups number of variables Z score cutoff list of raw score variables Program Segment 2 Obtaining Group N's Assignment %LET G = 2; XLET N = 5; %LET CUT = 1.64; %KACRO V ARS; variable list %HEND VARS; PROC FREQ; TABLES GROUP / OUT=FREQSET NOPRINT; PROC TRANSPOSE DATA=FREQSET OUT=FREQSET; ID GROUP; VAR COUNT; DATA FREQSET; LENGTH VARLABEL $40; SET FREqSET (RENAME=( NAME =VARNAME REFGRP=REFMEAN»; VARLABEL='N='; Program Segment 3 Obtaining Reference Statistics PROC MEANS DATA=REFGRPS NOPRINT; OUTPUT OUT=REFMEANS MEAN=MEAN1-MEAN&N STD=STD1-STD&N; PRoe MATRIX; FETCH X DATA=REFHEANS; Y = SHAPE(X.&N); z = y'; OUTPUT Z OUT=REFSET(RENAME3 (COL1=REFMEAN COL2=REFSTD»; Program Segment 4 Calculate Individual Z-scores DATA ZSCORES; IF N =1 THEN SET REFHEANS; SET-CROUPS; ARRAY Z (8) %VARS; ARRAY V (H) %VARS; ARRAY M (H) MEANI-MEAN&N; ARRAY S (H) STD1-STD&N; 00 OVER S; IF S HE 0 THEN 00; Z = (V-M)/S; ELSE Z =.; Program Segment 5 Calculate Hean Z-scores PRoe SORT DATA=ZSCORES; PRoe MEANS DATA=ZSCORES NOPRINT; OUTPUT OUT=ZMEANS MEAN= %VARS; PROC TRANSPOSE DATA=ZMEANS OUT=ZMEANS; ID GROUP; Program Segment 6 Obtain the Percentage of Deviate Z-scores DATA PCTZ; SET ZSCORES; ARRAY Z (H) %VARS; ARRAY CNT (H) CNT1-CNT&N; 00 OVER Z; IF ABS(Z) GT &CUT THEN CNT=IOO; ELSE IF Z NE THEN CNT=O; PROC SORT DATA=PCTZ; PROC MEANS DATA=PCTZ NOPRINT; VAR CNTl-CNT&N; OUTPUT OUT=PCTZ HEAN=%VARS; PROC TRANSPOSE DATA=PCTZ OUT=PCTZ PREFIX=PCT; Program Segment 1 Combine and Concatenate Data DATA COMBINE; MERGE ZMEANS PCTZ REFSET (DROP=ROW); RENAME NAME =VARNAME =LABEL_=VARLABEL; DATA FINAL; SET FREQSET COHBINE; LABEL REFHEAN =REFERENCE* Mean REFSTD =REFERENCE* Std Dev VARNAME =VARIABLE VARLABEL=VARIABLE*DESCRIPTION GROUPl =GROUP l*mean Z GROUP2 =GROUP 2*Mean Z PCT1 =PCT 1 PCT2 =PCT 2 Program Segment 8 Printing Table ~: PRoe PRINT SPLIT=*; ID VARLABEL; VAR REFMEAN REFSTD GROUP I PCTI GROUP2 PCT2; FORHAT REFMEAN REFSTD GROUPI-GROUP&G 8.2 PCTl-PCT&G 3.0; 1120