CREATING A SUMMARY TABLE OF NORMALIZED (Z) SCORES Walter W. OWen The Biostatistics Center The George Washington University ABSTRACT Data from the behavioral sciences are often analyzed by normalizing the scores for individuals in experimental subgroups to a reference population. Normalized scores, called Z-scores, may then be used to compare performance relative to the reference group either across the experimental subgroups or among different variables. Summary procedures al~ow group statistics to be output to SAS data sets. These data sets may be reshaped using the MATRIX and TRANSPOSE procedures before being brought together via SET and MERGE statements. The result is a compact table of normalized scores with SAS variable labels identifying the tests presented. Population: Sample: Where: Z Z = (Xi Z = (Xi - X ref ) I s ref Normalized Score Individualized Raw Score Reference Mean Scores Reference Standard Deviation Addition of the reference values to the table allows the reader to extrapolate information about the experimental subgroup means and to compare the reference group to other populations reported in the literature. Further, the percentage of each subgroup having absolute Z-scores greater than an arbitrary cutoff could be added yielding an even better definition of the experimental subgroups. For example, absolute scores greater than 1.64 indicate that an individual is performing at a level different from 90% of a normally distributed reference group. INTRODUCTION The use of normalized (Z) scores is widespread in the behavioral sciences. The process of normalization involves the transformation of data from experimental subgroups using the performance of a standard, or control, group as the initial point of reference. The resultant Z-scores allow researchers a common ground on which to compare a wide variety of tests that may be scored on different scales or are essentially objective in nature. The formulas for computing individual Z-scores based on reference populations and samples are listed below. The resultant Z values are unitless scores indicating the number of standard deviations by which the corresponding raw score lies above or below the mean of the reference distribution. The reference group is characterized by a Z-score mean of zero and standard deviation of one. If an individual has an absolute Z-score of 1.64 or greater, he is performing at a level different from 90% of a normally distributed reference population. Similarly, an absolute Z-score greater than 1.96 puts the individual outside the range of 95% of that reference group. When reporting the scores of several different subgroups for a battery of tests, it is desirable to present the results in tabular form. SAS provides several paths by which to create such a table. This paper will focus on gathering the information and the usefulness of different table layouts rather than elaborate methods for putting the information on paper. Accordingly, PROe PRINT, with LINESIZE options, was shown. used to output the tables 1116
METHODS Several requirements for an incoming data set should be established before elaborating on other methodology. Though the techniques described below work equally well for any number of subgroups, the groups must be classified by a single variable (perhaps GROUP) that identifies the experimental groupings as well as the reference group, in a mutually exclusive format. The appropriate PRoe FORMAT statement should include values for all groups and labels suitable as SAS variable names (i.e., eight characters or fewer with no spaces). This format should be permanently assigned to the GROUPing variable when creating the groups. Global macro variables should be established to give the number of variables (called by &N in the programming segments that follow) and the number of groups used (called by &G, not including the reference group), thus allowing much of the remaining programming to be generalized to accept variations in these values (see Program Segment 1). Also, a macro (called by %VARS) listing the actual variables to be normalized is fundamental if the program is to be easily adapted for various purposes. The number of observations per group is output from PROe FREQuency, TRANSPOSEd into a single observation, and saved for later use (see Program Segment 2). It is convenient to create a permanent length of 40 for the variable labels at this point. This will allow any label up to 40 characters to be printed in the final table without worrying about truncation in a subsequent MERGE statement. PROe PRINT will adjust spacing if no label requires this much space. The next step in the process is to create two data sets, one for the reference group and the other for all of the subgroup data. The mean and standard deviation for each variable in the reference group is output as a single observation using PROe MEANS. A copy of this data set is reshaped by PROe MATRIX and output for use in the final tables as reference parameters (see Program Segment 3). PROe SUMMARY could also be used, but requires a separate PRINT statement to look at the data. For smaller data sets, PROe MEANS is preferred even if it is slightly less efficient. Appending the reference statistics to each observation of the subgroup data set allows the calculation of individual Z scores (see Program Segment 4). The scores should replace the original raw values, thus retaining the variable labels for future use. A word of caution -- be aware that the variables must be able to accommodate the decimal portion of the newly created Z-score. A series of counting variables may be created to record which Z-scores are outside a desired range (perhaps 1.64 or 1.96 as described previously). If a value of 100 is used in these counting variables to mark a score as deviant and a value of zero if it is not, the mean of the values will automatically yield the percentage of individuals outside the specified range. PRoe MEANS (or SUMMARY) is used again, this time BY the GROUPing variable to output a data set of mean Z-scores with an observation for each subgroup (see' Program Segment 5). PRoe TRANSPOSE, using GROUP as the 10 variable, will produce data ready for the final table. The same general process is used to prepare the percentages of outliers for the table (see Program Segment 6). Data manipulation is completed by match MERGEing the reference statistics with the Z-score and percentage means for each subgroup. The group sizes may now be SET with the information collected for the variables in the previous step (see Program Segment 7). THE TABLES Now that all of the necessary information is together in a single data set having one observation for each variable plus one observation containing the group sizes, the tables may be PRINTed. The SPLIT option for labeling columns of PRoe PRINT should be used to give better definition to the table. The most simplistic output (see Table 1) gives only the mean Z-scores for each subgroup. Addition of the reference group means and standard deviations (see Table 2) will define where the values are centered and allows the reader to determine the means of the experimental subgroups. This is done 1117
by multiplying the reference standard deviation by the subgroup mean Z-score and then adding this value to the reference mean. The final bit of information to add is the percentage of each subgroup which lies outside of the specified range. These values are based on the number of subjects in each group who actually took the test and can enhance the information already listed by indicating the possible skewness of the subgroups. See Program Segment 8 for the PROe PRINT used to produce Table 3. The statements to produce Tables 1 and 2 are comparable. SAS is the registered trademark of SAS Institute, Inc., Cary, NC, USA. Address Correspondence To: Walter W. Owen The Biostatistics Center 7979 Old Georgetown Road, Suite 500 Bethesda, MD 20814 SUMMARY The use of the global macro variables G and N, defining the number of subgroups and variables respectively, allows flexibility in the programming. Simply by varying the value of N, along with the appropriate modifications in the VARS macro containing the list of variables used, the table may reflect different subsets of test items. The format of any of these tables may, of course, be changed to reflect the desired number of significant digits. If measurement units for the variables are needed, they should be included in the SAS variable labels. Units apply only for the reference group as Z-scores and percentages are unitless values. The SAS macro language offers some intriguing possibilities for the ambitious programmer. If further generalizations were added, a procedure-style macro could be set up with defining parameters to cover many of the requirements set forth for the incoming data mentioned earlier in this paper. It has proven to be a formidable challenge to put group sizes into macro variables for use in labeling the output, but SAS capabilities should make this possible. Also there is the possibility of using PUT statements to print the tables, although more information is generally needed to allow for varying column lengths, particularly for the variable labels. 1118
Table 1 VARIABLE GROUP GROUP 2 DESCRIPTION MEAN Z MEAN Z N= 125.00 212.00 SAS LABEL FOR VARIABLE 1-0.49-0.42 SAS LABEL FOR VARIABLE 2-0.49-0.54 SAS LABEL FOR VARIABLE 3 0.57 0.59 SAS LABEL FOR VARIABLE 4 0.52 0.45 SAS LABEL FOR VARIABLE 5 0.72 O.BO Table 2 NORMALIZED TO REFERENCE PARAMETERS VARIABLE REFERENCE REFERENCE GROUP GROUP 2 DESCRIPTION MEAN STD DEV MEAN Z MEAN Z N= 85.00 125.00 212.00 SAS LABEL FOR VARIABLE 1 107.3B 12.66-0.49-0.42 SAS LABEL FOR VARIABLE 2 61.54 14.84-0.49-0.54 SAS LABEL FOR VARIABLE 3 3.05 3.29 0.57 0.59 SAS LABEL FOR VARIABLE 4 0.35 0.07 0.52 0.45 SAS LABEL FOR VARIABLE 5 0.75 0.49 0.72 0.80 Table 3 NORMALIZED TO REFERENCE PARAMETERS PERCENTAGE OF GROUP WITH Izl > 1.64 SHOWN VARIABLE REFERENCE REFERENCE GROUP PCT GROUP 2 PCT 2 DESCRIPTION MEAN STn DEV MEAN Z MEAN Z N= 85.00 125.00 212.00 SAS LABEL FOR VARIABLE 107.38 12.66-0.49 15-0.42 16 SAS LABEL FOR VARIABLE 2 61.54 14.84-0.49 18-0.54 19 SAS LABEL FOR VARIABLE 3 3.05 3.29 0.57 12 0.59 15 SAS LABEL FOR VARIABLE 4 0.35 0.07 0.52 14 0.45 16 SAS LABEL FOR VARIABLE 5 0.75 0.49 0.72 19 0.80 25 1119
PROGRAMMING SEGMENTS Program Segment 1 Macro Definitions Call &G &N &CUT %VARS Heaning number of groups number of variables Z score cutoff list of raw score variables Program Segment 2 Obtaining Group N's Assignment %LET G = 2; XLET N = 5; %LET CUT = 1.64; %KACRO V ARS; variable list %HEND VARS; PROC FREQ; TABLES GROUP / OUT=FREQSET NOPRINT; PROC TRANSPOSE DATA=FREQSET OUT=FREQSET; ID GROUP; VAR COUNT; DATA FREQSET; LENGTH VARLABEL $40; SET FREqSET (RENAME=( NAME =VARNAME REFGRP=REFMEAN»; VARLABEL='N='; Program Segment 3 Obtaining Reference Statistics PROC MEANS DATA=REFGRPS NOPRINT; OUTPUT OUT=REFMEANS MEAN=MEAN1-MEAN&N STD=STD1-STD&N; PRoe MATRIX; FETCH X DATA=REFHEANS; Y = SHAPE(X.&N); z = y'; OUTPUT Z OUT=REFSET(RENAME3 (COL1=REFMEAN COL2=REFSTD»; Program Segment 4 Calculate Individual Z-scores DATA ZSCORES; IF N =1 THEN SET REFHEANS; SET-CROUPS; ARRAY Z (8) %VARS; ARRAY V (H) %VARS; ARRAY M (H) MEANI-MEAN&N; ARRAY S (H) STD1-STD&N; 00 OVER S; IF S HE 0 THEN 00; Z = (V-M)/S; ELSE Z =.; Program Segment 5 Calculate Hean Z-scores PRoe SORT DATA=ZSCORES; PRoe MEANS DATA=ZSCORES NOPRINT; OUTPUT OUT=ZMEANS MEAN= %VARS; PROC TRANSPOSE DATA=ZMEANS OUT=ZMEANS; ID GROUP; Program Segment 6 Obtain the Percentage of Deviate Z-scores DATA PCTZ; SET ZSCORES; ARRAY Z (H) %VARS; ARRAY CNT (H) CNT1-CNT&N; 00 OVER Z; IF ABS(Z) GT &CUT THEN CNT=IOO; ELSE IF Z NE THEN CNT=O; PROC SORT DATA=PCTZ; PROC MEANS DATA=PCTZ NOPRINT; VAR CNTl-CNT&N; OUTPUT OUT=PCTZ HEAN=%VARS; PROC TRANSPOSE DATA=PCTZ OUT=PCTZ PREFIX=PCT; Program Segment 1 Combine and Concatenate Data DATA COMBINE; MERGE ZMEANS PCTZ REFSET (DROP=ROW); RENAME NAME =VARNAME =LABEL_=VARLABEL; DATA FINAL; SET FREQSET COHBINE; LABEL REFHEAN =REFERENCE* Mean REFSTD =REFERENCE* Std Dev VARNAME =VARIABLE VARLABEL=VARIABLE*DESCRIPTION GROUPl =GROUP l*mean Z GROUP2 =GROUP 2*Mean Z PCT1 =PCT 1 PCT2 =PCT 2 Program Segment 8 Printing Table ~: PRoe PRINT SPLIT=*; ID VARLABEL; VAR REFMEAN REFSTD GROUP I PCTI GROUP2 PCT2; FORHAT REFMEAN REFSTD GROUPI-GROUP&G 8.2 PCTl-PCT&G 3.0; 1120