ABSTRACT Getting it Done with PROC TABULATE Michael J. Williams, ICON Clinical Research, San Francisco, CA The task of displaying statistical summaries of different types of variables in a single table is quite familiar to many SAS users. There are many ways to go about this. PROC FREQ tends to be a favorite for counts and percentages of categorical variables, while PROC MEANS/SUMMARY and PROC UNIVARIATE tend to be preferred for summary statistics of continuous variables. In addition, PROC REPORT has many desirable customizations for displaying datasets. PROC TABULATE combines the computational functionality of FREQ, MEANS/SUMMARY, UNIVARIATE with the customization abilities of PROC REPORT. In this presentation/paper, we will give an overview of PROC TABULATE syntax, and then discuss stylistic customizations, calculating percentages, dealing with missing values, creating and processing PROC TABULATE output data sets. INTRODUCTION The SAS procedure PROC TABULATE is very useful for summarizing data. One of the most interesting features of the procedure is the amount of flexibility in designing the structure of the displayed table in terms of row structure, column structure, stacking or concatenating sub-tables, and formatting/labeling variable names and variable values. This paper discusses various aspects of PROC TABULATE including basic syntax, percentages, handling of missing values, and methods for creating output SAS datasets. We use the data set SASHELP.CARS to produce examples. First, we read in this data set as the temporary dataset CARS, and we focus on the variables MPG_City miles per gallon in the city MPG_Highway miles per gallon on the highway Type Type of car: Hybrid, SUV, Sedan, Sports, Truck, Wagon Origin Country of Origin: Asia, Europe, USA The CARS dataset has no missing values for every one of its variables. This is, of course, the ideal case for data analysis. In a later section, we will modify the CARS dataset to include missing values; this will force us to use more caution when computing summary statistics for several variables in a single PROC TABULATE step. BASIC SYNTAX The basic statements to include in PROC TABULATE are given below. PROC TABULATE <option(s)> <STYLE=style-override(s)>; CLASS variable(s) </option(s)> <STYLE=style-override(s)>; VAR analysis-variable(s) </option(s)> <STYLE=style-override(s)>; TABLE <<page-expression,> row-expression,> column-expression </ table-option(s)>; RUN; The PROC TABULATE statement typically includes the data set name, the CLASS statement includes a list of categorical variables (for computations of frequency and frequency percentage, and others), and the VAR statement includes a list of analysis variables (for computations of sum, frequency of sum, mean, standard deviation and others). The TABLE statement allows you to design a table. Multiple CLASS, VAR, and TABLE statements are allowed. BASIC ONE-WAY FREQUENCY Let s start by making a simple table for Type. We include a CLASS statement for Type, and the TABLE statement simply mentions Type to produce a one-way frequency. See Output 1. class Type; table Type; 1
Output 1. Table for Type. CONTINUOUS VARIABLE Let s make a table for MPG_City (with summary statistics N, Mean) by Type. Since MPG_City is an analysis variable, we include it in a VAR statement. MPG_City is designated as a row variable (left of the comma), and Type is designated as the column variable (right of the comma). See Output 2. class Type; var MPG_City; table MPG_City*(n mean), Type; Output 2. Table for MPG_City by Type. CLASS (CATEGORICAL) VARIABLE Now make a frequency table for Origin by Type. We will discuss percentages in a moment. The variables Type and Origin are included in a CLASS statement. In the TABLE statement, we specify that the N statistic will be nested within Origin, so that N appears in each row. This overrides the default result as seen in our one-way frequency of Type where N appears in each column. See Output 3. class Type Origin; table Origin*n, Type; Output 3. Table for Origin by Type. 2
In PROC TABULATE, there are several standard percentages. For now, we will settle with discussing three types: PCTN, ROWPCTN, COLPCTN. The FREQ procedure also calculates these types of percentages. As you might expect, PCTN is an overall percentage, while ROWPCTN and COLPCTN are self-explanatory. It will useful to include the following statement in a PROC FORMAT step. picture pctf (round) other='009.9%'; The PCTF format will round-off percentages to one decimal place (of a percentage point) and the percentage sign % will be displayed in the output. We use the syntax = <label> and *f=<format> after a variable to use a label and format respectively. See Output 4. class Type Origin; table Origin*(n (pctn colpctn pctn<origin>='pctn<origin>' rowpctn)*f=pctf.) all='column Total', Type all='row Total'; Output 4. Table for Origin by Type with basic percentages. Notice that PctN is computed with a denominator of 428, while PctN<Origin> is the same as ColPctN; this is an example of specifying a denominator. In this case, PctN<Origin> tells SAS that the denominator will be the total of values with non-missing Origin values in that column. This is precisely the same thing as ColPctN. For now on, we will consider only column percentages for class variables. 3
COUNTS/PERCENTAGES FOR AN INDICATOR VARIABLE Suppose we simplify the Origin variable in order to keep track of whether or not a car is from the USA. So we introduce a binary variable Car_USA = (Origin eq USA ). It turns out that it is also useful to define an indicator INDATA that always has value 1. See Output 5. data cars; set cars; Car_USA = (Origin eq 'USA'); InData=1; format Car_USA ynf.; For formatting Car_USA, we included the following statement in PROC FORMAT. /* format for numerical Yes/No: 1= Yes, 0= No */ value ynf 1 = 'Yes' 0 = 'No'; class Type; class Car_USA / descending; table Car_USA*(n*f=comma9.0 colpctn*f=pctf.), all='overall' Type; Output 5. Table for Car_USA by Type. In order to count only the Yes values and display the same percentages, a different approach needs to be taken. We put Car_USA in the VAR statement instead of the CLASS statement, and we also use replace N and COLPCTN with SUM and COLPCTSUM<INDATA> respectively. See Output 6. class Type; var Car_USA InData; table Car_USA*(sum*f=comma9.0 colpctsum<indata>*f=pctf.), all='overall' Type; Output 6. Table for (Car_USA=1) by Type. 4
STACKING TABLES Let s stack the tables for MPG_City and Origin in a single table. We simply combine code from the previous examples into one PROC TABULATE step with a single TABLE statement. The part of the TABLE statement that defines the row will just be the row code from MPG_City and Origin separated by a space. Recall that the CARS data set has no missing values, so the results are what you would expect. In a later section, we will discuss how missing values affect the expected output in a stacked table. See Output 7. class Type Origin; var MPG_City; table MPG_City*(n mean*f=9.1) Origin*(n colpctn*f=pctf.), all='overall' Type; Output 7. Table for MPG_City and Origin by Type. DEALING WITH MISSING VALUES Let s impose missing values in 3 different ways. Keep in mind that the original data set CARS has absolutely NO missing values. data cars_mpg_city; set cars; row+1; if (mod(row,25) eq 0) then call missing(mpg_city); In the CARS_MPG_CITY dataset, we set MPG_City to missing for rows 25, 50, 75,, 425. data cars_origin; set cars; row+1; if (mod(row,25) eq 1) then call missing(origin); In the CARS_ORIGIN dataset, we set Origin to missing for rows 1, 26, 51, 76,, 426. 5
data cars_type; set cars; row+1; if (mod(row,25) eq 2) then call missing(type); In the CARS_TYPE dataset, we set Type to missing for rows 2, 27, 52, 77,, 427. THE EFFECT OF MISSING TYPE proc tabulate data=cars_type; class Type Origin; var MPG_City; table all='all Cars' MPG_City*(n mean*f=9.1) Origin*(n colpctn*f=pctf.), all='overall' Type; Output 8. Table for MPG_City and Origin by Type where some observations have missing Type. In this case, all observations with missing Type are simply excluded from the analysis. See Output 8. For simplicity, in the remainder of this discussion of missing values, we will keep only the Overall column. THE EFFECT OF MISSING MPG_CITY proc tabulate data=cars_mpg_city; class Origin; var MPG_City MPG_Highway; table all='all Cars' MPG_City*(n mean*f=9.1) MPG_Highway*(n mean*f=9.1) Origin*(n colpctn*f=pctf.), all='overall'; 6
Output 9. Table for MPG_City and Origin where some observations have missing MPG_City. In this example, the N for MPG_City and no other variable is less than 428. In summary, missing values for MPG_City have no effect on the other variable counts. See Output 9. THE EFFECT OF MISSING ORIGIN proc tabulate data=cars_origin; class Origin; var MPG_City; table all='all Cars' MPG_City*(n mean*f=9.1) Origin*(n colpctn*f=pctf.), all='overall'; Output 10. Table for MPG_City and Origin where some observations have missing Origin. 7
In this example, missing values for Origin affect the N values for entire table. When an observation has a missing value for Origin, that observation is excluded from the analysis of all variables in PROC TABULATE. See Output 10. THE MISSING OPTION IN PROC TABULATE proc tabulate data=cars_origin missing; class Origin; var MPG_City; table all='all Cars' MPG_City*(n mean*f=9.1) Origin*(n colpctn*f=pctf.), all='overall'; Output 11. Table for MPG_City and Origin where some observations have missing Origin. The MISSING option was used in PROC TABULATE statement. Percentages use 248 as denominator. The MISSING option allows all observations to be included in the analysis. A new category appears for Origin that accounts for missing Origin values. Note that the percentages for the categories of Origin have a denominator of 428. See Output 11. What if we only want a denominator for only the 410 non-missing values? EXCLUDING MISSINGS IN THE PERCENTAGES In this situation, we need to the have an indicator variable for non-missing Origin. The total number of observations with non-missing Origin will be the denominator for percentages. This denominator is the same as the sum of the indicator. 8
data cars_origin2; set cars_origin; Origin_n = ^missing(origin); label Origin_n = "Indicator for non-missing Origin"; proc tabulate data=cars_origin2 missing; class Origin; var Origin_n; table all='all Cars' Origin*(Origin_n=' '*(n colpctsum<origin_n>*f=pctf.)), all='overall' / row=float; In the code, Origin_n is nested in Origin. The label for origin_n is blank since we have origin_n=. We don t see additional cells in the table due to the ROW=FLOAT option. This option eliminates repeated cells that appear in consecutive rows. So our extraneous blank rows are not visible. See Output 12. Output 12. Table for MPG_City and ORigin where some observations have missing Origin. The MISSING option was used in PROC TABULATE statement. Percentages use 410 as denominator. We could label ColPctSum as percent of non-missing. CREATING AN OUTPUT SAS DATA SET THE EASY WAY: USE THE OUT= OPTION OR AN ODS OUTPUT STATEMENT The OUT= option in the PROC TABULATE statement makes it easy to create an output dataset. The data set that you obtain will not look like the displayed PROC TABULATE output, but the computed content is in the data set. The data set can be further processed according to a programmers needs. The following two PROC TABULATE snippets below produce exactly the same data set. 9
proc tabulate data=cars out=pt_output(drop=_page table_); class Type Origin; table Origin*(n colpctn), all Type=' '; class Type Origin; table Origin*(n colpctn), all Type=' '; ods output out=pt_output(drop=_page table_); Note in the code that Type is the first class variable from CARS, and Origin is the second class variable from CARS. The variable _TYPE_ (in PT_OUTPUT) is an ordered pair of indicators for the class variables. Also, _TYPE_ is implicitly used in the variable names PctN_00 and PctN_10 (in PT_OUTPUT). The ordered pair of indicators serve to indicate the whether or not each of the class variables from CARS is used in a denominator definition. See Output 13. Output 13. The data set PT_OUTPUT. 10
AN ALTERNATIVE WAY: EXPORT TO EXCEL, THEN IMPORT THE RESULT AS SAS DATA SET It might be more desirable to make an output data set of the PROC TABULATE displayed output. First, we make Excel output by using ODS Excel statements; this requires SAS version 9.4. ods excel file="c:\table.xlsx" style=minimal; class Type / descending; class Origin / order=freq; var MPG_City; table MPG_City*(n mean*f=9.1) Origin*(n colpctn*f=pctf.), all='overall' Type=' ' / nocellmerge misstext='x'; keylabel ColPctN='Percent'; ods excel close; In the CLASS statement for Type, the DESCENDING option has PROC TABULATE present the columns by descending order, specifically in this case, reverse alphabetical order. In the CLASS statement for Origin, the ORDER=FREQ option has PROC TABULATE present the categories of Origin by descending frequency or the Overall column; if we add the option ascending, then the categories of Origin will be presented in ascending frequency. See Output 14. Output 14. The Excel file produced by PROC TABULATE along with ODS EXCEL code. After creating the Excel document, we can import it back into SAS as a dataset. Let s call the dataset Table_01. See Output 15. proc import datafile="c:\table.xlsx" out=table_01 dbms=excel replace; getnames=no; scantext=yes; mixed=yes; 11
Output 15. The data set TABLE_01 created by PROC IMPORT. The PROC IMPORT statement is standard for importing Excel files. It is easier to work with the data set if we set GETNAMES=NO; The variable names will be simply F1 F9. The SCANTEXT=YES statement is useful in order to have SAS determine whether each variable should be character or numeric. Fortunately, our PROC TABULATE computed all of the numbers that we need, and it formatted all of these numbers and missing values. So it is ok to have all of the variables be character. SAS scans all of the data and since there is at least one character string in each column, SAS determines that the variables F1 F9 will be character variables. Now fix indents, insert appropriate text for missing values, and set up labels for variables. The following DATA step takes care of this. The LABEL statement is omitted. See Output 16. data table_02(drop=row); length f1 $12; /* set TABLE_01 and add ROW counter */ set table_01; row+1; /* replace 'X' with '0' or '0.0%' */ array allvars{*} _CHARACTER_; do i=1 to dim(allvars); if (f2 eq 'N') and (allvars{i} eq 'X') then allvars{i} = '0'; else if (f2 eq 'Percent') and (allvars{i} eq 'X') then allvars{i} = '0.0%'; end; drop i; /* restore indents for Origin categories */ if (6 le row le 10) then f1 = cat(' ',f1); if (row eq 1) then delete; format f1 $12.; Output 16. The data set TABLE_02. With a data set in hand, we can run PROC REPORT and customize the appearance of our table. See Output 17. 12
Car Type Characteristic Statistic Overall Wagon Truck Sports Sedan SUV Hybrid MPG (City) N 428 30 24 49 262 60 3 Mean 20.1 21.1 16.5 18.4 21.1 16.1 55.0 Origin Asia N 158 11 8 17 94 25 3 Percent 36.9% 36.7% 33.3% 34.7% 35.9% 41.7% 100.0% USA N 147 7 16 9 90 25 0 Percent 34.3% 23.3% 66.7% 18.4% 34.4% 41.7% 0.0% Europe N 123 12 0 23 78 10 0 Percent 28.7% 40.0% 0.0% 46.9% 29.8% 16.7% 0.0% Output 17. PROC REPORT output of the data set TABLE_02. We could do a few more DATA steps to combine some statistics into the same cell, and run PROC REPORT on the new data set. See Output 18. Car Type Characteristic Statistic Overall Wagon Truck Sports Sedan SUV Hybrid MPG (City) N, Mean 428, 20.1 30, 21.1 24, 16.5 49, 18.4 262, 21.1 60, 16.1 3, 55.0 Origin Asia N, % 158, 36.9% 11, 36.7% 8, 33.3% 17, 34.7% 94, 35.9% 25, 41.7% 3, 100.0% USA N, % 147, 34.3% 7, 23.3% 16, 66.7% 9, 18.4% 90, 34.4% 25, 41.7% 0, 0.0% Europe N, % 123, 28.7% 12, 40.0% 0, 0.0% 23, 46.9% 78, 29.8% 10, 16.7% 0, 0.0% Output 18. PROC REPORT output of a DATA-step-modified version of TABLE_02. To achieve this, we split the data set TABLE_02 into two data sets The data set TABLE_03 contains all of the rows of TABLE_02 except the rows for which Statistic equals Mean or Percent. The data set TABLE_03X contains all of the rows of TABLE_02 for which Statistic equals Mean or Percent. Then we introduce a counter call ROW to line up the rows that we want to merge, then we merge the data sets by ROW, and use the CATX function with delimiter, to combine values to form an N, Mean value or an N, % value. CONCLUSION As we have shown, PROC TABULATE is useful for computing descriptive statistics and displaying those statistics in a variety of ways. Although percentages and missing values may seem hard to deal with at times, there are sensible ways to work with the PROC TABULATE code to achieve your results. RECOMMENDED READING Carpenter, Art. 2011. PROC TABULATE: Doing More. Proceedings of SAS Global 2011 Conference. Cary, NC: SAS Institute Inc. McLawhorn, Kathryn. 2013. Tips for Generating Percentages Using the SAS TABULATE Procedure. Proceedings of SAS Global 2013 Conference. Cary, NC: SAS Institute Inc. 13
CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Name: Michael J. Williams Enterprise: ICON Clinical Research Address: 456 Montgomery Street, Suite 2200 City, State ZIP: San Francisco, CA 94104 E-mail: michael.williams@iconplc.com Web: ICONplc.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 14