Getting it Done with PROC TABULATE

Similar documents
Setting the Percentage in PROC TABULATE

Introducing a Colorful Proc Tabulate Ben Cochran, The Bedford Group, Raleigh, NC

It s Proc Tabulate Jim, but not as we know it!

DSCI 325: Handout 10 Summarizing Numerical and Categorical Data in SAS Spring 2017

The REPORT Procedure: A Primer for the Compute Block

Tweaking your tables: Suppressing superfluous subtotals in PROC TABULATE

Writing Reports with the

Art Carpenter California Occidental Consultants

Square Peg, Square Hole Getting Tables to Fit on Slides in the ODS Destination for PowerPoint

Copy That! Using SAS to Create Directories and Duplicate Files

Anyone Can Learn PROC TABULATE, v2.0

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

Using PROC REPORT to Cross-Tabulate Multiple Response Items Patrick Thornton, SRI International, Menlo Park, CA

Producing Summary Tables in SAS Enterprise Guide

I AlB 1 C 1 D ~~~ I I ; -j-----; ;--i--;--j- ;- j--; AlB

Quality Control of Clinical Data Listings with Proc Compare

SAS Business Rules Manager 1.2

SAS Cloud Analytic Services 3.1: Graphing Your Output

PROC MEANS for Disaggregating Statistics in SAS : One Input Data Set and One Output Data Set with Everything You Need

User Manual Mail Merge

Getting Classy: A SAS Macro for CLASS Statement Automation

Figure 1: The PMG GUI on startup

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA

Getting Up to Speed with PROC REPORT Kimberly LeBouton, K.J.L. Computing, Rossmoor, CA

Statistics, Data Analysis & Econometrics

SAS/STAT 13.1 User s Guide. The NESTED Procedure

Statistics and Graphics Functions

Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide

SAS Online Training: Course contents: Agenda:

Introduction to Stata - Session 2

Paper ###-YYYY. SAS Enterprise Guide: A Revolutionary Tool! Jennifer First, Systems Seminar Consultants, Madison, WI

SAS Macros for Grouping Count and Its Application to Enhance Your Reports

Exporting Variable Labels as Column Headers in Excel using SAS Chaitanya Chowdagam, MaxisIT Inc., Metuchen, NJ

A Quick and Gentle Introduction to PROC SQL

Proc Tabulate: Extending This Powerful Tool Beyond Its Limitations

Going Beyond Proc Tabulate Jim Edgington, LabOne, Inc., Lenexa, KS Carole Lindblade, LabOne, Inc., Lenexa, KS

Macro Quoting: Which Function Should We Use? Pengfei Guo, MSD R&D (China) Co., Ltd., Shanghai, China

Christopher Louden University of Texas Health Science Center at San Antonio

Uncommon Techniques for Common Variables

Using PROC TABULATE and ODS Style Options to Make Really Great Tables Wendi L. Wright, CTB / McGraw-Hill, Harrisburg, PA

Using PROC SQL to Generate Shift Tables More Efficiently

Create Custom Tables in No Time

The REPORT Procedure CHAPTER 32

Format-o-matic: Using Formats To Merge Data From Multiple Sources

STATISTICAL TECHNIQUES. Interpreting Basic Statistical Values

Excel 2007/2010. Don t be afraid of PivotTables. Prepared by: Tina Purtee Information Technology (818)

The Basics of PROC FCMP. Dachao Liu Northwestern Universtiy Chicago

PDF Accessibility: How SAS 9.4M5 Enables Automatic Production of Accessible PDF Files

Chapters 18, 19, 20 Solutions. Page 1 of 14. Demographics from COLLEGE Data Set

ET01. LIBNAME libref <engine-name> <physical-file-name> <libname-options>; <SAS Code> LIBNAME libref CLEAR;

Use mail merge to create and print letters and other documents

PharmaSUG Paper PO10

Let the CAT Out of the Bag: String Concatenation in SAS 9

Customized Flowcharts Using SAS Annotation Abhinav Srivastva, PaxVax Inc., Redwood City, CA

Checking for Duplicates Wendi L. Wright

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

Beginning Tutorials. PROC FSEDIT NEW=newfilename LIKE=oldfilename; Fig. 4 - Specifying a WHERE Clause in FSEDIT. Data Editing

CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

Create CSV for Asset Import

A Guided Tour Through the SAS Windowing Environment Casey Cantrell, Clarion Consulting, Los Angeles, CA

SAP InfiniteInsight 7.0

Microsoft Office Excel 2007

SAS Job Monitor 2.2. About SAS Job Monitor. Overview. SAS Job Monitor for SAS Data Integration Studio

Using Templates Created by the SAS/STAT Procedures

Fifteen Functions to Supercharge Your SAS Code

Applications Development

Excel Introduction to Excel Databases & Data Tables

A Format to Make the _TYPE_ Field of PROC MEANS Easier to Interpret Matt Pettis, Thomson West, Eagan, MN

The NESTED Procedure (Chapter)

2015 Vanderbilt University

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

The Proc Transpose Cookbook

Imelda C. Go, South Carolina Department of Education, Columbia, SC

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

PREREQUISITES FOR EXAMPLES

ODS/RTF Pagination Revisit

Using Proc Freq for Manageable Data Summarization

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

A Side of Hash for You To Dig Into

Multiple Graphical and Tabular Reports on One Page, Multiple Ways to Do It Niraj J Pandya, CT, USA

STAT:5400 Computing in Statistics

Basic SAS Hash Programming Techniques Applied in Our Daily Work in Clinical Trials Data Analysis

SAS Publishing SAS. Forecast Studio 1.4. User s Guide

Week 9: PROC TABULATE (Chapter 19)

Sending Text Messages from SAS

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

Module 6 - Structured References - 1

Chapter 6: Modifying and Combining Data Sets

Using a Picture Format to Create Visit Windows

In d e x Dapresy London Dapresy Sweden Dapresy Berlin

An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step Mike Zdeb, FSL, University at Albany School of Public Health, Rensselaer, NY

Access - Introduction to Queries

The GEOCODE Procedure and SAS Visual Analytics

Data Grids in Business Rules, Decisions, Batch Scoring, and Real-Time Scoring

The Essential Meaning of PROC MEANS: A Beginner's Guide to Summarizing Data Using SAS Software

Unit 3 Fill Series, Functions, Sorting

Facilitate Statistical Analysis with Automatic Collapsing of Small Size Strata

Tips & Techniques with PROC MEANS

Excel 2013 Next Steps

Unit 3 Functions Review, Fill Series, Sorting, Merge & Center

Transcription:

ABSTRACT Getting it Done with PROC TABULATE Michael J. Williams, ICON Clinical Research, San Francisco, CA The task of displaying statistical summaries of different types of variables in a single table is quite familiar to many SAS users. There are many ways to go about this. PROC FREQ tends to be a favorite for counts and percentages of categorical variables, while PROC MEANS/SUMMARY and PROC UNIVARIATE tend to be preferred for summary statistics of continuous variables. In addition, PROC REPORT has many desirable customizations for displaying datasets. PROC TABULATE combines the computational functionality of FREQ, MEANS/SUMMARY, UNIVARIATE with the customization abilities of PROC REPORT. In this presentation/paper, we will give an overview of PROC TABULATE syntax, and then discuss stylistic customizations, calculating percentages, dealing with missing values, creating and processing PROC TABULATE output data sets. INTRODUCTION The SAS procedure PROC TABULATE is very useful for summarizing data. One of the most interesting features of the procedure is the amount of flexibility in designing the structure of the displayed table in terms of row structure, column structure, stacking or concatenating sub-tables, and formatting/labeling variable names and variable values. This paper discusses various aspects of PROC TABULATE including basic syntax, percentages, handling of missing values, and methods for creating output SAS datasets. We use the data set SASHELP.CARS to produce examples. First, we read in this data set as the temporary dataset CARS, and we focus on the variables MPG_City miles per gallon in the city MPG_Highway miles per gallon on the highway Type Type of car: Hybrid, SUV, Sedan, Sports, Truck, Wagon Origin Country of Origin: Asia, Europe, USA The CARS dataset has no missing values for every one of its variables. This is, of course, the ideal case for data analysis. In a later section, we will modify the CARS dataset to include missing values; this will force us to use more caution when computing summary statistics for several variables in a single PROC TABULATE step. BASIC SYNTAX The basic statements to include in PROC TABULATE are given below. PROC TABULATE <option(s)> <STYLE=style-override(s)>; CLASS variable(s) </option(s)> <STYLE=style-override(s)>; VAR analysis-variable(s) </option(s)> <STYLE=style-override(s)>; TABLE <<page-expression,> row-expression,> column-expression </ table-option(s)>; RUN; The PROC TABULATE statement typically includes the data set name, the CLASS statement includes a list of categorical variables (for computations of frequency and frequency percentage, and others), and the VAR statement includes a list of analysis variables (for computations of sum, frequency of sum, mean, standard deviation and others). The TABLE statement allows you to design a table. Multiple CLASS, VAR, and TABLE statements are allowed. BASIC ONE-WAY FREQUENCY Let s start by making a simple table for Type. We include a CLASS statement for Type, and the TABLE statement simply mentions Type to produce a one-way frequency. See Output 1. class Type; table Type; 1

Output 1. Table for Type. CONTINUOUS VARIABLE Let s make a table for MPG_City (with summary statistics N, Mean) by Type. Since MPG_City is an analysis variable, we include it in a VAR statement. MPG_City is designated as a row variable (left of the comma), and Type is designated as the column variable (right of the comma). See Output 2. class Type; var MPG_City; table MPG_City*(n mean), Type; Output 2. Table for MPG_City by Type. CLASS (CATEGORICAL) VARIABLE Now make a frequency table for Origin by Type. We will discuss percentages in a moment. The variables Type and Origin are included in a CLASS statement. In the TABLE statement, we specify that the N statistic will be nested within Origin, so that N appears in each row. This overrides the default result as seen in our one-way frequency of Type where N appears in each column. See Output 3. class Type Origin; table Origin*n, Type; Output 3. Table for Origin by Type. 2

In PROC TABULATE, there are several standard percentages. For now, we will settle with discussing three types: PCTN, ROWPCTN, COLPCTN. The FREQ procedure also calculates these types of percentages. As you might expect, PCTN is an overall percentage, while ROWPCTN and COLPCTN are self-explanatory. It will useful to include the following statement in a PROC FORMAT step. picture pctf (round) other='009.9%'; The PCTF format will round-off percentages to one decimal place (of a percentage point) and the percentage sign % will be displayed in the output. We use the syntax = <label> and *f=<format> after a variable to use a label and format respectively. See Output 4. class Type Origin; table Origin*(n (pctn colpctn pctn<origin>='pctn<origin>' rowpctn)*f=pctf.) all='column Total', Type all='row Total'; Output 4. Table for Origin by Type with basic percentages. Notice that PctN is computed with a denominator of 428, while PctN<Origin> is the same as ColPctN; this is an example of specifying a denominator. In this case, PctN<Origin> tells SAS that the denominator will be the total of values with non-missing Origin values in that column. This is precisely the same thing as ColPctN. For now on, we will consider only column percentages for class variables. 3

COUNTS/PERCENTAGES FOR AN INDICATOR VARIABLE Suppose we simplify the Origin variable in order to keep track of whether or not a car is from the USA. So we introduce a binary variable Car_USA = (Origin eq USA ). It turns out that it is also useful to define an indicator INDATA that always has value 1. See Output 5. data cars; set cars; Car_USA = (Origin eq 'USA'); InData=1; format Car_USA ynf.; For formatting Car_USA, we included the following statement in PROC FORMAT. /* format for numerical Yes/No: 1= Yes, 0= No */ value ynf 1 = 'Yes' 0 = 'No'; class Type; class Car_USA / descending; table Car_USA*(n*f=comma9.0 colpctn*f=pctf.), all='overall' Type; Output 5. Table for Car_USA by Type. In order to count only the Yes values and display the same percentages, a different approach needs to be taken. We put Car_USA in the VAR statement instead of the CLASS statement, and we also use replace N and COLPCTN with SUM and COLPCTSUM<INDATA> respectively. See Output 6. class Type; var Car_USA InData; table Car_USA*(sum*f=comma9.0 colpctsum<indata>*f=pctf.), all='overall' Type; Output 6. Table for (Car_USA=1) by Type. 4

STACKING TABLES Let s stack the tables for MPG_City and Origin in a single table. We simply combine code from the previous examples into one PROC TABULATE step with a single TABLE statement. The part of the TABLE statement that defines the row will just be the row code from MPG_City and Origin separated by a space. Recall that the CARS data set has no missing values, so the results are what you would expect. In a later section, we will discuss how missing values affect the expected output in a stacked table. See Output 7. class Type Origin; var MPG_City; table MPG_City*(n mean*f=9.1) Origin*(n colpctn*f=pctf.), all='overall' Type; Output 7. Table for MPG_City and Origin by Type. DEALING WITH MISSING VALUES Let s impose missing values in 3 different ways. Keep in mind that the original data set CARS has absolutely NO missing values. data cars_mpg_city; set cars; row+1; if (mod(row,25) eq 0) then call missing(mpg_city); In the CARS_MPG_CITY dataset, we set MPG_City to missing for rows 25, 50, 75,, 425. data cars_origin; set cars; row+1; if (mod(row,25) eq 1) then call missing(origin); In the CARS_ORIGIN dataset, we set Origin to missing for rows 1, 26, 51, 76,, 426. 5

data cars_type; set cars; row+1; if (mod(row,25) eq 2) then call missing(type); In the CARS_TYPE dataset, we set Type to missing for rows 2, 27, 52, 77,, 427. THE EFFECT OF MISSING TYPE proc tabulate data=cars_type; class Type Origin; var MPG_City; table all='all Cars' MPG_City*(n mean*f=9.1) Origin*(n colpctn*f=pctf.), all='overall' Type; Output 8. Table for MPG_City and Origin by Type where some observations have missing Type. In this case, all observations with missing Type are simply excluded from the analysis. See Output 8. For simplicity, in the remainder of this discussion of missing values, we will keep only the Overall column. THE EFFECT OF MISSING MPG_CITY proc tabulate data=cars_mpg_city; class Origin; var MPG_City MPG_Highway; table all='all Cars' MPG_City*(n mean*f=9.1) MPG_Highway*(n mean*f=9.1) Origin*(n colpctn*f=pctf.), all='overall'; 6

Output 9. Table for MPG_City and Origin where some observations have missing MPG_City. In this example, the N for MPG_City and no other variable is less than 428. In summary, missing values for MPG_City have no effect on the other variable counts. See Output 9. THE EFFECT OF MISSING ORIGIN proc tabulate data=cars_origin; class Origin; var MPG_City; table all='all Cars' MPG_City*(n mean*f=9.1) Origin*(n colpctn*f=pctf.), all='overall'; Output 10. Table for MPG_City and Origin where some observations have missing Origin. 7

In this example, missing values for Origin affect the N values for entire table. When an observation has a missing value for Origin, that observation is excluded from the analysis of all variables in PROC TABULATE. See Output 10. THE MISSING OPTION IN PROC TABULATE proc tabulate data=cars_origin missing; class Origin; var MPG_City; table all='all Cars' MPG_City*(n mean*f=9.1) Origin*(n colpctn*f=pctf.), all='overall'; Output 11. Table for MPG_City and Origin where some observations have missing Origin. The MISSING option was used in PROC TABULATE statement. Percentages use 248 as denominator. The MISSING option allows all observations to be included in the analysis. A new category appears for Origin that accounts for missing Origin values. Note that the percentages for the categories of Origin have a denominator of 428. See Output 11. What if we only want a denominator for only the 410 non-missing values? EXCLUDING MISSINGS IN THE PERCENTAGES In this situation, we need to the have an indicator variable for non-missing Origin. The total number of observations with non-missing Origin will be the denominator for percentages. This denominator is the same as the sum of the indicator. 8

data cars_origin2; set cars_origin; Origin_n = ^missing(origin); label Origin_n = "Indicator for non-missing Origin"; proc tabulate data=cars_origin2 missing; class Origin; var Origin_n; table all='all Cars' Origin*(Origin_n=' '*(n colpctsum<origin_n>*f=pctf.)), all='overall' / row=float; In the code, Origin_n is nested in Origin. The label for origin_n is blank since we have origin_n=. We don t see additional cells in the table due to the ROW=FLOAT option. This option eliminates repeated cells that appear in consecutive rows. So our extraneous blank rows are not visible. See Output 12. Output 12. Table for MPG_City and ORigin where some observations have missing Origin. The MISSING option was used in PROC TABULATE statement. Percentages use 410 as denominator. We could label ColPctSum as percent of non-missing. CREATING AN OUTPUT SAS DATA SET THE EASY WAY: USE THE OUT= OPTION OR AN ODS OUTPUT STATEMENT The OUT= option in the PROC TABULATE statement makes it easy to create an output dataset. The data set that you obtain will not look like the displayed PROC TABULATE output, but the computed content is in the data set. The data set can be further processed according to a programmers needs. The following two PROC TABULATE snippets below produce exactly the same data set. 9

proc tabulate data=cars out=pt_output(drop=_page table_); class Type Origin; table Origin*(n colpctn), all Type=' '; class Type Origin; table Origin*(n colpctn), all Type=' '; ods output out=pt_output(drop=_page table_); Note in the code that Type is the first class variable from CARS, and Origin is the second class variable from CARS. The variable _TYPE_ (in PT_OUTPUT) is an ordered pair of indicators for the class variables. Also, _TYPE_ is implicitly used in the variable names PctN_00 and PctN_10 (in PT_OUTPUT). The ordered pair of indicators serve to indicate the whether or not each of the class variables from CARS is used in a denominator definition. See Output 13. Output 13. The data set PT_OUTPUT. 10

AN ALTERNATIVE WAY: EXPORT TO EXCEL, THEN IMPORT THE RESULT AS SAS DATA SET It might be more desirable to make an output data set of the PROC TABULATE displayed output. First, we make Excel output by using ODS Excel statements; this requires SAS version 9.4. ods excel file="c:\table.xlsx" style=minimal; class Type / descending; class Origin / order=freq; var MPG_City; table MPG_City*(n mean*f=9.1) Origin*(n colpctn*f=pctf.), all='overall' Type=' ' / nocellmerge misstext='x'; keylabel ColPctN='Percent'; ods excel close; In the CLASS statement for Type, the DESCENDING option has PROC TABULATE present the columns by descending order, specifically in this case, reverse alphabetical order. In the CLASS statement for Origin, the ORDER=FREQ option has PROC TABULATE present the categories of Origin by descending frequency or the Overall column; if we add the option ascending, then the categories of Origin will be presented in ascending frequency. See Output 14. Output 14. The Excel file produced by PROC TABULATE along with ODS EXCEL code. After creating the Excel document, we can import it back into SAS as a dataset. Let s call the dataset Table_01. See Output 15. proc import datafile="c:\table.xlsx" out=table_01 dbms=excel replace; getnames=no; scantext=yes; mixed=yes; 11

Output 15. The data set TABLE_01 created by PROC IMPORT. The PROC IMPORT statement is standard for importing Excel files. It is easier to work with the data set if we set GETNAMES=NO; The variable names will be simply F1 F9. The SCANTEXT=YES statement is useful in order to have SAS determine whether each variable should be character or numeric. Fortunately, our PROC TABULATE computed all of the numbers that we need, and it formatted all of these numbers and missing values. So it is ok to have all of the variables be character. SAS scans all of the data and since there is at least one character string in each column, SAS determines that the variables F1 F9 will be character variables. Now fix indents, insert appropriate text for missing values, and set up labels for variables. The following DATA step takes care of this. The LABEL statement is omitted. See Output 16. data table_02(drop=row); length f1 $12; /* set TABLE_01 and add ROW counter */ set table_01; row+1; /* replace 'X' with '0' or '0.0%' */ array allvars{*} _CHARACTER_; do i=1 to dim(allvars); if (f2 eq 'N') and (allvars{i} eq 'X') then allvars{i} = '0'; else if (f2 eq 'Percent') and (allvars{i} eq 'X') then allvars{i} = '0.0%'; end; drop i; /* restore indents for Origin categories */ if (6 le row le 10) then f1 = cat(' ',f1); if (row eq 1) then delete; format f1 $12.; Output 16. The data set TABLE_02. With a data set in hand, we can run PROC REPORT and customize the appearance of our table. See Output 17. 12

Car Type Characteristic Statistic Overall Wagon Truck Sports Sedan SUV Hybrid MPG (City) N 428 30 24 49 262 60 3 Mean 20.1 21.1 16.5 18.4 21.1 16.1 55.0 Origin Asia N 158 11 8 17 94 25 3 Percent 36.9% 36.7% 33.3% 34.7% 35.9% 41.7% 100.0% USA N 147 7 16 9 90 25 0 Percent 34.3% 23.3% 66.7% 18.4% 34.4% 41.7% 0.0% Europe N 123 12 0 23 78 10 0 Percent 28.7% 40.0% 0.0% 46.9% 29.8% 16.7% 0.0% Output 17. PROC REPORT output of the data set TABLE_02. We could do a few more DATA steps to combine some statistics into the same cell, and run PROC REPORT on the new data set. See Output 18. Car Type Characteristic Statistic Overall Wagon Truck Sports Sedan SUV Hybrid MPG (City) N, Mean 428, 20.1 30, 21.1 24, 16.5 49, 18.4 262, 21.1 60, 16.1 3, 55.0 Origin Asia N, % 158, 36.9% 11, 36.7% 8, 33.3% 17, 34.7% 94, 35.9% 25, 41.7% 3, 100.0% USA N, % 147, 34.3% 7, 23.3% 16, 66.7% 9, 18.4% 90, 34.4% 25, 41.7% 0, 0.0% Europe N, % 123, 28.7% 12, 40.0% 0, 0.0% 23, 46.9% 78, 29.8% 10, 16.7% 0, 0.0% Output 18. PROC REPORT output of a DATA-step-modified version of TABLE_02. To achieve this, we split the data set TABLE_02 into two data sets The data set TABLE_03 contains all of the rows of TABLE_02 except the rows for which Statistic equals Mean or Percent. The data set TABLE_03X contains all of the rows of TABLE_02 for which Statistic equals Mean or Percent. Then we introduce a counter call ROW to line up the rows that we want to merge, then we merge the data sets by ROW, and use the CATX function with delimiter, to combine values to form an N, Mean value or an N, % value. CONCLUSION As we have shown, PROC TABULATE is useful for computing descriptive statistics and displaying those statistics in a variety of ways. Although percentages and missing values may seem hard to deal with at times, there are sensible ways to work with the PROC TABULATE code to achieve your results. RECOMMENDED READING Carpenter, Art. 2011. PROC TABULATE: Doing More. Proceedings of SAS Global 2011 Conference. Cary, NC: SAS Institute Inc. McLawhorn, Kathryn. 2013. Tips for Generating Percentages Using the SAS TABULATE Procedure. Proceedings of SAS Global 2013 Conference. Cary, NC: SAS Institute Inc. 13

CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Name: Michael J. Williams Enterprise: ICON Clinical Research Address: 456 Montgomery Street, Suite 2200 City, State ZIP: San Francisco, CA 94104 E-mail: michael.williams@iconplc.com Web: ICONplc.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 14