DSCI 325: Handout 10 Summarizing Numerical and Categorical Data in SAS Spring 2017

Similar documents
DSCI 325: Handout 9 Sorting and Options for Printing Data in SAS Spring 2017

Lab #3: Probability, Simulations, Distributions:

Introducing a Colorful Proc Tabulate Ben Cochran, The Bedford Group, Raleigh, NC

Handling missing values in Analysis

DSCI 325: Handout 2 Getting Data into SAS Spring 2017

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

Using SAS to Analyze CYP-C Data: Introduction to Procedures. Overview

DSCI 325: Handout 15 Introduction to SAS Macro Programming Spring 2017

Getting it Done with PROC TABULATE

Biostat Methods STAT 5820/6910 Handout #4: Chi-square, Fisher s, and McNemar s Tests

This code and the crash data set can be found on the course web page.

SAS seminar. The little SAS book Chapters 3 & 4. April 15, Åsa Klint. By LD Delwiche and SJ Slaughter. 3.1 Creating and Redefining variables

work.test temp.test sasuser.test test

Using SAS Macros to Extract P-values from PROC FREQ

Applied Multivariate Analysis

Exam Questions A00-281

DSCI 325: Handout 3 Creating and Redefining Variable in SAS Spring 2017

Writing Reports with the

Uncommon Techniques for Common Variables

STAT:5400 Computing in Statistics

Basic Concepts #6: Introduction to Report Writing

Introduction to StatKey Getting Data Into StatKey

Setting the Percentage in PROC TABULATE

Introduction to Stata - Session 2

DSCI 325: Handout 1 Introduction to SAS Programs Spring 2017

PHPM 672/677 Lab #2: Variables & Conditionals Due date: Submit by 11:59pm Monday 2/5 with Assignment 2

Introduction to SAS Procedures SAS Basics III. Susan J. Slaughter, Avocet Solutions

A Side of Hash for You To Dig Into

It s Proc Tabulate Jim, but not as we know it!

SAS Training Spring 2006

Introduction to SAS Procedures SAS Basics III. Susan J. Slaughter, Avocet Solutions

Econ Stata Tutorial I: Reading, Organizing and Describing Data. Sanjaya DeSilva

Introduction to SAS. Cristina Murray-Krezan Research Assistant Professor of Internal Medicine Biostatistician, CTSC

Using Proc Freq for Manageable Data Summarization

Frequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values

Reading data in SAS and Descriptive Statistics

STATA 13 INTRODUCTION

Generating Customized Analytical Reports from SAS Procedure Output Brinda Bhaskar and Kennan Murray, RTI International

Introductory Guide to SAS:

B/ Use data set ADMITS to find the most common day of the week for admission. (HINT: Use a function or format.)

STA 570 Spring Lecture 5 Tuesday, Feb 1

Data-Analysis Exercise Fitting and Extending the Discrete-Time Survival Analysis Model (ALDA, Chapters 11 & 12, pp )

TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT

Cell means coding and effect coding

22S:166. Checking Values of Numeric Variables

EXST SAS Lab Lab #6: More DATA STEP tasks

Going Beyond Proc Tabulate Jim Edgington, LabOne, Inc., Lenexa, KS Carole Lindblade, LabOne, Inc., Lenexa, KS

Using PROC TABULATE and ODS Style Options to Make Really Great Tables Wendi L. Wright, CTB / McGraw-Hill, Harrisburg, PA

Exploring and Understanding Data Using R.

Producing Summary Tables in SAS Enterprise Guide

Choosing the Right Procedure

Dr. Barbara Morgan Quantitative Methods

Introduction to SAS Mike Zdeb ( , #61

Contents of SAS Programming Techniques

First steps in SPSS. Figure 1

Data Mining. SPSS Clementine k-means Algorithm. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

GETTING STARTED. A Step-by-Step Guide to Using MarketSight

STAT 7000: Experimental Statistics I

Checking the Math Data

Stat 302 Statistical Software and Its Applications SAS: Data I/O

I. Quality Assessment: Frequency Tables of onlineavailable and traininglevel

Choosing the Right Procedure

A Short Guide to Stata 10 for Windows

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

After opening Stata for the first time: set scheme s1mono, permanently

Intermediate SAS: Statistics

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

Subject. Creating a diagram. Dataset. Importing the data file. Descriptive statistics with TANAGRA.

EXAMPLES OF DATA LISTINGS AND CLINICAL SUMMARY TABLES USING PROC REPORT'S BATCH LANGUAGE

Basic Commands. Consider the data set: {15, 22, 32, 31, 52, 41, 11}

SPSS Instructions and Guidelines PSCI 2300 Intro to Political Science Research Dr. Paul Hensel Last updated 10 March 2018

DSCI 325 Practice Midterm Questions Spring In SAS, a statement must end

WORKING WITH PIVOT TABLES

Lecture 1 Getting Started with SAS

DSCI 325: Handout 4 If-Then Statements in SAS Spring 2017

Lab #1: Introduction to Basic SAS Operations

Checking for Duplicates Wendi L. Wright

PIVOT TABLES IN MICROSOFT EXCEL 2016

PDQ-Notes. Reynolds Farley. PDQ-Note 3 Displaying Your Results in the Expert Query Window

Introduction to STATA 6.0 ECONOMICS 626

Table of Contents (As covered from textbook)

Get into the Groove with %SYSFUNC: Generalizing SAS Macros with Conditionally Executed Code

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

SAS 101. Based on Learning SAS by Example: A Programmer s Guide Chapter 21, 22, & 23. By Tasha Chapman, Oregon Health Authority

Intermediate SAS: Working with Data

New Perspectives on Microsoft Excel Module 5: Working with Excel Tables, PivotTables, and PivotCharts

Data Should Not be a Four Letter Word Microsoft Excel QUICK TOUR

SPSS TRAINING SPSS VIEWS

The subject of this chapter is the pivot table, the name given to a special

Microsoft Access XP (2002) - Advanced Queries

From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX

Using PROC SQL to Generate Shift Tables More Efficiently

Introduction to SAS. I. Understanding the basics In this section, we introduce a few basic but very helpful commands.

Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts. Microsoft Excel 2013 Enhanced

MATH 117 Statistical Methods for Management I Chapter Two

Data Management Project Using Software to Carry Out Data Analysis Tasks

A Quick Guide to Stata 8 for Windows

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

SAS Online Training: Course contents: Agenda:

Excel 2007/2010. Don t be afraid of PivotTables. Prepared by: Tina Purtee Information Technology (818)

Transcription:

DSCI 325: Handout 10 Summarizing Numerical and Categorical Data in SAS Spring 2017 USING PROC MEANS The routine PROC MEANS can be used to obtain limited summaries for numerical variables (e.g., the mean, standard deviation, etc.). Consider the CourseEnrollments.txt data set. DATA CourseEnrollments; INFILE '/folders/myfolders/courseenrollments.txt'; INPUT Name $18. +1 Subject $4. CourseNum StartDate MMDDYY10. Section Credits Enrollment; PROC PRINT DATA=CourseEnrollments; A portion of the output is shown below: The following code can be used to obtain basic summary statistics for the numeric variables Credits and Enrollment. PROC MEANS DATA=CourseEnrollments; VAR Credits Enrollment; The following output is produced by PROC MEANS by default: 1

Similar to other routines, you can use a BY statement within PROC MEANS. Recall that the data set will first need to be sorted by the variable you are using in the BY statement. PROC FORMAT; VALUE Semester 18141 = 'Fall 2009' 18277 = 'Spring 2010'; PROC SORT DATA=CourseEnrollments; BY StartDate; PROC MEANS DATA=CourseEnrollments; VAR Credits Enrollment; BY StartDate; FORMAT StartDate Semester.; SAS also has the ability to save the output from PROC MEANS into a secondary dataset. This can be done using the OUTPUT statement, as is shown below. The new data set will be named CEMeans, and this data set will contain two new variables: AvgCredits and AvgEnrollment. PROC MEANS DATA=CourseEnrollments; VAR Credits Enrollment; OUTPUT OUT=CEMeans MEAN(Credits Enrollment) = AvgCredits AvgEnrollment; 2

The following PROC PRINT statement shows the contents of the CEMeans dataset. PROC PRINT DATA=CEMeans; Note that you can prevent SAS from printing the output to the Results Viewer window by using the NOPRINT option in the PROC MEANS statement. PROC MEANS DATA=CourseEnrollments NOPRINT; VAR Credits Enrollment; OUTPUT OUT=CEMeans MEAN(Credits Enrollment) = AvgCredits AvgEnrollment; At times, SAS will automatically create variables and place them in data sets for you. For example, suppose we want to obtain the average credits and average enrollment by StartDate. When SAS creates the CEMeans dataset, it automatically puts a variable in this dataset to identify the StartDate. PROC MEANS DATA=CourseEnrollments NOPRINT; VAR Credits Enrollment; BY StartDate; OUTPUT OUT=CEMeans MEAN(Credits Enrollment) = AvgCredits AvgEnrollment; PROC PRINT DATA=CEMeans; Note: SAS automatically creates a variable in the CEMeans dataset called StartDate so that each average can be easily identified. 3

Using our previously specified formatting for StartDate (from Handout 9), we can print StartDate, the average credits, and the average enrollments in a simple, easy-to-read data set. PROC FORMAT; VALUE Semester 18141 = 'Fall' 18277 = 'Spring'; PROC PRINT DATA=CEMeans; VAR StartDate AvgCredits AvgEnrollment; Format StartDate Semester.; Finally, note that there are several additional summaries you can obtain using PROC MEANS. For example, the following programming statements would ask PROC MEANS to compute only the mean, median, standard deviation, and range of the variables listed in the VAR statement. PROC MEANS DATA=CourseEnrollments MEAN MEDIAN STDDEV RANGE; VAR Credits Enrollment; 4

USING PROC FREQ PROC MEANS is used to summarize numerical data; PROC FREQ is used to obtain summaries for categorical data, instead. For this example, consider the CarAccidents.csv data set (a portion of which is shown below). Read this dataset into your personal library if you have not already done so. Then, print the data with a PROC PRINT statement similar to the following. PROC PRINT DATA=Hooks.Car_Accidents; A partial listing of the CarAccidents data set is shown below. Note: SAS did not like the spaces in the variable names and automatically placed a _ between words in the column titles. Therefore, when you refer to these variables in your program, you must include the underscore (e.g., Seat_Belt is the name of the Seat Belt variable). 5

Consider the following summary for Gender (which is a categorical variable) obtained using the PROC FREQ routine. The TABLE command is used to specify the variables to be summarized. PROC FREQ DATA=Hooks.CarAccidents; TABLE Gender; The SAS output from the PROC FREQ procedure is shown below. Two variables can be summarized simultaneously by using a * between the variables to create a contingency table (i.e., a cross-tab table). PROC FREQ DATA=Hooks.CarAccidents; TABLE Gender*Cell_Phone_Involved; 6

There exist many options that can be specified within PROC FREQ. For example, the following program tells SAS to include only the counts in the report; that is, it removes the overall percent, row percent, and column percent from each cell of the table. PROC FREQ DATA=Hooks.CarAccidents; TABLE Gender*Cell_Phone_Involved / NOPERCENT NOROW NOCOL; Including LIST in the options causes SAS to produce a list of counts instead of a table. PROC FREQ DATA=Hooks.CarAccidents; TABLE Gender*Cell_Phone_Involved / NOPERCENT NOROW NOCOL LIST; Many other options exist in PROC FREQ (e.g., you can use PROC FREQ to carry out chi-square tests and several other exact tests). You can read more about these in the SAS help documentation. 7

USING PROC TABULATE The PROC TABULATE routine is a popular substitute for PROC MEANS, PROC FREQ, and even PROC PRINT because the output produced can be manipulated to have a better appearance. PROC TABULATE is a sophisticated routine, and entire manuals have been written on this this procedure. It is based in part on the Table Producing Language, a complex language developed by the U.S. Department of Labor. Consider a side-by-side comparison of the PROC TABULATE code/output and the PROC FREQ code/output. PROC FREQ DATA=Hooks.CarAccidents; TABLE Gender; PROC TABULATE DATA=Hooks.CarAccidents; CLASS Gender; TABLE Gender; Note one major difference in the code between PROC FREQ and PROC TABULATE: the PROC TABULATE routine requires a CLASS statement to specify any categorical variables you d like to summarize. Note that if you are working with numerical variables, you must use the VAR statement in PROC TABULATE to specify the numerical variables you d like to summarize. Consider the following programming statements, which ask SAS to display both counts and percentages. The N*() and PCTN*() options are specified in the TABLE statement. CLASS Gender; TABLE N*(Gender) PCTN*(Gender); ` ` 8

The following code creates one big table that shows the counts and percentages for both Gender and Cell_Phone_Involved. Notice the counts are computed individually (i.e., a contingency table is not created here). TABLE N*(Gender Cell_Phone_Involved) PCTN*(Gender Cell_Phone_Involved); To obtain a contingency table, you should include a * between the variable names. Also, using two table statements will create a table of counts and a separate table of percentages in each category. TABLE N*(Gender*Cell_Phone_Involved); TABLE PCTN*(Gender*Cell_Phone_Involved); 9

In all of the above examples involving PROC TABULATE, the variables specified in the TABLE statement appeared in columns of the resulting table. This procedure also allows for more dimensions in your table. Consider the following program and output. TABLE Gender, Cell_Phone_Involved; Note that the variable that appears before the comma in the TABLE statement is placed in rows, while the variable that appears after the comma is placed in columns. You could also request that SAS display the percentages in each group instead of the counts. TABLE PCTN*Gender, Cell_Phone_Involved; 10

Note that the percentages reported so far are percentages of the total number of observations that fall in each cell. Instead, we may be more interested in calculating row percentages, as shown below. TABLE ROWPCTN*Gender, Cell_Phone_Involved; Finally, note that if you wanted to change the header for the Cell Phone Involved variable and also remove the RowPctN label, you could modify your code as follows: TABLE ROWPCTN=' '*Gender, Cell_Phone_Involved = 'Cell Phone Involved'; To remove the empty box left behind after removing the RowPctN label, you could use the ROW=FLOAT option as follows: TABLE ROWPCTN=' '*Gender, Cell_Phone_Involved = 'Cell Phone Involved' / ROW=FLOAT; 11

Finally, consider an example using PROC TABULATE to create summaries for numeric variables: PROC TABULATE DATA=CourseEnrollments; CLASS CourseNum; VAR Enrollment; TABLE CourseNum, Enrollment*(SUM N MIN MAX MEAN); 12