Data Manipulation with SQL Mara Werner, HHS/OIG, Chicago, IL

Similar documents
Greenspace: A Macro to Improve a SAS Data Set Footprint

OLAP Drill-through Table Considerations

A Practical Introduction to SAS Data Integration Studio

Designing the staging area contents

The Basics of PROC FCMP. Dachao Liu Northwestern Universtiy Chicago

PROC MEANS for Disaggregating Statistics in SAS : One Input Data Set and One Output Data Set with Everything You Need

Content Curation Mistakes

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

Chapter 3: The IF Function and Table Lookup

Extending the Scope of Custom Transformations

Using PROC REPORT to Cross-Tabulate Multiple Response Items Patrick Thornton, SRI International, Menlo Park, CA

How to Add or Invite Colleagues

Can you decipher the code? If you can, maybe you can break it. Jay Iyengar, Data Systems Consultants LLC, Oak Brook, IL

Tweaking your tables: Suppressing superfluous subtotals in PROC TABULATE

ABSTRACT INTRODUCTION MACRO. Paper RF

The Basics of Variable Data. Tips for Using Variable Data Within your Direct Marketing Campaigns

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

North County Newcomers Club How to Register for a Luncheon or Other Event

Microsoft Access XP (2002) - Advanced Queries

Close Your File Template

Tour de PIF Registration Guide

Journey to the center of the earth Deep understanding of SAS language processing mechanism Di Chen, SAS Beijing R&D, Beijing, China

It s Proc Tabulate Jim, but not as we know it!

Basic Excel. Helen Mills OME-RESA

HOW TO DEVELOP A SAS/AF APPLICATION

What s New in SAS Studio?

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

Effective Forecast Visualization With SAS/GRAPH Samuel T. Croker, Lexington, SC

CHAPTER4 CONSTRAINTS

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Priority Queues / Heaps Date: 9/27/17

Boolean Expressions. Is Equal and Is Not Equal

Boolean Expressions. Is Equal and Is Not Equal

Make it a Date! Setting up a Master Date View in SAS

BI-09 Using Enterprise Guide Effectively Tom Miron, Systems Seminar Consultants, Madison, WI

Anyone Can Learn PROC TABULATE, v2.0

10 Steps to Writing Effective s

The Power of Combining Data with the PROC SQL

6 counterintuitive strategies to put your list building efforts into overdrive

So, Your Data are in Excel! Ed Heaton, Westat

LeakDAS Version 4 The Complete Guide

Hypothesis Testing: An SQL Analogy

How to Create Data-Driven Lists

Variables, expressions and statements

Omitting Records with Invalid Default Values

15 NEUROMARKETING. Mind Hacks. You Need To Be Using

Let SAS Write and Execute Your Data-Driven SAS Code

If Statements, For Loops, Functions

If you like this guide and you want to support the community, you can sign up as a Founding Member here:

THREE CORNERSTONES OF A SUCCESSFUL REAL ESTATE BUSINESS

Biology 345: Biometry Fall 2005 SONOMA STATE UNIVERSITY Lab Exercise 2 Working with data in Excel and exporting to JMP Introduction

Paper BI SAS Enterprise Guide System Design. Jennifer First-Kluge and Steven First, Systems Seminar Consultants, Inc.

ORB Education Quality Teaching Resources

Creating Custom Financial Statements Using

Positive Pay. Quick Start Set-up Guide

Inside Relational Databases with Examples in Access

Making ERAS work for you

CMISS the SAS Function You May Have Been MISSING Mira Shapiro, Analytic Designers LLC, Bethesda, MD

DBLOAD Procedure Reference

Best Practice for Creation and Maintenance of a SAS Infrastructure

CIS 45, The Introduction. What is a database? What is data? What is information?

Mr G s Java Jive. #11: Formatting Numbers

How to Use Your Autoresponder Series for Maximum Affiliate Profits

Unlock SAS Code Automation with the Power of Macros

Crash Course in Modernization. A whitepaper from mrc

10 TESTED LANDING PAGE ELEMENTS GUARANTEED TO IMPROVE CONVERSIONS

Be Your Own Task Master - Adding Custom Tasks to EG Peter Eberhardt, Fernwood Consulting Group Inc. Toronto, ON

Files Arriving at an Inconvenient Time? Let SAS Process Your Files with FILEEXIST While You Sleep

Taskbar: Working with Several Windows at Once

Using Proc Freq for Manageable Data Summarization

INTRODUCTION TO SAS HOW SAS WORKS READING RAW DATA INTO SAS

Basic Intro to ETO Results

Module 1: Introduction RStudio

XP: Backup Your Important Files for Safety

WHITE PAPER. The General Data Protection Regulation: What Title It Means and How SAS Data Management Can Help

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

The Ins and Outs of %IF

SESUG 2014 IT-82 SAS-Enterprise Guide for Institutional Research and Other Data Scientists Claudia W. McCann, East Carolina University.

This book is about using Visual Basic for Applications (VBA), which is a

Introducing a Colorful Proc Tabulate Ben Cochran, The Bedford Group, Raleigh, NC

APPENDIX 2 Customizing SAS/ASSIST Software

4010/5010 Transition: FAQs from Ask the Experts Webinars (Updated 2/15/12)

How to Read AWStats. Why it s important to know your stats

Using SAS/SCL to Create Flexible Programs... A Super-Sized Macro Ellen Michaliszyn, College of American Pathologists, Northfield, IL

Don t Forget About SMALL Data

How to Keep Multiple Formats in One Variable after Transpose Mindy Wang

Advanced Web Intelligence Techniques for Aspiring Jedi Knights

How Managers and Executives Can Leverage SAS Enterprise Guide

How metadata can reduce query and report complexity As printed in the September 2009 edition of the IBM Systems Magazine

Last, with this edition, you can view and download the complete source for all examples at

New Vs. Old Under the Hood with Procs CONTENTS and COMPARE Patricia Hettinger, SAS Professional, Oakbrook Terrace, IL

Using pivot tables in Excel (live exercise with data)

Using Pivot Tables in Excel (Live Exercise with Data)

VIDEO 1: WHY IS SEGMENTATION IMPORTANT WITH SMART CONTENT?

Device: PLT This document Version: 3.0. For hardware Version: 4. For firmware Version: Date: April 2014

Teradata. This was compiled in order to describe Teradata and provide a brief overview of common capabilities and queries.

This is a book about using Visual Basic for Applications (VBA), which is a

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

The Path To Treatment Pathways Tracee Vinson-Sorrentino, IMS Health, Plymouth Meeting, PA

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

THE HOME BUILDER S GUIDE TO. Mastering New Home Marketing with Your CRM

Transcription:

Paper TS05-2011 Data Manipulation with SQL Mara Werner, HHS/OIG, Chicago, IL Abstract SQL was developed to pull together information from several different data tables - use this to your advantage as you organize the results of your analysis. Using SQL, you can make your data more accessible to a new audience by explaining numerical data in words. Then, organize your results by creating a results summary table. As an added bonus, the results summary table acts as an index that can help you find sections of your SAS code later. Introduction This paper goes over two ways to make your analysis more approachable using SQL. The first uses the SQL CASE- WHEN statement to translate numerical data into words. This will make it easier for someone looking at your data for the first time to scan through it and find the message. The second uses the SQL UNION statement to make a master table with the results or all your analysis. This will help to organize your code and gives you easy access to your important results. To demonstrate these techniques I ll use fictional data from site visits to Medicare providers. The data includes variables indicating what site reviewers found at each location did they find the provider they were looking for or another business. If it was the correct provider, was the office open? How much did Medicare pay to providers that weren t open during the site visit? Summarize Categorical Data with Words Some data are easy to understand even without much background. Most people can scan a price list and understand the meaning. Other data, like categorical data, may be harder to scan and understand because the numbers are really standing in for something else. A common example is gender (male=1, female=0). In the site visit data, we have a variable called found. If the site reviewers found the provider, then found=1, otherwise, found=0. Similarly, if the provider s office was open, then open=1, otherwise open=0. Different and describe what was at the location if not the provider. Source Table: 1 1 1 0 0 2 0 0 1 0 Without knowing the project or the data well, a table like this may be difficult to interpret. Having a data dictionary would help, but explaining the take away in words will make it much easier. The table below has a summary column that helps readers scan the data and absorb the important information quickly. Summary 1 1 1 0 0 We found this provider. The office was open. 2 0 0 1 0 We found a different business at this location. When site reviewers visited location 1 they found the provider there (found=1) and open (open=1). The columns for different_business and no_business aren t represented in the summary for this site visit because they are only used to describe a location where the provider was not found. When site reviewers visited location 2, they did not find the provider (found=0). Therefore, it doesn t matter whether or not the office was open. It may be interesting to know what site reviewers did find at that address. In this case, they found a different business (different_business=1). 1

Creating Text with Conditional Logic Creating text for the summary column is as easy as putting quotation marks around the text. For example, to create a new variable that is consistently the same string of text, put the text into quotation marks as a new variable in the select statement. select location, found, "We found this provider. " as summary Location Found Summary 1 1 We found this provider. 2 0 We found this provider. However, we only found the first provider, not the second. To address this, we can use conditional logic. If found=1 then we found this provider, otherwise we did not find this provider. In SQL this is achieved with a CASE-WHEN statement. select location, found, case when found = 1 then "We found this provider. " else "We did not find this provider. " end as summary Location Found Summary 1 1 We found this provider. 2 0 We did not find this provider. Conditional Logic with Multiple Variables Depending on the situation, you may need a more complicated logic to provide a meaningful summary. In the site visit example, if the provider wasn t found, there is information on what site reviewers did find at that location. Any time found=0, we can categorize what was there instead of the provider. When we found the provider, we can check whether the office was open. To write this into code, we can string together multiple CASE-WHEN statements. The first one will generate text based on what site reviewers found at the location: the provider, another business, or no business. If site reviewers found the provider, we would like to say we found this provider. If they didn t find the provider, but we did find a business, then We found a different business at this location. Similarly, if there was no business at that location, the summary can say, We did not find a business at this location. 2

select *, case when found=1 then "We found this provider. " when found=0 and different_business=1 then "We found a different business at this location. " when found=0 and no_business=1 then "We did not find a business at this location. " else '' end as Summary Summary 1 1 1 0 0 We found this provider. 2 0 0 1 0 We found a different business at this location. So, you have generated text that either indicates that we found the place or describes what we found instead those are all mutually exclusive categories. w we ll build on that. If found=1 we still want the text We found this provider. but in addition, we d like to say whether or not the office was open. If the place was not found, it doesn t matter whether the office was open or not, so we insert blank text instead of words. In SAS, you can generate blank text by using a set of quotation marks with a space between them ( ). To string together multiple statements, you can use the train tracks symbol ( ). Keep in mind, if you are stringing together multiple chunks of text, it becomes helpful to put a period and spaces after each sentence. select *, case when found=1 then "We found this provider. " when found=0 and different_business = 1 then "We found a different business at this location. " when found=0 and no_business=1 then "We did not find a business at this location. " else '' end case when found=1 and open=1 then "The office was open. " when found=1 and open=0 then "The office wasn t open. " else '' end as Summary ;quit business Summary 1 1 1 0 0 We found this provider. The office was open. 2 0 0 1 0 We found a different business at this location. Organizing Summary Analysis Another way that SQL can help you to make your analysis more approachable to others and to your future self is to create a data table combining all the important numbers into a results summary table. For the site visits, we might have the total number of providers that were not at the correct location and the total number of providers that were 3

not open. We might also break out the total that were not found into those where there was another business at the location and those were there was no business at the location. Results Summary: Number 50 Total not open 8 Different 2 To create the results summary table, the first step is to write code for your analysis. I use Proc SQL rather than some of the other procedures in base SAS to calculate my results. This allows me to have results in the form of a data tables rather than text in the output window. I can then use SQL to query those results and include them in the results summary table. The following code calculates the number of providers that were not found (N1) and then the number that were not open (N2). create table N1 as select count(*) as num_not_found where found=0 create table N2 as select count(*) as num_not_open This produces the following two tables: N1 N2 Num_not_found Num_not_open 10 Two tables aren t so hard to keep track of, but when it gets to be 25 or 50 of them, you may find yourself (as I often do) scrolling through pages of code trying to find the place where you created some particular number so you can check it, update it, or use it. You can save yourself this grief by creating a summary results table. Then, you will have everything you need in one place. The following two options show different ways to achieve the same goal collecting your important results into a single data table and producing descriptive labels for each number. Option 1: Proc SQL + Proc Transpose Once you have your summary analysis, you can use a combination of Proc SQL and Proc Transpose to create the results summary table ( Allresults ). For this to work, each of the source tables should be a summary only one row per table. create table allresults as select a.num_not_found, b.num_not_open from n1 as a, n2 as b allresults Num_not_found Num_not_open 10 50 50 4

And then transpose the allresults table to make it easier to read. proc transpose data=allresults out=allresults2 name= prefix=number; run; allresults2 Number1 Num_not_found 10 Num_not_open 50 Option 2: Proc SQL UNION statement You could also create the allresults table at the same time as you write your analysis. This method will use the Proc SQL UNION statement to join together your summary analysis. Instead of using a descriptive variable name, call each new result something generic, like number and then include the description in a separate variable. Create each new select statement with the same variables in the same order. In the code below, the results (the count or summary, etc.) are in the first variable. The second variable is text describing the result ( Total not found ). create table N3 as 'Total not found' as description where found=0 union 'Total not open' as description N3 Number 50 Total not open To make this more useful, you can add another column that groups the data in some meaningful way. For example, you could add another column indicating your source data or to group the numbers into the different sections of a report. Formats in the Results Summary You may have results that make more sense when they are formatted. For example, dollar signs, percent signs, and commas for large numbers all can make your data easier to read. In SAS, all the rows in a column must have the same format. To utilize multiple formats, put the data into different columns. For example, the number of providers that were found can be in one column formatted as a number and the amount that Medicare paid to these facilities can be in another column formatted as currency. In the following sections, I ll give two options for using formats in the results summary table. Format option 1 The first is to create one table for numbers (N3), another for dollars (N4), and then join them together using a SET statement (allresults). 5

create table N4 as select avg(dollars2010) as dollars, 'Average dollars- not open' as description N4 Dollars $75,000 Average dollars not open data allresults; format number 10. dollars dollar15. length number 10. descr $50.; set N3 N4; run; Allresults: Number Dollars 50 Total open $75,000 Average dollars not open Format option 2 The second is to carry both columns the whole way down. Since number and dollar are both numerical columns, instead of using quotation marks to generate a blank, use a period. create table allresults2 as. as dollar, 'Total not found' as description where found=0 union. as dollar, 'Total not open' as description union select. as number, avg(dollars2010) as dollars, 'Average dollars- not open' as description Allresults2: Number Dollars 50 Total door Locked $75,000 Average dollars not open 6

Benefits of Creating an All-Results table Creating a table like the all-results tables in the example above can be coding-intensive. But, in my experience, the time up front pays off later. You ll have all of the important results of your analysis together in one place. If you need to alter any of the code or update your numbers with new data, you can find it easily using the text from the description column. When you search for that text in your code, it will take your right to the place where that number was created. Conclusion Data analysis is often for the purpose of discovery, but also for sharing a message. Using SQL and SAS, you can make the message easier to understand by creating summary text and you can make your summary results easier to find and the code easier to understand by organizing your results into an all-results summary table. Contact Information Your comments and questions are valued and encouraged. Contact the author at: Mara Werner Department of Health and Human Services, Office of Inspector General 233 N. Michigan ave, suite 1390 Chicago, IL 60601 312.353.9872 mara.werner@oig.hhs.gov www.oig.hhs.gov SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 7