... ) city (city, cntyid, area, pop,.. )

Similar documents
Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

%MISSING: A SAS Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System

Developing Data-Driven SAS Programs Using Proc Contents

INTRODUCTION TO PROC SQL JEFF SIMPSON SYSTEMS ENGINEER

Taming a Spreadsheet Importation Monster

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

Correcting for natural time lag bias in non-participants in pre-post intervention evaluation studies

SAS Macro Design Issues Ian Whitlock, Westat

Using an Array as an If-Switch Nazik Elgaddal and Ed Heaton, Westat, Rockville, MD

How to Create Data-Driven Lists

Data Quality Review for Missing Values and Outliers

A Side of Hash for You To Dig Into

A Quick and Gentle Introduction to PROC SQL

10 The First Steps 4 Chapter 2

Get SAS sy with PROC SQL Amie Bissonett, Pharmanet/i3, Minneapolis, MN

Contents of SAS Programming Techniques

An SQL Tutorial Some Random Tips

Hash Objects Why Bother? Barb Crowther SAS Technical Training Specialist. Copyright 2008, SAS Institute Inc. All rights reserved.

Efficient Processing of Long Lists of Variable Names

Tips & Tricks. With lots of help from other SUG and SUGI presenters. SAS HUG Meeting, November 18, 2010

ABSTRACT INTRODUCTION MACRO. Paper RF

Uncommon Techniques for Common Variables

PROGRAMMING ROLLING REGRESSIONS IN SAS MICHAEL D. BOLDIN, UNIVERSITY OF PENNSYLVANIA, PHILADELPHIA, PA

David Beam, Systems Seminar Consultants, Inc., Madison, WI

BASICS BEFORE STARTING SAS DATAWAREHOSING Concepts What is ETL ETL Concepts What is OLAP SAS. What is SAS History of SAS Modules available SAS

Best Practice for Creation and Maintenance of a SAS Infrastructure

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

Submitting SAS Code On The Side

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

SAS Macro Dynamics: from Simple Basics to Powerful Invocations Rick Andrews, Office of Research, Development, and Information, Baltimore, MD

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

An Easy Route to a Missing Data Report with ODS+PROC FREQ+A Data Step Mike Zdeb, FSL, University at Albany School of Public Health, Rensselaer, NY

PharmaSUG Paper TT11

Summarizing Impossibly Large SAS Data Sets For the Data Warehouse Server Using Horizontal Summarization

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

9 Ways to Join Two Datasets David Franklin, Independent Consultant, New Hampshire, USA

Paper HOW-06. Tricia Aanderud, And Data Inc, Raleigh, NC

Dictionary.coumns is your friend while appending or moving data

Using PROC SQL to Generate Shift Tables More Efficiently

Using Templates Created by the SAS/STAT Procedures

A Better Perspective of SASHELP Views

Validation Summary using SYSINFO

USING PROC SQL EFFECTIVELY WITH SAS DATA SETS JIM DEFOOR LOCKHEED FORT WORTH COMPANY

Are Your SAS Programs Running You? Marje Fecht, Prowerk Consulting, Cape Coral, FL Larry Stewart, SAS Institute Inc., Cary, NC

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys

An Animated Guide: Proc Transpose

Top-Down Programming with SAS Macros Edward Heaton, Westat, Rockville, MD

WHAT ARE SASHELP VIEWS?

SAS System Powers Web Measurement Solution at U S WEST

Checking for Duplicates Wendi L. Wright

So Much Data, So Little Time: Splitting Datasets For More Efficient Run Times and Meeting FDA Submission Guidelines

SAS Talk: LAG Lead. The SESUG Informant Volume 9, Issue 2

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

Minimum spanning trees

Unlock SAS Code Automation with the Power of Macros

Useful Tips When Deploying SAS Code in a Production Environment

Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

Tweaking your tables: Suppressing superfluous subtotals in PROC TABULATE

Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

Cover the Basics, Tool for structuring data checking with SAS Ole Zester, Novo Nordisk, Denmark

Adjusting for daylight saving times. PhUSE Frankfurt, 06Nov2018, Paper CT14 Guido Wendland

Merge Processing and Alternate Table Lookup Techniques Prepared by

CS264: Homework #1. Due by midnight on Thursday, January 19, 2017

Get Going with PROC SQL Richard Severino, Convergence CT, Honolulu, HI

SAS Programming Techniques for Manipulating Metadata on the Database Level Chris Speck, PAREXEL International, Durham, NC

Definition: A data structure is a way of organizing data in a computer so that it can be used efficiently.

How to Keep Multiple Formats in One Variable after Transpose Mindy Wang

Merging Data Eight Different Ways

Automating the Production of Formatted Item Frequencies using Survey Metadata

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

Stat Wk 3. Stat 342 Notes. Week 3, Page 1 / 71

A Format to Make the _TYPE_ Field of PROC MEANS Easier to Interpret Matt Pettis, Thomson West, Eagan, MN

An Application of PROC NLP to Survey Sample Weighting

SAS/Warehouse Administrator Usage and Enhancements Terry Lewis, SAS Institute Inc., Cary, NC

Format-o-matic: Using Formats To Merge Data From Multiple Sources

SAS Scalable Performance Data Server 4.3

Planting Your Rows: Using SAS Formats to Make the Generation of Zero- Filled Rows in Tables Less Thorny

Introduction to PROC SQL

How Managers and Executives Can Leverage SAS Enterprise Guide

Tales from the Help Desk 6: Solutions to Common SAS Tasks

USING DATA TO SET MACRO PARAMETERS

Macros from Beginning to Mend A Simple and Practical Approach to the SAS Macro Facility

Advanced PROC REPORT: Getting Your Tables Connected Using Links

Arthur L. Carpenter California Occidental Consultants, Oceanside, California

CSC 261/461 Database Systems Lecture 19

Why Hash? Glen Becker, USAA

Posters. Workarounds for SASWare Ballot Items Jack Hamilton, First Health, West Sacramento, California USA. Paper

A SAS Macro for Producing Benchmarks for Interpreting School Effect Sizes

Operators Agree LabLite PC = Data Security

PROC SQL vs. DATA Step Processing. T Winand, Customer Success Technical Team

A Tutorial on the SAS Macro Language

PhUSE US Connect 2018 Paper CT06 A Macro Tool to Find and/or Split Variable Text String Greater Than 200 Characters for Regulatory Submission Datasets

Countdown of the Top 10 Ways to Merge Data David Franklin, Independent Consultant, Litchfield, NH

David Ghan SAS Education

SAS Online Training: Course contents: Agenda:

Troy Martin Hughes ABSTRACT INTRODUCTION

Transcription:

PaperP829 PROC SQl - Is it a Required Tool for Good SAS Programming? Ian Whitlock, Westat Abstract No one SAS tool can be the answer to all problems. However, it should be hard to consider a SAS programmer well versed in SAS, who does not use DATA steps. SOL should be classified with the DATA step rather than procedures because it is really a programming language in itself. In the past many good SAS programmers have resisted learning PROC SOL on the basis that it is a database tool and that they can get along without it. It is time (or past time) to reconsider the question and change that belief. SOL has dramatically changed the nature of what good SAS macro code looks like. It can simplify and standardize a number of common SAS programming patterns involving combinations of the DATA step, PROC SUMMARY, PROC SORT, and PROC PRINT. This tutorial will focus on problem examples with code where PROC SOL has a distinct advantage in terms of code simplicity over use of the more traditional SAS tools mentioned above. Introduction The typical SAS programmer needs PROC SOL because: 1. It is superb at accessing data stored in multiple data sets at different levels. 2. It can easily produce a Cartesian produd. 3. It can perform matching where the condition of a match is not equality. 4. It is good at summarization. 5. With the introduction of 6.11, it can make arrays of macro variables or do away with the need for these arrays by assigning a whole column of values to one macro variable. 6. Macro - SOL interadion enhances both macro and Sal. In addition to the direct values listed above, one should not underestimate the value of SOL training in teaching one data organization. An example of how SQL can teach data organization is given at the end of the summarizing section. Matching multiple data sets at different levels PROC SOL provides a powerful tool when extracting data from different data sets at several different levels. It not only provides simpler code, it provides a new way of looking at these problems. Suppose we have data at the state, county and city level stored in three data sets. State (state, region,... ) county (cntyid, state, cnty, area,... ) city (city, cntyid, area, pop,.. ) Prepare a report of all cities in the midwest with populations over 100,000 with the ratio of the city area to the enclosing county area. A SAS procedural solution demands that we decide whether to start with states or cities, specify all sorts needed for the various DATA step merges, and specify those merges in detail ending up with a PROC PRINT. In contrast, SQL asks the fundamental questions: 1. What are the data sets? 2. What are the subsetting conditions? 3. What are the linking conditions? 4. What columns should appear? proc sql i select st.state, cn.cnty, ct.city, ct. area I cn. area as arearato from state as st, county as cn, city as ct where ct.pop > 100000 and st. region = 'MW' and st.state = cn.state and cn.cntyid = ct.cntyid 919

TUFORIALS quit ; In all the remaining example code the PROC statement and the QUIT will be omitted. Cartesian Product Cartesian product matches are far more common than one-to-one matches, but the MERGE statement assumes one-to-one within BY-groups. To find a Cartesian product match, let's look at a codebook example. I have three data sets: specs ( variable, format ) freq ( variable, format, value, count ) fmts format, value, label The report might look something like this. first format variable using fmtlname create table report as select coalesce (sfm.variable, fq.variable) as variable, coalesce (sfm.format, fq.format) as format, coalesce (sfm.value, fq.value) as value, sfm.label, coalesce ( fq.count, 0 ) as count from sfm full join freq as fq on sfm.variable = fq.variable and sfm.format = fq.format and sfm.value = fq.value order by variable, format, value value label count 1 first 500 2 sec 0 3 rem 300 second variable using format etc. fmt2name Note that in two SQL statements we have done a lot of the work toward creating a codebook. If one could produce SPECS, FREQ, and FMTS easily, then one could produce a codebook for any properly formatted SAS data set. The FMTS file is trivially produced with the FMTLlB option of PROC FORMAT. The FREQ file requires some macro code. We will postpone discussion of the SPECS file to a later section. Before tackling the problem let's look at the code for joining SPECS and FMTS. The problem involves a Cartesian product because a format may be associated with more than one variable in SPECS, and formats typically have more than one value. Thus FORMAT does not determine a single record in either data set; hence a merge by FORMAT will not work. Note that accomplishing this sort of combining records in a DATA step involves using sophisticated SAS techniques when SQL is not used. create view sfm as select s.variable, s.format, fm.value, fm.label from specs as s, fmts as fm where s.format = fm.format It is easy to see how to produce the report with a DATA _NULL_ step when the right information is in a data set ( VARIABLE, FORMAT, VALUE, LABEL, and COUNT ). Here is the SQL code to produce the file. Fuzzy Matching Fuzzy matching comes in two varieties. In date (or time) line matches one file holds a specific date (or time) and one wants the corresponding record which holds a range of dates (or time). For SAS dates DATE, BEGDATE, and ENDDATE the WHERE clause might be where date is between begdate and enddate For efficiency reasons it is important to add an equicondition whenever possible. In date (or time) matches one often has an 10 that must also match, hence the equijoin condition becomes where a.id b.id and date is between begdate and enddate In the other kind of fuzzy matching one cannot trust the identifying variables. Suppose we want to match on social security numbers, SSN, but expect transposition errors 920

and single digit mutations. Now the WHERE clause might be where sum ( substr(a.ssn,l,l) = substr(b.ssn,l,l), substr(a.ssn,2,1) = substr(b.ssn,2,1), substr(a.ssn,9,1) = substr(b.ssn,9,1) ) >= 7 To make this an equi-join we might add and substr(a.zip,1,3) substr(b.zip,1,3) or some other relatively safe blocking variable. Summarizing One PROC Sal step can do the job of a PROC SUMMARY followed by merging of the results with the original data.. For example, suppose we have a weighted student sample including many different schools. We want the percentage weight of each student in a school. Then we might have: select school, student, wght, 100*wght/sum(wght) as pctwt from studsamp group by school In this case one gets a message that summary data was remerged with the original data, but that is precisely what we wanted. Now suppose we want to look at all the students from any school which has some student contributing more than 20% of the weight. The code might be select stu.* from studsamp as stu, ( select distinct school from studsamp group by school having wght/sum(wght) >.2 as want where stu. school = want. school order by stu. school, stu.wght desc The technique is important because there are many times one wants to view every one in a group if anyone in the group has some property. Sal provides a natural idiom for producing the report. Knowing Sal should make one more sensitive to bad patterns of storing data. For example, a common question on SAS-l is how to array data. Given the data 10 DATE COUNT 1 5jun1993 50 1 160ct1993 25 1 21dec1993 8 2 14may1990 16 2 27ian1991 3 how do you produce one record per 10 with as many date and count fields as needed, say 10, OATE1 - OATE32 and COUNT1 - COUNT32? Another common question is how to work with the arrayed data. For example, how can you compute the rate of decrease in count per month and per year for each 10. The answer is a trivial Sal problem, when the data are stored as they were originally given. select id, (max(count)-min(count»/ intck('month',min(date),max(date» as decpmon, calculated decpmon * 12 as decpyr from origdata group by id After arraying it becomes a harder problem. Perhaps if SAS programmers learned Sal, and how to solve problems without arrays, then they would also learn the advantages of storing data in a non-arrayed form. With Sal training, one comes to realize the importance of putting the information into the data instead of the variable names. Of course, this also means that the usefulness of Sal is highly dependent on how well the data are stored but it would be wrong to conclude that one might as weli avoid learning Sal because of bad data management practices. Macro Lists Via PROC SQL PROC Sal's ability to assign a whole column of values to a macro variable has drastically changed how one writes macro code. Consider the splitting problem. Given a data set All with a variable SPLIT naming a member, split All by the variable SPLIT. Before version 6.11 one had to use CAll SYMPUT to create an array of data set names and values and then write a monster SELECT statement. The whole thing had to be in a macro in order 921

to repetitively process the array. Now one might view it as a problem to produce two lists memname 'MYDATA' 1. The names of data sets 2. WHEN I OUTPUT statements for a SELECT block The first case is easily handled by select distinct 'lib.' I Isplit into :datalist separated by,, from all ; The second is more of the same, only harder. select distinct 'when (' I I split I I ') output lib.' I I split into :whenlist separated by from all '.,, Now the code to produce the split is trivial and need not even be housed in a macro. data &datalist set all ; select ( split &whenlist otherwise end run ; In the section on the Cartesian product, we postponed discussion of the data set SPECS. It could be generated from one of the "dictionary" files documented in the Technical Report P-222. Suppose we are interested in making a codebook for the data set LlB.MYDATA, then the following code could generate SPECS. create specs as select name as variable, case when format=" and type='char' then Schar when format=" and type= 'num' then best else format end as format from dictionary. columns where libname = 'LIB' and To prepare for doing the frequencies needed to make the data set FREO we could use the array form of generating variables from a column. select variable, format, into :varl - :fmtl - from specs %let nvar = &sqlobs var9999 fmt9999 The frequency data sets can then be generated in a PROC FORMAT with the macro code proc freq data = lib.mydata %do i = 1 %to &nvar ; table &&var&i /out=&&var&i format &&var&i &&fmt&i... ; %end run ; We still have not combined the frequency data sets into one data set, but that task can be left to a competent macro programmer, even one who doesn't know Sal (assuming that that is not a contradiction in terms). Macro - SQL Interaction The previous section showed how the making of lists has had a dramatic effect on the way one codes macro problems involving lists. Now we consider a more complex interaction between PROC Sal and macro, where macro code is used to write the Sal code in a loop and the whole problem is much easier, precisely because it is Sal code. Suppose we have a data set, W, containing the variables NAME and GROUP. NAME GROUP A 1 B 1 B 2 C 2 D 2 D 3 E 3 F 4 G 4 G 5 H 5 922

min (liimingroup) as liimingroup from t%eval(liii-l) group by liiname create table tliii as select liiname, liigroup, min (liimingroup) as liimingroup from t&i group by liigroup /* are we finished? */ reset noprint ; select wl.&name from t%eval(&i-l) as wi tliii as w2 where wl.liiname=w2.liiname and Iiigroup=w2.liigroup and wl.liimingroup A= w2.&mingroup run Conclusion I have pointed out six areas where SQl code excels. My conclusion is that a good SAS programmer can no longer ignore PROC SQl and remain good. The author can be contacted by mail at: or bye-mail at: Westat Inc. 1650 Research Boulevard Rockville, MD 20850-3129 whitloi1 @westat.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. eel indicates USA registration. %let done = %eval ( not liisqlobs ) ; reset print ; drop table t%eval(liii-l); %end ;/*end iterative loop*/ %if not liidone %then %put WARNING (GROUPIT) : Process stopped by condition MAX=&max; %else %do create table liiout as select liiname I liigroup, liimingroup from tliii order by &name drop table %end t&i quit ; %mend groupit %groupit data = w, name = namel, group = groupl,out = w2 ) proc print data = w2 ; 924

We want to collapse groups to the lowest level. For example, since A and B belong to group 1, and Band C belong to group 2, then all members of group 2 are part of group 1 because the groups have the common member B. Once this is seen one can add group 3 to the new group 1 because of the common member D. Thus group 1 covers A, B, C, 0 and E. Similarly F, G, and H ultimately belong to group 4. More formally, two groups are in the same chain if there is a sequence of groups containing the given groups such that each consecutive pair of groups contains a common name. Using this definition the data set consists of disjoint chains. The problem is to write a program identifying each chain by the minimum group number in the chain. The intuitive argument given in the previous paragraph uses two kinds of minimization. 1. Find the minimum group (call it MINGROUP) for all names having the same value (e.g. NAME = 'B' has MINGROUP = 1). 2. Find the minimum of all MINGROUP values for all names in a common group (e.g. GROUP = 2 has MINGROUP = 1). PROC SQl is very suitable to both types of minimization. In the first case we might have create table t as select name, group, min (group) as mingroup from dataset group by name ; In the second case we might have create table t as select name, group, min (mingroup) as mingroup from t group by group ; These two operations must be repeated over and over until no new minimums are found, since each new extension of a group may mean further collapsing. To express the iteration of this code to an arbitrary level, we need a macro %DO-Ioop. This time we will present the complete macro, %GROUPIT. For generality, we make parameters to name the input and output data sets, and the variables represented by NAME, GROUP, and MINGROUP. The parameter MAX is added to insure that the macro does not execute for an excessively long time. (Since the algorithm does converge one could do away with this parameter or set it to the number of observations.) %macro groupit ( data=&syslast,/*input data */ out=_data_, /*output data*/ name=name, /* name var */ group=group, /* group var */ mingroup=mingroup, /* minimum var*/ max=20 /*limit *iterations*/ /* ------------------------------- minimize group on name and then group repeat until max iterations or done ------------------------------- */ %local i done proc sql /* ------------------------- initial set up - get first minimums, start numbered sequence of data sets ------------------------- */ create table to as select &name, min (&group) as &mingroup from &data group by &name create table to as select &name, min (&mingroup) as &mingroup from to group by &group /* ---------------------------- iterate until done or too many iterations ------------------------- */ %do %until (&done or &i > &max) ; %let i = %eval (&i + 1) create table select &name t&i as 923