A Quick and Gentle Introduction to PROC SQL

Similar documents
Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

Using SAS/SCL to Create Flexible Programs... A Super-Sized Macro Ellen Michaliszyn, College of American Pathologists, Northfield, IL

Copy That! Using SAS to Create Directories and Duplicate Files

SAS 9 Programming Enhancements Marje Fecht, Prowerk Consulting Ltd Mississauga, Ontario, Canada

Matrix Plot. Sample StatFolio: matrixplot.sgp

Format-o-matic: Using Formats To Merge Data From Multiple Sources

%MAKE_IT_COUNT: An Example Macro for Dynamic Table Programming Britney Gilbert, Juniper Tree Consulting, Porter, Oklahoma

Hidden in plain sight: my top ten underpublicized enhancements in SAS Versions 9.2 and 9.3

How Do I.? Some Beginners FAQs. Peter Eberhardt, Fernwood Consulting Group Inc. Toronto, Canada Audrey Yeo, Athene USA, West Des Moines, IA

Square Peg, Square Hole Getting Tables to Fit on Slides in the ODS Destination for PowerPoint

Cleaning Duplicate Observations on a Chessboard of Missing Values Mayrita Vitvitska, ClinOps, LLC, San Francisco, CA

Paper DB2 table. For a simple read of a table, SQL and DATA step operate with similar efficiency.

Using PROC SQL to Calculate FIRSTOBS David C. Tabano, Kaiser Permanente, Denver, CO

Checking for Duplicates Wendi L. Wright

Querying Data with Transact SQL

It s Proc Tabulate Jim, but not as we know it!

ABSTRACT INTRODUCTION MACRO. Paper RF

SAS Macro Dynamics - From Simple Basics to Powerful Invocations Rick Andrews, Office of the Actuary, CMS, Baltimore, MD

Tales from the Help Desk 6: Solutions to Common SAS Tasks

INTRODUCTION TO PROC SQL JEFF SIMPSON SYSTEMS ENGINEER

Tweaking your tables: Suppressing superfluous subtotals in PROC TABULATE

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

The REPORT Procedure: A Primer for the Compute Block

Breaking up (Axes) Isn t Hard to Do: An Updated Macro for Choosing Axis Breaks

PharmaSUG Paper TT11

The Basics of PROC FCMP. Dachao Liu Northwestern Universtiy Chicago

Going Under the Hood: How Does the Macro Processor Really Work?

A Strip Plot Gets Jittered into a Beeswarm

Introduction. Getting Started with the Macro Facility CHAPTER 1

Get SAS sy with PROC SQL Amie Bissonett, Pharmanet/i3, Minneapolis, MN

Lecture 1 Getting Started with SAS

Using SAS with Oracle : Writing efficient and accurate SQL Tasha Chapman and Lori Carleton, Oregon Department of Consumer and Business Services

Data Edit-checks Integration using ODS Tagset Niraj J. Pandya, Element Technologies Inc., NJ Vinodh Paida, Impressive Systems Inc.

BASICS BEFORE STARTING SAS DATAWAREHOSING Concepts What is ETL ETL Concepts What is OLAP SAS. What is SAS History of SAS Modules available SAS

Unlock SAS Code Automation with the Power of Macros

Using the SQL Editor. Overview CHAPTER 11

Retrieving Data Using the SQL SELECT Statement. Copyright 2009, Oracle. All rights reserved.

Ditch the Data Memo: Using Macro Variables and Outer Union Corresponding in PROC SQL to Create Data Set Summary Tables Andrea Shane MDRC, Oakland, CA

Useful Tips from my Favorite SAS Resources

Submitting SAS Code On The Side

Retrieving Data Using the SQL SELECT Statement. Copyright 2004, Oracle. All rights reserved.

Using Metadata Queries To Build Row-Level Audit Reports in SAS Visual Analytics

The Proc Transpose Cookbook

Arthur L. Carpenter California Occidental Consultants, Oceanside, California

Programming Beyond the Basics. Find() the power of Hash - How, Why and When to use the SAS Hash Object John Blackwell

Retrieving Data Using the SQL SELECT Statement. Copyright 2004, Oracle. All rights reserved.

Exploring DATA Step Merge and PROC SQL Join Techniques Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Querying Data with Transact-SQL

ABSTRACT INTRODUCTION TRICK 1: CHOOSE THE BEST METHOD TO CREATE MACRO VARIABLES

Getting it Done with PROC TABULATE

How to Create Data-Driven Lists

A Macro that can Search and Replace String in your SAS Programs

BreakOnWord: A Macro for Partitioning Long Text Strings at Natural Breaks Richard Addy, Rho, Chapel Hill, NC Charity Quick, Rho, Chapel Hill, NC

You can write a command to retrieve specified columns and all rows from a table, as illustrated

You can use PIVOT even when you don t know the list of possible values, and you can unpivot in order to normalize unnormalized data.

Journey to the center of the earth Deep understanding of SAS language processing mechanism Di Chen, SAS Beijing R&D, Beijing, China

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Bruce Gilsen, Federal Reserve Board

Statistics, Data Analysis & Econometrics

footnote1 height=8pt j=l "(Rev. &sysdate)" j=c "{\b\ Page}{\field{\*\fldinst {\b\i PAGE}}}";

Querying Data with Transact SQL Microsoft Official Curriculum (MOC 20761)

Are you Still Afraid of Using Arrays? Let s Explore their Advantages

SAS seminar. The little SAS book Chapters 3 & 4. April 15, Åsa Klint. By LD Delwiche and SJ Slaughter. 3.1 Creating and Redefining variables

Getting Up to Speed with PROC REPORT Kimberly LeBouton, K.J.L. Computing, Rossmoor, CA

Easing into Data Exploration, Reporting, and Analytics Using SAS Enterprise Guide

Macro to compute best transform variable for the model

T-SQL Training: T-SQL for SQL Server for Developers

Using Templates Created by the SAS/STAT Procedures

WEEK 3 TERADATA EXERCISES GUIDE

USING PROC SQL EFFECTIVELY WITH SAS DATA SETS JIM DEFOOR LOCKHEED FORT WORTH COMPANY

ABSTRACT MORE THAN SYNTAX ORGANIZE YOUR WORK THE SAS ENTERPRISE GUIDE PROJECT. Paper 50-30

Simple Rules to Remember When Working with Indexes

Contents of SAS Programming Techniques

PROC REPORT Basics: Getting Started with the Primary Statements

Chapter 7. Introduction to Structured Query Language (SQL) Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel

KEYWORDS Metadata, macro language, CALL EXECUTE, %NRSTR, %TSLIT

SAS Cloud Analytic Services 3.1: Graphing Your Output

Cover the Basics, Tool for structuring data checking with SAS Ole Zester, Novo Nordisk, Denmark

Identifying Duplicate Variables in a SAS Data Set

Correcting for natural time lag bias in non-participants in pre-post intervention evaluation studies

The FORMS Procedure. Overview CHAPTER 20

MANAGING DATA(BASES) USING SQL (NON-PROCEDURAL SQL, X401.9)

Hash Objects for Everyone

More with SQL Queries. Advanced SAS Programming

CHAPTER 7 Using Other SAS Software Products

SAS Business Rules Manager 2.1

How to Incorporate Old SAS Data into a New DATA Step, or What is S-M-U?

DBLOAD Procedure Reference

FIT 100 More Microsoft Access and Relational Databases Creating Views with SQL

Validation Summary using SYSINFO

A Breeze through SAS options to Enter a Zero-filled row Kajal Tahiliani, ICON Clinical Research, Warrington, PA

SAS/STAT 13.1 User s Guide. The NESTED Procedure

CC13 An Automatic Process to Compare Files. Simon Lin, Merck & Co., Inc., Rahway, NJ Huei-Ling Chen, Merck & Co., Inc., Rahway, NJ

Consolidate data from a field into a list

An Introduction to PROC REPORT

Advanced PROC REPORT: Getting Your Tables Connected Using Links

The correct bibliographic citation for this manual is as follows: SAS Institute Inc Proc FORMS. Cary, NC: SAS Institute Inc.

The Power of Combining Data with the PROC SQL

Uncommon Techniques for Common Variables

Transcription:

ABSTRACT Paper B2B 9 A Quick and Gentle Introduction to PROC SQL Shane Rosanbalm, Rho, Inc. Sam Gillett, Rho, Inc. If you are afraid of SQL, it is most likely because you haven t been properly introduced. Maybe you ve heard of SQL but never quite had the time to explore. Or, maybe you were introduced to SQL, but it was at the hands of a database administrator turned SAS programmer who liked to write un-indented queries that were 60-lines long and combined 15 different datasets egads! The goal of this paper is to provide an introduction to PROC SQL that is both quick and gentle. We will not discuss joins, subqueries, cases, calculated, feedback, return codes, or anything else that might dampen your initial momentum. Don t get me wrong, these are all wonderful topics. But they are better suited to a paper in the Hands-on Workshop or Beyond the Basics sections. Our focus will be limited to introductory topics such as, from, into, where, group by, and order by. Once these basic topics have been introduced, we will use them to demonstrate some simple yet powerful applications of SQL. The applications will focus on situations in which the use of SQL will result in code that is both more efficient and more concise than what can be achieved when limiting oneself to the DATA step and other non-sql procedures. ORIENTATION In this paper we will first do a short sprint through the fundamental concepts of SQL. After this short sprint we will regroup, pulling this initial bolus of information together in a more consolidated form, as well as elaborating on some of the finer points associated with these fundamental concepts. With these fundamentals in hand, we will then move on to some slightly more interesting features of SQL. THE FUNDAMENTALS Here begins a short sprint through the fundamental concepts of SQL. We begin with the simplest possible application of SQL and then layer on additional fundamentals one at a time. SELECT FROM We begin by comparing PROC SQL to PROC PRINT. Here is a straightforward application of PROC PRINT. proc print data=sashelp.cars var model run The corresponding SQL code looks like: proc sql model quit The output from both the PROC PRINT and PROC SQL would look like: Model ----------------------------------------- MDX RSX Type S 2dr TSX 4dr TL 4dr 3.5 RL 4dr 3.5 RL w/navigation 4dr Takeaway Message: The keywords are different in SQL: DATA= is replaced by FROM and VAR is replaced by SELECT. 1

COMMAS Here is another straightforward PROC PRINT example, in which several variables are printed at once. proc print data=sashelp.cars var model msrp mpg_highway run The corresponding SQL code utilizes commas and looks like: proc sql model, msrp, mpg_highway quit MPG Model MSRP (Highway) ------------------------------------------------------------- Matrix XR $16,695 36 Touareg V6 $35,515 20 Golf GLS 4dr $18,715 31 GTI 1.8T 2dr hatch $19,825 31 Jetta GLS TDI 4dr $21,055 46 New Beetle GLS 1.8T 2dr $21,055 31 Takeaway Message: Use commas to delimit a list of variables in SQL. WHERE Subsetting in SQL is actually very similar to typical SAS syntax. The keyword is WHERE. ORDER BY proc sql model, msrp, mpg_highway where make = Cadillac quit In SQL you are able to sort while you print. The keyword is ORDER BY. proc sql model, msrp, mpg_highway where make = Cadillac order by msrp quit MPG Model MSRP (Highway) ------------------------------------------------------------- CTS VVT 4dr $30,835 25 Deville 4dr $45,445 26 SRX V8 $46,995 21 Seville SLS 4dr $47,955 26 Deville DTS 4dr $50,595 26 Vs. Traditional Methods: Similar output might be generated by a PROC SORT followed by a PROC PRINT. The main benefit of the SQL is less typing. 2

STATISTICS In the DATA step, functions return one result per observation (e.g., x=max(var1,var2,var3)). In SQL, certain functions are enhanced to have the added ability to perform calculations across records. Functions capable of producing across-record calculations include MAX, MEAN, and COUNT. The following application of the MAX function returns the MSRP of the most expensive automobile in the dataset. proc sql max(msrp) quit -------- 192465 The following application of the MEAN function returns the average MSRP of all automobiles in the dataset. proc sql mean(msrp) quit -------- 32774.86 The following application of the COUNT function returns the number of automobiles in the dataset. proc sql count(*) quit -------- 428 Potential Gotcha: With COUNT, you must use an asterisk within the parentheses rather than a variable name. Explanation of this particular syntactic quirk is beyond the scope of a quick and gentle paper. Takeaway Message: SQL is much more than just an alternative to PROC PRINT. It also contains some of the functionality of PROC FREQ and PROC MEANS. GROUP BY The above statistics can be calculated separately within subgroups. The keyword is GROUP BY. The following application of the MEAN function returns the average MSRP of the models produced by each manufacturer. proc sql make, mean(msrp) group by make quit 3

Make ----------------------- Acura 42938.57 Audi 43307.89 BMW 43285.25 Buick 30537.78 Cadillac 50474.38 Potential Gotcha: You have to specify the subgroup variable twice: once in the SELECT clause and a second time in the GROUP BY clause. The following application of the COUNT function returns the number of models produced by each manufacturer. proc sql make, count(*) group by make quit Make ----------------------- Acura 7 Audi 19 BMW 20 Buick 9 Cadillac 8 Takeaway Message: SQL is more than just a different way to print. FUNDAMENTALS CONSOLIDATED AND ELABORATED UPON Thus far we have introduced the following SQL keywords: SELECT (variables) FROM (input datasets) WHERE (subsetting) GROUP BY (subgroups) ORDER BY (sorting) Not all keywords are required in a given SELECT statement, but keywords that are used must be specified in the order shown above. A useful pneumonic for remembering this order is So few workers go home on time. A SELECT statement is referred to as a query. The parts of a SELECT statement are referred to as clauses (e.g., the WHERE clause, the ORDER BY clause, etc.). We combine clauses to form a statement or a query. The statement PROC SQL is used prior to a SELECT statement. The statement QUIT is used following a SELECT statement. Note that it is a QUIT statement and not a RUN statement. proc sql quit Multiple SELECT statements are allowed within a single PROC SQL. proc sql quit 4

Various summary statistics can be calculated, including but not limited to: MAX(var) MEAN(var) COUNT(*) Functions like MAX and MEAN can be used at the observation level by specifying multiple parameters (e.g., x=max(var1,var2,var3)), or they can be used at the dataset level by specifying a single parameter (e.g., max(var1)). Summary statistics can be calculated within subgroups (i.e., by using GROUP BY) or at the dataset level (i.e., by omitting the GROUP BY). Beauty being in the eye of the beholder, we meekly offer the following recommendations related to query readability: (a) keep each clause on a separate line, (b) indent keywords the same amount, and (c) put the semicolon on its own line. proc sql from where group by order by quit BUILDING ON THE FUNDAMENTALS The fundamentals in hand, we now build on them. DISTINCT The DISTINCT keyword allows us to see all of the unique values of a given variable. proc sql distinct make quit Make -------------- Acura Audi BMW Buick Cadillac Application: Use this technique when you want to see all unique values for a variable, but you aren t concerned with counting the number of observations corresponding to each unique value. Vs. Traditional Methods: Similar output might be generated by a NODUPKEY SORT followed by a PROC PRINT. The main benefit of the SQL is less typing. PROC FREQ could also be used, but it carries a little additional overhead because it will be calculating counts as well. SQLOBS The automatic macro variable SQLOBS is updated after every SQL query. This macro variable indicates the number of records returned by the most recent query. proc sql distinct make quit %PUT &SQLOBS. 5

Log: 107 %PUT &SQLOBS. 38 Vs. Traditional Methods: Similar output might be generated by a NODUPKEY SORT followed by a CALL SYMPUT on the END= record within a DATA _NULL_. The main benefit of the SQL is again less typing. Derivations Derivations can be performed in the SELECT clause. proc sql make, model, msrp*0.85 quit Make Model ----------------------------------------------------------------- Volvo C70 HPT convertible 2dr 36180.25 Volvo S80 T6 4dr 38428.5 Volvo V40 22214.75 Volvo XC70 29873.25 Vs. Traditional Methods: Similar output might be generated by a DATA step followed by a PROC PRINT. Again with the less typing thing. Naming Columns You might have noticed by now that the results of derivations and summary functions do not automatically come endowed with column headers. The keyword AS is used to assign names to derived columns. Note that the new variable name comes after the derivation. proc sql make, model, msrp*0.85 as wholesale quit Make Model wholesale ------------------------------------------------------------------ Acura MDX 31403.25 Acura RSX Type S 2dr 20247 Acura TSX 4dr 22941.5 Acura TL 4dr 28215.75 Acura 3.5 RL 4dr 37191.75 Takeaway Message: Compared to the dataset syntax for assigning values to new variables, we replace = with AS, and we move the new variable name from the left side to the right side. CREATE TABLE Results from queries can be saved into datasets. The keyword is CREATE TABLE. proc sql create table means_by_make as make, mean(msrp) as mean_msrp group by make quit 6

Log: NOTE: Table WORK.MEANS_BY_MAKE created, with 38 rows and 2 columns. Potential Gotcha: The keyword AS is required prior to SELECT. Vs. Traditional Methods: Similar output might be generated by a PROC SORT (by MAKE) followed by a PROC MEANS and ODS OUTPUT. The main benefit of the SQL is that you do not have to pre-sort your dataset. ASTERISK In order to all variables in a dataset, use an asterisk. proc sql create table caddies as * where make = Cadillac quit Log: NOTE: Table WORK.CADDIES created, with 8 rows and 15 columns. Vs. Traditional Methods: Similar output might be generated by a DATA step. It s a tie. INTO Results from queries can be saved into macro variables. The keyword is INTO. The following application returns the number of models for the make Cadillac. proc sql count(*) into :NMODELS where make = Cadillac quit %PUT NMODELS = [&NMODELS.] Log: 121 %PUT NMODELS = [&NMODELS.] NMODELS = [ 8] Potential Gotcha: The macro variable name must be preceded by a colon. Potential Gotcha: SAS will sometimes pad macro variables with leading or trailing spaces. When sending results to the log with a %PUT statement, enclose the macro variable reference in brackets so that padding is more easily detected. 7

Vs. Traditional Methods: Similar output might be generated by a DATA step with WHERE= and END= on the SET statement and a CALL SYMPUT. The main benefit of the SQL is readability. Sets of Macro Variables INTO allows for the creation of a set of macro variables with a dash. proc sql distinct make into :MAKE1 - :MAKE1000 % quit %PUT MAKE1 = [&MAKE1.] %PUT MAKE2 = [&MAKE2.] Log: 180 %PUT MAKE1 = [&MAKE1.] MAKE1 = [Acura] 181 %PUT MAKE2 = [&MAKE2.] MAKE2 = [Audi] Potential Gotcha: The fact that we have to write down a specific number after the dash (i.e., :MAKE1000) is very unsatisfying. Alas, a number must be specified. As a safety precaution, always specify a number that is a couple of orders of magnitude larger than what you might reasonably expect to find in your data. NOPRINT Results from queries can be suppressed from the output window. The keyword is NOPRINT. proc sql noprint count(*) into :NMODELS quit Application: Use this technique when you want to know how many of something you have, but you don t necessarily need to see the count in the output window. MORE CONSOLIDATION AND ELABORATION In BUILDING ON THE FUNDAMENTALS we introduced the following: DISTINCT (removing duplicates) SQLOBS (number of observations returned by the query) Derivations (in SELECT) Naming Columns (with AS) CREATE TABLE (creating datasets) ASTERISK (ing all variables in a dataset) INTO (creating one macro variable) Sets of Macro Variables (creating a set of macro variables) NOPRINT (suppressing results in the output window) CREATE TABLE does not send results to the output window, but to a dataset. Therefore NOPRINT is unnecessary when CREATE TABLE is used. 8

Multiple variables can be specified following DISTINCT, GROUP BY, or ORDER BY. proc sql make, model, msrp order by make, model quit Because the value of the SQLOBS macro variable is automatically updated after every query, you should immediately save the result, lest the desired value be overwritten by your next query. proc sql %LET N1 = &SQLOBS. %LET N2 = &SQLOBS. quit CONCLUSION PROC SQL is an extremely powerful procedure that combines parts and pieces of the DATA step, PROC PRINT, PROC SORT, PROC FREQ, PROC MEANS, and the Macro Language. If a SAS user is not properly introduced to PROC SQL it can easily intimidate. We hope that the preceding introduction was quick enough so as not to bore you and gentle enough so as not to scare you. We leave you with the following summarization of all of the topics that have been introduced in this paper. proc sql <noprint> create table table-nameas <*> <distinct> <var <as newvar>> <, var <as newvar>> into :mac from table-name where expression group by var <, var> order by var <, var> %LET N = &SQLOBS. quit REFERENCES SAS Institute Inc., 2009. SAS 9.2 SQL Procedure User s Guide. Cary, NC: SAS Institute Inc. RECOMMENDED READING Gusinow, Rosalind, Introduction to PROC SQL, Proceedings of the 25 th Annual SAS Users Group International Conference DeFoor, Jimmy, Proc SQL A Primer for SAS Programmers, Proceedings of the 31 st Annual SAS Users Group International Conference CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Shane Rosanbalm Rho, Inc 6330 Quadrangle Dr., Ste. 500 Chapel Hill, NC 27517 919.595.6273 E-mail: shane_rosanbalm@rhoworld.com Web: www.rhoworld.com 9

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 10