A Hands-on Tour Inside the World of PROC SQL Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Similar documents
PROC SQL for New Users Kirk Paul Lafler, Software Intelligence Corporation

Exploring DATA Step Merge and PROC SQL Join Techniques Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Conditional Processing Using the Case Expression in PROC SQL

Simple Rules to Remember When Working with Indexes

Conditional Processing in the SAS Software by Example

Demystifying PROC SQL Join Algorithms

Undocumented and Hard-to-find PROC SQL Features

Undocumented and Hard-to-find PROC SQL Features

A Hands-on Tour Inside the World of PROC SQL

Exploring DICTIONARY Tables and SASHELP Views

Output Delivery Goes Web

An Introduction to SAS Hash Programming Techniques Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

SAS Programming Tips, Tricks and Techniques

Output Delivery Goes Web

One-to-One, One-to-Many, and Many-to-Many Joins Using PROC SQL

A Hands-on Introduction to SAS DATA Step Hash Programming Techniques

Ten Great Reasons to Learn SAS Software's SQL Procedure

Using SAS Enterprise Guide to Coax Your Excel Data In To SAS

An Introduction to PROC REPORT

JMP Coders 101 Insights

Connect with SAS Professionals Around the World with LinkedIn and sascommunity.org

Introduction / Overview

SAS Performance Tuning Strategies and Techniques

Fuzzy Matching Programming Techniques Using SAS Software

Fuzzy Matching Programming Techniques Using SAS Software

T-SQL Training: T-SQL for SQL Server for Developers

What's Hot, What's Not - Skills for SAS Professionals Kirk Paul Lafler, Software Intelligence Corporation Charles Edwin Shipp, Shipp Consulting

Ready To Become Really Productive Using PROC SQL? Sunil K. Gupta, Gupta Programming, Simi Valley, CA

SAS Macro Programming Tips and Techniques

Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY

Introduction to SQL/PLSQL Accelerated Ed 2

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

SAS PROGRAM EFFICIENCY FOR BEGINNERS. Bruce Gilsen, Federal Reserve Board

Bruce Gilsen, Federal Reserve Board

APPENDIX 4 Migrating from QMF to SAS/ ASSIST Software. Each of these steps can be executed independently.

Fuzzy Matching Programming Techniques Using SAS Software

Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

Oracle Database 11g: SQL and PL/SQL Fundamentals

Quick Results with SAS Enterprise Guide

Oracle Database: SQL and PL/SQL Fundamentals Ed 2

Formulas, LookUp Tables and PivotTables Prepared for Aero Controlex

10 The First Steps 4 Chapter 2

SECTION 1 DBMS LAB 1.0 INTRODUCTION 1.1 OBJECTIVES 1.2 INTRODUCTION TO MS-ACCESS. Structure Page No.

Introduction to ODS Statistical Graphics

SAS Business Rules Manager 1.2

A Visual Step-by-step Approach to Converting an RTF File to an Excel File

Language. f SQL. Larry Rockoff COURSE TECHNOLOGY. Kingdom United States. Course Technology PTR. A part ofcenqaqe Learninq

Oracle Database: SQL and PL/SQL Fundamentals NEW

Analytics: Server Architect (Siebel 7.7)

WARLINGHAM MAKE A MUSIC & DVD MOVIE COLLECTION DATABASE BOOK 4 ADD A FURTHER TABLE MAKE VALIDATION RULES MAKE FORMS MAKE PARAMETER QUERIES

CSC Web Programming. Introduction to SQL

Querying Data with Transact-SQL

Information Systems Engineering. SQL Structured Query Language DML Data Manipulation (sub)language

SYSTEM 2000 Essentials

Querying Data with Transact SQL

PROC SQL vs. DATA Step Processing. T Winand, Customer Success Technical Team

SAS Web Report Studio 3.1

Full file at

Using the SQL Editor. Overview CHAPTER 11

Guide Users along Information Pathways and Surf through the Data

Create and Modify Queries 7

$99.95 per user. Writing Queries for SQL Server (2005/2008 Edition) CourseId: 160 Skill level: Run Time: 42+ hours (209 videos)

6.001 Notes: Section 15.1

Querying Data with Transact SQL Microsoft Official Curriculum (MOC 20761)

1. Data Definition Language.

Slicing and Dicing Data in CF and SQL: Part 2

DOWNLOAD PDF LEARN TO USE MICROSOFT ACCESS

Test Bank for Database Processing Fundamentals Design and Implementation 13th Edition by Kroenke

DBLOAD Procedure Reference

PharmaSUG Paper TT03. Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

CGS 3066: Spring 2017 SQL Reference

SAS ENTERPRISE GUIDE USER INTERFACE

Tackling Unique Problems Using TWO SET Statements in ONE DATA Step. Ben Cochran, The Bedford Group, Raleigh, NC

Relational Database Management Systems for Epidemiologists: SQL Part I

Advanced Multidimensional Reporting

MySQL for Developers Ed 3

Database Design & Programming with SQL: Part 1 Learning Objectives

EE221 Databases Practicals Manual

Oracle Database: Introduction to SQL/PLSQL Accelerated

Arthur L. Carpenter California Occidental Consultants, Oceanside, California

Can you decipher the code? If you can, maybe you can break it. Jay Iyengar, Data Systems Consultants LLC, Oak Brook, IL

Data about data is database Select correct option: True False Partially True None of the Above

CHAPTER4 CONSTRAINTS

Aster Data Basics Class Outline

chapter 2 G ETTING I NFORMATION FROM A TABLE

MANAGING DATA(BASES) USING SQL (NON-PROCEDURAL SQL, X401.9)

Institute of Aga. Microsoft SQL Server LECTURER NIYAZ M. SALIH

LABORATORY. 16 Databases OBJECTIVE REFERENCES. Write simple SQL queries using the Simple SQL app.

Course Outline and Objectives: Database Programming with SQL

SAS Business Rules Manager 2.1

Database Optimization

Chapter 7. Introduction to Structured Query Language (SQL) Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel

Access 2007: Advanced Instructor s Edition

Institute of Aga. Network Database LECTURER NIYAZ M. SALIH

Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables

UAccess ANALYTICS Next Steps: Working with Bins, Groups, and Calculated Items: Combining Data Your Way

Introduction to Computer Science and Business

Importing CSV Data to All Character Variables Arthur L. Carpenter California Occidental Consultants, Anchorage, AK

USING SAS HASH OBJECTS TO CUT DOWN PROCESSING TIME Girish Narayandas, Optum, Eden Prairie, MN

Aster Data SQL and MapReduce Class Outline

Transcription:

A Hands-on Tour Inside the World of POC SQL Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California Abstract Structured Query Language (POC SQL) is a database language found in the base-sas software. It enables access to data stored in SAS data sets or tables using a powerful assortment of statements, clauses, options, functions, and other language features. This hands -on workshop presents core concepts of this powerful language as well as its many applications, and is intended for SAS users who desire an overview of the capabilities of this exciting procedure. Attendees will explore the construction of SQL queries, ordering and grouping data, the application of case logic for data reclassification, the creation and use of views, and the construction of simple inner and outer joins. Introduction The SQL procedure is a wonderful tool for querying and subsetting data; restructuring or regrouping data by specifying case expressions ; constructing and using virtual tables known as views; joining two or more tables (up to 32) to explore data relationships. Occasionally, a problem comes along where the SQL procedure is either better suited or easier to use than other more conventional DATA and/or POC step methods. As each situation presents itself, POC SQL should be examined to see if its use is warranted for the task at hand. Many SAS users prefer the power and coding simplicity of the SQL procedure in perform ing data analysis and routine everyday tasks rather than using other DATA or POC step approaches and methods. Whatever the nature of the requirements or SQL usage, this paper is intended to present an overview of some of the most exciting features found in POC SQL. Example Tables The examples used throughout this paper utilize a database of tables. A relational database is a collection of tables. Each table contains one or more columns and one or more rows of data. The example database consists of three tables: MOVIES, ACTOS, and ENTAL_INFO. Each table appears below. MOVIES TITLE LENGTH CATEGOY YEA STUDIO ATING Brave Heart 177 Action Adventure 1995 Paramount Pictures Casablanca 103 Drama 1942 MGM / UA Christmas Vacation 97 Comedy 1989 Warner Brothers -13 Coming to America 116 Comedy 1988 Paramount Pictures Dracula 130 Horror 1993 Columbia TriStar Dressed to Kill 105 Drama Mysteries 1980 Filmways Pictures Forrest Gump 142 Drama 1994 Paramount Pictures -13 Ghost 127 Drama omance 1990 Paramount Pictures -13 Jaws 125 Action Adventure 1975 Universal Studios Jurassic Park 127 Action 1993 Universal Studios -13 Lethal Weapon 110 Action Cops & obber 1987 Warner Brothers Michael 106 Drama 1997 Warner Brothers -13 National Lampoon's Vacation 98 Comedy 1983 Warner Brothers -13 Poltergeist 115 Horror 1982 MGM / UA ocky 120 Action Adventure 1976 MGM / UA Scarface 170 Action Cops & obber 1983 Universal Studios Silence of the Lambs 118 Drama Suspense 1991 Orion Star Wars 124 Action Sci-Fi 1977 Lucas Film Ltd The Hunt for ed October 135 Action Adventure 1989 Paramount Pictures The Terminator 108 Action Sci-Fi 1984 Live Entertainment The Wizard of Oz 101 Adventure 1939 MGM / UA G Titanic 194 Drama omance 1997 Paramount Pictures -13 1

ACTOS TITLE ACTO_LEADING ACTO_SUPPOTING Brave Heart Mel Gibson Sophie Marceau Christmas Vacation Chevy Chase Beverly D'Angelo Coming to America Eddie Murphy Arsenio Hall Forest Gump Tom Hanks Sally Field Ghost Patrick Swayze Demi Moore Lethal Weapon Mel Gibson Danny Glover Michael John Travolta Andie MacDowell National Lampoon s Vacation Chevy Chase Beverly D'Angelo ocky Sylvester Stallone Talia Shire Silence of the Lambs Anthony Hipkins Jodie Foster The Hunt for ed October Sean Connery Alec Baldwin The Terminator Arnold Schwarzenegger Michael Biehn Titanic Leonardo DiCaprio Kate Winslet ENTAL_INFO CUST_NO ENTAL_DATE TITLE 5 11/29/2006 Christmas Vacation 10 05/18/2007 Star Wars 3 11/28/2006 Christmas Vacation 10 04/04/2007 Coming to America 3 06/20/2007 Forest Gump 5 10/15/2006 Ghost 10 05/07/2006 Jurassic Park 3 06/20/2007 National Lampoon s Vacation 10 07/10/2007 Star Wars 5 08/01/2007 The Wizard of Oz 3 09/15/2006 The Wizard of Oz Constructing SQL Queries to etrieve and Subset Data POC SQL provides simple, but powerful, retrieval and subsetting capabilities including the retrieval and display of all rows and columns in a table, removal of rows with duplicate values, using wildcard characters to search for partially known information, and integrating ODS to create enhanced and more esthetically pleasing output. etrieving and Displaying All Columns in a Table with a Wildcard Character * SQL can retrieve and display all columns in a table by specifying the wildcard character * in a SELECT statement. The wildcard character * instructs the SQL processor to retrieve and display each column in the order that they were defined in the underlying table. The following example illustrates the use of the wildcard character to display all columns and rows in the underlying MOVIES table. POC SQL; SELECT * FOM MOVIES; SQL Output Title Length Category Year Studio ating Brave Heart 177 Action Adventure 1995 Paramount Pictures Casablanca 103 Drama 1942 MGM / UA Christmas Vacation 97 Comedy 1989 Warner Brothers -13 Coming to America 116 Comedy 1988 Paramount Pictures Dracula 130 Horror 1993 Columbia TriStar 2

Dressed to Kill 105 Drama Mysteries 1980 Filmways Pictures Forrest Gump 142 Drama 1994 Paramount Pictures -13 Ghost 127 Drama omance 1990 Paramount Pictures -13 Jaws 125 Action Adventure 1975 Universal Studios Jurassic Park 127 Action 1993 Universal Pictures -13 Lethal Weapon 110 Action Cops & obber 1987 Warner Brothers Michael 106 Drama 1997 Warner Brothers -13 National Lampoon's Vacation 98 Comedy 1983 Warner Brothers -13 Poltergeist 115 Horror 1982 MGM / UA ocky 120 Action Adventure 1976 MGM / UA Scarface 170 Action Cops & obber 1983 Universal Studios Silence of the Lambs 118 Drama Suspense 1991 Orion Star Wars 124 Action Sci-Fi 1977 Lucas Film Ltd The Hunt for ed October 135 Action Adventure 1989 Paramount Pictures The Terminator 108 Action Sci-Fi 1984 Live Entertainment The Wizard of Oz 101 Adventure 1939 MGM / UA G Titanic 194 Drama omance 1997 Paramount Pictures -13 etrieving and Displaying Specific Columns in a Table SQL can retrieve and display specific columns from a table when they are coded in a SELECT statement. As illustrated in the next example, the SELECT statement displays the movie titles and ratings from the MOVIES table. POC SQL; SELECT title, rating FOM MOVIES; SQL Output Title Brave Heart Casablanca Christmas Vacation Coming to America Dracula Dressed to Kill Forrest Gump Ghost Jaws Jurassic Park Lethal Weapon Michael National Lampoon's Vacation Poltergeist ocky Scarface Silence of the Lambs Star Wars The Hunt for ed October The Terminator The Wizard of Oz Titanic ating -13-13 -13-13 -13-13 G -13 3

Ordering the esults of a Query with an ODE BY Clause The results of an SQL query can be ordered with an ODE BY clause. Query results can be ordered in ascending (default) order, or in descending order when the DESC keyword is specified following the column name. The following example selects movie titles and ratings from the MOVIES table and displays the results in ascending order. POC SQL; SELECT title, rating FOM MOVIES ODE BY rating; Output from the ODE BY Clause Title The Wizard of Oz The Hunt for ed October Star Wars Poltergeist Jaws ocky Casablanca Forrest Gump Christmas Vacation Michael National Lampoon's Vacation Jurassic Park Titanic Ghost Dressed to Kill Lethal Weapon Dracula The Terminator Brave Heart Coming to America Silence of the Lambs Scarface ating G -13-13 -13-13 -13-13 -13 emoving Duplicate ows with the DISTINCT Keyword When the same value and or row appear multiple times in a table, SQL can be instructed to remove the duplicates by specifying a DISTINCT keyword in a SELECT statement. As illustrated in the following example, the DISTINCT keyword is specified for the ATING column to remove any and all duplicate rows from the MOVIES table. POC SQL; SELECT DISTINCT rating FOM MOVIES; The resulting output from specifying the DISTINCT keyword illustrates three distinct groups and appears below. 4

Output from the DISTINCT Keyword ating -13 Using Wildcard Characters with the LIKE Operator for Searching When searching for specific rows of data is necessary, but only part of the data you are searching for is known, then SQL provides the ability to use wildcard characters as part of the search argument. Say you wanted to search for all movies that were classified as an Action type of movie. By specifying a query using the wildcard character percent sign (%) in a WHEE clause with the LIKE operator, the query results consist of all rows containing the word ACTION as illustrated in the following query. POC SQL; SELECT title, category FOM MOVIES WHEE UPCASE(category) LIKE %ACTION% ; The resulting output from specifying the wildcard character % with the LIKE operator appears below. Output from the Wildcard Search Title Brave Heart Category Action Adventure Phonetic Matching (Sounds-Like Operator =*) A technique for finding names that sound alike or have spelling variations is available in the SQL procedure. This frequently used technique is referred to as phonetic matching and is performed using the Soundex algorithm. In Joe Celko s book, SQL for Smarties: Advanced SQL Programming, he traced the origins of the Soundex algorithm to the developers Margaret O Dell and obert C. ussell in 1918. Although not technically a function, the sounds -like operator searches and selects character data based on two expressions: the search value and the matched value. Anyone that has looked for a last name in a local telephone directory is quickly reminded of the possible phonetic variations. To illustrate how the sounds -like operator works, let s search each movie title for the phonetic variation of Dramma Suspence which, by the way, is spelled incorrectly. To help find as many (and hopefully all) possible spelling variations, the sounds -like operator is used to identify select similar sounding names including spelling variations. The following example illustrates all movies where the movie title sounds like Dramma Suspence. POC SQL; SELECT title, category, rating FOM MOVIES WHEE category =* Dramma Suspence ; The resulting output from specifying the sounds -like operator =*) appears below. 5

Output from the Sounds-Like Operator Title Category ating Silence of the Lambs Drama Suspense Case Logic In the SQL procedure, a case expression provides a way of conditionally selecting result values from each row in a table (or view). Similar to an IF-THEN construct, a case expression uses a WHEN-THEN clause to conditionally process some but not all the rows in a table. An optional ELSE expression can be specified to handle an alternative action should none of the expression(s) identified in the WHEN condition(s) be satisfied. A case expression must be a valid SQL expression and conform to syntax rules similar to DATA step SELECT- WHEN statements. Even though this topic is best explained by example, let s take a quick look at the syntax. CASE <column-name> WHEN when-condition THEN result-expression <WHEN when-condition THEN result-expression> END <ELSE result-expression> A column-name can optionally be specified as part of the CASE-expression. If present, it is automatically made available to each when-condition. When it is not specified, the column-name must be coded in each when-condition. Let s examine how a case expression works. If a when-condition is satisfied by a row in a table (or view), then it is considered true and the result-expression following the THEN keyword is processed. The remaining WHEN conditions in the CASE expression are skipped. If a when-condition is false, the next when-condition is evaluated. SQL evaluates each when-condition until a true condition is found or in the event all when-conditions are false, it then executes the ELSE expression and assigns its value to the CASE expression s result. A missing value is assigned to a CASE expression when an ELSE expression is not specified and each when-condition is false. In the next example, let s see how a case expression actually works. Suppose a value of Exciting, Fun, Scary, or is desired for each of the movies. Using the movie s category (CATEGOY) column, a CASE expression is constructed to assign one of the desired values in a unique column called Type for each row of data. A value of Exciting is assigned to all Adventure movies, Fun for Comedies, Scary for Suspense movies, and blank for all other movies. A column heading of Type is assigned to the new derived output column using the AS keyword. POC SQL; SELECT TITLE, ATING, CASE WHEN CATEGOY = Adventure THEN Exciting WHEN CATEGOY = Comedy THEN Fun WHEN CATEGOY = Suspense THEN Scary ELSE END AS Type FOM MOVIES; 6

Output from CASE Expression Title ating Type Brave Heart Casablanca Christmas Vacation -13 Fun Coming to America Fun Dracula Dressed to Kill Forrest Gump -13 Ghost -13 Jaws Jurassic Park -13 Lethal Weapon Michael -13 National Lampoon's Vacation -13 Fun Poltergeist ocky Scarface Silence of the Lambs Star Wars The Hunt for ed October The Terminator The Wizard of Oz G Exciting Titanic -13 In another example suppose we wanted to determine the audience level (e.g., general or adult audiences) for each movie. By using the ATING column we can assign a descriptive value to isolate general audience levels from other types with a simple Case expression, as follows. POC SQL; SELECT TITLE, ATING, CASE ATING WHEN THEN Adult Audience ELSE Other END AS Audience_Level FOM MOVIES; The resulting output depicting movies corresponding to a general audience level appear below. Output Depicting of a General Audience Level Title ating Audience_Level Brave Heart Adult Audience Casablanca Other Christmas Vacation -13 Other Coming to America Adult Audience Dracula Adult Audience Dressed to Kill Adult Audience Forrest Gump -13 Other Ghost -13 Other 7

Jaws Other Jurassic Park -13 Other Lethal Weapon Adult Audience Michael -13 Other National Lampoon's Vacation -13 Other Poltergeist Other ocky Other Scarface Adult Audience Silence of the Lambs Adult Audience Star Wars Other The Hunt for ed October Other The Terminator Adult Audience The Wizard of Oz G Other Titanic -13 Other Creating and Using Views Views are classified as virtual tables. There are many reasons for constructing and using views. A few of the more common reasons are presented below. Minimizing, or perhaps eliminating, the need to know the table or tables underlying structure Often a great degree of knowledge is required to correctly identify and construct the particular table interactions that are necessary to satisfy a requirement. When this prerequisite knowledge is not present, a view becomes a very attractive alternative. Once a view is constructed, users can simply execute it. This results in improved data integrity and control where the user of the view does not need to possess as much knowledge about the data structure of the underlying table(s) being used. educing the amount of typing for longer requests Often, a query will involve many lines of instruction combined with logical and comparison operators. When this occurs, there is any number of places where a typographical error or inadvertent use of a comparison operator may present an incorrect picture of your data. The construction of a view is advantageous in these circumstances, since a simple call to a view virtually eliminates the problems resulting from a lot of typing. Hiding SQL language syntax and processing complexities from users When users are unfamiliar with the SQL language, the construction techniques of views, or processing complexities related to table operations, they only need to execute the desired view using simple calls. This simplifies the process and enables users to perform simple to complex operations with custom -built views. Providing security to sensitive parts of a table Security measures can be realized by designing and constructing views designating what pieces of a table's information is available for viewing. Since data should always be protected from unauthorized use, views can provide some level of protection (also consider and use security measures at the operating system level). Controlling change / customization independence Occasionally, table and/or process changes may be necessary. When this happens, it is advantageous to make it as painless for users as possible. When properly designed and constructed, a view modifies the underlying data without the slightest hint or impact to users, with the one exception that results and/or output may appear differently. Consequently, views can be made to maintain a greater level of change independence. Types of Views Views can be typed or categorized according to their purpose and construction method. Joe Celko, author of SQL for Smarties and a number of other SQL books, describes views this way, "Views can be classified by the type of SELECT statement they use and the purpose they are meant to serve." To classify views in the SAS System environment, one must also look at how the SELECT statement is constructed. The following classifications are useful when describing a view's capabilities. Single-Table Views Views constructed from a single table are often used to control or limit what is accessible from that table. These views generally limit what columns, rows, and/ or both are viewed. 8

Ordered Views Views constructed with an ODE BY clause arrange one or more rows of data in some desired way. Grouped Views Views constructed with a GOUP BY clause divide a table into sets for conducting data analysis. Grouped views are more often than not used in conjunction with aggregate functions (see aggregate views below). Distinct Views Views constructed with the DISTINCT keyword tell the SAS System how to handle duplicate rows in a table. Aggregate Views Views constructed using aggregate and statistical functions tell the SAS System what rows in a table you want summary values for. Joined-Table Views Views constructed from a join on two or more tables use a connecting column to match or compare values. Consequently, data can be retrieved and manipulated to assist in data analysis. Nested Views Views can be constructed from other views, although extreme care should be taken to build views from base tables. Creating Views When creating a view, its name must be unique and follow SAS naming conventions. Also, a view cannot reference itself since it does not exist yet. The next example illustrates the process of creating an SQL view. In the first step, no output is produced since the view must first be created. Once the view has been created, the second step executes the view, G_MOVIES, by rendering the view s instructions to produce the desired output results. POC SQL; CEATE VIEW G_MOVIES AS SELECT title, category, rating FOM MOVIES WHEE rating = G ; SELECT * FOM G_MOVIES ODE BY title; SAS Log 1 POC SQL; 2 CEATE VIEW G_MOVIES AS 3 SELECT title, category, rating 4 FOM mydata.movies 5 WHEE rating = 'G'; NOTE: SQL view WOK.G_MOVIES has been defined. 6 7 SELECT * 8 FOM G_MOVIES 9 ODE BY title; 10 NOTE: POCEDUE SQL used (Total process time): real time 0.00 seconds cpu time 0.01 seconds 9

SAS Output Title Category ating The Wizard of Oz Adventure G Exploring Indexes Indexes can be used to improve the access speed to desired table rows. ather than physically sorting a table (as performed by the ODE BY clause or the BY statement in POC SOT), an index is designed to set up a logical data arrangement without the need to physically sort it. This has the advantage of reducing CPU and memory requirements. It also reduces data access time when using WHEE clause processing. Understanding Indexes What exactly is an index? An index consists of one or more columns in a table to uniquely identify each row of data within the table. Operating as a SAS object containing the values in one or more columns in a table, an index is comprised of one or more columns and may be defined as numeric, character, or a combination of both. Although there is no rule that says a table must have an index, when present, they are most frequently used to make information retrieval using a WHEE clause more efficient. To help determine when an index is necessary, it is important to look at existing data as well as the way the base table(s) will be used. It is also critical to know what queries will be used and how they will access columns of data. There are times when the column(s) making up an index are obvious and other times when they are not. When determining whether an index provides any processing value, some very important rules should be kept in mind. An index should permit the greatest flexibility so every column in a table can be accessed and displayed. Indexes should also be assigned to discriminating column(s) only since query results will benefit greatest when this is the case. When an index is specified on one or more tables, a join process may actually be boosted. The POC SQL processor may use an index, when certain conditions permit its use. Here are a few things to keep in mind before creating an index: - If the table is small, sequential processing may be just as fast, or faster, than processing with an index - If the page count as displayed in the CONTENTS procedure is less than 3 pages, avoid creating an index - Do not to create more indexes than you absolutely need - If the data subset for the index is not small, sequential access may be more efficient than using the index - If the percentage of matches is approximately 15% or less then an index should be used - The costs associated with an index can outweigh its performance value an index is updated each time the rows in a table are added, deleted, or modified. Sample code is illustrated below illustrating the syntax to create simple and composite indexes with the CEATE INDEX statement in the SQL procedure. Creating a Simple Index A simple index is specifically defined for a single column in a table and must be the same name as the column. Suppose you had to create an index consisting of movie rating (ATING) in the MOVIES table. Once created, the index becomes a separate object located in the SAS library. POC SQL; CEATE INDEX ATING ON MOVIES(ATING); 10

SAS Log esults POC SQL; CEATE INDEX ATING ON MOVIES(ATING); NOTE: Simple index ATING has been defined. Creating a Composite Index A composite index is defined for two or more columns in a table and must have a unique name that is different than the columns assigned to the index. Suppose you were to create a composite index consisting of the movie category (CATEGOY) and movie rating (ATING) from the MOVIES table. Once the composite index is created, the index consisting of the two table columns become a separate object located in the SAS library. POC SQL; CEATE INDEX CAT_ATING ON MOVIES(CATEGOY, ATING); SAS Log esults POC SQL; CEATE INDEX CAT_ATING ON MOVIES(CATEGOY, ATING); NOTE: Composite index CAT_ATING has been defined. Index Limitations Indexes can be very useful, but they do have limitations. As data in a table is inserted, modified, or deleted, an index must also be updated to address any and all changes. This automatic feature requires additional CPU resources to process any changes to a table. Also, as a separate structure in its own right, an index can consume considerable storage space. As a consequence, care should be exercised not to create too many indexes but assign indexes to only those discriminating variables in a table. Because of the aforementioned drawbacks, indexes should only be created on tables where query search time needs to be optimized. Any unnecessary indexes may force the SAS System to expend resources needlessly updating and reorganizing after insert, delete, and update operations are performed. And even worse, the POC SQL optimizer may accidentally use an index when it should not. Optimizing WHEE Clause Processing with Indexes A WHEE clause defines the logical conditions that control which rows a SELECT statement will select, a DELETE statement will delete, or an UPDATE statement will update. This powerful, but optional, clause permits SAS users to test and evaluate conditions as true or false. From a programming perspective, the evaluation of a condition determines which of the alternate paths a program will follow. Conditional logic in POC SQL is frequently implemented in a WHEE clause to reference constants and relationships among columns and values. To get the best possible performance from programs containing SQL procedure code, an index and WHEE clause can be used together. Using a WHEE clause restricts processing in a table to a subset of selected rows. When an index exists, the SQL processor determines whether to take advantage of it during WHEE clause processing. Although the SQL processor determines whether using an index will ultimately benefit performance, when it does the result can be an improvement in processing speeds. 11

POC SQL Joins A join of two or more tables provides a means of gathering and manipulating data in a single SELECT statement. A "JOIN" statement does not exist in the SQL language. The way two or more tables are joined is to specify the tables names in a WHEE clause of a SELECT statement. A comma separates each table specified in an inner join. Joins are specified on a minimum of two tables at a time, where a column from each table is used for the purpose of connecting the two tables. Connecting columns should have "like" values and the same datatype attributes since the join's success is dependent on these values. What Happens During a Join? Joins are processed in two distinct phases. The first phase determines whether a FOM clause references two or more tables. When it does, a virtual table, known as the Cartesian product, is created resulting in each row in the first table being combined with each row in the second table, and so forth. The Cartesian product (internal virtual table) can be extremely large since it represents all possible combinations of rows and columns in the joined tables. The second phase of every join process, if present, is specified in the WHEE clause. When a WHEE clause is specified, the number of rows in the Cartesian product is reduced to satisfy this expression. This data subsetting process generally results in a more manageable end product containing meaningful data. Creating a Cartesian Product When a WHEE clause is omitted, all possible combinations of rows from each table is produced. This form of join is known as the Cartesian Product. Say for example you join two tables with the first table consisting of 10 rows and the second table with 5 rows. The result of these two tables would consist of 50 rows. Very rarely is there a need to perform a join operation in SQL where a WHEE clause is not specified. The primary importance of this form of join is to illustrate a base for all joins. As illustrated in the following diagram, the two tables are combined without a corresponding WHEE clause. Consequently, no connection between common columns exists. MOVIES TITLE LENGTH CATEGOY YEA STUDIO ATING ACTOS TITLE ACTO_LEADING ACTO_SUPPOTING The result of a Cartesian Product is the combination of all rows and columns, and customarily represents a large table. The next example illustrates a Cartesian Product join using a SELECT query without a WHEE clause. POC SQL; SELECT * FOM MOVIES, ACTOS; Joining Two Tables with a Where Clause Joining two tables together is a relatively easy process in SQL. To illustrate how a join works, a two-table join is linked using the movie title (TITLE) in the following diagram. MOVIES TITLE LENGTH CATEGOY YEA STUDIO ATING ACTOS TITLE ACTO_LEADING ACTO_SUPPOTING 12

The next SQL join query references a join on two tables with TITLE specified as the connecting column. In this example, tables MOVIES and ACTOS are used. Each table has a common column, TITLE which is used to connect rows together from each when the value of TITLE is equal, as specified in the WHEE clause. A WHEE clause restricts what rows of data will be included in the resulting join. POC SQL; SELECT MOVIES.TITLE, MOVIES.ATING, ACTOS.ACTO_LEADING FOM MOVIES, ACTOS WHEE MOVIES.TITLE = ACTOS.TITLE; The resulting output from the two-way join appears below. Output from the Two-way Equi-Join Title ating Actor_Leading Brave Heart Mel Gibson Christmas Vacation -13 Chevy Chase Coming to America Eddie Murphy Forrest Gump -13 Tom Hanks Ghost -13 Patrick Swayze Lethal Weapon Mel Gibson Michael -13 John Travolta National Lampoon's Vacation -13 Chevy Chase ocky Sylvester Stallone Silence of the Lambs Anthony Hopkins The Hunt for ed October Sean Connery The Terminator Arnold Schwarzenegge Titanic -13 Leonardo DiCaprio Joins and Table Aliases Table aliases provide a "short-cut" way to reference one or more tables in a join operation. One or more aliases are specified so columns can be selected with a minimal number of keystrokes. To illustrate how table aliases in a join works, a two-table join is linked in the following diagram. MOVIES M TITLE LENGTH CATEGOY YEA STUDIO ATING ACTOS A TITLE ACTO_LEADING ACTO_SUPPOTING The following SQL code illustrates a join on two tables with MOVIE_NO specified as the connecting column. The table aliases are specified in the SELECT statement as qualified names, the FOM clause, and the WHEE clause. 13

POC SQL; SELECT M.TITLE, M.CATEGOY, M.ATING, A.ACTO_LEADING FOM MOVIES M, ACTOS A WHEE M.TITLE = A.TITLE; The resulting output from the two-way join appears below. Output from the Two-way Join with Table Alias Title Category ating Actor_Leading Brave Heart Action Adventure Mel Gibson Christmas Vacation Comedy -13 Chevy Chase Coming to America Comedy Eddie Murphy Forrest Gump Drama -13 Tom Hanks Ghost Drama omance -13 Patrick Swayze Lethal Weapon Action Cops & obber Mel Gibson Michael Drama -13 John Travolta National Lampoon's Vacation Comedy -13 Chevy Chase ocky Action Adventure Sylvester Stallone Silence of the Lambs Drama Suspense Anthony Hopkins The Hunt for ed October Action Adventure Sean Connery The Terminator Action Sci-Fi Arnold Schwarzenegge Titanic Drama omance -13 Leonardo DiCaprio Joining Three Tables In an earlier example, you saw where customer information was combined with the movies they rented. You may also want to display the leading actor of each movie along with the other information. To do this, you will need to extract information from three different tables: CUSTOMES, MOVIES, and ACTOS. A join with three tables follows the same rules as in a two-table join. Each table will need to be listed in the FOM clause with appropriate restrictions specified in the WHEE clause. To illustrate how a three table join works, the following diagram should be visualized. MOVIES TITLE LENGTH CATEGOY ATING ACTOS TITLE ACTO_LEADING ACTO_SUPPOTING ENTAL_INFO TITLE CUST_NO ENTAL_DATE The next POC SQL example references a join using three tables with TITLE specified as the connecting column for the MOVIES, ACTOS and ENTAL_INFO tables. 14

POC SQL; SELECT M.TITLE, M.CATEGOY, M.ATING, A.ACTO_LEADING,.ENTAL_DATE FOM MOVIES M, ACTOS A, ENTAL_INFO WHEE M.TITLE = A.TITLE AND M.TITLE =.TITLE; Output from Joining Three Tables Title Category ating Actor_Leading ental_date Christmas Vacation Comedy -13 Chevy Chase 11/29/2006 Christmas Vacation Comedy -13 Chevy Chase 11/28/2006 Coming to America Comedy Eddie Murphy 04/04/2007 Forrest Gump Drama -13 Tom Hanks 06/20/2007 Ghost Drama omance -13 Patrick Swayze 10/15/2006 National Lampoon's Vacation Comedy -13 Chevy Chase 06/20/2007 Introduction to Outer Joins Generally a join is a process of relating rows in one table with rows in another. But occasionally, you may want to include rows from one or both tables that have no related rows. This concept is referred to as row preservation and is a significant feature offered by the outer join construct. There are operational and syntax differences between inner (natural) and outer joins. First, the maximum number of tables that can be specified in an outer join is two (the maximum number of tables that can be specified in an inner join is 32). Like an inner join, an outer join relates rows in both tables. But this is where the similarities end because the result table also includes rows with no related rows from one or both of the tables. This special handling of matched and unmatched rows of data is what differentiates an outer join from an inner join. An outer join can accomplish a variety of tasks that would require a great deal of effort using other methods. This is not to say that a process similar to an outer join can not be programmed it would probably just require more work. Let s take a look at a few tasks that are possible with outer joins: List all custom er accounts with rentals during the month, including customer accounts with no purchase activity. Compute the number of rentals placed by each customer, including customers who have not rented. Identify movie renters who rented a movie last month, and those who did not. Another obvious difference between an outer and inner join is the way the syntax is constructed. Outer joins use keywords such as LEFT JOIN, IGHT JOIN, and FULL JOIN, and have the WHEE clause replaced with an ON clause. These distinctions help identify outer joins from inner joins. Finally, specifying a left or right outer join is a matter of choice. Simply put, the only difference between a left and right join is the order of the tables they use to relate rows of data. As such, you can use the two types of outer joins interchangeably and is one based on convenience. 15

Exploring Outer Joins Outer joins process data relationships from two tables differently than inner joins. In this section a different type of join, known as an outer join, will be illustrated. The following code example illustrates a left outer join to identify and match movie numbers from the MOVIES and ACTOS tables. The resulting output contains all rows for which the SQL expression, referenced in the ON clause, matches both tables and all rows from the left table (MOVIES) that did not match any row in the right (ACTOS) table. Essentially the rows from the left table are preserved and captured exactly as they are stored in the table itself, regardless if a match exists. POC SQL; SELECT movies.title, actor_leading, rating FOM MOVIES LEFT JOIN ACTOS ON movies.title = actors.title; Output from LEFT Outer Join Title Actor_Leading ating Brave Heart Mel Gibson Casablanca Christmas Vacation Chevy Chase -13 Coming to America Eddie Murphy Dracula Dressed to Kill Forrest Gump Tom Hanks -13 Ghost Patrick Swayze -13 Jaws Jurassic Park -13 Lethal Weapon Mel Gibson Michael John Travolta -13 National Lampoon's Vacation Chevy Chase -13 Poltergeist ocky Sylvester Stallone Scarface Silence of the Lambs Anthony Hopkins Star Wars The Hunt for ed October Sean Connery The Terminator Arnold Schwarzenegge The Wizard of Oz G Titanic Leonardo DiCaprio -13 The next example illustrates the result of using a right outer join to identify and match movie titles from the MOVIES and ACTOS tables. The resulting output contains all rows for which the SQL expression, referenced in the ON clause, matches both tables and all rows from the right table (ACTOS) that do not match any row in the left (MOVIES) table. 16

POC SQL; SELECT movies.title, actor_leading, rating FOM MOVIES IGHT JOIN ACTOS ON movies.title = actors.title; Output from IGHT Outer Join Title Actor_Leading ating Brave Heart Mel Gibson Christmas Vacation Chevy Chase -13 Coming to America Eddie Murphy Forrest Gump Tom Hanks -13 Ghost Patrick Swayze -13 Lethal Weapon Mel Gibson Michael John Travolta -13 National Lampoon's Vacation Chevy Chase -13 ocky Sylvester Stallone Silence of the Lambs Anthony Hopkins The Hunt for ed October Sean Connery The Terminator Arnold Schwarzenegge Titanic Leonardo DiCaprio -13 Conclusion The SQL procedure is a wonderful tool for SAS users to explore and use in a variety of application situations. This paper presented a few of most exciting features found in POC SQL. You are encouraged to explore POC SQL s powerful capabilities as it relates to querying and subsetting data; restructuring data by constructing case expressions ; constructing and using views; joining two or more tables to explore data relationships. eferences Lafler, Kirk Paul (2007), Undocumented and Hard-to-find POC SQL Features, Proceedings of the PharmaSUG 2007 Conference, Software Intelligence Corporation, Spring Valley, CA, USA. Lafler, Kirk Paul and Ben Cochran (2007), A Hands-on Tour Inside the World of POC SQL Features, Proceedings of the SAS Global Forum (SGF) 2007 Conference, Software Intelligence Corporation, Spring Valley, CA, and The Bedford Group, USA. Lafler, Kirk Paul (2006), A Hands-on Tour Inside the World of POC SQL, Proceedings of the 31 st Annual SAS Users Group International Conference, Software Intelligence Corporation, Spring Valley, CA, USA. Lafler, Kirk Paul (2005), Manipulating Data with POC SQL, Proceedings of the 30 th Annual SAS Users Group International Conference, Software Intelligence Corporation, Spring Valley, CA, USA. Lafler, Kirk Paul (2004). POC SQL: Beyond the Basics Using SAS, SAS Institute Inc., Cary, NC, USA. Lafler, Kirk Paul (2003), Undocumented and Hard-to-find POC SQL Features, Proceedings of the Eleventh Annual Western Users of SAS Software Conference. Lafler, Kirk Paul (1992-2006). POC SQL for Beginners; Software Intelligence Corporation, Spring Valley, CA, USA. Lafler, Kirk Paul (1998-2006). Intermediate POC SQL; Software Intelligence Corporation, Spring Valley, CA, USA. SAS Guide to the SQL Procedure: Usage and eference, Version 6, First Edition (1990). SAS Institute, Cary, NC. SAS SQL Procedure User s Guide, Version 8 (2000). SAS Institute Inc., Cary, NC, USA. 17

Trademarks SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. and other countries. indicates USA registration. Acknowledgments I would like to thank Dalia Kahane and John Cohen, NESUG 2007 Section Co-Chairs of Hands -on Workshops, for accepting my abstract and paper, as well as Christianna Williams and ick Mitchell, NESUG 2007 Conference Co- Chairs, and the NESUG Leadership for their support of a great Conference. Author Information Kirk Paul Lafler is consultant and founder of Software Intelligence Corporation and has been programming in SAS since 1979. As a SAS Certified Professional and SAS Institute Alliance Member (1996 2002), Kirk provides IT consulting services and training to SAS users around the world. As the author of four books including POC SQL: Beyond the Basics Using SAS (SAS Institute. 2004), Kirk has written more than two hundred peer-reviewed papers and articles that have appeared in professional journals and SAS User Group proceedings. He has also been an Invited speaker at more than two hundred SAS International, regional, local, and special-interest user group conferences and meetings throughout North America. His popular SAS Tips column, Kirk s Korner of Quick and Simple Tips, appears regularly in several SAS User Group newsletters and Web sites, and his fun-filled SASword Puzzles is featured in SAScommunity.org. Comments and suggestions can be sent to: Kirk Paul Lafler Software Intelligence Corporation World Headquarters P.O. Box 1390 Spring Valley, California 91979-1390 E-mail: KirkLafler@cs.com 18