An SQL Tutorial Some Random Tips Presented by Jens Dahl Mikkelsen SAS Institute A/S Author: Paul Kent SAS Institute Inc, Cary, NC.
Short Stories Towards a Better UNION Outer Joins. More than two too. Logical Expressions Case Free SoRtiNg Data about Data
A Better UNION?
A Better UNION? Can SQL do this? DATA A; SET B C; RUN;
A Better UNION? Can SQL do this? sure Create table A as select * from B union select * from C;
A Better UNION? not quite! SQL Set Operators have strict mathematical semantics UNION is formed on a column by column basis not matched by name UNION requires no duplicate rows in result this is an expensive operation
A Better UNION? SQL can do this... Create table A as select * from B union ALL CORRESPONDING select * from C;
A Better UNION? SQL CORRESPONDING avoids the SORT QUERY is still interpreted DATASTEP Doesn t need sort to begin with Program is compiled into Native Machine Code Which Means...
A Better UNION? DATA STEP : 1 SQL : 0 huh? isn t this an SQL talk?
<not> A Better UNION? Choose your battles wisely. Do not abandon those DATA STEP skills. Might still choose SQL if: UNION is part of a larger query You expect to port the program to a NON SAS environment Performance is not your only Metric
Outer Joins More than Two Too
Outer Join vs. inner join Data A Data B
Outer Join vs. inner join Inner Join select * from a, b where a.key = b.key;
Outer Join vs. inner join Left Join select * from a left join b on a.key = b.key;
Outer Join vs. inner join Full Join select * from a full join b on a.key = b.key;
Outer Joins Most People get their SQL joins *wrong* Non Matched records are dropped Information is lost from reports Duplicate Matches seems to double up Totals are unpredictable
Outer Joins Consider these Example Datasets PEOPLE PAYROLL INVESTments All linked by a common PERSON
Outer Joins select * from people, payroll, invest where peo.person = pay.person and peo.person = inv.person ;
Outer Joins This is usually *wrong* Even if all people are recorded in PEOPLE Some may not get paid Some may not have investments SQL default is to drop records with no match Where clause is not true. This combination of rows is not considered interesting
Outer Joins select * from people left join payroll on peo.person = pay.person left join invest on peo.person = inv.person ;
Outer Joins Better... This query retains people who are not paid not invested SQL provides missing values
Outer Joins {LEFT RIGHT FULL} JOIN is SQL syntax that supplements the, used in a FROM Clause. ON {expression} is used instead of a WHERE clause May still have records in the result set for which the ON Clause is not TRUE
Have we got it right yet? Select * has problems Includes the join-key person three times people.person payroll.person invest.person How to choose the correct one?
Outer Joins select people.*, pay.var1,pay.var2,... inv.var1,inv.var2,... from...
Outer Joins Are Your Data perfect? This example assumed: People is a complete set No payroll records exist without corresponding people record No investment records exist without corresponding people record
Outer Joins You may know the data are perfect RDBMS integrity constraints Application controls Your Boss told you so But what if it aint so?
Outer Joins select COALESCE(peo.person, pay.person, inv.person) as person, peo,var1,peo.var2,... pay.var1,pay.var2,... inv.var1,inv.var2,... from...
Outer Joins COALESCE returns its first non-missing argument correctly selects the person key even if the corresponding record is not from the people dataset. Is it correct yet? what if we have a payroll record and an investment record for PAUL, but no people record?
Outer Joins From Clause needs fixing too. Left Join is only appropriate in situations where you are 100% confident in the master detail relationship. Full Join can handle the uncertainty of data coming from either table and not the other Full Joins much harder to optimise. Indexes are not useful.
Outer Joins from peo full join pay on peo.person = pay.person full join inv on peo.person = inv.person ;
Is it correct yet? No What about PAUL who has a payroll record as well as an investment record, but no people record...
Outer Joins from peo full join pay on peo.person = pay.person full join inv on peo.person = inv.person ;
Outer Joins select COALESCE(peo.person, pay.person, inv.person) as person, peo,var1,peo.var2,... pay.var1,pay.var2,... inv.var1,inv.var2,... from...
from peo full join pay on peo.person = pay.person full join inv on COALESCE( peo.person, pay.person) = inv.person ;
Outer Joins - SQL 3 select * from people full natural join payroll on person full natural join invest on person;
Outer Joins - SQL 3 Fixes all these problems Choosing the key once only on the select clause Coalescing the key properly in the select clause in the on clause
Logical Expressions This is only ONE s and ZERO s
Logical Expressions Boolean Expressions in SAS *must* evaluate to one of {0,1} You may be able to exploit this: to construct a score for a matching scheme to tally the number of records for which the expression was true
Logical Expressions as Join Criteria A Match is a Catch if n or more of these conditions are true Age is within one year of matching First Name matches Last Name matches Initial matches
Logical Expressions as Join Criteria select * from A,B where (abs( a.dob - b.dob) < 365) +(a.lastname = b.lastname) +(a.initial = b.initial) +(a.fname = b.fname) >= 2;
Logical Expressions as Join Criteria Careful! Make up internal Cartesian Products. Very expensive to evaluate each possible combination of rows from contributing tables
Logical Expressions as Counters Exploit the identity that: the SUM of a logical expression is equivalent to the number of rows for which that expression was true. False contributes 0 to the sum. True contributes 1.
Logical Expressions as Counters data pets; input person $ cats dogs @@; cards; paul 5 1 linda 0 2 chris 0 2 pat 0 0 kelsey 1 0 thais 0 4 run;
Logical Expressions as Counters select person, cats > 0 as cat_own, cats > 2 as cat_love, dogs > 0 as dog_own, dogs > 2 as dog_love from pets ;
Logical Expressions as Counters PERSON CAT_OWN CAT_LOVE DOG_OWN DOG_LOVE ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ ƒƒƒƒƒƒƒƒ paul 1 1 1 0 linda 0 0 1 0 chris 0 0 1 0 pat 0 0 0 0 kelsey 1 0 0 0 thais 0 0 1 1
Logical Expressions as Counters select sum(cats > 0) as cat_own, sum(cats > 2) as cat_love, sum(dogs > 0) as dog_own, sum(dogs > 2) as dog_love from pets ;
Logical Expressions as Counters CAT_OWN CAT_LOVE DOG_OWN DOG_LOVE ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 2 1 4 1
Logical Expressions as Counters Useful? Perhaps..
Case Free Sorting
Case Free Sorting Tech Support gets requests for SORT while ignoring case SORT by formatted values SORT a special number to the beginning end
Case Free Sorting PROC SQL allows: An expression most anywhere you could have a variable. A Subquery most anywhere you could have an expression This is ANSI SQL2, but some RDBMS do not implement it yet.
Case Free Sorting ORDER BY upcase(name) ORDER BY put(variable, format.) ORDER BY CASE WHEN ACCOUNT = 999 THEN. ELSE ACCOUNT END
Data about Data How to use DICTIONARY.TABLES to write programs that respond to the contents of libraries dynamically Suppose you have a SAS library with airline data you want to display column info for tables having the string flight in the member label you want listings of those tables.
Data about Data Get a list of available tables from DICTIONARY.TABLES Get column info from DICTIONARY.COLUMNS Use some sneaky macro and SQL tricks!
Data about Data Get a list of available tables reset noprint; select quote(memname) into :members seperated by ',' from dictionary.tables where libname='airline' and upcase(memlabel) contains FLIGHT ;
Data about Data Get the column information reset print flow= 15 20; select memname, name, label, type, length, format, idxusage from dictionary.columns where libname = 'AIRLINE' and memname in(&members);
Data about Data Column Member Column Column Column Index Name Name Column Label Type Length Column Format Type ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ ƒƒƒƒƒƒƒƒƒƒ DELAY FLIGHT Flight number char 3 COMPOSITE DELAY DATE Departure date num 8 DATE7. COMPOSITE DELAY DELAY Delay in minutes num 8 5. FLINFO FLIGHT Flight Number char 3 $3. SIMPLE FLINFO ORIG Origin char 3 $3. FLINFO DEST Destination char 3 $3. FLINFO MILES Distance in num 8 5. Nautic Miles MARCH FLIGHT Flight number char 3 $3. COMPOSITE MARCH DATE Departure date num 8 DATE7. COMPOSITE MARCH DEPART Departure (local num 8 TIME8. time) MARCH MAIL Weight of mail (kg) num 8 5. MARCH FREIGHT Weight of freight num 8 5. (kg) MARCH BOARDED No. of boarded num 8 5. passengers MARCH TRANSFER No. of transfer num 8 5. passengers MARCH NONREV No. of non-revenue num 8 5. pass. MARCH DEPLANE No. of disembarked num 8 5. pass. MARCH CAPACITY Max. no of pass. num 8 5. in plane SCHEDULE FLIGHT Flight number char 3 $3. COMPOSITE SCHEDULE DATE Date num 8 DATE7. COMPOSITE SCHEDULE IDNUM Id of crew member char 4 $4. COMPOSITE
Data about Data Get the available table names into macro variables reset noprint; select memname into :mem1 thru :mem99 from dictionary.tables where libname='airline' and upcase(memlabel) contains 'FLIGHT' ; %let n_mems = &sqlobs;
Data about Data Make listings of those tables (first 20 obs.) %macro prt_mems; reset print outobs=20 number; %do i = 1 %to &n_mems; title " Listing of first 20 rows of AIRLINE.&&&mem&i"; select * from airline.&&&mem&i ; %end; title; %mend; %prt_mems;
Data about Data Partial listing Listing of first 20 rows of AIRLINE.DELAY Delay Flight Departure in Row number date minutes ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 1 182 01MAR94 0 2 114 01MAR94 8 3 202 01MAR94-5 4 219 01MAR94 18 5 439 01MAR94-4 6 387 01MAR94-2 7 290 01MAR94-8 8 523 01MAR94 4 9 982 01MAR94 0 10 622 01MAR94-5 11 821 01MAR94 16 12 872 01MAR94 3 13 416 01MAR94 4 14 132 01MAR94 14 15 829 01MAR94-6 16 183 01MAR94-8 17 271 01MAR94 5 18 921 01MAR94-5 19 302 01MAR94-2 20 431 01MAR94 13
Thats All Folks! SAS and SAS/ACCESS are registered trademarks of SAS Institute Inc., Cary, NC, USA. Other brand names are trademarks or registered trademarks of their respective holders.