Surfing the SAS cache to improve optimisation Michael Thompson Department of Employment / Quantam Solutions
Background Did first basic SAS course in 1989 Didn t get it at all Actively avoided SAS programing had very capable team members to do it for me Went consulting in 1999 First interviewed was asked a lot of SAS questions which I could answer because I had had good SAS people working for me Wasn t asked if I could program in SAS! Only realised starting on day one that SAS was required Had to learn via SAS tech support web page really quickly Discovered quite quickly that SAS proc SQL was both quick to program and ran fast Yes I am an unashamed SQL zealot!
Let s make this interactive Don t be shy! Better Ideas Observations Insights Might be 500-1000 years of SAS experience in the room!
What is going on that affects us Budget cuts drive for more efficiency We must do more with less Demands for more responsiveness to support the business of our organisations Government must follow private enterprise and understand at a finer level the attributes and needs of the public we serve and be able to quickly measure effectiveness of both new and old policy
What is going on that affects us Big Data (Sure it is one of the latest buzzwords, however ) The data is growing massively and will not stop Maybe in the next few years more data will be created than in the last 40,000 With so much data around, an important skill will be the ability to discern which data to ignore But also the creativity to identify surprising new ways to use new data to serve our organisations
What is going on that affects us The advent of the micro-policy is coming. Policies benefiting small numbers of people Developed and implemented quickly Evaluated quickly If they fail make sure they fail fast To facilitate concepts like micro-policies it is up to us ensure our IT areas have the right data available, are even more flexible, responsive and our output is trusted (both by perception and reality)
What is going on that affects us Trend to open public data to scrutiny Obama s second administration directed government (taking account of privacy and national security) to publish government data and open it to scrutiny Given enough eyeballs all insights are shallow Slight twist on Linus Law (Linix creator) crowdsource research May mean our data processes need to be more robust and we may have an increased workload We may more often need to verify insights identified in our data by others
What used to be most important For SAS programmers (creativity aside) Writing code so it executed quickly Getting IF tests in the right order reducing tests executed Using ELSE to avoid executing un-needed IF tests Sub-setting where statements after set ing data Keeping numbers of fields and their length down to a minimum
Now things have changed Moore s law has continued to work CPU speeds have increased massively Disk storage is getting larger and cheaper per Gb
Moore s Law Transistor counts on integrated circuits double every 2 years (Ref: Wikipedia Moore s Law)
Why change the way I do things now? Computers will just get faster and help me keep up! Unfortunately the speed data can be read, written and transmitted has not kept up This means that as our data volumes increase, even though processors are getting faster and storage is getting bigger and cheaper, read/write speeds are not keeping up It doesn t matter how fast our CPUs are or how efficiently we write our code if the bottle neck is the reading data from and writing data to storage
As a SAS programmer what things can we do? Write our programs or processes more quickly helps no matter what the data volumes Where our data volumes are high, engineer processes which optimise/minimise IO
As a SAS programmer what things can we do? (to speed up joining data from 2 or more tables) I personally believe that at the moment (and this may change as we more and more utilise solid state memory) the best thing we can do to speed up our SAS is to understand how the cache operates and work with it. When SAS reads data from storage it doesn t just read 1 record It reads many into cache.
As a SAS programmer what things can we do? (to speed up joining data from 2 or more tables) The #1 thing we can do to optimise the cache is to ensure tables we are joining are sorted by the key variable we are joining with and preferably indexed by that variable as well This means that when we start to read two tables into memory as we join the first records from each table we have also just read the data for thousands of subsequent matches.
As a SAS programmer what things can we do? (to speed up joining data from 2 or more tables) This utilisation of the data flowing through cache can be further enhanced by ensuring the tables are compressed (preferably using SPDE binary compression) If you really need to make the matched even faster ensuring that only the fields needed in the output are in the input files. Allowing even more records to be read in each bite of the cache
Caveats Many of these advantages can be lost if more than 2 tables are joined at the same time. SAS internally if joining 3 tables actually joins 2 tables, writes the output to work and then joins the work table to the 3 rd. Unfortunately work files are not SPDE compressed so the reads and writes to work are slow.
Work around only do this with great care. Some SAS processes will fail if you follow this idea. EG SAS Graph /* The following code gets SAS to utilise SPDE compression by default for work files be careful!!!*/ options obs=max compress=binary; %let path=%sysfunc(pathname(work)); libname s spde "&path"; run; options user=s ;
Questions
#1 learn SQL As a SAS programmer what things can we do? Data steps describe the path to solving a problem SQL semantically describes the answer and delivers it When joining tables (datasets SQL almost always out performs SAS merge) Short SQL programs can often replace processes with 100 s of lines of code Can avoid complicated data steps containing retain and set by statements
Results of testing file types No order no index Regular SAS DS Random order indexed Index ordered & indexed Random order indexed SPDE Index ordered & indexed Elapsed 2m17s 14m34s 1m21 1hr15m54 1m05s Merge User CPU 16s 2m08s 23s 1hr25m13 31s SQL Sys CPU 4s 2m47s 6s 5m13s 6s Elapsed 1m23s 50s 41s 31s 24s User CPU 14s 25s 16s 27s 20s Sys CPU 8s 6s 6s 2s 1s A join between 10 million and 30 million row tables Also shows the benefit of SPDE format for SAS tables (datasets) Note the worst performing SQL join was only just slower than the best merge
Joining/merging data There are many ways to join or merge data and give the correct result. All are not equal when choosing strategy what matters is :- Speed of coding Speed of execution (run-time/processor time) Maintenance Size of data
Joining/merging data What makes a difference... file sizes sort order indexing technique (merge/sql join/sort/other) will code be run regularly or one-off? Will SPDE help?
Join Types Traditional Join Options SQL join Merge
Joining/merging data 15 proc sql; 16 create table CALD_Profile as 17 select a.ssr 18,start 19,a.bentype 20,b.COB 21,end 22 from REDlast.benhist as a join 23 redlast.customer as b 24 on a.ssr eq B.ssr 25 where a.end eq. 26 27 ; NOTE: Table WORK.CALD_PROFILE created, with 4662787 rows and 5 columns. 28 quit; NOTE: PROCEDURE SQL used (Total process time): real time 19.93 seconds user cpu time 20.26 seconds system cpu time 1.82 seconds 16 data CALD_Profile_MG (keep=ssr 17 start 18 bentype 19 COB 20 end); 21 Merge REDlast.benhist (in=a) 22 redlast.customer ; 23 by ssr ; 24 if end eq. and a ; 25 run; NOTE: There were 30752526 observations read from the data set REDLAST.BENHIST. NOTE: There were 11543751 observations read from the data set REDLAST.CUSTOMER. NOTE: The data set WORK.CALD_PROFILE_MG has 4662787 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 50.01 seconds user cpu time 43.86 seconds system cpu time 24.24 seconds
As a SAS programmer what things can we do? #2 Look after the IO (when data volumes are high) Understand how you can make the cache work for you The order of your data can matter a lot Learn how SPDE format data can help
Demo Joining/merging data sort order matters!
SPDE makes big difference IO/footprint High density data can actually increase in size ~25% when regular SAS compression is applied SPDE binary compression can cut size of same dataset in half faster to write faster to read! Remember SPDE files CAN NOT BE MOVED outside SAS NOTE: Compressing data set JKN.RAND_ORDER_INDEXED_MG increased size by 27.59 percent. Compressed is 47549 pages; un-compressed would require 37266 pages. NOTE: MODIFY was successful for JKS.BENHIST_RAND_ORDER_INDSPD.DATA. NOTE: Compressing data set JKS.BENHIST_RAND_ORDER_INDSPD decreased size by 71.36 percent.