Getting the Most from Hash Objects Bharath Gowda
Getting the most from Hash objects Techniques covered are: SQL join Data step merge using BASE engine Data step merge using SPDE merge Index Key lookup Hash object lookup Simple join, iteration and sorting with hash
Test setup Only using local SAS datasets. Datasets (including SPDE) reside in the local work library directory. Storing data locally nullifies the network delays and I/O delays. The environment has a super fast solid state hard disk which decreases I/O delay. Same join logic is implemented over all the techniques.
PROC SQL Requires no sorting prior to joins Multi way joins can be performed. SQL uses internal utility tables for joins Proc sql; create table acute_pred as select a.*,b.nwau_sas from pred_cost as a inner join speciality as b on a.yearid = b.yearid and a.facility_identifier= b.facility_identifier ; quit; Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) real time user cpu time 12,001,616 0:20:28.79 0:14:15.31 Memory 254512.00k
Data step merge It requires a sort first by the key variables in all the input datasets. Proc sort data=pred_cost out= srt_pred_cost; by yearid facility_identifier; Run; Proc sort speciality out= srt_spec (keep= yearid facility_identifier nwau_sas); by yearid facility_identifier; Run; Data acute_pred; Merge srt_pred_cost(in=a) srt_spec(in=b); by yearid facility_identifier; If a and b; Run; Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) real time user cpu time (avg combined sorting and merging) 12,001,616 0:30:28.79 0:21:15.31 Memory(combined avg) 162393.59k
SPDE merge SPDE stands for Scalable Performance data engine SPDE combines software and hardware capabilities. libname workspde spde "%sysfunc(pathname(work))" temp=yes; proc copy in=work out= workspde; run; select pred_cost specialty; data workspde.acute_pred; merge workspde.hiepred_acutecost_1718(in=a) workspde.nwau16 If a and b; run; (in=b keep= nwau_sas yearid facility_identifier); by yearid facility_identifier; Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (simple join) real time user cpu time 12,001,616 0:16:15.31 0:12:28.79 Memory 1261040.70k
Index key lookup Main dataset is not sorted. Multiple set statements with key= option. _IORC_ is one of the key automatic variable which needs attention. Proc datasets lib=work nolist; Modify speciality; Index create myindex=(yearid facility_identifier); run; Data acute_pred; set pred_cost; Set speciality key=myindex ; If _IORC_ =0 ; run; Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) real time user cpu time 12,001,616 0:12:15.31 0:06:28.79 Memory 142279.71k
HASH Objects One of the fastest approaches for the look up activities. In memory computation makes this technique the fastest. Dataset size is a key factor for memory consumption. Pre sorting is not required. Hash object needs to be defined and instantiated. Lookups are performed with the find() function. Data acute_pred; length nwau_sas 8.; If _n_=1 then do; declare hash hn16(dataset: speciality,duplicate: e ); hn16.definekey( yearid, facility_identifier ); hn16.definedata( nwau_sas ); hn16.definedone(); end; set pred_cost; Rc=hn16.find(key :yearid, key:facility_identifier); If rc=0; Run;
HASH Objects Simple inner joins and left joins can be achieved by keeping a check on the return code variable. Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 if rc=0; Rows in acute_pred (inner join) 12,001,616 real time user cpu time 0:08:28.79 0:03:15.31 Memory 1465437.28k
HASH Iterators Hash iterator objects need to be defined and instantiated. Hash objects are assigned to hash iterator objects. Data top bottom; length nwau_sas 8.; If _n_=1 then do; declare hash hn16(dataset: speciality, ordered: descending ); hn16.definekey( yearid, facility_identifier ); hn16.definedata(all: Y ); hn16.definedone(); Declare hiter iter_ex( hn16 ); End; /*hn16.output(dataset: sort_dec )*/
HASH Iterators First() and next() functions to iterate top to bottom and last() and prev() functions to iterate bottom to top Can be used to get the top and bottom records. Declare hiter iter_ex( hn16 ); End; Iter_ex.First(); do I = 1 to 10; output top; Iter_ex.Next(); end; Iter_ex.last(); do I = 1 to 10; output bottom; Iter_ex.prev(); end; Run;
Avoid or use? Where should you apply these techniques? Type Proc SQL Datastep Merge Index lookup (key=) SPDE Merge Hash lookup Hash Iterators Small to medium sized tables (< 10,000 rows) Huge table (> 10 million rows) Unsorted data(> 10 million rows) Datasets with multiple indexes Less memory hungry techniques
Getting the most from Hash objects Bharath Gowda, SAS analyst Independent Contractor 0468304568 bharathg1307@gmail.com