BAKING OR COOKING LEARN THE IFFERENCE BETWEEN SAS ARRAYS AN HASH OBJECTS WISCONSIN ILLINOIS SAS USERS GROUP MILWAUKEE 29 JUNE 2016 CHARU SHANKAR SAS INSTITUTE INC. Copyright 2013, SAS Institute Inc. All rights reserved.
2 About Charu Teaching Experience Computer languages English language Business communication skills SAS experience SAS programming SQL language S2 Hadoop Business Intelligence classes Hobbies Writing-SAS training blog Fitness-Yoga instructor Singer Working with children Cooking blog & catering Some fun facts 9 years at SAS Canada July 4 th Born & raised in India with sunshine 365 days/year Never dreamed I would live in Canada with cold 365 days of the year This Is my 4 th time presenting at WIILSU. I find coding & teaching creative processes like singing & cooking I like helping others. If you are looking for work, join the linkedin in group I ve created 21-day free SAS challenge Thanks LeRoy and all of you for continuing to find value in my teaching & invite me back. And my manager at SAS Canada & all the SAS folks that make this happen
Baking or Cooking learn the difference between SAS arrays and hash objects 1. Bake - Using Arrays 2. Cook - Using Hash Objects 3. Compare and contrast 3
Is this oven baked or cooked on the stovetop? cooked Copyright 2012, SAS Institute Inc. All rights reserved.
oven baked or cooked stovetop? cooked Copyright 2012, SAS Institute Inc. All rights reserved.
oven baked or cooked stovetop? baked Copyright 2012, SAS Institute Inc. All rights reserved.
oven baked or cooked stovetop? cooked Copyright 2012, SAS Institute Inc. All rights reserved.
Overview of Arrays (Review) An array is similar to a row of numbered buckets. 1 2 3 4 SAS puts a value in a bucket based on the bucket number. A value is retrieved from a bucket based on the bucket number. 8
Baking or Cooking learn the difference between SAS arrays and hash objects 1. Bake - Using Arrays 2. Cook - Using Hash Objects 3. Compare and contrast 9
efining Arrays (Review) An array is a temporary grouping of SAS variables that are arranged in a particular order and identified by an array name. The following tasks can be accomplished using an array: performing repetitive calculations on a group of variables creating many variables with the same attributes restructuring data performing a table lookup with one or more numeric factors An array exists only for the duration of the current ATA step. 10
Using One-imensional Arrays (Review) To use an array, declare the array by using an ARRAY statement. General form of the one-dimensional ARRAY statement: ARRAY array-name {number-of-elements} <$> <length> <list-of-variables> <(initial-values)>; 11
Objectives Load an array from a SAS data set. Use the array as a look up table to match records. Chef wants to create a food theme with an international menu for the summer You have been asked to provide food names with country origin(match country codes to country names) Country has country names, codes & food names CodeFoodesc has food name, description & country code Using country dataset as input to the array, we want to populate the new dataset by using the array as a lookup table 12
Business Scenario Wiilsu The dataset CodeFoodesc contains food name, description & country code partial listing of CodeFoodesc 13
Business Scenario To help the chef, you need to combine CodeFoodesc with the Country dataset which contains country names partial listing of Country 14
Using a One-imensional Array 4 data baking(keep=code country food description); array C{&num} $255 _temporary_ (&maxnum*' '); if _n_=1 then do code =1 to &maxnum; set bakecook.country; C{code}=country; end; /* array retains values automatically if initialized in array statement*/ set foods.'codefooddesc$'n; country=c{code}; run; 15
Execution Partial PV data baking(keep=code country food description); array C{&num} $255 (&maxnum*' '); if _n_=1 then do code =1 to &maxnum; set bakecook.country; C{code}=country; end; /* array retains values automatically if initialized in array statement*/ set foods.'codefooddesc$'n; country=c{code}; run; C1 C2 C3 C4 C5........ C32 C33... C860 Code Country Food escription _N_..... 1 16...
Execution Partial PV C1 C2 C3 C4.. C32 data baking(keep=code country food description); array C{&num} $255 &maxnum*' '); if _n_=1 then do code=1 to &maxnum; set bakecook.country; C{code}=country; end; /* array retains values automatically if initialized in array statement*/ set foods.'codefooddesc$'n; country=c{code}; run;.. C36... C860 Code Country Food escription _N_ 17 32 Argentina 1...
Execution Partial PV C1 C2 C3 C4.. C32 data baking(keep=code country food description); array C{&num} $255 &maxnum*' '); if _n_=1 then do code=1 to &maxnum; set bakecook.country; C{code}=country; end; /* array retains values automatically if initialized in array statement*/ set foods.'codefooddesc$'n; country=c{code}; run; Argentina.. C36... C860 Code Country Food escription _N_ 18 32 Argentina 1...
Execution fully loaded array Partial PV C1 C2 C3 C4.. C32 data baking(keep=code country food description); array C{&num} $255 &maxnum*' '); if _n_=1 then do code=1 to &maxnum; set bakecook.country; C{code}=country; end; /* array retains values automatically if initialized in array statement*/ set foods.'codefooddesc$'n; country=c{code}; run;.. C36... C860 Code Country Food escription _N_ 19 Australia Uzbekistan 860 Uzbekistan 1...
Execution Partial PV C1 C2 C3 C4.. C32 data baking(keep=code country food description); array C{&num} $255 (&maxnum*' '); if _n_=1 then do code=1 to &maxnum; set bakecook.country; C{code}=country; end; /* array retains values automatically if initialized in array statement*/ set foods.'codefooddesc$'n; country=c{code}; run; Argentina.. C36... C860 Code Country Food escription _N_ 20 Australia Uzbekista n. 1...
Execution Partial PV C1 C2 C3 C4.. C32 data baking(keep=code country food description); array C{&num} $255 &maxnum*' '); if _n_=1 then do code=1 to &maxnum; set bakecook.country; C{code}=country; end; /* array retains values automatically if initialized in array statement*/ set foods.'codefooddesc$'n; country=c{code}; run; Argentina 21.. C36... C860 Code Country Food escription Australia Uzbekista n 32 Asado Cuts of meat.. 1 _N _...
Execution Partial PV C1 C2 C3 C4.. C32 data baking(keep=code country food description); array C{&num} $255 &maxnum*' '); if _n_=1 then do code=1 to &maxnum; set bakecook.country; C{code}=country; end; /* array retains values automatically if initialized in array statement*/ set foods.'codefooddesc$'n; Country = C(code); run; Argentina 22.. C36... C860 Code Country Food escription Australia Uzbekista n 32 Asado Cuts of meat.. 1 _N _...
Execution Partial PV C1 C2 C3 C4.. C32 data baking(keep=code country food description); array C{&num} $255 &maxnum*' '); if _n_=1 then do code=1 to &maxnum; set bakecook.country; C{code}=country; end; /* array retains values automatically if initialized in array statement*/ set foods.'codefooddesc$'n; Country = C(code); run; Argentina 23.. C36... C860 Code Country Food escription Australia Uzbekista n 32 Argentina Asado Cuts of meat.. 1 _N _...
Execution Partial PV C1 C2 C3 C4.. C32 data baking(keep=code country food description); array C{&num} $255 &maxnum*' '); if _n_=1 then do code=1 to &maxnum; set bakecook.country; Implicit OUTPUT; C{code}=country; end; Implicit RETURN; /* array retains values automatically if initialized in array statement*/ set foods.'codefooddesc$'n; Country =C(code); run; Argentina 24.. C36... C860 Code Country Food escription Australia Uzbekista n 32 Argentina Asado Cuts of meat.. 1 _N _...
Execution Partial PV C1 C2 C3 C4.. C32 data baking(keep=code country food description); array C{&num} $255 &maxnum*' '); if _n_=1 then do code=1 to &maxnum; set bakecook.country; C{code}=country; end; /* array retains values automatically if initialized in array statement*/ set foods.'codefooddesc$'n; Country =C(code); run; Argentina 25.. C36... C860 Code Country Food escription Australia Uzbekista n 32 Argentina Asado Cuts of meat.. 1 _N _...
Execution Partial PV C1 C2 C3 C4.. C32 data baking(keep=code country food description); array C{&num} $255 &maxnum*' '); if _n_=1 then do code=1 to &maxnum; set bakecook.country; C{code}=country; end; /*array retains values automatically if initialized in array statement*/ set foods.'codefooddesc$'n; Country =C(code); run; Continue until EOF. Argentina 26.. C36... C860 Code Country Food escription Australia Uzbekista n 860 Uzbekista n _N _ Plov A grain such as.. 1...
Resulting ata proc print data=baking noobs; run; PROC PRINT Output Using One imensional Arrays Year_ Obs Employee_I Hired Salary Average Salary_if 1 120101 2003 $163,040.00 $35,082.50 $127,957.50 2 120102 1989 $108,255.00 $88,588.75 $19,666.25 3 120103 1974 $87,975.00 $39,243.61 $48,731.39 4 120104 1981 $46,230.00 $36,436.67 $9,793.33 5 120105 1999 $27,110.00 $36,533.75 $-9,423.75 6 120106 1974 $26,960.00 $39,243.61 $-12,283.61 7 120107 1974 $30,475.00 $39,243.61 $-8,768.61 8 120108 2006 $27,660.00 $27,883.71 $-223.71 27
28
Review of Arrays Array The subscript value(s) must be numeric. One data value can be associated with the subscript value(s). An array uses less memory than other in-memory lookup techniques. The size of the array is determined at compilation time. Subscript values must be consecutive integers. An array selects values by direct access based on the subscript value. Arrays can only be used in the ATA step. 29
Baking or Cooking learn the difference between SAS arrays and hash objects 1. Bake - Using Arrays 2. Cook - Using Hash Objects Compare and contrast 30
Poll Have you used hash objects in SAS or other computer languages? Yes No 31
ATA Step Hash Objects The ATA step hash object has the following attributes: provides in-memory data storage and retrieval has a data component and a key component uses the key for quick data retrieval can store multiple data items per key does not require the data to be sorted is sized dynamically The hash object is a good choice for lookups using unordered data that can fit into memory. 32
Overview of a Hash Object (Review) A hash object is similar to rows of buckets that are identified by the value of a key. Key ata ata SAS puts value(s) in the data bucket(s) based on the value(s) in the key bucket. Value(s) are retrieved from the data bucket(s) based on the value(s) in the key bucket. 33
ATA Step Hash Objects The hash object resembles a table with rows and columns. The columns have the following characteristics: can be numeric or character can be loaded from hardcoded values can be loaded from a SAS data set exist for the duration of the ATA step can be output to a SAS data set 34
ATA Step Hash Objects The key component has the following attributes: can consist of numeric and character values maps key values to data rows must be unique before SAS 9.2 can be composite The data component has the following attributes: can contain multiple data values per key value can consist of numeric and character values ata components and key components are ATA step variables. 35
Using Hash Objects The ATA step hash object has these characteristics: is created with a ECLARE statement has attributes and methods is manipulated with object dot syntax An attribute is a property. A method is a function. 36
Objectives Load a hash object from a SAS data set. Use a hash object method to match records. Chef wants to create a food theme with an international menu for the summer You have been asked to provide food names with country origin(match country codes to country names) Country has country names, codes & food names CodeFoodesc has food name, description & country code Using country dataset as input to hash object, we want to populate the new dataset by using hash as a lookup table 37
Business Scenario Wiilsu The dataset CodeFoodesc contains food name, description & country code partial listing of CodeFoodesc 38
Business Scenario To help the chef, you need to combine CodeFoodesc with the Country dataset which contains country names partial listing of Country 39
Loading ata from a SAS ata Set into a hash object data cooking; rop rc; length country $255; if _n_=1 then do; dcl hash C(dataset:'bakecook.country'); C.definekey('code'); C.definedata('country'); C.definedone(); call missing(country); end; set foods.'codefooddesc$'n; rc=c.find(); run; 40
Execution data cooking; rop rc; length country $255; if _n_=1 then do; dcl hash C(dataset:'bakecook.country'); C.definekey('code'); C.definedata('country'); C.definedone(); call missing(country); end; set foods.'codefooddesc$'n; rc=c.find(); run; Partial PV 41 Country Food escription Code... rc... _N_. 1
Execution data cooking; rop rc; length country $255; if _n_=1 then do; dcl hash C(dataset:'bakecook.country'); C.definekey('code'); C.definedata('country'); C.definedone(); call missing(country); end; set foods.'codefooddesc$'n; rc=c.find(); run; Partial PV 42 Country Food escription Code Asado Cuts of Meat... rc... _N_. 1
Execution data cooking; rop rc; length country $255; if _n_=1 then do; dcl hash C(dataset:'bakecook.country'); C.definekey('code'); C.definedata('country'); C.definedone(); call missing(country); end; set foods.'codefooddesc$'n; rc=c.find(); run; Partial PV 43 Country Food escription Code Asado Cuts of meat.. 32... rc... _N_ 0 1
Execution data cooking; rop rc; length country $255; if _n_=1 then do; dcl hash C(dataset:'bakecook.country'); C.definekey('code'); C.definedata('country'); C.definedone(); call missing(country); end; set foods.'codefooddesc$'n; rc=c.find(); run; True Partial PV 44 Country Food escription Code Asado Cuts of meat.. 32... rc... _N_ 0 1
Execution data cooking; rop rc; length country $255; if _n_=1 then do; dcl hash C(dataset:'bakecook.country'); C.definekey('code'); C.definedata('country'); C.definedone(); call missing(country); end; set foods.'codefooddesc$'n; rc=c.find(); run; Partial PV 45 Country Food escription Code Argentina Asado Cuts of meat.. 32... rc... _N_ 0 1
Execution data cooking; rop rc; length country $255; if _n_=1 then do; dcl hash C(dataset:'bakecook.country'); C.definekey('code'); end; set foods.'codefooddesc$'n; run; C.definedata('country'); C.definedone(); call missing(country); rc=c.find(); Implicit OUTPUT; Implicit RETURN; Partial PV 46 Country Food escription Code Argentina Asado Cuts of meat.. 32... rc... _N_ 0 1
Execution data cooking; rop rc; length country $255; if _n_=1 then do; dcl hash C(dataset:'bakecook.country'); C.definekey('code'); C.definedata('country'); C.definedone(); call missing(country); end; set foods.'codefooddesc$'n; rc=c.find(); run; Continue until EOF. Partial PV 47 Country Food escription Code Argentina Asado Cuts of meat.. 32... rc... _N_ 0 1
Results title 'International food theme for summer with country origins of foods'; proc print data=cooking; run; Partial PROC PRINT Output 48 p306d02
49
Efficiency The program created the variable rc and then dropped it. How can you avoid creating the variable so that you do not have to drop it? a. Use a WHERE statement or a WHERE= data set option. b. Use a KEEP= or ROP= data set option in foods. codefooddesc$ n. c. Test the result of the FIN method in the subsetting IF statement. d. Use a KEEP or ROP statement. 50
Multiple Choice Poll Correct Answer The program created the variable rc and then dropped it. How can you avoid creating the variable so that you do not have to drop it? a. Use a WHERE statement or a WHERE= data set option. b. Use a KEEP= or ROP= data set option in foods. codefooddesc$ n. c. Test the result of the FIN method in the subsetting IF statement. d. Use a KEEP or ROP statement. 51
Not Creating rc The program created the variable rc and then dropped it. How can you avoid creating the variable so that you do not have to drop it? data cooking; rop rc; length country $255; if _n_=1 then do; dcl hash C(dataset:'bakecook.country'); C.definekey('code'); C.definedata('country'); C.definedone(); call missing(country); end; set foods.'codefooddesc$'n; If C.find() = 0; run; 52
53
Quiz How do you know the length of the character variable Country? 54
Quiz Correct Answer How do you know the length of the character variables Country You use PROC CONTENTS, PROC ATASETS, or the Explorer window to view the descriptor portion of orion.supplier. 55
56 efining PV Variables dynamically Instead of the LENGTH statement, you can use an IF-THEN statement. data cooking; rop rc; length country $255; if _N_=1 then do; if 0 then set bakecook.country (keep=code Country); declare hash C(dataset:'bakecook.country '); C.definekey('code'); C.definedata('country'); C.definedone(); end; set foods.'codefooddesc$'n; If c.find()=0; run; Because the IF condition is false during execution, the SET statement is compiled, but not executed. The PV includes all the kept variables from bakecook.country.
Using ATA Set Options In SAS 9.2, you can use SAS ATA set options to limit the amount of data loaded into a hash object. data mostfoods; if _N_=1 then do; if 0 then set bakecook.country (keep=code Country); declare hash C(dataset:"bakecook.country (where=(code=&ccode))"); C.definekey('code'); C.definedata('country'); C.definedone(); end; set foods.'codefooddesc$'n; If c.find()=0; run; 57
58 Advantages and isadvantages of Hash Objects Advantages use of character and numeric keys use of composite keys faster lookup than formats or merges/joins ability to be loaded from a SAS data set fine level of control (flexibility) isadvantages memory requirements Amt of memory your SAS session has available determines how big your hash object can be. Reducing the # of obs & restricting data items loaded into the hash object to only those that the program needs is a way to conserve memory. While it may be, it may be seem counter-intuitive, it may be more efficient to load your larger data into the hash object, esp, if it is your lookup data set. Action of reading your smaller data set sequentially and looking up information in a large hash object is likely to process more quickly than if you read your larger data set sequentially and look up info for each of its obs in a small hash object
Comparing Arrays and Hash Objects flexibility Array The subscript value(s) must be numeric. One data value can be associated with the subscript value(s). An array uses less memory than a hash object. The size of the array is determined at compilation time. When you define an array with the ARRAY statement you must specify the number of elements in your array. If the number of elements changes the next time you use the data step, you must update the ARRAY statement, or possibly maintain additional code like macro programs that could update this for you. Subscript values must be consecutive integers. An array selects values by direct access based on the subscript value. Arrays can only be used in the ATA step. Hash Object The keys can be character, numeric, or both. Multiple data items can be associated with the key value. A hash object uses more memory than an array. The size of the hash object is determined at execution time. SAS dynamically allocates memory as it needs it. You do not have to determine the size of your hash object even if the next time you use it, you have many more obs to load into it The keys do not have to be consecutive or sorted. A hash object uses a hash function for the lookup process. Hash objects can only be used in the ATA step. 59
Cool resources Value of arrays Better hashing Arrays in compile time Reading excel using xlsx engine ifferent ways of combining SAS tables Step by step explanation of SAS arrays I cut my processing time by 90% by using hash objects 60
Copyright 2014, SAS Institute Inc. All rights reserved. Questions and Comments 61
Contact Thanks for your time Charu Shankar Senior Technical Training Specialist SAS institute Inc. EMAIL Charu.Shankar@sas.com SAS BLOG http://blogs.sas.com/content/sastraining/author/charushankar/ TWITTER CharuYogaCan LINKEIN https://ca.linkedin.com/in/charushankar Copyright 2012, SAS Institute Inc. All rights reserved.