Administration & Support

Size: px
Start display at page:

Download "Administration & Support"

Transcription

1 An Animated Guide : Speed Merges: Resource use by common non-parallel procedures Russ Lavery Contractor for ASG, Inc. ABSTRACT This paper is a comparison of how resources are used by different SAS table lookup (Figure 1) techniques. The paper will be useful for programmers starting to study programming efficiency or who are experiencing efficiency problems with current programs. Understanding resource use of basic processes gives programmers a logical structure for selecting the table lookup technique that will most likely solve their efficiency problem. Advanced table lookup techniques all have the characteristic that they perform a Sorted By Merge (or close to one) without sorting. All of the advanced table lookup techniques (Format, IORC, hashing, SQL, tagsort, modify, etc.) make different demands on I/O, CPU and disk space. Understanding the demands of the different table lookup techniques lets a programmer select the best technique in his/her particular situation. Several years of NESUG/SUGI proceedings papers, including many on efficiency topics, are available on line. These papers can be downloaded/read for an in-depth discussion the different table lookup techniques mentioned here. This paper is an overview and hopes to help readers decide which table lookup techniques will best help them and suggests a web site that can be used to search for, and download other useful proceedings papers. INTRODUCTION Concerned with FAST table lookup Table lookup is Our Old Friend the If Statement If zip= then state= PA ; Else if zip= then state= NJ ; Else state=?? ; Table Lookup is often done by accessing another file (by merge, format, IORC, Hashing etc.) Table Lookup is also subsetting Small File Key var1 var2 etc. DATA STEP Often a By Merge Some Logical result Large File Key var1 var2 etc. Efficiency means using resources gently. Working for efficiency has many dimensions and involves trade-offs. Takes too long CPU Memory Utilization Maintenance Programming effort Initial Programming effort EFFICIENCY TRADEOFFS IO Utilization space File is too big Figure 1 Figure 2 A table lookup is the use of a key, or id, variable in a small file to lookup values in a large file. This can often take the form of taking a subset of the large file as is shown in Figure 1. This is such a common task that efficient coding is valuable. The goal of this paper is to support good decision making when selecting an efficient table lookup technique. The paper will discuss the details of common SAS processes and focus on the issues of CPU usage and disk space requirements. Efficiency is a multidimensional concept and usually involves making trade-offs (Figure 2). Programs that use little RAM often have excessive I/O. Programs that run quickly often have high CPU usage/requirements. Knowing how processes use resources helps programmers make those trade-offs. Generally, people study efficiency because a job takes too long to run/write or creates a file that is too big for their system. Generally, we should use the most constrained resource gently. The processes/techniques discussed in this paper are: Basic sequential SAS read of a data set Sorting: 1) General 2) SAS sort 3) Host sort 4) Asserted Sort 5) Ordered by The binary search process Formats Indexes 1)General 2) IORC_ 3) tagsort Key Indexing, Bitmapping and Hashing The paper contains a figure ranking techniques and suggests to identify useful online articles. 1

2 THE BASIC SAS READ When SAS executes a set/merge command it reads the mentioned file top-to-bottom. This top-to-bottom read (Figure 3) is a fast access technique, for SAS. A top to bottom read is part of data step processing and is also how many Procs access data. Total run time is the result of multiplying the time for a single read by the number of observations and total run time can be too high. SAS performs a fast operation (a sequential read) but performs one for every observation. Your SAS sequential read Some File PROCESSING TIME Data new; set old; tax=sales*.06; Reading a SAS file sequentially (first obs to last) is a very fast process (for SAS). SAS performs a read for each observation- but each individual read is fast Your SAS Sort Big File Sorting File Copy of Work.Big Misc. File 2 Misc. File 1 Work.Small Work.Big Sorting creates a sorting file that is approximately 2 1/2 times the size of the source file. Proc sort data=small noequals sortsize=max; by state zip; Proc sort data=big; noequals sortsize=max; by state zip; Figure 3 Figure 4 SORTING AND THE SORTED BY MERGE Sorting is a common SAS technique, and a required step for the Sorted By Merge table lookup. A Sorted By Merge is easy to program but is very CPU and disk intensive. (Figure 4). When SAS sorts a file (say work.small) it creates a temp copy of the file AND a sorting copy that is about 3 ½ times the size of the original file. Sorting of observations is done in the sorting copy and, if successful, the sorted observations are written back to the temp file and then to work.small. After overwriting work.small, the sorting and temp files are automatically deleted and disk/memory space is freed up. It is the large size of the temp files (3 ½ times the size of the original file) that sometimes causes Proc Sort to fill up a disk and to fail. The memsize=max and noequals should be used as sort options whenever possible. If datasets are small (several thousand obs), the Sorted By Merge is recommended because it is so easy to program and the hardware demands (CPU and disk) are easily met by modern equipment. With small files, the preliminary sorting and the top to bottom read in the data step takes so little total run time that there is little incentive to explore advanced table lookup techniques. With large files, programmers can experience time/disk space problems. In this case Sorted By Merges should be avoided. Sorting is a resource intensive operation, taking lots of time and lots of disk space. Sorting large files has crashed systems. SAS has done major reprogramming on Proc Sort in V9.0 / V9.1. The new Proc Sort automatically multithreads and accesses multiple CPUs, when they are available. Results of some simple performance tests are available in a paper by Jacobson and Lavery in this NESUG. Resource demand for Sorted By Merge: CPU: HIGH DISKSPACE: HIGH I/O: HIGH THE SAS ASSERTED SORT The fastest way to sort data is to get the data delivered to you in sorted order by the data supplier and simply tell SAS to treat the data as sorted. This is the Asserted Sort. It takes zero time and is shown in Figure 5. Companies sometimes struggle to sort data that is already sorted, or that can be requested already sorted from suppliers. The sortedby=variable option can be applied to the data or set statement. The box in the lower left of Figure 5 shows what Proc Contents would report about data set one in Figure 5. Proc Contents says the data is sorted but reports that SAS has not sorted the data. If the data had been sorted by a Proc Sort, the Validated characteristic would be YES. SAS will create, and use, the first-dot and last-dot variables on data that is asserted to be sorted. Resource demand for Asserted Sort: CPU: NONE DISKSPACE: NONE I/O: NONE 2

3 SAS Asserted Sort If you know your data is sorted, you can assert that it is sorted and SAS will believe you. You can assert when you create or use. Proc Contents shows: Sortedby: zip Validated: NO data one /*(sortedby=zip)*/; infile datalines; input zip $char5.; datalines; Great time saver!! No work done!! ; data two; set one (sortedby=zip); by zip; if first.zip and last.zip; proc print data=two; Select SAS vs. Host Sort Sometimes the operating system has a very fast sort. You can specify to use: the SAS sort, or the Host OS sort or best sort Conventional wisdom says SAS sort is good for small files (Test and set your the cutpoint) Options sortcp= xx ; Options sortpgm=sas; Options sortpgm=host; Options sortpgm=best; Watch out!!! There can be differences between SAS and Host sort logic Figure 5 Figure 6 HOST SORT VS SAS SORT The word on the street is that the SAS V8.2 Proc Sort is very good for small/moderate data sets but is a bit slow if the data set is large. Unix, and some mainframes have sorting routines that can be faster than SAS Proc Sort. SAS can be instructed to use a custom/host operating system sorting routine rather than its own. Options are shown in Figure 6. A programmer can instruct SAS to use SAS Proc Sort, a host sort, or to use the best sort method (host sort if file is above a certain size). One danger associated with using a host system is that it can use a different sorting sequence from SAS and that files would be sorted one way if they are smaller than the cutpoint (and thereby sorted by the SAS sort) and a different way if they are larger than the cutpoint (and thereby sorted by the host sort). Resource utilization varies with hosts. SAS for MS Windows, before version 9, can not support a host sort. It accepts the commands to pass the sorting task to a Host sort package -without issuing notes/warnings - but does not do anything. Resource demand for Host Sort: CPU:?? DISKSPACE:?? I/O:?? THE BINARY SEARCH PROCESS The binary search process underlies several SAS procedures. While seeming to be very simple, it has been shown, mathematically, to be the optimal search method under conditions common to many SAS programming tasks. Binary searches are used by formats and indexes. Binary searches search an ordered file by repeatedly dividing it in half. Figure 7 shows the binary search process applied to a format. The format records are ordered in the format catalog and SAS finds a format by repeatedly applying simple logic/rules. In Figure 7 we check if subject 10 was assigned to the test or control treatment. Formats use binary searches Format File 001 test 002 control 003 test 004 control 005 control 006 control 007 test 008 control 009 test 010 control 011 test 012 test 013 control 014 test 015 control 016 test 017 control Success PROCESSING TIME SAS uses binary searches to look within formats and indexes. Searching a file by Halves Lets look for subject 10 See if the subject number is in the exact middle of the file If it is not, is it above or below the middle of the file - define range See if the subject number is in the exact middle of the range defined above Repeat TorC= PUT(PAT_ID,InSml.); SYNTAX proc format; value skill LOW -< 1="BAD # LOW" 1="SAS" 2="Java" 3,5,6="Microsoft" Creates THREE 6<-HIGH="BAD # HIGH"; formats value $ Gen_Age "M"="A.M." "F"="A.F." "C","B","A" ="Non-Adult" other ="error"; VALUE $ WHR_IN /*CHAR RANGES */ low-<"00000"="bad zip" "19000"-"19099"="PHILA" "19100"-"19400"="pa" "80000"-"89999"="JERSEY" "OTHER"="UNKNOWN"; PROC PRINT DATA=EX_2; FORMAT GENDER $Gen_Age. JOB SKILL. ; RUN; Data Set Ex_2 NAME GENDER JOB Bob M 1 Russ M 2 Sue F 2 AJ M 1 Dot F 2 Prints using the format OUTPUT NAME GENDER JOB Bob A.M. SAS Russ A.M. Java Sue A.F. Java AJ A.M. SAS Dot A.F. Java Figure 7 Figure 8 3

4 SAS picks the middle observation in the file (subject 9) and asks if that is the subject it was seeking. If it is the desired subject, the search stops. If it is not, SAS asks if the desired subject sorted above, or below, the current number. It is below. SAS then divides the file in half and only considers subjects from 9 to 17. It then picks a number in the middle of the new range (14) and asks if that is the desired subject. If it is, the process stops. It is not. SAS asks if the desired subject is above, or below, the current number. It is above. SAS then divides the range in half and only considers subjects from 9 to 14. It picks a number in the middle of the new range (11). SAS asks if that is the desired subject. If it is, the process stops. It is not. SAS asks if the desired subject is above, or below, the current subject. It is above. SAS then divides the range in half and only considers subjects from 9 to 10. It picks a number in the middle of the new range (10). SAS asks if that is the desired subject. Since it is, the process stops. Information is transferred from the format to the PDV by a put (or input) statement. Figure 7 shows a binary search in a format file. Information is transferred from the format to the PDV via a put statement. FORMATS SAS formats automatically convert data values to the formatted value when data is displayed and are created with a Proc Format. They take time to create and disk space after they are created. They can be re-used, so the cost of creation needs to be paid only once but there is no automatic maintenance performed on formats. Figure 8 shows the normal creation, and typical use, of character and numeric formats. The large box on the left contains the syntax for a Proc Format and a Proc Print that applies those formats. A data set is in the upper right hand corner and the output of the Proc Print, applying the formats to that data set, is in the lower right. Formats are very useful SAS tools to save disk space and time. Generally, formats determine how data is displayed. Formats can save disk space because the data in the SAS data set is stored as the original value (in Figure 8 original values are 1, 2, M, F) and then displayed in a longer form. In the Figure 8, the savings would be substantial if the zips were stored as zips (19101) and then expanded to much longer strings/values like Main Postal Distribution Center, Philadelphia, Pennsylvania. Format table lookup does not require sorting of either data set. The large data set is read from top to bottom and formats are applied for each observation. A large portion of the speed of the format table lookup comes from the fact that that it is usually a memory resident technique and avoids disk access. Theoretically there is no limit on the number of levels in a SAS format, however when the format table lookup executes the whole format must fit in ram memory or suffer. Some OS can not page formats and will crash if the format is larger than the available RAM. Some OS will page large formats between disk and RAM, allowing the job to complete, but increasing run time. SAS code that would select, from a very large file, patients that had been assigned to be controls (imagine the file in Figure 7 is the format in_sml) is shown below. This is a Format Table Lookup Data subset; Set very_big_file; If put(pat_id,insml.)= control ; Run; Formats are stored in a catalog (permanent or work) and take Ram/disk space (Figure 9) When formats are created from an input file (see right hand box in Figure 9) the data from the input file is summarized in a file (often called cntlin). Cntlin is used as input to a Proc Format. This cntlin file must be manually removed from work to release space. While formats are often used for table lookup, they were not designed with this task in mind. As a result, formats perform extra processing, not required for the simple task of table lookup. The format Table Lookup technique is best for table lookup, or a merge, when only one variable in the small file must be brought through the merge. Remember, the PDV is created by the set statement that accesses the larger file and contains those variables (plus SAS variables). While several variables from the small file can be brought through a merge using a Format Table Lookup, it is not often done. The additional steps to concatenate the variables into one label variable and then substring the variables back out after the merge, causes this twist to the basic technique to be infrequently done. When the Format Table Lookup fails to run, or there is a need to bring several variables from small through the merge, a programmer often tries an _IORC_ Merge. An IORC Merge is based on a SAS index and is not a RAM resident technique. It is generally slower than a Format Table Lookup, but often faster than a Sorted By Merge. Resource demand for Format Table Lookup: CPU: HIGH DISKSPACE: Low I/O: Moderate 4

5 SAS Format SAS Index DATA N_SML_TOO; SET BIG; IF PUT(PAT_ID,InSml.)="YES"; RUN; PROC PRINT; RUN; Your Figure 9 Format Format Cntlin Misc. File 2 Misc. File 1 Work.Small Work.Big Formats require creation of a cntlin file and storage of the format in the catalog. data CNTLIN (keep= fmtname start label type hlo); retain FMTNAME "InSml" TYPE "n" LABEL "YES"; set small rename= (pat_id=start )) end=last; output; if last=1 then do; Hlo="O"; label="other" ; start=.; output; end; proc format cntlin=cntlin; Your Small File INDEX Big File INDEX Copy of Big File Misc. File 2 Misc. File 1 Work.Small Work.Big Indexes are from 10% to 50% of the size of the source file. Indexes take time to create! Individual reads are slow! Indexes are good: 1) if they can be reused and 2) if you only want 5% to 10% of the big file. data small(index=(state)); set small; Figure 10 Proc sql; create index state on work.small(state); quit; Proc datasets lib=work; Modify Large; index create zip; quit; INDEXES THE BASIS OF SEVERAL MERGE TECHNIQUES Programmers can use several methods to create an index. Indexes take time to create and disk space after creation. An index can be re-used, so the cost of creation needs to be paid only once. However, indexes are automatically maintained. As observations are added, and dropped, resources are used in maintenance. Generally, indexes take more time to recover data than formats because index use involves more steps than the use of a format - and some of the additional steps are slow. As a first step in data access, an index goes through the same binary search as a format. However, a successful find does not return desired information (Figure 7) but rather a location on the hard drive where the information can be found (Figure 11). Reading the information involves a slow (mechanical) process. The location is passed to the disc controller wit an instruction to find that page of data on the hard drive. The disk spins and the disk arm moves and that page is read. The page of information is passed back to the CPU. The page contains several observations and the CPU parses the information to select the correct observation and variable. Any mechanical process is slow and to be avoided if possible. For table lookup, typically the index is built on the larger file. The smaller file is processed from top to bottom and the index is accessed for every observation in the smaller file. Indexes are attractive if they return a small fraction ( LT 5%) of the large file. If an index table lookup, returns 20% or more of the large file, there are usually faster techniques to employ. THE _IORC_ MERGE The IORC merge does not require sorting data sets but does require an index on the larger data set (Figures 12 & 13). Typically, the smaller file is read top-to-bottom and an indexed lookup is attempted for every observation. Figure 12 shows the first has executed and loaded the variables from that dataset (Day_1) into the PDV. The second set (with the key= option) uses the value in name, on the PDV, to do an indexed read of the data set Up_Dt. In Figure 12 the lookup is successful and variables from UpDt are copied to the PDV. A successful index lookup results in a 0 being written to IORC on the PDV. Reminder Indexes use binary searches Index File subj page row Lets look for subject Once 9 is found, accessing the information is a multi-step process Read head moves to Controller find 5542 Spinning drive OH Obs 8 Obs 9 Obs 10 1 byte /obs Unused Descriptor Index returns the location of a page & row of data on the disk drive INDEX lookup involves more operations than format lookup Page/Block of data read back to CPU PROCESSING TIME Data is Parsed to find proper observation and field Bob Y Russ N Sue N AJ N Fred Y Glenn N KL Y data new3; SYNTAX set Day_1; set UpDt key=name/unique; array setmiss(*) $ ShinSpl - -T_Elb; if _iorc_ then do; _error_=0; _IORC_=0; do i= 1 to dim(setmiss); setmiss(i)=""; end; end; DATA SET: Day_1 Bob S N Eric S N Sue N N Fred S Mark S Walt N KL N T Wayne N T Sally N T OUTPUT FILE Bob Y S N DATA SET: UpDT DATA VECTOR Name Run Sh_Sp T_Elb _N ERROR IORC_ Bob Y S N Figure 11 Figure 12 An unsuccessful attempt to do a table lookup is shown in Figure 13. When the second set statement fails to find a match in the index lookup, it writes a non-zero value to IORC. The non-zero value of IORC causes the do group to execute and the resetting to missing of variables in the PDV. Note: variables from both datasets are on the PDV and output data sets. 5

6 Bob Y Russ N Sue N AJ N Fred Y Glenn N KL Y data new3; SYNTAX set Day_1; set UpDt key=name/unique; array setmiss(*) $ ShinSpl - -T_Elb; if _iorc_ then do; _error_=0; _IORC_=0; do i= 1 to dim(setmiss); setmiss(i)=""; end; end; DATA SET: Day_1 Bob S N Eric S N Sue N N Fred S Mark S Walt N KL N T Wayne N T Sally N T OUTPUT FILE Bob Y S N Russ N DATA SET: UpDT DATA VECTOR Name Run Sh_Sp T_Elb _N ERROR IORC_ Russ N Using a Tagsort involves Your searching in the sorted secondary file and an indexed lookup using the pointer. SAS Tagsort SORTED Secondary file State zip POINTER Misc. File 2 Misc. File 1 Work.Small Work.Big Tagsorts create a secondary file containing sort keys and a pointer and a sort file for the secondary file. proc sort data=big tagsort; By state zip; Proc Contents shows: Sortedby: zip Validated: YES Figure 13 Figure 14 Resource demand for IORC Table Lookup: CPU: HIGH DISKSPACE: Low I/O: HIGH THE TAGSORT A tagsort (Figure 14) involves the creation of a secondary file containing the key and a pointer to the location of the observation on the disk (like an index). Tagsorts are easy to program and can be used to support a Sorted By Merge without sorting a file. The taagsort process is similar to an automated IORC merge in that the basic process is an index lookup. It has all the problems associated with index usage if the small file has more than 5% of the number of obs in the large file. This can be a very slow process, and is not often used today. It reads the source file top to bottom and creates a temp file. It then sorts the temp file. Finally it accesses for main file again with a random access method. While slow, it does avoid sorting the main file and can have some value if the problem is lack of disk space. Resource demand for Tagsort Table Lookup: CPU: HIGH DISKSPACE: Low I/O: HIGH KEY INDEXING BITMAPPING AND MANUAL HASHING Key indexing, bitmapping and manual hashing are only available inside the data step. All desired information from the small file is loaded into a SAS array. The larger file is read from top to bottom and SAS performs a lookup for each observation in the large file. This process is illustrated, on a high level, in Figure 15. In key indexing, bitmapping and manual hashing a mathematical function reads the Key variable from the PDV as the large file is processed. SAS uses the value of the key variable to calculate the array-cell-number that holds the desired information. Retrieving/accessing/using information from the array is quick. Typically, only one variable is carried through a hash-merge. More can be carried through, at the expense of substringing the information that had been stored in the array. This is not often done, since the SAS substring function is not particularly fast. Key indexing, bitmapping and hashing have been described in a series of articles, written by Dr. Paul Dorfmann. These are very fast table lookup techniques because they load the whole small data set into standard SAS arrays arrays that must exist in RAM. These very helpful techniques can be difficult to code and are generally a challenge for the maintenance programmer who inherits code using these techniques. Resource demand for Keyindexing, etc. Table Lookup: CPU: HIGH DISKSPACE: VERY LOW I/O: LOW V9 HASHING In response to the challenge of manually coding hashing algorithms, SAS has created an easy-to-use hashing applet in V9. The applet can be called from within the data step and creates something like a very smart array. Hashing is production in V9 (and higher) and preliminary speed tests show it to be a fast technique. To get maximum speed from, the technique, the whole small file should be memory-resident. If an OS does not page memory objects, SAS may crash if the hash object does not fit in RAM. V9 Hashing has a new file structure and searches it using a two-part algorithm. The file structure/algorithm is one reason that hashing is generally faster than formats for table lookup. An additional factor is that the hashing algorithm was designed to do table lookup and, unlike formats, performs few unneeded operations. 6

7 Small File Key One_var._max No Sorting Small File is AUTOMATICALLY De-duped High Memory Usage FAST-- if.. SAS Array Characteristics (max value in cell) will limit the technique Figure 15 Key indexing, bitmapping & hashing DATA STEP RAM MEMORY Array of Keys Result File Large File Key var1 var2 etc. Can code your own OR SAS V9 will have SAS Coded hashing As we process Large File, quick access to the values in the array, lets us determine if we want the obs from Large File in the result file. Hashing uses two searches Format File 001 test 002 control 003 test 004 control 005 control 006 control 007 test 008 control 009 test 010 control 011 test 012 test 013 control 014 test 015 control 016 test 017 control 018 test 018 control 020 test 021 control 022 test 023 control 024 test 025 control 026 test 027 test 028 control 029 test 030 control 031 test 032 control Figure 16 Bucket 1 Format: test Lets look for subject 7 Bucket 2 Hashing divides 016 test the file into buckets Bucket control And a tree structure below Bucket 4 the buckets 010 control Bucket 5 The hash 018 control function gets you to a bucket Bucket test The method searches down Bucket 7 the tree 020 test Find puts info on the PDV Bucket control 001 test 025 control 012 test 026 test 002 control 019 test 005 control 028 test 003 test 030 control 015 control 032 test 009 control 021 control 014 test 027 test Hashing: control 013 control 004 control 006 control 007 test 029 control 011 test 022 test A comparison of format-table lookup and V9 hashing lookup is shown in Figure 16. The left hand box illustrates how a format will search for subject 7. The right hand side illustrates a search for subject 7 in a hash object. LOADING A V9 HASH OBJECT It is common to think of a hash object as a smart array; an array that can grow as you add information to it. There (Figure 16) is a top level of the hash object (here buckets 1 to 8) that is very similar to an array. The programmer specifies the number of cells in the top level of the hash object, as s/he does in an array and the number of cells in the top level is an important characteristic of the hash object. The smaller file, and it can be of any size, is loaded into the hash object. As observations from the smaller file are processed, the value of a key variable is used by a mathematical (hashing) function to calculate the proper top level cell for that observation. If two observations hash to the same cell, a tree like structure grows below the top-level cell and the data is stored in the tree under that top level. In a normal SAS array, the original data would be overwritten and lost. Several different values of the key variable can hash to the same top-cell and all identical values of key will hash to the same top cell. So, if the key for the hash object is state, all residents of California will be hashed to the same top level cell. As can be seen, trees can be of different depths. Because hash objects can grow, any number of observations can fit in any hash object- regardless of the number of cells in the top level (sizing the top level is a topic for another paper). Cells store the value of key and any values being carried through the table lookup. While many variables can easily be loaded into a V9 hash object, and carried through the table-lookup, hash objects perform best when they can fit in RAM. If the hash object can not fit in RAM, and the OS can not page memory objects, the program might crash. ACCESSING A V9 HASH OBJECT After the small file has been loaded into the hash object, the large file is read top-to-bottom. A hash table lookup is performed for every observation in the large file using the following process. The value of the key variable (in V9, having multiple key variables are easy) is read from the PDV and passed to the hashing function. The hashing function returns a number of the top-level-cell in the hashing object. If a matching observation (matched on the value of key) has been loaded from the small file, into the hash object, that value would be in the top-level cell or the tree under it. SAS examines the top-level cell. If the top level cell does not contain the proper value of the key variable(s) SAS searches down the tree using an efficient method. If the proper level of the key is found, information stored in the hash object is returned to the PDV. If the end of the tree is reached without finding the proper level of the key variable, SAS concludes that this level of Key was not in the small file and return a failure to match (a number NE 0) to a variable on the PDV. Resource demand for V9 Hashing Table Lookup: CPU: HIGH DISKSPACE: VERY LOW IO: LOW SQL SQL is a powerful tool for table lookups and it is very difficult to classify as to how it uses resources. In SQL, the programmer specifies what s/he wants but not how to get it. SQL, every time it runs, invokes a powerful, and complex, subroutine called the SQL optimizer that examines system conditions (file sizes, buffersize, the existence of indexes on variables in the where clause, SQL syntax) and determines how the query should be run. The optimizer has a variety of techniques to use in executing code and anything more than a cursory description is outside the scope of this paper. 7

8 The optimizer, in a merge, can automatically choose to employ one of the following: Cartesian products, a sorted-by-merge, a SQL version of a hash merge, an IORC merge or finally a step-loop merge. Surprisingly, the same query code, run repeatedly over the course of a project with data being refreshed and source files growing, can be executed in one manner for one refresh and in another manner after file sizes have changed. The strength of SQL is its ease of use and the time it saves on the programming task. Complex operations can easily be coded in SQL. Conventional wisdom is: if space and run time are the problems forcing a programmer to explore efficiency, SQL will generally not be the solution. In published tests SQL has generally lost out in speed tests against Sorted By-Merges but the tests did not seem to consider, and did not document, the decisions made by the optimizer. Much development work has been done to create an SQL optimizer logic to automatically minimize the space requirements of SQL and generally the results are very good. After saying that, it seems apparently small differences in syntax/environment can put the optimizer under a severe handicap and produce code that is good, but not maximally efficient. Within the last few years, proceeding papers were published suggesting that SQL can be a very fast and small technique- if the query is written in a way that does not handicap the optimizer. The optimizer is a subtle, brilliant and very useful subroutine and the subject of a series of papers whose titles start with the phrase The Optimzer Project. The author suggests these papers, which can be found online, as starting points for study of this topic. Additionally recommended are papers written by Paul Kent or Lewis Church (both of SAS) as well as papers by the review committee to the SQL Optimizer Project (Paul Dorfmann, Sigurd Hernmansen, Kirk LaFler, and Paul Sherman) Resource demand for SQL (currently): CPU: HIGH DISKSPACE: MED-HIGH I/O: MED-HIGH Our List of tools for table Lookup By Merges & SQL Joins Tag Sorted by merge Indexed by merge SQL Joins SAS Sorted by merge Host System Sorted by merge Asserted to be sorted by merge Update _IORC_ MERGE as table lookup Formats as table lookup SAS V9 Hashing Custom Coded Hashing Custom Coded Bitmapping Custom Coded Key Indexing Figure 17 TOOLS FOR TABLE LOOKUP fast Intensive Memory Intensive CONCLUSION Figure 17 presents a rough guide to the use of these techniques. It is hoped that the information provided in this paper will help point a reader towards the technique that best solves her/his efficiency problem. REFERENCES Several years of proceedings are available online and can be downloaded for free. A good place for searching NESUG and SUGI proceedings as well as SAS_L is For general understanding of how things work in SAS: Aster & Seidelman Professional Programming Secrets McGraw Hill Virgile, Efficiency: Improving the Performance of your SAS Applications SAS Institute SAS course notes 58032: Optimizing s CONTACT INFORMATION (Your comments and questions are valued and encouraged. Contact the author at: Russell Lavery 9 Station Ave. Apt 1, Ardmore, PA 19003, # 3 russ.lavery@verizon.net Contractor for ASG, Inc. SAS is a registered trademark of SAS Institute, Inc., in the USA and other countries. indicates US registration. 8

An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA

An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA An Animated Guide : Speed Merges: resource use by common procedures Russell Lavery, Contractor, Ardmore, PA ABSTRACT This paper is a comparison of how resources are used by different SAS table lookup (Figure

More information

Paper TT17 An Animated Guide : Speed Merges with Key Merging and the _IORC_ Variable Russ Lavery Contractor for Numeric resources, Inc.

Paper TT17 An Animated Guide : Speed Merges with Key Merging and the _IORC_ Variable Russ Lavery Contractor for Numeric resources, Inc. Paper TT7 An Animated Guide : Speed Merges with Key Merging and the _IORC_ Variable Russ Lavery Contractor for Numeric resources, Inc. ABSTRACT The key mege (A.K.A. _IORC_ merge) is an efficiency technique.

More information

Merge Processing and Alternate Table Lookup Techniques Prepared by

Merge Processing and Alternate Table Lookup Techniques Prepared by Merge Processing and Alternate Table Lookup Techniques Prepared by The syntax for data step merging is as follows: International SAS Training and Consulting This assumes that the incoming data sets are

More information

1. Join with PROC SQL a left join that will retain target records having no lookup match. 2. Data Step Merge of the target and lookup files.

1. Join with PROC SQL a left join that will retain target records having no lookup match. 2. Data Step Merge of the target and lookup files. Abstract PaperA03-2007 Table Lookups...You Want Performance? Rob Rohrbough, Rohrbough Systems Design, Inc. Presented to the Midwest SAS Users Group Monday, October 29, 2007 Paper Number A3 Over the years

More information

Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY

Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY Table Lookups in the SAS Data Step Gary L. Katsanis, Blue Cross and Blue Shield of the Rochester Area, Rochester, NY Introduction - What is a Table Lookup? You have a sales file with one observation for

More information

An Annotated Guide: The New 9.1, Free & Fast SPDE Data Engine Russ Lavery, Ardmore PA, Independent Contractor Ian Whitlock, Kennett Square PA

An Annotated Guide: The New 9.1, Free & Fast SPDE Data Engine Russ Lavery, Ardmore PA, Independent Contractor Ian Whitlock, Kennett Square PA An Annotated Guide: The New 9.1, Free & Fast SPDE Data Engine Russ Lavery, Ardmore PA, Independent Contractor Ian Whitlock, Kennett Square PA ABSTRACT SAS has been working hard to decrease clock time to

More information

capabilities and their overheads are therefore different.

capabilities and their overheads are therefore different. Applications Development 3 Access DB2 Tables Using Keylist Extraction Berwick Chan, Kaiser Permanente, Oakland, Calif Raymond Wan, Raymond Wan Associate Inc., Oakland, Calif Introduction The performance

More information

Table Lookups: From IF-THEN to Key-Indexing

Table Lookups: From IF-THEN to Key-Indexing Table Lookups: From IF-THEN to Key-Indexing Arthur L. Carpenter, California Occidental Consultants ABSTRACT One of the more commonly needed operations within SAS programming is to determine the value of

More information

Table Lookups: Getting Started With Proc Format

Table Lookups: Getting Started With Proc Format Table Lookups: Getting Started With Proc Format John Cohen, AstraZeneca LP, Wilmington, DE ABSTRACT Table lookups are among the coolest tricks you can add to your SAS toolkit. Unfortunately, these techniques

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Hash Objects for Everyone

Hash Objects for Everyone SESUG 2015 Paper BB-83 Hash Objects for Everyone Jack Hall, OptumInsight ABSTRACT The introduction of Hash Objects into the SAS toolbag gives programmers a powerful way to improve performance, especially

More information

File Structures and Indexing

File Structures and Indexing File Structures and Indexing CPS352: Database Systems Simon Miner Gordon College Last Revised: 10/11/12 Agenda Check-in Database File Structures Indexing Database Design Tips Check-in Database File Structures

More information

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ Paper CC16 Smoke and Mirrors!!! Come See How the _INFILE_ Automatic Variable and SHAREBUFFERS Infile Option Can Speed Up Your Flat File Text-Processing Throughput Speed William E Benjamin Jr, Owl Computer

More information

An Animated Guide: Proc Transpose

An Animated Guide: Proc Transpose ABSTRACT An Animated Guide: Proc Transpose Russell Lavery, Independent Consultant If one can think about a SAS data set as being made up of columns and rows one can say Proc Transpose flips the columns

More information

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC

If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC Paper 2417-2018 If You Need These OBS and These VARS, Then Drop IF, and Keep WHERE Jay Iyengar, Data Systems Consultants LLC ABSTRACT Reading data effectively in the DATA step requires knowing the implications

More information

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA

Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA Choosing the Right Technique to Merge Large Data Sets Efficiently Qingfeng Liang, Community Care Behavioral Health Organization, Pittsburgh, PA ABSTRACT This paper outlines different SAS merging techniques

More information

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact: Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base

More information

(Refer Slide Time: 01:25)

(Refer Slide Time: 01:25) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture - 32 Memory Hierarchy: Virtual Memory (contd.) We have discussed virtual

More information

Kathleen Durant PhD Northeastern University CS Indexes

Kathleen Durant PhD Northeastern University CS Indexes Kathleen Durant PhD Northeastern University CS 3200 Indexes Outline for the day Index definition Types of indexes B+ trees ISAM Hash index Choosing indexed fields Indexes in InnoDB 2 Indexes A typical

More information

Lecture 12. Lecture 12: The IO Model & External Sorting

Lecture 12. Lecture 12: The IO Model & External Sorting Lecture 12 Lecture 12: The IO Model & External Sorting Lecture 12 Today s Lecture 1. The Buffer 2. External Merge Sort 2 Lecture 12 > Section 1 1. The Buffer 3 Lecture 12 > Section 1 Transition to Mechanisms

More information

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX ABSTRACT Symmetric multiprocessor (SMP) computers can increase performance by reducing the time required to analyze large volumes

More information

File System Interface and Implementation

File System Interface and Implementation Unit 8 Structure 8.1 Introduction Objectives 8.2 Concept of a File Attributes of a File Operations on Files Types of Files Structure of File 8.3 File Access Methods Sequential Access Direct Access Indexed

More information

Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA

Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Indexing and Compressing SAS Data Sets: How, Why, and Why Not Andrew H. Karp Sierra Information Services, Inc. San Francisco, California USA Many users of SAS System software, especially those working

More information

CSCI S-Q Lecture #12 7/29/98 Data Structures and I/O

CSCI S-Q Lecture #12 7/29/98 Data Structures and I/O CSCI S-Q Lecture #12 7/29/98 Data Structures and I/O Introduction The WRITE and READ ADT Operations Case Studies: Arrays Strings Binary Trees Binary Search Trees Unordered Search Trees Page 1 Introduction

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX

Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX 1/0 Performance Improvements in Release 6.07 of the SAS System under MVS, ems, and VMS' Stephen M. Beatrous, SAS Institute Inc., Cary, NC John T. Stokes, SAS Institute Inc., Austin, TX INTRODUCTION The

More information

Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY

Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY Using Data Set Options in PROC SQL Kenneth W. Borowiak Howard M. Proskin & Associates, Inc., Rochester, NY ABSTRACT Data set options are an often over-looked feature when querying and manipulating SAS

More information

Comparison of different ways using table lookups on huge tables

Comparison of different ways using table lookups on huge tables PhUSE 007 Paper CS0 Comparison of different ways using table lookups on huge tables Ralf Minkenberg, Boehringer Ingelheim Pharma GmbH & Co. KG, Ingelheim, Germany ABSTRACT In many application areas the

More information

Hash-Based Indexes. Chapter 11

Hash-Based Indexes. Chapter 11 Hash-Based Indexes Chapter 11 1 Introduction : Hash-based Indexes Best for equality selections. Cannot support range searches. Static and dynamic hashing techniques exist: Trade-offs similar to ISAM vs.

More information

Teradata. This was compiled in order to describe Teradata and provide a brief overview of common capabilities and queries.

Teradata. This was compiled in order to describe Teradata and provide a brief overview of common capabilities and queries. Teradata This was compiled in order to describe Teradata and provide a brief overview of common capabilities and queries. What is it? Teradata is a powerful Big Data tool that can be used in order to quickly

More information

16 Sharing Main Memory Segmentation and Paging

16 Sharing Main Memory Segmentation and Paging Operating Systems 64 16 Sharing Main Memory Segmentation and Paging Readings for this topic: Anderson/Dahlin Chapter 8 9; Siberschatz/Galvin Chapter 8 9 Simple uniprogramming with a single segment per

More information

Paper Haven't I Seen You Before? An Application of DATA Step HASH for Efficient Complex Event Associations. John Schmitz, Luminare Data LLC

Paper Haven't I Seen You Before? An Application of DATA Step HASH for Efficient Complex Event Associations. John Schmitz, Luminare Data LLC Paper 1331-2017 Haven't I Seen You Before? An Application of DATA Step HASH for Efficient Complex Event Associations ABSTRACT John Schmitz, Luminare Data LLC Data processing can sometimes require complex

More information

using and Understanding Formats

using and Understanding Formats using and Understanding SAS@ Formats Howard Levine, DynaMark, Inc. Oblectives The purpose of this paper is to enable you to use SAS formats to perform the following tasks more effectively: Improving the

More information

An Introduction to Compressing Data Sets J. Meimei Ma, Quintiles

An Introduction to Compressing Data Sets J. Meimei Ma, Quintiles An Introduction to Compressing Data Sets J. Meimei Ma, Quintiles r:, INTRODUCTION This tutorial introduces compressed data sets. The SAS system compression algorithm is described along with basic syntax.

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2014 Lecture 14 LAST TIME! Examined several memory technologies: SRAM volatile memory cells built from transistors! Fast to use, larger memory cells (6+ transistors

More information

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015 CS 3: Intro to Systems Caching Kevin Webb Swarthmore College March 24, 205 Reading Quiz Abstraction Goal Reality: There is no one type of memory to rule them all! Abstraction: hide the complex/undesirable

More information

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys Richard L. Downs, Jr. and Pura A. Peréz U.S. Bureau of the Census, Washington, D.C. ABSTRACT This paper explains

More information

Ext3/4 file systems. Don Porter CSE 506

Ext3/4 file systems. Don Porter CSE 506 Ext3/4 file systems Don Porter CSE 506 Logical Diagram Binary Formats Memory Allocators System Calls Threads User Today s Lecture Kernel RCU File System Networking Sync Memory Management Device Drivers

More information

OPERATING SYSTEMS. After A.S.Tanenbaum, Modern Operating Systems 3rd edition Uses content with permission from Assoc. Prof. Florin Fortis, PhD

OPERATING SYSTEMS. After A.S.Tanenbaum, Modern Operating Systems 3rd edition Uses content with permission from Assoc. Prof. Florin Fortis, PhD OPERATING SYSTEMS #8 After A.S.Tanenbaum, Modern Operating Systems 3rd edition Uses content with permission from Assoc. Prof. Florin Fortis, PhD MEMORY MANAGEMENT MEMORY MANAGEMENT The memory is one of

More information

PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING

PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING PROC FORMAT: USE OF THE CNTLIN OPTION FOR EFFICIENT PROGRAMMING Karuna Nerurkar and Andrea Robertson, GMIS Inc. ABSTRACT Proc Format can be a useful tool for improving programming efficiency. This paper

More information

CSC 261/461 Database Systems Lecture 17. Fall 2017

CSC 261/461 Database Systems Lecture 17. Fall 2017 CSC 261/461 Database Systems Lecture 17 Fall 2017 Announcement Quiz 6 Due: Tonight at 11:59 pm Project 1 Milepost 3 Due: Nov 10 Project 2 Part 2 (Optional) Due: Nov 15 The IO Model & External Sorting Today

More information

BEYOND FORMAT BASICS 1

BEYOND FORMAT BASICS 1 BEYOND FORMAT BASICS 1 CNTLIN DATA SETS...LABELING VALUES OF VARIABLE One common use of a format in SAS is to assign labels to values of a variable. The rules for creating a format with PROC FORMAT are

More information

Parallelizing Windows Operating System Services Job Flows

Parallelizing Windows Operating System Services Job Flows ABSTRACT SESUG Paper PSA-126-2017 Parallelizing Windows Operating System Services Job Flows David Kratz, D-Wise Technologies Inc. SAS Job flows created by Windows operating system services have a problem:

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

AVL 4 4 PDV DECLARE 7 _NEW_

AVL 4 4 PDV DECLARE 7 _NEW_ Glossary Program Control... 2 SAS Variable... 2 Program Data Vector (PDV)... 2 SAS Expression... 2 Data Type... 3 Scalar... 3 Non-Scalar... 3 Big O Notation... 3 Hash Table... 3 Hash Algorithm... 4 Hash

More information

What did we talk about last time? Finished hunters and prey Class variables Constants Class constants Started Big Oh notation

What did we talk about last time? Finished hunters and prey Class variables Constants Class constants Started Big Oh notation Week 12 - Friday What did we talk about last time? Finished hunters and prey Class variables Constants Class constants Started Big Oh notation Here is some code that sorts an array in ascending order

More information

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes

More information

Performance Considerations

Performance Considerations 149 CHAPTER 6 Performance Considerations Hardware Considerations 149 Windows Features that Optimize Performance 150 Under Windows NT 150 Under Windows NT Server Enterprise Edition 4.0 151 Processing SAS

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

Updating Data Using the MODIFY Statement and the KEY= Option

Updating Data Using the MODIFY Statement and the KEY= Option Updating Data Using the MODIFY Statement and the KEY= Option Denise J. Moorman and Deanna Warner Denise J. Moorman is a technical support analyst at SAS Institute. Her area of expertise is base SAS software.

More information

Lecture 12. Lecture 12: Access Methods

Lecture 12. Lecture 12: Access Methods Lecture 12 Lecture 12: Access Methods Lecture 12 If you don t find it in the index, look very carefully through the entire catalog - Sears, Roebuck and Co., Consumers Guide, 1897 2 Lecture 12 > Section

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians

Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians Cheat sheet: Data Processing Optimization - for Pharma Analysts & Statisticians ABSTRACT Karthik Chidambaram, Senior Program Director, Data Strategy, Genentech, CA This paper will provide tips and techniques

More information

Why Hash? Glen Becker, USAA

Why Hash? Glen Becker, USAA Why Hash? Glen Becker, USAA Abstract: What can I do with the new Hash object in SAS 9? Instead of focusing on How to use this new technology, this paper answers Why would I want to? It presents the Big

More information

Heap Management. Heap Allocation

Heap Management. Heap Allocation Heap Management Heap Allocation A very flexible storage allocation mechanism is heap allocation. Any number of data objects can be allocated and freed in a memory pool, called a heap. Heap allocation is

More information

Chapter 3 - Memory Management

Chapter 3 - Memory Management Chapter 3 - Memory Management Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 3 - Memory Management 1 / 222 1 A Memory Abstraction: Address Spaces The Notion of an Address Space Swapping

More information

How Oracle Essbase Aggregate Storage Option. And How to. Dan Pressman

How Oracle Essbase Aggregate Storage Option. And How to. Dan Pressman How Oracle Essbase Aggregate Storage Option And How to Dan Pressman San Francisco, CA October 1, 2012 Assumption, Basis and a Caveat Assumption: Basic understanding of ASO cubes Basis: My chapter How ASO

More information

Format-o-matic: Using Formats To Merge Data From Multiple Sources

Format-o-matic: Using Formats To Merge Data From Multiple Sources SESUG Paper 134-2017 Format-o-matic: Using Formats To Merge Data From Multiple Sources Marcus Maher, Ipsos Public Affairs; Joe Matise, NORC at the University of Chicago ABSTRACT User-defined formats are

More information

50 WAYS TO MERGE YOUR DATA INSTALLMENT 1 Kristie Schuster, LabOne, Inc., Lenexa, Kansas Lori Sipe, LabOne, Inc., Lenexa, Kansas

50 WAYS TO MERGE YOUR DATA INSTALLMENT 1 Kristie Schuster, LabOne, Inc., Lenexa, Kansas Lori Sipe, LabOne, Inc., Lenexa, Kansas Paper 103-26 50 WAYS TO MERGE YOUR DATA INSTALLMENT 1 Kristie Schuster, LabOne, Inc., Lenexa, Kansas Lori Sipe, LabOne, Inc., Lenexa, Kansas ABSTRACT When you need to join together two datasets, how do

More information

Lecture 13. Lecture 13: B+ Tree

Lecture 13. Lecture 13: B+ Tree Lecture 13 Lecture 13: B+ Tree Lecture 13 Announcements 1. Project Part 2 extension till Friday 2. Project Part 3: B+ Tree coming out Friday 3. Poll for Nov 22nd 4. Exam Pickup: If you have questions,

More information

CS 360 Programming Languages Interpreters

CS 360 Programming Languages Interpreters CS 360 Programming Languages Interpreters Implementing PLs Most of the course is learning fundamental concepts for using and understanding PLs. Syntax vs. semantics vs. idioms. Powerful constructs like

More information

Administration Naive DBMS CMPT 454 Topics. John Edgar 2

Administration Naive DBMS CMPT 454 Topics. John Edgar 2 Administration Naive DBMS CMPT 454 Topics John Edgar 2 http://www.cs.sfu.ca/coursecentral/454/johnwill/ John Edgar 4 Assignments 25% Midterm exam in class 20% Final exam 55% John Edgar 5 A database stores

More information

15 Sharing Main Memory Segmentation and Paging

15 Sharing Main Memory Segmentation and Paging Operating Systems 58 15 Sharing Main Memory Segmentation and Paging Readings for this topic: Anderson/Dahlin Chapter 8 9; Siberschatz/Galvin Chapter 8 9 Simple uniprogramming with a single segment per

More information

Reducing SAS Dataset Merges with Data Driven Formats

Reducing SAS Dataset Merges with Data Driven Formats Paper CT01 Reducing SAS Dataset Merges with Data Driven Formats Paul Grimsey, Roche Products Ltd, Welwyn Garden City, UK ABSTRACT Merging different data sources is necessary in the creation of analysis

More information

Preview. Memory Management

Preview. Memory Management Preview Memory Management With Mono-Process With Multi-Processes Multi-process with Fixed Partitions Modeling Multiprogramming Swapping Memory Management with Bitmaps Memory Management with Free-List Virtual

More information

PharmaSUG Paper BB01

PharmaSUG Paper BB01 PharmaSUG 2014 - Paper BB01 Indexing: A powerful technique for improving efficiency Arun Raj Vidhyadharan, inventiv Health, Somerset, NJ Sunil Mohan Jairath, inventiv Health, Somerset, NJ ABSTRACT The

More information

T.I.P.S. (Techniques and Information for Programming in SAS )

T.I.P.S. (Techniques and Information for Programming in SAS ) Paper PO-088 T.I.P.S. (Techniques and Information for Programming in SAS ) Kathy Harkins, Carolyn Maass, Mary Anne Rutkowski Merck Research Laboratories, Upper Gwynedd, PA ABSTRACT: This paper provides

More information

CBS For Windows CDROM Backup System Quick Start Guide Installation Preparation:

CBS For Windows CDROM Backup System Quick Start Guide Installation Preparation: CBS For Windows CDROM Backup System Quick Start Guide Installation If you have your CBS CD Writer Backup system on CD, simply insert the CD. It will automatically start and install the software. If you

More information

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking

Chapter 17. Disk Storage, Basic File Structures, and Hashing. Records. Blocking Chapter 17 Disk Storage, Basic File Structures, and Hashing Records Fixed and variable length records Records contain fields which have values of a particular type (e.g., amount, date, time, age) Fields

More information

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling. Don Porter CSE 506 Block Device Scheduling Don Porter CSE 506 Logical Diagram Binary Formats Memory Allocators System Calls Threads User Kernel RCU File System Networking Sync Memory Management Device Drivers CPU Scheduler

More information

Block Device Scheduling

Block Device Scheduling Logical Diagram Block Device Scheduling Don Porter CSE 506 Binary Formats RCU Memory Management File System Memory Allocators System Calls Device Drivers Interrupts Net Networking Threads Sync User Kernel

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1 MySQL UC 2010 How Fractal Trees Work 1 How TokuDB Fractal TreeTM Indexes Work Bradley C. Kuszmaul MySQL UC 2010 How Fractal Trees Work 2 More Information You can download this talk and others at http://tokutek.com/technology

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 15-16: Basics of Data Storage and Indexes (Ch. 8.3-4, 14.1-1.7, & skim 14.2-3) 1 Announcements Midterm on Monday, November 6th, in class Allow 1 page of notes (both sides,

More information

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC

Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC Paper CC-05 Programming Gems that are worth learning SQL for! Pamela L. Reading, Rho, Inc., Chapel Hill, NC ABSTRACT For many SAS users, learning SQL syntax appears to be a significant effort with a low

More information

PROC SQL vs. DATA Step Processing. T Winand, Customer Success Technical Team

PROC SQL vs. DATA Step Processing. T Winand, Customer Success Technical Team PROC SQL vs. DATA Step Processing T Winand, Customer Success Technical Team Copyright 2012, SAS Institute Inc. All rights reserved. Agenda PROC SQL VS. DATA STEP PROCESSING Comparison of DATA Step and

More information

Unlock SAS Code Automation with the Power of Macros

Unlock SAS Code Automation with the Power of Macros SESUG 2015 ABSTRACT Paper AD-87 Unlock SAS Code Automation with the Power of Macros William Gui Zupko II, Federal Law Enforcement Training Centers SAS code, like any computer programming code, seems to

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

Optimizing System Performance

Optimizing System Performance 243 CHAPTER 19 Optimizing System Performance Definitions 243 Collecting and Interpreting Performance Statistics 244 Using the FULLSTIMER and STIMER System Options 244 Interpreting FULLSTIMER and STIMER

More information

Hash-Based Indexing 1

Hash-Based Indexing 1 Hash-Based Indexing 1 Tree Indexing Summary Static and dynamic data structures ISAM and B+ trees Speed up both range and equality searches B+ trees very widely used in practice ISAM trees can be useful

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Simple Rules to Remember When Working with Indexes

Simple Rules to Remember When Working with Indexes Simple Rules to Remember When Working with Indexes Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, CA Abstract SAS users are always interested in learning techniques related to improving

More information

General Tips for Working with Large SAS datasets and Oracle tables

General Tips for Working with Large SAS datasets and Oracle tables General Tips for Working with Large SAS datasets and Oracle tables 1) Avoid duplicating Oracle tables as SAS datasets only keep the rows and columns needed for your analysis. Use keep/drop/where directly

More information

Operating Systems Unit 6. Memory Management

Operating Systems Unit 6. Memory Management Unit 6 Memory Management Structure 6.1 Introduction Objectives 6.2 Logical versus Physical Address Space 6.3 Swapping 6.4 Contiguous Allocation Single partition Allocation Multiple Partition Allocation

More information

APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software

APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software 177 APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software Authors 178 Abstract 178 Overview 178 The SAS Data Library Model 179 How Data Flows When You Use SAS Files 179 SAS Data Files 179

More information

CMPSCI 105: Lecture #12 Searching, Sorting, Joins, and Indexing PART #1: SEARCHING AND SORTING. Linear Search. Binary Search.

CMPSCI 105: Lecture #12 Searching, Sorting, Joins, and Indexing PART #1: SEARCHING AND SORTING. Linear Search. Binary Search. CMPSCI 105: Lecture #12 Searching, Sorting, Joins, and Indexing PART #1: SEARCHING AND SORTING Linear Search Binary Search Items can be in any order, Have to examine first record, then second record, then

More information

Streamline Table Lookup by Embedding HASH in FCMP Qing Liu, Eli Lilly & Company, Shanghai, China

Streamline Table Lookup by Embedding HASH in FCMP Qing Liu, Eli Lilly & Company, Shanghai, China ABSTRACT PharmaSUG China 2017 - Paper 19 Streamline Table Lookup by Embedding HASH in FCMP Qing Liu, Eli Lilly & Company, Shanghai, China SAS provides many methods to perform a table lookup like Merge

More information

PROC FORMAT. CMS SAS User Group Conference October 31, 2007 Dan Waldo

PROC FORMAT. CMS SAS User Group Conference October 31, 2007 Dan Waldo PROC FORMAT CMS SAS User Group Conference October 31, 2007 Dan Waldo 1 Today s topic: Three uses of formats 1. To improve the user-friendliness of printed results 2. To group like data values without affecting

More information

Base and Advance SAS

Base and Advance SAS Base and Advance SAS BASE SAS INTRODUCTION An Overview of the SAS System SAS Tasks Output produced by the SAS System SAS Tools (SAS Program - Data step and Proc step) A sample SAS program Exploring SAS

More information

From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX

From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX Paper 152-27 From Manual to Automatic with Overdrive - Using SAS to Automate Report Generation Faron Kincheloe, Baylor University, Waco, TX ABSTRACT This paper is a case study of how SAS products were

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

IF there is a Better Way than IF-THEN

IF there is a Better Way than IF-THEN PharmaSUG 2018 - Paper QT-17 IF there is a Better Way than IF-THEN Bob Tian, Anni Weng, KMK Consulting Inc. ABSTRACT In this paper, the author compares different methods for implementing piecewise constant

More information

SAS File Management. Improving Performance CHAPTER 37

SAS File Management. Improving Performance CHAPTER 37 519 CHAPTER 37 SAS File Management Improving Performance 519 Moving SAS Files Between Operating Environments 520 Converting SAS Files 520 Repairing Damaged Files 520 Recovering SAS Data Files 521 Recovering

More information

Characteristics of a "Successful" Application.

Characteristics of a Successful Application. Characteristics of a "Successful" Application. Caroline Bahler, Meridian Software, Inc. Abstract An application can be judged "successful" by two different sets of criteria. The first set of criteria belongs

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now

ò Very reliable, best-of-breed traditional file system design ò Much like the JOS file system you are building now Ext2 review Very reliable, best-of-breed traditional file system design Ext3/4 file systems Don Porter CSE 506 Much like the JOS file system you are building now Fixed location super blocks A few direct

More information

Version 8 Base SAS Performance: How Does It Stack-Up? Robert Ray, SAS Institute Inc, Cary, NC

Version 8 Base SAS Performance: How Does It Stack-Up? Robert Ray, SAS Institute Inc, Cary, NC Paper 9-25 Version 8 Base SAS Performance: How Does It Stack-Up? Robert Ray, SAS Institute Inc, Cary, NC ABSTRACT This paper presents the results of a study conducted at SAS Institute Inc to compare the

More information

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions...

Contents Contents Introduction Basic Steps in Query Processing Introduction Transformation of Relational Expressions... Contents Contents...283 Introduction...283 Basic Steps in Query Processing...284 Introduction...285 Transformation of Relational Expressions...287 Equivalence Rules...289 Transformation Example: Pushing

More information

Ten Great Reasons to Learn SAS Software's SQL Procedure

Ten Great Reasons to Learn SAS Software's SQL Procedure Ten Great Reasons to Learn SAS Software's SQL Procedure Kirk Paul Lafler, Software Intelligence Corporation ABSTRACT The SQL Procedure has so many great features for both end-users and programmers. It's

More information