STOP MERGING AND START COMBINING by Robert S. Nicol U.S. Quality Algorithms There are many ways to combine data within the SAS system. Probably the most widely used method is the. While the merge is very powerful, it is often misunderstood and overused by the novice SAS programmer. This paper ilhjstrates some of the common data combination methods and provides guidelines for choosing the most effective technique. The data combination routines covered by this paper are formats, the, the update statement, the set statement, proc SQl, proc append, and proc datasets. This paper is intended to be used as an ongoing rapid reference and as such is designed to get you quickly to the relative facts and syntax. How do the fields in the source files relate to the resultant file? Are all of the fields in all of the files required? Does the same field appear in more than one file? Other considerations: If the files are related by 'keys', are the files sorted or indexed by those keys? How much memory is available to you? How often does this combination have to be performed? How large are the files? How often does the data change? The code must be maintainable. The best way to insure the results of any project is to start with a plan. This axiom also holds true in the design of a program. An integral part of planning a program is having a thorough understanding of the available data. Armed with your knowledge of the data and your operating environment, the methods used to combine the data should be evident. This paper reviews the processes that can be applied under three distinct sets of conditions: When you need to combine entire observations. When one file is a 'master', that is you need to end up with only those original observations. When the data drive the resultant observations. PRIMARY DECISION FACTORS How are the observations related to one another? Should the observations remain intact? Are they related ordinally? Is one file a 'master' with fields added or modified based upon other files? If the files are related by 'keys', which observations are to survive? The methods to follow are presented in increasing order of preference. However, the conditions at your site may dictate a different order. Additionally the methods presented here will combine all observations and all variables. Should you need to modify the observations you may be able to use data set options (where, obs, firstobs) or sub-setting statements(if, where, delete}. The variables in the output data set may be controlled by data set options(drop, keep, rename} or statements (drop, keep, rename). 171
COMBINING ENTIRE OBSERVATIONS INTERLEAVING FILES Interleaving is the process of combining entire data sets into one resultant data set. The order of the observations is controlled via a "by" variable. The total number of observations in the final data set is equal to the total observations in the contributing files. set statement +advantage The command is safe and straight forward. drawback All files must be presorted or indexed. data inter; set one twoj NOTE: The merge command can also be used to interleave data sets. However if 'by variable' matches are found, the resultant file will be in error. CONCATENATION Concatenation is the process of combining entire data sets into one resultant data set. All of the observations in one data set are followed by all of the observations in the other data set. The order of the observations is controlled by the order of the data sets in the concatenation command. The total number of observations in the final data set is equal to the total observations in the contributing files. procappend + advantage As the 'base' data set is not rewritten, the computer resource requirements are significantly reduced. +drawback No other processing can be perlormed. The base data set must be available for modification. If the step abends then the base data set may be damaged. proc append force base::one data::two; set statement advantage As the command is part of a data step, further processing can be accomplished without starting a new data or proc drawback All observations are read in and then written out which requires CPU usage and workspace. data concatj setonetwoi proc datasets advantage Since the 'base' data set is not rewritten, computer resource requirements are significantly reduced. drawback Only other 'proc dataset' processes can be performed. The base data set must be available for modification. If the proc abends the base data set may be damaged. proc datasets library=work forcej append base=one data=twoj THE OBSERVATIONS IN ONE ALE ARE THE ONLY REQUIRED OBSERVATIONS The observations in the 'base' data set are to be the only observations in resultant file. The fields in each observation are a combination of the fields in the source files. 172
procsql advantage The files do not have to be presorted. drawback The proc requires large amounts of work space. The syntax becomes tiring jf the files have a large number of fields. The secondary files must be pre-selected to have unique keys. procsql; create table oneto1 as select one.key. two.field_a from one. two where one.key = two.key union select key. field_a from one where key not in (select key from two) order by key advantage The statement handles large numbers of fields and files with simple syntax. As the command is part of a data step, further processing can be accomplished without starting a new data or proc drawback All files must be presorted or indexed. The secondary files must be sorted with the 'nodupkeys' option. NOTE: if the same variable occurs in more than one data set then the resultant value will come f rom the right most data set in the merge statement. update statement advantage Allows multiple 'transaction' records per 'master'. As the command is part of a data step, further processing can be accomplished without starting a new data or proc drawback Fields cannot be added to the master file. All files must be presorted or indexed. data manyto1 ; update one(ln=in_one) two; if in_one; formats advantage As the format(s) is created from the transaction file, the primary file does not have to be sorted or indexed. It may be possible not to modify the base file at all, but merely apply the formats at time of output. Because the format is applied within a data step, further processing can be accomplished without starting a new data or proc If the transaction file does not change often consider saving permanent file. drawback If many fields are to be added the overall complexity of the code and CPU resources will become burdensome. NOTE: ConSider writing a macro to create formats from data sets, thus reducing the coding to create formats data manyt01 ; merge one(in=in_one) two; if In_one i data cntltwo(keep= fmtname type start label); set two end:eof; format start $5.; fmtname='two_fmt'; type:'c'; 173
start=key; label=field_a; output; if eof then do; start='other'; label=' '; output; end; proc fonnat cntlln=cntuwo; data withfonn; set one; from_2 = put(key,stwo_fmt.}; RESULTANT OBSERVATIONS ARE DEPENPENT UPON THE KEYS In some situations you may need to let'the data do the talking. That is the number of records in the final data set will be a direct result of the number of matching keys in the source files. One-ta-Many The is typified by combining one record from a reference file with many observations from a detail file. procsql +advantage Files do not have to be presorted. +drawback Requires large amounts of work space. The syntax becomes tiring if the files have a large number of fields. proc sql ; create table manyto1 as select one.key, one.source1, two.field_a,two.source2 from one, two where one.key=two.key union select key, source1, field_a from one where key not in (select key from two) order advantage Many fields and files can be handled with simple syntax. As the command is part of a data step, further processing can be accomplished without starting a new data or proc drawback All files must be presorted or indexed. data manyt01 ; merge one(ln=in_one) two(in=in_two); if in_one and in_two; Many-ta-Many Many-to-many merges occur when you combine files that each have multiple occurrences of the key values. Caution: In many years of experience in different industries, I have noted the most many-to-many merges are caused by invalid data(or a lack of understanding of the data) rather then a true need to perform a many-ta-many combination. proc sql advantage It works. Files do not have to be presorted. drawback- Requires large amounts of work space. The syntax becomes lengthy if the files have many fields. proc sql; create table manytom as select two. *, three.source3, three.flekca from two(rename=(field_a=f1d_a2}), three where two.key=three.key order by key, fld_a2: 174
DO NOT USE MERGE FOR MANY TO MANY The uses the data from the right most data set that is still contributing data. Although the system only flags many to many merges as a warning, this will usually create erroneous data. Roll you own You may write your own 'many-to-many combiner'. This can be accomplished by performing a series of one to many merges, or by using indexes and pointers or by... POINTS TO REMEMBER While programming SAS there is no substitute for reading the log and testing your logic. If you start off with knowing your data(repeats of key values etc.) and the result that you need{only the observations in the base data set, etc.), then choosing the data combination technique should be straight forward. If a method does not teel right, it may not be. Test it. Then compare the results and the resource usage. CONTACT INFORMATION Robert S. Nicol telephone: 215-654-5813 (day) 610-944-0884 (evening) fax 215-654-6007 175