A SAS Programmer's Guide to Project- and Program-Level Quality Control Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD

Size: px

Start display at page:

Download "A SAS Programmer's Guide to Project- and Program-Level Quality Control Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD"

Owen Randell Armstrong
5 years ago
Views:

1 A SAS Programmer's Guide to Project- and Program-Level Quality Control Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD ABSTRACT The focus of this paper is how to save time and money (and reduce frustration!) by (1) including intelligent and informative input and output checks in SAS programs, (2) generating quality-control information for program modules and projects, and (3) using SAS to create program and project documentation. At both the project and program level, the focus will be on quality control as data analysis. This approach is uniquely suited to using the numerous SAS procedures and reporting mechanisms (including programmer-specified LOG messages) to provide the programmer and project manager with the necessary information for verifying the correctness of program output and deliverables. INTRODUCTION The SAS programmer is in the fortunate position of having a variety of excellent data-analysis tools at hand. As stated in the abstract, this paper will focus on quality control as a form of data analysis and SAS is a wonderful data-analysis tool. This means that, as a SAS programmer, you don't need to develop a separate set of skills in order to incorporate good quality-control checks into program- and project-level work. In this paper I'll first develop a general approach to quality-control at the program level, and then apply this approach at the project level. The main difference is that, at the program level, the emphasis is on identifying patterns and drawing conclusions about the data, whereas at the project level, the emphasis is on metadata. The metadata used here will be dictionary table and views that are generated automatically by the SAS system. This is a wealth of information that is easily made available using SAS programming techniques that most SAS programmers already have mastered. A recurring theme of this paper is that you, as a SAS programmer, already have the skills to run important quality-control checks, and to contribute valuable information at the project level. I'm going to distinguish between 3 types of quality control [QC]. (1) a. INPUT QC b. PROGRAM QC C. OUTPUT QC Input QC is the process of learning about the input data to your program. It's important to distinguish between properties that the data is supposed to have, and the properties it actually has. It is also important for the SAS programmer to have a good, general, sense of what the data means and how it will be used. This knowledge allows the programmer to integrate more-intelligent QC measures into the programs than would be possible by blindly following specs. Program QC refers to writing programs so that quality-control information is available both from intermediate stages of the program as well as in the final output. For example, merges are crucial areas of a program requiring QC. I will give examples where the merge DATA step can be written to show important information about the merge. Output QC is what most programmers are thinking about when QC is mentioned. That is, Output QC is looking at the output data sets to make sure that they have the intended properties. There are various aspects of this. There's the basic level of making sure that 2+2=4. There are also relationships between variables that are determined by external factors, e.g. IF MONTH = 'SEPT' THEN (1 <= DAY <= 30). There are variable values and relationships to check that are specific to the particular data set being input or generated, e.g. number of children in the family was recorded only if they were living at home. I will discuss a number of specific SAS programming examples, but the goal of this paper is to emphasize as much as possible the general principles which lead to more-effective quality control. Throughout I will focus on a common situation for a SAS programmer: you have an input SAS data set (or, sets) and you have specs to modify this data set in some way and output a new data set. Of course there are numerous other situations (e.g. reading in flat files, outputting to Excel spreadsheets, etc.) that require QC. I hope that I am able to communicate my basic approach in such a way that the programmer can readily apply it to situations not covered here. 1

2 One thread that will run through this discussion is what we might call content-based programming. By that I mean that it is well worth the time for a SAS programmer to learn as much as possible about the project the programs are a part of, the type of people who will be working with the data, and the general goals which motivate the particular specs. A lot of time and money can be saved if the programmer is in a position to catch content-based problems before they are delivered to the client. For example, I work a lot with health-care survey data and it is fairly easy to figure out that males should not have valid responses to pregnancy questions. It is less obvious that some questions (e.g. "Have you fully recovered?") are only asked of people with certain types of medical conditions. An added benefit to content-based programming is that it makes the work more interesting. The more interested I am, the more alert and engaged I am with the data and the programs. So it's efficient in the long run for the programmer to take some time before ever writing a line of code to find out as much as possible about the data, and to plan ahead with QC in mind. PROGRAM-LEVEL An individual program may be more or less complex. For purposes of quality control, a long, complex, program should be divided into sections or modules where the approach to Input QC and Output QC are applied to sections rather than just to the full program. The idea is to keep things clear and limit the search space for error-detection and de-bugging when problems occur. INPUT QC A lot of old cliches are exactly right when it comes to Input QC ("A stitch in time saves nine", "Penny wise but pound foolish", etc.). I'll add two more: (2) a. Don't overlook the obvious. b. Verify what you're told. The word verify is going to come up quite often in this paper because a lot of basic QC comes down to verifying that what you think is true, or have been told is true, actually is true. One of the great enemies of good QC is time pressure, and this can start right at the beginning by not taking the time to take a good look at the input data. SAS provides a wonderful tool for initially examining input data sets: the CONTENTS procedure. It is absolutely essential to run a PROC CONTENTS to see, at least, the following: (3) a. Number of Variables b. Number of Observations c. Sorted? d. Engine (V9?) e. Variable Name f. Variable Type (Numeric/Character?) g. Variable Length h. Variable Labels i. Variable Formats Initially it's important to know if this general information agrees with the documentation you have about the file. If it doesn't then you need to straighten this out before going any further (perhaps you have the wrong file, one that's similar to the right file). If the general information (e.g. number of variables and observations) appears to be as it should be, you can look at specifics such as variable names and data types. Whether a variable is numeric or character is not always predictable from its name. An ID variable that consists only of numbers might be either numeric or character. A flag variable whose values are only 0 or 1 might be numeric or character. It's obviously important when you start writing code to know what type of variables you have. 2

3 When working with numeric data it's important to know the variables' length. The default is 8 bytes, but this is often reduced to save disk space. If you are going to be modifying the values of numeric data, it's especially important to know variable length. Here's a simple program for taking a first look. (4) PROC CONTENTS DATA= DSN_IN OUT= CHECK1 (KEEP= NAME TYPE LENGTH LABEL); PROC FREQ DATA= CHECK1; TABLES LENGTH; WHERE TYPE = 1; The output data set CHECK1 will have 4 variables: NAME (= the names of the variables on the DSN_IN data set); TYPE (1 if numeric, 2 if character); LENGTH and LABEL. The PROC FREQ outputs a table showing the frequency of numeric variables (WHERE TYPE = 1) by LENGTH. There are two reasons to care about numeric length: (i) if it is less than 8 bytes and you will be increasing any values, you need to make sure that the length specification is sufficient for the updated values; (ii) if you are concerned about saving disk space, you may want to consider reducing numeric length for some variables. For more-detailed discussion of this topic, see my paper on SAS numeric data in these Proceedings. The situation is similar for character data. Suppose you have a character variable RESPONSE that has a LENGTH of 3 on the input data set and 3 values: "YES", "NO" and "DK" (where "DK" indicates "Don't Know"). You're asked to edit this variable as in (5): (5) IF RESPONSE = 'DK' THEN RESPONSE = 'MAYBE'; Because LENGTH is a variable attribute that, by default, is inherited by one data set from another, the LENGTH of 3 will cause the new value to be "MAY" instead of "MAYBE". This is one of many situations where a little time spent reading before typing can save a lot of time later ("Anyone remember which program we edited that RESPONSE variable in?). It's important to become familiar with variable labels because they are often important sources of information about the intended content and use of the variables. Sometimes labels offer abbreviated explanations of particular values ("1 IF NUMERIC, 2 IF CHARACTER"). If your editing adds to, or changes, these values, you will need to modify the label as well. Even the best specs aren't perfect and the programmer who has taken (or, been given) the time to understand the context of the data and the specs, can often spot additional modifications that may be needed. Often these take the form, "If you change X then you should also change Y." This is another potential time saver and one that is often appreciated by the client writing the specs. Formats are also important for a number of reasons. If numeric variables are associated with formats, this may affect the operation of particular PROCS. It's good to verify that associated formats should be kept, and to communicate any potential implications. If formats indicate value ranges, it's good to verify that these ranges are meaningful for the data you have. For example, if the format values are [ 1 = '-1'], [0 = '0'] and [ = '1-100'] but the actual range of positive values on the input data set is 40-60, you should check whether or not this format range should be changed. For this type of case, if you only ran a formatted FREQ, you would never see any problem because the range of actual values is within the format range. The output would only show a certain number of records with each of the format values. 3

4 Another important piece of information to know about an input data set is which variable(s) uniquely identify observations, i.e the key variable (often, but not always, this is the SORTED BY variable). If you are told that PERSON_ID uniquely identifies rows, then you should run the following check: (6) PROC SORT DATA= DSN_IN OUT= CHECK2 NODUPKEY; BY PERSON_ID; If PERSON_ID really is the key variable, then the LOG will show that 0 observations have been deleted. Don't overlook the obvious takes many forms. For example if you have MONTH and DAY variables, you could include conditionals such as (7) as part of a DATA step. (7) IF (MONTH = 'SEPT') AND NOT (1 <= DAY <= 30) THEN PUT ID= MONTH= DAY= ; You can also use format ranges to test for valid and invalid values for particular variables. Part of the importance of knowing what properties the data SHOULD have concerns how various types of non-response values are coded. For example, what is the value for a MONTH variable if this information is unknown. For many surveys, responses such as Don't Know or refusals are coded as specific minus values (-8, -7, etc.) or out-of-range values (e.g. 99). Knowing what these are is essential before any code is written that will edit or re-code the input data. See Ron Cody's book on data cleaning for many useful approaches to checking the values of input variables. Here's an example of the verify what you are told theme: suppose that you have a person-level data set (key= PERSON_ID) and each person is a member of a family, identified by FAMILY_ID. Both variables are character, with PERSON_ID having LENGTH 4 and FAMILY_ID LENGTH 2. PERSON_ID is composed of FAMILY_ID plus a two-character sequence number. The data set would look like (8). (8) FAMILY_ID PERSON_ID One generalization is that the first two characters of PERSON_ID = FAMILY_ID. You can verify this as follows: (9) IF FAMILY_ID NE SUBSTR(PERSON_ID,1,2) THEN DO; PUT '***** ID ERROR *****' ; PUT FAMILY_ID= PERSON_ID= ; END; This will print a message to the LOG for each record where FAMILY_ID is not the first two characters of PERSON_ID. If you suspect you have a lot of bad data and don't want to fill up your LOG, you can initially set a flag variable (e.g. ID_ERR) that is assigned a value of 1 if the conditional evaluates as true. Then you can run a FREQ on ID_ERR or use it to create an output data set. 4

5 This type of approach is much better than dumping 50 or so observations to 'check' on variable values or the relationships between variables. I will discuss this further in the section on Output QC, but it is relevant here as well. Printing a small number of records is an amazingly common practice among SAS programmers. It is often requested by clients. It is useful for getting a kind of 'at-a-glance' initial sense of the data. It may even make you aware of a problem that would never have occurred to you to check on. But it is never a sufficient QC check for the simple reason that, if you print 50 records, there may be a problem with records 51 through 100,000. Here's where 'QC as data analysis' comes in most clearly. Good QC, like good data analysis, allows you to draw conclusions about the data. From a QC perspective, you want to be able to say, with confidence, that the first two characters of each PERSON_ID on DSN_IN = FAMILY_ID. The only way to put yourself in position to say that is to examine all the records. The only reasonable way to do this is with a DATA or PROC step. I recently worked with an input file with 4 different ID variables. The file has 323,536 records. On each record, a substring of each of the ID variables is supposed to have the same 3-digit person identifier. This turned out to be true for all but one record. But it was important to catch that one error and correct it. PROGRAM QC The point of Program QC is to write DATA steps (e.g. merges) and include PROCS in such a way that you are outputting information that is necessary for evaluating the correctness of the program and its output. There are a couple of SAS options that I would recommend either as system or program options. The first is ERRORABEND. This will cause SAS to abend for errors where it would otherwise (i.e. if NOERRORABEND is set) write a LOG ERROR note, set OBS=0 and go into syntax-checking mode. I rarely find syntax checking after the initial error very useful, and it is much easier for me to focus on correcting the one error that caused SAS to abend. The LOG of any running program should be free of ERROR messages, even those associated with errors that "don't matter." Chapter 2 of Michele Burlew's book on debugging SAS programs has good information on interpreting and correcting Notes, Warning and Error messages in SAS LOGs. Another option I would recommend, not available in V6, is MERGENOBY=ERROR. If you fail to include a BY statement when merging data sets, SAS will write the following message to the LOG. (10) ERROR: No BY statement was specified for a MERGE statement. This is a very useful option which I have in my AUTOEXEC file. I never intentionally merge data sets without a BY statement (even one-to-one merges) so =ERROR works best for me. You can also set MERGENOBY=WARN to get a warning rather than an error message. In (11) I illustrate a merge DATA step where temporary variables containing information about the merge are written to the LOG. (11) DATA THREE; MERGE ONE (IN= A) TWO (IN= B) END= ITSOVER; BY ID; IF A AND B THEN MTCH+1; ELSE IF A THEN JUSTA+1; ELSE IF B THEN JUSTB+1; IF ITSOVER THEN PUT MTCH= JUSTA= JUSTB= ; DROP MTCH JUSTA JUSTB; 5

6 END= creates a temporary variable indicating the end of the file with the largest number of observations. Here this variable is named ITSOVER and the LOG message is only written when the last observation is processed. At this point the value of the variable MTCH equals the number of observations where ONE and TWO match BY ID. The value of JUSTA is the number of non-matches in ONE. The value of JUSTB is the number of non-matches in TWO. Although it's often the case that the number of OBS in the output data set indicates the number of matches, if a subsequent problem is noticed, then knowing the number of non-matches can be important. When possible I try to have enough information to be able to make predictions about the outcome of a merge. Is it a one-toone merge where all records should match? Is it a one-to-many match? I can't always predict what the exact values of MTCH, JUSTA and JUSTB should be but I can usually make some predictions, even it's simply that there should be very few non-matches. Let's take a different type of example, one where Input and Program QC will help you write the correct code without trial and error. Suppose you have the spec in (12). (12) Create a variable AGECAT. If a person is under 18 then AGECAT = CHILD. If the person is 18 or older, then AGECAT = ADULT. Without having looked at the data, it might be tempting to code this as: (13) IF AGE < 18 THEN AGECAT = 'CHILD'; ELSE AGECAT = 'ADULT'; But as part of Input QC, you noticed that there were both missing values and zero values. The conditional in (13) will code both of these values as 'CHILD.' This may be want is intended for 0, but certainly not for missing values. Another way to catch this type of problem is to always include a QC crosstab after re-coding. This crosstab should have the basic form BEFORE*AFTER. Of course the trick to such a crosstab is how to format the BEFORE variable. For something like AGE (with perhaps 100 different values) it wouldn't be too annoying to leave it unformatted. But that isn't often a realistic option. Here you could run a PROC to get the MIN and MAX values (perhaps using a WHERE AGE GT 0 clause) and then format AGE as MISSING, O, 1-17 and (assuming that 95 was the maximum age on the data set). Then you could modify (13) as in (14). Of course, once you know to ask, it may be a simple check to find out if 0 values should be coded as 'CHILD.' If so, then (14) can be simplified to include 0 in the CHILD range. (14) IF (1 <= AGE <= 17) THEN AGECAT = 'CHILD'; ELSE IF (18 <= AGE <= 95) THEN AGECAT = 'ADULT'; ELSE IF AGE = 0 THEN AGECAT = 'ZERO VALUE'; ELSE IF AGE =. THEN AGECAT = 'MISSING'; ELSE AGECAT = '?????'; A useful generalization (which I owe to Ed Heaton) is to avoid open-ended conditionals such as IF AGE < 18. If you require that each end be specified it forces you to learn more about the data so that you know which values to use for the lower and upper bounds. The final line in (14) points to another part of good programming: conditionals should always exhaust the logical possibilities. The final ELSE gives you control over the unexpected. You will have a clear indicator ("?????") that there are AGE values you haven't accounted for. Often the best you can do for quality control is to test that two different methods lead to the same result. SAS is great for this because there's usually more than one way to do any particular operation. Consider an example where you have a person-level data set with three expenditure variables: OOP (for out of pocket); PRV (for private); and PUB (for public). The spec is to create a person-level total expenditure variable TOTEXP. Here's a way of building good QC into the DATA step (assume there's a person-level ID variable). 6

7 (15) DATA TWO; SET ONE END= ITSOVER; TOTEXP = (OOP+PRV+PUB); SUMTOTXP+TOTEXP; TOTCHK = SUM(OOP,PRV,PUB); SUMTOTCK+TOTCHK; SUMOOP+OOP; SUMPRV+PRV; SUMPUB+PUB; IF TOTEXP NE TOTCHK THEN PUT ID= TOTEXP= TOTCHK= ; IF ITSOVER THEN DO; IF SUMTOTXP NE SUMTOTCK THEN PUT ID= SUMTOTXP= SUMTOTCK= ; ELSE PUT 'SUM TOTALS OK'; IF SUMTOTXP NE SUM(SUMOOP,SUMPRV,SUMPUB) THEN PUT ID= SUMTOTXP= SUMOOP= SUMPRV= SUMPUB= ; ELSE PUT 'SUMTOTXP OK' ; END; On the record (i.e. person) level, this DATA step creates a total expenditure variable TOTEXP. As a check on this we create another total expenditure variable TOTCHK with the sum function. They should be equal and we test for this on each record (IF TOTEXP NE TOTCHK...). As an added check, we sum these totals and their components (OOP, PRV and PUB) for the full data set and then put in conditionals to make sure that the sums of the totals equal the totals of the sums (i.e. everything adds up the way it should). Clearly there is more that could be built in here, and other ways (e.g.. using summary PROCs) to check on these variables. But, even though this DATA step is more complicated than it would be without the QC, it is easier than coming back later and trying to figure out why something's not adding up. The extra assignment statements and conditionals in this DATA step put you in a position where you can say, with confidence, that for each record the TOTEXP variable accurately sums the OOP, PRV and PUB variables. Also, if program LOGs are part of what you deliver to the client, you now have specific program output that you can point to to back up your assertions about the programs and the data. Another common area of SAS programming is creating a new file or variable that's at a different level (e.g. person, family, etc.) than the input file. As an example, assume that you have a person-level file with family and person ID variables, plus an age variable. The spec is in (16). (16) Create a family-level variable FAM_AGE that has a value of 1 if at least one member of the family is 65 or older. Otherwise FAM_AGE = 0. A common way to do this in SAS is shown in (17). I haven't included the QC parts of the merge because (17) is illustrating common practice. 7

8 (17) PROC SORT DATA= ONE; BY FAMILY_ID; DATA TWO (KEEP= FAMILY_ID FAM_AGE); SET ONE; BY FAMILY_ID; RETAIN FAM_AGE; IF FIRST.FAMILY_ID THEN FAM_AGE = 0; IF AGE >= 65 THEN FAM_AGE = 1; IF LAST.FAMILY_ID; PROC SORT DATA= TWO; BY FAMILY_ID; DATA THREE; MERGE ONE TWO; BY FAMILY_ID; In the section on Output QC I will discuss running a crosstab to test the output of this program. But, as we saw with the example using expenditure variables, we can build in a check by creating the FAM_AGE variable another way, as in (18), which assumes that data set ONE is sorted by FAMILY_ID. (18) DATA CHKFAM; SET ONE (IN= A) ONE; BY FAMILY_ID; RETAIN FAM_AGE; IF A THEN DO; IF FIRST.FAMILY_ID THEN FAM_AGE = 0; IF AGE >= 65 THEN FAM_AGE = 1; END; ELSE OUTPUT; In this DATA step the data set ONE is read in twice BY FAMILY_ID. That is, a family group of records is read in from ONE (IN= A), and the FAM_AGE variable is created based on the AGE variable. This variable is RETAINed and its value on the last record for this family 'spills over' into the records for this same family when ONE is read in again BY FAMILY_ID. The trick here is this second pass through the family group doesn't trigger the THEN FAM_AGE = 0 conditional because the value of FAMILY_ID doesn't change so it's only records on the IN= A data set where FIRST.FAMILY_ID = 1. If you want to see this concretely, create a small data set of 3 or 4 families and place (19) just before the RUN statement. (19) PUT FAMILY_ID= AGE= FAM_AGE= FIRST.FAMILY_ID= LAST.FAMILY_ID= ; The DATA step in (18) isn't directly generating any QC output. But it is creating an output data set that will allow for OUTPUT QC to yield some results, as we'll see in the next section. 8

9 The general structure, somewhat simplified, of many SAS programs is to input a data set, create a series of temporary data sets, and output a permanent data set. We can illustrate this in (20). (20) I T T n O It's a good rule of thumb to generate QC output (either as LOG messages or PROC output) for each temporary data set the program generates. That way, if you have to work your way back from some problematic output to its source, you have a series of snapshots to guide the search. OUTPUT QC Let's begin this section with the last example from the previous section. We've created a family-level variable FAM_AGE on a person-level file. How do we check the correctness of our code? First question is: Given the specs, what SHOULD be true? It SHOULD be the case that all records on the file where AGE is 65 or greater have a FAM_AGE value of 1. That is, if you're 65 or older you must be in a family that has a member who is 65 or older. So, if we run a crosstab FAM_AGE*AGE (where AGE is formatted as 'UNDER 65' and '65 OR OLDER'), we should find that for all records where AGE='65 OR OLDER', FAM_AGE = 1. But what can we say about the records where AGE is less than 65? Not much directly. This is where the DATA step in (17) comes in. We now have two DATA sets that should be identical, even though the family-level variable FAM_AGE was created in two different ways. This is a time to use PROC COMPARE. We can use it in its simplest form, as in (21). (21) PROC COMPARE BASE= THREE COMPARE= CHKFAM; The PROC COMPARE output should say that there are no observations with variables of unequal value. Note that this is also an added check on the merge in (17). That is, data set THREE should be equal to ONE plus the new FAM_AGE variable. Data set CHKFAM created in (18) also should be identical to ONE plus the FAM_AGE variable. So, if (17) and (18) produce identical output, and the crosstab shows that all persons 65 or older are in families where FAM_AGE = 1, then this seems like a pretty strong indication that we're outputting the intended result. Note that it's not proof. We've used the same conditionals in (17) and (18) to create the variables, so if we're repeating an error in our logic that it isn't shown by the crosstab, it may still slip through. This points to another aspect of trying to exhaust the possibilities, not just when writing conditionals, but also when testing output. I have seen programs where, having recognized that we can only predict the FAM_AGE values for those 65 and older, programmers will run a PROC FREQ for FAM_AGE and include the clause WHERE AGE >= 65. This is risky because then we can't see the numbers for those under 65. It might be that the crosstab shows that FAM_AGE = 1 for ALL records on the file. This is almost certainly a programming error (if it's true, why create the variable?). The WHERE clause restricts the information that is output, creating the possibility of missing a large problem. This brings us back to content-based programming. What is the population on the data set? Is it families with at least one person receiving Medicare? A random sample of families in the US? Obviously this will affect our expectations. When creating variables at a certain level, it's useful to run FREQs at that level, even if not required by the specs. So run a FREQ for FAM_AGE on the family-level file TWO created in (17). Is the output reasonable given the population? One last point with this example. If you wanted to directly check that the output file THREE is identical to ONE except for the addition of the FAM_AGE variable, you can compare them by simply dropping the new variable, as in (22). (22) PROC COMPARE BASE= THREE (DROP= FAM_AGE) COMPARE= ONE; This allows you to check that there haven't been any 'inadvertent' changes along the way. As mentioned in the section on PROGRAM QC, a typical program usually generates a number of temporary data sets before a permanent data set is 9

10 created. In addition to including QC LOG messages and PROC output for the temporary data sets, it is important to repeat any relevant QC measures with the final output. Not all will be possible because the output file might be different in character (e.g. a person-level rather than a family-level file) from some of the temporary data sets, but what is possible should be done. After all, it's the final output that's important. One all-too-common way to check the final output is to print 50 OBS or so of a large file to check that everything looks the way it should. A lot of programmers do this, and a lot of clients request it, because it gives a concrete sense of how the variable values look. But it is not a sufficient QC measure. Sometimes there's a series of record-level edits that really are just a list of individual changes to a file. In this case PROC output, which is usually designed to capture patterns in the data, is of little use. Here's one case where the temptation to dump 50 OBS is hard to resist. If we take the 'data-analysis' perspective, one problem with dumping 50 OBS is that you usually see the following: (23) PROC PRINT DATA= THREE (OBS= 50); Here the OBS= 50 tells SAS to print the first 50 observations in the data set. Usually there is no particular reason to print the first 50 observations, it's just convenient. If there really is no alternative than examining a small number of observations to check a large number of record-level edits (i.e. there is no pattern to be tested), I would recommend, since it's now very easy in SAS, to take a simple random sample [SRS], as shown in (24). This is really just a form of data-entry checking. (24) PROC SURVEYSELECT DATA= THREE METHOD= SRS SAMPSIZE= 50 OUT= CHECK3 ; PROC PRINT DATA= CHECK3; You can change SAMPSIZE to be whatever number of records you would normally dump and then print the output. This method allows you to avoid basing conclusions on variable values that may be particular to the first 50 or so observations in a data set. It's not perfect, but it's better than the alternative. In many ways OUTPUT QC mirrors INPUT QC in that you want to have a very detailed picture of both input and output data sets. In the INPUT QC section I stressed the essential role of PROC CONTENTS. This is equally important with OUTPUT QC. The list of data set properties in (3), repeated here as (25) should form a minimal output checklist. (25) a. Number of Variables b. Number of Observations c. Sorted? d. Engine (V8?) e. Variable Name f. Variable Type (Numeric/Character?) g. Variable Length h. Variable Labels i. Variable Formats Are the number of variables and observations correct? Should the data set be sorted? Do all the variables have labels and/or formats? I would also add to this list a question about compression. SAS V8 offers good options for compression that are worth considering (COMPRESS= CHAR BINARY). It is also useful to use the POSITION or VARNUM option to clearly see the position of the variables on the data set. This is often a good way to pick up any stray variables (e.g. I or X from a DO loop) that may have been accidentally left on the data set. These variables will tend to be positioned near the end. 10

11 The basic point of the approach to quality control discussed here is that it takes very little effort with SAS to make a lot of useful information about your programs and program output available. Before starting to program you want to have as much information about the input data as possible. Intelligent QC demands more than just knowledge of the formal properties of the data, it also requires information about the content of the information, i.e. what the data is about. Once your underway you want to make available information about the intermediate output of your programs through LOG messages and snapshots of work files used as input to permanent data sets. Work files by definition are unavailable after the SAS session so if you need to problem solve by going back to programs that created work files can require re-running the programs just to re-create the work files so they can be examined. A few PROC FREQS and PROC COMPAREs along the way can save a lot of time down the road. In the next section we take this idea of making useful quality-control data available to the project level. The emphasis will be on what you, as a SAS programmer using your core skills, can do to provide that information and in a useful and potentially time- and money-saving way. PROJECT-LEVEL QUALITY CONTROL When we looked at Input QC we pointed out the relevance of the cliché, Don't overlook the obvious. The context there was that implicit assumptions we might make about the data often come back to haunt us days or weeks later. At the project level we can look at this from another perspective. (26) Make important information easily accessible so it's less likely to be overlooked. In this section of the paper we are going to focus on SAS-provided metadata (i.e. data about SAS files such as data sets and catalogs) in the form of dictionary tables and views. The idea here is not to provide you with specific checks on specific output, because the particulars of what is needed will vary from project to project. Even the SAS techniques illustrated here will change over time, so the emphasis will be on the approach. As a SAS programmer, with a minimal expansion of what you are already doing with your programs and your data, you can generate important project level information that is invaluable for quality control and project documentation. Before discussing the metadata available in SAS-provided files such as dictionary tables, consider the discussion of the PROC COMPARE example at the program level and it's applicability to the program level. In (20) we illustrated the simplified structure of a program as an input/output device with a succession of intermediate states. At the program level we're concerned with properties of the input, as well as intermediate and final output. At the program level the same concerns exist, except that program output itself can be seen as a series of intermediate output states. From this perspective, PROC COMPARE can be used to provide information about the relationship of variables on the initial input files to the output files. The KEEP and DROP options, as was shown in (22), can be used to isolate variables that should be unchanged from initial input to final output. If the program flow for a project to divided into modules or sections, then input/output comparisons can be made for these sub-project domains as well. In a similar way, the SAS programmer can generate, or make use of, information about the data sets that have been created during the course of the project. Again, making use of this metadata does not involve learning a new set of skills or a new application. As a SAS programmer you already have the necessary expertise (although expanding your PROC SQL knowledge wouldn't hurt). DICTIONARY.COLUMNS is one of many dictionary tables that the SAS system generates and stores without explicit programmer intervention. As Dilorio and Abolafia (2004) point out, despite their usefulness, and the wealth of information they provide, dictionary tables and views still remain on the periphery of most SAS programmers practical knowledge. But putting them to use does not require a whole new set of skills, as the following examples illustrate (see also Michael Davis' 2001 SUGI paper on dictionary views). As one example of their usefulness, consider the code in (27). 11

12 (27) PROC SQL; CREATE TABLE PROJCONT AS SELECT QUIT; NAME AS VAR_NAME, TYPE AS VAR_TYPE, MEMNAME AS FILENAME FROM DICTIONARY.COLUMNS WHERE LIBNAME = "SASHELP"; This PROC SQL step extracts three variables from DICTIONARY.COLUMNS: name (renamed to VAR_NAME; type (renamed to VAR_TYPE; and memname (renamed to FILNAME). These variables provide information about all the members in the SASHELP directory. With a LIBNAME statement you can use as LIBREF you want. So now you have a list of all members in this directory, with the filename, the variable name, and the variable type. Once you have the data set PROJCONT, you can manipulate and output it just like you would any SAS data set. You might want to export it to Excel. Here's an easy way to do that. (28) ODS HTML FILE= 'C:\PROJ INFO\FILEINFO\HELPLIST.XLS'; PROC PRINT DATA= PROJCONT NOOBS; VAR FILENAME VAR_NAME DATATYPE; ODS HTML CLOSE; Of course there are numerous ways to export SAS data sets to Excel (e.g. the ExcelXP tagset) but here I'm only illustrating the ease with which SAS not only provides useful information about the data sets your project has generated, but also the ease with which you can make this available to other on the project. Building on this, and continuing to use SAS skills that are part of your everyday work, you can get data-set level information from DICTIONARY.TABLES. Consider the example in (29), a slightly modified example from Dilorio and Abolfina's 2004 SUGI paper. (29) PROC SQL; CREATE TABLE PROJLIST AS SELECT QUIT; CRDATE AS FILEDATE, MEMNAME AS FILENAME, (NPAGE*16384/1024) AS SIZE_KB FROM DICTIONARY.TABLES WHERE LIBNAME = "SASHELP"; The CRDATE variable is the creation date variable (there's also MODATE for the date modified). The SIZE_KB uses Dilorio and Abolfina's technique for creating a handy file size variable. Once you've created this data set, you can sort it and the data set created in (27) by FILENAME, and then merge them into a data set that combines data set and variable information. If you sort by CRDATE you can create a data set and variable chronology that is not only useful for quality control, but also for project documentation. Once you start this process it builds on itself and soon you're wondering how you worked on projects without this kind of handy metadata that you've created for yourself and others on the project. As mentioned above, you can export this information to Excel or other applications. Often these data sets can become quite large so finding handy ways to access subsets of the data on the fly can be a big help when you're in a hurry and you just need to know which data set first had variable X or Y. I've found that, when exporting to Excel, the AUTOFILTER option in the ExcelXP tagset provides a great way of automatically creating a spreadsheet where I can pick out variables or other data set properties of interest. See any of Eric Gebhart's papers on this (e.g. his SUGI 30 paper listed in the references). 12

13 The DICTIONARY.COLUMNS table is useful for looking at information across data sets. Let's say you wanted to see which variable name appears in more than one data set. You could modify (27) as in (30). (30) PROC SQL; CREATE TABLE VDUPLIST AS SELECT QUIT; NAME AS VAR_NAME, TYPE AS VAR_TYPE, MEMNAME AS FILENAME, LENGTH FROM DICTIONARY.COLUMNS WHERE LIBNAME = "SASHELP" GROUP BY NAME HAVING COUNT(NAME) > 1; The HAVING clause picks out variables appearing in more than one data set. The inclusion of the variable length and type allow you to check for consistency across data sets. Again you can export this information to other applications or use SAS to create reports. The point here is that, using the programming knowledge you already are using at the program level, you can provide yourself and others with valuable information at the project level. Obviously the more information you have at your fingertips, the easier it's going to be to use that information to make sure of the accuracy and consistency of the data your project is generating. Once you start to get a sense of how easy it is to use your SAS programming skills for quality control and documentation information, you can quickly build on this knowledge to fine tune what's most useful for your project. Continuing to use the SASHELP files as an illustration we can adapt another example from Dilorio and Abolafia (2004) to show this type of finetuning. (31) PROC SQL; CREATE TABLE VDUPLIST AS SELECT QUIT; NAME AS VAR_NAME, TYPE AS VAR_TYPE, MEMNAME AS FILENAME, LENGTH FROM DICTIONARY.COLUMNS WHERE LIBNAME = "SASHELP" GROUP BY NAME HAVING COUNT(DISTINCT LENGTH) > 1 OR COUNT(DISTINCT TYPE) > 1; If you run this for the SASHELP data sets you will see output such as that shown in (32) below (only partial output is given). The HAVING clause picks out variables with different length or type properties on different data sets. Clearly if this were actual output for your project you'd want to investigate situations where the same variable name has a different data type and/or length on different data sets. If you had created a variable chronology file using the methods described above you could quickly gather additional information to try and figure out what's going on with these variables. Once you've identified unspecified differences among variables on different data sets, then you need to investigate programs where these variables were used in merges, conditionals, and other programs areas where variable properties come into play. But even if using DICTIONARY.COLUMNS, or other metadata easily obtained with SAS (such as output data sets from PROC CONTENTS), does not reveal any quality-control issues that need to be addressed, you are in a better position to document your project activities and to demonstrate the accuracy of your output and deliverables. Often output data can look odd or wrong but is actually accurate, given the input data. If you've built in good quality control checks, you're in a good position to efficiently answer client (or project manager) questions about the accuracy of the data. 13

14 (32) VAR_NAME FILENAME VARTYPE length CLASS BVHELP char 20 CLASS MRRGSTRY char 35 COUNTRY PRDSAL3 char 10 COUNTRY MDV char 14 COUNTRY PRDSAL2 char 10 COUNTRY PRDSALE char 10 COUNTY PRDSAL3 char 20 COUNTY EISILCO num 6 COUNTY TGRMAPC num 6 COUNTY PRDSAL2 char 20 COUNTY ZIPCODE num 8 DATE CITIDAY num 7 DATE RENT num 8 DATE ROCKPIT num 8 DATE USECON num 5 Metadata is data, and SAS programmers are experts at making data available for analysis, especially when SAS provides the data in a form that we already know how to manipulate and put to use. Project information is also data, and SAS programmers are in an excellent position to make essential contributions at the project level. CONCLUSION The basic principle of QC is to know your data. Good programming requires that you know your input data, that you know the relevant properties of intermediate data sets, and that you know the details of your permanent output. An important element of this is to distinguish between the values that variables actually have, and the values that they should have. As much as possible this should be done for all records on all the files you are working with. SAS programmers are fortunate to have a variety of data-analysis tools at their disposal. The minimal added cost of including QC PROCs or conditional LOG messages in SAS programs is usually more than offset by the benefit of preventing the generation of unintended output, or being able to efficiently demonstrate the correctness of odd-looking output. In the section on INPUT QC, I said that time pressure is one of the great enemies of good QC. It's clear that in the real world you can't always run every check that you'd like. But there's an awful lot that can be done fairly efficiently. In the same way that there are patterns in data we want to capture with our programs, there are patterns to QC that often recur on projects we are working on. As with input data before beginning a program, it's worthwhile at the beginning of a project to anticipate QC needs and build them into program specs. Often this will allow a number of QC steps to be 'macrotized' for a greater saving of project resources by being applicable to a number of different programs. In general, the net cost in time for good QC is minimal, especially given the approach outlined here where the existing skills of most SAS programmers are used. Many of us know from painful experience the feeling that comes when someone asks, right after the final data delivery, "Did you know that those expenditure values we summed had 1 values for missing data?" Also, because a major part of quality control is making data available, good QC also goes hand-in-hand with good data management and documentation. Consider the example where an output table was generated to show variables with different length and or type on different data sets. This type of information is as much a part of data management and documentation as it is of good quality control. 14

15 It can be, as we all know, a frustrating experience to try to figure out why some output data looks wrong, or doesn't add up or doesn't make sense. It can be even more frustrating to have your client call and ask why something doesn't add up. What is needed in those situations is information that delimits the search space for problem solving and reduces the time and frustration involved. Good QC usually means catching problems before you get too far from the source of the problem. This means less time spent problem solving and more time spent making our customers happy. Always a good idea! Good QC goes hand in hand with good project documentation. A significant part of good project documentation is generating metadata in a useful format. Using our SAS skills to do this saves time and effort in those final project stages. This makes our project managers happy. Always a good idea! REFERENCES Burlew, Michele M. (2001) Debugging SAS Programs: A Handbook of Tools and Techniques, SAS Institute, Inc (Cary, NC, USA). Cody, Ron (1999) Cody's Data Cleaning Techniques Using SAS Software, SAS Institute, Inc (Cary, NC, USA). Davis, Michael (2001) "You Could Look It Up: An Introduction to SASHELP Dictionary Views," Proceedings of SUGI 26. Dilorio, Frank and Abolafia, Jeff (2004) "Dictionary Tables and Views: Essential Tools for Serious Applications," Proceedings of SUGI 29. Gebhart, Eric (2005) "ODS Tagsets and Excel: An Excellent Combination," Proceedings of SUGI 30. Gorrell, Paul (2003) "Quality Control With SAS Numeric Data," Proceedings of NESUG Gorrell, Paul (2002) "Numeric Data In SAS: Guidelines for Storage and Display," Proceedings of NESUG Lafler, Kirk Paul (2005) "Exploring DICTIONARY Tables and Views," Proceedings of SUGI 30. ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. Excel is a registered trademark of the Microsoft Corporation. I would like to thank my colleagues at Social & Scientific Systems, as well as members of the audience at prior NESUG conferences, who have given me invaluable feedback on my SAS programs and earlier papers on quality control. Thanks as well to David Chapman for comments on a draft of this paper. CONTACT INFORMATION Paul Gorrell Social & Scientific Systems, Inc Georgia Avenue, 12 th Floor Silver Spring, MD pgorrell@s-3.com Telephone: (Office) (Main) FAX:

Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables

Paper 3458-2015 Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables ABSTRACT Louise Hadden, Abt Associates Inc., Cambridge, MA SAS provides a wealth of resources for users to