A SAS Programmer's Guide to Project- and Program-Level Quality Control Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD

Size: px
Start display at page:

Download "A SAS Programmer's Guide to Project- and Program-Level Quality Control Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD"

Transcription

1 A SAS Programmer's Guide to Project- and Program-Level Quality Control Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD ABSTRACT The focus of this paper is how to save time and money (and reduce frustration!) by (1) including intelligent and informative input and output checks in SAS programs, (2) generating quality-control information for program modules and projects, and (3) using SAS to create program and project documentation. At both the project and program level, the focus will be on quality control as data analysis. This approach is uniquely suited to using the numerous SAS procedures and reporting mechanisms (including programmer-specified LOG messages) to provide the programmer and project manager with the necessary information for verifying the correctness of program output and deliverables. INTRODUCTION The SAS programmer is in the fortunate position of having a variety of excellent data-analysis tools at hand. As stated in the abstract, this paper will focus on quality control as a form of data analysis and SAS is a wonderful data-analysis tool. This means that, as a SAS programmer, you don't need to develop a separate set of skills in order to incorporate good quality-control checks into program- and project-level work. In this paper I'll first develop a general approach to quality-control at the program level, and then apply this approach at the project level. The main difference is that, at the program level, the emphasis is on identifying patterns and drawing conclusions about the data, whereas at the project level, the emphasis is on metadata. The metadata used here will be dictionary table and views that are generated automatically by the SAS system. This is a wealth of information that is easily made available using SAS programming techniques that most SAS programmers already have mastered. A recurring theme of this paper is that you, as a SAS programmer, already have the skills to run important quality-control checks, and to contribute valuable information at the project level. I'm going to distinguish between 3 types of quality control [QC]. (1) a. INPUT QC b. PROGRAM QC C. OUTPUT QC Input QC is the process of learning about the input data to your program. It's important to distinguish between properties that the data is supposed to have, and the properties it actually has. It is also important for the SAS programmer to have a good, general, sense of what the data means and how it will be used. This knowledge allows the programmer to integrate more-intelligent QC measures into the programs than would be possible by blindly following specs. Program QC refers to writing programs so that quality-control information is available both from intermediate stages of the program as well as in the final output. For example, merges are crucial areas of a program requiring QC. I will give examples where the merge DATA step can be written to show important information about the merge. Output QC is what most programmers are thinking about when QC is mentioned. That is, Output QC is looking at the output data sets to make sure that they have the intended properties. There are various aspects of this. There's the basic level of making sure that 2+2=4. There are also relationships between variables that are determined by external factors, e.g. IF MONTH = 'SEPT' THEN (1 <= DAY <= 30). There are variable values and relationships to check that are specific to the particular data set being input or generated, e.g. number of children in the family was recorded only if they were living at home. I will discuss a number of specific SAS programming examples, but the goal of this paper is to emphasize as much as possible the general principles which lead to more-effective quality control. Throughout I will focus on a common situation for a SAS programmer: you have an input SAS data set (or, sets) and you have specs to modify this data set in some way and output a new data set. Of course there are numerous other situations (e.g. reading in flat files, outputting to Excel spreadsheets, etc.) that require QC. I hope that I am able to communicate my basic approach in such a way that the programmer can readily apply it to situations not covered here. 1

2 One thread that will run through this discussion is what we might call content-based programming. By that I mean that it is well worth the time for a SAS programmer to learn as much as possible about the project the programs are a part of, the type of people who will be working with the data, and the general goals which motivate the particular specs. A lot of time and money can be saved if the programmer is in a position to catch content-based problems before they are delivered to the client. For example, I work a lot with health-care survey data and it is fairly easy to figure out that males should not have valid responses to pregnancy questions. It is less obvious that some questions (e.g. "Have you fully recovered?") are only asked of people with certain types of medical conditions. An added benefit to content-based programming is that it makes the work more interesting. The more interested I am, the more alert and engaged I am with the data and the programs. So it's efficient in the long run for the programmer to take some time before ever writing a line of code to find out as much as possible about the data, and to plan ahead with QC in mind. PROGRAM-LEVEL An individual program may be more or less complex. For purposes of quality control, a long, complex, program should be divided into sections or modules where the approach to Input QC and Output QC are applied to sections rather than just to the full program. The idea is to keep things clear and limit the search space for error-detection and de-bugging when problems occur. INPUT QC A lot of old cliches are exactly right when it comes to Input QC ("A stitch in time saves nine", "Penny wise but pound foolish", etc.). I'll add two more: (2) a. Don't overlook the obvious. b. Verify what you're told. The word verify is going to come up quite often in this paper because a lot of basic QC comes down to verifying that what you think is true, or have been told is true, actually is true. One of the great enemies of good QC is time pressure, and this can start right at the beginning by not taking the time to take a good look at the input data. SAS provides a wonderful tool for initially examining input data sets: the CONTENTS procedure. It is absolutely essential to run a PROC CONTENTS to see, at least, the following: (3) a. Number of Variables b. Number of Observations c. Sorted? d. Engine (V9?) e. Variable Name f. Variable Type (Numeric/Character?) g. Variable Length h. Variable Labels i. Variable Formats Initially it's important to know if this general information agrees with the documentation you have about the file. If it doesn't then you need to straighten this out before going any further (perhaps you have the wrong file, one that's similar to the right file). If the general information (e.g. number of variables and observations) appears to be as it should be, you can look at specifics such as variable names and data types. Whether a variable is numeric or character is not always predictable from its name. An ID variable that consists only of numbers might be either numeric or character. A flag variable whose values are only 0 or 1 might be numeric or character. It's obviously important when you start writing code to know what type of variables you have. 2

3 When working with numeric data it's important to know the variables' length. The default is 8 bytes, but this is often reduced to save disk space. If you are going to be modifying the values of numeric data, it's especially important to know variable length. Here's a simple program for taking a first look. (4) PROC CONTENTS DATA= DSN_IN OUT= CHECK1 (KEEP= NAME TYPE LENGTH LABEL); PROC FREQ DATA= CHECK1; TABLES LENGTH; WHERE TYPE = 1; The output data set CHECK1 will have 4 variables: NAME (= the names of the variables on the DSN_IN data set); TYPE (1 if numeric, 2 if character); LENGTH and LABEL. The PROC FREQ outputs a table showing the frequency of numeric variables (WHERE TYPE = 1) by LENGTH. There are two reasons to care about numeric length: (i) if it is less than 8 bytes and you will be increasing any values, you need to make sure that the length specification is sufficient for the updated values; (ii) if you are concerned about saving disk space, you may want to consider reducing numeric length for some variables. For more-detailed discussion of this topic, see my paper on SAS numeric data in these Proceedings. The situation is similar for character data. Suppose you have a character variable RESPONSE that has a LENGTH of 3 on the input data set and 3 values: "YES", "NO" and "DK" (where "DK" indicates "Don't Know"). You're asked to edit this variable as in (5): (5) IF RESPONSE = 'DK' THEN RESPONSE = 'MAYBE'; Because LENGTH is a variable attribute that, by default, is inherited by one data set from another, the LENGTH of 3 will cause the new value to be "MAY" instead of "MAYBE". This is one of many situations where a little time spent reading before typing can save a lot of time later ("Anyone remember which program we edited that RESPONSE variable in?). It's important to become familiar with variable labels because they are often important sources of information about the intended content and use of the variables. Sometimes labels offer abbreviated explanations of particular values ("1 IF NUMERIC, 2 IF CHARACTER"). If your editing adds to, or changes, these values, you will need to modify the label as well. Even the best specs aren't perfect and the programmer who has taken (or, been given) the time to understand the context of the data and the specs, can often spot additional modifications that may be needed. Often these take the form, "If you change X then you should also change Y." This is another potential time saver and one that is often appreciated by the client writing the specs. Formats are also important for a number of reasons. If numeric variables are associated with formats, this may affect the operation of particular PROCS. It's good to verify that associated formats should be kept, and to communicate any potential implications. If formats indicate value ranges, it's good to verify that these ranges are meaningful for the data you have. For example, if the format values are [ 1 = '-1'], [0 = '0'] and [ = '1-100'] but the actual range of positive values on the input data set is 40-60, you should check whether or not this format range should be changed. For this type of case, if you only ran a formatted FREQ, you would never see any problem because the range of actual values is within the format range. The output would only show a certain number of records with each of the format values. 3

4 Another important piece of information to know about an input data set is which variable(s) uniquely identify observations, i.e the key variable (often, but not always, this is the SORTED BY variable). If you are told that PERSON_ID uniquely identifies rows, then you should run the following check: (6) PROC SORT DATA= DSN_IN OUT= CHECK2 NODUPKEY; BY PERSON_ID; If PERSON_ID really is the key variable, then the LOG will show that 0 observations have been deleted. Don't overlook the obvious takes many forms. For example if you have MONTH and DAY variables, you could include conditionals such as (7) as part of a DATA step. (7) IF (MONTH = 'SEPT') AND NOT (1 <= DAY <= 30) THEN PUT ID= MONTH= DAY= ; You can also use format ranges to test for valid and invalid values for particular variables. Part of the importance of knowing what properties the data SHOULD have concerns how various types of non-response values are coded. For example, what is the value for a MONTH variable if this information is unknown. For many surveys, responses such as Don't Know or refusals are coded as specific minus values (-8, -7, etc.) or out-of-range values (e.g. 99). Knowing what these are is essential before any code is written that will edit or re-code the input data. See Ron Cody's book on data cleaning for many useful approaches to checking the values of input variables. Here's an example of the verify what you are told theme: suppose that you have a person-level data set (key= PERSON_ID) and each person is a member of a family, identified by FAMILY_ID. Both variables are character, with PERSON_ID having LENGTH 4 and FAMILY_ID LENGTH 2. PERSON_ID is composed of FAMILY_ID plus a two-character sequence number. The data set would look like (8). (8) FAMILY_ID PERSON_ID One generalization is that the first two characters of PERSON_ID = FAMILY_ID. You can verify this as follows: (9) IF FAMILY_ID NE SUBSTR(PERSON_ID,1,2) THEN DO; PUT '***** ID ERROR *****' ; PUT FAMILY_ID= PERSON_ID= ; END; This will print a message to the LOG for each record where FAMILY_ID is not the first two characters of PERSON_ID. If you suspect you have a lot of bad data and don't want to fill up your LOG, you can initially set a flag variable (e.g. ID_ERR) that is assigned a value of 1 if the conditional evaluates as true. Then you can run a FREQ on ID_ERR or use it to create an output data set. 4

5 This type of approach is much better than dumping 50 or so observations to 'check' on variable values or the relationships between variables. I will discuss this further in the section on Output QC, but it is relevant here as well. Printing a small number of records is an amazingly common practice among SAS programmers. It is often requested by clients. It is useful for getting a kind of 'at-a-glance' initial sense of the data. It may even make you aware of a problem that would never have occurred to you to check on. But it is never a sufficient QC check for the simple reason that, if you print 50 records, there may be a problem with records 51 through 100,000. Here's where 'QC as data analysis' comes in most clearly. Good QC, like good data analysis, allows you to draw conclusions about the data. From a QC perspective, you want to be able to say, with confidence, that the first two characters of each PERSON_ID on DSN_IN = FAMILY_ID. The only way to put yourself in position to say that is to examine all the records. The only reasonable way to do this is with a DATA or PROC step. I recently worked with an input file with 4 different ID variables. The file has 323,536 records. On each record, a substring of each of the ID variables is supposed to have the same 3-digit person identifier. This turned out to be true for all but one record. But it was important to catch that one error and correct it. PROGRAM QC The point of Program QC is to write DATA steps (e.g. merges) and include PROCS in such a way that you are outputting information that is necessary for evaluating the correctness of the program and its output. There are a couple of SAS options that I would recommend either as system or program options. The first is ERRORABEND. This will cause SAS to abend for errors where it would otherwise (i.e. if NOERRORABEND is set) write a LOG ERROR note, set OBS=0 and go into syntax-checking mode. I rarely find syntax checking after the initial error very useful, and it is much easier for me to focus on correcting the one error that caused SAS to abend. The LOG of any running program should be free of ERROR messages, even those associated with errors that "don't matter." Chapter 2 of Michele Burlew's book on debugging SAS programs has good information on interpreting and correcting Notes, Warning and Error messages in SAS LOGs. Another option I would recommend, not available in V6, is MERGENOBY=ERROR. If you fail to include a BY statement when merging data sets, SAS will write the following message to the LOG. (10) ERROR: No BY statement was specified for a MERGE statement. This is a very useful option which I have in my AUTOEXEC file. I never intentionally merge data sets without a BY statement (even one-to-one merges) so =ERROR works best for me. You can also set MERGENOBY=WARN to get a warning rather than an error message. In (11) I illustrate a merge DATA step where temporary variables containing information about the merge are written to the LOG. (11) DATA THREE; MERGE ONE (IN= A) TWO (IN= B) END= ITSOVER; BY ID; IF A AND B THEN MTCH+1; ELSE IF A THEN JUSTA+1; ELSE IF B THEN JUSTB+1; IF ITSOVER THEN PUT MTCH= JUSTA= JUSTB= ; DROP MTCH JUSTA JUSTB; 5

6 END= creates a temporary variable indicating the end of the file with the largest number of observations. Here this variable is named ITSOVER and the LOG message is only written when the last observation is processed. At this point the value of the variable MTCH equals the number of observations where ONE and TWO match BY ID. The value of JUSTA is the number of non-matches in ONE. The value of JUSTB is the number of non-matches in TWO. Although it's often the case that the number of OBS in the output data set indicates the number of matches, if a subsequent problem is noticed, then knowing the number of non-matches can be important. When possible I try to have enough information to be able to make predictions about the outcome of a merge. Is it a one-toone merge where all records should match? Is it a one-to-many match? I can't always predict what the exact values of MTCH, JUSTA and JUSTB should be but I can usually make some predictions, even it's simply that there should be very few non-matches. Let's take a different type of example, one where Input and Program QC will help you write the correct code without trial and error. Suppose you have the spec in (12). (12) Create a variable AGECAT. If a person is under 18 then AGECAT = CHILD. If the person is 18 or older, then AGECAT = ADULT. Without having looked at the data, it might be tempting to code this as: (13) IF AGE < 18 THEN AGECAT = 'CHILD'; ELSE AGECAT = 'ADULT'; But as part of Input QC, you noticed that there were both missing values and zero values. The conditional in (13) will code both of these values as 'CHILD.' This may be want is intended for 0, but certainly not for missing values. Another way to catch this type of problem is to always include a QC crosstab after re-coding. This crosstab should have the basic form BEFORE*AFTER. Of course the trick to such a crosstab is how to format the BEFORE variable. For something like AGE (with perhaps 100 different values) it wouldn't be too annoying to leave it unformatted. But that isn't often a realistic option. Here you could run a PROC to get the MIN and MAX values (perhaps using a WHERE AGE GT 0 clause) and then format AGE as MISSING, O, 1-17 and (assuming that 95 was the maximum age on the data set). Then you could modify (13) as in (14). Of course, once you know to ask, it may be a simple check to find out if 0 values should be coded as 'CHILD.' If so, then (14) can be simplified to include 0 in the CHILD range. (14) IF (1 <= AGE <= 17) THEN AGECAT = 'CHILD'; ELSE IF (18 <= AGE <= 95) THEN AGECAT = 'ADULT'; ELSE IF AGE = 0 THEN AGECAT = 'ZERO VALUE'; ELSE IF AGE =. THEN AGECAT = 'MISSING'; ELSE AGECAT = '?????'; A useful generalization (which I owe to Ed Heaton) is to avoid open-ended conditionals such as IF AGE < 18. If you require that each end be specified it forces you to learn more about the data so that you know which values to use for the lower and upper bounds. The final line in (14) points to another part of good programming: conditionals should always exhaust the logical possibilities. The final ELSE gives you control over the unexpected. You will have a clear indicator ("?????") that there are AGE values you haven't accounted for. Often the best you can do for quality control is to test that two different methods lead to the same result. SAS is great for this because there's usually more than one way to do any particular operation. Consider an example where you have a person-level data set with three expenditure variables: OOP (for out of pocket); PRV (for private); and PUB (for public). The spec is to create a person-level total expenditure variable TOTEXP. Here's a way of building good QC into the DATA step (assume there's a person-level ID variable). 6

7 (15) DATA TWO; SET ONE END= ITSOVER; TOTEXP = (OOP+PRV+PUB); SUMTOTXP+TOTEXP; TOTCHK = SUM(OOP,PRV,PUB); SUMTOTCK+TOTCHK; SUMOOP+OOP; SUMPRV+PRV; SUMPUB+PUB; IF TOTEXP NE TOTCHK THEN PUT ID= TOTEXP= TOTCHK= ; IF ITSOVER THEN DO; IF SUMTOTXP NE SUMTOTCK THEN PUT ID= SUMTOTXP= SUMTOTCK= ; ELSE PUT 'SUM TOTALS OK'; IF SUMTOTXP NE SUM(SUMOOP,SUMPRV,SUMPUB) THEN PUT ID= SUMTOTXP= SUMOOP= SUMPRV= SUMPUB= ; ELSE PUT 'SUMTOTXP OK' ; END; On the record (i.e. person) level, this DATA step creates a total expenditure variable TOTEXP. As a check on this we create another total expenditure variable TOTCHK with the sum function. They should be equal and we test for this on each record (IF TOTEXP NE TOTCHK...). As an added check, we sum these totals and their components (OOP, PRV and PUB) for the full data set and then put in conditionals to make sure that the sums of the totals equal the totals of the sums (i.e. everything adds up the way it should). Clearly there is more that could be built in here, and other ways (e.g.. using summary PROCs) to check on these variables. But, even though this DATA step is more complicated than it would be without the QC, it is easier than coming back later and trying to figure out why something's not adding up. The extra assignment statements and conditionals in this DATA step put you in a position where you can say, with confidence, that for each record the TOTEXP variable accurately sums the OOP, PRV and PUB variables. Also, if program LOGs are part of what you deliver to the client, you now have specific program output that you can point to to back up your assertions about the programs and the data. Another common area of SAS programming is creating a new file or variable that's at a different level (e.g. person, family, etc.) than the input file. As an example, assume that you have a person-level file with family and person ID variables, plus an age variable. The spec is in (16). (16) Create a family-level variable FAM_AGE that has a value of 1 if at least one member of the family is 65 or older. Otherwise FAM_AGE = 0. A common way to do this in SAS is shown in (17). I haven't included the QC parts of the merge because (17) is illustrating common practice. 7

8 (17) PROC SORT DATA= ONE; BY FAMILY_ID; DATA TWO (KEEP= FAMILY_ID FAM_AGE); SET ONE; BY FAMILY_ID; RETAIN FAM_AGE; IF FIRST.FAMILY_ID THEN FAM_AGE = 0; IF AGE >= 65 THEN FAM_AGE = 1; IF LAST.FAMILY_ID; PROC SORT DATA= TWO; BY FAMILY_ID; DATA THREE; MERGE ONE TWO; BY FAMILY_ID; In the section on Output QC I will discuss running a crosstab to test the output of this program. But, as we saw with the example using expenditure variables, we can build in a check by creating the FAM_AGE variable another way, as in (18), which assumes that data set ONE is sorted by FAMILY_ID. (18) DATA CHKFAM; SET ONE (IN= A) ONE; BY FAMILY_ID; RETAIN FAM_AGE; IF A THEN DO; IF FIRST.FAMILY_ID THEN FAM_AGE = 0; IF AGE >= 65 THEN FAM_AGE = 1; END; ELSE OUTPUT; In this DATA step the data set ONE is read in twice BY FAMILY_ID. That is, a family group of records is read in from ONE (IN= A), and the FAM_AGE variable is created based on the AGE variable. This variable is RETAINed and its value on the last record for this family 'spills over' into the records for this same family when ONE is read in again BY FAMILY_ID. The trick here is this second pass through the family group doesn't trigger the THEN FAM_AGE = 0 conditional because the value of FAMILY_ID doesn't change so it's only records on the IN= A data set where FIRST.FAMILY_ID = 1. If you want to see this concretely, create a small data set of 3 or 4 families and place (19) just before the RUN statement. (19) PUT FAMILY_ID= AGE= FAM_AGE= FIRST.FAMILY_ID= LAST.FAMILY_ID= ; The DATA step in (18) isn't directly generating any QC output. But it is creating an output data set that will allow for OUTPUT QC to yield some results, as we'll see in the next section. 8

9 The general structure, somewhat simplified, of many SAS programs is to input a data set, create a series of temporary data sets, and output a permanent data set. We can illustrate this in (20). (20) I T T n O It's a good rule of thumb to generate QC output (either as LOG messages or PROC output) for each temporary data set the program generates. That way, if you have to work your way back from some problematic output to its source, you have a series of snapshots to guide the search. OUTPUT QC Let's begin this section with the last example from the previous section. We've created a family-level variable FAM_AGE on a person-level file. How do we check the correctness of our code? First question is: Given the specs, what SHOULD be true? It SHOULD be the case that all records on the file where AGE is 65 or greater have a FAM_AGE value of 1. That is, if you're 65 or older you must be in a family that has a member who is 65 or older. So, if we run a crosstab FAM_AGE*AGE (where AGE is formatted as 'UNDER 65' and '65 OR OLDER'), we should find that for all records where AGE='65 OR OLDER', FAM_AGE = 1. But what can we say about the records where AGE is less than 65? Not much directly. This is where the DATA step in (17) comes in. We now have two DATA sets that should be identical, even though the family-level variable FAM_AGE was created in two different ways. This is a time to use PROC COMPARE. We can use it in its simplest form, as in (21). (21) PROC COMPARE BASE= THREE COMPARE= CHKFAM; The PROC COMPARE output should say that there are no observations with variables of unequal value. Note that this is also an added check on the merge in (17). That is, data set THREE should be equal to ONE plus the new FAM_AGE variable. Data set CHKFAM created in (18) also should be identical to ONE plus the FAM_AGE variable. So, if (17) and (18) produce identical output, and the crosstab shows that all persons 65 or older are in families where FAM_AGE = 1, then this seems like a pretty strong indication that we're outputting the intended result. Note that it's not proof. We've used the same conditionals in (17) and (18) to create the variables, so if we're repeating an error in our logic that it isn't shown by the crosstab, it may still slip through. This points to another aspect of trying to exhaust the possibilities, not just when writing conditionals, but also when testing output. I have seen programs where, having recognized that we can only predict the FAM_AGE values for those 65 and older, programmers will run a PROC FREQ for FAM_AGE and include the clause WHERE AGE >= 65. This is risky because then we can't see the numbers for those under 65. It might be that the crosstab shows that FAM_AGE = 1 for ALL records on the file. This is almost certainly a programming error (if it's true, why create the variable?). The WHERE clause restricts the information that is output, creating the possibility of missing a large problem. This brings us back to content-based programming. What is the population on the data set? Is it families with at least one person receiving Medicare? A random sample of families in the US? Obviously this will affect our expectations. When creating variables at a certain level, it's useful to run FREQs at that level, even if not required by the specs. So run a FREQ for FAM_AGE on the family-level file TWO created in (17). Is the output reasonable given the population? One last point with this example. If you wanted to directly check that the output file THREE is identical to ONE except for the addition of the FAM_AGE variable, you can compare them by simply dropping the new variable, as in (22). (22) PROC COMPARE BASE= THREE (DROP= FAM_AGE) COMPARE= ONE; This allows you to check that there haven't been any 'inadvertent' changes along the way. As mentioned in the section on PROGRAM QC, a typical program usually generates a number of temporary data sets before a permanent data set is 9

10 created. In addition to including QC LOG messages and PROC output for the temporary data sets, it is important to repeat any relevant QC measures with the final output. Not all will be possible because the output file might be different in character (e.g. a person-level rather than a family-level file) from some of the temporary data sets, but what is possible should be done. After all, it's the final output that's important. One all-too-common way to check the final output is to print 50 OBS or so of a large file to check that everything looks the way it should. A lot of programmers do this, and a lot of clients request it, because it gives a concrete sense of how the variable values look. But it is not a sufficient QC measure. Sometimes there's a series of record-level edits that really are just a list of individual changes to a file. In this case PROC output, which is usually designed to capture patterns in the data, is of little use. Here's one case where the temptation to dump 50 OBS is hard to resist. If we take the 'data-analysis' perspective, one problem with dumping 50 OBS is that you usually see the following: (23) PROC PRINT DATA= THREE (OBS= 50); Here the OBS= 50 tells SAS to print the first 50 observations in the data set. Usually there is no particular reason to print the first 50 observations, it's just convenient. If there really is no alternative than examining a small number of observations to check a large number of record-level edits (i.e. there is no pattern to be tested), I would recommend, since it's now very easy in SAS, to take a simple random sample [SRS], as shown in (24). This is really just a form of data-entry checking. (24) PROC SURVEYSELECT DATA= THREE METHOD= SRS SAMPSIZE= 50 OUT= CHECK3 ; PROC PRINT DATA= CHECK3; You can change SAMPSIZE to be whatever number of records you would normally dump and then print the output. This method allows you to avoid basing conclusions on variable values that may be particular to the first 50 or so observations in a data set. It's not perfect, but it's better than the alternative. In many ways OUTPUT QC mirrors INPUT QC in that you want to have a very detailed picture of both input and output data sets. In the INPUT QC section I stressed the essential role of PROC CONTENTS. This is equally important with OUTPUT QC. The list of data set properties in (3), repeated here as (25) should form a minimal output checklist. (25) a. Number of Variables b. Number of Observations c. Sorted? d. Engine (V8?) e. Variable Name f. Variable Type (Numeric/Character?) g. Variable Length h. Variable Labels i. Variable Formats Are the number of variables and observations correct? Should the data set be sorted? Do all the variables have labels and/or formats? I would also add to this list a question about compression. SAS V8 offers good options for compression that are worth considering (COMPRESS= CHAR BINARY). It is also useful to use the POSITION or VARNUM option to clearly see the position of the variables on the data set. This is often a good way to pick up any stray variables (e.g. I or X from a DO loop) that may have been accidentally left on the data set. These variables will tend to be positioned near the end. 10

11 The basic point of the approach to quality control discussed here is that it takes very little effort with SAS to make a lot of useful information about your programs and program output available. Before starting to program you want to have as much information about the input data as possible. Intelligent QC demands more than just knowledge of the formal properties of the data, it also requires information about the content of the information, i.e. what the data is about. Once your underway you want to make available information about the intermediate output of your programs through LOG messages and snapshots of work files used as input to permanent data sets. Work files by definition are unavailable after the SAS session so if you need to problem solve by going back to programs that created work files can require re-running the programs just to re-create the work files so they can be examined. A few PROC FREQS and PROC COMPAREs along the way can save a lot of time down the road. In the next section we take this idea of making useful quality-control data available to the project level. The emphasis will be on what you, as a SAS programmer using your core skills, can do to provide that information and in a useful and potentially time- and money-saving way. PROJECT-LEVEL QUALITY CONTROL When we looked at Input QC we pointed out the relevance of the cliché, Don't overlook the obvious. The context there was that implicit assumptions we might make about the data often come back to haunt us days or weeks later. At the project level we can look at this from another perspective. (26) Make important information easily accessible so it's less likely to be overlooked. In this section of the paper we are going to focus on SAS-provided metadata (i.e. data about SAS files such as data sets and catalogs) in the form of dictionary tables and views. The idea here is not to provide you with specific checks on specific output, because the particulars of what is needed will vary from project to project. Even the SAS techniques illustrated here will change over time, so the emphasis will be on the approach. As a SAS programmer, with a minimal expansion of what you are already doing with your programs and your data, you can generate important project level information that is invaluable for quality control and project documentation. Before discussing the metadata available in SAS-provided files such as dictionary tables, consider the discussion of the PROC COMPARE example at the program level and it's applicability to the program level. In (20) we illustrated the simplified structure of a program as an input/output device with a succession of intermediate states. At the program level we're concerned with properties of the input, as well as intermediate and final output. At the program level the same concerns exist, except that program output itself can be seen as a series of intermediate output states. From this perspective, PROC COMPARE can be used to provide information about the relationship of variables on the initial input files to the output files. The KEEP and DROP options, as was shown in (22), can be used to isolate variables that should be unchanged from initial input to final output. If the program flow for a project to divided into modules or sections, then input/output comparisons can be made for these sub-project domains as well. In a similar way, the SAS programmer can generate, or make use of, information about the data sets that have been created during the course of the project. Again, making use of this metadata does not involve learning a new set of skills or a new application. As a SAS programmer you already have the necessary expertise (although expanding your PROC SQL knowledge wouldn't hurt). DICTIONARY.COLUMNS is one of many dictionary tables that the SAS system generates and stores without explicit programmer intervention. As Dilorio and Abolafia (2004) point out, despite their usefulness, and the wealth of information they provide, dictionary tables and views still remain on the periphery of most SAS programmers practical knowledge. But putting them to use does not require a whole new set of skills, as the following examples illustrate (see also Michael Davis' 2001 SUGI paper on dictionary views). As one example of their usefulness, consider the code in (27). 11

12 (27) PROC SQL; CREATE TABLE PROJCONT AS SELECT QUIT; NAME AS VAR_NAME, TYPE AS VAR_TYPE, MEMNAME AS FILENAME FROM DICTIONARY.COLUMNS WHERE LIBNAME = "SASHELP"; This PROC SQL step extracts three variables from DICTIONARY.COLUMNS: name (renamed to VAR_NAME; type (renamed to VAR_TYPE; and memname (renamed to FILNAME). These variables provide information about all the members in the SASHELP directory. With a LIBNAME statement you can use as LIBREF you want. So now you have a list of all members in this directory, with the filename, the variable name, and the variable type. Once you have the data set PROJCONT, you can manipulate and output it just like you would any SAS data set. You might want to export it to Excel. Here's an easy way to do that. (28) ODS HTML FILE= 'C:\PROJ INFO\FILEINFO\HELPLIST.XLS'; PROC PRINT DATA= PROJCONT NOOBS; VAR FILENAME VAR_NAME DATATYPE; ODS HTML CLOSE; Of course there are numerous ways to export SAS data sets to Excel (e.g. the ExcelXP tagset) but here I'm only illustrating the ease with which SAS not only provides useful information about the data sets your project has generated, but also the ease with which you can make this available to other on the project. Building on this, and continuing to use SAS skills that are part of your everyday work, you can get data-set level information from DICTIONARY.TABLES. Consider the example in (29), a slightly modified example from Dilorio and Abolfina's 2004 SUGI paper. (29) PROC SQL; CREATE TABLE PROJLIST AS SELECT QUIT; CRDATE AS FILEDATE, MEMNAME AS FILENAME, (NPAGE*16384/1024) AS SIZE_KB FROM DICTIONARY.TABLES WHERE LIBNAME = "SASHELP"; The CRDATE variable is the creation date variable (there's also MODATE for the date modified). The SIZE_KB uses Dilorio and Abolfina's technique for creating a handy file size variable. Once you've created this data set, you can sort it and the data set created in (27) by FILENAME, and then merge them into a data set that combines data set and variable information. If you sort by CRDATE you can create a data set and variable chronology that is not only useful for quality control, but also for project documentation. Once you start this process it builds on itself and soon you're wondering how you worked on projects without this kind of handy metadata that you've created for yourself and others on the project. As mentioned above, you can export this information to Excel or other applications. Often these data sets can become quite large so finding handy ways to access subsets of the data on the fly can be a big help when you're in a hurry and you just need to know which data set first had variable X or Y. I've found that, when exporting to Excel, the AUTOFILTER option in the ExcelXP tagset provides a great way of automatically creating a spreadsheet where I can pick out variables or other data set properties of interest. See any of Eric Gebhart's papers on this (e.g. his SUGI 30 paper listed in the references). 12

13 The DICTIONARY.COLUMNS table is useful for looking at information across data sets. Let's say you wanted to see which variable name appears in more than one data set. You could modify (27) as in (30). (30) PROC SQL; CREATE TABLE VDUPLIST AS SELECT QUIT; NAME AS VAR_NAME, TYPE AS VAR_TYPE, MEMNAME AS FILENAME, LENGTH FROM DICTIONARY.COLUMNS WHERE LIBNAME = "SASHELP" GROUP BY NAME HAVING COUNT(NAME) > 1; The HAVING clause picks out variables appearing in more than one data set. The inclusion of the variable length and type allow you to check for consistency across data sets. Again you can export this information to other applications or use SAS to create reports. The point here is that, using the programming knowledge you already are using at the program level, you can provide yourself and others with valuable information at the project level. Obviously the more information you have at your fingertips, the easier it's going to be to use that information to make sure of the accuracy and consistency of the data your project is generating. Once you start to get a sense of how easy it is to use your SAS programming skills for quality control and documentation information, you can quickly build on this knowledge to fine tune what's most useful for your project. Continuing to use the SASHELP files as an illustration we can adapt another example from Dilorio and Abolafia (2004) to show this type of finetuning. (31) PROC SQL; CREATE TABLE VDUPLIST AS SELECT QUIT; NAME AS VAR_NAME, TYPE AS VAR_TYPE, MEMNAME AS FILENAME, LENGTH FROM DICTIONARY.COLUMNS WHERE LIBNAME = "SASHELP" GROUP BY NAME HAVING COUNT(DISTINCT LENGTH) > 1 OR COUNT(DISTINCT TYPE) > 1; If you run this for the SASHELP data sets you will see output such as that shown in (32) below (only partial output is given). The HAVING clause picks out variables with different length or type properties on different data sets. Clearly if this were actual output for your project you'd want to investigate situations where the same variable name has a different data type and/or length on different data sets. If you had created a variable chronology file using the methods described above you could quickly gather additional information to try and figure out what's going on with these variables. Once you've identified unspecified differences among variables on different data sets, then you need to investigate programs where these variables were used in merges, conditionals, and other programs areas where variable properties come into play. But even if using DICTIONARY.COLUMNS, or other metadata easily obtained with SAS (such as output data sets from PROC CONTENTS), does not reveal any quality-control issues that need to be addressed, you are in a better position to document your project activities and to demonstrate the accuracy of your output and deliverables. Often output data can look odd or wrong but is actually accurate, given the input data. If you've built in good quality control checks, you're in a good position to efficiently answer client (or project manager) questions about the accuracy of the data. 13

14 (32) VAR_NAME FILENAME VARTYPE length CLASS BVHELP char 20 CLASS MRRGSTRY char 35 COUNTRY PRDSAL3 char 10 COUNTRY MDV char 14 COUNTRY PRDSAL2 char 10 COUNTRY PRDSALE char 10 COUNTY PRDSAL3 char 20 COUNTY EISILCO num 6 COUNTY TGRMAPC num 6 COUNTY PRDSAL2 char 20 COUNTY ZIPCODE num 8 DATE CITIDAY num 7 DATE RENT num 8 DATE ROCKPIT num 8 DATE USECON num 5 Metadata is data, and SAS programmers are experts at making data available for analysis, especially when SAS provides the data in a form that we already know how to manipulate and put to use. Project information is also data, and SAS programmers are in an excellent position to make essential contributions at the project level. CONCLUSION The basic principle of QC is to know your data. Good programming requires that you know your input data, that you know the relevant properties of intermediate data sets, and that you know the details of your permanent output. An important element of this is to distinguish between the values that variables actually have, and the values that they should have. As much as possible this should be done for all records on all the files you are working with. SAS programmers are fortunate to have a variety of data-analysis tools at their disposal. The minimal added cost of including QC PROCs or conditional LOG messages in SAS programs is usually more than offset by the benefit of preventing the generation of unintended output, or being able to efficiently demonstrate the correctness of odd-looking output. In the section on INPUT QC, I said that time pressure is one of the great enemies of good QC. It's clear that in the real world you can't always run every check that you'd like. But there's an awful lot that can be done fairly efficiently. In the same way that there are patterns in data we want to capture with our programs, there are patterns to QC that often recur on projects we are working on. As with input data before beginning a program, it's worthwhile at the beginning of a project to anticipate QC needs and build them into program specs. Often this will allow a number of QC steps to be 'macrotized' for a greater saving of project resources by being applicable to a number of different programs. In general, the net cost in time for good QC is minimal, especially given the approach outlined here where the existing skills of most SAS programmers are used. Many of us know from painful experience the feeling that comes when someone asks, right after the final data delivery, "Did you know that those expenditure values we summed had 1 values for missing data?" Also, because a major part of quality control is making data available, good QC also goes hand-in-hand with good data management and documentation. Consider the example where an output table was generated to show variables with different length and or type on different data sets. This type of information is as much a part of data management and documentation as it is of good quality control. 14

15 It can be, as we all know, a frustrating experience to try to figure out why some output data looks wrong, or doesn't add up or doesn't make sense. It can be even more frustrating to have your client call and ask why something doesn't add up. What is needed in those situations is information that delimits the search space for problem solving and reduces the time and frustration involved. Good QC usually means catching problems before you get too far from the source of the problem. This means less time spent problem solving and more time spent making our customers happy. Always a good idea! Good QC goes hand in hand with good project documentation. A significant part of good project documentation is generating metadata in a useful format. Using our SAS skills to do this saves time and effort in those final project stages. This makes our project managers happy. Always a good idea! REFERENCES Burlew, Michele M. (2001) Debugging SAS Programs: A Handbook of Tools and Techniques, SAS Institute, Inc (Cary, NC, USA). Cody, Ron (1999) Cody's Data Cleaning Techniques Using SAS Software, SAS Institute, Inc (Cary, NC, USA). Davis, Michael (2001) "You Could Look It Up: An Introduction to SASHELP Dictionary Views," Proceedings of SUGI 26. Dilorio, Frank and Abolafia, Jeff (2004) "Dictionary Tables and Views: Essential Tools for Serious Applications," Proceedings of SUGI 29. Gebhart, Eric (2005) "ODS Tagsets and Excel: An Excellent Combination," Proceedings of SUGI 30. Gorrell, Paul (2003) "Quality Control With SAS Numeric Data," Proceedings of NESUG Gorrell, Paul (2002) "Numeric Data In SAS: Guidelines for Storage and Display," Proceedings of NESUG Lafler, Kirk Paul (2005) "Exploring DICTIONARY Tables and Views," Proceedings of SUGI 30. ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. Excel is a registered trademark of the Microsoft Corporation. I would like to thank my colleagues at Social & Scientific Systems, as well as members of the audience at prior NESUG conferences, who have given me invaluable feedback on my SAS programs and earlier papers on quality control. Thanks as well to David Chapman for comments on a draft of this paper. CONTACT INFORMATION Paul Gorrell Social & Scientific Systems, Inc Georgia Avenue, 12 th Floor Silver Spring, MD pgorrell@s-3.com Telephone: (Office) (Main) FAX:

Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables

Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables Paper 3458-2015 Better Metadata Through SAS II: %SYSFUNC, PROC DATASETS, and Dictionary Tables ABSTRACT Louise Hadden, Abt Associates Inc., Cambridge, MA SAS provides a wealth of resources for users to

More information

It Might Be Valid, But It's Still Wrong Paul Maskens and Andy Kramek

It Might Be Valid, But It's Still Wrong Paul Maskens and Andy Kramek Seite 1 von 5 Issue Date: FoxTalk July 2000 It Might Be Valid, But It's Still Wrong Paul Maskens and Andy Kramek This month, Paul Maskens and Andy Kramek discuss the problems of validating data entry.

More information

Instructor: Craig Duckett. Lecture 04: Thursday, April 5, Relationships

Instructor: Craig Duckett. Lecture 04: Thursday, April 5, Relationships Instructor: Craig Duckett Lecture 04: Thursday, April 5, 2018 Relationships 1 Assignment 1 is due NEXT LECTURE 5, Tuesday, April 10 th in StudentTracker by MIDNIGHT MID-TERM EXAM is LECTURE 10, Tuesday,

More information

Week - 01 Lecture - 04 Downloading and installing Python

Week - 01 Lecture - 04 Downloading and installing Python Programming, Data Structures and Algorithms in Python Prof. Madhavan Mukund Department of Computer Science and Engineering Indian Institute of Technology, Madras Week - 01 Lecture - 04 Downloading and

More information

6.001 Notes: Section 8.1

6.001 Notes: Section 8.1 6.001 Notes: Section 8.1 Slide 8.1.1 In this lecture we are going to introduce a new data type, specifically to deal with symbols. This may sound a bit odd, but if you step back, you may realize that everything

More information

How to Improve Your Campaign Conversion Rates

How to Improve Your  Campaign Conversion Rates How to Improve Your Email Campaign Conversion Rates Chris Williams Author of 7 Figure Business Models How to Exponentially Increase Conversion Rates I'm going to teach you my system for optimizing an email

More information

6.001 Notes: Section 15.1

6.001 Notes: Section 15.1 6.001 Notes: Section 15.1 Slide 15.1.1 Our goal over the next few lectures is to build an interpreter, which in a very basic sense is the ultimate in programming, since doing so will allow us to define

More information

MITOCW ocw f99-lec07_300k

MITOCW ocw f99-lec07_300k MITOCW ocw-18.06-f99-lec07_300k OK, here's linear algebra lecture seven. I've been talking about vector spaces and specially the null space of a matrix and the column space of a matrix. What's in those

More information

So on the survey, someone mentioned they wanted to work on heaps, and someone else mentioned they wanted to work on balanced binary search trees.

So on the survey, someone mentioned they wanted to work on heaps, and someone else mentioned they wanted to work on balanced binary search trees. So on the survey, someone mentioned they wanted to work on heaps, and someone else mentioned they wanted to work on balanced binary search trees. According to the 161 schedule, heaps were last week, hashing

More information

Formal Methods of Software Design, Eric Hehner, segment 24 page 1 out of 5

Formal Methods of Software Design, Eric Hehner, segment 24 page 1 out of 5 Formal Methods of Software Design, Eric Hehner, segment 24 page 1 out of 5 [talking head] This lecture we study theory design and implementation. Programmers have two roles to play here. In one role, they

More information

PROFESSOR: Last time, we took a look at an explicit control evaluator for Lisp, and that bridged the gap between

PROFESSOR: Last time, we took a look at an explicit control evaluator for Lisp, and that bridged the gap between MITOCW Lecture 10A [MUSIC PLAYING] PROFESSOR: Last time, we took a look at an explicit control evaluator for Lisp, and that bridged the gap between all these high-level languages like Lisp and the query

More information

Arduino IDE Friday, 26 October 2018

Arduino IDE Friday, 26 October 2018 Arduino IDE Friday, 26 October 2018 12:38 PM Looking Under The Hood Of The Arduino IDE FIND THE ARDUINO IDE DOWNLOAD First, jump on the internet with your favorite browser, and navigate to www.arduino.cc.

More information

MITOCW watch?v=9h6muyzjms0

MITOCW watch?v=9h6muyzjms0 MITOCW watch?v=9h6muyzjms0 The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To

More information

Taming a Spreadsheet Importation Monster

Taming a Spreadsheet Importation Monster SESUG 2013 Paper BtB-10 Taming a Spreadsheet Importation Monster Nat Wooding, J. Sargeant Reynolds Community College ABSTRACT As many programmers have learned to their chagrin, it can be easy to read Excel

More information

MITOCW watch?v=kz7jjltq9r4

MITOCW watch?v=kz7jjltq9r4 MITOCW watch?v=kz7jjltq9r4 PROFESSOR: We're going to look at the most fundamental of all mathematical data types, namely sets, and let's begin with the definitions. So informally, a set is a collection

More information

What is version control? (discuss) Who has used version control? Favorite VCS? Uses of version control (read)

What is version control? (discuss) Who has used version control? Favorite VCS? Uses of version control (read) 1 For the remainder of the class today, I want to introduce you to a topic we will spend one or two more classes discussing and that is source code control or version control. What is version control?

More information

CS103 Spring 2018 Mathematical Vocabulary

CS103 Spring 2018 Mathematical Vocabulary CS103 Spring 2018 Mathematical Vocabulary You keep using that word. I do not think it means what you think it means. - Inigo Montoya, from The Princess Bride Consider the humble while loop in most programming

More information

Best Practice for Creation and Maintenance of a SAS Infrastructure

Best Practice for Creation and Maintenance of a SAS Infrastructure Paper 2501-2015 Best Practice for Creation and Maintenance of a SAS Infrastructure Paul Thomas, ASUP Ltd. ABSTRACT The advantage of using metadata to control and maintain data and access to data on databases,

More information

Hello, and welcome to another episode of. Getting the Most Out of IBM U2. This is Kenny Brunel, and

Hello, and welcome to another episode of. Getting the Most Out of IBM U2. This is Kenny Brunel, and Hello, and welcome to another episode of Getting the Most Out of IBM U2. This is Kenny Brunel, and I'm your host for today's episode which introduces wintegrate version 6.1. First of all, I've got a guest

More information

A Better Perspective of SASHELP Views

A Better Perspective of SASHELP Views Paper PO11 A Better Perspective of SASHELP Views John R. Gerlach, Independent Consultant; Hamilton, NJ Abstract SASHELP views provide a means to access all kinds of information about a SAS session. In

More information

Instructor: Craig Duckett. Lecture 03: Tuesday, April 3, 2018 SQL Sorting, Aggregates and Joining Tables

Instructor: Craig Duckett. Lecture 03: Tuesday, April 3, 2018 SQL Sorting, Aggregates and Joining Tables Instructor: Craig Duckett Lecture 03: Tuesday, April 3, 2018 SQL Sorting, Aggregates and Joining Tables 1 Assignment 1 is due LECTURE 5, Tuesday, April 10 th, 2018 in StudentTracker by MIDNIGHT MID-TERM

More information

MITOCW ocw f99-lec12_300k

MITOCW ocw f99-lec12_300k MITOCW ocw-18.06-f99-lec12_300k This is lecture twelve. OK. We've reached twelve lectures. And this one is more than the others about applications of linear algebra. And I'll confess. When I'm giving you

More information

Debugging. CSE 2231 Supplement A Annatala Wolf

Debugging. CSE 2231 Supplement A Annatala Wolf Debugging CSE 2231 Supplement A Annatala Wolf Testing is not debugging! The purpose of testing is to detect the existence of errors, not to identify precisely where the errors came from. Error messages

More information

Class #7 Guidebook Page Expansion. By Ryan Stevenson

Class #7 Guidebook Page Expansion. By Ryan Stevenson Class #7 Guidebook Page Expansion By Ryan Stevenson Table of Contents 1. Class Purpose 2. Expansion Overview 3. Structure Changes 4. Traffic Funnel 5. Page Updates 6. Advertising Updates 7. Prepare for

More information

KMyMoney Transaction Matcher

KMyMoney Transaction Matcher KMyMoney Transaction Matcher Ace Jones Use Cases Case #1A: Matching hand-entered transactions manually I enter a transaction by hand, with payee, amount, date & category. I download

More information

Welcome to this IBM Rational podcast, enhanced. development and delivery efficiency by improving initial

Welcome to this IBM Rational podcast, enhanced. development and delivery efficiency by improving initial IBM Podcast [ MUSIC ] GIST: Welcome to this IBM Rational podcast, enhanced development and delivery efficiency by improving initial core quality. I'm Kimberly Gist with IBM. Catching defects earlier in

More information

Formal Methods of Software Design, Eric Hehner, segment 1 page 1 out of 5

Formal Methods of Software Design, Eric Hehner, segment 1 page 1 out of 5 Formal Methods of Software Design, Eric Hehner, segment 1 page 1 out of 5 [talking head] Formal Methods of Software Engineering means the use of mathematics as an aid to writing programs. Before we can

More information

Applying Best Practices, QA, and Tips and Tricks to Our Reports

Applying Best Practices, QA, and Tips and Tricks to Our Reports Applying Best Practices, QA, and Tips and Tricks to Our Reports If we had to summarize all we have learned so far, put it into a nutshell, and squeeze in just the very best of everything, this is how that

More information

MITOCW watch?v=4dj1oguwtem

MITOCW watch?v=4dj1oguwtem MITOCW watch?v=4dj1oguwtem PROFESSOR: So it's time to examine uncountable sets. And that's what we're going to do in this segment. So Cantor's question was, are all sets the same size? And he gives a definitive

More information

MITOCW watch?v=0jljzrnhwoi

MITOCW watch?v=0jljzrnhwoi MITOCW watch?v=0jljzrnhwoi The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To

More information

MITOCW watch?v=flgjisf3l78

MITOCW watch?v=flgjisf3l78 MITOCW watch?v=flgjisf3l78 The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To

More information

Create your first workbook

Create your first workbook Create your first workbook You've been asked to enter data in Excel, but you've never worked with Excel. Where do you begin? Or perhaps you have worked in Excel a time or two, but you still wonder how

More information

Know Thy Data : Techniques for Data Exploration

Know Thy Data : Techniques for Data Exploration Know Thy Data : Techniques for Data Exploration Montreal SAS Users Group Wednesday, 29 May 2018 13:50-14:30 PM Andrew T. Kuligowski, Charu Shankar AGENDA Part 1- Easy Ways to know your data Part 2 - Powerful

More information

Microsoft SharePoint 2010

Microsoft SharePoint 2010 BrainStorm Quick Start Card for Microsoft SharePoint 2010 Getting Started Microsoft SharePoint 2010 brings together your organization s people, documents, information, and ideas in a customizable space

More information

Post Experiment Interview Questions

Post Experiment Interview Questions Post Experiment Interview Questions Questions about the Maximum Problem 1. What is this problem statement asking? 2. What is meant by positive integers? 3. What does it mean by the user entering valid

More information

printf( Please enter another number: ); scanf( %d, &num2);

printf( Please enter another number: ); scanf( %d, &num2); CIT 593 Intro to Computer Systems Lecture #13 (11/1/12) Now that we've looked at how an assembly language program runs on a computer, we're ready to move up a level and start working with more powerful

More information

If Statements, For Loops, Functions

If Statements, For Loops, Functions Fundamentals of Programming If Statements, For Loops, Functions Table of Contents Hello World Types of Variables Integers and Floats String Boolean Relational Operators Lists Conditionals If and Else Statements

More information

Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee

Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee ABSTRACT PharmaSUG2012 Paper CC14 Quick Data Definitions Using SQL, REPORT and PRINT Procedures Bradford J. Danner, PharmaNet/i3, Tennessee Prior to undertaking analysis of clinical trial data, in addition

More information

DARING CHANGES IN ENTERPRISE GUIDE WITH A SAFETY NET

DARING CHANGES IN ENTERPRISE GUIDE WITH A SAFETY NET DARING CHANGES IN ENTERPRISE GUIDE WITH A SAFETY NET Lorne Salter, salchootchkin@gmail.com ABSTRACT Version Control is a super undo button and more according to Dave Thomas(1), a vault with every version

More information

So, Your Data are in Excel! Ed Heaton, Westat

So, Your Data are in Excel! Ed Heaton, Westat Paper AD02_05 So, Your Data are in Excel! Ed Heaton, Westat Abstract You say your customer sent you the data in an Excel workbook. Well then, I guess you'll have to work with it. This paper will discuss

More information

Rapid Software Testing Guide to Making Good Bug Reports

Rapid Software Testing Guide to Making Good Bug Reports Rapid Software Testing Guide to Making Good Bug Reports By James Bach, Satisfice, Inc. v.1.0 Bug reporting is a very important part of testing. The bug report, whether oral or written, is the single most

More information

(Refer Slide Time: 01:25)

(Refer Slide Time: 01:25) Computer Architecture Prof. Anshul Kumar Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture - 32 Memory Hierarchy: Virtual Memory (contd.) We have discussed virtual

More information

Windows Script Host Fundamentals

Windows Script Host Fundamentals O N E Windows Script Host Fundamentals 1 The Windows Script Host, or WSH for short, is one of the most powerful and useful parts of the Windows operating system. Strangely enough, it is also one of least

More information

CS02b Project 2 String compression with Huffman trees

CS02b Project 2 String compression with Huffman trees PROJECT OVERVIEW CS02b Project 2 String compression with Huffman trees We've discussed how characters can be encoded into bits for storage in a computer. ASCII (7 8 bits per character) and Unicode (16+

More information

(Refer Slide Time: 00:51)

(Refer Slide Time: 00:51) Programming, Data Structures and Algorithms Prof. Shankar Balachandran Department of Computer Science and Engineering Indian Institute Technology, Madras Module 10 E Lecture 24 Content Example: factorial

More information

Ten Great Reasons to Learn SAS Software's SQL Procedure

Ten Great Reasons to Learn SAS Software's SQL Procedure Ten Great Reasons to Learn SAS Software's SQL Procedure Kirk Paul Lafler, Software Intelligence Corporation ABSTRACT The SQL Procedure has so many great features for both end-users and programmers. It's

More information

The following content is provided under a Creative Commons license. Your support

The following content is provided under a Creative Commons license. Your support MITOCW Lecture 10 The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a

More information

Creating a Departmental Standard SAS Enterprise Guide Template

Creating a Departmental Standard SAS Enterprise Guide Template Paper 1288-2017 Creating a Departmental Standard SAS Enterprise Guide Template ABSTRACT Amanda Pasch and Chris Koppenhafer, Kaiser Permanente This paper describes an ongoing effort to standardize and simplify

More information

PYTHON PROGRAMMING FOR BEGINNERS: AN INTRODUCTION TO THE PYTHON COMPUTER LANGUAGE AND COMPUTER PROGRAMMING BY JASON CANNON

PYTHON PROGRAMMING FOR BEGINNERS: AN INTRODUCTION TO THE PYTHON COMPUTER LANGUAGE AND COMPUTER PROGRAMMING BY JASON CANNON PYTHON PROGRAMMING FOR BEGINNERS: AN INTRODUCTION TO THE PYTHON COMPUTER LANGUAGE AND COMPUTER PROGRAMMING BY JASON CANNON DOWNLOAD EBOOK : PYTHON PROGRAMMING FOR BEGINNERS: AN AND COMPUTER PROGRAMMING

More information

A Practical Introduction to SAS Data Integration Studio

A Practical Introduction to SAS Data Integration Studio ABSTRACT A Practical Introduction to SAS Data Integration Studio Erik Larsen, Independent Consultant, Charleston, SC Frank Ferriola, Financial Risk Group, Cary, NC A useful and often overlooked tool which

More information

Promoting Component Architectures in a Dysfunctional Organization

Promoting Component Architectures in a Dysfunctional Organization Promoting Component Architectures in a Dysfunctional Organization by Raj Kesarapalli Product Manager Rational Software When I first began my career as a software developer, I didn't quite understand what

More information

Who am I? I m a python developer who has been working on OpenStack since I currently work for Aptira, who do OpenStack, SDN, and orchestration

Who am I? I m a python developer who has been working on OpenStack since I currently work for Aptira, who do OpenStack, SDN, and orchestration Who am I? I m a python developer who has been working on OpenStack since 2011. I currently work for Aptira, who do OpenStack, SDN, and orchestration consulting. I m here today to help you learn from my

More information

PROFESSOR: Well, yesterday we learned a bit about symbolic manipulation, and we wrote a rather stylized

PROFESSOR: Well, yesterday we learned a bit about symbolic manipulation, and we wrote a rather stylized MITOCW Lecture 4A PROFESSOR: Well, yesterday we learned a bit about symbolic manipulation, and we wrote a rather stylized program to implement a pile of calculus rule from the calculus book. Here on the

More information

Binary, Hexadecimal and Octal number system

Binary, Hexadecimal and Octal number system Binary, Hexadecimal and Octal number system Binary, hexadecimal, and octal refer to different number systems. The one that we typically use is called decimal. These number systems refer to the number of

More information

6.001 Notes: Section 6.1

6.001 Notes: Section 6.1 6.001 Notes: Section 6.1 Slide 6.1.1 When we first starting talking about Scheme expressions, you may recall we said that (almost) every Scheme expression had three components, a syntax (legal ways of

More information

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ

Paper CC16. William E Benjamin Jr, Owl Computer Consultancy LLC, Phoenix, AZ Paper CC16 Smoke and Mirrors!!! Come See How the _INFILE_ Automatic Variable and SHAREBUFFERS Infile Option Can Speed Up Your Flat File Text-Processing Throughput Speed William E Benjamin Jr, Owl Computer

More information

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02) Programming, Data Structures and Algorithms in Python Prof. Madhavan Mukund Department of Computer Science and Engineering Indian Institute of Technology, Madras Week - 04 Lecture - 01 Merge Sort (Refer

More information

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data

The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data Paper PO31 The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data MaryAnne DePesquo Hope, Health Services Advisory Group, Phoenix, Arizona Fen Fen Li, Health Services Advisory Group,

More information

Azon Master Class. By Ryan Stevenson Guidebook #5 WordPress Usage

Azon Master Class. By Ryan Stevenson   Guidebook #5 WordPress Usage Azon Master Class By Ryan Stevenson https://ryanstevensonplugins.com/ Guidebook #5 WordPress Usage Table of Contents 1. Widget Setup & Usage 2. WordPress Menu System 3. Categories, Posts & Tags 4. WordPress

More information

Software Testing Prof. Meenakshi D Souza Department of Computer Science and Engineering International Institute of Information Technology, Bangalore

Software Testing Prof. Meenakshi D Souza Department of Computer Science and Engineering International Institute of Information Technology, Bangalore Software Testing Prof. Meenakshi D Souza Department of Computer Science and Engineering International Institute of Information Technology, Bangalore Lecture 04 Software Test Automation: JUnit as an example

More information

Introduction to Programming

Introduction to Programming CHAPTER 1 Introduction to Programming Begin at the beginning, and go on till you come to the end: then stop. This method of telling a story is as good today as it was when the King of Hearts prescribed

More information

Q&A Session for Connect with Remedy - CMDB Best Practices Coffee Break

Q&A Session for Connect with Remedy - CMDB Best Practices Coffee Break Q&A Session for Connect with Remedy - CMDB Best Practices Coffee Break Date: Thursday, March 05, 2015 Q: When going to Asset Management Console and making an update on there, does that go to a sandbox

More information

Client Code - the code that uses the classes under discussion. Coupling - code in one module depends on code in another module

Client Code - the code that uses the classes under discussion. Coupling - code in one module depends on code in another module Basic Class Design Goal of OOP: Reduce complexity of software development by keeping details, and especially changes to details, from spreading throughout the entire program. Actually, the same goal as

More information

Hi everyone. Starting this week I'm going to make a couple tweaks to how section is run. The first thing is that I'm going to go over all the slides

Hi everyone. Starting this week I'm going to make a couple tweaks to how section is run. The first thing is that I'm going to go over all the slides Hi everyone. Starting this week I'm going to make a couple tweaks to how section is run. The first thing is that I'm going to go over all the slides for both problems first, and let you guys code them

More information

Linked Lists. What is a Linked List?

Linked Lists. What is a Linked List? Linked Lists Along with arrays, linked lists form the basis for pretty much every other data stucture out there. This makes learning and understand linked lists very important. They are also usually the

More information

how its done in about the five most common SQL implementations.

how its done in about the five most common SQL implementations. SQL PDF Database management. It may sound daunting, but it doesn't have to be, even if you've never programmed before. SQL: Visual QuickStart Guide isn't an exhaustive guide to SQL written for aspiring

More information

MITOCW watch?v=w_-sx4vr53m

MITOCW watch?v=w_-sx4vr53m MITOCW watch?v=w_-sx4vr53m The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To

More information

Hi everyone. I hope everyone had a good Fourth of July. Today we're going to be covering graph search. Now, whenever we bring up graph algorithms, we

Hi everyone. I hope everyone had a good Fourth of July. Today we're going to be covering graph search. Now, whenever we bring up graph algorithms, we Hi everyone. I hope everyone had a good Fourth of July. Today we're going to be covering graph search. Now, whenever we bring up graph algorithms, we have to talk about the way in which we represent the

More information

Stat Wk 3. Stat 342 Notes. Week 3, Page 1 / 71

Stat Wk 3. Stat 342 Notes. Week 3, Page 1 / 71 Stat 342 - Wk 3 What is SQL Proc SQL 'Select' command and 'from' clause 'group by' clause 'order by' clause 'where' clause 'create table' command 'inner join' (as time permits) Stat 342 Notes. Week 3,

More information

Skill 1: Multiplying Polynomials

Skill 1: Multiplying Polynomials CS103 Spring 2018 Mathematical Prerequisites Although CS103 is primarily a math class, this course does not require any higher math as a prerequisite. The most advanced level of mathematics you'll need

More information

MITOCW watch?v=hverxup4cfg

MITOCW watch?v=hverxup4cfg MITOCW watch?v=hverxup4cfg PROFESSOR: We've briefly looked at graph isomorphism in the context of digraphs. And it comes up in even more fundamental way really for simple graphs where the definition is

More information

SIMPLE PROGRAMMING. The 10 Minute Guide to Bitwise Operators

SIMPLE PROGRAMMING. The 10 Minute Guide to Bitwise Operators Simple Programming SIMPLE PROGRAMMING The 10 Minute Guide to Bitwise Operators (Cause you've got 10 minutes until your interview starts and you know you should probably know this, right?) Twitter: Web:

More information

Principles of Algorithm Design

Principles of Algorithm Design Principles of Algorithm Design When you are trying to design an algorithm or a data structure, it s often hard to see how to accomplish the task. The following techniques can often be useful: 1. Experiment

More information

1 Getting used to Python

1 Getting used to Python 1 Getting used to Python We assume you know how to program in some language, but are new to Python. We'll use Java as an informal running comparative example. Here are what we think are the most important

More information

Statistics, Data Analysis & Econometrics

Statistics, Data Analysis & Econometrics ST009 PROC MI as the Basis for a Macro for the Study of Patterns of Missing Data Carl E. Pierchala, National Highway Traffic Safety Administration, Washington ABSTRACT The study of missing data patterns

More information

Macros I Use Every Day (And You Can, Too!)

Macros I Use Every Day (And You Can, Too!) Paper 2500-2018 Macros I Use Every Day (And You Can, Too!) Joe DeShon ABSTRACT SAS macros are a powerful tool which can be used in all stages of SAS program development. Like most programmers, I have collected

More information

Lab 4. Recall, from last lab the commands Table[], ListPlot[], DiscretePlot[], and Limit[]. Go ahead and review them now you'll be using them soon.

Lab 4. Recall, from last lab the commands Table[], ListPlot[], DiscretePlot[], and Limit[]. Go ahead and review them now you'll be using them soon. Calc II Page 1 Lab 4 Wednesday, February 19, 2014 5:01 PM Recall, from last lab the commands Table[], ListPlot[], DiscretePlot[], and Limit[]. Go ahead and review them now you'll be using them soon. Use

More information

Clean & Speed Up Windows with AWO

Clean & Speed Up Windows with AWO Clean & Speed Up Windows with AWO C 400 / 1 Manage Windows with this Powerful Collection of System Tools Every version of Windows comes with at least a few programs for managing different aspects of your

More information

Intro. Scheme Basics. scm> 5 5. scm>

Intro. Scheme Basics. scm> 5 5. scm> Intro Let s take some time to talk about LISP. It stands for LISt Processing a way of coding using only lists! It sounds pretty radical, and it is. There are lots of cool things to know about LISP; if

More information

Debugging 101 Peter Knapp U.S. Department of Commerce

Debugging 101 Peter Knapp U.S. Department of Commerce Debugging 101 Peter Knapp U.S. Department of Commerce Overview The aim of this paper is to show a beginning user of SAS how to debug SAS programs. New users often review their logs only for syntax errors

More information

MITOCW watch?v=yarwp7tntl4

MITOCW watch?v=yarwp7tntl4 MITOCW watch?v=yarwp7tntl4 The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality, educational resources for free.

More information

How to Create Data-Driven Lists

How to Create Data-Driven Lists Paper 9540-2016 How to Create Data-Driven Lists Kate Burnett-Isaacs, Statistics Canada ABSTRACT As SAS programmers we often want our code or program logic to be driven by the data at hand, rather than

More information

Q &A on Entity Relationship Diagrams. What is the Point? 1 Q&A

Q &A on Entity Relationship Diagrams. What is the Point? 1 Q&A 1 Q&A Q &A on Entity Relationship Diagrams The objective of this lecture is to show you how to construct an Entity Relationship (ER) Diagram. We demonstrate these concepts through an example. To break

More information

CFMG Training Modules Classified Ad Strategy Module

CFMG Training Modules Classified Ad Strategy Module CFMG Training Modules Classified Ad Strategy Module In This Module: 1. Introduction 2. Preliminary Set Up Create the Sequential Letterset for our Ad Strategy Set Up Your Vanity Responder Create Your Stealth

More information

MITOCW watch?v=se4p7ivcune

MITOCW watch?v=se4p7ivcune MITOCW watch?v=se4p7ivcune The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To

More information

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys

A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys A Simple Framework for Sequentially Processing Hierarchical Data Sets for Large Surveys Richard L. Downs, Jr. and Pura A. Peréz U.S. Bureau of the Census, Washington, D.C. ABSTRACT This paper explains

More information

Validating And Updating Your Data Using SAS Formats Peter Welbrock, Britannia Consulting, Inc., MA

Validating And Updating Your Data Using SAS Formats Peter Welbrock, Britannia Consulting, Inc., MA Validating And Updating Your Data Using SAS Formats Peter Welbrock, Britannia Consulting, Inc., MA Overview In whatever way you use SAS software, at some point you will have to deal with data. It is unavoidable.

More information

Porsche 91 1GT D m o d e ling tutorial - by Nim

Porsche 91 1GT D m o d e ling tutorial - by Nim orsche 911GT 3D modeling tutorial - by Nimish In this tutorial you will learn to model a car using Spline modeling method. This method is not very much famous as it requires considerable amount of skill

More information

Paper S Data Presentation 101: An Analyst s Perspective

Paper S Data Presentation 101: An Analyst s Perspective Paper S1-12-2013 Data Presentation 101: An Analyst s Perspective Deanna Chyn, University of Michigan, Ann Arbor, MI Anca Tilea, University of Michigan, Ann Arbor, MI ABSTRACT You are done with the tedious

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17 601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17 5.1 Introduction You should all know a few ways of sorting in O(n log n)

More information

AHHHHHHH!!!! NOT TESTING! Anything but testing! Beat me, whip me, send me to Detroit, but don t make me write tests!

AHHHHHHH!!!! NOT TESTING! Anything but testing! Beat me, whip me, send me to Detroit, but don t make me write tests! NAME DESCRIPTION Test::Tutorial - A tutorial about writing really basic tests AHHHHHHH!!!! NOT TESTING! Anything but testing! Beat me, whip me, send me to Detroit, but don t make me write tests! *sob*

More information

How to Create a Killer Resources Page (That's Crazy Profitable)

How to Create a Killer Resources Page (That's Crazy Profitable) How to Create a Killer Resources Page (That's Crazy Profitable) There is a single page on your website that, if used properly, can be amazingly profitable. And the best part is that a little effort goes

More information

HELPLINE. Dilwyn Jones

HELPLINE. Dilwyn Jones HELPLINE Dilwyn Jones Remember that you can send me your Helpline queries by email to helpline@quanta.org.uk, or by letter to the address inside the front cover. While we do our best to help, we obviously

More information

The following content is provided under a Creative Commons license. Your support

The following content is provided under a Creative Commons license. Your support MITOCW Lecture 9 The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation

More information

Smart formatting for better compatibility between OpenOffice.org and Microsoft Office

Smart formatting for better compatibility between OpenOffice.org and Microsoft Office Smart formatting for better compatibility between OpenOffice.org and Microsoft Office I'm going to talk about the backbreaking labor of helping someone move and a seemingly unrelated topic, OpenOffice.org

More information

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below.

The first thing we ll need is some numbers. I m going to use the set of times and drug concentration levels in a patient s bloodstream given below. Graphing in Excel featuring Excel 2007 1 A spreadsheet can be a powerful tool for analyzing and graphing data, but it works completely differently from the graphing calculator that you re used to. If you

More information

PROC SUMMARY AND PROC FORMAT: A WINNING COMBINATION

PROC SUMMARY AND PROC FORMAT: A WINNING COMBINATION PROC SUMMARY AND PROC FORMAT: A WINNING COMBINATION Alan Dickson - Consultant Introduction: Is this scenario at all familiar to you? Your users want a report with multiple levels of subtotals and grand

More information

ChatScript Finaling a Bot Manual Bruce Wilcox, Revision 12/31/13 cs3.81

ChatScript Finaling a Bot Manual Bruce Wilcox, Revision 12/31/13 cs3.81 ChatScript Finaling a Bot Manual Bruce Wilcox, gowilcox@gmail.com Revision 12/31/13 cs3.81 OK. You've written a bot. It sort of seems to work. Now, before releasing it, you should polish it. There are

More information

WHAT ARE SASHELP VIEWS?

WHAT ARE SASHELP VIEWS? Paper PN13 There and Back Again: Navigating between a SASHELP View and the Real World Anita Rocha, Center for Studies in Demography and Ecology University of Washington, Seattle, WA ABSTRACT A real strength

More information

Using Metadata Queries To Build Row-Level Audit Reports in SAS Visual Analytics

Using Metadata Queries To Build Row-Level Audit Reports in SAS Visual Analytics SAS6660-2016 Using Metadata Queries To Build Row-Level Audit Reports in SAS Visual Analytics ABSTRACT Brandon Kirk and Jason Shoffner, SAS Institute Inc., Cary, NC Sensitive data requires elevated security

More information