What just happened? A visual tool for highlighting differences between two data sets

ABSTRACT What just happened? A visual tool for highlighting differences between two data sets Steve Cavill, NSW Bureau of Crime Statistics and Research, Sydney, Australia Base SAS includes a great utility for comparing two data sets - PROC COMPARE. The output though can be hard to read as the differences between values are listed separately for each variable. It's hard to see the differences across all variables for the same observation. This talk presents a macro to compare two SAS data sets and display the differences in Excel. PROC COMPARE OUT= option creates an output data set with all the differences. This data set is then processed with PROC REPORT using ODS EXCEL and colour highlighting to show the differences in an Excel, making the differences easy to see. INTRODUCTION A common requirement of software development is regression testing, that is, checking that the changes in your output data are as intended. One simple method to check this is to compare your output data sets before and after a software change. NSW Bureau of Crime Statistics (BOCSAR) creates output datasets on a monthly basis for analyzing crime trends in New South Wales, Australia. These data are continually being refined, which requires constant changes to the SAS code used to create the data. An important check in this process is that the code and data changes have not introduced errors into the data collection. PROC COMPARE is one of the tools used to validate the data by comparing before and after copies of the data. PROC COMPARE is a powerful tool for comparing two sas data sets. The output of PROC COMPARE is variable centric, that is, it shows all the changes for each variable, grouped by variable. So if you have made a change to a small number of variables across a large number of observations, the standard PROC COMPARE output is easy to read. However, the standard output of PROC COMPARE isn t particularly helpful when many variables change across a set of observations. PROC COMPARE, like many SAS reporting procedures, can send its output to a SAS data set. Then it s not all that complicated to take that output and refine it into a report that makes your changes easier to see. The full code for this illustration is in the Appendix. It uses sample SAS data sets that are part of your SAS installation. PROC COMPARE STANDARD OUTPUT This paper uses a standard SAS sample data set to illustrate how you can refine the output of PROC COMPARE SASHELP.BASEBALL contains some simple baseball statistics. I have made some changes to this data set to illustrate the benefit of changing the way PROC COMPARE output appears. Here is the program to introduce deliberate (though somewhat random) changes the data: proc sort data= sashelp.baseball out=before; by name; data after; set before; drop c:; array nums _numeric_; if _n_ in (5,71) then do i=1 to dim(nums); nums(i)=nums(i) - _n_; if _n_ in (6,88) then do i=1 to dim(nums) by 3; nums(i)=nums(i) - _n_; if _n_ in (74,100) then team='boston'; 1

And here is the standard output from PROC COMPARE: The COMPARE Procedure Comparison of WORK.BEFORE with WORK.AFTER (Method=EXACT) Data Set Summary Dataset Created Modified NVar NObs Label WORK.BEFORE 25JUL16:06:46:24 25JUL16:06:46:24 24 322 1986 Baseball Data WORK.AFTER 25JUL16:06:46:24 25JUL16:06:46:24 19 322 Variables Summary Number of Variables in Common: 18. Number of Variables in WORK.BEFORE but not in WORK.AFTER: 6. Number of Variables in WORK.AFTER but not in WORK.BEFORE: 1. Number of ID Variables: 1. Observation Summary Observation Base Compare ID First Obs 1 1 Name=Aldrete, Mike First Unequal 5 5 Name=Armas, Tony Last Unequal 100 100 Name=Franco, Julio Last Obs 322 322 Name=Yount, Robin Number of Observations in Common: 322. Total Number of Observations Read from WORK.BEFORE: 322. Total Number of Observations Read from WORK.AFTER: 322. Number of Observations with Some Compared Variables Unequal: 6. Number of Observations with All Compared Variables Equal: 316. Values Comparison Summary Number of Variables Compared with All Observations Equal: 4. Number of Variables Compared with Some Observations Unequal: 13. Total Number of Values which Compare Unequal: 32. Maximum Difference: 88. Variables with Unequal Values Variable Type Len Label Ndif MaxDif Team CHAR 14 Team at the End of 1986 2 natbat NUM 8 Times at Bat in 1986 4 88.000 nhits NUM 8 Hits in 1986 2 71.000 nhome NUM 8 Home Runs in 1986 2 71.000 nruns NUM 8 Runs in 1986 4 88.000 nrbi NUM 8 RBIs in 1986 2 71.000 nbb NUM 8 Walks in 1986 2 71.000 YrMajor NUM 8 Years in the Major Leagues 4 88.000 nouts NUM 8 Put Outs in 1986 2 71.000 nassts NUM 8 Assists in 1986 2 71.000 nerror NUM 8 Errors in 1986 4 88.000 Salary NUM 8 1987 Salary in $ Thousands 1 71.000 2

logsalary NUM 8 Log Salary 1 71.000 Value Comparison Results for Variables Team at the End of 1986 Base Value Compare Value Name Team Team Dawson, Andre Montreal Boston Franco, Julio Cleveland Boston Times at Bat in 1986 Base Compare Name natbat natbat Diff. % Diff Armas, Tony 425.0000 420.0000-5.0000-1.1765 Ashby, Alan 315.0000 309.0000-6.0000-1.9048 Davis, Glenn 574.0000 503.0000-71.0000-12.3693 Easler, Mike 490.0000 402.0000-88.0000-17.9592 Hits in 1986 Base Compare Name nhits nhits Diff. % Diff Armas, Tony 112.0000 107.0000-5.0000-4.4643 Davis, Glenn 152.0000 81.0000-71.0000-46.7105 Home Runs in 1986 Base Compare Name nhome nhome Diff. % Diff Armas, Tony 11.0000 6.0000-5.0000-45.4545 Davis, Glenn 31.0000-40.0000-71.0000-229.0323 Runs in 1986 Base Compare Name nruns nruns Diff. % Diff Armas, Tony 40.0000 35.0000-5.0000-12.5000 Ashby, Alan 24.0000 18.0000-6.0000-25.0000 Davis, Glenn 91.0000 20.0000-71.0000-78.0220 Easler, Mike 64.0000-24.0000-88.0000-137.5000 3

RBIs in 1986 Base Compare Name nrbi nrbi Diff. % Diff Armas, Tony 58.0000 53.0000-5.0000-8.6207 Davis, Glenn 101.0000 30.0000-71.0000-70.2970 Walks in 1986 Base Compare Name nbb nbb Diff. % Diff Armas, Tony 24.0000 19.0000-5.0000-20.8333 Davis, Glenn 64.0000-7.0000-71.0000-110.9375 Years in the Major Leagues Base Compare Name YrMajor YrMajor Diff. % Diff Armas, Tony 11.0000 6.0000-5.0000-45.4545 Ashby, Alan 14.0000 8.0000-6.0000-42.8571 Davis, Glenn 3.0000-68.0000-71.0000-2367 Easler, Mike 13.0000-75.0000-88.0000-676.9231 Put Outs in 1986 Base Compare Name nouts nouts Diff. % Diff Armas, Tony 247.0000 242.0000-5.0000-2.0243 Davis, Glenn 1253 1182-71.0000-5.6664 Assists in 1986 Base Compare Name nassts nassts Diff. % Diff Armas, Tony 4.0000-1.0000-5.0000-125.0000 Davis, Glenn 111.0000 40.0000-71.0000-63.9640 Errors in 1986 Base Compare Name nerror nerror Diff. % Diff 4

Armas, Tony 8.0000 3.0000-5.0000-62.5000 Ashby, Alan 10.0000 4.0000-6.0000-60.0000 Davis, Glenn 11.0000-60.0000-71.0000-645.4545 Easler, Mike 0-88.0000-88.0000. 1987 Salary in $ Thousands Base Compare Name Salary Salary Diff. % Diff Davis, Glenn 215.0000 144.0000-71.0000-33.0233 Log Salary Base Compare Name logsalary logsalary Diff. % Diff Davis, Glenn 5.3706-65.6294-71.0000-1322 Output 1. Standard output from PROC TABULATE 5

Here is the output from the CompareHighlight Macro which transforms the standard output from PROC COMPARE. Output 2. Output from CompareHighlight macro 6

COMPAREHIGHLIGHT MACRO The code to produce the highlighted output above (Output 2) is deceptively simple: %comparehighlight (base=before,compare=after,id=name,xlfile=/folders/myfolders/baseball.xlsx) HOW DOES IT WORK? To show how the macro works I have created a very simple data set: data one; retain i 1; drop i; length id $3 num1-num3 8 ; input id num1 num2 num3; cards; aaa 1 2 3 bbb 4 5 6 ccc 7 8 9 ;;;; data two; set one; if id='bbb' then num1=num1+1; if id='ccc' then num3=num3+1; The standard proc compare of the two data sets looks like this (focusing just on the variable comparison part of the output) You can see that num1 was changed on row bbb and num3 was changed on row ccc: proc compare base=one compare=two; id id; The COMPARE Procedure Comparison of WORK.ONE with WORK.TWO (Method=EXACT) Data Set Summary Dataset Created Modified NVar NObs WORK.ONE 25JUL16:07:04:01 25JUL16:07:04:01 4 3 WORK.TWO 25JUL16:07:04:01 25JUL16:07:04:01 4 3 Variables with Unequal Values Variable Type Len Ndif MaxDif num1 NUM 8 1 1.000 num3 NUM 8 1 1.000 Value Comparison Results for Variables 7

Base Compare id num1 num1 Diff. % Diff bbb 4.0000 5.0000 1.0000 25.0000 Base Compare id num3 num3 Diff. % Diff ccc 9.0000 10.0000 1.0000 11.1111 Output 3: Standard PROC COMPARE output for simple data set Here is the output from %CompareHighlight options mprint; %comparehighlight (base=one,compare=two,id=id,xlfile=/folders/myfolders/simple1.xlsx) Note the standard output is actually in Excel, Output 3 is an HTML rendering of that Excel for display purposes. Each of the 4 mini tables is a tab in one Excel workbook, as illustrated below (output 4) Output 4: Excel workbook from %CompareHighlight Output 3: Tables of output from %CompareHighlight 8

STEP 1: CREATE AN OUTPUT DATA SET FROM PROC COMPARE proc compare base=one compare=two out=xtemp outbase outcomp outdiff outpercent outnoequal; id id; outbase output all the observations in the BASE data set that have differences, including those that ONLY appear in the BASE dataset outcomp output all the observations in the COMPARE data set that have differences, including those that ONLY appear in the COMPARE dataset outdiff output the absolute difference for every row that has differences outpercent output the percentage change for each difference outnoequal restrict output to observations that have differences id is used to match observations, otherwise proc compare uses simple one to one matching Obs _TYPE OBS_ id num1 num2 num3 1 BASE 2 bbb 4 5 6.0000 2 COMPARE 2 bbb 5 5 6.0000 3 DIF 2 bbb 1 E E 4 PERCENT 2 bbb 25 E E 5 BASE 3 ccc 7 8 9.0000 6 COMPARE 3 ccc 7 8 10.0000 7 DIF 3 ccc E E 1.0000 8 PERCENT 3 ccc E E 11.1111 Output 5: Output data set from proc compare (data set work.xtemp) STEP 2: MAKE REQUIRED CHANGES TO THE OUTPUT DATA SET The output data set needs to be manipulated slightly before we can process with PROC REPORT. 2A Change equal which is special missing.e to 0 You can see in Output 5 that variables that are equal contain an E, which is a special missing value.e. This will cause us headaches later in PROC REPORT so we change it to zero. 2B rename all the difference variables to contain format info for proc report We will see how we use this later in PROC REPORT in step 3B 9

Below you can see the E values have been changed to zero and the comparison variables have been prefixed with format_ Obs _TYPE OBS_ id format_num1 format_num2 format_num3 1 BASE 2 bbb 4 5 6.0000 2 COMPARE 2 bbb 5 5 6.0000 3 DIF 2 bbb 1 0 0.0000 4 PERCENT 2 bbb 25 0 0.0000 5 BASE 3 ccc 7 8 9.0000 6 COMPARE 3 ccc 7 8 10.0000 7 DIF 3 ccc 0 0 1.0000 8 PERCENT 3 ccc 0 0 11.1111 Output 6: Revised output data set from proc compare (data set work.diffs) STEP 3: PROCESS EACH TYPE OF OUTPUT ROW TO CREATE AN EXCEL SHEET Step 3A Merge each type of output row with the DIF row to create 4 different data sets The code below is repeated for each type of compare row BASE, COMPARE, DIF and PERCENT, to create 4 data sets to process with proc report: diffs_report_base, diffs_report_compare, diffs_report_dif and diffs_report_percent You can see that num1, num2 and num3 contain the original values and format_num1, format_num2 and format_num3 contain the differences (from the DIF row in work.diffs which were renamed earlier data diffs_report_base; merge xtemp(where=(_type_="base") in=infirst) diffs(where=(_type_='dif') in=insecond) ; by id; if infirst and insecond; Obs _TYPE OBS_ id num1 num2 num3 format_num1 format_num2 format_num3 1 DIF 2 bbb 4 5 6 1 0 0 2 DIF 3 ccc 7 8 9 0 0 1 Output 7: merged data set for input to proc report (data set work.diffs_report_base) Step 3B Process each data set with PROC REPORT to highlight the differences First, open the excel workbook (this is only done once at the beginning of the process) ods listing close; ods excel file="/folders/myfolders/simple1.xlsx" Now we can use PROC REPORT to process each of the 4 data sets to create a highlighted report. We define num1, num2 and num3 as display columns and format_num1, format_num2 and format_num3 as noprint columns. We use a formula based on format_xxx to highlight column xxx if the value in format_xxx is not zero. This is why we changed the special missing value.e to zero, as the missing value caused problems in the formula. Name the sheet: ods excel options (sheet_name="_base"); Then create the highlighted report: 10

proc report data=diffs_report_base nowd; columns id format_num1 format_num2 format_num3 num1 num2 num3; define id/display ; define format_num1/noprint; define format_num2/noprint; define format_num3/noprint; define num1/display ; define num2/display ; define num3/display ; Now we use the value in the format_xxx variables to apply formatting to cells which have differences: compute num1; if format_num1.sum ne 0 then call define(_col_,'style',"style={foreground=red background=yellow}"); endcomp; compute num2; if format_num2.sum ne 0 then call define(_col_,'style',"style={foreground=red background=yellow}"); endcomp; compute num3; if format_num3.sum ne 0 then call define(_col_,'style',"style={foreground=red background=yellow}"); endcomp; Then repeat the PROC REPORT step for each of the 4 difference data sets HANDLING OBSERVATIONS THAT ONLY APPEAR IN ONE DATASET. We need an extra step at step 2 to handle datasets that have observations in only one of the base or compare data sets. For example, if we delete one of the observations in the TWO data set: data two; set two; if id='bbb' then delete; %comparehighlight (base=one,compare=two,id=id,xlfile=/folders/myfolders/simple2.xlsx) PROC COMPARE shows us the row is missing: Comparison Results for Observations Observation 2 in WORK.ONE not found in WORK.TWO: id=bbb. 11

But the output data set only has the BASE row and the COMPARE, DIF and PERCENT rows don t exist for that row. (id=bbb) You can see row 5 is the BASE observation for the bbb row. As that row doesn t exist in the compare data set, the COMPARE, DIF and PERCENT rows aren t created. This is because we use the outnoequal option on the PROC TABULATE statement Obs _TYPE OBS_ id num1 num2 num3 1 BASE 1 aaa 1 2 3.0000 2 COMPARE 1 aaa 1 2 3.0000 3 DIF 1 aaa 0 0 0.0000 4 PERCENT 1 aaa 0 0 0.0000 5 BASE 2 bbb 4 5 6.0000 6 BASE 3 ccc 7 8 9.0000 7 COMPARE 2 ccc 7 8 10.0000 8 DIF 2 ccc 0 0 1.0000 9 PERCENT 2 ccc 0 0 11.1111 Output 8: PROC COMPARE output data set where there are non-matched rows We could turn off the outnoequal option, but that can create large output data sets, as it then creates 4 output observation for each input observation. We want the output report to only highlight the differences, particularly if the input datasets have very many rows. 2C Expand output data set to have a BASE, COMPARE, DIFF and PERCENT row for every difference row So instead we just use a simple data step to create the missing rows: data diffs; drop i; set xtemp; by id; /* step 2a above */ array nums(*) _numeric_; do i=1 to dim(nums); if nums(i)=.e then nums(i)=0; if not last.id then output; if last.id then select (_type_); when ('PERCENT') output; when ('COMPARE') do; output; call missing (num1,num2,num3); _type_='base'; output; _type_='dif'; output; _type_='percent'; output; when ('BASE') do; output; call missing (num1,num2,num3); _type_='compare'; output; _type_='dif'; output; _type_='percent'; output; 12

AN EXAMPLE APPLICATION At BOCSAR we use the CompareHighlight macro to check changes in our data and processes from month to month. In addition to obvious comparsions of individual data sets, aan interesting application of the CompareHighlight macro is comparing the contents of monthly data loads to check that the volume of change is as expected. This is the CompareHighlight output from comparing the sashelp.vmember of the output libraries from one month to the next (just the percentage tab). It s easy to see at a glance that the tables have more observations, which is expected, but the filesize is smaller, which is perhaps unexpected. The obslen is also shorter. Checking the DIF tab (not shown) indicates there are 15 fewer variables in the new data sets which is worth investigating. memname nobs obslen nvar filesize DISPOSAL 0.1234127-10.05199-20.83333-8.732908 DISPOSALOFFENCE 0.2365206-6.643757-13.27434-5.241922 DISPOSALOFFENCEPENALTY 0.2033322-6.173497-11.62791-4.869967 DISPOSALPENALTY 0.1926014-8.957529-16.85393-6.83416 OFFENCE 0.2365201 E E 0.2612768 PENALTY 0.1926011 E E 0.2111189 CONCLUSION The output of the CompareHighlight macro makes it easier to read the output from PROC COMPARE, particularly when there are differences in many variables on the same observation. The macro combines the powerful comparison capability of PROC COMPARE with the dynamic formatting flexibility of PROC REPORT and simplicity of ODS to create Excel output. One of the great things about working with SAS is the ability to combine strengths of various components to solve just about any problem!. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Steve Cavill NSW Bureau of Crime Statistics and Research 20 Lee Street Sydney, NSW, 2000 Steve.cavill@infoclarity.com.au www.bocsar.nsw.gov.au www.skillfactor.com.au SAS Community Page: This and other presentations can be found at my SAS Community presentations page http://www.sascommunity.org/wiki/presentations:stevecavill_papers_and_presentations SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 13

APPENDIX Source code to reproduce the examples in this paper. The sample data set is a standard data set that is part of your SAS installation The presentation and associated source code is also available at sascommunity.org http://www.sascommunity.org/wiki/what_just_happened%3f_a_visual_tool_for_highlighting_differences_between_tw o_data_sets The macro also calls a number of utility macros that can be accessed at http://www.sascommunity.org/wiki/file:cavillmacros.zip COMPAREHIGHLIGHT MACRO %macro CompareHighlight(base=,compare=,id=,var=,xlfile=,sheetname=,outdiffds=); /*BeginWikiEntry Take the output dataset from proc compare and create an excel sheet highlighting the differences assumes the excelxp destination is already open (to facilitate multiple sheets in the one workbook) == Parameters: == - base - base dataset to compare - compare - dataset to compare to base dataset - id - the id list used by proc compare (to avoid highlighting the id variables) - var - (optional) list of vars to compare - xlfile - the excel workbook (in.xml format) to create containing the comparison - sheetname - (optional) the name to give the sheet in the excel workbook - outdiffds - (optional) dataset containing the output from proc compare == example: == {{{ %comparehighlight(base=hc.hc14q2slw,compare=hc.hc14q2fst,id=personid procno offcount pencount,xlfile=x.xml) }}} == Notes == ds= parameter no longer supported [[br]] was the output dataset created by proc compare out= outdiff outbase outcomp outnoequal EndWikiEntry*/ /* to do--------- highlight keys on a row which is exclusive to base or compare data set */ %local i idvarcount colcount colname coldataname colformatname usage datavars formatvars; 14

%let highlight={foreground=red background=yellow}; %if %length(&id)=0 % then %do; %let id=comparehighlightrowcounter; data &base._view/view=&base._view; set &base; comparehighlightrowcounter=_n_; data &compare._view/view=&compare._view; set &compare; comparehighlightrowcounter=_n_; %let base=&base._view; %let compare=&compare._view; % proc compare base=&base compare=&compare out=xtemp outdiff outbase outcomp outnoequal outpercent; id &id; %if %length(&var)>0 %then var &var;; %let datavars=%listtablevars (ds=xtemp, where=upcase(name) not in ('_TYPE_' '_OBS_' %upcase(%quotelst(&id)))); %let colcount=%countw(&id); /* make sure all 4 output rows exist for every comparison */ data diffs; drop i; set xtemp; /*nned id vars to join */ by &id; array nums(*) _numeric_; do i=1 to dim(nums); if nums(i)=.e then nums(i)=0; /* proc report struggles with missing */ if not last.%scan(&id,&colcount) then output; if last.%scan(&id,&colcount) then select (_type_); when ('PERCENT') output; /* no action required */ when ('COMPARE') do; output; call missing (%seplist(&datavars,delim=%str(,))); _type_='base';output; _type_='dif';output; _type_='percent';output; when ('BASE') do; output; call missing (%seplist(&datavars,delim=%str(,))); _type_='compare';output; _type_='dif';output; _type_='percent';output; 15

%renamevariables (ds=diffs,prefix=format_,except=_type obs_ &id ); %let formatvars=%listtablevars(ds=diffs,where=upcase(name) not in ('_TYPE_' '_OBS_' %upcase(%quotelst(&id)))); %macro colourit(type=); data diffs_report_&type; merge xtemp(where=(_type_="&type") in=infirst) diffs(where=(_type_='dif') in=insecond) ; by &id; if infirst and insecond; ; %delabel(ds=diffs_report_&type) ods excel options (sheet_name="&sheetname._&type"); proc report data=diffs_report_&type nowd; columns &id &formatvars &datavars; %let colcount=%sysfunc(countw(&id)); %do i= 1 %to &colcount; %let colname=%scan(&id,&i); define &colname/display /*&usage*/; % %let colcount=%sysfunc(countw(&formatvars)); %do i= 1 %to &colcount; %let colname=%scan(&formatvars,&i); define &colname/noprint; % %let colcount=%sysfunc(countw(&datavars)); %do i= 1 %to &colcount; %let colname=%scan(&datavars,&i); %if %vartype(diffs_report_&type,&colname)=c %then %let usage=display;%else %let usage=analysis; define &colname/display /*&usage*/; % %let colcount=%sysfunc(countw(&formatvars)); %do i= 1 %to &colcount; %let coldataname=%substr(%scan(&formatvars,&i),8); %let colformatname=format_&coldataname; compute &coldataname; %if %vartype(diffs_report_&type,&colformatname)=c %then %do; if index(&colformatname,'x')>0 or &colformatname=' ' % 16

%if %vartype(diffs_report_&type,&colformatname)=n %then %do; if &colformatname..sum ne 0 % then call define(_col_,'style',"style=&highlight"); endcomp; % %m ods listing close; ods excel file="&xlfile" style=minimal; %colourit(type=base) %colourit(type=compare) %colourit(type=dif) %colourit(type=percent) ods excel close; ods listing; %if &outdiffds ne %then %do; data &outdiffds; set diffs; % %m SAMPLE CODE FOR THE PAPER /* example of output highlighting changes, compared to proc compare output */ proc sort data= sashelp.baseball out=before; by name; data after; set before; drop c:; array nums _numeric_; if _n_ in (5,71) then do i=1 to dim(nums); nums(i)=nums(i) - _n_; if _n_ in (6,88) then do i=1 to dim(nums) by 3; nums(i)=nums(i) - _n_; if _n_ in (74,100) then team='boston'; options ls=90 ps=999; options nomlogic nomprint; %comparehighlight(base=before,compare=after,id=name,xlfile=/folders/myfolders/baseb all.xlsx) 17

/* simple example to show how proc report works */ data one; retain i 1; drop i; length id $3 num1-num3 8 ; input id num1 num2 num3; cards; aaa 1 2 3 bbb 4 5 6 ccc 7 8 9 ;;;; data two; set one; if id='bbb' then num1=num1+1; if id='ccc' then num3=num3+1; options mprint; %comparehighlight(base=one,compare=two,id=id,xlfile=/folders/myfolders/simple1.xlsx ) /*handle "incomplete" output from proc compare */ data two; set two; if id='bbb' then delete; %comparehighlight(base=one,compare=two,id=id,xlfile=/folders/myfolders/simple2.xlsx ) 18