Paper PO31 The Power of PROC SQL Techniques and SAS Dictionary Tables in Handling Data MaryAnne DePesquo Hope, Health Services Advisory Group, Phoenix, Arizona Fen Fen Li, Health Services Advisory Group, Phoenix, Arizona ABSTRACT This paper demonstrates the combined power of PROC SQL and SAS Dictionary Tables to assist in the data management of multi-year health care survey data. The survey data, collected yearly, usually require some modifications to fields and file names to adjust for year-to-year changes in survey administration. As in all programming aspects of a project, it is essential that the programming techniques are efficient and adaptable in the data handling processes. Structured Query Language (SQL) is a powerful database language that can be used to access SAS Dictionary Tables, which contain information about data files open in a SAS session. Examples presented in this paper demonstrate techniques of applying PROC SQL to the SAS Dictionary Tables, SASHELP.VCOLUMN and SASHELP.VTABLE. These techniques easily and quickly address survey data management tasks including renaming variables, label creation, conversion of variable characteristics, automating file lists and file comparisons. INTRODUCTION Health Services Advisory Group Inc. (HSAG) is Arizona s largest health care quality review organization. HSAG is currently working on a number of large-scale survey projects, including the Medicare Health Outcomes Survey (HOS). The Medicare HOS measures the physical and mental health status of Medicare beneficiaries in managed care settings. The Medicare HOS, sponsored by the Centers for Medicare & Medicaid Services (CMS), is administered annually to a randomly selected sample of Medicare Advantage (MA) Plan members from each applicable Medicare contract market area in the United States. A random sample of 1,000 individuals is selected at baseline from each MA Plan and then resurveyed in two years. Challenges exist when multi-year data contain a large number of variables that change from year to year, or when it is necessary to compare a large number of data files at different group levels. SQL extract techniques have been implemented to more easily handle changing requirements and characteristics of the data. These techniques are used instead of hard coding key values, cutting and pasting sections of code, or using the conventional DATA and PROC step methods that may require many program steps and lengthy lines of code. The following examples show how SQL is used to extract and modify valuable data set information that is available in the SASHELP.VCOLUMN table and the SASHELP.VTABLE table. The coding techniques used to rename variables, create labels, convert variable types, generate file lists and compare files are described and demonstrated in the following examples. SQL PROC SQL is a database language that incorporates features that can simplify and consolidate coding requirements. Using these features results in fewer program steps and shorter lines of code when compared to the conventional DATA step and PROC step techniques. Some of the common usages of PROC SQL include joining tables, extracting data, grouping and ordering data, creating and modifying tables, subsetting data, and creating macro variables. SASHELP DICTIONARY TABLES The SASHELP library contains dictionary tables and view tables that are automatically created when a SAS session is started and automatically updated throughout the SAS session. These resources are meta tables (data about data) that provide a wealth of information about the current data files in the SAS session. The view tables are stored in the SASHELP Library and prefixed with a V. The view tables contain components of SAS data files such as columns, formats, indexes, macros, and tables. The COLUMN view table and TABLE view table are the specific focus in this paper.
SASHELP VCOLUMN TABLE The SASHELP.VCOLUMN table includes data set information at the variable level. Below are some examples of variables contained in the column view table [description (variable name)]: name (name) type (type) length (length) label (label) format (format) informat (informat) position (npos) order number in table (varnum) SASHELP VTABLE TABLE The SASHELP.VTABLE table includes the data set information at the file level. Below is a list of some of the frequently used VTABLE variables [description (variable name)]: library name (libname) file name (memname) file type (memtype) number of observation (nobs) file label (memlabel) number of variables (nvar) file creation date (crdate) file modification date (modate) EXAMPLES USING PROC SQL AND DICTIONARY TABLES RENAMING VARIABLES Survey data frequently have a large number of variables and often there is a need to rename variables in the data set in order to merge data or modify the variable names for input to generic programs. The name field in the SASHELP.VCOLUMN table is used to extract only the required variables from the data file. Example 1 uses PROC SQL and the SASHELP.VCOLUMN table to demonstrate the selection of variables to be renamed. Example 1: The first step is to execute PROC SQL to extract the selected variables that are available in the SASHELP.VCOLUMN table in the stored SAS data set named HDATA (libname is PLAN ). The where option is used to select all the numeric variables with the exception of the 'V1PATID' variable in the data set. A macro variable called mnlist is created that contains a string comprised of the rename SAS statements. The string contains the original variable name, an equal sign and the new variable name with each rename assignment delimited by a blank space. The substr function strips off the two-character prefix and adds the _MN suffix to all the variable names. The separated by creates a space delimiter between the rename statements, and compress will remove any spaces or unwanted characters preceding the variable name. The trim function, preceded by left will remove trailing spaces and left justify the renamed variable name. proc sql noprint; select compress(name) '=' trim(left(substr(name,3))) '_' 'MN' into :mnlist separated by from sashelp.vcolumn where libname='plan' and memname='hdata' and type= num and name not in ( V1PATID ); The second step in the process is to run a data step using the macro variable &mnlist. The macro &mnlist is used to provide the renaming code in the data step statement. data renmfile (rename=(&mnlist)); set plan.hdata; The log from the data renmfile data step below shows the resolution of the macro variable mnlist. 41 data renmfile 41 (rename=(&mnlist)); SYMBOLGEN: Macro variable MNLIST resolves to V1HTH=HTH_MN V1HTHN=HTHN_MN V1VIG=VIG_MN V1MOD=MOD_MN V1LFT=LFT_MN V1CLMB=CLMB_MN
V1CLMBN=CLMBN_MN V1BND=BND_MN V1WLK=WLK_MN V1WLKB=WLKB_MN 42 set plan.hdata; LABEL CREATION One of the required data management tasks is to create Comma Separated Value (CSV) files from the SAS survey data and to create a labeled row for the variables in the CSV text file. Typing the labels directly into the CSV file is time consuming and prone to error. Example 2 illustrates the use of PROC SQL to quickly and accurately access the labels stored in the SAS data set using the SASHELP.VCOLUMN table, then transposing the captured labels and creating an EXCEL file that includes the label row. Example 2: PROC SQL is used to create a table named outds that contains variable names (name) and the corresponding label (label). These fields are extracted from the SASHELP.VCOLUMN table in a data set named HDATA. Selection of variables is based on numeric variables with a length of 8 and excludes the variable V1PATID. create table outds as select name, label from sashelp.vcolumn where libname="plan" and memname="hdata" and type= num and length=8 and name not in ('V1PATID'); Table Outds: Results of SQL Extraction Name Label V1VAR08 First Variable Label 08 V1VAR18 Second Variable Label 18 V1VAR28 Third Variable Label 28 To produce the CSV file with the variable name and its corresponding label, the outds table is used as an input data set in the TRANSPOSE procedure. The label field values and the name field values are transposed to two rows that contain the values from the two fields. proc transpose data=outds out=tr_outds (drop =_name label_) ; var name label; Table Tr_outds: Results of the Proc Transpose. Col1 Col2 Col3 1 V1VAR08 V1VAR18 V1VAR28 2 First Variable Label 08 Second Variable Label 18 Third Variable Label 28 Next, the PROC EXPORT syntax is used to export the table tr_outds into the CSV text file addlabels.csv. This label EXCEL file is then concatenated to the EXCEL data file. Another method would be to set the two SAS data sets, tr_outds and HDATA, before exporting to EXCEL format. proc export data = tr_outds outfile ="C:\addlabels.csv" dbms=csv replace; VARIABLE TYPE CONVERSION Changing a large number of numeric variables into character variables and visa versa is a common process in health care survey data management. As shown in Example 3, converting numeric variables to character variables using the SASHELP.VCOLUMN table and the PROC SQL procedure is completed in an efficient manner.
Example 3: The first step is to create macro variables using the where option to select variables that are stored in the SASHELP.VCOLUMN table in the data set called HDATA. The LIKE operator with the %8 placeholder is used to identify variables that have 8 in the name. All the variables satisfying the 8 criteria will be included in the macro variables named chr1 and _chr1. The former contains a list of these selected variables separated by a space, and the latter also contains a list of the same variables but prefixed with _. proc sql noprint ; select compress(name), "_" compress(name) into : chr1 separated by ' ', : _chr1 separated by ' ' from sashelp.vcolumn where libname="plan" and memname="hdata" and name like %8 ; Below is the result of using a %put to see the values in the two macro variables. 137 %put &chr1 &_chr1; V1VAR08 V1VAR18 V1VAR28 _V1VAR08 _V1VAR18 _V1VAR28 The data step vartype uses the macro variables in the ARRAY statement, along with the put statement to convert the numeric variable into the character variables. The trim function, preceded by left will remove trailing spaces and left justify the converted variable name. data vartype (drop=&chr1 k); length &_chr1 $12; set plan.hdata; array n_vars{3} &chr1; array c_vars{3} &_chr1; do k=1 to 3; c_vars{k}=left(trim(put(n_vars{k},12.9))); end; Variable Type Conversion Before Conversion After Conversion Variable Type Variable Type V1VAR08 Num _ V1VAR08 Chr V1VAR18 Num _ V1VAR18 Chr V1VAR28 Num _ V1VAR28 Chr FILE LIST GENERATION FOR MERGING Each year more than 150 health care survey data files are distributed to health care plans nationwide. To ensure that the correct numbers of data files are generated for distribution, validation is required to match the electronic data files against a list of appropriate plans. Manually checking each electronic data file name against the plan list is feasible but labor intensive and prone to error. The code in Example 4 demonstrates the use of PROC SQL combined with the SASHELP.VTABLE table to automate the validation process. Example 4: PROC SQL is run to access the list of the SAS data files names stored in the SASHELP.VTABLE table in the library called PLANDATA. A table named filelist is created that contains this list of data file names. The filelist table below shows the table generated by PROC SQL using SASHELP.VTABLE table. create table filelist as select memname from sashelp.vtable
where libname="plandata" ; Filelist Table (B) MEMNAME AL_DATA AZ_DATA CA_DATA CO_DATA Master Plan Table (A) PLANID AL_DATA AZ_DATA CA_DATA CO_DATA CT_DATA The filelist table created in the previous code will be compared to an existing SAS table, planlist. The following code uses a PROC SQL left join to merge the master table ( planlist ) with the previously created table filelist. The two fields used for the match-merge are planid which is in the planlist (A) table, and memname which is in the filelist (B) table. The left join specifies that all the observations in table A and only matching observations from the B table are included in the resulting table. The resulting table Validation contains any data file that is not in the filelist table. The order by option will arrange the data file list alphabetically. title 'Electronic Data File List Checking'; create table validation (where=(memname= )) as select A. *, B. * from plan.planlist A left join filelist B on (A.planid=B.memname) order by planid; The result of the match-merge is below. Validation Table PLANID CT_DATA FILE LIST GENERATION FOR AUTOMATIC FILE COMPARISON Another of HSAG s tasks is to create text files for data distribution. In order to verify the accuracy of the text file generated, data is re-imported from each text file back to SAS and then compared to the original source SAS data file. (Note: the imported SAS data file and the SAS source data file have identical file names). Because of the need to generate a large number of text files for each health plan, it is challenging to compare many pairs of data set names. Example 5 presents the code that has been developed to automate the data comparison process. Example 5: The source data files are stored in a libname called SOURCE and the imported SAS data files are stored in a libname called IMPORT. First, PROC SQL is used to extract the file names from the SASHELP.VTABLE table in the SOURCE library in alphabetic order. This step creates a table, sourcelist that contains a master list of names of the data sets. create table sourcelist as select memname from sashelp.vtable where libname="source" order by memname; Next, using the data set sourcelist, the data step newfile is used to execute a CALL SYMPUT. This statement stores the value rank from _N_ and assigns it to the macro variable datafile which drives the %do looping processing. The CALL SYMPUT within the do loop captures each value of a data file name (memname) and stores the name in the macro variable dataname. The do loop is processed for each data set name and then each data set
in each library (source and import) is sorted by a key variable. Each pair of data sets is then compared using the PROC COMPARE procedure. The result of the PROC COMPARE procedure is the validation ensuring that there is a 100% match on content of the imported file and the source file. %macro autocmp; data newfile; set sourcelist; rank = _n_; call symput ("datafile", put (rank, 2.)); %do x = 1% to &datafile; data _null_; set newfile; if rank=&x; call symput ("dataname", trim (memname)); proc sort data=source.&dataname; by V1PATID; proc sort data=import.&dataname; by V1PATID; proc compare base=source.&dataname compare=import.&dataname; id V1PATID; %end; %mend autocmp; %autocmp; CONCLUSION The SAS Dictionary Tables provide direct access to valuable information about SAS data sets available in a SAS session. Using PROC SQL with these tables offers a comprehensive and powerful method to reduce the coding time necessary to accomplish data handling and validation tasks. Applying SQL techniques demonstrated in this paper can automate processes for easier and more efficient programming. REFERENCES SAS SQL Procedure User s Guide Version 8. 2000. Cary, NC: SAS Institute, Inc. SAS Institute Inc. 2003. SAS OnlineDoc 9.1. Cary, NC: SAS Institute, Inc. SAS Technical Support, SN-009581, Cary, NC: SAS Institute, Inc. SPECIAL ACKNOWLWDGEMENTS The authors would like to acknowledge the Medicare Medicare Health Outcomes Survey team at HSAG for review of this paper. CONTACT INFORMATION Your comments and questions are valued and encouraged. MaryAnne DePesquo Hope Health Services Advisory Group, Inc. 1600 E. Northern Ave., Suite 100 Phoenix, AZ 85020 Work Phone: 602-745-6312 Fax: 602-241-0757 mhope@hsag.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.