Topics: Data step Subsetting Concatenation and Merging Reference: Little SAS Book - Chapter 5, Section 3.6 and 2.2 Online documentation Exercise I LAB EXERCISE The following is a lab exercise to give you experience combining SAS data sets. The data files, nmes, employee1-employee4, data1-data3, wide, long2, lab3longtowide.sas, and lab3widetolong.sas are located on the website on the LAB page under class3 http://www.biostat.jhsph.edu/bstcourse/bio632/default.htm. Download the self-extracting file class3.exe from the website. Extract contents to the d:\temp\sasclass folder. Create the folder if it does not exist. Start the SAS Program If you are taking the class for credit (either pass/fail or graded), please read the italicized instructions at the end of each section. You will need to print out sections of the SAS log and output windows and answers to some of the questions at the end of lab,. Please do not print all of the logs and output windows. Please label each section clearly and put your name at the top of the pages. Use a TITLE statement. The data is stored in SAS file nmes. All missing data values are coded as 9. The variables included in the data file are : Variable Name Age Gender Race Smoke Description Age of Subject 1 = Male 0 = Female 1 = African American 0 = Other 1 = Current 2 = Former 1
3 = Never -9=unknown LC CHD BMI Expend Marital Educ 1 = Lung Cancer or Laryngeal Cancer or COPD 0 otherwise 1 = Coronary Heart Disease 0 otherwise Body mass index with two decimal places -9=unknown Subjects Total Self-Reported Medical Expenditures 1 = Married 2 = Widowed or Divorced or Separated 3 = Never Married 1 = 1 year of college or more 2 = Completed High School 3 = Less than High School -9=unknown A. Write a Data step that will do the following : 1. Using IF/THEN statements recode the missing data (coded as 9) to. for the variables: smoke, bmi and educ. 2. Create a new variable called LogBMI using the SAS Function LOG. 3. Create a categorical Age variable that breaks age into the following categories: 40 55, 56 65, > 65 (Note: no subjects in the dataset are less than 40 years old) 4. Check that your code works correctly by printing out the resulting data set for the first 30 observations. Use a data set option in a PROC step. HAND IN: Print out the output from #4 ONLY to hand in. Label this section Lab3 Exercise 1 QA.4 2
B. Create the following subsets: 1. Create a file nmes1 that contains only males who are <=65 years old 2. Create two files in the same data step that contain males and females separately. HAND IN: Print out the SAS log from part B 1 and 2. Label this section Lab3 Exercise I QB. Exercise II A. Concatenation and Merging 1. DATA1 and DATA2 are SAS data sets described below that contain disease and follow-up information on a group of patients. The maximum number of diseases codes (ICD-9 codes) is 6. We want to create a new file, DATA1_2, by combining these two files. Both of the files contain the variables described below. Type in the following program into the ENHANCED EDITOR window and submit to create one file with the data derived from these two files. Check the SAS log and answer the questions. Libname mylib d:\temp\sasclass ; Data data1_2; Set ; Run; How many observations are in DATA1? How many observations in Data1_2? How many variables? Variable Description Type ID Patient ID Numeric DX1 Diagnosis 1 Character DX2 Diagnosis 2 Character DX3 Diagnosis 3 Character DX4 Diagnosis 4 Character DX5 Diagnosis 5 Character DX6 Diagnosis 6 Character Sex 0 = female Numeric 3
Yearc 1 = male Year of last contact Numeric Yob Year of Birth Numeric Cvd Cardiovascular Disease 0 = no 1 = yes Numeric Smoker 0 = no 1 = yes Numeric Chol Cholesterol mg/dl Numeric 2. We have additional patient information to add to the Data1_2 file created in 1. DATA3 contains additional information described below for the patients in the Data1_2 file. Create a new SAS data set (ALLDATA) by match-merging the data in Data1_2 with the data in DATA3 using a key variable (id). This is a description of the data in DATA3 Variable Type Description ID numeric id SBP numeric systolic blood pressure mmhg DBP numeric diastolic blood pressure mmhg NO_CIG numeric number of cigarettes per day 0=none 1=1-10 2=11-19 3=20-39 4=40 or more BMI 18-21 numeric body mass index kg/m 2 Remember we need to sort both files by ID before merging (using PROC SORT). Proc Sort data= ; by id; Proc Sort data= ; by id; Data mylib.alldata; merge ; Proc print data=mylib.alldata; Run; Check the SAS log for errors. Although you may not have any errors, there is a major problem with the merge program. The program did not match-merge the data because the BY statement was missing. Instead the file was sequentially matched and data from different patients were combined into one record. How many observations in the ALLDATA file? Compare the values for ICD-9 codes for the first five records of the ALLDATA file to the first five record of the Data1_2. Notice the problems with the matching. 4
Now return to the program editor window, add the BY statement to the DATA step and rerun. How many observations are in the ALLDATA file? Compare the first five records to the records in Data1_2. 3. We are going to use the data set option (in= ) to determine which records did not match. Return to the program in the Enhanced Editor and add the following instructions to the DATA step. Remember the in variable for each file will equal one for each record on that file. Data mylib.alldata; merge (in=count) (in=count2); by id; If count=0 then put id= count=; If count2=0 then put id= count2=; Proc print data=mylib.alldata; Title With By statement ; Run; Review the log window. How many records from the DATA1_2 file did not have a match in DATA3? How many records from the DATA3 file did not have a match in Data1_2? 4. Suppose you only want to include those records that matched included in my ALLDATA file. You can use the count and count2 variables in the DATA step to exclude the non-matches using IF-THEN clauses. Add the appropriate statement(s) to the program and run. Check the SAS log for errors. HAND IN: Print out the SAS log from this final DATA step and the answer to the following question. Label this section Lab3 Exercise II Part A Q4. How many observations are in the ALLDATA file? NOTE: The SAS system has an option to prevent accidental merging without a BY statement. Look at the NOMERGEBY system option in HELP for further details. 5
B. Concatenation and Merging The following files contain employee information. Use the SET and MERGE statements to combine the following files. 1. Create a combined SAS data set named employee1_2 (temporary or permanent, you choose) by concatenating the employee1 and employee2 files (SAS data sets). The data sets contain the following variables: Variable SSN Description SOCIAL SECURITY NUMBER ( XXXXXXXXX) Name employee name : lastname, first name Hire hire date Date Variable Salary Phone annual salary office telephone number: In the form : XXX-XXXX Add a LABEL statement to the DATA step to label the name, hire, and phone variables with the description given above. Add a PROC CONTENTS step to list out the contents of employee1_2. Review the LOG and OUTPUT windows. How many records are in the employee1_2 SAS data set? 2. Employee3 contains additional employees that we need to add to the file created in 1. Combine this file with the employee1_2 SAS data set created in section A.1 and name the new SAS data set employee123. DO NOT INCLUDE the variable name in the employee123 file (DROP or KEEP Data Set Option). The employee3 file includes the following variables: Variable SSN Description SOCIAL SECURITY NUMBER ( XXXXXXXXX) Name employee name : In the form lastname, first name Gender gender F=female M=male 6
Hire hire date Date variable Salary annual salary Notice employee3 does not contain the phone variable, but does include the gender variable. HAND IN: Print the OUTPUT window (from #2 only) containing the listing of employee123. Make sure that you put the name EMPLOYEE 123 file as the title at the top of the listing. Include the answers to the following 3 questions in your report. Label this section Lab3 Exercise II Part B Q2. 1. How many observations? 2. What is the value for gender for SSN=244967839? 3. What is the office telephone number for SSN=933476520? 3. Add the following data from the employee4 file to the records from employee123 file created in 2. Employee4 contains additional information on the employees in the employee123 file. SSN is the key variable. Variable SSN Description SOCIAL SECURITY NUMBER (XXXXXXXXX) Left date left the company date variable Blank if still an employee Phone home phone number In the form (XXX-XXXX) First, run PROC CONTENTS on the employee4 file. Notice the label for the phone variable. It is the home phone number. The variable phone on the employee123 file is the office telephone number. We want to merge the employee4 SAS data set with the employee123 SAS data set created in 3, BUT we want to keep both the home and office phone numbers. Remember SAS will retain only one of the variables because they have the same name 7
(Hint: use a Data Set Option on the MERGE statement).match-merge using SSN as the key variable and create a new SAS data set employee_total. Print out the file using PROC PRINT. HAND IN: Print the LOG and OUTPUT windows (from #3) containing results from the program creating employee_total and the answers to the following questions. Please label this part of the report as Lab 3 Exercise II Part B. Q3. 1. How many records are in the employee123 and employee4 files? 2. How many records and variables are in the employee_total file? 3. List the SSN of the records that do not match? Use the IN data set option to identify the records that do not match and list them in the LOG window. 4. How many variables does the file employee_total have? 4. Modify the DATA step that creates employee_total to use the IN data option to include only those observations that exist in both files. There will be 14 observations in employee_total. HAND IN: Print the SAS LOG (from #4) creating the new employee_total. Label this section as Lab 3 Exercise II Part B Q4. 8