Topics: Data step Subsetting Concatenation and Merging Reference: Little SAS Book - Chapter 5, Section 3.6 and 2.2 Online documentation Exercise I LAB EXERCISE The following is a lab exercise to give you experience combining SAS data sets. The data files, nmes, employee1-employee4, data1-data3 are located on the website on the LAB page under class3 http://www.biostat.jhsph.edu/bstcourse/bio632/default.htm. Download the files from LAB on the website to your folder. If you are taking the class for credit (either pass/fail or graded), please read the italicized instructions at the end of each section. Please save the logs, output sections and the answers to the questions specified into one word document and e-mail to the class e-mail sas@jhsph.edu. Please do not send all of the logs and output windows. Please label each section clearly and put your name and LAB3 in the subject line of the e-mail. Use a TITLE statements. Start the SAS Program Create a SAS Program to create the following files from the PREC2 sas data set created in LAB2. 1. Create a temporary file that contains only records with only known values of systolic and diastolic pressure (msbp and mdbp). 2. Create another file that contains only males whose age in 1998 was less than 75 years of age. Do not include the variables wgt and hgt on this dataset. 3. Create two files in the same data step that contain males and females separately. Save the saslog from these 3 data steps send in the exercise e- mail. Label this section Lab3 Exercise 1 1
Exercise II A. Concatenation and Merging 1. DATA1 and DATA2 are two SAS data sets described below that contain disease and follow-up information on a group of patients. The maximum number of diseases codes (ICD-9 codes) is 6. We want to create a new file, DATA1_2, by combining these two files. Both of the files contain the variables described below. Type in the following program into the ENHANCED EDITOR window and submit to create one file with the data derived from these two files. Check the SAS log and answer the questions. Libname mylib insert your folder name ; Data data1_2; Set ; Run; How many observations are in DATA1? How many observations in Data1_2? How many variables? Variable Description Type ID Patient ID Numeric DX1 Diagnosis 1 Character DX2 Diagnosis 2 Character DX3 Diagnosis 3 Character DX4 Diagnosis 4 Character DX5 Diagnosis 5 Character DX6 Diagnosis 6 Character Sex 0 = female Numeric 1 = male Yearc Year of last Numeric contact Yob Year of Birth Numeric Cvd Cardiovascular Numeric Disease 0 = no 1 = yes Smoker 0 = no 1 = yes Numeric Chol Cholesterol mg/dl Numeric 2
2. We have additional patient information to add to the Data1_2 file created in 1. DATA3 contains additional information described below for the patients in the Data1_2 file. Therefore, use a MERGE step to combine these files. We are adding additional data to existing records. This is a description of the data in DATA3 Variable Type Description ID numeric id SBP numeric systolic blood pressure mmhg DBP numeric diastolic blood pressure mmhg NO_CIG numeric number of cigarettes per day 0=none 1=1-10 2=11-19 3=20-39 4=40 or more BMI 18-21 numeric body mass index kg/m 2 Remember we need to sort both files by ID before merging (using PROC SORT). Proc Sort data= ; by id; Proc Sort data= ; by id; Data mylib.alldata; merge ; Proc print data=mylib.alldata; Run; Check the SAS log for errors. Although you may not have any errors, there is a major problem with the merge program. The program did not match-merge the data because the BY statement was missing. Instead the file was sequentially matched and data from different patients were combined into one record. How many observations in the ALLDATA file? Compare the values for ICD-9 codes for the first five records of the ALLDATA file to the first five records of the Data1_2. Notice the problems with the matching. Now return to the program editor window. Create a new SAS data set (ALLDATA) by match-merging the data in Data1_2 with the data in DATA3 using a key variable (id). To do this add a BY statement to the DATA step and rerun. How many observations are in the ALLDATA file? Compare the first five records to the records in Data1_2. 3. We are going to use the data set option (in= ) to determine which records did not match. Return to the program in the Enhanced Editor and add the following instructions to the DATA step. Remember the in variable for each file will equal one for each record on that file. 3
Data mylib.alldata; merge (in=count) (in=count2); by id; If count=0 then put id= count=; If count2=0 then put id= count2=; Proc print data=mylib.alldata; Title With By statement ; Run; Review the log window. How many records from the DATA1_2 file did not have a match in DATA3? How many records from the DATA3 file did not have a match in Data1_2? 4. Suppose you only want to include those records that matched included in my ALLDATA file. You can use the count and count2 variables in the DATA step to exclude the non-matches using IF-THEN clauses. Add the appropriate statement(s) to the program and run. Check the SAS log for errors. SAVE the SAS log from this final DATA step and the answer to the following question in the exercise e-mail. Label this section Lab3 Exercise II Part A. How many observations are in the ALLDATA file? NOTE: The SAS system has an option to prevent accidental merging without a BY statement. Look at the NOMERGEBY system option in HELP for further details. B. Concatenation and Merging The following files contain employee information. Use the SET and MERGE statements to combine the following files. 1. Create a combined SAS data set named employee1_2 (temporary or permanent, you choose) by concatenating the employee1 and employee2 files (SAS data sets). The data sets contain different individuals with the following variables: Variable SSN Description SOCIAL SECURITY NUMBER ( XXXXXXXXX) 4
Name employee name : lastname, first name Hire hire date Date Variable Salary Phone annual salary office telephone number: In the form : XXX-XXXX Add a LABEL statement to the DATA step to label the name, hire, and phone variables with the description given above. Add a PROC CONTENTS step to list out the contents of employee1_2. Review the LOG and OUTPUT windows. How many records are in the employee1_2 SAS data set? 2. Employee3 contains additional employees that we need to add to the file created in 1. Combine this file with the employee1_2 SAS data set created in section A.1 and name the new SAS data set employee123. DO NOT INCLUDE the variable name in the employee123 file (DROP or KEEP Data Set Option). The employee3 file includes the following variables: Variable SSN Description SOCIAL SECURITY NUMBER ( XXXXXXXXX) Name employee name : In the form lastname, first name Gender gender F=female M=male Hire hire date Date variable Salary annual salary Notice employee3 does not contain the phone variable, but does include the gender variable. Use PROC PRINT to print out the new dataset employee123. It should contain all of the records in employee1, employee2 and employee3. SAVE the OUTPUT window (from #2 only) containing the listing of employee123. Make sure that you put the name EMPLOYEE 123 file as the title at the top of the listing. Include the listing and the answers to the following 3 questions in your exercise e-mail. Label this section Lab3 5
Exercise II Part B.1 1. How many observations are in employee123? 2. What is the value for gender for SSN=244967839? 3. What is the office telephone number for SSN=933476520? 3. Add the following data from the employee4 file to the records from employee123 file created in 2. The Employee4 contains additional information on the same employees in the employee123 file. SSN is the key variable to use to match the records. Variable SSN Description SOCIAL SECURITY NUMBER Left date left the company date variable Blank if still an employee Phone home phone number In the form (XXX-XXXX) First, run PROC CONTENTS on the employee4 file. Notice the label for the phone variable. It is the home phone number. The variable phone on the employee123 file is the office telephone number. We want to merge the employee4 SAS data set with the employee123 SAS data set created in 3, BUT we want to keep both the home and office phone numbers. Remember SAS will retain only one of the variables because they have the same name (Hint: use a Data Set Option on the MERGE statement).match-merge using SSN as the key variable and create a new SAS data set employee_total. Print out the file using PROC PRINT. SAVE the LOG and OUTPUT windows (from #3) containing results from the program creating employee_total and the answers to the following questions. Please label this part of the report as Lab 3 Exercise II Part B.2 and include in your exercise e-mail. 1. How many records are in the employee123 and employee4 files? 2. How many records and variables are in the employee_total file? 6
3. List the SSN of the records that do not match? Use the IN data set option to identify the records that do not match and list them in the LOG window. 4. How many variables does the file employee_total have? 4. Modify the DATA step that creates employee_total to use the IN data option to include only those observations that exist in both files(employee123 and employee4). There will be 14 observations in employee_total. Save the SAS LOG (from #4) creating the new employee_total. Label this section as Lab 3 Exercise II Part B.3 and include in your exercise e-mail. 7