ST 512 - Lab 1 - The basics of SAS What is SAS? SAS is a programming language based in C. For the most part SAS works in procedures called proc s. For instance, to do a correlation analysis there is proc corr. Today we will start with the basics: The SAS interface, reading in data, and running a few procedures. You can download SAS for your personal computer - See http://sas.ncsu.edu/ (It requires more than Windows Home though.) The SAS Interface There are four main windows in the SAS environment: The Program Editor Window This is where you spend most of your time working in SAS, writing your program in the editing window. Note: SAS is not case sensitive. The Log Window Once you execute your program, SAS will report back to you in this window. Error messages, notes about your dataset, and warning messages will appear here. Don t underestimate the importance of this window remember to look here each and every time you execute a SAS program. The Output Window The output requested in your SAS program will appear in this window. Remember that output may be generated even when errors are present in your program! The Results Window The output is listed by section here. Click on an item and you are taken to that place in the output. The explorer subtab allows you to keep track of your libraries and their contents. 1
Note: to end a line in SAS, a semicolon is used! First Lines and Reading in Data Now we are ready to read in some data. There are a few options to read in data: 1. Copy and Paste the data into the Program Editor Window with the correct code before and after. 2. Use the SAS import wizard 3. Use SAS commands that call a file Let s get started! 1. Copy and Paste method: Go to the wolfware page, Lab section. Here you will find a file called soilwater.dat. Open the link and copy and paste the data into the program editor after your options command. To create data in SAS we use the data command (called a data step). data name; input variable1 variable2...; datalines; # #... # #...... # #... ; Note, if one of our variables was non-numeric (e.g. had values A B etc.) We would need to put a $ after the variable name in the input statement (input variable1 $ variable2... declares variable1 to be a character variable). Create the data step to read in the soil water data. Don t paste in the column names! Once you have the code, highlight the part you want to run (include the options command the first time you run something). Now you can simply click the running man button at the top of the page (or use the SAS menus). To check if the data is read in correctly, as should always be done after submitting code, first check the log for errors. Then we can print the data out to see it. To do this we use a procedure called proc print. proc print data=name; If you ever get confused on a procedure s syntax, you can google sas proc - help. The first link should take you to SAS s very nice online documentation system. (Try it.) Highlight this section of code and run it to see your output! 2
2. Import Wizard Method: SAS has an import wizard that can read in many standard types of data files. First, go to the class website and download the mother.xls file (make sure you know where it is saving to!). Now, go to File Import Data. The import wizard will pop up. You can choose a standard source from the drop down menu. Mother.xls is an excel 95 file (SAS can t read in xlsx files). Hit next and browse to the location of the file. Hit next, now type in the name of the dataset you want to create (e.g. mother). Hit next, SAS will ask if you want to save the commands for importing the file. Hit browse and find the folder you would like to save the file to, type in the file name and hit save. Finally, hit finish. Check your log to see if there are any errors. Print the data out to check that it was read in correctly. 3. SAS commands: You can also read data in using a few commands in SAS. Find the file that contained the commands for importing a file and open it. Copy and paste the code into the program editor. In the future you can use these commands to read in the data rather than using the import wizard. You may need to change the file path however. You can set the default file path in SAS using the following: Go to Tools Options Change Current Folder. From here you can select the default folder for SAS to look in. Choose the folder with the file mother.xls. Once you do this, you can remove any directory names, e.g. DATAFILE= J:\ST 512\Labs\Mother.xls can be replaced by DATAFILE= Mother.xls There are other ways to import data in SAS such as infile. If interested, search the SAS help pages. 3
The Corr and Reg Procedures Description of the Soil Water data set: The data set contains the measured soil water content (in cm 3 /cm 3 ) of 16 soil samples at four depths (in cms). Description of the Mother data set: Weight gain of the mother during pregnancy is known to be a critical factor in determining the birth-weight of the infant. Some data collected in a study of the relationship between average weight gain and mother s age are given in the file mother.xls. Some questions we may want to answer from these types of data sets: 1. Is there an association between the two variables? 2. If so, does that association appear to be linear? 3. Can we conduct a statistical test to determine this relationship is statistically significant? 4. Can we fit a linear regression line to this data? 5. How can we use that line to predict for future observations? Let s go through the soil water data together, then you can attempt the mother data set on your own. 1. To answer the first two questions, we can invoke the corr procedure. proc corr data=soilwater; var depth soil; Run this code and inspect the output. To get the tests we will cover in class (and some nice plots), add in the following: ods graphics on; proc corr data=soilwater plots=matrix fisher(biasadj=no); var depth soil; ods graphics off; Run this code and inspect the plots. Use the output to answer the first 3 questions above. There are many other options for tests that can be performed using the proc corr procedure. Check out http://support.sas.com/documentation/cdl/en/procstat/63104/html/default/viewer.htm#procstat_corr_sect004.htm for more information. 4
Let us look into fitting a regression line with this data. Which variable would we consider our response (dependent variable), which our predictor (independent variable)? We can use proc reg to fit a regression line (we could also use the proc glm or proc mixed, which will be discussed later in the course). The basic code to invoke the reg procedure is: proc reg data=soilwater; model soil=depth; Run this code and inspect the output. What hypotheses are being tested by each p-value you see? To see a scatterplot with a regression line, residual diagnostics, predicted values, and confidence intervals for our parameter estimates we can run the following: ods graphics on; proc reg data=soilwater; model soil=depth/r clb; ods graphics off; Inspect the output and plots. Does the line appear to fit the data on the scatterplot well? Do the residuals appear to have constant variance? What do our confidence intervals tell us about our parameter estimates? We can see that by adding /r to the model statement we get information on the predicted value of any value of the independent variable that was included in our data set. How can we get other predicted values (e.g. for a depth of 12.5 cm or 49 cm)? We can use the equation given to estimate the predicted values or have SAS To do this we can trick SAS into giving us a predicted value by appending missing values onto our data set. SAS sees a. as a missing value, so we can run the following code: 5
data newdepths; input depth soil; datalines; 12.5. 49. ; proc datasets; append base=soilwater data=newdepths; Run this code, check the log to make sure everything worked, and use proc print to print out the new soilwater data set. Now if we run the same proc reg code with /r we will get predicted values at depths of 12.5 and 49. (Note, we can also get C.I. s and P.I. s for these values, which we will talk about at a later time.) Try to answer the following questions about the mother data set on your own: 1. What is the sample correlation between weight gain and age? 2. Is the sample correlation significantly different from zero? 3. Which variable would we consider the response and which the independent variable? Why? 4. Fit a regression line to the data, is the slope significantly different from 0? 5. Do the data appear to satisfy the assumption of constant variance? 6. Predict the value of weight gain for someone who is 20 years old and someone who is 35 years old. 6