April 4, SAS General Introduction

PP 105 Spring 01-02 April 4, 2002 SAS General Introduction TA: Kanda Naknoi kanda@stanford.edu Stanford University provides UNIX computing resources for its academic community on the Leland Systems, which can be accessed through the Stanford University Network (SUNet). This document provides a basic overview of the SAS System data analysis package that is running on the Leland Systems. You must have a Leland account to use SAS. To open a Leland account you will need a SUNet ID; see Introduction to the Leland Systems. For information on using UNIX (including very useful information on viewing and copying files, aborting a program, and checking to see if the printers are working), see the UNIX Command Summary reference card or the document Getting Started in UNIX. All of these documents are available at the Consulting Desk on the second floor of Sweet Hall and at http://www-leland.stanford.edu/group/dcg/docs/alphadocs.html. Currently, SAS runs on most of the Leland Systems such as Elaine and Tree. The workstations for use are located on the second floor of Sweet Hall. You can run your SAS programs in batch mode; to do so, you must also be able to use a UNIX text editor. (The EMACS Reference Card, also available on the second floor of Sweet Hall, is very useful in learning the fundamentals of EMACS, a popular text editor.) In addition, you can run programs interactively, or under the X Window System. However, the SAS language commands and syntax are the same across all user interfaces, and are consistent with previous versions. Note that while SAS is not case sensitive, the UNIX operating system is. In the sample program command lines in this document, bold letters represent SAS or UNIX keywords, which should not be changed.\ This document is also available online. At the UNIX prompt, type: elaine5> lelanddocs Running SAS in Batch Mode You can execute SAS command files from the UNIX prompt. This is called batch mode. To use batch mode, store the SAS commands in a text file using a UNIX text editor such as EMACS, and then submit them to SAS with the following command: elaine5> sas filename For example: elaine5> sas census.sas

This command will process all the commands in the file called census.sas and normally create two new files. The first file created, census.log, contains an annotated version of your SAS program, including error messages, and other important messages regarding the execution of your program. The second, census.lst, contains the SAS output, which lists the results produced by the SAS program. You can view both files using the EMACS text editor, or with the UNIX command more. If no.lst file was created, that may mean that SAS came across errors that stopped the processing. In that case, check the.log file to see what errors need to be corrected. It is always a good idea to check your.log file to make sure that the program ran correctly. It is suggested that you give all your SAS program files a common extension, such as.sas. An Introduction to SAS Statements and Syntax A SAS program is constructed with SAS statements. A SAS statement is a string of SAS keywords, SAS names, and special characters and operators ending in a semicolon. A statement asks SAS to perform an operation or gives SAS information. Some examples are provided below: INPUT X 15; DATA ONE; Most SAS statements begin with a keyword that identifies what kind of a statement it is. The keyword in the first example is INPUT; it identifies an INPUT statement. The kinds of names that can appear in SAS statements include the names of variables, SAS data sets, formats, procedures, options, macros, and file references, among others. In the first example X is a variable, and in the second example ONE is a data set. Every SAS statement must end with a semicolon, which is one type of special character. Examples of other special characters and operators include the dollar sign $, the equals sign =, and the addition sign +. For more information on the components of SAS statements, see the manual SAS Language Guide. Here are general rules for writing SAS statements: Begin all SAS statements with an identifying keyword and end them with a semicolon. SAS statements are free-format. That is, they can begin and end anywhere on a line, as long as they end with a semicolon. One statement can continue over several lines, and several statements can occupy the same line. You may use as many blank spaces or lines as you want to separate fields or to separate sets of statements. Use comments and blank lines to set off logical parts of your program. You can include comments anywhere in the program. This is an example of a comment: /* comments are enclosed in these symbols */

SAS Data and Proc Steps: Building Blocks of a SAS Program A SAS program is comprised of SAS steps, which in turn are made up of SAS statements. There are two kinds of steps: DATA steps and PROC (procedure) steps. These steps are the building blocks of all SAS programs. Generally, DATA steps read unprocessed or raw data and organize them into a SAS data set, and PROC steps process these data sets. A SAS program can consist of a DATA step or a PROC step, or both. Within a program, DATA and PROC steps can appear in any order and with any frequency. DATA Steps: Each DATA step includes statements asking SAS to create one or more new SAS data sets and programming statements that perform the manipulations necessary to build these data sets. The DATA step begins with a data statement and can include any number of program statements. The DATA step must be used whenever any transformation of variables is needed. The DATA step is described in more detail in Basic Data Management in SAS later in this document. PROC Steps: Each PROC step asks SAS to execute a procedure that is defined as part of the SAS language, usually with a SAS data set as input. Additional statements used in the PROC step give the program more information about the results that you want. Note that while some additional statements are necessary for the proper execution of a procedure, other additional statements may be optional. Different manuals list the statements available with each PROC step, but many of them can be found in the SAS Procedures Guide. A PROC step always starts with a PROC statement. The following are two examples: PROC CONTENTS; PROC MEANS; VAR AGE INCOME; Since a data set was not specified in the previous two examples, SAS will process the last data set mentioned in the program. Therefore, it is a good habit to name the SAS data set you want the procedure to analyze. To name the SAS data set follow the example: PROC PRINT DATA = datasetname; PROC statements have a wide variety of uses within SAS. Most notably, all statistical analysis routines in SAS are accessed through PROC statements. Basic Data Management in SAS Reading in Raw Data In this section we refer to two kinds of data files: raw data files and SAS data sets. Raw data files are numbers or characters which can be entered and/or viewed using a text

editor. SAS data sets, also known as system files, cannot be viewed using an editor (i.e., they are binary files, rather than text files). When bringing raw data into SAS, use a DATA step to read the data, as in the example below. This process creates a SAS data set containing the compiled version of the raw data and any computed or recoded variables defined in the DATA step. The SAS System creates two types of data sets: temporary and permanent. A temporary SAS data set exists only for the duration of the current SAS session. Therefore, data stored in a temporary SAS data set cannot be retrieved for use in later SAS sessions. See the following section Creating a Permanent SAS Data Set for information about permanent SAS data sets. To create a temporary SAS data set from a raw data file, follow this example: FILENAME fileref 'path/filename'; DATA sasname; INFILE fileref; INPUT variable names; The FILENAME statement indicates the location and the name of the UNIX file to be read by the SAS program. The fileref is a 'nickname' by which the file is referenced inside the SAS program. The fileref must be eight or fewer characters and must begin with a letter. The filename is the actual name of the UNIX file that holds your data, represented by the fileref. You must specify the filename in the FILENAME line. The filename should be preceded by the path, which tells SAS which directory or subdirectory the raw data file is stored in. With the DATA step, you specify the input format, recoding, and computation of new variables. The keyword DATA signifies the beginning of the DATA step whereas sasname is the 'nickname' by which you can subsequently refer to the data set you are creating in this data step. The INFILE statement uses the previously defined fileref to indicate which raw data file is to be read in, and INPUT specifies the names of the variables to be read in. There are three main forms of INPUT statements: The LIST input is the simplest form of input statement. It assumes that the variables are recorded in the same order for each case (observation), but not necessarily in the same column locations. Values are separated by blanks or commas, and there may be several cases on the same row. For example, if there are five variables specified, SAS assumes that a new case begins after each group of five values, regardless of carriage returns in the raw data. Missing values must be represented by a place holder such as a period. The COLUMN input is used when the raw data file has the variables in the same column location for every case (observation). When using column input, you list in the input statement the variable names and identify the location of the corresponding data fields in the data lines by specifying the column positions. You can use column input to skip fields when reading in data, and fields can be read in any order. No place holder is required for missing data.

The FORMATTED input is used when the data requires special instructions to be read correctly. For example, dates or numeric data containing commas should be read using formatted input. There are many formats in SAS, and they are described in detail in the manual SAS Language Guide. The following example uses LIST input. Suppose you want to use the raw data file called census.data (located in a subdirectory called USinfo), which contains information from a U.S. survey, as input data for a SAS program. If you chose the name rawdata for the fileref and the name usa for the SAS data set, the corresponding DATA step would be: FILENAME rawdata '~/Country/USinfo/census.data'; DATA usa; INFILE rawdata; INPUT NAME $ SEX $ ID AGE INCOME TEST1 TEST2; This example reads in the raw data, as list input from a file named census.data in the subdirectory USinfo of the directory Country, and creates a temporary SAS data set named usa. If no path name is specified, it will be assumed that the file is located in the current directory. No matter what directory you are in, you can use ~/ to indicate your home directory. SAS can handle two kinds of variables: numeric and character. A numeric variable is a variable whose values are numbers. A character variable may contain alphabetic and special characters, as well as numbers. When reading in a character variable, a $ must follow the variable name. In the previous example, the variables name and sex are character variables. Creating a Permanent SAS Data Set If you are going to use the same data set a few times, it is usually worth your time to create a permanent SAS data set (also known as a system file) for the data. A permanent SAS data set exists after the end of the current SAS session and can, therefore, be retrieved for use in future programs or sessions. A permanent SAS data set contains the compiled version of the raw data file, as well as any computed or recoded variables. Using permanent SAS data sets makes for quicker, more efficient computer processing than does reading in raw data for each program. The SAS System identifies permanent SAS data sets using names that consist of two parts separated by a period. The first part is called the first-level name, or libref; it identifies the SAS library where the data set is stored. In UNIX, a SAS Library is a directory. The second part is called the second-level name or sasfn; it identifies the specific SAS data set. Both the libref and the sasfn can consist of one to eight characters.

The LIBNAME statement is used to associate a libref with the name of the directory where you intend to store the permanent SAS data set. The syntax of the DATA step to create a permanent SAS data set is: FILENAME fileref 'path/file'; LIBNAME libref 'path'; DATA libref.sasfn; INFILE fileref; INPUT variable names; In the following example, using the same raw data file described previously, you create a permanent SAS data set named survey.ssd01 in the subdirectory Usinfo of the directory Country. Note that the extension ssd01 is attached to all permanent SAS data sets. FILENAME rawdata '~/Country/USinfo/census.data'; LIBNAME usa '~/Country/USinfo'; DATA usa.survey; INFILE rawdata; INPUT NAME $ SEX $ ID AGE INCOME TEST1 TEST2; The permanent SAS data set is now in a file named survey.ssd01, which is in your USinfo subdirectory. If you wish to save the SAS system file in your current directory, you can replace the path in the LIBNAME with the notation '.' In the following example '.' has replaced '~/Country/USinfo' in the LIBNAME statement. LIBNAME usa '.'; Using a Permanent SAS Data Set Once a permanent SAS data set is created, use the LIBNAME statement in conjunction with the Data= libref.sasfn option in the PROC step. The following is a program that produces descriptive statistics, using the permanent SAS data set which was created in the previous section. PROC MEANS DATA = usa.survey; VAR age income; You can use permanent SAS data sets in SAS procedures in just the same way as you can use temporary data sets.

Modifying a SAS Data Set Once a permanent SAS data set is created, use the LIBNAME statement in conjunction with the SET statement to modify an existing SAS data set. Note that the SET command can be used only for SAS data sets; in contrast, the INFILE statement used above can be used only with raw data sets. Following is a program that reads in the permanent SAS data set which was created earlier, and calculates a new variable called test3. DATA newvar; SET usa.survey; test3 = test1 + test2; The data set newvar is now a temporary SAS data set. If you want to make it into a permanent file that will hold all the variables in usa.survey as well as the newly created variable test3, you must give it a two-level name, such as usa.newvar. The name usa.newvar implies that the data set newvar will be stored in the directory referenced by usa, that is 'Country/USinfo'. Saving the Output Data of a SAS Procedure In some cases, you may want to save the results of a procedure analysis into a SAS data set for further analysis. For example, when running a regression you may later want to plot the residuals of the observations in the regression. The following example saves residuals and predicted values from a regression. The same general form of the OUTPUT statement can apply to almost any procedure. All the variables in the original data set are included in the new data set, along with variables created in the OUTPUT statement. To see the specific variables that can be saved for each procedure, check the manual for that procedure. As mentioned earlier, if you want to create a permanent SAS data set you must specify a two-level name in the OUTPUT statement. PROC REG DATA = usa.survey; MODEL z = x1 x2; OUTPUT OUT = res RESIDUAL = zresid PREDICTED = zhat; This program creates a temporary output data set named res. In addition to the variables in the permanent data set survey.ssd01, res contains the variables zhat, whose values are the predicted values of the dependent variable z, and zresid, whose values are the residual values of z.

Examining and Sorting SAS Data Sets All of the following procedures are described in detail in the manual SAS Procedures Guide. The CONTENTS Procedure The CONTENTS procedure can be used to generate more general information from a data set. In the following example, PROC CONTENTS will produce a list of the names, positions, formats, and labels for all variables in the survey.ssd01 data set, as well as the date the data set was created. PROC CONTENTS DATA = usa.survey; The PRINT Procedure The PRINT procedure lists the values of some or all variables contained in a SAS data set. The PRINT procedure can be used to check that the data set you have just created actually contains the right variables and observations. You can produce customized reports with PRINT procedure options and statements. The structure of the PRINT procedure is: LIBNAME libref 'path'; PROC PRINT DATA = libref.sasfn; The above syntax will display all of the variables in the data set. If you only wish to display specific variables, you can add the VAR statement. In the following example, PROC PRINT displays the variables in the order listed in the VAR statement. In other words, the variables sex and id will be displayed, in that order, from the survey.ssd01 SAS data set. PROC PRINT DATA = usa.survey; VAR sex id; Note that the PRINT procedure does NOT send any output to a printer. The SORT Procedure The primary function of the SORT procedure is to sort a SAS data set based on the values of a specific variable or variables. The SORT procedure is also necessary for certain SAS procedures that require the data to be sorted before they can be analyzed. For example,

the BY command in many SAS procedures will run a separate analysis for each specified value of a variable. However, BY group processing requires the data to be sorted on the variable of interest. PROC SORT rearranges the observations in the data set according to the values of the variables in the BY statement. If more than one variable is specified, PROC SORT first sorts the data according to the values of the first variable, then sorts each resulting group according to the second variable, and so on for all successive variables. PROC SORT has the following structure: LIBNAME libref 'path'; PROC SORT DATA = libref.sasfn; BY variable names; This program sorts the data in the survey.ssd01 data set by the value of the variable id : PROC SORT DATA = usa.survey; BY id; SAS Options There are many different options that can be specified at the beginning of a SAS program. Among the most common are LINESIZE and MEMSIZE. SAS often generates output that is too wide to fit on 8.5"x11" paper. One solution is to insert the following statement at the beginning of your program: OPTIONS LINESIZE = 80; By default, SAS uses 32 megabytes of memory, which is sufficient in most cases. However, if your.log file tells you that it ran out of memory, you should use the memsize option, which has the following form: OPTIONS MEMSIZE = nm; where "n" is the the memory you wish to use in megabytes. Moving SAS Files Between Different Operating Systems The following section assumes you know how to use FTP (File Transfer Protocol). If you do not, and need to move a SAS data set between different operating systems, contact the Sweet Hall Consulting Desk (725-2101 or consult@leland). Occasionally, you may need to move a SAS data set from one operating system to another. For example, you may receive data from a location that uses an operating system other than UNIX, or you may want to move your files to or from SAS on a PC or

Macintosh to UNIX. Since the form of SAS data sets is specific to the operating system under which the files have been created, moving data sets from one machine to another requires some extra steps. Note that if you are moving a data set from one UNIX account to another, you don't need to export and import, just use binary FTP. To move data via FTP you must first convert the data set to a portable file by running the short SAS export program shown below. Portable files are versions of data sets that can be imported into SAS under all operating systems. Second, use FTP to move the portable file to the destination computer. You must use binary FTP to move the file. Third, import the portable file into a standard data set on the destination computer by running the short SAS import program shown below. Once you have done all of this, you can erase the portable file (from both accounts). Do not erase the portable file before checking that your data have been imported correctly (for example, by using the PRINT or CONTENTS procedures). In the example below, a file called survey, which is in the sasfiles subdirectory, will be moved by writing it into a portable file called expfile.exp that can be transferred by FTP and then imported. The PROC COPY procedure creates the portable version, Expfile.exp, of the SAS data set, survey. 1.Write a SAS program containing the following commands: LIBNAME mylib '~/Stat/sasfiles'; LIBNAME tranfile XPORT 'expfile.exp'; PROC COPY IN = mylib OUT = tranfile; SELECT survey; If you type ls at the UNIX prompt you will see that you now have a new file called expfile.exp. Note: You may move a directory that consists of several data sets at once. In the example above, if you delete the line SELECT survey, you will move a subdirectory called sasfiles by writing it into a portable file called expfile.exp. Using this procedure, Expfile.exp contains a portable version of all of the data sets in the sasfiles subdirectory. 2.FTP the file expfile.exp. Remember to use binary FTP. 3.On the destination computer, create and run the following import program, which saves the data from the portable file back into all the standard data sets from sasfiles, regardless of the number of files you exported previously. LIBNAME tranfile XPORT 'expfile.exp'; LIBNAME newlib '~/USinfo/new'; PROC COPY IN = tranfile OUT = newlib;

Remarks 1. This handout is reproduced from the following link: http://www.stanford.edu/group/consult-stat/sas/sas.leland.html 2. Another useful link: http://www.stanford.edu/group/consult-stat/sas/sas.index.html.