STAT 3304/5304 Introduction to Statistical Computing Introduction to SAS
What is SAS? SAS (originally an acronym for Statistical Analysis System, now it is not an acronym for anything) is a program designed to perform analysis on large sets of numerical and character data. Pronounced sass, not spelled out as three letters. Developed in the early 1970 s at North Carolina State University. In 1976, The SAS Institute Inc., a privately held corporation was formed. It grew in popularity and capability and was used in academic groups. 1
What is SAS? SAS can be used without knowing much about programming but it is also a very sophisticated language and more can be done with it. SAS was first developed to be a programming language for statisticians and data analysts. Originally intended for management and analysis of agricultural field experiments. 2
What is SAS? SAS has grown into the world s largest privately held software company. SAS is now located in Cary, North Carolina. It is a world-wide company with business in Asia, Pacific and Latin America, Europe, Middle East and Africa. SAS also has a good employee retention rate of 96%. It also is a family oriented company and is friendly to working women 3
What is SAS? SAS is now one of the the most widely used statistical software. Continual product line expansion and diversification of clientele have resulted in SAS products being used by over 40,000 customer sites in 50 countries. There are 3.5 million users of SAS products. Part of the reason for the continual growth is that the SAS Institute works with the end user to improve its product. It offers solutions for data warehousing, data mining, data visualization, and applications development. 4
What is SAS? The SAS System is an applications system that can be used as a statistical package a data base management system a high level programming language An applications system is software that gives you the tools you need to make the data useful and meaningful. In order to be useful, an applications system should give you total control of your data, facilitate applications that run in more than one computing environment, and accommodate varying skill levels of potential users. 5
What is SAS? SAS is able to run on a variety of platforms and SAS is also portable across computing environments. A computing environment is determined by the HARDWARE and the host OPERATING SYSTEM running it. SAS can be used on IBM mainframes, UNIX based machines, on personal computers using Windows. Portability means that SAS applications: Function the same Look the same Produce the same results You can develop SAS applications in one environment and run them in other environments without rewriting the programs. 6
Modes for Running SAS SAS can be run in a variety of styles, or modes, depending on what type of operating system it is being run on. The modes most often used include: Batch Mode: user writes whole SAS programs, saves them into a file, then runs SAS from a command line prompt. Interactive Line Mode: user enters commands line by line in response to prompts issued by the SAS System. 7
Modes for Running SAS Interactive window mode (SAS Display Manager System): user interacts with SAS through Windows using pull-down menus, dialog boxes and icons. this is the version used on Windows and Macintosh. SAS Enterprise Guide: SAS Enterprise Guide software runs only under Windows It can write SAS code for you through its extensive menu system. 8
How does SAS work? With any body of data, you must perform four basic tasks to make it useful and meaningful. ACCESS First, you access the data through the SAS system MANAGE Update, rearrange, combine, edit, or subset data before analyzing ANALYZE Ranges from simple descriptive statistics to more advanced or specialized analyses for econometrics and forecasting, statistical design, computer performance evaluation, and operations research PRESENT Presentation capabilities range from simple list and tables to multidimensional plots to elaborate full-color graphics, both on paper and on your display. 9
How does SAS work? A SAS program is a sequence of statements executed in order. A statement gives information or instructions to SAS and must be appropriately placed in the program. SAS is very lenient about the format of its input statements can be broken up across lines, multiple statements can appear on a single line, and blank spaces and lines can be added to make the program more readable. The most effective strategy for learning SAS is to concentrate on the details of the data step, and learn the details of each procedure as you have a need for them. 10
SAS Windows There are five basic SAS windows: Results and Explorer windows, and three programming windows: Editor, Log, and Output. There are also many other SAS windows that you may use for tasks such as getting help, changing SAS system options, and customizing your SAS session. Results: The Results window is like a table of contents for your Output window; the results tree lists each part of your results in an outline form. Explorer: The Explorer window gives you easy access to your SAS files and libraries. 11
SAS Windows Editor: The Editor window can use the text editor to type in, edit, and submit SAS programs as well as edit other text files such as raw data files. Log: The Log window contains notes about your SAS session, and after you submit a SAS program, any notes, errors, or warnings associated with your program as well as the program statements themselves will appear in the Log window. Output: If your program generates any printable results, then they will appear in the Output window. 12
SAS Windows In Windows operating environments, the default editor is the Enhanced Editor. The Enhanced Editor is syntax sensitive and color codes your programs making it easier to read them and find mistakes. Green: Comments Dark Blue: Keywords in major SAS commands Blue: Keywords that have special meaning as SAS commands Yellow Highlight: Data Red: Statements that SAS does not understand The Enhanced Editor also allows you to collapse and expand the various steps in your program. For other operating environments, the default editor is the Program Editor whose features vary with the version of SAS and operating environment. 13
General Syntax and Rules SAS statements may be in upper or lower case and may begin on any column. SAS statements always end with a semicolon (;). SAS statements may also extend across lines, and more than one SAS statement may appear on a single line. SAS variable names must be 32 characters or less, constructed of letters, digits and the underscore character. The first character must be an English letter (A, B, C,..., Z) or underscore ( ). Subsequent characters can be letters, numeric digits (0, 1,..., 9), or underscores. Characters such as dashes and spaces are not allowed. 14
General Syntax and Rules Its a good idea not to start variable names with an underscore, because special system variables are named that way. Data set names follow similar rules as variables, but they have a different name space. There are virtually no reserved keywords in SAS; its very good at figuring things out by context. SAS is not case sensitive, except inside of quoted strings. Missing values are handled consistently in SAS, and are represented by a period (.). Each statement in SAS must end in a semicolon (;). 15
General Syntax and Rules To make your programs more understandable, you can insert comments into your programs. Comments are usually used to annotate the program, making it easier for someone to read your program and understand what you have done and why. It doesnt matter what you put in your comments, SAS will not look at it. There are two styles of comments you can use: one starts with an asterisk (*) and ends with a semicolon (;). The other style starts with a slash asterisk (/*) and ends with an asterisk slash (*/). 16
Getting Help The bulk of SAS documentation is available online, at http://support.sas.com/documentation/onlinedoc/ A catalog of printed documentation available from SAS can be found at http://support.sas.com/publishing/ Online help: Type help in the SAS display manager input window. Sample Programs, distributed with SAS on all platforms. SAS Institute Home Page: http://www.sas.com SAS Institute Technical Support: http://support.sas.com/resources/ 17
Getting Help Searchable index to SAS-L, the SAS mailing list: http://www.listserv.uga.edu/archives/sas-l.html Michael Friendlys Guide to SAS Resources on the Internet: http://www.math.yorku.ca/scs/statresource.html#sas Brian Yandells Introduction to SAS: http://www.stat.wisc.edu/~yandell/software/sas/intro.html 18
Two Parts of a SAS Program There are two main components to most SAS programs DATA steps: create SAS data sets, read in, manipulated and edited data. PROC steps: process SAS data sets (creating reports, graphs, editing data, sorting data, etc.) and can also create data sets. A typical program starts with a DATA step to create a SAS data set and then passes the data to a PROC step for processing. For example: Raw data and/or a pre-existing SAS data set are read into a SAS DATA step, turned into a SAS data set, altered or analyzed by a PROC step and then the results are displayed in a report. 19
DATA steps: Getting data into a SAS There are three ways of getting data into a SAS data set. 1. Including the data in the SAS command stream The data are like a card deck placed into the stream of SAS commands. Use an INPUT command to list the variables and a CARDS statement right before the data to be read in. Example: DATA CARDSIN; INPUT IDNUM SEX AGE; CARDS; 1 1 25 2 2 33 4 1 55 20
DATA steps: Getting data into a SAS 2. Read the data in from a disk file. Use the INFILE command to name the disk area with the data Then use the INPUT command to list the variables. Example: DATA DISKIN; INFILE RAWDATA.DAT ; INPUT IDNUM SEX AGE; 21
DATA steps: Getting data into a SAS 3. Create a new data set from an existing SAS data set. Here, the SET command is used to name the existing SAS data set. Example: creates two new SAS data sets from an existing SAS data set: DATA FATHERS MOTHERS; SET DISKIN; IF SEX=1 THEN OUTPUT FATHERS; ELSE OUTPUT MOTHERS; 22
PROC steps: Data Management PROC SORT Sorts a data set by one or more variables. PROC SORT; BY ID; will sort the data set by the values of the variable ID. PROC CONTENTS Displays the contents of the data set. PROC DATASETS Manages SAS data set libraries. PROC RANK Rank orders one or more variables. PROC STANDARDIZE Rescales variables to a specified mean and/or standard deviation. 23
PROC steps: Data Management PROC SCORE Generates linear scores for certain procedures like factor analysis and discriminant analysis. PROC TRANSPOSE Transposes a data set. 24
PROC steps: Descriptive Statistics PROC FREQ Simple frequencies and contingency tables for categorical variables. PROC MEANS Number of observations, mean, standard deviation, and minimum and maximum values for continuous variables. PROC UNIVARIATE More detailed descriptive statistics for continuous variables. PROC TABULATE Produces tables of frequencies and/or descriptive statistics. 25
PROC steps: Descriptive Statistics PROC SUMMARY Descriptive statistics broken down by groups; particularly useful for generating a data set of descriptive statistics for input into other procedures. PROC CORR Parametric and nonparametric correlations. 26
PROC steps: Regression PROC REG General purpose linear regression and multivariate regression. PROC GLM General linear models, including regression, analysis of variance/covariance, and multivariate analysis of variance/covariance. PROC RSQUARE All possible subsets of regression. PROC RSREG Quadratic response surface regression. PROC LOGISTIC Logistic regression. PROC PROBIT Probit regression. 27
PROC steps: ANOVA, Graphics Analysis of Variance PROC ANOVA Analysis of variance for orthogonal data. PROC GLM General linear models, including regression, analysis of variance, and multivariate analysis of variance. PROC NESTED Nested analysis of variance. PROC VARCOMP Variance components. Low Resolution Graphics PROC CHART Pie, bar, and star charts. PROC PLOT Two dimensional plots. 28
PROC steps: Multivariate Analysis Discriminant Analysis PROC DISCRIM General purpose parametric and nonparametric discriminant analysis. PROC CANDISC Canonical discriminant analysis. Principal Components and Factor Analysis PROC PRINCOMP Principal components. PROC FACTOR Factor analysis. 29
PROC steps: Multivariate Analysis Cluster Analysis PROC CLUSTER Clustering observations. PROC FASTCLUS Disjoint clustering for large data sets. PROC VARCLUS Clustering variables. Survival Analysis PROC LIFETEST Nonparametric and life tables. PROC LIFEREG Parametric survival analysis. 30