SAS Programs SAS Lecture 4 Procedures Aidan McDermott, April 18, 2006 A SAS program is in an imperative language consisting of statements. Each statement ends in a semi-colon. Programs consist of (at least) four types of statement grouped together in blocks: global statements data step statements procedures statements macro statements Outline SAS Formats: proc format Procedures for descriptive data analysis: the freq, means, and univariate procedures Procedures for statistical analysis: the ttest and reg procedures What will the output from this program look like? How many variables will be in the dataset example, and what will be the length and type of each variable? What will the variable package look like? SAS Formats It is sometimes useful to store data in one way and display it in another. For example, dates can be stored as integers but displayed in human readable format. A SAS format changes the way the data stored in a variable is displayed. There are two types of format: Internal formats (SAS already knows about these) User defined formats (you define these yourself). Internal SAS formats A format statement tells SAS to use a format with one or more variables 1
Permanent formats User defined formats A format statement added to a datastep permanently connects the name of a format to a variable. The format name is stored in the dataset header. A format statement begins with the format keyword and ends with a semicolon (it is a SAS statement after all). You can define your own format using the format procedure. Like all procedures it begins with the key word proc and ends with the run statement. proc format syntax: User defined formats proc format <options>; value formatname range1 = formatted value1.. rangen = formatted valuen ; proc format defines the format yesno. the format statement applies the format to the variable death. Used to define a format. User defined formats Define the format using proc format Tell SAS to use the format with a specific variable by using the format statement as before. User defined formats: Example proc format; value gen 1 = male 2 = female ; value age 10-29 = 10-29 30-39 = 30-39 40-49 = 40-49 50-75 = 50-75 ; value $dpt A = Dept A. B = Dept B. ; Defines three formats, gen, age and dpt. Format dpt is a character format suitable for character variables. 2
format ranges You can specify a range of values to be formatted in a given way proc format; value age 10-29 = 10-29 30-39 = 30-39 40-49 = 40-49 50-75 = 50-75 ; inclusive ranges you can use formats as look-up tables to categorize a variable. specifying format ranges low high other lowest value (excludes missing) highest value all other values not listed (including missing values) value1 - value2 means [value1,value2] value1 -< value2 means [value1,value2) value1 <- value2 means (value1,value2] the put function allows you to capture the formatted value in another variable. format names: must be 8 or fewer characters long cannot end with a number character formats begin with a $ can not use a SAS internal format name refer to format in format statement by using the name followed by a period Descriptive statistics exploratory data analysis is very important from many perspectives in SAS there are three procedures used routinely proc freq produces frequency counts and crosstabulation tables computes tests and measures of association Procedure freq univariate means tables for categorical data descriptive statistics for numeric data descriptive statistics for numeric data Syntax: proc freq <options>; tables requests / <options>; 3
proc freq data=mydata is an option Example The dataset sample has a variable gender. We would like to know what proportion of the sample data are male and what proportion are female. proc freq data=mydata; tables gender race chd ; tables gender * chd / chisq relrisk; proc freq data=sample; tables gender; chisq and relrisk are requests for statistics Example data: NMES The national medical expenditure survey (1987). Examine smoking and gender. Libname mylib c:\sas2006\lecture4 ; proc format; value smoke 0 = never 1 = current 2 = former ; value gen 0 = female 1 = male ; make two formats smoke and gen for the smoking and gender variables Example data: NMES proc freq data=mylib.nmes; tables male*smoke / chisq; format male gen. smoke smoke.; mylib is a libname (folder), nmes is the data 4
Output: Output: proc univariate produces simple descriptive statistics use PLOT options on PROC statement stem-and-leaf plot box plot normal probability plot (QQ plot) side by side box plots for by variable groups Syntax: proc univariate proc univariate data=mylib.nmes plot; title Univariate Output for Age ; var lastage; proc univariate <options>; var varlist / <options>; Output: Output: 5
Output: proc means similar to univariate no plots nicer output, particularly for more that one variable Syntax: proc means <options>; class varlist; var varlist / <options>; by varlist; output out=outdata <options>; proc means options data=dataset statistic default is: n mean std min max Others are: nmiss range median clm noprint suppress printing of output statements class statistics produced for each combination of class variable by statistics produced by each combination of by variables output produce an output dataset which contains the statistics proc means proc means data=mylib.nmes noprint n mean std stderr range nmiss; class male; var lastage; output out=results n=nage mean=mage std=sage; format male gen.; proc print data=results; Output: In a number of previous Phase I and II studies of male, non-insulin-dependent diabetic (NIDD) patients conducted by a drug company the mean body mass index (BMI) was found to be 28.4. An investigator has 17 male NIDD patients enrolled in a new study and wants to know if the BMI from this sample is consistent with previous findings. Patient Number 1 2 3 4 5 6 7 8 Height (CM) 178 170 191 179 182 177 184 182 Weight (Kg) 101.7 97.1 114.2 101.9 93.1 108.1 85.0 89.1 6
proc ttest Syntax: proc ttest <options>; var varlist; paired pairlist; by varlist; class varlist; Two-sample paired t-test. A new compound, ABC-123, is being developed for long-term treatment of patients with chronic asthma. Asthmatic patients were enrolled in a double-blind study and randomized to receive daily oral doses of ABC-123 or a placebo for 6 weeks. The primary measurement of interest is the resting FEV1 (forced expiratory volume during the first second of expiration), which is measured before and at the end of the 6-week treatment period. Does administration of ABC-123 appear to have any effect on FEV1? Patient Number 1 2 3 4 5 ABC-123 Yes Yes Yes Yes Yes Baseline FEV1 (Liters) 1.35 3.22 2.79 2.45 1.84 Week 6 FEV1 (Liters) N/A 3.55 3.15 2.30 2.37 7
Modeling with SAS examine relationships between variables estimate parameters and their standard errors calculate predicted values evaluate the fit or lack of fit of a model test hypotheses design outcome The linear model y = x 0 1 1 2 2 k k β + β x + β x + K + β + ε 2 ε ~ N(0, σ ) Weight = β 0 + β1height + β 2 Age + ε Note: outcome variable must be continuous and normal given independent variables 8
the linear model with proc reg estimates parameters by least squares produces diagnostics to test model fit (e.g. scatter plots) tests hypotheses proc reg data=mydata; model weight = height age; proc reg Syntax: proc reg <options>; model response = effects </options>; plot yvariable*xvariable = symbol ; by varlist; output <OUT=SAS data set> <output statistic list>; proc reg proc reg statement syntax: data = SAS data set name input data set outest = SAS data set name creates data set with parameter estimates simple prints simple statistics proc reg the model statement model response=<effects></options>; required variables must be numeric many options can specify more than one model statement model weight = height age; model weight = height age / p clm cli; proc reg the plot statement plot yvariable*xvariable <=symbol> </options>; produces scatter plots - yvariable on the vertical axis and xvariable on the horizontal axis can specify several plots optional symbol to mark points yvariable and xvariable can be variables specified in model statements or statistics available in output statement plot weight * age / pred; plot r. * p. / vref = 0; proc reg some statistics available for plotting: P. predicted values R. residuals L95. lower 95% CI bound for individual prediction U95. upper 95% CI bound for individual prediction L95M. lower 95% CI bound for mean of dependent variable U95M. upper 95% CI bound for mean of dependent variable plot weight * age / pred; plot r. * p. / vref = 0; plot (weight p. l95. U95.) * age / overlay; 9
proc reg the output statement output <OUT=SAS data set> keywords=names; creates SAS data set all original variables included keyword=names specifies the statistics to include output out=pvals p=pred r=resid; NMES variables of interest: totalexp total medical expenditure ($) chd5 indicator of CHD lastage age at last interview male sex of participant proc reg example here: 1. model estimate parameters etc 2. plot make three plots 3. output make an output dataset regout The run statement Many people assume that the run statement ends a procedure such as proc reg. This is because when SAS encounters a run statement it executes any outstanding instructions in the program buffer. But it may or may not end the procedure. proc reg data=lecture4.nmes; model totalexp = chd5 lastage male; model totalexp = chd5 lastage; plot r.*chd5; quit; /* ends the procedure */ 10