PLS205 Lab 1 January 12, 2012 Laboratory Topics 1 & 2 Welcome, introduction, logistics, and organizational matters Introduction to SAS Writing and running programs; saving results; checking for errors Different ways to input/import data Proc Means, Proc Univariate (testing for normality) Introduction to SAS Enterprise Guide Inputting/importing data Saving output Modifying data Hypothesis testing using Enterprise Guide t-test Power Calculations using Proc Power Hypothesis testing using SAS editor t-test Proc Print, Proc Sort Nifty SAS Program: Critical values generator Niftier webpage APPENDIX: Data input examples Logistics and Organizational Matters 1. Homework is due at the beginning of lab, with 10 points off for every day it's late. If you don't submit it by the time the homework key is posted (usually 24 hours later), you will receive a zero. 2. Print the lab handouts before coming to lab; they will be posted on the class website each week by Wednesday night at the latest. 3. To log on to the lab computers, you need a UCD user ID and password. 4. Bring a diskette/flashdrive to lab to copy examples from the class directory (G:\PLS205\*.*). 5. This is a demanding class, so make use of all your resources office hours, lab handouts, homework keys, each other (the 205 Buddy System). Introduction to SAS (your new best friend?) To open SAS Version 9.3: START All Programs Class Software SAS SAS 9.3 (English) The SAS Display Manager There are three basic Windows, listed in the order you should view them: 1) The Program Editor window: Where you tell SAS what to do. 2) The Log window: Where SAS tells you what it did and (usually) what you did wrong. 3) The Output window: Where you find the results of your analysis (i.e. the good stuff). Lab 1.1
Example 1 From ST&D p. 29 [Lab1ex1.sas] Data BirdCount; * Creates a new data set called "BirdCount"; Input Field Birds; * Tells SAS the names of variables; * A throwback to the old days; 1 210 2 221 3 218 4 228 5 220 6 227 7 223 8 224 9 192 ; * SEMICOLON! SEMICOLON! SEMICOLON!; Proc Means mean var std stderr cv Data = BirdCount; Var Birds; * Generate these requested statistics for the variable "Birds" in the dataset "BirdCount"; Run; Quit; Output Analysis Variable : Birds Coeff of Mean Variance Std Dev Std Error Variation --------------------------------------------------------------------------- 218.1111111 124.3611111 11.1517313 3.7172438 5.1128671 --------------------------------------------------------------------------- Things to Learn 1. Run (submit) a SAS program with a simple click on the running man icon. 2. Move between windows to scan for red-type errors (Log) and then view results (Output). 3. Clear Log and Output windows with a simple click on the blank page icon. 4. Save program to disk. From Program Editor window: File Save as. 5. Save output to disk. From Output window: File Save as. 6. Set the line size for output to 76 characters (the perfect fit for 10 point Courier font on a page with 1" margins): Tools Options System Log and procedure output control SAS log and procedure output Double click linesize Example 2 From ST&D pg. 30 [Lab1ex2.sas] Data Barley; Input Extract @@; * @@ tells SAS to please read to the end of the line; 77.7 76.0 76.9 74.6 74.7 76.5 74.2 75.4 76.0 76.0 73.9 77.4 76.6 77.3 ; Proc Univariate normal plot Data = Barley; var Extract; * Test for normality and generate plots for the variable Extract in the dataset Barley ; Run; Quit; Lab 1.2
Comments on the code 1. Use @@ in the input statement when you have more Cards on a row than input variables. 2. The word "plot" in Proc Univariate is an example of an option. Its function is to generate several graphical displays of the data, including a stem-and-leaf display, a boxplot, and a normal probability plot (a.k.a. quantile-quantile or Q-Q plot) [see ST&D for interpretation of these displays: pages 30-32, 566-567]. 3. The word "normal" in Proc Univariate is another option. Its function is to carry out tests for normality. In this class, we will be using the Shapiro-Wilk test for normality. Output Variable: Extract Moments N 14 Sum Weights 14 Mean 75.9428571 Sum Observations 1063.2 Std Deviation 1.2270755 Variance 1.50571429 Skewness -0.2898702 Kurtosis -1.0921714 Uncorrected SS 80762.02 Corrected SS 19.5742857 Coeff Variation 1.61578791 Std Error Mean 0.32794972 Basic Statistical Measures Location Variability Mean 75.94286 Std Deviation 1.22708 Median 76.00000 Variance 1.50571 Mode 76.00000 Range 3.80000 Interquartile Range 2.20000 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 231.5686 Pr > t <.0001 Sign M 7 Pr >= M 0.0001 Signed Rank S 52.5 Pr >= S 0.0001 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.945784 Pr < W 0.4974 Kolmogorov-Smirnov D 0.161429 Pr > D >0.1500 Cramer-von Mises W-Sq 0.046718 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.297241 Pr > A-Sq >0.2500 Lab 1.3
Quantiles (Definition 5) Quantile Estimate 100% Max 77.7 99% 77.7 95% 77.7 90% 77.4 75% Q3 76.9 50% Median 76.0 25% Q1 74.7 10% 74.2 5% 73.9 1% 73.9 0% Min 73.9 Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 73.9 11 76.6 13 74.2 7 76.9 3 74.6 4 77.3 14 74.7 5 77.4 12 75.4 8 77.7 1 Stem Leaf # Boxplot 77 7 1 77 34 2 76 569 3 +-----+ 76 000 3 *-----* 75 + 75 4 1 74 67 2 +-----+ 74 2 1 73 9 1 ----+----+----+----+ Normal Probability Plot 77.75+ ++++* *++* * *+*+ * * *+++ 75.75+ ++++ ++*+ ++*+* +++* 73.75+ ++*+ +----+----+----+----+----+----+----+----+----+----+ -2-1 0 +1 +2 NOTE: The Shapiro-Wilk W statistic measures the linear correlation between the data and their normal scores. The closer W is to 1, the better correlated the distribution is to a normal distribution. Normality is rejected when W is sufficiently smaller than one, that is, when the value Pr < W is less than 0.05. In this example, p = 0.4974 > 0.05, so we conclude the data exhibit a normal distribution. Lab 1.4
Introduction to SAS Enterprise Guide 1. To open Enterprise Guide: Start Menu SAS Enterprise Guide 4.3 2. There are several ways to input data into Enterprise Guide: a. Import data that was run before in the SAS session: File Open Program Lab 1ex3 b. Type it directly into the Enterprise Guide spreadsheet. c. Import an Excel file: File Import Data (select Microsoft Excel Spreadsheet in the Files of Type menu) Select the file you want to open. d. Import a delimited text file: Same as above but select Delimited File in the Files of Type menu. 3. To analyze the data and check for normality: a. Push the Run button b. Click on the Output Data tab c. Analyze Capability Q-Q Plot d. Choose "Extract" for analysis e. Distribution Normal 4. To save or export the output: a. The easiest way is simply to copy the output or graph and paste it into your Word file: Edit Copy Graph (or Copy to Program Editor if it s text; once in the program editor, highlight what you wish to export and copy and paste as normal). 5. To modify and add data: a. Edit Mode (if Mode is not active, be sure all your Log and Output windows in SAS are clean and the data sheet is saved) Edit b. Add one extreme value (Edit Add rows) and observe the effects on the normality test. You can also add a row by placing the mouse at beginning of a row right click Add rows c. To insert a computed column, click the Calculator icon. d. To add, delete, or duplicate columns, right-click the top of the column. Lab 1.5
Hypothesis Testing Using Enterprise Guide In Enterprise Guide, with Barley data loaded: Analyze ANOVA t-tests One Sample t-test for a Mean choose Extract as the variable H0: 78 Confidence Level select 95% Output N Mean Std Dev Std Err MinimumMaximum 14 75.9429 1.2271 0.3279 73.9000 77.7000 Mean 95% CL Mean Std Dev 95% CL Std Dev 75.9429 75.2344 76.6513 1.2271 0.8896 1.9769 DF t Value Pr > t 13-6.27 <.0001 Confidence Interval In Enterprise Guide, with Barley data loaded: Describe Summary Statistics under Data choose Extract as the variable Under Statistics--Additional, select Confidence limits of the mean Analysis Variable : Extract Mean Std Dev Minimum Maximum N Lower 95% CL for Mean Upper 95% CL for Mean 75.9428571 1.2270755 73.9000000 77.7000000 14 75.2343648 76.6513494 Things to Notice 1. The t-test is highly significant (p < 0.001); so we reject H 0. 2. The 95% confidence interval of the mean is [75.23.76.65]. See that the value 78 is far above the upper limit of this confidence interval. That is why the test is highly significant. In your ample spare time, try repeating the exercise using 75.234 (the lower extreme of the confidence interval) as the Null Mean. What is the expected probability of the t-test? Power calculation using SAS PROC POWER One Sample power test. What is the power of a test to detect a difference between the observed mean of 75.94 and alternative means of 78 77 and 75.94 (the same value)? proc power; onesamplemeans mean = 75.94 ntotal = 14 stddev = 1.23 nullmean= 75.94 77 78 alpha= 0.05 power =.; run; The POWER Procedure Lab 1.6
One-sample t Test for Mean Fixed Scenario Elements Distribution Normal Method Exact Alpha 0.05 Mean 75.94 Standard Deviation 1.23 Total Sample Size 14 Number of Sides 2 Computed Power Null Index Mean Power 1 75.9 0.050 One curve is on top of the other! 2 77.0 0.846 3 78.0 >.999 Things to Notice 1. The. after power indicates that you are requesting the power 2. The onesamplemeans is one line of code up to the to the ;. It is split in multiple lines to make it easier to read 3. The power to detect a difference from a null mean of 77 is 0.846, and the power increases to almost 1 when the alternative mean is 78. The minimum value of the power is =alpha when the alternative mean is the same as the observed mean. You generally want a power of at least 0.80 (80%). Notice that a 95% confidence interval of the mean is [75.23.76.65] excludes both 77 and 78. See that the value 78 is far above the upper limit of this confidence interval. That is why the test is highly significant. Proc Power can be also used to estimate the number of samples required to obtain a certain power proc power; onesamplemeans mean = 75.94 ntotal =. stddev = 1.23 nullmean= 77 alpha= 0.05 power = 0.80 0.90 0.95 0.99 0.846 0.845; run; Lab 1.7
The POWER Procedure One-sample t Test for Mean Fixed Scenario Elements Distribution Normal Method Exact Null Mean 77 Alpha 0.05 Mean 75.94 Standard Deviation 1.23 Number of Sides 2 Computed N Total Nominal Actual N Index Power Power Total 1 0.800 0.814 13 2 0.900 0.915 17 3 0.950 0.955 20 4 0.990 0.991 27 5 0.846 0.873 15 6 0.845 0.846 14 SAS rounds the number estimation conservatively to the upper number if there are decimals, to guarantee at least the requested power. Two sample power test. What is the power of a test to detect a difference between two samples owith the following mean and variances: Mean Variance N Sample 1 90 13 6 Sample 2 95 19 6 Mean difference= 5 Pooled s= SQRT( (15+17)/2)= 4 (not the same as the average of the standard deviations) proc power; twosamplemeans test=diff meandiff = 5 stddev = 4 npergroup = 6 10 20 power =.; run; The POWER Procedure Two-sample t Test for Mean Difference Fixed Scenario Elements Distribution Normal Method Exact Mean Difference 5 Standard Deviation 4 Number of Sides 2 Null Difference 0 Alpha 0.05 Lab 1.8
Computed Power N Per Index Group Power 1 6 0.498 2 10 0.753 3 20 0.971 Hypothesis Testing Using SAS To use Proc Univariate to do a t-test (e.g. testing if = xx), we must create: new variable = old variable expected In the following example, we will test the hypothesis that = 78 by creating a new variable TEST78 = Extract - 78.0. We will then perform a t-test for the new variable against the hypothesis = 0 (see similar example ST&D pg. 96-97). Example 3 [Lab1ex3.sas] Data Barley; Input Extract @@; Test78 = Extract - 78.0; * Here's that new variable; 77.7 76.0 76.9 74.6 74.7 76.5 74.2 75.4 76.0 76.0 73.9 77.4 76.6 77.3 ; Proc Print; * Proc Print displays the inputted data, a nice check; Title 'Hypothesis mean = 78.0'; Proc Univariate; Var Test78; * Indicates we want to use the new variable Test78; Proc GChart; * Proc GChart creates fancy charts in new windows; Run; Quit; Hbar Test78; * Hbar = horizontal bar. Could be vbar, pie, etc; Output [Note: In your work, you would accompany this output with a line of interpretation.] Variable: Test78 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t -6.27274 Pr > t <.0001 Lab 1.9
Example 4 [Lab1ex4.sas] This next example illustrates the use of Proc Sort, Proc Print, and Proc Means: Data Grades; Input StudentNo GradUG $ HWGrade Midterm Final; * $ indicates a non-numeric class variable; FinalGrade = 0.25*HWgrade + 0.35*Midterm + 0.40*Final; 13 G 92 84 89 9 G 85 65 80 47 G 90 81 92 21 UG 82 73 86 60 G 94 96 98 4 UG 89 82 90 ; Proc Sort; * Orders the data by the variable named below; By StudentNo; Proc Print; * Displays the inputted data in whatever order you wish; Title 'Roster in order of Student Number'; ID StudentNo; Var HWGrade Midterm Final FinalGrade; Proc Means n mean std var stderr maxdec=1;* MaxDec limits all numbers to 1 decimal place; Title 'Descriptive statistics'; Var HWGrade Midterm Final FinalGrade; Proc Sort; By GradUG; * Sorting is needed because of the Proc Means below; Proc Means n mean std var stderr maxdec=1; Title 'Descriptive statistics by student level'; Var HWGrade Midterm Final FinalGrade; By GradUG; * Without Proc Sort above, this would confuse SAS; Proc Plot; Plot Final*FinalGrade; * Generates plot of Final (y) vs. FinalGrade(x); Run; Quit; Note: If you add a title to one Proc statement but not to the others, all the Proc outputs will have the same label. In fact, they will carry over to future programs! To avoid confusion, you should label everything, especially as your programs become more complicated and the output more profuse. Lab 1.10
Nifty SAS Program [SASCritValues.sas] Tables of critical values rarely contain the exact values you are looking for. Here's a way to use SAS to find critical values and p-values with precision: Data ValueFinder; TITLE 'CRITICAL VALUES'; * The functions below find the critical value for a specified probability 'p'; * where 'p' is the proportion of the area to the **LEFT** of the critical value; * [e.g. 0.975 will be the 'p' for a 5% two-tailed test]; Nvalue = PROBIT (0.975); * This is Z; Tvalue = TINV (0.975, 20); * This is t (p, df); Chivalue = CINV (0.975, 20); * This is chi-square (p, df); Fvalue = FINV (0.975, 20, 4); * This is F (p, NUM df, DEN df); TITLE 'PROBABILITY'; * These functions return the probability that an observation is < x; Nprob = PROBNORM (1.96); * Z; Tprob = PROBT (2.086, 20); * t; Chiprob = PROBCHI (34.2, 20); * chi-square; Fprob = PROBF (8.56, 20, 4); * F; Proc Print; Run; Quit; Very very handy; but if you use this, please be aware of what SAS is telling you, namely that it is the areas to the LEFT of the critical values that are being considered. Double-check your results with a table until you get the hang of it. Niftier Website There are a lot of free critical values calculators available on-line as well. Feel free to use them, but be sure you understand how they work. The best way to do this is by checking some test values against the tables in the book (or on the class webpage). A good site: http://www.graphpad.com/quickcalcs/distmenu.cfm Caution: Be aware of what these calculators are telling you, namely that it is the areas to the LEFT or RIGHT of the critical values that are being considered. Double-check your results with a table until you get the hang of it. APPENDIX: Data Input Examples Students lose a shocking number of points on homeworks and exams due to incorrect data input (i.e. careless typographical errors). Very rarely should you ever have to input data number-by-number because almost all the datasets will be provided to you already typed into Word documents. The challenge you have is to structure your data input routine in SAS such that it will read correctly whatever you cut-and-paste into your code. The "Do-End-loops" illustrated below may look complicated, but it is worth your time to understand how they work, especially as our data sets become bigger and bigger. Example dataset 1 5 treatments with 5 replications each Lab 1.11
Possible SAS data entry code: Data Example1; Input Treatment $ @@; Do Replication = 1 to 5; Input Response @@; Output; A 3.08 5.51 5.07 4.41 3.85 B 3.30 3.19 4.29 1.87 1.32 C 5.73 5.18 5.06 3.96 3.74 D 1.87 3.30 2.64 3.08 3.85 E 2.25 4.78 3.13 2.91 2.58 ; A 3.08 5.51 5.07 4.41 3.85 B 3.30 3.19 4.29 1.87 1.32 C 5.73 5.18 5.06 3.96 3.74 D 1.87 3.30 2.64 3.08 3.85 E 2.25 4.78 3.13 2.91 2.58 If this is scary, you can also paste the above table into Excel and manipulate it (again, by cutting and pasting and transposing, not by retyping numbers) to give you something like this: A 3.08 A 5.51 A 5.07 A 4.41 A 3.85 B 3.3 B 3.19 B 4.29 B 1.87 B 1.32 C 5.73 C 5.18 C 5.06 C 3.96 C 3.74 D 1.87 D 3.3 D 2.64 D 3.08 D 3.85 E 2.25 E 4.78 E 3.13 E 2.91 E 2.58 Lab 1.12
Once you are here, the SAS code is straightforward: Data Example1; Input Treatment Response; A 3.08 A 5.51... E 2.91 E 2.58 ; The two approaches are equivalent, but as the data sets become bigger, the Excel manipulations needed for the second approach will become more and more cumbersome. Example data set 2 Combinations of treatments with 10 replications each Trt1A Trt1B Trt2A 131 109 133 136 142 126 150 142 167 145 Trt2B 136 103 78 154 132 122 114 107 120 127 Trt2C 101 142 164 144 113 149 139 121 162 154 Trt2A 68 114 101 120 113 134 147 97 85 114 Trt2B 149 132 134 111 136 103 103 92 124 80 Trt2C 125 106 125 89 120 71 100 125 132 108 Possible SAS data entry code: Data Example2; Do Trt1 = 1 to 2; Do Trt2 = 1 to 3; Do Rep = 1 to 10; Input Response @@; Output; 131 109 133 136 142 126 150 142 167 145 136 103 78 154 132 122 114 107 120 127 101 142 164 144 113 149 139 121 162 154 68 114 101 120 113 134 147 97 85 114 149 132 134 111 136 103 103 92 124 80 125 106 125 89 120 71 100 125 132 108 ; Here we ve set up the input routine in such a way that we could just cut-and-paste the data table into SAS. No chance for typographical errors. Lab 1.13
Example data set 3 Each data point identified by four classification variables A1 A2 C1 C2 C3 C4 D1 D2 D3 D1 D2 D3 D1 D2 D3 D1 D2 D3 B1 121 121 116 107 104 110 119 116 108 92 101 121 B2 123 131 125 113 138 119 118 116 118 113 107 123 B3 107 160 160 129 114 107 119 114 107 131 103 86 B4 123 119 127 129 131 100 121 99 111 105 92 108 B1 123 118 138 151 104 127 108 118 108 136 116 114 B2 129 131 140 157 127 133 119 121 99 123 100 95 B3 131 129 131 136 143 121 131 131 108 131 127 110 B4 131 131 129 151 131 118 118 114 119 125 127 90 Possible SAS data entry code: Data Example3; Do ClassA = 1 to 2; Do ClassB = 1 to 4; Do ClassC = 1 to 4; Do ClassD = 1 to 3; Input Response @@; Output; 121 121 116 107 104 110 119 116 108 92 101 121 123 131 125 113 138 119 118 116 118 113 107 123 107 160 160 129 114 107 119 114 107 131 103 86 123 119 127 129 131 100 121 99 111 105 92 108 123 118 138 151 104 127 108 118 108 136 116 114 129 131 140 157 127 133 119 121 99 123 100 95 131 129 131 136 143 121 131 131 108 131 127 110 131 131 129 151 131 118 118 114 119 125 127 90 ; Voila! Without the Do-End loops, the same dataset would be five times as large because you would have to input the individual classification address for each and every data point (e.g. A2, B3, C2, D1). Again, this may seem unnecessary to you now; but please take the time to learn it. And if you have any questions, just ask. Lab 1.14