2) familiarize you with a variety of comparative statistics biologists use to evaluate results of experiments;

Size: px

Start display at page:

Download "2) familiarize you with a variety of comparative statistics biologists use to evaluate results of experiments;"

Daniela McCoy
5 years ago
Views:

1 A. Goals of Exercise Biology 164 Laboratory Using Comparative Statistics in Biology "Statistics" is a mathematical tool for analyzing and making generalizations about a population from a number of individual observations. Statistics provides a shorthand way of describing and comparing all sorts of scientific data, so that we can confidently draw conclusions based on quantitative information. In using comparative statistics we are 1) trying to evaluate goodness of fit between observed and expected numbers (e.g., Chi-square test), or 2) trying to determine if there is a statistically significant difference between experimental groups (e.g., t-test, Wilcoxon ranked-sum test, etc.). The purpose of this exercise is to 1) introduce you to some of the basic concepts of probability, statistics, and hypothesis testing; 2) familiarize you with a variety of comparative statistics biologists use to evaluate results of experiments; 3) provide an opportunity for you to use the statistical software package, Stata. 4) use your statistical knowledge to evaluate the results of an experiment to determine the effects of light intensity on three growth characteristics of Brassica rapa. B. Introduction to Comparative Statistics Samples and Populations A sample consists of a set of observations drawn from a larger set, the population, about which information is desired. For example, a campus survey might not include every student at Colby, but to be more manageable would most likely involve a sample of the total student body chosen at random, say, a few hundred students, depending upon resources available for conducting the survey. The object of statistics is to make inferences about a population based on information gathered from a sample. Variability and Experimental Design The design of an experiment is essentially based upon a sequence of steps written prior to the start of the experiment that helps ensure that adequate and appropriate data will be gathered to provide an objective and repeatable analysis of the question at hand. Several experimental designs may be possible, and the best design will result in the greatest amount of useful information for the amount of time, money, and other limitations involved. With careful use of controls, a good experimental design allows variation resulting from a biological effect to be distinguished from background natural chance variation. Statistics is the mathematical tool used to make the distinction between chance variation and the variation for which we have designed the study to investigate. Replication and Randomness One of the things to consider when designing an experiment is the variation associated with the observations. Each observation within a sample will be different from every other observation to some degree. Therefore, it is logical to have replicates of each observation, and of the experimental treatments that will be applied. In other words, samples must be large enough to adequately account for the variability among observations within each sample, and there should be more than one sample to which a treatment is applied to account for the natural variability among samples from the same population. These concepts are generally addressed by including replication in the experimental design. Another issue that needs to be addressed when designing an experiment is that of randomness. Human nature will lend some degree of bias to the selection of observations that are included in a sample unless steps are taken to ensure that observational units are selected at random from the original Comparative Statistics in Biology Page 1

2 population. The idea is to have as broad a scope of inference as possible by selecting a sample that most accurately reflects the amount of variability found in the original population. Obviously, sample size is also important, we wouldn t want to base conclusion on only a handful of observations, but one must be careful to be unbiased in sampling, i.e., not sampling from only the most conspicuous plants or the most out-going individuals. Hypothesis Testing A hypothesis is a statement of belief, which may or may not be true, about a population. The test of a hypothesis is a comparison using objectively collected facts. If these facts are in agreement with the stated belief, the hypothesis is accepted. If these facts do not agree with the statement of belief, the hypothesis is rejected. The hypothesis that is tested in a statistical analysis is called the null (no difference) hypothesis, and its opposite is called the alternative hypothesis. These two hypotheses cover all the possible outcomes of an experiment, and since they are opposites of one another, only one of them will be supported by the statistical test performed. All statistical analyses are designed to calculate a probability of the null hypothesis being true. If the probability calculated is greater than.05 (5%), we conclude that the variation seen in the study can be attributed to chance natural variability. By convention the null hypothesis is only rejected as false if the probability of it being true is equal to or less than.05. When the null hypothesis has been rejected, the alternative hypothesis is accepted as being true, meaning that there is a statistically significant difference in the values being compared. An important point to note about using statistics for hypothesis testing is that a statistical analysis is never able to prove or disprove a hypothesis in the mathematical sense. The reason for this is that all statistical analyses are merely based on a portion (e.g., a sample) of the entire population, not the entire population itself. It is important to remember that statistics are only used to support or refute hypotheses, not prove them. For example, if one wanted to prove or disprove that brown was the most common color of M&M s, it would not be sufficient to statistically analyze a sample represented by a number of packages purchased at your local store. What would be required to prove such a fact? Ask your instructor if the answer is not obvious. Thinking that statistics actually prove something is one of the greatest misconceptions that students have about using statistics!! Types of Variables In experimental design, the experimenter manipulates one of the variables of a study to determine whether the variable affects the outcome of the study. The variable so designated is known as the independent variable. The other measured variables in the study are considered to be the dependent variables. For example, in an experiment to determine how differing brands of fertilizer affect the yield of corn, the brand of fertilizer represents the independent variable while the corn yield that is measured in the study represents the dependent variable. In statistics variables may be either qualitative (also called discreet or grouping), or quantitative (also called continuous). When variables are classified according to some attribute or character they possess, rather than by a numerical measurement, they are considered to be qualitative because their values fall into one or more discreet categories. Some authors refer to qualitative variables as nominal variables, since they often represent names of things. Qualitative variables correspond to independent variable of an experiment. In contrast,a continuous variable is one that can, in theory, take on any value that can be expressed numerically. Quantitative variables correspond to the dependent variable. Frequency Distributions For many biological populations a typical shape for a frequency distribution of the values of a continuous variable is that of a bell-shaped curve, or normal distribution. In fact, a great many continuous variables are normally distributed, and if the shape of the frequency distribution is unknown, it is often assumed to be "normal", especially if we have only a small sample from which to estimate the true population frequency distribution. The characteristic features of a normal distribution are a central peak in the frequency distribution and symmetric variation on either side of the peak. When a population has a normal distribution, inferences about the population can be made using certain parameters of the sample as a mathematical guide. Comparative Statistics in Biology Page 2

3 Parametric and Non-Parametric Statistics Two statistical parameters that are commonly calculated for a sample are 1) the mean ( x), a measure of central tendency, and 2) the standard deviation (SD), a measure of the dispersion of the sample on either side of the mean. When the mean and standard deviation have been calculated for a sample, they provide an estimate of the shape of the frequency distribution for the population. By definition ( x) ± 1 SD includes approximately 90% of the population, ( x) ± 2 SD includes approximately 95% of the population, and ( x) ± 3 SD includes approximately 99% of the population. Another dispersion parameter commonly measured for a normal distribution is the standard error of the mean, often simply called the standard error (SE). The standard error is calculated by dividing the sample standard deviation by the square root of the sample size, N. The standard error is useful for predicting the value of the population mean in relation to the sample mean. By definition there is approximately a 90% probability that the population mean lies within 1 SE of the sample mean, a 95% probability that the population mean lies within 2 SE of the sample mean, and a 99% probability that the population mean lies within 3 SE of the sample mean. All parametric statistical tests require that the population have a normal distribution, and that the distribution is reasonably symmetrical. If it is suspected that a population does not meet the criteria of a normal distribution, sample parameters like mean and standard deviation are useless for making inferences about the population. A different approach, known as non-parametric statistics, is used that does not assume a normal distribution for the population. Non-parametrics employ alternative approaches such as comparing sequentially ranked observations made from different groups. In general, for each parametric statistical test there is a corresponding non-parametric equivalent that can be used when the distribution of the population is unknown or is expected to have a distribution other than a normal distribution. Very commonly count data or proportion data (e.g., decimal values falling between 0 and 1) do not follow a normal distribution, and hence should be analyzed using non-parametrics. C. Guide to Using Stata Moving Datasets from Excel to Stata Stata is a powerful software package, but its user interface can be a bit intimidating to first-time users, especially regarding data entry. Fortunately, data files created in Excel can be easily copied and pasted into the Stata data editor. Thus, in Biology lab settings, we generally create and manage data files using Excel, and then transfer the data to Stata when we need to perform statistical analyses. Sorting Data in Excel It is occasionally useful to sort data. Sorting is particularly useful when the data file contains measurements from two or more experimental treatments and you would like to generate summary descriptive statistics from each of the experimental treatments separately. To sort data locate the Sort command under the Data menu. Excel allows you to sort using any of the variables as sorting keys. For example if you choose your experimental treatment variable as the sorting key, the data will be sorted such that all the measurements of a given experimental treatment will be grouped together. Nested sorting can be accomplished by selecting more that one sorting key. Nested sorting is useful if you want to sort data by experimental treatment, having the measurements within a treatment ranked from lowest to highest values. To perform this type of nested sort, you would select the experimental treatment variable as the first key, and the measured variable as the second key. Preparing Data for Use in Stata Stata is particular in regards to how your data is organized. If you remember the earlier section on Types of Variables, you should realize that each experimental observation consists of a measured variable (the dependent variable) and a corresponding independent variable (the variable manipulated by the experimenter). For instance, if you have records of fish body length size for two different ponds, one column would be designated for the body length measurements (the dependent variable), while the other column would be designated for ponds from which the fish were obtained, Johnson and Messalonskee (the independent variable). Comparative Statistics in Biology Page 3

4 It is generally easier to get the data into the correct format in Excel before moving it to Stata. Using the above fish example, a researcher may have found it easiest to record the fish length measurements using one column for Johnson Pond and one column for Messalonskee Pond. Before moving to Stata however, it is best to rearrange the data so that there is a single column for all the fish lengths, with a second corresponding column that indicates from which pond each measurement came. Stata Windows When you launch Stata you should see something similar to the following. Note that Stata has four basic windows: 1. The Stata Results window will display all the commands issued to Stata, as well as all the output resulting from these commands (with the exception of graphs). 2. The Stata Command window is used to type commands directly into Stata. Most of what you will be doing will be done through the drop down menus. One key command drop _all can be used to clear the data editor spreadsheet. 3. The Review window keeps a running history of all the commands that you have issued Stata during your session. To recall a command, simply click on it in this window. 4. The Variables window shows you all the variables that you currently have in your dataset. If you should ever lose a window, click on Window and choose the window name that you want to appear. There are other windows in Stata, but the only other one you re likely to encounter is the Graph window. This window pops up anytime you generate a graph. You can right-click on a graph for printing and copying options. Graphs can be copied in Stata and pasted into any Microsoft Office application (such as Word). Oftentimes you want to keep a running history of all your commands and the output from those commands. You can do this by using a log file. To begin a log file click on the button and choose where you would like to save the file (use the.log extension rather than the.scml extension). This file is a text file and can be opened in Notepad or any word processing package. Comparative Statistics in Biology Page 4

Copying an Excel Spreadsheet into Stata Reading data from an Excel spreadsheet is straightforward. Make sure that the variables are in the columns with each variable name at the top of each column.

5 Copying an Excel Spreadsheet into Stata Reading data from an Excel spreadsheet is straightforward. Make sure that the variables are in the columns with each variable name at the top of each column. As mentioned above, make sure that for each dependent variable column there is a corresponding independent variable column. Otherwise, Stata will not be able to use your data!! Highlight the data including the first row of variable names and copy to the clipboard. Open the Data Editor in Stata by clicking on the button. You will see an empty spreadsheet (if not then you already have a Stata file open). Click on the upper left cell and select paste. You can click on the x in the upper-left corner to exit. The data will be preserved in the temporary memory, but will not be saved until you choose File > Save As from the menu and enter the name under which you wish to save it. You cannot run any statistical tests unless you close the Data Editor window first. Entering Data in Stata by Hand (SKIP TO PAGE 6 IF YOU RE USING EXCEL DATAFILES) Open the data editor in Stata by clicking on the button. Click on the upper-left cell of the empty spreadsheet and begin entering the data by putting each variable in its own column (each row is a separate observation). Stata names the variables var1, var2, var3, etc. by default. To give the variables more meaningful names, double-click on the column you want to change and Variable Properties box will appear. You can also enter a longer, more descriptive, label for the variable in the box. Note that Stata also decides what format the variable should be in. This can take on many different forms and it is usually best to let it default. Changing Data Values Once They are Entered If you want to change any numbers that are in your data set, open the editor and click on the cell you want to change. Simply enter the new number and press enter. When you close the editor window, Stata will ask if you want to preserve the changes. If you click accept changes your change(s) will be applied. At this point, only the data in the local memory have been changed. If you want to save the data set to the disk then you must select File > Save As. If you click discard changes no changes will be made. Labeling Categorical Variables If you have one or more categorical variables (variables that take only a finite number of distinct categories) you may want to label the values in them. For example, if your dataset includes a variable called treetype that can take on two values (1 for deciduous and 2 for coniferous), then you will see a series of 1s and 2s in the data set. To give these numbers meaningful labels you can select Data > Labels > Label values > Define or modify label values. Click on Define to define a new label and give the label a name. A new box will pop up asking for you to give the numerical value of one of the categories. It will also ask you to give the label you wish to assign to that numerical value. Keep entering labels until you have exhausted all the categories, then click cancel. To actually assign the labels you just created to a variable, select Data > Labels > Label values > Assign value labels to variable. Select the label name you just created and the variable you wish to apply them to and hit enter. Comparative Statistics in Biology Page 5

6 The Example Dataset For this exercise you will use the Brassica Light Study.xls dataset. Open the data file and examine the data. These data were generated by Intro Bio students who investigated the effects of growing plants for three weeks at two different light levels. After the growth period, the students measured plant height, plant weight, and the number of petiole trichomes produced on the first true leaf of each plant. Note that the first column corresponds to the initials of students entering the data, the second column corresponds to the two light levels used in the study, third column = plant height, fourth column = plant weight, and fifth column = # of petiole trichomes. Exercise Question #1: Which are the independent and dependent variables for this study? Examining the Frequency Distribution of the Data In order to decide whether parametric or non-parametric statistics are appropriate for a comparative statistical analysis, the frequency distribution of the sample must be examined. Copy the data from Brassica Light Study.xls into Stata s Data Editor, preserve it, then close the Data Editor. From the Graphics menu select Histogram. The following dialog box will appear: Note that the histogram dialog window consists of an upper tab bar, corresponding to a number of subwindows. Note how the Main sub-window is highlighted. We ll start with the Main window and move to other sub-windows as necessary. In the Variable box, select the particular variable for which you want to generate a frequency distribution. The variable chosen must be a dependent variable, and you can only choose one variable at a time. From the Y axis box, select the Frequency button. Take a look at the Bins box for a moment. Bins refer to the size of each interval of values on the X axis of the frequency distribution. Stata uses a default of 10 bins, and determines the size of each interval (bin) by dividing the range of values in the distribution by 10. We will go with 10 bins for now, but note that if you ever want to change the size and number of bins, this is where to do it. If we generated the frequency histogram without setting further criteria, the distribution would include all of the measurements of the study for the particular variable chosen. However, we would like to see separate frequency distributions for each of the conditions of the independent variable. Thus we need to set further criteria, and for that we need to use the By sub-window. Comparative Statistics in Biology Page 6

Select the By sub-window, and the following dialog box will appear: Click the Draw subgraphs for unique values of variables box, and for Variables select the data column that contains the independent

7 Select the By sub-window, and the following dialog box will appear: Click the Draw subgraphs for unique values of variables box, and for Variables select the data column that contains the independent variable. We will not make any changes to other sub-windows, but take a look at the Y axis and X axis subwindows. These are the sub-windows in which you can control the scale and other properties of the axes. When finished click the OK button, and a histogram of the frequency distribution for the dependent variable chosen will be displayed. If you would like to save the histogram for inserting into a Word document, perform a Save As.. and use the.png format. The saved histogram can then be inserted as a picture from file into your Word document. Do either of the frequency distributions shown in the histogram resemble a normal distribution? The distribution does not have to be a perfect bell shape curve, but there should be some suggestion of central tendency, and some suggestion of symmetry of the tails. If one can make a reasonable case upon examination that the population is normally distributed, parametric tests should be used because of the greater statistical robustness they offer. Otherwise non-parametrics must be used. Go back to the Graphics menu and create histograms for the other two dependent variables in the study. The only changes you need to make are in the Variables box of the Main sub-window, since all previous selections are retained. Exercise Question #2: For which of the three growth characters measured can you make a reasonable case for a normal distribution? Do any of the growth characters exhibit an obvious non-normal distribution? Exercise Question #3: Can the same comparative statistical test be used to analyze all three of the growth characters? Explain your reasoning. Comparative Statistics in Biology Page 7

Parametric Tests t-test: Comparing Two Sample Groups The unpaired t-test is a test of the null hypothesis (i.e. no difference between means) used to compare one sample mean to another sample mean.

8 Parametric Tests t-test: Comparing Two Sample Groups The unpaired t-test is a test of the null hypothesis (i.e. no difference between means) used to compare one sample mean to another sample mean. The "t" refers to the "t-distribution," which is a specific normal-type distribution. We allow for three possibilities with an unpaired t-test. The test is two-tailed if one sample mean could theoretically be smaller or larger than the other sample mean to which it is being compared. In a two-tailed t-test we test for no difference between sample means (the null hypothesis). We use a one-tailed test if we are only concerned about the sample mean deviating in one direction. Thus, in a one-tailed t-test two possibilities are tested for: sample mean A is larger than sample mean B, and sample mean A is smaller than sample mean B. To begin a t-test go to the Statistics menu, and select Summaries, tables & tests, then select Classical tests of hypotheses, and then Two group mean comparison test. You should see the following dialog box: Select a dependent variable for the Variable Name box, and the independent variable (sometimes referred to as the grouping variable ) for the Group variable name box. To get more accurate results check the boxes for Unequal variances and Welch s approximation, then click OK. A table similar to the one below will appear in the Results window. Note that the first line (in white text) indicates the names of the dependent and independent variables selected for the t-test, while the table contains values for the statistical parameters of central tendency and dispersion. The middle p-value at the bottom of the window is for a two-tailed t-test, while the other p-values are for the corresponding one-tailed t-tests. To interpret the meaning of the central p-value, refer to the section on Hypothesis Testing on page 2. Comparative Statistics in Biology Page 8

9 Nonparametric Tests Nonparametrics test hypotheses about data for which the underlying distribution cannot be assumed to be normal. Keep in mind when summarizing such data that you must steer clear of statistical parameters such as mean and standard error since those statistics are meaningless for a population lacking a normal distribution. For such data it is customary to report the Minimum and Maximum values observed along with the Median, the value that lies midway in the distribution (see below for details on how to obtain these descriptive statistics in Stata). The use of nonparametric tests in Stata is very straightforward and similar to performing parametric tests. However, one important difference with nonparametric tests is that they only provide the p-value of the analysis and do not provide any summary descriptive statistics (as the parametric tests do). The Wilcoxon ranked-sum test (also known as the Mann Whitney test) is the nonparametric equivalent of the unpaired t-test. To begin this nonparametric test go to the Statistics menu, and select Summaries, tables & tests, then select Nonparametric tests of hypotheses, and then Wilcoxon ranked-sum test. You should see the following dialog box: Select a dependent variable for the Variable box, and an independent variable for the Grouping variable box, then click OK. In the table that appears in the Results window, Prob is the p-value of the null hypothesis. To interpret the meaning of the p-value, refer to Hypothesis Testing on page 2. Obtaining a Summary Table of Descriptive Statistics To get a quick, convenient summary of descriptive stats for your dataset in tabular form, go to the Statistics menu, and select Summaries, tables & tests, then select Tables, and then Table of summary statistics (tabstat). You should get a dialog box that looks like the following: For the Variables box you can select as many dependent variables as you like. Check the Group statistics by variable box, then select the independent variable in the pull-down menu. In the Statistics to display portion of the window you can choose up to eight different statistics to display in the table. For this example please choose the following four (you need to first click each box in order to activate each pull-down menu): Count, Minimum, Maximum, and Median. Click OK to complete your selection. Comparative Statistics in Biology Page 9

You should get a table that looks like the following: Note that the header line of the Results window indicates the choice of summary statistics you have made and the names of the variables used in

10 You should get a table that looks like the following: Note that the header line of the Results window indicates the choice of summary statistics you have made and the names of the variables used in the table. These are also indicated in the table itself, though the variable names have unfortunately been severely truncated. Nevertheless, such a table can be very handy, and remember you can include up to eight different summary statistics in the table! You now have the tools to answer the final question of this exercise: Exercise Question #4: How did light intensity affect the three growth characters measured? Were statistically significant differences seen between different light intensities for all the growth characters? Explain briefly. D. Homework Assignment: Summarizing the Results of the Brassica Light Study Write a Results section that summarizes the findings of this study. Use the questions listed above as a guide, and support your answers using the results from the appropriate statistical test. The Results section should include a textual summary along with properly labeled summary tables or figures where appropriate. The reporting of statistical results and organization of tables should adhere to the guidelines described in the Biology 164 Style Guide for Papers and Lab Reports. You may work on the assignment individually or with your lab partner. If working with a lab partner, submit one assignment with both of your names on it. The completed assignment is due at the beginning of lab next week. Comparative Statistics in Biology Page 10

Descriptive Statistics, Standard Deviation and Standard Error

AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.