Lecture Notes 3: Data summarization

Lecture Notes 3: Data summarization Highlights: Average Median Quartiles 5-number summary (and relation to boxplots) Outliers Range & IQR Variance and standard deviation Determining shape using mean & median 1

Some important characteristics of a data set Location: Where is the data set located along a number line? Where is its center? Spread: How dispersed (i.e. spread out) is the data? Outliers: set? Are there any unusual values in the data Shape: What is the shape of the distribution of values in the data set? 2

Location Statistics Mean, Median & Quartiles In these notes, we will look at some common descriptive statistics that are useful for summarizing a data set. Recall that a statistic is any number calculated from a set of data. The most succinct way to describe the location of a data set is to identify its center. There are two statistics used to describe center: with the mean and with the median. 3

Sample average The sample average (a.k.a. mean) is the sum of the data divided by the sample size. We denote the mean using, or x bar The sample size is the number of observations in the sample, and is denoted n. The sum of all the observations in a sample is denoted by. x So, our formula for the sample mean is x i x x i = n 4

Sample Average Example Suppose we are interested in the average undulation rate (in Hz) of a paradise tree snake, which undulates after jumping from a tree in order to glide away. We take a sample of n = 8 snakes and somehow measure the rates at which they undulate as they propel themselves from a source. The eight observed rates are 0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6 5

Sample Average Example So, for this sample, we can compute: x x = i = = n 6

Median If you put data in order from the smallest to the largest values, the number in the middle is called the median. The median separates the bottom 50% of the data from the top 50% of the data. If the sample size is odd, the median will be a value in your sample. If the sample size is even, the median will be between the middle two numbers in your sample. 7

Computing the median 1) Order the data set, smallest to largest. 2) Compute the rank of the median using Rank = (n + 1)/2. The rank tells you which observation will be the median. ordered 3) If Rank is an integer value go right to it in the sorted data set. Otherwise compute the average of the two surrounding observations. For instance, if rank = 5, then the median is the 5 th ordered observation. If rank = 5.5, then the median is the average of the 5 th and 6 th ordered observations. 8

Computing the Median The data set to the right is already ordered. There are 19 observations. Find the rank of the median using (n+1)/2: 49 69 70 70 73 78 81 81 96 96 105 110 116 116 117 121 137 142 151 Now go to this observation by counting from the start of the data set to the rank of the median. You can verify that this is the median by making sure that there are the same number of observations above it as there are below it. 9

Computing the Median The data set to the right is already ranked. There are 20 observations. Find the rank of the median using (n+1)/2: 49 69 70 70 73 78 81 81 96 96 105 110 116 116 117 121 137 142 151 175 In this case, the rank is between two integers, so the median will be the average of these two ordered observations. 10

Location Statistics: Quartiles The median breaks the data set into two halves Quartiles break the data set into 4 quarters The lower quartile, Q1, is the median of all the data below the overall median. The upper quartile, Q3, is the median of all the data above the overall median. 11

Computing Quartiles Here, there are 10 observations below the median. We can find their median, Q1, in the usual manner: Q1 separates the lower 25% from the upper 75% of the data. 49 69 70 70 73 78 81 81 96 96 105 110 116 116 117 121 137 142 151 175 12

Computing Quartiles Likewise, there are 10 observations above the median. We can use the same rank we used to find Q1, but start counting from the first observation above the overall median: 49 69 70 70 73 78 81 81 96 96 105 110 116 116 117 121 137 142 151 175 Q3 separates the lower 75% from the top 25% of the data. 13

Computing Quartiles A brief aside: when sample size is odd, it will not be the case that *exactly* 50% of the data is below the median or that *exactly* 50% is above it This is because the median itself is not counted as being in either the upper or lower half of the data set. For reasonably large data sets, we may say things like 50% of the data is above the median and 25% of the data is below Q1, even though in some cases these are approximations. 14

Computing Quartiles Note that for relatively small datasets, you may be able to eyeball the data to find the median, Q1, and Q3, rather than using rank. For instance, it is not challenging to find the median and quartiles for the snake undulation rate data set of size n=8 from before. Simply order the numbers 0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6 from smallest to largest, and you can quickly see where the median and quartiles lie: 15

Location Statistics: Extremes We are also often interested in the extremes of a data set. These extreme values are referred to as the minimum and the maximum. Extreme in this context doesn t necessarily mean really big or really small. It just means the biggest or the smallest. 16

The 5-number summary The 5-number summary can be used to summarize a data set. This group consists of the: minimum, maximum, Q1, median, and Q3 These are all measures of location 17

Boxplots and the 5-number summary Boxplots graphically illustrate the 5 values in a 5-number summary Sometimes boxplots are called box and whisker plots. 60 65 70 75 boxplot of height (female) 18

Boxplots and the 5-number summary Boxplots can be displayed horizontally or vertically. The dark line inside the box is the median The edges of the box are Q1 and Q3 The whiskers extend to either the min and max, or to the furthest non-outliers. 19

Boxplots and the 5-number summary Outliers are represented as dots on a boxplot. Note: 50% of the data is inside the box, 25% is below the box, and 25% is above the box. 20

Outliers Outliers are data points that are located far away from where the majority of the data lie. There is not universal agreement on what the standard should be for classifying an observation as an outlier. It is to some extent subjective. Data analysis software packages will have internal standards by which they decide which values should be considered outlying. 21

Outliers It s usually a good idea to look more closely at an outlier to see if it is real or if it is a mistake. The outlier might be an improperly entered data value. Data entry is a tedious process and sometimes people make mistakes. The outlier might be in different units than the rest of the data. For instance, in the questionnaires from the first day of class, a few students gave their heights in centimeters rather than inches. If these heights had not been converted, then our class dataset would have shown students over 12 feet tall. 22

Outliers Outliers are often real, accurate pieces of data that are simply unusual. For instance, most people work 35-40 hours per week. However a very small number work 70-80 hours a week. It is sometimes tempting to remove outliers from a data set, but we must find out first whether or not the outlier is a legitimate observation or a mistake. 23

Dispersion (Spread) Here is a good piece of advice: Do not cross a river if it is, on average, 4 feet deep -Nassim Taleb, The Black Swan Why is this good advice? What additional information would we need before we decide if crossing the river is a good idea? 24

Dispersion (Spread) Information about location (average or median) is not enough to adequately summarize a data set. Sometimes the average doesn t exist. For example, the average human being has one ovary and one testicle. Information about how your data is dispersed is also useful, and is essential in inferential statistics. We don t just want to know where the center of our data lies; we also want to know how spread out the data is! 25

The Range The range is the easiest measure of dispersion to compute. It is the difference between the maximum value and the minimum value. One problem with using the range is that it doesn t tell you whether most of the data is spread out through the whole range, or if the maximum and minimum values are outliers. 26

The IQR The inter-quartile range (Q3 Q1) is not affected by extreme values since it is calculated using values that lie close to the center of the data set We will not use either the range or the IQR when we move on to inferential statistics. But they are still useful as descriptive statistics. 27

Variance The variance is another measure of dispersion. It is closely related to the standard deviation, which we will consider shortly. Unlike the range or IQR, the variance statistic is computed using all of the data values in a data set. It is sensitive to outliers, but the effects of extreme values are diluted if there are a large number of observations. 28

Sum of Squared Deviations To compute the variance of a data set we first need a statistic called the sum of squared deviations This is often abbreviated as SS, for sum of squares To get the squared deviation for a single observation, subtract the mean from this observation, and then square the result. Do this for all observations and sum the results. This gives us the sum of squared deviations. Mathematically, = 2 S S ( x x i 29

Sum of Squared Deviations Example: find the sum of squared deviations (SS) for our TV watching dataset: 0.9 1.4 1.2 1.2 1.3 2.0 1.4 1.6 S = S x x= 2 ( ) i 30

Sample Variance The sample variance is denoted by the symbol s 2 Mathematically, s 2 x x i = = n 1 n 1 ( 2 S S The English interpretation of a variance is: The average squared distance that a group of n points lies from the mean of the group. This is not a very intuitive concept, though it is very often used in mathematical computations. 31

Sample Standard Deviation The sample standard deviation is simply the square root of the sample variance. It is denoted by the letter s Continuing with our example, we have: S S = = = 1 2 s s n 32

Interpret the Standard Deviation The standard deviation can be thought of roughly as an average distance that a group of points lies from the group mean. A large standard deviation tells you that your data is highly dispersed, or spread out. In inferential statistics, a large standard deviation signifies high levels of uncertainty regarding statistical inferences. Note that what counts as large or small depends on the magnitude of the data itself. 33

Shapes of Distributions You don t need a histogram to determine the shape of a distribution. In fact, all you need are the values for the mean and the median of your data set. Frequency 9 8 7 6 5 4 3 2 1 0 Median= 92 Mean= 86 30 40 50 60 70 80 Grades 90 100 110 34

Shapes of Distributions What is the shape of this distribution to the right? 9 8 7 6 5 4 Median= 92 Mean= 86 Note that the mean is 86, and the median is 92 3 2 1 0 30 40 50 60 70 80 90 100 110 0 0 35

Shapes of Distributions Median =.6 What is the shape of this distribution to the right? 10 Note that the mean is 2.6, and the median is 0.6 5 0 0 mean = 2.6 2 4 6 8 10 12 14 36

Shapes of Distributions What is the shape of this distribution to the right? 30 20 Mean=102 Median= 102 10 Note that the mean is 102, and the median 0 is 102 0 20 40 60 80 100 120 140 160 180 0 37

Mean, Median, & Shape If the mean is greater than the median then the distribution is skewed to the right If the mean is less than the median then the distribution is skewed to the left If the mean and median are (approximately) equal then the distribution is (approximately) symmetric 38

Conclusion A statistic is any number calculated from a set of data. Descriptive statistics are numbers that are used to describe important features of a data set. The mean and median are very commonly used statistics which refer to location The standard deviation is a very commonly used statistic which refers to dispersion. In the next set of notes, we will look at probability and the normal distribution, which will lay the groundwork for understanding inferential statistics. 39