Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling Gathering data on all individuals in a large population is usually not realistic (though the census attempts this every 10 years). But we can get info on a population by looking at a subset of the population. To get at the population parameters (such as the population mean µ), we collect data on a subset of the full population. 1

Sample Population Often, this subset is chosen with a simple random sample of the population, which means the observations were taken totally at random, and each individual had the same chance of being chosen. What do we do with the data once we collect it? We can summarize it in a useful manner. One option is to report a statistic from the data. Statistic A statistic is a summary value calculated from a sample of observations. Usually, a statistic is an estimator of some population parameter. 2

Suppose we collect n observations in a sample x 1, x 2,..., x n, from a particular population, Statistic Estimates the population parameter Sample mean: Population mean: x = ni=1 x i n µ Sample variance: s 2 = Population Variance: ni=1 (x i x) 2 n 1 σ 2 Calculated from the data Unknown 3

We discussed this general concept earlier... that we infer something about the population from a sample. This is called statistical inference. Sample Population Population parameters are shown with a greek letter. Statistic Estimates this... Sample mean: Population mean: x µ Sample variance: Population variance: s 2 σ 2 Sample std. deviation: Population std. deviation: s σ Sample intercept: Population intercept: b 0 or ˆβ 0 β 0 Sample slope: Population slope: b 1 or ˆβ 1 β 1 4

Numerical Summaries Section 6-1 The sample mean and the sample variance are numerical summaries of the sample data. The sample standard deviation is the square root of the sample variance. The full (larger) population of interest maybe an actual physical population, but it could also be a conceptual population if the population doesn t physically exist, as with all components that will be manufactured and sold. As we saw earlier, the sample variance s 2 essentially describes the average squared distance of an observation from the sample mean. 5

There are n = 8 observations in the sample below. The deviations from the sample mean x i x are shown below: Sample variance: s 2 = ni=1 (x i x) 2 n 1 6

Computation of s 2 Original formula and alternatives: s 2 = = = ni=1 (x i x) 2 n 1 ni=1 (x 2 i ) ( n i=1 x i ) 2 n 1 ni=1 (x 2 i ) n x2 n 1 n Note that the divisor for sample variance is n 1. We subtract 1 from the sample size because we had to estimate µ with x in order to compute the sample variance. 7

We re interested in how the observations are dispersed around µ, but we only have information on how the observations are dispersed around x. If we didn t make this adjustment, our estimate for σ 2 (i.e. our s 2 value), would consistently be too small in estimating the true population variance. We also say, s 2 is based on n 1 degrees of freedom. We ll discuss this more later. Another measure of sample spread is the sample range. Sample Range If the n observations in a sample are denoted by x 1, x 2,..., x n, the sample range is r = max(x i ) min(x i ) This is as a single value, not 2 individual values. 8

Stem-n-leaf diagrams Section 6-2 The mean and variance are quantities that give us information on the center and spread of the data, respectively. These are important summaries of a distribution. But many distributions can have the same mean and variance, and yet be different distributions. We can use graphical displays to consider the whole distribution of the data. 9

Consider the following set of n = 80 data points which are compressive strengths in pounds per square inch of 80 specimens of a new aluminumlithium alloy undergoing evaluation. 105 97 245 163 207 134 218 199 160 196 221 154 228 131 180 178 157 151 175 201 183 153 174 154 190 76 101 142 149 200 186 174 199 115 193 167 171 163 87 176 121 120 181 160 194 184 165 145 160 150 181 168 158 208 133 135 172 171 237 170 180 167 176 158 156 229 158 148 150 118 143 141 110 133 123 146 169 158 135 149 For this data, x = 162.66 and s 2 = 1140.63. These give a measure of center and spread. 10

We can look at a stem-n-leaf diagram to get a feel for the full distribution of the data. 7 6 8 7 9 7 10 15 11 058 12 013 13 133455 14 12356899 15 001344678888 16 0003357789 17 0112445668 18 0011346 19 034699 20 0178 21 8 22 189 23 7 24 5 The decimal point is 1 digit(s) to the right of the The minimum value is 76. 7 is the stem, and 6 is the leaf. The maximum value is 245. 24 is the stem, and 5 is the leaf. 11

The legend tells us where the decimal is at. This stem-n-leaf suggests this distribution can be described as bell-shaped and unimodal (i.e. has one peak). 12

Steps for making a Stem-n-Leaf Diagram 1. Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. Stems may have as many digits as needed, but each leaf contains only a single digit. 2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. 3. Write each leaf in the row to the right of its stem, in increasing order out from the stem. 13

If there are too many values for each stem, you can also do a split-stem-n-leaf diagram by splitting the values for each stem. 14

Mode, Quartiles, and Percentiles Once we ve ordered the data as in the stem-nleaf diagram, we can easily pull-out some other useful data features. Consider the following stem-n-leaf diagram: The decimal point is 1 digit(s) to the right of the 6 134 6 5568 7 0113 7 57 We see that n = 13, the min is 61, the max is 77. Median This is the value at which 50% fall below and 50% fall above. The median is 68 for this data set. If n is odd, an actual data point is the median. 15

If n is even, the median falls between the 2 data points at the middle (use the average of these two data points). The median is a measure of central tendency, and is denoted by x. Mode This is the most frequently occurring data point. There are two modes in this data set, 65 and 71. We would call this distribution bimodal (i.e. has 2 peaks). 16

Quartiles The positions that break the data into 4 quadrants, each containing 25% of the data are the quartiles. The first quartile (q 1 ), the second quartile (q 2 ) also called the median, and the third quartile (q 3 ). This data set has q 1 = 64.5 q 2 = 68 q 3 = 72 There are a number of ways to find positions the break the data into the 25% proportions since the data is discrete. But here s one option: q 1 is the interpolated value between the data points at ordered positions of n+1 and n+1 4 4 (These are symbols for rounded-down and rounded-up, respectively) 17

q 3 is the interpolated value between the data points at ordered positions of 3(n+1) 4 and 3(n+1) 4 The interquartile range(iqr) is equal to q 3 q 1 and is a measure of variability. It is the spread of the middle 50% of the data. The IQR is less sensitive to extremes than the ordinary sample range. The IQR for the example data set is IQR= q 3 q 1 = 72 64.5 = 7.5 Percentiles The 100kth percentile is a data value such that approximately 100k% of the observations are at or below this value and approximately 100(1 k)% of them are above it (for 0 < k < 1). 18

Example: Mean and Median A manufacturer of electronic components is interested in determining the lifetime of a certain type of battery. A sample, in hours of life, is as follows: 123, 116, 122, 110, 175, 126, 125, 111, 118, 117 a) Find the sample mean and median. b) What feature in this data set is responsible for the substantial difference between the mean and median? 19

Frequency Distributions and Histograms Section 6-3 A frequency distribution is a table that divides a set of data into a suitable number of classes (categories), showing also the number of items belonging to each class. Consider the following stem-n-leaf diagram for humidity readings rounded to the nearest percent. Stem Leaf 1 2 5 7 2 1 1 3 4 5 7 8 9 3 2 4 4 7 9 4 2 4 8 5 3 We might group these data into the following frequency distribution: 20

Cumulative Class Class Frequency Relative Relative Interval midpoint f frequency frequency 10-19 14.5 3 3/20 = 0.15 0.15 20-29 24.5 8 8/20 = 0.40 0.55 30-39 34.5 5 5/20 = 0.25 0.80 40-49 44.5 3 3/20 = 0.15 0.95 50-59 54.5 1 1/20 = 0.05 1.00 There were 5 bins, or cells, or intervals for this frequency table. 21

The histogram is a visual display of a frequency distribution. Example: Recall the n = 80 compressive strengths from earlier 105 97 245 163 207 134 218 199 160 196 221 154 228 131 180 178 157 151 175 201 183 153 174 154 190 76 101 142 149 200 186 174 199 115 193 167 171 163 87 176 121 120 181 160 194 184 165 145 160 150 181 168 158 208 133 135 172 171 237 170 180 167 176 158 156 229 158 148 150 118 143 141 110 133 123 146 169 158 135 149 Using 10 bins, we can create the frequency distribution... 22

Cumulative Class Class Frequency Relative Relative Interval midpoint f frequency frequency 61-80 70.5 1 1/80 = 0.0125 0.0125 81-100 90.5 2 2/80 = 0.0250 0.0375 101-120 110.5 6 6/80 = 0.0750 0.1125 121-140 130.5 8 8/80 = 0.1000 0.2125 141-160 150.5 23 23/80 = 0.2875 0.5000 161-180 170.5 19 19/80 = 0.2375 0.7375 181-200 190.5 12 12/80 = 0.1500 0.8875 201-220 210.5 4 4/80 = 0.0500 0.9375 221-240 230.5 4 4/80 = 0.0500 0.9875 241-260 250.5 1 1/80 = 0.0125 1.0000 The histogram for this frequency table... 23

Histogram of data Frequency 0 5 10 15 20 100 150 200 250 We can see this is a unimodal distribution with a bell-shape. data NOTE: The bin widths can alter the shape of a histogram. For instance, if I only chose 3 bins... 24

Histogram of data Frequency 0 10 20 30 40 50 60 70 0 50 100 150 200 250 300 data This is not as informative. In general, you don t want too many or too few observations in each bin (relative to n), and you can play around with bin size for the best scenario. 25

We summarize data in a histogram (by lumping a lot of individual observations together in a cell), so we lose some information. But this loss is usually small compared to the information gained in the visual, and the ease of interpretation gained in the graph. Some possible descriptions of histograms Symmetric Skewed (asymmetric, long tail to one side) Right-tail stretched out... positive skew Left-tail stretched out... negative skew Unimodal (one peak) Bimodal (two peaks) Bell-shaped uniformly distributed (flat) 26

Symmetric If the distribution is symmetric, the mean = median. Right-skewed If the distribution is right-skewed, mean > median. Left-skewed If the distribution is left-skewed, mean < median. Left-skewed Symmetric Right-skewed 27

The histogram of the sample data at the bottom of the slide gives us a feel for the population from which the sample was drawn. The top plot is of the conceptual population from which the sample was drawn. 28

Box Plots Section 6-4 Boxplots are another graphical tool for visualizing data. They utilize the quartiles to give us a feel for the data distribution. Values forming the box (shows middle 50% of data): q 1 q 2 left, middle, right q 3 1.5 IQR largest possible whiskers (as distance from q 1 or q 3 ) outliers values out past the whiskers (past q 1 1.5 IQR or past q 3 + 1.5 IQR), seen at either tail Whiskers will end on an actual data point. 29

Comparative boxplots Data on age at which a Best Oscar is won from 1970 to 2012. Variables: Age and Gender Compare spreads. Compare centers. 0 20 40 60 80 Age of Oscer Winner 80 60 40 20 0 Gender female male female male female male Side-by-side boxplots on left, overlay of data points on the right. 31

MLB Annual Salaries for 2016 (by position) The white X represents the mean for each distribution (which is not shown on a traditional boxplot). These are all right-skewed distributions: the mean is larger than than median. The are many pitchers in the league (P, SP, RP), and the Starting Pitcher (SP) position contains the highest paid players. 32

Time Sequence Plots Section 6-5 When data is collected over time, it can be informative to plot the data in sequence. Time sequence plot can show trends and cycles. The compressive strength data we previously looked at has a time component to it... Consider the following set of n = 80 data points we saw earlier. 105 97 245 163 207 134 218 199 160 196 221 154 228 131 180 178 157 151 175 201 183 153 174 154 190 76 101 142 149 200 186 174 199 115 193 167 171 163 87 176 121 120 181 160 194 184 165 145 160 150 181 168 158 208 133 135 172 171 237 170 180 167 176 158 156 229 158 148 150 118 143 141 110 133 123 146 169 158 135 149 33

We didn t consider the time component previously, but we can look at it as time sequence plot... Compressive strengths with time component included 34

Quality control charts To improve productivity. To prevent defects. To provide information about process. 35

Probability Plots Section 6-7 Let s return to the stem-n-leaf diagram for the compressive strength data. The decimal point is 1 digit(s) 7 6 to the right of the 8 7 9 7 10 15 11 058 12 013 13 133455 14 12356899 15 001344678888 16 0003357789 17 0112445668 18 0011346 19 034699 20 0178 21 8 22 189 23 7 24 5 It looks normally distributed, but is it? 36

Having the correct general shape is a start, but there are specific probabilities that coincide with the normal distribution. For example... y.1 0.0 0.1 0.2 0.3 0.4 normal not quite normal 4 2 0 2 4 x For the red probability distribution, less than 95% is between -2 and 2 because there is more left in the tails than in the normal distribution. Scaling the distribution won t get you the normal distribution. 37

The previous example shows a distribution that is nearly normal, which will often be close enough to the normal for our specific needs. But, in general, we we want to be able to detect non-normality, or when a distribution is not normal. We can use a Normal Probability Plot for this goal. I d like to spend more time with normal probability plots, but due to time constraints, I just want you to know two main things... 1. We use a normal probability plot to check for normality. 2. What the normal probability plot looks like when the data is normally distributed (and when it is not). 38

A normal probability plot plots your observed ordered data points against those that would have been seen from a truly normal distribution. If the data were generated from a normal distribution, the data points in the normal probability plot will fall approximately on a straight diagonal line. 39

Things to look for in your normal probability plot that suggest non-normality... S shapes J shape Light-tails Heavy tails Right - skew compared to compared to normal normal All these are signs of non-normality. 40

NOTE: The diagonal line below IS NOT A BEST FIT LINE to the data. It is simply a reference line for your eye. In R statistical software, the line is drawn by simply connecting the two (x, y) points determined by the values at the 25th and 75th percentiles. 41

This Normal Probability Plot has issues because of the points at the bottom left. Normality is questionable. Normal Q-Q Plot Sample Quantiles -20-10 0 10 Reference line connects values at the 25th and 75th percentiles (in blue). -2-1 0 1 2 Theoretical Quantiles 42

Sometimes we can use a transformation of the data to improve the normality (but you ll be working on the transformed scale after that). Below, a log-transformation helped, but didn t quite get us to normality. NPP plot - original scale NPP plot - log scale Sample Quantiles 0 50 100 150 200 250 Sample Quantiles -2-1 0 1 2 3-2 -1 0 1 2 Theoretical Quantiles -2-1 0 1 2 Theoretical Quantiles 43

This one looks pretty good. Not perfect, but reasonable to assume approximate normality. Normal Q-Q Plot Sample Quantiles -25-20 -15-10 -5-1 0 1 Theoretical Quantiles 44