Chapter 6: DESCRIPTIVE STATISTICS

Similar documents
Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 2 Describing, Exploring, and Comparing Data

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Chapter 1. Looking at Data-Distribution

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

No. of blue jelly beans No. of bags

CHAPTER 3: Data Description

CHAPTER 2 DESCRIPTIVE STATISTICS

UNIT 1A EXPLORING UNIVARIATE DATA

appstats6.notebook September 27, 2016

Chapter 2 Modeling Distributions of Data

CHAPTER 2: SAMPLING AND DATA

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

Table of Contents (As covered from textbook)

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

Averages and Variation

Measures of Central Tendency:

DAY 52 BOX-AND-WHISKER

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

STA 570 Spring Lecture 5 Tuesday, Feb 1

Chapter2 Description of samples and populations. 2.1 Introduction.

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

Organizing and Summarizing Data

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Name Date Types of Graphs and Creating Graphs Notes

AND NUMERICAL SUMMARIES. Chapter 2

Basic Statistical Terms and Definitions

Measures of Central Tendency

3. Data Analysis and Statistics

1 Overview of Statistics; Essential Vocabulary

Chapter 3 Analyzing Normal Quantitative Data

15 Wyner Statistics Fall 2013

Lecture 6: Chapter 6 Summary

3 Graphical Displays of Data

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Understanding and Comparing Distributions. Chapter 4

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

AP Statistics Summer Assignment:

Univariate Statistics Summary

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

Frequency Distributions

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

MATH& 146 Lesson 10. Section 1.6 Graphing Numerical Data

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

2.1: Frequency Distributions and Their Graphs

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

3 Graphical Displays of Data

1.3 Graphical Summaries of Data

Chpt 3. Data Description. 3-2 Measures of Central Tendency /40

Exploratory Data Analysis

Chapter 2 - Graphical Summaries of Data

Descriptive Statistics

8: Statistics. Populations and Samples. Histograms and Frequency Polygons. Page 1 of 10

TMTH 3360 NOTES ON COMMON GRAPHS AND CHARTS

Math 167 Pre-Statistics. Chapter 4 Summarizing Data Numerically Section 3 Boxplots

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

Measures of Central Tendency

Lecture Notes 3: Data summarization

How individual data points are positioned within a data set.

Chapter 5. Understanding and Comparing Distributions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

MAT 110 WORKSHOP. Updated Fall 2018

Chapter 5. Understanding and Comparing Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Measures of Position. 1. Determine which student did better

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

STP 226 ELEMENTARY STATISTICS NOTES

Page 1. Graphical and Numerical Statistics

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Chapter 3: Describing, Exploring & Comparing Data

Week 2: Frequency distributions

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

+ Statistical Methods in

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Probability and Statistics. Copyright Cengage Learning. All rights reserved.

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

WHOLE NUMBER AND DECIMAL OPERATIONS

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis.

Chapter 2: The Normal Distributions

WELCOME! Lecture 3 Thommy Perlinger

Descriptive Statistics

MATH11400 Statistics Homepage

Learning Log Title: CHAPTER 7: PROPORTIONS AND PERCENTS. Date: Lesson: Chapter 7: Proportions and Percents

Section 2-2 Frequency Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc

Overview. Frequency Distributions. Chapter 2 Summarizing & Graphing Data. Descriptive Statistics. Inferential Statistics. Frequency Distribution

Lecture 1: Exploratory data analysis

Section 3.1 Shapes of Distributions MDM4U Jensen

Transcription:

Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling Gathering data on all individuals in a large population is usually not realistic (though the census attempts this every 10 years). But we can get info on a population by looking at a subset of the population. To get at the population parameters (such as the population mean µ), we collect data on a subset of the full population. 1

Sample Population Often, this subset is chosen with a simple random sample of the population, which means the observations were taken totally at random, and each individual had the same chance of being chosen. What do we do with the data once we collect it? We can summarize it in a useful manner. One option is to report a statistic from the data. Statistic A statistic is a summary value calculated from a sample of observations. Usually, a statistic is an estimator of some population parameter. 2

Suppose we collect n observations in a sample x 1, x 2,..., x n, from a particular population, Statistic Estimates the population parameter Sample mean: Population mean: x = ni=1 x i n µ Sample variance: s 2 = Population Variance: ni=1 (x i x) 2 n 1 σ 2 Calculated from the data Unknown 3

We discussed this general concept earlier... that we infer something about the population from a sample. This is called statistical inference. Sample Population Population parameters are shown with a greek letter. Statistic Estimates this... Sample mean: Population mean: x µ Sample variance: Population variance: s 2 σ 2 Sample std. deviation: Population std. deviation: s σ Sample intercept: Population intercept: b 0 or ˆβ 0 β 0 Sample slope: Population slope: b 1 or ˆβ 1 β 1 4

Numerical Summaries Section 6-1 The sample mean and the sample variance are numerical summaries of the sample data. The sample standard deviation is the square root of the sample variance. The full (larger) population of interest maybe an actual physical population, but it could also be a conceptual population if the population doesn t physically exist, as with all components that will be manufactured and sold. As we saw earlier, the sample variance s 2 essentially describes the average squared distance of an observation from the sample mean. 5

There are n = 8 observations in the sample below. The deviations from the sample mean x i x are shown below: Sample variance: s 2 = ni=1 (x i x) 2 n 1 6

Computation of s 2 Original formula and alternatives: s 2 = = = ni=1 (x i x) 2 n 1 ni=1 (x 2 i ) ( n i=1 x i ) 2 n 1 ni=1 (x 2 i ) n x2 n 1 n Note that the divisor for sample variance is n 1. We subtract 1 from the sample size because we had to estimate µ with x in order to compute the sample variance. 7

We re interested in how the observations are dispersed around µ, but we only have information on how the observations are dispersed around x. If we didn t make this adjustment, our estimate for σ 2 (i.e. our s 2 value), would consistently be too small in estimating the true population variance. We also say, s 2 is based on n 1 degrees of freedom. We ll discuss this more later. Another measure of sample spread is the sample range. Sample Range If the n observations in a sample are denoted by x 1, x 2,..., x n, the sample range is r = max(x i ) min(x i ) This is as a single value, not 2 individual values. 8

Stem-n-leaf diagrams Section 6-2 The mean and variance are quantities that give us information on the center and spread of the data, respectively. These are important summaries of a distribution. But many distributions can have the same mean and variance, and yet be different distributions. We can use graphical displays to consider the whole distribution of the data. 9

Consider the following set of n = 80 data points which are compressive strengths in pounds per square inch of 80 specimens of a new aluminumlithium alloy undergoing evaluation. 105 97 245 163 207 134 218 199 160 196 221 154 228 131 180 178 157 151 175 201 183 153 174 154 190 76 101 142 149 200 186 174 199 115 193 167 171 163 87 176 121 120 181 160 194 184 165 145 160 150 181 168 158 208 133 135 172 171 237 170 180 167 176 158 156 229 158 148 150 118 143 141 110 133 123 146 169 158 135 149 For this data, x = 162.66 and s 2 = 1140.63. These give a measure of center and spread. 10

We can look at a stem-n-leaf diagram to get a feel for the full distribution of the data. 7 6 8 7 9 7 10 15 11 058 12 013 13 133455 14 12356899 15 001344678888 16 0003357789 17 0112445668 18 0011346 19 034699 20 0178 21 8 22 189 23 7 24 5 The decimal point is 1 digit(s) to the right of the The minimum value is 76. 7 is the stem, and 6 is the leaf. The maximum value is 245. 24 is the stem, and 5 is the leaf. 11

The legend tells us where the decimal is at. This stem-n-leaf suggests this distribution can be described as bell-shaped and unimodal (i.e. has one peak). 12

Steps for making a Stem-n-Leaf Diagram 1. Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. Stems may have as many digits as needed, but each leaf contains only a single digit. 2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. 3. Write each leaf in the row to the right of its stem, in increasing order out from the stem. 13

If there are too many values for each stem, you can also do a split-stem-n-leaf diagram by splitting the values for each stem. 14

Mode, Quartiles, and Percentiles Once we ve ordered the data as in the stem-nleaf diagram, we can easily pull-out some other useful data features. Consider the following stem-n-leaf diagram: The decimal point is 1 digit(s) to the right of the 6 134 6 5568 7 0113 7 57 We see that n = 13, the min is 61, the max is 77. Median This is the value at which 50% fall below and 50% fall above. The median is 68 for this data set. If n is odd, an actual data point is the median. 15

If n is even, the median falls between the 2 data points at the middle (use the average of these two data points). The median is a measure of central tendency, and is denoted by x. Mode This is the most frequently occurring data point. There are two modes in this data set, 65 and 71. We would call this distribution bimodal (i.e. has 2 peaks). 16

Quartiles The positions that break the data into 4 quadrants, each containing 25% of the data are the quartiles. The first quartile (q 1 ), the second quartile (q 2 ) also called the median, and the third quartile (q 3 ). This data set has q 1 = 64.5 q 2 = 68 q 3 = 72 There are a number of ways to find positions the break the data into the 25% proportions since the data is discrete. But here s one option: q 1 is the interpolated value between the data points at ordered positions of n+1 and n+1 4 4 (These are symbols for rounded-down and rounded-up, respectively) 17

q 3 is the interpolated value between the data points at ordered positions of 3(n+1) 4 and 3(n+1) 4 The interquartile range(iqr) is equal to q 3 q 1 and is a measure of variability. It is the spread of the middle 50% of the data. The IQR is less sensitive to extremes than the ordinary sample range. The IQR for the example data set is IQR= q 3 q 1 = 72 64.5 = 7.5 Percentiles The 100kth percentile is a data value such that approximately 100k% of the observations are at or below this value and approximately 100(1 k)% of them are above it (for 0 < k < 1). 18

Example: Mean and Median A manufacturer of electronic components is interested in determining the lifetime of a certain type of battery. A sample, in hours of life, is as follows: 123, 116, 122, 110, 175, 126, 125, 111, 118, 117 a) Find the sample mean and median. b) What feature in this data set is responsible for the substantial difference between the mean and median? 19

Frequency Distributions and Histograms Section 6-3 A frequency distribution is a table that divides a set of data into a suitable number of classes (categories), showing also the number of items belonging to each class. Consider the following stem-n-leaf diagram for humidity readings rounded to the nearest percent. Stem Leaf 1 2 5 7 2 1 1 3 4 5 7 8 9 3 2 4 4 7 9 4 2 4 8 5 3 We might group these data into the following frequency distribution: 20

Cumulative Class Class Frequency Relative Relative Interval midpoint f frequency frequency 10-19 14.5 3 3/20 = 0.15 0.15 20-29 24.5 8 8/20 = 0.40 0.55 30-39 34.5 5 5/20 = 0.25 0.80 40-49 44.5 3 3/20 = 0.15 0.95 50-59 54.5 1 1/20 = 0.05 1.00 There were 5 bins, or cells, or intervals for this frequency table. 21

The histogram is a visual display of a frequency distribution. Example: Recall the n = 80 compressive strengths from earlier 105 97 245 163 207 134 218 199 160 196 221 154 228 131 180 178 157 151 175 201 183 153 174 154 190 76 101 142 149 200 186 174 199 115 193 167 171 163 87 176 121 120 181 160 194 184 165 145 160 150 181 168 158 208 133 135 172 171 237 170 180 167 176 158 156 229 158 148 150 118 143 141 110 133 123 146 169 158 135 149 Using 10 bins, we can create the frequency distribution... 22

Cumulative Class Class Frequency Relative Relative Interval midpoint f frequency frequency 61-80 70.5 1 1/80 = 0.0125 0.0125 81-100 90.5 2 2/80 = 0.0250 0.0375 101-120 110.5 6 6/80 = 0.0750 0.1125 121-140 130.5 8 8/80 = 0.1000 0.2125 141-160 150.5 23 23/80 = 0.2875 0.5000 161-180 170.5 19 19/80 = 0.2375 0.7375 181-200 190.5 12 12/80 = 0.1500 0.8875 201-220 210.5 4 4/80 = 0.0500 0.9375 221-240 230.5 4 4/80 = 0.0500 0.9875 241-260 250.5 1 1/80 = 0.0125 1.0000 The histogram for this frequency table... 23

Histogram of data Frequency 0 5 10 15 20 100 150 200 250 We can see this is a unimodal distribution with a bell-shape. data NOTE: The bin widths can alter the shape of a histogram. For instance, if I only chose 3 bins... 24

Histogram of data Frequency 0 10 20 30 40 50 60 70 0 50 100 150 200 250 300 data This is not as informative. In general, you don t want too many or too few observations in each bin (relative to n), and you can play around with bin size for the best scenario. 25

We summarize data in a histogram (by lumping a lot of individual observations together in a cell), so we lose some information. But this loss is usually small compared to the information gained in the visual, and the ease of interpretation gained in the graph. Some possible descriptions of histograms Symmetric Skewed (asymmetric, long tail to one side) Right-tail stretched out... positive skew Left-tail stretched out... negative skew Unimodal (one peak) Bimodal (two peaks) Bell-shaped uniformly distributed (flat) 26

Symmetric If the distribution is symmetric, the mean = median. Right-skewed If the distribution is right-skewed, mean > median. Left-skewed If the distribution is left-skewed, mean < median. Left-skewed Symmetric Right-skewed 27

The histogram of the sample data at the bottom of the slide gives us a feel for the population from which the sample was drawn. The top plot is of the conceptual population from which the sample was drawn. 28

Box Plots Section 6-4 Boxplots are another graphical tool for visualizing data. They utilize the quartiles to give us a feel for the data distribution. Values forming the box (shows middle 50% of data): q 1 q 2 left, middle, right q 3 1.5 IQR largest possible whiskers (as distance from q 1 or q 3 ) outliers values out past the whiskers (past q 1 1.5 IQR or past q 3 + 1.5 IQR), seen at either tail Whiskers will end on an actual data point. 29

30

Comparative boxplots Data on age at which a Best Oscar is won from 1970 to 2012. Variables: Age and Gender Compare spreads. Compare centers. 0 20 40 60 80 Age of Oscer Winner 80 60 40 20 0 Gender female male female male female male Side-by-side boxplots on left, overlay of data points on the right. 31

MLB Annual Salaries for 2016 (by position) The white X represents the mean for each distribution (which is not shown on a traditional boxplot). These are all right-skewed distributions: the mean is larger than than median. The are many pitchers in the league (P, SP, RP), and the Starting Pitcher (SP) position contains the highest paid players. 32

Time Sequence Plots Section 6-5 When data is collected over time, it can be informative to plot the data in sequence. Time sequence plot can show trends and cycles. The compressive strength data we previously looked at has a time component to it... Consider the following set of n = 80 data points we saw earlier. 105 97 245 163 207 134 218 199 160 196 221 154 228 131 180 178 157 151 175 201 183 153 174 154 190 76 101 142 149 200 186 174 199 115 193 167 171 163 87 176 121 120 181 160 194 184 165 145 160 150 181 168 158 208 133 135 172 171 237 170 180 167 176 158 156 229 158 148 150 118 143 141 110 133 123 146 169 158 135 149 33

We didn t consider the time component previously, but we can look at it as time sequence plot... Compressive strengths with time component included 34

Quality control charts To improve productivity. To prevent defects. To provide information about process. 35

Probability Plots Section 6-7 Let s return to the stem-n-leaf diagram for the compressive strength data. The decimal point is 1 digit(s) 7 6 to the right of the 8 7 9 7 10 15 11 058 12 013 13 133455 14 12356899 15 001344678888 16 0003357789 17 0112445668 18 0011346 19 034699 20 0178 21 8 22 189 23 7 24 5 It looks normally distributed, but is it? 36

Having the correct general shape is a start, but there are specific probabilities that coincide with the normal distribution. For example... y.1 0.0 0.1 0.2 0.3 0.4 normal not quite normal 4 2 0 2 4 x For the red probability distribution, less than 95% is between -2 and 2 because there is more left in the tails than in the normal distribution. Scaling the distribution won t get you the normal distribution. 37

The previous example shows a distribution that is nearly normal, which will often be close enough to the normal for our specific needs. But, in general, we we want to be able to detect non-normality, or when a distribution is not normal. We can use a Normal Probability Plot for this goal. I d like to spend more time with normal probability plots, but due to time constraints, I just want you to know two main things... 1. We use a normal probability plot to check for normality. 2. What the normal probability plot looks like when the data is normally distributed (and when it is not). 38

A normal probability plot plots your observed ordered data points against those that would have been seen from a truly normal distribution. If the data were generated from a normal distribution, the data points in the normal probability plot will fall approximately on a straight diagonal line. 39

Things to look for in your normal probability plot that suggest non-normality... S shapes J shape Light-tails Heavy tails Right - skew compared to compared to normal normal All these are signs of non-normality. 40

NOTE: The diagonal line below IS NOT A BEST FIT LINE to the data. It is simply a reference line for your eye. In R statistical software, the line is drawn by simply connecting the two (x, y) points determined by the values at the 25th and 75th percentiles. 41

This Normal Probability Plot has issues because of the points at the bottom left. Normality is questionable. Normal Q-Q Plot Sample Quantiles -20-10 0 10 Reference line connects values at the 25th and 75th percentiles (in blue). -2-1 0 1 2 Theoretical Quantiles 42

Sometimes we can use a transformation of the data to improve the normality (but you ll be working on the transformed scale after that). Below, a log-transformation helped, but didn t quite get us to normality. NPP plot - original scale NPP plot - log scale Sample Quantiles 0 50 100 150 200 250 Sample Quantiles -2-1 0 1 2 3-2 -1 0 1 2 Theoretical Quantiles -2-1 0 1 2 Theoretical Quantiles 43

This one looks pretty good. Not perfect, but reasonable to assume approximate normality. Normal Q-Q Plot Sample Quantiles -25-20 -15-10 -5-1 0 1 Theoretical Quantiles 44