CHAPTER 3: Data Description

Similar documents
Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 2 Describing, Exploring, and Comparing Data

Averages and Variation

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

15 Wyner Statistics Fall 2013

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

CHAPTER 2: SAMPLING AND DATA

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

Chapter 1. Looking at Data-Distribution

No. of blue jelly beans No. of bags

Frequency Distributions

STA 570 Spring Lecture 5 Tuesday, Feb 1

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

Basic Statistical Terms and Definitions

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Measures of Dispersion

Measures of Central Tendency

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

Lecture Notes 3: Data summarization

+ Statistical Methods in

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

IT 403 Practice Problems (1-2) Answers

Descriptive Statistics

Chpt 3. Data Description. 3-2 Measures of Central Tendency /40

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

CHAPTER 2 DESCRIPTIVE STATISTICS

Chapter 3 Analyzing Normal Quantitative Data

How individual data points are positioned within a data set.

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

Measures of Position

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

Table of Contents (As covered from textbook)

Chapter 3: Describing, Exploring & Comparing Data

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Chapter 5. Understanding and Comparing Distributions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

AND NUMERICAL SUMMARIES. Chapter 2

Understanding and Comparing Distributions. Chapter 4

3.2-Measures of Center

2.1: Frequency Distributions and Their Graphs

MATH& 146 Lesson 10. Section 1.6 Graphing Numerical Data

1.3 Graphical Summaries of Data

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

DAY 52 BOX-AND-WHISKER

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Univariate Statistics Summary

Chapter 5. Understanding and Comparing Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

AP Statistics Prerequisite Packet

Pre-Calculus Multiple Choice Questions - Chapter S2

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

UNIT 1A EXPLORING UNIVARIATE DATA

appstats6.notebook September 27, 2016

Mean,Median, Mode Teacher Twins 2015

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Exploratory Data Analysis

Chapter2 Description of samples and populations. 2.1 Introduction.

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Section 6.3: Measures of Position

SCHOOL OF BUSINESS, ECONOMICS AND MANAGEMENT BBA240 STATISTICS/ QUANTITATIVE METHODS FOR BUSINESS AND ECONOMICS

MATH& 146 Lesson 8. Section 1.6 Averages and Variation

Name Date Types of Graphs and Creating Graphs Notes

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

Chapter 2 Modeling Distributions of Data

Chapter 2: Descriptive Statistics

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Lecture 3: Chapter 3

Math 167 Pre-Statistics. Chapter 4 Summarizing Data Numerically Section 3 Boxplots

8: Statistics. Populations and Samples. Histograms and Frequency Polygons. Page 1 of 10

Section 9: One Variable Statistics

Measures of Dispersion

Create a bar graph that displays the data from the frequency table in Example 1. See the examples on p Does our graph look different?

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

AP Statistics. Study Guide

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

TMTH 3360 NOTES ON COMMON GRAPHS AND CHARTS

6th Grade Vocabulary Mathematics Unit 2

Measures of Central Tendency:

Numerical Summaries of Data Section 14.3

Date Lesson TOPIC HOMEWORK. Displaying Data WS 6.1. Measures of Central Tendency WS 6.2. Common Distributions WS 6.6. Outliers WS 6.

Chapter 5snow year.notebook March 15, 2018

Chapter 5: The standard deviation as a ruler and the normal model p131

Section 1.2. Displaying Quantitative Data with Graphs. Mrs. Daniel AP Stats 8/22/2013. Dotplots. How to Make a Dotplot. Mrs. Daniel AP Statistics

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis.

Descriptive Statistics: Box Plot

Transcription:

CHAPTER 3: Data Description You ve tabulated and made pretty pictures. Now what numbers do you use to summarize your data? Ch3: Data Description Santorico Page 68

You ll find a link on our website to a data set with various measures for housing in the suburbs of Boston. It comes from a paper titled: Hedonic prices and the demand for clean air. I ve given histograms here for a few of the variables. What are some characteristics of the distributions that you might want to describe? Do you think these measures might do a better job for some of these variables versus others? Why or why not? Ch3: Data Description Santorico Page 69

What s being described? Parameter a characteristic or measure obtained using the data values from a specific population. Statistic a characteristic or measure obtained using the data values from a sample. Ch3: Data Description Santorico Page 70

Rules and Notation: Let x represent the variable for which we have sample data. Let n represent the number of observations in the sample. (the sample size). Let N represent the number of observations in the population. x represents the sum of all the data values of x. x 2 is the sum of the data values after squaring them. x 2 x 2. General Rounding Rule: When computations are done in the calculation, rounding should not be done until the final answer is calculated! Rounding Rule of Thumb for Calculations from Raw Data: The final answer should be rounded to one more decimal place than that of the original data. You will see that this will be true for the mean, variance and standard deviation. Ch3: Data Description Santorico Page 71

Section 3-1: Measures of Central Tendency Measure Description Statistic and Parameter Notes and Insights the sum of the data values The sample mean is denoted by The mean should be rounded to one more decimal place than occurs divided by the total number of x in the raw data. and calculated using the values x The mean is the balance point of the data. formula: x n When the data is skewed the mean is pulled in the direction of the longer tail. Mean The population mean is denoted The mean is used in computing other statistics such as variance and standard deviation. by and is found with the Median the middle number of the data set when they are ordered from smallest to largest the value that occurs most often in a data set formula: x N Arrange the data in order. If n is odd, the median is the middle number. If n is even, the median is the mean of the middle two numbers We use the symbol MD for median. The mean is highly affected by outliers and may not be an appropriate statistic to use when an outlier is present. The median is robust against outliers (less affected by them). The median is used when one must find the center value of a data set This is where the peaks occur in a histogram. Unimodal when a data set has only one mode Mode Bimodal when a data set has 2 modes Multimodal when a data set has more than 2 modes No Mode when no data values occurs more than once The mode is used when the most typical case is desired. Ch3: Data Description Santorico Page 72

Measures of Central Tendency continued Measure Description Statistic and Parameter Notes and Insights the sum of the minimum and Denoted by the symbol MR. Midrange maximum values in the data set, divided by 2 Weighted Mean found by multiplying each value by its corresponding weight and dividing the sum of the products by the sum of the weights. min max MR 2 w1 x1 w2 x2... wnx x w1 w2... wn w, w,..., w n are the x, x,..., n where 1 2 weights and 1 2 x are the 1 st, 2 nd,, nth data values. n It is a rough estimate of the middle since it gives the midpoint of the dataset. An outlier (a really large or really small value) can have a dramatic effect on the midrange value. Ch3: Data Description Santorico Page 73

GROUP WORK: (use appropriate notation) Find the Mean, Median and Midrange of the daily vehicle pass charge for five U.S. National Parks. The costs are $25, $15, $15, $20, and $25. Find the Mean, Median and Midrange of the numbers of waterline breaks per month in the last two winter seasons for the city of Brownsville, Minnesota: 2, 3, 6, 8, 4, 1. Find the midrange. Ch3: Data Description Santorico Page 74

Find the mode of the following data sets: Set 1: 12, 8, 14, 15, 11, 10, 5, 14 Set 2: 1, 2, 3, 4 Set 3: 1, 2, 3, 4, 1, 2, 3 Set 4: 18.0, 14.0, 34.5, 10, 11.3, 10, 12.4, 10 Ch3: Data Description Santorico Page 75

Find the weighted mean for the grade point using the number of credits for a class as the weight: Course Credits (weights) Grade point(x) English 3 C (2 points) Calculus 4 A (4 points) Yoga 2 A (4 points) Physics 4 B (3 points) Ch3: Data Description Santorico Page 76

Now consider which of these measures would be good representations of central tendency for the 3 variables from the Boston housing data set. Per Capita Crime Rate By Town Average Number Of Rooms Per Dwelling Pupil-Teacher Ratio By Town x 3.613524 6.2846 18.46 MD= 0.256510 6.2085 19.05 MR= 44.491260 6.1705 17.30 Mode= 0.01501, 14.3337 (each occurring twice) 5.713, 6.127, 6.167, 6.229, 6.405, 6.417 (all occurring 3 times) 20.2 (occurring 140 times; the next count closest occurred 34 times) Notice how the statistics compare to each other for each variable, e.g., mean, median and midrange are all close to each other for the room variable. Why? Why is this not the case for the other variables? Ch3: Data Description Santorico Page 77

Location of Mean, Median, and Mode on Distribution Shapes Ch3: Data Description Santorico Page 78

Section 3-2: Measures of Variation (Spread) To describe a distribution well, we need more than just the measures of center. We also need to know how the data is spread out or how it varies. Example: Bowling scores John 185 135 200 185 250 155 Jarrod 182 185 188 185 180 190 Who is the better bowler and how do you know? Note: The mean score for both bowlers is 185. Ch3: Data Description Santorico Page 79

Ways to Measure Spread: Measure Description Sample Population Notes Range The difference between the largest and Denoted by R. R = high value low value smallest observations. Variance The average of the squares of the distance The sample variance is an estimate of the population variance calculated from a sample. Population variance is denoted by 2 It is commonly used in statistics because each value is from the mean It is denoted by s 2 it has nice theoretical properties.. Standard deviation the typical deviation from the sample mean The formula to calculate the sample variance is or ( x x) 2 s n 1 s 2 n x2 x n(n 1) 2. 2 The square root of the sample variance It is denoted by s. The formula to calculate the sample standard deviation is: The formula for the population variance is (x ) 2 2 N the square root of the population variance The symbol for the population standard deviation is. In practice, we don t know the population values or parameters, so we cannot calculate 2 or. We end up calculating the variance and standard deviation of a sample. Be careful to notice the difference of n-1 (sample) and n (population) in the denominator. The greater the spread of the data, the larger the value of s. s=0 only when all observations take the same value. OR s 2 x x. ( ) 2 s n 1 s n x2 x n(n 1) 2 The formula for the population standard deviation is 2 (x ) 2 N. s can be influenced by outliers because outliers influence the mean and because outliers have large deviations from the mean Ch3: Data Description Santorico Page 80

Steps for Calculating Sample Variance and Standard Deviation x 1. Calculate the sample mean. 2. Calculate the deviation from the mean for every data value (data value mean). 3. Square all the values from #2 and find the sum. 4. Divide the sum in #3 by n 1. This calculation produces the sample variance. 5. Take the square root of #4. This number produces the sample standard deviation. The same (general) procedure applies for finding the population variance and standard deviation except we use the population mean and divide by N instead of n 1. Ch3: Data Description Santorico Page 81

Group work: Compute the range, sample variance and sample standard deviation for Jarrod s bowling scores. If this was our whole population, what will differ? For the sake of comparison, 2 John s statistics are: x 185, s 1570, s 39.6. Ch3: Data Description Santorico Page 82

And to check your answer, John s statistics are: 2 x 185, s 13.6, s 3.7 Ch3: Data Description Santorico Page 83

Uses of the Variance and Standard Deviation 1. To determine the spread of data. The larger the variance or standard deviation, the greater the data are dispersed. 2. Makes it easy to compare the dispersion of two or more data sets to decide which is more spread out. 3. To determine the consistency of a variable. E.g., the variation of nuts and bolts in manufacturing must be small. 4. Frequently used in inferential statistics, as we will see later in the book. 5. Empirical rule Ch3: Data Description Santorico Page 84

Empirical Rule Works ONLY for symmetric, unimodal curves (bell-shaped curves): Approximately 68% of data falls within 1 standard deviation of the mean, x s Approximately 95% of data falls within 2 standard deviations of the mean, x 2s Approximately 99.7% of data falls within 3 standard deviations of the mean, x 3s Ch3: Data Description Santorico Page 85

Example: Mothers Heights An article in 1903 published the heights of 1052 mothers. The sample mean was 62.484 inches and the standard deviation was 2.390 inches. Note the summary table below regarding the actual percentages and the empirical rule. Ch3: Data Description Santorico Page 86

Section 3-3: Measure of Position (some of this section we need for use in Section 3-4) Quartiles values that divide the distribution into four groups, separated by Q1, Q2 (median), and Q3. Q1 is the 25 th percentile. Q2 is the 50 th percentile (the median). Q3 is the 75 th percentile. Interquartile Range (IQR) the difference between Q1 and Q3. This is the range of the middle 50% of the data. IQR Q3 Q1 Ch3: Data Description Santorico Page 87

Finding the Quartiles: 1. Arrange the data from smallest to largest. 2. Find the median data value. This is the value for Q2. 3. Find the median of the data values that fall below Q2. This is the value for Q1. (Don t include the median in these values if the number of observations is odd.) 4. Find the median of the data values that are greater than Q2. This is the value for Q3. (Don t include the median in these values if the number of observations is odd.) Ch3: Data Description Santorico Page 88

Example: Find Q1, Q2, and Q3 for the data set 15, 13, 6, 5, 12, 50, 22, 18. Additionally, what is the IQR? Q2 (Median) is (13+15)/2 = 14 Order data: 5, 6, 12, 13, 15, 18, 22, 50 Q1 is (6+12)/2 =9 Q3 is (18+22)/2=20 IQR = Q3 Q1 = 20 9 = 11 Ch3: Data Description Santorico Page 89

Outlier an extremely large or small data value when compared to the rest of the data values. One way to identify outliers is by defining an outlier to be any data value that has value more than 1.5 times the IQR from Q1 or Q3. i.e., a data value is an outlier if it is smaller than Q1 1.5(IQR) or larger than Q3 1.5(IQR). Steps to Identify Outliers: 1. Arrange the data in order and find Q1, Q3, and the IQR. 2. Calculate Q1 1.5(IQR) and Q3 + 1.5(IQR). 3. Find any data values smaller than the Q1 1.5(IQR) or larger than Q3 + 1.5(IQR). Ch3: Data Description Santorico Page 90

Example: Find the outliers of the data set 15, 13, 6, 5, 12, 50, 22, 18. Note: This is the same data set as the previous example. Step 1: Q1 = 9, Q3 = 20, IQR = 11 Step 2: Q1 1.5(IQR) = 9 1.5(11) = -7.5 Q3 + 1.5(IQR) = 20 + 1.5(11) = 36.5 Step 3: The data value, 50, is considered an outlier. Ch3: Data Description Santorico Page 91

Section 3-4: Exploratory Data Analysis Exploratory Data Analysis (EDA) - Examining data to find out what information can be discovered about the data such as the center and the spread. Uses robust statistics such as the median and interquartile range. Five number summary the minimum, Q1, median (Q2), Q3, and the maximum of a data set. Ch3: Data Description Santorico Page 92

Boxplot a graphical display of the five-number summary (and potentially outliers) using a box and whiskers. Outliers are indicated by * on the boxplot. Steps for constructing a boxplot: 1. Order the data and calculate the five-number summary. 2. Determine if there are any outliers. 3. Draw and scale an axis (either horizontal or vertical) that includes both the minimum and the maximum. 4. Draw a box extending from Q1 to Q3. 5. Draw a bar across the box at the value of the median. 6. Draw bars extending from away from the box extending to the most extreme values that are not outliers. 7. Draw stars for all of the outliers. Note: There are many slight variations of boxplots. The book s version of a boxplot doesn t mark outliers, just the max and min with bars. The book calls our boxplot a modified boxplot. Ch3: Data Description Santorico Page 93

Example: Create a boxplot for the data set 15, 13, 6, 5, 12, 50, 22, 18. Note: This is the same data set as the previous example. Min = 5 Q1 = 9 Q2 = 14 Q3 = 20 Max = 50 Outlier at 50 Outlier Largest point not an outlier Q3 Q2 Q1 Smallest point that is not an outlier Ch3: Data Description Santorico Page 94

Example: Draw a boxplot for the following data set: 2, 5, 5, 7, 7, 8, 9, 10, 10, 10, 10, 14, 17, 20 Q 1 =7, Median=9.5, Q 3 =10 Ch3: Data Description Santorico Page 95

In general, a box plot gives us the following information: If the median is near the center of the box and the bars are about the same length, the distribution is approximately symmetric. If the median is lower than the middle of the box and the bar going in the positive direction is larger than the one in the negative direction, then the distribution is positively-skewed (right-skewed). If the median is higher than the middle of the box and the bar going in the negative direction is larger than the one in the positive direction, then the distribution is negatively-skewed (left-skewed). Ch3: Data Description Santorico Page 96

Symmetric Distribution Positivelyskewed (Rightskewed) Negativelyskewed (Leftskewed) Ch3: Data Description Santorico Page 97