Lecture Notes 3: Data summarization

Similar documents
STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

+ Statistical Methods in

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

MATH& 146 Lesson 8. Section 1.6 Averages and Variation

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Averages and Variation

Measures of Dispersion

Chapter 3 - Displaying and Summarizing Quantitative Data

CHAPTER 3: Data Description

Table of Contents (As covered from textbook)

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

Chapter 2 Describing, Exploring, and Comparing Data

Measures of Central Tendency

Math 167 Pre-Statistics. Chapter 4 Summarizing Data Numerically Section 3 Boxplots

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Chapter 3 Analyzing Normal Quantitative Data

No. of blue jelly beans No. of bags

STA 570 Spring Lecture 5 Tuesday, Feb 1

appstats6.notebook September 27, 2016

Univariate Statistics Summary

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Chapter 6: DESCRIPTIVE STATISTICS

Create a bar graph that displays the data from the frequency table in Example 1. See the examples on p Does our graph look different?

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis.

Measures of Central Tendency:

Learning Log Title: CHAPTER 8: STATISTICS AND MULTIPLICATION EQUATIONS. Date: Lesson: Chapter 8: Statistics and Multiplication Equations

MATH& 146 Lesson 10. Section 1.6 Graphing Numerical Data

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Understanding and Comparing Distributions. Chapter 4

15 Wyner Statistics Fall 2013

Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 4 th Nine Weeks,

Sections 2.3 and 2.4

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

NAME: DIRECTIONS FOR THE ROUGH DRAFT OF THE BOX-AND WHISKER PLOT

Section 6.3: Measures of Position

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

Descriptive Statistics

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Measures of Position

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

Exploratory Data Analysis

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II. 3 rd Nine Weeks,

AP Statistics Prerequisite Packet

1.3 Graphical Summaries of Data

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

6th Grade Vocabulary Mathematics Unit 2

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

CHAPTER 2 DESCRIPTIVE STATISTICS

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

DAY 52 BOX-AND-WHISKER

CHAPTER 2: SAMPLING AND DATA

Homework Packet Week #3

Box Plots. OpenStax College

Chapter 1. Looking at Data-Distribution

AND NUMERICAL SUMMARIES. Chapter 2

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution

Learning Log Title: CHAPTER 7: PROPORTIONS AND PERCENTS. Date: Lesson: Chapter 7: Proportions and Percents

AP Statistics Summer Assignment:

TMTH 3360 NOTES ON COMMON GRAPHS AND CHARTS

UNIT 1A EXPLORING UNIVARIATE DATA

3.3 The Five-Number Summary Boxplots

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

Understanding Statistical Questions

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

Basic Statistical Terms and Definitions

1.2. Pictorial and Tabular Methods in Descriptive Statistics

Chapter2 Description of samples and populations. 2.1 Introduction.

Chapter 5: The standard deviation as a ruler and the normal model p131

Day 4 Percentiles and Box and Whisker.notebook. April 20, 2018

Quantitative - One Population

Chapter 2 Modeling Distributions of Data

3. Data Analysis and Statistics

Chapter 2: Descriptive Statistics

1.3 Box and Whisker Plot

Using a percent or a letter grade allows us a very easy way to analyze our performance. Not a big deal, just something we do regularly.

Vocabulary: Data Distributions

Chapter 5. Understanding and Comparing Distributions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Chapter 5: The normal model

Frequency Distributions

Lecture 6: Chapter 6 Summary

Section 1.2. Displaying Quantitative Data with Graphs. Mrs. Daniel AP Stats 8/22/2013. Dotplots. How to Make a Dotplot. Mrs. Daniel AP Statistics

Mean,Median, Mode Teacher Twins 2015

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

How individual data points are positioned within a data set.

Chapter 5. Understanding and Comparing Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Chapter 3 Understanding and Comparing Distributions

Descriptive Statistics: Box Plot

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes.

Maths Revision Worksheet: Algebra I Week 1 Revision 5 Problems per night

Transcription:

Lecture Notes 3: Data summarization Highlights: Average Median Quartiles 5-number summary (and relation to boxplots) Outliers Range & IQR Variance and standard deviation Determining shape using mean & median 1

Some important characteristics of a data set Location: Where is the data set located along a number line? Where is its center? Spread: How dispersed (i.e. spread out) is the data? Outliers: set? Are there any unusual values in the data Shape: What is the shape of the distribution of values in the data set? 2

Location Statistics Mean, Median & Quartiles In these notes, we will look at some common descriptive statistics that are useful for summarizing a data set. Recall that a statistic is any number calculated from a set of data. The most succinct way to describe the location of a data set is to identify its center. There are two statistics used to describe center: with the mean and with the median. 3

Sample average The sample average (a.k.a. mean) is the sum of the data divided by the sample size. We denote the mean using, or x bar The sample size is the number of observations in the sample, and is denoted n. The sum of all the observations in a sample is denoted by. x So, our formula for the sample mean is x i x x i = n 4

Sample Average Example Suppose we are interested in the average undulation rate (in Hz) of a paradise tree snake, which undulates after jumping from a tree in order to glide away. We take a sample of n = 8 snakes and somehow measure the rates at which they undulate as they propel themselves from a source. The eight observed rates are 0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6 5

Sample Average Example So, for this sample, we can compute: x x = i = = n 6

Median If you put data in order from the smallest to the largest values, the number in the middle is called the median. The median separates the bottom 50% of the data from the top 50% of the data. If the sample size is odd, the median will be a value in your sample. If the sample size is even, the median will be between the middle two numbers in your sample. 7

Computing the median 1) Order the data set, smallest to largest. 2) Compute the rank of the median using Rank = (n + 1)/2. The rank tells you which observation will be the median. ordered 3) If Rank is an integer value go right to it in the sorted data set. Otherwise compute the average of the two surrounding observations. For instance, if rank = 5, then the median is the 5 th ordered observation. If rank = 5.5, then the median is the average of the 5 th and 6 th ordered observations. 8

Computing the Median The data set to the right is already ordered. There are 19 observations. Find the rank of the median using (n+1)/2: 49 69 70 70 73 78 81 81 96 96 105 110 116 116 117 121 137 142 151 Now go to this observation by counting from the start of the data set to the rank of the median. You can verify that this is the median by making sure that there are the same number of observations above it as there are below it. 9

Computing the Median The data set to the right is already ranked. There are 20 observations. Find the rank of the median using (n+1)/2: 49 69 70 70 73 78 81 81 96 96 105 110 116 116 117 121 137 142 151 175 In this case, the rank is between two integers, so the median will be the average of these two ordered observations. 10

Location Statistics: Quartiles The median breaks the data set into two halves Quartiles break the data set into 4 quarters The lower quartile, Q1, is the median of all the data below the overall median. The upper quartile, Q3, is the median of all the data above the overall median. 11

Computing Quartiles Here, there are 10 observations below the median. We can find their median, Q1, in the usual manner: Q1 separates the lower 25% from the upper 75% of the data. 49 69 70 70 73 78 81 81 96 96 105 110 116 116 117 121 137 142 151 175 12

Computing Quartiles Likewise, there are 10 observations above the median. We can use the same rank we used to find Q1, but start counting from the first observation above the overall median: 49 69 70 70 73 78 81 81 96 96 105 110 116 116 117 121 137 142 151 175 Q3 separates the lower 75% from the top 25% of the data. 13

Computing Quartiles A brief aside: when sample size is odd, it will not be the case that *exactly* 50% of the data is below the median or that *exactly* 50% is above it This is because the median itself is not counted as being in either the upper or lower half of the data set. For reasonably large data sets, we may say things like 50% of the data is above the median and 25% of the data is below Q1, even though in some cases these are approximations. 14

Computing Quartiles Note that for relatively small datasets, you may be able to eyeball the data to find the median, Q1, and Q3, rather than using rank. For instance, it is not challenging to find the median and quartiles for the snake undulation rate data set of size n=8 from before. Simply order the numbers 0.9, 1.4, 1.2, 1.2, 1.3, 2.0, 1.4, 1.6 from smallest to largest, and you can quickly see where the median and quartiles lie: 15

Location Statistics: Extremes We are also often interested in the extremes of a data set. These extreme values are referred to as the minimum and the maximum. Extreme in this context doesn t necessarily mean really big or really small. It just means the biggest or the smallest. 16

The 5-number summary The 5-number summary can be used to summarize a data set. This group consists of the: minimum, maximum, Q1, median, and Q3 These are all measures of location 17

Boxplots and the 5-number summary Boxplots graphically illustrate the 5 values in a 5-number summary Sometimes boxplots are called box and whisker plots. 60 65 70 75 boxplot of height (female) 18

Boxplots and the 5-number summary Boxplots can be displayed horizontally or vertically. The dark line inside the box is the median The edges of the box are Q1 and Q3 The whiskers extend to either the min and max, or to the furthest non-outliers. 19

Boxplots and the 5-number summary Outliers are represented as dots on a boxplot. Note: 50% of the data is inside the box, 25% is below the box, and 25% is above the box. 20

Outliers Outliers are data points that are located far away from where the majority of the data lie. There is not universal agreement on what the standard should be for classifying an observation as an outlier. It is to some extent subjective. Data analysis software packages will have internal standards by which they decide which values should be considered outlying. 21

Outliers It s usually a good idea to look more closely at an outlier to see if it is real or if it is a mistake. The outlier might be an improperly entered data value. Data entry is a tedious process and sometimes people make mistakes. The outlier might be in different units than the rest of the data. For instance, in the questionnaires from the first day of class, a few students gave their heights in centimeters rather than inches. If these heights had not been converted, then our class dataset would have shown students over 12 feet tall. 22

Outliers Outliers are often real, accurate pieces of data that are simply unusual. For instance, most people work 35-40 hours per week. However a very small number work 70-80 hours a week. It is sometimes tempting to remove outliers from a data set, but we must find out first whether or not the outlier is a legitimate observation or a mistake. 23

Dispersion (Spread) Here is a good piece of advice: Do not cross a river if it is, on average, 4 feet deep -Nassim Taleb, The Black Swan Why is this good advice? What additional information would we need before we decide if crossing the river is a good idea? 24

Dispersion (Spread) Information about location (average or median) is not enough to adequately summarize a data set. Sometimes the average doesn t exist. For example, the average human being has one ovary and one testicle. Information about how your data is dispersed is also useful, and is essential in inferential statistics. We don t just want to know where the center of our data lies; we also want to know how spread out the data is! 25

The Range The range is the easiest measure of dispersion to compute. It is the difference between the maximum value and the minimum value. One problem with using the range is that it doesn t tell you whether most of the data is spread out through the whole range, or if the maximum and minimum values are outliers. 26

The IQR The inter-quartile range (Q3 Q1) is not affected by extreme values since it is calculated using values that lie close to the center of the data set We will not use either the range or the IQR when we move on to inferential statistics. But they are still useful as descriptive statistics. 27

Variance The variance is another measure of dispersion. It is closely related to the standard deviation, which we will consider shortly. Unlike the range or IQR, the variance statistic is computed using all of the data values in a data set. It is sensitive to outliers, but the effects of extreme values are diluted if there are a large number of observations. 28

Sum of Squared Deviations To compute the variance of a data set we first need a statistic called the sum of squared deviations This is often abbreviated as SS, for sum of squares To get the squared deviation for a single observation, subtract the mean from this observation, and then square the result. Do this for all observations and sum the results. This gives us the sum of squared deviations. Mathematically, = 2 S S ( x x i 29

Sum of Squared Deviations Example: find the sum of squared deviations (SS) for our TV watching dataset: 0.9 1.4 1.2 1.2 1.3 2.0 1.4 1.6 S = S x x= 2 ( ) i 30

Sample Variance The sample variance is denoted by the symbol s 2 Mathematically, s 2 x x i = = n 1 n 1 ( 2 S S The English interpretation of a variance is: The average squared distance that a group of n points lies from the mean of the group. This is not a very intuitive concept, though it is very often used in mathematical computations. 31

Sample Standard Deviation The sample standard deviation is simply the square root of the sample variance. It is denoted by the letter s Continuing with our example, we have: S S = = = 1 2 s s n 32

Interpret the Standard Deviation The standard deviation can be thought of roughly as an average distance that a group of points lies from the group mean. A large standard deviation tells you that your data is highly dispersed, or spread out. In inferential statistics, a large standard deviation signifies high levels of uncertainty regarding statistical inferences. Note that what counts as large or small depends on the magnitude of the data itself. 33

Shapes of Distributions You don t need a histogram to determine the shape of a distribution. In fact, all you need are the values for the mean and the median of your data set. Frequency 9 8 7 6 5 4 3 2 1 0 Median= 92 Mean= 86 30 40 50 60 70 80 Grades 90 100 110 34

Shapes of Distributions What is the shape of this distribution to the right? 9 8 7 6 5 4 Median= 92 Mean= 86 Note that the mean is 86, and the median is 92 3 2 1 0 30 40 50 60 70 80 90 100 110 0 0 35

Shapes of Distributions Median =.6 What is the shape of this distribution to the right? 10 Note that the mean is 2.6, and the median is 0.6 5 0 0 mean = 2.6 2 4 6 8 10 12 14 36

Shapes of Distributions What is the shape of this distribution to the right? 30 20 Mean=102 Median= 102 10 Note that the mean is 102, and the median 0 is 102 0 20 40 60 80 100 120 140 160 180 0 37

Mean, Median, & Shape If the mean is greater than the median then the distribution is skewed to the right If the mean is less than the median then the distribution is skewed to the left If the mean and median are (approximately) equal then the distribution is (approximately) symmetric 38

Conclusion A statistic is any number calculated from a set of data. Descriptive statistics are numbers that are used to describe important features of a data set. The mean and median are very commonly used statistics which refer to location The standard deviation is a very commonly used statistic which refers to dispersion. In the next set of notes, we will look at probability and the normal distribution, which will lay the groundwork for understanding inferential statistics. 39