STA 570 Spring Lecture 5 Tuesday, Feb 1

Similar documents
Chapter 3 - Displaying and Summarizing Quantitative Data

Chapter 2 Describing, Exploring, and Comparing Data

Chapter 1. Looking at Data-Distribution

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Univariate Statistics Summary

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes.

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

CHAPTER 3: Data Description

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Table of Contents (As covered from textbook)

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

UNIT 1A EXPLORING UNIVARIATE DATA

Chapter 6: DESCRIPTIVE STATISTICS

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

15 Wyner Statistics Fall 2013

Averages and Variation

Lecture 6: Chapter 6 Summary

Understanding and Comparing Distributions. Chapter 4

3. Data Analysis and Statistics

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

AND NUMERICAL SUMMARIES. Chapter 2

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1

SLStats.notebook. January 12, Statistics:

Measures of Central Tendency

CHAPTER 2 DESCRIPTIVE STATISTICS

2.1: Frequency Distributions and Their Graphs

Chapter 3: Describing, Exploring & Comparing Data

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Name Date Types of Graphs and Creating Graphs Notes

Math 167 Pre-Statistics. Chapter 4 Summarizing Data Numerically Section 3 Boxplots

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

CHAPTER 2: SAMPLING AND DATA

AP Statistics Summer Assignment:

Algebra 1, 4th 4.5 weeks

Lecture Notes 3: Data summarization

Boxplots. Lecture 17 Section Robb T. Koether. Hampden-Sydney College. Wed, Feb 10, 2010

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 4 th Nine Weeks,

Chapter2 Description of samples and populations. 2.1 Introduction.

Frequency Distributions

UNIT 1: NUMBER LINES, INTERVALS, AND SETS

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

NAME: DIRECTIONS FOR THE ROUGH DRAFT OF THE BOX-AND WHISKER PLOT

Regression III: Advanced Methods

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II. 3 rd Nine Weeks,

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Exploratory Data Analysis

MATH NATION SECTION 9 H.M.H. RESOURCES

Basic Statistical Terms and Definitions

MATH& 146 Lesson 8. Section 1.6 Averages and Variation

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

Measures of Central Tendency:

Chapter 5. Understanding and Comparing Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Middle Years Data Analysis Display Methods

DAY 52 BOX-AND-WHISKER

Chapter 5. Understanding and Comparing Distributions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

1 Overview of Statistics; Essential Vocabulary

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

Learning Log Title: CHAPTER 7: PROPORTIONS AND PERCENTS. Date: Lesson: Chapter 7: Proportions and Percents

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Pre-Calculus Multiple Choice Questions - Chapter S2

Chpt 3. Data Description. 3-2 Measures of Central Tendency /40

LESSON 3: CENTRAL TENDENCY

Measures of Dispersion

No. of blue jelly beans No. of bags

ALGEBRA II A CURRICULUM OUTLINE

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Section 9: One Variable Statistics

Interactive Math Glossary Terms and Definitions

+ Statistical Methods in

Chapter 2 Modeling Distributions of Data

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

What are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram

AP Statistics Prerequisite Packet

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 2.1- #

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution

Descriptive Statistics

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

TMTH 3360 NOTES ON COMMON GRAPHS AND CHARTS

Week 4: Describing data and estimation

Descriptive Statistics, Standard Deviation and Standard Error

Math 227 EXCEL / MEGASTAT Guide

Overview. Frequency Distributions. Chapter 2 Summarizing & Graphing Data. Descriptive Statistics. Inferential Statistics. Frequency Distribution

Mean,Median, Mode Teacher Twins 2015

IT 403 Practice Problems (1-2) Answers

1.3 Graphical Summaries of Data

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

Transcription:

STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row Proportions, Relative Risk, Odds Ratio Homework 4: Due next week in lab. 1

Standard Deviation Interpretation: Empirical Rule If the histogram of the data is approximately symmetric and bell-shaped, then About 68% of the data are within one standard deviation from the mean About 95% of the data are within two standard deviations from the mean About 99.7% of the data are within three standard deviations from the mean STA 570 - Spring 2011 Lecture 5 2 More precisely: 68.27% 95.45% 99.73%

Yet Another Example Number of people you have known personally who have committed suicide in the last 12 months Response Frequency Percentage 0 1344 88.8 1 133 8.8 2 25 1.7 3 11 0.7 4 1 0.1 Mean = 0.15 Standard Deviation = 0.46 Are 68% of the observations between 0.31 and 0.61? STA 570 - Spring 2011 Lecture 5 3

Standard Deviation (s) vs. Interquartile Range (IQR) Standard Deviation is affected by outliers Interquartile Range is not affected by outliers NB: The Range is most affected by outliers Whenever you use the Median instead of the Mean, you should also use the IQR instead of the Standard Deviation STA 570 - Spring 2011 Lecture5 4

Example Data Sets One Variable Statistical Calculator Modify the data sets and see how mean and median, as well as standard deviation and interquartile range change Look at the histograms and stem-and-leaf plots does the empirical rule apply? Make yourself familiar with the standard deviation Interpreting the standard deviation takes experience STA 570 - Spring 2011 Lecture 5 5

Another Graphical Technique: Boxplots A boxplot is intended to be a SIMPLE plot which allows you to quickly see all the features of the distribution. Basically a graphical version of the five number summary There are marks on a boxplot to show the median, Q 1, Q 3 (together these allow you to see the interquartile range). There are whiskers on a boxplot to show the tails of the distribution, with circles indicating outliers. 6

Constructing a boxplot Draw an axis which covers the entire range of your data. Draw a narrow box extending from Q 1 to Q 3. At the location of the median, draw a line through the box. Define inner fences as Q 1 and Q 3 +. On each side, find the most extreme point within the fences (note the fences are NOT drawn on the plot) 7

Boxplots, continued On each side of the box, draw a line from the end of the box to the most extreme points you found in the previous step. These lines are called whiskers. Define outer fences (again, NOT drawn) to be Q 1 3 IQR and Q 3 + 3 IQR. 8

Boxplots, continued Mark any points appearing between the inner and outer fences with an open circle. These are usually referred to as mild outliers. Mark any points beyond the outer fences with closed circles or asterisks. These are typically referred to as extreme outliers. 9

Boxplot, continued These rules are the most common, but different people and different computer software may use different symbols, and may have different definitions of mild and extreme outliers. 10

Using Boxplots Central location may be visualized using the median, on the plot. Remember the box also contains the middle 50% of the data. Some programs will put a separate mark for the mean of the data. Spread may be visualized with the IQR, the length of the box. You can also see the range of the data. The standard deviation CANNOT be found with a boxplot. 11

Step 1 for boxplot The Box Box extends from Q1 to Q3, with a line for the median. Thus, you can immediately see the median (central location) and the IQR (spread). Note the box contains 50% of the data Q3 Median Q1 12

Step 2 for boxplot The fences Construct the fences. These are NOT in the final product. They are just used to make decisions on outliers. Inner fences are 1.5 IQR from the box, outer fences are 3.0 IQR from the box. IQR Q3 Median Q1 13

Step 2 for boxplot Inner Fences Construct the fences. These are NOT in the final product. They are just used to make decisions on outliers. Inner fences are 1.5 IQR from the box, outer fences are 3.0 IQR from the box. IQR Inner fences 14

Step 2 for boxplot Outer fences Construct the fences. These are NOT in the final product. They are just used to make decisions on outliers. Inner fences are 1.5 IQR from the box, outer fences are 3.0 IQR from the box. IQR Outer Fences 15

Step 3 for boxplot Whiskers The whiskers extend from the box to the point closest to, but still inside, the inner fence. Remember, the whiskers end at a data point, not the inner fences. IQR Whiskers 16

Step 4 for boxplot Mild outliers Mild outliers for a boxplot are defined to be points located between the inner and outer fences. They are denoted by open circles. IQR Mild outliers 17

Step 5 for boxplot Extreme outliers Extreme outliers for a boxplot are defined to be points located beyond the outer fences They are denoted (here) by filled circles. IQR Extreme outliers 18

Remember, the fences are not actually drawn. You can see the four features of distributions easily with a boxplot. Outliers, for example, are explicitly drawn. Final boxplot 19

Using Boxplots Central location is shown through the median (some boxplots will indicate the mean by a + symbol). 20

Using Boxplots Spread is shown through the IQR (you cannot get the standard deviation from a boxplot). You can also see the range of the data, but remember the range is often not that useful. 21

Using Boxplots Shape can be seen through the box and the whiskers. If one side of the box and the corresponding whisker are longer, then the data is skewed in that direction (here left skewed) Note: The axis should be labeled 22

Using boxplots Sometime the box leans one way and the whiskers the other. Then you can t tell that much about shape from the boxplot. This happens most often in small datasets, where there isn t much information about shape in the entire dataset anyway. In general, symmetric gets the benefit of the doubt, so a slight lean isn t enough to conclude skewness. 23

Using boxplots Outliers are drawn explicitly on the plot You don t have to take the definitions of mild and extreme as absolute truth, but it is a sometimes useful convention. 24

In SAS Code proc univariate data=[name] plot; var [variable]; run; Calculates Mean, Median, Quartiles, Minimum, Maximum Range, Interquartile Range, Variance, Standard Deviation and much more Creates Box Plot, Stem and Leaf Plot 25

Side by side boxplots When comparing multiple groups of people (or anything else), boxplots provide a handy method for comparison. By placing the boxplots side by side, you can immediately see similarities and differences in central location, spread, and shape. 26

1970 Draft Lottery months on x axis, draft number on y axis. 27

Review: Summarizing Univariate Data Numerically Center of the data Mean: Arithmetic average (Interval) Median: Midpoint of the observations when they are arranged in increasing order (Interval, Ordinal) Mode: Most frequent value (Interval, Ordinal, Nominal) Dispersion of the data Variance, Standard deviation Interquartile range Range Skewness 28

Sample and Population Measures of Variation Range: maximum - minimum Variance (sample / population): Standard Deviation (sample / population): Interquartile Range: Q3-Q1 29

Sample Variance The variance of n observations is the sum of the squared deviations, divided by n-1. 30

Sample Variance Step By Step 1. Calculate the mean 2. For each observation, calculate the deviation 3. For each observation, calculate the squared deviation 4. Add up all the squared deviations 5. Divide the result by (n-1) (To get the standard deviation, take the square root of the result) 31

Sample Standard Deviation The sample standard deviation s is the positive square root of the sample variance Generally, interpreting the standard deviation takes experience. The empirical rule helps sometimes. 32

Use of the Standard Deviation Empirical rule: If the histogram of the data is approximately symmetric and bell-shaped, then About 68% of the data are within one standard deviation from the mean About 95% of the data are within two standard deviations from the mean About 99.7% of the data are within three standard deviations from the mean Chebysheff s inequality: The proportion of observations that lie within k standard deviations of the mean is at least 1-1/k 2 for k>1 33

Stem and Leaf Plot, Box Plot, Empirical Rule Example: Highway Gas Mileage for 24 Cars Stem Leaf # Boxplot 5 0 1 4 4 00 2 3 5678 4 +-----+ 3 01112 5 *--+--* 2 557889 6 2 0134 4 +-----+ 1 7 1 1 3 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+1 Mean=29.625 Standard Deviation=8.261 Q1=24.5 Median=29.5 Q3=35.5 34

Application of the Empirical Rule Gas Mileage Data Mean = 29.6 Standard Deviation = 8.3 68% of the data ( observations) are supposed to be between and How many observations are actually within one standard deviation from the mean? 95% of the data ( observations) are supposed to be between and How many observations are actually within two standard deviations from the mean? 35

Comparison to Chebycheff s Inequality Gas Mileage Data Mean = 29.6, Standard Deviation = 8.3 Within 2 standard deviation of the mean are at least 1-1/4=3/4=75% of the observations (empirical rule: about 95% when the data is approximately bell-shaped) Within 3 standard deviations are at least 1-1/9=8/9=88.9% Within 4 standard deviations are at least 1-1/16=15/16=93.75% 36

Chebycheff s Inequality Always Works Number of people you have known personally who have committed suicide in the last 12 months Response Frequency Percentage Cumulative Percentage 0 1344 88.8 88.8 1 133 8.8 97.6 2 25 1.7 99.2 3 11 0.7 99.9 4 1 0.1 100.0 Mean = 0.15, Standard Deviation = 0.46 At least 75% of the observations are within 2 standard deviations from the means, that is, between 0.77 and 1.07. At least 88.9% are between -1.23 and 1.53 At least 93.75% are between -1.7 and 2.0 37

There is also: Coefficient of Variation Standardized measure of variation Idea: A standard deviation of 10 may indicate great variability or small variability, depending on the magnitude of the observations in the data set CV = Ratio of standard deviation divided by mean Population and sample version 38

So far: Summarizing Univariate Data Next: Summarizing Bivariate Data Two categorical variables Contingency Table Row/Column Relative Frequencies Relative Risk, Odds Ratio Two quantitative variables Scatter Diagram Regression Line Correlation Coefficient, Coefficient of Determination, Slope and Intercept of Regression Line 39

Quiz! Take a piece of paper and write your name and section number on top of it Please write your answers to the questions legibly. When you are done, quietly leave your seat, turn in the paper, and quietly leave the room Question: 40