Week 4: Describing data and estimation

Similar documents
Chapter 1. Looking at Data-Distribution

Week 7: The normal distribution and sample means

Chapter 3 - Displaying and Summarizing Quantitative Data

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

DAY 52 BOX-AND-WHISKER

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Chapter 6: DESCRIPTIVE STATISTICS

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Measures of Position

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

Measures of Central Tendency

STA 570 Spring Lecture 5 Tuesday, Feb 1

15 Wyner Statistics Fall 2013

MATH& 146 Lesson 10. Section 1.6 Graphing Numerical Data

2.1: Frequency Distributions and Their Graphs

Table of Contents (As covered from textbook)

CHAPTER 3: Data Description

IT 403 Practice Problems (1-2) Answers

1.3 Graphical Summaries of Data

appstats6.notebook September 27, 2016

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Understanding and Comparing Distributions. Chapter 4

Averages and Variation

Center, Shape, & Spread Center, shape, and spread are all words that describe what a particular graph looks like.

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

Box Plots. OpenStax College

Chapter 2 Describing, Exploring, and Comparing Data

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

AND NUMERICAL SUMMARIES. Chapter 2

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

Lecture 3 Questions that we should be able to answer by the end of this lecture:

CHAPTER 2 DESCRIPTIVE STATISTICS

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

No. of blue jelly beans No. of bags

MATH NATION SECTION 9 H.M.H. RESOURCES

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Measures of Dispersion

IQR = number. summary: largest. = 2. Upper half: Q3 =

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Chapter 5. Understanding and Comparing Distributions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Chapter 3 Analyzing Normal Quantitative Data

Lecture Notes 3: Data summarization

Math 167 Pre-Statistics. Chapter 4 Summarizing Data Numerically Section 3 Boxplots

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

Chapter 3: Describing, Exploring & Comparing Data

Chapter 5: The standard deviation as a ruler and the normal model p131

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

AP Statistics Prerequisite Packet

More Numerical and Graphical Summaries using Percentiles. David Gerard

CHAPTER 2: SAMPLING AND DATA

+ Statistical Methods in

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis.

Week 2: Frequency distributions

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Univariate Statistics Summary

How individual data points are positioned within a data set.

Bar Graphs and Dot Plots

Mean,Median, Mode Teacher Twins 2015

Chapter 2 Modeling Distributions of Data

Page 1. Graphical and Numerical Statistics

MATH& 146 Lesson 8. Section 1.6 Averages and Variation

TMTH 3360 NOTES ON COMMON GRAPHS AND CHARTS

Descriptive Statistics

Understanding Statistical Questions

Section 6.3: Measures of Position

Chapter2 Description of samples and populations. 2.1 Introduction.

Day 4 Percentiles and Box and Whisker.notebook. April 20, 2018

MATH11400 Statistics Homepage

Sections 2.3 and 2.4

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

Lecture 1: Exploratory data analysis

Homework Packet Week #3

Lecture 3: Chapter 3

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

UNIT 1A EXPLORING UNIVARIATE DATA

Chapter 5. Understanding and Comparing Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Assignments. Math 338 Lab 1: Introduction to R. Atoms, Vectors and Matrices

WHOLE NUMBER AND DECIMAL OPERATIONS

1.2. Pictorial and Tabular Methods in Descriptive Statistics

MAT 110 WORKSHOP. Updated Fall 2018

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester

Exam Review: Ch. 1-3 Answer Section

Learning Log Title: CHAPTER 7: PROPORTIONS AND PERCENTS. Date: Lesson: Chapter 7: Proportions and Percents

Boxplots. Lecture 17 Section Robb T. Koether. Hampden-Sydney College. Wed, Feb 10, 2010

Section 1.2. Displaying Quantitative Data with Graphs. Mrs. Daniel AP Stats 8/22/2013. Dotplots. How to Make a Dotplot. Mrs. Daniel AP Statistics

Descriptive Statistics: Box Plot

Your Name: Section: 2. To develop an understanding of the standard deviation as a measure of spread.

Section 3.1 Shapes of Distributions MDM4U Jensen

Descriptive Statistics and Graphing

MATH 112 Section 7.2: Measuring Distribution, Center, and Spread

Create a bar graph that displays the data from the frequency table in Example 1. See the examples on p Does our graph look different?

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

Transcription:

Week 4: Describing data and estimation Goals Investigate sampling error; see that larger samples have less sampling error. Visualize confidence intervals. Calculate basic summary statistics using R. Calculate confidence intervals for the mean with R. Learning the Tools Missing data Sometimes we do not have all variables measured on all individuals in the data set. When this happens, we need a space holder in our data files so that R knows that the data is missing. The standard way of doing this in R is to put NA (without the quotes) in the location that the data would have gone. NA is short for not available. For example, in the Titanic data set, we do not know the age of several passengers. Let s look at it. Load the Titanic data set: > titanicdata <- read.csv("data/titanic.csv" ) Have R print out the list of the age variable, which you can do easily by just typing its name: > titanicdata$age If you look through the results, you will see that most individuals have numbers in this list, but some have NA. These NAs are the people for which we do not have age information. By the way, the titanic.csv file simply has nothing in the places where there is missing data. When R loaded it, it replaced the empty spots with NA automatically. Measures of location mean() This week we want to use R to give some basic descriptive statistics for numerical data. We have already seen in Week 1 how to calculate the mean of a 1

vector of data using mean(). Unfortunately, if there are missing data we need to tell R how to deal with it. A (somewhat annoying) quirk of R is that if we try to take the mean of a list of numbers that include missing data, we get an NA for the result! > mean(titanicdata$age) [1] NA na.rm=true To get the mean of all the numbers that we do have, we have to add an option to the mean() function. This option is na.rm = TRUE. > mean(titanicdata$age, na.rm = TRUE) [1] 31.19418 This tells R to remove ( rm ) the NAs before taking the mean. It turns out that the mean age of passengers that we have information for was about 31.2. na.rm = TRUE can be added to many functions in R, including median(), as we shall see next. median() The median of a series of numbers is the middle number half of the numbers in the list are greater the median and half are below it. It can be calculated in R by using median(). > median(titanicdata$age, na.rm = TRUE) [1] 30 summary() A handy function that will return both the mean and median at the same time (along with other information such as the first and third quartiles) is summary(). > summary(titanicdata$age) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.1667 21.0000 30.0000 31.1900 41.0000 71.0000 680 2

From left to right, this output gives us the smallest (minimum) value in the list ( Min. ), the first quartile ( 1st Qu. ), the median, the mean, the third quartile ( 3rd Qu. ), the largest (maximum) value ( Max. ), and finally, the number of individuals with missing values ( NA s ). The first quartile is the value in the data that is larger than a quarter of the data points. The third quartile is larger than ¾ of the data. These are also called the 25 th percentile and the 75 th percentile, respectively. (You may remember these from boxplots, where the top and bottom of the box mark the 75 th and 25 th percentiles, respectively.) Measures of variability var() R can also calculate measures of the variability of a sample. In this section we ll learn how to calculate the variance, standard deviation, coefficient of variation and interquartile range of a set of data. To calculate the variance of a list of numbers, use var(). > var(titanicdata$age, na.rm = TRUE) [1] 217.4895 Note that var(), as well as sd() below, have the same need for na.rm = TRUE when analyzing data that include missing values. sd() The standard deviation can be calculated by sd(). > sd(titanicdata$age, na.rm = TRUE) [1] 14.74753 Of course, the standard deviation is the same as the square root of the variance. > sqrt(var(titanicdata$age, na.rm = TRUE)) [1] 14.74753 Coefficient of variation Surprisingly, there is no standard function in R to calculate the coefficient of variation. You can do this yourself, though, directly from the definition: 3

> 100 * sd(titanicdata$age, na.rm = TRUE) / mean(titanicdata$age, na.rm = TRUE) [1] 47.27653 IQR() The interquartile range (or IQR) is the difference between the third quartile and the first quartile; in other words the range covered by the middle half of the data. It can be calculated easily with IQR(). > IQR(titanicData$age, na.rm = TRUE) [1] 20 Note that this is the same as we could calculate by the results from summary() above. The third quartile is 41 and the first quartile is 21, so the difference is 41 21 = 20. Confidence intervals of the mean The confidence interval for an estimate tells us a range of values that is likely to contain the true value of the parameter. For example, in 95% of random samples the 95% confidence interval of the mean will contain the true value of the mean. R does not have a simple built-in function to calculate only the confidence interval of the mean, but the function that calculates t-tests will give us this information. (See chapters 11 and 12 in the text or Week 7 and 8 of these labs for more on t-tests.) The function t.test() has many results in its output. By adding $conf.int to this function we only get back the confidence interval for the mean. By default it gives us the 95% confidence interval. > t.test(titanicdata$age)$conf.int [1] 30.04312 32.34524 attr(,"conf.level") [1] 0.95 As the result above shows, the 95% confidence interval of the mean of age in the titanicdata data set is from about 30.0 to 32.3. R also tells us that it used a 95% confidence level for its calculation. (The confidence interval is not so useful in this case, because we actually have information for nearly all the individuals on the Titanic.) 4

To calculate confidence intervals with a different level of confidence, we can add the option conf.level to the t.test function. For example, for a 99% confidence interval we can use the following. > t.test(titanicdata$age, conf.level = 0.99)$conf.int [1] 29.67976 32.70861 attr(,"conf.level") [1] 0.99 5

6

Activities Distribution of sample means Go to the web and open the page at http://www.zoology.ubc.ca/~whitlock/kingfisher/samplingnormal.htm. This page contains some interactive visualizations that let you play around with sampling to see the distribution of sampling means. Click the button that says Tutorial near the bottom of the page and follow along with the instructions. Confidence intervals Go back to the web, and open the page at http://www.zoology.ubc.ca/~whitlock/kingfisher/cimeans.htm. This applet draws confidence intervals for the mean of a known population. Click Tutorial again, and follow along. Questions 1. The data file "students.csv" should include data that your classmates collected on themselves during the first week of lab, in chapter 1. Open that file in R. a. Use R to calculate the mean height of all students in the class. b. Plot the distribution of heights in the class. Describe the shape of the distribution. Is it symmetric or skewed? Is it unimodal or bimodal? Key point: A distribution is skewed if it is asymmetric. A distribution is skewed right if there is a long tail to the right, and skewed left if there is a long tail to the left. c. Are there any large outliers that look as though a student used the wrong units for their height measurement. (I.e., are there any that are more plausibly a height given in inches rather than the requested centimeters?) If so, and if this is not likely to be an accurate description of an individual in your class, use filter() from the package dplyr to create a new data set without those rows. 7

d. Use sd() to calculate the standard deviation of height. 2. The file "caffeine.csv" contains data on the amount of caffeine in a 16 oz. cup of coffee obtained from various vendors. For context, doses of caffeine over 25 mg are enough to increase anxiety in some people, and doses over 300 to 360 mg are enough to significantly increase heart rate in most people. A can of Red Bull contains 80mg of caffeine. a. What is the mean amount of caffeine in 16 oz. coffees? b. What is the 95% confidence interval for the mean? c. Plot the frequency distribution of caffeine levels for these data in a histogram. Is the amount of caffeine in a cup of coffee relatively consistent from one vendor to another? What is the standard deviation of caffeine level? What is the coefficient of variation? d. The file "caffeinestarbucks.csv" has data on six 16 oz. cups of Breakfast Blend coffee sampled on six different days from a Starbucks location. Calculate the mean (and the 95% confidence interval for the mean) for these data. Compare these results to the data taken on the broader sample of vendors in the first file. Describe the difference. 3. A confidence interval is a range of values that are likely to contain the true value of a parameter. Consider the "caffeine.csv" data again. a. Calculate the 99% confidence interval for the mean caffeine level. b. Compare this 99% confidence interval to the 95% confidence interval you calculate in question 2b. Which confidence interval is wider (i.e., spans a broader range)? Why should this one be wider? c. Let s compare the quantiles of the distribution of caffeine to this confidence interval. Approximately 95% of the data values should fall between the 2.5% and 97.5% quantiles of the distribution of caffeine. (Explain why this is true.) We can use R to calculate the 2.5% and 97.5% quantiles with a command like the following. (Replace datavector with the name of the vector of your caffeine data.) > quantile(datavector, c(0.025, 0.975), na.rm =TRUE) Are these the same as the boundaries of the 95% confidence interval? If not, why not? Which should bound a smaller region, the quantile or the confidence interval of the mean? 8

4. Return to the class data set "students.csv." Find the mean value of "number of siblings." Add one to this to find the mean number of children per family in the class. a. The mean number of offspring per family twenty years ago was about 2. Is the value for this class similar, greater, or smaller? If different, think of reasons for the difference. b. Are the families represented in this class systematically different from the population at large? Is there a potential sampling bias? c. Consider the way in which the data were collected. How many families with zero children are represented? Why? What effect does this have on the estimated mean family size of all couples? 5. Return to the data on countries of the world, in "countries.csv" and the data from the class in "students.csv." Plot the distributions for ecological_footprint_2000, cell_phone_subscriptions_per_100_people_2012, and life_expectancy_at_birth_female. a. For each variable, plot a histogram of the distribution. Is the variable skewed? If so, in which direction? b. For each variable, calculate the mean and median. Are they similar? Match the difference in mean and median to the direction of skew on the histogram. Do you see a pattern? 9