Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Similar documents
Chapter 2 Describing, Exploring, and Comparing Data

STA 570 Spring Lecture 5 Tuesday, Feb 1

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Chapter 1. Looking at Data-Distribution

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

Table of Contents (As covered from textbook)

AND NUMERICAL SUMMARIES. Chapter 2

Averages and Variation

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

DAY 52 BOX-AND-WHISKER

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

Chapter 3 - Displaying and Summarizing Quantitative Data

UNIT 1A EXPLORING UNIVARIATE DATA

Measures of Central Tendency

Lecture Notes 3: Data summarization

LESSON 3: CENTRAL TENDENCY

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

+ Statistical Methods in

Frequency Distributions

CHAPTER 3: Data Description

CHAPTER 2: SAMPLING AND DATA

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

15 Wyner Statistics Fall 2013

Math 167 Pre-Statistics. Chapter 4 Summarizing Data Numerically Section 3 Boxplots

Univariate Statistics Summary

2.1: Frequency Distributions and Their Graphs

Let s take a closer look at the standard deviation.

Exploratory Data Analysis

Learning Log Title: CHAPTER 7: PROPORTIONS AND PERCENTS. Date: Lesson: Chapter 7: Proportions and Percents

Descriptive Statistics

Chapter 2: Frequency Distributions

Chapter 3: Describing, Exploring & Comparing Data

Measures of Position

B. Graphing Representation of Data

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution

MATH& 146 Lesson 8. Section 1.6 Averages and Variation

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

Numerical Summaries of Data Section 14.3

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Overview. Frequency Distributions. Chapter 2 Summarizing & Graphing Data. Descriptive Statistics. Inferential Statistics. Frequency Distribution

Measures of Dispersion

1.2. Pictorial and Tabular Methods in Descriptive Statistics

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Chapter 2 Modeling Distributions of Data

Section 6.3: Measures of Position

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 2.1- #

MATH& 146 Lesson 10. Section 1.6 Graphing Numerical Data

CHAPTER 2 DESCRIPTIVE STATISTICS

Lecture 6: Chapter 6 Summary

Chapter2 Description of samples and populations. 2.1 Introduction.

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

Downloaded from

MAT 110 WORKSHOP. Updated Fall 2018

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

TMTH 3360 NOTES ON COMMON GRAPHS AND CHARTS

Parents Names Mom Cell/Work # Dad Cell/Work # Parent List the Math Courses you have taken and the grade you received 1 st 2 nd 3 rd 4th

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Section 9: One Variable Statistics

Week 4: Describing data and estimation

Lecture 3 Questions that we should be able to answer by the end of this lecture:

+ Statistical Methods in

Day 4 Percentiles and Box and Whisker.notebook. April 20, 2018

AP Statistics Summer Assignment:

MATH NATION SECTION 9 H.M.H. RESOURCES

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

Chapter 2 - Graphical Summaries of Data

Name Date Types of Graphs and Creating Graphs Notes

Descriptive Statistics

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis.

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester

Maths Revision Worksheet: Algebra I Week 1 Revision 5 Problems per night

At the end of the chapter, you will learn to: Present data in textual form. Construct different types of table and graphs

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Chapter 2: Understanding Data Distributions with Tables and Graphs

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Chpt 3. Data Description. 3-2 Measures of Central Tendency /40

Chapter Two: Descriptive Methods 1/50

Chapter 2. Frequency distribution. Summarizing and Graphing Data

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26

Section 2-2 Frequency Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Transcription:

1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals. 3. How to conduct tests on population parameters within a population and between/across populations. These parameters include means, variances, odds ratios, and others. 4. How and when to carry out a linear regression analysis and how to interpret the results. 3 4 mediamatters.org claimed that this was a misleading graph: In presenting the results of a CNN/USA Today/Gallup poll, CNN.com used a visually distorted graph that falsely conveyed the impression that Democrats far outnumber Republicans and Independents in thinking the Florida state court was right to order Terri Schiavo s feeding tube removed. According to the poll, when asked if they agree[d] with the court s decision to have the feeding tube removed, 62 percent of Democratic respondents agreed, compared to 54 percent of Republicans, and 54 percent of Independents. But these results were displayed along a very narrow scale of 10 percentage points, and thus appeared to show a large gap between Democrats and Republicans/Independents: Presented in this manner, the graph suggests that the gap between the two groups is overwhelming, rather than only 8 percentage points, within the poll s margin of error of +/- 7 percentage points.

When constructing graphs, do the following, and hope (or maybe insist) that others do the same! 1. Clearly label x and y axes. 2. Use relevant scales. 3. Indicate exactly which subset of the data are being represented and how. Were continuous measurements changed to binary or integer for plotting purposes? Were other transformations done? 4. BE VERY CAUTIOUS about drawing general conclusions from graphs that do not involve error bars (or some way of representing error). More on this later in the semester. 5 Data can be classified into a few basic types. Nominal Data: Numeric values that represent classes or categories - the categories are not ordered. Magnitude of numerical value is not important. Ordinal Data: Numeric values that represent classes or categories - the categories are ordered. Magnitude of numerical value is not important. Ranked Data*: Numeric values that represent the order of ranked observations. By assigning ranks, information about the magnitude of the values and their differences is lost; however, ranks still retain much useful information. *Convention is to rank from lowest to highest and then assign numeric values to each ranked observation starting with the lowest. Pagano and Gavreau (page 10) have this switched around. 6 Discrete Data: Numbers that represent measurable quantities (as opposed to just labels). Magnitude is important. Discrete data takes on specified values that differ by fixed amounts; intermediate values are not possible. Continuous Data: Numbers that represent measurable quantities taking on any value in some range. Again, magnitude is important. The difference between any two values can be arbitrarily small. Responses to questions in the Breast Cancer Consortium Questionnaire fall into these categories. Examples are below. Nominal Data: Questions 1, 2, 5, 16... Ordinal Data: Question 4, 9, 17, (disregard never and not sure)... Discrete Data: How many sisters have been diagnosed with breast cancer? (Question 7b rephrased) Continuous Data: Questions 14 (weight could be measured in arbitrarily small units - can take on any value in some range - discreteness of measurements only limited by the measuring device) 7 Absolute and Relative Frequencies of Mammograms for 5,447,140 mammograms recorded in the Breast Cancer Surveillance Consortium (BCSC) study from 1996-2004 (inclusive). (http://breastscreening.cancer.gov/statistics) Race Number of Mammograms RF (%) White 3,785,762 69.5 Black 283,251 5.2 Hispanic 397,641 7.3 Asian 266,910 4.9 American Indian 59,919 1.1 Other 653,657 12.0 8 For nominal (shown here) and ordinal data, a frequency distribution consists of the set of classes or categories along with the numerical counts in each. A relative frequency distribution shows the proportion of counts that fall into each class or category. A relative frequency (RF) value for any category is obtained by dividing the number of observations in that category by the total number of observations. This can be reported as a percentage (as shown) by multiplying the resulting fraction by 100. This table is a bit misleading since we can t tell which populations (if any) are under (or over) represented. This data does not consider the proportions of each race in the general population.

9 10 Number of mammograms taken in 1999 grouped by patient s age Age Number of Mammograms RF (%) CRF (%) 18-29 2,000 0.4 0.4 30-39 27,000 5.7 6.1 40-49 141,000 29.6 35.7 50-59 135,000 28.3 64.0 60-69 87,000 18.2 82.2 70-79 65,000 13.6 95.8 80-89 19,000 4.0 99.8 90-over 1,000 0.2 100.0 Graphical Summaries Percentage of Mammograms by Race and Ethnicity. This pie chart shows the racial distribution of 5,447,140 mammograms recorded by the Breast Cancer Surveillance Consortium from 1996-2004 (inclusive). Again, we can t tell which populations (if any) are under (or over) represented. This data does not consider the proportions of each race in the general population. For discrete or continuous data, we must break down the range of values into a series of distinct non-overlapping intervals. If there are too many intervals, not much of a summary is obtained; if there are too few, information can be lost. Although it is not necessary (and is not done in this BCSC study), intervals are often constructed so that they have equal widths. The cumulative relative frequency (CRF) for an interval is the proportion of the total number of observations that have a value less than or equal to the upper limit of the interval. This too can be expressed as a percentage. In the table above, we see that 35.7% of mammograms are performed on women at or under the age of 49. Bar chart showing number of mammograms in each age group for years 1996-2004. 11 Rat Mean Arterial Pressure Histograms are used to display a frequency distribution for discrete or continuous data. If relative frequencies (proportions) are displayed, the histogram is often called a probability histogram. In this case, the heights sum to one. Again, note that we must break down the range of values into a series of distinct non-overlapping intervals. If there are too many intervals, not much of a summary is obtained; if there are too few, information can be lost. 12 BN Rat MAP () 100 120 140 160 0 2000 4000 6000 8000 10000 12000 14000 time 100 120 140 160 Number Values <= (x) in 1000 0 2 4 6 8 10 12 14 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165

13 14 Numerical Summaries of Data Arithmetic Mean: Sum of data values divided by the total number of values. Median: Value which separates data into two halves: half the data values are greater than the median, half are smaller than the median. The median is less sensitive to outliers than the mean. Mode: Value (or set of values) that occurs most frequently. More precise definitions will be given soon. 0 1000 2000 3000 4000 Mean = 112.4 Median = 112.0 Mode =110 90 110 130 90 110 130 15 16 Numerical Summaries of Data 0 1000 2000 3000 4000 5000 6000 100 140 Mean = 118.9 Median = 113.0 Mode = 110 100 140 Let n data values be denoted by x 1, x 2,..., x n Mean: Sum of data values divided by the total number of values. x = 1 n n x i i=1 Mode: Value (or set of values) that occurs most frequently. Median: Value which separates data into two halves: half the data values are greater than the median, half are smaller than the median. The median is less sensitive to outliers than the mean. Quantiles or percentiles: The n th Quantile (percentile) is the smallest value which is greater than or equal to n percent of the data. For example, the 95 th percentile is the value that is greater than or equal to 95% of the data and less than or equal to the remaining 5 %. Quartiles: The 25 th and 75 th quantiles are called quartiles. Range: Difference between the largest and the smallest data values. Interquartile Range: Difference between the 75 th and 25 th percentiles. Consequently, it contains the middle 50% of the observations.

17 18 A bit of detail on Quantiles Precise Definition of a Quantile Intuitively, the p th quantile (percentile) is the smallest value V p such that p percent of the sample points are less than or equal to V p. The median, being the 50 th percentile, is a special case of a quantile. Quartiles are also special cases of quantiles. Note that this is not a precise definition. For example, if you have a data set with n = 20 values, what would the median be? For a data set of size n, the p th quantile is defined by 1. The (k + 1) th largest sample point if np 100 is not an integer. Here k is the largest integer less than np 100. 2. The average of the ( np 100 )th and ( np 100 + 1)th largest observations if np 100 is an integer....more Numerical Summaries of Data The (sample) variance of the data set is defined by s 2 = 1 n 1 n i=1 (x i x) 2 The rationale for using (n-1) in the denominator (as opposed to n) will be given in a few weeks. 19 If the 40 th and 60 th percentiles lie an equal distance from the midpoint, and the same is true for the 30 th and 70 th, the 20 th and 80 th, and all other pairs of percentiles that sum to 100, the data are symmteric. A symmetric distribution has the same shape on each side of the 50 th percentile. Shown below are histograms of MAP (left) data and simulated (right) data. 20 The coefficient of variation is the standard deviation divided by the mean. CV = s x It is often multiplied by 100 % to give a %. 90 110 130 sim.data

Summarizing the Distribution The mean and standard deviation of a data set can be used to summarize characteristics of the entire distribution. 21 Vertical lines are drawn at x ± 1 σ for the MAP (left) and simulated (right) data. 69 % of the observations fall between the lines for the MAP data; 84 % for the simulated data! The empirical rule does not work if the data are not symmetric and unimodal. 22 Empirical Rule: If the data are symmetric and unimodal, then approximately 67 % of the observations lie within the interval x±1 σ; approximately 95 % of the observations lie within x ± 2 σ. There is a more precise version of this that we will see later this semester. 69 % of obs. in range 84 % of obs. in range 100 140 sim.data 23 24 Chebychev s Inequality Chebychev s Inequality will work even when the data are not symmetric or unimodal. So, for k = 2, Chebychev s inequality tells us that at least 1 1 2 2 = 3 4 of the values lie within 2 standard deviations of the mean. Chebychev s Inequality: For any number k that is greater than 1, at least 1 ( 1 ) 2 k lie within k standard deviations of their mean. 96 % of obs. in range 97 % of obs. in range 100 140 sim.data

25 26 Box Plots Generating a Box Plot A boxplot is used for discrete or continuous data. The lower bound of the box is the 25th quartile of the data; the upper end is the 75th quartile. The 50th percentile (median) is indicated. Horizontal lines are drawn at the most extreme values outside the box that are not more than 1.5 * Interquartile Range beyond either of the bounding quartiles. To generate a box plot, Plot the box. Upper bound is 75th quantile (75Q), lower bound is 25th quantile (25Q). Draw a line at the median. Note that 25Q and 75Q are NOT standard notation. Calculate the Inter Quartile Range (75Q-25Q). Draw horizontal lines at most extreme points closest to without going outside 75Q + 1.5*IQR and 25Q - 1.5*IQR. Draw in remaining points that fall outside the horizontal lines. 27 28 Rat Mean Arterial Pressure (MAP) Data A box plot for the MAP pressure data shown earlier is given below. Mean = 112.4 Median = 112.0 Mode = 110 100 110 120 130 140 Int. Qrt. Range = 115-109=6 1.5*IQR = 9 3rd Qrt + 1.5*IQR=124 3rd quartile = 115 2nd quartile (median) = 112 1st quartile = 109 1st Qrt - 1.5*IQR=100 90 100 110 120 130 140

Note that the horizontal lines need not be the same distance from the box. Again, the horizontal lines are drawn at the most extreme values outside the box that are NOT more than 1.5 * Interquartile Range beyond either of the bounding quartiles. Mean = 0.306 Median = -0.077 Modes (-0.4, -0.2, 0.0) 29 Histograms and Boxplots of MAP data (upper) and Simulated data (lower) 0 500 1500 100 120 140 90 110 130 30 sim.data 31 32 Simulated data from previous slide -1.892-1.649-1.328-0.913-0.839-0.724-0.442-0.415-0.361-0.300-0.282-0.224-0.177-0.173-0.145-0.078-0.049-0.012 0.006 0.124 0.159 0.266 0.406 0.467 0.905 1.072 1.146 1.858 2.478 2.602 8.000 A boxplot function will calculate quartiles. It might interpolate between values or impose the restriction that the quartiles be one of the data values. For now, let s impose the restriction that the quantiles be one of the data values. Find the median and the 25 th and 75 th quantiles. Where would the horizonal lines for the box plot be drawn? (hint: 25Q-1.5*IQR=-1.738 and 75Q+1.5*IQR=1.791). Int. Qrt. Range = 0.467+0.415=0.882 1.5*IQR = 1.323 3rd Qrt + 1.5*IQR=1.791 y=1.146 3rd quartile = 0.4368 2nd quartile (median) = -0.0777 1st quartile = -0.3879 y=-1.649 1st Qrt - 1.5*IQR=-1.738

33 (From Stat 541 exam): Three data sets were generated. The box plots and histograms of each data set are shown. Match the box plot to the histogram generated by the same data. (a) (b) (c) 10 (a) 20 25 30 (d) 10 10 (b) 20 25 30 (e) 10 10 (c) 20 25 30 (f ) 10