Chapter2 Description of samples and populations. 2.1 Introduction.

Similar documents
Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Chapter 3 - Displaying and Summarizing Quantitative Data

STP 226 ELEMENTARY STATISTICS NOTES

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

Chapter 6: DESCRIPTIVE STATISTICS

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

UNIT 1A EXPLORING UNIVARIATE DATA

Name Date Types of Graphs and Creating Graphs Notes

Chapter 2 Describing, Exploring, and Comparing Data

MATH& 146 Lesson 10. Section 1.6 Graphing Numerical Data

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

AND NUMERICAL SUMMARIES. Chapter 2

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

Table of Contents (As covered from textbook)

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

AP Statistics Summer Assignment:

No. of blue jelly beans No. of bags

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Chapter 3 Analyzing Normal Quantitative Data

Averages and Variation

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

Section 1.2. Displaying Quantitative Data with Graphs. Mrs. Daniel AP Stats 8/22/2013. Dotplots. How to Make a Dotplot. Mrs. Daniel AP Statistics

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Chapter 1. Looking at Data-Distribution

1.3 Graphical Summaries of Data

Section 2-2 Frequency Distributions. Copyright 2010, 2007, 2004 Pearson Education, Inc

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

Chapter 2 Modeling Distributions of Data

STA 570 Spring Lecture 5 Tuesday, Feb 1

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

Univariate Statistics Summary

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution

2.1: Frequency Distributions and Their Graphs

To calculate the arithmetic mean, sum all the values and divide by n (equivalently, multiple 1/n): 1 n. = 29 years.

appstats6.notebook September 27, 2016

CHAPTER 3: Data Description

Chapter 2 - Graphical Summaries of Data

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

TMTH 3360 NOTES ON COMMON GRAPHS AND CHARTS

CHAPTER 2: SAMPLING AND DATA

CHAPTER 2 DESCRIPTIVE STATISTICS

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 2.1- #

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Frequency Distributions

Raw Data is data before it has been arranged in a useful manner or analyzed using statistical techniques.

Measures of Central Tendency

Measures of Dispersion

15 Wyner Statistics Fall 2013

+ Statistical Methods in

1.2. Pictorial and Tabular Methods in Descriptive Statistics

Descriptive Statistics, Standard Deviation and Standard Error

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

VCEasy VISUAL FURTHER MATHS. Overview

CHAPTER 2 Modeling Distributions of Data

Understanding and Comparing Distributions. Chapter 4

Lecture 3 Questions that we should be able to answer by the end of this lecture:

Organizing and Summarizing Data

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Chapter 2: The Normal Distributions

Exploratory Data Analysis

Descriptive Statistics

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis.

ECLT 5810 Data Preprocessing. Prof. Wai Lam

MATH& 146 Lesson 8. Section 1.6 Averages and Variation

Statistics Lecture 6. Looking at data one variable

SLStats.notebook. January 12, Statistics:

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

Chapter 2: Descriptive Statistics

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Chapter 6. THE NORMAL DISTRIBUTION

Measures of Dispersion

Chapter 6. THE NORMAL DISTRIBUTION

3 Graphical Displays of Data

Lecture Notes 3: Data summarization

Section 6.3: Measures of Position

Sections 2.3 and 2.4

Chapter 6 Normal Probability Distributions

Lecture 6: Chapter 6 Summary

Data organization. So what kind of data did we collect?

Basic Statistical Terms and Definitions

STAT STATISTICAL METHODS. Statistics: The science of using data to make decisions and draw conclusions

Chapter 2: Modeling Distributions of Data

Vocabulary: Data Distributions

Section 2.2 Normal Distributions. Normal Distributions

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Chapter 5: The standard deviation as a ruler and the normal model p131

Chapter 3: Describing, Exploring & Comparing Data

Transcription:

Chapter2 Description of samples and populations. 2.1 Introduction. Statistics=science of analyzing data. Information collected (data) is gathered in terms of variables (characteristics of a subject that can be assigned a numerical value or nonnumerical category. Data itself and its transformed forms are also called statistics. Types of variables: 1. Categorical Variable, it records a category subject belongs to, like Blood Type (O, A, B, AB) or Gender (Female, Male). Usually categories do not have a meaningful order. Some categorical data can be ordinal, where some natural order exists for example: response to the treatment: none, partial, complete. 2. Quantitative (Numeric) Variable, records amount of something or a count of something. It can be continuous,with values on the continuous scale (Weight of a newborn, Cholesterol content in a blood specimen) or discrete, where values can be listed, often values are integer (Number of eggs in the nest, Number of bacteria in a petri dish). Distinction between discrete and continuous variables is not rigid, we often round up measurements to nearest integer Sample=collection of persons or things on which we measure one or more variables. Sometimes that same word is used in a different context (for example sample of blood taken from a subject). To avoid confusion we will say a specimens of blood in that case. Some other vocabulary and notation: Example. Twenty students gave reported their gender, blood type and weight to a researcher. Students are here observational units. Variables are: Gender, Blood Type ( both categorical) and Weight (numerical). Sample size is n=20 We will use capital letters like X and Y for the names of the variables and lower case letters (x or y) for the particular observations. For example we may use Y=weight of a student and y 1 =150 lb as a weight of one such a student (John). 2.2. Frequency distributions. When data is collected, to make sense of it it is helpful to summarize it in a form of tables and/or graphs. We will use some example data sets to examine different ways data can be displayed. Ex1: Sample of Blood Type for 21 people: A O A AB O B AB A O A O AB O A O B A AB A O A We can summarize it using frequency and relative frequency table. Frequency=count in a particular class. Relative frequency=frequency/n % frequency= relative frequency*100%

Frequency table results for Blood Type: Blood Type Frequency Relative Frequency A 8 0.3809524 AB 4 0.1904762 B 2 0.0952381 O 7 0.33333334 Notice that all frequencies add up to n=21 and all relative frequencies add up to 1 (or 100%) Graphical display includes a Bar Chart. Notice that classes do not have to be placed in any particular order. Example#2: US Solid Waste Weight (Pie Chart) Material Weight (million tons) Percent of Total Food Scraps Glass Metals Paper, Paperboard Plastics Rubber, Leather, Textiles Wood Yard Trimmings Other 25.9 12.8 18 86.7 24.7 15.8 12.7 27.7 11.2% 37.4% 10.7% 6.8% 5.5% 11.9% 3.2% Totals 231.9 100%

Missing frequency=7.6, missing relative frequencies are 5.5% and 7.8% To figure out the sizes of each slice multiply 360 by the relative frequency. Ex3 40 couples, # of children in each family 3 3 3 1 4 3 0 0 2 0 4 2 4 3 2 2 3 2 5 1 1 0 1 1 2 1 0 0 1 2 1 1 0 3 2 1 2 1 2 3 These data can be grouped using a single value, since there are relatively few different data values. Our classes will be in order: 0,1,2,3,4,5, frequencies will be computed exactly as in example #1. Frequency table results for Number of children: Number of children Frequency Relative Frequency 0 7 0.175 1 11 0.275 2 10 0.25 3 8 0.2 4 3 0.075 5 1 0.025

Graphical display of such a data is called a histogram, bars will be raised with classes placed in the middle of each bar. Another way to display such a data is a dotplot. You place a dot over each data value. If values are repeated, you place multiple dots equally spaced above these values. Grouped frequency distribution is appropriate for a data set with a lot of different values like in the following example. Ex4 AGE of onset of diabetes (35 people) 48 41 57 83 41 55 59 61 38 48 79 75 77 7 54 23 47 56 79 68 61 64 45 53 82 68 38 70 10 60 83 76 21 65 47 If we decide to start at 0 and have groups with the width=10 we can have following classes: [0,10), [10,20), [20,30) and so on, Treat the notation like an interval notation. Histogram for these data can also be obtain, bars will be raised over each class. Vertical axis can represent either frequency or relative frequency. We can also obtain a fast histogram, otherwise called stem-and-leaf diagram (or a stemplot): Each data point is divided into stem and leaf, all possible stems are placed vertically and leaves are added to them in order. Our stemplot is given below, notice that leaves are ordered. 0 7 1 0 2 1 3 3 8 8 4 1 1 5 7 7 8 8 stems: tens 5 3 4 5 6 7 9 leaves: ones 6 0 1 1 4 5 8 8 7 0 5 6 7 9 9 8 2 3 3

How to make a stemplot: 1. Separate each observation into a stem (has all but the last digit, can be 1, 2, or more digits) consisting of all but the final (rightmost) digit and a leaf (has only one digit), the final digit. 2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. 3. Write each leaf in the row to the right of its stem, in increasing order out from the stem. Ex5 Radishes growth (mm in 3 days) A(in the dark) B (12 hours of light/ 12 hours of dark) A: 15 20 22 20 29 37 11 35 15 30 8 25 33 10 B: 10 11 15 15 20 4 22 21 10 25 27 20 9 20 Side by side Stemplots (with 2 leaves per stem) can let us compare both sets: In both stems are tens, leaves are ones 0 4 8 0 9 stems: tens 1 0 1 0 0 1 leaves: ones 5 5 1 5 5 A 2 0 0 2 0 0 0 1 2 B 9 5 2 5 7 3 0 3 7 5 3 Stemplot with two leaves per stem: The number of stems can be doubled by splitting the stem in two ; one with leaves from 0 to 4 and the other with leaves 5 to 9. Interpreting areas of the histogram: Area of each bar of the histogram is proportional to corresponding frequency. In example #4 area between 10 and 30 (2 bars) equals 3/35~8.6% of the total area of the histogram. We can draw a f histogram using a density scale ( n ), then total area of the histogram will be 1 or unit= class width 100%

Ex6 The amounts of iron intake, in milligrams, during a 24-hour period for a sample of 30 females under the age of 51 15.0 18.1 14.4 14.6 10.9 18.1 18.2 18.3 15.0 16.0 12.6 16.6 20.7 19.8 11.6 12.8 15.6 11.0 15.3 9.4 19.5 18.3 14.5 16.6 11.5 16.4 12.5 14.6 11.9 12.5 In that last example we may select groups of width 2, namely: [9,11), [11,13), [13,15) and so on, we will get 6 classes, appropriate number for data of 30 observations. Example7: Weight data (in pounds) in an Intro. Stats Class 100, 105, 111,115, 118, 118, 119, 120, 125, 125, 128, 128, 129, 130, 133, 135, 135, 138, 138, 140, 140, 145, 146, 150, 155, 158, 160, 162, 164, 165, 167, 171, 175, 178, 180, 180, 182, 185, 185, 187, 189, 190, 190, 193, 194, 195, 200, 205, 210, 215, 230, 270 We can clearly observe two prominent picks, data is bimodal

Describing distribution of the sample data: Modality, Shapes, Symmetry, and Skewness. Modality: Unimodal - has one peak eg. Bell-shaped, Triangular, Reverse J-shaped, J-shaped, Right skewed, Left skewed Bimodal - has two peaks (technically, all peaks should be same height, not so in practice) Multimodal - has 3 or more peaks Symmetry and Skewness Symmetry - property of a distribution to be divided into 2 parts that are mirror images of each other. Do not have to be exact in identifying symmetry. Eg. bell-shaped, triangular, uniform. Non-symmetric Distribution - Reverse J-shaped, J-shaped, Right skewed, Left skewed The distribution of population data is called population distribution, or the distribution of the variable. The distribution of sample data is a sample distribution. The distribution of a random sample from a population approximates the population distribution, hence, larger samples give better approximation. Shapes of Distributions. right skewed distribution, left skewed distribution, symmetric distribution,

2.3 Descriptive Measures of Center Let Y be our variable, numerical. y = Median=middle of the ordered data. Position (location) of the median is n=sample size. n+ 1 2, where Ex Weight gain in pounds for 6 young lambs 1 2 10 11 13 19, 0.5(6+1)=3.5 (median is between observation #3 and #4), y =(10+11)/2=10.5 lb If we add one more observation: 10lb, data becomes: 1 2 10 10 11 13 19, 0.5(7+1)=4,(median is observation #4) y =10 Median is a robust (resistant) measure of center, it is relatively unaffected by changes in small portion of the data. y = Mean (arithmetic mean)= n i=1 y= n y i, where y i -s are observations in the sample. In our example y =56/6~9.33 lb Differences between each data point and the mean and their sum i=1 n ( y i y)=0 for any data set. ( y i y) are called deviations from the mean In our example sum of all deviations=-8.33+ (-7.33)+.67+1.67+3.67+9.67=0 Mean can be visualized as a point of balance of the weightless seesaw with points (like children) sitting on it. Unlike median, mean is not robust, it is influenced by any data changes, very much by extremes. If data has some extreme values then median is a better measure of center for that data.

Mean vs Median right skewed distribution, left skewed distribution, symmetric distribution, Mean>Median Mean< Median Mean=Median 2.4 Boxplots. Single variable data may be summarized by 5 numbers: Minimum, Maximum, Median and 2 Quartiles referred to as five-number summary. These values are also used to make a box plot. Lower quartile denoted by Q 1 is a median of lower half of data, upper quartile denoted by Q 3 is a median of upper half of data. Ex1 Data represents systolic blood pressure (in mmhg) of 7 adult males 151 124 132 170 146 124 113 We order data first: 113 124 124 132 146 151 170 Min=113, Max=170, Median=132 Q 1 =124 Q 3 =151 (Median is excluded when we compute quartiles) Boxplot connects all 5 numbers in the following way, the box represents middle half of the data. 110 120 130 140 150 160 170 Another measure we can compute is Interquartile Range IQR= Q 3 - Q 1. This measure gives spread of middle half of data values. We can use it to find unusual data points (outliers). The procedure is as follows:

Compute lower fence=q 1-1.5*IQR and upper fence=q 3 + 1.5*IQR. An outlier is a data point that falls outside of the fences. In our example: IQR=151-124=27, 1.5(IQR)=1.5*27 = 40.5 lower fence=124-40.5=83.5, upper fence= 151+40.5 = 191.5, all observations are within the fences, so so there are no outliers in our data set. Ex2 Radishes growth (in mm) in the light. 4 5 5 7 7 8 9 10 10 10 10 14 20 21 Min=4, Max=21, Q 1 =7, Median=(9+10)/2=9.5 Q 3 =10 IQR=3, lower fence=2.5 upper fence=14.5, so 20 and 21 are outliers. Modified box plot exposes outliers. * * 5 10 15 20 25 2.5 Relationship between variables. This section discusses various ways used to compare two or more variables. Some methods include: a) Two way frequency and relative frequency tables to examine relationship between two categorical variables. They are useful to determine if variables are associated or not. b) Scatter plots for numerical variables to decide if there is a linear trend present, so that we can fit a regression line to the data. c) Side-by-side boxplots, dot plots, stemplots are useful to observe if there are differences between two or more treatments. 2.6 Measures of dispersion (variability) Range=Maximum-Minimum, gives overall spread of the data, easy to calculate, but very sensitive to extreme data values. IQR as we stated before gives range of the middle half of data and is a robust measure, not sensitive to extreme data values.

Sample standard deviation s = n (y i y ) 2 i=1 n 1 averages the squared deviations from the mean. Square root is taken at the end, so the units of s are the same as the units of the data. s 0, s=0 if all data points are the same s 2 is the sample variance. We will abbreviate SD for standard deviation, s will be used in the formulas. Ex. Experiment on chrysanthemums, botanist measured stem elongation in 7 days (in mm) 76, 72, 65, 70, 82 n=5 y=365 /5=73, deviations from the mean are: 3, -1,-8,-3,9, squared deviations are: 9, 1,64,9,81 s= (9+ 1+ 64+ 9+ 81)/4 = 164/ 4 =6.40 mm variance s 2 =41mm 2 s gives typical distance of the observations from the mean, larger s means more variability. Similar to the mean, s is also influenced by extreme data values (not a robust measure). n-1 =degrees of freedom of s, as an intuitive justification why we use ( n-1) not n we can consider n=1, when variability of 1 observation can't be computed, one data point gives no information about variability. The Coefficient of Variation = s expressed as a percentage of the mean: coefficient of variation= units, for example: s y 100% has no units and can be used to compare data sets with different EX Weight and height is measured for girls at age 2. Which of the two measures has greater variability? Weight : mean=12.6 kg, SD=1.4 kg Height: mean=86.6 cm, SD=2.9 cm coef. of variation: 11.1% for weight and 3.3% for height, we conclude that weight is more variable, here SD is much larger percentage of the mean than for height.

Typical Percentages: The Empirical Rule For a nice distribution (pretty symmetric, unimodal, no very long or very short tails) we expect to find : about 68% of all data points within the interval ( y SD, y+ SD) about 95% of all data points within the interval ( y 2SD, y+ 2SD) more than 99% of all data points within the interval ( y 3SD, y+ 3SD) 2.8 Effect of Transformation of Variables Sometimes when we work with a data set it is convenient to transform our variable(s). For example, we may want to change units or transform very small numbers that appear in scientific notation to something easier to use by multiplying original data by 10,000. Linear transformation is the simplest one: Let Y be the original variable with mean y and SD s, then Y '=ay +b is it's linear transformation, mean and SD of Y ' are y' and s' respectively. That type of transformation does not change the essential shape of the distribution of Y, the histogram of transformed variable can be made identical to the original histogram by suitable scaling of the horizontal axis. How Linear Transformation Affects mean and SD? Only mean (but not s) is affected by the additive transformation (adding positive or negative constant b to Y), but both mean and SD are affected by multiplying Y by a positive or a negative constant a: y'=a y+b and s '= a s Ex Suppose Y=summer temperature in some American city in 2013 in F, y=79.6 F and s=12.7 F. If we would like to change the Y to C, the transformation is as follows: Y '=(Y 32) 5 9 = 5 9 Y 5 9 32, so new mean s'= 5 9 12.7=7.06 C y' = 5 9 79.6 ( 5 9 32)=26.44 C and Nonlinear transformations like the following examples: Y '= Y, Y '=logy, Y '= 1 Y, Y '=Y 2, can affect data in complex ways and they do change essential shape of the frequency distribution. If the distribution is right skewed, for example, and we wish to make it more symmetric, we can apply square root transformation to pool the righthand tail and push out the left -hand tail. Logarithmic transformation will deliver even more drastic change in that regard (check out the histograms given at the end of this section)

2.8. Statistical Inference is the process of drawing conclusions about the population based on the observations in the sample. We can for example estimate percentage of all people in England with blood type A as 44% (the sample proportion of people with that blood type). Sample must be considered a random sample from entire population, must be representative of that population. 44% is a statistics (sample proportion p= y n, p hat ) that is estimating a parameter of the population (population proportion p). There are also other statistics we can use to estimate a population proportion, namely p= y+ 2, p tilde. n+ 4 In each case y=number of people in a sample that have a blood type A, n=sample size. We will discuss these estimates in later chapters Other parameters of the population that we often estimate from the samples are: population mean, μ, is estimated by sample mean, y. population SD, σ, is estimated by sample SD, s.