Chapter2 Description of samples and populations. 2.1 Introduction. Statistics=science of analyzing data. Information collected (data) is gathered in terms of variables (characteristics of a subject that can be assigned a numerical value or nonnumerical category. Data itself and its transformed forms are also called statistics. Types of variables: 1. Categorical Variable, it records a category subject belongs to, like Blood Type (O, A, B, AB) or Gender (Female, Male). Usually categories do not have a meaningful order. Some categorical data can be ordinal, where some natural order exists for example: response to the treatment: none, partial, complete. 2. Quantitative (Numeric) Variable, records amount of something or a count of something. It can be continuous,with values on the continuous scale (Weight of a newborn, Cholesterol content in a blood specimen) or discrete, where values can be listed, often values are integer (Number of eggs in the nest, Number of bacteria in a petri dish). Distinction between discrete and continuous variables is not rigid, we often round up measurements to nearest integer Sample=collection of persons or things on which we measure one or more variables. Sometimes that same word is used in a different context (for example sample of blood taken from a subject). To avoid confusion we will say a specimens of blood in that case. Some other vocabulary and notation: Example. Twenty students gave reported their gender, blood type and weight to a researcher. Students are here observational units. Variables are: Gender, Blood Type ( both categorical) and Weight (numerical). Sample size is n=20 We will use capital letters like X and Y for the names of the variables and lower case letters (x or y) for the particular observations. For example we may use Y=weight of a student and y 1 =150 lb as a weight of one such a student (John). 2.2. Frequency distributions. When data is collected, to make sense of it it is helpful to summarize it in a form of tables and/or graphs. We will use some example data sets to examine different ways data can be displayed. Ex1: Sample of Blood Type for 21 people: A O A AB O B AB A O A O AB O A O B A AB A O A We can summarize it using frequency and relative frequency table. Frequency=count in a particular class. Relative frequency=frequency/n % frequency= relative frequency*100%
Frequency table results for Blood Type: Blood Type Frequency Relative Frequency A 8 0.3809524 AB 4 0.1904762 B 2 0.0952381 O 7 0.33333334 Notice that all frequencies add up to n=21 and all relative frequencies add up to 1 (or 100%) Graphical display includes a Bar Chart. Notice that classes do not have to be placed in any particular order. Example#2: US Solid Waste Weight (Pie Chart) Material Weight (million tons) Percent of Total Food Scraps Glass Metals Paper, Paperboard Plastics Rubber, Leather, Textiles Wood Yard Trimmings Other 25.9 12.8 18 86.7 24.7 15.8 12.7 27.7 11.2% 37.4% 10.7% 6.8% 5.5% 11.9% 3.2% Totals 231.9 100%
Missing frequency=7.6, missing relative frequencies are 5.5% and 7.8% To figure out the sizes of each slice multiply 360 by the relative frequency. Ex3 40 couples, # of children in each family 3 3 3 1 4 3 0 0 2 0 4 2 4 3 2 2 3 2 5 1 1 0 1 1 2 1 0 0 1 2 1 1 0 3 2 1 2 1 2 3 These data can be grouped using a single value, since there are relatively few different data values. Our classes will be in order: 0,1,2,3,4,5, frequencies will be computed exactly as in example #1. Frequency table results for Number of children: Number of children Frequency Relative Frequency 0 7 0.175 1 11 0.275 2 10 0.25 3 8 0.2 4 3 0.075 5 1 0.025
Graphical display of such a data is called a histogram, bars will be raised with classes placed in the middle of each bar. Another way to display such a data is a dotplot. You place a dot over each data value. If values are repeated, you place multiple dots equally spaced above these values. Grouped frequency distribution is appropriate for a data set with a lot of different values like in the following example. Ex4 AGE of onset of diabetes (35 people) 48 41 57 83 41 55 59 61 38 48 79 75 77 7 54 23 47 56 79 68 61 64 45 53 82 68 38 70 10 60 83 76 21 65 47 If we decide to start at 0 and have groups with the width=10 we can have following classes: [0,10), [10,20), [20,30) and so on, Treat the notation like an interval notation. Histogram for these data can also be obtain, bars will be raised over each class. Vertical axis can represent either frequency or relative frequency. We can also obtain a fast histogram, otherwise called stem-and-leaf diagram (or a stemplot): Each data point is divided into stem and leaf, all possible stems are placed vertically and leaves are added to them in order. Our stemplot is given below, notice that leaves are ordered. 0 7 1 0 2 1 3 3 8 8 4 1 1 5 7 7 8 8 stems: tens 5 3 4 5 6 7 9 leaves: ones 6 0 1 1 4 5 8 8 7 0 5 6 7 9 9 8 2 3 3
How to make a stemplot: 1. Separate each observation into a stem (has all but the last digit, can be 1, 2, or more digits) consisting of all but the final (rightmost) digit and a leaf (has only one digit), the final digit. 2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. 3. Write each leaf in the row to the right of its stem, in increasing order out from the stem. Ex5 Radishes growth (mm in 3 days) A(in the dark) B (12 hours of light/ 12 hours of dark) A: 15 20 22 20 29 37 11 35 15 30 8 25 33 10 B: 10 11 15 15 20 4 22 21 10 25 27 20 9 20 Side by side Stemplots (with 2 leaves per stem) can let us compare both sets: In both stems are tens, leaves are ones 0 4 8 0 9 stems: tens 1 0 1 0 0 1 leaves: ones 5 5 1 5 5 A 2 0 0 2 0 0 0 1 2 B 9 5 2 5 7 3 0 3 7 5 3 Stemplot with two leaves per stem: The number of stems can be doubled by splitting the stem in two ; one with leaves from 0 to 4 and the other with leaves 5 to 9. Interpreting areas of the histogram: Area of each bar of the histogram is proportional to corresponding frequency. In example #4 area between 10 and 30 (2 bars) equals 3/35~8.6% of the total area of the histogram. We can draw a f histogram using a density scale ( n ), then total area of the histogram will be 1 or unit= class width 100%
Ex6 The amounts of iron intake, in milligrams, during a 24-hour period for a sample of 30 females under the age of 51 15.0 18.1 14.4 14.6 10.9 18.1 18.2 18.3 15.0 16.0 12.6 16.6 20.7 19.8 11.6 12.8 15.6 11.0 15.3 9.4 19.5 18.3 14.5 16.6 11.5 16.4 12.5 14.6 11.9 12.5 In that last example we may select groups of width 2, namely: [9,11), [11,13), [13,15) and so on, we will get 6 classes, appropriate number for data of 30 observations. Example7: Weight data (in pounds) in an Intro. Stats Class 100, 105, 111,115, 118, 118, 119, 120, 125, 125, 128, 128, 129, 130, 133, 135, 135, 138, 138, 140, 140, 145, 146, 150, 155, 158, 160, 162, 164, 165, 167, 171, 175, 178, 180, 180, 182, 185, 185, 187, 189, 190, 190, 193, 194, 195, 200, 205, 210, 215, 230, 270 We can clearly observe two prominent picks, data is bimodal
Describing distribution of the sample data: Modality, Shapes, Symmetry, and Skewness. Modality: Unimodal - has one peak eg. Bell-shaped, Triangular, Reverse J-shaped, J-shaped, Right skewed, Left skewed Bimodal - has two peaks (technically, all peaks should be same height, not so in practice) Multimodal - has 3 or more peaks Symmetry and Skewness Symmetry - property of a distribution to be divided into 2 parts that are mirror images of each other. Do not have to be exact in identifying symmetry. Eg. bell-shaped, triangular, uniform. Non-symmetric Distribution - Reverse J-shaped, J-shaped, Right skewed, Left skewed The distribution of population data is called population distribution, or the distribution of the variable. The distribution of sample data is a sample distribution. The distribution of a random sample from a population approximates the population distribution, hence, larger samples give better approximation. Shapes of Distributions. right skewed distribution, left skewed distribution, symmetric distribution,
2.3 Descriptive Measures of Center Let Y be our variable, numerical. y = Median=middle of the ordered data. Position (location) of the median is n=sample size. n+ 1 2, where Ex Weight gain in pounds for 6 young lambs 1 2 10 11 13 19, 0.5(6+1)=3.5 (median is between observation #3 and #4), y =(10+11)/2=10.5 lb If we add one more observation: 10lb, data becomes: 1 2 10 10 11 13 19, 0.5(7+1)=4,(median is observation #4) y =10 Median is a robust (resistant) measure of center, it is relatively unaffected by changes in small portion of the data. y = Mean (arithmetic mean)= n i=1 y= n y i, where y i -s are observations in the sample. In our example y =56/6~9.33 lb Differences between each data point and the mean and their sum i=1 n ( y i y)=0 for any data set. ( y i y) are called deviations from the mean In our example sum of all deviations=-8.33+ (-7.33)+.67+1.67+3.67+9.67=0 Mean can be visualized as a point of balance of the weightless seesaw with points (like children) sitting on it. Unlike median, mean is not robust, it is influenced by any data changes, very much by extremes. If data has some extreme values then median is a better measure of center for that data.
Mean vs Median right skewed distribution, left skewed distribution, symmetric distribution, Mean>Median Mean< Median Mean=Median 2.4 Boxplots. Single variable data may be summarized by 5 numbers: Minimum, Maximum, Median and 2 Quartiles referred to as five-number summary. These values are also used to make a box plot. Lower quartile denoted by Q 1 is a median of lower half of data, upper quartile denoted by Q 3 is a median of upper half of data. Ex1 Data represents systolic blood pressure (in mmhg) of 7 adult males 151 124 132 170 146 124 113 We order data first: 113 124 124 132 146 151 170 Min=113, Max=170, Median=132 Q 1 =124 Q 3 =151 (Median is excluded when we compute quartiles) Boxplot connects all 5 numbers in the following way, the box represents middle half of the data. 110 120 130 140 150 160 170 Another measure we can compute is Interquartile Range IQR= Q 3 - Q 1. This measure gives spread of middle half of data values. We can use it to find unusual data points (outliers). The procedure is as follows:
Compute lower fence=q 1-1.5*IQR and upper fence=q 3 + 1.5*IQR. An outlier is a data point that falls outside of the fences. In our example: IQR=151-124=27, 1.5(IQR)=1.5*27 = 40.5 lower fence=124-40.5=83.5, upper fence= 151+40.5 = 191.5, all observations are within the fences, so so there are no outliers in our data set. Ex2 Radishes growth (in mm) in the light. 4 5 5 7 7 8 9 10 10 10 10 14 20 21 Min=4, Max=21, Q 1 =7, Median=(9+10)/2=9.5 Q 3 =10 IQR=3, lower fence=2.5 upper fence=14.5, so 20 and 21 are outliers. Modified box plot exposes outliers. * * 5 10 15 20 25 2.5 Relationship between variables. This section discusses various ways used to compare two or more variables. Some methods include: a) Two way frequency and relative frequency tables to examine relationship between two categorical variables. They are useful to determine if variables are associated or not. b) Scatter plots for numerical variables to decide if there is a linear trend present, so that we can fit a regression line to the data. c) Side-by-side boxplots, dot plots, stemplots are useful to observe if there are differences between two or more treatments. 2.6 Measures of dispersion (variability) Range=Maximum-Minimum, gives overall spread of the data, easy to calculate, but very sensitive to extreme data values. IQR as we stated before gives range of the middle half of data and is a robust measure, not sensitive to extreme data values.
Sample standard deviation s = n (y i y ) 2 i=1 n 1 averages the squared deviations from the mean. Square root is taken at the end, so the units of s are the same as the units of the data. s 0, s=0 if all data points are the same s 2 is the sample variance. We will abbreviate SD for standard deviation, s will be used in the formulas. Ex. Experiment on chrysanthemums, botanist measured stem elongation in 7 days (in mm) 76, 72, 65, 70, 82 n=5 y=365 /5=73, deviations from the mean are: 3, -1,-8,-3,9, squared deviations are: 9, 1,64,9,81 s= (9+ 1+ 64+ 9+ 81)/4 = 164/ 4 =6.40 mm variance s 2 =41mm 2 s gives typical distance of the observations from the mean, larger s means more variability. Similar to the mean, s is also influenced by extreme data values (not a robust measure). n-1 =degrees of freedom of s, as an intuitive justification why we use ( n-1) not n we can consider n=1, when variability of 1 observation can't be computed, one data point gives no information about variability. The Coefficient of Variation = s expressed as a percentage of the mean: coefficient of variation= units, for example: s y 100% has no units and can be used to compare data sets with different EX Weight and height is measured for girls at age 2. Which of the two measures has greater variability? Weight : mean=12.6 kg, SD=1.4 kg Height: mean=86.6 cm, SD=2.9 cm coef. of variation: 11.1% for weight and 3.3% for height, we conclude that weight is more variable, here SD is much larger percentage of the mean than for height.
Typical Percentages: The Empirical Rule For a nice distribution (pretty symmetric, unimodal, no very long or very short tails) we expect to find : about 68% of all data points within the interval ( y SD, y+ SD) about 95% of all data points within the interval ( y 2SD, y+ 2SD) more than 99% of all data points within the interval ( y 3SD, y+ 3SD) 2.8 Effect of Transformation of Variables Sometimes when we work with a data set it is convenient to transform our variable(s). For example, we may want to change units or transform very small numbers that appear in scientific notation to something easier to use by multiplying original data by 10,000. Linear transformation is the simplest one: Let Y be the original variable with mean y and SD s, then Y '=ay +b is it's linear transformation, mean and SD of Y ' are y' and s' respectively. That type of transformation does not change the essential shape of the distribution of Y, the histogram of transformed variable can be made identical to the original histogram by suitable scaling of the horizontal axis. How Linear Transformation Affects mean and SD? Only mean (but not s) is affected by the additive transformation (adding positive or negative constant b to Y), but both mean and SD are affected by multiplying Y by a positive or a negative constant a: y'=a y+b and s '= a s Ex Suppose Y=summer temperature in some American city in 2013 in F, y=79.6 F and s=12.7 F. If we would like to change the Y to C, the transformation is as follows: Y '=(Y 32) 5 9 = 5 9 Y 5 9 32, so new mean s'= 5 9 12.7=7.06 C y' = 5 9 79.6 ( 5 9 32)=26.44 C and Nonlinear transformations like the following examples: Y '= Y, Y '=logy, Y '= 1 Y, Y '=Y 2, can affect data in complex ways and they do change essential shape of the frequency distribution. If the distribution is right skewed, for example, and we wish to make it more symmetric, we can apply square root transformation to pool the righthand tail and push out the left -hand tail. Logarithmic transformation will deliver even more drastic change in that regard (check out the histograms given at the end of this section)
2.8. Statistical Inference is the process of drawing conclusions about the population based on the observations in the sample. We can for example estimate percentage of all people in England with blood type A as 44% (the sample proportion of people with that blood type). Sample must be considered a random sample from entire population, must be representative of that population. 44% is a statistics (sample proportion p= y n, p hat ) that is estimating a parameter of the population (population proportion p). There are also other statistics we can use to estimate a population proportion, namely p= y+ 2, p tilde. n+ 4 In each case y=number of people in a sample that have a blood type A, n=sample size. We will discuss these estimates in later chapters Other parameters of the population that we often estimate from the samples are: population mean, μ, is estimated by sample mean, y. population SD, σ, is estimated by sample SD, s.