Chapter 3 Descriptive Measures Measures of Ceter (Cetral Tedecy) These measures will tell us where is the ceter of our data or where most typical value of a data set lies Mode the value that occurs most frequetly i the data set Obtai the frequecy of each value 1. If the greatest frequecy is 1, the there is o mode. 2. If the greatest frequecy is 2 or greater, the ay value with that greatest frequecy is the mode of the data set. Example: 2, 3, 3, 3, 4, 4, 5 Mode = 3 Media divides the bottom 50% of the data from the top 50% Arrage the data i icreasig order 1. If the # of observatios is odd, the media is the observatio exactly i the middle. 2. If the # of observatios is eve, the media is the mea of the two middle observatios. For observatios, the positio of the media is the ( +1 2 ) th positio i the ordered distributio. Ex Weight gai i pouds for 6 youg lambs 1 2 10 11 13 19, positio=(6+1)/2=3.5 (media is betwee observatio #3 ad #4), Media=(10+11)/2=10.5 lb If we add oe more observatio: 10lb, data becomes: 1 2 10 10 11 13 19, positio=(7+1)/2 =4,(media is observatio #4) Media=10lb Media is a robust (resistat) measure of ceter, it is relatively uaffected by chages i small portio of the data. Mea sum of the observatios divided by the umber of observatios. x = Mea (arithmetic mea)= x= i=1 x i, where x i s are observatios i the sample. I our example x =56/6~9.33 lb
Differeces betwee each data poit ad the mea their sum (x i x)=0 for ay data set. i=1 (x i x) are called deviatios from the mea ad I our example sum of all deviatios = (- 8.33)+ (- 7.33)+.67+1.67+3.67+9.67=0 Mea ca be visualized as a poit of balace of the weightless seesaw with poits (like childre) sittig o it. Ulike media, mea is ot robust, it is iflueced by ay data chages, very much by extremes. If data has some extreme values the media is a better measure of ceter for that data. Mea vs Media right skewed distributio, left skewed distributio, symmetric distributio, Mea>Media Mea< Media Mea=Media Measures of dispersio (variability) Rage=Maximum-Miimum, gives overall spread of the data, easy to calculate, but very sesitive to extreme data values. Sample Stadard Deviatio DEFINITION: s = i=1 (x i x) 2 1 s averages the squared deviatios from the mea. Square root is take at the ed, so the uits of s are the same as the uits of the data. Properties: s 0, s=0 if all data poits are the same s has the same uits as your data larger s idicates more variability
s 2 is the sample variace. We will abbreviate SD for stadard deviatio, s will be used i the formulas. Ex. Experimet o chrysathemums, botaist measured stem elogatio i 7 days (i mm) 76, 72, 65, 70, 82 =5 x=365/5=73 76 72 65 70 82 x i x i x (x i x) 2 3-1 -8-3 9 9 1 64 9 81 total 0 164 s== 164 4 =6.40 mm variace s 2 =41mm 2 s gives typical distace of the observatios from the mea, larger s meas more variability. Similar to the mea, s is also iflueced by extreme data values (ot a robust measure). -1 =degrees of freedom of s, as a ituitive justificatio why we use ( -1) ot we ca cosider =1, whe variability of 1 observatio ca't be computed, oe data poit gives o iformatio about variability. Sample stadard deviatio x i x i 2 i=1 COMPUTATIONAL FORMULA: s= x i 2 ( i=1 1 x i)2 76 72 65 70 82 5776 5184 4225 4900 6724 365 26809
s= 26809 (365) 2 5 4 = 26809 26645 4 = 164 =6.40 mm 4 The more variatio there is i a data set, the larger its stadard deviatio. Similar to the mea, stadard deviatio is ot robust, it is iflueced by ay data chages, very much by extremes. Three Stadard Deviatios Rule: Almost all of the observatios i ay data set lie withi three (3) stadard deviatios to either side of the mea. More Precise Rules for ay data set: (optioal) Chebychev s rule : ~ 89% of the observatios i ay data set lie withi three stadard deviatios to either side of the mea. Chebychev s rule (more precisely):for ay data set ad ay umber k > 1, at least 100(1 1/k 2 )% of the observatios lie withi k stadard deviatios to either side of the mea. If the distributio is ~ bell-shaped, the Empirical Rule implies that ~ 99.7% of the observatios lie withi three stadard deviatios to either side of the mea. We will tallk about bell shaped distributios later. Typical Percetages: The Empirical Rule For a ice distributio (pretty symmetric, uimodal, o very log or very short tails) we expect to fid : about 68% of all data poits withi the iterval ( y SD, y+ SD) about 95% of all data poits withi the iterval ( y 2SD, y+ 2SD) more tha 99% of all data poits withi the iterval ( y 3SD, y+ 3SD) Effect of Trasformatio of Variables Sometimes whe we work with a data set it is coveiet to trasform our variable(s). For example, we may wat to chage uits or trasform very small umbers that appear i scietific otatio to somethig easier to use by multiplyig origial data by 10,000. ad SD =s, the X '=ax +b is it's liear trasformatio, mea ad SD of X ' are x ' ad SD= s' respectively. That type of trasformatio does ot chage the essetial shape of the distributio of X, the histogram of trasformed variable ca be made idetical to the origial histogram by suitable scalig of the horizotal axis. Liear trasformatio is the simplest oe: Let X be the origial variable with mea x
How Liear Trasformatio Affects mea ad SD? Oly mea (but ot s) is affected by the additive trasformatio (addig positive or egative costat b to X), but both mea ad SD are affected by multiplyig X by a positive or a egative costat a: x'=a x+b ad s '= a s Ex Suppose X=summer temperature i some America city i 2013 i F, x=79.6 If we would like to chage the X to C, the trasformatio is as follows: X '=( X 32) 5 9 = 5 9 X 5 9 32, so ew mea x '= 5 9 79.6 ( 5 32)=26.44 C ad 9 s'= 5 9 12.7=7.06 C F ad s=12.7 F. Noliear trasformatios like the followig examples: X '= X, X '=log X, X '= 1 X, X '= X 2, ca affect data i complex ways ad they do chage essetial shape of the frequecy distributio. If the distributio is right skewed, for example, ad we wish to make it more symmetric, we ca apply square root trasformatio to pool the righthad tail ad push out the left -had tail. Logarithmic trasformatio will deliver eve more drastic chage i that regard (check out the histograms give at the ed of this sectio) The five-umber summary; Boxplots Media, Percetiles, Deciles, Quartiles, Iterquartile Rage are all resistat measures. Percetiles divide the distributio ito 100 equal parts (P 1, P 2,,P 99 ) P 1 divides the bottom 1% of the data from the top 99% P 2 divides the bottom 2% of the data from the top 98% Etc, Media is the 50 th percetile Deciles divide the distributio ito 10 equal parts (D 1, D 2,, D 9 ) D 1 divides the bottom 10% of the data from the top 90% D 2 divides the bottom 20% of the data from the top 80% Etc, Media is D 5 Quartiles divide the distributio ito 4 equal parts (Q 1, Q 2, Q 3 ) Q 1 divides the bottom 25% of the data from the top 75%
Q 2 divides the bottom 50% of the data from the top 50% Q 3 divides the bottom 75% of the data from the top 25% Media is Q 2 To fid the Quartiles Arrage the data i icreasig order. 1. Q 1 is the media of the data set that lies at or below the media of the etire data set. 2. Q 2 is the media of the etire data set. 3. Q 3 is the media of the data set that lies at or abowe the media of the etire data set. Examples: 1. =7 (odd) Data: 3, 4, 5, 6, 12, 13, 14 Q 1 =(4+5)/2=4.5 Q 2 = 6 Q 3 = (12+13)/2=12.5 Whe is odd, calculator ad your book have slightly differet ways to calculate quartiles: Your Book: To compute Quartiles, Media is icluded i lower ad upper part of data Your calculator: To compute Quartiles, Media is excluded from the computatios, so you will get somewhat differet values: Q 1 = 4 Q 2 = 6 Q 3 = 13 2. =10 (eve) Data: 1, 3, 4, 5, 6, 12, 13, 14, 15, 18 Q 1 = 4 Q 2 = (6+12)/2=9 Q 3 = 13 Iterquartile Rage (IQR) differece betwee the first ad third quartiles. IQR = Q 3 Q 1 IQR gives the rage of the middle 50% of the observatios (approximately) The five-umber summary of a data set cosists of the miimum, maximum, ad the quartiles i icreasig order. Mi., Q 1, Q 2, Q 3, Max. Outliers observatios well outside of the overall patter of the data LL=Lower limit = Q 1 1.5 (IQR) UL=Upper limit = Q 3 + 1.5 (IQR)
Potetial outliers are observatios outside of the Lower ad Upper Limits. Boxplot (box-ad-whisker diagram) ad the modified boxplot To costruct a boxplot 1. Determie the 5 umber summary (Mi, Q 1, Q 2, Q 3, Max.) 2. Draw a horizotal axis o which the umbers obtaied i step 1ca be located. Above this axis, mark the quartiles ad the miimum ad maximum with vertical lies. 3. Coect the quartiles to each other to make a box, ad the coect the box to the miimum ad maximum with lies. The followig is Boxplot for example 1 (top of previous page): 3 4.5 6 12.5 14 To costruct a modified boxplot 1. Determie the quartiles. 2. Determie potetial outliers ad the adjacet values. 3. Draw a horizotal axis o which the umbers obtaied i steps 1 ad 2 ca be located. Above this axis, mark the quartiles ad the adjacet values with vertical lies. 4. Coect the quartiles to each other to make a box, ad the coect the box to the most extreme obs. that are still lyig withi the upper ad lower limits 5. Plot each potetial outlier with a asterisk. The two lies stretchig out o both sides are the whiskers. Example Data represets systolic blood pressure (i mmhg) of 7 adult males 151 124 132 170 146 124 113 We order data first: 113 124 124 132 146 151 170 Mi=113, Max=170, Media=132 Q 1 =124 Q 3 =151 (Media is excluded whe we compute quartiles) Boxplot coects all 5 umbers i the followig way, the box represets middle half of the data.
110 120 130 140 150 160 170 Are there ay outliers? I our example: IQR=151-124=27, 1.5(IQR)=1.5*27 = 40.5 lower limit=124-40.5=83.5, upper limit = 151+40.5 = 191.5, all observatios are withi the limits, so so there are o outliers i our data set. Example Radishes growth (i mm) i the light. 4 5 5 7 7 8 9 10 10 10 10 14 20 21 Mi=4, Max=21, Q 1 =7, Media=(9+10)/2=9.5 Q 3 =10 IQR=3, lower limit=2.5 upper limit=14.5, so 20 ad 21 are outliers. Modified box plot exposes outliers. * * 5 10 15 20 25 Descriptive Measures for Populatios; Use of Samples Statistical Iferece is the process of drawig coclusios about the populatio based o the observatios i the sample. Notatio: Size Mea SD Sample x s Populatio N μ σ
Parameter A descriptive measure for a populatio. Example:, Statistic A descriptive measure for a sample. Example: x, s Sample mea, x, is used to estimate a populatio mea, Sample SD, s, is used to estimate populatio SD, σ μ Populatio Mea μ (Mea of a Variable) computed i same maer as for a sample mea For a variable X, the mea of all possible obs. for the etire populatio is called the populatio mea or mea of the variable X. It is deoted by x or whe o cofusio will arise, simply by. For a fiite populatio, we have = N x where N is the populatio size. Populatio Stadard Deviatio σ (Stadard Deviatio of the Variable) For a variable x, the stadard deviatio of all possible obs. for the etire populatio is called the populatio stadard deviatio or stadard deviatio of the variable x. It is deoted by x or, whe o cofusio will arise, simply by. For a fiite populatio, we have = ( x ) N 2 = ( x 2 N ) μ2 where N is the populatio size. Populatio Variace 2 x Stadardized Variable For a variable X, the variable z = is called the stadard score or z-score z is also called the stadardized versio of x or the stadardized variable correspodig to the variable x. The value of the z score tells us how may stadard deviatios above or below the mea is a particular value of x.
Properties of z-scores: z<0 if x is below the mea, z>0 if x is above the mea z=0 if x is equal to the mea. z-scores have mea=0 ad SD=1 z-scores have o uits: z = z =0, z = N ( z ) N 2 =1 Most of the z-scores are betwee -3 ad 3 (3 SD Rule) Example: Fial test scores i all Mat119 classes last semester have mea μ=72 ad SD σ=10 a) Jae scored 86 poits, fid ad iterpret her z-score z = x, z=(86-72)/10=1.4 Het test grade is 1.4 stadard deviatios above the average b) Jack's z-score was -1.0, what was his test score is: x=μ+ z σ, X=72-1.0(10)=62 c) True or false? Very few fial test scores are below 42 poits or above 102 poits. True, most of the scores are withi 3SD-s from the mea, i a iterval (42,102), so very few are outside of that rage