MAT 155: Describing, Exploring, and Comparing Data Page 1 of NotesCh2-3.doc

MAT 155: Decribing, Exploring, and Comparing Data Page 1 of 8 001-oteCh-3.doc ote for Chapter Summarizing and Graphing Data Chapter 3 Decribing, Exploring, and Comparing Data Frequency Ditribution, Graphic Repreentation, Meaure of Center, Variation, & Standing In thee chapter, we will tudy (1) viual repreentation of data, () mean of center and variation, and (3) relative tanding and exploratory analyi. Thee three area will include (1) frequency ditribution, relative frequency ditribution, cumulative frequency ditribution, hitogram, frequency polygon, tem-and-leaf plot, and catter plot; () arithmetic mean, median, mode, midrange, weighted mean, range, tandard deviation, coefficient of variation, empirical rule, and Chebyhev Theorem; and (3) z-core, quartile, percentile, outlier, and box plot (5-number ummary). Viual Repreentation of Data A frequency ditribution i one convenient way to repreent a large amount of data in a mall amount of pace uing two column: (1) categorie or clae and () frequency. There are ome general guideline that we hould ue when contructing a frequency ditribution. Firt, determine the number of clae, k, by uing the to the k rule. Find the mallet integer k o that k n where n i the total number of obervation or data value. For example, if n 50 data value, we would find k 6 clae. [ 4 16, 5 3, and 6 64] Of coure, we have ome freedom that allow u to chooe the number of clae different from the k-value when actually contructing the frequency ditribution. We may chooe a different k- value to make the ditribution more appealing. OTE: Clae hould be mutually excluive and collectively exhautive. Thi would enure that each data value would fit into only one cla, and every value would belong to a cla. Alo, we hould try to have at leat 5 and not more than 15 clae. Thu, we will try to atify the inequality 5 k 15. We hould avoid, if poible, open-ended clae. Second, determine the cla interval or cla width. Two guideline that may be ued to determine the cla interval, i, are l et data value mallet data value i arg l arg et data value mallet data value (1) () i number of clae 1+ 3.3(log n) Suppoe the mallet and larget value of the 50 value from above are 1 and 88, repectively. 88 1 88 1 By ( 1) i 1. 666 and by ( ) i 11. 439 6 1+ 3.3(log 50) Again, we have ome freedom to chooe the cla width (interval) to be a whole number if we wih. Depending on our choice for i, we may have to change the number of clae from 6. OTE: The cla interval hould be equal.

MAT 155: Decribing, Exploring, and Comparing Data Page of 8 001-oteCh-3.doc We will et up our clae o that the lower limit (left value) of the cla i included in that cla, and the upper limit (right value) of the cla i not included in that cla. Returning to the 50 data value ranging from 1 to 88, let u et up the clae. If we chooe i 1 and tart the firt cla with a lower limit of 1, we would need 7 clae in order to include the larget value of 88. If we chooe i 15 and tart with 10 a the lower limit of the firt cla, we would need only 6 clae to include the value of 88. OTE: Some people recommend that the lower limit of the firt cla be a whole number multiple of the mallet data value. However, thi i not eential, and we will ue that only when it i convenient. Baed on the information preented above, we may chooe either of the cla etup below. Table A Table B Clae: k7, i1 Clae: k6, i15 1-4 10-5 4-36 5-40 36-48 40-55 48-60 55-70 60-7 70-85 7-84 85-100 84-96 Once we et up the clae, we count and record the number of value in each cla. In Table A, we record, in the frequency column, the number of value o that 1 value < 4, 4 value < 36, etc. Table C. 50 Data Value 57 5 1 1 74 43 70 5 78 61 88 6 3 4 0 87 79 17 39 78 13 16 69 81 73 4 73 75 19 46 48 4 19 64 41 4 81 54 0 73 16 40 70 85 7 37 64 17 46 Uing the guideline, Table A, and the data in Table C above, we get the frequency ditribution in Table D below. Table D. Frequency Ditribution Clae Frequency 1-4 13 4-36 5 36-48 9 48-60 3 60-7 6 7-84 11

MAT 155: Decribing, Exploring, and Comparing Data Page 3 of 8 001-oteCh-3.doc 84-96 3 Sum of freq. n 50 The relative frequency ditribution i contructed from the frequency ditribution by dividing each frequency by the um of the frequencie. For example 13/50 0.6, 5/50 0.10, etc. Table E below i the relative frequency ditribution contructed from Table D. Table E. Relative Frequency Ditribution Clae Frequency Relative Frequency 1-4 13 0.6 4-36 5 0.10 36-48 9 0.18 48-60 3 0.06 60-7 6 0.1 7-84 11 0. 84-96 3 0.06 Total 50 1.00 From Table E, we ee that about 6% and % of the data value are in the interval [1,4) and [7,84), repectively. In addition to the relative frequency ditribution, we will dicu the le than cumulative frequency ditribution (LCF). The LCF (Table F) how the accumulated frequency that i le than the upper limit value in the repective cla. Table F. Le Than Cumulative Frequency Ditribution Clae Frequency Le than Cumulative Frequency (<cf) 1-4 13 13 4-36 5 18 36-48 9 7 48-60 3 30 60-7 6 36 7-84 11 47 84-96 3 50 Total 50 --- We ee that 13 value are maller than 4. The 13 in the firt cla plu 5 in the econd cla give 18 value le than 36. [The ret of the value in the column <cf are obtained thuly 18 + 9 7, 7 + 3 30, 30 + 6 36, 36 + 11 47, and 47 + 3 50.] The hitogram i contructed by uing the cla limit on the horizontal axi of the frequencie on the vertical axi. The hitogram below on the left wa contructed uing Statdik; on the right by uing Excel.

MAT 155: Decribing, Exploring, and Comparing Data Page 4 of 8 001-oteCh-3.doc Hitogram Fregrency 15 10 5 0 1-4 4-36 36-48 48-60 60-7 7-84 84-96 Clae The tem-and-leaf plot i a good repreentation for raw data. All value are hown in a concie form a hown by the Minitab output of the following data: 4, 45, 51, 61, 69, 76, 78, 78, 7, 6, 51, and 44. Table H. Current workheet: Citie.mtw Character Stem-and-Leaf Diplay Stem-and-leaf of Atlanta 1 Leaf Unit 1.0 3 4 45 5 5 11 (3) 6 19 4 7 688 Interval i 10. Stem increae by 10: 40, 50, 60, 70 above it, and the number below it give u the n 1. [5 + 3 + 4 1] Meaure of Center and Variation The tem-and-leaf indicate that there are 1 data value, and each leaf repreent 1 unit. A we read the firt row, we ee there are 3 value in the 40. Thee are 4, 44, and 45. There are (5-3) value in the 50 : 51 and 51. There are (3) value in the 60 : 61, 6, and 69. Finally, there are 4 value in the 70 : 7, 76, 78, and 78. The firt column accumulate from the top down until we reach (3) [Don t be concerned about the meaning of thi value.] Then the accumulation tart at the bottom and work upward. Adding the (3), the number We will firt dicu the population mean and the ample mean. When talking about the population and a ample, we refer to a parameter and a tatitic, repectively. otation for the population and ample mean are µ (mu) and (-bar), repectively. otice in the formula that (upper cae) repreent the total number of obervation in the population, and n (lower cae) repreent the total number of obervation in the ample.

MAT 155: Decribing, Exploring, and Comparing Data Page 5 of 8 001-oteCh-3.doc Arithmetic Mean for Population and Sample Type of Data Population Sample Raw µ n f f Grouped µ f f The arithmetic mean (1) i calculated for interval-level and ratio-level data, () include all data value, (3) i unique for a et of data, (4) i ueful in comparing two or more group of data, and (5) i affected by extremely large or extremely mall value. The median i a meaure of center that require little or no calculation for raw data. To find the median for raw data, we ue the following procedure. (1) Order the data from mallet value to larget value or vice-vera. () If the number of data value i odd, chooe the value in the middle o that the ame number of value are to the left a are to the right of the middle value. (3) If the number of data value i even, chooe the two value in the middle o that the ame number of value are to the left a are to the right of the two middle value. (4) Calculate the average of thoe two value. To find the median for grouped data, we ue the following procedure. (1) In the frequency ditribution, form the le than cumulative frequency (<CF) column. () Find one-half the um of the frequencie, n/. (3) Find the larget number in the <CF column that i not larger than n/. (4) Circle the row (Cla, frequency, <CF) below the number in Step 3. Thi row contain the median. (5) Subtract the number found in Step 3 (CF) from n/, divide by the frequency (f) circle in Step 4, and multiply by the cla interval (i). (6) Add the anwer from Step 5 to the lower cla limit circled in Step 4. Thi repreent the median for the grouped data. The following formula ummarize the ix-tep procedure given above. n CF Median for grouped data Median L + ( i) f The mode i a meaure of center that identifie the data value that appear mot frequently. There will be no mode if all data value appear the ame number of time. There will be more than one mode if two or more data value appear with the ame frequency and more frequently than other data value(). To find the mode for raw data, imply find the value() that appear mot frequently. To find the mode for grouped data, find the midpoint() of the cla(e) that ha (have) the larget frequencie. The cla containing the mode i called the modal cla. The mid-range i midway between the larget value and the mallet value of the data. l arget + mallet midrange

MAT 155: Decribing, Exploring, and Comparing Data Page 6 of 8 001-oteCh-3.doc The weighted mean may be calculated by uing the following three-tep procedure: (1) multiply each value by a weight for that value, () um thoe product, and (3) divide that um by the um of the weight. The following formula expree the above procedure: Weighted Mean w w1 1 + w + wn w w + w + w where w repreent the weight and repreent the data value. 1 n n, Skewne tell u omething about the hape of a frequency ditribution. A ymmetric ditribution i one whoe graph i ymmetric with repect to a vertical line that pae through the mean, median, and mode. If a ditribution i kewed to the right, the graph i elongated (or tretched) to the right ide. If a ditribution i kewed to the left, the graph i elongated (or tretched) to the left ide. Remember that extremely large value will pull the mean to the right; thu, kewing the graph (ditribution) to the right. Similarly, extremely mall value will pull the mean to the left; thu, kewing the graph (ditribution) to the left. To calculate the coefficient of kewne by hand, we ue Pearon index (coefficient) of kewne formula: 3( mean median) I k We will dicu variation (diperion) for two reaon. Firt, variation (diperion) can be ued to indicate the preence or abence of reliability. Second, variation (diperion) can be ued to compare the pread of two or more ditribution. One meaure of variation (diperion) i the range. The range i the difference between the larget and mallet data value. The calculation of the range i the implet of the meaure of variation (diperion). A diadvantage of uing the range i that it involve only two of the data value. Range Range L arg et Value SmalletValue (D1) We calculate the variance of data o that we can find the tandard deviation. For population data, the variance i the arithmetic mean of the quared deviation from the mean. For ample data, divide the um of the quared deviation by n-1. We may ue the following procedure to calculate the variance for ungrouped data. (1) Calculate the arithmetic mean. () Find the difference between each data value and the mean. (3) Square each of the difference found in Step. (4) Sum the quare from Step 3. (5) If the data i from a population, divide the um in Step 4 by, the total number of data value. (6) If the data i from a ample, divide the um in Step 4 by n-1, where n i the total number of data value. The above tep are ummarized in the two formula below. In the ample calculation, the denominator of n-1 i ued intead of n to help correct for the error created by the maller number of data value in the ample compared to the population. The table below how the Conceptual Formula and Calculation Formula (for raw or ungrouped data) ued to find the variance of data. The tandard deviation can be ued to compare the diperion of two or more population or ample. Alo, if the data value are

MAT 155: Decribing, Exploring, and Comparing Data Page 7 of 8 001-oteCh-3.doc meaured in the ame unit and the mean are cloe together, a mall tandard deviation may be ued indicate that the mean a reliable meaure of central tendency. For population data, the tandard deviation i the quare root of the population variance. For ample data, the tandard deviation i the quare root of the ample variance. We may ue the following procedure to calculate the tandard deviation for ungrouped data. (1) Calculate the variance. () Find the quare root of the variance from Step 1. The above tep are ummarized in the formula below. Conceptual Formula to Calculate the Variance of Raw Data Population Sample ( µ ) ( ) σ (D3) (D4) n 1 ( ) Calculation σ (D5) n nn ( 1) ( ) (D6) Variance Standard Deviation Formula to Calculate the Variance and Standard Deviation of Grouped Data Population Sample σ (D7) σ (D9) ( ) ( ) f f ( ) ( ) f f (D8) (D10) ( ) ( f ) n f nn ( 1) ( ) ( f ) n f nn ( 1) For grouped data, the range i the difference between the upper limit of the larget cla (interval) and lower limit of the mallet cla (interval). Range Range Upper Limit of L arg et Interval Lower Limit of Smallet Interval (D1G) Relative Diperion. If the unit of meaure are different or the mean are not cloe together, the tandard deviation cannot be ued to compare diperion of data et. Therefore, we ue the coefficient of variation that meaure the diperion relative to the mean by dividing the tandard deviation by the mean and multiplying by 100 to form a percent. The coefficient of variation i calculated by uing the following formula: CV (100%) (D1) Empirical Rule. The Empirical Rule applie only to ditribution that are ymmetrical and bell-haped. For uch ditribution, the Empirical Rule tate that about 68% of the data value are within plu or minu one tandard deviation of the mean; about 95% within plu and minu two tandard deviation of the mean; and about 99.7% within plu and minu three tandard deviation of the mean.

MAT 155: Decribing, Exploring, and Comparing Data Page 8 of 8 001-oteCh-3.doc Chebyhev Theorem allow u to determine the minimum proportion of data value within a pecific number (larger than one) of tandard deviation of the mean for any et of data value. Thi minimum proportion i calculated by uing the formula 1 1 (D11) k where k >1 i the number of tandard deviation either ide of the mean. Uing thi formula, we ee that at leat 75% of the data value are between two tandard deviation below the mean and two tandard deviation above the mean. Similarly, there would be at leat 88.9% within three tandard deviation of the mean. There would be at leat 55.6% within 1.5 tandard deviation of the mean. 1 4 5 1.5 yield 1 1 0.556 55.6% 1.5 9 9 Z-core, tandard core, i the number of tandard deviation x i from the mean. x x x µ z-core for ample: z z-core for population: z σ Quartile, Decile, Percentile. Earlier we dicued meaure of center. One of thoe meaure wa the median. We found the median to be the middle value of ungrouped data, and we ued a formula to find the median for grouped data. ow we will calculate quartile, decile, and percentile a meaure of diperion. For ungrouped data, the following formula may be ued to find the location, L, of a percentile, k: k L n (D13) 100 If L i whole number, P k i midway between L th value and (L+1) t value of the orted data. If L i not a whole number, P k i the next value above the L th poition. To find the location of the firt quartile, imply find the location of the 5 th percentile; to find the location of the econd decile, imply find the location of the 0 th percentile. number of valuelethan x Percentile of value x 100 total number of value Box Plot. A box plot i a graphical diplay of five value: mallet and larget data value, the median, and the firt and third quartile. To draw a box plot, (1) identify the mallet and larget data value, () calculate the firt, econd, and third quartile, (3) draw a rectangle with the firt quartile at the left end, the third quartile at the right end, and the econd quartile (median) a a vertical line egment in the rectangle, (4) draw line egment from the left end to the mallet value and from the right end to the larget value. A an example, conider the following: mallet value i 50, firt quartile i 70, econd quartile (median) i 90, third quartile i 115, and the larget value i 150. The box plot repreenting thee value i hown below. 50 70 90 115 150 (Copyrighted by Claude S. Moore 004-008)