MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, PDF Free Download

MAT 142 College Mathematics Statistics Module ST Terri Miller revised July 14, 2015

2 Statistics Data Organization and Visualization Basic Terms. A population is the set of all objects under study, a sample is any subset of a population, and a data point is an element of a set of data. Example 1. Population, sample, data point Population: all ASU students Sample: 1000 randomly selected ASU students data: 10, 15, 13, 25, 22, 53, 47 data point: 13 data point: 53 The frequency is the number of times a particular data point occurs in the set of data. A frequency distribution is a table that list each data point and its frequency. The relative frequency is the frequency of a data point expressed as a percentage of the total number of data points. Example 2. Frequency, relative frequency, frequency distribution data: 1, 3, 6, 4, 5, 6, 3, 4, 6, 3, 6 frequency of the data point 1 is 1 frequency of the data point 6 is 4 the relative frequency of the data point 6 is (4/11) 100% 36.35% the frequency distribution for this set of data is: (where x is a data point and f is the frequency for that point) x f 1 1 3 3 4 4 2 2 5 1 Data is often described as ungrouped or grouped. Ungrouped data is data given as individual data points. Grouped data is data given in intervals. Example 3. Ungrouped data without a frequency distribution. 1, 3, 6, 4, 5, 6, 3, 4, 6, 3, 6 Example 4. Ungrouped data with a frequency distribution.

MAT 142 - Module ST 3 Example 5. Grouped data. Number of television sets Frequency 0 2 1 13 2 18 3 0 4 10 5 2 Total 45 Organizing Data. Exam score Frequency 90-99 7 80-89 5 70-79 15 60-69 4 50-59 5 40-49 0 30-39 1 Total 37 Given a set of data, it is helpful to organize it. This is usually done by creating a frequency distribution. Example 6. Ungrouped data. Given the following set of data, we would like to create a frequency distribution. 1 5 7 8 2 3 7 2 8 7 To do this we will count up the data by making a tally (a tick mark in the tally column for each occurrence of the data point. As before, we will designate the data points by x. x tally 1 2 3 4 5 6 7 8 Now we add a column for the frequency, this will simply be the number of tick marks for each data point. We will also total the number of data points. As we have done previously, we will represent the frequency with f.

4 Statistics x tally f 1 1 2 2 3 1 4 0 5 1 6 0 7 3 8 2 Total 10 This is a frequency distribution for the data given. We could also include a column for the relative frequency as part of the frequency distribution. We will use rel f to indicate the relative frequency. x tally f rel f 1 1 1 100% = 10% 10 2 2 2 100% = 20% 10 3 1 10% 4 0 0% 5 1 10% 6 0 0% 3 7 3 100% = 30% 10 8 2 20% Totals 10 100% For our next example, we will use the data to create groups (or categories) for the data and then make a frequency distribution. Example 7. Grouped Data. Given the following set of data, we want to organize the data into groups. We have decided that we want to have 5 intervals. 26 18 21 34 18 38 22 27 22 30 25 25 38 29 20 24 28 32 33 18 Since we want to group the data, we will need to find out the size of each interval. To do this we must first identify the highest and the lowest data point. In our data the highest data point is 38 and the lowest is 18. Since we want 5 intervals, we make the computation highest lowest 38 18 = = 20 number of intervals 5 5 = 4 Since we need to include all points, we always take the next highest integer from that which was computed to get the length of our interval. Since we computed 4, the length of our intervals will be 5. Now we set up the first interval lowest x < lowest + 5 which results in 18 x < 23.

MAT 142 - Module ST 5 Our next interval is obtained by adding 5 to each end of the first one: 18 + 5 x < 23 + 5 which results in 23 x < 28. We continue in this manner to get all of our intervals: 18 x < 23 23 x < 28 28 x < 33 33 x < 38 38 x < 43. Now we are ready to tally the data and make the frequency distribution. Be careful to make sure that a data point that is the same number as the end of the interval is placed in the correct interval. This means that the data point 33 is counted in the interval 33 x < 38 and NOT in the interval 28 x < 33. x tally f rel f 7 18 x < 23 7 100% = 35% 20 5 23 x < 28 5 100% = 25% 20 4 28 x < 33 4 100% = 20% 20 2 33 x < 38 2 100% = 10% 20 38 x < 43 2 10% Totals 20 100% Histogram. Now that we have the data organized, we want a way to display the data. One such display is a histogram which is a bar chart that shows how the data are distributed among each data point (ungrouped) or in each interval (grouped) Example 8. Histogram for ungrouped data. Given the following frequency distribution: The histogram would look as follows: Number of television sets Frequency 0 2 1 13 2 18 3 0 4 10 5 2

6 Statistics "' ( & $ " ' ( & $ " ' " # $ % & Example 9. Histogram example for grouped data. We will use the data from example 7. The histogram would look as follows.

Mode. MAT 142 - Module ST 7 Measures of Central Tendency The mode is the data point which occurs most frequently. It is possible to have more than one mode, if there are two modes the data is said to be bimodal. It is also possible for a set of data to not have any mode, this situation occurs if the number of modes gets to be too large. It it not really possible to define too large but one should exercise good judgement. A reasonable, though very generous, rule of thumb is that if the number of data points accounted for in the list of modes is half or more of the data points, then there is no mode. Note: if the data is given as a list of data points, it is often easiest to find the mode by creating a frequency distribution. This is certainly the most organized method for finding it. In our examples we will use frequency distributions. Example 10. A data set with a single mode. Consider the data from example 8: Number of television sets Frequency 0 2 1 13 2 18 3 0 4 10 5 2 You can see from the table that the data point which occurs most frequently is 2 as it has a frequency of 18. So the mode is 2. Example 11. A data set with two modes. Consider the data: Number of hours of television Frequency 0 1 0.5 4 1 8 1.5 9 2 13 2.5 10 3 11 3.5 13 4 5 4.5 3 You can see from the table that the data points 2 and 3.5 both occur with the highest frequency of 13. So the modes are 2 and 3.5.

8 Statistics Example 12. A data set with no mode. Consider the data: Age Frequency 18 12 19 5 20 3 21 9 22 1 23 8 24 12 25 12 26 5 27 3 Total 71 You can see from the table that the data points 18, 24 and 25 all occur with the highest frequency of 12. Since this would account for 36 of the 71 data points, this would qualify as too large a number of data points taken accounted for. In this case, we would say there is no mode. Median. The median is the data point in the middle when all of the data points are arranged in order (high to low or low to high). To find where it is, we take into account the number of data points. If the number of data points is odd, divide the number of data points by 2 and then round up to the next integer; the resulting integer is the location of the median. If the number of data points is even, there are two middle values. We take the number of data points and divide by 2, this integer is the first of the two middles, the next one is also a middle. Now we average these two middle values to get the median. Example 13. An odd number of data points with no frequency distribution. 3, 4.5, 7, 8.5, 9, 10, 15 There are 7 data points and 7/2=3.5 so the median is the 4th number, 8.5. Example 14. an odd number of data points with a frequency distribution. Age Frequency 18 12 19 5 20 3 21 9 22 2 Total 31 There are 31 data points and 31/2=15.5 so the median is the 16th number. Start counting, 18 occurs 12 times, then 19 occurs 5 times getting us up to entry 17 (12+5); so the 16th entry must be a 19. This data set has a median of 19.

MAT 142 - Module ST 9 Example 15. An even number of data points with no frequency distribution. 3, 4.5, 7, 8.5, 9, 10, 15, 15.5 There are 8 data points and 7/2=3.5 so the median is the 4th number, 8.5. Mean. The mean is the average of the data points, it is denoted x. There are three types of data for which we would like to compute the mean, ungrouped of frequency 1, ungrouped with a frequency distribution, and grouped. Starting with the first type, ungrouped of frequency 1, is when data is given to you as a list and it is not organized into a frequency distribution. When this happens, we compute the average as we have always done, add up all of the data points and divide by the number of data points. To write a formula for this, we use the capital greek letter sigma, Σx. This just means to add up all of the data points. We will use n to represent the number of data points. mean: x = Σx n This corresponds to the left hand column of your calculator instructions. Example 16. Given the ungrouped data list below: 10 15 13 25 22 53 47 We would enter the data into the calculator following the instructions in the left hand column, the result is x = 26.4285714. When have a frequency distribution for the data, we have to take the average as before but remember that the frequency gives the number of times that the data point occurs. mean: x = Σ(fx) n This corresponds to the right hand column of your calculator instructions. Example 17. Given the frequency distribution for ungrouped data below: Number of television sets Frequency 0 2 1 13 2 18 3 0 4 10 5 2 We would enter the data into our calculators following the directions in the right hand column. The result is x = 2.2.

10 Statistics Our final type of data is grouped data. This requires a computation before we can begin. Since we cannot enter the entire interval as a data point, we use a representative for each interval, x i. This representative is the midpoint of the interval, to find the midpoint of an interval you add the two endpoints and divide by 2. These are the numbers that you use as data points for computing the mean. Example 18. Mean for Grouped data. We will use the data from example 7 again: x f 18 x < 23 7 23 x < 28 5 28 x < 33 4 33 x < 38 2 38 x < 43 2 Total 20 Start by calculating the representative for each interval. 23 + 18 = 41 2 2 = 20.5 Since this is the midpoint of the first interval and the intervals have length 5, we find the rest by adding 5 to this one. x x i f 18 x < 23 20.5 7 23 x < 28 25.5 5 28 x < 33 30.5 4 33 x < 38 35.5 2 38 x < 43 40.5 2 Totals 20 Now enter the data as directed using the right hand column of the calculator directions. You use x i as the data point (list 1) and the frequency as usual in list 2. The result should be x = 27.25.

MAT 142 - Module ST 11 Measures of Dispersion In the last section we looked at measure of central tendency. These let us know what the middle of a set of data look like. We are not going to look at measures of dispersion. These will tell us know spread out a set of data is. One measure of spread is the range. The range is the difference between the highest data value and the lowest data value. We are going to be using the following data for the next few examples. Below is a list of the heights, in inches, of the 2015/16 Phoenix Suns roster players. 84 73 78 85 77 75 85 82 75 82 76 78 80 Example 19. What is the range of the heights of the players on the current Phoenix Suns roster? Solution: The tallest player is 85 inches and the shortest player is 73 inches. This means that the range of the heights is 85 73 = 12. The standard deviation is a number that describes the amount of variation or dispersion of a set of data values. The standard deviation is the square root of the variance. The standard deviation is the square root of the sum of the square of the differences between each data value and the mean divide by one less than the number of data values. S x = Σ(x x) 2 (n 1) Example 20. Find the standard deviation of the heights of the players on the Phoenix Suns 2015/16 roster. How many of the players fall within one standard deviation of the mean? Solution: We first must find the mean of the player heights. We do this by adding all the heights together and dividing by the number of players. (Note: Round the mean to two decimal places and any squared values to four decimal places.) x = 84 + 73 + 78 + 85 + 77 + 75 + 85 + 82 + 75 + 82 + 76 + 78 + 80 13 = 79.23 Now that we have the mean, we can calculate the standard deviation. This is most easily done by creating a table.

12 Statistics Data Value x x (x x) 2 84 84 79.23 = 4.77 (4.77) 2 = 22.7529 73 73 79.23 = 6.23 ( 6.23) 2 = 38.8129 78 78 79.23 = 1.23 ( 1.23) 2 = 1.5129 85 85 79.23 = 5.77 5.77 2 = 33.2929 77 77 79.23 = 2.23 ( 2.23) 2 = 4.9729 75 75 79.23 = 4.23 ( 4.23) 2 = 17.8929 85 85 79.23 = 5.77 5.77 2 = 33.2929 82 82 79.23 = 2.77 2.77 2 = 7.6729 75 75 79.23 = 4.23 ( 4.23) 2 = 17.8929 82 82 79.23 = 2.77 2.77 2 = 7.6729 76 76 79.23 = 3.23 ( 3.23) 2 = 10.4329 78 78 79.23 = 1.23 ( 1.23) 2 = 1.5129 80 80 79.23 = 0.77 0.77 2 = 0.5929 Σ (x x) 0 Σ (x x) 2 = 198.3077 We now need to take the sum of the square of the differences between the data values and the mean, divide that number by 12 (13 1) and then take the square root of that result. This will give us that the standard deviation of the heights of the players is 198.3077 S x = = 4.07 12 Now to answer the question of how many players fall within one standard deviation of the mean. This is asking us to first find an interval whose left end is the mean minus one standard deviation (79.23 4.07 = 75.16) and whose right end is the mean plus one standard deviation (79.23 + 4.07 = 83.3). We now need to count how many players have a height between 75.16 inches and 83.3 inches. There are seven players whose heights (78, 77, 82, 82, 76, 78, 80) are within one standard deviation of the mean.

MAT 142 - Module ST 13 Normal Distribution Many types of data are normally distributed. This is often known as the bell curve. In normally distributed data, the mean, median and mode are equal. Any data which is normally distributed can transformed to a standard normal distribution. The standard normal distribution has a mean, median, and mode of 0 (µ = 0) and a standard deviation of 1 (σ = 1). Normally distributed data can be summarized by a rule often referred to the 68-95-99.7 rule. This rule says that approximately 68% of the data fall within one standard deviation of the mean; 95% of the data fall within two standard deviations of the mean; and 99.7% of the data fall within three standard deviations of the mean.

14 Statistics Standard Normal Distribution 68-95-99.7 Rule A z-score is a statistical measure that tells us how many standard deviations above or below the mean a particular data value is in a normal distribution. A positive z-score indicates that the data value is above the mean. A negative z-score indicates that the data value is below the mean. A z-score of 0 indicates that the data value is equal to the mean. z = x µ σ Using a z-score, we can use what we know about the standard normal distribution to determine the area under the standard normal curve that falls below or above a particular data value. There are table that will give us this information or allow us to calculate it. We have provided one such z-table as part of the formula sheet. Our z table gives the area under the normal curve to the left of a particular z-value. Example 21. Human pregnancies are normally distributed with a mean of 266 days and a standard deviation of 16 days. What is the probability that a pregnancy will last fewer than 270 days? Solution: Let s start by looking at a picture for this problem. area below 270 days 270

MAT 142 - Module ST 15 In order to answer this question, we will need to determine how many standard deviations away from from the mean 270 days is. This is what we get when we calculate the z-score. 270 266 z = =.25 16 We now need to look this z-score up in the z-table. We go down the left column of the positive side of the table until we get to 0.2. We then go across the.2 row until we get to the 0.05 column. Z-table z 0.0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359 0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753 0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141 0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0..6443 0.6480 0.6517 0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879 0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224 0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549 This number, 0.7 0.5948, 0.7580 is 0.7611 the probability 0.7642 0.7673 that 0.7704 a pregnancy 0.7734 will 0.7764 last0.7794 fewer 0.7823 that 270 0.7852 days. 0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133 Example 22. Human pregnancies are normally distributed with a mean of 266 days and a 0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389 standard deviation of 16 days. What percent of pregnancies will last more than 280 days? 1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621 Solution: Let s start by looking at a picture for this problem. 1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830 1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015 1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177 1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319 1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441 1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545 1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633 1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706 1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767 2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817 2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857 2.2 0.9861 0.9864 0.9868 0.9871 00.9875 0.9878 0.9881 0.9884 0.9887 0.9890 2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916 2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936 2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952 2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964 2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974 280 area above 280 days 2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981 2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986 3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990 3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993 In order to answer this question, we will need to determine how many standard deviations 3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995 away from from the mean 280 days is. This is what we get when we calculate the z-score. 3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997 3.4 0.9997 0.9997 0.9997 0.9997 280 0.9997 266 0.9997 0.9997 0.9997 0.9997 0.9998 z = =.875 16

16 Statistics We now need to look this z-score up in the z-table. Since our z-table only has z-score to two decimal places, we will need to look up 0.88. We go down the left column of the positive side of the table until we get to 0.8. We then go across the.8 row until we get to the 0.08 column. This number, 0.8106, is the probability that a pregnancy will last fewer that 280 days. We are asked about more that 280 days, thus we need to subtract this number from 1. Thus the probability that a pregnancy will last more than 280 days it 1 0.8106 =.1894. We need to change this probability to a percent by multiplying it by 100 to find that 18.94% of pregnancies will last more than 280 days. Example 23. Human pregnancies are normally distributed with a mean of 266 days and a standard deviation of 16 days. What is the probability that a pregnancy will last between 250 and 275 days? Solution: Let s start by looking at a picture for this problem. area below 250 days area below 275 days In order to answer this question, we will need to determine how many standard deviations away from from the mean 250 days is. This is what we get when we calculate the z-score. 250 266 z = = 1 16 We go down the left column of the negative side of the table until we get to 1.0. We then go across the 1.0 row until we get to the 0.00 column. This gives us 0.1587. We now need to determine how many standard deviations away from from the mean 250 days is. This is what we get when we calculate the z-score. 275 266 z = = 0.5625 16 We now need to look this z-score up in the z-table. Since our z-table only has z-score to two decimal places, we will need to look up 0.56. We go down the left column of the positive side of the table until we get to 0.5. We then go across the 0.5 row until we get to the 0.06 column. This gives us 0.7123. 275

MAT 142 - Module ST 17 In order to find the probability that a pregnancy lasts between 250 and 275 days, we need to subtract the probability of less than 250 day from the probability of less than 275 days. This will give us 0.7123 0.1587 = 0.5536 is the probability of a pregnancy lasting between 250 and 275 days. In the previous three examples, we first found a z-score and then looked that up in the z-table to get the information that we needed. We can also use a z-table in essentially a way that is backward from that. Example 24. All incoming freshmen at a major university are required to take a mathematics placement exam. The scores are normally distributed with a mean of 420 and a standard deviation of 45. If a student score less than a certain score, he or she will have to take a review course. Find the cutoff score at which 25% of the students would have to take the review course. Solution: For this problem we want to find a test score so that 25% of the students score below that score. The picture below illustrates what we are looking for. 25% 420 The 25% indicates the value within the z-table that we need to look for. We would need to change the 25% to a decimal of 0.25. This is the number within the table that we need to find. Once we find the value, we need to determine which z-score.

18 Statistics z 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.0-1.0 0.1379 0.1401 0.1423 0.1446 0.1469 0.1492 0.1515 0.1539 0.1562 0.1587-0.9 0.1611 0.1635 0.1660 0.1685 0.1711 0.1736 0.1762 0.1788 0.1814 0.1841-0.8 0.1867 0.1894 0.1922 0.1949 0.1977 0.2005 0.2033 0.2061 0.2090 0.2119-0.7 0.2148 0.2177 0.2206 0.2236 0.2266 0.2296 0.2327 0.2358 0.2389 0.2420-0.6 0.2451 0.2483 0.2514 0.2546 0.2578 0.2611 0.2643 0.2676 0.2709 0.2743-0.5 0.2776 0.2810 0.2843 0.2877 0.2912 0.2946 0.2981 0.3015 0.3050 0.3085-0.4 0.3121 0.3156 0.3192 0.3228 0.3264 0.3300 0.3336 0.3372 0.3409 0.3446-0.3 0.3483 0.3520 0.3557 0.3594 0.3632 0.3669 0.3707 0.3745 0.3783 0.3821-0.2 0.3829 0.3897 0.3936 0.3974 0.4013 0.4052 0.4090 0.4129 0.4168 0.4207-0.1 0.4247 0.4286 0.4325 0.4364 0.4404 0.4443 0.4483 0.4522 0.4562 0.4602-0.0 0.4641 0.4681 0.4721 0.4761 0.4801 0.4840 0.4880 0.4920 0.4960 0.5000 When we look at the partial table above, we see that there is not exact entry within the table for 0.2500. The two values that are closest to 0.2500. We want to pick the value that is closest. We see that 0.2500 0.2483 = 0.0017 and 0.2514 0.2500 = 0.0014. Thus, 0.2514 is the closest to.2500. We now follow the row to the left and the column up to find the z-score associated with the probability 0.2514. This score is z = 0.67 We now have the z-score. We need to figure out which test score will create this z-score. To do this, we plug what we know into the formula z = x µ σ and solve for x. 0.67 = x 420 45 30.15 = x 420 389.85 = x Since score are whole numbers, we will round this number to 390. Thus a score of 390 on the exam will be the cutoff score at which 25% of the students would have to take the review course.

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015