Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good way to display a small data set. Each value is divided into a stem and a leaf as in the following example, where the rightmost digit is the leaf, and the remaining digits form the stem. This is how we construct a stem-and-leaf graph: Data: 43 58 41 65 49 52 58 60 49 Divide up stem and leaf: 4 3 5 8 4 1 6 5 4 9 5 2 5 8 6 0 4 9 Stem-and-leaf plot: Stem Leaves 4 1 3 9 9 5 2 8 8 6 0 5 List the data represented in the following display. 7 5 9 8 0 2 6 7 7 9 1 7 8 10 2 6 Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. 94 103 79 99 114 89 81 86 81 93 100 96 75 90 88 107 132 95 Page 1

An outlier, also called an extreme value, is an observation of data that lies further away from the rest of the values. Stem-and-leaf graphs make it easy to find outliers. Were there any outliers in the example you just did? Prepare a stem-and-leaf graph for the following data. Determine if there are any outliers in the data set. 6.3 2.9 4.4 5.5 5.1 4.9 5.0 4.8 4.4 5.2 When two data sets have values similar enough so that the same stems can be used, their shapes can be compared with a back-to-back stem-and-leaf plot. In a back-to-back stem-and-leaf plot, the stems go down the middle. The leaves for one of the data sets go off to the right, and the leaves for the other go off to the left. Consider the following course averages from an English class and a History class. The classes can be compared with a back-to-back stem-and-leaf plot. What can you conclude about the averages of these two classes? History Class 9 71 311 853220 872 0 4 5 6 7 8 9 10 English Class 5 7 2689 244679 135 1 A line graph presents data values on the horizontal axis and the frequency on the vertical axis. Frequency points are plotted and connected using line segments. In a survey, 40 people were asked how many times they visited a store before making a major purchase. The results are shown in the table below and in the corresponding line graph. Number of times in store Frequency 1 4 2 10 3 16 4 6 5 4 Page 2

2.2 Histograms and Frequency Polygons Chapter 2 Since quantitative data doesn t have natural categories, we divide the data into classes. The classes are intervals of equal width that cover all the values in the data set. A Frequency Distribution for Quantitative data lists all the classes and the number of values that belong to each class. Data presented in the form of a frequency distribution are called grouped data. Note how grouped data is easier to read, but that we have lost some information by grouping. Make sure that classes do not overlap so that each of the original values must belong to exactly one class. Try to use the same width for all classes, although it is sometimes impossible to avoid open-ended intervals, such as 65 years or older. Make sure to include all classes, even those with a frequency of zero. Class width = Lower limit of a class Lower limit of next class Class midpoint = Lower limit of a class + Lower limit of next class 2 How many classes does the frequency distribution have? What is the class width? Is it clear where a value belongs? How big was the sample of pennies? Could we have omitted the classes with no frequencies, or do those classes tell us something important? Randomly Selected Pennies Weights of Pennies in grams Frequency 2.40-2.49 18 2.50-2.59 59 2.60-2.69 0 2.70-2.79 0 2.80-2.89 0 2.90-2.99 2 3.00-3.09 25 3.10-3.19 8 Weights of Pennies in grams Frequency Page 3

Constructing a Frequency Distribution: Chapter 2 1. Decide on the number of classes (should be between 5 and 15). The bigger the data set, the more classes you should choose. Rather choose too many classes than too few, but you want to have some large frequencies in some of the classes. (One way to help you choose number of classes is Sturge s formula: # of classes = 1 + 3.3 log n, where n is the number of observations.) 2. Calculate approximate class width. Round to a convenient number. Largest value - Smallest value Class width = Number of classes that you want 3. Starting point: Begin by choosing a lower limit of the first class, which should be a convenient number less than or equal to your smallest value. 4. Using the lower limit of the first class and the class width, proceed to list all the classes. Lower limit of one class + Class width = Lower limit for the next class 5. Count the number of observations in each class (possibly by tallying) and construct a frequency distribution. Histograms A histogram is a graph in which classes are marked on the horizontal axis and the frequencies, relative frequencies, or percentages are marked on the vertical axis. The frequencies, relative frequencies, or percentages are represented by the heights of the bars. In a histogram, the bars are drawn adjacent to each other (that is, there is no space in-between them as in a bar graph). Frequency Histograms vs. Relative Frequency Histograms or Percentage Histograms Page 4

A polygon is a graph formed by joining the midpoints of the tops of successive bars in a histogram with straight lines. Histogram with a superimposed polygon Polygon by itself An extra class is often created at each end, both of which have zero frequency, as is seen in the polygons above. Practice: The following data give the numbers of computer keyboards assembled at the Twentieth Century Electronics Company for a sample of 25 days. 45 52 48 41 56 46 44 42 48 53 51 53 51 48 46 43 52 50 54 47 44 47 50 49 52 a. On a separate piece of paper, make the frequency distribution table for these data. b. Calculate the relative frequencies for all classes. c. Construct a histogram for the relative frequency distribution. d. Construct a polygon for the relative frequency distribution. A time-series plot may be used when the data consist of values of a variable measured at different points in time. In a time-series plot, the horizontal axis represents time, and the vertical axis represents the value of the variable we are measuring. The values of the variable are plotted at each of the times, then the points are connected with straight lines. Page 5

A dotplot consists of a graph in which each data value is plotted as a point (or dot) along a scale of values. Dots that represent equal values are stacked. (Dotplot graphs are not shown in the textbook.) Dotplots are helpful in getting a good overview of the data, including finding clusters of data, as well as outliers. Clusters are areas where the values are more concentrated. Dotplots are also useful for comparing two or more datasets, by creating a dotplot for each dataset, on the same scale, and then place these sets on top of each other. We call this stacked dotplots. The example below is a dotplot display of the ages of actresses at the time they won Academy Award Oscars. Example borrowed from the Triola Stats book Ex. Make a dotplot of ages of actors, and make a stacked dotplot to compare the two data sets. Page 6

2.3 Measures of the Location of the Data Chapter 2 Percentiles divide sorted data into equal parts. The kth percentile is the value in a data set that has about k% of the data values smaller than k and (100-k)% values that are greater than k. When my daughter Linnéa was born, the doctor told me her length was in the 98.8th percentile for newborn girls. What does that mean? To find the approximate value of the pth percentile: 1. sort the data in increasing order k 2. calculate i = ( n + 1) 100 3. If i is a whole number, the kth percentile is the number in the kth position in the ordered set of data. If i is NOT a whole number, the kth percentile is the average of the closest two whole numbers to the ith position. x + 0.5y Percentile of a specific value y in a data set = (100) n where x = number of data values less than the number you want to find the percentile for where y = number of data values equal to the data value for which you want to find the percentile Round the result to the nearest integer. (a) Given the following data set 15 9 12 11 7 6 9 10 14 3 6 5 Calculate the approximate value of the 80th percentile. (b) Find the percentile rank of 7. Page 7

The median is the middle value when the data values are arranged in an increasing or decreasing order. If you have an even number of data values, the median is the mean of the two middle values (that is, add the two middle values and divide by two). The position of the middle value when data is arranged in increasing or decreasing order, can be n + 1 calculated by the formula. 2 If you don t get an integer answer, it is the mean of the two integer values closest to that number. Find the median in the previous data set. Quartiles denoted Q 1, Q 2, and Q 3, divide sorted data into equal parts. Q 1 is the 25 th percentile. Q 2 is the 50 th percentile (median). Q is the 75 th percentile. 3 There are two ways to find the quartiles. You can find the corresponding percentiles. Another way is that you can start by finding the median of the whole data set ( Q 2 ), and then find the median of the lower half of the data ( Q 1 ) and then the median of the upper half of the data ( Q 3 ). Find the quartiles for the following data set. 5 4 5 1 8 11 6 5 4 9 3 4 16 3 The interquartile range, denoted IQR, is a number that indicates the spread of the middle half of the data. IQR = Interquartile Range = Q3 - Q 1 The IQR method allows us to determine which values are outliers. Outliers are data values that are below Q 1 1.5 IQR (the lower outlier boundary) or above Q3 + 1.5 IQR (the upper outlier boundary). Outliers are data values that are significantly different than the rest of the data values and could be due to an error so they might need some further investigation. Find the IQR for the above data set. Determine if there are any outliers. Page 8

2.4 Box Plots A box plot, also called box-and-whisker plot, is a graphic presentation of data using five values: the minimum value, maximum value, the three quartiles. We also use the IQR and outliers. Lower outlier boundary = Q1-1.5 IQR Upper outlier boundary = Q 3 + 1.5 IQR How to draw a box-and-whisker plot: Draw a scaled number line, such that all numbers in the data set are covered. Draw a box above or below the number line, such that its left side is at Q1 and the right side at Q 3. Draw a vertical line at Q2 also. * * Draw whiskers (horizontal lines) to join the box and the smallest and largest value resp. within the two outlier boundaries. Plot any outliers (values outside of outlier boundaries). Note: There are versions of box plots where the whiskers are drawn all the way out to the minimum and maximum values even when they are outliers. Our textbook uses this method, but I want you to use the first way described. The time (in minutes) that a student spent in the laundromat in a week, for 15 randomly selected weeks, is as follows: 72 62 84 73 107 81 93 72 135 77 85 67 90 83 112 Prepare a box-and-whisker plot. Page 9

A boxplot can help us better see the distribution of the data, such as the center, spread, skewness, and outliers. We can also use boxplots to visually compare two or more data sets by placing them right above each other. Page 10

2.5 Measures of the Center of the Data Chapter 2 The (arithmetic) mean, often referred to as the average, is the sum of all values divided by the total number of values. The mean calculated for population data is denoted by µ (mu). The mean calculated for a sample data is denoted by x (x bar). µ = x N x x = n where N denotes the population size and n denotes the sample size. We will round the mean to one more decimal place than the data. Ex. Cost of houses in a certain area: 499,000 629,000 4,900,000 715,000 899,000 649,000 989,000 629,000 598,000 759,000 899,000 1,219,000 Find the mean of the cost of houses in this area: Are there any outliers? Do you think the mean is a good measurement? Find the median of the cost of houses in the example above: Do you think the median is a good measurement? A parameter is a numerical measurement based on a population, such as a population mean. A statistic is a numerical measurement based on a sample, such as a sample mean. A statistic/parameter is outlier sensitive if its value is affected by extreme values (outliers) in the data set. Which is more outlier sensitive, the median or the mean? Page 11

The mode of a data set is the value(s) that occurs most frequently. Chapter 2 If there are two values that occur with the same greatest frequency, the data set is bimodal. If there are more than two values that occur with the same greatest frequency, the data set is multimodal. If no value is repeated, there is no mode. Find the mode of the cost of houses in the example above: Do you think the mode is a good measurement? Is the mode outlier sensitive? Approximate Mean for Grouped Data Although it can be beneficial to group data, we do lose information on individual data. However, we can approximate the mean of grouped data, by assuming that the values in each class are equal to the class midpoint. midpoint = m = lower boundary + upper boundary 2 mean = data sum fm = or number of data values n fm N The following data give the frequency distribution of the test scores of all the students in a class. Find the mean of these test scores. Test Scores Frequency 90-100 7 80-89 10 70-79 12 60-69 5 50-59 2 Page 12

2.6 Skewness and the Mean, Median, and Mode Chapter 2 Shapes of Histograms As the number of classes is increased, the polygon eventually becomes a smooth curve A high point of a histogram is referred to as a mode. A histogram is unimodal if it has only one mode, and bimodal if it has two distinct modes. Symmetric. Unimodal. Bell shaped. Symmetric. Bimodal. Skewed right. Skewed left. Uniform or Rectangular. Page 13

Relationships Among the Mean, Median, and Mode 2.7 Measures of the Spread of the Data In some data sets the observations are close together, while in others they are more spread out. In addition to measures of the center, it's often important to measure the spread of the data. One measurement of variation that is easy to calculate is the range. Range = (Largest value) - (Smallest value) What are some disadvantages about this measurement? We would like to find the spread/variation of all data, not only between the minimum and maximum value. The measurements Variance and Standard Deviation find the deviation of all of the data values from the mean. Page 14

Given the following data: 13, 14, 24, 24, 25, 26 Chapter 2 Find the mean of the data: Find the variance and standard deviation of the data: Value of x 13 14 24 24 25 26 Total Deviation from mean x-µ The Population Variance is given by The Population Standard Deviation is given by σ 2 = σ = The Sample Variance is given by The Sample Standard Deviation is given by s 2 = s = Note: Do NOT use any other formulas for the variance and standard deviation that you might find. - The standard deviation is the measure of variation of all values from the mean. - The variance is the measure of variation equal to the square of the standard deviation. - The value of the standard deviation is usually positive, and in rare cases it could also equal to zero. Describe how the data set would look if the standard deviation is zero: - The unit of the standard deviation is the same unit as the units of the original data. Page 15

- It is unusual that data fall more than... standard deviations from the mean. Is the standard deviation outlier sensitive? Explain. - A general round-off rule for variation: Carry one more decimal place than is present in the original set of data. Round only the final answer and not values in the middle of a calculation (if necessary to round off in the middle of a data set, you must include at least twice as many decimals than what your final answer will have). Variance and Standard Deviation for Grouped Data When finding the variance or standard deviation for grouped data, we will assume that all data values are equal to its class midpoint. The following data give the frequency distribution of the test scores of all the students in a class. Find the standard deviation of these test scores. Unless it is clear that a data set is from a population (it will usually use the word ALL) we will assume it is a sample. Test Scores f m ( m m ) ( m m ) 2 ( m m ) 2 90-100 7 80-89 10 70-79 12 60-69 5 50-59 2 f Page 16

Comparing Values from Different Data Sets Who is taller, a man 73 inches tall or a woman 68 inches tall? The obvious answer is that the man is taller. However, men are taller than women on the average. Let s ask the question this way: Who is taller relative to their gender, a man 73 inches tall or a woman 68 inches tall? The z-score of an individual data value tells how many that value is from its population mean. Let x be a value from a population with mean μ and standard deviation σ. The z-score for x is z = Practice 1. A National Center for Health Statistics study states that the mean height for adult men in the U.S. is μ = 69.4 inches, with a standard deviation of σ = 3.1 inches. The mean height for adult women is μ = 63.8 inches, with a standard deviation of σ = 2.8 inches. Who is taller relative to their gender, a man 73 inches tall, or a woman 68 inches tall? 2. Eric proudly tells his brother Bruce that he got 94 points on his last math exam, which had an average of 73 points and a standard deviation of 9. Bruce says that he did even better on his math exam, on which he got 96 points, and this exam had an average of 79 points and a standard deviation of 7. Who did better on his exam relative to their class scores? 3. Suppose Eric's classmate got a z-score of -1.7 on his math exam. What was his exam score? Page 17

Page 18

Other use of Standard Deviation Chapter 2 Note: These are NOT in the textbook but you are held responsible for knowing them. Empirical Rule For a bell-shaped distribution -about 68% of all values fall within 1 standard deviation of the mean -about 95% of all values fall within 2 standard deviations of the mean - almost all values (about 99.7%) fall within 3 standard deviations of the mean (a) The prices of all college textbooks follow a bell-shaped distribution with a mean of $105 and a standard deviation of $20. Using the empirical rule, find the interval that contains the prices of about 99.7% of college textbooks. (b) Using the empirical rule, find the percentage of all college textbooks with their prices between $85 and $125. (c) Using the empirical rule, find the percentage of all college textbooks with their prices between $65 and $145. Page 19

Chebyshev s Theorem Chapter 2 1 At least 1 k 2 of the data values lie within k standard deviations of the mean, for any k>1. Note that the distribution does NOT have to be bell-shaped in order to use this theorem. Use above formula for k=3, and interpret the result. Use above formula for k=1.5, and interpret the result. Suppose the average credit card debt for households is $9,500 with a standard deviation of $2,600. (a) Using Chebyshevs theorem, find at least what percentage of current credit card debts for all households are between $3,000 and $16,000. (b) Using Chebyshevs theorem, find the interval that contains credit card debts of at least 89% of all the households. A Rough Estimation of the Standard Deviation Most values fall within 2 standard deviations of the mean. Values that fall outside of this interval would be considered unusual. Almost all values fall within 3 standard deviations, so we can say that extreme outliers). range SD (unless we have 6 You would only use above formula if you are asked to find a rough estimation of the standard deviation. If you are ever asked to just find the standard deviation, rather than a rough estimate, then don t use this formula. The Wechsler Adult Intelligence Scale involves an IQ test designed so that the mean score is 100 and the standard deviation is 15. Use the Rough Estimation of the Standard Deviation to find the minimum and maximum "usual" IQ scores. Then determine whether an IQ score of 135 would be considered "unusual." Page 20