Chapter 5 Understanding and Comparing Distributions
The Big Picture We can answer much more interesting questions about variables when we compare distributions for different groups. Below is a histogram of the Average Wind Speed for every day in 1989. Slide 5-3
The Big Picture (cont.) The distribution is unimodal and skewed to the right. The high value may be an outlier Median daily wind speed is about 1.90 mph and the IQR is reported to be 1.78 mph. Can we say more? Slide 5-4
The Five-Number Summary The five-number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum). Example: The fivenumber summary for the daily wind speed is: Max 8.67 Q3 2.93 Median 1.90 Q1 1.15 Min 0.20 Slide 5-5
Daily Wind Speed: Making Boxplots A boxplot is a graphical display of the fivenumber summary. Boxplots are useful when comparing groups. Boxplots are particularly good at pointing out outliers. Slide 5-6
Constructing Boxplots 1. Draw a single vertical axis spanning the range of the data. Draw short horizontal lines at the lower and upper quartiles and at the median. Then connect them with vertical lines to form a box. Slide 5-7
Constructing Boxplots (cont.) 2. Put up fences around the main part of the data. The upper fence is 1.5 IQRs above the upper quartile. Upper fence = Q3 + 1.5(IQR) The lower fence is 1.5 IQRs below the lower quartile. Lower fence = Q1-1.5(IQR) Note: the fences only help with constructing the boxplot and should not appear in the final display. Slide 5-8
Constructing Boxplots (cont.) 3. Use the fences to make whiskers. Draw lines from the ends of the box up and down to the most extreme data values found within the fences. If a data value falls outside one of the fences, we do not connect it with a whisker. Slide 5-9
Constructing Boxplots (cont.) 4. Add the outliers by displaying any data values beyond the fences with special symbols. We often use a different symbol for far outliers that are farther than 3 IQRs from the quartiles. Slide 5-10
Wind Speed: Making Boxplots (cont.) Compare the histogram and boxplot for daily wind speeds: How does each display represent the distribution? Slide 5-11
Comparing Groups It is almost always more interesting to compare groups. With histograms, note the shapes, centers, and spreads of the two distributions. What does this graphical display tell you? Slide 5-12
Comparing Groups (cont.) Boxplots offer an ideal balance of information and simplicity, hiding the details while displaying the overall summary information. We often plot them side by side for groups or categories we wish to compare. What do these boxplots tell you? Slide 5-13
Example (p. 96 #11) Slide 5-14
Example (p. 96 #11) Slide 5-15
Example (p. 98 #20 and 21) See book Slide 5-16
Homework p. 95 #5, 7, 12, 15, 22, 29, 36 due on Wednesday Read p. 87 end Slide 5-17
What About Outliers? If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing. Note: The median and IQR are not likely to be affected by the outliers. Slide 5-18
Variable Timeplots: Order, Please! For some data sets, we are interested in how the data behave over time. In these cases, we construct timeplots of the data. Timeplots show data measurements in time order. Timeplots help us to detect trends over time. You cannot always extrapolate from the data. (Don t predict the future) It is probably okay to predict a less windy June next year compared to a more windy November. It would not be okay to predict another storm on November 21 st (outlier). Time Slide 5-19
*Re-expressing Skewed Data to Improve Symmetry When the data are skewed it can be hard to summarize them simply with a center and spread, and hard to decide whether the most extreme values are outliers or just part of a stretched out tail. How can we say anything useful about such data? Slide 5-20
*Re-expressing Skewed Data to Improve Symmetry (cont.) Slide 5-21
*Re-expressing Skewed Data to Improve Symmetry (cont.) Re-expression changes the measuring scale for the variable by altering the distance between the values. All of the lines below represent the numbers 1 to 10, on a decimal, logarithmic, and squared scale. On our familiar decimal measuring scale, the distance between numbers is the same for all numbers. On a logarithmic scale, the distance between the numbers decreases as the numbers get larger On a square scale, the distance between the numbers decreases as the values get smaller. All of the dots represent the same sequence of values from 1 to 10 on different measuring scales. Slide 5-22
What Can Go Wrong? (cont.) Avoid inconsistent scales, either within the display or when comparing two displays. Label clearly so a reader knows what the plot displays. Good intentions, bad plot: Slide 5-23
What Can Go Wrong? (cont.) Beware of outliers Be careful when comparing groups that have very different spreads. Consider these side-by-side boxplots of cotinine levels: Re-express... Slide 5-24
What have we learned? We ve learned the value of comparing data groups and looking for patterns among groups and over time. We ve seen that boxplots are very effective for comparing groups graphically. We ve experienced the value of identifying and investigating outliers. We ve graphed data that has been measured over time against a time axis and looked for longterm trends both by eye and with a data smoother. Slide 5-25
Example (p. 101 #21, 35, 37, 39) See book Slide 5-26
Homework p. 95 #5, 7, 12, 15, 22, 29, 36 Slide 5-27
Classwork Investigative Task HW: Read p. 104-107 Slide 5-28