Stat 300: Intro to Probability & Statistics Textbook: Introduction to Statistical Investigations Name: Chapter P: Preliminaries Section P.2: Exploring Data Example 1: Think About It! What will it look like? Consider the following Variables: 1 A. Point values of letters in the board game Scrabble B. Prices of properties on the Monopoly game board C. Jersey numbers of San Francisco 49ers football players in 2017 D. Weights of rowers on the 2016 U.S. men's Olympic team E. Blood pressure measurements for a sample of healthy adults F. Quiz percentages for a class of statistics students (quizzes were quite straightforward for most students) G. Annual snowfall amounts for a sample of cities around the U.S. a) Identify each of the variables above as categorical or quantitative. b) Matching. The following dotplots display the distributions of these variables, but the variables are not shown in the same order as they are listed. Moreover, the scales have been intentionally left off the axes! For each dotplot, try to identify the variable displayed (by letter, from the previous list). Also, provide a brief explanation of your reasoning in each case. [Note: You might make different matches than other students; be prepared to justify your choices.] One of the goals of this matching game is to illustrate that you can anticipate what the distribution of a set of data might look like by considering the context of the data. 1. 2. 3. 4. 5. 6. 7. 1 Rossman, A., & Chance, B., Workshop Statistics: Discovery with Data, 4 th Ed. Wiley & Sons, 2011.
Stat 300 Text: Intro. to Statistical Investigations Section P2 Page 2 of 6 P2: Exploring Data In this section, you will learn how to explore a data set by carry out an Exploratory Data Analysis or EDA. By exploring your data set, you will begin to learn how to begin to examine the data in a way that can help you interpret the data and help make informed decisions. Often, you will need to look at several different exploratory features in order to get a good feel for your data. Do not limit yourself to only considering one graph or calculating one statistic. Exploratory Data Analysis or EDA Guidelines (Graph) What does the data distribution look like? o Possible s - Symmetric (mound shape), Skewed Left, Skewed Right, Uniform, Bimodal. o Graphs to Consider Dotplot, Histogram, Boxplot Where is the distribution centered? What is a typical or representative value? o Different ways to measure the center Mean, Median, Mode, Midrange Spread or Variability How far does the data spread from the center? o Different ways to measure the spread Standard Deviation, Range, Inner Quartile Range (IQR) Are there any unusual observations that deviate from the overall pattern on the distribution? Example 2: Below you are given the heights (in feet) for a random sample of dwarf mandarin trees from your local Green Acres nursery. Use this data set to conduct an Exploratory Data Analysis (EDA). We will first create the graphs and calculate the values by hand, then we will use StatCrunch to expedite the process. Tree Heights (in feet) 1 8 6.5 5.5 4 1.5 3 2 2.5 4 7.5 3.5 The shape of a data set describes the overall picture or distribution. à Five Main s or Distributions: Symmetric (Bell or mounded), Uniform, Skewed Left, Skewed Right, Bimodal (or symmetric with to mounds). Enter your data into the first column à Warning: You may have to consult several different graphs to get a good idea of its shape. Dotplot One dot represents SC Dotplot Graph > Dotplot Under Column click on your variable. Label:
Stat 300 Text: Intro. to Statistical Investigations Section P2 Page 3 of 6 Histogram Each bar represents: SC Histogram Graph > Histogram Under Column click on your variable. Under Type: Choose: Frequency (How many) or Relative Frequency (Proportion or %) Boxplot Each Box or Whisker represents: Label: SC Boxplot Graph > Boxplot Under Column click on your variable. Under Other Options: Click: Use fences to identify outliers & Draw boxes horizontally. The center of a data set is a typical or representative data value à Three main ways to measure the center: Mean, Median, Mode Definitions: Enter your data into the first column SC Summary Statistics Stat > Summary Stats > Columns Under Statistics: Control + Click to select the mean, median, and mode of the data set. Mode:
Stat 300 Text: Intro. to Statistical Investigations Section P2 Page 4 of 6 Spread The spread of a distribution describes the variability of a data set. Are the values close together? Are they far apart? Do they cluster in one area? How far do the data range? Enter your data into the first column à Five Main ways to measure the spread: Min, Max, Range, Standard Deviation, Interquartile Range. Definitions: Min: Range: Max: SC Dotplot Stat > Summary Stats > Columns Under Statistics: Use Control + Click to select the std. dev., min, max, Q1, Q3, IQR Standard Deviation: Quartiles: Interquartile Range: are data values that deviate markedly from the overall patter of the other data values. Many data sets do not have any outliers. à Outlier are anything beyond the upper and lower fence; ABOVE the upper fence: Q3 + 1.5 * IQR Find the outliers by creating the boxplot and asking StatCrunch to mark the outliers using fences. BELOW the lower fenec: Q1 1.5 *IQR
Stat 300 Text: Intro. to Statistical Investigations Section P2 Page 5 of 6 Example 3: Dataset: Old Faithful 2 In the reading for section P.2, you read about a statistical investigation in which park rangers at Yellowstone National Park were trying to predict the how much time a person usually has to wait to see an eruption of the geyser Old Faithful. In order to predict the next eruption time, researchers collected data on 222 eruption of Old Faithful taken over several days in August 1978. The times in between eruptions for all 222 observations can be found in the dataset labeled Old Faithful in the StatCrunch group. Use this data set to conduct an Exploratory Data Analysis by completing the following: a) Describe the of the distribution. Use StatCrunch to create the dotplot, histogram, and boxplot for time in between eruptions. b) Find the measures of center of the distribution. Use StatCrunch to find the mean, median, and mode for the times in between eruptions. c) Find the measures of spread or variablity of the distribution. Use StatCrunch to find the min, max, range, standard deviation, and inner quartile range (IQR) for the times in between eruptions. d) Does this data set contain any outliers? Use boxplot feature on StatCrunch to determine whether or not this data set contains any outliers. Example 4: Which is better measure of : Mean or the Median? The center of a dataset is meant to represent a typical data value. Below you are given the sales price for seven homes that recently sold in the greater Sacramento area. Home prices in dollars: $300,000 $285,000 $400,000 $385,000 $410,000 $325,000 $2,500,000 a) Use StatCrunch to find the mean and median of this data set. b) Which value(s) best represent a typical value from the data set above the mean, the median, or both values? Explain your answer. c) Describe the shape of the distribution. Are there any outliers? (Use StatCrunch.) d) Change the last value from $2,500,000 to $250,000 and repeat steps a c from above. 2 Tintle, N. et al., Introduction to Statistical Investigations, 1 st ed. Wiley & Sons, 2016.
Stat 300 Text: Intro. to Statistical Investigations Section P2 Page 6 of 6 Example 5: For each of the data sets below, use StatCrunch to help you create a dotplot and find the mean, median, standard deviation, and outliers. Record your answer on the table provided. Data Set #1: 4, 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 9, 9, 10 Dotplot Spread / Variability Std. Dev.: Data Set #2: 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8 Dotplot Spread / Variability Std. Dev.: Data Set #3: 4, 4, 4, 4, 4, 5, 5, 5, 9, 9, 9, 10, 10, 10, 10, 10 Dotplot Spread / Variability Std. Dev.: Think About It / Write About It The standard deviation can roughly be interpreted as the typical distance between the data values and the mean. Compare the standard deviations for each of these three data sets. Explain how the standard deviation related to the shape of the distribution.