UNIT 1A EXPLORING UNIVARIATE DATA

Size: px

Start display at page:

Download "UNIT 1A EXPLORING UNIVARIATE DATA"

Clara Randall
5 years ago
Views:

1 A.P. STATISTICS E. Villarreal Lincoln HS Math Department UNIT 1A EXPLORING UNIVARIATE DATA LESSON 1: TYPES OF DATA Here is a list of important terms that we must understand as we begin our study of statistics and probability: Variable: Any characteristic whose value may change from one individual or object to another. Examples: X = LHS student scores on most recent SAT Math I test Y = Fall math course of LHS seniors Data: The result from making observations either on a single variable or simultaneously on two or more variables. Examples: Possible values of X: {542, 705, 568, 388, 674 } Possible values of Y: {Statistics, AP Statistics, Statistics, Algebra 2 } Numerical (Quantitative) Variable: the resulting observations are numerical in nature, and can be placed in a specific ordered sequence. Example: The variable X described above. Categorical (Qualitative) Variable: the resulting observations fall into categories or groups. Example: The variable Y described above. Discrete Numerical Data: when all the possible observations from a numerical variable correspond to isolated points on a number line. The data values are countable. Example: S = Shoe size of LHS male students Possible values of S: { 6, 6½, 7, 7½, 8, 8½ } Continuous Numerical Data: when all the possible observations from a numerical variable form an entire interval on the number line. A data value can take on any value in an interval. Example: F = Foot length of LHS male students (cm) Possible values of F: 20 cm < F < 120 cm or (20 cm, 120 cm) Distribution: The set of observations (typically numerical, in nature) taken from a population, or a sample of or subset of the population. A distribution is univariate if the observations are on a single attribute, and a distribution is bivariate if the observations are on two attributes, resulting in a pair of numbers. Page 1A - 1

2 Practice 1-1 For each of the following variables, determine if it is categorical or numerical. If it is numerical, classify it as discrete or continuous, and try and estimate the range of possible values of the variable. 1. Length of a pencil 2. Type of pen 3. Number of pens in a box 4. Waist size of pants 5. Color of pants 6. Number of pockets on pants 7. Subject of book 8. Number of pages in book 9. Area of cover of a book Practice 1-2 Describe a variable (with population), and give an example of a data set for that variable, for each of the following: (a) Categorical Variable (b) Numerical Variable (c) Discrete Numerical Data (d) Continuous Numerical Data LESSON 2: BASIC GRAPHICAL DISPLAYS (DOT PLOTS AND STEM PLOTS) A dotplot and a stemplot (or stem-and-leaf plot ) are both effective ways to graphically display a relatively small numerical data set. In a dotplot, you create an appropriate number line and use a dot to represent each data point. Practice 2-1 A random sample of 18 6-oz bags of peanut M&M s were chosen, opened up and weighed. The weights of the contents, in ounces, were as follows: Create a dot plot for this data. Page 1A - 2

3 In a stem-and-leaf plot, you create a vertical line, where the stem is the first part of the number, and goes on the left of the line. The leaf is the last part of the number, and is listed on the right side of the line. Make sure each stemplot is supported with a key on how to read it. Practice 2-2 Create a stemplot for the data in Practice 2-1. If a stemplot has too few stems, the data can sometimes appear too compacted and it can be difficult to get a sense of the shape of the distribution. In these cases, you can create a split stemplot, where you divide the stems into two (or any other appropriate value) in order to spread out the data and getter a better visual sense of the distribution. Practice 2-3 Create a split stemplot for the data in Practice 2-1. How is this graph different from the one in Practice 2-2? LESSON 3: DESCRIBING DISTRIBUTIONS OF UNIVARIATE DATA Describing a Distribution: The 4 Key Features of a Distribution are: Shape, Center, Spread and Unusual Values Words used to describe the shape of a distribution: a) Uniform (Rectangular) b) Symmetric c) Single-peaked (unimodal) d) Double-peaked (bimodal) e) Skewed left (negatively skewed) f) Skewed right (positively skewed) DISCUSSION 3-1: Draw a simple sketch that describes each of the shapes listed above. There is no specific definition of unusual values, but here are some things to consider: Def: Outliers: data values that fall out of the pattern of the rest of the distribution (very high or very low) We will soon learn a simple formula to determine if a data point is considered an outlier or not. Clusters: isolated groups of values. Gaps: large spaces between values. Page 1A - 3

4 We will learn more specific ways to describe center and spread soon. In the meantime, you can: 1. estimate the center with the balancing point of the distribution, and 2. describe the spread as the interval in which the majority of the data points lie. Practice 3-1 Describe the distribution of the peanut M&M data described in Practice 2-1 from page 1A-2. Make sure to answer in context. Practice 3-2 A random sample of 18 male LHS students were chosen and weighed. Their weights, in pounds, were as follows: {97, 102, 105, 114, 117, 122, 125, 128, 130, 132, 135, 137, 138, 139, 141, 144, 147, 148, 148, 154, 157, 159, 162, 166, 171, 173, 189, 191, 195, 225} Create a dotplot or stemplot for this data. Then, describe the distribution of weights, in context. LESSON 4: DESCRIBING THE CENTER OF A DISTRIBUTION (MEAN AND MEDIAN) In the last few topics, we discussed some basic graphical methods to view a data set and the components needed to verbally describe that data set. In this lesson, we will learn more precise, numerical methods to describe the center of a data set. Def: The population mean = μ = the average of all the values in the entire population. Since we rarely study the entire population, we estimate the population mean (μ) with the sample mean (x ). Def: The sample mean is the average of all the values in a sample from a population. x = x 1+x 2 + +x n n = x i n, where n = sample size It is a convention in statistics to use Greek letters to denote population characteristics (or population parameters). DISCUSSION 4 1 Why have we previously discussed the mean as the balancing point of a distribution? Consider the set {1, 2, 3, 6, 8} Page 1A - 4

5 Def: The median is the middle element of a data set. It is the value that separates the lower half of the data set from the upper half of the data set. To find which value is the median, you must put the data in numerical order and calculate (n + 1)/2. The median will be the (n + 1)/2 data value in the ordered list. Ex. 9 data elements: (9 + 1)/2 = 5, so the 5 th element is the median. 36 data elements: (36 + 1)/2 = 18.5, so the average of the 18 th and 19 th element is the median Practice 4-1 (a) Find the mean and median of the following set: {12, 20, 19, 8, 17, 23, 255, 12} (b) It turns out that the 255 was recorded incorrectly and should have been recorded as 25. What effect will this have on the mean? On the median? Def: A resistant measure is a measurement that is not affected by outliers. The median is a resistant measure, but the mean is not, since an outlier like 255 has no effect on the median, but a big effect on the mean. DISCUSSION 4 2 Why are medians sometimes reported instead of means; like when the news reports that the median home price is going up? Why don t they report the average price? Def: A trimmed mean is a resistant measure which tries to avoid the influence of outliers by eliminating values from the low and high end of the distribution. For example, to find a 10% trimmed mean, you eliminate the lowest 10% of the data values and the highest 10%. The mean will then be computed using the remaining 80% of the data. Practice 4-2 (a) Find the sample mean for the LHS male student weight data from Practice 3-2 on page 1A-4. If you were to take a 10% trimmed mean for this set, would you expect the trimmed mean to be greater than, or less than the mean? Explain. (b) Find the 10% trimmed mean for this data set. Were you correct? (c) What happens as the trimming percentage gets closer to 50%? Page 1A - 5

6 When the data is grouped into classes, use the midpoint of each class to estimate the mean Practice 4-3 Find the mean and median for the 58 SAT scores identified in the table below. SAT score Frequency 200-< < < < < <800 8 Total 58 LESSON 5: DESCRIBING VARIABILITY OF A DISTRIBUTION (STD. DEVIATION AND IQR) Knowing the shape and center of a data set gives us some understanding of its distribution, but we do not know the whole story until we investigate the spread, or variability. DISCUSSION 5-1 What do we mean by spread or variability? Can I get a visual? The range is a simple way to describe the variability of a data set. The range is the difference between the maximum and minimum value: R = max min. Note: The range is one number, not two! Note: Although it is simple to calculate, the range has a big weakness: it is not resistant to outliers. Practice 5-1 A random sample of 7 male 8 th graders takes part in a survey about allowances. One of the questions on the survey asks them to state how much money (to the nearest dollar) they have on their person at the time of the survey. Their responses to this question were {$4, $5, $10, $3, $6, $12, $36, $5}. (a) Find the range of this data set. Instead of using only 2 data points, better measures of spread consider the entire data set. Two of these measures are the standard deviation and variance. The deviation of an observation is its distance from the mean. The standard deviation (denoted s) of a sample data set is a number that represents a typical deviation from the mean (denoted x ) for the sample data set. (b) Create a dotplot of the data. Estimate the standard deviation of the data set visually. Page 1A - 6

7 (c) Find the mean of the data set. Then use it to calculate the deviation values (x x ) for each of the data points. Standard Deviation of a Sample Data Set s = (x x )2 n 1 (d) Find the standard deviation of the data set. Interpret this value in context. The variance of a sample data set is the square of the sample standard deviation. (e) Find the variance of the data set. (f) Use your graphing calculator to verify your results. DISCUSSION 5-2 Is the standard deviation a resistant measure? Can it have a value of zero? Can it be negative? Note the following notations, remembering our practice to use Greek letters for population parameters: σ = the population standard deviation σ 2 = the population variance Assume all data sets are samples from larger populations, unless told otherwise. Another common and useful measure of spread or variability is the Inter-Quartile Range (IQR). It is defined as the range of the middle 50% of the data. Mathematically this is interpreted as the difference between the first quartile and the third quartile. That is, IQR = Q3 - Q1. The first quartile, or Q1, is simply the median of the lower 50% of the data set. The third quartile, or Q3, is the median of the upper 50% of the data set. Note: The median is the second quartile, although it is rarely referred to that way. Practice 5-2 Consider the male LHS student data described in Practice 3-2 on page 1A-4. Find the range and interquartile range for the data. DISCUSSION 5-3 If we were to change the 225 to 525, would the IQR change? Would the range change? Which is a resistant measure? Page 1A - 7

8 DISCUSSION 5-4 If there were an odd number of points so that the median is one of the observations in the set, then is that observation considered part of the upper or lower half of the data sets? Example: What is Q1 and Q3 of the set {1,2,3,4,5,6,7,8,9}? LESSON 6: MORE ON MEDIANS AND QUARTILES: OUTLIERS AND BOXPLOTS The Five-Number Summary of a data set is the minimum, Q1, median, Q3, maximum. A boxplot, another graphical method to visually display a data distribution, uses the five-number summary at its foundation. Some of you may know this graph as a box and whisker plot. DISCUSSION 6-1 What does a basic boxplot look like? How do you find the IQR from boxplot? Shouldn t a boxplot show outliers, if possible? Determining Whether a Data Set has Outliers To determine if there are outliers we place boundaries (fences) around the main part of the data. lower fence = Q1 1.5 (IQR) upper fence = Q (IQR) Any observations that are outside of these fences are considered outliers and are shown with asterisks. If there are outliers, the whisker extends to the most extreme observation that is not an outlier. Practice 6-1 (a) Does the male weight data from Practice 3-2 on page 1A-4 contain any outliers? (b) Sketch a modified boxplot for this data. DISCUSSION 6-2 DISCUSSION 6-3 How do you make a modified boxplot on a TI-83? Can you describe the shape of a distribution by looking at its boxplot? To summarize: Boxplots are nice because we can quickly identify the center (median) and spread (range and IQR). We can identify symmetry or skewness in the distribution, but it tells nothing about the frequencies of the data values in the distribution. Note: The book may use the term mild outlier and extreme outliers. For our class, the distinction isn t important. Just call them all outliers. Always mark outliers! Page 1A - 8

9 LESSON 7: RELATIVE FREQUENCY HISTOGRAMS Stem-and-leaf plots and dotplots are very good for displaying small data sets. However, when there are a large number of observations, frequency distributions and histograms are a better choice. Frequency Distributions and Histograms for Discrete Numerical Data: Practice recent college grads were surveyed and were asked how many courses they took their last semester before graduating. The results are shown in the frequency table below. (a) Sketch a histogram representing this data. # of Frequency Courses Notes: When each bar corresponds to only one value, center each bar above its corresponding value. Make sure you label axes and scales. The vertical axis should always start at 0. The bars in a histogram should touch (unlike bar charts for categorical data, which we will see later) (b) A relative frequency histogram looks exactly the same as a regular histogram, except that it will have relative frequency (percent of the total) rather than frequency (number of observations) on the vertical axis. Create a relative frequency table and histogram for this data. Frequency Distributions and Histograms for Continuous Numerical Data: Consider the data (on the next page) on total annual rainfall (in inches) reported on The Los Angeles Almanac Web site for the years 1970 to Page 1A - 9

10 Season Total Rainfall (in.) Season Total Rainfall (in.) Since this data is continuous, there are no natural categories to place the data. In this case, we will define our own categories, called classes. There is no perfect way to create classes, but classes should always be the same length and never overlap or leave any gaps. Rule of thumb is the width of the intervals should approximately equal the square root of the maximum value minus the minimum value. What if an observation falls exactly on a boundary? As a convention, we will put boundary values into the upper class. For example, suppose you decided on class lengths of 8 inches starting at 0. Then, the boundaries would be 0, 8, 16, etc. In other words, the class from 0-8 is defined as 0 x < 8 or 0-<8 and an observation of 8 would fall into the 8-16 category. Note: Note: This approach results in a graph a relative frequency histogram for continuous data. There are methods for creating histograms with unequal classes, but we are skipping them. If a frequency distribution has non-bounded classes, such as 12 or more, a histogram cannot be made. Practice Pick appropriate class intervals and make a frequency/relative frequency chart for the rainfall data. 2. Sketch a relative frequency histogram for the data. 3. Write a few short sentences commenting on the rainfall distribution. Remember the four key concepts in describing a distribution: shape, center, spread and unusual values. Page 1A - 10

11 LESSON 8: CUMULATIVE RELATIVE FREQUENCY GRAPHS Instead of wanting to know what percent of the data falls into a particular class, we often want to know what percent falls below a certain value. To make this possible, we will compute the cumulative relative frequency for each class, which is the sum of the relative frequency of that class and all the classes below it. Practice 8-1 Consider the following frequency distribution of 200 Algebra 1 final exam scores. Complete the cumulative relative frequency table and use it to respond to the questions that follow: Exam Score frequency relative freq. cumulative r.f. 0-< < < < < < < < < <100 7 total 200 (a) What proportion of students scored less than 30? less than 90? (b) What proportion of students scored at least 40? at least 70? (c) What proportion scored exactly 40? exactly 73? (d) What proportion scored at least 50 but less than 70? Def: The graph of a cumulative relative frequency distribution is called a cumulative relative frequency plot or an ogive (term not in book). (e) Create a cumulative relative frequency plot. Then, use the graph to answer the following questions. 1. About what proportion of students scored less than 45? 2. What score separates the lower half of the scores from the upper half of scores? (This score is called the median of the data set) 3. What scores represent the boundaries that contain the middle 50% of Algebra 1 exam scores? (These two numbers are called the quartiles of the data set. The lower number is called the first quartile, and the upper number is called the third quartile.) 4. The interquartile range is the difference between the first and third quartiles of the data set. It is often abbreviated IQR. What is the IQR for this set of data scores? Page 1A - 11

12 Def: the Pth percentile of a distribution is the value in the distribution such that P percent of the observations lie at that level or below. 5. Estimate the 15 th percentile score for this distribution? 6. What is the 70 th percentile score for this distribution? LESSON 9: INTERPRETING AND COMPARING CENTER AND VARIABILITY A very important characteristic about a data value in a data set is where its position is in relation to the mean of the distribution and the standard deviation of the distribution. As such, a very important question for statisticians when working with distributions is What proportion or percentage of the distributions data values fall within k standard deviation(s) of the mean? A very conservative approach to answering this question would be to use a rule called Chebychev s Rule, which states that for any distribution, the proportion of observations that are within k standard deviations of the mean is at least 1 1. Though Chebychev s Rule is rather conservative, k2 it does work for all distributions. DISCUSSION 9-1 What do you mean using Chebychev s Rule is a conservative approach? If we know, however, that a distribution is unimodal and symmetric (and many of our most useful distributions are), we can be more exact with these estimations. Most unimodal and symmetric distributions can be well approximated with a normal curve (also called a bell curve) The Rule (also called the Empirical Rule) states that if the data set can be well approximated by a normal curve, then approximately 68% of the observations will be within 1 standard deviation approximately 95% of the observations will be within 2 standard deviations approximately 99.7% of the observations will be within 3 standard deviations DISCUSSION 9-2 Can I get a visual of the Empirical Rule? Practice 9-1 The SAT Math I scores of all high school students approximately normal with mean score of 580 and standard deviation 70. Sketch the normal curve and answer the following questions: (a) Approximately 95% of students scored between which 2 values? (b) What proportion of students scored between 580 and 650? (c) Between 510 and 710? (d) Between 440 and 510? (e) Between 600 and 700? Page 1A - 12

13 Practice 9-2 Suppose that the distribution of incomes for professional baseball players has a mean of 2.6 million dollars with a standard deviation of 3.1 million dollars. What proportion of players is within one standard deviation of the mean salary amount? COMPARING DATA FROM TWO DIFFERENT DISTRIBUTIONS There are two different types of comparisons that we must be able to make and describe. (1) Comparing individual data values from two different distributions with one another; and (2) comparing two distributions with one another. To make comparisons between data points from different distributions possible, we consider where each data point falls within its own distribution. When we are using the standard deviation as a ruler to measure how far an observation is above or below the mean, we are using a standardized score or a z-score. observed value mean value x x z = = standard deviation s Practice 9-3 Suppose that a professional soccer team has the money to sign one additional player and they are considering adding either a goalie or a forward. The goalie has a 90% save percentage and the forward averages 1.2 goals a game. In this league, the average goalie saves 86% of shots with a standard deviation of 5% while the average forward scores 0.9 goals per game with a standard deviation of 0.2. Who is the better player? To compare two different distributions to one another, we should always compare their shapes, centers, and spreads in context. In our descriptions of the comparisons, we must always be sure to use comparative language ( is equal to or is greater than or is smaller than ). It is not enough just to report the shape and the center and spread values of the two distributions individually. We must describe how they compare to one another. Practice 9-4 In an experiment investigating the effectiveness of two anti-depression drugs, 40 subjects were assigned to take Drug A, and 40 subjects to take Drug B. After a specified amount of time, each subject was asked to rank, on a scale from 1 to 10, how happy they felt. The following histograms display the Happiness scores for the subjects in each of the two test groups, respectively. Drug A Drug B Page 1A - 13

14 (a) Construct comparative boxplots for these two distributions. (b) Compare the two distributions in context. LESSON 10: SHIFTING AND RESCALING DATA Suppose that I took a random sample of 9 geometry students and recorded their score on a recent quiz. There scores are as follows: {10, 15, 18, 18, 20, 20, 22, 22, 24}. Complete the dot plot for this distribution and list all the summary statistics for this data set (mean, standard deviation, median, quartiles, IQR, range) Now, suppose that I was feeling especially generous and added 25 points to each score. Plot the new distribution and recalculate the summary statistics. Which ones changed? Which did not? Did the shape change? Page 1A - 14

15 Now, go back to the original data and assume that the quiz was out of 25 points. To convert these scores to percents, we could multiply each of them by 4 (since 4 x 25 = 100). Sketch the new distribution (same scale!) and recalculate the summary statistics. Which ones changed? Which did not? Did the shape change? Conclusion No. 1: When we add (or subtract) a constant from every member of a data set, the measures of position (min, max, mean, median, quartiles) change by the same amount we added or subtracted but the measures of spread (range, IQR, standard deviation) remain unchanged. When we add (or subtract) the same number to every number in a data set, we are shifting the data set. Conclusion No. 2: When we multiply (or divide) every member of a data set by a constant, BOTH the measures of position (min, max, mean, median, quartiles) AND the measures of spread (range, IQR, standard deviation) change by the same amount we multiplied (or divided) by. When we multiply (or divide) each member of a data set by the same constant, we are rescaling the data set. Page 1A - 15

AP Statistics Summer Assignment:

AP Statistics Summer Assignment: Read the following and use the information to help answer your summer assignment questions. You will be responsible for knowing all of the information contained in this