AP Statistics Summer Assignment:

Size: px

Start display at page:

Download "AP Statistics Summer Assignment:"

Benedict Byrd
5 years ago
Views:

1 AP Statistics Summer Assignment: Read the following and use the information to help answer your summer assignment questions. You will be responsible for knowing all of the information contained in this reading. Variables In statistics, a variable has two defining characteristics: A variable is an attribute that describes a person, place, thing, or idea. The value of the variable can "vary" from one entity to another. For example, a person's hair color is a potential variable, which could have the value of "blond" for one person and "brunette" for another. Categorical vs. Quantitative Variables Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical). Categorical. Categorical variables take on values that are names or labels. The color of a ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) would be examples of categorical variables. Quantitative. Quantitative variables are numerical. They represent a measurable quantity. For example, when we speak of the population of a city, we are talking about the number of people in the city - a measurable attribute of the city. Therefore, population would be a quantitative variable. In algebraic equations, quantitative variables are represented by symbols (e.g., x, y, or z). Discrete vs. Continuous Variables Quantitative variables can be further classified as discrete or continuous. If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable; otherwise, it is called a discrete variable. Some examples will clarify the difference between discrete and continuous variables. Suppose the fire department mandates that all fire fighters must weigh between 150 and 250 pounds. The weight of a fire fighter would be an example of a continuous variable; since a fire fighter's weight could take on any value between 150 and 250 pounds. Suppose we flip a coin and count the number of heads. The number of heads could be any integer value between 0 and plus infinity. However, it could not be any number between 0 and plus infinity. We could not, for example, get 2.3 heads. Therefore, the number of heads must be a discrete variable.

2 Univariate vs. Bivariate Data Statistical data is often classified according to the number of variables being studied. Univariate data. When we conduct a study that looks at only one variable, we say that we are working with univariate data. Suppose, for example, that we conducted a survey to estimate the average weight of high school students. Since we are only working with one variable (weight), we would be working with univariate data. Bivariate data. When we conduct a study that examines the relationship between two variables, we are working with bivariate data. Suppose we conducted a study to see if there were a relationship between the height and weight of high school students. Since we are working with two variables (height and weight), we would be working with bivariate data. Measures of Central Tendency Statisticians use summary measures to describe patterns of data. Measures of central tendency refer to the summary measures used to describe the most "typical" value in a set of values. The Mean and the Median The two most common measures of central tendency are the median and the mean, which can be illustrated with an example. Suppose we draw a sample of five women and measure their weights. They weigh 100 pounds, 100 pounds, 130 pounds, 140 pounds, and 150 pounds. To find the median, we arrange the observations in order from smallest to largest value. If there is an odd number of observations, the median is the middle value. If there is an even number of observations, the median is the average of the two middle values. Thus, in the sample of five women, the median value would be 130 pounds; since 130 pounds is the middle weight. The mean of a sample or a population is computed by adding all of the observations and dividing by the number of observations. Returning to the example of the five women, the mean weight would equal ( )/5 = 620/5 = 124 pounds. In the general case, the mean can be calculated, using one of the following equations: Population mean = µ = ΣX / N OR Sample mean = X = Σx / n where ΣX is the sum of all the population observations, N is the number of population observations, Σx is the sum of all the sample observations, and n is the number of sample observations. When statisticians talk about the mean of a population, they use the Greek letter µ to refer to the mean score. When they talk about the mean of a sample, statisticians use the symbol to refer to the mean score.

3 The Mean vs. the Median As measures of central tendency, the mean and the median each have advantages and disadvantages. Some pros and cons of each measure are summarized below. The median may be a better indicator of the most typical value if a set of scores has an outlier. An outlier is an extreme value that differs greatly from other values. The median is a resistant measure of central tendency because it is not affected by extreme values. However, when the sample size is large and does not include outliers, the mean score usually provides a better measure of central tendency. To illustrate these points, consider the following example. Suppose we examine a sample of 10 households to estimate the typical family income. Nine of the households have incomes between $20,000 and $100,000; but the tenth household has an annual income of $1,000,000,000. That tenth household is an outlier. If we choose a measure to estimate the income of a typical household, the mean will greatly over-estimate the income of a typical family (because of the outlier); while the median will not. Effect of Changing Units Sometimes, researchers change units (minutes to hours, feet to meters, etc.). Here is how measures of central tendency are affected when we change units. If you add a constant to every value, the mean and median increase by the same constant. For example, suppose you have a set of scores with a mean equal to 5 and a median equal to 6. If you add 10 to every score, the new mean will be = 15; and the new median will be = 16. Suppose you multiply every value by a constant. Then, the mean and the median will also be multiplied by that constant. For example, assume that a set of scores has a mean of 5 and a median of 6. If you multiply each of these scores by 10, the new mean will be 5 * 10 = 50; and the new median will be 6 * 10 = 60. Measures of Position Statisticians often talk about the position of a value, relative to other values in a set of observations. The most common measures of position are quartiles, percentiles, and standard scores (aka, z-scores). Percentiles Assume that the elements in a data set are rank ordered from the smallest to the largest. The values that divide a rank-ordered set of elements into 100 equal parts are called percentiles An element having a percentile rank of Pi would have a greater value than i percent of all the elements in the set. Thus, the observation at the 50th percentile would be denoted P50, and it would be greater than 50 percent of the observations in the set. An observation at the 50th percentile would correspond to the median value in the set. A percentile can also be defined as a measure that tells us what percent

4 of the total frequency scored at or below that measure. A percentile rank is the percentage of scores that fall at or below a given score. Quartiles Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively. Note the relationship between quartiles and percentiles. Q1 corresponds to P25, Q2 corresponds to P50, Q3 corresponds to P75. Q2 is the median value in the set. Measures of Dispersion While knowing the mean value for a set of data may give us some information about the set itself, many varying sets can have the same mean value. To determine how the sets are different, we need more information. Another way of examining single variable data is to look at how the data is spread out, or dispersed about the mean. We will discuss 3 ways of examining the dispersion of data. The smaller the values from these methods, the more consistent the data. 1. Range: The simplest of our methods for measuring dispersion is range. Range is the difference between the largest value and the smallest value in the data set. While being simple to compute, the range is often unreliable as a measure of dispersion since it is based on only two values in the set. A range of 50 tells us very little about how the values are dispersed. Are the values all clustered to one end with the low value (12) or the high value (62) being an outlier? Or are the values more evenly dispersed among the range? Before discussing our next methods, let's establish some vocabulary: Population form: The population form is used when the data being analyzed includes the entire set of possible data. When using this form, divide by N, the number of values in the data set. Sample form: The sample form is used when the data is a random sample taken from the entire set of data. When using this form, divide by n - 1. (It can be shown that dividing by n - 1 makes S 2 for the sample, a better estimate of for the population from which the sample was taken.) The population form should be used unless you know a random sample is being analyzed.

2. Variance: To find the variance: subtract the mean, X, from each of the values in the data set, square the result add all of these squares and divide by the number of values in the data set. X i.

5 2. Variance: To find the variance: subtract the mean, X, from each of the values in the data set, square the result add all of these squares and divide by the number of values in the data set. X i. 1 Population Variance = σ = µ n 2 2 X ( X i X ) N i= 1 1 Sample Variance = s = X X 2 n 2 X ( i ) n 1 i= 1 3. Standard Deviation: Standard deviation is the square root of the variance. The formulas are: n 1 Population Std Deviation = σ = ( X µ ) X i X N i= 1 2 n 1 Sample Std Deviation = sx = ( X i X ) n 1 i= 1 2 Warning!!! Be sure you know where to find "population" forms versus "sample" forms on the calculator. Examples: Find, to the nearest tenth, the standard deviation and variance of the distribution: Score Frequency Solution: Grab your graphing calculator. Enter the data and frequencies in lists. Choose 1-Var Stats and enter as grouped data. Population standard deviation is Population variance is

6 Patterns in Data Graphical displays are useful for seeing patterns in data. Patterns in data are commonly described in terms of: center, unusual features, shape, and spread. Center: Graphically, the center of a distribution is located at the median of the distribution. This is the point in a graphic display where about half of the observations are on either side. In the histogram below, the height of each column indicates the frequency of observations. Here, the observations are centered over 4. Unusual Features: Sometimes, statisticians refer to unusual features in a set of data. The two most common unusual features are gaps and outliers. Gaps. Gaps refer to areas of a distribution where there are no observations. The first figure below has a gap; there are no observations in the middle of the distribution. Outliers. Sometimes, distributions are characterized by extreme values that differ greatly from the other observations. These extreme values are called outliers. As a "rule of thumb", an extreme value is often considered to be an outlier if it is at least 1.5 interquartile ranges below the first quartile (Q1), or at least 1.5 interquartile ranges above the third quartile (Q3). Spread: The spread of a distribution refers to the variability of the data. If the observations cover a wide range, the spread is larger. If the observations are clustered around a single value, the spread is smaller.

7 Shape: The shape of a distribution is described by the following characteristics. Symmetry. When it is graphed, a symmetric distribution can be divided at the center so that each half is a mirror image of the other. Number of peaks. Distributions can have few or many peaks. Distributions with one clear peak are called unimodal, and distributions with two clear peaks are called bimodal. When a symmetric distribution has a single peak at the center, it is referred to as bell-shaped. Skewness. When they are displayed graphically, some distributions have many more observations on one side of the graph than the other. Distributions with most of their observations on the left (toward lower values) are said to be skewed right; and distributions with most of their observations on the right (toward higher values) are said to be skewed left. Uniform. When the observations in a set of data are equally spread across the range of the distribution, the distribution is called a uniform distribution. A uniform distribution has no clear peaks. Dotplots Dotplot Overview As you might guess, a dotplot is made up of dots plotted on a graph. Here is how to interpret a dotplot. Each dot can represent a single observation from a set of data, or a specified number of observations from a set of data. The dots are stacked in a column over a value, so that the height of the column represents the relative or absolute frequency of observations in the f that value. Compared to other types of graphical displays, dotplots are used most often to plot frequency counts within a small number of values, usually with small sets of data.

Dotplot Example Bar Charts and Histograms Bar charts and histograms are used to compare the sizes of different groups. Bar Charts A bar chart is made up of columns plotted on a graph.

The bar chart below shows average per capita income for the four "New" states - New Jersey, New York, New Hampshire, and New Mexico.

8 Dotplot Example Bar Charts and Histograms Bar charts and histograms are used to compare the sizes of different groups. Bar Charts A bar chart is made up of columns plotted on a graph. Here is how to read a bar chart. The columns are positioned over a label that represents a categorical variable. The height of the column indicates the size of the group defined by the column label. The bar chart below shows average per capita income for the four "New" states - New Jersey, New York, New Hampshire, and New Mexico. Histograms Like a bar chart, a histogram is made up of columns plotted on a graph. Usually, there is no space between adjacent columns. Here is how to read a histogram. The columns are positioned over a label that represents a quantitative variable. The column label can be a single value or a range of values. The height of the column indicates the size of the group defined by the column label. The histogram below shows per capita income for five age groups.

9 The Difference Between Bar Charts and Histograms Here is the main difference between bar charts and histograms. With bar charts, each column represents a group defined by a categorical variable; and with histograms, each column represents a group defined by a quantitative variable. One implication of this distinction: it is always appropriate to talk about the skewness of a histogram; that is, the tendency of the observations to fall more on the low end or the high end of the X axis. With bar charts, however, the X axis does not have a low end or a high end; because the labels on the X axis are categorical - not quantitative. As a result, it is less appropriate to comment on the skewness of a bar chart. Stemplots (aka, Stem and Leaf Plots) Although a histogram shows how observations are distributed across groups, it does not show the exact values of individual observations. A different kind of graphical display, called a stemplot or a stem and leaf plot, does show exact values of individual observations. Stemplots A stemplot is used to display quantitative data, generally from small data sets (50 or fewer observations). The stemplot below shows IQ scores for 30 sixth graders. In a stemplot, the entries on the left are called stems; and the entries on the right are called leaves. In the example above, the stems are tens (80 and 90) and hundreds (100 through 140). However, they could be other units - millions, thousands, ones, tenths, etc. In the example above, the stems and leaves are explicitly labeled for educational purposes. In the real world, however, stemplots usually do not include explicit labels for the stems and leaves. Some stemplots include a key to help the user interpret the display correctly. The key in the stemplot above indicates that a stem of 11 with a leaf of 7 represents an IQ score of 117. Looking at the example above, you should be able to quickly describe the distribution of IQ scores. Most of the scores are clustered between 90 and 109, with the center falling in the neighborhood of 100. The scores range from a low of 81 (two students have an IQ of 81) to a high of 151. The high score of 151 might be classified as an outlier.

10 Boxplots (aka, Box and Whisker Plots) A boxplot, sometimes called a box and whisker plot, is a type of graph used to display patterns of quantitative data. Boxplot Basics A boxplot splits the data set into quartiles. The body of the boxplot consists of a "box" (hence, the name), which goes from the first quartile (Q1) to the third quartile (Q3). Within the box, a vertical line is drawn at the Q2, the median of the data set. Two horizontal lines, called whiskers, extend from the front and back of the box. The front whisker goes from Q1 to the smallest non-outlier in the data set, and the back whisker goes from Q3 to the largest non-outlier. If the data set includes one or more outliers, they are plotted separately as points on the chart. In the boxplot above, two outliers precede the first whisker; and three outliers follow the second whisker. How to Interpret a Boxplot Here is how to read a boxplot. The median is indicated by the vertical line that runs down the center of the box. In the boxplot above, the median is about 400. Additionally, boxplots display two common measures of the variability or spread in a data set. Range. If you are interested in the spread of all the data, it is represented on a boxplot by the horizontal distance between the smallest value and the largest value, including any outliers. In the boxplot above, data values range from about -700 (the smallest outlier) to 1700 (the largest outlier), so the range is If you ignore outliers, the range is illustrated by the distance between the opposite ends of the whiskers - about 1000 in the boxplot above. Interquartile range (IQR). The middle half of a data set falls within the interquartile range. In a boxplot, the interquartile range is represented by the width of the box (Q3 minus Q1). In the chart above, the interquartile range is equal to 600 minus 300 or about 300. And finally, boxplots often provide information about the shape of a data set. The examples below show some common patterns. Each of the above boxplots illustrates a different skewness pattern. If most of the observations are concentrated on the low end of the scale, the distribution is skewed right; and vice versa. If a distribution is symmetric, the observations will be evenly split at the median, as shown above in the middle figure.

Cumulative Frequency In a data set, the cumulative frequency for a value x is the total number of scores that are less than or equal to x.

11 Cumulative Frequency Plots A cumulative frequency plot is a way to display cumulative information graphically. It shows the number, percentage, or proportion of observations in a data set that are less than or equal to particular values. Frequency vs. Cumulative Frequency In a data set, the cumulative frequency for a value x is the total number of scores that are less than or equal to x. The charts below illustrate the difference between frequency and cumulative frequency. Both charts show scores for a test administered to 300 students. In the chart on the left, column height shows frequency - the number of students in each test score grouping. For example, about 30 students received a test score between 51 and 60. In the chart on the right, column height shows cumulative frequency - the number of students up to and including each test score. The chart on the right is a cumulative frequency chart. It shows that 30 students received a test score of at most 50; 60 students received a score of at most 60; 120 students received a score of at most 70; and so on. Absolute vs. Relative Frequency Frequency counts can be measured in terms of absolute numbers or relative numbers (e.g., proportions or percentages). The cumulative relative frequency chart expresses the counts in terms of percentages rather than absolute numbers. Note that the columns in the chart have the same shape, whether the Y axis is labeled with actual frequency counts or with percentages. If we had used proportions instead of percentages, the shape would remain the same.

12 Discrete vs. Continuous Variables Each of the previous cumulative charts have used a discrete variable on the X axix (i.e., the horizontal axis). The chart above duplicates the previous cumulative charts, except that it uses a continuous variable for the test scores on the X axis. Let's work through an example to understand how to read this cumulative frequency plot. Specifically, let's find the median. Follow the grid line to the right from the Y axis at 50%. This line intersects the curve over the X axis at a test score of about 73. This means that half of the students received a test score of at most 73, and half received a test score of at least 73. Thus, the median is 73. You can use the same process to find the cumulative percentage associated with any other test score. For example, what percentage of students received a test score of 64 or less? From the graph, you can see that about 25% of students received a score of 64 or less. Comparing Distributions Common graphical displays (e.g., dotplots, boxplots, stemplots, bar charts) can be effective tools for comparing data from two or more populations. How to Compare Distributions When you compare two or more data sets, focus on four features: Center. Graphically, the center of a distribution is the point where about half of the observations are on either side. Spread. The spread of a distribution refers to the variability of the data. If the observations cover a wide range, the spread is larger. If the observations are clustered around a single value, the spread is smaller. Shape. The shape of a distribution is described by symmetry, skewness, number of peaks, etc. Unusual features. Unusual features refer to gaps (areas of the distribution where there are no observations) and outliers. The remainder of this lesson shows how to interpret various graphs in terms of center, spread, shape, and unusual features. This is a skill that will probably be tested on the Advanced Placement (AP) Statistics Exam.

13 Back-to-Back Stemplots The back-to-back stemplots are another graphic option for comparing data from two populations. The center of a back-to-back stemplot consists of a column of stems, with a vertical line on each side. Leaves representing one data set extend from the right, and leaves representing the other data set extend from the left. The back-to-back stemplot on the right shows the amount of cash (in dollars) carried by a random sample of teenage boys and girls. The boys carried more cash than the girls - a median of $42 for the boys versus $26 for the girls. Both distributions were roughly bell-shaped, although there was more variation among the boys. And finally, there were neither gaps nor outliers in either group. Parallel Boxplots With parallel boxplots (aka, side-by-side boxplots), data from two distributions are displayed on the same chart, using the same measurement scale. The boxplot above summarizes results from a medical study. The treatment group received an experimental drug to relieve cold symptoms, and the control group received a placebo. The boxplot shows the number of days each group continued to report symptoms. Neither distribution has unusual features, such as gaps or outliers. Both distributions are skewed to the right, although the skew is more prominent in the treatment group. Patient response was slightly less variable in the treatment group than in the control group. In the treatment group, cold symptoms lasted 1 to 14 days (range = 13) versus 3 to 17 days (range = 14) for the control group. The median recovery time is more telling - about 5 days for the treatment group versus about 9 days for the control group. It appears that the drug had a positive effect on patient recovery.

Double Bar Charts A double bar chart is similar to a regular bar chart, except that it provides two pieces of information for each category rather than just one.

14 Double Bar Charts A double bar chart is similar to a regular bar chart, except that it provides two pieces of information for each category rather than just one. Often, the charts are color-coded with a different colored bar representing each piece of information. Above, a double bar chart shows customer satisfaction ratings for different cars, broken out by gender. The blue rows represent males; the red rows, females. Both groups prefer the Japanese cars to the American cars, with Honda receiving the highest ratings and Ford receiving the lowest ratings. Moreover, both genders agree on the rank order in which the cars are rated. As a group, the men seem to be tougher raters; they gave lower ratings to each car than the women gave. Statistical Graphs Summary Data recorded in experiments or surveys is displayed by a statistical graph. Choosing which graph is determined by the type and breadth of the data, the audience it is directed to, and the questions being asked. Each type of graph has its advantages and disadvantages. Consult the table below when choosing a graph. Dotplot A dotplot can be used as an initial record of discrete data values. The range determines a number line which is then plotted with X's for each data value. Advantages Quick analysis of data Shows range, minimum & maximum, gaps & clusters, and outliers easily Exact values retained Disadvantages Not as visually appealing Best for under 50 data values Needs small range of data

Pie chart A pie chart displays data as a percentage of the whole. Each pie section should have a label and percentage. A total data number should be included.

specified Best for 3 to 7 categories Use only with discrete data Histogram A histogram displays continuous data in ordered columns.

15 Pie chart A pie chart displays data as a percentage of the whole. Each pie section should have a label and percentage. A total data number should be included. Advantages Visually appealing Shows percent of total for each category Disadvantages No exact numerical data Hard to compare 2 data sets "Other" category can be a problem Total unknown unless specified Best for 3 to 7 categories Use only with discrete data Histogram A histogram displays continuous data in ordered columns. Categories are of continuous measure such as time, inches, temperature, etc. Advantages Visually strong Can compare to normal curve Usually vertical axis is a frequency count of items falling into each category Disadvantages Cannot read exact values because data is grouped into categories More difficult to compare two data sets Use only with continuous data

16 Bar graph A bar graph displays discrete data in separate columns. A double bar graph can be used to compare two data sets. Categories are considered unordered and can be rearranged alphabetically, by size, etc. Advantages Visually strong Can easily compare two or three data sets Disadvantages Graph categories can be reordered to emphasize certain effects Use only with discrete data Line graph A line graph plots continuous data as points and then joins them with a line. Multiple data sets can be graphed together, but a key must be used. Advantages Can compare multiple continuous data sets easily Interim data can be inferred from graph line Disadvantages Use only with continuous data

17 Scatterplot A scatterplot displays the relationship between two factors of the experiment. A trend line is used to determine positive, negative, or no correlation. Advantages Shows a trend in the data relationship Retains exact data values and sample size Shows minimum/maxim um and outliers Disadvantages Hard to visualize results in large data sets Flat trend line gives inconclusive results Data on both axes should be continuous Stem and Leaf Plot Stem and leaf plots record data values in rows, and can easily be made into a histogram. Large data sets can be accomodated by splitting stems. Advantages Concise representation of data Shows range, minimum & maximum, gaps & clusters, and outliers easily Can handle extremely large data sets Disadvantages Not visually appealing Does not easily indicate measures of centrality for large data sets Box plot A boxplot is a concise graph showing the five point summary. Multiple boxplots can be drawn side by side to compare more than one data set. More about boxplots below. Advantages Shows 5-point summary and outliers Easily compares two or more data sets Handles extremely large data sets easily Disadvantages Not as visually appealing as other graphs Exact values not retained

18 Box Plots A box plot is a graph that is useful for very large data sets that are too unwieldy for a stem and leaf or line plot. A box plot summarizes the data to only five numbers -- the median, upper and lower quartiles, and minimum and maximum values. It provides a quick visual summary that easily shows center, spread, range and any outliers. When we want to compare two or more sets of data, we make side-by-side boxplots. This statistical graph is very efficient in comparing center and spread of two or more data sets. We can immediately visualize the ranges, medians, and "shapes" of each data set. 5-Number Summary The median is found by listing the data values in increasing order, and finding the center value. If there is an even number of data values, find the average of the two center values. This number forms the interior line of the box. The lower quartile (Q1) is found by considering only the bottom half of the data, below the median. Find the median, or middle value, of this part of the data. The lower quartile number forms the bottom line of the box. The upper quartile (Q3) is the median of the upper half of the data, above the median. The upper quartile number forms the top line of the box. Connect these three lines to form the sides of the box. The minimum and maximum values can be read right off the list of data values. If there are no outliers, these numbers form the ends of the whiskers of the graph, and are connected to the upper and lower quartile lines. Outliers Sometimes a data set will have one or more outliers. An outlier can be detected by finding the value of 1.5*(IQR); then subtracting this number from the lower quartile, and adding it to the upper quartile. This is the maximum range of the whiskers of the graph, a theoretical "fence" of the range data. Any data values falling outside this "fence" are considered outliers. They are labeled on the graph with an asterisk. There can be outliers above or below, or multiple outliers in a data set.

UNIT 1A EXPLORING UNIVARIATE DATA

A.P. STATISTICS E. Villarreal Lincoln HS Math Department UNIT 1A EXPLORING UNIVARIATE DATA LESSON 1: TYPES OF DATA Here is a list of important terms that we must understand as we begin our study of statistics