Scatterplot: The Bridge from Correlation to Regression
We have already seen how a histogram is a useful technique for graphing the distribution of one variable. Here is the histogram depicting the distribution of the Age Category (agecat) variable in the voter.xlsx file we have used in class. Now let s look at histograms for the rates of the violent crimes of forcible rape and robbery in the United States in 2012.
These histograms show us the distribution of the 2 crime rates individually, but they don t show us how the 2 crime rates are related to each other. We can see that the distribution of forcible rape rates is positively skewed; the distribution of robbery rates is also positively skewed, although less than the rate of forcible rape. But, these histograms do not allow us to see how these 2 violent crime rates may be related to each other. We don t know what happens to the robbery rate when the forcible rape rate increases or decreases. Because it is important to know how (or if) 2 or more variables are related to each other, we need to go beyond simply describing their distributions in samples or populations. We need techniques to capture in numbers and in graphs how variables are related to each other. We have previously covered correlation.
Simple Linear Regression With correlation, we have seen that the strength and direction of association between two variables can be summarized in one number- the correlation coefficient. (See the slide show Correlation to refresh your understanding of this procedure.) Often, we would like to go beyond such description to make predictions about what the values of one variable are likely to be if we know the values of one or more related variables. Regression procedures allow us to make such predictions. Let us clarify what we mean by predictions. We do not mean some psychic ability to foresee some future event based on readings of an individual's palms or his/her corona. Nor do we mean an ability to forecast changes in weather on the basis of someone's aching bones or some other physical ailment or quirk. We definitely do not refer to a person's intuition, feelings, or some other supernatural capability. Rather, our use of the term prediction means deriving acceptably accurate estimates of the value of one variable on the basis of its known relationship with one or more other variables. Many sets of variables have linear relationships which can be graphed using the standard X and Y coordinates of a two-dimensional grid. To refresh your memory of 2 dimensional graphing, the next slide reviews the basic structure of such graphs.
Values of Y increase as we move up the Y axis from the origin to the top of the axis.. Origin Origin Values of X increase as we move across the X axis from left to right.
The graph used in regression is called a scatterplot. A scatterplot has a horizontal axisdesignated by the letter X- and a vertical axis- designated by the letter Y. Each point in the scatterplot is represented by its X value (that is,its value on the horizontal axis) and its corresponding Y value (its value on the vertical axis). For example, the numerical pair (1, 2) indicates that the X value is 1 and the Y value is 2. On the next slide, we will see how this point is plotted on a scatterplot.
This is a very simple example of a scatterplot. Note the location of the point above the value 1.00 on the x (horizontal) axis and at the point 2.00 on the y (vertical) axis. Now let us add a second point to this graph. This point will have coordinates x = 3 and y = 5. The scatterplot is on the next slide.
(3,5) (1,2) Here is the scatterplot of these 2 points and the straight line that connects them. Remember that 2 points determine a straight line and, as with all straight lines, this line can be determined with the formula y = mx + b, where: y is the value of the y variable; m is the slope of the line; x is the value of the x variable; and b is the y-intercept- the point where the line crosses the y (vertical) axis when x = 0.00. Before discussing the line in more detail, let us add a third point to the scatterplot. This point will have coordinates x = 1 and y = 4.
It should be obvious that this third point (1,4) will not lie on the same straight line as the other 2 points (1,2; 3,5). However there is a line which can be drawn through this graph which will come close to all 3 points. The next slide illustrates this line.
This line appears to pass through only 1 of the points on this graph. However, of all the lines that we could draw through this graph, this is the one line that comes closest to touching all 3 points on the graph. This line is called the line of least squares or least squares line and it will be very handy for making predictions about the dependent variable when we know its relationship with the independent variable. For now, let us consider the correlation between forcible rape and robbery rates.
Here is an abbreviated correlation matrix showing the correlation between forcible rape and robbery rates for the United States in 2012 (compiled and computed in the FBI Uniform Crime Reports). The correlation coefficient of -.255 indicates a weak negative relationship between the 2 variables. In words, as the rate of robbery increases the rate of forcible rape decreases slightly. Now, let s look at the graph- the scatterplot- for the correlation between these variables.
In this scatterplot, robbery rates are presented as the independent variable along the x axis and forcible rape rates are plotted as the dependent variable on the y (vertical) axis. It is not immediately evident if the least squares line passes through any of the points in the graph, but it is the one line that comes closest to all of the points. [The equation for the line is presented above the scatterplot; we will return to this equation in a subsequent slide.] Note that the line has a negative slope- as robbery rates increase along the x axis, forcible rape rates decrease on the y axis. On the next slide, we present the correlation matrix for these variables along with this scatterplot.
The correlation matrix shows a correlation coefficient of -.255; the scatter plot shows a least squares line with a negative slope. It should be obvious that the scatterplot for a positive correlation will have a least squares line with a positive slope. Now let s take a closer look at the equation above the scatterplot.
Recall the equation for a straight line: y = mx + b. In research terms: 1) y is the value of the dependent variable; 2) m is the slope of the least squares line; 3) x is the value of the independent variable; 4) b is the y-intercept. SPSS presents the linear equation is a slightly different order. This order is effectively y = b + mx. In other words, SPSS gives the y-intercept first followed by the product of the slope times the value of the x (independent) variable. Here is the equation for this least squares line: Forcible rape = 36.80 +-0.07*robbery Dependent y- + slope*independent Variable intercept We ve almost come full circle. The equation for the least squares line through a scatterplot depicting a linear relationship between two variables allows a researcher to predict values of the dependent variable from values of the independent variable. Let s see how this is done on the next slide.
Suppose we want to know what the forcible rape rate is given that we know what the robbery rate is. Further, suppose we know that a state s robbery rate is 100.00 per 100,000 inhabitants. With our equation- Forcible rape = 36.80 +-0.07*robbery- we only have to plug in 1 number (100.00). Solving the equation: Forcible rape = 36.80 +-0.07*100.00 Forcible rape = 36.80 +-7 Forcible rape = 29.80 With these data and this linear relationship between these variables, we would predict that a state with a robbery rate of 100.00 per 100,000 inhabitants will have a forcible rape rate of 29.80 per 100,000 inhabitants. Now suppose that a state s robbery rate is 175.00 per 100,000 inhabitants. Solving this equation: Forcible rape = 36.80 +-0.07*175.00 Forcible rape = 36.80 +- 12.25 Forcible rape = 24.55 With this presentation, we can see how we use the technique of linear regression to predict values of one variable knowing values of an associated variable. This presentation is also intended to reinforce the importance of identifying INDEPENDENT and DEPENDENT variables. Remember: the INDEPENDENT variable is plotted along the horizontal (X) axis of a graph, while the DEPENDENT variable is plotted up and down the vertical (Y) axis of the graph.
The preceding slides showing the scatterplot for the relationship between the variables were prepared using an earlier version of SPSS. Newer versions of IBM SPSS Statistics produce the scatterplot, but may not show the least squares line through the data points on the graph, nor the equation for the line. However, we can still depict a scatterplot and we can generate the linear equation using the Analyze Regression Linear command sequence. The following slides will demonstrate how to do these procedures.
First, here is the scatterplot for the relationship between robbery rates and forcible rape rates; the graph is constructed without the least squares line. In the next few slides, we will demonstrate the SPSS command sequence which generates this scatterplot. For now, note that the data points appear to lie on a line with a negative slope. Let us see how this graph was generated.
Here is the command sequence to generate a scatterplot using Legacy Dialogs.
On this screen, we choose Simple Scatter ; click the Define button
Move the independent variable (in this illustration, robbery ) to the X-axis field; move the dependent variable (here, Forcible rape ) to the Y-axis field. Click Titles.
Enter a title for the scatterplot; graphs are typically indicated as Figure x. Click Continue.
Returning to the Simple Scatterplot screen, click OK.
Here is the simple scatterplot for the association between robbery rates the independent variable- and forcible rape rates- the dependent variable. Now let s generate the equation for the least squares line through this scatterplot.
Here is the command sequence to begin the procedure.
Move the independent variable (robbery) to the Independent(s) field; move the dependent variable (Forcible rape) to the Dependent field. [This command sequence can also be used for multiple regression in which there can be more than one variable entered in this field; for our illustration, we will do a simple regression of one variable on a second variable.
Most of the other buttons on this screen can be ignored since we are only doing a simple regression. Click OK.
Here is the Regression output; from this screen, we can generate the equation for the least squares line. Recall the linear equation: y = mx + b.
Of the output on the previous slide, the part we need to generate the linear equation is the table of Coefficients at the bottom of the screen. In this table: 1) The cell in the column headed B and row headed (Constant) contains the y- intercept or b in the equation- y = mx + b; in this case, b = 36.805. 2) The cell in the same column in the row headed robbery contains the slope of the least squares line- in this case, m = -.069 The equation for the least squares line through this scatterplot is: y = -.069x + 36.805