The Correlation Coefficient r
Besides looking at the scatter plot and seeing that a line seems reasonable, how can you tell if the line is a good predictor?
Use the correlation coefficient as another indicator (besides the scatterplot) of the strength of the relationship between x and y.
The correlation coefficient, r, developed by Karl Pearson in the early 1900s, is numerical and provides a measure of
strength and direction of the linear association between the independent variable x and the dependent variable y.
What the VALUE of r tells us:
• The value of r is always between –1 and +1: –1 ≤ r ≤ 1.
• The size of the correlation r indicates the strength of the linear relationship between x and y. Values of r close to –1 or
to +1 indicate a stronger linear relationship between x and y.
• If r = 0 there is absolutely no linear relationship between x and y (no linear correlation).
• If r = 1, there is a perfect positive correlation. If r = –1, there is a perfect negative correlation. In both these cases, all of the original data points lie on a straight line. Of course, in the real world, this will not generally happen.
What the SIGN of r tells us
• A positive value of r means that when x increases, y tends to increase, and when x decreases, y tends to decrease
• A negative value of r means that when x increases, y tends to decrease, and when x decreases, y tends to increase
• The sign of r is the same as the sign of the slope, b, of the best-fit line.
The Coefficient of Determination
The variable r2 is called the coefficient of determination and is the square of the correlation coefficient, but is usually stated as a percent, rather than in decimal form. It has an interpretation in the context of the data:
• r2, when expressed as a percent, represents the percent of variation in the dependent (predicted) variable y that can be explained by variation in the independent (explanatory) variable x using the regression (best-fit) line.
• 1 – r2, when expressed as a percentage, represents the percent of the variation in y that is NOT explained by variation in x using the regression line. This can be seen as the scattering of the observed data points about the regression line.