To understand regression, we first need to understand how to measure the relationship between two quantitative variables.
r = sum( ((x - xbar)/s_x) * ((y - ybar)/s_y) ) / (n-1)
where xbar and s_x are the sample mean and standard deviation of the x variable and ybar and s_y are the sample mean and standard deviation of the y variable.
Notice that each individual measurement is standardized to its z-score; subtract the mean and divide by the standard deviation. The correlation coefficient r is almost the average of the product of the z-scores for the x and y measurements of each data point.
This implies a couple things. First, because z-scores are unitless, the correlation coefficient is unitless as well. If one variable is length, it does not matter if length is measured in inches, feet, centimeters, or kilometers - r would not change. Second, terms will be positive when either both z-scores are positive or both z-scores are negative. Terms will be negative when one z-score is positive and the other is negative. The sign of the correlation coefficient will be a measure of the association between two variables.
The last point requires some elaboration. A correlation coefficient near 0 does not by itself imply that two variables are unrelated. It could be the case that there is a strong nonlinear relationship between the variables. Furthermore, a correlation coefficient near -1 or 1 does not, by itself, imply that a linear relationship by itself is most appropriate. A nonlinear curve could be a better description of the relationship.
The value of r^2 is often used as a summary statistic as well. It may be interpreted as the proportion of the variability in y that is explained by the regression. When r=-1, or r=1, r^2=1 and all of the variability in y is explained by x. In other words, there is no variability around the regression line. A large r^2 value implies that the data is more tightly clustered around a line with some nonzero slope that around the line at y = ybar.
The least squares line is
yhat = b0 + b1 x
where b0 is the intercept and b1 is the slope. The slope and intercept are determined by the mean and standard deviation of x and y and the correlation coefficient.
b1 = r * (s_y / s_x)
b0 = ybar - b1 * xbar
(predicted y) = b0 + b1 * x = (ybar - b1 * xbar) + b1 * (xbar + z sx) = ybar + b1 * z * sx = ybar + (r * sy / sx) * z * sx = ybar + r*z*sy
Notice this says that if x is z standard deviations from the mean, predict y will be rz standard deviations from the mean.
In particular, the predicted value when x is xbar is ybar.
Patterns in a residual plot indicate nonlinearity. For example, if the residuals tend to be positive for low x values, negative for middle x values, and positive again for high x values, this indicates that the data is scattered around a curve that is concave up.
When the size of the residuals tends to increase with the size of the fitted values, this indicates that the variance is related to the explanatory variable. A common solution is to transform the explanatory variable. Perhaps, log(y) has a linear relationship with x with variance that is more nearly constant as x changes.
In class, we examined my son Riley's growth over time. This data is here.
A plot of the data shows that fitting a single straight line to all of the data is unwarranted. For the first year and a half of his life, his height increases rapidly. At some point between eighteen months and two years, there is a noticeable change in his growth rate. But from age 2 to age 8, the data appears to be fairly linear.
If we restrict attention to ages 24 months and higher, we have this summary data. x is Riley's age in months, y is his height in inches, and n is the number of times this data is recorded.
n=16, xbar = 61.6 months, sx = 21.9 months, ybar = 45.7 inches, sy = 5.5 inches, and r = 0.999. If we plug these values, we find the least squares regression line
b1 = r * sy / sx = (0.999)(5.5)/(21.9) = 0.25 inches per month.
b0 = ybar - b1 * xbar = 45.7 - 0.25*61.6 = 30.3 inches.
The slope is generally the more important variable. It has units (y units over x units). In this case, we can say that Riley has been growing about a quarter inch per month, or about three inches per year.
The intercept can be interpreted as the predicted value when x = 0. This interpretation is only valid when 0 is in the range of the measured data. In this problem, 0 is a meaningful x value (the time of birth), but the prdiction is ridiculous - no babies are over thirty inches long at birth. The use of a regression line well outside its range of applicability is called extrapolation.
You are responsible for the concepts presented above. You should know how to find a least squares regression line from the five summary statistics. You should be able to make a prediction with a regression line and to know when the prediction is invalid because of extrapolation.
Bret Larget, larget@mathcs.duq.edu