We will be using this data to illustrate several concepts in multiple regression.
(FEV) = b0 + b1(height) + b2(age)
Here is the output.
Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -4.6105 0.2243 -20.5576 0.0000 ht 0.1097 0.0047 23.2628 0.0000 age 0.0543 0.0091 5.9609 0.0000 Residual standard error: 0.4197 on 651 degrees of freedom Multiple R-Squared: 0.7664Look at a plot of the residuals versus fitted values.
Notice that there are patterns in the residual plot.
- There is a curve. This indicates that a nonlinear fit.
- The spread increases as the fitted values increase. This indicates heteroscedasticity (different spread).
Fit #2
Now try transforming the response variable by taking logarithms. This often helps when the spread increases with fitted values (and sometimes gets rid of nonlinearity problems as well).Here are the fitted values.
Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -1.9711 0.0783 -25.1639 0.0000 ht 0.0440 0.0016 26.7059 0.0000 age 0.0198 0.0032 6.2305 0.0000 Residual standard error: 0.1466 on 651 degrees of freedom Multiple R-Squared: 0.8071Also, look at the residual plot.
Notice that there are no obvious patterns. The residuals have similar spread for all fitted values and there are no trends.
Fit #3
We could also see if we could do better by adding a quadratic term for age.Here is a summary of the fit.
Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -1.9809 0.0801 -24.7360 0.0000 ht 0.0435 0.0018 23.6722 0.0000 age 0.0271 0.0128 2.1239 0.0341 I(age^2) -0.0003 0.0005 -0.5915 0.5544 Residual standard error: 0.1467 on 650 degrees of freedom Multiple R-Squared: 0.8072Notice that the p-value for the quadratic term is very large. This extra term did not add much to the quality of the fit.
Fit #4
We could also see if we could do better by adding an interaction term between age and height.Here is a summary of the fit.
Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -1.9666 0.1878 -10.4733 0.0000 ht 0.0439 0.0032 13.6514 0.0000 age 0.0193 0.0207 0.9322 0.3516 ht:age 0.0000 0.0003 0.0267 0.9787 Residual standard error: 0.1467 on 650 degrees of freedom Multiple R-Squared: 0.8071Notice that the interaction term does not add much of value.
Our best model was Fit #2.
Fit #5
We can also add categorical variables to the multiple regression. The variable sex can be represented by a numerical variable with one value for male and another for female. A similar thing holds for smoking status.Here is a summary of the fit.
Coefficients: Value Std. Error t value Pr(>|t|) (Intercept) -1.9524 0.0807 -24.1811 0.0000 ht 0.0428 0.0017 25.4893 0.0000 age 0.0234 0.0033 6.9845 0.0000 smoke -0.0230 0.0105 -2.2031 0.0279 sex 0.0147 0.0059 2.5020 0.0126 Residual standard error: 0.1455 on 649 degrees of freedom Multiple R-Squared: 0.8106All of these variables appear to be important because they have small p-values.
Last modified: April 5, 2001Bret Larget, larget@mathcs.duq.edu