Math 225

Introduction to Biostatistics

Linear Regression

Assignment 10

Prerequisites

This lab assumes that you already know how to:

Login, find course Web page, run S-PLUS
Use the Commands Window to execute commands
Load data sets

Technical Objectives

This lab will teach you to:

construct a regression line by hand from summary statistics
use S-PLUS to plot bivariate data with a scatterplot
use S-PLUS to find a least-squares line
use S-PLUS to fit a regression line
use S-PLUS to make residual plots

Conceptual Objectives

In this lab you should begin to understand:

to interpret residual plots to determine if fitting a linear model is justified
to test if a true regression slope should be 0
to find and correctly interpret a confidence interval for a true regression slope

Correlation

The correlation coefficient, r, is a measure of the strength of the linear relationship between two continuous variables. A more intuitive formula for r than those that appear in the text book is as follows, where xbar and s_x are the mean and sample standard deviation of the x variable and ybar and s_y are the mean and standard deviation of the y variable.

r = sum( ((x - xbar)/s_x) * ((y - ybar)/s_y) ) / (n-1)

Notice that each individual measurement is standardized to its z-score; subtract the mean and divide by the standard deviation. The correlation coefficient r is almost the average of the product of the z-scores for the x and y measurements of each data point.

Association

We say that two variables x and y are positively associated if when x is large relative to its mean, y also tends to be large, and when x is small relative to its mean, y also tends to be small. Two variables are negatively associated when big x tends to go with small y and vice versa. A scatter plot of positively associated variables will have many points in the upper right and lower left. A scatter plot of negatively associated variables will have many points in the lower right and upper left. Data that is positively associated will have a positive correlation coefficent and data that is negatively associated will have a negative correlation coefficient. This follows because products of corresponding z-scores will tend to be positive for positively associated data (+ times + or - times -) and negative for negatively associated data (+ time - or - times +).

Correlation coefficient facts

-1 <= r <= 1
r = -1 if and only if the data lie exactly on a line with a negative slope.
r = 1 if and only if the data lie exactly on a line with a positive slope.
The correlation coefficient measures only the strength of the linear relationship.

The last point requires some elaboration. A correlation coefficient near 0 does not by itself imply that two variables are unrelated. It could be the case that there is a strong nonlinear relationship between the variables. Furthermore, a correlation coefficient near -1 or 1 does not, by itself, imply that a linear relationship by itself is most appropriate. A nonlinear curve could be a better description of the relationship.

The value of r^2 is often used as a summary statistic as well. It may be interpreted as the proportion of the variability in y that is explained the regression. When r=-1, or r=1, r^2=1 and all of the variability in y is explained by x. In other words, there is no variability around the regression line. A large r^2 value implies that the data is more tightly clustered around a line with some nonzero slope that around the line at y = ybar.

Residuals

We will draw the response variable (dependent variable) on the y axis and the explanatory variable (independent variable) on the x axis. For any line drawn through the points, the vertical distance from a point to the line is called a residual. Residuals are positve for points above the line and negative for points below the line.

The Criterion of Least Squares

Lines which are good descriptions of the relationship between two variables will tend to have small residuals whereas lines that give poor fits have some large residuals (in absolute value). One particularly criterion for choosing a ``best line'' is to make the sum of all of the squared residuals as small as possible. This line is called the least squares line. Notice that because residuals are measured vertically, it matters quite a bit which variable is designated x and which is y.

The least squares line is

yhat = a + b x

where a is the intercept and b is the slope. The slope and intercept are determined by the mean and standard deviation of x and y and the correlation coefficient.

b = r * (s_y / s_x)

a = ybar - b * xbar

Simple Linear Regression

Simple linear regression uses the least squares line to predict the value of y for each x. You may think of the line as representing the average y value for all individuals in a population with a given x value. The regression line is an estimate of the ``true'' line based on sample data. Note that the method of least squares will always provide a line, even when a nonlinear curve would be more appropriate. The correlation coefficient alone cannot be used as a measure of the appropriateness of a linear fit. Plotting the data to ascertain if a linear fit is appropriate is necessary.

Residuals

Regression Diagnostics

It is often the case that the relationship between two quantitative variables should not be summarized with a straight line. It may be the case that some nonlinear relationship is better. In addition, the methods of statistical inference in a regression framework assume that the variance is constant for different values of the explanatory x variable. Often, a plot of the data itself makes it clear when a nonlinear fit is more appropriate, or when nonconstant variance is a potential problem. However, a plot of residuals versus the fitted values (or the original x values) makes it easier to see these potential problems.

Patterns in a residual plot indicate nonlinearity. For example, if the residuals tend to be positive for low x values, negative for middle x values, and positive again for high x values, this indicates that the data is scattered around a curve that is concave up.

When the size of the residuals tends to increase with the size of the fitted values, this indicates that the variance is related to the explanatory variable. A common solution is to transform the explanatory variable. Perhaps, log(y) has a linear relationship with x with variance that is more nearly constant as x changes.

In-class Activities

Forced expiratory volume (FEV) is a measure of lung function. An individual takes a full breath and then blows as much air as possible into the measuring instrument. Larger measures are better, other factors being equal. FEV may be affected by several variables. For this lab, we will examine the effect of height on FEV from a sample of children of various ages.

Better analyses would use additional information, such as the ages, sex, and smoking status of the children.

Load in the FEV data set. In the data set, FEV (fev) is measured in liters and height (ht) is measured in inches.
Attach the data by typing
```
> attach(fevdata)
```
in the Commands Window.
Make a plot with ht on the x axis and fev on the y axis.
```
> plot(ht,fev)
```
Does it look like a linear relationship is adequate, or is a nonlinear relationship better? (There is a noticeable curve, and the spread of the points around the line increases as the height increases.)
When the spread increases as the x variable increases, taking logarithms of the response variable often corrects this. Make a plot of log(fev) versus ht.
```
> plot(ht,log(fev))
```
Does it look like a linear relationship is adequate? (Yes! The relationship looks fairly linear and the spread around the line is similar as the height increases.)
Use S-PLUS to find the least squares regression line that predicts log of FEV from height.
1. Use your mouse to select Statistics:Regression:Linear....
2. In the Formula box type
```
log(fev) ~ ht
```
  This means ``log(fev) is modeled as a function of height''. An intercept is included by default.
3. Click on the Plots tab.
4. Click on the two plots Residuals versus Fitted Values and Response versus Fitted Values.
5. Click on OK.
6. Read the Report Window and look at the graphs.
On the residual plot, you should see that the points are spread around the line y=0 without pattern and that the size of the spread does not change drastically with x.
The plot on page 2 is a scatter plot of the response values versus the fitted values. This plot is identical to the original data except that the horizontal axis has been rescaled and relabeled and the regression line is added to the plot.
In the Report Window, there will be a table labeled "Coefficients" with the fitted parameter values.
The column headed "Value" has the slope and intercept of the regression line. These are statistics that can be used to estimate the "true" population slope and intercept assuming that the data is a random sample from the population.
The column headed "Std. Error" has the estimated standard errors of the estimated coefficients. These could be used to construct confidence intervals of the unknown "true" parameter values. The appropriate t multiplier has n-2 degrees of freedom where the regression is based on n points and would be found with the command qt(0.975,n-2) for example for a 95% confidence interval.
The column headed "t value" is the t statistic of the hypothesis test that tests if the true parameter value is 0.
The column headed "Pr(>|t|)" is the two-sided p-value of the hypothesis test.
Note that these inferences only make sense if a linear fit is appropriate and if the variance does not depend too strongly on the value of x.

Homework Assignment

Load in the data from Exercise 10-4 from your testbook. This data represents systolic blood pressure of thirty women with ages ranging from 40 to 85. Is there a relationship between age and blood pressure in middle aged and older women?

Use S-PLUS to fit a simple linear regression model to
```
(systolic blood pressure) = (intercept) + (slope) age
```
where (intercept) and (slope) are numbers. What are the units of measurement of the intercept and slope?
Find a 95% confidence interval of the slope of the regression line in the previous problem. (Use the estimate and SE from the Report Window. Use the function qt(0.975,df) to find the correct multiplier where df is equal to sample size - 2.)
Formally test the hypothesis that beta1 = 0 versus the alternative that beta1 is not 0 where beta1 is the population slope. (Hint: Look at the t-value and the p-value in the Report Window.)

Last modified: December 1, 2000

Bret Larget, larget@mathcs.duq.edu