Math 225
Introduction to Biostatistics
Linear Regression
Assignment 10
Prerequisites
This lab assumes that you already know how to:
- Login, find course Web page, run S-PLUS
- Use the Commands Window to execute commands
- Load data sets
Technical Objectives
This lab will teach you to:
- construct a regression line by hand from summary statistics
- use S-PLUS to plot bivariate data with a scatterplot
- use S-PLUS to find a least-squares line
- use S-PLUS to fit a regression line
- use S-PLUS to make residual plots
Conceptual Objectives
In this lab you should begin to understand:
- to interpret residual plots
to determine if fitting a linear model is justified
- to test if a true regression slope should be 0
- to find and correctly interpret
a confidence interval for a true regression slope
Correlation
The correlation coefficient, r,
is a measure of the strength of the linear relationship
between two continuous variables.
A more intuitive formula for r
than those that appear in the text book is as follows,
where xbar and s_x are the mean and sample standard deviation of the x variable
and ybar and s_y are the mean and standard deviation of the y variable.
r = sum( ((x - xbar)/s_x) * ((y - ybar)/s_y) ) / (n-1)
Notice that each individual measurement is standardized to its z-score;
subtract the mean and divide by the standard deviation.
The correlation coefficient r is almost the average of the product of the z-scores
for the x and y measurements of each data point.
Association
We say that two variables x and y are positively associated
if when x is large relative to its mean, y also tends to be large,
and when x is small relative to its mean, y also tends to be small.
Two variables are negatively associated
when big x tends to go with small y and vice versa.
A scatter plot of positively associated variables
will have many points in the upper right and lower left.
A scatter plot of negatively associated variables
will have many points in the lower right and upper left.
Data that is positively associated will have a positive correlation coefficent
and data that is negatively associated will have a negative correlation coefficient.
This follows because products of corresponding z-scores will tend to be positive
for positively associated data (+ times + or - times -)
and negative for negatively associated data (+ time - or - times +).
Correlation coefficient facts
- -1 <= r <= 1
- r = -1 if and only if the data lie exactly on a line with a negative slope.
- r = 1 if and only if the data lie exactly on a line with a positive slope.
- The correlation coefficient measures only the strength of the linear relationship.
The last point requires some elaboration.
A correlation coefficient near 0 does not by itself imply that two variables are unrelated.
It could be the case that there is a strong nonlinear relationship between the variables.
Furthermore, a correlation coefficient near -1 or 1 does not, by itself,
imply that a linear relationship by itself is most appropriate.
A nonlinear curve could be a better description of the relationship.
The value of r^2 is often used as a summary statistic as well.
It may be interpreted as the proportion of the variability
in y that is explained the regression.
When r=-1, or r=1,
r^2=1 and all of the variability in y is explained by x.
In other words, there is no variability around the regression line.
A large r^2 value implies that the data is more tightly clustered around
a line with some nonzero slope that around the line at y = ybar.
Residuals
We will draw the response variable (dependent variable)
on the y axis and the explanatory variable (independent variable)
on the x axis.
For any line drawn through the points,
the vertical distance from a point to the line is called a residual.
Residuals are positve for points above the line
and negative for points below the line.
The Criterion of Least Squares
Lines which are good descriptions of the relationship between two variables
will tend to have small residuals
whereas lines that give poor fits have some large residuals (in absolute value).
One particularly criterion for choosing a ``best line''
is to make the sum of all of the squared residuals as small as possible.
This line is called the least squares line.
Notice that because residuals are measured vertically,
it matters quite a bit which variable is designated x and which is y.
The least squares line is
yhat = a + b x
where a is the intercept and b is the slope.
The slope and intercept are determined by the mean and standard deviation of x and y
and the correlation coefficient.
b = r * (s_y / s_x)
a = ybar - b * xbar
Simple Linear Regression
Simple linear regression uses the least squares line
to predict the value of y for each x.
You may think of the line as representing the average y value for all individuals
in a population with a given x value.
The regression line is an estimate of the ``true'' line
based on sample data.
Note that the method of least squares will always provide a line,
even when a nonlinear curve would be more appropriate.
The correlation coefficient alone cannot be used as a measure
of the appropriateness of a linear fit.
Plotting the data to ascertain if a linear fit is appropriate is necessary.
Residuals
We will draw the response variable (dependent variable)
on the y axis and the explanatory variable (independent variable)
on the x axis.
For any line drawn through the points,
the vertical distance from a point to the line is called a residual.
Residuals are positve for points above the line
and negative for points below the line.
Regression Diagnostics
It is often the case that the relationship between two quantitative variables
should not be summarized with a straight line.
It may be the case that some nonlinear relationship is better.
In addition, the methods of statistical inference in a regression framework
assume that the variance is constant for different values of the explanatory x variable.
Often, a plot of the data itself makes it clear
when a nonlinear fit is more appropriate,
or when nonconstant variance is a potential problem.
However, a plot of residuals versus the fitted values (or the original x values)
makes it easier to see these potential problems.
Patterns in a residual plot indicate nonlinearity.
For example, if the residuals tend to be positive for low x values,
negative for middle x values,
and positive again for high x values,
this indicates that the data is scattered around a curve
that is concave up.
When the size of the residuals tends to increase with the size of the fitted values,
this indicates that the variance is related to the explanatory variable.
A common solution is to transform the explanatory variable.
Perhaps, log(y) has a linear relationship with x with variance
that is more nearly constant as x changes.
Forced expiratory volume (FEV) is a measure of lung function.
An individual takes a full breath and then blows as much air as possible
into the measuring instrument.
Larger measures are better, other factors being equal.
FEV may be affected by several variables.
For this lab, we will examine the effect of height on FEV
from a sample of children of various ages.
Better analyses would use additional information,
such as the ages, sex, and smoking status of the children.
- Load in the FEV data set.
In the data set, FEV (fev) is measured in liters
and height (ht) is measured in inches.
Attach the data
by typing
> attach(fevdata)
in the Commands Window.
- Make a plot with ht on the x axis and fev on the y axis.
> plot(ht,fev)
Does it look like a linear relationship is adequate,
or is a nonlinear relationship better?
(There is a noticeable curve, and the spread of the points around the line increases
as the height increases.)
- When the spread increases as the x variable increases,
taking logarithms of the response variable often corrects this.
Make a plot of log(fev) versus ht.
> plot(ht,log(fev))
Does it look like a linear relationship is adequate?
(Yes! The relationship looks fairly linear
and the spread around the line is similar as the height increases.)
- Use S-PLUS to find the least squares regression line
that predicts log of FEV from height.
- Use your mouse to select Statistics:Regression:Linear....
- In the Formula box type
log(fev) ~ ht
This means ``log(fev) is modeled as a function of height''.
An intercept is included by default.
- Click on the Plots tab.
- Click on the two plots Residuals versus Fitted Values
and Response versus Fitted Values.
- Click on OK.
- Read the Report Window and look at the graphs.
On the residual plot,
you should see that the points are spread around the line y=0
without pattern
and that the size of the spread does not change drastically with x.
The plot on page 2 is a scatter plot of the response values
versus the fitted values.
This plot is identical to the original data
except that the horizontal axis has been rescaled and relabeled
and the regression line is added to the plot.
In the Report Window,
there will be a table labeled "Coefficients" with the fitted parameter values.
The column headed "Value" has the slope and intercept of the regression line.
These are statistics that can be used to estimate the "true"
population slope and intercept
assuming that the data is a random sample from the population.
The column headed "Std. Error" has the estimated standard errors
of the estimated coefficients.
These could be used to construct confidence intervals
of the unknown "true" parameter values.
The appropriate t multiplier has n-2 degrees of freedom
where the regression is based on n points
and would be found with the command qt(0.975,n-2)
for example for a 95% confidence interval.
The column headed "t value" is the t statistic
of the hypothesis test that tests if the true parameter value is 0.
The column headed "Pr(>|t|)" is the two-sided p-value
of the hypothesis test.
Note that these inferences only make sense if a linear fit is appropriate
and if the variance does not depend too strongly on the value of x.
Homework Assignment
Load in the data from Exercise 10-4
from your testbook.
This data represents systolic blood pressure
of thirty women with ages ranging from 40 to 85.
Is there a relationship between age and blood pressure in middle aged
and older women?
-
Use S-PLUS to fit a simple linear regression model
to
(systolic blood pressure) = (intercept) + (slope) age
where (intercept) and (slope) are numbers.
What are the units of measurement of the intercept and slope?
- Find a 95% confidence interval of the slope of the regression line
in the previous problem.
(Use the estimate and SE from the Report Window.
Use the function qt(0.975,df) to find the correct multiplier
where df is equal to sample size - 2.)
- Formally test the hypothesis that beta1 = 0
versus the alternative that beta1
is not 0 where beta1 is the population slope.
(Hint: Look at the t-value and the p-value in the Report Window.)
Last modified: December 1, 2000
Bret Larget,
larget@mathcs.duq.edu