Math 225
Introduction to Biostatistics
Analysis of Variance
Prerequisites
This lab assumes that you already know how to:
- Login, find course Web page, run S-PLUS
- Use the Commands Window to execute commands
- Load data sets
Technical Objectives
This lab will teach you to:
- use S-PLUS for analysis of variance
- complete an ANOVA table by hand
- interpret an ANOVA table
- count degrees of freedom
- find p-values from an F table
Conceptual Objectives
In this lab you should begin to understand:
- understand when a one-way ANOVA is appropriate
- know the assumptions in an ANOVA
Analysis of Variance
Analysis of variance
is a general statistical method
that is appropriate if there is a continuous response variable
and one or more categorical explanatory variables.
In this lab we are concerned with a single explanatory variable
(called a factor) with at least three possible values (called levels).
In a one-way analysis of variance,
we may think of the data as consisting of g independent samples
(possibly of different sizes) from g populations.
In a previous chapter, you studies the case when g was two,
and you could test whether two population means were equal
with a t test.
This chapter generalizes this to test if three or more population means
are equal.
The basic structure of the test is similar to the case of two populations:
Calculate a test statistic and compare this statistic to its sampling
distribution under the null hypothesis that all population means are equal.
If you observe a test statistic that is very different from what you would
expect to see by chance (indicated both with a small p-value
and a test statistic in the critical region),
there is evidence against the null hypothesis.
With three or more populations, there is a different test statistic
and you compare its value to an F distribution instead of a t distribution.
An ANOVA table
(ANOVA is short for ANalysis Of VAriance)
is little more than a step-by-step procedure to calculate
the test statistic.
If the null hypothesis were true,
you would expect the sample means to be close to one another
because they all would be unbiased estimates of the same population mean.
Because of chance variation,
the sample means will probably not be exactly equal.
The basic idea is that if the variability among sample means
is greater than what you would expect due to chance,
you have evidence that the population means are not all equal.
The test statistic compares the variation among the sample means
to the variation within the samples.
If this statistic is large, this indicates that the sample means
are farther apart than expected due to chance
considering the within sample variablity and the sample sizes.
The p-value is the area to the right of the test statistic
under an F distribution with the appropriate degrees of fredom.
Assumptions
Analysis of variance makes these assumptions:
- Populations are normal.
- The populations have equal variances.
- The samples are independent.
ANOVA is robust to lack of normality in the populations
if the sample sizes are large or if the populations are not strongly skewed.
The only concern is small skewed samples.
The assumption of equal variances is equivalent to an assumption
made when comparing two population means with independent samples.
ANOVA is robust to lack of equal variance (heteroskedacticity)
as long as the sample standard deviations are about
the same order of magnitude.
If the ratio of the largest sample sd to the smallest is more than ten or so,
you need to do something more advanced to account for the heteroskedacticity.
The last assumption says that ANOVA is not appropriate for paired or matched
samples, foe example.
ANOVA Table
Your book describes analysis of variance in Chapter 9.
Page 238 sumarizes the formulae you need to do a calculation
without software.
Instead of the textbook formulae,
you may find it easier to understand these equivalent formulae.
SS_Among = sum n_i * (xbar_i - grand_mean)^2
SS_Within = sum (n_i - 1) * (s_i)^2
df_Among = (number of groups) - 1
df_Within = sum (n_i - 1) = (total number of measurements) - (number of groups)
MS_Among = SS_Among / df_Among
MS_Within = SS_Within / df_Within
F = MS_Among / MS_Within
The p-value is the area to the right of the test statistic
under an F distribution.
Multiple Comparisons
When you can reject the hypothesis that all population means
are the same, you often wish to identify which population means
are different.
This can involve multiple pairwise comparisons.
There are several methods to construct simultaneous confidence intervals.
Two of these are
the Scheffe and Bonferroni methods.
They are discussed on pages 243 and 245 respectively.
Each of these methods depends on the assumptions of normality,
equal variance, and independence.
The Scheffe method allows for any number of comparisons
of any contrasts and is very conservative.
The Bonferroni method is only valid if the comparisons
are specified prior to examining the data.
For the purposes of this course, we only consider comparisons between all sample means.
The basic structure of the two methods is the same.
The simultaneous confidence intervals are of this form.
(difference in sample means) +/- (multiplier)(standard error)
where the standard error has the form
(pooled estimate of sigma)*sqrt(1/(sample size 1) + 1/(sample size 2))
For the Scheffe method, the multiplier comes from an F distribution,
specifically
multiplier = sqrt( (g-1)*F(1-alpha) )
where F(1-alpha) is the point that cuts off the upper right tail area
of alpha from an F distribution with (g-1) and (N-g) degrees of freedom.
For the Bonferroni method,
the multiplier is the value t so that the area between -t and t
is 1 - alpha/k from a t distribution with (N-g) degrees of freedom
where k is the number of comparisons.
- Load in the data from exercise 9-1.
- Use S-PLUS to do an analysis of variance
with the Scheffe method for multiple comparisons.
- Select Statistics:ANOVA:Fixed Effects...
- On the Model tab,
click on hours as the dependent variable
and treatment as the independent variable.
- For the Results tab,
click on Means (along with the defaults).
- For the Plot tab,
click on Residuals versus Fit
and change Number of Extreme Points To Identify from 3 to 0.
- On the Compare tab,
select temperature for Levels Of,
click on Plot Intervals,
and select Scheffe for Method.
- Click on the Apply Button.
- Close the warning message box that appears.
Look at the output in the Report Window,
the residual plots,
and the graphical display of the confidence intervals.
- Change the method for multiple comparisons to Bonferroni
and click on OK,
and examine the output.
If time permits, you can complete the ANOVA table from the formulae
using a hand calculator.
You may wish to use S-PLUS to find sample means and variances.
- Use S-PLUS to find the mean and sample variance from each sample
as well as the mean and sample variance of the data treated as one large sample.
- Fill in a one-way ANOVA table as on page 238 for your data.
Source SS df MS F
Among
Within
Total
- Find the exact p-value by calculating the area to the right of your F test statistic.
Something like the example below will do the trick.
(You will get a different test statistic.)
> 1-pf(22.42,2,12)
Homework Assignment
Use S-PLUS to do exercise 9-2 on page 264.
Here is the data.
- Construct side-by-side boxplots of count versus mouse.
In this informal analysis,
does it appear that the count depends on the mouse?
Do the boxplots show skewed or fairly symmetric distributions?
- Use S-PLUS to carry out a one-way ANOVA for this data
using count as the response.
Complete the table.
What is the F statistic and the p-value?
- If you have a very small p-value,
is it reasonable to conclude that the effects of the treatment
are not the same for each mouse?
Last modified: November 27 2000
Bret Larget,
larget@mathcs.duq.edu