Math 225
Introduction to Biostatistics
Statistical Inferences for One Population
Prerequisites
This lab assumes that you already know how to:
- Login, find course Web page, run S-PLUS
- Use the Commands Window to execute commands
- Load data sets
Technical Objectives
This lab will teach you to:
- Calculate probabilities and quantiles associated with the t distribution.
- Estimate a population mean with a confidence interval using S-PLUS
- Test the a hypothesis about a single population mean using S-PLUS
Conceptual Objectives
In this lab you should begin to understand:
- how to properly interpret confidence intervals
- how to interpret hypothesis test results
- what a p-value is
- when inference based on the t distribution is appropriate
Confidence intervals
You have learned to standardize normal random variables
to compare the standardized value (z-score) to a standard normal curve.
z = (x-mu)/sigma
You also know that the same idea holds for a sample mean, xbar
.
In this case,
you must use SE(xbar) = sigma/sqrt(n)
instead of sigma
in the denominator.
z = (xbar-mu)/(sigma/sqrt(n))
Consider estimating the mean body temperature
of a population.
You could take a random sample of people and find the body temperature of each.
The sample mean would be your best single guess
for a value of the unknown population mean body temperature.
However, because you know that your estimate may not be exactly correct,
you should include with your single best estimate an assessment
of how accurate the estimate is likely to be.
A confidence interval does just this.
If you wish to be 95% confident that your estimate is within some distance
of the true population mean,
you create an interval centered at xbar extending a width 1.96 SE(xbar)
in each direction.
This is justified because the middle 95% of the standard normal curve is within
1.96 standard deviations and the central limit theorem says that the sampling distribution of xbar
will be approximately normal (for sufficiently large sample sizes).
A 95% confidence interval for mu
:
xbar
± 1.96 sigma / sqrt(n)
A confidence interval with a different level of condfidence (90% and 99% are typical choices)
would use a multiplier other than 1.96 (1.645 or 2.576 for 90% or 99% respectively)
determined by the cutpoints of the corresponding center area under the standard normal curve.
The t distribution
In a typical situation, however,
the population standard deviation sigma
will not be known.
In this case, you can replace sigma
with the sample standard deviation s
.
The problem that arises is that the random variable
t = (x-mu)/(s/sqrt(n))
does not have a standard normal distribution.
Instead, it has a t
distribution with n-1
degrees of freedom.
This distribution is symmetric, bell-shaped, and centered at 0,
just as the standard normal curve is, but it is more spread out.
There is a different t
distribution for every sample size.
When constructing a 95% confidence interval,
the formula
xbar
± 1.96 s / sqrt(n)
will be too small and contain an area with less than 95% of the area
under the sampling distribution
because there is uncertainty in estimating the standard error.
To correct for this, a multiplier from the t
distribution
should be used instead of from the standard normal distribution.
This mulitplier should correspond to the middle 95%
of the t
distribution with the correct number of degrees of freedom.
The t distribution in S-PLUS
The two functions in S-PLUS you need to learn for this lab are pt
which finds areas under t
distributions
and qt
which finds quantiles of t
distributions.
They are used as follows.
> pt(x,df)
is the area to the left of x
under a t
distribution
with df
degrees of freedom.
For confidence intervals for a population mean, df = n-1
.
> qt(x,df)
is the number x
for which the area to the left of x
under a t
distribution
with df
degrees of freedom.
S-PLUS help is available in this
on-line guide.
Appropriateness of the t distribution
The t distribution in theory assumes that the populations are normal.
The central limit theorem implies that the t distribution
will be appropriate for nonnormal populations
provided the sample size is sufficiently large.
As the previous lab showed,
the t distribution may not be suitable
if the population is strongly skewed
and the sample size is insufficiently large.
However, in most situations where strong skewness is not a problem,
the t distribution is appropriate even for small samples.
The first goal is to learn to use the computer to make calculations
with a t distribution
and to compare these to the standard normal distribution.
- Open a Commands Window.
[How?]
- Find the area to the left of -2
under the standard normal curve.
> pnorm(-2)
- Find the area to the left of -2 under t distributions
with 1, 10, 30, and 100 degrees of freedom.
> pt(-2,c(1,10,30,100))
(The S-PLUS function c
collects several numbers together.
You could use pt(-2,1)
to find just the first of the probabilities.)
Notice that these numbers are all larger than the corresponding area
for a normal curve, but that they get closer
to the normal curve area as the degrees of freedom increases.
What do you think happens as the number of degrees of freedom increases to infinity?
- What is the number z so that the area under the standard normal curve
between -z and z is 95%?
> qnorm(0.975)
- What are the numbers t so that the area
under a t distribution with 1, 10, 30, and 100 degrees of freedom
between -t and t is 95%?
> qt(0.975,c(1,10,30,100))
Notice that numbers get closer to 1.96 as the sample size increases.
You can find these numbers in the table on page 419 in your textbook.
Is the mean body temperature of human adults really 98.6?
The body temperature data set
contains the body temperature and gender of 130 volunteers,
65 men and 65 women.
For the present, we will ignore differences due to gender.
Load this data into S-PLUS.
To apply the methods in the textbook,
we need to find the mean and standard deviation
of body temperature, coded as the variable temp
.
> attach(temperature)
> mean(temp)
> sqrt(var(temp))
Draw a histogram of the variable temp
.
Is it centered about where you would expect?
Does the standard deviation represent the size of a typical deviation from the mean?
Is the shape approximately normal,
or do you see substantial skewness or outliers?
Construct a 95% confidence interval for the unknown population mean
using the method described in the textbook
in section 6-2-3.
Use the t table on page 419 to find the correct t value.
Now use S-PLUS to find a confidence interval.
- Use your mouse to select Statistics:Compare Samples:One Sample:t test
- Select the variable temp.
- Use 98.6 as the mean for the null hypothesis.
- Click on OK.
- Read the Report Window.
A p-value is the probability of observing a result as least as extreme
as that actually observed
if the experiment were repeated and the null hypothesis were true.
Small p-values imply strong evidence that the null hypothesis is false,
but should not be interpreted as the probability the null hypothesis is true.
Later lab assignments will consider p-values in more detail.
Which statements below are justified:
- 95% of the sample has body temperatures between 98.12 and 98.38 degrees.
- 95% of the population has body temperatures between 98.12 and 98.38 degrees.
- There is fairly strong evidence that the mean population body temperature
is not 98.6 degrees.
- We can be about 95% sure that the population mean body temperature
is between 98.12 and 98.38.
- If we took another sample of 130 people,
it would be 95% likely that the sample mean would be between 98.12 and 98.38.
Homework Assignment
You may solve each problem below using the program S-PLUS.
You should also know how to do each problem with paper, pencil, and your t table.
Solutions with t tables may not be as accurate as solutions with the computer.
Put your answers on this form.
- For a t distribution with 10 degrees of freedom,
what is the number t such that the area between -t and t is
- 0.90?
- 0.95?
- 0.99?
What are the corresponding numbers for the standard normal curve?
- For a t distribution with 24 degrees of freedom,
what is the number t such that the area between -t and t is
- 0.90?
- 0.95?
- 0.99?
What are the corresponding numbers for the standard normal curve?
-
The rat data set
is modeled after Exercise 6-1 on page 160.
Each value is the time in hours
until a skin cancer is gone after treatment with a drug.
Find a 95% confidence interval for the mean time until cancers disappear
using the textbook formula
(You may use S-PLUS to calculate summary statisics.)
-
Use S-PLUS to find the 95% confidence interval
for the population mean time until cancers disappear
assuming the 24 rats are a random sample from a larger population.
- Use your mouse to select Statistics:Compare Samples:One Sample:t test
- Select the variable hours.
- Click on OK.
- Read the Report Window.
-
Consider again the body temperature data from class.
Make the assumption that the mean body temperature in the population
is 98.6 degrees and that the population standard deviation is 0.73 degrees.
You may solve these problems with or without S-PLUS.
-
What is the probability that a single randomly sampled body temperature
would be 98.25 or lower?
-
What is the probability that the mean of 10 randomly sampled body temperatures
would be 98.25 or lower?
-
What is the probability that the mean of 130 randomly sampled body temperatures
would be 98.25 or lower?
Last modified: March 5, 2001
Bret Larget,
larget@mathcs.duq.edu