Math 225

Introduction to Biostatistics


Comparisons Between Two Populations

Prerequisites

This lab assumes that you already know how to:
  1. Login, find course Web page, run S-PLUS
  2. Use the Commands Window to execute commands
  3. Load data sets

Technical Objectives

This lab will teach you to:
  1. Use S-PLUS to estimate differences in population means from independent or matched pair experiments.
  2. Use S-PLUS to test hypotheses about differences in two population means.

Conceptual Objectives

In this lab you should begin to understand:
  1. how to properly interpret confidence intervals
  2. how to interpret hypothesis test results
  3. what a p-value is
  4. when inference based on the t distribution is appropriate

Matched Pair versus Independent Sample Designs

Often times in statistics the investigator wishes to make a comparison between two groups or to assess the difference in two treatments.

One experimental design is to either take two independent samples, one from each population, or to randomly assign subjects into each of two treatment groups. In these situations, the number of individuals in each sample or treatment group is potentially different. There is no particular reason why any single individual should be matched or paired with a specific individual from the other group. These are examples of independent samples.

In contrast, an alternative design is to take a single sample of pairs of measurements. Perhaps each subject is measured twice, once under each set of treatment conditions. Or each experimental unit may be a pair of individuals, such as siblings or twins. Notice that in these situations, there is always perfect balance between the two groups. These are examples of matched pair designs.

In both cases, we will consruct confidence intervals with the basic form

(sample mean 1) - (sample mean 2) +/- (multiplier)(SE)

The multiplier will usually be from the t distribution (unless we know the population standard deviations) but the formula for the SE differs in the two cases. Also, the test statistic for hypothesis tests will be:

t = ((sample mean 1) - (sample mean 2)) / SE

For independent samples of size n_1 and n_2, the exact standard error is:

SE = sqrt( (sigma_1)^2 / n_1 + (sigma_2)^2 / n_2 )
If we assume that the two population standard deviations are equal, we can estimate their common value best as
s_p = sqrt( ( (n_1 - 1)*(s_1)^2 + (n_2 - 1)*(s_2)^2 ) / (n_1 + n_2 - 2) )
Notice that this is the square root of the weighted average of the sample variances (s_1)^2 and (s_2)^2. The SE is approximated by:
SE = s_p * sqrt( 1/(n_1) + 1/(n_2) )
which follows algebraically if you substitute in s_p for (sigma_1) and (sigma_2). The t distribution has n_1 + n_2 - 2 degrees of freedom in this case. (n_1 - 1 + n_2 - 1 = n_1 + n_2 - 2.)

For the matched pair design case, the idea is to take all of the individual differences first, and to consider your data to be a single sample of n differences. There are then n-1 degrees of freedom and you treat the sampled differences as a sample from a single population.

In-class Activities

  1. Load in the body temperature data set.
  2. Use S-PLUS to find the sample means and the sample standard deviations of temperature by gender.
    1. Use your mouse to select Statistics:Data Summaries:Summary Statistics
    2. Select temp in the Data area.
    3. Select gender in the Summaries by Group area.
    4. Click on OK.
    5. Read the Report Window.
    Note that this is an independent samples problem. There is no reason why specific man/woman pairs should be formed.
  3. Use the textbook formula to find the SE for the difference in sample means assuming that population standard deviations are equal.
  4. Use the textbook formula to find a 95% confidence interval for a difference in population means.
  5. Use the textbook formula to find the value of the test statistic.
  6. Use the t table to find a range for the p-value.
  7. Use S-PLUS to find a 95% confidence interval and to test the hypothesis that men and women have the same mean body temperature by hand.
    1. Use your mouse to select Statistics:Compare Samples:Two Samples:t test
    2. Click the Variable 2 is a Grouping Variable button
    3. Select temp as Variable 1
    4. Select gender as Variable 2
    5. Click on OK.
    6. Read the Report Window.


Homework Assignment

You may solve each problem below using the program S-PLUS. You should also know how to do each problem with paper, pencil, and the table on the back cover of your textbook. Solutions with the table may not be as accurate as solutions with the computer.

This lab will ask questions about the HARVEST trial data set.

Notice: The linked dataset is a modified version of the original data set where subjects with missing data in the SMOKEYES and HRCB variables have been removed. The menu command to compare two samples and will not work if there is missing data. If you use the original data set, you will have trouble, so make sure you use the data set harvestTwo.

We will make several comparisons between smokers and nonsmokers. The grouping variable for these comparisons is SMOKEYES. Treat the smokers and nonsmokers in the HARVEST data set as independent random samples from some larger populations. Other questions will make comparisons between variables measured in the clinic (C) and at home (A for ambulatory). Because these measurements are made on the same individuals, these comparisons are an example where matched pair methods are appropriate.

  1. Is there a difference between the diastolic blood pressures as measured at the clinic at baseline between smokers and nonsmokers? This data is in the variable DBPCB with grouping variable SMOKEYES. Find a 95% confidence interval for the difference and report a p-value for the two-sided hypothesis test with null hypothesis that the population means are equal.

  2. Which interpretation is most appropriate?

    A. There is convincing evidence that the mean diastolic blood pressure of smokers is different than that of nonsmokers.
    B. There is fairly strong evidence that the mean diastolic blood pressure of smokers is different than that of nonsmokers.
    C. There is fairly strong evidence that the mean diastolic blood pressure of smokers is exactly equal to that of nonsmokers.
    D. The data is consistent with no difference in the mean diastolic blood pressure of smokers and nonsmokers. The observed difference in sample means can be explained by sample variation.

  3. Is there a difference between the heart rates at baseline between smokers and nonsmokers? Find a 95% confidence interval for the difference and report a p-value for the two-sided hypothesis test with null hypothesis that the population means are equal.

  4. Which interpretation is most appropriate?

    A. There is convincing evidence that the mean heart rate of smokers is different than that of nonsmokers.
    B. There is fairly strong evidence that the mean heart rate of smokers is different than that of nonsmokers.
    C. There is fairly strong evidence that the mean heart rate of smokers is exactly equal to that of nonsmokers.
    D. The data is consistent with no difference in the mean heart rate of smokers and nonsmokers. The observed difference in sample means can be explained by sample variation.

  5. Are measurements at baseline of diastolic blood pressure similar if they are made at the clinic or at home? These data are in the variables DBPCB and DBPAB. Find a 95% confidence interval for the difference and report a p-value for the two-sided hypothesis test with null hypothesis that the population means are equal.

  6. Which interpretation is most appropriate?

    A. There is convincing evidence that the measurements of diastolic blood pressure at home and at the clinic are different.
    B. There is fairly strong evidence that the measurements of diastolic blood pressure at home and at the clinic are different.
    C. There is fairly strong evidence evidence that the measurements of diastolic blood pressure at home and at the clinic are exactly the same.
    D. The data is consistent with no difference in the measurements of diastolic blood pressure at home and at the clinic. The observed differences can be explained by sample variation.

  7. Is the observed difference in diastolic blood pressure medically important?

  8. Are measurements at baseline of heart rate similar if they are made at the clinic or at home? These data are in the variables HRCB and HRAB. Find a 95% confidence interval for the difference and report a p-value for the two-sided hypothesis test with null hypothesis that the population means are equal.

  9. Which interpretation is most appropriate?

    A. There is convincing evidence that the measurements of heart rate at home and at the clinic are different.
    B. There is fairly strong evidence that the measurements of heart rate at home and at the clinic are different.
    C. There is fairly strong evidence evidence that the measurements of heart rate at home and at the clinic are exactly the same.
    D. The data is consistent with no difference in the measurements of heart rate at home and at the clinic. The observed differences can be explained by sample variation.

  10. Is the observed difference in heart rates medically important?

Last modified: October 30, 2000

Bret Larget, larget@mathcs.duq.edu