Math 225

Introduction to Biostatistics

Notes from Lecture #11

Hypothesis Tests for a Single Proportion
A motivating problem. Does my daughter Elizabeth have ESP? Elizabeth is an accomplished card player. Perhaps some of her skill is an ability to know what the cards are before she sees them. We can test this by having her guess the suits of cards. If she correctly predicts the suits more than would be expected by pure chance, there would be some evidence that has ESP or some other means of knowing the suits of cards before she sees them.
In forty independent trials, Elizabeth correctly guessed the suit 14 times, more than the ten expected by chance. How much evidence is there that Elizabeth has ESP?
Hypothesis Tests. The previous problem may be tested formally with a statistical procedure called a hypothesis test. Begin by assuming that Elizabeth correctly guesses a suit completely at random. By chance, she has a 0.25 probability of getting it correct. This is our null hypothesis. The alternative hypothesis is that the probability she guesses correctly is greater than 0.25.
From our study of probability, we know that in a sequence of independent guesses with the same chance of success, the number of correct guesses will be a binomial random variable. The success probability would be p = 0.25 if she is only guessing. If she makes 40 independent guesses, we would expect her to get about 10 correct just by chance. Of course, there is some variability and the actual number of correct guesses may not be exactly 10, even if the null hypothesis is true. The basic idea is that the larger the difference between what we expect to see by chance (10 correct guesses) and what actually occurs (14 correct guesses), the more evidence there is that our hypothesis of no ESP is incorrect. It is common to calculate the difference between what we observe and what we expect to see by computing a test statistic. If the null hypothesis is true, the test statistic will have a known sampling distribution. In the example, we could let the test statistic be the observed number of correct guesses and the sampling distribution would be binomial with n=40 and p=0.25.
We can measure the difference between what actually occurs and what we expect to occur by calculating the probability of seeing an outcome at least as extreme as what actually occurs if we were to do the entire experiment again assuming our original hypothesis is correct. This probability is called a p-value. The smaller the p-value, the more evidence there is that the null hypothesis is incorrect.
A hypothesis test then consists of these parts.
1. State null and alternative hypotheses.
2. Calculate a test statistic.
3. Calculate a p-value.
4. Summarize your findings in the context of the problem.
We can now apply these ideas to the example problem.
1. State hypotheses:
  H₀: p = 0.25
  H_a: p > 0.25
2. Calculate a test statistic:
  X = 14.
3. Calculate a p-value:
  The alternative hypothesis is p > 0.25. This is a one-sided test. The larger the count, the more evidence there is against the null hypothesis. Any observation 14 or larger would be at least as extreme as what we actually observed. The probability of 14 or more successes when n=40 and p=0.25 is 0.1032 (from S-PLUS). Alternatively, we could have computed this with a normal approximation (as you might need to do for your test). The test statistic is then z = (13.5-10)/sqrt(40*0.25*0.75) = 1.28. (The correction for continuity makes a substantial difference.) The area to the right of 1.28 is 0.1003.
4. Summarize the findings in the context of the problem:
  If Elizabeth does not have ESP, she could be expected to guess correctly at least 14 times about once every ten times we repeated the experiment. A one in ten chance is not that small. I think that chance and blind luck can explain the results. There is only weak evidence that Elizabeth can correctly guess the suits of cards with success probability higher than 0.25.
More comments on proportions. In the previous example, we used the binomial distribution explicitly. An alternative (especially when n is large) is to look directly at proportions. The sample proportion is p-hat = X/n. The mean and standard deviation of the sampling distribution of p-hat will be 1/n times the corresponding moments for the binomial distribution. In particular, the mean is (np)/n = p and the standard deviation is sqrt(np(1-p))/n = sqrt(p(1-p)/n). A test statistic for hypothesis tests is then
```
       p-hat - p
z = ---------------
    sqrt( p(1-p)/n )
```
The logic of hypothesis testing. Hypothesis testing answers the question, can the difference between a test statistic and its expected value under a null hypothesis be explained by chance alone?
The basic logic of hypothesis testing is similar to the technique of proof by contradiction. In proof by contradiction, if an initial assumption logically leads to a contradiction, the initial assumption is proven to be incorrect.
A hypothesis test is evidence by probable contradiction. If an initial assumption leads to the conclusion that an improbable event actually occurred, there is evidence that the original assumption is incorrect. If the evidence is sufficiently strong, the investigator may conclude that the null hypothesis is false rather than believe that it is true and that an improbable event actually occurred.
More comments on p-values. A p-value is the probability of observing a new test statistic at least as extreme as the old assuming that the null hypothesis is true. Small p-values indicate evidence against the null hypothesis. The idea of a p-value is so important that one might express the idea in a song.

The p-value Polka (to the tune of the Beer Barrel Polka)
What's a p-value?
Test a hypothesis and see.
What's a p-value?
It is the probability
That a new test statistic
Is at least as extreme as the old
Given H₀ is true.
That's what I'm told!
What is alpha? Another approach to hypothesis testing is to select a predetermined significance level (alpha). If the p-value is smaller than alpha, the null hypothesis is rejected. If the p-value is not smaller than alpha, the null hypothesis is not rejected (some say accepted). If a test has a p-value smaller than alpha, we say the test is "statistically significant at the alpha level".
Common choices for alpha are 5% and 1%. If a test that is statistically significant at the 5% level, it means that what occurred should have happened less than 5% of the time if the null hypothesis were true.
Simply stating whether or not a result is significant at the 5% level is much less informative than reporting a p-value. P-values of 0.049 and 0.051 describe similar amounts of evidence against the null hypothesis, but the first is significant at the 5% level while the second is not. A p-value of 0.00004 is significant at the 5% level, and also the 1% level and the 0.001% level. Simply stating that the result is significant at the 5% level grossly understates the evidence against the null hypothesis in this case. Reporting the p-value indicates just how unlikely the observed event is under the null hypothesis and the reader can make his or her own decision on what to conclude in face of the evidence.

Last modified: February 27, 2001

Bret Larget, larget@mathcs.duq.edu

Math 225

Introduction to Biostatistics

Notes from Lecture #11

Hypothesis Tests for a Single Proportion

The p-value Polka (to the tune of the Beer Barrel Polka)