In forty independent trials, Elizabeth correctly guessed the suit 14 times, more than the ten expected by chance. How much evidence is there that Elizabeth has ESP?
From our study of probability, we know that in a sequence of independent guesses with the same chance of success, the number of correct guesses will be a binomial random variable. The success probability would be p = 0.25 if she is only guessing. If she makes 40 independent guesses, we would expect her to get about 10 correct just by chance. Of course, there is some variability and the actual number of correct guesses may not be exactly 10, even if the null hypothesis is true. The basic idea is that the larger the difference between what we expect to see by chance (10 correct guesses) and what actually occurs (14 correct guesses), the more evidence there is that our hypothesis of no ESP is incorrect. It is common to calculate the difference between what we observe and what we expect to see by computing a test statistic. If the null hypothesis is true, the test statistic will have a known sampling distribution. In the example, we could let the test statistic be the observed number of correct guesses and the sampling distribution would be binomial with n=40 and p=0.25.
We can measure the difference between what actually occurs and what we expect to occur by calculating the probability of seeing an outcome at least as extreme as what actually occurs if we were to do the entire experiment again assuming our original hypothesis is correct. This probability is called a p-value. The smaller the p-value, the more evidence there is that the null hypothesis is incorrect.
A hypothesis test then consists of these parts.
We can now apply these ideas to the example problem.
H0: p = 0.25
Ha: p > 0.25
X = 14.
The alternative hypothesis is p > 0.25. This is a one-sided test. The larger the count, the more evidence there is against the null hypothesis. Any observation 14 or larger would be at least as extreme as what we actually observed. The probability of 14 or more successes when n=40 and p=0.25 is 0.1032 (from S-PLUS). Alternatively, we could have computed this with a normal approximation (as you might need to do for your test). The test statistic is then z = (13.5-10)/sqrt(40*0.25*0.75) = 1.28. (The correction for continuity makes a substantial difference.) The area to the right of 1.28 is 0.1003.
If Elizabeth does not have ESP, she could be expected to guess correctly at least 14 times about once every ten times we repeated the experiment. A one in ten chance is not that small. I think that chance and blind luck can explain the results. There is only weak evidence that Elizabeth can correctly guess the suits of cards with success probability higher than 0.25.
p-hat - p z = --------------- sqrt( p(1-p)/n )
The basic logic of hypothesis testing is similar to the technique of proof by contradiction. In proof by contradiction, if an initial assumption logically leads to a contradiction, the initial assumption is proven to be incorrect.
A hypothesis test is evidence by probable contradiction. If an initial assumption leads to the conclusion that an improbable event actually occurred, there is evidence that the original assumption is incorrect. If the evidence is sufficiently strong, the investigator may conclude that the null hypothesis is false rather than believe that it is true and that an improbable event actually occurred.
Common choices for alpha are 5% and 1%. If a test that is statistically significant at the 5% level, it means that what occurred should have happened less than 5% of the time if the null hypothesis were true.
Simply stating whether or not a result is significant at the 5% level is much less informative than reporting a p-value. P-values of 0.049 and 0.051 describe similar amounts of evidence against the null hypothesis, but the first is significant at the 5% level while the second is not. A p-value of 0.00004 is significant at the 5% level, and also the 1% level and the 0.001% level. Simply stating that the result is significant at the 5% level grossly understates the evidence against the null hypothesis in this case. Reporting the p-value indicates just how unlikely the observed event is under the null hypothesis and the reader can make his or her own decision on what to conclude in face of the evidence.
Bret Larget, larget@mathcs.duq.edu