Friday, July 19, 2013

More on statistical significance and small samples

In yesterday's post, I pointed out that for small sample sizes and cases where success is unlikely, standard tests of statistical significance can use even one observation of "success" to reject the null hypothesis. This is jarring, since statistical significance sounds important and official, like it should be more rigorous than our intuitions about what the data says. So what's going on here, and when do we have to worry about it?

We'll work through this with an example.  Suppose we conduct a poll of five wizards and five muggles, and all the wizards and four of the muggles eat at least one piece of chocolate per day, while one of the muggles is on a diet and eats no chocolate at all.

Some Definitions, Applied 

The null hypothesis is the hypothesis that every observation is being drawn from the same distribution, or that the treatment group has the same distribution as the control group. In this case, the null hypothesis might say "Wizards and muggles eat the same amount of chocolate,"  or perhaps "The same proportion of wizards and muggles eat (or don't eat) chocolate." It's good to be precise about what you are measuring; we'll use the second formulation this time.

In the case of our survey about chocolate, we might be tempted to conclude that the null hypothesis is wrong, because more of the wizards eat chocolate.  But first we should decide whether we've just come to the conclusion by chance-- after all, we've only talked to ten people total.  This is where statistical significance comes in.

To determine whether our results are statistically significant, we first must decide how willing we are to reject the null hypothesis when it is actually true. That is, suppose that the Actual Real Truth is that the same percentage of wizards and muggles abstain entirely from chocolate. How willing are we to conclude that they don't? It's common to accept a 5% or 1% chance of rejecting the null hypothesis wrongly, though some disciplines are okay with even a 10% chance. Whatever chance we accept, we'll be looking for statistical significance "at that level", for instance, statistical significance at the 5% level, otherwise known as statistical significance with p < .05.

Checking for Statistical Significance (Through Simulation)

To show that our results are statistically significant at the 5% level, we have to show that if the null hypothesis is true, we would expect to get results as extreme as ours less than 5% of the time, if we repeated our experiment many times.  To do this, we don't actually repeat the experiment many times-- remember, we don't know if the null hypothesis is true in the real world. Instead, we can simulate repeating it many times in a computer world where the null hypothesis is true, or we can use analytical methods to find out what would be the results of carrying out such simulations.

We use what we know in order to make the null hypothesis more precise so that we can carry out our simulations. The null hypothesis that we started with in our example just said that wizards and muggles are equally likely to eat chocolate, but now we have data that says that 9 of 10 people surveyed consume chocolate. So the best we can do is to assume that the null hypothesis is true, and 90% of people eat at least one piece of chocolate per day and 10% eat no chocolate at all. Since the null hypothesis says there is no consumption difference between wizards and muggles, in our simulation these figures are the same for both groups.

One way to simulate this situation is simply to erase the wizard/muggle data from our observations and replace it randomly, so that a random five observations are from "wizards" and  the other five are from "muggles". If we simulate this way (also known as sampling without replacement) then every time either the wizards or the muggles appear to eat more chocolate, to an extent exactly as extreme as was seen in our original sample, because the one abstainer is always either a wizard or a muggle.  We would conclude that our findings are not statistically significant.

Another way to simulate it is to sample with replacement. This time, we will make five "wizard" observations and five "muggle" observations, each with a 90% chance of eating chocolate and a 10% chance of abstaining.  We will get some cases when exactly one of our observations is an abstainer, like in the original sample, and others where none are, or where several people abstain.  The situation is more complicated than the previous case, and much of the time we do get samples showing exactly the same consumption of chocolate for wizards and muggles, but over half the time, we don't. Again, our single observation difference is not considered statistically significant. This should agree with our intuition that we need a larger sample size if we want to find a difference between the populations that was not immediately obvious.
Randomly sampling with replacement 10,000 times, the most common situation is for wizards and muggles to have the same number of abstainers in our sample, but this case makes up fewer than half of our samples. 

Tweaking the Example

Things get more complicated if we have much more data about one population than the other. For instance, suppose the five wizards we surveyed were all the wizards in the world. Then it is inappropriate to suppose that we could have sampled from a population that includes wizards who do not eat chocolate; all the wizards do eat chocolate. And one of the muggles doesn't. There is no way the null hypothesis could be true now.  This is not just statistical significance, which says that if the null hypothesis is true our results are unlikely. Our study actually disproved the null hypothesis, which is much stronger. (And, practically speaking, almost never the case.)

The example above is a special case of a more general situation in which much more is known about the distribution of one population than about the other. Another example of such a situation is the case in which our study was carried out by wizard researchers who knew very few muggles, so that they hadn't surveyed 5 wizards and 5 muggles, but 5000 wizards and 5 muggles. Let's say they found 50 wizards who did not eat chocolate and, as before, 1 muggle who did not eat chocolate. They can do either of our tests above; let's see what happens.

If they sample without replacement, they're looking at 5005 total observations, of which 51 are chocolate-abstainers. Randomly assigning five of these observations to be "muggles" 10,000 times, in 9,511 cases I got no muggles who were chocolate abstainers. This is (just) over 95% of my samples, so in fewer than 5% of cases, I got a result as extreme as the wizard researchers' original result.[1] From one muggle chocolate-abstainer in their sample, they could conclude statistical significance at the 5% level.

Sampling with replacement is technically easier and gives similar results. In 9,526 of my 10,000 repetitions, no muggles were chocolate abstainers. Once again, this would allow the researchers to conclude a statistically significant difference at the 5% level based on the one muggle chocolate abstainer in their original sample.  This doesn't agree so well with our intuitions; although the researchers now have a sample of wizards that seems large enough to determine their chocolate consumption habits with some precision, the sample of muggles still feels awfully small for most purposes.  

A Broader View

By having a lot of information about the proportion of wizards who eat chocolate, the researchers in the last example are able to use very little information about the proportion of muggles who do to conclude that the two populations are different. This isn't all bad, and sometimes such techniques are necessary. But their results, even though statistically significant by both methods described, aren't as solid as that makes them sound. The small size of their sample of muggles makes it especially easy for them to accidentally get a sample that is not representative of the total muggle population by bad sampling methodology or for some other reason.  Imagine polling the next five people you see about some issue, and then imagine polling the next 1000 people you see. Both these polls use the same method to get a non-random sample of the population of your area, but the first poll will likely have respondents that are much less diverse, and a much less representative sample, simply because there are so few of them.  

This example of researchers who have an easier time finding subjects in one of the groups they're studying than another is not purely fictional; there are many reasons it may occur in reality. An intervention might be much more expensive or difficult than the appropriate control procedure and than the follow-up data collection, so that it is easier to fund or conduct a study with a large control group and small treatment group than with two groups of equal sizes. Researchers might be trying to compare a group that is a small fraction of the population with a group that is a much larger fraction of the population.  In the case of the study I was working on in yesterday's post, with our original study design we expected to have trouble finding the targets of our intervention in order to follow up with them and therefore expected to survey a much larger number of people in the control group (in that case, the population from which we had originally drawn targets for our intervention) than from the treatment group.  

In any case, if a study design allows for statistical inferences to be drawn from a small number of surprising observations in the treatment group, caution is warranted.  Use common sense as a back-up check on the meaningfulness of statistical significance, just as you use statistical significance as a back-up check on the meaningfulness of your intuitive reaction to the data.

Code used in this post is available on github.

[1]: In this case, any result but the most likely result. Normally we'd need to check both for cases where the muggle group has too many abstainers and for cases where it has too few, but since zero is the most likely number for it to have under the null hypothesis, in this case we do not have to be concerned with there being too few.



Thursday, July 18, 2013

Statistical power at small sample sizes

Recently I was working on a team to design a study where we expected to find relatively few examples of the phenomenon we were studying. While it wasn't out of the question that we'd find an example in the control population, we expected to find very few examples there, and only a few more in the treatment group. I made a chart to help us understand how large a sample size we'd need to decide our intervention was making a difference, and it looked something like this:
Each line on this chart shows the probability that we would find our treatment had an effect significant at a given level, depending on the size of our treatment group. But why are there sudden drops in the probability that we'd find a statistically significant effect? And why, for so many sample sizes, does it not appear to matter whether we're seeking an effect significant at the 5% or 10% level? 

The answer is that this is the wrong chart to draw, in this situation. If we expected only 1 in 2000 observations from the control group to be a success, and 1 in 500 observations from the treatment group to be a success, a simple application of the definition of significance levels gives the graph above. But besides looking weird, it is misleading. Notice where it implies that we'd have about a 30% chance of concluding our treatment works, with a sample size of about 225? Well, that's the probability we'd get at least one success in 225 trials.
This chart, where each line shows the probability of getting at minimum a given number of successes, depending on the size of the treatment group, is much easier to read, and much more illuminating. Even if a single success in the treatment group would technically be statistically significant, our study would be much more persuasive if we chose a sample size that allowed us to expect to find two or three successes, at a minimum. 

Overlaid, the charts are even more useful. They show the probability of getting a certain number of successes at a given sample size, together with the level of statistical significance that would imply. And they answer our questions about the first chart-- the drops come when a particular level of statistical significance requires us to find an additional success, and the times when two levels of significance are equally likely are times when they are requiring us to find equal numbers of successes.


Code used in this post is available on github.

Euclidean construction paintings

I've enjoyed straightedge-and-compass constructions since high school—I remember that high school geometry was my favorite math class up...