When is a small sample size too small for statistical reporting?

It has been a fairly well known assumption in Statistics that a sample size of 30 is a so-called magic number in estimating distribution or statistical errors. The problem is that, firstly, according to Andrew Messing of the Center for Brain Science, Harvard University, like a lot of Rule of Thumb commonsense measures, this assumption does not have a solid theoretical basis to prove its veracity. Secondly, the number 30 is itself arbitrary, and some textbooks give alternative magic numbers of 50 or 20. Examples can be found in Is 30 the magic number issues in sample size estimation? or Shamanism as Statistical Knowledge: Is a Sample Size of 30 All You Need?, for example. Funny thing is that there is no formal proof that any of these numbers are useful because they all rely on assumptions that can fail to hold true in one or more ways, and as a result, the adequate sample size cannot be derived using the methods typically taught (and used) in the medical, social, cognitive, and behavioral sciences.

Like so many others before me, this got me thinking. One of my domains is healthcare data analytics, a field that is perpetually inundated with data. Should I test this rule of thumb and see if there is any truth to it?

Let’s set the background first. The data I was looking at centers around the treatment of Hepatitis C. The goal of Hepatitis C therapy is to clear the patient’s blood of the Hepatitis C virus (HCV). During this treatment, the doctors routinely monitor the level of virus in the patient’s blood – a measurement known as viral load – typically in terms of International Units per milliliter (IU/mL). When I was slicing and dicing the data using different criteria such as age, sex, genotype etc., to report effectiveness of treatment, sometimes the sample sizes of these cohorts were becoming too small. So what should the minimum size of my sample set before I can confidently report that result?

Let’s look at some fairly simple mathematical model now. In this specific case, we assumed a T-distribution of our data. T-distribution is almost engineered so it gives a better estimate of our confidence intervals especially since we have a small sample size. It looks very similar to a normal distribution. It has a “mean”, which is our mean of our sampling distribution. But it also has fatter tails. So normally what we can do is that we find the estimate of the true standard deviation, and then we can say that the standard deviation of the sampling distribution is equal to the true standard deviation of our population divided by the square root of n, which is the sample size.

This is especially useful since we never know the true standard deviation, or we seldom know the true standard deviation. So, if we don’t know that, the best thing we can put in there is our sample standard deviation.

We do not necessarily call this estimate a probability interval; rather it is a “confidence interval” because we are making some assumptions. This confidence measure is going to change from sample to sample. And in particular, the expectation is that this is going to be a particularly bad estimate when we have a really small sample size.

Accordingly, we calculated confidence intervals with the above procedure for the data. If the size of the sample is more than a cut off, say 30, we have used Z-scores; otherwise we have used t-table for calculation. (I am assuming t-tables and Z-scores are outside the scope of this article.) This basically means that we first find the mean, then find the standard deviation, and finally find the standard error, which is equal to standard deviation divided by the square root of sample size. We then find the range either from the t-table or Z-score, as mentioned above. Finally, the adjusted range with a specific % confidence will be equal to the mean +/- the range, as calculated above.

The graph below shows the results where the sample size was 1,720 patients.

The next graph shows the results where the sample size is 28.

In both of these cases, it appears that sample size around 30 gives us enough statistical confidence in the results we are presenting. In both cases, however, we are bound by the fact that comparing effectiveness across treatments will probably be best related to the size of the sample (cohort) itself, the closest metric being the utilization factor. For the calculated values within each category, however, we should be able to report the numbers with a prescribed confidence interval. In all the calculations presented above, that confidence interval was 95%.

Somehow, we picked a set of “arbitrary” healthcare data and somehow, a sample size around 30 was an adequately large number to generate dependable statistics. But the question remains, why?

The best rationale I have come across of why this is such a popular number was given by Christopher C. Rout, of the University of KwaZulu-Natal, Department of Anesthetics and Critical Care, Durban, KwaZulu-Natal, South Africa. According to him, it is not “enough”, but rather it is that we need “at least” 30 samples before we can reasonably expect an analysis based upon the normal distribution (i.e. Z test) to be valid. That is, it represents a threshold above which the sample size is no longer considered small. It may have to do with the difference between the square roots of 1/n and 1/(n-1). At about 30 (actually between 32 and 33) this difference becomes less than 0.001, so in a way, the intuitive sense is that at or around that number of the sample size, the difference between samples of larger size may not contribute too much to the probability distribution calculation and a measure of estimated error goes down to acceptable levels.

Well, sounds about right to me.