What is a large enough sample?

In my previous entry, I tried to clear up some of my own confusion about the Central Limit Theorem (CLT) and explained why it is such a valuable theoretical concept in statistics. To recap, the CLT describes how the means of a random sample of an unknown sampling distribution approach a normal distribution as the sample size $n$ approaches $\infty$. The uncertainty about our estimate of the mean of the original sampling distribution is given by $\sigma / \sqrt{n}$, where $\sigma$ is the standard deviation of the sampling distribution. We can see that the larger the sample size, the more certain we are about our estimate of the true mean.

The obvious practical question is what is a large enough sample size? The short answer is, it depends. A sample size of 30 is a pretty save bet for most real life applications.

To investigate the influence of sample size on the convergence of the distribution of the means, I will use simulated sampling from three different sampling distributions. All simulations were done using R. The code can be found on Github.

CLT in (simulated) action

Let’s consider a normal sampling distribution to start with. This is useful to illustrate the idea of how the uncertainty of our estimate of the true mean depends on the sample size $n$. Here is our normal sampling distribution with $\mu$ = 4 and $\sigma$ = 2. Now we generate a large number $m$ of random samples each with sample size $n$ and calculate their means. If this confuses you, you are not alone. For now, understand that the only variable we are changing is the sample size $n$. $m$ will just be a “large number”, such as 10000 in our case, so that we can draw a histogram of 10000 simulated means. We will do this four time, each time with a different sample size of $n$ being either 2, 5, 15, or 30.

The histogram shows the distribution of simulated means and the blue curve illustrates the normal distribution predicted by the CLT with a mean of $\mu$ and a standard deviation of $\sigma / \sqrt{n}$. In the lower panel, I show quantile-quantile plots to investigate the how well the distribution of the means fits a theoretical normal distribution. Unsurprisingly, the means of random samples drawn from a perfect normal distribution are themselves normally distributed. Even with a sample size as small as 2. It is intuitive that small sample sizes have more uncertainty associated with our estimate of the true mean, which is reflected by the relatively broad normal distribution of the means. As we increase the sample size the distribution of the means becomes more pointy and narrow, indicating that our estimate of the true mean $\mu$ becomes more and more accurate. Note also, that the y-axis changes as we increase the sample size. This is a visual confirmation that the standard deviation of the distribution of the means is given by $\sigma / \sqrt{n}$.

Let’s turn to an exponential sampling distribution with $\lambda$ = 1/4 next. Recall that both the mean and standard deviation of an exponential distribution is $1 / \lambda$. This one is clearly not normal. I simulated random samples for different sample sizes as described above for the normal distribtion and calculated the means. At smaller sample sizes, the deviation of the actual distribution of the means from the theoretical distribution of the means is obvious. It clearly retains some characteristics of an exponential distribution. As we increase the sample size, the fit becomes better and better, until it eventually morphes into a normal distribution.

Does the CLT hold for an arbitrary distribution? Well, let’s consider this crazy sampling distribution I made up using a combination of normal, exponential and uniform distributions. Simulation of random samples using different sample sizes as before. As predicted, the CLT holds even for a non-standard sampling distribution. Granted, I did not challenge the assumptions of the CLT too much using for example an extreme tail (skew). I trust this is good enough to convince you that it would just take a few more samples before convergence.

Why is a sample size of 30 large enough?

Back to our original question: what is a large enough sample? We have seen that the major determinant is the shape of the sampling distribution. The more normal it is to begin with, the fewer samples we will need to reach convergence towards a normal distribution of the means.

In practice we do not generate 10000 random samples (10000 experiments!) to get a distribution of the means. We estimate the mean and standard deviation from a single random sample. The larger the random sample, the better will be our estimate of the true mean $\mu$ and the standard deviation $\sigma$. This follows directly from the Law of large numbers. It is often recommended in statistics textbooks that as a rule of thumb a sample size of 30 can be considered “large”. But why exactly 30? I think there is a practical and a pragmatic argument to be made.

In the simulations we saw that the distribution of the means of a random sample drawn from a (not too crazy) non-normal sampling distribution will be very close to normal. This means that our estimates of the mean and standard deviation of that distribution will be sufficient to describe the distribution of the means and we can use them in hypothesis testing with some confidence (no pun intended).

A more pragmatic argument would make use of the relationship between the sample size and our uncertainty about the true mean of the sampling distribution. Irrespective of the standard deviation $\sigma$ of the sampling distribution, the standard error $\sigma / \sqrt{n}$ decreases proportional to $\sqrt{n}$. Common sense dictates that increasing the sample size beyond a certain point will result in ever diminishing gains in precision. Here is a graphical representation of the relationship between the standard error and sample size. As you can see, a sample size of 30 sits right at the point where the curve stops to have an exponential and starts to have a linear decrease. In other words, a sample size of 30 represents the sweet spot in terms of the most “bang for the buck”, no matter the magnitude of the standard deviation of the original sampling distribution.

You might ask, what if the standard deviation is a large value? Well, then our estimate of the true mean will be pretty bad. We will have to increase the sample size and deal with the fact that gains in precision will be ever smaller as $n$ goes beyond 30.

In biomedical research we often face the situation that even a sample size of 30 is unattainable in terms of time or money. Fortunately, there is a solution for that dilemma: Student’s t-distribution. I will investigate how the CLT relates to the t-distribution and hypothesis testing in the next post.

Reproducibility

The full R code is available on Github.