# What is a large enough sample?

In my previous entry, I tried to clear up some of my own confusion about the Central Limit Theorem (CLT) and explained why it is such a valuable theoretical concept in statistics. To recap, the CLT describes how the means of a random sample of an unknown sampling distribution approach a normal distribution as the sample size $n$ approaches $\infty$. The uncertainty about our estimate of the mean of the original sampling distribution is given by $\sigma / \sqrt{n}$, where $\sigma$ is the standard deviation of the sampling distribution. We can see that the larger the sample size, the more certain we are about our estimate of the true mean.

The obvious practical question is what is a large enough sample size? The short answer is, it depends. A sample size of 30 is a pretty save bet for most real life applications.

To investigate the influence of sample size on the convergence of the distribution of the means, I will use simulated sampling from three different sampling distributions. All simulations were done using R. The code can be found on Github.

### CLT in (simulated) action

Let’s consider a normal sampling distribution to start with. This is useful to illustrate the idea of how the uncertainty of our estimate of the true mean depends on the sample size $n$. Here is our normal sampling distribution with $\mu$ = 4 and $\sigma$ = 2.

Now we generate a large number $m$ of random samples each with sample size $n$ and calculate their means. If this confuses you, you are not alone. For now, understand that the only variable we are changing is the sample size $n$. $m$ will just be a “large number”, such as 10000 in our case, so that we can draw a histogram of 10000 simulated means. We will do this four time, each time with a different sample size of $n$ being either 2, 5, 15, or 30.

The histogram shows the distribution of simulated means and the blue curve illustrates the normal distribution predicted by the CLT with a mean of $\mu$ and a standard deviation of $\sigma / \sqrt{n}$. In the lower panel, I show quantile-quantile plots to investigate the how well the distribution of the means fits a theoretical normal distribution.

Unsurprisingly, the means of random samples drawn from a perfect normal distribution are themselves normally distributed. Even with a sample size as small as 2. It is intuitive that small sample sizes have more uncertainty associated with our estimate of the true mean, which is reflected by the relatively broad normal distribution of the means. As we increase the sample size the distribution of the means becomes more pointy and narrow, indicating that our estimate of the true mean $\mu$ becomes more and more accurate. Note also, that the y-axis changes as we increase the sample size. This is a visual confirmation that the standard deviation of the distribution of the means is given by $\sigma / \sqrt{n}$.

Let’s turn to an exponential sampling distribution with $\lambda$ = 1/4 next. Recall that both the mean and standard deviation of an exponential distribution is $1 / \lambda$. This one is clearly not normal.

I simulated random samples for different sample sizes as described above for the normal distribtion and calculated the means.

At smaller sample sizes, the deviation of the actual distribution of the means from the theoretical distribution of the means is obvious. It clearly retains some characteristics of an exponential distribution. As we increase the sample size, the fit becomes better and better, until it eventually morphes into a normal distribution.

Does the CLT hold for an arbitrary distribution? Well, let’s consider this crazy sampling distribution I made up using a combination of normal, exponential and uniform distributions.

Simulation of random samples using different sample sizes as before.

As predicted, the CLT holds even for a non-standard sampling distribution. Granted, I did not challenge the assumptions of the CLT too much using for example an extreme tail (skew). I trust this is good enough to convince you that it would just take a few more samples before convergence.

### Why is a sample size of 30 large enough?

Back to our original question: what is a large enough sample? We have seen that the major determinant is the shape of the sampling distribution. The more normal it is to begin with, the fewer samples we will need to reach convergence towards a normal distribution of the means.

In practice we do not generate 10000 random samples (10000 experiments!) to get a distribution of the means. We estimate the mean and standard deviation from a single random sample. The larger the random sample, the better will be our estimate of the true mean $\mu$ and the standard deviation $\sigma$. This follows directly from the Law of large numbers. It is often recommended in statistics textbooks that as a rule of thumb a sample size of 30 can be considered “large”. But why exactly 30? I think there is a practical and a pragmatic argument to be made.

In the simulations we saw that the distribution of the means of a random sample drawn from a (not too crazy) non-normal sampling distribution will be very close to normal. This means that our estimates of the mean and standard deviation of that distribution will be sufficient to describe the distribution of the means and we can use them in hypothesis testing with some confidence (no pun intended).

A more pragmatic argument would make use of the relationship between the sample size and our uncertainty about the true mean of the sampling distribution. Irrespective of the standard deviation $\sigma$ of the sampling distribution, the standard error $\sigma / \sqrt{n}$ decreases proportional to $\sqrt{n}$. Common sense dictates that increasing the sample size beyond a certain point will result in ever diminishing gains in precision. Here is a graphical representation of the relationship between the standard error and sample size.

As you can see, a sample size of 30 sits right at the point where the curve stops to have an exponential and starts to have a linear decrease. In other words, a sample size of 30 represents the sweet spot in terms of the most “bang for the buck”, no matter the magnitude of the standard deviation of the original sampling distribution.

You might ask, what if the standard deviation is a large value? Well, then our estimate of the true mean will be pretty bad. We will have to increase the sample size and deal with the fact that gains in precision will be ever smaller as $n$ goes beyond 30.

In biomedical research we often face the situation that even a sample size of 30 is unattainable in terms of time or money. Fortunately, there is a solution for that dilemma: Student’s t-distribution. I will investigate how the CLT relates to the t-distribution and hypothesis testing in the next post.

### Reproducibility

The full R code is available on Github.

# Unlimited confusion: The central limit theorem

Open any statistics textbook, and it won’t be long until you encounter the Central Limit Theorem (CLT). You will learn that it is the basis of key concepts of inferential statistics such as the well-known t-test. In my experience the CLT is also a source of great confusion as it is surprisingly hard to wrap your head around it.

If statisticians had their way, everything would have a normal (Gaussian) distribution. And in some ways, maybe that would be a fairer world to live in. Take salaries for example. However as it stands, not everything is normally distributed. We have a lot of small earthquakes and few strong ones, which is typical for an exponential distribution (or a Pareto distribution in this particular case).

The reason why the CLT is so important in statistics is that it allows us to describe the means of random samples of (virtually) any distribution with the parameters of a normal distribution (mean $\mu$ and standard deviation $\sigma$) given that the sample size $n$ of those random samples is large enough.

In more precise language, the CLT states that the mean $\bar{x}$ of a random sample $x_1, x_2, ..., x_n$ taken from the sampling distribution $S$ follows a normal distribution centered at $\mu$ with standard deviation $\sigma / \sqrt{n}$ as $n \rightarrow \infty$.

There is a lot of information in this sentence. Let’s deconstruct it bit by bit and discuss the implications.

### We need (virtually) no knowledge of the sampling distribution

The CLT ensures us that irrespective of the shape of the original distribution we are sampling from, the means of random samples will approach a normal distribution. In practice, this means that we do not need any information on how the random samples are generated. We just need to take a large enough sample and analyze it using well-developed statistical methods. In other words, every statistician’s dream.

Why do I say virtually no knowledge? In fact it is possible to construct sampling distributions that break the CLT. That happens if the sampling distribution has an infinite mean or an infinite standard deviation. You will never encounter such sampling distributions in your everyday experiments, so it is more of a technicality.

### What are we sampling here?

For me, the major point of confusion about the CLT is that there are apparently two forms of sampling going on. First, each random sample drawn from the sampling distribution has a certain number of samples $n$. This random sample $X$ will give us exactly one mean $\bar{X}$. How do we get to a distribution from exactly one number? The CLT says that if we take $m$ random samples, each with a sampling size of $n$, the $m$ means of the random samples will approach a normal distribution.

So does the statement “given that the sample size is large” refer to the number of instances per random samples $n$ or the number of random samples $m$? Common sense says that it has to refer to $n$ to be practical. If we had to conduct $m$ experiments each with sample size $n$ it would be either too time consuming or too expensive, especially if both $n$ and $m$ need to be large.

The key to understanding the CLT is that we can estimate the parameters of the distribution of the means from a single random sample of sample size $n$ because we know that if we took $m$ more such random samples, their means would be distributed normally.

Earlier, we established that the mean of the distribution of the means will approach the mean of the sampling distribution. So, we can estimate this parameter $\hat{\mu}$ from the mean $\bar{X}$ of our random sample. Our estimate will most likely not be completely accurate, but the Law of large numbers tells us that if $n$ is not too small, our estimate will be reasonably good. But how good? What about the uncertainty of our measurement? The CLT tells us that the spread of the distribution of the means will be $\sigma / \sqrt{n}$.

The variance of the mean of a set of random variables is given by

$Var(\bar{X}) = Var(\frac{1}{n} \sum_{i=1}^{n} X_i)$

According to the Bienayme formula the variance of the sum of uncorrelated random variables is the sum of their variances.

$Var(\frac{1}{n} \sum_{i=1}^{n} X_i) = \frac{1}{n^2} Var(\sum_{i=1}^{n} X_i) = \frac{1}{n^2} \sum_{i=1}^{n} Var(X_i)$

The variances of the samples are identical variances because they samples come from the same distribution.

$\frac{1}{n^2} \sum_{i=1}^{n} Var(X_i) = \frac{1}{n^2} n Var(X_i) = \frac{1}{n}Var(X_i)$

Thus, the variance of the mean is $\sigma^2 / n$ and, accordingly, the standard deviation is $\sigma / \sqrt{n}$.

To add more confusion, $\sigma / \sqrt{n}$ is called the standard error of the mean but it is just the standard deviation of the distribution of the means. It relates the standard deviation of the sampling distribution to the sample size and quantifies our uncertainty about our estimate of the true mean. The larger the sample size $n$, the closer the estimate $\hat{\mu}$ will be to the true mean $\mu$. The formula for the standard error of the mean tells us the precision of our estimate increases with the square root of the sample size. In practical terms, if we want to decrease the standard error by a factor of two, we need to increase the sample size by a factor of 4.

We typically do not know the true variance $\sigma^2$ of the sampling distribution. Again, we estimate it from our random sample using the unbiased estimator $s^2$

$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$

The CLT demonstrates that the means of random samples drawn from an unknown sampling distribution will have a normal distribution. That alone would be interesting but not particularly useful. The fact that we can estimate the mean and standard deviation of the distribution of those means from a single sample makes it a cornerstone of many key concepts of inferential statistics such as hypothesis testing and and confidence intervals. The one reservation of the CLT is that the sample size needs to large enough. Fortunately, we can turn to the t-distribution of we don’t quite meet the criteria of a “large enough” sample size.

### What is a large enough random sample?

The CLT states that the number of instances $n$ that make up the random sample should approach $\infty$. That is clearly not practical. For most sampling distributions, especially if they already are close to a Gaussian distribution to start with, convergence will happen much sooner. A rough rule of thumb is that a sample size of n = 30 can be considered “large enough”. But it very much depends on the shape of the original sampling distribution. We will explore that in a follow-up post using simulation.