Unlimited confusion: The central limit theorem

Open any statistics textbook, and it won’t be long until you encounter the Central Limit Theorem (CLT). You will learn that it is the basis of key concepts of inferential statistics such as the well-known t-test. In my experience the CLT is also a source of great confusion as it is surprisingly hard to wrap your head around it.

If statisticians had their way, everything would have a normal (Gaussian) distribution. And in some ways, maybe that would be a fairer world to live in. Take salaries for example. However as it stands, not everything is normally distributed. We have a lot of small earthquakes and few strong ones, which is typical for an exponential distribution (or a Pareto distribution in this particular case).

The reason why the CLT is so important in statistics is that it allows us to describe the means of random samples of (virtually) any distribution with the parameters of a normal distribution (mean \mu and standard deviation \sigma) given that the sample size n of those random samples is large enough.

In more precise language, the CLT states that the mean \bar{x} of a random sample x_1, x_2, ..., x_n taken from the sampling distribution S follows a normal distribution centered at \mu with standard deviation \sigma / \sqrt{n} as n \rightarrow \infty.

There is a lot of information in this sentence. Let’s deconstruct it bit by bit and discuss the implications.

We need (virtually) no knowledge of the sampling distribution

The CLT ensures us that irrespective of the shape of the original distribution we are sampling from, the means of random samples will approach a normal distribution. In practice, this means that we do not need any information on how the random samples are generated. We just need to take a large enough sample and analyze it using well-developed statistical methods. In other words, every statistician’s dream.

Why do I say virtually no knowledge? In fact it is possible to construct sampling distributions that break the CLT. That happens if the sampling distribution has an infinite mean or an infinite standard deviation. You will never encounter such sampling distributions in your everyday experiments, so it is more of a technicality.

What are we sampling here?

For me, the major point of confusion about the CLT is that there are apparently two forms of sampling going on. First, each random sample drawn from the sampling distribution has a certain number of samples n. This random sample X will give us exactly one mean \bar{X}. How do we get to a distribution from exactly one number? The CLT says that if we take m random samples, each with a sampling size of n, the m means of the random samples will approach a normal distribution.

So does the statement “given that the sample size is large” refer to the number of instances per random samples n or the number of random samples m? Common sense says that it has to refer to n to be practical. If we had to conduct m experiments each with sample size n it would be either too time consuming or too expensive, especially if both n and m need to be large.

The key to understanding the CLT is that we can estimate the parameters of the distribution of the means from a single random sample of sample size n because we know that if we took m more such random samples, their means would be distributed normally.

Earlier, we established that the mean of the distribution of the means will approach the mean of the sampling distribution. So, we can estimate this parameter \hat{\mu} from the mean \bar{X} of our random sample. Our estimate will most likely not be completely accurate, but the Law of large numbers tells us that if n is not too small, our estimate will be reasonably good. But how good? What about the uncertainty of our measurement? The CLT tells us that the spread of the distribution of the means will be \sigma / \sqrt{n}.

The variance of the mean of a set of random variables is given by

Var(\bar{X}) = Var(\frac{1}{n} \sum_{i=1}^{n} X_i)

According to the Bienayme formula the variance of the sum of uncorrelated random variables is the sum of their variances.

Var(\frac{1}{n} \sum_{i=1}^{n} X_i) = \frac{1}{n^2} Var(\sum_{i=1}^{n} X_i) = \frac{1}{n^2} \sum_{i=1}^{n} Var(X_i)

The variances of the samples are identical variances because they samples come from the same distribution.

\frac{1}{n^2} \sum_{i=1}^{n} Var(X_i) = \frac{1}{n^2} n Var(X_i) = \frac{1}{n}Var(X_i)

Thus, the variance of the mean is \sigma^2 / n and, accordingly, the standard deviation is \sigma / \sqrt{n}.

To add more confusion, \sigma / \sqrt{n} is called the standard error of the mean but it is just the standard deviation of the distribution of the means. It relates the standard deviation of the sampling distribution to the sample size and quantifies our uncertainty about our estimate of the true mean. The larger the sample size n, the closer the estimate \hat{\mu} will be to the true mean \mu. The formula for the standard error of the mean tells us the precision of our estimate increases with the square root of the sample size. In practical terms, if we want to decrease the standard error by a factor of two, we need to increase the sample size by a factor of 4.

We typically do not know the true variance \sigma^2 of the sampling distribution. Again, we estimate it from our random sample using the unbiased estimator s^2

s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

The CLT demonstrates that the means of random samples drawn from an unknown sampling distribution will have a normal distribution. That alone would be interesting but not particularly useful. The fact that we can estimate the mean and standard deviation of the distribution of those means from a single sample makes it a cornerstone of many key concepts of inferential statistics such as hypothesis testing and and confidence intervals. The one reservation of the CLT is that the sample size needs to large enough. Fortunately, we can turn to the t-distribution of we don’t quite meet the criteria of a “large enough” sample size.

What is a large enough random sample?

The CLT states that the number of instances n that make up the random sample should approach \infty. That is clearly not practical. For most sampling distributions, especially if they already are close to a Gaussian distribution to start with, convergence will happen much sooner. A rough rule of thumb is that a sample size of n = 30 can be considered “large enough”. But it very much depends on the shape of the original sampling distribution. We will explore that in a follow-up post using simulation.

Unlimited confusion: The central limit theorem

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s