If you've ever seen a data visualization, you've probably seen a Bell Curve or a normal distribution. But this emergent property of many data visualizations is actually a result of the law of large numbers and the central limit theorem.
The central limit theorem tells us that the distribution of a normalized version of any sample mean will eventually converge to a standard normal distribution.
For example, let's say that we wish to chart the first fifty most popular science-fiction books on Goodreads by the number of pages they contain.
Our initial sample will look something like this:
pageCounts = np.array([
324, 216, 384, 194, 480, 368, 374, 268, 244, 258,
476, 472, 391, 390, 144, 288, 118, 592, 224, 342,
382, 336, 450, 500, 304, 297, 192, 320, 487, 260,
250, 525, 182, 275, 400, 576, 518, 318, 208, 256
])
If we want to plot our original sample of books, we could do something like:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pageCounts = np.array([
324, 216, 384, 194, 480, 368, 374, 268, 244, 258,
476, 472, 391, 390, 144, 288, 118, 592, 224, 342,
382, 336, 450, 500, 304, 297, 192, 320, 487, 260,
250, 525, 182, 275, 400, 576, 518, 318, 208, 256
])
plt.figure(figsize=(7, 5))
sns.histplot(page_counts, bins=10, kde=False, color='#1f77b4', edgecolor='black')
plt.title('Histogram of Book Pages')
plt.xlabel('Page Count')
plt.ylabel('Frequency')
plt.savefig("histogram.jpg", dpi=300, bbox_inches='tight')
plt.close()
This will produce a chart like:
But if we want to normalize and bootstrap our dataset, we will have to resample it. Replacement sampling, which we will use in this example, works like this. Let us say that we have a data set of only:
pageCounts = np.array([
216, 324, 385
])
The resampling process will randomly sample from this set. For example:
- Resample #1: [216, 324, 324] -> mean = 288.0
- Resample #2: [385, 385, 216] -> mean = 328.67
- Resample #3: [324, 216, 216] -> mean = 252.0
If we repeat this process many times, the distribution of our resampled means will approximate a normal distribution, as predicted by the Central Limit Theorem. We can append the following Python code to bootstrap our dataset and graph it:
np.random.seed(42)
num_samples = 10000
bootstrap_means = np.random.choice(page_counts, (num_samples, len(page_counts)),
replace=True).mean(axis=1)
plt.figure(figsize=(7, 5))
sns.histplot(bootstrap_means, bins=30, kde=True, color='#ff7f0e', edgecolor='black')
plt.title('Bootstrapped Distribution of Page Counts')
plt.xlabel('Mean Page Count')
plt.ylabel('Frequency')
plt.savefig("bootstrapped_distribution.jpg", dpi=300, bbox_inches='tight')
plt.close()
This process is extremely useful for both modeling and hypothesis testing. If we want to make a claim about a dataset, such as page counts of science fiction books — but we only get a small sample of science fiction books to work with—we can use bootstrapping to generate many simulations of the dataset and sample the distribution of the statistic we want to inquire about.
It's important to note that resampling isn't done to estimate the distribution—our sample itself already represents a data model. In this case, it represents page counts of science fiction books.
Rather, by resampling, we approximate the sampling distribution of a given statistic, such as the mean. This may allow us to make inferences about the broader dataset, even when the original sample size is small.
For example, we could additionally assess confidence intervals, which we'll discuss in a future post.
Comments
Post a Comment