Subtitles section Play video
In this video, I want to talk about what is easily
one of the most fundamental and profound concepts in statistics
and maybe in all of mathematics.
And that's the central limit theorem.
And what it tells us is we can start off
with any distribution that has a well-defined mean and
variance-- and if it has a well-defined variance,
it has a well-defined standard deviation.
And it could be a continuous distribution or a discrete one.
I'll draw a discrete one, just because it's easier
to imagine, at least for the purposes of this video.
So let's say I have a discrete probability distribution
function.
And I want to be very careful not
to make it look anything close to a normal distribution.
Because I want to show you the power of the central limit
theorem.
So let's say I have a distribution.
Let's say it could take on values 1 through 6.
1, 2, 3, 4, 5, 6.
It's some kind of crazy dice.
It's very likely to get a one.
Let's say it's impossible-- well,
let me make that a straight line.
You have a very high likelihood of getting a 1.
Let's say it's impossible to get a 2.
Let's say it's an OK likelihood of getting a 3 or a 4.
Let's say it's impossible to get a 5.
And let's say it's very likely to get a 6 like that.
So that's my probability distribution function.
If I were to draw a mean-- this the symmetric,
so maybe the mean would be something like that.
The mean would be halfway.
So that would be my mean right there.
The standard deviation maybe would
look-- it would be that far and that
far above and below the mean.
But that's my discrete probability distribution
function.
Now what I'm going to do here, instead of just taking
samples of this random variable that's
described by this probability distribution function,
I'm going to take samples of it.
But I'm going to average the samples
and then look at those samples and see
the frequency of the averages that I get.
And when I say average, I mean the mean.
Let me define something.
Let's say my sample size-- and I could put any number here.
But let's say first off we try a sample size of n is equal to 4.
And what that means is I'm going to take four samples from this.
So let's say the first time I take four samples--
so my sample sizes is four-- let's say I get a 1.
Let's say I get another 1.
And let's say I get a 3.
And I get a 6.
So that right there is my first sample of sample size 4.
I know the terminology can get confusing.
Because this is the sample that's made up of four samples.
But then when we talk about the sample mean and the sampling
distribution of the sample mean, which we're
going to talk more and more about over the next few videos,
normally the sample refers to the set of samples
from your distribution.
And the sample size tells you how many you actually
took from your distribution.
But the terminology can be very confusing,
because you could easily view one of these as a sample.
But we're taking four samples from here.
We have a sample size of four.
And what I'm going to do is I'm going to average them.
So let's say the mean-- I want to be very careful when
I say average.
The mean of this first sample of size 4 is what?
1 plus 1 is 2.
2 plus 3 is 5.
5 plus 6 is 11.
11 divided by 4 is 2.75.
That is my first sample mean for my first sample of size 4.
Let me do another one.
My second sample of size 4, let's say that I get a 3, a 4.
Let's say I get another 3.
And let's say I get a 1.
I just didn't happen to get a 6 that time.
And notice I can't get a 2 or a 5.
It's impossible for this distribution.
The chance of getting a 2 or 5 is 0.
So I can't have any 2s or 5s over here.
So for the second sample of sample size 4,
my second sample mean is going to be 3 plus 4 is 7.
7 plus 3 is 10 plus 1 is 11.
11 divided by 4, once again, is 2.75.
Let me do one more, because I really
want to make it clear what we're doing here.
So I do one more.
Actually, we're going to do a gazillion more.
But let me just do one more in detail.
So let's say my third sample of sample size 4--
so I'm going to literally take 4 samples.
So my sample is made up of 4 samples
from this original crazy distribution.
Let's say I get a 1, a 1, and a 6 and a 6.
And so my third sample mean is going to be 1 plus 1 is 2.
2 plus 6 is 8.
8 plus 6 is 14.
14 divided by 4 is 3 and 1/2.
And as I find each of these sample
means-- so for each of my samples of sample size 4,
I figure out a mean.
And as I do each of them, I'm going
to plot it on a frequency distribution.
And this is all going to amaze you in a few seconds.
So I plot this all on a frequency distribution.
So I say, OK, on my first sample,
my first sample mean was 2.75.
So I'm plotting the actual frequency of the sample
means I get for each sample.
So 2.75, I got it one time.
So I'll put a little plot there.
So that's from that one right there.
And the next time, I also got a 2.75.
That's a 2.75 there.
So I got it twice.
So I'll plot the frequency right there.
Then I got a 3 and 1/2.
So all the possible values, I could have a three,
I could have a 3.25, I could have a 3 and 1/2.
So then I have the 3 and 1/2, so I'll plot it right there.
And what I'm going to do is I'm going
to keep taking these samples.
Maybe I'll take 10,000 of them.
So I'm going to keep taking these samples.
So I go all the way to S 10,000.
I just do a bunch of these.
And what it's going to look like over time is each of these--
I'm going to make it a dot, because I'm
going to have to zoom out.
So if I look at it like this, over time-- it still
has all the values that it might be able to take on,
2.75 might be here.
So this first dot is going to be-- this one
right here is going to be right there.
And that second one is going to be right there.
Then that one at 3.5 is going to look right there.
But I'm going to do it 10,000 times.
Because I'm going to have 10,000 dots.
And let's say as I do it, I'm going just keep plotting them.
I'm just going to keep plotting the frequencies.
I'm just going to keep plotting them
over and over and over again.
And what you're going to see is, as I take
many, many samples of size 4, I'm
going to have something that's going
to start kind of approximating a normal distribution.
So each of these dots represent an incidence of a sample mean.
So as I keep adding on this column right here,
that means I kept getting the sample mean 2.75.
So over time.
I'm going to have something that's
starting to approximate a normal distribution.
And that is a neat thing about the central limit theorem.
So an orange, that's the case for n is equal to 4.
This was a sample size of 4.
Now, if I did the same thing with a sample size of maybe
20-- so in this case, instead of just taking 4 samples
from my original crazy distribution, every sample
I take 20 instances of my random variable,
and I average those 20.
And then I plot the sample mean on here.
So in that case, I'm going to have
a distribution that looks like this.
And we'll discuss this in more videos.
But it turns out if I were to plot 10,000 of the sample
means here, I'm going to have something
that, two things-- it's going to even more closely approximate