Name: P-values Broke Scientific Statistics—Can We Fix Them?
Uploaded: 2020-04-13T16:46:39.000Z
Duration: 10 min 40 s
Description: Thousands of YouTube videos with English-Chinese subtitles! Now you can learn to understand native speakers, expand your vocabulary, and improve your pronunciation...

A little over a decade ago, a neuroscientist stopped by a grocery store on his way to his lab

and then  it completed what was called an  "open-ended mentalizing task"

where it was asked to determine the emotions that were being experienced by different people in photos.

Yes, the salmon was asked to do that. The dead one from the grocery store.

The weird part is that researchers found that

so-called significant activation  occurred in neural tissue in a couple places in the dead fish.

Turns out, this was a little bit of a stunt.

The researchers weren't studying the mental abilities of dead fish;

they wanted to make a point about statistics,  and how scientists use them.

Which is to say, stats can be done wrong, so wrong that they can make a dead fish seem alive.

A lot of the issues surrounding scientific statistics come from a little something called a p-value.

and it refers to  the probability that you would have gotten the results you did just by chance.

There are lots of other ways to provide statistical support for your conclusion in science,

and, I mean, it's literally what scientists mean when they report that their findings are “significant”.

But it's also one of the most frequently misused and misunderstood parts of scientific research.

And some think it's time to get rid of it altogether.

The p-value was first proposed by a statistician named Ronald Fisher in 1925.

Fisher spent a lot of time thinking about how to determine

if the results of a study were really meaningful.

And, at least according to some accounts, his big breakthrough came after a party in the early 1920s.

At this party there was a fellow scientist named Muriel Bristol,

and reportedly,  she refused a cup of tea from Fisher because he had added milk after the tea was poured.

She only liked her tea when the milk was added first.

Fisher didn't believe she could really taste the difference,

so he and a colleague designed an experiment to test her assertion.

They made eight cups of tea, half of which were milk first, and half of which were tea first.

The order of the cups was random, and, most importantly, unknown to Bristol,

though she was told there would be four of each cup.

Then, Fisher had her taste each tea one by one and say whether it that cup was milk or tea first.

And to Fisher's great surprise, she went 8 for 8.

She guessed correctly every time which cup was tea-first and which was milk-first!

And that got him to thinking, what are the odds that she got them all right just by guessing?

In other words, if she really couldn't taste the difference,

how likely would it be that she got them all right?

He calculated that are 70 possible orders for the 8 cups if there are four of each mix.

Therefore, the probability that she'd guess the right one by luck alone is 1 in 70.

Written mathematically, the value of P is about 0.014.

the probability that you'd get that result  if chance is the only factor.

In other words, there's really no relationship between the two things you're testing,

in this case, how tea is mixed versus how it tastes,

but you could still wind up with data that suggest there is a relationship.

Of course, the definition of “chance” varies depending on the experiment,

which is why p-values depend a lot on experimental design.

Say Fisher had only made 6 cups, 3 of each tea mix.

Then, there are only 20 possible orders for the cups, so the odds of getting them all correct is 1 in 20,

Fisher went on to describe an entire field of statistics based on this idea,

which we now call Null Hypothesis Significance Testing.

The “null hypothesis” refers to the experiment's assumption of what “by chance” looks like.

Basically, researchers calculate how likely it is that they've gotten the data that they did,

even if the effect they're testing for doesn't exist.

Then, if the results are extremely unlikely to occur if the null hypothesis is true,

So, in statistical speak, with a low enough p-value, they can reject the null hypothesis,

leaving them with whatever alternate hypothesis they had as the explanation for the results.

The question becomes, how low does a p-value have to be before you can reject that null hypothesis.

Well, the standard answer used in science is less than 1 in 20 odds, or a p-value below 0.05.

The problem is, that's an arbitrary choice.

It also traces back to Fisher's 1925 book, where he said 1 in 20 was quote “convenient”.

A year later, he admitted the cutoff was somewhat subjective,

but that 0.05 was generally his personal preference.

Since then, the 0.05 threshold has become the gold standard in scientific research.

A p of less than 0.05, and your results are quote “significant”.

It's often talked about as determining whether or not an effect is real.

But the thing is, a result with a p-value of 0.049 isn't more true than one with a p-value of 0.051.

It's just ever so slightly less likely to be explained by chance or sampling error.

You're not more right if you get a lower p-value,

because a p-value says nothing about how correct your alternate hypothesis is.

Bristol aced Fisher's 8-cup study by getting them all correct,

which as we noted, has a p-value of 0.014, solidly below the 0.05 threshold.

But it being unlikely that she randomly guessed doesn't prove she could taste the difference.

See, it tells us nothing about other possible explanations for her correctness.

Like, if the teas had different colors rather than tastes.

Or she secretly saw Fisher pouring each cup!

Also, it still could have been a one-in-seventy fluke.

And sometimes, one might even argue often,

1 in 20 is not a good enough threshold to really rule out that a result is a fluke.

Which brings us back to that seemingly undead fish.

The spark of life detected in the salmon was actually an artifact of how MRI data is collected and analyzed.

See, when researchers analyze MRI data, they look at small units about a cubic millimeter or two in volume.

So for the fish, they took each of these units and compared the data

before and after the pictures were shown to the fish.

That means even though they were just looking at one dead fish's brain before and after,

they were actually making multiple comparisons, potentially, thousands of them.

The same issue crops up in all sorts of big studies with lots of data,

like nutritional studies where people provide detailed diet information about hundreds of foods,

or behavioral studies where participants fill out surveys with dozens of questions.

In all cases, even though each individual comparison is unlikely, with enough comparisons,

you're bound to find some false positives.

There are statistical solutions for this problem, of course,

which are simply known as multiple comparison corrections.