Placeholder Image

Subtitles section Play video

  • [♪ INTRO]

  • A little over a decade ago, a neuroscientist stopped by a grocery store on his way to his lab

  • to buy a large Atlantic salmon.

  • The fish was placed in an MRI machine,

  • and then it completed what was called an "open-ended mentalizing task"

  • where it was asked to determine the emotions that were being experienced by different people in photos.

  • Yes, the salmon was asked to do that. The dead one from the grocery store.

  • But that's not the weird part.

  • The weird part is that researchers found that

  • so-called significant activation occurred in neural tissue in a couple places in the dead fish.

  • Turns out, this was a little bit of a stunt.

  • The researchers weren't studying the mental abilities of dead fish;

  • they wanted to make a point about statistics, and how scientists use them.

  • Which is to say, stats can be done wrong, so wrong that they can make a dead fish seem alive.

  • A lot of the issues surrounding scientific statistics come from a little something called a p-value.

  • The p stands for probability,

  • and it refers to the probability that you would have gotten the results you did just by chance.

  • There are lots of other ways to provide statistical support for your conclusion in science,

  • but p-value is by far the most common,

  • and, I mean, it's literally what scientists mean when they report that their findings aresignificant”.

  • But it's also one of the most frequently misused and misunderstood parts of scientific research.

  • And some think it's time to get rid of it altogether.

  • The p-value was first proposed by a statistician named Ronald Fisher in 1925.

  • Fisher spent a lot of time thinking about how to determine

  • if the results of a study were really meaningful.

  • And, at least according to some accounts, his big breakthrough came after a party in the early 1920s.

  • At this party there was a fellow scientist named Muriel Bristol,

  • and reportedly, she refused a cup of tea from Fisher because he had added milk after the tea was poured.

  • She only liked her tea when the milk was added first.

  • Fisher didn't believe she could really taste the difference,

  • so he and a colleague designed an experiment to test her assertion.

  • They made eight cups of tea, half of which were milk first, and half of which were tea first.

  • The order of the cups was random, and, most importantly, unknown to Bristol,

  • though she was told there would be four of each cup.

  • Then, Fisher had her taste each tea one by one and say whether it that cup was milk or tea first.

  • And to Fisher's great surprise, she went 8 for 8.

  • She guessed correctly every time which cup was tea-first and which was milk-first!

  • And that got him to thinking, what are the odds that she got them all right just by guessing?

  • In other words, if she really couldn't taste the difference,

  • how likely would it be that she got them all right?

  • He calculated that are 70 possible orders for the 8 cups if there are four of each mix.

  • Therefore, the probability that she'd guess the right one by luck alone is 1 in 70.

  • Written mathematically, the value of P is about 0.014.

  • That, in a nutshell, is a p-value,

  • the probability that you'd get that result if chance is the only factor.

  • In other words, there's really no relationship between the two things you're testing,

  • in this case, how tea is mixed versus how it tastes,

  • but you could still wind up with data that suggest there is a relationship.

  • Of course, the definition ofchancevaries depending on the experiment,

  • which is why p-values depend a lot on experimental design.

  • Say Fisher had only made 6 cups, 3 of each tea mix.

  • Then, there are only 20 possible orders for the cups, so the odds of getting them all correct is 1 in 20,

  • a p-value of 0.05.

  • Fisher went on to describe an entire field of statistics based on this idea,

  • which we now call Null Hypothesis Significance Testing.

  • Thenull hypothesisrefers to the experiment's assumption of whatby chancelooks like.

  • Basically, researchers calculate how likely it is that they've gotten the data that they did,

  • even if the effect they're testing for doesn't exist.

  • Then, if the results are extremely unlikely to occur if the null hypothesis is true,

  • then they can infer that it isn't.

  • So, in statistical speak, with a low enough p-value, they can reject the null hypothesis,

  • leaving them with whatever alternate hypothesis they had as the explanation for the results.

  • The question becomes, how low does a p-value have to be before you can reject that null hypothesis.

  • Well, the standard answer used in science is less than 1 in 20 odds, or a p-value below 0.05.

  • The problem is, that's an arbitrary choice.

  • It also traces back to Fisher's 1925 book, where he said 1 in 20 was quoteconvenient”.

  • A year later, he admitted the cutoff was somewhat subjective,

  • but that 0.05 was generally his personal preference.

  • Since then, the 0.05 threshold has become the gold standard in scientific research.

  • A p of less than 0.05, and your results are quotesignificant”.

  • It's often talked about as determining whether or not an effect is real.

  • But the thing is, a result with a p-value of 0.049 isn't more true than one with a p-value of 0.051.

  • It's just ever so slightly less likely to be explained by chance or sampling error.

  • This is really key to understand.

  • You're not more right if you get a lower p-value,

  • because a p-value says nothing about how correct your alternate hypothesis is.

  • Let's bring it back to tea for a moment.

  • Bristol aced Fisher's 8-cup study by getting them all correct,

  • which as we noted, has a p-value of 0.014, solidly below the 0.05 threshold.

  • But it being unlikely that she randomly guessed doesn't prove she could taste the difference.

  • See, it tells us nothing about other possible explanations for her correctness.

  • Like, if the teas had different colors rather than tastes.

  • Or she secretly saw Fisher pouring each cup!

  • Also, it still could have been a one-in-seventy fluke.

  • And sometimes, one might even argue often,

  • 1 in 20 is not a good enough threshold to really rule out that a result is a fluke.

  • Which brings us back to that seemingly undead fish.

  • The spark of life detected in the salmon was actually an artifact of how MRI data is collected and analyzed.

  • See, when researchers analyze MRI data, they look at small units about a cubic millimeter or two in volume.

  • So for the fish, they took each of these units and compared the data

  • before and after the pictures were shown to the fish.

  • That means even though they were just looking at one dead fish's brain before and after,

  • they were actually making multiple comparisons, potentially, thousands of them.

  • The same issue crops up in all sorts of big studies with lots of data,

  • like nutritional studies where people provide detailed diet information about hundreds of foods,

  • or behavioral studies where participants fill out surveys with dozens of questions.

  • In all cases, even though each individual comparison is unlikely, with enough comparisons,

  • you're bound to find some false positives.

  • There are statistical solutions for this problem, of course,

  • which are simply known as multiple comparison corrections.

  • Though they can get fancy, they usually amount to lowering the threshold for p-value significance.

  • And to their credit, the researchers who looked at the dead salmon

  • also ran their data with multiple comparison corrections,

  • when they did, their data was no longer significant.

  • But not everyone uses these corrections.

  • And though individual studies might give various reasons for skipping them,

  • one thing that's hard to ignore is that researchers are under a lot of pressure to publish their work,

  • and significant results are more likely to get published.

  • This can lead to p-hacking: the practice of analyzing or collecting data,

  • until you get significant p-values.

  • This doesn't have to be intentional, because researchers make many small choices that lead to different results,

  • like we saw with 6 versus 8 cups of tea.

  • This has become such a big issue because, unlike when these statistics were invented,

  • people can now run tests lots of different ways fairly quickly and cheaply,

  • and just go with what's most likely to get their work published.

  • Because of all of these issues surrounding p-values,

  • some are arguing that we should get rid of them altogether.

  • And one journal has totally banned them.

  • And many that say we should ditch the p-value are pushing for an alternate statistical system

  • called Bayesian statistics.

  • P-values, by definition, only examine null hypotheses.

  • The result is then used to infer if the alternative is likely.

  • Bayesian statistics actually look at the probability of both the null and alternative hypotheses.

  • What you wind up with is an exact ratio of how likely one explanation is compared to another.

  • This is called a Bayes factor.

  • And this is a much better answer if you want to know how likely you are to be wrong.

  • This system was around when Fisher came up with p-values.

  • But, depending on the dataset, calculating Bayes factors can require some serious computing power,

  • power that wasn't available at the time, since, y'know, it was before computers.

  • Nowadays, you can have a huge network of computers thousands of miles from you to run calculations

  • while you throw a tea party.

  • But the truth is, replacing p-values with Bayes factors probably won't fix everything.

  • A loftier solution is to completely separate a study's publishability from its results.

  • This is the goal of two-step manuscript submission,

  • where you submit an introduction to your study and a description of your method,

  • and the journal decides whether to publish before seeing your results.

  • That way, in theory at least, studies would get published based on whether they represent good science,

  • not whether they worked out the way researchers hoped,

  • or whether a p-value or Bayes factor was more or less than some arbitrary threshold.

  • This sort of idea isn't widely used yet,

  • but it may become more popular as statistical significance meets more sharp criticism.

  • In the end, hopefully, all this controversy surrounding p-values

  • means that academic culture is shifting toward a clearer portrayal

  • of what research results do and don't really show.

  • And that will make things more accessible for all of us who want to read and understand science,

  • and keep any more zombie fish from showing up.

  • Now, before I go make myself a cup of Earl Grey, milk first, of course,

  • I want to give a special shout out to today's President of Space, SR Foxley.

  • Thank you so much for your continued support!

  • Patrons like you give us the freedom to dive deep into complex topics like p-values,

  • so really, we can't thank you enough.

  • And if you want to join SR in supporting this channel and the educationalcontent we make here at SciShow,

  • you can learn more at Patreon.com/SciShow.

  • Cheerio!

  • [♪ OUTRO]

[♪ INTRO]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it