This scandalous practice of deceit-for-funding-and-profit is why I persist in slamming psychology as “not science”
It’s not only that these are research scams that waste funding and devalue science; human beings are harmed as a result from this abuse of statistics. Asperger and neurodiverse types are being “defined” as “defective” human beings: there is no scientific basis for this “socially-motivated” construct. The current Autism-ASD-Asperger Industry is a FOR PROFIT INDUSTRY that exploits individuals, their families, schools, communities, tax-payers and funding for research. It also serves to enforce “the social order” dictated by elites.
The Mind-Reading Salmon: The True Meaning of Statistical Significance
If you want to convince the world that a fish can sense your emotions, only one statistical measure will suffice: the p-value.
The p-value is an all-purpose measure that scientists often use to determine whether or not an experimental result is “statistically significant.” Unfortunately, sometimes the test does not work as advertised, and researchers imbue an observation with great significance when in fact it might be a worthless fluke.
Say you’ve performed a scientific experiment testing a new heart attack drug against a placebo. At the end of the trial, you compare the two groups. Lo and behold, the patients who took the drug had fewer heart attacks than those who took the placebo. Success! The drug works!
Well, maybe not. There is a 50 percent chance that even if the drug is completely ineffective, patients taking it will do better than those taking the placebo. (After all, one group has to do better than the other; it’s a toss-up whether the drug group or placebo group will come up on top.)
The p-value puts a number on the effects of randomness. It is the probability of seeing a positive experimental outcome even if your hypothesis is wrong. A long-standing convention in many scientific fields is that any result with a p-value below 0.05 is deemed statistically significant. An arbitrary convention, it is often the wrong one. When you make a comparison of an ineffective drug to a placebo, you will typically get a statistically significant result one time out of 20. And if you make 20 such comparisons in a scientific paper, on average, you will get one significant result with a p-value less than 0.05—even when the drug does not work.
Many scientific papers make 20 or 40 or even hundreds of comparisons. In such cases, researchers who do not adjust the standard p-value threshold of 0.05 are virtually guaranteed to find statistical significance in results that are meaningless statistical flukes. A study that ran in the February issue of the American Journal of Clinical Nutrition tested dozens of compounds and concluded that those found in blueberries lower the risk of high blood pressure, with a p-value of 0.03. But the researchers looked at so many compounds and made so many comparisons (more than 50), that it was almost a sure thing that some of the p-values in the paper would be less than 0.05 just by chance.
The same applies to a well-publicized study that a team of neuroscientists once conducted on a salmon. When they presented the fish with pictures of people expressing emotions, regions of the salmon’s brain lit up. The result was statistically significant with a p-value of less than 0.001; however, as the researchers argued, there are so many possible patterns that a statistically significant result was virtually guaranteed, so the result was totally worthless. p-value notwithstanding, there was no way that the fish could have reacted to human emotions. The salmon in the fMRI happened to be dead.
Statistical Significance Abuse
A lot of research makes scientific evidence seem more “significant” than it is
I am a science writer and a former Registered Massage Therapist with a decade of experience treating tough pain cases. I was the Assistant Editor of ScienceBasedMedicine.org for several years.
Many study results are called “statistically significant,” giving unwary readers the impression of good news. But it’s misleading: statistical significance means only that the measured effect of a treatment is probably real (not a fluke). It says nothing about how large the effect is. Many small effect sizes are reported only as “statistically significant” — it’s a nearly standard way for biased researchers to make it found like they found something more important than they did.
This article is about two common problems with “statistical significance” in medical research. Both problems are particularly rampant in the study of massage therapy, chiropractic, and alternative medicine in general, and are wonderful examples of why science is hard, “why most published research findings are false” and genuine robust treatment effects are rare:
- mixing up statistical and clinical significance and the probability of being “right”
- reporting statistical significance of the wrong dang thing
Significance Problem #1 Two flavours of “significant”: statistical versus clinical
Research can be statistically significant, but otherwise unimportant. Statistical significance means that data signifies something… not that it actually matters.
Statistical significance on its own is the sound of one hand clapping. But researchers often focus on the the positive: “Hey, we’ve got statistical significance! Maybe!” So they summarize their findings as “significant” without telling us the size of the effect they observed, which is a little devious or sloppy. Almost everyone is fooled by this — except 98% of statisticians — because the word “significant” carries so much weight. It really sounds like a big deal, like good news. But it’s like bragging about winning a lottery without mentioning that you only won $25.
Statistical significance without other information really doesn’t mean all that much. It is not only possible but common to have clinically trivial results that are nonetheless statistically significant. How much is that statistical significance is worth? It depends … on details that are routinely omitted; which is convenient if you’re pushing a pet theory, isn’t it?
Imagine a study of a treatment for pain, which has a statistically significant effect, but it’s a tiny effect: that is, it only reduces pain slightly. You can take that result to the bank (supposedly) — it’s real! It’s statistically significant! But … no more so than a series of coin flips that yields enough heads in a row to raise your eyebrows. And the effect was still tiny. So calling these results “significant” is using math to put lipstick on a pig.
There are a lot of decorated pigs in research: “significant” results that are possibly not even that, and clinically boring in any case.
Just because a published paper presents a statistically significant result does not mean it necessarily has a biologically meaningful effect.
Science Left Behind: Feel-Good Fallacies and the Rise of the Anti-Scientific Left, Alex Berezow & Hank Campbell
If you torture data for long enough, it will confess to anything.
P-values, where P stands for “please stop the madness”
Small study proves showers work Too often people smugly dismiss a study just because of small sample size, ignoring all other considerations, like effect size … a rookie move. For instance, you really do not need to test lots of showers to prove that they are an effective moistening procedure. The power of a study is a product of both sample and effect size (and more).
Statistical significance is boiled down to one convenient number: the infamous, cryptic, bizarro and highly over-rated P-value. Cue Darth vader theme. This number is “diabolically difficult” to understand and explain, and so p-value illiteracy and bloopers are epidemic (Goodman identifies ““A dirty dozen: twelve p-value misconceptions””4). It seems to be hated by almost everyone who actually understands it, because almost no one else does. Many believe it to be a blight on modern science.5 Including the American Statistical Association — and if they don’t like it, should you?
The mathematical soul of the p-value is, frankly, not really worth knowing. It’s just not that fantastic an idea. The importance of scientific research results cannot be jammed into a single number (and nor was that ever the intent). And so really wrapping your head around it no more important than learning the gritty details of the Rotten Tomatoes algorithm when you’re trying to decide whether to see that new Godzilla (2014) movie.7
What you do need to know is the role that p-values play in research today. You need to know that “it depends” is a massive understatement, and that there are “several reasons why the p-value is an unobjective and inadequate measure of evidence”8 Because it is so often abused, it’s way more important to know what the p-value is NOT than what it IS. For instance, it’s particularly useless when applied to studies of really outlandish ideas. And yet it’s one of the staples of pseudoscience, because it is such an easy way to make research look better than it is.
Above all, a good p-value is not a low chance that the results were a fluke or false alarm — which is by far the most common misinterpretation (and the first of Goodman’s Dirty Dozen). The real definition is a kind of mirror image of that:11 it’s not a low chance of a false alarm, but a low chance of an effect that actually is a false alarm. The false alarm is a given! That part of the equation is already filled in, the premise of every p-value. For better or worse, the p-value is the answer to this question: if there really is nothing going on here, what are the odds of getting these results? A low number is encouraging, but it doesn’t say the results aren’t a fluke, because it can’t — it was calculated by assuming they are.
The only way to actually find out if the effect is real or a fluke is to do more experiments. If they all produce results that would be unlikely if there was no real effect, then you can say the results are probably real. The p-value alone can only be a reason to check again — not statistical congratulations on a job well done. And yet that’s exactly how most researchers use it. And most science journalists.
The problem with p-values
Academic psychology and medical testing are both dogged by unreliability. The reason is clear: we got probability wrong
The aim of science is to establish facts, as accurately as possible. It is therefore crucially important to determine whether an observed phenomenon is real, or whether it’s the result of pure chance. If you declare that you’ve discovered something when in fact it’s just random, that’s called a false discovery or a false positive. And false positives are alarmingly common in some areas of medical science.
In 2005, the epidemiologist John Ioannidis at Stanford caused a storm when he wrote the paper ‘Why Most Published Research Findings Are False’, focusing on results in certain areas of biomedicine. He’s been vindicated by subsequent investigations.
For example, a recent article found that repeating 100 different results in experimental psychology confirmed the original conclusions in only 38 per cent of cases. It’s probably at least as bad for brain-imaging studies and cognitive neuroscience. How can this happen?
The problem of how to distinguish a genuine observation from random chance is a very old one. It’s been debated for centuries by philosophers and, more fruitfully, by statisticians. It turns on the distinction between induction and deduction. Science is an exercise in inductive reasoning: we are making observations and trying to infer general rules from them. Induction can never be certain. In contrast, deductive reasoning is easier: you deduce what you would expect to observe if some general rule were true and then compare it with what you actually see. The problem is that, for a scientist, deductive arguments don’t directly answer the question that you want to ask.
What matters to a scientific observer is how often you’ll be wrong if you claim that an effect is real, rather than being merely random. That’s a question of induction, so it’s hard. In the early 20th century, it became the custom to avoid induction, by changing the question into one that used only deductive reasoning. In the 1920s, the statistician Ronald Fisher did this by advocating tests of statistical significance. These are wholly deductive and so sidestep the philosophical problems of induction.
Tests of statistical significance proceed by calculating the probability of making our observations (or the more extreme ones) if there were no real effect. This isn’t an assertion that there is no real effect, but rather a calculation of what would be expected if there were no real effect. The postulate that there is no real effect is called the null hypothesis, and the probability is called the p-value. Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. All you have to do is to decide how small the p-value must be before you declare that you’ve made a discovery. But that turns out to be very difficult.
The problem is that the p-value gives the right answer to the wrong question. What we really want to know is not the probability of the observations given a hypothesis about the existence of a real effect, but rather the probability that there is a real effect – that the hypothesis is true – given the observations. And that is a problem of induction.
Confusion between these two quite different probabilities lies at the heart of why p-values are so often misinterpreted. It’s called the error of the transposed conditional. Even quite respectable sources will tell you that the p-value is the probability that your observations occurred by chance. And that is plain wrong.
Suppose, for example, that you give a pill to each of 10 people. You measure some response (such as their blood pressure). Each person will give a different response. And you give a different pill to 10 other people, and again get 10 different responses. How do you tell whether the two pills are really different?
The conventional procedure would be to follow Fisher and calculate the probability of making the observations (or the more extreme ones) if there were no true difference between the two pills. That’s the p-value, based on deductive reasoning. P-values of less than 5 per cent have come to be called ‘statistically significant’, a term that’s ubiquitous in the biomedical literature, and is now used to suggest that an effect is real, not just chance.
But the dichotomy between ‘significant’ and ‘not significant’ is absurd. There’s obviously very little difference between the implication of a p-value of 4.7 per cent and of 5.3 per cent, yet the former has come to be regarded as success and the latter as failure. And ‘success’ will get your work published, even in the most prestigious journals. That’s bad enough, but the real killer is that, if you observe a ‘just significant’ result, say P = 0.047 (4.7 per cent) in a single test, and claim to have made a discovery, the chance that you are wrong is at least 26 per cent, and could easily be more than 80 per cent. How can this be so?
For one, it’s of little use to say that your observations would be rare if there were no real difference between the pills (which is what the p-value tells you), unless you can say whether or not the observations would also be rare when there is a true difference between the pills. Which brings us back to induction.
The problem of induction was solved, in principle, by the Reverend Thomas Bayes in the middle of the 18th century. He showed how to convert the probability of the observations given a hypothesis (the deductive problem) to what we actually want, the probability that the hypothesis is true given some observations (the inductive problem). But how to use his famous theorem in practice has been the subject of heated debate ever since.
Take the proposition that the Earth goes round the Sun. It either does or it doesn’t, so it’s hard to see how we could pick a probability for this statement. Furthermore, the Bayesian conversion involves assigning a value to the probability that your hypothesis is right before any observations have been made (the ‘prior probability’). Bayes’s theorem allows that prior probability to be converted to what we want, the probability that the hypothesis is true given some relevant observations, which is known as the ‘posterior probability’.
These intangible probabilities persuaded Fisher that Bayes’s approach wasn’t feasible. Instead, he proposed the wholly deductive process of null hypothesis significance testing. The realisation that this method, as it is commonly used, gives alarmingly large numbers of false positive results has spurred several recent attempts to bridge the gap.
There is one uncontroversial application of Bayes’s theorem: diagnostic screening, the tests that doctors give healthy people to detect warning signs of disease. They’re a good way to understand the perils of the deductive approach.
In theory, picking up on the early signs of illness is obviously good. But in practice there are usually so many false positive diagnoses that it just doesn’t work very well. Take dementia. Roughly 1 per cent of the population suffer from mild cognitive impairment, which might, but doesn’t always, lead to dementia. Suppose that the test is quite a good one, in the sense that 95 per cent of the time it gives the right (negative) answer for people who are free of the condition. That means that 5 per cent of the people who don’t have cognitive impairment will test, falsely, as positive. That doesn’t sound bad. It’s directly analogous to tests of significance which will give 5 per cent of false positives when there is no real effect, if we use a p-value of less than 5 per cent to mean ‘statistically significant’.
But in fact the screening test is not good – it’s actually appallingly bad, because 86 per cent, not 5 per cent, of all positive tests are false positives. So only 14 per cent of positive tests are correct. This happens because most people don’t have the condition, and so the false positives from these people (5 per cent of 99 per cent of the people), outweigh the number of true positives that arise from the much smaller number of people who have the condition (80 per cent of 1 per cent of the people, if we assume 80 per cent of people with the disease are detected successfully). There’s a YouTube video of my attempt to explain this principle, or you can read my recent paper on the subject.
Notice, though, that it’s possible to calculate the disastrous false-positive rate for screening tests only because we have estimates for the prevalence of the condition in the whole population being tested. This is the prior probability that we need to use Bayes’s theorem. If we return to the problem of tests of significance, it’s not so easy. The analogue of the prevalence of disease in the population becomes, in the case of significance tests, the probability that there is a real difference between the pills before the experiment is done – the prior probability that there’s a real effect. And it’s usually impossible to make a good guess at the value of this figure.
An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a P = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong, not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. Just as in screening tests, the reason for this large number of mistakes is that the number of false positives in the tests where there is no real effect outweighs the number of true positives that arise from the cases in which there is a real effect.
In general, though, we don’t know the real prevalence of true effects. So, although we can calculate the p-value, we can’t calculate the number of false positives. But what we can do is give a minimum value for the false positive rate. To do this, we need only assume that it’s not legitimate to say, before the observations are made, that the odds that an effect is real are any higher than 50:50. To do so would be to assume you’re more likely than not to be right before the experiment even begins.
If we repeat the drug calculations using a prevalence of 50 per cent rather than 10 per cent, we get a false positive rate of 26 per cent, still much bigger than 5 per cent. Any lower prevalence will result in an even higher false positive rate.
The upshot is that, if a scientist observes a ‘just significant’ result in a single test, say P = 0.047, and declares that she’s made a discovery, that claim will be wrong at least 26 per cent of the time, and probably more.
No wonder then that there are problems with reproducibility in areas of science that rely on tests of significance.
What is to be done? For a start, it’s high time that we abandoned the well-worn term ‘statistically significant’. The cut-off of P < 0.05 that’s almost universal in biomedical sciences is entirely arbitrary – and, as we’ve seen, it’s quite inadequate as evidence for a real effect. Although it’s common to blame Fisher for the magic value of 0.05, in fact Fisher said, in 1926, that P = 0.05 was a ‘low standard of significance’ and that a scientific fact should be regarded as experimentally established only if repeating the experiment ‘rarely fails to give this level of significance’.
The ‘rarely fails’ bit, emphasised by Fisher 90 years ago, has been forgotten. A single experiment that gives P = 0.045 will get a ‘discovery’ published in the most glamorous journals. So it’s not fair to blame Fisher, but nonetheless there’s an uncomfortable amount of truth in what the physicist Robert Matthews at Aston University in Birmingham had to say in 1998:
‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug.’
The underlying problem is that universities around the world press their staff to write whether or not they have anything to say. This amounts to pressure to cut corners, to value quantity rather than quality, to exaggerate the consequences of their work and, occasionally, to cheat. People are under such pressure to produce papers that they have neither the time nor the motivation to learn about statistics, or to replicate experiments. Until something is done about these perverse incentives, biomedical science will be distrusted by the public, and rightly so. Senior scientists, vice-chancellors and politicians have set a very bad example to young researchers. As the zoologist Peter Lawrence at the University of Cambridge put it in 2007:
hype your work, slice the findings up as much as possible (four papers good, two papers bad), compress the results (most top journals have little space, a typical Nature letter now has the density of a black hole), simplify your conclusions but complexify the material (more difficult for reviewers to fault it!)
But there is good news too. Most of the problems occur only in certain areas of medicine and psychology. And despite the statistical mishaps, there have been enormous advances in biomedicine. The reproducibility crisis is being tackled. All we need to do now is to stop vice-chancellors and grant-giving agencies imposing incentives for researchers to behave badly.
This last paragraph is an egregious act of “FRAMING” – that is diluting and denying what one just said by establishing a “positive” CONTEXT “But there is good news too” “advances in biomedicine” “crisis being tackled” “it’s vice-chancellors and grant-giving agencies fault” (not the poor beleaguered researchers who are “forced to” be dishonest!