randombio.com | Science Dies in Unblogginess | Believe All Science | I Am the Science Friday, May 16, 2025 | science An 'explosion' of formulaic research papers written with AIA new paper gives us a clue how the idea that computers are never wrong will kill us all |
ave you ever wondered why eating blueberries cures cancer one day and causes
it the next? Why two hundred years after the first functional gas stove
was built by James Sharp they were suddenly found to cause asthma when the
government wanted to ban them? Now we know the answer: computers running
AI
are making it all up.
A new paper [1]
claims that “AI-assisted workflows
” are producing
an ‘explosion’ of “formulaic research articles,
including inappropriate study designs and false discoveries” published
by so-called paper mills, which, according to the authors, include many top
scientific journals. The authors write:
Employing AI-supported single-factor approaches removes context from research, fails to capture interactions, avoids false discovery correction, and is an approach that can easily be adopted by paper mills.
The articles use data from the National Health and Nutrition Examination Survey (NHANES), an AI-ready dataset for which automated tools have been written. ‘Depression’ was the most frequent condition studied, with 28 papers finding positive or negative statistical correlations between depression and blood pressure, ethylene oxide levels, PFAS, blood cadmium, ratio of non-HDL to HDL cholesterol, and many other things, all of which were calculated wrong. Some analyzed multifactorial conditions as single factor problems. Some were selectively extracting data to fit their flawed design. Some were harking, which means hypothesizing after the results are in, perhaps by “subsetting,” or dividing the population up in various ways until something becomes statistically significant.
They also didn't correct for False Discovery Rate, or FDR:
Of the 28 statistically significant associations, less than half (13) remained statistically significant after FDR correction.
FDR testing is necessary when you do many hundreds of correlations. Some will always come out significant just by chance, even if you set your significance level to p<0.001. Statisticians have devised many ways of correcting for this problem, FDR being one of the weakest and Bonferroni being more stringent. Of course the real responsibility lies with the authors, whose only goal is to get a paper. In the drive to publish as much as possible, researchers forget that a correlation alone is meaningless unless it's backed up by a solid, well established mechanism. Even then, the possibility of confounding factors remains.
It should be added that the papers could very well be correct. Maybe some or all of the problems aren't multifactorial problems but single-factor problems. Unfortunately, there's no way to know, and there's also the problem of confounding factors.
A classic example of that was the ozone hole, where a potential problem was identified, a plausible mechanism was found, and evidence was found that the proposed mechanism occurred. The authors got a Nobel prize and a treaty banning the bad molecule, and . . . 38 years later . . . the ozone hole is as big as ever, though you'll never hear that in the popular press. Clearly something else, remaining to be discovered, must also be contributing, but few people care: there's no percentage in looking for it. They're not going to give another Nobel Prize for proving the first one was a ‘whoopsie.’
The authors also find that between 2021 and 2024, many of these “formulaic” manuscripts (292 out of 316) come from authors in mainland China.
They cite specific examples. In PMID 39377074, a group from Shanghai claimed in Frontiers in Endocrinology to have found a correlation between body shape index and abdominal aortic calcification. In PMID 38840158 the same group linked it to cognitive impairment.
You don't need AI or to live in Shanghai to produce statistically flawed results, nor is nutrition the only field where people do it. There are thousands of such studies from every country in every field. The authors say FDR correction should be used in all data dredging studies because the number of potential hypotheses is enormous. The problem is that researchers always say they're doing exploratory research that's not intended to be the final answer but to find as many clues as possible. That's a valid concern.
Take chronic inflammation, which is now “implicated” in almost every disease you can think of, from cancer to heart disease to Alzheimer's disease. Is it a cause, a contributing factor, or merely a side-effect as unhealthy cells die, release their contents, and activate the DNA damage response? No one knows; they're all too busy finding correlations with air pollution, microplastics, PFAS, and BPA.
Polygenic risk score (PRS) is a similar questionable technique. You take a large number of patients and healthy people, analyze their DNA, and pick the top 5% or so of the gene polymorphisms, called SNPs, that show a difference. This is called a GWAS, or genome-wide association study. It was invented as a tool for exploration, and it's been very useful.
As an exploratory tactic, it's reasonable. But it wasn't meant to be used in the clinic to diagnose cardiovascular disease [2] or mental health disorders [3], which according to the experts are “highly polygenic.” What, if anything, does this mean? In fact, it means almost nothing. PRS is basically a way of multiplying hundreds or thousands of hypotheses together. You then assume the product has some objective meaning and correlate it with something else, say political belief or number of blueberries eaten per week, thereby proving that people who eat blueberries are all insane or all likely to die in some horrible way. It doesn't distinguish cause and effect, it fails to consider the patient's environment, and it has no mechanistic basis. Its connection to GWAS makes it appear scientific, but it's not. This is why it was rejected by proteomics researchers. It's also why a PRS based on a European database can't be used for cohorts of non-European ancestry.
PRS is only popular because it gets around the fact that we have no clue what causes the disease or—especially in mental health disorders—whether the disease entity is even real.
GWAS usually corrects for the FDR to address the problem of multiple correlations, but PRS does not.[4] Anderson et al. write:
By aggregating all variant effects into one continuous metric, one can also skirt the strict multiple comparisons corrections necessary for GWAS. In other words, rather than correcting for millions of univariate statistical tests looking for hits across the genome, just one statistical test is required to regress genome-wide PRS onto the phenotype of interest. [3]
Even if the numbers were corrected for the FDR, it's a statistical shipwreck: how can 100,000 numbers with no individual connection to the disease sum up to something that's predictive? These weaknesses make it worthless for doing science. Even for clinical studies, there are doubts. Khanna et al. say
[I]t should not be the only tool to identify the risk in patients because it does not consider the environmental factors, lifestyle, and family history, which could be different for every individual.[2]
A good example of how polygenic risk score is misused is in the study by Saarinen et al. [5], who correlated belief in magic with polygenic risk for schizophrenia, a disorder for which the cause is totally unknown. The existence of thousands of papers praising PRS doesn't change the fact that adding up a hundred thousand genes with no discernible connection to the disorder doesn't turn them into a diagnosis.
The risk is obvious: too many bad studies can wipe out confidence in all of science and lead to bad public policy. This has been going on for decades in many fields. It's why people still think they're eating too many eggs and need to stop eating meat, driving a car, and having children to save the planet. What's new is that AI makes it all sound scientific. The idea that the computer is never wrong could kill us all.
The authors propose access barriers, rejection of publications that appear formulaic, more retractions based on post-publication criticism by PubPeer, and bureaucratic controls such as licensing. That's a great way to make the problem worse. PubPeer is particularly pernicious, as many of the critiques found there are founded on speculation and personal animosity.
The source of the problem is the demand in academia to publish as much as possible. There's also a lack of statistical training in grad school. When I was a student, they gave us none of it. I had to read many books on the subject, most of which were either dry, basic, or highly mathematical. Or they were from Iowa State in 1937 and talked mainly about corn.
Here's a better solution. Forbid companies from doing drug testing on their own product. Fund mechanism-based research over data dredging and insist that any correlation must have mechanistic support. Eliminate the practice of publication-counting in academia. Make statistics courses mandatory in the STEM curriculum. And find some way to accommodate exploratory research, maybe by labeling it as such.
That's not as simple as pulling up the ladder behind you, as the authors seem to want, but it's what you'd need if you wanted to solve the problem. The hard part is convincing people the problem exists, and for that the authors deserve praise.
[1] Suchak T, Aliu AE, Harrison C, Zwiggelaar R, Geifman N, Spick M. Explosion of formulaic research articles, including inappropriate study designs and false discoveries, based on the NHANES US national health database. PLoS Biol. 2025 May 8;23(5):e3003152. doi: 10.1371/journal.pbio.3003152. PMID: 40338847; PMCID: PMC12061153. Link
[2] Khanna NN, Singh M, Maindarkar M, Kumar A, Johri AM, Mentella L, Laird JR, Paraskevas KI, Ruzsa Z, Singh N, Kalra MK, Fernandes JFE, Chaturvedi S, Nicolaides A, Rathore V, Singh I, Teji JS, Al-Maini M, Isenovic ER, Viswanathan V, Khanna P, Fouda MM, Saba L, Suri JS. Polygenic Risk Score for Cardiovascular Diseases in Artificial Intelligence Paradigm: A Review. J Korean Med Sci. 2023 Nov 27;38(46):e395. doi: 10.3346/jkms.2023.38.e395. PMID: 38013648; PMCID: PMC10681845.
[3] Anderson JS, Shade J, DiBlasi E, Shabalin AA, Docherty AR. Polygenic risk scoring and prediction of mental health outcomes. Curr Opin Psychol. 2019 Jun;27:77–81. doi: 10.1016/j.copsyc.2018.09.002. PMID: 30339992; PMCID: PMC6426686.
[4] Wray NR, Goddard ME, Visscher PM: Prediction of individual genetic risk to disease from genome-wide association studies. Genome res 2007, 17:1520–1528. [PubMed: 17785532]
[5] Saarinen A, Lyytikäinen LP, Hietala J, Dobewall H, Lavonius V, Raitakari O, Kähönen M, Sormunen E, Lehtimäki T, Keltikangas-Järvinen L. Magical thinking in individuals with high polygenic risk for schizophrenia but no non-affective psychoses-a general population study. Mol Psychiatry. 2022 Aug;27(8):3286–3293. doi: 10.1038/s41380-022-01581-z. PMID: 35505089; PMCID: PMC9708578.
may 16 2025, 2:50 pm. updated may 17 2025, 5:14 am
Make NIH-funded research great again
NIH study sections are just the tip of the iceberg. What's really
needed is to overhaul the reward system in science
Open science: any progress yet?
Science need not go full metal open science, but it must become more open
in order to survive
Science is failing to address its problems
A new book from a former editor of Cell says failings in the
system misdirect science. But minor tweaking is not enough
Hysteria about AI
If it's really all that dangerous, let's hear the reasons, not
your ideas for movie scripts
What are your odds of getting Alzheimer's disease?
Latest research shows that the risk of late-onset Alzheimer's disease
is 75% genetic
NO2 causes the ozone hole, CFCs cause global warming, CO2 misses out
Silly us, it's a mistake anybody could make. All those little
molecules look the same to us