Irreproducibility of Results and the Dead Fish Test


Years ago (as in 1984) a colleague gave me a copy of the  Journal of Irreproducible Results. I’m not sure if they are still doing new issues of the Journal but it seems like the issue of people publishing either deliberately or inadvertently results that others cannot replicate is not going away and in fact (no pun intended) is set as a growing topic.

Countless TED talks, commercials (such as the tag line “neuroscience tells us that…”) and the claims of celebrities and politicians alike rest on the largely unchallenged headlines that cross our smartphones by the minute about some “new study” and its findings.

Science is built on the idea that there is some hypothesis formulated, an experiment designed to prove (or fail to prove as the case may be) that hypothesis and the publishing of the results — data and all. Tests that result in a failure to prove are just as valuable as a tests that confirms the hypothesis: knowing what doesn’t work is as important as knowing what does.

But there are now several issues dogging many areas of research of public policy. The first is money; the second is complexity.

In a recent New York Times article, titled “Do You Believe in God, or Is That a Software Glitch?” by Kate Murphy, described the example of how many claims in the field of neuroscience may have questionable bases in fact due to complexities (and hidden biases of software) as well as the growing practice of withholding experimental data in order to maximize profit potential of any discoveries:

We’ve all seen them, those colorful images that show how our brains “light up” when we’re in love, playing a video game, craving chocolate, etc. Created using functional magnetic resonance imaging, or fM.R.I., these pictures are the basis of tens of thousands of scientific papers, the backdrop to TED talks and supporting evidence in best-selling books that tell us how to maintain healthy relationships, make decisions, market products and lose weight.

But a study published last month in the Proceedings of the National Academy of Sciences uncovered flaws in the software researchers rely on to analyze fM.R.I. data. The glitch can cause false positives — suggesting brain activity where there is none — up to 70 percent of the time.

“We have entered an era where the kinds of data and the analyses that people run have gotten incredibly complicated,” said Martin Sereno, the chairman of the cognitive neuroimaging department at the University of California, San Diego. “So you have researchers using sophisticated software programs that they probably don’t understand but are generally accepted and everyone uses.”

“There is an immense amount of flexibility in how anybody is going to analyze data,” said Russell Poldrack, who leads a cognitive neuroscience lab at Stanford University and is a co-author of the “Handbook of functional MRI Data Analysis.” And, he continued, “some choices make a bigger difference than others in the integrity of your results.”

To try to create some consistency and enhance credibility, he and other leaders in the field recently published a lengthy report titled “Best Practices in Data Analysis and Sharing in Neuroimaging Using MRI.” They said their intent was to increase transparency through comprehensive sharing of data, research methods and final results so that other investigators could “reproduce findings with the same data, better interrogate the methodology used and, ultimately, make best use of research funding by allowing reuse of data.”

The shocker is that transparency and reproducibility aren’t already required, given that we’re talking about often publicly funded, peer-reviewed, published research. And it’s much the same in other scientific disciplines.

Indeed, a study published last year in the journal Science found that researchers could replicate only 39 percent of 100 studies appearing in three high-ranking psychology journals. Research similarly hasn’t held up in genetics, nutrition, physics and oncology. The fM.R.I. errors added fuel to what many are calling a reproducibility crisis.

“People feel they are giving up a competitive advantage” if they share data and detail their analyses, said Jean-Baptiste Poline, senior research scientist at the University of California, Berkeley’s Brain Imaging Center. “Even if their work is funded by the government, they see it as their data. This is the wrong attitude because it should be for the benefit of society and the research community.”

There is also resistance because, of course, nobody likes to be proved wrong. Witness the blowback against those who ventured to point out irregularities in psychology research, dismissed by some as the “replication police” and “shameless little bullies.”

Process excellence work rests on the concept of data that is relevant (that is, understanding what are the vital few things to measure vs. the trivial many) and that has a known probability of error (or noise). Things start to go south in an experiment or a business when we don’t know what to measure but also when the data is deliberately hidden (for security or commercial reasons) or is hidden inadvertently because it exists deep within a software application whose very complexity defies human comprehension without a concerted effort.

This is why somewhere in every decision loop, such as when someone in an organization interprets something or makes a decision, we need to design forms of cross-checks that asks the proverbial “dumb questions” around whether the things a computer is spitting out makes sense or conducts the equivalent of the “dead fish test” described in the Times article:

Developed in the 1990s, fM.R.I. creates images based on the differential effects a strong magnetic field has on brain tissue. The scans occur at a rate of about one per second, and software divides each scan into around 200,000 voxels — cube-shaped pixels — each containing about a million brain cells. The software then infers neural activity within voxels or clusters of voxels, based on detected blood flow (the areas that “light up”). Comparisons are made between voxels of a resting brain and voxels of a brain that is doing something like, say, looking at a picture of Hillary Clinton, to try to deduce what the subject might be thinking or feeling depending on which area of the brain is activated.

But when you divide the brain into bitty bits and make millions of calculations according to a bunch of inferences, there are abundant opportunities for error, particularly when you are relying on software to do much of the work. This was made glaringly apparent back in 2009, when a graduate student conducted an fM.R.I. scan of a dead salmon and found neural activity in its brain when it was shown photographs of humans in social situations. Again, it was a salmon. And it was dead.

This is not to say all fM.R.I. research is hooey. But it does indicate that methods matter even when using whiz-bang technology. In the case of the dead salmon, what was needed was to statistically correct for false positives that arise when you make so many comparisons between voxels.