What Those Cornell Pizza Studies Teach Us About Bad Science

Sloppy statistics and research misconduct are nothing new, but it's rare to get a clear picture of exactly how questionable data gets turned into clicky headlines. We have that now with the latest reporting on Cornell food scientist Brian Wansink, and it's worth taking a minute to look at what's exactly so wrong about the dodgy research techniques he's been accused of employing.

We mentioned last year that Wansink's research was being questioned, starting with four studies on pizza buffets that Cornell ruled were sloppy but not fraudulent. More of his work has been investigated since then, and is looking fishier and fishier. This weekend, Stephanie M. Lee of Buzzfeed (no relation to our own Stephanie Lee) published emails from Wansink's research team that provide a great lesson in how not to do science.

Before we get into specifics, an important note: Science still works. Bad studies are out there, and people can misunderstand or misreport good studies, too. (Anyone who reads our Dose of Reality posts has figured that out.) The tricky thing is telling who is doing legitimate, well-designed studies and processing them with appropriate statistical techniques, versus who is just imitating those practices. The real world is a mix of both. So, don't lose faith! But do stay sceptical. Here are some of the things that can go wrong:

P-hacking Can 'Find' Correlations That Aren't Really There

One of the fundamental questions you have to ask after doing a study is: Did I even find anything? Let's say you want to know whether people prefer cake or pie. If nobody has a preference, you can set out both desserts at your next party, and you would expect the same number of servings would disappear from each.

But let's say people eat 31 slices of cake, and 30 slices of pie. That difference probably isn't big enough for you to conclude that your crowd has a preference. In research terms, the difference isn't statistically significant.

So how do you tell if a result is significant? One way is to calculate how likely you are to see your result just by chance. If it's less than five per cent, then the rule of thumb is that people will believe your result could be valid.

Statisticians call that number a p-value, and if it's under 0.05 (which is just another way of saying five per cent), then you can call your result significant. P-values have their limitations, and they don't mean that you've determined your result is due to a real effect rather than to chance. But they give you a starting point for knowing which data not to bother with. If your p is over 0.05, your numbers are likely meaningless.

So, here's the problem. If you look at a bunch of totally random data, you'll find "significant" p-values five per cent of the time. If you were to look at a ton of variables until a p-value stood out at you, you could cherry-pick those unusual data points and pretend they mean something. XKCD summarised this nicely in cartoon form: Jelly beans may not cause acne, but if you run the same test on 20 different flavours of jelly beans, you might just find that one of the flavours turns up a positive result as a fluke.

If you're a careful scientist, you will make sure to analyse your results with this pitfall in mind. (One simple way: Choose a number smaller than 0.05 as your p-value cutoff.)

But if you aren't a careful scientist, this isn't a pitfall but an opportunity. You can run any study you like, and if it's large enough, you'll always find some results that look significant! Exploiting this statistical phenomenon is called "p-hacking" or "data fishing". If you publish results you got this way, you're likely publishing a lot of false positives.

When I read research papers, as soon as I notice that a study is looking at many different variables, I hit Ctrl-F to search in the page for the phrase "multiple comparisons" to see if there is a section where the researchers explain how they handled the situation. Since statistics is not my area of expertise, I'll ask an outside expert if I need a judgment call on the quality of their analysis. But if the researchers are doing multiple comparisons and don't even mention this issue at all, that's a big red flag.

Good Scientists Ask The Question Before They Collect the Data, Not Vice Versa

Every experiment is a question. Ideally you collect data as a way of answering your question, and you see what the data tell you.

But if you p-hack, you might just be collecting random data, and then making up stories afterward about what the data might mean. It's the scientific equivalent of seeing your Magic 8-Ball give an answer you don't like, and then lying about what question you asked.

The nickname for this is HARKing, or "hypothesising after results are known". (The hypothesis is your guess of what the data will show you; it's basically your research question.)

This is a problem because if you really wanted to test a certain question, you would set up an experiment that asks it well. Grabbing a few data points from a much larger study just isn't the same.

HARKing is easy to hide: All you have to do is never publish the part of your experiment that didn't work out. Just write up the paper as if you knew what you were doing from the beginning. To guard against this, medical journals often require researchers to pre-register their studies, writing a description someplace such as clinicaltrials.gov explaining how the study is designed and what outcomes it will test. Cognitive neuroscientist Chris Chambers advised young researchers on Twitter that you can always pre-register your own studies privately even if the person running your lab doesn't love the idea. Registries where you can do this include aspredicted.org and osf.io.

Scientists' Careers Are Shaped By the Pressure to Publish

P-hacking and HARKing are arguably the result, not the cause, of a bigger problem in science: The pressure to publish. Your career as a scientist, including your funding and your chances of getting tenure or getting promoted, typically rely on having published a lot of research - preferably in big-name journals, and ideally with studies big enough to make it to news headlines. (You'll recall that Wansink's studies made headlines all the dang time.)

These pressures encourage statistical tricks and bad research design because studies usually have to show something, ideally something new or surprising, to have a good chance of being published. Remember last week's diet study and how I mentioned that it was unusual for something with negative results to be published? We need more of that in science, but it's an uphill battle.

This detail from the Wansink Lab's emails shows how the pressure to publish can change how a lab processes its data:

Wansink wrote: "Too much inventory; not enough shipments." Ideally, he mused, a science lab would function like a tech company. Tim Cook, for instance, was renowned for getting products out of Apple's warehouses faster and pumping up profits. "As Steve Jobs said, 'Geniuses ship,'" Wansink wrote.

So, he proposed, the lab should adopt a system of strict deadlines for submitting and resubmitting research until it landed somewhere. "A lot of these papers are laying around on each of our desktops and they're like inventory that isn't working for us," he told the team. "We've got so much huge momentum going. This could make our productivity legendary."

Persistence in submitting papers can pay off by eventually getting your work published, but what does that mean for the rest of us - people who read (and write) headlines that are based on shoddy research that was rejected from plenty of journals but resubmitted until it finally stuck?

A workflow like this could take almost any study, relevant or not, and make it publishable by analysing it badly and submitting to lower and lower tier journals until it gets accepted somewhere, and then writing a press release that trumpets the results. Most ordinary folks ever see the last step in that process: A plausible-sounding conclusion in the media, with assurances that there was some kind of science behind it.

Investigations are ongoing into how much of the Wansink lab's anomalies are p-hacking and HARKing, versus deliberate fraud, versus a series of honest mistakes. But the whole saga is kind of a cautionary tale about what's behind a lot of under-scrutinised science that shows up in the media. Probably nobody would have found out what the lab was doing if Wansink hadn't written a blog post in late 2016 praising a student who turned a "failed study that had null results" into four published papers.

I've found it sobering, though not surprising, to see how scientists are processing the news. In one informal Twitter poll of researchers, 68 per cent said Wansink's actions were not different than what normally goes on in science, they're just more blatant. Psychologist Pete Etchells tweeted a more dismaying thought: "Yeah, loads of people have been saying the thing that sets apart Wansink is the scale of what's been going on. I keep thinking that the thing setting him apart is that he's been caught."


    How would HARKing compare to those using historical data to predict climate change?
    Skeptics point to the inaccuracy of the predictions each year. Do they have a valid point?

Join the discussion!

Trending Stories Right Now