Home Blog Family Life Did the Marshmallow Test Fail to Replicate?

Did the Marshmallow Test Fail to Replicate?

June 7, 2018

Highlights

The new data turn the classic result of the Marshmallow Test, long treated as a simple and effective demonstration of the power of self-control, into a Rorschach inkblot. Post This
What has certainly been thrown into doubt is whether narrow efforts to improve kids’ self-control could be as revolutionary as some came to think based on the results of earlier Marshmallow Test studies. Post This

It’s one of those cute social-science findings that appear in books you can buy at the airport: In a study released in 1990, continuing work that began in the 1960s, researchers showed they could predict kids’ SAT scores reasonably well based on a simple test at age four. In the test, kids choose between eating a marshmallow or other treat right away, or waiting for the experimenters to return, at which point they get two treats. The longer a kid is willing to delay gratification, the higher his SAT scores years later.

A new study “revisits” this notion through a “conceptual replication”—meaning the researchers tested the same idea with somewhat similar methods but did not aim to recreate the original study’s precise contours. According to the sociologist Jessica McCrory Calarco, writing in The Atlantic, the replication “failed,” revealing the Marshmallow Test as little more than an indicator of kids’ wealth and home environments. But according to the economist Jason Collins, the test “held up ok.”

The way I’d put it is this: The new data turn the classic result, long treated as a simple and effective demonstration of the power of self-control, into a Rorschach inkblot. Whatever theory you have about why some kids do well academically and others fail, you can read the data in a way that supports it.

First things first: The basic result of the 1990 study really did replicate just fine.¹In that study, the correlations between a child’s wait time and his SAT scores were 0.57 for math and 0.42 for verbal. The correlation between the Marshmallow Test at age four and the “Achievement Composite” used in this² study, at age 15, was 0.24 for kids whose moms didn’t have college degrees and 0.17 for children of college grads.³

So, yes, the effect is considerably smaller—which is unsurprising in a replication, especially when the methods are different—but real and statistically significant. In general, kids with better self-control at age four do better academically at age 15.

The allegations of a “failed replication” come from the fact that the effect more or less disappeared once the researchers added two different groups of statistical controls. And that’s where the interpretation gets really tricky. Contra The Atlantic, the study hardly shows that the Marshmallow Test is a mere proxy for wealth.

The first set of controls involves “child background characteristics.” These include sex; race/ethnicity; family income; mother’s education; mother’s age at the child's birth; mother’s score on a vocabulary test; two different tests measuring “child cognitive functioning” (at ages two and three); a measure of infant temperament; birth weight; and various measures of the child’s home environment, including the presence of learning materials and a measure of parental responsivity. Note that these variables are not limited to income and the home environment; they also include the child’s own traits in infancy and as late as age three.

The second set of controls consists only of the child’s own traits, this time measured around the same time he took the Marshmallow Test. These include tests of “Letter-Word Identification,” “Applied Problems,” “Picture Vocabulary,” “Memory for Sentences,” “Complete Words,” and internalizing and externalizing behaviors.

It’s certainly interesting that the Marshmallow Test’s results weaken, often becoming statistically insignificant, when the first or both sets of control variables are included. What’s not clear is what exactly to make of it. There are several ways to interpret the data, all of which are likely true—at least to some degree.

One interpretation is The Atlantic’s: Both academic achievement and Marshmallow Test performance are results of the home environment. A poor home environment decreases academic performance and also makes people behave more impulsively; a wealthy environment works the opposite way.

However, the study does not report the actual coefficients on the individual control variables, so we can’t tell whether the child’s background or the child’s own traits are doing most of the work. And for kids of mothers without college degrees, the age-15 results actually hold up fairly well— weakening but remaining statistically significant—when only the first set of controls (the one that includes income and home environment) is added.

Further, even if we assume home environment is the key, another interpretation is that self-control mediates the relationship between home environment and academic performance: A poor home environment reduces the ability to delay gratification, which in turn causes bad outcomes. This could generate a pattern of results like the one we see in the study—with the direct correlation between impulsivity (measured imperfectly through the Marshmallow Test) and test scores declining when the home environment is controlled⁴—but wouldn’t undermine the idea that we can improve kids’ outcomes by teaching them self-control.

Still another interpretation is that genes are in play, a possibility buttressed by research in the field of behavioral genetics, which generally finds the entire “shared environment” to be a very small factor in how kids turn out. If kids get their impulsivity (not to mention other academically relevant traits) from their parents, and then you extensively control for what kind of parents the kids have, this too can generate the same result pattern: The Marshmallow Test is predictive when used in isolation, but its power fades in the presence of other variables that indirectly measure the same thing.

Genetic influences are not insurmountable obstacles; the clichéd example is myopia, partly genetic in cause yet easily remedied with eyeglasses. But then again eyeglasses for impulse control are hard to come by.

Another possibility is that some of the child-level variables—including the tests of cognitive functioning and temperament from the first set of controls, as well as some of the second-layer variables, which can reasonably be described as the four-year-old’s version of the tests used to create the “Achievement Composite” used to measure outcomes at age 15—might indirectly measure self-control. In this case, the Marshmallow Test will unfairly lose explanatory power when they're added to the model.

Yet by the opposite token, the authors report that the Marshmallow Test correlates surprisingly strongly with the “Applied Problems” test—suggesting it’s not just measuring self-control to begin with, but also picking up general cognitive functioning somehow (which should be controlled if the goal is to measure self-control alone, assuming the two skills really are independent of each other).

This jumble of confusion is what leads to Collins’s takeaway:

It’s no surprise that controls of this nature do this. It simply suggests that the controls are better predictors. The original claim was not that the marshmallow test was the best or only predictor.

What has certainly been thrown into doubt, however, is whether narrow efforts to improve kids’ self-control could be as revolutionary as some came to think based on the results of earlier Marshmallow Test studies. The test might be picking up on kids’ affluence or might be measuring their intelligence in addition to their self-control.

As often happens in social science, it’s just not possible to untangle the various pathways of causation.

Robert VerBruggen is a deputy managing editor of National Review.

1. In this piece, I’m focusing on the SAT-score claim, but for the record, the new study did fail to replicate claims that the Marshmallow Test predicts future behavior problems.

2. It’s compiled from the Woodcock-Johnson Psycho-Educational Battery Revised.

3. The results for college grads’ kids are based on a smaller sample size and also limited by the fact that two-thirds of them waited the full seven minutes allowed (vs. 45% of the other sample), reducing the amount of variation in the data.

4. This could be more or less of a problem depending on how much measurement error the Marshmallow Test suffers from. If the test is noisy—as all psychological tests are to a degree—it will be easier for other variables to “steal” its predictive power, and there are a lot of other variables prowling around in this study.