AI dreams of stacking eggs.
In March 2023, a group of computer scientists at Microsoft Research published an assessment of OpenAI’s new chatbot, GPT-4. The team’s report drew global attention to one test in particular. They posed this question to GPT-4.
Here we have a book, nine eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.
This is not an essay question in which rhetorical skill (bullsh*tting) can substitute for a lack of knowledge. Nor is it a story problem, an algebraic equation that happens to be expressed in words. It is a practical problem demanding common sense and even original thought.
Earlier versions of the chatbot gave absurd answers (such as stacking the nine eggs on top of each other). But GPT-4’s answer called for arranging the eggs in a three-by-three array between the book and the laptop. The egg layer supports the book, whose weight secures the eggs in place. The book in turn supports the bottle and nail. This answer, hailed as a triumph, helped trigger the current global wave of AI hype and angst. It has fueled debate about how large language models (LLMs)—dismissed not inaccurately as autocomplete on steroids—are able to perform creative tasks.
“I started off being very skeptical,” said Peter Lee, head of research at Microsoft. “And that evolved into a sense of frustration, annoyance, maybe even fear. You think: Where the heck is this coming from?”
The Art of Egg Stacking
The egg-stacking puzzle is an example of a functional fixedness problem. As described by Gestalt psychologist Karl Duncker in a 1945 paper, these ask the solver to accomplish a task with a given set of objects. The solution involves using some of the objects in unusual or creative ways (think TV’s MacGyver). There is a substantial literature on the topic.
By 2022, LLMs were being trained on as many as 17 trillion words of human-generated text, mainly from the Internet. It’s all but certain that GPT-4 was exposed to functional fixedness problems in its training. The Microsoft researchers were of course aware of that and invented the eggs and laptop question rather than posing a published problem. This was intended to prevent GPT-4 from cribbing an answer from the Internet.
I did Internet searches for the eggs-and-laptop problem as described in the paper. They got no hits beyond the reportage of the GPT-4 study itself. However, broadening the search produced some results that may be worth noting.
I learned of Mohammed Muqbel, a Yemeni man living in Malaysia, who holds the Guinness Book world record for stacking four eggs vertically, one on top of the other. There’s an online video of him managing this feat.
How does Muqbel do it? Well, very, very carefully. There’s no glue, no real trick, just eggs balanced on top of each other as accurately as is humanly possible. One story reported that Muqbel “had to figure out the center of mass for each egg, then stack them exactly so that each hovered over the other.”
Another group of online sources might be more relevant. There is a science fair project using a horizontal array of eggshells to demonstrate how arches support weight. Books or magazines are stacked on top of the eggshells. Multiple educational sites describe such projects in detail.
The science project differs from the Microsoft problem in several ways, but it does contain the central idea of balancing flat objects on an array of eggs or shells on a flat surface. It’s conceivable that the pre-release GPT-4 was paraphrasing elements of the science project in its answer.
An LLM Is Not a Search Engine
It cannot cut and paste answers from the Internet, like a kid doing homework. LLMs use language statistics only. They do not even know what words mean.
However, the statistical approach can have surprising power. “Stack these nine eggs on top of each other”—said no one, ever. The extreme rarity of that string of words is a proxy for the real-world rarity of stacks of nine eggs. The language statistics will tend to prevent LLMs from talking about stacking eggs in unrealistic ways. In effect, the LLM has learned some backdoor physics and common sense. Statistical knowledge about word choices becomes a model of the wide world of things to which words refer.
Microsoft researcher Sébastien Bubeck, one of the team that tested GPT-4, thinks it’s likely that an LLM trained on trillions of words creates a world model, and “this is what gives it the ‘magical’ extrapolation properties.”
Are LLMs truly original, or are they just plagiarists good at covering their tracks? The two spins may not be as different as they seem. There’s nothing entirely new under the sun. Edison did not invent the lightbulb; he improved it. Most so-called creators apply knowledge from different fields to a problem and arrive at a solution. It’s more a matter of making relevant connections than of fabricating something completely new. If LLMs are indeed acquiring the ability to make relevant connections, that would be a momentous—and disquieting—development.