Do chatbots really understand? Or, are the large language models that power them to answer sophisticated questions, analyze texts, and generate poems and computer programs just a mass of data and calculations that simulates true understanding?
CHM aimed to find out, staging a debate between University of Washington computational linguist Emily M. Bender—who, with her coauthors, established the term “stochastic parrot" in a major 2021 paper and is coauthor of the forthcoming book The AI Con—and OpenAI's Sébastien Bubeck, former VP for AI and distinguished scientist at Microsoft, and the lead author of an influential 2023 paper about LLMs, “Sparks of Artificial General Intelligence.” Coordinated in partnership with IEEE Spectrum, whose Senior Editor Eliza Strickland served as moderator, this event was made possible by the generous support of the Patrick J. McGovern Foundation.
To provide a baseline understanding for the audience, Strickland offered a brief history and description of AI neural networks and the large language models that help chatbots reproduce human-like text. She noted that they can generate relevant responses because they’ve basically “read the entire internet” and so can predict what likely comes next in a sentence.
The debaters each offered opening remarks on the question of “Do LLMs really understand, or are they just mimicking training data?”
Emily Bender represented the position that “No, LLMs do not really understand.” She won’t use the term “artificial intelligence” but rather refers to what the systems do as “automation.” Bender explained that when humans use language, understanding includes much more than words, such as context and cues from the speaker. LLMs are trained only to look at the form of the words themselves, not how they are being used in a particular context. She argues that a chatbot only makes sense when we talk to it because we ourselves are making sense of it.
Emily Bender argues that LLMs do not understand.
Bender noted that it's an extraordinary claim to say that LLMs understand, and extraordinary evidence is needed to prove that claim. The data that would help to verify it must not be hidden. Getting text out when you put text in may look like reasoning, but it is really only testing how closely the system is modeling training data.
For the “Yes” side of the debate, Sébastien Bubeck noted that “understanding is in the eye of the beholder.” In the world of AI, benchmarks are used to assess the rate of progress, which has been remarkable over the past couple of years as models advanced from solving high school level mathematics questions to grappling with problems that no human can solve alone. However, Bubeck believes that benchmarks do not show understanding, which can only be judged by interacting with the system and probing it to see how deeply it can go.
Sébastien Bubeck argues that LLMs can push understanding.
At the end of the day, says Bubeck, understanding is a human journey. So, perhaps ask yourself if the chatbot helped you to understand more things rather than asking if the chatbot itself understood them. We may see breakthroughs in math by LLMs, but they will not be accepted until humans can fully grasp what the chatbot has revealed.
The debaters fielded a question from the moderator about whether or not the hype around artificial general intelligence (AGI) is justified by its current functionality. While Bubeck believes it’s plausible to reach AGI, Bender objects to how people necessarily assume that it exists in the future.
A new benchmark measure called ARC, which stands for Abstraction and Reasoning Corpus, professes to be able to measure AGI. Bender notes that benchmarks always include only a selection and there is a “whole wide world” outside of it. We don’t benchmark people, she says, we create licensing exams and academic exams to measure understanding, not how well a person has been developed to perform some task. Bubek agreed … to some extent.
Bubek and Bender disagree on deploying AGI.
Strickland asked the debaters if there’s a danger in letting people believe there’s a mind on the other side of chatbot technology. She used an example of an article that noted how chatbot therapists became "stressed" after hearing about humans' trauma.
Bender said unequivocally that this is a problem that sets people up not to make good decisions. Bubek remarked that anthropomorphizing chatbots is not great and we need to do more work to develop the right vocabulary to talk about these things. For example, he doesn’t like the term “AI” because LLMs are intelligent but not in same way as humans.
After addressing audience questions, the debaters offered final remarks. Bender would like people to know that nothing is inevitable. Refusal is important, especially in systems that are already creaking, like education, healthcare, and the legal system. In all those places where synthetic text looks like a quick solution, we need to say “no,” she says, because it is worse than nothing.
Bubek advises people to decide for themselves when to interact with these tools and to see if they provide value. These topics are complex and subtle, and no one knows how far AI is going to go. The growth rate is astonishing, and he’s excited to see what the next three years will bring.
Bubek says that the question of LLM’s understanding is both parrot and spark. Understanding is a continuum and the balance is shifting.
The Great Chatbot Debate | CHM Live, March 25, 2025
Free events like these would not be possible without the generous support of people like you who care deeply about decoding technology for everyone. Please consider making a donation.