AI's not 'reasoning' at all - how this team debunked the industry hype

2 months ago 18

Follow ZDNET: Add us as a preferred source on Google.

ZDNET's key takeaways

We don't entirely know how AI works, so we ascribe magical powers to it.
Claims that Gen AI can reason are a "brittle mirage."
We should always be specific about what AI is doing and avoid hyperbole.

Ever since artificial intelligence programs began impressing the general public, AI scholars have been making claims for the technology's deeper significance, even asserting the prospect of human-like understanding.

Scholars wax philosophical because even the scientists who created AI models such as OpenAI's GPT-5 don't really understand how the programs work -- not entirely.

Also: OpenAI's Altman sees 'superintelligence' just around the corner - but he's short on details

AI's 'black box' and the hype machine

AI programs such as LLMs are infamously "black boxes." They achieve a lot that is impressive, but for the most part, we cannot observe all that they are doing when they take an input, such as a prompt you type, and they produce an output, such as the college term paper you requested or the suggestion for your new novel.

In the breach, scientists have applied colloquial terms such as "reasoning" to describe the way the programs perform. In the process, they have either implied or outright asserted that the programs can "think," "reason," and "know" in the way that humans do.

In the past two years, the rhetoric has overtaken the science as AI executives have used hyperbole to twist what were simple engineering achievements.

Also: What is OpenAI's GPT-5? Here's everything you need to know about the company's latest model

OpenAI's press release last September announcing their o1 reasoning model stated that, "Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem," so that "o1 learns to hone its chain of thought and refine the strategies it uses."

It was a short step from those anthropomorphizing assertions to all sorts of wild claims, such as OpenAI CEO Sam Altman's comment, in June, that "We are past the event horizon; the takeoff has started. Humanity is close to building digital superintelligence."

(Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

The backlash of AI research

There is a backlash building, however, from AI scientists who are debunking the assumptions of human-like intelligence via rigorous technical scrutiny.

In a paper published last month on the arXiv pre-print server and not yet reviewed by peers, the authors -- Chengshuai Zhao and colleagues at Arizona State University -- took apart the reasoning claims through a simple experiment. What they concluded is that "chain-of-thought reasoning is a brittle mirage," and it is "not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching."

Also: Sam Altman says the Singularity is imminent - here's why

The term "chain of thought" (CoT) is commonly used to describe the verbose stream of output that you see when a large reasoning model, such as GPT-o1 or DeepSeek V1, shows you how it works through a problem before giving the final answer.

That stream of statements isn't as deep or meaningful as it seems, write Zhao and team. "The empirical successes of CoT reasoning lead to the perception that large language models (LLMs) engage in deliberate inferential processes," they write.

But, "An expanding body of analyses reveals that LLMs tend to rely on surface-level semantics and clues rather than logical procedures," they explain. "LLMs construct superficial chains of logic based on learned token associations, often failing on tasks that deviate from commonsense heuristics or familiar templates."

The term "chains of tokens" is a common way to refer to a series of elements input to an LLM, such as words or characters.

Testing what LLMs actually do

To test the hypothesis that LLMs are merely pattern-matching, not really reasoning, they trained OpenAI's older, open-source LLM, GPT-2, from 2019, by starting from scratch, an approach they call "data alchemy."

The model was trained from the beginning to just manipulate the 26 letters of the English alphabet, "A, B, C,…etc." That simplified corpus lets Zhao and team test the LLM with a set of very simple tasks. All the tasks involve manipulating sequences of the letters, such as, for example, shifting every letter a certain number of places, so that "APPLE" becomes "EAPPL."

Also: OpenAI CEO sees uphill struggle to GPT-5, potential for new kind of consumer hardware

Using the limited number of tokens, and limited tasks, Zhao and team vary which tasks the language model is exposed to in its training data versus which tasks are only seen when the finished model is tested, such as, "Shift each element by 13 places." It's a test of whether the language model can reason a way to perform even when confronted with new, never-before-seen tasks.

They found that when the tasks were not in the training data, the language model failed to achieve those tasks correctly using a chain of thought. The AI model tried to use tasks that were in its training data, and its "reasoning" sounds good, but the answer it generated was wrong.

As Zhao and team put it, "LLMs try to generalize the reasoning paths based on the most similar ones […] seen during training, which leads to correct reasoning paths, yet incorrect answers."

Specificity to counter the hype

The authors draw some lessons.

First: "Guard against over-reliance and false confidence," they advise, because "the ability of LLMs to produce 'fluent nonsense' -- plausible but logically flawed reasoning chains -- can be more deceptive and damaging than an outright incorrect answer, as it projects a false aura of dependability."

Also, try out tasks that are explicitly not likely to have been contained in the training data so that the AI model will be stress-tested.

Also: Why GPT-5's rocky rollout is the reality check we needed on superintelligence hype

What's important about Zhao and team's approach is that it cuts through the hyperbole and takes us back to the basics of understanding what exactly AI is doing.

When the original research on chain-of-thought, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," was performed by Jason Wei and colleagues at Google's Google Brain team in 2022 -- research that has since been cited more than 10,000 times -- the authors made no claims about actual reasoning.

Wei and team noticed that prompting an LLM to list the steps in a problem, such as an arithmetic word problem ("If there are 10 cookies in the jar, and Sally takes out one, how many are left in the jar?") tended to lead to more correct solutions, on average.

google-2022-example-chain-of-thought-prompting

They were careful not to assert human-like abilities. "Although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually 'reasoning,' which we leave as an open question," they wrote at the time.

Also: Will AI think like humans? We're not even close - and we're asking the wrong question

Since then, Altman's claims and various press releases from AI promoters have increasingly emphasized the human-like nature of reasoning using casual and sloppy rhetoric that doesn't respect Wei and team's purely technical description.

Zhao and team's work is a reminder that we should be specific, not superstitious, about what the machine is really doing, and avoid hyperbolic claims.

Read Entire Article