OpenAI's newest o3 and o4-mini models excel at coding and math – but hallucinate more often

3 months ago 7

Serving tech enthusiasts for over 25 years.
TechSpot means tech analysis and advice you can trust.

A hot potato: OpenAI's latest artificial intelligence models, o3 and o4-mini, have set new benchmarks in coding, math, and multimodal reasoning. Yet, despite these advancements, the models are drawing concern for an unexpected and troubling trait: they hallucinate, or fabricate information, at higher rates than their predecessors – a reversal of the trend that has defined AI progress in recent years.

Historically, each new generation of OpenAI's models has delivered incremental improvements in factual accuracy, with hallucination rates dropping as the technology matured. However, internal testing and third-party evaluations now reveal that o3 and o4-mini, both classified as "reasoning models," are more prone to making things up than earlier reasoning models such as o1, o1-mini, and o3-mini, as well as the general-purpose GPT-4o, according to a report by TechCrunch.

On OpenAI's PersonQA benchmark, which measures a model's ability to answer questions about people accurately, o3 hallucinated in 33 percent of cases, more than double the rate of o1 and o3-mini, which scored 16 percent and 14.8 percent, respectively. O4-mini performed even worse, with a staggering 48 percent hallucination rate – nearly one in every two responses.

The reasons for this regression remain unclear, even to OpenAI's own researchers. In technical documentation, the company admits that "more research is needed" to understand why scaling up reasoning models appears to worsen the hallucination problem.

One hypothesis, offered by Neil Chowdhury, a researcher at the nonprofit AI lab Transluce and a former OpenAI employee, is that the reinforcement learning techniques used for the o-series models may amplify issues that previous post-training processes had managed to mitigate, if not eliminate.

Third-party findings support this theory: Transluce documented instances where o3 invented actions it could not possibly have performed, such as claiming to run code on a 2021 MacBook Pro "outside of ChatGPT" and then copying the results into its answer – an outright fabrication.

Sarah Schwettmann, co-founder of Transluce, warns that the higher hallucination rate could limit o3's usefulness in real-world applications. Kian Katanforoosh, a Stanford adjunct professor and CEO of Workera, told TechCrunch that while o3 excels in coding workflows, it often generates broken website links.

These hallucinations pose a substantial risk for businesses and industries where accuracy, such as law or finance, is paramount. A model that fabricates facts could introduce errors into legal contracts or financial reports, undermining trust and utility.

OpenAI acknowledges the challenge, with spokesperson Niko Felix telling TechCrunch that addressing hallucinations "across all our models is an ongoing area of research, and we're continually working to improve their accuracy and reliability."

One promising avenue for reducing hallucinations is integrating web search capabilities. OpenAI's GPT-4o, when equipped with search, achieves 90 percent accuracy on the SimpleQA benchmark, suggesting that real-time retrieval could help ground AI responses in verifiable facts – at least where users are comfortable sharing their queries with third-party search providers.

Meanwhile, the broader AI industry is shifting its focus toward reasoning models, which promise improved performance on complex tasks without requiring exponentially more data and computing power. Yet, as the experience with o3 and o4-mini shows, this new direction brings its own set of challenges, chief among them the risk of increased hallucinations.

Read Entire Article