ChatGPT 5.1 vs Rivals: New Benchmarks Show 0.6%–1.4% Hallucination Gap in 2025
Image Credit: Mariia Shalabaieva | Splash
OpenAI has released ChatGPT 5.1, rolling out first to paid users and offering measurable improvements in reasoning, safety, and response consistency. The update builds on August’s GPT-5 launch and introduces two operation modes — Instant and Thinking — designed to balance speed, cost efficiency, and deeper deliberation.
As governments, healthcare providers, and legal professionals increasingly integrate AI into mission-critical workflows, reducing hallucinations remains a core challenge. While GPT-5.1 demonstrates more reliable behaviour in practical use, several rival models currently outperform OpenAI’s flagship systems on strict grounded hallucination tests.
Benchmark Overview: GPT-5.1 vs Official GPT-5 Metrics
OpenAI has released a GPT-5.1 system-card addendum, but most published quantitative benchmarks still refer to GPT-5, not GPT-5.1. Independent laboratories continue to fill this data gap with external evaluations.
Confirmed GPT-5 Performance
SimpleQA hallucination benchmark: In OpenAI’s SimpleQA evaluations, a GPT-5-series model (gpt-5-thinking-mini) produced hallucinations in around 26% of prompts, compared with ~75% for the older o4-mini model. This result reflects a single high-difficulty benchmark designed to measure how often a model “makes something up” instead of refusing to answer.
AIME 2025 (math): GPT-5 records 94.6% accuracy on the 2025 AIME mathematics benchmark without tools.
PersonQA: OpenAI has not published PersonQA hallucination data for GPT-5 or GPT-5.1. Earlier models such as o3 show hallucination rates around 33%, while o4-mini is higher on this specific test.
Because GPT-5.1 lacks a fully published benchmark suite, third-party hallucination results remain the primary basis for cross-model comparison.
Independent Hallucination Tests: Current Market Standings
One of the most widely referenced external evaluations is Vectara’s Hallucination Leaderboard, which uses the Hughes Hallucination Evaluation Model (HHEM). This benchmark measures grounded hallucinations — cases in which a model inserts information not present in the source text.
Lowest Known Hallucination Rates (HHEM Summarisation Test)
AntGroup Finix-S1-32B — ~0.6%
Currently the lowest grounded hallucination rate measured on HHEM.Google Gemini-2.0-Flash — ~0.7%
A lightweight, efficiency-optimised model designed to avoid adding extraneous information.OpenAI GPT-5-high — ~1.4%
Higher-capacity OpenAI models perform strongly but do not match the two leaders on this specific benchmark.OpenAI o-series models (o1/o1-pro) — ~2–3%
Depending on variant, these models fall in the low single digits.Google Gemini-2.5-Pro — ~2.6%
A larger, more capable Pro model with higher hallucination probability on grounded tests due to more detailed generative output.Qwen-series models — variable
Hallucination performance ranges widely by size and benchmark.
Lowest Hallucination Model Overall
AntGroup’s Finix-S1-32B currently leads with ~0.6% hallucination, followed closely by Gemini-2.0-Flash at ~0.7%.
These measurements apply specifically to strict grounded summarisation tests. Performance changes significantly for open-ended reasoning, coding, and mathematics.
Why Benchmark Results Differ Across Models
Hallucination rates vary because benchmarks measure different capabilities:
Grounded summarisation penalises even minor additions not present in the reference text.
Open-ended question answering increases hallucination likelihood because models must infer answers.
Reasoning-heavy models generate detailed multi-step explanations, making them more likely to introduce unverifiable statements.
Conservative models like Gemini-2.0-Flash produce shorter, safer outputs with fewer hallucinated additions.
These distinctions explain why a smaller model can top hallucination charts while larger models outperform it in analytical, mathematical, and scientific tasks.
Real-World Implications
Low hallucination rates are increasingly essential:
Legal professionals have encountered fabricated citations in court filings.
Customer support systems risk cascading misinformation across thousands of interactions.
Government and public services in countries such as Australia require consistent outputs to maintain trust and safety.
Cybersecurity analysts warn that even a 6% error rate can lead to severe operational risks when scaled across large automated systems.
We are a leading AI-focused digital news platform, combining AI-generated reporting with human editorial oversight. By aggregating and synthesizing the latest developments in AI — spanning innovation, technology, ethics, policy and business — we deliver timely, accurate and thought-provoking content.
