ChatGPT 5.1 vs Rivals: New Benchmarks Show 0.6%–1.4% Hallucination Gap in 2025

Image Credit: Mariia Shalabaieva | Splash

OpenAI has released ChatGPT 5.1, rolling out first to paid users and offering measurable improvements in reasoning, safety, and response consistency. The update builds on August’s GPT-5 launch and introduces two operation modes — Instant and Thinking — designed to balance speed, cost efficiency, and deeper deliberation.

As governments, healthcare providers, and legal professionals increasingly integrate AI into mission-critical workflows, reducing hallucinations remains a core challenge. While GPT-5.1 demonstrates more reliable behaviour in practical use, several rival models currently outperform OpenAI’s flagship systems on strict grounded hallucination tests.

Benchmark Overview: GPT-5.1 vs Official GPT-5 Metrics

OpenAI has released a GPT-5.1 system-card addendum, but most published quantitative benchmarks still refer to GPT-5, not GPT-5.1. Independent laboratories continue to fill this data gap with external evaluations.

Confirmed GPT-5 Performance

  • SimpleQA hallucination benchmark: In OpenAI’s SimpleQA evaluations, a GPT-5-series model (gpt-5-thinking-mini) produced hallucinations in around 26% of prompts, compared with ~75% for the older o4-mini model. This result reflects a single high-difficulty benchmark designed to measure how often a model “makes something up” instead of refusing to answer.

  • AIME 2025 (math): GPT-5 records 94.6% accuracy on the 2025 AIME mathematics benchmark without tools.

  • PersonQA: OpenAI has not published PersonQA hallucination data for GPT-5 or GPT-5.1. Earlier models such as o3 show hallucination rates around 33%, while o4-mini is higher on this specific test.

Because GPT-5.1 lacks a fully published benchmark suite, third-party hallucination results remain the primary basis for cross-model comparison.

Independent Hallucination Tests: Current Market Standings

One of the most widely referenced external evaluations is Vectara’s Hallucination Leaderboard, which uses the Hughes Hallucination Evaluation Model (HHEM). This benchmark measures grounded hallucinations — cases in which a model inserts information not present in the source text.

Lowest Known Hallucination Rates (HHEM Summarisation Test)

  1. AntGroup Finix-S1-32B — ~0.6%
    Currently the lowest grounded hallucination rate measured on HHEM.

  2. Google Gemini-2.0-Flash — ~0.7%
    A lightweight, efficiency-optimised model designed to avoid adding extraneous information.

  3. OpenAI GPT-5-high — ~1.4%
    Higher-capacity OpenAI models perform strongly but do not match the two leaders on this specific benchmark.

  4. OpenAI o-series models (o1/o1-pro) — ~2–3%
    Depending on variant, these models fall in the low single digits.

  5. Google Gemini-2.5-Pro — ~2.6%
    A larger, more capable Pro model with higher hallucination probability on grounded tests due to more detailed generative output.

  6. Qwen-series models — variable
    Hallucination performance ranges widely by size and benchmark.

Lowest Hallucination Model Overall

AntGroup’s Finix-S1-32B currently leads with ~0.6% hallucination, followed closely by Gemini-2.0-Flash at ~0.7%.

These measurements apply specifically to strict grounded summarisation tests. Performance changes significantly for open-ended reasoning, coding, and mathematics.

Why Benchmark Results Differ Across Models

Hallucination rates vary because benchmarks measure different capabilities:

  • Grounded summarisation penalises even minor additions not present in the reference text.

  • Open-ended question answering increases hallucination likelihood because models must infer answers.

  • Reasoning-heavy models generate detailed multi-step explanations, making them more likely to introduce unverifiable statements.

  • Conservative models like Gemini-2.0-Flash produce shorter, safer outputs with fewer hallucinated additions.

These distinctions explain why a smaller model can top hallucination charts while larger models outperform it in analytical, mathematical, and scientific tasks.

Real-World Implications

Low hallucination rates are increasingly essential:

  • Legal professionals have encountered fabricated citations in court filings.

  • Customer support systems risk cascading misinformation across thousands of interactions.

  • Government and public services in countries such as Australia require consistent outputs to maintain trust and safety.

  • Cybersecurity analysts warn that even a 6% error rate can lead to severe operational risks when scaled across large automated systems.

3% Cover the Fee
TheDayAfterAI News

We are a leading AI-focused digital news platform, combining AI-generated reporting with human editorial oversight. By aggregating and synthesizing the latest developments in AI — spanning innovation, technology, ethics, policy and business — we deliver timely, accurate and thought-provoking content.

Previous
Previous

EY Launches AI Academy After Upskilling 44,000 Staff and Piloting 50+ Enterprise Projects

Next
Next

AI Deepfakes Hit 38 Countries: How Synthetic Media Is Shaping 2025 Elections