Finix-S1-32B Hits 0.6% Hallucination Rate as Mid-2025 AI Accuracy Rankings Shift
Image Credit: Google DeepMind | Splash
As artificial intelligence systems continue to shape sectors such as research, education and digital media, new benchmarks released through 2025 show wide variation in factual reliability across major language models. While a small group of models have achieved record-low hallucination rates, others, particularly those emphasising advanced reasoning, continue to exhibit elevated tendencies to deviate from verified facts, intensifying concerns around high-stakes deployment.
Understanding AI Hallucinations
AI hallucinations arise when a large language model generates responses that appear coherent but lack factual grounding. These errors stem from limitations in training data, probabilistic token prediction, and architectural biases that prioritise fluent text over evidence-based accuracy.
The problem entered mainstream discourse following the 2022 release of OpenAI’s ChatGPT. Widely accessible generative models revealed that even highly capable systems frequently fabricated citations, dates, and recent events. A 2023 Vectara study measured hallucination rates between 3% and 27% in summarisation tasks, leading developers to adopt approaches such as retrieval-augmented generation (RAG), external tool use and improved alignment techniques.
Yet in 2024–25, researchers, including those associated with the University of Edinburgh, observed that as models become more complex and rely on multi-step reasoning, factual grounding does not always improve and may decline in certain benchmarks.
Latest Benchmark Results
The Vectara Hallucination Leaderboard, widely referenced across the sector and updated through August 2025, evaluates models on short-document summarisation. Its latest release positions AntGroup’s Finix-S1-32B at the top with a 0.6% hallucination rate — currently the lowest publicly measured figure for any model in that task.
Google’s Gemini-2.0-Flash-001 follows with a 0.7% rate, while other Gemini models fall slightly behind in the low-single-digit range.
OpenAI’s GPT-5-High, released on 7 August 2025, records approximately 1.4%, while GPT-4.5, launched earlier in the year, achieves around 1.2% — the strongest performance among non-specialised, non-grounded frontier models.
External evaluations reinforce this disparity. A June 2025 AIMultiple benchmark found GPT-4.5 to be the most accurate among 16 models tested on news-related queries, scoring about 85% factual accuracy.
Conversely, reasoning-centric models such as OpenAI’s o3-mini displayed higher factual error rates on datasets like PersonQA, with news coverage in May 2025 noting results around 33%, compared with approximately 16% for its predecessor o1-mini.
Giskard’s May 2025 study of eight chatbots also demonstrated that “brevity prompts” significantly increase hallucination rates, especially for temporal and numerical questions.
Model Updates and Leading Performers
OpenAI: OpenAI’s GPT-5, launched on 7 August 2025, emphasises more faithful instruction following and reduced hallucinations. Benchmarks place its high-accuracy mode around 1.4% on Vectara’s summarisation test. Its earlier GPT-4.5 model remains notable for its low summarisation error rate (~1.2%) and high factual accuracy in independent news-query evaluations.
Google: Google released Gemini 2.5 Pro on 25 March 2025, improving long-context reasoning and verification via integrated search. Public summaries place its hallucination performance in the low single digits, though not below the 1% threshold reached by the Finix or Flash models.
Anthropic: Anthropic’s Claude 4 family launched on 22 May 2025, followed by the Claude Opus 4.1 update on 5 August 2025, which strengthened long-horizon reasoning and agentic capability. Claude models generally perform in the low-single-digit hallucination range — higher than top Gemini and OpenAI models but competitive within safety-aligned systems.
xAI: xAI’s Grok 4, unveiled on 10 July 2025, prioritises fast reasoning and tool-use. Independent summaries place its hallucination resistance in the mid-single-digit percent range, making it competitive but not among the very top performers.
Microsoft and Meta: Microsoft’s Copilot and Meta AI continue to benefit from integrated search and citations, yet external evaluations indicate varying stability depending on domain, particularly on politically sensitive or real-time topics.
What Influences Model Reliability?
Multiple factors affect hallucination performance across 2025 benchmarks:
Training data gaps and recency limits: Models trained on corpora without fresh information frequently fabricate post-cutoff events.
Alignment trade-offs: Reinforcement learning can enhance certain skills (e.g., coding or planning) while reducing factual recall.
Prompt structure: AIMultiple finds explicit accuracy directives reduce hallucinations, while Giskard shows short-form prompts significantly worsen them.
Data privacy and retraining rules: Free tiers of some chatbots may allow providers to retrain on user data, raising concerns when sensitive or personal content is involved.
Regulatory scrutiny: Institutions such as the Alan Turing Institute continue to call for independent audits, transparency around training data, and safety evaluations for foundation models.
Agentic and Deep-Research Features
Advanced verification features are emerging to address hallucinations:
OpenAI Deep Research, launched 2 February 2025, equips ChatGPT agents with multi-step web-verification workflows, reducing errors in complex tasks.
Claude 4.1 expands sustained reasoning and structured agent workflows.
Gemini 2.5 introduces long-context “thinking modes” for improved analysis and grounding.
Grok 4 supports agentic tasks, though performance varies by domain.
Medium analyses and academic commentary in mid-2025 note that agentic AI — with iterative checking, tool use and web queries — may be key to reducing hallucinations. However, researchers such as La Trobe University’s Daswin de Silva caution that structural biases stemming from training data persist even in agent-enhanced systems.
Looking Ahead
Hybrid approaches that combine RAG, agentic verification, and specialised domain training are expected to drive improvements through 2026.
While OpenAI’s GPT-5 demonstrates selective hallucination rates as low as ~1.4%, only a handful of models — most notably AntGroup’s Finix-S1-32B and Gemini-2.0-Flash-001 — currently fall below the 1% mark on public summarisation tests.
As generative AI becomes increasingly embedded in workflows, factual reliability remains essential for trust. Researchers continue to stress the importance of independent evaluation, transparent benchmarking, and the mitigation of entrenched cultural and geographic biases as AI systems become more capable and more widely deployed.
We are a leading AI-focused digital news platform, combining AI-generated reporting with human editorial oversight. By aggregating and synthesizing the latest developments in AI — spanning innovation, technology, ethics, policy and business — we deliver timely, accurate and thought-provoking content.
