Gemini 2.0 Flash Hits 0.7% Hallucination Rate, but Trails ChatGPT o4-mini in Reasoning Benchmarks

Image Credit: Solen Feyissa | Splash

Google DeepMind’s Gemini 2.0 Flash, specifically the gemini-2.0-flash-001 API model, has emerged as one of the most reliable large language models (LLMs) currently measured, posting a 0.7% hallucination rate on Vectara’s Hallucination Leaderboard (updated 16 October 2025). This places it second globally, behind only AntGroup’s specialised Finix S1-32B, which leads with 0.6%.

However, while Gemini 2.0 Flash excels in factual faithfulness and speed, a third-party benchmark study by Analytics Vidhya finds that it lags behind OpenAI’s o4-mini in deeper reasoning, mathematical problem-solving and code-generation tasks.

The contrast highlights a central trend in AI development: as governments, healthcare systems and news organisations deploy AI tools at scale, minimising fabricated information is becoming as crucial as maximising raw intelligence.

A Speed-First Model Built for High-Volume, Low-Latency Workloads

Google first introduced Gemini 2.0 Flash on 11 December 2024, positioning it as a fast, efficient “workhorse” for the emerging agentic era, where AI systems perform actions, orchestrate tools and process multimodal inputs in real time.

By February 2025, the model became generally available via:

  • Google AI Studio

  • Vertex AI

  • The Gemini App, where third-party testing reports it as the default model for many new chats

A defining feature is its 1 million-token context window, formally listed as 1,048,576 tokens, enabling long-document processing across legal, financial or enterprise workflows without losing coherence.

Google’s developer materials state that Gemini 2.0 Flash delivers approximately 2× the speed of Gemini 1.5 Pro, a claim also echoed by senior Google researchers. Independent reviewers note that Flash maintains this speed while supporting strong multimodal performance across 100+ languages, though exact counts vary depending on the evaluation.

A widely cited third-party comparison suggests that, on Google’s TPU v5e hardware, Gemini 2.0 Flash may use around 75% less energy than o4-mini for similar workloads — not an official Google metric, but an estimate published in Analytics Vidhya’s evaluation.

Together, features such as speculative decoding, adaptive attention, and unified multimodal processing help Flash deliver high throughput without significantly increasing computational cost.

Vectara Benchmark Confirms Exceptional Hallucination Suppression

Vectara’s HHEM-2.1 hallucination benchmark, based on approximately 831 CNN/Daily Mail documents, currently serves as one of the most cited measurements of factual consistency in AI summarisation.

On this evaluation, Gemini 2.0 Flash (gemini-2.0-flash-001) achieved:

  • 0.7% hallucination rate

  • 99.3% factual consistency

  • 100% answer rate

  • ~65-word average summaries

These figures place Gemini 2.0 Flash above all major general-purpose LLMs currently listed, with only AntGroup’s Finix S1-32B scoring lower. The model’s output style is notably concise and tightly anchored to source material, attributes that make it appealing for enterprise summaries, regulated-industry workflows, and newsroom fact chains.

Reasoning Benchmarks: One Study Shows a Clear o4-mini Advantage

Although Gemini 2.0 Flash excels in reducing hallucinations, one detailed third-party benchmark (Analytics Vidhya, April 2025) finds that OpenAI’s o4-mini delivers stronger performance on tasks requiring multi-step reasoning.

Key figures from that study include:

General Reasoning (MMLU)

  • o4-mini: ~89.7%

  • Gemini 2.0 Flash: ~81.4%

Python Coding (HumanEval)

  • o4-mini: ~85.6%

  • Gemini 2.0 Flash: ~78.9%

Grade-School Math (GSM8K)

  • o4-mini scores several points higher in head-to-head comparisons

Visual-Question Answering (VQA)

Flash retains an advantage in one test cited in the same study:

  • Gemini 2.0 Flash: ~88.3%

  • o4-mini: ~81.5%

Latency & Energy

The same article reports that:

  • Flash responds ~2.6× faster than o4-mini

  • Flash may use ~75% less energy on TPU v5e hardware

These values are not from Google or OpenAI and should be understood as results from a single independent evaluation, not a consensus across all benchmark suites. Nonetheless, the directional difference — Flash optimised for speed, o4-mini for multi-step reasoning — aligns with broader industry positioning.

Practical Deployment: Where Flash Performs Best

Given its combination of very low hallucination rates, large context and high throughput, Gemini 2.0 Flash is particularly well-suited for:

  • newsroom summarisation and fact-anchored reporting

  • enterprise RAG pipelines and document processing

  • customer support and high-volume messaging systems

  • workflow automation across Google Workspace

  • tool-using agent systems requiring real-time inference

For workloads requiring deep chain-of-thought, mathematical reasoning, scientific analysis or complex coding, heavier models — such as OpenAI’s o4-mini, Google’s Gemini 2.5 Pro, or Meta’s Llama 4 Maverick — currently tend to perform better.

Gemini 2.5 Pro: Google’s High-End Reasoning Engine

In March 2025, Google unveiled Gemini 2.5 Pro, described as its most intelligent and most reasoning-capable model to date.

Google reports that Gemini 2.5 Pro:

  • Achieved state-of-the-art scores on reasoning benchmarks such as AIME 2025 and GPQA

  • Reached the #1 position on the LMSYS Chatbot Arena, debuting with a strong Elo lead

  • Launched with a 1 million-token context, with a 2 million-token mode progressively becoming available in select configurations

This positioning reflects Google’s strategy:
Gemini 2.0 Flash remains the fast, energy-efficient factual engine, while Gemini 2.5 Pro leads the company’s high-end reasoning capabilities.

3% Cover the Fee
TheDayAfterAI News

We are a leading AI-focused digital news platform, combining AI-generated reporting with human editorial oversight. By aggregating and synthesizing the latest developments in AI — spanning innovation, technology, ethics, policy and business — we deliver timely, accurate and thought-provoking content.

Previous
Previous

US$350B AI Data Centre Boom Reshapes U.S. Economy as Power Demand Surges

Next
Next

Tulane Futurist Series Probes 95% AI ROI Gap and 1,000+ Qubit Quantum Future