OpenAI GPT-5 Pro Scores 100% on AIME 2025 Math Exam

Image Credit: BoliviaInteligente | Splash

OpenAI has launched GPT-5, its latest language model aimed at improving problem-solving in areas such as coding, math, writing, and health.

The model became available on August 7, 2025, and is now the default option in the ChatGPT platform for signed-in users.

Background on OpenAI's Development

OpenAI's GPT series has evolved from GPT-3, which focused on large-scale text generation, to models like GPT-4o that added multimodal processing for images and audio.

GPT-5 continues this progression by incorporating a unified architecture to handle both quick responses and deeper reasoning, addressing gaps in prior models' consistency and accuracy.

The release supports OpenAI's objective of making advanced AI widely available, with collaborations including Microsoft for infrastructure support and safety assessments by organizations like the UK's AI Safety Institute (AISI).

Key Features of GPT-5

GPT-5 functions as a single system with an internal router that selects between a standard mode for routine queries and a "thinking" mode for complex ones, based on factors like query difficulty and user preferences.

This enables tasks such as building complete websites or apps from one prompt, interpreting scientific diagrams, and offering context-aware health information.

The model supports multimodal inputs, including video analysis and spatial reasoning, and includes preset personalities like Cynic or Nerd for customized interactions in text chats.

Improvements and Performance Metrics

GPT-5 demonstrates consistently high performance compared to earlier OpenAI models like GPT-4o and o3.

It achieves 94.6% on the AIME 2025 math exam without tools, 74.9% on SWE-bench Verified for software engineering, and 84.2% on MMMU for multimodal understanding.

The GPT-5 Pro variant reaches 100% accuracy on AIME 2025 when using Python code execution.

Factual errors in reasoning tasks are reduced by up to 80%, deceptive responses drop from 4.8% to 2.1%, and sycophantic tendencies—excessive agreement with users—decrease from 14.5% to under 6%.

These results stem from updated training approaches that prioritize factual and balanced outputs.

In comparison, xAI's Grok 4 scores 90.6% on AIME 2025 and 58.6% on SWE-bench Verified, with notable multimodal capabilities.

Anthropic's Claude Opus 4.1 reaches 78.0% on AIME 2025 and 74.5% on SWE-bench Verified, showing strength in coding tasks.

DeepSeek's R1 achieves 79.8% on AIME 2024 and 49.2% on SWE-bench Verified, with competitive multimodal reasoning in its V3 series.

GPT-5's consistently strong results across these benchmarks highlight its versatility, while competitors show targeted strengths in areas like Grok 4's math performance or Claude Opus 4.1's coding efficiency.

Release Details and Accessibility

GPT-5 is integrated into ChatGPT, replacing prior defaults for logged-in users. Free-tier users can send up to 10 messages every 5 hours, with one GPT-5 Thinking message per day; exceeding this limit switches chats to a mini variant until the limit resets.

ChatGPT Plus users can send up to 160 GPT-5 messages every 3 hours (a temporary increase, set to revert soon) and up to 200 GPT-5 Thinking messages per week via manual selection.

Exceeding the weekly limit disables manual selection of GPT-5 Thinking, though automatic switching remains available.

Team and Pro plan users enjoy unlimited access, subject to abuse guardrails prohibiting actions like data scraping, account sharing, or reselling access. Usage violations may lead to temporary restrictions, with restoration upon review if no violations are found.

Enterprise and education rollout follows shortly, including coding integrations via tools like Codex CLI on GitHub.

Safety and Ethical Considerations

Classified as high-capability in biological and chemical fields, GPT-5 includes safeguards like safe completions training and refusal policies for harmful queries.

It underwent over 5,000 hours of adversarial testing, including evaluations by the UK AISI and US CAISI, as detailed in its system card.

This builds on prior model assessments to minimize risks while maintaining utility.

Potential Impacts on Society and Industries

GPT-5 performs at or near expert levels in about half of tasks across 40 occupations, including law and engineering, which could enhance productivity but raise concerns about workforce changes.

In health, its 46.2% on HealthBench Hard supports informed user queries, though not as medical advice.

Wider adoption may promote innovation in education and creativity, with safety measures aiming to address ethical challenges.

3% Cover the Fee
TheDayAfterAI News

We are a leading AI-focused digital news platform, combining AI-generated reporting with human editorial oversight. By aggregating and synthesizing the latest developments in AI — spanning innovation, technology, ethics, policy and business — we deliver timely, accurate and thought-provoking content.

Previous
Previous

ChatGPT Leads U.S. Generative AI Market at 60.4% as Rivals Perplexity, Claude Gain Ground

Next
Next

GPT-5, Grok 4, and Claude Opus 4.1: Comparing the Latest AI Model Advancements