AI Benchmarks & Leaderboard — 2026-05-01

AI Benchmarks & Leaderboard|May 1, 2026(3h ago)6 min read9.0AI quality score — automatically evaluated based on accuracy, depth, and source quality

40 subscribers

The final days of April 2026 proved to be one of the most competitive periods in AI history, with GPT-5.5 topping the frontier leaderboard, Claude Opus 4.7 and Gemini 3.1 Pro close behind, and DeepSeek V4 reclaiming open-source leadership with aggressive pricing. Mistral AI also entered the fray with Medium 3.5, the rare Western open-source contender in the top tier. Independent trackers now place GPT-5.5 at the apex of the Intelligence Index, while open-source models are closing the gap faster than ever on reasoning benchmarks.

AI Benchmarks & Leaderboard — 2026-05-01

New Model Releases & Updates

GPT-5.5 by OpenAI

Type: Closed-source
Key benchmarks: Tops Artificial Analysis Intelligence Index at score 60 (xhigh tier); ranked #1 and #2 (two compute tiers)
vs. Previous best: Surpasses GPT-5.4 (score 57) and all current Anthropic and Google flagship models
What's notable: Better at coding, computer use, and deep research capabilities; available in two inference tiers (high and xhigh). Pricing undercut by DeepSeek V4 Pro ($0.145/M input vs GPT-5.5's higher rate).

DeepSeek V4 Pro & V4 Flash by DeepSeek

Type: Open-weights (preview)
Key benchmarks: "Almost closed the gap" with frontier closed and open models on reasoning benchmarks per DeepSeek's own assessment; more efficient and performant than DeepSeek V3.2
vs. Previous best: Outperforms DeepSeek V3.2; competitive with GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7 on reasoning tasks
What's notable: Aggressive pricing undercuts all major rivals — V4 Pro at $0.145/M input tokens, $3.48/M output tokens. Market reaction was muted compared to the viral DeepSeek V3 launch a year ago. Reclaims open-weights leadership per independent analysis.

techcrunch.com

Mistral Medium 3.5 by Mistral AI

Type: Open-source
Key benchmarks: Described as "top tier" for open-source; internet reaction mixed except for agent capabilities
vs. Previous best: Rare Western entrant in open-source top tier; costs multiples more than Chinese rivals (DeepSeek, Qwen)
What's notable: The one standout feature receiving praise is its agent/tool-use capabilities, making it potentially competitive for enterprise agentic workflows despite higher pricing.

Claude Opus 4.7 by Anthropic

Type: Closed-source
Key benchmarks: Intelligence Index score 57 (Adaptive Reasoning, Max Effort); tied with Gemini 3.1 Pro Preview for 3rd place
vs. Previous best: Slightly behind GPT-5.5 (xhigh: 60) but competitive with Gemini 3.1 Pro at the frontier tier
What's notable: Strong on adaptive reasoning tasks; pricing higher than DeepSeek V4 Pro but competitive within the closed-source tier.

Leaderboard Snapshot

Frontier Models (Closed-Source)

Model	Provider	Notable Strengths	Key Score (Intelligence Index)
GPT-5.5 (xhigh)	OpenAI	Coding, computer use, deep research	60
GPT-5.5 (high)	OpenAI	Coding, computer use	59
Claude Opus 4.7 (Max Effort)	Anthropic	Adaptive reasoning	57
Gemini 3.1 Pro Preview	Google	Broad capability, multimodal	57
GPT-5.4 (xhigh)	OpenAI	Previous-gen frontier	57

Open-Source Leaders

Model	Parameters	Notable Strengths	Key Score
DeepSeek V4 Pro	Not disclosed (MoE)	Reasoning, efficiency, low cost	Near-frontier on reasoning benchmarks
DeepSeek V4 Flash	Not disclosed (MoE)	Speed, low cost	Competitive on reasoning
GLM-5	Not disclosed	General capability	~85 (open benchmark ranking)
Qwen3.5	Various (0.8B–large)	Efficiency, low cost ($0.02/M tokens)	Competitive
Kimi K2.5	Not disclosed	General capability	Top-3 open-source
Mistral Medium 3.5	Not disclosed	Agent/tool use	Top Western open-source

Benchmark Deep Dive

DeepSeek V4: Closing the Gap on Reasoning

The most consequential benchmark story of the past week is DeepSeek's return to the frontier with its V4 Pro and V4 Flash models. Independent analysis from Artificial Analysis confirms that DeepSeek is "back among the leading open-weights models," with both variants closing what was previously a meaningful gap on reasoning benchmarks against the best closed-source systems from OpenAI, Anthropic, and Google.

DeepSeek V4 is more efficient and performant than its predecessor V3.2 due to architectural improvements. Critically, both models have "almost closed the gap" with current leading models — open and closed — on reasoning benchmarks, according to DeepSeek's own technical preview. This is particularly notable because just one year ago, DeepSeek V3 shocked Silicon Valley; now the community is watching whether V4 can repeat that disruption at the frontier reasoning tier.

What makes this especially significant for practitioners is the pricing dimension. DeepSeek V4 Pro is offered at $0.145 per million input tokens and $3.48 per million output tokens — decisively undercutting Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.7, and GPT-5.4. For organizations running high-volume inference workloads, this cost-performance profile is potentially transformative. Market reaction has been more muted than the original DeepSeek moment (Reuters notes the subdued response), which may reflect that the AI field has become accustomed to rapid capability improvements. Nonetheless, independent evaluators confirm DeepSeek V4 is a genuine frontier-class open-weights model.

Analysis & Trends

State of the art: GPT-5.5 leads across coding, computer use, and research tasks. Claude Opus 4.7 and Gemini 3.1 Pro lead in adaptive reasoning and multimodal capability respectively. On the open-source side, DeepSeek V4 Pro leads on reasoning, while Qwen3.5 leads on cost-efficiency and speed.
Open vs. Closed gap: The gap is narrowing at pace. DeepSeek V4 has nearly matched frontier closed-source models on reasoning benchmarks at a fraction of the cost. Chinese labs (DeepSeek, Qwen, GLM, Kimi) dominate open-source top-5 rankings; Mistral Medium 3.5 is the notable Western exception, but at a significant price premium over Chinese competitors.
Cost-performance: The most dramatic development this week is pricing pressure. DeepSeek V4 Pro at $0.145/M input tokens sets a new low for frontier-class open-weights performance, and Qwen3.5 0.8B (Reasoning) is now the cheapest model on Artificial Analysis at $0.02/M tokens blended. Mercury 2 leads inference speed at 805.5 tokens/second. The cost curve continues to collapse.
Emerging patterns: Agent and tool-use capability is emerging as a key differentiator — Mistral Medium 3.5's agent capability was the main reason it received positive coverage despite mixed general benchmarks. Speed-optimized models (Mercury 2, Granite 4.0 H Small) are carving out a distinct niche as frontier intelligence and raw speed increasingly diverge.

What to Watch Next

DeepSeek V4 full release: The current V4 Pro and V4 Flash are preview versions. A full production launch could shift enterprise adoption rapidly — especially if pricing holds.
Qwen and the open-source price war: Forbes notes DeepSeek V4 and Qwen are reshaping the open-source AI race together. Watch for Qwen's next major release and whether it can further undercut on cost while matching reasoning benchmarks.
April/May leaderboard consolidation: BuildFastWithAI's updated leaderboard covering both April and May 2026 signals that the rankings are still in flux as late-April releases are fully evaluated. Expect scoring updates across SWE-bench, ARC-AGI-2, and reasoning suites as community evaluations catch up to new models.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics

Model

Provider

Notable Strengths

Key Score (Intelligence Index)

GPT-5.5 (xhigh)

OpenAI

Coding, computer use, deep research

GPT-5.5 (high)

OpenAI

Coding, computer use

Claude Opus 4.7 (Max Effort)

Anthropic

Adaptive reasoning

Gemini 3.1 Pro Preview

Google

Broad capability, multimodal

GPT-5.4 (xhigh)

OpenAI

Previous-gen frontier

Model

Parameters

Notable Strengths

Key Score

DeepSeek V4 Pro

Not disclosed (MoE)

Reasoning, efficiency, low cost

Near-frontier on reasoning benchmarks

DeepSeek V4 Flash

Not disclosed (MoE)

Speed, low cost

Competitive on reasoning

GLM-5

Not disclosed

General capability

~85 (open benchmark ranking)

Qwen3.5

Various (0.8B–large)

Efficiency, low cost ($0.02/M tokens)

Competitive

Kimi K2.5

Not disclosed

General capability

Top-3 open-source

Mistral Medium 3.5

Not disclosed

Agent/tool use

Top Western open-source

Analysis & Trends

State of the art: GPT-5.5 leads across coding, computer use, and research tasks. Claude Opus 4.7 and Gemini 3.1 Pro lead in adaptive reasoning and multimodal capability respectively. On the open-source side, DeepSeek V4 Pro leads on reasoning, while Qwen3.5 leads on cost-efficiency and speed.

Open vs. Closed gap: The gap is narrowing at pace. DeepSeek V4 has nearly matched frontier closed-source models on reasoning benchmarks at a fraction of the cost. Chinese labs (DeepSeek, Qwen, GLM, Kimi) dominate open-source top-5 rankings; Mistral Medium 3.5 is the notable Western exception, but at a significant price premium over Chinese competitors.

Cost-performance: The most dramatic development this week is pricing pressure. DeepSeek V4 Pro at $0.145/M input tokens sets a new low for frontier-class open-weights performance, and Qwen3.5 0.8B (Reasoning) is now the cheapest model on Artificial Analysis at $0.02/M tokens blended. Mercury 2 leads inference speed at 805.5 tokens/second. The cost curve continues to collapse.

Emerging patterns: Agent and tool-use capability is emerging as a key differentiator — Mistral Medium 3.5's agent capability was the main reason it received positive coverage despite mixed general benchmarks. Speed-optimized models (Mercury 2, Granite 4.0 H Small) are carving out a distinct niche as frontier intelligence and raw speed increasingly diverge.

What to Watch Next

DeepSeek V4 full release: The current V4 Pro and V4 Flash are preview versions. A full production launch could shift enterprise adoption rapidly — especially if pricing holds.

Qwen and the open-source price war: Forbes notes DeepSeek V4 and Qwen are reshaping the open-source AI race together. Watch for Qwen's next major release and whether it can further undercut on cost while matching reasoning benchmarks.

April/May leaderboard consolidation: BuildFastWithAI's updated leaderboard covering both April and May 2026 signals that the rankings are still in flux as late-April releases are fully evaluated. Expect scoring updates across SWE-bench, ARC-AGI-2, and reasoning suites as community evaluations catch up to new models.

AI Benchmarks & Leaderboard — 2026-05-01

AI Benchmarks & Leaderboard — 2026-05-01