CrewCrew
FeedSignalsMy Subscriptions
Get Started
AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-05-01

  1. Signals
  2. /
  3. AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-05-01

AI Benchmarks & Leaderboard|May 1, 2026(3h ago)6 min read9.0AI quality score — automatically evaluated based on accuracy, depth, and source quality
40 subscribers

The final days of April 2026 proved to be one of the most competitive periods in AI history, with GPT-5.5 topping the frontier leaderboard, Claude Opus 4.7 and Gemini 3.1 Pro close behind, and DeepSeek V4 reclaiming open-source leadership with aggressive pricing. Mistral AI also entered the fray with Medium 3.5, the rare Western open-source contender in the top tier. Independent trackers now place GPT-5.5 at the apex of the Intelligence Index, while open-source models are closing the gap faster than ever on reasoning benchmarks.

AI Benchmarks & Leaderboard — 2026-05-01


New Model Releases & Updates


GPT-5.5 by OpenAI

  • Type: Closed-source
  • Key benchmarks: Tops Artificial Analysis Intelligence Index at score 60 (xhigh tier); ranked #1 and #2 (two compute tiers)
  • vs. Previous best: Surpasses GPT-5.4 (score 57) and all current Anthropic and Google flagship models
  • What's notable: Better at coding, computer use, and deep research capabilities; available in two inference tiers (high and xhigh). Pricing undercut by DeepSeek V4 Pro ($0.145/M input vs GPT-5.5's higher rate).

DeepSeek V4 Pro & V4 Flash by DeepSeek

  • Type: Open-weights (preview)
  • Key benchmarks: "Almost closed the gap" with frontier closed and open models on reasoning benchmarks per DeepSeek's own assessment; more efficient and performant than DeepSeek V3.2
  • vs. Previous best: Outperforms DeepSeek V3.2; competitive with GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7 on reasoning tasks
  • What's notable: Aggressive pricing undercuts all major rivals — V4 Pro at $0.145/M input tokens, $3.48/M output tokens. Market reaction was muted compared to the viral DeepSeek V3 launch a year ago. Reclaims open-weights leadership per independent analysis.

DeepSeek V4 preview coverage
DeepSeek V4 preview coverage

techcrunch.com

techcrunch.com


Mistral Medium 3.5 by Mistral AI

  • Type: Open-source
  • Key benchmarks: Described as "top tier" for open-source; internet reaction mixed except for agent capabilities
  • vs. Previous best: Rare Western entrant in open-source top tier; costs multiples more than Chinese rivals (DeepSeek, Qwen)
  • What's notable: The one standout feature receiving praise is its agent/tool-use capabilities, making it potentially competitive for enterprise agentic workflows despite higher pricing.

Mistral AI open-source model
Mistral AI open-source model


Claude Opus 4.7 by Anthropic

  • Type: Closed-source
  • Key benchmarks: Intelligence Index score 57 (Adaptive Reasoning, Max Effort); tied with Gemini 3.1 Pro Preview for 3rd place
  • vs. Previous best: Slightly behind GPT-5.5 (xhigh: 60) but competitive with Gemini 3.1 Pro at the frontier tier
  • What's notable: Strong on adaptive reasoning tasks; pricing higher than DeepSeek V4 Pro but competitive within the closed-source tier.

Leaderboard Snapshot


Frontier Models (Closed-Source)

ModelProviderNotable StrengthsKey Score (Intelligence Index)
GPT-5.5 (xhigh)OpenAICoding, computer use, deep research60
GPT-5.5 (high)OpenAICoding, computer use59
Claude Opus 4.7 (Max Effort)AnthropicAdaptive reasoning57
Gemini 3.1 Pro PreviewGoogleBroad capability, multimodal57
GPT-5.4 (xhigh)OpenAIPrevious-gen frontier57

Artificial Analysis LLM Leaderboard
Artificial Analysis LLM Leaderboard


Open-Source Leaders

ModelParametersNotable StrengthsKey Score
DeepSeek V4 ProNot disclosed (MoE)Reasoning, efficiency, low costNear-frontier on reasoning benchmarks
DeepSeek V4 FlashNot disclosed (MoE)Speed, low costCompetitive on reasoning
GLM-5Not disclosedGeneral capability~85 (open benchmark ranking)
Qwen3.5Various (0.8B–large)Efficiency, low cost ($0.02/M tokens)Competitive
Kimi K2.5Not disclosedGeneral capabilityTop-3 open-source
Mistral Medium 3.5Not disclosedAgent/tool useTop Western open-source

Benchmark Deep Dive


DeepSeek V4: Closing the Gap on Reasoning

The most consequential benchmark story of the past week is DeepSeek's return to the frontier with its V4 Pro and V4 Flash models. Independent analysis from Artificial Analysis confirms that DeepSeek is "back among the leading open-weights models," with both variants closing what was previously a meaningful gap on reasoning benchmarks against the best closed-source systems from OpenAI, Anthropic, and Google.

DeepSeek V4 is more efficient and performant than its predecessor V3.2 due to architectural improvements. Critically, both models have "almost closed the gap" with current leading models — open and closed — on reasoning benchmarks, according to DeepSeek's own technical preview. This is particularly notable because just one year ago, DeepSeek V3 shocked Silicon Valley; now the community is watching whether V4 can repeat that disruption at the frontier reasoning tier.

What makes this especially significant for practitioners is the pricing dimension. DeepSeek V4 Pro is offered at $0.145 per million input tokens and $3.48 per million output tokens — decisively undercutting Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.7, and GPT-5.4. For organizations running high-volume inference workloads, this cost-performance profile is potentially transformative. Market reaction has been more muted than the original DeepSeek moment (Reuters notes the subdued response), which may reflect that the AI field has become accustomed to rapid capability improvements. Nonetheless, independent evaluators confirm DeepSeek V4 is a genuine frontier-class open-weights model.

DeepSeek V4 analysis thumbnail
DeepSeek V4 analysis thumbnail


Analysis & Trends

  • State of the art: GPT-5.5 leads across coding, computer use, and research tasks. Claude Opus 4.7 and Gemini 3.1 Pro lead in adaptive reasoning and multimodal capability respectively. On the open-source side, DeepSeek V4 Pro leads on reasoning, while Qwen3.5 leads on cost-efficiency and speed.

  • Open vs. Closed gap: The gap is narrowing at pace. DeepSeek V4 has nearly matched frontier closed-source models on reasoning benchmarks at a fraction of the cost. Chinese labs (DeepSeek, Qwen, GLM, Kimi) dominate open-source top-5 rankings; Mistral Medium 3.5 is the notable Western exception, but at a significant price premium over Chinese competitors.

  • Cost-performance: The most dramatic development this week is pricing pressure. DeepSeek V4 Pro at $0.145/M input tokens sets a new low for frontier-class open-weights performance, and Qwen3.5 0.8B (Reasoning) is now the cheapest model on Artificial Analysis at $0.02/M tokens blended. Mercury 2 leads inference speed at 805.5 tokens/second. The cost curve continues to collapse.

  • Emerging patterns: Agent and tool-use capability is emerging as a key differentiator — Mistral Medium 3.5's agent capability was the main reason it received positive coverage despite mixed general benchmarks. Speed-optimized models (Mercury 2, Granite 4.0 H Small) are carving out a distinct niche as frontier intelligence and raw speed increasingly diverge.


What to Watch Next

  • DeepSeek V4 full release: The current V4 Pro and V4 Flash are preview versions. A full production launch could shift enterprise adoption rapidly — especially if pricing holds.

  • Qwen and the open-source price war: Forbes notes DeepSeek V4 and Qwen are reshaping the open-source AI race together. Watch for Qwen's next major release and whether it can further undercut on cost while matching reasoning benchmarks.

  • April/May leaderboard consolidation: BuildFastWithAI's updated leaderboard covering both April and May 2026 signals that the rankings are still in flux as late-April releases are fully evaluated. Expect scoring updates across SWE-bench, ARC-AGI-2, and reasoning suites as community evaluations catch up to new models.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics
  • QHow do inference costs impact enterprise adoption?
  • QAre benchmarks becoming less reliable for AI performance?
  • QWhat specific agentic tasks favor Mistral 3.5?
  • QWhy was the DeepSeek V4 market reaction muted?

Powered by

CrewCrew

Sources

Want your own AI intelligence feed?

Create custom signals on any topic. AI curates and delivers 24/7.