AI Benchmarks & Leaderboard — 2026-05-22

AI Benchmarks & Leaderboard|May 22, 20266 min read9.1AI quality score — automatically evaluated based on accuracy, depth, and source quality

42 subscribers

The AI coding agent benchmarks field entered a new phase of scrutiny this week, with Claude Code leading SWE-bench Verified at 87.6% while GPT-5.5 topped Terminal-Bench at 82.7% — even as OpenAI's own declared contamination of a key benchmark raised reliability questions. On the leaderboard front, Artificial Analysis's Intelligence Index now places GPT-5.5 (xhigh) at the top with a score of 60, followed closely by Claude Opus 4.7 and Gemini 3.1 Pro Preview at 57. A cost-performance breakthrough stands out: GPT-5.5 (medium) matches Claude Opus 4.7 (max) on intelligence at roughly one-quarter the cost.

AI Benchmarks & Leaderboard — 2026-05-22

New Model Releases & Updates

Claude Code (Coding Agent) by Anthropic

Type: Closed-source, AI coding agent
Key benchmarks: SWE-bench Verified: 87.6%
vs. Previous best: Leads the coding agent field on code quality metrics as of May 2026
What's notable: Tops the benchmark for pure code quality in the software development agent space. However, the broader field faces benchmark reliability concerns — OpenAI declared one widely-used benchmark contaminated in February 2026, yet it continues to appear in agent comparisons.

Ranked coding agents benchmark chart for May 2026

marktechpost.com

GPT-5.5 by OpenAI

Type: Closed-source, frontier LLM (multiple effort tiers: xhigh, high, medium)
Key benchmarks: Artificial Analysis Intelligence Index: 60 (xhigh tier), 59 (high tier); Terminal-Bench: 82.7%
vs. Previous best: Tops the Artificial Analysis Intelligence Index leaderboard across all providers; leads agent benchmarks on Terminal-Bench
What's notable: GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) on the Intelligence Index at approximately one-quarter of the cost (~$1,200 vs ~$4,800 per comparable workload). Effort tiers create a clear ladder for balancing intelligence and cost.

Gemini 3.1 Pro Preview by Google

Type: Closed-source, frontier LLM
Key benchmarks: Artificial Analysis Intelligence Index: 57
vs. Previous best: Ties with Claude Opus 4.7 (max) at 57 on the Intelligence Index, but at a lower cost point than Claude's max-effort tier
What's notable: Leads in reasoning tasks according to independent May 2026 rankings; matches GPT-5.5 (medium) on the Intelligence Index

Mercury 2 by (various)

Type: Inference-optimized model
Key benchmarks: Inference speed: 802.1 tokens/second (fastest tracked by Artificial Analysis)
vs. Previous best: Leads all models tracked by Artificial Analysis in raw output speed, followed by Granite 4.0 H Small at 400.3 t/s
What's notable: Represents a new class of speed-optimized models; speed leadership is increasingly a distinct competitive axis from intelligence rankings

Qwen3.5 0.8B by Alibaba/Qwen

Type: Open-weight, ultra-small model (0.8B parameters)
Key benchmarks: Price: $0.01 per 1M tokens (blended, most affordable tracked)
vs. Previous best: Most affordable model in the Artificial Analysis index, available in both reasoning and non-reasoning variants
What's notable: Sets the floor on cost-per-token for capable models; reflects the ongoing commoditization of smaller open-weight models

Leaderboard Snapshot

Frontier Models (Closed-Source)

Model	Provider	Notable Strengths	Key Score
GPT-5.5 (xhigh)	OpenAI	Intelligence leader, agent benchmarks (Terminal-Bench 82.7%)	Intelligence Index: 60
GPT-5.5 (high)	OpenAI	High intelligence, step below xhigh	Intelligence Index: 59
Claude Opus 4.7 (max)	Anthropic	Coding (SWE-bench 87.6%), reasoning	Intelligence Index: 57
Gemini 3.1 Pro Preview	Google	Reasoning tasks	Intelligence Index: 57
GPT-5.4 (xhigh)	OpenAI	Prior-gen frontier performance	Intelligence Index: 57

Open-Source Leaders

Model	Parameters	Notable Strengths	Key Score
DeepSeek V4	Large (MoE)	Cost efficiency, coding, open-weight frontier	Competitive with closed-source on GPQA/SWE-bench
Qwen3.5	Multiple sizes (0.8B–large)	Breadth of sizes, cost ($0.01/1M tokens at 0.8B)	MMLU-Pro competitive
Llama 4	Large	Meta's open-weight flagship, broad capability	Frontier-class open-weight
Gemma 4	Mid-size	Google's open-weight, efficiency	Strong coding/reasoning for size
Mistral Medium 3.5	Mid-size	European open-weight alternative	Competitive reasoning

Benchmark Deep Dive

The AI Coding Agent Benchmark Crisis: Contamination, Fragmentation, and What Actually Matters

A new analysis published this week examined the state of AI coding agent benchmarks — and found the field in a state of productive chaos. The most-cited benchmark, SWE-bench Verified, places Claude Code at 87.6%, reflecting strong performance on real-world software engineering tasks from GitHub issues. GPT-5.5 leads on Terminal-Bench at 82.7%, reflecting agentic terminal-use capability. These two benchmarks measure meaningfully different things: SWE-bench evaluates code quality and correctness against verified repository patches, while Terminal-Bench measures agentic command-line task completion.

The deeper story is about trust. OpenAI itself declared one major benchmark contaminated in February 2026 — yet that benchmark continues to appear in third-party comparisons, creating a signal-to-noise problem for practitioners trying to make deployment decisions. This isn't unique to OpenAI: as models train on ever-larger datasets scraped from the web, the risk of benchmark leakage grows across the board.

What the analysis makes clear is that the "best" coding agent depends heavily on the use case. Claude Code's SWE-bench lead suggests it is strongest on repository-level refactoring and bug-fixing tasks closely resembling real GitHub workflows. GPT-5.5's Terminal-Bench lead suggests an edge in open-ended, multi-step agentic terminal sessions. Neither benchmark captures the full picture of production coding workflows, which involve code review, documentation, dependency management, and team integration.

For practitioners, the recommendation emerging from this week's analysis is to treat public benchmark scores as a starting filter, not a final decision. The field has more capable, more fragmented, and harder-to-benchmark coding agents than ever before — and the gap between leaderboard performance and production performance remains a live concern.

Analysis & Trends

State of the art: GPT-5.5 leads on overall intelligence (Artificial Analysis Intelligence Index: 60) and agentic terminal tasks (Terminal-Bench: 82.7%). Claude Opus 4.7 leads on pure code quality (SWE-bench Verified: 87.6%) and shares the top reasoning tier with Gemini 3.1 Pro at Intelligence Index 57. Gemini 3.1 Pro is called out specifically for reasoning leadership in independent rankings.
Open vs. Closed gap: The H1 2026 open-weight cohort — DeepSeek V4, Qwen3.5, Llama 4, Gemma 4, and Mistral Medium 3.5 — has meaningfully closed the gap with closed-source frontier models. A widely cited Medium analysis notes that "in 2026, open-source models have caught up with GPT-4 on most tasks." The remaining gap is primarily at the absolute frontier (top-tier reasoning, complex coding agents) rather than across general-purpose tasks.
Cost-performance: The most striking data point this week comes from Artificial Analysis: GPT-5.5 (medium) matches Claude Opus 4.7 (max) on the Intelligence Index at ~one-quarter the cost (~$1,200 vs ~$4,800 per comparable workload). At the bottom end, Qwen3.5 0.8B at $0.01/1M tokens represents the new cost floor. The spread from $0.01 to $25/M tokens across 356+ tracked models reflects a market that has dramatically stratified by price/performance tier.
Emerging patterns: Benchmark contamination and reliability have emerged as a first-class concern in May 2026. The coding agent space in particular is "more capable, more fragmented, and harder to benchmark than it looks," per this week's MarkTechPost analysis. Inference speed is also becoming a distinct competitive axis: Mercury 2 at 802.1 t/s is more than twice as fast as the next-fastest tracked model (Granite 4.0 H Small at 400.3 t/s).

What to Watch Next

Benchmark contamination audits: With OpenAI having flagged at least one benchmark as contaminated in February 2026 and the problem appearing systemic, expect renewed calls for independent contamination audits of SWE-bench Verified, GPQA Diamond, and Terminal-Bench in the coming weeks. Any major revision to leaderboard standings would reshape deployment decisions across the industry.
Open-weight frontier convergence: The H1 2026 open-weight retrospective tracking DeepSeek V4, Qwen3 (and 3.5), and Llama 4 is ongoing. Watch for H2 2026 releases from these families — particularly any Qwen3.5 or DeepSeek V5 announcements — which could push open-weight performance above the current frontier ceiling.
Cost-performance inflection for enterprise: GPT-5.5 (medium)'s cost-intelligence parity with Claude Opus 4.7 (max) at one-quarter the price is already shifting enterprise procurement conversations. Watch for Anthropic and Google to respond with their own mid-tier pricing adjustments or new model tiers optimized for the $1,000–$2,000 workload range.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics

AI Benchmarks & Leaderboard — 2026-05-22

AI Benchmarks & Leaderboard — 2026-05-22

New Model Releases & Updates

Claude Code (Coding Agent) by Anthropic

GPT-5.5 by OpenAI

Gemini 3.1 Pro Preview by Google

Mercury 2 by (various)

Qwen3.5 0.8B by Alibaba/Qwen

Leaderboard Snapshot

Frontier Models (Closed-Source)

Open-Source Leaders

Benchmark Deep Dive

The AI Coding Agent Benchmark Crisis: Contamination, Fragmentation, and What Actually Matters

Analysis & Trends

What to Watch Next

Sources

Want your own AI intelligence feed?