AI Benchmarks & Leaderboard — 2026-05-22
The AI coding agent benchmarks field entered a new phase of scrutiny this week, with Claude Code leading SWE-bench Verified at 87.6% while GPT-5.5 topped Terminal-Bench at 82.7% — even as OpenAI's own declared contamination of a key benchmark raised reliability questions. On the leaderboard front, Artificial Analysis's Intelligence Index now places GPT-5.5 (xhigh) at the top with a score of 60, followed closely by Claude Opus 4.7 and Gemini 3.1 Pro Preview at 57. A cost-performance breakthrough stands out: GPT-5.5 (medium) matches Claude Opus 4.7 (max) on intelligence at roughly one-quarter the cost.
AI Benchmarks & Leaderboard — 2026-05-22
New Model Releases & Updates

Claude Code (Coding Agent) by Anthropic
- Type: Closed-source, AI coding agent
- Key benchmarks: SWE-bench Verified: 87.6%
- vs. Previous best: Leads the coding agent field on code quality metrics as of May 2026
- What's notable: Tops the benchmark for pure code quality in the software development agent space. However, the broader field faces benchmark reliability concerns — OpenAI declared one widely-used benchmark contaminated in February 2026, yet it continues to appear in agent comparisons.

GPT-5.5 by OpenAI
- Type: Closed-source, frontier LLM (multiple effort tiers: xhigh, high, medium)
- Key benchmarks: Artificial Analysis Intelligence Index: 60 (xhigh tier), 59 (high tier); Terminal-Bench: 82.7%
- vs. Previous best: Tops the Artificial Analysis Intelligence Index leaderboard across all providers; leads agent benchmarks on Terminal-Bench
- What's notable: GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) on the Intelligence Index at approximately one-quarter of the cost (~$1,200 vs ~$4,800 per comparable workload). Effort tiers create a clear ladder for balancing intelligence and cost.
Gemini 3.1 Pro Preview by Google
- Type: Closed-source, frontier LLM
- Key benchmarks: Artificial Analysis Intelligence Index: 57
- vs. Previous best: Ties with Claude Opus 4.7 (max) at 57 on the Intelligence Index, but at a lower cost point than Claude's max-effort tier
- What's notable: Leads in reasoning tasks according to independent May 2026 rankings; matches GPT-5.5 (medium) on the Intelligence Index
Mercury 2 by (various)
- Type: Inference-optimized model
- Key benchmarks: Inference speed: 802.1 tokens/second (fastest tracked by Artificial Analysis)
- vs. Previous best: Leads all models tracked by Artificial Analysis in raw output speed, followed by Granite 4.0 H Small at 400.3 t/s
- What's notable: Represents a new class of speed-optimized models; speed leadership is increasingly a distinct competitive axis from intelligence rankings
Qwen3.5 0.8B by Alibaba/Qwen
- Type: Open-weight, ultra-small model (0.8B parameters)
- Key benchmarks: Price: $0.01 per 1M tokens (blended, most affordable tracked)
- vs. Previous best: Most affordable model in the Artificial Analysis index, available in both reasoning and non-reasoning variants
- What's notable: Sets the floor on cost-per-token for capable models; reflects the ongoing commoditization of smaller open-weight models
Leaderboard Snapshot
Frontier Models (Closed-Source)
| Model | Provider | Notable Strengths | Key Score |
|---|---|---|---|
| GPT-5.5 (xhigh) | OpenAI | Intelligence leader, agent benchmarks (Terminal-Bench 82.7%) | Intelligence Index: 60 |
| GPT-5.5 (high) | OpenAI | High intelligence, step below xhigh | Intelligence Index: 59 |
| Claude Opus 4.7 (max) | Anthropic | Coding (SWE-bench 87.6%), reasoning | Intelligence Index: 57 |
| Gemini 3.1 Pro Preview | Reasoning tasks | Intelligence Index: 57 | |
| GPT-5.4 (xhigh) | OpenAI | Prior-gen frontier performance | Intelligence Index: 57 |
Open-Source Leaders
| Model | Parameters | Notable Strengths | Key Score |
|---|---|---|---|
| DeepSeek V4 | Large (MoE) | Cost efficiency, coding, open-weight frontier | Competitive with closed-source on GPQA/SWE-bench |
| Qwen3.5 | Multiple sizes (0.8B–large) | Breadth of sizes, cost ($0.01/1M tokens at 0.8B) | MMLU-Pro competitive |
| Llama 4 | Large | Meta's open-weight flagship, broad capability | Frontier-class open-weight |
| Gemma 4 | Mid-size | Google's open-weight, efficiency | Strong coding/reasoning for size |
| Mistral Medium 3.5 | Mid-size | European open-weight alternative | Competitive reasoning |
Benchmark Deep Dive
The AI Coding Agent Benchmark Crisis: Contamination, Fragmentation, and What Actually Matters
A new analysis published this week examined the state of AI coding agent benchmarks — and found the field in a state of productive chaos. The most-cited benchmark, SWE-bench Verified, places Claude Code at 87.6%, reflecting strong performance on real-world software engineering tasks from GitHub issues. GPT-5.5 leads on Terminal-Bench at 82.7%, reflecting agentic terminal-use capability. These two benchmarks measure meaningfully different things: SWE-bench evaluates code quality and correctness against verified repository patches, while Terminal-Bench measures agentic command-line task completion.
The deeper story is about trust. OpenAI itself declared one major benchmark contaminated in February 2026 — yet that benchmark continues to appear in third-party comparisons, creating a signal-to-noise problem for practitioners trying to make deployment decisions. This isn't unique to OpenAI: as models train on ever-larger datasets scraped from the web, the risk of benchmark leakage grows across the board.
What the analysis makes clear is that the "best" coding agent depends heavily on the use case. Claude Code's SWE-bench lead suggests it is strongest on repository-level refactoring and bug-fixing tasks closely resembling real GitHub workflows. GPT-5.5's Terminal-Bench lead suggests an edge in open-ended, multi-step agentic terminal sessions. Neither benchmark captures the full picture of production coding workflows, which involve code review, documentation, dependency management, and team integration.
For practitioners, the recommendation emerging from this week's analysis is to treat public benchmark scores as a starting filter, not a final decision. The field has more capable, more fragmented, and harder-to-benchmark coding agents than ever before — and the gap between leaderboard performance and production performance remains a live concern.
Analysis & Trends
-
State of the art: GPT-5.5 leads on overall intelligence (Artificial Analysis Intelligence Index: 60) and agentic terminal tasks (Terminal-Bench: 82.7%). Claude Opus 4.7 leads on pure code quality (SWE-bench Verified: 87.6%) and shares the top reasoning tier with Gemini 3.1 Pro at Intelligence Index 57. Gemini 3.1 Pro is called out specifically for reasoning leadership in independent rankings.
-
Open vs. Closed gap: The H1 2026 open-weight cohort — DeepSeek V4, Qwen3.5, Llama 4, Gemma 4, and Mistral Medium 3.5 — has meaningfully closed the gap with closed-source frontier models. A widely cited Medium analysis notes that "in 2026, open-source models have caught up with GPT-4 on most tasks." The remaining gap is primarily at the absolute frontier (top-tier reasoning, complex coding agents) rather than across general-purpose tasks.
-
Cost-performance: The most striking data point this week comes from Artificial Analysis: GPT-5.5 (medium) matches Claude Opus 4.7 (max) on the Intelligence Index at ~one-quarter the cost (~$1,200 vs ~$4,800 per comparable workload). At the bottom end, Qwen3.5 0.8B at $0.01/1M tokens represents the new cost floor. The spread from $0.01 to $25/M tokens across 356+ tracked models reflects a market that has dramatically stratified by price/performance tier.
-
Emerging patterns: Benchmark contamination and reliability have emerged as a first-class concern in May 2026. The coding agent space in particular is "more capable, more fragmented, and harder to benchmark than it looks," per this week's MarkTechPost analysis. Inference speed is also becoming a distinct competitive axis: Mercury 2 at 802.1 t/s is more than twice as fast as the next-fastest tracked model (Granite 4.0 H Small at 400.3 t/s).
What to Watch Next
-
Benchmark contamination audits: With OpenAI having flagged at least one benchmark as contaminated in February 2026 and the problem appearing systemic, expect renewed calls for independent contamination audits of SWE-bench Verified, GPQA Diamond, and Terminal-Bench in the coming weeks. Any major revision to leaderboard standings would reshape deployment decisions across the industry.
-
Open-weight frontier convergence: The H1 2026 open-weight retrospective tracking DeepSeek V4, Qwen3 (and 3.5), and Llama 4 is ongoing. Watch for H2 2026 releases from these families — particularly any Qwen3.5 or DeepSeek V5 announcements — which could push open-weight performance above the current frontier ceiling.
-
Cost-performance inflection for enterprise: GPT-5.5 (medium)'s cost-intelligence parity with Claude Opus 4.7 (max) at one-quarter the price is already shifting enterprise procurement conversations. Watch for Anthropic and Google to respond with their own mid-tier pricing adjustments or new model tiers optimized for the $1,000–$2,000 workload range.
This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.