AI Benchmarks & Leaderboard — 2026-05-19
The frontier AI model race continues at full intensity in mid-May 2026, with GPT-5.5 holding the top intelligence rankings, Claude Opus 4.7 leading in coding, and DeepSeek V4 dominating cost-performance benchmarks. Independent evaluations from Artificial Analysis confirm GPT-5.5 (xhigh) and GPT-5.5 (high) as the highest-intelligence models, while open-source contenders like Qwen3.5 and Llama 4 continue closing the gap with closed-source leaders.
AI Benchmarks & Leaderboard — 2026-05-19

New Model Releases & Updates
GPT-5.5 by OpenAI
- Type: Closed-source; available in "xhigh" and "high" compute tiers
- Key benchmarks: Intelligence Index score of 60 (xhigh) and 59 (high) on Artificial Analysis composite benchmark
- vs. Previous best: Leads all models on the Artificial Analysis Intelligence Index; GPT-5.4 (xhigh) scores 57, placing it third
- What's notable: Tops both the intelligence leaderboard and agent task evaluations; pricing reaches up to $25/M tokens at the high end; Mercury 2 and Granite 3.3 8B surpass it in output speed
Claude Opus 4.7 by Anthropic
- Type: Closed-source; available in "max" and "Adaptive Reasoning, Max Effort" tiers
- Key benchmarks: Intelligence Index score of 57 (Adaptive Reasoning, Max Effort); cited as leading model for coding tasks
- vs. Previous best: Tied with Gemini 3.1 Pro Preview at 57 on the Intelligence Index; trails GPT-5.5 by 2–3 points overall
- What's notable: Identified as the top performer specifically for coding benchmarks; Anthropic's flagship reasoning model heading into summer 2026
Gemini 3.1 Pro Preview by Google
- Type: Closed-source preview
- Key benchmarks: Intelligence Index score of 57; identified as leading model for reasoning tasks
- vs. Previous best: Tied with Claude Opus 4.7 (Adaptive Reasoning) at 57, trailing GPT-5.5 (xhigh) by 3 points
- What's notable: Gemini 3.1 Flash-Lite is among the fastest models on the leaderboard; the Pro Preview variant focuses on reasoning quality
DeepSeek V4 by DeepSeek (China)
- Type: Open-weight; available via API
- Key benchmarks: Named best-in-class for cost-efficiency; cited as winning the cost category in May 2026 rankings
- vs. Previous best: Competes with GPT-5.5 and Claude Opus 4.7 on intelligence while dramatically undercutting on price
- What's notable: Part of a broader pattern of Chinese open-source AI — alongside Qwen3.5 — reshaping the cost-performance frontier. Continues to reignite AI price wars roughly a year after DeepSeek V3 rattled Silicon Valley
Leaderboard Snapshot
Frontier Models (Closed-Source)
| Model | Provider | Notable Strengths | Key Score |
|---|---|---|---|
| GPT-5.5 (xhigh) | OpenAI | Top intelligence, agent tasks | Intelligence Index: 60 |
| GPT-5.5 (high) | OpenAI | Strong intelligence, broad capability | Intelligence Index: 59 |
| Claude Opus 4.7 (Adaptive Reasoning, Max) | Anthropic | Coding, reasoning | Intelligence Index: 57 |
| Gemini 3.1 Pro Preview | Reasoning, multimodal | Intelligence Index: 57 | |
| GPT-5.4 (xhigh) | OpenAI | Strong all-round performance | Intelligence Index: 57 |
Open-Source Leaders
| Model | Parameters | Notable Strengths | Key Score |
|---|---|---|---|
| DeepSeek V4 | Not disclosed | Cost-efficiency, reasoning | Best cost/performance ratio |
| Qwen3.5 (397B) | 397B | Reasoning, multilingual, local deployment | 5.5+ tokens/sec on MacBook |
| Llama 4 | Not disclosed | Broad capability, open license | Frontier-class open-weight |
| Gemma 4 | Not disclosed | Efficiency, Google ecosystem | Competitive with frontier |
| Mistral Medium 3.5 | Not disclosed | Coding, instruction following | Frontier-class open-weight |
| Qwen3.5 0.8B | 0.8B | Speed, cost | $0.02/M tokens (blended) |
Benchmark Deep Dive
Artificial Analysis Intelligence Index: Who Really Leads in May 2026?

The Artificial Analysis Intelligence Index provides one of the most rigorous composite evaluations currently available, aggregating ten challenging evaluations across mathematics, science, coding, and reasoning into a single holistic score. As of mid-May 2026, the leaderboard shows a clear hierarchy: GPT-5.5 (xhigh) scores 60, GPT-5.5 (high) scores 59, and three models — Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.4 (xhigh) — are tied at 57.
What makes this snapshot particularly revealing is what it shows about the shape of competition at the frontier. The gap between the top model (60) and the cluster just below it (57) is narrow, suggesting that for most production use cases, the choice between top-tier closed-source models may hinge more on cost, latency, and task-specific strengths than on raw benchmark scores.
Speed is its own dimension entirely. Mercury 2 leads at 673.6 tokens per second, followed by Granite 3.3 8B at 394.0 t/s — both dramatically faster than frontier intelligence leaders. For latency-sensitive applications, this matters enormously. Meanwhile, on cost, Qwen3.5 0.8B (both reasoning and non-reasoning variants) comes in at $0.02/M tokens blended, representing a nearly thousand-fold cost reduction versus GPT-5.5 at the high end.
For practitioners, the key takeaway is that "best model" is now thoroughly context-dependent. GPT-5.5 wins on raw intelligence benchmarks; Claude Opus 4.7 edges ahead on coding; Gemini 3.1 Pro leads on reasoning; DeepSeek V4 and Qwen3.5 dominate on cost; and Mercury 2 and Granite models lead on throughput. Selecting a model requires matching the task profile, latency budget, and cost constraints rather than simply chasing the top of a single leaderboard.
Analysis & Trends
-
State of the art: GPT-5.5 leads overall intelligence rankings, Claude Opus 4.7 tops coding evaluations, and Gemini 3.1 Pro Preview leads reasoning tasks. For speed, Mercury 2 and Granite 3.3 8B are in a class of their own at 400–670 tokens/second.
-
Open vs. Closed gap: According to a recent analysis, "open-source models have caught up with GPT-4 on most tasks" — but the frontier has moved. DeepSeek V4, Qwen3.5, Llama 4, and Mistral Medium 3.5 are now described as "frontier-class open-weight" models, and the open-source community is widely acknowledged to be closing the gap meaningfully, even if the very top closed-source models (GPT-5.5, Claude Opus 4.7) remain ahead on composite benchmarks.
-
Cost-performance: The cost range across 356+ tracked models spans from $0.02 to $25/M tokens — a 1,250× spread. Qwen3.5 0.8B holds the affordability crown; DeepSeek V4 wins the cost-per-capability ratio. Gartner projects that 40% of organizations deploying AI will implement dedicated AI observability tools by 2028 to track model performance and cost at scale, reflecting growing enterprise attention to cost management in production deployments.
-
Emerging patterns: The "layer above the model" is increasingly cited as the differentiator in production AI — the orchestration, prompting infrastructure, and tooling stack matter as much or more than which underlying model is chosen. Additionally, Chinese open-source models (DeepSeek V4, Qwen3.5) are reshaping the competitive landscape not just on cost but on raw capability, putting sustained pressure on Western labs.
What to Watch Next
-
AI observability adoption: Gartner's May 12 prediction that 40% of AI-deploying organizations will use dedicated observability tooling by 2028 signals a maturing enterprise market — watch for new tools and vendor consolidation in this space over the coming months.
-
Chinese open-source momentum: DeepSeek V4 and Qwen3.5 have already reshaped the cost frontier. Further releases from DeepSeek and Alibaba — potentially including stronger reasoning models — could close the gap with GPT-5.5 and Claude Opus 4.7 on intelligence benchmarks while maintaining decisive cost advantages.
-
Production infrastructure vs. model benchmarks: The emerging consensus that the "layer above the model" now matters more than raw benchmark scores suggests that the next competitive battleground may shift from model capability to deployment tooling, RAG frameworks, and agentic orchestration. Watch for benchmark frameworks that measure real-world production performance rather than isolated task scores.
This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.