AI Benchmarks & Leaderboard — 2026-05-19

AI Benchmarks & Leaderboard|May 19, 20266 min read8.4AI quality score — automatically evaluated based on accuracy, depth, and source quality

42 subscribers

The frontier AI model race continues at full intensity in mid-May 2026, with GPT-5.5 holding the top intelligence rankings, Claude Opus 4.7 leading in coding, and DeepSeek V4 dominating cost-performance benchmarks. Independent evaluations from Artificial Analysis confirm GPT-5.5 (xhigh) and GPT-5.5 (high) as the highest-intelligence models, while open-source contenders like Qwen3.5 and Llama 4 continue closing the gap with closed-source leaders.

AI Benchmarks & Leaderboard — 2026-05-19

AI model benchmark leaderboard comparison for May 2026

New Model Releases & Updates

GPT-5.5 by OpenAI

Type: Closed-source; available in "xhigh" and "high" compute tiers
Key benchmarks: Intelligence Index score of 60 (xhigh) and 59 (high) on Artificial Analysis composite benchmark
vs. Previous best: Leads all models on the Artificial Analysis Intelligence Index; GPT-5.4 (xhigh) scores 57, placing it third
What's notable: Tops both the intelligence leaderboard and agent task evaluations; pricing reaches up to $25/M tokens at the high end; Mercury 2 and Granite 3.3 8B surpass it in output speed

Claude Opus 4.7 by Anthropic

Type: Closed-source; available in "max" and "Adaptive Reasoning, Max Effort" tiers
Key benchmarks: Intelligence Index score of 57 (Adaptive Reasoning, Max Effort); cited as leading model for coding tasks
vs. Previous best: Tied with Gemini 3.1 Pro Preview at 57 on the Intelligence Index; trails GPT-5.5 by 2–3 points overall
What's notable: Identified as the top performer specifically for coding benchmarks; Anthropic's flagship reasoning model heading into summer 2026

Gemini 3.1 Pro Preview by Google

Type: Closed-source preview
Key benchmarks: Intelligence Index score of 57; identified as leading model for reasoning tasks
vs. Previous best: Tied with Claude Opus 4.7 (Adaptive Reasoning) at 57, trailing GPT-5.5 (xhigh) by 3 points
What's notable: Gemini 3.1 Flash-Lite is among the fastest models on the leaderboard; the Pro Preview variant focuses on reasoning quality

DeepSeek V4 by DeepSeek (China)

Type: Open-weight; available via API
Key benchmarks: Named best-in-class for cost-efficiency; cited as winning the cost category in May 2026 rankings
vs. Previous best: Competes with GPT-5.5 and Claude Opus 4.7 on intelligence while dramatically undercutting on price
What's notable: Part of a broader pattern of Chinese open-source AI — alongside Qwen3.5 — reshaping the cost-performance frontier. Continues to reignite AI price wars roughly a year after DeepSeek V3 rattled Silicon Valley

Leaderboard Snapshot

Frontier Models (Closed-Source)

Model	Provider	Notable Strengths	Key Score
GPT-5.5 (xhigh)	OpenAI	Top intelligence, agent tasks	Intelligence Index: 60
GPT-5.5 (high)	OpenAI	Strong intelligence, broad capability	Intelligence Index: 59
Claude Opus 4.7 (Adaptive Reasoning, Max)	Anthropic	Coding, reasoning	Intelligence Index: 57
Gemini 3.1 Pro Preview	Google	Reasoning, multimodal	Intelligence Index: 57
GPT-5.4 (xhigh)	OpenAI	Strong all-round performance	Intelligence Index: 57

Open-Source Leaders

Model	Parameters	Notable Strengths	Key Score
DeepSeek V4	Not disclosed	Cost-efficiency, reasoning	Best cost/performance ratio
Qwen3.5 (397B)	397B	Reasoning, multilingual, local deployment	5.5+ tokens/sec on MacBook
Llama 4	Not disclosed	Broad capability, open license	Frontier-class open-weight
Gemma 4	Not disclosed	Efficiency, Google ecosystem	Competitive with frontier
Mistral Medium 3.5	Not disclosed	Coding, instruction following	Frontier-class open-weight
Qwen3.5 0.8B	0.8B	Speed, cost	$0.02/M tokens (blended)

Benchmark Deep Dive

Artificial Analysis Intelligence Index: Who Really Leads in May 2026?

Artificial Analysis model intelligence and pricing comparison

The Artificial Analysis Intelligence Index provides one of the most rigorous composite evaluations currently available, aggregating ten challenging evaluations across mathematics, science, coding, and reasoning into a single holistic score. As of mid-May 2026, the leaderboard shows a clear hierarchy: GPT-5.5 (xhigh) scores 60, GPT-5.5 (high) scores 59, and three models — Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.4 (xhigh) — are tied at 57.

What makes this snapshot particularly revealing is what it shows about the shape of competition at the frontier. The gap between the top model (60) and the cluster just below it (57) is narrow, suggesting that for most production use cases, the choice between top-tier closed-source models may hinge more on cost, latency, and task-specific strengths than on raw benchmark scores.

Speed is its own dimension entirely. Mercury 2 leads at 673.6 tokens per second, followed by Granite 3.3 8B at 394.0 t/s — both dramatically faster than frontier intelligence leaders. For latency-sensitive applications, this matters enormously. Meanwhile, on cost, Qwen3.5 0.8B (both reasoning and non-reasoning variants) comes in at $0.02/M tokens blended, representing a nearly thousand-fold cost reduction versus GPT-5.5 at the high end.

For practitioners, the key takeaway is that "best model" is now thoroughly context-dependent. GPT-5.5 wins on raw intelligence benchmarks; Claude Opus 4.7 edges ahead on coding; Gemini 3.1 Pro leads on reasoning; DeepSeek V4 and Qwen3.5 dominate on cost; and Mercury 2 and Granite models lead on throughput. Selecting a model requires matching the task profile, latency budget, and cost constraints rather than simply chasing the top of a single leaderboard.

Analysis & Trends

State of the art: GPT-5.5 leads overall intelligence rankings, Claude Opus 4.7 tops coding evaluations, and Gemini 3.1 Pro Preview leads reasoning tasks. For speed, Mercury 2 and Granite 3.3 8B are in a class of their own at 400–670 tokens/second.
Open vs. Closed gap: According to a recent analysis, "open-source models have caught up with GPT-4 on most tasks" — but the frontier has moved. DeepSeek V4, Qwen3.5, Llama 4, and Mistral Medium 3.5 are now described as "frontier-class open-weight" models, and the open-source community is widely acknowledged to be closing the gap meaningfully, even if the very top closed-source models (GPT-5.5, Claude Opus 4.7) remain ahead on composite benchmarks.
Cost-performance: The cost range across 356+ tracked models spans from $0.02 to $25/M tokens — a 1,250× spread. Qwen3.5 0.8B holds the affordability crown; DeepSeek V4 wins the cost-per-capability ratio. Gartner projects that 40% of organizations deploying AI will implement dedicated AI observability tools by 2028 to track model performance and cost at scale, reflecting growing enterprise attention to cost management in production deployments.
Emerging patterns: The "layer above the model" is increasingly cited as the differentiator in production AI — the orchestration, prompting infrastructure, and tooling stack matter as much or more than which underlying model is chosen. Additionally, Chinese open-source models (DeepSeek V4, Qwen3.5) are reshaping the competitive landscape not just on cost but on raw capability, putting sustained pressure on Western labs.

What to Watch Next

AI observability adoption: Gartner's May 12 prediction that 40% of AI-deploying organizations will use dedicated observability tooling by 2028 signals a maturing enterprise market — watch for new tools and vendor consolidation in this space over the coming months.
Chinese open-source momentum: DeepSeek V4 and Qwen3.5 have already reshaped the cost frontier. Further releases from DeepSeek and Alibaba — potentially including stronger reasoning models — could close the gap with GPT-5.5 and Claude Opus 4.7 on intelligence benchmarks while maintaining decisive cost advantages.
Production infrastructure vs. model benchmarks: The emerging consensus that the "layer above the model" now matters more than raw benchmark scores suggests that the next competitive battleground may shift from model capability to deployment tooling, RAG frameworks, and agentic orchestration. Watch for benchmark frameworks that measure real-world production performance rather than isolated task scores.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics

AI Benchmarks & Leaderboard — 2026-05-19

AI Benchmarks & Leaderboard — 2026-05-19

New Model Releases & Updates

GPT-5.5 by OpenAI

Claude Opus 4.7 by Anthropic

Gemini 3.1 Pro Preview by Google

DeepSeek V4 by DeepSeek (China)

Leaderboard Snapshot

Frontier Models (Closed-Source)

Open-Source Leaders

Benchmark Deep Dive

Artificial Analysis Intelligence Index: Who Really Leads in May 2026?

Analysis & Trends

What to Watch Next

Sources

Want your own AI intelligence feed?