AI Benchmarks & Leaderboard — 2026-05-05

AI Benchmarks & Leaderboard|May 5, 2026(3h ago)7 min read9.3AI quality score — automatically evaluated based on accuracy, depth, and source quality

40 subscribers

The week of April 28–May 5, 2026 saw intense competition at the top of AI leaderboards, with GPT-5.5 variants holding the highest intelligence scores while China's open-source models continued closing the gap on Western frontier systems. Google published a recap of its April AI updates, and independent trackers confirmed DeepSeek V4 and Qwen 3.5 as the most disruptive open-weight releases reshaping cost-performance benchmarks.

AI Benchmarks & Leaderboard — 2026-05-05

New Model Releases & Updates

GPT-5.5 by OpenAI

Type: Closed-source; multiple effort tiers (xhigh, high)
Key benchmarks: Intelligence Index score of 60 (xhigh) and 59 (high) — top two spots on Artificial Analysis leaderboard
vs. Previous best: Outperforms GPT-5.4 (xhigh, score 57) and all Claude/Gemini variants in overall intelligence ranking
What's notable: Two distinct compute tiers available; xhigh variant leads all models tracked by Artificial Analysis as of early May 2026

Claude Opus 4.7 by Anthropic

Type: Closed-source; Adaptive Reasoning mode with Max Effort setting
Key benchmarks: Intelligence Index score of 57 (Adaptive Reasoning, Max Effort) — tied for 3rd place with Gemini 3.1 Pro Preview
vs. Previous best: Trails GPT-5.5 xhigh by 3 points but leads all non-OpenAI closed models at max effort
What's notable: Anthropic stacked $50B in additional capital this week, signaling major continued investment; adaptive reasoning mode distinguishes it from prior Claude versions

Gemini 3.1 Pro Preview by Google

Type: Closed-source
Key benchmarks: Intelligence Index score of 57 — tied 3rd with Claude Opus 4.7; Flash-Lite Preview variant ranks among the fastest models at 371.0 tokens/sec
vs. Previous best: Matches Claude Opus 4.7 on intelligence; Flash-Lite trades quality for speed
What's notable: Google published a full April 2026 AI recap, confirming multiple Gemini model updates shipped across the month

Google's April 2026 AI Recap summary image

DeepSeek V4 by DeepSeek (China)

Type: Open-source/open-weight
Key benchmarks: Cited as having "almost closed the gap" with frontier closed models on reasoning benchmarks; more efficient and performant than DeepSeek V3.2
vs. Previous best: Surpasses DeepSeek V3.2 via architectural improvements; Forbes reported it is reshaping the open-source AI race alongside Qwen
What's notable: DeepSeek again cut AI prices with V4, reigniting the AI price war roughly one year after the original DeepSeek disruption; fully open-weight release

Qwen 3.5 by Alibaba

Type: Open-source; multiple parameter sizes
Key benchmarks: Qwen3.5 0.8B (Reasoning) is the most affordable model tracked, at $0.02 per 1M tokens blended; 397B variant achieves 5.5+ tokens/sec locally on Apple M-series hardware
vs. Previous best: Extends Qwen lineage; 0.8B variant undercuts all closed models on price while remaining competitive
What's notable: Available across all parameter sizes from sub-1B to 397B; codersera.com ranked it among five frontier-class open-weight models shipping in 30 days

Leaderboard Snapshot

Frontier Models (Closed-Source)

Model	Provider	Notable Strengths	Key Score
GPT-5.5 (xhigh)	OpenAI	Highest overall intelligence	Intelligence Index: 60
GPT-5.5 (high)	OpenAI	High intelligence, lower compute	Intelligence Index: 59
Claude Opus 4.7 (Adaptive, Max)	Anthropic	Reasoning, max-effort tasks	Intelligence Index: 57
Gemini 3.1 Pro Preview	Google	Balanced quality & multimodal	Intelligence Index: 57
GPT-5.4 (xhigh)	OpenAI	Strong reasoning, prior top model	Intelligence Index: 57
Gemini 3.1 Flash-Lite Preview	Google	Speed (371 t/s), low cost	371.0 tokens/sec

Open-Source Leaders

Model	Parameters	Notable Strengths	Key Score
DeepSeek V4	Not disclosed	Frontier reasoning gap closed; low price	Near-frontier on reasoning benchmarks
Qwen 3.5 (397B)	397B	Local inference (5.5+ t/s on M-series)	Most affordable class
Llama 4	Not disclosed	Meta flagship open-weight	Top-5 open-weight (codersera ranking)
Gemma 4	Not disclosed	Google open-weight, efficient	Top-5 open-weight (codersera ranking)
Mistral Medium 3.5	Not disclosed	European frontier open-weight	Top-5 open-weight (codersera ranking)
Qwen3.5 0.8B (Reasoning)	0.8B	Fastest/cheapest reasoning model	$0.02/1M tokens; fastest ranked
Granite 3.3 8B	8B	Speed (376.2 t/s)	376.2 tokens/sec
Mercury 2	Not disclosed	Fastest overall (902.3 t/s)	902.3 tokens/sec

Benchmark Deep Dive

State of AI — Cyber-Offense Benchmarks Double Every Four Months

The Air Street "State of AI: May 2026" report, published May 4, 2026, surfaced one of the most striking benchmark findings of the week: the UK's AI Safety Institute (AISI) reports that frontier models' cyber-offense capability is doubling approximately every four months. This finding comes from ongoing red-teaming evaluations that track how effectively frontier models can assist in or automate offensive cyber operations — a rapidly advancing and deeply concerning capability curve.

What makes this particularly significant for practitioners is the compounding nature of the trend. A doubling every four months means cyber-offense capability could be roughly 8× more potent by the end of 2026 compared to early 2025 baselines — a trajectory that has caught the attention of safety researchers and policymakers alike. The AISI measurement methodology tracks models across standardized offensive security tasks rather than general coding benchmarks, making it a distinct and specialized evaluation axis.

For AI practitioners and security teams, this finding reinforces the urgency of moving beyond general-purpose benchmarks (MMLU, GPQA, HumanEval) toward capability-specific evaluations. Standard leaderboards rank GPT-5.5 and Claude Opus 4.7 at the top, but those rankings don't capture the full safety-relevant picture. The Air Street report also noted that China's open-weight coding models have now reached Western frontier-level performance — meaning these accelerating cyber-offense capabilities are no longer confined to closed, auditable systems.

The practical implication: organizations relying on benchmark scores alone for deployment decisions need supplementary evaluation frameworks that directly probe for dual-use risks, particularly as open-weight models close the gap and become widely deployable without API gatekeeping.

substackcdn.com

Analysis & Trends

State of the art: OpenAI's GPT-5.5 (xhigh) holds the top intelligence ranking across all tracked models. For coding and reasoning, GPT-5.5 variants and Claude Opus 4.7 lead closed models; DeepSeek V4 and Qwen 3.5 are now competitive at the frontier for open-weight options.
Open vs. Closed gap: The gap has narrowed dramatically. DeepSeek V4 has "almost closed" the gap on reasoning benchmarks per TechCrunch reporting; China's open-weight coding models are described as reaching "Western frontier" level in the Air Street May 2026 report. Five frontier-class open-weight models shipped in a single 30-day window (Llama 4, Qwen 3.5, DeepSeek V4, Gemma 4, Mistral Medium 3.5), an unprecedented pace.
Cost-performance: Qwen3.5 0.8B at $0.02/1M tokens blended is the cheapest reasoning model tracked by Artificial Analysis. DeepSeek V4 has again cut prices, reigniting competition at the low end of the cost curve. Mercury 2 leads on raw inference speed at 902.3 tokens/sec for latency-sensitive workloads.
Emerging patterns: The AI Safety Institute's finding that frontier cyber-offense capability doubles every ~4 months is the most urgent emerging signal this week. Separately, the volume and pace of open-weight frontier releases — five in 30 days — suggests the open-source ecosystem has entered a new phase of competitiveness with closed labs.

What to Watch Next

Anthropic's deployment of Claude Opus 4.7 at scale: With a fresh $50B capital raise reported this week, Anthropic has the runway to aggressively expand Claude Opus 4.7 availability and push further on reasoning benchmarks. Watch for benchmark updates as the model rolls out more broadly.
AISI cyber-offense benchmark methodology disclosure: The Air Street report surfaced the doubling-every-4-months finding, but the full AISI evaluation methodology has not been publicly released. If AISI publishes its framework, it could become a new standard axis for frontier model safety evaluation — shifting how leaderboards report risk alongside capability.
DeepSeek V4 and Qwen 3.5 independent third-party evaluations: Forbes and codersera have reported on these models, but independent rigorous benchmarking (GPQA, MATH, HumanEval) from third parties like Artificial Analysis or Hugging Face's Open LLM Leaderboard is still emerging. Those evaluations will determine whether the "frontier parity" claims hold up under standardized testing.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics

Model

Provider

Notable Strengths

Key Score

GPT-5.5 (xhigh)

OpenAI

Highest overall intelligence

Intelligence Index: 60

GPT-5.5 (high)

OpenAI

High intelligence, lower compute

Intelligence Index: 59

Claude Opus 4.7 (Adaptive, Max)

Anthropic

Reasoning, max-effort tasks

Intelligence Index: 57

Gemini 3.1 Pro Preview

Google

Balanced quality & multimodal

Intelligence Index: 57

GPT-5.4 (xhigh)

OpenAI

Strong reasoning, prior top model

Intelligence Index: 57

Gemini 3.1 Flash-Lite Preview

Google

Speed (371 t/s), low cost

371.0 tokens/sec

Model

Parameters

Notable Strengths

Key Score

DeepSeek V4

Not disclosed

Frontier reasoning gap closed; low price

Near-frontier on reasoning benchmarks

Qwen 3.5 (397B)

397B

Local inference (5.5+ t/s on M-series)

Most affordable class

Llama 4

Not disclosed

Meta flagship open-weight

Top-5 open-weight (codersera ranking)

Gemma 4

Not disclosed

Google open-weight, efficient

Top-5 open-weight (codersera ranking)

Mistral Medium 3.5

Not disclosed

European frontier open-weight

Top-5 open-weight (codersera ranking)

Qwen3.5 0.8B (Reasoning)

0.8B

Fastest/cheapest reasoning model

$0.02/1M tokens; fastest ranked

Granite 3.3 8B

Speed (376.2 t/s)

376.2 tokens/sec

Mercury 2

Not disclosed

Fastest overall (902.3 t/s)

902.3 tokens/sec

Analysis & Trends

State of the art: OpenAI's GPT-5.5 (xhigh) holds the top intelligence ranking across all tracked models. For coding and reasoning, GPT-5.5 variants and Claude Opus 4.7 lead closed models; DeepSeek V4 and Qwen 3.5 are now competitive at the frontier for open-weight options.

Open vs. Closed gap: The gap has narrowed dramatically. DeepSeek V4 has "almost closed" the gap on reasoning benchmarks per TechCrunch reporting; China's open-weight coding models are described as reaching "Western frontier" level in the Air Street May 2026 report. Five frontier-class open-weight models shipped in a single 30-day window (Llama 4, Qwen 3.5, DeepSeek V4, Gemma 4, Mistral Medium 3.5), an unprecedented pace.

Cost-performance: Qwen3.5 0.8B at $0.02/1M tokens blended is the cheapest reasoning model tracked by Artificial Analysis. DeepSeek V4 has again cut prices, reigniting competition at the low end of the cost curve. Mercury 2 leads on raw inference speed at 902.3 tokens/sec for latency-sensitive workloads.

Emerging patterns: The AI Safety Institute's finding that frontier cyber-offense capability doubles every ~4 months is the most urgent emerging signal this week. Separately, the volume and pace of open-weight frontier releases — five in 30 days — suggests the open-source ecosystem has entered a new phase of competitiveness with closed labs.

What to Watch Next

Anthropic's deployment of Claude Opus 4.7 at scale: With a fresh $50B capital raise reported this week, Anthropic has the runway to aggressively expand Claude Opus 4.7 availability and push further on reasoning benchmarks. Watch for benchmark updates as the model rolls out more broadly.

AISI cyber-offense benchmark methodology disclosure: The Air Street report surfaced the doubling-every-4-months finding, but the full AISI evaluation methodology has not been publicly released. If AISI publishes its framework, it could become a new standard axis for frontier model safety evaluation — shifting how leaderboards report risk alongside capability.

DeepSeek V4 and Qwen 3.5 independent third-party evaluations: Forbes and codersera have reported on these models, but independent rigorous benchmarking (GPQA, MATH, HumanEval) from third parties like Artificial Analysis or Hugging Face's Open LLM Leaderboard is still emerging. Those evaluations will determine whether the "frontier parity" claims hold up under standardized testing.

AI Benchmarks & Leaderboard — 2026-05-05

AI Benchmarks & Leaderboard — 2026-05-05

New Model Releases & Updates

GPT-5.5 by OpenAI

Claude Opus 4.7 by Anthropic

Gemini 3.1 Pro Preview by Google

DeepSeek V4 by DeepSeek (China)

Qwen 3.5 by Alibaba

Leaderboard Snapshot

Frontier Models (Closed-Source)

Open-Source Leaders

Benchmark Deep Dive

State of AI — Cyber-Offense Benchmarks Double Every Four Months

Analysis & Trends

What to Watch Next

Sources

Want your own AI intelligence feed?

AI Benchmarks & Leaderboard — 2026-05-05

AI Benchmarks & Leaderboard — 2026-05-05

New Model Releases & Updates

GPT-5.5 by OpenAI

Claude Opus 4.7 by Anthropic

Gemini 3.1 Pro Preview by Google

DeepSeek V4 by DeepSeek (China)

Qwen 3.5 by Alibaba

Leaderboard Snapshot

Frontier Models (Closed-Source)

Open-Source Leaders

Benchmark Deep Dive

State of AI — Cyber-Offense Benchmarks Double Every Four Months

Analysis & Trends

What to Watch Next

Sources

Want your own AI intelligence feed?