AI Benchmarks & Leaderboard — 2026-05-08

AI Benchmarks & Leaderboard|May 8, 2026(1d ago)5 min read9.3AI quality score — automatically evaluated based on accuracy, depth, and source quality

40 subscribers

OpenAI's GPT-5.5 Instant became the new default ChatGPT model this week, targeting reduced hallucinations in high-stakes domains. A viral Medium article revealed that every major LLM scored 0% on ProgramBench, a new coding benchmark testing full-program generation — exposing a dramatic gap between perceived and actual coding capabilities. Meanwhile, leaderboard trackers show GPT-5.5 variants dominating the Intelligence Index, with Claude Opus 4.7 and Gemini 3.1 Pro Preview close behind.

AI Benchmarks & Leaderboard — 2026-05-08

New Model Releases & Updates

GPT-5.5 Instant by OpenAI

Type: Closed-source; default model for ChatGPT
Key benchmarks: Specific scores not yet published; OpenAI emphasizes reduced hallucination rates in law, medicine, and finance versus its predecessor
vs. Previous best: Positioned as a low-latency improvement over GPT-5.4, retaining speed while adding reliability in sensitive domains
What's notable: Replaces the previous default model in ChatGPT; the company's focus is on hallucination reduction in professional/high-stakes query categories rather than raw benchmark gains

OpenAI GPT-5.5 Instant release — ChatGPT memory update announcement

techcrunch.com

DeepSeek V4 by DeepSeek

Type: Open-source; exact parameter count not confirmed publicly
Key benchmarks: Described as having "almost closed the gap" with current leading models on reasoning benchmarks; more efficient than DeepSeek V3.2 due to architectural improvements
vs. Previous best: Outperforms DeepSeek V3.2; competitive with frontier closed-source models on reasoning tasks
What's notable: Accompanied by a significant price cut on API access — continuing DeepSeek's pattern of disrupting AI pricing. Forbes noted that DeepSeek V4 and Qwen are jointly reshaping the open-source AI race

Qwen3.5 (multiple sizes) by Alibaba

Type: Open-source; sizes include 0.8B, 2B, and larger variants
Key benchmarks: Qwen3.5 0.8B reaches 358.9 tokens/second (non-reasoning), Qwen3.5 2B reaches 356.5 tokens/second — among the fastest models tracked by Artificial Analysis
vs. Previous best: Mercury 2 leads raw speed at 689.5–712 tokens/second, but Qwen3.5 small models are the fastest open-source options at their respective sizes
What's notable: Exceptional inference speed makes these models attractive for latency-sensitive production deployments; Kimi K2.6 and Qwen3.6 are described as "closing the gap on closed-source models" for agentic coding workflows

Leaderboard Snapshot

Frontier Models (Closed-Source)

Model	Provider	Notable Strengths	Key Score
GPT-5.5 (xhigh)	OpenAI	Overall intelligence, reasoning	Intelligence Index: 60
GPT-5.5 (high)	OpenAI	Speed + intelligence balance	Intelligence Index: 59
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)	Anthropic	Deep reasoning, long context	Intelligence Index: 57
Gemini 3.1 Pro Preview	Google	Multimodal, code, reasoning	Intelligence Index: 57
GPT-5.4 (xhigh)	OpenAI	Strong prior-gen flagship	Intelligence Index: 57

Open-Source Leaders

Model	Parameters	Notable Strengths	Key Score
DeepSeek V4	Not confirmed	Reasoning, efficiency	Near-frontier on reasoning benchmarks
Kimi K2.6	Not confirmed	Agentic coding	Frontier-competitive per MindStudio eval
Qwen3.6	Not confirmed	Coding, reasoning	Frontier-competitive per MindStudio eval
Qwen3.5 2B	2B	Inference speed	356.5 tokens/second
Qwen3.5 0.8B	0.8B	Ultra-fast edge inference	358.9 tokens/second

Benchmark Deep Dive

ProgramBench: Every LLM Scores 0%

The most striking benchmark story of the week comes from a Medium post published within the last 24 hours, describing ProgramBench — a new evaluation that asks models to generate complete, runnable programs rather than code snippets or functions. The result: every major LLM currently scores 0%.

The author explains that the AI industry has spent the past year celebrating LLMs' coding gains on tasks like HumanEval, where models complete short functions in isolation. ProgramBench changes the task fundamentally — it requires models to produce end-to-end working programs that integrate multiple components, handle edge cases, and produce correct output without human glue code. This is closer to what developers actually need in production agentic workflows.

The 0% scores are not a rounding artifact. They reflect a genuine ceiling: current models, even the best frontier systems, consistently fail to produce complete programs that run correctly without modification. This finding has serious implications for practitioners who have begun deploying LLMs for autonomous software development tasks — the benchmark suggests those capabilities are far more fragile than leaderboard numbers imply.

For benchmark designers and AI labs, ProgramBench points to a clear next frontier: moving evaluations from "can a model write a function?" to "can a model build a program?" It also raises questions about whether today's RLHF and instruction-tuning approaches are sufficient to bridge this gap, or whether architectural changes are needed.

ProgramBench benchmark visualization — every LLM scores 0%

medium.com

miro.medium.com

Analysis & Trends

State of the art: GPT-5.5 variants hold the top two spots on Artificial Analysis's Intelligence Index. For coding and agentic tasks specifically, DeepSeek V4, Kimi K2.6, and Qwen3.6 are the leading open-source options. On raw inference speed, Mercury 2 leads all models at ~700 tokens/second.
Open vs. Closed gap: The gap is narrowing rapidly for reasoning and coding tasks. DeepSeek V4 is described as having "almost closed the gap" with closed-source frontier models. MindStudio's analysis of agentic coding benchmarks places Qwen3.6 and Kimi K2.6 firmly in competitive territory. However, ProgramBench's 0% result for all models — open and closed alike — suggests that at the hardest task level, the entire field has significant room to grow.
Cost-performance: DeepSeek V4 launched with a price cut, continuing the lab's strategy of competing on cost efficiency alongside raw capability. Qwen3.5 small models (0.8B–2B) offer exceptional tokens-per-second rates for budget-constrained deployments.
Emerging patterns: The ProgramBench result signals that the next major frontier in LLM benchmarking may shift from function-level coding to full-program generation — a metric where even the best models currently fail. This week also saw Microsoft report that global AI adoption reached 17.8% of working-age population in Q1 2026, up from 16.3% — suggesting practitioner demand for reliable models continues to outpace demonstrated capability.

What to Watch Next

ProgramBench scores from top labs: Now that the benchmark is public, expect OpenAI, Anthropic, Google, and DeepSeek to benchmark their latest models — the first non-zero score will be a significant milestone.
DeepSeek V4 full public release & official benchmarks: DeepSeek previewed V4 with architectural efficiency claims but has not published a full benchmark card. Official MMLU, GPQA, and MATH numbers will clarify where it truly sits relative to GPT-5.5 and Claude Opus 4.7.
Claude Mythos access expansion: Reporting from the May 3–4 period noted that Anthropic's Claude Mythos model has restricted hacking capabilities and that Anthropic revenue is surging. Broader access or a public benchmark card for Mythos could shift the leaderboard rankings materially.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics

Model

Provider

Notable Strengths

Key Score

GPT-5.5 (xhigh)

OpenAI

Overall intelligence, reasoning

Intelligence Index: 60

GPT-5.5 (high)

OpenAI

Speed + intelligence balance

Intelligence Index: 59

Claude Opus 4.7 (Adaptive Reasoning, Max Effort)

Anthropic

Deep reasoning, long context

Intelligence Index: 57

Gemini 3.1 Pro Preview

Google

Multimodal, code, reasoning

Intelligence Index: 57

GPT-5.4 (xhigh)

OpenAI

Strong prior-gen flagship

Intelligence Index: 57

Model

Parameters

Notable Strengths

Key Score

DeepSeek V4

Not confirmed

Reasoning, efficiency

Near-frontier on reasoning benchmarks

Kimi K2.6

Not confirmed

Agentic coding

Frontier-competitive per MindStudio eval

Qwen3.6

Not confirmed

Coding, reasoning

Frontier-competitive per MindStudio eval

Qwen3.5 2B

Inference speed

356.5 tokens/second

Qwen3.5 0.8B

0.8B

Ultra-fast edge inference

358.9 tokens/second

Analysis & Trends

State of the art: GPT-5.5 variants hold the top two spots on Artificial Analysis's Intelligence Index. For coding and agentic tasks specifically, DeepSeek V4, Kimi K2.6, and Qwen3.6 are the leading open-source options. On raw inference speed, Mercury 2 leads all models at ~700 tokens/second.

Open vs. Closed gap: The gap is narrowing rapidly for reasoning and coding tasks. DeepSeek V4 is described as having "almost closed the gap" with closed-source frontier models. MindStudio's analysis of agentic coding benchmarks places Qwen3.6 and Kimi K2.6 firmly in competitive territory. However, ProgramBench's 0% result for all models — open and closed alike — suggests that at the hardest task level, the entire field has significant room to grow.

Cost-performance: DeepSeek V4 launched with a price cut, continuing the lab's strategy of competing on cost efficiency alongside raw capability. Qwen3.5 small models (0.8B–2B) offer exceptional tokens-per-second rates for budget-constrained deployments.

Emerging patterns: The ProgramBench result signals that the next major frontier in LLM benchmarking may shift from function-level coding to full-program generation — a metric where even the best models currently fail. This week also saw Microsoft report that global AI adoption reached 17.8% of working-age population in Q1 2026, up from 16.3% — suggesting practitioner demand for reliable models continues to outpace demonstrated capability.

What to Watch Next

ProgramBench scores from top labs: Now that the benchmark is public, expect OpenAI, Anthropic, Google, and DeepSeek to benchmark their latest models — the first non-zero score will be a significant milestone.

DeepSeek V4 full public release & official benchmarks: DeepSeek previewed V4 with architectural efficiency claims but has not published a full benchmark card. Official MMLU, GPQA, and MATH numbers will clarify where it truly sits relative to GPT-5.5 and Claude Opus 4.7.

Claude Mythos access expansion: Reporting from the May 3–4 period noted that Anthropic's Claude Mythos model has restricted hacking capabilities and that Anthropic revenue is surging. Broader access or a public benchmark card for Mythos could shift the leaderboard rankings materially.

AI Benchmarks & Leaderboard — 2026-05-08

AI Benchmarks & Leaderboard — 2026-05-08

New Model Releases & Updates

GPT-5.5 Instant by OpenAI

DeepSeek V4 by DeepSeek

Qwen3.5 (multiple sizes) by Alibaba

Leaderboard Snapshot

Frontier Models (Closed-Source)

Open-Source Leaders

Benchmark Deep Dive

ProgramBench: Every LLM Scores 0%

Analysis & Trends

What to Watch Next

Sources

Want your own AI intelligence feed?

AI Benchmarks & Leaderboard — 2026-05-08

AI Benchmarks & Leaderboard — 2026-05-08

New Model Releases & Updates

GPT-5.5 Instant by OpenAI

DeepSeek V4 by DeepSeek

Qwen3.5 (multiple sizes) by Alibaba

Leaderboard Snapshot

Frontier Models (Closed-Source)

Open-Source Leaders

Benchmark Deep Dive

ProgramBench: Every LLM Scores 0%

Analysis & Trends

What to Watch Next

Sources

Want your own AI intelligence feed?