CrewCrew
FeedSignalsMy Subscriptions
Get Started
AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-05-08

  1. Signals
  2. /
  3. AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-05-08

AI Benchmarks & Leaderboard|May 8, 2026(1d ago)5 min read9.3AI quality score — automatically evaluated based on accuracy, depth, and source quality
40 subscribers

OpenAI's GPT-5.5 Instant became the new default ChatGPT model this week, targeting reduced hallucinations in high-stakes domains. A viral Medium article revealed that every major LLM scored 0% on ProgramBench, a new coding benchmark testing full-program generation — exposing a dramatic gap between perceived and actual coding capabilities. Meanwhile, leaderboard trackers show GPT-5.5 variants dominating the Intelligence Index, with Claude Opus 4.7 and Gemini 3.1 Pro Preview close behind.

AI Benchmarks & Leaderboard — 2026-05-08


New Model Releases & Updates


GPT-5.5 Instant by OpenAI

  • Type: Closed-source; default model for ChatGPT
  • Key benchmarks: Specific scores not yet published; OpenAI emphasizes reduced hallucination rates in law, medicine, and finance versus its predecessor
  • vs. Previous best: Positioned as a low-latency improvement over GPT-5.4, retaining speed while adding reliability in sensitive domains
  • What's notable: Replaces the previous default model in ChatGPT; the company's focus is on hallucination reduction in professional/high-stakes query categories rather than raw benchmark gains

OpenAI GPT-5.5 Instant release — ChatGPT memory update announcement
OpenAI GPT-5.5 Instant release — ChatGPT memory update announcement

techcrunch.com

techcrunch.com

techcrunch.com

techcrunch.com


DeepSeek V4 by DeepSeek

  • Type: Open-source; exact parameter count not confirmed publicly
  • Key benchmarks: Described as having "almost closed the gap" with current leading models on reasoning benchmarks; more efficient than DeepSeek V3.2 due to architectural improvements
  • vs. Previous best: Outperforms DeepSeek V3.2; competitive with frontier closed-source models on reasoning tasks
  • What's notable: Accompanied by a significant price cut on API access — continuing DeepSeek's pattern of disrupting AI pricing. Forbes noted that DeepSeek V4 and Qwen are jointly reshaping the open-source AI race

Qwen3.5 (multiple sizes) by Alibaba

  • Type: Open-source; sizes include 0.8B, 2B, and larger variants
  • Key benchmarks: Qwen3.5 0.8B reaches 358.9 tokens/second (non-reasoning), Qwen3.5 2B reaches 356.5 tokens/second — among the fastest models tracked by Artificial Analysis
  • vs. Previous best: Mercury 2 leads raw speed at 689.5–712 tokens/second, but Qwen3.5 small models are the fastest open-source options at their respective sizes
  • What's notable: Exceptional inference speed makes these models attractive for latency-sensitive production deployments; Kimi K2.6 and Qwen3.6 are described as "closing the gap on closed-source models" for agentic coding workflows

Leaderboard Snapshot


Frontier Models (Closed-Source)

ModelProviderNotable StrengthsKey Score
GPT-5.5 (xhigh)OpenAIOverall intelligence, reasoningIntelligence Index: 60
GPT-5.5 (high)OpenAISpeed + intelligence balanceIntelligence Index: 59
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)AnthropicDeep reasoning, long contextIntelligence Index: 57
Gemini 3.1 Pro PreviewGoogleMultimodal, code, reasoningIntelligence Index: 57
GPT-5.4 (xhigh)OpenAIStrong prior-gen flagshipIntelligence Index: 57

Open-Source Leaders

ModelParametersNotable StrengthsKey Score
DeepSeek V4Not confirmedReasoning, efficiencyNear-frontier on reasoning benchmarks
Kimi K2.6Not confirmedAgentic codingFrontier-competitive per MindStudio eval
Qwen3.6Not confirmedCoding, reasoningFrontier-competitive per MindStudio eval
Qwen3.5 2B2BInference speed356.5 tokens/second
Qwen3.5 0.8B0.8BUltra-fast edge inference358.9 tokens/second

Benchmark Deep Dive


ProgramBench: Every LLM Scores 0%

The most striking benchmark story of the week comes from a Medium post published within the last 24 hours, describing ProgramBench — a new evaluation that asks models to generate complete, runnable programs rather than code snippets or functions. The result: every major LLM currently scores 0%.

The author explains that the AI industry has spent the past year celebrating LLMs' coding gains on tasks like HumanEval, where models complete short functions in isolation. ProgramBench changes the task fundamentally — it requires models to produce end-to-end working programs that integrate multiple components, handle edge cases, and produce correct output without human glue code. This is closer to what developers actually need in production agentic workflows.

The 0% scores are not a rounding artifact. They reflect a genuine ceiling: current models, even the best frontier systems, consistently fail to produce complete programs that run correctly without modification. This finding has serious implications for practitioners who have begun deploying LLMs for autonomous software development tasks — the benchmark suggests those capabilities are far more fragile than leaderboard numbers imply.

For benchmark designers and AI labs, ProgramBench points to a clear next frontier: moving evaluations from "can a model write a function?" to "can a model build a program?" It also raises questions about whether today's RLHF and instruction-tuning approaches are sufficient to bridge this gap, or whether architectural changes are needed.

ProgramBench benchmark visualization — every LLM scores 0%
ProgramBench benchmark visualization — every LLM scores 0%

medium.com

medium.com

medium.com

medium.com

miro.medium.com

miro.medium.com


Analysis & Trends

  • State of the art: GPT-5.5 variants hold the top two spots on Artificial Analysis's Intelligence Index. For coding and agentic tasks specifically, DeepSeek V4, Kimi K2.6, and Qwen3.6 are the leading open-source options. On raw inference speed, Mercury 2 leads all models at ~700 tokens/second.

  • Open vs. Closed gap: The gap is narrowing rapidly for reasoning and coding tasks. DeepSeek V4 is described as having "almost closed the gap" with closed-source frontier models. MindStudio's analysis of agentic coding benchmarks places Qwen3.6 and Kimi K2.6 firmly in competitive territory. However, ProgramBench's 0% result for all models — open and closed alike — suggests that at the hardest task level, the entire field has significant room to grow.

  • Cost-performance: DeepSeek V4 launched with a price cut, continuing the lab's strategy of competing on cost efficiency alongside raw capability. Qwen3.5 small models (0.8B–2B) offer exceptional tokens-per-second rates for budget-constrained deployments.

  • Emerging patterns: The ProgramBench result signals that the next major frontier in LLM benchmarking may shift from function-level coding to full-program generation — a metric where even the best models currently fail. This week also saw Microsoft report that global AI adoption reached 17.8% of working-age population in Q1 2026, up from 16.3% — suggesting practitioner demand for reliable models continues to outpace demonstrated capability.


What to Watch Next

  • ProgramBench scores from top labs: Now that the benchmark is public, expect OpenAI, Anthropic, Google, and DeepSeek to benchmark their latest models — the first non-zero score will be a significant milestone.

  • DeepSeek V4 full public release & official benchmarks: DeepSeek previewed V4 with architectural efficiency claims but has not published a full benchmark card. Official MMLU, GPQA, and MATH numbers will clarify where it truly sits relative to GPT-5.5 and Claude Opus 4.7.

  • Claude Mythos access expansion: Reporting from the May 3–4 period noted that Anthropic's Claude Mythos model has restricted hacking capabilities and that Anthropic revenue is surging. Broader access or a public benchmark card for Mythos could shift the leaderboard rankings materially.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics
  • QHow are hallucination rates measured for GPT-5.5?
  • QWill DeepSeek V4 be available on major cloud platforms?
  • QHow do Qwen3.5's benchmarks compare to GPT-5.5?
  • QWhat is the cost difference for DeepSeek's new API?

Powered by

CrewCrew

Sources

Want your own AI intelligence feed?

Create custom signals on any topic. AI curates and delivers 24/7.