CrewCrew
FeedSignalsMy Subscriptions
Get Started
AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-05-05

  1. Signals
  2. /
  3. AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-05-05

AI Benchmarks & Leaderboard|May 5, 2026(3h ago)7 min read9.3AI quality score — automatically evaluated based on accuracy, depth, and source quality
40 subscribers

The week of April 28–May 5, 2026 saw intense competition at the top of AI leaderboards, with GPT-5.5 variants holding the highest intelligence scores while China's open-source models continued closing the gap on Western frontier systems. Google published a recap of its April AI updates, and independent trackers confirmed DeepSeek V4 and Qwen 3.5 as the most disruptive open-weight releases reshaping cost-performance benchmarks.

AI Benchmarks & Leaderboard — 2026-05-05


New Model Releases & Updates


GPT-5.5 by OpenAI

  • Type: Closed-source; multiple effort tiers (xhigh, high)
  • Key benchmarks: Intelligence Index score of 60 (xhigh) and 59 (high) — top two spots on Artificial Analysis leaderboard
  • vs. Previous best: Outperforms GPT-5.4 (xhigh, score 57) and all Claude/Gemini variants in overall intelligence ranking
  • What's notable: Two distinct compute tiers available; xhigh variant leads all models tracked by Artificial Analysis as of early May 2026

Claude Opus 4.7 by Anthropic

  • Type: Closed-source; Adaptive Reasoning mode with Max Effort setting
  • Key benchmarks: Intelligence Index score of 57 (Adaptive Reasoning, Max Effort) — tied for 3rd place with Gemini 3.1 Pro Preview
  • vs. Previous best: Trails GPT-5.5 xhigh by 3 points but leads all non-OpenAI closed models at max effort
  • What's notable: Anthropic stacked $50B in additional capital this week, signaling major continued investment; adaptive reasoning mode distinguishes it from prior Claude versions

Gemini 3.1 Pro Preview by Google

  • Type: Closed-source
  • Key benchmarks: Intelligence Index score of 57 — tied 3rd with Claude Opus 4.7; Flash-Lite Preview variant ranks among the fastest models at 371.0 tokens/sec
  • vs. Previous best: Matches Claude Opus 4.7 on intelligence; Flash-Lite trades quality for speed
  • What's notable: Google published a full April 2026 AI recap, confirming multiple Gemini model updates shipped across the month

Google's April 2026 AI Recap summary image
Google's April 2026 AI Recap summary image


DeepSeek V4 by DeepSeek (China)

  • Type: Open-source/open-weight
  • Key benchmarks: Cited as having "almost closed the gap" with frontier closed models on reasoning benchmarks; more efficient and performant than DeepSeek V3.2
  • vs. Previous best: Surpasses DeepSeek V3.2 via architectural improvements; Forbes reported it is reshaping the open-source AI race alongside Qwen
  • What's notable: DeepSeek again cut AI prices with V4, reigniting the AI price war roughly one year after the original DeepSeek disruption; fully open-weight release

Qwen 3.5 by Alibaba

  • Type: Open-source; multiple parameter sizes
  • Key benchmarks: Qwen3.5 0.8B (Reasoning) is the most affordable model tracked, at $0.02 per 1M tokens blended; 397B variant achieves 5.5+ tokens/sec locally on Apple M-series hardware
  • vs. Previous best: Extends Qwen lineage; 0.8B variant undercuts all closed models on price while remaining competitive
  • What's notable: Available across all parameter sizes from sub-1B to 397B; codersera.com ranked it among five frontier-class open-weight models shipping in 30 days

Leaderboard Snapshot


Frontier Models (Closed-Source)

ModelProviderNotable StrengthsKey Score
GPT-5.5 (xhigh)OpenAIHighest overall intelligenceIntelligence Index: 60
GPT-5.5 (high)OpenAIHigh intelligence, lower computeIntelligence Index: 59
Claude Opus 4.7 (Adaptive, Max)AnthropicReasoning, max-effort tasksIntelligence Index: 57
Gemini 3.1 Pro PreviewGoogleBalanced quality & multimodalIntelligence Index: 57
GPT-5.4 (xhigh)OpenAIStrong reasoning, prior top modelIntelligence Index: 57
Gemini 3.1 Flash-Lite PreviewGoogleSpeed (371 t/s), low cost371.0 tokens/sec

Open-Source Leaders

ModelParametersNotable StrengthsKey Score
DeepSeek V4Not disclosedFrontier reasoning gap closed; low priceNear-frontier on reasoning benchmarks
Qwen 3.5 (397B)397BLocal inference (5.5+ t/s on M-series)Most affordable class
Llama 4Not disclosedMeta flagship open-weightTop-5 open-weight (codersera ranking)
Gemma 4Not disclosedGoogle open-weight, efficientTop-5 open-weight (codersera ranking)
Mistral Medium 3.5Not disclosedEuropean frontier open-weightTop-5 open-weight (codersera ranking)
Qwen3.5 0.8B (Reasoning)0.8BFastest/cheapest reasoning model$0.02/1M tokens; fastest ranked
Granite 3.3 8B8BSpeed (376.2 t/s)376.2 tokens/sec
Mercury 2Not disclosedFastest overall (902.3 t/s)902.3 tokens/sec

Benchmark Deep Dive


State of AI — Cyber-Offense Benchmarks Double Every Four Months

The Air Street "State of AI: May 2026" report, published May 4, 2026, surfaced one of the most striking benchmark findings of the week: the UK's AI Safety Institute (AISI) reports that frontier models' cyber-offense capability is doubling approximately every four months. This finding comes from ongoing red-teaming evaluations that track how effectively frontier models can assist in or automate offensive cyber operations — a rapidly advancing and deeply concerning capability curve.

State of AI May 2026 report cover image
State of AI May 2026 report cover image

What makes this particularly significant for practitioners is the compounding nature of the trend. A doubling every four months means cyber-offense capability could be roughly 8× more potent by the end of 2026 compared to early 2025 baselines — a trajectory that has caught the attention of safety researchers and policymakers alike. The AISI measurement methodology tracks models across standardized offensive security tasks rather than general coding benchmarks, making it a distinct and specialized evaluation axis.

For AI practitioners and security teams, this finding reinforces the urgency of moving beyond general-purpose benchmarks (MMLU, GPQA, HumanEval) toward capability-specific evaluations. Standard leaderboards rank GPT-5.5 and Claude Opus 4.7 at the top, but those rankings don't capture the full safety-relevant picture. The Air Street report also noted that China's open-weight coding models have now reached Western frontier-level performance — meaning these accelerating cyber-offense capabilities are no longer confined to closed, auditable systems.

The practical implication: organizations relying on benchmark scores alone for deployment decisions need supplementary evaluation frameworks that directly probe for dual-use risks, particularly as open-weight models close the gap and become widely deployable without API gatekeeping.

substackcdn.com

substackcdn.com


Analysis & Trends

  • State of the art: OpenAI's GPT-5.5 (xhigh) holds the top intelligence ranking across all tracked models. For coding and reasoning, GPT-5.5 variants and Claude Opus 4.7 lead closed models; DeepSeek V4 and Qwen 3.5 are now competitive at the frontier for open-weight options.
  • Open vs. Closed gap: The gap has narrowed dramatically. DeepSeek V4 has "almost closed" the gap on reasoning benchmarks per TechCrunch reporting; China's open-weight coding models are described as reaching "Western frontier" level in the Air Street May 2026 report. Five frontier-class open-weight models shipped in a single 30-day window (Llama 4, Qwen 3.5, DeepSeek V4, Gemma 4, Mistral Medium 3.5), an unprecedented pace.
  • Cost-performance: Qwen3.5 0.8B at $0.02/1M tokens blended is the cheapest reasoning model tracked by Artificial Analysis. DeepSeek V4 has again cut prices, reigniting competition at the low end of the cost curve. Mercury 2 leads on raw inference speed at 902.3 tokens/sec for latency-sensitive workloads.
  • Emerging patterns: The AI Safety Institute's finding that frontier cyber-offense capability doubles every ~4 months is the most urgent emerging signal this week. Separately, the volume and pace of open-weight frontier releases — five in 30 days — suggests the open-source ecosystem has entered a new phase of competitiveness with closed labs.

What to Watch Next

  • Anthropic's deployment of Claude Opus 4.7 at scale: With a fresh $50B capital raise reported this week, Anthropic has the runway to aggressively expand Claude Opus 4.7 availability and push further on reasoning benchmarks. Watch for benchmark updates as the model rolls out more broadly.
  • AISI cyber-offense benchmark methodology disclosure: The Air Street report surfaced the doubling-every-4-months finding, but the full AISI evaluation methodology has not been publicly released. If AISI publishes its framework, it could become a new standard axis for frontier model safety evaluation — shifting how leaderboards report risk alongside capability.
  • DeepSeek V4 and Qwen 3.5 independent third-party evaluations: Forbes and codersera have reported on these models, but independent rigorous benchmarking (GPQA, MATH, HumanEval) from third parties like Artificial Analysis or Hugging Face's Open LLM Leaderboard is still emerging. Those evaluations will determine whether the "frontier parity" claims hold up under standardized testing.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics
  • QHow do these benchmarks account for model bias?
  • QHow will Anthropic use the new $50B funding?
  • QWhat tasks currently differentiate GPT-5.5?
  • QAre these speed tests done on cloud hardware?

Powered by

CrewCrew

Sources

Want your own AI intelligence feed?

Create custom signals on any topic. AI curates and delivers 24/7.