AI Benchmarks & Leaderboard — 2026-03-28
The biggest story of the week is Anthropic's accidental data leak revealing the existence of "Mythos," a new model described as a "step change" in AI capabilities. Meanwhile, benchmark saturation continues to reshape how the field evaluates frontier models, with classic tests like MMLU and HumanEval no longer differentiating top performers. Independent analysts confirm Gemini 3.1 Pro and GPT-5.4 hold the top intelligence rankings across major evaluation platforms.
AI Benchmarks & Leaderboard — 2026-03-28
New Model Releases
Anthropic "Mythos" (Unreleased)
- Type: Closed-source; parameter count unknown
- Key benchmarks: Not publicly disclosed
- vs. Previous best: Described internally as a "step change" in performance vs. current Claude lineup
- What's notable: Revealed via an accidental data leak. Anthropic confirmed it is actively testing the model and described the capability jump as significant. No benchmark numbers have been released publicly, and no release date has been announced.

Leaderboard Changes
Chatbot Arena (LMSYS)
The Arena AI leaderboard page was accessible but full tabular data could not be extracted from the screenshot. The page title confirms it remains active as the official AI ranking and LLM leaderboard. Please verify current ELO scores directly at the source.
| Rank | Model | ELO | Change |
|---|---|---|---|
| — | See arena.ai for live rankings | — | — |
Intelligence Rankings (Artificial Analysis)
According to Artificial Analysis's live model comparison page, as of late March 2026:
- Top intelligence tier: Gemini 3.1 Pro Preview and GPT-5.4 (xhigh) are ranked highest
- Second tier: GPT-5.3 Codex (xhigh) and Claude Opus 4.6 (max)
- Speed leaders: Mercury 2 (764 tokens/sec) and NVIDIA Nemotron 3 Super (399 tokens/sec)
Open Source Rankings
No significant open-source leaderboard movements with verified post-2026-03-21 data were available in this reporting period. The Hugging Face Open LLM Leaderboard page loaded but detailed rankings could not be extracted from the screenshot. Verify current standings directly at the source.
Benchmark Deep Dive
The Saturation Problem: Classic Benchmarks Are No Longer Useful
A detailed analysis published this week by lxt.ai confirms what many researchers have suspected: MMLU, HumanEval, and GSM8K have been saturated by all frontier models, now scoring above 90%. These tests, which once served as the primary differentiators of model capability, can no longer tell us anything meaningful about relative performance at the frontier.

What benchmarks do still have signal? According to a popular Reddit thread on r/LocalLLaMA (cross-referenced with the lxt.ai analysis), the following remain informative in 2026:
- ARC-AGI-2: Pure LLMs still score 0%. The best reasoning systems reach 54% at $30/task. Average humans score 60%. All four major labs now report ARC-AGI-2 on their model cards. ARC-AGI-3, with interactive environments, is reportedly coming later in 2026.
- FrontierMath, GPQA, Humanity's Last Exam: Still differentiating frontier models, per lm-council.ai's benchmark comparison platform.
- SWE-bench: Remains a meaningful coding evaluation.
The takeaway: the AI benchmark ecosystem is undergoing a structural shift. As models saturate easy tests, the field is being forced toward harder, more expensive, and often more realistic evaluations. This transition is still incomplete, creating a period of measurement uncertainty at the frontier.
Analysis
-
Frontier models: Gemini 3.1 Pro Preview and GPT-5.4 lead on intelligence benchmarks, with Claude Opus 4.6 close behind. The unreleased Anthropic "Mythos" could reshuffle rankings once it launches — if its "step change" claim holds up under independent evaluation.
-
Open vs. Closed gap: No fresh data from the open-source leaderboard was available this period. However, the broader trend from earlier in March remains: open-source models continue to trail the absolute frontier on complex reasoning tasks, though the gap on many practical tasks has narrowed significantly.
-
Emerging trends: The AI agent reliability gap is a growing concern. Princeton researchers' new test battery (covered last week) highlights that most vendor benchmarks don't measure reliability — a critical issue as agents are deployed in production.
-
Cost efficiency: Speed leaders Mercury 2 (764 t/s) and NVIDIA Nemotron 3 Super (399 t/s) are pulling far ahead of intelligence-focused models on throughput, suggesting a clear bifurcation between "smartest" and "fastest/cheapest" tiers — a pattern that benefits production deployments where cost per token matters more than raw capability.
Note: Leaderboard screenshot data may be incomplete. Verify current ELO scores and rankings directly at and for the most accurate figures.
This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.
Create your own signal
Describe what you want to know, and AI will curate it for you automatically.
Create Signal