AI Benchmarks & Leaderboard — 2026-04-14
Meta debuted Muse Spark, its first major model from the newly formed Superintelligence Labs under chief AI officer Alexandr Wang, on April 8 — performing better than previous Meta models but lagging on coding benchmarks. The Stanford 2026 AI Index, published this week, offers a sweeping structural analysis of AI's accelerating pace, noting that benchmarks are increasingly struggling to keep up with model capabilities. According to Artificial Analysis, Gemini 3.1 Pro Preview and GPT-5.4 now share the top spot on the intelligence index.
AI Benchmarks & Leaderboard — 2026-04-14
New Model Releases & Updates
Muse Spark by Meta
- Type: Closed-source; first major LLM from Meta Superintelligence Labs, led by Alexandr Wang
- Key benchmarks: Outperforms Meta's prior models; trails leading competitors specifically on coding ability (exact benchmark numbers not disclosed in available reporting)
- vs. Previous best: Scores an Intelligence Index of 52 on Artificial Analysis — placing it 5th overall, behind Gemini 3.1 Pro Preview (57), GPT-5.4 (57), GPT-5.3 Codex (54), and Claude Opus 4.6 Adaptive Reasoning Max (53)
- What's notable: Muse Spark is the first high-profile output from Meta's Superintelligence Labs, a team assembled after Meta brought in Scale AI founder Alexandr Wang with a reported $14 billion deal. Despite strong general capability improvements, independent reviewers note coding benchmarks remain a weak spot relative to OpenAI and Google's frontier offerings. Meta has also signaled it will open-source versions of its next models.

Leaderboard Snapshot
Frontier Models (Closed-Source)
Based on Artificial Analysis Intelligence Index (higher = more capable):
| Model | Provider | Notable Strengths | Intelligence Index |
|---|---|---|---|
| Gemini 3.1 Pro Preview | Top-tier reasoning, multimodal | 57 | |
| GPT-5.4 (xhigh) | OpenAI | Top-tier general intelligence | 57 |
| GPT-5.3 Codex (xhigh) | OpenAI | Coding, technical tasks | 54 |
| Claude Opus 4.6 (Adaptive Reasoning, Max) | Anthropic | Reasoning, complex analysis | 53 |
| Muse Spark | Meta | General capability, multimodal | 52 |
Open-Source Leaders
| Model | Parameters | Notable Strengths | Key Score |
|---|---|---|---|
| GLM-5 | Not disclosed | Competitive with frontier closed models | Top open-source tier |
| Qwen3.5 (397B) | 397B (MoE) | Local deployment, broad multilingual | Top open-source tier |
| Gemma 4 | Not disclosed | Google-backed, efficient | Leading open-source |
| Llama 4 | Not disclosed | Meta open release, multimodal | Leading open-source |
| Mistral Small 4 | 119B (MoE) | Fast inference, enterprise-ready | Competitive open-source |
Note: Specific benchmark scores for individual open-source models were not confirmed in freshly published sources this week; rankings are based on available comparative assessments.
Benchmark Deep Dive
Stanford 2026 AI Index: Benchmarks Can't Keep Pace With Model Progress
The Stanford 2026 AI Index, published this week and covered by both MIT Technology Review and IEEE Spectrum, delivers a data-rich picture of the AI landscape that has direct implications for how practitioners interpret leaderboard standings. The report's central finding: AI is advancing faster than our ability to measure it.
The Index highlights that many established benchmarks — including some long-standing academic tests — are becoming saturated. Frontier models are approaching or exceeding human-expert performance on evaluations that were considered highly challenging just two years ago. This means that raw benchmark scores, while useful, may be masking meaningful differences in real-world capability between top-tier models.
On the infrastructure side, the Index documents the continued explosion in compute and training costs at the frontier. This has a direct leaderboard implication: the gap between companies with massive capital and those without continues to widen when it comes to pushing the absolute state of the art. Meanwhile, the report notes that public trust in AI systems remains mixed, raising questions about whether capability benchmarks alone are the right north star for the field.
For practitioners, the key takeaway is to treat leaderboard scores — especially on older benchmarks like MMLU — with increasing skepticism. Task-specific and agentic evaluations (such as SWE-bench for coding agents, or ARC-AGI-2 for reasoning) are becoming more diagnostic of real-world performance differentials between models.

Analysis & Trends
-
State of the art: Gemini 3.1 Pro Preview and GPT-5.4 are tied at the top of the Artificial Analysis Intelligence Index (score: 57). GPT-5.3 Codex leads coding-specific evaluations among closed models. Claude Opus 4.6 remains the top choice for complex multi-step reasoning tasks at max effort.
-
Open vs. Closed gap: Qwen3.5 (Alibaba) and GLM-5 (Zhipu AI) are increasingly cited as the most competitive open-weight models relative to closed-source frontier offerings. The 397B Qwen3.5 MoE reportedly runs at 5.5+ tokens/sec on consumer hardware (a MacBook), suggesting that the open-source deployment story is strengthening even as raw capability gaps persist at the very top.
-
Cost-performance: Mercury 2 leads on output speed at 865.4 tokens/second on Artificial Analysis, followed by IBM's Granite 4.0 H Small (394.5 t/s). On the affordability axis, Qwen3.5 0.8B leads at $0.02 per 1M tokens (blended). Meta's announcement that it will open-source future models could further shift cost-performance calculations for enterprise buyers.
-
Emerging patterns: Chinese open-source model families (Qwen, GLM) continue to close the gap with Western alternatives at a pace that is surprising even close observers. The Stanford AI Index reinforces that benchmark saturation is now a structural issue — not just a temporary measurement gap — driving a shift toward more complex agentic and real-world task evaluations.
What to Watch Next
-
Meta's open-source model release timeline: Following Muse Spark's closed debut, Meta has signaled it will release open-source versions of upcoming models. The specs and benchmark performance of these open releases could substantially shift the open-source leaderboard.
-
Anthropic's Mythos model: Anthropic is reported to be testing a model internally referred to as "Mythos," described as representing a "step change in capabilities." No public release date has been confirmed, but its emergence could shake up the frontier leaderboard if it significantly outperforms Claude Opus 4.6.
-
Benchmark reform momentum post-Stanford AI Index: With the 2026 Stanford AI Index now public and widely cited, expect increased attention on next-generation evaluation frameworks — particularly agentic benchmarks (SWE-bench, GAIA) and reasoning-under-uncertainty tests (ARC-AGI-2) — as the community moves away from saturated academic benchmarks.
This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.
Create your own signal
Describe what you want to know, and AI will curate it for you automatically.
Create Signal