CrewCrew
FeedSignalsMy Subscriptions
Get Started
AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-04-14

  1. Signals
  2. /
  3. AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-04-14

AI Benchmarks & Leaderboard|April 14, 2026(5h ago)5 min read8.4AI quality score — automatically evaluated based on accuracy, depth, and source quality
37 subscribers

Meta debuted Muse Spark, its first major model from the newly formed Superintelligence Labs under chief AI officer Alexandr Wang, on April 8 — performing better than previous Meta models but lagging on coding benchmarks. The Stanford 2026 AI Index, published this week, offers a sweeping structural analysis of AI's accelerating pace, noting that benchmarks are increasingly struggling to keep up with model capabilities. According to Artificial Analysis, Gemini 3.1 Pro Preview and GPT-5.4 now share the top spot on the intelligence index.

AI Benchmarks & Leaderboard — 2026-04-14


New Model Releases & Updates


Muse Spark by Meta

  • Type: Closed-source; first major LLM from Meta Superintelligence Labs, led by Alexandr Wang
  • Key benchmarks: Outperforms Meta's prior models; trails leading competitors specifically on coding ability (exact benchmark numbers not disclosed in available reporting)
  • vs. Previous best: Scores an Intelligence Index of 52 on Artificial Analysis — placing it 5th overall, behind Gemini 3.1 Pro Preview (57), GPT-5.4 (57), GPT-5.3 Codex (54), and Claude Opus 4.6 Adaptive Reasoning Max (53)
  • What's notable: Muse Spark is the first high-profile output from Meta's Superintelligence Labs, a team assembled after Meta brought in Scale AI founder Alexandr Wang with a reported $14 billion deal. Despite strong general capability improvements, independent reviewers note coding benchmarks remain a weak spot relative to OpenAI and Google's frontier offerings. Meta has also signaled it will open-source versions of its next models.

Meta's new Muse Spark model launch, led by Alexandr Wang at Meta Superintelligence Labs
Meta's new Muse Spark model launch, led by Alexandr Wang at Meta Superintelligence Labs


Leaderboard Snapshot


Frontier Models (Closed-Source)

Based on Artificial Analysis Intelligence Index (higher = more capable):

ModelProviderNotable StrengthsIntelligence Index
Gemini 3.1 Pro PreviewGoogleTop-tier reasoning, multimodal57
GPT-5.4 (xhigh)OpenAITop-tier general intelligence57
GPT-5.3 Codex (xhigh)OpenAICoding, technical tasks54
Claude Opus 4.6 (Adaptive Reasoning, Max)AnthropicReasoning, complex analysis53
Muse SparkMetaGeneral capability, multimodal52

Open-Source Leaders

ModelParametersNotable StrengthsKey Score
GLM-5Not disclosedCompetitive with frontier closed modelsTop open-source tier
Qwen3.5 (397B)397B (MoE)Local deployment, broad multilingualTop open-source tier
Gemma 4Not disclosedGoogle-backed, efficientLeading open-source
Llama 4Not disclosedMeta open release, multimodalLeading open-source
Mistral Small 4119B (MoE)Fast inference, enterprise-readyCompetitive open-source

Note: Specific benchmark scores for individual open-source models were not confirmed in freshly published sources this week; rankings are based on available comparative assessments.


Benchmark Deep Dive


Stanford 2026 AI Index: Benchmarks Can't Keep Pace With Model Progress

The Stanford 2026 AI Index, published this week and covered by both MIT Technology Review and IEEE Spectrum, delivers a data-rich picture of the AI landscape that has direct implications for how practitioners interpret leaderboard standings. The report's central finding: AI is advancing faster than our ability to measure it.

The Index highlights that many established benchmarks — including some long-standing academic tests — are becoming saturated. Frontier models are approaching or exceeding human-expert performance on evaluations that were considered highly challenging just two years ago. This means that raw benchmark scores, while useful, may be masking meaningful differences in real-world capability between top-tier models.

On the infrastructure side, the Index documents the continued explosion in compute and training costs at the frontier. This has a direct leaderboard implication: the gap between companies with massive capital and those without continues to widen when it comes to pushing the absolute state of the art. Meanwhile, the report notes that public trust in AI systems remains mixed, raising questions about whether capability benchmarks alone are the right north star for the field.

For practitioners, the key takeaway is to treat leaderboard scores — especially on older benchmarks like MMLU — with increasing skepticism. Task-specific and agentic evaluations (such as SWE-bench for coding agents, or ARC-AGI-2 for reasoning) are becoming more diagnostic of real-world performance differentials between models.

Charts from Stanford's 2026 AI Index showing accelerating AI progress against benchmark saturation
Charts from Stanford's 2026 AI Index showing accelerating AI progress against benchmark saturation

technologyreview.com

technologyreview.com


Analysis & Trends

  • State of the art: Gemini 3.1 Pro Preview and GPT-5.4 are tied at the top of the Artificial Analysis Intelligence Index (score: 57). GPT-5.3 Codex leads coding-specific evaluations among closed models. Claude Opus 4.6 remains the top choice for complex multi-step reasoning tasks at max effort.

  • Open vs. Closed gap: Qwen3.5 (Alibaba) and GLM-5 (Zhipu AI) are increasingly cited as the most competitive open-weight models relative to closed-source frontier offerings. The 397B Qwen3.5 MoE reportedly runs at 5.5+ tokens/sec on consumer hardware (a MacBook), suggesting that the open-source deployment story is strengthening even as raw capability gaps persist at the very top.

  • Cost-performance: Mercury 2 leads on output speed at 865.4 tokens/second on Artificial Analysis, followed by IBM's Granite 4.0 H Small (394.5 t/s). On the affordability axis, Qwen3.5 0.8B leads at $0.02 per 1M tokens (blended). Meta's announcement that it will open-source future models could further shift cost-performance calculations for enterprise buyers.

  • Emerging patterns: Chinese open-source model families (Qwen, GLM) continue to close the gap with Western alternatives at a pace that is surprising even close observers. The Stanford AI Index reinforces that benchmark saturation is now a structural issue — not just a temporary measurement gap — driving a shift toward more complex agentic and real-world task evaluations.


What to Watch Next

  • Meta's open-source model release timeline: Following Muse Spark's closed debut, Meta has signaled it will release open-source versions of upcoming models. The specs and benchmark performance of these open releases could substantially shift the open-source leaderboard.

  • Anthropic's Mythos model: Anthropic is reported to be testing a model internally referred to as "Mythos," described as representing a "step change in capabilities." No public release date has been confirmed, but its emergence could shake up the frontier leaderboard if it significantly outperforms Claude Opus 4.6.

  • Benchmark reform momentum post-Stanford AI Index: With the 2026 Stanford AI Index now public and widely cited, expect increased attention on next-generation evaluation frameworks — particularly agentic benchmarks (SWE-bench, GAIA) and reasoning-under-uncertainty tests (ARC-AGI-2) — as the community moves away from saturated academic benchmarks.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Back to AI Benchmarks & LeaderboardBrowse all Signals

Create your own signal

Describe what you want to know, and AI will curate it for you automatically.

Create Signal

Powered by

CrewCrew

Sources

Want your own AI intelligence feed?

Create custom signals on any topic. AI curates and delivers 24/7.