AI Benchmarks & Leaderboard — 2026-04-09
Meta announced the Muse Spark AI model family this week, while MLCommons released its most significant MLPerf Inference v6.0 benchmark update to date. The frontier leaderboard remains tightly contested, with Gemini 3.1 Pro Preview and GPT-5.4 trading blows at the top of intelligence rankings, while open-source models from Alibaba's Qwen and Meta continue closing the gap on closed-source giants.
AI Benchmarks & Leaderboard — 2026-04-09
New Model Releases & Updates
Muse Spark by Meta
- Type: Closed-source (part of the new "Muse" family); parameter count not publicly disclosed
- Key benchmarks: Benchmark specifics not yet fully disclosed at time of publication
- vs. Previous best: Positioned as a new entrant in Meta's frontier model lineup, announced by Mark Zuckerberg alongside the broader Muse model family
- What's notable: Meta is launching Muse Spark as the first model in its new Muse family — a distinct product line from Llama. Simultaneously, separate reporting confirms Meta is also planning to open-source versions of its upcoming next-generation models, with some components kept private for safety and competitive reasons.

GPT-5.4 by OpenAI
- Type: Closed-source
- Key benchmarks: Ranked among the highest intelligence models alongside Gemini 3.1 Pro Preview, per Artificial Analysis leaderboard
- vs. Previous best: Shares the top intelligence tier with Gemini 3.1 Pro Preview; GPT-5.3 Codex (xhigh) ranked just below
- What's notable: The xhigh reasoning tier remains competitive at the frontier; also recognized in the April 2026 model rankings for strong SWE-bench and ARC-AGI-2 scores
Gemini 3.1 Pro Preview by Google
- Type: Closed-source
- Key benchmarks: Co-leads intelligence rankings alongside GPT-5.4 (xhigh) per Artificial Analysis
- vs. Previous best: Matches or exceeds GPT-5.4 on several intelligence measures; topped by Mercury 2 and Granite 4.0 H Small on raw output speed
- What's notable: Occupies the highest intelligence tier; identified in the April 2026 model rankings as a top performer on SWE-bench and ARC-AGI-2
Claude Opus 4.6 by Anthropic
- Type: Closed-source
- Key benchmarks: Ranked third on intelligence behind Gemini 3.1 Pro Preview and GPT-5.4; max tier listed alongside GPT-5.3 Codex (xhigh)
- vs. Previous best: Closely trails the top two; noted for strong expert-level evaluation performance in a multi-model consensus study
- What's notable: A multi-model LLM consensus system was shown this week to match or outperform Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro across 100 expert-level questions in finance, law, medicine, and technology — with no performance degradation
Open-Source AI Landscape: Gemma 4, Qwen 3.6 Plus, Llama 4, Mistral Small 4 by Various
- Type: Open-source (multiple providers)
- Key benchmarks: Detailed scores not individually disclosed in available sources this week
- vs. Previous best: Collectively mapped as the leading open-source tier in April 2026; Qwen described as "the most-downloaded AI model family on Earth"
- What's notable: The open-source ecosystem saw a wave of new releases including Gemma 4 (Google), Qwen 3.6 Plus (Alibaba), Llama 4 (Meta), Mistral Small 4, and GLM-5 in the March–April 2026 window. Alibaba's Qwen 3.5 (397B) was noted running at 5.5+ tokens/sec on a MacBook.
Leaderboard Snapshot
Frontier Models (Closed-Source)
| Model | Provider | Notable Strengths | Key Score |
|---|---|---|---|
| Gemini 3.1 Pro Preview | Top intelligence tier, strong reasoning & coding | Highest intelligence (Artificial Analysis) | |
| GPT-5.4 (xhigh) | OpenAI | Top intelligence tier, ARC-AGI-2, SWE-bench | Highest intelligence (Artificial Analysis) |
| GPT-5.3 Codex (xhigh) | OpenAI | Code generation, reasoning | 2nd tier intelligence |
| Claude Opus 4.6 (max) | Anthropic | Expert-level Q&A, law/medicine/finance | 2nd tier intelligence |
| Mercury 2 | (undisclosed) | Output speed | 906 tokens/sec (fastest) |
Open-Source Leaders
| Model | Parameters | Notable Strengths | Key Score |
|---|---|---|---|
| Qwen 3.6 Plus | Not disclosed | Multilingual, efficiency, most-downloaded family | Top open-source tier (Apr 2026) |
| Llama 4 | Not disclosed | General purpose, Meta open-source | Top open-source tier (Apr 2026) |
| Gemma 4 | Not disclosed | Google open-source, reasoning | Top open-source tier (Apr 2026) |
| Mistral Small 4 | Not disclosed | Efficient, fast inference | Top open-source tier (Apr 2026) |
| GLM-5 | Not disclosed | Chinese frontier open-source | Top open-source tier (Apr 2026) |
| Granite 4.0 H Small | Small | Output speed | 414 tokens/sec (2nd fastest) |
Benchmark Deep Dive
MLPerf Inference v6.0 — The Most Significant Hardware Benchmark Update Yet
This week, MLCommons released MLPerf Inference v6.0, described as "the most significant benchmark update to date." The new suite adds substantial new tests that weren't present in prior rounds: text-to-video generation, the GPT-OSS 120B open-source large language model, DLRMv3 (an updated recommendation model), vision-language models, and the YOLOv11 object detection model.

The hardware angle generated immediate headlines: AMD's latest accelerator "finally beat" Nvidia's B300 in this round — though, as Forbes noted, the victory was narrow and limited to a smaller model that "few still run." The result is significant not because it represents AMD's wholesale defeat of Nvidia in AI inference, but because it marks the first time AMD has crossed that threshold at all — a meaningful milestone for GPU competition in AI.
For practitioners, the expanded v6.0 benchmark suite is directly relevant. The inclusion of text-to-video models and vision-language benchmarks signals that MLCommons is tracking a broader frontier of production workloads, not just language-only LLM inference. The addition of GPT-OSS 120B means that, for the first time, large open-source models are being benchmarked at a scale comparable to frontier closed models in a standardized hardware context. Organizations selecting inference hardware for multimodal or large-scale open-source deployments will want to consult the v6.0 results carefully before procurement decisions.
Analysis & Trends
- State of the art: Gemini 3.1 Pro Preview and GPT-5.4 (xhigh) share the top intelligence tier for closed-source models. For coding and reasoning specifically, GPT-5.3 Codex remains highly competitive. On output speed, Mercury 2 (906 tokens/sec) and Granite 4.0 H Small (414 tokens/sec) are the fastest available.
- Open vs. Closed gap: The April 2026 open-source cohort — Qwen 3.6 Plus, Llama 4, Gemma 4, Mistral Small 4, GLM-5 — is closing the gap meaningfully. Alibaba's Qwen family is now the most-downloaded on the planet, and its 397B model runs locally on consumer hardware at 5.5+ tokens/sec. A separate analysis notes that Chinese open-source AI is "catching up faster than you think" compared to Western alternatives.
- Cost-performance: Mercury 2 stands out at 906 tokens/sec for speed-sensitive workloads. Granite 4.0 H Small and Qwen3.5 0.8B offer strong throughput at smaller model sizes. No major pricing changes were announced this week at the frontier tier.
- Emerging patterns: MLPerf v6.0's inclusion of text-to-video and vision-language benchmarks reflects the industry's push beyond text-only evaluation. Meta's dual strategy — a commercial Muse family alongside open-source Llama-based releases — signals a bifurcation in how major labs are positioning their model portfolios. Multi-model consensus systems are also emerging as a new paradigm, with a Reuters-covered study showing that routing across GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro can match or beat any single model on expert-level tasks.
What to Watch Next
-
Meta's next-gen open-source model release: Multiple sources confirm Meta is preparing to release open-source versions of its upcoming models (separate from Muse Spark), with Alexandr Wang at the helm. Key details — parameter counts, benchmarks, and which components will remain proprietary — remain to be disclosed.
-
AMD vs. Nvidia in MLPerf follow-up: AMD's narrow edge over Nvidia's B300 in MLPerf v6.0 (on a limited workload) will likely intensify scrutiny of next-round results. Watch for both companies to respond with updated hardware and software submissions as the AI inference hardware race heats up.
-
The measurement problem — obsolescence of classic benchmarks: A new analysis from Understanding AI argues that the most famous benchmark chart in AI "might be obsolete soon," as saturation on MMLU, GPQA, and similar tests is forcing a reckoning over what meaningful evaluation looks like at frontier scale. This debate will shape how the next generation of leaderboards is constructed.
This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.
Create your own signal
Describe what you want to know, and AI will curate it for you automatically.
Create Signal