CrewCrew
FeedSignalsMy Subscriptions
Get Started
AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-04-09

  1. Signals
  2. /
  3. AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-04-09

AI Benchmarks & Leaderboard|April 9, 2026(5d ago)7 min read8.4AI quality score — automatically evaluated based on accuracy, depth, and source quality
37 subscribers

Meta announced the Muse Spark AI model family this week, while MLCommons released its most significant MLPerf Inference v6.0 benchmark update to date. The frontier leaderboard remains tightly contested, with Gemini 3.1 Pro Preview and GPT-5.4 trading blows at the top of intelligence rankings, while open-source models from Alibaba's Qwen and Meta continue closing the gap on closed-source giants.

AI Benchmarks & Leaderboard — 2026-04-09


New Model Releases & Updates


Muse Spark by Meta

  • Type: Closed-source (part of the new "Muse" family); parameter count not publicly disclosed
  • Key benchmarks: Benchmark specifics not yet fully disclosed at time of publication
  • vs. Previous best: Positioned as a new entrant in Meta's frontier model lineup, announced by Mark Zuckerberg alongside the broader Muse model family
  • What's notable: Meta is launching Muse Spark as the first model in its new Muse family — a distinct product line from Llama. Simultaneously, separate reporting confirms Meta is also planning to open-source versions of its upcoming next-generation models, with some components kept private for safety and competitive reasons.

Mark Zuckerberg announces Meta's new Muse Spark AI model family
Mark Zuckerberg announces Meta's new Muse Spark AI model family

mashable.com

mashable.com


GPT-5.4 by OpenAI

  • Type: Closed-source
  • Key benchmarks: Ranked among the highest intelligence models alongside Gemini 3.1 Pro Preview, per Artificial Analysis leaderboard
  • vs. Previous best: Shares the top intelligence tier with Gemini 3.1 Pro Preview; GPT-5.3 Codex (xhigh) ranked just below
  • What's notable: The xhigh reasoning tier remains competitive at the frontier; also recognized in the April 2026 model rankings for strong SWE-bench and ARC-AGI-2 scores

Gemini 3.1 Pro Preview by Google

  • Type: Closed-source
  • Key benchmarks: Co-leads intelligence rankings alongside GPT-5.4 (xhigh) per Artificial Analysis
  • vs. Previous best: Matches or exceeds GPT-5.4 on several intelligence measures; topped by Mercury 2 and Granite 4.0 H Small on raw output speed
  • What's notable: Occupies the highest intelligence tier; identified in the April 2026 model rankings as a top performer on SWE-bench and ARC-AGI-2

Claude Opus 4.6 by Anthropic

  • Type: Closed-source
  • Key benchmarks: Ranked third on intelligence behind Gemini 3.1 Pro Preview and GPT-5.4; max tier listed alongside GPT-5.3 Codex (xhigh)
  • vs. Previous best: Closely trails the top two; noted for strong expert-level evaluation performance in a multi-model consensus study
  • What's notable: A multi-model LLM consensus system was shown this week to match or outperform Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro across 100 expert-level questions in finance, law, medicine, and technology — with no performance degradation

Open-Source AI Landscape: Gemma 4, Qwen 3.6 Plus, Llama 4, Mistral Small 4 by Various

  • Type: Open-source (multiple providers)
  • Key benchmarks: Detailed scores not individually disclosed in available sources this week
  • vs. Previous best: Collectively mapped as the leading open-source tier in April 2026; Qwen described as "the most-downloaded AI model family on Earth"
  • What's notable: The open-source ecosystem saw a wave of new releases including Gemma 4 (Google), Qwen 3.6 Plus (Alibaba), Llama 4 (Meta), Mistral Small 4, and GLM-5 in the March–April 2026 window. Alibaba's Qwen 3.5 (397B) was noted running at 5.5+ tokens/sec on a MacBook.

April 2026 open-source AI model landscape overview
April 2026 open-source AI model landscape overview


Leaderboard Snapshot


Frontier Models (Closed-Source)

ModelProviderNotable StrengthsKey Score
Gemini 3.1 Pro PreviewGoogleTop intelligence tier, strong reasoning & codingHighest intelligence (Artificial Analysis)
GPT-5.4 (xhigh)OpenAITop intelligence tier, ARC-AGI-2, SWE-benchHighest intelligence (Artificial Analysis)
GPT-5.3 Codex (xhigh)OpenAICode generation, reasoning2nd tier intelligence
Claude Opus 4.6 (max)AnthropicExpert-level Q&A, law/medicine/finance2nd tier intelligence
Mercury 2(undisclosed)Output speed906 tokens/sec (fastest)

Open-Source Leaders

ModelParametersNotable StrengthsKey Score
Qwen 3.6 PlusNot disclosedMultilingual, efficiency, most-downloaded familyTop open-source tier (Apr 2026)
Llama 4Not disclosedGeneral purpose, Meta open-sourceTop open-source tier (Apr 2026)
Gemma 4Not disclosedGoogle open-source, reasoningTop open-source tier (Apr 2026)
Mistral Small 4Not disclosedEfficient, fast inferenceTop open-source tier (Apr 2026)
GLM-5Not disclosedChinese frontier open-sourceTop open-source tier (Apr 2026)
Granite 4.0 H SmallSmallOutput speed414 tokens/sec (2nd fastest)

Benchmark Deep Dive


MLPerf Inference v6.0 — The Most Significant Hardware Benchmark Update Yet

This week, MLCommons released MLPerf Inference v6.0, described as "the most significant benchmark update to date." The new suite adds substantial new tests that weren't present in prior rounds: text-to-video generation, the GPT-OSS 120B open-source large language model, DLRMv3 (an updated recommendation model), vision-language models, and the YOLOv11 object detection model.

MLPerf Inference v6.0 benchmark results from MLCommons
MLPerf Inference v6.0 benchmark results from MLCommons

The hardware angle generated immediate headlines: AMD's latest accelerator "finally beat" Nvidia's B300 in this round — though, as Forbes noted, the victory was narrow and limited to a smaller model that "few still run." The result is significant not because it represents AMD's wholesale defeat of Nvidia in AI inference, but because it marks the first time AMD has crossed that threshold at all — a meaningful milestone for GPU competition in AI.

For practitioners, the expanded v6.0 benchmark suite is directly relevant. The inclusion of text-to-video models and vision-language benchmarks signals that MLCommons is tracking a broader frontier of production workloads, not just language-only LLM inference. The addition of GPT-OSS 120B means that, for the first time, large open-source models are being benchmarked at a scale comparable to frontier closed models in a standardized hardware context. Organizations selecting inference hardware for multimodal or large-scale open-source deployments will want to consult the v6.0 results carefully before procurement decisions.

substackcdn.com

substackcdn.com


Analysis & Trends

  • State of the art: Gemini 3.1 Pro Preview and GPT-5.4 (xhigh) share the top intelligence tier for closed-source models. For coding and reasoning specifically, GPT-5.3 Codex remains highly competitive. On output speed, Mercury 2 (906 tokens/sec) and Granite 4.0 H Small (414 tokens/sec) are the fastest available.
  • Open vs. Closed gap: The April 2026 open-source cohort — Qwen 3.6 Plus, Llama 4, Gemma 4, Mistral Small 4, GLM-5 — is closing the gap meaningfully. Alibaba's Qwen family is now the most-downloaded on the planet, and its 397B model runs locally on consumer hardware at 5.5+ tokens/sec. A separate analysis notes that Chinese open-source AI is "catching up faster than you think" compared to Western alternatives.
  • Cost-performance: Mercury 2 stands out at 906 tokens/sec for speed-sensitive workloads. Granite 4.0 H Small and Qwen3.5 0.8B offer strong throughput at smaller model sizes. No major pricing changes were announced this week at the frontier tier.
  • Emerging patterns: MLPerf v6.0's inclusion of text-to-video and vision-language benchmarks reflects the industry's push beyond text-only evaluation. Meta's dual strategy — a commercial Muse family alongside open-source Llama-based releases — signals a bifurcation in how major labs are positioning their model portfolios. Multi-model consensus systems are also emerging as a new paradigm, with a Reuters-covered study showing that routing across GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro can match or beat any single model on expert-level tasks.

What to Watch Next

  • Meta's next-gen open-source model release: Multiple sources confirm Meta is preparing to release open-source versions of its upcoming models (separate from Muse Spark), with Alexandr Wang at the helm. Key details — parameter counts, benchmarks, and which components will remain proprietary — remain to be disclosed.

  • AMD vs. Nvidia in MLPerf follow-up: AMD's narrow edge over Nvidia's B300 in MLPerf v6.0 (on a limited workload) will likely intensify scrutiny of next-round results. Watch for both companies to respond with updated hardware and software submissions as the AI inference hardware race heats up.

  • The measurement problem — obsolescence of classic benchmarks: A new analysis from Understanding AI argues that the most famous benchmark chart in AI "might be obsolete soon," as saturation on MMLU, GPQA, and similar tests is forcing a reckoning over what meaningful evaluation looks like at frontier scale. This debate will shape how the next generation of leaderboards is constructed.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Back to AI Benchmarks & LeaderboardBrowse all Signals

Create your own signal

Describe what you want to know, and AI will curate it for you automatically.

Create Signal

Powered by

CrewCrew

Sources

Want your own AI intelligence feed?

Create custom signals on any topic. AI curates and delivers 24/7.