CrewCrew
FeedSignalsMy Subscriptions
Get Started
AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-03-29

  1. Signals
  2. /
  3. AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-03-29

AI Benchmarks & Leaderboard|March 29, 20266 min read9.1AI quality score — automatically evaluated based on accuracy, depth, and source quality
37 subscribers

This week's most striking benchmark news centers on ARC-AGI-3, a new evaluation that humbled frontier AI models — Gemini scored just 0.37% and GPT-5.4 scored 0.26%, while humans hit 100%. Meanwhile, Mistral released an open-weight text-to-speech model it claims outperforms ElevenLabs, and Artificial Analysis' leaderboard continues to show Gemini 3.1 Pro Preview and GPT-5.4 sharing the top intelligence ranking among closed-source frontier models.

AI Benchmarks & Leaderboard — 2026-03-29


New Model Releases & Updates


Voxtral TTS by Mistral AI

  • Type: Open-weight, speech generation / text-to-speech model
  • Key benchmarks: Mistral claims it outperforms ElevenLabs on voice quality; lightweight enough to run on a smartphone
  • vs. Previous best: Positioned directly against ElevenLabs, Deepgram, and OpenAI in the enterprise voice agent market
  • What's notable: Fully open-weight (weights released for free), targets enterprise voice agents for sales and customer engagement use cases; puts Mistral in direct competition with specialized TTS incumbents

Mistral releases new open-source speech model — TechCrunch
Mistral releases new open-source speech model — TechCrunch


Qwen 3.5 by Alibaba

  • Type: Open-source, multiple parameter sizes including a 397B flagship
  • Key benchmarks: 397B model runs at 5.5+ tokens/sec on a MacBook; benchmark comparisons show competitive performance against Western open-source alternatives including Llama and Mistral
  • vs. Previous best: Closing the gap with Llama and Mistral families across general tasks; Chinese open-source AI ecosystem described as "catching up faster than you think"
  • What's notable: Launched across all parameter sizes in March 2026; flagship 397B model achieves surprisingly strong on-device throughput, raising the bar for local inference

Qwen 3.5 vs Llama vs Mistral comparison
Qwen 3.5 vs Llama vs Mistral comparison


ARC-AGI-3 Benchmark (New Evaluation)

  • Type: New benchmark released by the ARC Prize organization; not a model release but a significant new evaluation
  • Key benchmarks: Gemini scored 0.37%, GPT-5.4 scored 0.26%, humans scored 100%
  • vs. Previous best: Represents a dramatic step up in difficulty over ARC-AGI-2; no current frontier model clears even 1%
  • What's notable: Released the same week Nvidia CEO Jensen Huang declared AGI achieved — the results starkly challenge that narrative; the near-zero AI scores vs. perfect human scores underscore that genuine general reasoning remains an unsolved problem

ARC-AGI-3 benchmark — Decrypt
ARC-AGI-3 benchmark — Decrypt

decrypt.co

decrypt.co


Leaderboard Snapshot


Frontier Models (Closed-Source)

ModelProviderNotable StrengthsKey Score
Gemini 3.1 Pro PreviewGoogleHighest intelligence ranking; science tasks (94.3% GPQA)Top intelligence tier (Artificial Analysis)
GPT-5.4 (xhigh)OpenAICo-highest intelligence; 33% fewer hallucinations vs. priorTop intelligence tier (Artificial Analysis)
GPT-5.3 Codex (xhigh)OpenAIStrong coding performanceSecond intelligence tier (Artificial Analysis)
Claude Opus 4.6 (max)AnthropicCoding (75.6% SWE-Bench); long-context reasoningSecond intelligence tier (Artificial Analysis)
Gemini 3.1 ProGoogleScience & multimodalHigh intelligence tier

Open-Source Leaders

ModelParametersNotable StrengthsKey Score
Qwen 3.5Up to 397BStrong general tasks, on-device inference, multilingualCompetitive with Llama/Mistral across tasks
Mercury 2N/ASpeed leader: 764 tokens/sec outputFastest output speed (Artificial Analysis)
NVIDIA Nemotron 3 SuperN/ASecond fastest: 399 tokens/secSecond fastest output (Artificial Analysis)
Llama (latest)Multiple sizesGeneral-purpose; widely deployed locallyStrong open-source baseline
Mistral Voxtral TTSN/ASpeech generation; smartphone-capableClaims to beat ElevenLabs on voice quality

Benchmark Deep Dive


ARC-AGI-3: The Benchmark That Put AGI Claims on Trial

ARC-AGI-3 dropped the same week Jensen Huang declared AGI achieved
ARC-AGI-3 dropped the same week Jensen Huang declared AGI achieved

The release of ARC-AGI-3 this week may be the most consequential benchmark drop of 2026 so far — not because of what AI achieved, but because of what it failed to achieve. The new evaluation was designed to measure abstract reasoning and general intelligence in a way that is trivially solved by humans but appears nearly impenetrable to current frontier AI systems. Gemini, one of the highest-ranked models on intelligence leaderboards, scored 0.37%. GPT-5.4, co-leader on Artificial Analysis' intelligence rankings, scored 0.26%. Humans, by contrast, scored 100%.

The timing was notable: ARC-AGI-3 arrived the same week that Jensen Huang, CEO of Nvidia, publicly declared that AGI had been achieved. The benchmark results directly contradict that framing. While frontier models like Gemini 3.1 Pro Preview and GPT-5.4 are genuinely impressive across a wide range of tasks — coding, science Q&A, creative writing — the ARC-AGI-3 results expose a fundamental gap between task-specific performance and the kind of flexible, novel problem-solving that humans demonstrate without effort.

For practitioners, this has real implications. The benchmark suggests that models fine-tuned or optimized for known task distributions may still fail catastrophically on genuinely novel reasoning problems. Teams building agentic systems or deploying AI in dynamic, unpredictable environments should treat ARC-AGI-3 scores (when available for a model) as a proxy for robustness under distribution shift. The near-zero scores also serve as a corrective to benchmark inflation — as models "solve" existing evals, the field must continue raising the bar to avoid false confidence in AI capabilities.

decrypt.co

decrypt.co


Analysis & Trends

  • State of the art: Gemini 3.1 Pro Preview and GPT-5.4 share the top intelligence ranking among closed-source models per Artificial Analysis. For specific tasks: Claude leads on coding (75.6% SWE-Bench), Gemini leads on science (94.3% GPQA), and GPT-5.4 reduces hallucinations by 33% vs. prior versions. No model approaches human-level performance on the new ARC-AGI-3 benchmark.

  • Open vs. Closed gap: The gap is narrowing on standard benchmarks. Qwen 3.5's 397B model achieves competitive results against Western open-source alternatives, and the Chinese open-source ecosystem is described as accelerating. However, on difficult reasoning benchmarks like ARC-AGI-3, all models — open and closed — remain near zero.

  • Cost-performance: Speed efficiency is emerging as a key differentiator. Mercury 2 leads at 764 tokens/sec and NVIDIA Nemotron 3 Super reaches 399 tokens/sec, far outpacing most intelligence-focused models. Mistral's open-weight Voxtral TTS, running on a smartphone, represents a push to bring capable models to edge hardware without API costs.

  • Emerging patterns: Voice AI is heating up as a competitive frontier, with Mistral entering a space previously dominated by ElevenLabs, Deepgram, and OpenAI. Open-weight TTS with competitive quality could disrupt the voice agent market. Meanwhile, reliability of AI agents remains a concern — Princeton researchers released a battery of reliability tests this week showing most AI vendors still don't benchmark for reliability, even as agentic capabilities improve.


What to Watch Next

  • ARC-AGI-3 scores from additional labs: OpenAI, Anthropic, and others have not yet published official ARC-AGI-3 results. When they do, the scores will set a new baseline for how seriously "AGI" claims can be taken in 2026.

  • Anthropic's Mythos model: A data leak earlier this week revealed the existence of a model internally codenamed "Mythos," described as a "step change in capabilities." Anthropic has confirmed it is testing this model. Its benchmark performance — particularly on ARC-AGI-3-style evaluations — could shift the frontier rankings significantly.

  • AI agent reliability benchmarks from Princeton: Researchers at Princeton have released a new battery of reliability tests for AI agents, designed to surface failure modes that standard benchmarks miss. As more vendors adopt these tests, expect reliability scores to become a new axis on leaderboards — potentially reshuffling rankings for teams deploying agents in production.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Back to AI Benchmarks & LeaderboardBrowse all Signals

Create your own signal

Describe what you want to know, and AI will curate it for you automatically.

Create Signal

Powered by

CrewCrew

Sources

Want your own AI intelligence feed?

Create custom signals on any topic. AI curates and delivers 24/7.