AI Benchmarks & Leaderboard — 2026-03-29

AI Benchmarks & Leaderboard|March 29, 20266 min read9.1AI quality score — automatically evaluated based on accuracy, depth, and source quality

42 subscribers

This week's most striking benchmark news centers on ARC-AGI-3, a new evaluation that humbled frontier AI models — Gemini scored just 0.37% and GPT-5.4 scored 0.26%, while humans hit 100%. Meanwhile, Mistral released an open-weight text-to-speech model it claims outperforms ElevenLabs, and Artificial Analysis' leaderboard continues to show Gemini 3.1 Pro Preview and GPT-5.4 sharing the top intelligence ranking among closed-source frontier models.

AI Benchmarks & Leaderboard — 2026-03-29

New Model Releases & Updates

Voxtral TTS by Mistral AI

Type: Open-weight, speech generation / text-to-speech model
Key benchmarks: Mistral claims it outperforms ElevenLabs on voice quality; lightweight enough to run on a smartphone
vs. Previous best: Positioned directly against ElevenLabs, Deepgram, and OpenAI in the enterprise voice agent market
What's notable: Fully open-weight (weights released for free), targets enterprise voice agents for sales and customer engagement use cases; puts Mistral in direct competition with specialized TTS incumbents

Mistral releases new open-source speech model — TechCrunch

Qwen 3.5 by Alibaba

Type: Open-source, multiple parameter sizes including a 397B flagship
Key benchmarks: 397B model runs at 5.5+ tokens/sec on a MacBook; benchmark comparisons show competitive performance against Western open-source alternatives including Llama and Mistral
vs. Previous best: Closing the gap with Llama and Mistral families across general tasks; Chinese open-source AI ecosystem described as "catching up faster than you think"
What's notable: Launched across all parameter sizes in March 2026; flagship 397B model achieves surprisingly strong on-device throughput, raising the bar for local inference

ARC-AGI-3 Benchmark (New Evaluation)

Type: New benchmark released by the ARC Prize organization; not a model release but a significant new evaluation
Key benchmarks: Gemini scored 0.37%, GPT-5.4 scored 0.26%, humans scored 100%
vs. Previous best: Represents a dramatic step up in difficulty over ARC-AGI-2; no current frontier model clears even 1%
What's notable: Released the same week Nvidia CEO Jensen Huang declared AGI achieved — the results starkly challenge that narrative; the near-zero AI scores vs. perfect human scores underscore that genuine general reasoning remains an unsolved problem

decrypt.co

Leaderboard Snapshot

Frontier Models (Closed-Source)

Model	Provider	Notable Strengths	Key Score
Gemini 3.1 Pro Preview	Google	Highest intelligence ranking; science tasks (94.3% GPQA)	Top intelligence tier (Artificial Analysis)
GPT-5.4 (xhigh)	OpenAI	Co-highest intelligence; 33% fewer hallucinations vs. prior	Top intelligence tier (Artificial Analysis)
GPT-5.3 Codex (xhigh)	OpenAI	Strong coding performance	Second intelligence tier (Artificial Analysis)
Claude Opus 4.6 (max)	Anthropic	Coding (75.6% SWE-Bench); long-context reasoning	Second intelligence tier (Artificial Analysis)
Gemini 3.1 Pro	Google	Science & multimodal	High intelligence tier

Open-Source Leaders

Model	Parameters	Notable Strengths	Key Score
Qwen 3.5	Up to 397B	Strong general tasks, on-device inference, multilingual	Competitive with Llama/Mistral across tasks
Mercury 2	N/A	Speed leader: 764 tokens/sec output	Fastest output speed (Artificial Analysis)
NVIDIA Nemotron 3 Super	N/A	Second fastest: 399 tokens/sec	Second fastest output (Artificial Analysis)
Llama (latest)	Multiple sizes	General-purpose; widely deployed locally	Strong open-source baseline
Mistral Voxtral TTS	N/A	Speech generation; smartphone-capable	Claims to beat ElevenLabs on voice quality

Benchmark Deep Dive

ARC-AGI-3: The Benchmark That Put AGI Claims on Trial

ARC-AGI-3 dropped the same week Jensen Huang declared AGI achieved

The release of ARC-AGI-3 this week may be the most consequential benchmark drop of 2026 so far — not because of what AI achieved, but because of what it failed to achieve. The new evaluation was designed to measure abstract reasoning and general intelligence in a way that is trivially solved by humans but appears nearly impenetrable to current frontier AI systems. Gemini, one of the highest-ranked models on intelligence leaderboards, scored 0.37%. GPT-5.4, co-leader on Artificial Analysis' intelligence rankings, scored 0.26%. Humans, by contrast, scored 100%.

The timing was notable: ARC-AGI-3 arrived the same week that Jensen Huang, CEO of Nvidia, publicly declared that AGI had been achieved. The benchmark results directly contradict that framing. While frontier models like Gemini 3.1 Pro Preview and GPT-5.4 are genuinely impressive across a wide range of tasks — coding, science Q&A, creative writing — the ARC-AGI-3 results expose a fundamental gap between task-specific performance and the kind of flexible, novel problem-solving that humans demonstrate without effort.

For practitioners, this has real implications. The benchmark suggests that models fine-tuned or optimized for known task distributions may still fail catastrophically on genuinely novel reasoning problems. Teams building agentic systems or deploying AI in dynamic, unpredictable environments should treat ARC-AGI-3 scores (when available for a model) as a proxy for robustness under distribution shift. The near-zero scores also serve as a corrective to benchmark inflation — as models "solve" existing evals, the field must continue raising the bar to avoid false confidence in AI capabilities.

decrypt.co

Analysis & Trends

State of the art: Gemini 3.1 Pro Preview and GPT-5.4 share the top intelligence ranking among closed-source models per Artificial Analysis. For specific tasks: Claude leads on coding (75.6% SWE-Bench), Gemini leads on science (94.3% GPQA), and GPT-5.4 reduces hallucinations by 33% vs. prior versions. No model approaches human-level performance on the new ARC-AGI-3 benchmark.
Open vs. Closed gap: The gap is narrowing on standard benchmarks. Qwen 3.5's 397B model achieves competitive results against Western open-source alternatives, and the Chinese open-source ecosystem is described as accelerating. However, on difficult reasoning benchmarks like ARC-AGI-3, all models — open and closed — remain near zero.
Cost-performance: Speed efficiency is emerging as a key differentiator. Mercury 2 leads at 764 tokens/sec and NVIDIA Nemotron 3 Super reaches 399 tokens/sec, far outpacing most intelligence-focused models. Mistral's open-weight Voxtral TTS, running on a smartphone, represents a push to bring capable models to edge hardware without API costs.
Emerging patterns: Voice AI is heating up as a competitive frontier, with Mistral entering a space previously dominated by ElevenLabs, Deepgram, and OpenAI. Open-weight TTS with competitive quality could disrupt the voice agent market. Meanwhile, reliability of AI agents remains a concern — Princeton researchers released a battery of reliability tests this week showing most AI vendors still don't benchmark for reliability, even as agentic capabilities improve.

What to Watch Next

ARC-AGI-3 scores from additional labs: OpenAI, Anthropic, and others have not yet published official ARC-AGI-3 results. When they do, the scores will set a new baseline for how seriously "AGI" claims can be taken in 2026.
Anthropic's Mythos model: A data leak earlier this week revealed the existence of a model internally codenamed "Mythos," described as a "step change in capabilities." Anthropic has confirmed it is testing this model. Its benchmark performance — particularly on ARC-AGI-3-style evaluations — could shift the frontier rankings significantly.
AI agent reliability benchmarks from Princeton: Researchers at Princeton have released a new battery of reliability tests for AI agents, designed to surface failure modes that standard benchmarks miss. As more vendors adopt these tests, expect reliability scores to become a new axis on leaderboards — potentially reshuffling rankings for teams deploying agents in production.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

AI Benchmarks & Leaderboard — 2026-03-29

AI Benchmarks & Leaderboard — 2026-03-29

New Model Releases & Updates

Voxtral TTS by Mistral AI

Qwen 3.5 by Alibaba

ARC-AGI-3 Benchmark (New Evaluation)

Leaderboard Snapshot

Frontier Models (Closed-Source)

Open-Source Leaders

Benchmark Deep Dive

ARC-AGI-3: The Benchmark That Put AGI Claims on Trial

Analysis & Trends

What to Watch Next

Sources

Want your own AI intelligence feed?