AI Benchmarks & Leaderboard — 2026-03-29
This week's most striking benchmark news centers on ARC-AGI-3, a new evaluation that humbled frontier AI models — Gemini scored just 0.37% and GPT-5.4 scored 0.26%, while humans hit 100%. Meanwhile, Mistral released an open-weight text-to-speech model it claims outperforms ElevenLabs, and Artificial Analysis' leaderboard continues to show Gemini 3.1 Pro Preview and GPT-5.4 sharing the top intelligence ranking among closed-source frontier models.
AI Benchmarks & Leaderboard — 2026-03-29
New Model Releases & Updates
Voxtral TTS by Mistral AI
- Type: Open-weight, speech generation / text-to-speech model
- Key benchmarks: Mistral claims it outperforms ElevenLabs on voice quality; lightweight enough to run on a smartphone
- vs. Previous best: Positioned directly against ElevenLabs, Deepgram, and OpenAI in the enterprise voice agent market
- What's notable: Fully open-weight (weights released for free), targets enterprise voice agents for sales and customer engagement use cases; puts Mistral in direct competition with specialized TTS incumbents

Qwen 3.5 by Alibaba
- Type: Open-source, multiple parameter sizes including a 397B flagship
- Key benchmarks: 397B model runs at 5.5+ tokens/sec on a MacBook; benchmark comparisons show competitive performance against Western open-source alternatives including Llama and Mistral
- vs. Previous best: Closing the gap with Llama and Mistral families across general tasks; Chinese open-source AI ecosystem described as "catching up faster than you think"
- What's notable: Launched across all parameter sizes in March 2026; flagship 397B model achieves surprisingly strong on-device throughput, raising the bar for local inference
ARC-AGI-3 Benchmark (New Evaluation)
- Type: New benchmark released by the ARC Prize organization; not a model release but a significant new evaluation
- Key benchmarks: Gemini scored 0.37%, GPT-5.4 scored 0.26%, humans scored 100%
- vs. Previous best: Represents a dramatic step up in difficulty over ARC-AGI-2; no current frontier model clears even 1%
- What's notable: Released the same week Nvidia CEO Jensen Huang declared AGI achieved — the results starkly challenge that narrative; the near-zero AI scores vs. perfect human scores underscore that genuine general reasoning remains an unsolved problem

Leaderboard Snapshot
Frontier Models (Closed-Source)
| Model | Provider | Notable Strengths | Key Score |
|---|---|---|---|
| Gemini 3.1 Pro Preview | Highest intelligence ranking; science tasks (94.3% GPQA) | Top intelligence tier (Artificial Analysis) | |
| GPT-5.4 (xhigh) | OpenAI | Co-highest intelligence; 33% fewer hallucinations vs. prior | Top intelligence tier (Artificial Analysis) |
| GPT-5.3 Codex (xhigh) | OpenAI | Strong coding performance | Second intelligence tier (Artificial Analysis) |
| Claude Opus 4.6 (max) | Anthropic | Coding (75.6% SWE-Bench); long-context reasoning | Second intelligence tier (Artificial Analysis) |
| Gemini 3.1 Pro | Science & multimodal | High intelligence tier |
Open-Source Leaders
| Model | Parameters | Notable Strengths | Key Score |
|---|---|---|---|
| Qwen 3.5 | Up to 397B | Strong general tasks, on-device inference, multilingual | Competitive with Llama/Mistral across tasks |
| Mercury 2 | N/A | Speed leader: 764 tokens/sec output | Fastest output speed (Artificial Analysis) |
| NVIDIA Nemotron 3 Super | N/A | Second fastest: 399 tokens/sec | Second fastest output (Artificial Analysis) |
| Llama (latest) | Multiple sizes | General-purpose; widely deployed locally | Strong open-source baseline |
| Mistral Voxtral TTS | N/A | Speech generation; smartphone-capable | Claims to beat ElevenLabs on voice quality |
Benchmark Deep Dive
ARC-AGI-3: The Benchmark That Put AGI Claims on Trial

The release of ARC-AGI-3 this week may be the most consequential benchmark drop of 2026 so far — not because of what AI achieved, but because of what it failed to achieve. The new evaluation was designed to measure abstract reasoning and general intelligence in a way that is trivially solved by humans but appears nearly impenetrable to current frontier AI systems. Gemini, one of the highest-ranked models on intelligence leaderboards, scored 0.37%. GPT-5.4, co-leader on Artificial Analysis' intelligence rankings, scored 0.26%. Humans, by contrast, scored 100%.
The timing was notable: ARC-AGI-3 arrived the same week that Jensen Huang, CEO of Nvidia, publicly declared that AGI had been achieved. The benchmark results directly contradict that framing. While frontier models like Gemini 3.1 Pro Preview and GPT-5.4 are genuinely impressive across a wide range of tasks — coding, science Q&A, creative writing — the ARC-AGI-3 results expose a fundamental gap between task-specific performance and the kind of flexible, novel problem-solving that humans demonstrate without effort.
For practitioners, this has real implications. The benchmark suggests that models fine-tuned or optimized for known task distributions may still fail catastrophically on genuinely novel reasoning problems. Teams building agentic systems or deploying AI in dynamic, unpredictable environments should treat ARC-AGI-3 scores (when available for a model) as a proxy for robustness under distribution shift. The near-zero scores also serve as a corrective to benchmark inflation — as models "solve" existing evals, the field must continue raising the bar to avoid false confidence in AI capabilities.
Analysis & Trends
-
State of the art: Gemini 3.1 Pro Preview and GPT-5.4 share the top intelligence ranking among closed-source models per Artificial Analysis. For specific tasks: Claude leads on coding (75.6% SWE-Bench), Gemini leads on science (94.3% GPQA), and GPT-5.4 reduces hallucinations by 33% vs. prior versions. No model approaches human-level performance on the new ARC-AGI-3 benchmark.
-
Open vs. Closed gap: The gap is narrowing on standard benchmarks. Qwen 3.5's 397B model achieves competitive results against Western open-source alternatives, and the Chinese open-source ecosystem is described as accelerating. However, on difficult reasoning benchmarks like ARC-AGI-3, all models — open and closed — remain near zero.
-
Cost-performance: Speed efficiency is emerging as a key differentiator. Mercury 2 leads at 764 tokens/sec and NVIDIA Nemotron 3 Super reaches 399 tokens/sec, far outpacing most intelligence-focused models. Mistral's open-weight Voxtral TTS, running on a smartphone, represents a push to bring capable models to edge hardware without API costs.
-
Emerging patterns: Voice AI is heating up as a competitive frontier, with Mistral entering a space previously dominated by ElevenLabs, Deepgram, and OpenAI. Open-weight TTS with competitive quality could disrupt the voice agent market. Meanwhile, reliability of AI agents remains a concern — Princeton researchers released a battery of reliability tests this week showing most AI vendors still don't benchmark for reliability, even as agentic capabilities improve.
What to Watch Next
-
ARC-AGI-3 scores from additional labs: OpenAI, Anthropic, and others have not yet published official ARC-AGI-3 results. When they do, the scores will set a new baseline for how seriously "AGI" claims can be taken in 2026.
-
Anthropic's Mythos model: A data leak earlier this week revealed the existence of a model internally codenamed "Mythos," described as a "step change in capabilities." Anthropic has confirmed it is testing this model. Its benchmark performance — particularly on ARC-AGI-3-style evaluations — could shift the frontier rankings significantly.
-
AI agent reliability benchmarks from Princeton: Researchers at Princeton have released a new battery of reliability tests for AI agents, designed to surface failure modes that standard benchmarks miss. As more vendors adopt these tests, expect reliability scores to become a new axis on leaderboards — potentially reshuffling rankings for teams deploying agents in production.
This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.
Create your own signal
Describe what you want to know, and AI will curate it for you automatically.
Create Signal