AI Benchmarks & Leaderboard — 2026-03-22
The week of March 14–22, 2026 saw a flurry of model releases and comparisons, with GPT-5.4, Claude Opus/Sonnet 4.6, and Gemini 3.1 Pro Preview trading blows at the frontier. Independent analysis from Artificial Analysis places Gemini 3.1 Pro Preview and GPT-5.4 at the top of the intelligence rankings, while OpenAI and Mistral AI also shipped new hardware-efficient language models. Open-source contenders continue to close the gap with closed models on key benchmarks.
AI Benchmarks & Leaderboard — 2026-03-22
New Model Releases

GPT-5.4 by OpenAI
- Type: Closed-source
- Key benchmarks: Listed as one of the highest-intelligence models on Artificial Analysis; described as tied with Gemini 3.1 Pro Preview at the top of the intelligence leaderboard
- vs. Previous best: Surpasses earlier GPT-5.x variants; general IQ-like tests (LSAT, Bar Exam, MedQA) remain strong suits
- What's notable: OpenAI also released a new hardware-efficient language model alongside this flagship release
Claude Opus 4.6 / Sonnet 4.6 by Anthropic
- Type: Closed-source
- Key benchmarks: Claude Opus 4.6 is recorded as achieving "record scores on coding tests, surpassing Google Gemini 3 Pro" in independent testing
- vs. Previous best: Outperforms earlier Claude 4.x series on coding benchmarks; ranked just below GPT-5.4 and Gemini 3.1 Pro Preview on general intelligence
- What's notable: Both Opus (max) and Sonnet 4.6 variants released; Opus 4.6 is particularly strong on code
Gemini 3.1 Pro Preview by Google
- Type: Closed-source
- Key benchmarks: Tied with GPT-5.4 as the highest-intelligence model on the Artificial Analysis leaderboard
- vs. Previous best: Extends the Gemini 3.x line; surpasses Gemini 3 Pro on coding per independent benchmarks
- What's notable: Preview designation suggests additional updates expected; top-ranked alongside GPT-5.4
New Hardware-Efficient Models by OpenAI & Mistral AI
- Type: Closed-source (OpenAI); Open-source (Mistral AI)
- Key benchmarks: Specific benchmark numbers not yet published in sources reviewed
- vs. Previous best: Positioned as more efficient alternatives to frontier flagship models
- What's notable: Both labs shipped efficiency-focused models on the same week (reported March 17, 2026), signaling a parallel push toward cost/performance optimization alongside raw capability races
Leaderboard Changes
Chatbot Arena (LMSYS / Arena.ai)
The Arena AI leaderboard page was accessible this week, but detailed per-model ELO scores could not be extracted reliably from the screenshot capture. Based on corroborating news sources, the following reflects the approximate top tier as of March 2026:
| Rank | Model | Notes |
|---|---|---|
| 1–2 (tied) | Gemini 3.1 Pro Preview | Top intelligence per Artificial Analysis |
| 1–2 (tied) | GPT-5.4 | Top intelligence per Artificial Analysis |
| 3 | Claude Opus 4.6 | Strong on coding; close behind top 2 |
| 4 | GPT-5.3 Codex | Strong coding-focused variant |
For authoritative, up-to-date ELO scores, verify directly at .
Open Source Rankings

A leading open-source model tracked this week posted benchmark scores that place it in striking distance of closed-source rivals:
- MMLU: 90.8
- MMLU-Pro: 84.0
- HumanEval: 90.2
- LiveCodeBench: 65.9
- AIME 2025: 87.5
- GPQA Diamond: 71.5
- MATH-500: 97.3
- Chatbot Arena ELO: 1398
This places the best open-source contenders within a competitive range of the closed frontier on math and coding, though a gap remains on general reasoning benchmarks.
Benchmark Deep Dive
ARC-AGI-2: The Benchmark That Still Humbles AI
One Reddit analysis from the LocalLLaMA community (posted within the coverage window) offered a sobering look at which benchmarks still have meaningful signal in 2025–2026. The standout finding: ARC-AGI-2.
- Pure LLMs score 0% on ARC-AGI-2
- The best reasoning system (using extended scaffolding) hits only 54% at $30 per task
- Average humans score ~60%
- All four major labs now report ARC-AGI-2 on model cards
- A v3 with interactive environments is planned for 2026
This result is striking because it underscores that despite frontier models topping out on MMLU (90%+) and MATH-500 (97%+), fundamental abstract reasoning remains genuinely hard. ARC-AGI-2 forces models to generalize from tiny example sets — a capability humans take for granted but LLMs still cannot reliably replicate. As the community notes, most "standard" benchmarks are increasingly saturated; ARC-AGI-2 may be one of the last remaining benchmarks with true signal at the frontier.
Analysis

-
Frontier models: The current state of the art is a three-way race between GPT-5.4, Gemini 3.1 Pro Preview, and Claude Opus 4.6. Artificial Analysis places the first two tied at the top for intelligence; Claude Opus 4.6 leads on coding. GPT-5.3 Codex is a notable coding-focused alternative from OpenAI.
-
Open vs. Closed gap: The best open-source models are posting MMLU scores of 90.8 and MATH-500 scores of 97.3 — figures that would have been frontier-only 18 months ago. However, on live coding benchmarks (LiveCodeBench: 65.9) and abstract reasoning (ARC-AGI-2), closed models still maintain advantages. The gap is narrowing but has not closed.
-
Emerging trends: Two clear trends this week: (1) simultaneous release of both capability-maximizing flagship models and hardware-efficient smaller models by OpenAI and Mistral, suggesting labs are addressing both the frontier and deployment cost simultaneously; (2) growing focus on "live" and adversarial benchmarks (ARC-AGI-2, LiveCodeBench, SWE-bench) as older benchmarks saturate.
-
Cost efficiency: The simultaneous push for hardware-efficient models by both OpenAI and Mistral AI is the most concrete cost/performance story of the week. Specific price-per-token figures were not available in this week's sources, but the strategic direction is clear: labs are no longer optimizing only for benchmark leaderboard positions, but increasingly for inference cost and deployment practicality.
Note: Leaderboard ELO scores and some benchmark numbers could not be fully verified from live page captures this week. Readers are encouraged to cross-check figures at and for the most current data.
This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.
Create your own signal
Describe what you want to know, and AI will curate it for you automatically.
Create Signal