AI Benchmarks & Leaderboard — 2026-04-24

AI Benchmarks & Leaderboard|April 24, 2026(2h ago)5 min read9.1AI quality score — automatically evaluated based on accuracy, depth, and source quality

39 subscribers

This week's AI leaderboard sees GPT-5.5 claiming the top Intelligence Index spot with a score of 60, narrowly ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview tied at 57, according to Artificial Analysis live rankings. Open-source models are closing the gap rapidly, with GLM-5.1 leading SWE-Bench Pro at 58.4% and Kimi K2.6 entering the top 5 closed-source rankings. A notable Forbes analysis published April 19 argues that open-source AI has moved decisively from a sideshow to a core enterprise strategy.

AI Benchmarks & Leaderboard — 2026-04-24

New Model Releases & Updates

GPT-5.5 by OpenAI

Type: Closed-source
Key benchmarks: Intelligence Index score of 60 (xhigh setting), 59 (high setting) — highest on Artificial Analysis leaderboard
vs. Previous best: Surpasses GPT-5.4 (score: 57) and Claude Opus 4.7 (score: 57), previously tied for first
What's notable: The xhigh compute setting pushes it clearly ahead of all rivals on Artificial Analysis's composite Intelligence Index; GPT-5.3 Codex (xhigh) also appears in the top 5 with a score of 54, indicating OpenAI is occupying multiple top spots simultaneously

Kimi K2.6 by Moonshot AI

Type: Closed-source
Key benchmarks: Intelligence Index score of 54, placing it 4th on the Artificial Analysis leaderboard
vs. Previous best: Enters the top 5, matching GPT-5.3 Codex (xhigh)
What's notable: Represents continued strong performance from Chinese AI labs in frontier model rankings; sits just below the OpenAI/Anthropic/Google top tier

GLM-5.1 by Zhipu AI

Type: Open-source
Key benchmarks: SWE-Bench Pro score of 58.4%, leading all open-source models on that coding benchmark
vs. Previous best: Tops the open-source coding leaderboard per April 2026 tracking
What's notable: Zhipu's GLM family continues to lead open-source coding benchmarks; also compared favorably against Gemma 4, Qwen 3.6, Llama 4, and DeepSeek V4 in April rankings

April 2026 open-source LLM benchmark comparison overview

Leaderboard Snapshot

Frontier Models (Closed-Source)

Model	Provider	Notable Strengths	Key Score
GPT-5.5 (xhigh)	OpenAI	Top Intelligence Index; best overall composite	60
GPT-5.5 (high)	OpenAI	Strong composite, slightly lower compute	59
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)	Anthropic	Reasoning-heavy tasks	57
Gemini 3.1 Pro Preview	Google	Competitive across domains	57
GPT-5.4 (xhigh)	OpenAI	Previous top model	57
Kimi K2.6	Moonshot AI	Strong challenger from Chinese AI lab	54
GPT-5.3 Codex (xhigh)	OpenAI	Coding-specialized frontier model	54

Open-Source Leaders

Model	Parameters	Notable Strengths	Key Score
GLM-5.1	Not disclosed	SWE-Bench Pro leader for coding	58.4% SWE-Bench Pro
Qwen 3.5 0.8B (Reasoning)	0.8B	Most affordable: $0.02/1M tokens; fastest small reasoning model	Most affordable
Gemma 3n E4B Instruct	~4B effective	Efficiency-optimized; $0.03/1M tokens	Near-top value
Mercury 2	Not disclosed	Fastest model at 716.5 tokens/sec	716.5 t/s
Granite 4.0 H Small	Not disclosed	Second fastest at 453.0 tokens/sec	453.0 t/s
Granite 3.3 8B (Non-reasoning)	8B	Third fastest at 379.5 tokens/sec	379.5 t/s
Kimi K2.5	Not disclosed	Top open-source coding performance	Leading coding

Artificial Analysis AI model leaderboard and intelligence index

Benchmark Deep Dive

Open-Source AI's Strategic Moment: Closing the Gap on Closed Models

A Forbes analysis published April 19, 2026 argues that the question is no longer whether open-source AI matters, but whether companies can still afford to treat it as secondary. This framing resonates with the hard benchmark data emerging this week.

The April 2026 open-source rankings tell a striking story. GLM-5.1 now scores 58.4% on SWE-Bench Pro — a demanding software engineering evaluation designed to resist saturation — putting it ahead of Gemma 4, Qwen 3.6, Llama 4, and DeepSeek V4. For context, SWE-Bench Pro tasks models with resolving real GitHub issues, making it a meaningful proxy for production coding utility rather than synthetic test performance.

Speed metrics add another dimension. Mercury 2 clocks 716.5 tokens per second on the Artificial Analysis leaderboard, followed by Granite 4.0 H Small at 453.0 t/s and Granite 3.3 8B at 379.5 t/s. These are open-weight models matching or exceeding the inference speed of many closed APIs. Meanwhile, Qwen 3.5 0.8B holds the cost crown at $0.02 per million tokens — a price point that makes experimentation effectively free for most organizations.

For practitioners, the practical implication is clear: the open-source tier has matured to the point where, for coding, speed-sensitive, or cost-constrained workloads, closed frontier models are no longer the default answer. The intelligence gap at the very top (GPT-5.5 at 60 vs. GLM-5.1's open-source leadership) remains real, but the functional gap for everyday deployment has narrowed substantially.

Analysis & Trends

State of the art: GPT-5.5 leads the closed-source composite Intelligence Index (score: 60). For coding, GLM-5.1 leads open-source SWE-Bench Pro at 58.4%. For speed, Mercury 2 (716.5 t/s) dominates inference throughput across all model categories.
Open vs. Closed gap: The gap at the absolute frontier persists — GPT-5.5 at 60 is meaningfully ahead of the best open-source models in composite reasoning. However, in specific verticals like coding (GLM-5.1) and cost-efficiency (Qwen 3.5 0.8B at $0.02/M tokens), open-source options are now enterprise-grade alternatives. The Forbes analysis from April 19 formalizes what practitioners have been observing for months.
Cost-performance: Qwen 3.5 0.8B holds the most affordable position at $0.02 per million tokens (blended), followed by Gemma 3n E4B Instruct at $0.03. These prices represent a continued commoditization of capable small models.
Emerging patterns: Speed differentiation is becoming a new competitive axis — Mercury 2 at 716.5 t/s is nearly 60% faster than the third-place Granite 3.3 8B (379.5 t/s). Chinese AI labs (Moonshot's Kimi K2.6 at #4 closed-source, Zhipu's GLM-5.1 leading open-source coding) continue to punch well above their weight in frontier rankings.

What to Watch Next

GPT-5.5 vs. Claude Opus 4.7 head-to-head on domain-specific benchmarks: The composite Intelligence Index has GPT-5.5 ahead, but Claude Opus 4.7 with Adaptive Reasoning may close or reverse that gap on reasoning-heavy or long-context tasks. Watch for third-party evaluations breaking down performance by task type.
GLM-5.1 SWE-Bench Pro trajectory: If GLM-5.1 continues improving on coding benchmarks, it could become the de facto open-source default for software engineering agents — a category with significant commercial implications.
Enterprise open-source adoption signals: With Forbes now explicitly framing open-source AI as a strategic imperative rather than a niche choice, watch for enterprise procurement and deployment data that tests whether this thesis holds in practice — particularly as IBM's Granite 4.0 family shows strong speed numbers that could appeal to on-premises deployments.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics

AI Benchmarks & Leaderboard — 2026-04-24

AI Benchmarks & Leaderboard — 2026-04-24

New Model Releases & Updates

GPT-5.5 by OpenAI

Kimi K2.6 by Moonshot AI

GLM-5.1 by Zhipu AI

Leaderboard Snapshot

Frontier Models (Closed-Source)

Open-Source Leaders

Benchmark Deep Dive

Open-Source AI's Strategic Moment: Closing the Gap on Closed Models

Analysis & Trends

What to Watch Next

Sources

Want your own AI intelligence feed?