CrewCrew
FeedSignalsMy Subscriptions
Get Started
AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-04-24

  1. Signals
  2. /
  3. AI Benchmarks & Leaderboard

AI Benchmarks & Leaderboard — 2026-04-24

AI Benchmarks & Leaderboard|April 24, 2026(4h ago)5 min read9.1AI quality score — automatically evaluated based on accuracy, depth, and source quality
40 subscribers

This week's AI leaderboard sees GPT-5.5 claiming the top Intelligence Index spot with a score of 60, narrowly ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview tied at 57, according to Artificial Analysis live rankings. Open-source models are closing the gap rapidly, with GLM-5.1 leading SWE-Bench Pro at 58.4% and Kimi K2.6 entering the top 5 closed-source rankings. A notable Forbes analysis published April 19 argues that open-source AI has moved decisively from a sideshow to a core enterprise strategy.

AI Benchmarks & Leaderboard — 2026-04-24


New Model Releases & Updates


GPT-5.5 by OpenAI

  • Type: Closed-source
  • Key benchmarks: Intelligence Index score of 60 (xhigh setting), 59 (high setting) — highest on Artificial Analysis leaderboard
  • vs. Previous best: Surpasses GPT-5.4 (score: 57) and Claude Opus 4.7 (score: 57), previously tied for first
  • What's notable: The xhigh compute setting pushes it clearly ahead of all rivals on Artificial Analysis's composite Intelligence Index; GPT-5.3 Codex (xhigh) also appears in the top 5 with a score of 54, indicating OpenAI is occupying multiple top spots simultaneously

Kimi K2.6 by Moonshot AI

  • Type: Closed-source
  • Key benchmarks: Intelligence Index score of 54, placing it 4th on the Artificial Analysis leaderboard
  • vs. Previous best: Enters the top 5, matching GPT-5.3 Codex (xhigh)
  • What's notable: Represents continued strong performance from Chinese AI labs in frontier model rankings; sits just below the OpenAI/Anthropic/Google top tier

GLM-5.1 by Zhipu AI

  • Type: Open-source
  • Key benchmarks: SWE-Bench Pro score of 58.4%, leading all open-source models on that coding benchmark
  • vs. Previous best: Tops the open-source coding leaderboard per April 2026 tracking
  • What's notable: Zhipu's GLM family continues to lead open-source coding benchmarks; also compared favorably against Gemma 4, Qwen 3.6, Llama 4, and DeepSeek V4 in April rankings

April 2026 open-source LLM benchmark comparison overview
April 2026 open-source LLM benchmark comparison overview


Leaderboard Snapshot


Frontier Models (Closed-Source)

ModelProviderNotable StrengthsKey Score
GPT-5.5 (xhigh)OpenAITop Intelligence Index; best overall composite60
GPT-5.5 (high)OpenAIStrong composite, slightly lower compute59
Claude Opus 4.7 (Adaptive Reasoning, Max Effort)AnthropicReasoning-heavy tasks57
Gemini 3.1 Pro PreviewGoogleCompetitive across domains57
GPT-5.4 (xhigh)OpenAIPrevious top model57
Kimi K2.6Moonshot AIStrong challenger from Chinese AI lab54
GPT-5.3 Codex (xhigh)OpenAICoding-specialized frontier model54

Open-Source Leaders

ModelParametersNotable StrengthsKey Score
GLM-5.1Not disclosedSWE-Bench Pro leader for coding58.4% SWE-Bench Pro
Qwen 3.5 0.8B (Reasoning)0.8BMost affordable: $0.02/1M tokens; fastest small reasoning modelMost affordable
Gemma 3n E4B Instruct~4B effectiveEfficiency-optimized; $0.03/1M tokensNear-top value
Mercury 2Not disclosedFastest model at 716.5 tokens/sec716.5 t/s
Granite 4.0 H SmallNot disclosedSecond fastest at 453.0 tokens/sec453.0 t/s
Granite 3.3 8B (Non-reasoning)8BThird fastest at 379.5 tokens/sec379.5 t/s
Kimi K2.5Not disclosedTop open-source coding performanceLeading coding

Artificial Analysis AI model leaderboard and intelligence index
Artificial Analysis AI model leaderboard and intelligence index


Benchmark Deep Dive


Open-Source AI's Strategic Moment: Closing the Gap on Closed Models

A Forbes analysis published April 19, 2026 argues that the question is no longer whether open-source AI matters, but whether companies can still afford to treat it as secondary. This framing resonates with the hard benchmark data emerging this week.

The April 2026 open-source rankings tell a striking story. GLM-5.1 now scores 58.4% on SWE-Bench Pro — a demanding software engineering evaluation designed to resist saturation — putting it ahead of Gemma 4, Qwen 3.6, Llama 4, and DeepSeek V4. For context, SWE-Bench Pro tasks models with resolving real GitHub issues, making it a meaningful proxy for production coding utility rather than synthetic test performance.

Speed metrics add another dimension. Mercury 2 clocks 716.5 tokens per second on the Artificial Analysis leaderboard, followed by Granite 4.0 H Small at 453.0 t/s and Granite 3.3 8B at 379.5 t/s. These are open-weight models matching or exceeding the inference speed of many closed APIs. Meanwhile, Qwen 3.5 0.8B holds the cost crown at $0.02 per million tokens — a price point that makes experimentation effectively free for most organizations.

For practitioners, the practical implication is clear: the open-source tier has matured to the point where, for coding, speed-sensitive, or cost-constrained workloads, closed frontier models are no longer the default answer. The intelligence gap at the very top (GPT-5.5 at 60 vs. GLM-5.1's open-source leadership) remains real, but the functional gap for everyday deployment has narrowed substantially.


Analysis & Trends

  • State of the art: GPT-5.5 leads the closed-source composite Intelligence Index (score: 60). For coding, GLM-5.1 leads open-source SWE-Bench Pro at 58.4%. For speed, Mercury 2 (716.5 t/s) dominates inference throughput across all model categories.

  • Open vs. Closed gap: The gap at the absolute frontier persists — GPT-5.5 at 60 is meaningfully ahead of the best open-source models in composite reasoning. However, in specific verticals like coding (GLM-5.1) and cost-efficiency (Qwen 3.5 0.8B at $0.02/M tokens), open-source options are now enterprise-grade alternatives. The Forbes analysis from April 19 formalizes what practitioners have been observing for months.

  • Cost-performance: Qwen 3.5 0.8B holds the most affordable position at $0.02 per million tokens (blended), followed by Gemma 3n E4B Instruct at $0.03. These prices represent a continued commoditization of capable small models.

  • Emerging patterns: Speed differentiation is becoming a new competitive axis — Mercury 2 at 716.5 t/s is nearly 60% faster than the third-place Granite 3.3 8B (379.5 t/s). Chinese AI labs (Moonshot's Kimi K2.6 at #4 closed-source, Zhipu's GLM-5.1 leading open-source coding) continue to punch well above their weight in frontier rankings.


What to Watch Next

  • GPT-5.5 vs. Claude Opus 4.7 head-to-head on domain-specific benchmarks: The composite Intelligence Index has GPT-5.5 ahead, but Claude Opus 4.7 with Adaptive Reasoning may close or reverse that gap on reasoning-heavy or long-context tasks. Watch for third-party evaluations breaking down performance by task type.

  • GLM-5.1 SWE-Bench Pro trajectory: If GLM-5.1 continues improving on coding benchmarks, it could become the de facto open-source default for software engineering agents — a category with significant commercial implications.

  • Enterprise open-source adoption signals: With Forbes now explicitly framing open-source AI as a strategic imperative rather than a niche choice, watch for enterprise procurement and deployment data that tests whether this thesis holds in practice — particularly as IBM's Granite 4.0 family shows strong speed numbers that could appeal to on-premises deployments.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics
  • QHow do these scores translate to real-world tasks?
  • QWhat compute resources does GPT-5.5 require?
  • QWhy is open-source coding catching up so fast?
  • QHow is Kimi K2.6 closing the gap with OpenAI?

Powered by

CrewCrew

Sources

Want your own AI intelligence feed?

Create custom signals on any topic. AI curates and delivers 24/7.