May 29, 2026

AI Benchmarks & Leaderboard — 2026-05-29

This week brought critical updates to model pricing structures and benchmark evaluations, with OpenAI releasing GPT-5.5 Instant improvements and infrastructure companies reporting significant inference cost reductions. A major CVPR 2026 conference drew over 16,000 paper submissions, signaling intense competition in AI research. Key leaderboard movements show frontier model performance stabilizing as open-source alternatives continue narrowing the gap.

4 min read/15 sources

May 22, 2026

AI Benchmarks & Leaderboard — 2026-05-22

The AI coding agent benchmarks field entered a new phase of scrutiny this week, with Claude Code leading SWE-bench Verified at 87.6% while GPT-5.5 topped Terminal-Bench at 82.7% — even as OpenAI's own declared contamination of a key benchmark raised reliability questions. On the leaderboard front, Artificial Analysis's Intelligence Index now places GPT-5.5 (xhigh) at the top with a score of 60, followed closely by Claude Opus 4.7 and Gemini 3.1 Pro Preview at 57. A cost-performance breakthrough stands out: GPT-5.5 (medium) matches Claude Opus 4.7 (max) on intelligence at roughly one-quarter the cost.

6 min read/15 sources

May 19, 2026

AI Benchmarks & Leaderboard — 2026-05-19

The frontier AI model race continues at full intensity in mid-May 2026, with GPT-5.5 holding the top intelligence rankings, Claude Opus 4.7 leading in coding, and DeepSeek V4 dominating cost-performance benchmarks. Independent evaluations from Artificial Analysis confirm GPT-5.5 (xhigh) and GPT-5.5 (high) as the highest-intelligence models, while open-source contenders like Qwen3.5 and Llama 4 continue closing the gap with closed-source leaders.

6 min read/15 sources

May 15, 2026

AI Benchmarks & Leaderboard — 2026-05-15

This week's AI landscape is marked by a major independent evaluation finding that top frontier models still miss expert-level judgment nearly 30% of the time, Microsoft's announcement of a new multi-model agentic security system that tops a leading cybersecurity benchmark, and ongoing analysis of the tightening gap between open-source and closed-source models. Meanwhile, AI cyber capabilities are accelerating faster than earlier projections, according to a UK government AI Safety Institute assessment.

6 min read/15 sources

May 12, 2026

AI Benchmarks & Leaderboard — 2026-05-12

The week of May 5–12, 2026 sees GPT-5.5 and Claude Opus 4.7 holding firm at the top of intelligence leaderboards, while the open-source field heats up with Llama 4, Qwen 3.5, DeepSeek V4, Gemma 4, and Mistral Medium 3.5 shipping within weeks of each other. Meanwhile, a sharp new analysis reveals that most AI agent benchmarks are being gamed, calling into question the real-world utility scores published by major labs. Practitioners navigating the May 2026 model landscape face both an embarrassment of riches and a growing crisis of benchmark credibility.

6 min read/15 sources

May 8, 2026

AI Benchmarks & Leaderboard — 2026-05-08

OpenAI's GPT-5.5 Instant became the new default ChatGPT model this week, targeting reduced hallucinations in high-stakes domains. A viral Medium article revealed that every major LLM scored 0% on ProgramBench, a new coding benchmark testing full-program generation — exposing a dramatic gap between perceived and actual coding capabilities. Meanwhile, leaderboard trackers show GPT-5.5 variants dominating the Intelligence Index, with Claude Opus 4.7 and Gemini 3.1 Pro Preview close behind.

5 min read/15 sources

May 5, 2026

AI Benchmarks & Leaderboard — 2026-05-05

The week of April 28–May 5, 2026 saw intense competition at the top of AI leaderboards, with GPT-5.5 variants holding the highest intelligence scores while China's open-source models continued closing the gap on Western frontier systems. Google published a recap of its April AI updates, and independent trackers confirmed DeepSeek V4 and Qwen 3.5 as the most disruptive open-weight releases reshaping cost-performance benchmarks.

7 min read/15 sources

May 1, 2026

AI Benchmarks & Leaderboard — 2026-05-01

The final days of April 2026 proved to be one of the most competitive periods in AI history, with GPT-5.5 topping the frontier leaderboard, Claude Opus 4.7 and Gemini 3.1 Pro close behind, and DeepSeek V4 reclaiming open-source leadership with aggressive pricing. Mistral AI also entered the fray with Medium 3.5, the rare Western open-source contender in the top tier. Independent trackers now place GPT-5.5 at the apex of the Intelligence Index, while open-source models are closing the gap faster than ever on reasoning benchmarks.

6 min read/15 sources

Apr 28, 2026

AI Benchmarks & Leaderboard — 2026-04-28

The past week in AI benchmarks was defined by DeepSeek's surprise V4 preview release, which claims to have nearly closed the gap with frontier closed-source models — while markets reacted with muted enthusiasm compared to last year's shock. Meanwhile, the Artificial Analysis leaderboard shows GPT-5.5 holding the top intelligence spot, with a cluster of models from Anthropic and Google close behind. New open-source contenders including Kimi K2.6 and Qwen3.5 are pushing cost-performance ratios to new lows.

6 min read/15 sources

Apr 24, 2026

AI Benchmarks & Leaderboard — 2026-04-24

This week's AI leaderboard sees GPT-5.5 claiming the top Intelligence Index spot with a score of 60, narrowly ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview tied at 57, according to Artificial Analysis live rankings. Open-source models are closing the gap rapidly, with GLM-5.1 leading SWE-Bench Pro at 58.4% and Kimi K2.6 entering the top 5 closed-source rankings. A notable Forbes analysis published April 19 argues that open-source AI has moved decisively from a sideshow to a core enterprise strategy.

5 min read/15 sources

Apr 21, 2026

AI Benchmarks & Leaderboard — 2026-04-21

This week's AI landscape is defined by a flurry of frontier model releases, with Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.4 trading blows at the top of intelligence leaderboards. Open-source contenders — particularly GLM-5.1, Qwen 3.6, and Gemma 4 — are closing the gap on proprietary giants, with GLM-5.1 leading SWE-Bench Pro at 58.4%. Speed benchmarks tell a separate story, with Mercury 2 reaching 658.8 tokens/second.

6 min read/15 sources

Apr 17, 2026

AI Benchmarks & Leaderboard — 2026-04-17

The Stanford 2026 AI Index, published this week, reveals SWE-bench coding scores leaping from 60% to nearly 100% in a single year, marking an extraordinary pace of AI capability gains. The Artificial Analysis leaderboard currently places Gemini 3.1 Pro Preview and GPT-5.4 atop the frontier intelligence rankings, with open-source contenders like GLM-5, Qwen3.5, and Gemma 4 closing the gap. Meanwhile, expert analysis signals that the open-closed model gap is narrowing faster than most expected, with Chinese open-source families like Qwen gaining significant ground.

6 min read/15 sources

Apr 14, 2026

AI Benchmarks & Leaderboard — 2026-04-14

Meta debuted Muse Spark, its first major model from the newly formed Superintelligence Labs under chief AI officer Alexandr Wang, on April 8 — performing better than previous Meta models but lagging on coding benchmarks. The Stanford 2026 AI Index, published this week, offers a sweeping structural analysis of AI's accelerating pace, noting that benchmarks are increasingly struggling to keep up with model capabilities. According to Artificial Analysis, Gemini 3.1 Pro Preview and GPT-5.4 now share the top spot on the intelligence index.

5 min read/15 sources

Apr 9, 2026

AI Benchmarks & Leaderboard — 2026-04-09

Meta announced the Muse Spark AI model family this week, while MLCommons released its most significant MLPerf Inference v6.0 benchmark update to date. The frontier leaderboard remains tightly contested, with Gemini 3.1 Pro Preview and GPT-5.4 trading blows at the top of intelligence rankings, while open-source models from Alibaba's Qwen and Meta continue closing the gap on closed-source giants.

7 min read/15 sources

Mar 29, 2026

AI Benchmarks & Leaderboard — 2026-03-29

This week's most striking benchmark news centers on ARC-AGI-3, a new evaluation that humbled frontier AI models — Gemini scored just 0.37% and GPT-5.4 scored 0.26%, while humans hit 100%. Meanwhile, Mistral released an open-weight text-to-speech model it claims outperforms ElevenLabs, and Artificial Analysis' leaderboard continues to show Gemini 3.1 Pro Preview and GPT-5.4 sharing the top intelligence ranking among closed-source frontier models.

6 min read/15 sources

Mar 28, 2026

AI Benchmarks & Leaderboard — 2026-03-28

The biggest story of the week is Anthropic's accidental data leak revealing the existence of "Mythos," a new model described as a "step change" in AI capabilities. Meanwhile, benchmark saturation continues to reshape how the field evaluates frontier models, with classic tests like MMLU and HumanEval no longer differentiating top performers. Independent analysts confirm Gemini 3.1 Pro and GPT-5.4 hold the top intelligence rankings across major evaluation platforms.

3 min read/15 sources

Mar 27, 2026

AI Benchmarks & Leaderboard — 2026-03-27

This week's AI landscape is dominated by a deep dive into which frontier models now lead across different tasks, with fresh analysis confirming Gemini 3.1 Pro and GPT-5.4 at the top of intelligence rankings, while Claude Opus 4.6 holds an edge in coding and enterprise benchmarks. A new Princeton study highlights a growing reliability gap in AI agents even as raw capabilities surge. Meanwhile, classic benchmarks like MMLU and HumanEval have been declared effectively saturated — the field has moved on to harder tests.

4 min read/15 sources

Mar 22, 2026

AI Benchmarks & Leaderboard — 2026-03-22

The week of March 14–22, 2026 saw a flurry of model releases and comparisons, with GPT-5.4, Claude Opus/Sonnet 4.6, and Gemini 3.1 Pro Preview trading blows at the frontier. Independent analysis from Artificial Analysis places Gemini 3.1 Pro Preview and GPT-5.4 at the top of the intelligence rankings, while OpenAI and Mistral AI also shipped new hardware-efficient language models. Open-source contenders continue to close the gap with closed models on key benchmarks.

5 min read/15 sources

AI Benchmarks & Leaderboard

Latest

AI Benchmarks & Leaderboard — 2026-05-29

AI Benchmarks & Leaderboard — 2026-05-22

AI Benchmarks & Leaderboard — 2026-05-19

AI Benchmarks & Leaderboard — 2026-05-15

AI Benchmarks & Leaderboard — 2026-05-12

AI Benchmarks & Leaderboard — 2026-05-08

AI Benchmarks & Leaderboard — 2026-05-05

AI Benchmarks & Leaderboard — 2026-05-01

AI Benchmarks & Leaderboard — 2026-04-28

AI Benchmarks & Leaderboard — 2026-04-24

AI Benchmarks & Leaderboard — 2026-04-21

AI Benchmarks & Leaderboard — 2026-04-17

AI Benchmarks & Leaderboard — 2026-04-14

AI Benchmarks & Leaderboard — 2026-04-09

AI Benchmarks & Leaderboard — 2026-03-29

AI Benchmarks & Leaderboard — 2026-03-28

AI Benchmarks & Leaderboard — 2026-03-27

AI Benchmarks & Leaderboard — 2026-03-22

Want your own AI intelligence feed?