AI Benchmarks & Leaderboard — 2026-05-15

AI Benchmarks & Leaderboard|May 15, 2026(1h ago)6 min read8.1AI quality score — automatically evaluated based on accuracy, depth, and source quality

40 subscribers

This week's AI landscape is marked by a major independent evaluation finding that top frontier models still miss expert-level judgment nearly 30% of the time, Microsoft's announcement of a new multi-model agentic security system that tops a leading cybersecurity benchmark, and ongoing analysis of the tightening gap between open-source and closed-source models. Meanwhile, AI cyber capabilities are accelerating faster than earlier projections, according to a UK government AI Safety Institute assessment.

AI Benchmarks & Leaderboard — 2026-05-15

New Model Releases & Updates

No major frontier model releases were confirmed in the past 7 days (after 2026-05-08) with verifiable benchmark data. The most significant developments this week were evaluation results and agentic system benchmarks rather than new model drops.

Benchmark Deep Dive

Pearl Evaluation: Frontier Models Fall Short of Expert Judgment

Pearl.com leaderboard showing AI model performance gaps vs. expert judgment across professional domains

A new independent evaluation published this week by Pearl found that leading frontier AI models still fall short of expert-level performance on real-world professional questions approximately 30% of the time. The study, released on May 14, 2026, examined AI performance across multiple professional domains and found significant variance — meaning some domains see much higher failure rates than others.

This result matters because it cuts against the prevailing marketing narrative that frontier models have reached or surpassed human expert performance across the board. Pearl's methodology focused on professional-domain questions rather than standardized academic benchmarks like MMLU or GPQA, where models tend to score highest. The gap between benchmark performance and real-world professional utility is a persistent challenge: models are heavily optimized for benchmark formats but struggle with the judgment, nuance, and context that domain experts bring.

For practitioners, this finding reinforces the importance of human-in-the-loop design for high-stakes applications in law, medicine, finance, and similar fields. A 30% miss rate against expert judgment is too high for autonomous deployment in most professional settings, even as it represents a remarkable capability floor compared to just a few years ago. Organizations evaluating AI vendors should seek domain-specific benchmarks rather than relying solely on general-purpose leaderboard scores.

prnewswire.com

Microsoft MDASH: Multi-Model Agentic Security System Tops Cybersecurity Benchmark

Microsoft Security Blog header image for MDASH multi-model agentic scanning harness announcement

On May 12, 2026, Microsoft announced MDASH (Multi-model Agentic Scanning Harness), a new AI-powered cyber defense system that uses multiple models in an agentic pipeline to top a leading industry cybersecurity benchmark. The system is designed to detect and respond to threats at "AI speed," coordinating multiple specialized models rather than relying on a single general-purpose LLM.

MDASH represents a shift in how AI benchmarks are being applied in enterprise security: instead of measuring a single model's raw intelligence, the benchmark evaluates an end-to-end agentic system's ability to identify and respond to complex threat scenarios. Microsoft's announcement signals growing industry investment in multi-model orchestration architectures, where different models handle different subtasks — detection, reasoning, response generation — within a coordinated pipeline.

For practitioners, this is a signal that "which model is best" is increasingly the wrong question for production deployments. System-level design, model orchestration, and pipeline reliability are becoming the primary competitive differentiators in applied AI.

microsoft.com

d MDASH).

AI Cyber Capabilities Outpacing Projections — AISI Report

AI cyber capabilities are improving faster than expected, with newer models surpassing earlier capability projections, according to analysis published May 14, 2026 by Help Net Security citing the UK AI Safety Institute (AISI). The report notes that the pace of improvement in AI's ability to assist with or conduct cyberattacks has exceeded what safety researchers anticipated in prior assessments.

This has direct implications for benchmark design: cybersecurity benchmarks set even 12 months ago may now underestimate current model capabilities. Practitioners relying on older red-team evaluations to calibrate AI risk should consider refreshing those assessments.

Leaderboard Snapshot

Frontier Models (Closed-Source)

Based on Artificial Analysis intelligence index data (composite benchmark aggregating ten evaluations across mathematics, science, coding, and reasoning):

Model	Provider	Notable Strengths	Key Score
GPT-5.5 (xhigh)	OpenAI	Highest overall intelligence rating	Top-ranked (composite)
GPT-5.5 (high)	OpenAI	High intelligence, broad capability	2nd overall (composite)
Claude Opus 4.7 (max)	Anthropic	Reasoning, long-context	3rd overall (composite)
Gemini 3.1 Pro Preview	Google	Multimodal, speed-intelligence balance	4th overall (composite)
Mercury 2	—	Speed leader (838 t/s output)	Fastest model
Gemini 3.1 Flash-Lite Preview	Google	Speed + cost efficiency (347 t/s)	2nd fastest

Open-Source Leaders

Based on recent analysis of the open-weight model ecosystem (May 2026):

Model	Parameters	Notable Strengths	Key Score
Llama 4 (Meta)	Not disclosed	Frontier-class reasoning, open weights	Tier-A coding benchmark
Qwen 3.5 (Alibaba)	Various	Strong multilingual, math	Top open-weight composite
DeepSeek V4	Not disclosed	Coding (89/100 via DeepClaude), cost	Tier-A coding (via shim)
Gemma 4 (Google)	Not disclosed	Efficiency, on-device deployment	Competitive with Llama 4
Mistral Medium 3.5	Not disclosed	European model, balanced performance	Frontier-class open weight
Qwen3.5 2B	2B	Fastest small model (near Mercury 2)	Speed leader, small models

Analysis & Trends

State of the art: GPT-5.5 variants lead on composite intelligence benchmarks; Claude Opus 4.7 and Gemini 3.1 Pro remain close competitors. In coding specifically, DeepSeek V4 (via routing shims) and Llama 4 are pressing into Tier-A territory for open weights. Speed leadership belongs to Mercury 2 (838 t/s) and Gemini 3.1 Flash-Lite (347 t/s).
Open vs. Closed gap: The gap is narrowing materially. Analysis from FutureAGI (published within the past week) notes that "open-source models have caught up with GPT-4 on most tasks" — though the frontier has moved further. The real competitive frontier is now GPT-5.5 / Claude Opus 4.7 tier vs. Llama 4 / DeepSeek V4 / Qwen 3.5, not GPT-4 era models. The practical question for developers has shifted from "open or closed?" to "which layer above the model matters most?"
Cost-performance: Pricing gaps are widening between speed-optimized models (Gemini Flash-Lite, Mercury 2) and intelligence-maximized models (GPT-5.5 xhigh, Claude Opus 4.7 max). The $0.02–$25/M token range across 356+ tracked models (per ClickRank) reflects a more stratified market than 12 months ago.
Emerging patterns: Multi-model agentic architectures (exemplified by Microsoft's MDASH) are becoming a primary deployment pattern, reducing dependence on any single model. Benchmarks are starting to shift from single-model evaluations to system-level assessments. Healthcare-specific AI agent benchmarks (Hyro's 2026 report, covering ~400 health systems) reflect growing demand for domain-specific evaluation frameworks beyond general LLM leaderboards.

What to Watch Next

Pearl's domain-specific benchmark expansion: The finding that frontier models miss expert judgment 30% of the time across professional domains invites follow-up: which domains have the widest gaps? Future releases from Pearl or similar evaluation groups could reshape how practitioners select models for regulated industries.
AISI cyber capability reassessment cadence: With AI cyber capabilities outpacing prior projections, watch for updated AISI or NIST frameworks that revise threat models upward. This will likely trigger new enterprise compliance requirements and influence how AI security benchmarks are structured.
Multi-model agentic benchmark standardization: Microsoft's MDASH topping a cybersecurity benchmark is an early signal that agentic system benchmarks (not just single-model scores) are becoming the currency of enterprise AI competition. Watch for Google, Anthropic, and open-source coalitions to publish competing agentic system evaluations in the coming weeks.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics

AI Benchmarks & Leaderboard — 2026-05-15

AI Benchmarks & Leaderboard — 2026-05-15

New Model Releases & Updates

Benchmark Deep Dive

Pearl Evaluation: Frontier Models Fall Short of Expert Judgment

Microsoft MDASH: Multi-Model Agentic Security System Tops Cybersecurity Benchmark

AI Cyber Capabilities Outpacing Projections — AISI Report

Leaderboard Snapshot

Frontier Models (Closed-Source)

Open-Source Leaders

Analysis & Trends

What to Watch Next

Sources

Want your own AI intelligence feed?