AI Benchmarks & Leaderboard — 2026-05-15
This week's AI landscape is marked by a major independent evaluation finding that top frontier models still miss expert-level judgment nearly 30% of the time, Microsoft's announcement of a new multi-model agentic security system that tops a leading cybersecurity benchmark, and ongoing analysis of the tightening gap between open-source and closed-source models. Meanwhile, AI cyber capabilities are accelerating faster than earlier projections, according to a UK government AI Safety Institute assessment.
AI Benchmarks & Leaderboard — 2026-05-15
New Model Releases & Updates
No major frontier model releases were confirmed in the past 7 days (after 2026-05-08) with verifiable benchmark data. The most significant developments this week were evaluation results and agentic system benchmarks rather than new model drops.
Benchmark Deep Dive
Pearl Evaluation: Frontier Models Fall Short of Expert Judgment

A new independent evaluation published this week by Pearl found that leading frontier AI models still fall short of expert-level performance on real-world professional questions approximately 30% of the time. The study, released on May 14, 2026, examined AI performance across multiple professional domains and found significant variance — meaning some domains see much higher failure rates than others.
This result matters because it cuts against the prevailing marketing narrative that frontier models have reached or surpassed human expert performance across the board. Pearl's methodology focused on professional-domain questions rather than standardized academic benchmarks like MMLU or GPQA, where models tend to score highest. The gap between benchmark performance and real-world professional utility is a persistent challenge: models are heavily optimized for benchmark formats but struggle with the judgment, nuance, and context that domain experts bring.
For practitioners, this finding reinforces the importance of human-in-the-loop design for high-stakes applications in law, medicine, finance, and similar fields. A 30% miss rate against expert judgment is too high for autonomous deployment in most professional settings, even as it represents a remarkable capability floor compared to just a few years ago. Organizations evaluating AI vendors should seek domain-specific benchmarks rather than relying solely on general-purpose leaderboard scores.
Microsoft MDASH: Multi-Model Agentic Security System Tops Cybersecurity Benchmark

On May 12, 2026, Microsoft announced MDASH (Multi-model Agentic Scanning Harness), a new AI-powered cyber defense system that uses multiple models in an agentic pipeline to top a leading industry cybersecurity benchmark. The system is designed to detect and respond to threats at "AI speed," coordinating multiple specialized models rather than relying on a single general-purpose LLM.
MDASH represents a shift in how AI benchmarks are being applied in enterprise security: instead of measuring a single model's raw intelligence, the benchmark evaluates an end-to-end agentic system's ability to identify and respond to complex threat scenarios. Microsoft's announcement signals growing industry investment in multi-model orchestration architectures, where different models handle different subtasks — detection, reasoning, response generation — within a coordinated pipeline.
For practitioners, this is a signal that "which model is best" is increasingly the wrong question for production deployments. System-level design, model orchestration, and pipeline reliability are becoming the primary competitive differentiators in applied AI.
AI Cyber Capabilities Outpacing Projections — AISI Report
AI cyber capabilities are improving faster than expected, with newer models surpassing earlier capability projections, according to analysis published May 14, 2026 by Help Net Security citing the UK AI Safety Institute (AISI). The report notes that the pace of improvement in AI's ability to assist with or conduct cyberattacks has exceeded what safety researchers anticipated in prior assessments.
This has direct implications for benchmark design: cybersecurity benchmarks set even 12 months ago may now underestimate current model capabilities. Practitioners relying on older red-team evaluations to calibrate AI risk should consider refreshing those assessments.
Leaderboard Snapshot
Frontier Models (Closed-Source)
Based on Artificial Analysis intelligence index data (composite benchmark aggregating ten evaluations across mathematics, science, coding, and reasoning):
| Model | Provider | Notable Strengths | Key Score |
|---|---|---|---|
| GPT-5.5 (xhigh) | OpenAI | Highest overall intelligence rating | Top-ranked (composite) |
| GPT-5.5 (high) | OpenAI | High intelligence, broad capability | 2nd overall (composite) |
| Claude Opus 4.7 (max) | Anthropic | Reasoning, long-context | 3rd overall (composite) |
| Gemini 3.1 Pro Preview | Multimodal, speed-intelligence balance | 4th overall (composite) | |
| Mercury 2 | — | Speed leader (838 t/s output) | Fastest model |
| Gemini 3.1 Flash-Lite Preview | Speed + cost efficiency (347 t/s) | 2nd fastest |
Open-Source Leaders
Based on recent analysis of the open-weight model ecosystem (May 2026):
| Model | Parameters | Notable Strengths | Key Score |
|---|---|---|---|
| Llama 4 (Meta) | Not disclosed | Frontier-class reasoning, open weights | Tier-A coding benchmark |
| Qwen 3.5 (Alibaba) | Various | Strong multilingual, math | Top open-weight composite |
| DeepSeek V4 | Not disclosed | Coding (89/100 via DeepClaude), cost | Tier-A coding (via shim) |
| Gemma 4 (Google) | Not disclosed | Efficiency, on-device deployment | Competitive with Llama 4 |
| Mistral Medium 3.5 | Not disclosed | European model, balanced performance | Frontier-class open weight |
| Qwen3.5 2B | 2B | Fastest small model (near Mercury 2) | Speed leader, small models |
Analysis & Trends
-
State of the art: GPT-5.5 variants lead on composite intelligence benchmarks; Claude Opus 4.7 and Gemini 3.1 Pro remain close competitors. In coding specifically, DeepSeek V4 (via routing shims) and Llama 4 are pressing into Tier-A territory for open weights. Speed leadership belongs to Mercury 2 (838 t/s) and Gemini 3.1 Flash-Lite (347 t/s).
-
Open vs. Closed gap: The gap is narrowing materially. Analysis from FutureAGI (published within the past week) notes that "open-source models have caught up with GPT-4 on most tasks" — though the frontier has moved further. The real competitive frontier is now GPT-5.5 / Claude Opus 4.7 tier vs. Llama 4 / DeepSeek V4 / Qwen 3.5, not GPT-4 era models. The practical question for developers has shifted from "open or closed?" to "which layer above the model matters most?"
-
Cost-performance: Pricing gaps are widening between speed-optimized models (Gemini Flash-Lite, Mercury 2) and intelligence-maximized models (GPT-5.5 xhigh, Claude Opus 4.7 max). The $0.02–$25/M token range across 356+ tracked models (per ClickRank) reflects a more stratified market than 12 months ago.
-
Emerging patterns: Multi-model agentic architectures (exemplified by Microsoft's MDASH) are becoming a primary deployment pattern, reducing dependence on any single model. Benchmarks are starting to shift from single-model evaluations to system-level assessments. Healthcare-specific AI agent benchmarks (Hyro's 2026 report, covering ~400 health systems) reflect growing demand for domain-specific evaluation frameworks beyond general LLM leaderboards.
What to Watch Next
-
Pearl's domain-specific benchmark expansion: The finding that frontier models miss expert judgment 30% of the time across professional domains invites follow-up: which domains have the widest gaps? Future releases from Pearl or similar evaluation groups could reshape how practitioners select models for regulated industries.
-
AISI cyber capability reassessment cadence: With AI cyber capabilities outpacing prior projections, watch for updated AISI or NIST frameworks that revise threat models upward. This will likely trigger new enterprise compliance requirements and influence how AI security benchmarks are structured.
-
Multi-model agentic benchmark standardization: Microsoft's MDASH topping a cybersecurity benchmark is an early signal that agentic system benchmarks (not just single-model scores) are becoming the currency of enterprise AI competition. Watch for Google, Anthropic, and open-source coalitions to publish competing agentic system evaluations in the coming weeks.
This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.