에이전트 Eval 채점 버그가 성능 점수 58%포인트를 왜곡한 사건

Agent Harness Engineering Tech Report|May 23, 202623 min read8.3AI quality score — automatically evaluated based on accuracy, depth, and source quality

0 subscribers

이번 주 에이전트 하네스 엔지니어링 분야에서는 Anthropic의 에이전트 평가(eval) 방법론 심층 분석이 주목을 받았으며, 특히 CORE-Bench에서 Opus 4.5의 채점 오류 발견 사례가 실무자들에게 큰 반향을 일으켰다. GitHub의 `ai-boost/awesome-harness-engineering` 리포지토리가 2일 전 업데이트되어 에이전트 하네스 설계 패턴과 자기-수정(self-modifying) 하네스 개념을 정리한 자료로 주목받고 있다. 또한 "도구 수를 80% 줄이는 것이 모델 업그레이드보다 효과적"이라는 Vercel의 실사례가 커뮤니티에서 활발히 인용되고 있다.

에이전트 하네스 엔지니어링 주간 리포트 — 2026-05-23

Scope note: This report covers AI Agent Harness Engineering — the software scaffolding, orchestration frameworks (LangGraph, DSPy, CrewAI, AutoGen, Claude Agent SDK, OpenAI Agents SDK), tool-use patterns, guardrails, memory systems, and evaluation infrastructure for production LLM agents. It is NOT about physical wire harnesses, cabling, or automotive electrical systems.

This Week's Headlines

Anthropic exposes hidden eval pitfalls — Opus 4.5 initially scored 42% on CORE-Bench, but researchers uncovered multiple flaws: rigid numerical grading, ambiguous task specifications, and non-reproducible stochastic tasks that distorted actual performance by tens of percentage points.
ai-boost/awesome-harness-engineering updated 2 days ago — Now includes 110+ papers and 23 systems covering meta-harness patterns, where agents self-modify prompts, tools, and strategies based on execution history.
Gloriaameng/Awesome-Agent-Harness documents the tool minimization principle — Vercel reduced tools by 80% and saw performance gains exceeding any model upgrade; researchers also found that schema-first tool contracts prevent interface misuse but fail against semantic misuse (Sigdel & Baral, 2026).
masamasa59/ai-agent-papers adds terminal coding agent harness paper — "Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned" introduces a 5-layer safety architecture and registry-based tool structure.

Framework & Tooling Updates

No official version releases for LangGraph, DSPy, CrewAI, AutoGen, Claude Agent SDK, or OpenAI Agents SDK were announced during the period (post-2026-05-21).

Research & Evaluation

Demystifying Evals for AI Agents (Anthropic Engineering)

Authors / Org: Anthropic Research Team
Core finding: Opus 4.5 scored 42% on CORE-Bench initially, but Anthropic researchers discovered three structural flaws. First, rigid numerical grading rejected "96.12" when the correct answer was "96.124991…"; second, ambiguous task specifications; third, stochastic tasks with non-fixed seeds made exact reproduction impossible. Correcting these issues dramatically altered performance numbers.
Implication for harness design: Evaluation infrastructure matters as much as harness design. Production eval pipelines must include numerical tolerance thresholds, clear task specifications, and fixed environment seeds.

"Building AI Coding Agents for the Terminal" (arxiv.org/html/2603.05344v1)

Authors / Org: Terminal Coding Agent Research Team (arxiv, submitted March 2026)
Core finding: Combines registry-based tool architecture with lazy tool discovery via MCP, proposing a 5-layer safety architecture: prompt-level guardrails → schema-level tool gating (dual-agent separation) → runtime approval system → tool-level validation → user-defined lifecycle hooks.
Implication for harness design: Layered guardrails outperform single-layer defense; lazy tool discovery reduces context window waste—a critical pattern for scaling.

"AI Evals Are Becoming the New Compute Bottleneck" (HuggingFace Blog)

Authors / Org: HuggingFace Research Team
Core finding: ResearchGym (ICLR 2026) benchmarks agents performing real ML research, featuring 5 test tasks (39 subtasks) extracted from ACL, ICLR, and ICML papers. Eval execution cost itself has become a new compute bottleneck.
Implication for harness design: Integrating eval loops inside harnesses requires explicit cost-speed tradeoff design; subtask sampling strategy determines total eval budget.

Production Patterns & Practitioner Insights

Tool Minimization Beats Model Upgrades (Vercel Case Study)

Context: Vercel engineering team discovered this while optimizing their production AI agent environment.
Problem: More tools led to frequent misselection and context waste.
Solution / Takeaway: Cutting tools by 80% delivered performance gains exceeding any model upgrade. Least-privilege tool exposure is as critical as model quality. Schema-first tool contracts reduce interface misuse but don't prevent semantic misuse, so runtime validation layers remain necessary.

Meta-Harness: Agents Self-Modify Their Scaffolding

Context: Experimental pattern introduced in the ai-boost/awesome-harness-engineering repository.
Problem: Static harnesses prevent agents from adapting to new task types or failure modes.
Solution / Takeaway: Agents can dynamically modify their prompts, tool sets, and strategies based on execution history—the "meta-harness" pattern. Powerful but risky; whitelist-restricted modification scopes are essential to prevent runaway self-editing loops.

Opus 4.6 Release Drives Harness Complexity Reduction (Anthropic Engineering)

Context: Anthropic engineers experienced this while building harnesses for long-running applications.
Problem: Harnesses built for Opus 4.5 became unnecessarily complex and hard to maintain.
Solution / Takeaway: After Opus 4.6 launch, the same tasks required less scaffolding. As model capability improves, actively reduce harness complexity; make "harness complexity minimization" a regular engineering goal.

Trending OSS Repositories

ai-boost/awesome-harness-engineering — Dedicated awesome list for agent harness engineering; includes tool patterns, evals, memory, MCP, permissions, observability, and orchestration. Updated 2 days ago.
ARUNAGIRINATHAN-K/awesome-ai-agents-2026 — 300+ AI agent and framework comparison guide with benchmarks and deep dives. Created 1 week ago.
masamasa59/ai-agent-papers — Biweekly AI agent paper collection; recently added terminal coding agent harness paper.

Deep Dive: Anthropic's Agent Eval Postmortem—"Scoring Broke the Benchmark"

Anthropic's "Demystifying Evals for AI Agents" blog post tackles structural vulnerabilities in agent eval infrastructure head-on. The centerpiece: how Opus 4.5's CORE-Bench performance was measured. Initial measurement yielded 42%, but when researchers dug into the benchmark itself, three critical flaws emerged.

First, rigid numerical grading: When the correct answer was "96.124991…" and the agent returned "96.12", it was marked wrong—fundamentally flawed for scientific computation where answers within significant figures should count correct.

Second, ambiguous task specifications: Multiple tasks left output format undefined, so agents producing semantically correct answers still failed grading.

Third, non-reproducible stochastic tasks: Environment seeds weren't fixed, causing identical agent actions to produce different results across runs.

The harness engineering implications are sweeping. Since benchmark scores can reflect grading logic failures rather than actual model capability, treat eval pipelines with the rigor of production harnesses. Specifically: ① set numeric comparison epsilon parameters explicitly, ② formalize task specs without ambiguity, ③ apply seed-fixing or multi-run averaging for stochastic environments.

This isn't unique to Anthropic. The arxiv paper "AI Agent Systems: Architectures, Applications, and Evaluation" (January 2026) lists reproducible evaluation environments as an open problem; HuggingFace's eval bottleneck analysis confirms eval execution itself is now a compute issue. Harness architects must explicitly manage the tradeoff between "running evals cheaply and fast" versus "running evals accurately."

Anthropic's Claude Agent SDK already includes context compression for long-running tasks. Practitioner reports that Opus 4.6 requires less scaffolding for the same work suggest a key design principle: stronger models create opportunities to simplify harnesses.

What to Watch Next Week

Anthropic Claude Agent SDK context compression deep dive — Expect concrete API usage and migration guides for the Opus 4.6–based simplification patterns mentioned in long-running app harness design posts.
CORE-Bench grading fixes — If Anthropic's discovered flaws are applied to the official benchmark, existing model performance numbers will be recalibrated industry-wide, reshuffling rankings.
ai-boost/awesome-harness-engineering meta-harness expansion — Self-modifying harness patterns are still experimental; safety boundary design discussions will intensify in the community.

Reader Action Items

Add eval grading logic to code review — Add numeric tolerance, task specification clarity, and environment reproducibility to your eval pipeline checklist. Anthropic's example shows how grading bugs can drastically skew actual performance.
Run a tool reduction experiment — Audit your agent's exposed tool list and apply least-privilege strategy by keeping only tools actually called in recent logs; measure the performance delta.
Measure harness complexity debt regularly — When your model updates, review whether each scaffolding layer is still necessary; remove outdated prompt engineering and unnecessary tool gating. Model improvements (like Opus 4.6) often unlock simplification opportunities.
Bookmark ai-boost/awesome-harness-engineering — Meta-harness patterns, MCP integration, and 5-layer safety architecture are continuously updated; use it as a practical production harness reference.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics