Search conversations... /
New chat C
Collapse Tab
ChatCChat freely with AI and ask it to do anything
MessengersMTalk to AI through KakaoTalk, Telegram, and other messengers
My FilesFAll your uploaded files and AI-generated results in one place
InsightsIYour personal assistant that catches what matters in your schedule and inbox
MissionsAGoals you hand off to Crew — sub-agents plan, decide, and act with your approval.
MailEAI analyzes your email and even drafts replies for you
TodoTOrganize tasks and let AI track deadlines for you
SignalsSAI collects global news around the clock and organizes by topic for you
ToolsManage the tools and integrations AI can use
MarketplaceInstall skills and tools created by others
ActivitySee a full log of everything AI did for you
BillingCheck your plan and credit usage
SettingsManage account, connections, notifications and more

No conversations yet

Guide
FeedSignalsMy Subscriptions
Create Signal
Agent Harness Engineering Tech Report

Agent Harness Engineering Weekly Report — 2026-05-16

  1. Signals
  2. /
  3. Agent Harness Engineering Tech Report

Agent Harness Engineering Weekly Report — 2026-05-16

Agent Harness Engineering Tech Report|May 16, 2026(2h ago)31 min read9.3AI quality score — automatically evaluated based on accuracy, depth, and source quality
0 subscribers

This week's standout developments in agent harness engineering include OpenAI's public release of Symphony, an open-source spec for Codex orchestration, and Anthropic's deep dive into agent evaluation methodology. Anthropic exposed scoring bugs in CORE-Bench (floating-point rounding mismatches, ambiguous task specs, non-reproducible stochastic tasks) affecting Opus 4.5's initial 42% score, highlighting how benchmark reliability directly impacts harness design decisions. The TraceSafe paper introduced a new approach: guarding entire multi-step tool-call trajectories rather than isolated tool invocations. GitHub's `awesome-harness-engineering` repository is gaining rapid traction, documenting self-modifying harness patterns where agents refine their own prompts, tool selection, and strategies based on execution history.

Agent Harness Engineering Weekly Report — 2026-05-16

Scope note: This report covers AI Agent Harness Engineering — the software scaffolding, orchestration frameworks (LangGraph, DSPy, CrewAI, AutoGen, Claude Agent SDK, OpenAI Agents SDK), tool-use patterns, guardrails, memory systems, and evaluation infrastructure for production LLM agents. It is NOT about physical wire harnesses, cabling, or automotive electrical systems.


This Week's Headlines

  • OpenAI releases Symphony, an open-source spec for Codex orchestration — Built on Codex CLI and GPT-5-generated initial scaffolds, Symphony provides a public spec and repository allowing developers to build custom orchestration layers tailored to their own environments.
  • Anthropic publishes deep analysis of AI agent Evals methodology — Opus 4.5 initially scored 42% on CORE-Bench, but investigation uncovered multiple root causes: floating-point tolerance bugs ("96.12" vs "96.124991…"), ambiguous task specs, and irreproducible stochastic tasks. The findings underscore how evaluation infrastructure directly shapes harness design.
  • TraceSafe paper proposes trajectory-level guardrails for multi-step tool calls — Rather than guarding single tool invocations, TraceSafe-Bench introduces a benchmark for detecting risky behavior mid-trajectory before the agent reaches final output, surfacing blind spots in prior approaches like MCP-Guard.
  • awesome-harness-engineering repository gains rapid GitHub traction — A curated Awesome list covering self-modifying harness patterns (where agents auto-correct their own prompts, tools, and strategies), MCP, permissions, observability, and orchestration, now drawing significant community attention just four days after launch.

Framework & Tooling Updates


OpenAI Symphony — Open-Source Spec for Codex Orchestration

  • What's new: OpenAI has published Symphony, an open-source spec for orchestrating Codex-based agents. Developers can point Symphony specs and repositories directly at their own coding agents to generate environment-specific orchestration versions. The initial scaffold was generated by Codex CLI + GPT-5 and includes repository structure, CI configuration, and package manager setup templates.
  • Why it matters: Just as "harness engineering" blog posts became reference implementations for many developers' repository scaffolding, Symphony has the potential to become a common foundation for communities to customize orchestration layers to their own workflows. This represents a tangible reference implementation for standardizing agent harness discussions.
  • Migration notes: Teams currently using Codex CLI can reference the Symphony spec to instruct their coding agents to generate environment-specific versions. If you have hardcoded orchestration logic, now is a good time to consider transitioning to a spec-driven architecture.

OpenAI Symphony open-source orchestration spec announcement
OpenAI Symphony open-source orchestration spec announcement


Anthropic Claude Agent SDK — Harness Design Principles Update for Long-Running Agents

  • What's new: Anthropic published an engineering post deeply analyzing agent evaluation (Evals). The post reveals how Opus 4.5's initial CORE-Bench score of 42% was artificially depressed not by actual model capability but by three independent problems: scoring logic bugs (floating-point rounding differences), ambiguous task specs, and non-reproducible stochastic task combinations. The Claude Agent SDK also emphasizes new context management features, including compaction, to prevent context depletion during extended task execution.
  • Why it matters: This is a rare public post showing how dramatically benchmark numbers can diverge from actual agent performance. Additionally, Anthropic's description of reducing harness complexity after Opus 4.6 release validates the principle that "as models grow stronger, scaffolding should simplify."
  • Migration notes: If you're designing your own agent benchmarks, immediately audit your scoring logic for floating-point tolerance issues, task spec ambiguity, and reproducibility guarantees. External benchmark scores deserve healthy skepticism.

Research & Evaluation


TraceSafe: LLM Guardrails for Multi-Step Tool-Call Trajectories

  • Authors / Org: TraceSafe research team (arXiv 2604.07223, April 2026)
  • Core finding: Prior guardrail research (MCP-Guard, etc.) evaluated only single tool-call safety. In reality, risk is embedded across entire execution trajectories. TraceSafe-Bench introduces a standardized method to evaluate whether agents can detect policy violations mid-trajectory—before reaching final output—and halt execution.
  • Implication for harness design: Single tool-result validation is insufficient. Harness architecture must include a "mid-trajectory interception" layer that tracks full execution history, detects policy violations at intermediate states, and can interrupt if needed. This is especially critical for agents performing code execution, filesystem access, or chained external API calls.

The New Bottleneck in AI Evals: From Compute Cost to Evaluation Cost

  • Authors / Org: HuggingFace (HuggingFace Blog)
  • Core finding: ResearchGym, accepted to ICLR 2026, lets agents perform actual ML research (39 subtasks based on ACL, ICLR, ICML papers). The analysis reveals that agent evaluation itself is becoming a new computational bottleneck. The cost to run increasingly complex agent benchmarks now approaches model training costs.
  • Implication for harness design: Treat evaluation infrastructure as a first-class harness component. Bake evaluation cost optimization—parallel execution, result caching, lightweight proxy evaluations—into your architecture from day one.

DKnownAI Guard vs. AWS/Azure/Lakera: Comparative Eval of AI Agent Security Guardrails

  • Authors / Org: arXiv 2604.24826 (April 2026)
  • Core finding: Comparative evaluation of DKnownAI Guard, AWS Bedrock Guardrails, Azure Content Safety, and Lakera Guard in AI agent security scenarios shows detection rates, false-positive rates, and latency profiles vary dramatically across products. No single guardrail covers all risks.
  • Implication for harness design: Production agent harnesses should not rely on a single guardrail vendor. Instead, adopt a "defense-in-depth" approach: layer risk-type-specific guardrails in tandem.

Production Patterns & Practitioner Insights


Self-Modifying Harness: Agents Evolve Their Own Scaffolding

  • Context: Documented in the awesome-harness-engineering repository (ai-boost/awesome-harness-engineering), this pattern enables agents to refine their own prompts, tool selection, and strategies based on execution history—a "meta harness" concept.
  • Problem: Statically designed harnesses quickly become stale after model upgrades or task distribution shifts. Manual harness tuning becomes an operational burden.
  • Solution / Takeaway: Structure agent execution logs to capture "what failed and why" as metadata. On each subsequent run, feed this back into harness configuration (system prompts, tool priorities, retry policies). Critically, build in safety guardrails to prevent the feedback loop from diverging infinitely—cap how far the harness can auto-modify.

awesome-harness-engineering GitHub repository
awesome-harness-engineering GitHub repository


Anthropic's Lesson: As Models Strengthen, Simplify the Harness

  • Context: Anthropic's engineering team shared a case study on harness design for long-running apps, documenting how upgrading from Opus 4.5 to 4.6 led them to intentionally reduce harness complexity.
  • Problem: Complex scaffolding built for Opus 4.5—explicit retry logic, fine-grained context management, multi-stage planning prompts—actually hindered Opus 4.6's autonomous reasoning.
  • Solution / Takeaway: When model capability improves, don't keep the harness unchanged. Instead, remove scaffolding that's no longer needed to let the stronger model exercise more autonomy. Adopt the design principle: "harness complexity should be inversely proportional to model capability." Mandate harness re-evaluation every time you upgrade models.

Seven-Framework Showdown: Framework Choice Swings Performance 30 Points

  • Context: Uvik Software engineering team compared LangGraph, CrewAI, OpenAI Agents SDK, and Google ADK in production using the same model.
  • Problem: The assumption that "any framework will do once you pick the model" became the root cause of real project failures. Using identical models, framework choice alone created performance swings of 30+ points.
  • Solution / Takeaway: Frameworks are not mere "model wrappers"—they are core variables in the agent reasoning loop. Internal implementation differences (context passage method, tool result serialization, multi-agent message passing overhead) directly affect benchmark scores. Before committing to a framework, run small A/B tests on your domain-specific tasks.

Trending OSS Repositories

  • ai-boost/awesome-harness-engineering — Comprehensive Awesome list spanning AI agent harness tools, patterns, Evals, memory, MCP, permissions, observability, and orchestration. The self-modifying harness pattern is attracting significant attention and rapid growth.
  • VoltAgent/awesome-ai-agent-papers — Curated collection of 2026 research papers in agent engineering, memory, evaluation, workflows, and autonomous systems. Useful for practitioners tracking latest academic trends.
  • masamasa59/ai-agent-papers — Biweekly-updated collection including papers like "Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned."

Deep Dive: Anthropic's Agent Evals Methodology — The Benchmark Reliability Crisis and Its Harness Design Implications

This week's most critical single topic is Anthropic's deep analysis of agent evaluation methodology. On the surface, it's a case study of Opus 4.5's CORE-Bench scoring, but it directly addresses a crisis affecting every production agent team: the reliability of evaluation infrastructure itself.

The key discovery: When Anthropic investigated why Opus 4.5 scored only 42% on CORE-Bench, they found three independent problems coexisting: ① scoring logic treated "96.12" and "96.124991…" as different answers due to floating-point tolerance bugs, ② task specs were ambiguous enough that the "correct" output wasn't defined, and ③ some tasks produced different results on every run (stochasticity). After fixing these issues, actual performance improved significantly.

This isn't just an Anthropic internal detail. It carries two fundamental implications for agent harness engineering.

First, benchmark scores should never blindly guide harness design. Many teams pick models and build harnesses based on SWE-bench, GAIA, CORE-Bench rankings. But if scoring logic contains bugs, specs are ambiguous, or tasks aren't reproducible, benchmark ranks can wildly diverge from production performance. Harness teams must always supplement external benchmarks with a domain-specific internal evaluation set.

Second, the Opus 4.6 lesson on "harness simplification". Anthropic engineers discovered that complex scaffolding designed for 4.5 actually limited 4.6's performance, so they deliberately simplified it. This demonstrates the necessity of model-harness co-evolution. A fixed harness designed for an older model can inadvertently cap a newer model's gains.

Practical recommendations: When building agent evaluation pipelines, ① explicitly specify floating-point tolerance, ② formalize expected output format for each task, ③ replace stochastic tasks with seeded or deterministic alternatives, and ④ make harness complexity re-evaluation mandatory every time you upgrade models.


What to Watch Next Week

  • Symphony spec adoption by the community — Now that Symphony is public, expect community forks and custom implementations next week. Watch how teams currently on LangGraph or CrewAI integrate or compare against Symphony.
  • TraceSafe-Bench public leaderboard launch — If TraceSafe moves from arXiv to an open leaderboard or toolkit release, it could become the standard benchmark for multi-step tool-call guardrails. Monitor related GitHub repositories for updates.
  • LangGraph MCP/A2A native support roadmap — LangGraph currently lacks native Model Context Protocol (MCP) and Agent-to-Agent Protocol (A2A) support, relying on community integrations. LangChain's official stance or PR progress may be announced this month.

Reader Action Items

  • Immediately audit your own agent benchmark scoring logic — Like Anthropic's case, check for floating-point tolerance issues, task spec ambiguity, and reproducibility gaps. If you're using external benchmark scores for internal decisions, re-validate them now.
  • Make harness complexity re-evaluation mandatory on every model upgrade — When switching to GPT-5, Opus 4.6, or any new model, don't reuse your old scaffolding unchanged. Create a checklist asking, "Is this layer still needed with the new model?" for each component.
  • Add trajectory-level guardrails to your harness — Following TraceSafe's recommendation, move beyond single-tool validation. Add a layer that tracks full execution history, detects mid-trajectory policy violations, and can halt if needed. Make this a sprint goal.
  • Run domain-specific framework A/B tests before committing — Since framework choice alone can shift performance by 30 points, run a small experiment comparing LangGraph, CrewAI, and OpenAI Agents SDK on your 5–10 core tasks before migrating or starting a new project.

This content was collected, curated, and summarized entirely by AI — including how and what to gather. It may contain inaccuracies. Crew does not guarantee the accuracy of any information presented here. Always verify facts on your own before acting on them. Crew assumes no legal liability for any consequences arising from reliance on this content.

Explore related topics
  • QOpenAI Symphony가 실제 개발 생산성에 미칠 영향은?
  • QAnthropic의 평가 방법론이 표준 벤치마크에 주는 변화는?
  • QTraceSafe-Bench를 도입한 안전한 하네스 설계법은?
  • QSelf-modifying harness의 보안 위험과 방어책은?

Powered by

CrewCrew

Sources

Want your own AI intelligence feed?

Create custom signals on any topic. AI curates and delivers 24/7.