HarnessAudit: Auditing Agent Harness Safety

01Overview

Modern LLM agents rarely act alone. They run inside execution harnesses—like OpenClaw, Claude Code, and Codex—that dispatch tools, allocate resources, and route messages across specialized components. The harness, not the model, decides which actions are exposed, who may invoke them, and when execution terminates. This shift exposes failure modes that output-level evaluation cannot see: a run can return a correct, benign answer along a trajectory that accessed unauthorized resources, leaked context to the wrong agent, or triggered irreversible side effects.

HarnessAudit overview figure — **HarnessAudit overview.** (a) 8 real-world domains with realistic safety constraints. (b) Agents plan, retrieve, execute tools, review, and communicate while interacting with mutable resources and dynamic environments. (c) HarnessAudit audits *full trajectories* beyond final outputs and compares configurations along boundary compliance, execution fidelity, and system stability.

210

Tasks

Domains

Scenarios

Role Templates

Harness Configs

11.6K

Tool Authz Entries

Harness-centric formulation

An agent harness as a policy-constrained execution system audited via hidden, agent-independent evidence channels.

Realistic stress testing

HarnessAudit-Bench: 210 tasks in 8 domains with embedded safety constraints, instantiated in single- and multi-agent configs.

Empirical analysis

10 harness configurations across frontier models and 3 multi-agent frameworks reveal systematic safety failure patterns.

02Problem & Three Safety Layers

We argue that agent safety should be evaluated on the harness rather than the response, and audited over the full execution trajectory along three jointly-required properties:

Layer 1
♣ Boundary ComplianceEvery action stays within the permission policy (Π) and information-flow
                    policy (Φ). Audited across tools, resources, and
                    information flow.
Layer 2
♣ Execution FidelityThe trajectory reaches the goal via valid intermediate steps —
                    measured by action validity and checkpointed task completion.
Layer 3
♣ System StabilityL1 and L2 survive controlled stressors: indirect prompt injection,
                    ambiguous goals, and runtime/tool errors.

Existing benchmarks score only final outputs or terminal states, so a task that completes while accessing forbidden resources looks indistinguishable from a clean success. Recent harness-oriented work mostly targets single-agent settings, leaving inter-component communication in production multi-agent harnesses largely unaudited.

03Auditing Framework

A central design choice of HarnessAudit: all evaluation evidence is collected from channels agents cannot manipulate or anticipate, rather than from their self-reports. Each run proceeds through Setup, Execution, and Judge.

Auditing pipeline — **HarnessAudit auditing pipeline.** Hidden audit artifacts stay invisible during execution; trajectory logs and backend evidence support a three-layer diagnosis.

04HarnessAudit-Bench

HarnessAudit-Bench covers 210 tasks across 8 application domains and 24 fine-grained scenarios: finance, e-commerce, healthcare, office operations, social interaction, daily life, legal compliance, and software engineering. Each task is paired with audit rules over tool use, resource access, and information flow, plus perturbation specifications for stability testing.

Benchmark overview — **HarnessAudit-Bench** covers 210 tasks across 8 real-world domains and 24 fine-grained scenarios with role-based multi-agent structures.

05Task Browser

The Task Browser is generated from the real multi-agent task YAMLs and tool catalogs. Switch between domains, search by role or tool, and inspect each task's role-level tool scope and audit specification.

Task Browser

Explore real multi-agent tasks

Browse the real HarnessAudit multi-agent tasks domain by domain. Each card is generated from source YAML and links roles, useful/forbidden tools, resource boundaries, and completion checks.

Browse Sample Tasks

💳Finance 🛍️E-commerce 🩺Healthcare 🏢Office Ops 💬Social 🌞Daily Life ⚖️Legal 💻Software Eng.

06Interactive Data

Explore HarnessAudit-Bench results across 10 harness configurations. Toggle views to compare safety layers and perturbation stability.

HarnessAudit-Bench · 10 configurations

OpenClaw Claude Code Codex

08Key Findings

Current harnesses are far from safely reliable

Even the best system reaches only 0.32 overall. Strong task completion does not imply safe execution.

Completion ≠ safety compliance

Models with higher TCR can still violate critical execution boundaries; the two objectives are misaligned.

Resource access dominates violations

Agents rarely call obviously wrong tools, but routinely apply seemingly reasonable tools to unauthorized resources.

Fragile under perturbations

Indirect prompt injection causes the largest drop; agents are easily swayed by hidden instructions in tool returns.

09Analysis Highlights

[RQ1] Higher completion does not necessarily imply safer execution. Across harnesses, task completion shows a consistent negative association with safety adherence. Violations grow with the number of executed actions.

Completion vs safety trade-off — (a) Task completion vs mean safety adherence. (b) Number of violations vs executed actions. (c) Safety retained at increasing completion thresholds.

[RQ2] Risks differ across domains and roles. Finance and office tasks expose resource-boundary risks; daily-life and e-commerce stress information flow; software engineering pressures tool use. Agents responsible for coordination, final execution, or cross-role access cross safety boundaries more frequently.

Domain and role risk patterns — Domain-level adherence across safety channels, and violation rates of representative high-risk roles.

[RQ4] Violations are widespread across agents. More than 50% of agents commit at least one violation per task; resource access and information flow are the most fragile surfaces.

Protocol adherence and agent-level violations — (a) Protocol adherence across Codex, Claude Code, and OpenClaw. (b) Per-task fraction of role agents with violations under each harness.

[RQ5] Harness design sets the ceiling for safe deployment. Native harnesses can lift completion, but safety gains depend on how the harness structures tool use and execution control. Framework choice matters: weaker orchestration leads to more violations in realistic collaboration.

Native vs OpenClaw and multi-agent frameworks — (a) Native vs OpenClaw under matched models. (b) Completion vs safety across three multi-agent frameworks.

10Citation

If you find HarnessAudit useful, please cite our work:

@misc{liu2026auditingagentharnesssafety,
      title={Auditing Agent Harness Safety}, 
      author={Chengzhi Liu and Yichen Guo and Yepeng Liu and Yuzhe Yang and Qianqi Yan and Xuandong Zhao and Wenyue Hua and Sheng Liu and Sharon Li and Yuheng Bu and Xin Eric Wang},
      year={2026},
      eprint={2605.14271},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.14271}, 
}

HarnessAudit Auditing Agent Harness Safety