01-6 Harness Engineering: designing the shell that keeps an agent running reliably

What this unit solvesThe same model behaves wildly differently depending on the execution shell around it. The harness is that shell: tool set, permission boundaries, memory, loop control, observability, kill switch. The model sets the capability ceiling; the harness determines how much of it you actually get, how stable it is, whether results can be reproduced, and whether things can be contained when they go wrong. This unit unpacks the components and design criteria of a harness so you have a clear evaluation framework when choosing or building an agent system, rather than stumbling toward a working configuration by trial and error.

Learning objectives

Explain in your own words where the boundary between “model” and “harness” lies, and name concrete problems that arise from confusing the two.
List the six components of a sound harness and use them to audit an existing tool or a self-built system.
Design sensible security boundaries: approval gate, sandbox isolation, heartbeat dead-man switch, kill switch, and know which of these cannot be skipped in which situations.
Explain why “reproducible” matters more than “clever”, and identify the minimum observability configuration that makes agent behavior reproducible.

1. What a harness is: the model sets the ceiling; the harness determines what you get

Start with a comparison. The same Claude model, driven two ways: one approach is a bare loop you write yourself — throw a prompt, receive a response, manually execute the tool calls the model requests, paste the results back, no permission checks, no step limit, no logs of any kind. The other is Claude Code, the same model, but wrapped in permission rules, a sandbox, hooks, session persistence, and an interruptible loop. The stability, reproducibility, and safety of the two differ by an order of magnitude. The difference is not the model; it is the harness. Anthropic describes the basic building block of an agentic system as “an LLM augmented with retrieval, tools, memory, and other capabilities” [1]. That framing draws the boundary:

The model owns: reasoning, generation, deciding “which tool to call next, what to remember, what to look up.”
The harness owns: how tools are actually executed, where permission boundaries sit, how state is persisted, when the loop stops, who applies the brakes when something goes wrong, and whether a record of the whole process is kept.

The Claude Agent SDK documentation is even more direct about this shell: the execution loop that powers Claude Code is itself a harness (the agent harness that powers Claude Code), its cycle being “gather context, take action, verify work, repeat,” and the SDK hands that loop directly to the developer to control tools, permissions, cost limits, and output rather than hiding it behind an abstraction layer [2]. The most common confusion is treating the system prompt as the whole harness. The system prompt belongs to L1/L2 (how to ask, what goes in the window; see 01-3 and 01-4). It can influence model tendencies, but it has no reach into the execution layer: it cannot stop a dangerous rm -rf, will not cry halt at step 50 of an infinite retry, and will not terminate a process when the agent locks up. Those are the harness’s responsibilities. Conflating the two leads you to believe “a well-written prompt means safety,” while the actual safety and stability gap goes completely unaddressed.

One sentence to remember the boundaryThe model is responsible for “what it wants to do”; the harness is responsible for “what it’s allowed to do, how far it goes, and what happens when things break.” The first you handle by choosing a model and writing prompts; the second you handle by designing the shell. This unit covers only the second.

2. The six components of a sound harness

When auditing any agent system — commercial tool or self-built — check these six components one by one and the gaps become immediately visible. The first four correspond to Anthropic’s augmented LLM and agent loop; the last two are what you must add when taking it into production. Component 1: tool set and permission boundaries. The shape of a tool directly affects model decision quality. Anthropic recommends treating the agent-computer interface (ACI) with the same care as a human-computer interface (HCI), investing comparable effort: giving the model enough tokens to think before acting, using formats close to natural web text, and eliminating error-prone formatting overhead [1]. Beyond shape comes boundary: the files, domains, and command scopes each tool can reach determine the blast radius when something goes wrong. The principle of least privilege matters more in an agent context than in a traditional system, because the decision-maker is a model that improvises, not a program whose behavior is fixed and predictable. Component 2: memory and state. Three layers, each with its applicable scenarios and contamination risks:

Short-term (conversation context): the conversation history within a session. The Claude Agent SDK explicitly states that the context window within a session only grows, never resets — system prompt, tool definitions, conversation, tool inputs and outputs all accumulate [2]. This is precisely the context rot discussed in 01-4.
Working memory (structured state within a session): storing intermediate results as key-value pairs or summaries rather than leaving them floating in the conversation where they dilute.
Long-term (cross-session persistence): content loaded across sessions, such as CLAUDE.md and memory files (see 04-1). This is the highest-risk layer, because contaminated memory can quietly reassemble itself in a future session.

Component 3: loop and termination conditions. An agent loop (gather context, take action, verify, repeat) without a clear exit burns tokens at best and loses control at worst. Anthropic states explicitly that at each step the agent must obtain ground truth from the environment (tool call results, program execution results) to evaluate progress, and must have built-in stopping conditions (such as an iteration cap) to “maintain control” [1]. Three exits are non-negotiable: an explicit success criterion, a maximum step count, and an error exit path. Component 4: observability and logging. Without observability, debugging is blindfolded. Every tool call should at minimum record: timestamp, tool name, input summary, files touched, approval status (allowed or blocked), and token estimate. This is the prerequisite for Component 6 (reproducibility), expanded in Section 4. Component 5: security boundaries. Approval gate, sandbox isolation, kill switch. This is the critical gap between “I’ll try it myself” and “I’m comfortable leaving it unattended,” expanded in Section 3. Component 6: reproducibility. Can you reconstruct “exactly what it did last time” after the fact? This includes session replayability (at a minimum, preserving the tool call sequence) and resumability (the Claude Agent SDK uses session_id to resume, restoring files read, analyses done, and actions taken [2]). Section 4 expands on this.

Audit your current tool against all sixTake Claude Code as an example and map each component: tool set and permissions (permissions in settings.json), memory (session context + CLAUDE.md + memory files), loop termination (task-completion judgment + you can interrupt any time), observability (transcript + hook logs), security boundaries (permissions + sandbox + hook gate), reproducibility (session resume). All six have counterparts — that is why it is more stable than a bare API loop. Apply the same list to your self-built agent and ask “do I have this?” for each item. The first one you can’t answer is where it will break first.

3. Security boundaries: approval gate, sandbox, heartbeat, kill switch

These four are the parts of harness design most commonly skipped out of laziness, because skipping them leaves the system looking like it still runs. The problem only surfaces at the moment something goes wrong, by which point it is usually too late. Approval gate. Which actions proceed automatically and which require human confirmation? There is only one criterion: is the harm caused by this action reversible? Reversible actions such as reading files or running tests can be allowed through automatically. Irreversible or externally visible actions such as deleting files, git push, network egress, or writes to paths outside the repo should be held for confirmation. Claude Code’s permissions divides rules into three categories — allow, ask, deny — evaluated in the order deny, then ask, then allow; the first matching rule wins, so deny always takes precedence (as of 2026-05) [3].

{
  "permissions": {
    "deny": [
      "Read(./.env)",
      "Read(~/.ssh/**)",
      "Bash(curl:*)"
    ],
    "ask": [
      "Bash(git push:*)",
      "Write(~/**)"
    ]
  }
}

Setting approval boundaries too wide or too narrow are both wrongToo wide (everything auto-approved, or using --dangerously-skip-permissions) is the same as having no security boundary at all, and is especially dangerous in unattended operation or when processing external content. Too narrow (every step requires your confirmation) strips the agent of any autonomous value — you might as well do it yourself. Divide on the single criterion of reversibility rather than blanket approval or blanket blocking.

Sandbox isolation. Untrusted work (external repos, attachment processing, large-scale web scraping) should run inside a container or devcontainer with network egress closed by default. Note that sandbox and permissions are two complementary layers, not an either/or: permissions controls “which tools, files, and domains can be touched,” applying to all tools; the sandbox is OS-level enforced isolation, restricting the filesystem and network access of Bash commands and their child processes (as of 2026-05) [3]. Anthropic’s recommendation for taking an agent to production is “test thoroughly in a sandboxed environment with appropriate guardrails,” because agent costs are higher and errors compound [1]. The minimum one-liner for isolation:

# One-shot review of an untrusted repo: no network, no home directory mount
docker run -it --rm -v "$(pwd)":/workspace -w /workspace --network=none node:20 bash

Heartbeat (dead-man switch) and kill switch. Unattended long-running loops require a heartbeat mechanism: the agent writes a heartbeat at regular intervals, and a supervisor terminates it if no heartbeat arrives within the timeout. The key design principle is do not rely on the monitored process to report that it has stopped (it may be exactly the one that is stuck); an external supervisor must make that call and act on it. The termination signal must go to the entire process group, not just the parent process, otherwise child processes become orphans and keep running. The minimal supervisor below demonstrates both heartbeat monitoring and process-group kill:

#!/usr/bin/env bash
# supervisor.sh: watch agent heartbeat; terminate the whole process group on timeout
HEARTBEAT_FILE="/tmp/agent.heartbeat"
TIMEOUT=90            # treat as stuck if no heartbeat for 90 seconds
CHECK_INTERVAL=30

setsid ./run_agent.sh &     # setsid makes the agent its own process group
AGENT_PGID=$!

while kill -0 "$AGENT_PGID" 2>/dev/null; do
    sleep "$CHECK_INTERVAL"
    last=$(stat -c %Y "$HEARTBEAT_FILE" 2>/dev/null || echo 0)
    now=$(date +%s)
    if (( now - last > TIMEOUT )); then
        echo "[supervisor] heartbeat timeout, terminating process group $AGENT_PGID" >&2
        kill -TERM -"$AGENT_PGID"   # leading minus: signal the whole group
        exit 1
    fi
done

The agent side only needs to write one heartbeat per loop iteration (date +%s > /tmp/agent.heartbeat). The value of this mechanism is not in ordinary runs; it is in the one time the agent actually locks up.

This section covers design criteria only; the threat model belongs in 03-3This section is about which security mechanisms a harness should have and the design trade-offs involved. Concrete attack-surface analysis (prompt injection, supply chain, memory poisoning, relevant CVEs) belongs to the threat model and is covered in 03-3; it is not repeated here.

4. Observability and reproducibility: why “reproducible” is worth more than “clever”

An agent that occasionally produces impressive results but differs every time has lower engineering value than one that reliably produces 80-point output. The reason is practical: behavior that cannot be reproduced cannot be improved, and cannot be demonstrated to stakeholders. “It worked that way last time” is not an engineering answer; it is a luck report. Making behavior reproducible requires two things:

Observability (seeing after the fact what happened). The minimum configuration is the structured log fields from Component 4 in Section 2. The key is using a structured log (one JSON entry per call) rather than a full transcript — the former is cheap, queryable, and sufficient; the latter is expensive and hard to search. Observability has an underrated secondary use: anomaly pattern detection. Hijacked or out-of-control runs typically show anomalous signals in the tool call trace (suddenly reading a pile of .env files, connecting to domains it has no business reaching) before you notice anything manually.
Replayability and resumability (reconstructing state after the fact). The Claude Agent SDK’s session mechanism lets you grab a session_id and resume, restoring files previously read, analyses done, and actions taken [2]. At a minimum, preserve the tool call sequence; a full token stream is not needed.

What one tool call log entry looks like When an agent run completes, these entries accumulate one per line and form its complete behavioral trace. When something goes wrong, you are not scanning thousands of lines of conversation for clues — you grep approval=blocked or grep .env and land directly on the anomaly. That is what makes a structured log more valuable than a transcript.

{"ts":"2026-06-01T10:32:11Z","session":"a1b2","tool":"Bash",
 "input":"npm test","files":[],"approval":"approved","tokens":420}

5. What a mature harness looks like: from “it runs” to “well-designed”

The gap between “it runs” and “a well-designed harness” is clearest as a comparison table:

Dimension	It runs (barely)	Well-designed harness
Permissions	Wide open, or `--dangerously-skip-permissions`	`deny` high-risk actions, `ask` for irreversible ones, `allow` the rest
Isolation	Runs directly on the host under your personal account	Untrusted work goes into a sandbox / container; network closed by default
Loop	No step limit; someone watches and shouts stop	Explicit success condition + maximum steps + timeout exit
Memory	Everything packed into the conversation window	Short-term / working / long-term layered; cross-session persistence
Observability	Dig through chat logs when something breaks	Structured log per tool call; replayable
Emergency stop	Close the terminal window	Kill process group + audit record + heartbeat

Two concrete reference points in Part V — read them after finishing this unit to map abstract criteria onto real code (both units are currently skeletons still being expanded):

OpenClaw: see how its tool set design and approval gate are implemented, corresponding to Component 1 and Section 3 of this unit.
Hermes-Agent: see how its memory layering and loop termination conditions are designed, corresponding to Components 2 and 3.

6. Positioning on the concept ladder: harness is the outermost layer, L4

Arranged from innermost to outermost (see 01-3, 01-4, 01-5), the harness is the outer shell:

L1 Prompt engineering   -- how to write a single request
L2 Context engineering  -- what belongs in the window
L3 Workflow engineering -- what structure connects multiple steps
L4 Harness engineering  -- the shell that keeps the whole thing running reliably (this unit)

L4 carries the output of the first three layers and determines whether they can land stably. A common imbalance: push the L3 workflow orchestration to the extreme (beautiful DAG, adversarial verification, convergent loops) while putting no kill switch or permission boundaries at L4 — equivalent to fitting a racing engine but no brakes. The reverse also holds: a complete harness cannot rescue a bad prompt or bad context; L4 amplifies the design quality of the first three layers; it does not remediate them.

All four layers together give the full pictureWhen an agent “behaves erratically,” first locate the layer: output quality fluctuating (usually L1/L2), multi-step flow results inconsistent (L3), agent performing dangerous actions or failing to stop (L4). Getting the layer right keeps you from endlessly tweaking prompts when the root cause is a harness that isn’t catching things at the bottom.

Tool comparison

Mapping harness components to mainstream agent systems reveals consistent concepts under different names (as of 2026-05; exact mechanisms and configuration paths are subject to each vendor’s current documentation):

Component	Anthropic Claude (primary)	OpenAI	Google	Self-built (bare SDK / framework)
Tool set and ACI	Claude Code tools + MCP; Agent SDK custom tools [2]	Agents SDK function tools	Gemini API function calling	Custom function schema
Permissions / approval gate	`settings.json` `permissions` (deny, ask, allow) [3]	Agents SDK guardrails / `tool_use` approval	Official documentation does not provide equivalent detail	Implement your own approval layer
Loop termination	Task-completion judgment; Stop hook convergence sentinel	Agents SDK `max_turns` limit	Official documentation does not provide equivalent detail	Set your own step / timeout limit
Memory and state	Session context + `CLAUDE.md` + memory files	Agents SDK sessions	Official documentation does not provide equivalent detail	Manage your own state store
Observability	Transcript + hook logs	Agents SDK tracing	Official documentation does not provide equivalent detail	Wire your own logging / OpenTelemetry
Sandbox / isolation	OS-level sandbox (Bash child processes only) [3]	Official documentation does not provide equivalent detail	Official documentation does not provide equivalent detail	Container / devcontainer

The comparison table gives coordinates, not detailsExact names, versions, and configuration paths for each cell are fast-moving facts; cells where official documentation does not cover the equivalent concept note this explicitly — refer to each vendor’s official documentation for current details. The final column uses “self-built” instead of GitHub Copilot and Cursor (the usual comparison targets) because neither currently offers a programmable agent harness layer equivalent to the others; the natural comparison point for harness engineering is always “off-the-shelf tool” vs. “roll your own.” The table’s purpose is to tell you what the same harness component is called in your tool; for deeper configuration detail see Part II and Part IV.

Hands-on exercises

Two exercises: one audit, one hands-on fix for the component most commonly missing.

Audit your existing agent configuration against the six components. Take the system you are currently using (Claude Code’s .claude/settings.json, or your self-built agent) and check each of the six components from Section 2 one by one: tool set and permissions, memory and state, loop termination, observability, security boundaries, reproducibility. For the first one you cannot check off, write down “in what scenario will this be the first thing to break?” Most people find the gap is observability or a kill switch, because neither is needed under normal conditions and both are needed the moment they are.
Add a heartbeat to an unattended loop. Take the supervisor.sh skeleton from Section 3 and adapt it for your own agent: decide a reasonable TIMEOUT (slightly longer than the slowest normal single-step duration), add a one-line heartbeat write to each iteration of the agent loop, and use setsid to put it in its own process group. Verification: deliberately cause the agent to lock in an infinite loop and confirm that the supervisor actually kills the entire group after TIMEOUT, rather than leaving orphan processes behind.

Common pitfalls

Anti-pattern list

Treating the harness as the system prompt: the system prompt is L1/L2 and has no reach into the execution layer’s permissions, termination, or observability. A well-crafted prompt still cannot stop an rm with no deny rule.
Approval boundaries too wide or too narrow: too wide means no security boundary; too narrow blocks every step and removes the agent’s value. The criterion is action reversibility, not intuition.
Loops with no termination condition: without a success criterion, a maximum step count, and a timeout, the most common outcome is infinite retry burning tokens, not graceful completion.
Going live without observability: discovering there are no logs to consult only after something breaks, and being reduced to guessing from conversation history. Structured logging needs to be on from the start, not retrofitted after an incident.
Kill switch that only kills the parent process: not signaling the entire process group leaves child processes as orphans that keep running. An emergency stop needs to stop everything.
Relying on the monitored process to self-report that it is stuck: a process that is locked up cannot reliably report that it is locked up. The dead-man switch must be judged and triggered by an external supervisor.

Self-check

The bar for passing this unit

Can you state in one sentence what “model” and “harness” each own? Have you hit or observed a concrete problem from confusing the two?
Applying the six components from Section 2 to the agent tool you currently use, which items are missing? In what scenario would the first missing item be the first thing to break?
Are your approval boundaries divided by action reversibility, or is everything allowed through / everything blocked? Give one action you would allow and one you would block.
Does your unattended loop have a heartbeat and a process-group-level kill switch? Does the termination signal go to the parent process or the whole group?
If you needed to reconstruct right now what your agent did in its last run, is your observability data sufficient? If not, what piece is missing?

Sources and further reading

Factual claims are grounded in official documentation; fast-changing items are annotated as of 2026-05.

[1] E. Schluntz and B. Zhang, “Building Effective Agents,” Anthropic Engineering, Dec. 19, 2024. (augmented LLM as fundamental building block = LLM + retrieval + tools + memory; agent obtains environment ground truth at each step; stopping conditions maintain control; ACI should be designed with the same care as HCI; production requires sandboxed testing + guardrails) https://www.anthropic.com/engineering/building-effective-agents (as of 2026-05)
[2] Anthropic, “How the agent loop works,” Claude Agent SDK Docs. (gather context, take action, verify work, repeat; the harness powering Claude Code can power other agents; context accumulates within a session without reset; session_id resume restores full context) https://code.claude.com/docs/en/agent-sdk/agent-loop (as of 2026-05)
[3] Anthropic, “Configure permissions,” Claude Code Docs. (permissions allow / ask / deny; evaluation order deny, ask, allow first-match-wins; sandbox is OS-level enforcement, applies only to Bash and child processes; permissions and sandboxing are complementary defense-in-depth layers) https://code.claude.com/docs/en/permissions (as of 2026-05)

Related: 01-4 on memory layering and the causes of context rot; 01-5 on workflow orchestration that the harness grounds; 03-3 on the threat model and attack surface behind security boundaries; 04-1 on long-term memory design; 04-6 on using hooks to implement deterministic approval gates and convergence sentinels; 05-1 and 05-2 for the actual design of two open-source harnesses.

Overview

Part I Foundations

Part II Configuration

Part III Judgment

Part IV Customization

Part V Case Studies

Appendix

01-6 Harness Engineering: designing the shell that keeps an agent running reliably

Learning objectives

1. What a harness is: the model sets the ceiling; the harness determines what you get

2. The six components of a sound harness

3. Security boundaries: approval gate, sandbox, heartbeat, kill switch

4. Observability and reproducibility: why “reproducible” is worth more than “clever”

5. What a mature harness looks like: from “it runs” to “well-designed”

6. Positioning on the concept ladder: harness is the outermost layer, L4

Tool comparison

Hands-on exercises

Common pitfalls

Self-check

Sources and further reading

​Learning objectives

​1. What a harness is: the model sets the ceiling; the harness determines what you get

​2. The six components of a sound harness

​3. Security boundaries: approval gate, sandbox, heartbeat, kill switch

​4. Observability and reproducibility: why “reproducible” is worth more than “clever”

​5. What a mature harness looks like: from “it runs” to “well-designed”

​6. Positioning on the concept ladder: harness is the outermost layer, L4

​Tool comparison

​Hands-on exercises

​Common pitfalls

​Self-check

​Sources and further reading

Learning objectives

1. What a harness is: the model sets the ceiling; the harness determines what you get

2. The six components of a sound harness

3. Security boundaries: approval gate, sandbox, heartbeat, kill switch

4. Observability and reproducibility: why “reproducible” is worth more than “clever”

5. What a mature harness looks like: from “it runs” to “well-designed”

6. Positioning on the concept ladder: harness is the outermost layer, L4

Tool comparison

Hands-on exercises

Common pitfalls

Self-check

Sources and further reading