Home AI/ML Harness Engineering Explained: What It Is and How Claude Code’s Harness Makes AI Agents Actually Work

Harness Engineering Explained: What It Is and How Claude Code’s Harness Makes AI Agents Actually Work

Last updated: May 27, 2026
k
Published April 8, 2026 · Updated May 27, 2026 · 37 min read

Summary

What this post covers: An in-depth look at “harness engineering”—the orchestration layer wrapped around a language model that turns it into a reliable agent—using Claude Code’s architecture as the worked example, plus a guide to engineering your own harness on top of Claude Code.

Key insights:

  • The model is not the product: as LLMs commoditize, the orchestration around the model (tools, permissions, memory, verification loops, context management) becomes the real competitive moat.
  • Every harness performs four functions—guides (steering), sensors and verification (feedback), correction (recovery), and permissions/tools (capability and safety); any agent missing one of these will fail at production tasks.
  • Anthropic’s internal data shows that harness improvements alone can raise long-running coding-agent success rates by 2-3x with the same underlying model—evidence that prompt engineering is a strict subset of the broader systems discipline.
  • Claude Code itself ships 19 permission-gated tools, a streaming agent loop, hierarchical memory (CLAUDE.md, sub-agent contexts), sub-agent spawning, and context compaction—each a configuration point you can lean on to build vertical agents.
  • The practical recipe for a custom harness: write tight CLAUDE.md guides, define sub-agents for narrow tasks, add deterministic verification (tests, linters) the agent must pass, and gate dangerous tools behind allow-lists rather than trying to prompt away risk.

Main topics: What Is Harness Engineering?, The Four Core Functions of a Harness, Inside Claude Code’s Harness Architecture, Multi-Agent Harness Architecture, How to Engineer Your Own Harness for Claude Code, Harness Engineering Best Practices, Harness Engineering vs Prompt Engineering, Real-World Harness Examples, The Future of Harness Engineering.

When independent researchers first decompiled and analysed Claude Code—Anthropic’s AI-powered coding agent—many anticipated little more than a thin wrapper around a large language model. The reality proved considerably more involved. The system was found to comprise a sophisticated orchestration layer built from nineteen permission-gated tools, a streaming agent loop with continuous feedback, hierarchical memory systems, sub-agent spawning, context compaction algorithms, and a multi-layered permission model that governs every action the agent performs. The language model itself constitutes only one component of a substantially larger system.

This observation crystallised a position that the AI engineering community had been approaching for some time: the model is not the product. The harness is. The thousands of lines of orchestration code surrounding a language model—determining what it sees, what it may do, how it recovers from mistakes, and how it preserves knowledge across sessions—is where the engineering effort concentrates, and where quality is ultimately gained or lost.

Consider a simple thought experiment. Two developers are given the same LLM API key. One constructs a basic prompt-and-response loop. The other assembles a system with tool access, automated testing, iterative error correction, and persistent project memory. Both rely on the same underlying model, yet the outcomes diverge sharply. The decisive factor is not the engine but everything constructed around the engine. The decisive factor is the harness.

As large language models become increasingly commoditised—with open-source models narrowing the gap on proprietary systems and multiple providers offering comparable capability—the harness surrounding the model is rapidly emerging as the principal competitive advantage. Organisations that develop expertise in harness design will build agents that reliably ship code, manage infrastructure, and automate complex workflows. Organisations that treat the model as the entire product will find their agents repeatedly failing on real-world tasks. The remainder of this article examines harness engineering in detail: its definition, its implementation within Claude Code, and the steps required to build one independently.

What Is Harness Engineering?

The definition used by Anthropic’s engineering team is a useful starting point: harness engineering is “the art and science of using a coding agent’s configuration points to improve output quality and increase task success rates.” It is the discipline of designing, building, and refining every component of an AI agent except the model itself.

The underlying formula is straightforward:

Key Takeaway: Agent = Model + Harness. The model provides intelligence. The harness provides capability, reliability, and control. Both are required, but the harness is where engineering effort yields the highest return on investment.

An analogy clarifies the relationship. The model is an engine—a powerful, general-purpose engine capable of generating text, reasoning about code, and solving complex problems. An engine on a workbench, however, travels nowhere. It cannot steer, brake, or recognise a destination. The harness is the vehicle constructed around that engine: the steering wheel (guides that direct the model’s behaviour), the brakes (permissions that prevent harmful actions), the transmission (tools that translate model decisions into real-world actions), the navigation system (context management that keeps the model oriented), and the safety systems (verification and correction loops that detect and remediate errors).

An engine alone accomplishes little. A vehicle without an engine cannot move at all. Both components are necessary, but when comparing two vehicles with equivalent engines, the one with superior surrounding engineering will consistently prevail.

Agent = Model + Harness Model (Engine) Text Generation Code Reasoning Problem Solving Pattern Recognition Raw intelligence No tools, no memory No verification ⚙️ + Harness (Car) 🎯 Guides (Steering) 🔍 Sensors (Feedback) ✅ Verification (GPS) 🔧 Correction (Brakes) 🛡️ Permissions (Safety) 💾 Memory (Persistence) 🔌 Tools (Transmission) 🚗 = Agent Reliable Self-correcting Permission-safe Context-aware Persistent Production-grade 🤖 The model provides intelligence. The harness provides capability, reliability, and control.

Why Harness Engineering Matters

The same model paired with a poorly designed harness produces unreliable results. The same model paired with a well-designed harness produces consistently strong results. The effect is measurable rather than theoretical. Anthropic’s research on long-running coding agents demonstrated that harness improvements—better guides, tighter feedback loops, more refined context management—increased task success rates by a factor of two to three without any change to the underlying model. The model was already capable of solving the problems; the harness was the bottleneck.

This observation marks a fundamental shift in how AI engineering is conceived. For several years, the dominant paradigm has been prompt engineering—the craft of writing better prompts to elicit better outputs from language models. Prompt engineering is valuable, but it constitutes a single-turn optimisation: a prompt is crafted, a response is received, and the prompt is revised. Harness engineering represents the evolution of prompt engineering into a full systems discipline. It encompasses not only the prompt but the tools available to the model, the verification steps executed after the model acts, the correction mechanisms triggered when problems arise, the memory systems that persist knowledge across sessions, and the permission boundaries that keep the agent safe.

Prompt engineering asks how to write a better prompt. Harness engineering asks how to build a better system around the model such that it reliably succeeds at complex, multi-step, real-world tasks.

The Four Core Functions of a Harness

Anthropic’s published research on effective agent harnesses identifies four core functions that every harness must perform. These functions may be regarded as the four pillars supporting a reliable AI agent. Removing any one of them causes the structure to become unstable. Each is examined in turn below.

The Four Core Functions of a Harness AI Agent 1. Guides Feedforward Controls CLAUDE.md, commands, conventions BEFORE 2. Sensors Feedback Controls Linters, type checkers, builds AFTER 3. Verification Validation Tests, CI/CD, LLM-as-Judge CONFIRM 4. Correction Remediation Feedback loops, retry, self-repair FIX Guides prevent errors → Sensors detect errors → Verification confirms goals → Correction fixes failures

Guides (Feedforward Controls)

Guides function as feedforward controls: they steer the agent before it acts. Their purpose is to set expectations, provide context, establish rules, and shape the model’s behaviour before it produces a line of code or executes a command. Effective guides substantially reduce errors by preventing them at the outset rather than catching them after the fact.

Within the Claude Code ecosystem, guides take several concrete forms:

  • CLAUDE.md files: Project-level instruction files that inform the agent about the codebase, coding conventions, the frameworks to use, the patterns to follow, and the mistakes to avoid. These are the single most impactful harness component a practitioner can configure.
  • Custom commands (slash commands): Pre-defined workflows like /write-post or /review that structure multi-step tasks into repeatable processes, complete with specific instructions for each step.
  • Coding conventions and style guides: Explicit rules about formatting, naming, architecture patterns, and anti-patterns that the agent should follow or avoid.
  • Structured prompts and bootstrap instructions: System-level prompts that establish the agent’s role, capabilities, and constraints before any user interaction begins.
  • Task decomposition rules: Instructions that tell the agent how to break down large tasks into manageable subtasks, preventing the common failure mode of trying to do too much in a single step.
  • Examples and few-shot demonstrations: Concrete examples of desired output that show the agent exactly what “good” looks like for a given task.

The principal observation about guides is that they are inexpensive to implement and high-impact. A thorough CLAUDE.md file can be written in approximately thirty minutes, and the resulting improvement in agent output quality is often substantial and immediate. For this reason, Anthropic recommends that practitioners begin their harness engineering work with guides.

Sensors (Feedback Controls)

Sensors are feedback controls: they detect problems after the agent acts. Whereas guides seek to prevent errors, sensors accept that errors will occur and focus on detecting them quickly. The earlier an error is detected, the less costly it is to remediate.

Effective sensors for AI coding agents include the following:

  • Linters (ESLint, Ruff, mypy, Pylint) tuned for LLM-generated code patterns—LLMs tend to make specific categories of mistakes that linters can catch reliably.
  • Type checkers that catch type errors, missing imports, and interface mismatches before runtime.
  • Test suites designed specifically for LLM output patterns, not just generic unit tests, but tests that target the kinds of errors AI agents commonly make.
  • Build verification that ensures the code compiles and the project builds successfully after every change.
  • Code diff analysis that reviews what changed and flags potentially problematic patterns (accidental deletions, overly broad changes, unintended side effects).
Tip: The most effective sensor configuration for AI agents runs linters and type checkers automatically after every code change rather than only at commit time. This provides the agent with immediate feedback and the opportunity to self-correct before proceeding to the next task.

Verification

Verification extends beyond sensors. Whereas sensors detect that something may be wrong, verification confirms that the agent has accomplished the intended objective. It addresses questions of whether the feature functions as required, whether the output matches the specification, and whether the behaviour is substantively correct rather than merely syntactically valid.

Verification mechanisms include the following:

  • Automated test execution: Running the full test suite (or relevant subset) after changes to confirm that existing functionality still works and new functionality behaves as specified.
  • CI/CD pipeline integration: Feeding agent output through the same continuous integration pipeline that human code goes through, ensuring equal quality standards.
  • Browser automation testing: For web applications, actually loading the page and verifying that UI changes render correctly—not just checking that the code is syntactically valid, but that it produces the right visual and interactive result.
  • LLM-as-a-Judge: Using a superior model (or the same model in a separate context) to evaluate the quality and correctness of the agent’s output. This is particularly useful for subjective quality assessments like code readability, documentation quality, or design decisions.

Correction

Correction is the final pillar, and arguably the function that distinguishes prototype agents from production-grade agents. When the agent makes a mistake—and it inevitably will—the response of the system determines its utility. A naive system simply fails and reports the error. A well-designed harness returns the error to the model, allows it to reason about the cause, generates a corrective action, and retries.

Correction mechanisms include the following:

  • Feedback loops: Test failure → model reads the error message → model analyzes the root cause → model generates a fix → system reruns the test. This loop can repeat multiple times until the test passes or a retry limit is reached.
  • Self-repair mechanisms: When the agent detects that its own output is malformed or incomplete, it can trigger a repair pass without human intervention.
  • Retry logic with context: Not just blindly retrying the same action, but retrying with additional context about what went wrong, the error message, the stack trace, the failing test output.
  • Graceful fallback strategies: When the agent cannot solve a problem after multiple attempts, it should degrade gracefully—perhaps simplifying its approach, asking for human input, or documenting what it tried and why it failed.
Function Type When It Acts Examples
Guides Feedforward Before the agent acts CLAUDE.md, custom commands, coding conventions
Sensors Feedback After the agent acts Linters, type checkers, build verification
Verification Validation After completion Test suites, CI/CD, browser testing, LLM-as-Judge
Correction Remediation When something fails Feedback loops, self-repair, retry with context

 

The interplay among these four functions produces a resilient system. Guides reduce the error rate. Sensors catch errors that slip through. Verification confirms that the overall objective has been achieved. Correction addresses cases in which it has not. Together, these functions transform a probabilistic language model into a system reliable enough for production use.

Inside Claude Code’s Harness Architecture

With the theoretical framework established, attention can now turn to how one of the most sophisticated AI coding agents currently available implements these principles. Claude Code is not simply a model attached to a terminal; it is a carefully engineered harness that embodies all four core functions. Based on public analysis of its architecture, the following components are observable.

Claude Code Harness Architecture Permission & Safety Layer Context Management Layer (auto-compaction, selective reading, memory) Streaming Agent Loop Guides CLAUDE.md (hierarchical) Custom slash commands System prompt Bootstrap instructions Task decomposition Claude Model Reasoning & generation 19 Permission-Gated Tools Read | Write | Edit | Bash Grep | Glob | Agent | Web Sensors & Verification Hooks (pre/post tool) Linters & type checkers Test execution Build verification Diff analysis Correction Loop Error → Read message → Analyze → Fix → Retry Self-repair for malformed output Graceful fallback to human input Sub-agent delegation for complex fixes Extensions MCP servers (DB, GitHub, APIs) Sub-agent spawning (Agent tool) Persistent memory system Custom skills & workflows Multiple layers work together: permissions guard everything, context keeps the model focused, the loop drives action.

19 Permission-Gated Tools

At the centre of Claude Code’s harness sit nineteen distinct tools that the model can invoke to interact with the outside world. Each tool is permission-gated, meaning that the system controls which tools the agent may use and under what circumstances. The tools include the following categories:

  • File I/O: Read (view file contents), Write (create or overwrite files), Edit (make targeted string replacements in existing files)
  • Shell execution: Bash (execute arbitrary shell commands with timeout controls)
  • Search: Grep (content search with regex support), Glob (file pattern matching)
  • Git operations: Integrated version control operations
  • Web access: WebFetch (retrieve web page content for research)
  • Notebook editing: NotebookEdit (modify Jupyter notebook cells)
  • Sub-agent spawning: Agent (create specialized sub-agents for parallel or delegated tasks)
  • Task management: TaskCreate, TaskGet, TaskList, TaskUpdate (manage background tasks)

The important design decision here is permission gating. Not all tools carry equivalent risk. Reading a file is safe; deleting a file is potentially harmful; running an arbitrary shell command may have any number of consequences. Claude Code’s harness categorises tool invocations by risk level and requires explicit user approval for high-risk operations, such as executing unfamiliar shell commands, writing to sensitive files, or performing destructive git operations. This corresponds to the braking system in the vehicle analogy, and it is essential for establishing trust.

The Streaming Agent Loop

In contrast to a simple request-response chatbot, Claude Code operates in a streaming agent loop. The model receives input, reasons about an appropriate action, invokes a tool, observes the result, reasons again, invokes a further tool, and continues this cycle until the task is complete or human input is required. This loop is what qualifies Claude Code as an agent rather than a chatbot.

The streaming nature of this loop is significant for user experience. Rather than disappearing for extended periods of internal processing, the agent presents its work in real time: the user can observe the files being read, the commands being executed, and the decisions being made. This transparency builds trust and allows the user to intervene early when the agent is proceeding in an undesirable direction.

Context Management Layer

One of the most underappreciated components of Claude Code’s harness is its context management layer. Language models have finite context windows, even when those windows are large. A coding session that involves reading dozens of files, running tests, making changes, and debugging errors can rapidly exceed the available context. Claude Code addresses this constraint through several mechanisms:

  • Auto-compaction: When the conversation approaches the context limit, the harness automatically summarizes earlier parts of the conversation, preserving the most important information while freeing up context space for new work.
  • Persistent memory: The CLAUDE.md system and memory files allow important information to persist across sessions, so the agent does not need to re-learn the project’s conventions every time it starts.
  • Selective file reading: Rather than loading entire files, the agent can read specific line ranges, search for specific patterns, and load only the relevant portions of large files.
Key Takeaway: Context management is the largely invisible harness component that practitioners most commonly underestimate. Without it, agents degrade rapidly on long tasks as their context fills with irrelevant information and they lose track of their objectives. Effective context management is what enables Claude Code to handle tasks that span hundreds of tool invocations.

The CLAUDE.md System

Claude Code’s CLAUDE.md system is a hierarchical instruction framework that operates at multiple levels:

  • Project-level CLAUDE.md: Lives in the repository root. Contains project-specific instructions, coding conventions, architecture descriptions, and common pitfalls. Every developer on the team benefits from the same instructions.
  • User-level CLAUDE.md: Lives in the user’s home directory. Contains personal preferences and conventions that apply across all projects.
  • Directory-level CLAUDE.md: Lives in specific subdirectories. Contains instructions specific to that part of the codebase, useful for monorepos or projects with distinct subsystems.

This hierarchy means the agent receives increasingly specific guidance as it descends into the codebase. The project-level file might specify the use of TypeScript with strict mode. The directory-level file in /src/database/ might add an instruction to always use parameterised queries and never string concatenation for SQL. The system merges these instructions, with more specific files taking precedence.

Hooks and MCP Integration

Two additional harness components warrant mention. Hooks are shell commands that execute automatically in response to agent events. For example, a pre-tool hook may run a linter before every file write, or a post-tool hook may validate the result of every shell command. Hooks allow automated quality gates to be injected into the agent’s workflow without modifying the agent itself.

MCP (Model Context Protocol) integration allows Claude Code to connect to external tools and data sources through a standardised protocol. MCP servers can provide access to databases, APIs, project management tools, documentation systems, and any other resource that might assist the agent. This component functions as the expansion port of the harness: the mechanism for extending capabilities beyond the built-in tools.

Harness Component Core Function What It Does
CLAUDE.md files Guide Project-specific instructions and conventions
Custom commands Guide Repeatable multi-step workflows
Permission system Guide + Sensor Controls tool access and requires approval for risky actions
19 built-in tools Capability File I/O, search, shell, git, web access, sub-agents
Streaming agent loop Orchestration Continuous act-observe-reason cycle
Context management Efficiency Auto-compaction, selective reading, memory persistence
Hooks Sensor + Verification Automated quality gates on agent events
MCP integration Capability extension Connect to external tools and data sources

 

Multi-Agent Harness Architecture

One of the most significant findings from Anthropic’s research on long-running agents is that the optimal harness architecture for complex tasks is not a single agent performing every function, but rather multiple specialised agents, each operating with a clean context and a focused role. This is the multi-agent harness pattern, and it addresses one of the most persistent problems in AI agent design: context degradation.

The Context Degradation Problem

The problem can be stated as follows. A single agent working on a large task accumulates context over time—files it has read, commands it has run, errors it has encountered, and decisions it has made. As this context grows, the model’s ability to maintain focus and coherence degrades. Anthropic’s research refers to this phenomenon as “context anxiety”: the model becomes increasingly uncertain about which information remains relevant, begins to second-guess earlier decisions, and may even contradict its own prior work. The longer the session, the more pronounced the effect.

The multi-agent pattern resolves this difficulty by providing each agent with a clean context reset. Instead of a single agent performing all functions, specialised agents each handle one phase of the work, passing structured handoffs between them.

The Planner-Generator-Evaluator Pattern

Anthropic’s research describes an effective three-agent pattern:

  • Planner Agent: Takes a brief user prompt and expands it into a comprehensive specification. The planner reads the codebase, interprets the requirements, and produces a detailed plan that includes which files require modification, what the expected behaviour should be, and which edge cases warrant consideration. The planner does not write code; it writes specifications.
  • Generator Agent: Takes the planner’s specification and implements it. The generator writes code, creates tests, makes file changes, and runs builds. It works iteratively—implementing a component, testing it, addressing issues, and proceeding to the next component. The generator operates with a clean context that has not been encumbered by the planner’s exploration and deliberation.
  • Evaluator Agent: Takes the generator’s output and conducts quality assurance. The evaluator reviews the code for correctness, style, security issues, and specification compliance. It runs tests, checks for regressions, and provides a final assessment, again from a clean context focused exclusively on evaluation.

Each agent receives a fresh context window and operates within a clear, focused role. The handoffs between agents consist of structured data—specifications, code diffs, test results—rather than the growing conversation of a single long-running session.

Multi-Agent: Planner → Generator → Evaluator User Prompt Planner Read codebase Analyze requirements List files to change Identify edge cases Output: Specification spec Generator Write code Create tests Run builds Fix failures iteratively Output: Code + Tests diff Evaluator Review code quality Check security Verify spec compliance Run full test suite Output: Pass / Issues 🔄 Clean Context 🔄 Clean Context 🔄 Clean Context Issues found → Fix & retry Each agent gets a fresh context window—no context degradation across phases.

How Claude Code Implements Multi-Agent Patterns

Claude Code implements this pattern through its Agent tool, a built-in capability for spawning sub-agents. When Claude Code encounters a task that would benefit from delegation, it can create a sub-agent with a specific prompt and a clean context. The sub-agent runs independently, completes its task, and returns its results to the parent agent.

This approach is particularly suitable for tasks such as the following:

  • Searching a large codebase while the main agent continues to reason about the overall task
  • Running a battery of tests while the main agent plans the next change
  • Investigating a complex error in a separate context so that the investigation does not contaminate the main workflow
  • Reviewing code changes against project standards before the main agent marks the task as complete
Caution: Multi-agent architectures add complexity. They should not be adopted until the capabilities of a single well-designed agent have been exhausted. For most tasks, even complex ones, a single agent supported by strong guides, sensors, and correction loops will outperform a poorly coordinated multi-agent system. The recommended approach is to begin simply.

When to Use Single-Agent vs Multi-Agent

A single agent is appropriate when the task can be completed within one context window, the requirements are clear, and the feedback loop is tight—writing code, running tests, fixing issues, and repeating. Most routine coding tasks fall into this category.

A multi-agent configuration is warranted when the task is sufficiently large that context degradation becomes a genuine concern, when different phases of the task require fundamentally different skill sets (planning, implementation, and review), or when parallel execution of independent subtasks is required. Large feature development, codebase migrations, and comprehensive code reviews are appropriate candidates.

How to Engineer Your Own Harness for Claude Code

Theory has its place, but practical guidance is essential. The following discussion outlines the five levels of harness engineering for Claude Code, ranging from the simplest configuration to advanced multi-agent orchestration. Each level builds on the previous one, and practitioners should begin at Level 1 and add complexity only when a specific problem cannot be addressed at the current level.

Five Levels of Harness Engineering Level 1: CLAUDE.md Foundation,30 min setup, very high impact Level 2: Custom Commands Repeatable task workflows Level 3: Hooks Automated quality gates Level 4: MCP Servers External tool integration Level 5: Multi-Agent Orchestration Higher complexity Lower Situational impact Highest impact Start at the bottom. Move up only when lower levels cannot solve your problem.

Level 1: CLAUDE.md (The Foundation)

The single most impactful action a practitioner can take to improve Claude Code’s performance on a project is to write a comprehensive CLAUDE.md file. This file serves as the foundation upon which everything else is built.

An effective CLAUDE.md includes the following elements:

  • Project purpose: The function of the project, its users, and the problem it addresses.
  • Technology stack: Languages, frameworks, databases, and deployment targets.
  • Coding conventions: Formatting rules, naming conventions, and architectural patterns.
  • File structure: The location of project components and the contents of each directory.
  • Key commands: Procedures for building, testing, deploying, and running the project.
  • Items to avoid: Common mistakes, anti-patterns, and prohibited practices. This is often the most valuable section.

An example CLAUDE.md for a Python project is shown below:

# Project: DataPipeline

## Purpose
ETL pipeline that processes financial data from multiple exchanges
and loads it into our PostgreSQL analytics database.

## Tech Stack
- Python 3.12, managed with uv
- SQLAlchemy 2.0 for database access
- Pydantic for data validation
- pytest for testing
- Ruff for linting

## Key Commands
- Run tests: `uv run pytest tests/ -v`
- Lint: `uv run ruff check src/`
- Run pipeline: `uv run python -m src.main run --date 2026-04-03`

## Coding Conventions
- All functions must have type hints
- Use Pydantic models for all data structures (no raw dicts)
- SQL queries use parameterized queries only (never f-strings)
- Test files mirror source structure: src/foo/bar.py → tests/foo/test_bar.py

## What NOT to Do
- Do not use pandas — we use Polars for dataframes
- Do not hardcode database credentials — use environment variables
- Do not write raw SQL strings — use SQLAlchemy ORM
- Do not skip type hints — mypy strict mode is enforced in CI

With this single file present in the repository root, Claude Code will produce code that follows the project’s conventions, uses the specified tools, and avoids known pitfalls. No additional configuration is required.

Level 2: Custom Commands (Task Automation)

Custom commands allow repeatable workflows to be defined as slash commands. They reside in .claude/commands/ as Markdown files, and each file becomes a command that can be invoked with /command-name.

An example .claude/commands/write-tests.md is shown below:

Write comprehensive tests for the file or module specified in $ARGUMENTS.

## Steps:
1. Read the source file and understand its public API
2. Identify all functions, classes, and methods that need testing
3. Write pytest tests covering:
   - Happy path for each function
   - Edge cases (empty inputs, None values, boundary conditions)
   - Error cases (invalid inputs, missing dependencies)
4. Save tests to the mirror path: src/foo/bar.py → tests/foo/test_bar.py
5. Run the tests: `uv run pytest tests/foo/test_bar.py -v`
6. Fix any failing tests
7. Run the linter: `uv run ruff check tests/foo/test_bar.py`
8. Report results

With this command in place, a user can type /write-tests src/pipeline/transformer.py and Claude Code will follow this exact workflow each time. Testing conventions do not need to be re-explained in every conversation. The command encodes the team’s standards into a repeatable process.

Other useful custom commands worth considering include /review for code review, /deploy for deployment workflows, /debug for structured debugging sessions, and /refactor for refactoring with specific quality gates.

Level 3: Hooks (Automated Quality Gates)

Hooks allow automated checks to be injected into Claude Code’s workflow. They consist of shell commands that execute in response to specific events—before a tool runs, after a tool runs, or at other key moments in the agent loop.

An example hook configuration in .claude/settings.json is shown below:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "command": "uv run ruff check --fix $CLAUDE_FILE_PATH 2>/dev/null || true"
      }
    ],
    "PreCommit": [
      {
        "command": "uv run pytest tests/ -x -q 2>&1 | tail -5"
      }
    ]
  }
}

With this configuration, every time Claude Code writes or edits a file, Ruff automatically runs and corrects formatting issues. Before every commit, the test suite runs and the results are returned to the agent. These constitute automated sensors and verification gates: they execute without human intervention and without requiring the agent to remember to invoke them.

Level 4: MCP Servers (External Integration)

MCP (Model Context Protocol) servers extend Claude Code’s capabilities by connecting it to external tools and data sources. They are configured in .claude/settings.json and appear as additional tools that the agent can use.

{
  "mcpServers": {
    "postgres": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-postgres"],
      "env": {
        "DATABASE_URL": "postgresql://user:pass@localhost:5432/mydb"
      }
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": {
        "GITHUB_TOKEN": "ghp_your_token_here"
      }
    }
  }
}

With MCP servers configured, Claude Code can query a database directly (understanding schema, running queries, and verifying data), interact with GitHub (creating pull requests, reading issues, and checking CI status), and integrate with any other tool that provides an MCP server implementation. This transformation moves Claude Code from a coding assistant to an integrated development environment that understands the entire infrastructure.

Level 5: Multi-Agent Orchestration

At the highest level of harness sophistication, multi-agent workflows can be orchestrated in which different Claude Code instances handle different phases of a task. This can be accomplished through custom commands that explicitly invoke the Agent tool for delegation.

A conceptual example of a /feature command implementing the planner-generator-evaluator pattern is shown below:

Implement the feature described in $ARGUMENTS using a
multi-phase approach:

## Phase 1: Planning
Use the Agent tool to spawn a planning sub-agent with this prompt:
"Read the codebase and create a detailed implementation plan for:
$ARGUMENTS. List all files to modify, new files to create,
tests to write, and edge cases to consider. Output a structured
specification."

## Phase 2: Implementation
Use the Agent tool to spawn an implementation sub-agent with
the specification from Phase 1. The sub-agent should implement
the feature, write tests, and run them.

## Phase 3: Review
Use the Agent tool to spawn a review sub-agent that reads the
diff of all changes, checks for bugs, security issues, style
violations, and specification compliance. Report any issues found.

## Phase 4: Resolution
If the review found issues, fix them. Run the full test suite.
Report the final result.

Each sub-agent operates with a clean context focused on its specific phase, while the parent agent coordinates the workflow and manages the handoffs.

Level Component Complexity Impact on Quality
1 CLAUDE.md Low (30 min setup) Very High
2 Custom Commands Low-Medium High
3 Hooks Medium High
4 MCP Servers Medium-High Medium-High
5 Multi-Agent Orchestration High Medium (situational)

 

Harness Engineering Best Practices

After considerable time building and refining harnesses for Claude Code, several best practices emerge. These are not theoretical recommendations but lessons drawn from extensive practical experience.

Begin Simply and Add Complexity Only When Required

The most common mistake in harness engineering is over-engineering from the outset. Hooks, MCP servers, and multi-agent orchestration are not required on the first day. The recommended starting point is a CLAUDE.md file. After a week of using Claude Code, recurring errors will become apparent. A custom command or guide should then be added to address that specific failure pattern, and the process iterated. The most effective harnesses are grown organically from observed failure patterns rather than designed top-down from theoretical requirements.

Make the Harness Project-Specific

A one-size-fits-all harness is invariably a mediocre harness. A Python data pipeline has different requirements from a React frontend, which in turn has different requirements from a Rust systems library. The CLAUDE.md, custom commands, and hooks should all be tailored to the specific project, its technology stack, its conventions, and its common failure modes. Generic guidance such as “write clean code” is of little use. Specific instructions such as “use Pydantic models for all API responses; never return raw dicts” are actionable.

Test the Harness Configuration

One practice that distinguishes competent harness engineers from highly effective ones is the A/B testing of harness changes. Before adding a new guide or hook, a representative task should be run and the result recorded. The harness change is then applied and the same task executed again. The improvement, if any, should be quantified. This empirical approach prevents harness bloat—configurations that appear useful but do not actually improve outcomes.

Place the Harness Under Version Control

The CLAUDE.md file, the .claude/commands/ directory, and the hooks configuration should all be checked into version control alongside the code. They form part of the project’s engineering infrastructure and should be reviewed in pull requests, iterated upon over time, and shared across the team. A harness that exists only on one developer’s machine is a harness that will eventually be lost.

Iterate Based on Failure Patterns

Each time Claude Code makes a mistake that it should not have made, the question to ask is whether a harness change could have prevented it. If the agent repeatedly uses the wrong database library, a guide should be added. If it repeatedly forgets to run tests, a hook should be added. If it repeatedly generates code that fails the linter, a sensor should be added. The harness should function as a living document that evolves as new failure patterns are identified.

Balance Autonomy and Control

Excessive constraints render the agent slow and inflexible, as it spends more time verifying rules than completing work. Insufficient constraints render it error-prone, as it makes avoidable mistakes from lack of guidance. The appropriate balance varies by project and team. High-risk production codebases require more constraints, while experimental prototyping projects benefit from greater autonomy. Calibration should be adjusted accordingly.

Monitor and Measure

The agent’s success rate should be tracked over time. Relevant questions include how often the agent completes tasks correctly on the first attempt, how often correction is required, and which categories of errors occur most frequently. This data indicates where to direct harness engineering effort. If eighty per cent of failures are type errors, investment in type-checking sensors is warranted. If eighty per cent of failures stem from misunderstood requirements, investment in better guides is warranted.

Harness Engineering vs Prompt Engineering

Harness engineering is sometimes conflated with prompt engineering. Although the two are related, they are fundamentally different disciplines, and understanding the distinction is important for allocating engineering effort appropriately.

Prompt engineering is the craft of writing a single prompt for a single interaction. It focuses on wording, structure, few-shot examples, and instruction clarity in order to obtain the best possible response from one model call. It is valuable, and it constitutes one component of harness engineering—specifically, it falls under the heading of guides. But it remains only one piece of a broader framework.

Harness engineering is the discipline of designing a complete system around the model for sustained, reliable operation across many interactions and many tasks. It encompasses not only the prompt but every other component: the tools the model can use, the verification that runs after the model acts, the correction mechanisms invoked when problems arise, the persistence of cross-session knowledge, and the permissions that govern what the model may do.

Dimension Prompt Engineering Harness Engineering
Scope Single prompt, single interaction Complete system across many interactions
Persistence Ephemeral (one conversation) Persistent (CLAUDE.md, memory, commands)
Components Text instructions only Text + tools + sensors + verification + correction
Reliability Varies per interaction Systematically improved over time
Scalability Manual (re-craft for each task) Automated (configure once, apply to all tasks)
Error handling Hope the prompt prevents errors Detect, verify, and correct errors automatically
Team sharing Copy-paste prompts Version-controlled config files in the repo

 

The principal observation is that prompt engineering is a subset of harness engineering. A practitioner who attends only to prompt engineering leaves the majority of available improvement potential unrealised. The largest gains derive from the components that prompt engineering does not address: tools, verification, correction, and persistence.

Real-World Harness Examples

Abstract principles are useful, but concrete examples render them actionable. Three real-world harness configurations are presented below to demonstrate the principles in practice.

Example 1: Blog Publishing Harness (aicodeinvest.com)

The reader is currently engaging with the output of such a harness. This article was written and published by Claude Code operating within a harness developed specifically for blog publishing. The harness includes the following components:

  • CLAUDE.md: Contains writing guidelines (4,000-6,000 words, measured tone, specific HTML patterns), post structure requirements (Table of Contents, Introduction, body sections, Conclusion, References), and explicit anti-patterns to avoid (no numbered headings, no html/head/body wrappers).
  • /write-post custom command: Orchestrates the full workflow, including topic selection, writing, saving, publishing via WordPress REST API, and recording topic usage for deduplication.
  • WordPress REST API as a tool: A Python CLI (src/main.py) that handles authentication, content upload, category assignment, and status management.
  • Topic deduplication system: Tracks recently used topics in config/recent_topics.json to prevent the agent from writing about the same subject twice.

This harness transforms Claude Code from a general-purpose AI assistant into a specialised blog publishing system. The model’s writing ability provides the engine. The harness—comprising the CLAUDE.md guidelines, the custom command workflow, the publishing tools, and the deduplication system—is what converts that engine into a reliable content production pipeline.

Example 2: Enterprise Code Review Harness

Consider a team using Claude Code for automated code review. The corresponding harness might include the following components:

  • CLAUDE.md: Company coding standards, security requirements (no hardcoded secrets, all inputs sanitised, all queries parameterised), performance guidelines (no N+1 queries, pagination required for list endpoints), and architectural rules (clean architecture layers, dependency injection).
  • /review custom command: A structured review process that checks security, performance, style, test coverage, and documentation in that order, producing a formatted review with severity ratings.
  • CI/CD integration hooks: Post-commit hooks that run the test suite, linter, and security scanner, returning results to the agent for review.
  • Jira/Linear MCP server: Connects Claude Code to the team’s project management tool, enabling it to read ticket descriptions, interpret acceptance criteria, and verify that code changes match the specified requirements.

This harness ensures that every code review follows the same rigorous process, applies the same standards, and produces consistent, actionable feedback regardless of which developer triggered the review or which part of the codebase is under modification.

Example 3: Data Pipeline Harness

A data engineering team might construct a harness for managing ETL pipelines:

  • Custom commands: /new-pipeline for scaffolding new ETL jobs with the team’s standard structure, /validate-schema for checking data schemas against the warehouse, and /backfill for running historical data loads with appropriate idempotency checks.
  • Database MCP server: Provides Claude Code with direct access to the data warehouse schema, allowing it to interpret table structures, column types, relationships, and constraints without explicit explanation from the developer.
  • Test data generation tools: Custom commands that generate realistic test data for pipeline testing, including edge cases such as null values, duplicate records, and timezone mismatches.
  • CLAUDE.md with data engineering conventions: Rules concerning idempotency (all pipelines must be safely re-runnable), data validation (all inputs must be schema-validated before processing), and monitoring (all pipelines must emit metrics for latency, throughput, and error rate).

Each of these examples illustrates the same principle: the harness is tailored to the specific domain, encoding domain expertise into configuration that the agent can apply automatically.

The Future of Harness Engineering

Harness engineering is a young discipline that is evolving rapidly. Several trends are discernible.

A New Engineering Discipline

Just as DevOps emerged as a distinct discipline at the intersection of development and operations, harness engineering is emerging as a distinct discipline at the intersection of AI and software engineering. Companies are already hiring for roles that are, in effect, harness engineering positions—specialists in configuring, tuning, and optimising AI agent systems. The formal title may be “AI Platform Engineer” or “Agent Systems Engineer,” but the core skill set is harness engineering.

Standardisation Through MCP

The Model Context Protocol (MCP) represents the first serious attempt to standardise the interface between AI agents and external tools. Prior to MCP, each agent maintained its own proprietary tool integration system. MCP provides a common protocol that any tool can implement and any agent can consume. The analogy with HTTP and the web is apt: a standard creates the conditions for an ecosystem. As MCP matures, MCP servers can be expected to proliferate across the full range of tools and data sources, substantially lowering the cost of harness engineering.

Harness Marketplaces

At present, sharing a harness configuration involves distributing CLAUDE.md files and custom commands through GitHub repositories. In the future, dedicated marketplaces for harness configurations may emerge—curated collections of CLAUDE.md files, custom commands, hooks, and MCP server configurations for specific technology stacks and workflows. Examples might include production-ready harnesses for Django with PostgreSQL and Celery, or for iOS development with SwiftUI and Core Data. Such pre-built harnesses would provide teams with a starting point that already encodes best practices for their stack.

Self-Improving Harnesses

The most consequential frontier is the development of self-improving harnesses: harness systems that learn from their own failures and automatically update their configuration. A harness might, for example, observe that the agent repeatedly makes the same type error in a specific module and automatically add a guide to CLAUDE.md stipulating the use of Decimal rather than float for monetary values in the payments module. Alternatively, a harness might detect that test failures cluster around a specific API endpoint and automatically add more thorough validation for the responses from that endpoint.

This is not speculative. The constituent building blocks exist today. The agent can read its own CLAUDE.md, analyse its own failure patterns, and edit its own CLAUDE.md. The missing element is the orchestration logic that determines when to perform such updates and what to change. This is an active area of research.

The “Operating System for AI” Vision

At a sufficient level of abstraction, the harness begins to resemble an operating system. It manages resources (context windows, tool access), enforces permissions (what the agent may and may not do), provides system services (file I/O, networking, process management), and exposes a user interface (the conversation loop). The analogy is imperfect, but it points toward a future in which the harness is not merely a configuration layer but a complete runtime environment for AI agents, exhibiting the level of sophistication that operating systems bring to traditional computing.

Final Thoughts

The AI industry has spent the past several years in a competition over models—larger, faster, more capable. That competition continues, but a parallel race has emerged: the race to build better harnesses. The teams and organisations that develop expertise in harness engineering will extract substantially more value from the same models available to everyone else.

The formula is straightforward: Agent = Model + Harness. The model provides raw intelligence. The harness provides structure, tools, verification, correction, memory, and control. Together, they produce an agent capable of operating reliably in the real world. Separately, neither is complete.

If one observation should be drawn from this article, it is the following: an AI agent should not be treated as a chatbot with additional features but as an engineered system. A CLAUDE.md file should be written. Custom commands should be created for common workflows. Hooks should be added for automated quality gates. MCP servers should be connected for external tool access. The harness should be tested, iterated upon, placed under version control, and shared across the team.

The model is the engine. The harness is the vehicle. At present, many practitioners are attempting to drive an engine across the motorway. The vehicle must be built.

Key Takeaway: Harness engineering is among the most valuable skills in AI-assisted development today. A thirty-minute investment in a strong CLAUDE.md file will improve every subsequent interaction with Claude Code. The recommended approach is to begin there, measure the results, and build upon that foundation.

References

You Might Also Like

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *