Home AI/ML Harness Engineering Explained: What It Is and How Claude Code’s Harness Makes AI Agents Actually Work

Harness Engineering Explained: What It Is and How Claude Code’s Harness Makes AI Agents Actually Work

When curious developers first decompiled and analyzed Claude Code — Anthropic’s AI-powered coding agent — they expected to find a thin wrapper around a large language model. A glorified chatbot with file access, maybe. What they actually found stopped them in their tracks: a sophisticated orchestration layer comprising 19 permission-gated tools, a streaming agent loop with continuous feedback, hierarchical memory systems, sub-agent spawning, context compaction algorithms, and a multi-layered permission model that governs every single action the agent takes. The model itself? Just one piece of a much larger machine.

This discovery crystallized something that the AI engineering community had been circling around for months: the model is not the product. The harness is. The thousands of lines of orchestration code that wrap around a language model — deciding what it sees, what it can do, how it recovers from mistakes, and how it persists knowledge across sessions — that is where the real engineering happens. That is where quality is won or lost.

Think about it this way. You can hand two developers the exact same LLM API key. One builds a simple prompt-and-response loop. The other builds a system with tool access, automated testing, iterative error correction, and persistent project memory. Both are using the same model. The results will be worlds apart. The difference is not the engine — it is everything around the engine. The difference is the harness.

As large language models become increasingly commoditized — with open-source models closing the gap on proprietary ones, and multiple providers offering comparable intelligence — the harness engineering around them is rapidly becoming the real competitive moat. Companies that master harness design will build agents that reliably ship code, manage infrastructure, and automate complex workflows. Companies that treat the model as the whole product will wonder why their agents keep failing at real-world tasks. This post is a deep dive into harness engineering: what it is, how Claude Code implements it, and how you can build your own.

What Is Harness Engineering?

Let’s start with the definition that Anthropic’s own engineering team has been using: harness engineering is “the art and science of leveraging your coding agent’s configuration points to improve output quality and increase task success rates.” It is the discipline of designing, building, and refining everything in an AI agent except the model itself.

The core formula is deceptively simple:

Key Takeaway: Agent = Model + Harness. The model provides intelligence. The harness provides capability, reliability, and control. You need both, but the harness is where engineering effort has the highest return on investment.

Here is an analogy that makes this concrete. The model is an engine — a powerful, general-purpose engine capable of generating text, reasoning about code, and solving complex problems. But an engine sitting on a workbench does not go anywhere. It does not steer, it does not brake, it does not know the destination. The harness is the car built around that engine: the steering wheel (guides that direct the model’s behavior), the brakes (permissions that prevent dangerous actions), the transmission (tools that translate model decisions into real-world actions), the GPS (context management that keeps the model oriented), and the safety systems (verification and correction loops that catch and fix mistakes).

An engine alone does not go anywhere useful. A car without an engine does not move at all. You need both — but when comparing two cars with equivalent engines, the one with better engineering around the engine wins every time.

Why Harness Engineering Matters

The same model with a bad harness produces poor results. The same model with a great harness produces incredible results. This is not theoretical — it is measurable. Anthropic’s own research on long-running coding agents showed that harness improvements (better guides, tighter feedback loops, smarter context management) increased task success rates by 2-3x without changing the underlying model at all. The model was already capable of solving the problems. The harness was the bottleneck.

This realization marks a fundamental shift in how we think about AI engineering. For the past few years, the dominant paradigm has been prompt engineering — the craft of writing better prompts to coax better outputs from language models. Prompt engineering is valuable, but it is a single-turn optimization. You craft a prompt, you get a response, you iterate on the prompt. Harness engineering is the evolution of prompt engineering into a full systems discipline. It encompasses not just the prompt, but the tools available to the model, the verification steps that run after the model acts, the correction mechanisms that fire when something goes wrong, the memory systems that persist knowledge across sessions, and the permission boundaries that keep the agent safe.

Prompt engineering asks: “How do I write a better prompt?” Harness engineering asks: “How do I build a better system around the model so that it reliably succeeds at complex, multi-step, real-world tasks?”

The Four Core Functions of a Harness

Anthropic’s published research on effective agent harnesses identifies four core functions that every harness must perform. Think of these as the four pillars that hold up a reliable AI agent. Remove any one of them, and the structure becomes unstable. Let’s examine each one in detail.

Guides (Feedforward Controls)

Guides are feedforward controls — they steer the agent before it acts. Their job is to set expectations, provide context, establish rules, and shape the model’s behavior before it ever writes a line of code or executes a command. Good guides dramatically reduce errors by preventing them in the first place, rather than catching them after the fact.

In Claude Code’s ecosystem, guides take several concrete forms:

  • CLAUDE.md files: Project-level instruction files that tell the agent about the codebase, coding conventions, what frameworks to use, what patterns to follow, and what mistakes to avoid. These are the single most impactful harness component you can configure.
  • Custom commands (slash commands): Pre-defined workflows like /write-post or /review that structure multi-step tasks into repeatable processes, complete with specific instructions for each step.
  • Coding conventions and style guides: Explicit rules about formatting, naming, architecture patterns, and anti-patterns that the agent should follow or avoid.
  • Structured prompts and bootstrap instructions: System-level prompts that establish the agent’s role, capabilities, and constraints before any user interaction begins.
  • Task decomposition rules: Instructions that tell the agent how to break down large tasks into manageable subtasks, preventing the common failure mode of trying to do too much in a single step.
  • Examples and few-shot demonstrations: Concrete examples of desired output that show the agent exactly what “good” looks like for a given task.

The key insight about guides is that they are cheap to implement and high-impact. Writing a good CLAUDE.md file takes 30 minutes. The improvement in agent output quality can be dramatic and immediate. This is why Anthropic recommends starting your harness engineering journey with guides.

Sensors (Feedback Controls)

Sensors are feedback controls — they catch problems after the agent acts. While guides try to prevent errors, sensors accept that errors will happen and focus on detecting them quickly. The faster you detect an error, the cheaper it is to fix.

Effective sensors for AI coding agents include:

  • Linters (ESLint, Ruff, mypy, Pylint) tuned for LLM-generated code patterns — LLMs tend to make specific categories of mistakes that linters can catch reliably.
  • Type checkers that catch type errors, missing imports, and interface mismatches before runtime.
  • Test suites designed specifically for LLM output patterns — not just generic unit tests, but tests that target the kinds of errors AI agents commonly make.
  • Build verification that ensures the code compiles and the project builds successfully after every change.
  • Code diff analysis that reviews what changed and flags potentially problematic patterns (accidental deletions, overly broad changes, unintended side effects).
Tip: The most effective sensor setup for AI agents is to run linters and type checkers automatically after every code change, not just at commit time. This gives the agent immediate feedback and the opportunity to self-correct before moving on to the next task.

Verification

Verification goes beyond sensors. While sensors detect that something might be wrong, verification confirms that the agent actually accomplished the intended goal. Did the feature work? Does the output match the specification? Is the behavior correct, not just syntactically valid?

Verification mechanisms include:

  • Automated test execution: Running the full test suite (or relevant subset) after changes to confirm that existing functionality still works and new functionality behaves as specified.
  • CI/CD pipeline integration: Feeding agent output through the same continuous integration pipeline that human code goes through, ensuring equal quality standards.
  • Browser automation testing: For web applications, actually loading the page and verifying that UI changes render correctly — not just checking that the code is syntactically valid, but that it produces the right visual and interactive result.
  • LLM-as-a-Judge: Using a superior model (or the same model in a separate context) to evaluate the quality and correctness of the agent’s output. This is particularly useful for subjective quality assessments like code readability, documentation quality, or design decisions.

Correction

Correction is the final pillar — and arguably the one that separates toy agents from production-grade agents. When the agent makes a mistake (and it will), how does the system respond? A naive system simply fails and reports the error. A well-harnessed system feeds the error back to the model, lets it reason about what went wrong, generates a fix, and tries again.

Correction mechanisms include:

  • Feedback loops: Test failure → model reads the error message → model analyzes the root cause → model generates a fix → system reruns the test. This loop can repeat multiple times until the test passes or a retry limit is reached.
  • Self-repair mechanisms: When the agent detects that its own output is malformed or incomplete, it can trigger a repair pass without human intervention.
  • Retry logic with context: Not just blindly retrying the same action, but retrying with additional context about what went wrong — the error message, the stack trace, the failing test output.
  • Graceful fallback strategies: When the agent cannot solve a problem after multiple attempts, it should degrade gracefully — perhaps simplifying its approach, asking for human input, or documenting what it tried and why it failed.
Function Type When It Acts Examples
Guides Feedforward Before the agent acts CLAUDE.md, custom commands, coding conventions
Sensors Feedback After the agent acts Linters, type checkers, build verification
Verification Validation After completion Test suites, CI/CD, browser testing, LLM-as-Judge
Correction Remediation When something fails Feedback loops, self-repair, retry with context

 

The interplay between these four functions creates a resilient system. Guides reduce the error rate. Sensors catch the errors that slip through. Verification confirms that the overall goal was achieved. Correction handles the cases where it was not. Together, they transform a probabilistic language model into a deterministic-enough system for production use.

Inside Claude Code’s Harness Architecture

Now that we understand the theory, let’s look at how one of the most sophisticated AI coding agents in the world actually implements these principles. Claude Code is not just a model with a terminal — it is a carefully engineered harness that embodies all four core functions. Based on public analysis of its architecture, here is what is happening under the hood.

19 Permission-Gated Tools

At the heart of Claude Code’s harness are 19 distinct tools that the model can invoke to interact with the outside world. Each tool is permission-gated, meaning the system controls which tools the agent can use and under what circumstances. These tools include:

  • File I/O: Read (view file contents), Write (create or overwrite files), Edit (make targeted string replacements in existing files)
  • Shell execution: Bash (execute arbitrary shell commands with timeout controls)
  • Search: Grep (content search with regex support), Glob (file pattern matching)
  • Git operations: Integrated version control operations
  • Web access: WebFetch (retrieve web page content for research)
  • Notebook editing: NotebookEdit (modify Jupyter notebook cells)
  • Sub-agent spawning: Agent (create specialized sub-agents for parallel or delegated tasks)
  • Task management: TaskCreate, TaskGet, TaskList, TaskUpdate (manage background tasks)

The critical design decision here is permission gating. Not all tools are created equal in terms of risk. Reading a file is safe. Deleting a file is dangerous. Running a shell command could do anything. Claude Code’s harness categorizes tool invocations by risk level and requires explicit user approval for high-risk operations — like running unfamiliar shell commands, writing to sensitive files, or performing destructive git operations. This is the “brakes” part of our car analogy, and it is essential for trust.

The Streaming Agent Loop

Unlike a simple request-response chatbot, Claude Code operates in a streaming agent loop. The model receives input, reasons about what to do, invokes a tool, observes the result, reasons again, invokes another tool, observes that result, and continues this cycle until the task is complete or it determines it needs human input. This loop is what makes Claude Code an agent rather than just a chatbot.

The streaming nature of this loop is important for user experience. Rather than disappearing for minutes while processing, the agent shows its work in real time — the user can see what files it is reading, what commands it is running, and what decisions it is making. This transparency builds trust and allows the user to intervene early if the agent is heading in the wrong direction.

Context Management Layer

One of the most underappreciated components of Claude Code’s harness is its context management layer. Language models have finite context windows — even large ones. A coding session that spans reading dozens of files, running tests, making changes, and debugging errors can quickly exceed the context limit. Claude Code handles this through several mechanisms:

  • Auto-compaction: When the conversation approaches the context limit, the harness automatically summarizes earlier parts of the conversation, preserving the most important information while freeing up context space for new work.
  • Persistent memory: The CLAUDE.md system and memory files allow important information to persist across sessions, so the agent does not need to re-learn the project’s conventions every time it starts.
  • Selective file reading: Rather than loading entire files, the agent can read specific line ranges, search for specific patterns, and load only the relevant portions of large files.
Key Takeaway: Context management is the “invisible” harness component that most people underestimate. Without it, agents degrade rapidly on long tasks as their context fills with irrelevant information and they lose track of what they were doing. Good context management is what enables Claude Code to handle tasks that span hundreds of tool invocations.

The CLAUDE.md System

Claude Code’s CLAUDE.md system is a hierarchical instruction framework that operates at multiple levels:

  • Project-level CLAUDE.md: Lives in the repository root. Contains project-specific instructions, coding conventions, architecture descriptions, and common pitfalls. Every developer on the team benefits from the same instructions.
  • User-level CLAUDE.md: Lives in the user’s home directory. Contains personal preferences and conventions that apply across all projects.
  • Directory-level CLAUDE.md: Lives in specific subdirectories. Contains instructions specific to that part of the codebase — useful for monorepos or projects with distinct subsystems.

This hierarchy means the agent gets increasingly specific guidance as it drills into the codebase. The project-level file might say “use TypeScript with strict mode.” The directory-level file in /src/database/ might add “always use parameterized queries, never string concatenation for SQL.” The system merges these instructions, with more specific files taking precedence.

Hooks and MCP Integration

Two additional harness components deserve mention. Hooks are shell commands that execute automatically in response to agent events — for example, a pre-tool hook that runs a linter before every file write, or a post-tool hook that validates the result of every shell command. Hooks let you inject automated quality gates into the agent’s workflow without modifying the agent itself.

MCP (Model Context Protocol) integration allows Claude Code to connect to external tools and data sources through a standardized protocol. MCP servers can provide access to databases, APIs, project management tools, documentation systems, and any other resource that might help the agent do its job. This is the “expansion port” of the harness — the mechanism for extending its capabilities beyond the built-in tools.

Harness Component Core Function What It Does
CLAUDE.md files Guide Project-specific instructions and conventions
Custom commands Guide Repeatable multi-step workflows
Permission system Guide + Sensor Controls tool access and requires approval for risky actions
19 built-in tools Capability File I/O, search, shell, git, web access, sub-agents
Streaming agent loop Orchestration Continuous act-observe-reason cycle
Context management Efficiency Auto-compaction, selective reading, memory persistence
Hooks Sensor + Verification Automated quality gates on agent events
MCP integration Capability extension Connect to external tools and data sources

 

Multi-Agent Harness Architecture

One of the most significant findings from Anthropic’s research on long-running agents is that the optimal harness architecture for complex tasks is not a single agent doing everything — it is multiple specialized agents, each with a clean context and a focused role. This is the multi-agent harness pattern, and it solves one of the most persistent problems in AI agent design: context degradation.

The Context Degradation Problem

Here is the problem. A single agent working on a large task accumulates context over time — files it has read, commands it has run, errors it has encountered, decisions it has made. As this context grows, the model’s ability to stay focused and coherent degrades. Anthropic’s research calls this “context anxiety” — the model becomes increasingly uncertain about which information is still relevant, starts second-guessing earlier decisions, and may even contradict its own prior work. The longer the session, the worse this gets.

The multi-agent pattern solves this by giving each agent a clean context reset. Instead of one agent doing everything, you have specialized agents that each handle one phase of the work, passing structured handoffs between them.

The Planner-Generator-Evaluator Pattern

Anthropic’s research describes an effective three-agent pattern:

  • Planner Agent: Takes a brief user prompt and expands it into a comprehensive specification. The planner reads the codebase, understands the requirements, and produces a detailed plan that includes what files need to change, what the expected behavior should be, and what edge cases to consider. The planner does not write code — it writes specifications.
  • Generator Agent: Takes the planner’s specification and implements it. The generator writes code, creates tests, makes file changes, and runs builds. It works iteratively — implement a piece, test it, fix issues, move to the next piece. The generator has a clean context that is not polluted by the planner’s exploration and deliberation.
  • Evaluator Agent: Takes the generator’s output and conducts quality assurance. The evaluator reviews the code for correctness, style, security issues, and specification compliance. It runs tests, checks for regressions, and provides a final assessment. Again, with a clean context focused solely on evaluation.

Each agent gets a fresh context window. Each agent has a clear, focused role. The handoffs between agents are structured data (specifications, code diffs, test results), not the messy, growing conversation of a single long-running session.

How Claude Code Implements Multi-Agent Patterns

Claude Code implements this pattern through its Agent tool — a built-in capability to spawn sub-agents. When Claude Code encounters a task that would benefit from delegation, it can create a sub-agent with a specific prompt and a clean context. The sub-agent runs independently, completes its task, and returns its results to the parent agent.

This is particularly useful for tasks like:

  • Searching a large codebase while the main agent continues reasoning about the overall task
  • Running a battery of tests while the main agent plans the next change
  • Investigating a complex error in a separate context so the investigation does not pollute the main workflow
  • Reviewing code changes against project standards before the main agent marks the task as complete
Caution: Multi-agent architectures add complexity. Do not reach for them until you have exhausted what a single well-harnessed agent can do. For most tasks — even complex ones — a single agent with good guides, sensors, and correction loops will outperform a poorly coordinated multi-agent system. Start simple.

When to Use Single-Agent vs Multi-Agent

Use a single agent when the task can be completed within one context window, the requirements are clear, and the feedback loop is tight (write code, run test, fix, repeat). Most everyday coding tasks fall into this category.

Use multiple agents when the task is so large that context degradation becomes a real problem, when different phases of the task require fundamentally different skill sets (planning vs implementation vs review), or when you need parallel execution of independent subtasks. Large feature development, codebase migrations, and comprehensive code reviews are good candidates.

How to Engineer Your Own Harness for Claude Code

Theory is interesting, but you are here for practical guidance. Let’s walk through the five levels of harness engineering for Claude Code, from the simplest configuration to advanced multi-agent orchestration. Each level builds on the previous one, so start at Level 1 and add complexity only when you have a specific problem that the current level cannot solve.

Level 1: CLAUDE.md (The Foundation)

The single most impactful thing you can do to improve Claude Code’s performance on your project is to write a comprehensive CLAUDE.md file. This is your foundation. Everything else builds on it.

A good CLAUDE.md includes:

  • Project purpose: What does this project do? Who uses it? What problem does it solve?
  • Tech stack: Languages, frameworks, databases, deployment targets.
  • Coding conventions: Formatting rules, naming conventions, architecture patterns.
  • File structure: Where things live. What each directory contains.
  • Key commands: How to build, test, deploy, and run the project.
  • What NOT to do: Common mistakes, anti-patterns, things to avoid. This is often the most valuable section.

Here is an example CLAUDE.md for a Python project:

# Project: DataPipeline

## Purpose
ETL pipeline that processes financial data from multiple exchanges
and loads it into our PostgreSQL analytics database.

## Tech Stack
- Python 3.12, managed with uv
- SQLAlchemy 2.0 for database access
- Pydantic for data validation
- pytest for testing
- Ruff for linting

## Key Commands
- Run tests: `uv run pytest tests/ -v`
- Lint: `uv run ruff check src/`
- Run pipeline: `uv run python -m src.main run --date 2026-04-03`

## Coding Conventions
- All functions must have type hints
- Use Pydantic models for all data structures (no raw dicts)
- SQL queries use parameterized queries only (never f-strings)
- Test files mirror source structure: src/foo/bar.py → tests/foo/test_bar.py

## What NOT to Do
- Do not use pandas — we use Polars for dataframes
- Do not hardcode database credentials — use environment variables
- Do not write raw SQL strings — use SQLAlchemy ORM
- Do not skip type hints — mypy strict mode is enforced in CI

With just this file in your repository root, Claude Code will write code that follows your conventions, uses your tools, and avoids your known pitfalls. No additional configuration needed.

Level 2: Custom Commands (Task Automation)

Custom commands let you define repeatable workflows as slash commands. They live in .claude/commands/ as Markdown files, and each one becomes a command you can invoke with /command-name.

Here is an example .claude/commands/write-tests.md:

Write comprehensive tests for the file or module specified in $ARGUMENTS.

## Steps:
1. Read the source file and understand its public API
2. Identify all functions, classes, and methods that need testing
3. Write pytest tests covering:
   - Happy path for each function
   - Edge cases (empty inputs, None values, boundary conditions)
   - Error cases (invalid inputs, missing dependencies)
4. Save tests to the mirror path: src/foo/bar.py → tests/foo/test_bar.py
5. Run the tests: `uv run pytest tests/foo/test_bar.py -v`
6. Fix any failing tests
7. Run the linter: `uv run ruff check tests/foo/test_bar.py`
8. Report results

Now you can type /write-tests src/pipeline/transformer.py and Claude Code will follow this exact workflow every time. No need to re-explain your testing conventions in every conversation. The command encodes your team’s standards into a repeatable process.

Other useful custom commands to consider: /review for code review, /deploy for deployment workflows, /debug for structured debugging sessions, and /refactor for refactoring with specific quality gates.

Level 3: Hooks (Automated Quality Gates)

Hooks let you inject automated checks into Claude Code’s workflow. They are shell commands that execute in response to specific events — before a tool runs, after a tool runs, or at other key moments in the agent loop.

Here is an example hook configuration in .claude/settings.json:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "command": "uv run ruff check --fix $CLAUDE_FILE_PATH 2>/dev/null || true"
      }
    ],
    "PreCommit": [
      {
        "command": "uv run pytest tests/ -x -q 2>&1 | tail -5"
      }
    ]
  }
}

With this configuration, every time Claude Code writes or edits a file, Ruff automatically runs and fixes formatting issues. Before every commit, the test suite runs and the results are fed back to the agent. These are your automated sensors and verification gates — they run without human intervention and without the agent needing to remember to run them.

Level 4: MCP Servers (External Integration)

MCP (Model Context Protocol) servers extend Claude Code’s capabilities by connecting it to external tools and data sources. You configure them in .claude/settings.json, and they appear as additional tools the agent can use.

{
  "mcpServers": {
    "postgres": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-postgres"],
      "env": {
        "DATABASE_URL": "postgresql://user:pass@localhost:5432/mydb"
      }
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": {
        "GITHUB_TOKEN": "ghp_your_token_here"
      }
    }
  }
}

With MCP servers configured, Claude Code can query your database directly (understanding schema, running queries, checking data), interact with GitHub (creating PRs, reading issues, checking CI status), and integrate with any other tool that has an MCP server implementation. This turns Claude Code from a coding assistant into an integrated development environment that understands your entire infrastructure.

Level 5: Multi-Agent Orchestration

At the highest level of harness sophistication, you can orchestrate multi-agent workflows where different Claude Code instances handle different phases of a task. This can be done through custom commands that explicitly invoke the Agent tool for delegation.

Here is a conceptual example of a /feature command that implements the planner-generator-evaluator pattern:

Implement the feature described in $ARGUMENTS using a
multi-phase approach:

## Phase 1: Planning
Use the Agent tool to spawn a planning sub-agent with this prompt:
"Read the codebase and create a detailed implementation plan for:
$ARGUMENTS. List all files to modify, new files to create,
tests to write, and edge cases to consider. Output a structured
specification."

## Phase 2: Implementation
Use the Agent tool to spawn an implementation sub-agent with
the specification from Phase 1. The sub-agent should implement
the feature, write tests, and run them.

## Phase 3: Review
Use the Agent tool to spawn a review sub-agent that reads the
diff of all changes, checks for bugs, security issues, style
violations, and specification compliance. Report any issues found.

## Phase 4: Resolution
If the review found issues, fix them. Run the full test suite.
Report the final result.

Each sub-agent gets a clean context focused on its specific phase. The parent agent coordinates the workflow and handles the handoffs.

Level Component Complexity Impact on Quality
1 CLAUDE.md Low (30 min setup) Very High
2 Custom Commands Low-Medium High
3 Hooks Medium High
4 MCP Servers Medium-High Medium-High
5 Multi-Agent Orchestration High Medium (situational)

 

Harness Engineering Best Practices

After spending considerable time building and refining harnesses for Claude Code, several best practices emerge. These are not theoretical — they are hard-won lessons from real-world usage.

Start Simple, Add Complexity Only When Needed

The biggest mistake in harness engineering is over-engineering from the start. You do not need hooks, MCP servers, and multi-agent orchestration on day one. Start with a CLAUDE.md file. Use Claude Code for a week. Notice what it gets wrong repeatedly. Add a custom command or a guide to address that specific failure pattern. Iterate. The best harnesses are grown organically from real failure patterns, not designed top-down from theoretical requirements.

Make the Harness Project-Specific

A one-size-fits-all harness is a mediocre harness. A Python data pipeline has different needs than a React frontend, which has different needs than a Rust systems library. Your CLAUDE.md, your custom commands, your hooks — all of these should be tailored to the specific project, its tech stack, its conventions, and its common failure modes. Generic advice like “write clean code” is useless. Specific instructions like “use Pydantic models for all API responses, never return raw dicts” are actionable.

Test Your Harness Configuration

Here is a practice that separates good harness engineers from great ones: A/B test your harness changes. Before adding a new guide or hook, run a representative task and note the result. Add the harness change. Run the same task again. Did the output improve? By how much? This empirical approach prevents harness bloat — configurations that feel useful but do not actually improve outcomes.

Version Control Your Harness

Your CLAUDE.md, your .claude/commands/ directory, your hooks configuration — all of these should be checked into version control alongside your code. They are part of your project’s engineering infrastructure. They should be reviewed in PRs, iterated on over time, and shared across the team. A harness that lives only on one developer’s machine is a harness that will be lost.

Iterate Based on Failure Patterns

Every time Claude Code makes a mistake that it should not have made, ask: “Could a harness change have prevented this?” If the agent keeps using the wrong database library, add a guide. If it keeps forgetting to run tests, add a hook. If it keeps generating code that fails the linter, add a sensor. Your harness should be a living document that evolves as you discover new failure patterns.

Balance Autonomy and Control

Too many constraints make the agent slow and inflexible — it spends more time checking rules than doing work. Too few constraints make it error-prone — it makes avoidable mistakes because it was not told the rules. The sweet spot varies by project and by team. High-risk production codebases need more constraints. Experimental prototyping projects need more autonomy. Calibrate accordingly.

Monitor and Measure

Track your agent’s success rate over time. How often does it complete tasks correctly on the first attempt? How often does it need correction? What categories of errors are most common? This data tells you where to invest your harness engineering effort. If 80% of failures are type errors, invest in type checking sensors. If 80% of failures are misunderstanding requirements, invest in better guides.

Harness Engineering vs Prompt Engineering

Harness engineering is sometimes confused with prompt engineering, and while they are related, they are fundamentally different disciplines. Understanding the distinction is important for allocating your engineering effort correctly.

Prompt engineering is the craft of writing a single prompt for a single interaction. It focuses on wording, structure, few-shot examples, and instruction clarity to get the best possible response from one model call. It is valuable, and it is one component of harness engineering — specifically, it falls under “guides.” But it is only one piece of the puzzle.

Harness engineering is the discipline of designing a complete system around the model for sustained, reliable operation across many interactions and many tasks. It encompasses not just the prompt, but every other component: tools the model can use, verification that runs after the model acts, correction mechanisms when things go wrong, persistence for cross-session knowledge, and permissions that control what the model can do.

Dimension Prompt Engineering Harness Engineering
Scope Single prompt, single interaction Complete system across many interactions
Persistence Ephemeral (one conversation) Persistent (CLAUDE.md, memory, commands)
Components Text instructions only Text + tools + sensors + verification + correction
Reliability Varies per interaction Systematically improved over time
Scalability Manual (re-craft for each task) Automated (configure once, apply to all tasks)
Error handling Hope the prompt prevents errors Detect, verify, and correct errors automatically
Team sharing Copy-paste prompts Version-controlled config files in the repo

 

The key insight: prompt engineering is a subset of harness engineering. If you are only doing prompt engineering, you are leaving the majority of your improvement potential on the table. The biggest gains come from the components that prompt engineering does not address — tools, verification, correction, and persistence.

Real-World Harness Examples

Abstract principles are useful, but concrete examples make them actionable. Here are three real-world harness configurations that demonstrate the principles in practice.

Example 1: Blog Publishing Harness (aicodeinvest.com)

You are reading the output of this harness right now. This very blog post was written and published by Claude Code, operating within a harness that we built specifically for blog publishing. Here is what the harness includes:

  • CLAUDE.md: Contains writing guidelines (4,000-6,000 words, conversational tone, specific HTML patterns), post structure requirements (Table of Contents, Introduction, body sections, Conclusion, References), and explicit anti-patterns to avoid (no numbered headings, no html/head/body wrappers).
  • /write-post custom command: Orchestrates the full workflow — topic selection, writing, saving, publishing via WordPress REST API, and recording topic usage for deduplication.
  • WordPress REST API as a tool: A Python CLI (src/main.py) that handles authentication, content upload, category assignment, and status management.
  • Topic deduplication system: Tracks recently used topics in config/recent_topics.json to prevent the agent from writing about the same subject twice.

This harness turns Claude Code from a general-purpose AI assistant into a specialized blog publishing system. The model’s writing ability is the engine. The harness — the CLAUDE.md guidelines, the custom command workflow, the publishing tools, the deduplication system — is what turns that engine into a reliable content production pipeline.

Example 2: Enterprise Code Review Harness

Consider a team that uses Claude Code for automated code review. Their harness might include:

  • CLAUDE.md: Company coding standards, security requirements (no hardcoded secrets, all inputs sanitized, all queries parameterized), performance guidelines (no N+1 queries, pagination required for list endpoints), and architecture rules (clean architecture layers, dependency injection).
  • /review custom command: A structured review process that checks security, performance, style, test coverage, and documentation in that order, producing a formatted review with severity ratings.
  • CI/CD integration hooks: Post-commit hooks that run the test suite, linter, and security scanner, feeding results back to the agent for its review.
  • Jira/Linear MCP server: Connects Claude Code to the team’s project management tool so it can read ticket descriptions, understand acceptance criteria, and verify that the code changes match the specified requirements.

This harness ensures that every code review follows the same rigorous process, checks the same standards, and produces consistent, actionable feedback — regardless of which developer triggered the review or which part of the codebase is being changed.

Example 3: Data Pipeline Harness

A data engineering team might build a harness for managing ETL pipelines:

  • Custom commands: /new-pipeline for scaffolding new ETL jobs with the team’s standard structure, /validate-schema for checking data schemas against the warehouse, /backfill for running historical data loads with proper idempotency checks.
  • Database MCP server: Gives Claude Code direct access to the data warehouse schema, so it understands table structures, column types, relationships, and constraints without the developer needing to explain them.
  • Test data generation tools: Custom commands that generate realistic test data for pipeline testing, including edge cases like null values, duplicate records, and timezone mismatches.
  • CLAUDE.md with data engineering conventions: Rules about idempotency (all pipelines must be safely re-runnable), data validation (all inputs must be schema-validated before processing), and monitoring (all pipelines must emit metrics for latency, throughput, and error rate).

Each of these examples demonstrates the same principle: the harness is tailored to the specific domain, encoding domain expertise into configuration that the agent can leverage automatically.

The Future of Harness Engineering

Harness engineering is a young discipline, but it is evolving rapidly. Here is where it is heading.

A New Engineering Discipline

Just as DevOps emerged as a distinct discipline from the intersection of development and operations, harness engineering is emerging as a distinct discipline from the intersection of AI and software engineering. Companies are already hiring for roles that are essentially harness engineers — people who specialize in configuring, tuning, and optimizing AI agent systems. The job title might be “AI Platform Engineer” or “Agent Systems Engineer,” but the core skill set is harness engineering.

Standardization Through MCP

The Model Context Protocol (MCP) is the first serious attempt at standardizing the interface between AI agents and external tools. Before MCP, every agent had its own proprietary tool integration system. MCP provides a common protocol that any tool can implement and any agent can consume. This is analogous to what HTTP did for the web — it created a standard that enabled an ecosystem. As MCP matures, we will see a proliferation of MCP servers for every conceivable tool and data source, dramatically lowering the cost of harness engineering.

Harness Marketplaces

Today, sharing a harness configuration means sharing CLAUDE.md files and custom commands through GitHub repositories. Tomorrow, we may see dedicated marketplaces for harness configurations — curated collections of CLAUDE.md files, custom commands, hooks, and MCP server configurations for specific tech stacks and workflows. “Here is a production-ready harness for Django + PostgreSQL + Celery” or “Here is a harness for iOS development with SwiftUI and Core Data.” These pre-built harnesses would give teams a starting point that already encodes best practices for their stack.

Self-Improving Harnesses

The most exciting frontier is self-improving harnesses — harness systems that learn from their own failures and automatically update their configuration. Imagine a harness that notices the agent keeps making the same type error in a specific module, and automatically adds a guide to CLAUDE.md saying “In the payments module, always use Decimal instead of float for monetary values.” Or a harness that notices test failures cluster around a specific API endpoint and automatically adds more thorough validation for that endpoint’s responses.

This is not science fiction — the building blocks exist today. The agent can read its own CLAUDE.md. The agent can analyze its own failure patterns. The agent can edit its own CLAUDE.md. The missing piece is the orchestration logic that decides when to do this and what to change, and that is an active area of research.

The “Operating System for AI” Vision

Zoom out far enough, and the harness starts to look like an operating system. It manages resources (context windows, tool access), enforces permissions (what the agent can and cannot do), provides system services (file I/O, networking, process management), and offers a user interface (the conversation loop). The analogy is imperfect, but it points toward a future where the harness is not just a configuration layer — it is a full runtime environment for AI agents, with the same level of sophistication that operating systems bring to traditional computing.

Conclusion

The AI industry has spent the last few years in an arms race over models — bigger, faster, smarter. That race is not over, but a new race has begun alongside it: the race to build better harnesses. The teams and companies that master harness engineering will extract dramatically more value from the same models that everyone else has access to.

The formula is simple: Agent = Model + Harness. The model provides raw intelligence. The harness provides structure, tools, verification, correction, memory, and control. Together, they create an agent that can reliably operate in the real world. Separately, they are incomplete.

If you take away one thing from this post, let it be this: stop treating your AI agent as a chatbot with extra features, and start treating it as an engineered system. Write a CLAUDE.md file. Create custom commands for your common workflows. Add hooks for automated quality gates. Connect MCP servers for external tool access. Test your harness, iterate on it, version control it, and share it with your team.

The model is the engine. The harness is the car. And right now, most people are trying to drive an engine across the highway. Build the car.

Key Takeaway: Harness engineering is the highest-leverage skill in AI-assisted development today. A 30-minute investment in a good CLAUDE.md file will improve every single interaction you have with Claude Code. Start there, measure the results, and build up from that foundation.

References

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *