Strangelove-AI May 10, 2026

A Guide to Harness Engineering: Building Reliable AI Agent Workflows

Notes on WalkingLabs’ Learn Harness Engineering course, which provides a practical framework for building reliable AI agent workflows through harness design, state management, and end-to-end verification. The course emphasizes the critical importance of treating the harness as a formal engineering discipline to bridge the gap between AI model capability and production-grade execution.

<TL/DR>

The core insight: AI agent reliability is an engineered outcome of the infrastructure surrounding model weights, not a property of the model itself. Harness Engineering is the discipline of building that infrastructure. A “harness” comprises every element outside the model weights — the structured environment that converts raw AI capability into production-grade execution. The canonical proof: OpenAI’s Codex experiment succeeded not by improving the model, but by forcing engineers to design better environments when humans were forbidden from writing code directly. strangelove-aistrangelove-ai The five subsystems of a reliable harness:

  • Instruction: An AGENTS.md / CLAUDE.md file (50–200 lines max) acting as a routing file with hard constraints and links to topic docs. Too long = “Lost in the Middle” effect where the model ignores buried rules.
  • Tool: Least-privilege shell/filesystem access. Under-restrict and you create security holes; over-restrict and the agent can’t even run pip install.
  • Environment: Fully self-describing runtime via pyproject.toml, .nvmrc, Docker etc. If the agent has to guess dependency versions, it wastes context budget doing so.
  • State: A PROGRESS.md continuity artifact tracking what’s done, blocked, and next. Without it, every new session is an amnesiac starting from zero; with it, rebuild cost drops from ~15 minutes to ~3.
  • Feedback: Machine-verifiable E2E tests with agent-oriented error messages (not “test failed” but “GET /users/1 returned 500, fix at line 42, see docs/api-patterns.md”). This is the highest-ROI subsystem.

1. The Harness Manifesto: Why Strong Models Fail

In the professional landscape of AI automation, a critical strategic shift is occurring: moving from a “model-centric” to a “harness-centric” engineering philosophy. Model capability and execution reliability are fundamentally decoupled. A model may possess the reasoning capacity of a senior engineer, yet fail at production tasks because of structural defects in its operating environment. Reliability is not a feature of the model weights; it is an engineered outcome of the infrastructure surrounding those weights.

This distinction is best illustrated by the “Saddle Analogy.” An elite model like Opus 4.5 is a thoroughbred horse—extraordinarily capable but impossible to direct without equipment. Attempting to run complex agents “bareback” (prompt-only) leads to inevitable failure. The harness is the saddle; it determines the performance ceiling. This was proven by OpenAI’s “Million-Line Experiment,” where Codex was used to build a product from an empty repository. The most vital constraint of that experiment — humans were strictly forbidden from writing code directly — forced a move toward environment design. The project succeeded not by “improving the model,” but by refining the harness. This proved that a model’s “unreliable” label is usually a Harness-Induced Failure.

The Capability Gap

Feature Model Performance on Benchmarks Performance on Real-World Tasks
Success Rate 50–60% (e.g., SWE-bench Verified) Significantly lower due to environmental friction.
Requirements Clear, curated issue descriptions. Vague, shifting, or undocumented “tribal knowledge.”
Rule Sets Explicit and self-contained. Implicit rules; “Knowledge Decay” in stale docs.
Environment Clean, pre-configured containers. “Environment traps” (missing deps, version drift).
Verification Existing, comprehensive test suites. Non-existent tests or silent failure modes.

The environment, not the weights, determines whether an agent succeeds. To bridge this gap, we must treat the harness as a formal engineering discipline.

2. Defining the Harness: The Five-Subsystem Model

The “Harness” comprises every element of the engineering infrastructure outside the model weights. If the agent is a chef, the harness is the kitchen. A brilliant chef is useless in a kitchen without heat, calibrated knives, or a mise en place station. The harness provides the necessary constraints to translate raw reasoning into production-grade artifacts.

The core methodology of this discipline is the Diagnostic Loop: Execute → Observe Failure → Attribute to a specific harness layer → Fix that layer → Re-execute. By using “isometric model control” (keeping the model fixed while adjusting the environment) we isolate failures into five functional subsystems:

  1. Instruction Subsystem (The Recipe Shelf): Provides project overviews and hard constraints.
    • Primitive: AGENTS.md or CLAUDE.md.
  2. Tool Subsystem (The Knife Rack): Grants access to the filesystem and execution shells via least-privilege principles.
    • Primitive: Structured shell access with ls, grep, and sed.
  3. Environment Subsystem (The Stove): Ensures the runtime is self-describing and reproducible.
    • Primitive: pyproject.toml, package.json, or .nvmrc.
  4. State Subsystem (The Prep Station): Manages progress tracking and persistent memory for long-running tasks.
    • Primitive: PROGRESS.md or atomic git commits.
  5. Feedback Subsystem (The Quality Check): Provides machine-verifiable results of the agent’s actions.
    • Primitive: pytest, npm test, or custom linting.

The Kitchen Analogy

  • Instruction: The specific recipe and strict dietary restrictions.
  • Tools: Specialized utensils (knives, whisks) required for the dish.
  • Environment: The utilities (gas, water) and workspace stability.
  • State: The mise en place, knowing exactly what is chopped and what is in the pan.
  • Feedback: Tasting the dish and checking the internal temperature before service.
    For these subsystems to function, they must be anchored in a single source of truth: the repository.

3. The Repository as the System of Record

The “Repo as Spec” principle dictates that for an AI agent, information existing outside the repository (Slack, Jira, or human heads) effectively does not exist. If a rule is not documented in the codebase, the agent is forced to guess. Wrong guesses become bugs; excessive guessing wastes the finite context window. Furthermore, we must combat the Knowledge Decay Rate: documentation that is out-of-sync with code is more dangerous than no documentation, as it sends the agent in the wrong direction with high confidence.

The Cold-Start Test
A repository is production-ready for agents only if a brand-new session can answer these five questions without human intervention:

  1. What is this system? (Purpose and stack).
  2. How is it organized? (Architecture and module boundaries).
  3. How do I run it? (Setup and initialization scripts).
  4. How do I verify it? (Test and lint commands).
  5. Where are we now? (Current progress and next steps).

Managing State with ACID Principles
Reliable agent state management within a repository must adhere to the ACID properties:

Atomicity: Every logical operation is a single, reversible unit (one git commit).
Consistency: The repo moves only from one verified “green” state to another.
Isolation: Concurrent agent sessions must use separate branches or state files to avoid race conditions.
Durability: Cross-session knowledge must be persisted to git-tracked files, not session memory.

Instruction Architecture
A common failure is the “Giant Instruction File” (the 600-line trap), which triggers the “Lost in the Middle” effect. LLMs utilize information at the beginning or end of long texts significantly better than information in the center. Professional architects use a Routing File strategy based on a Signal-to-Noise Ratio (SNR) audit:

  • Entry File: A 50–200 line AGENTS.md containing only high-priority hard constraints and routers.
  • Topic Documents: Specific files (e.g., docs/api-patterns.md) loaded only when the SNR audit justifies the context spend.
  • Progressive Disclosure: Providing the agent with the overview first and detailed implementation rules only on demand.
    Structuring the repository this way bridges the gap between static code and the temporal challenges of long-running sessions.

4. Managing the Session Lifecycle: Continuity and Initialization

AI agents are “Amnesiac Craftsmen.” Context windows are finite, and session boundaries are the primary points of information decay. Long-running tasks eventually require a session reset. When this happens, the Rebuild Cost, the time a new session needs to reach an executable state, is the primary metric of success. A good harness reduces Rebuild Cost from 15 minutes to under 3 minutes.

Context Anxiety and Continuity Artifacts
Anthropic’s research highlights a critical distinction: while Opus 4.5 can manage long tasks via context compaction, Sonnet 4.5 requires a full context reset to avoid severe “Premature Convergence.” This phenomenon, known as “Context Anxiety,” occurs when an agent senses its window closing and rushes to finish, skipping verification. We mitigate this using “Continuity Artifacts” (PROGRESS.md, DECISIONS.md) to offload the “why” of decisions before a reset.

The Initialization Phase
Initialization must be a mandatory, distinct phase. Mixing foundation-building (environment setup) with implementation (feature code) results in “unverified accumulation.”

The Bootstrap Contract Checklist:

  • Runnable Environment: Dependencies locked; app starts without errors.
  • Verifiable Tests: At least one example test passes to prove the framework.
  • Task Breakdown: Project split into atomic units with clear acceptance criteria.
  • Clean Checkpoint: A git commit marking the end of the foundation work.

The Clean State Requirement
A session is only complete if it satisfies the five dimensions of a clean handoff:

  1. Build: Code compiles without errors.
  2. Test: All tests (existing and new) pass in a CI-like environment.
  3. Progress: PROGRESS.md reflects current task states.
  4. Artifact: Stale logs, debug files, and temporary code are removed.
  5. Startup: The standard make setup or initialization path remains functional.

5. Scope Control: Task Boundaries and Feature Primitives

The symbiotic relationship between “Overreach” (starting too much) and “Under-finish” (completing too little) is a primary cause of agent failure. In harness engineering, “doing less but finishing” is the superior strategic approach.
The Math of Attention: WIP=1
Attention is a finite resource. If the agent’s context capacity is C and it activates k tasks, each task receives only C/k reasoning resources. When C/k drops below a minimum threshold, the agent fails globally. Therefore, the harness must mandate a Work-in-Progress (WIP) limit of 1. The agent must verify one task before unlocking the next to prevent the dilution of attention.

Feature Lists as Harness Primitives
In a professional harness, feature lists are Primitives, not documents. Primitives are for systems to execute; documents are for humans to ignore. Every feature must follow a Triple Structure:

  • Behavior: A specific description (e.g., “GET /health returns 200”).
  • Verification: The exact command to run (e.g., pytest tests/api.py).
  • State: The current status in the machine-readable state machine.

Feature State Machine

State Transition Requirement Impact (Back-pressure)
not_started Default state for new items. Visible to the scheduler.
active One item moved here by agent. Consumes 100% of C/k budget.
blocked Requires external input. Exerts pressure to resolve dependencies.
passing Verification command returns 0. Relieves back-pressure; unlocks next task.

6. The Verification Framework: E2E Testing and Observability

Neural networks suffer from Confidence Calibration Bias (Guo et al.); they are systematically overconfident, often declaring victory because code looks correct. Externalized, execution-based verification is the only remedy.

The Blind Spots of Unit Testing
Unit tests utilize isolation and mocks, which hide systemic issues. High-reliability harnesses require End-to-End (E2E) verification to catch:

  1. Interface Mismatch: Inconsistent data formats between components (e.g., absolute vs. relative paths).
  2. State Propagation: Caching layers holding stale data after database migrations.
  3. Resource Lifecycle: Memory leaks or unclosed file handles spanning component boundaries.

The Three-Layer Termination Check
To prevent premature victory, the harness enforces a tiered check:

  1. Syntax Layer: Linting and type-checking (the bare minimum).
  2. Runtime Layer: Verifying the application starts and the critical path executes.
  3. System Level (E2E): Simulating full user flows to ensure components “sing together.”

Crucially, the harness must provide Agent-Oriented Error Messages using the “Red Pen Markup” pattern:

  • Bad Error: Test Failed: index out of bounds.
  • Agent-Oriented Error: Test Failed: GET /users/1 returned 500. Root cause: list index out of bounds in 'controllers/user.py' at line 42. Fix: Check if the user ID exists in the DB before indexing. Reference 'docs/api-patterns.md' for error handling.

Layered Observability and Feedback Promotion

We distinguish between Runtime Observability (logs/traces) and Process Observability (Sprint Contracts). A major concept in harness scaling is Review Feedback Promotion: every manual review comment should be converted into an automated harness check to prevent future regressions.

Sample Evaluator Rubric

Dimension Evidence Required Score (1-5)
Functional Does the E2E test pass on the first execution?
Boundary Does the implementation handle null/empty inputs?
Architectural Does it follow the folder structure in ARCHITECTURE.md?

Reliability is an engineered outcome, not a model feature. By building a robust harness of instructions, state management, and end-to-end verification, we close the gap between AI capability and production-grade execution.


Q&A: Harness Engineering Best Practices

How does an AGENTS.md file improve agent reliability?

Creating an AGENTS.md file (or CLAUDE.md) in the root of your repository is considered the first and “highest-ROI” step you can take in harness engineering to dramatically improve an AI agent’s reliability. The file acts as the agent’s “instruction subsystem” or “recipe shelf,” providing the foundational rules, tools, and context it needs to execute tasks successfully.

Here is how an AGENTS.md file specifically improves agent reliability:

  1. Eliminates Harmful Guesswork by Providing a “Single Source of Truth” Information, architectural conventions, and business rules that only exist in Slack messages or engineers’ heads are completely invisible to an AI agent. When an agent lacks context, it guesses; a wrong guess results in bugs and wasted context windows. The AGENTS.md file serves as the agent’s primary “landing page”, outlining the project’s purpose, tech stack versions, and architecture, which bridges the “Knowledge Visibility Gap” and allows the agent to start working reliably without human intervention.
  2. Enforces Objective Verification (Defining “Done”) A major cause of agent failure is the “Verification Gap,” where agents declare a task finished simply because the code looks correct to them. An AGENTS.md file directly counters this by explicitly listing verification commands (such as make test or yarn lint). Providing these commands gives the agent a machine-verifiable “Definition of Done,” forcing it to objectively test its code and significantly reducing premature task completion.
  3. Preserves Context Budget via “Routing” If an instruction file becomes too long, agents suffer from the “Lost in the Middle” effect, where they effectively ignore critical rules buried deep in the text, and waste precious cognitive budget processing irrelevant instructions. A highly reliable AGENTS.md avoids this by acting as a concise routing file (ideally 50-200 lines). It holds only the most critical, high-priority information and uses links to point the agent to more detailed topic documents (e.g., docs/api-patterns.md) only when those specific topics are needed.
  4. Establishes Non-Negotiable Hard Constraints The file is used to explicitly state a small number of global, non-negotiable rules (e.g., “All APIs must use OAuth 2.0” or “never use eval()”). By putting these hard constraints right in the agent’s entry point, you mechanically prevent the agent from straying from your project’s foundational standards.
  5. Enables Reliable “Cold-Starts” for Multi-Session Tasks Because context windows are finite, agents working on complex tasks will inevitably need to start new sessions. An AGENTS.md ensures that every fresh session can pass a “cold-start test” by immediately answering essential questions: What is this system? How do I run it? How do I verify it?. This allows new sessions to reliably pick up where the last one left off without wasting time and tokens re-discovering the project’s structure.

Ultimately, Anthropic and OpenAI have both found that the exact same underlying AI model will produce vastly different, highly reliable outcomes when placed in an environment guided by a structured AGENTS.md file.

How long should a reliable AGENTS.md file be?

A reliable AGENTS.md file should be between 50 and 200 lines long, with around 100 lines generally being sufficient.
Rather than functioning as a comprehensive encyclopedia, the file should act as a concise “routing file” or “directory page”. It should contain only the most critical, frequently used information, such as:

  • A brief, one- or two-sentence project overview.
  • First-run setup and verification commands.
  • A small number of global, non-negotiable hard constraints (ideally no more than 15).
  • Links to more detailed, topic-specific documents (e.g., in a docs/ directory) that the agent can read on demand.

Keeping the file short is crucial because long files (such as those growing to 300–600 lines) suffer from “instruction bloat”. When an instruction file becomes too large, it eats into the agent’s finite context budget, leaving less room for the agent to actually read code and execute tasks. Furthermore, long files trigger the “Lost in the Middle” effect, a phenomenon where the AI model systematically ignores critical constraints that are buried in the middle of a lengthy text.
By forcing the AGENTS.md file to remain short and moving occasional or module-specific rules into separate, dedicated files, you dramatically improve the agent’s signal-to-noise ratio and ensure it actually follows your most important directives.

How do I use AGENTS.md to enforce non-negotiable hard constraints?

To effectively use AGENTS.md to enforce non-negotiable hard constraints, you must focus on strategic placement, explicit language, and mechanical verification. Here is exactly how to do it:

  • Keep the list small and prominent: Limit your global, non-negotiable rules to no more than 15. Place them at the very top or bottom of your AGENTS.md file. If you bury critical constraints (like security rules) in the middle of a lengthy document, the AI model will likely ignore them due to the “Lost in the Middle” effect.
  • Use explicit “MUST/MUST NOT” language: Phrase your rules as absolute directives to eliminate any priority ambiguity. Give clear, strict commands like “never use eval()” or “All APIs must use OAuth 2.0”. Your goal is to enforce invariants rather than micromanaging the exact implementation.
  • Never mix hard constraints with soft suggestions: If you jumble strict architectural rules together with general coding preferences (like “prefer functional style”) or historical bug notes, the agent has no reliable way to distinguish an absolute rule from a gentle guideline. Keep your hard constraints isolated and distinct.
  • Back them up with executable verification commands: A rule written in text is only the first step. To truly enforce architectural boundaries and constraints, turn them into automated tests or custom lint rules. By explicitly listing these verification commands in the AGENTS.md file, you force the agent to run them and objectively prove that it followed the constraints before it can declare the task complete.

Why should I split instructions across multiple files?

Splitting instructions across multiple files is crucial because packing every rule into a single, massive file creates a “giant instruction file” trap that actively degrades an AI agent’s performance.
Here is exactly why you should move away from a single instruction file:

  • Preserves Precious Context Budget: An agent’s context window is finite. A bloated instruction file can consume up to 10,000-20,000 tokens, eating up 8-15% of the total budget before the agent even begins working. By splitting instructions, you improve the Signal-to-Noise Ratio (SNR), ensuring the agent doesn’t waste cognitive budget reading irrelevant rules (like deployment procedures) when trying to complete a simple task (like a bug fix).
  • Prevents the “Lost in the Middle” Effect: Language models use information located in the middle of long texts significantly less effectively than information at the beginning or the end. If a critical, non-negotiable constraint is buried at line 300 of a 600-line file, there is a very high probability the agent will effectively ignore it.
  • Eliminates Priority Conflicts: When you mix strict security rules, general coding guidelines, and historical notes about old bugs into one file, they all look equally important. The agent has no reliable way to distinguish an absolute hard constraint from a soft suggestion, which creates priority ambiguity.
  • Stops Maintenance Decay and Contradictions: Large instruction files accumulate debt just like code. Because developers frequently add new rules without deleting outdated ones, the file naturally bloats and accumulates contradictory instructions. When faced with contradicting rules, the agent will simply pick one at random.
  • Enables Progressive Disclosure (On-Demand Loading): Splitting files allows you to treat your main AGENTS.mdas a routing file rather than an encyclopedia. The main file can be kept at a concise 50-200 lines, containing only essential overviews, hard constraints, and links to topic documents (e.g., docs/api-patterns.md or docs/database-rules.md). This means the agent only loads detailed, domain-specific instructions when the current task actually requires them.

By modularizing your instructions, you ensure the agent focuses its attention on actual code execution and task reasoning, rather than digging through an overstuffed file of irrelevant rules.

How do I design topic documents for progressive disclosure?

Designing topic documents for progressive disclosure means structuring your rules so the AI agent only loads specific, detailed information when a task actually requires it. Here is how to design these documents effectively:

  • Keep them focused and concise: Each topic document should be between 50 and 150 lines long. They should be organized by subject, functioning like “packing cubes in a suitcase” so the agent doesn’t have to read every rule you’ve ever written to solve a simple task.
  • Link them properly in your routing file: In your main entry file (like AGENTS.md), you should provide links to these topic documents. Each link must include a one-line description and an applicability condition so the agent knows exactly when it needs to follow the link and read the file.
  • Structure every instruction: Inside the topic documents, every individual rule should include a source (why the rule was added), an applicability condition (when the rule is needed), and an expiry condition (under what circumstances the rule can be safely removed).
  • Place knowledge near the code: While global topic documents can live in a central docs/ directory (e.g., docs/api-patterns.md), you should also place short, specific architecture documents directly inside corresponding module directories. Furthermore, some information shouldn’t be in instruction files at all, type definitions, interface comments, and config explanations are best placed directly in the source code so the agent naturally sees them while reading the code.
  • Treat instructions like code dependencies: To prevent instruction bloat, you must regularly audit your topic documents and remove outdated, redundant, or contradictory entries. If an instruction is no longer used, it should be deleted, otherwise it will just slow the agent down.

What are the five subsystems of a reliable AI harness again?

A complete and reliable AI harness is composed of five distinct subsystems (including AGENTS.md). If the AI model provides the “ingredients” for your project, the harness acts as the fully equipped kitchen.

Here are the five subsystems you need to build a reliable AI harness:

  1. The Instruction Subsystem (The Recipe Shelf) This is where your routing files, like AGENTS.md or CLAUDE.md, live. As we covered previously, this subsystem acts as the agent’s landing page, providing the project overview, tech stack versions, non-negotiable hard constraints, and links to detailed topic documents so the agent knows exactly what rules it must follow.
  2. The Tool Subsystem (The Knife Rack) An agent needs sufficient tool access to execute its work, including shell commands, file manipulations, and testing capabilities. While you should follow the principle of least privilege, you must avoid over-restricting the agent; for example, if you disable shell access for security reasons, the agent cannot even run pip install or execute basic commands needed to complete its tasks.
  3. The Environment Subsystem (The Stove) The project’s runtime environment must be entirely self-describing and reproducible. You build this subsystem by locking dependencies and runtime versions using configuration files like pyproject.toml, package.json, .nvmrc, .python-version, or by utilizing Docker and devcontainers. This ensures the agent isn’t wasting its context budget trying to resolve environment mismatches.
  4. The State Subsystem (The Prep Station) Because context windows are finite, long-running and complex tasks will inevitably require multiple sessions. The state subsystem maintains the agent’s continuity across these sessions by using persistent artifacts like a PROGRESS.md file, which explicitly tracks what is already done, what is currently in progress, and what is blocked. Without this, a new session will suffer from amnesia and waste time rediscovering the project’s state.
  5. The Feedback Subsystem (The Quality Check Window) This is considered the highest-ROI subsystem of the entire harness. It provides explicit verification commands (such as testing, linting, and building) that give the agent a way to objectively test its work. This forces the agent to rely on machine-verifiable proof rather than just assuming its code looks correct.

Missing any of these subsystems is like missing a functional area in a kitchen; the agent can still work, but it will be awkward, inefficient, and prone to mistakes.
To optimize your own harness, you can perform “isometric model control”. This involves keeping the underlying AI model fixed while removing one harness subsystem at a time to measure which removal causes the largest drop in performance. This will tell you exactly which subsystem is your bottleneck so you can focus your engineering efforts there.

What should be in a PROGRESS.md file for state management?

A PROGRESS.md file serves as a vital “continuity artifact” or “journal” that allows an AI agent to remember project state across multiple sessions. Because an agent’s context window is finite, it will inevitably run out of memory during long-running tasks and require a new session. Without a state file, the new session acts like an “amnesiac craftsman” who forgets everything they built the day before and must waste time relearning the project structure.
To effectively maintain state management and ensure a clean handoff between sessions, your PROGRESS.md file should explicitly contain the following elements:

  • Current Status: A clear declaration of the specific task or feature that is currently active.
  • Completed Work and Verification Records: A concrete log of what has already been accomplished, including which exact tests or verification steps are currently passing. Tracking this prevents the agent from accidentally tearing down completed work or doing duplicate work.
  • Current Blockers: A list of unresolved issues, failed tests, or missing dependencies that are actively preventing the task from moving forward.
  • Next Steps / Actions: Clear, actionable instructions for the incoming agent session detailing exactly where to pick up the work (e.g., “Run make db-migrate” or “Implement /login endpoint”).
  • Repository and Runtime State Checkpoints: Explicit data, such as a git commit hash and the overall test pass rate, to unambiguously ground the new session in the current reality of the codebase.

How to use it in practice: You must treat your agent like an engineer with amnesia by forcing it to properly “clock out” at the end of its shift. The agent should be instructed to update the PROGRESS.md file right before the session ends.

When the next session begins, the agent reads this file to instantly understand what was done, what failed, and what to do next. When implemented correctly, this file acts as a core part of the state subsystem that can compress a new session’s rebuild cost — the time it takes for an agent to reach an executable state — from 15 minutes down to just 3 minutes.

How can I automate the updating of my PROGRESS.md file?

To automate the updating of your PROGRESS.md file and ensure your AI agent never forgets to record its state, you need to integrate the update process directly into the agent’s runtime harness and completion requirements.
Here are the primary ways to automate and enforce this based on the principles of harness engineering:
1. Create a Mandatory “Clock-Out” Routine in AGENTS.md
You must treat the agent like an “amnesiac craftsman” by giving it strict clock-in and clock-out instructions. You can automate the update by explicitly defining this routine in your AGENTS.md (or CLAUDE.md) file so that the agent mechanically updates the file before ending its session.
You can add a snippet like this to your instruction file:
When you start work:

  1. Read PROGRESS.md to understand current state
  2. Read DECISIONS.md for historical context

When you finish work (IMPORTANT):

  1. Update PROGRESS.md with completed items and blockers
  2. Commit changes with ‘git commit -m “chore: state checkpoint”’

2. Make it a “Clean State” Completion Requirement
Agents often suffer from premature completion declarations, declaring “done” simply because the code compiles. You can force the agent to update PROGRESS.md by making it a non-negotiable part of your project’s “Definition of Done” or clean handoff state.
Instruct the harness to reject any completion attempt if the progress dimension isn’t fulfilled.
Add a rule to your CLAUDE.md like:
Before declaring ‘done’, you MUST:

  1. Ensure all tests pass.
  2. Update PROGRESS.md with what was completed and what remains.
  3. Ensure no temporary debug files are left behind.

3. Use a “Handoff Reporter” Tied to Your Feature List
If you want to remove the burden from the agent entirely, you can automate the generation of progress summaries using a structured feature list. By maintaining a machine-readable feature list (like a JSON or Markdown file) that tracks every subtask’s state (e.g., not_started, active, blocked, passing), you can build a handoff reporter into your harness.

Because the harness controls the state transitions of features (shifting them to passing only when verifiable tests pass), the handoff reporter can automatically read this list at the end of the session and generate the new PROGRESS.md summary automatically, acting like an “automatic shift-change report”.

How does PROGRESS.md reduce a session’s rebuild cost?

A PROGRESS.md file reduces a session’s rebuild cost (defined as the time a new agent session needs to reach an executable state) by acting as a “continuity artifact” that eliminates the need for the agent to blindly rediscover the project’s state.
Here is exactly how it drives down that cost:

  • Eliminates Redundant Diagnosis: Without a progress record, a new session acts like an “amnesiac craftsman” who must waste precious context window re-reading folders, re-running tests, and guessing why previous code was written. This redundant diagnosis can consume 30-50% of the total session time. A structured PROGRESS.md provides an immediate, machine-readable “handoff,” allowing the new session to instantly know what is done, what is blocked, and what to do next.
  • Prevents Duplicate Work and Drift: When an agent doesn’t explicitly know what was completed in a previous session, it often wastes time re-implementing features that are already finished or undoing past decisions. The PROGRESS.md file anchors the agent to the current reality of the codebase, preventing this costly rework.
  • Bypasses the “Verification Gap”: If previous verification results (like which tests are passing or failing) are not recorded, the incoming session is forced to re-run all tests from scratch to understand the current state. A progress file records these verification notes, saving significant time and cognitive budget.

The Quantitative Impact By serving as an explicit journal for the agent, Anthropic’s engineering data shows that good progress records reduce session startup diagnostic time by 60-80%. In real-world applications, utilizing PROGRESS.mdcan compress a new session’s rebuild cost from 15–20 minutes down to just 3 minutes.

How can I build a handoff reporter for these files?

To build an automated handoff reporter, you must transition away from using unstructured text notes and instead use a machine-readable feature list (like a JSON file) as the foundation. The handoff reporter acts like an “automatic shift-change report” that reads this structured file at the end of a session to automatically generate your PROGRESS.md summary.
Here are the specific steps to build one based on harness engineering principles:

  1. Create a Feature State Machine
    Your reporter needs structured data to read. You must define every task in your feature list as a primitive “triple” containing three mandatory elements:
    • A specific behavior description (e.g., “user can add items to cart”).
    • An executable verification command to objectively check the behavior.
    • The current state, which must be strictly limited to not_started, active, blocked, or passing. Missing any of these elements makes the feature item incomplete.
  2. Define a Minimal JSON Format
    Structure your project’s tasks in a file like feature_list.json. Your schema should include fields for id, behavior description, verification command, current state, and an evidence reference (which links to the passing test or criteria).
  3. Enforce “Pass-State Gating”
    For the handoff reporter to be accurate, the AI agent cannot be allowed to manually change a task’s state to passing just because the code looks correct. Your harness must control the state transitions, a task can only move from active to passing if the harness executes the associated verification command and it succeeds. This ensures your reporter is summarizing verified truths rather than the agent’s overconfidence.
  4. Generate the PROGRESS.md Summary
    Build a script that runs at the very end of the agent’s session. The handoff reporter simply reads the feature_list.json file, groups the features by their current state, and automatically overwrites PROGRESS.md with a clean summary of what is passing, what is currently active, and what is blocked or not started.

By building this reporter, the next incoming agent session can read the generated progress file and instantly understand the exact state of the project in about 3 minutes. Real-world data shows that relying on structured progress records like this can reduce session startup diagnostic time by 60-80% and completely eliminate duplicate work.

Here is a sample JSON format for a minimal feature triple:

{
  "id": "REQ01",
  "behavior": "POST /cart/items returns 201",
  "verification_command": "npm run test:e2e -- -t 'cart add'",
  "state": "passing"
}

Every entry in this machine-readable feature list acts as a foundational data structure, or “primitive”, that all other harness components depend on. To form a complete “triple,” each item must strictly contain these three core elements:

  • Behavior description: A specific definition of exactly what the feature should do, such as “POST /cart/items returns 201”.
  • Verification command: An executable test command that objectively proves the behavior is working, like “npm run test:e2e – -t ‘cart add’”.
  • State: The current status of the task, which must be strictly limited to “not_started”, “active”, “blocked”, or “passing”.

While it is referred to as a “triple,” the JSON format typically includes an id field for tracking, while you might also include an evidence reference field to link to the specific passing test or criteria. Missing any of the core triple elements makes the feature item incomplete, much like a three-legged stool missing a leg.

Using this structured JSON format is critical because it enables “pass-state gating”.
The AI agent is not allowed to manually change a feature’s state to “passing” just because it thinks the code is done. Instead, the agent submits a verification request, and the harness actually executes the verification command. The harness will only transition the state to “passing” if the verification succeeds, making the completion criteria irreversible and objective.
Ultimately, this JSON acts as your project’s single source of truth, serving as the machine-readable backbone that powers the task scheduler, the verifier, and the automated handoff reporter.

How do I design agent-oriented error messages?

To design agent-oriented error messages, you must shift away from standard error outputs that simply state a test failed, and instead design messages that actively guide the AI toward the solution.
A well-designed agent-oriented error message must strictly contain three core elements: WHAT went wrong, WHY it went wrong, and exactly HOW to fix it.

Here is how to design them effectively:

  1. Provide Specific “Fix Instructions”
    Standard error messages merely state that a violation occurred, which often causes the agent to guess blindly at a solution. Error messages written for agents must include explicit fix instructions that tell the AI exactly how to change the code. By doing this, you turn architectural rules and test failures into an “auto-correction loop” where the agent can self-correct without any human intervention.
  2. Use the “Red Pen Markup” Approach
    Think of designing your error messages like a good teacher grading an exam. **Don’t just draw a big red cross to indicate a failure; instead, write specific, actionable feedback in the margins **explaining exactly how the student should correct their work.
    • Examples of Bad vs. Good Error Messages:
      • Bad (Vague): “Direct filesystem access in renderer” or “Test failed”.
      • Good (Agent-Oriented): “Direct filesystem access in renderer. All file operations must go through the preload bridge. Move this call to preload/file-ops.ts and invoke it via window.api.”
      • Good (Agent-Oriented): “Test failed: POST /api/reset-password returned 500. Check that the email service config exists in environment variables. The template file should be at templates/reset-email.html.”
  3. Turn Architectural Rules into Executable Checks
    To properly enforce your system boundaries, you should convert the rules from your architecture documents into custom lint rules or automated tests. When these rules are broken, the resulting error message must be designed to enforce the invariant while guiding the agent’s implementation. Over time, whenever you notice a recurring issue during code review, you can promote that feedback into a new automated check with an agent-oriented error message, continuously making your harness stronger.

References

WalkingLabs: Learn Harness Engineering

OpenAI: Harness engineering: leveraging Codex in an agent-first world (2026-02-11)
Anthropic: Effective harnesses for long-running agents (2025-11-26)
Anthropic: Harness design for long-running application development (2026-03-24)

OpenAI: Unrolling the Codex agent loop (2026-01-23)
Anthropic: Demystifying evals for AI agents (2026-01-09)
LangChain: Improving Deep Agents with harness engineering (2026-02-17)
Thoughtworks / Martin Fowler: Harness engineering for coding agent users (2026-04-02)
Cursor: Continually improving our agent harness (2026-04-30)