Back Lizi Li

Harness Engineering: Why the Same Model Works for Others But Not for You

Over the past two years, AI engineering has gone through three fundamental shifts: from Prompt Engineering to Context Engineering to Harness Engineering. On the surface, it looks like a parade of buzzwords. Underneath, it's a root-level change in engineering thinking — from "how to make the model think better" to "how to keep it on track, running stable, and recoverable when it fails."

A real story

Earlier this year, a friend asked me to help debug their agent. Their team had already put in serious effort — they'd upgraded to the best flagship model, rewritten their prompts over a hundred times, and tuned every parameter they could find. But once it hit real-world scenarios, performance was erratic. Sometimes brilliant, sometimes wildly off. Task success rate: under 70%.

When I dug in, the biggest changes I made had nothing to do with the model or the prompts. I changed how tasks were decomposed, how state was managed, how critical steps were validated, and how the system recovered from failure.

After the new version shipped — same model, same prompts — success rate jumped above 95%.

At the time, I didn't have a precise word for what I'd changed. It wasn't until Harness Engineering started gaining traction that I realized: what I'd rebuilt was the harness.

Three shifts in gravity

Prompt Engineering, Context Engineering, and Harness Engineering each correspond to a different stage of the same problem:

Did the model understand what you're asking? That's a Prompt problem.
Did the model receive the right information? That's a Context problem.
Can the model keep doing the right thing across real execution? That's a Harness problem.

Each question expands outward from the last.

Phase 1: Prompt Engineering

When large language models first exploded, everyone noticed the same thing: same model, different phrasing, wildly different results. So everyone started studying prompts — role definitions, style constraints, few-shot examples, chain-of-thought, output formatting.

Why does this work? Because a large language model is fundamentally a probability-generation system that's hypersensitive to its input context. Give it an identity, and it'll answer in character. Give it examples, and it'll pattern-match. Emphasize a constraint, and it'll weight that constraint heavily. Prompt engineering isn't commanding the model — it's shaping a local probability space.

But prompt engineering hits a ceiling fast. Many tasks aren't about "explaining the problem clearly" — they're about the model actually knowing things. Analyzing an internal document, answering about a product's latest configuration, writing code against a long specification. No matter how elegant the prompt, it can't substitute for facts the model doesn't have.

Prompts are good at: clarifying intent, constraining output, activating the model's existing capabilities.

Prompts are bad at: compensating for missing knowledge, managing dynamic information, handling state across long task chains.

Put simply, prompts solve the expression problem, not the information problem.

Phase 2: Context Engineering

When agents started gaining traction, models weren't just answering questions anymore — they were operating in real environments. Multi-turn conversations, browser automation, code generation, database queries, passing intermediate results between steps, adjusting plans based on external feedback.

The problem shifted. The system was no longer judged on whether a single response was correct, but on whether the entire task chain could run to completion.

The core of Context Engineering is one sentence: the model doesn't know everything; the system must deliver the right information at the right time.

"Context" here isn't just a few paragraphs of background. In engineering terms, it's the sum of everything that influences the model's current decision — user input, conversation history, retrieval results, tool outputs, current task state, intermediate artifacts, system rules, safety constraints, structured results from other agents.

One critical practice is progressive disclosure. Dumping all tool descriptions and parameter definitions into the context upfront sounds like it would help. In practice, it makes things worse. The context window is a scarce resource — too much information dilutes attention. The better approach: give the model minimal metadata up front, then dynamically load detailed SOPs and reference material only when the model actually needs to invoke a specific capability.

Context optimization isn't about "more." It's about the right information, layered appropriately, delivered at the right moment.

Phase 3: Harness Engineering

But Context Engineering isn't the end of the road. Teams kept discovering a harder problem: even when the information is right, the model doesn't always execute correctly.

The plan might be solid, but execution drifts. A tool gets called, but the model misinterprets the return value. Somewhere in a long chain, the agent gradually veers off course — and the system doesn't notice.

Prompts and context primarily solve input-side problems — the former optimizes intent expression, the latter optimizes information supply. But in complex tasks, there's a harder question: when the model starts taking continuous action, who supervises it, constrains it, and corrects it?

Harness — as in reins, as in a horse's harness — reminds us of something fundamental: when a model transitions from answering questions to executing tasks, the system doesn't just need to feed it information. It needs to drive the entire process.

Here's a concrete analogy. Imagine you're sending a new hire on an important client visit.

Prompt is briefing them on the playbook — start with small talk, then present the proposal, ask about their needs, confirm next steps. The goal is to communicate the task clearly.

Context is preparing their materials — client background, past communications, product pricing, competitive landscape. The goal is to supply the right information.

Harness is giving them a checklist, requiring real-time check-ins at key milestones, cross-referencing meeting notes against the recording afterward, correcting deviations immediately, and validating the outcome against explicit criteria. The goal is continuous observation, continuous correction, final acceptance.

These three aren't replacements for each other. They're concentric layers. Prompt is the engineering of instructions. Context is the engineering of the input environment. Harness is the engineering of the entire runtime system.

The six layers of a harness

LangChain's engineers offered a clean formula: Agent = Model + Harness. In an agent system, almost everything that determines whether it can reliably deliver — besides the model itself — is harness. A mature harness has six layers:

  1. Context management
    The model needs to know who it is, what the task is, and what success looks like. Information isn't "more is better" — it's "more relevant is better." Fixed rules, current state, and external evidence should be clearly separated. When information gets tangled, the model misses key points, forgets constraints, or contaminates its own reasoning.
  2. Tool system
    Without tools, the model is still a text predictor. The harness must solve three problems: which tools to expose, when to invoke them, and how to feed results back into context. Forty search results shouldn't be dumped back raw — they need to be filtered, distilled, and kept relevant to the current task.
  3. Execution orchestration
    Most agent failures aren't about not knowing a single step — they're about not stringing steps together coherently. Understand the goal → check if information is sufficient → gather more → analyze → generate output → validate output → retry if it doesn't pass. This is very close to how humans work. The difference: humans rely on experience, agents rely on the harness environment.
  4. Memory and state
    An agent without state is an agent with amnesia. Every turn, it forgets what it just did. At minimum, you need to separate three categories: current task state, in-session intermediate results, and long-term memory with user preferences. Mix them together and the system degrades fast.
  5. Evaluation and observability
    The layer most teams neglect. The system needs to not only do the work, but know whether it did the work correctly. This includes output acceptance criteria, environment verification, automated testing, logging and metrics, and error attribution.
  6. Constraints, validation, and recovery
    In real environments, failure is the norm. Search returns garbage. APIs time out. Document formats are broken. The model misreads the task. A mature harness must include: constraints (what's allowed and what's not), validation (how to check before and after output), and recovery (how to retry, reroute, or roll back to a stable state).

In practice: how Anthropic and OpenAI do it

Anthropic: Context Reset + separation of production and verification

Anthropic identified two recurring problems in long-running autonomous tasks.

The first is context anxiety. Over time, the context fills up. The model starts dropping details, missing key points. It even exhibits a curious behavior: it seems to sense that it's running out of room, and starts rushing to wrap up. Many systems attempt Context Compaction — compressing the history and continuing. Anthropic found this insufficient. Compression makes things shorter, but the cognitive burden doesn't actually disappear.

Their solution was more radical: Context Reset. Instead of compressing within the existing context, they spin up a fresh agent and hand the work off. It's analogous to dealing with a memory leak in engineering — instead of clearing the cache and hoping for the best, you restart the process and restore state.

The second problem is self-evaluation bias. When the model does the work and then grades its own output, the scores skew optimistic — especially on subjective dimensions like design quality, user experience, and product completeness. Anthropic's approach: separate the worker from the reviewer. A Planner expands vague requirements into full specifications. A Generator implements step by step. An Evaluator tests like QA — not just reading code, but actually interacting with the page, checking real behavior.

The principle: production and acceptance must be separated.

OpenAI: environment design + progressive disclosure + agent self-verification

OpenAI's approach to agent engineering redefined the role of the human engineer: humans don't write code. Humans design the environment.

In practice, the engineer's job became three things: decompose product goals into tasks the agent can understand; when the agent fails, don't tell it to "try harder" — ask what structural capability is missing from the environment; and build feedback loops so the agent can actually see the results of its own work.

When an agent fails, the fix is almost never "try harder." It's figuring out what structural capability is missing.

OpenAI made a mistake that many teams make early on: they wrote an enormous AGENTS.md file stuffing every specification, framework, and convention into a single document. The result? The agent got more confused, not less. The context window is a scarce resource — filling it to the brim is effectively the same as giving no guidance at all.

They restructured it as a table of contents. The main file holds only a core index. Detailed content lives in sub-documents — architecture docs, design specs, execution plans, quality rubrics, safety rules. The agent reads the index first and drills into specifics only when needed. Same principle as progressive disclosure.

Another critical practice: letting the agent see the entire application. Once output speed goes up, the bottleneck shifts from writing to verification — humans simply can't keep up. So they gave the agent access to a browser (screenshots, clicking, simulating real user actions), logging systems and metrics dashboards, and isolated environments per task. The agent doesn't just write code and declare it done. It runs the code, sees the result, finds bugs, fixes them, and verifies again.

They also encoded senior engineer judgment into system rules — how modules should be layered, which dependencies are forbidden, when to block, and how to fix detected issues. These rules don't just flag errors; they feed the fix back into the agent's next context turn. This isn't a style guide anymore. It's a continuously running, automated governance system.

The takeaway

Prompt Engineering solves how to communicate the task clearly. Context Engineering solves how to supply the right information. Harness Engineering solves how to keep the model doing the right thing across real, continuous execution.

Harness doesn't replace Prompt or Context. It operates at a wider system boundary that contains both. When the task is simple single-turn generation, good prompts are enough. When the task depends on external knowledge and runtime information, context becomes critical. When the model enters long-chain, executable, low-tolerance real-world scenarios — the harness becomes unavoidable.

This is why the same model performs so differently across different products. The model may determine the ceiling. But the harness determines whether the system ships, and whether it stays reliable once it does.

Back to home