What Is an AI Harness? The Layer That Makes Agents Work

What Is an AI Harness?

An AI harness is the code wrapped around a language model that turns a one-shot text generator into a working agent. It is the loop that calls the model, runs the tools the model asks for, feeds the results back in, manages what stays in the model's context, and decides when to stop. The model does the thinking. The harness does everything else — and "everything else" is where most of the difficulty lives.

Iceberg diagram showing the AI model as the small visible tip above water and the harness — loop, tools, memory, context management, guardrails, observability — as the much larger mass below the surface

Here is the line that reframed how I think about this whole field: the model is the cheapest part. You can swap a frontier model for a slightly worse one and the agent still mostly works. Strip away the harness and even the best model on the planet does nothing useful on a real task. People spent two years obsessing over model benchmarks. The benchmark gap between top models has narrowed to almost nothing. The gap that actually decides whether your agent ships is the harness.

This post explains, in plain terms, what a harness is, why it matters more than the model it wraps, where harnesses break, and what shipped in 2026 that made "harness engineering" a job title instead of an afterthought.

The Problem: A Model Is Not an Agent

Picture the simplest possible AI feature. A user types a question, you call the model's API, you return the text. That works right up until someone asks the AI to do something instead of say something.

"Look up this customer's last three orders and tell me if their refund is valid." A raw model cannot do this. It can describe how to query the database. It cannot run the query, read the result, notice the result is empty, and try a different query. It has no hands. It is a brain in a jar that can only produce text.

That is the core limitation, and it has three parts:

A model is stateless. Every call starts from zero. It remembers nothing about the last message unless you paste the history back in by hand.
A model cannot act. It outputs words. Running a database query, calling an API, editing a file — none of that happens unless something outside the model executes it.
A model runs once and stops. It produces one response and goes quiet. Real tasks need many steps, each one depending on what the last step returned.

So the question becomes: what software do you have to write to close those three gaps? That software is the harness. The agent is not the model. The agent is the model plus the harness, and the harness is the part you actually build.

The Mental Model: Agent = Model + Harness

The cleanest way to hold this in your head is a simple loop. The model never "runs" on its own — it gets called, over and over, inside a loop that you control. Here is the whole idea in about fifteen lines of pseudocode:

context = [system_prompt, user_request]
 
while True:
    response = model.call(context)          # the model thinks
 
    if response.is_final_answer:
        return response.text                 # we're done
 
    tool_name, args = response.tool_call     # the model asks for a tool
    result = run_tool(tool_name, args)       # the harness acts
 
    context.append(response)                 # remember what it said
    context.append(result)                   # remember what came back
    # loop again — now the model can see the tool result

Read that loop closely, because every hard problem in agent building is hiding inside one of those lines.

The model "thinks" — but you decide what goes into context before each call. The model "asks for a tool" — but you wrote run_tool, you decide which tools exist and what they're allowed to touch. The loop "remembers" — but context has a size limit, and on a long task it overflows, so you decide what to keep and what to throw away. The loop ends "when done" — but a confused model can loop forever, so you decide the stop conditions.

The model is one line in that block. The other fourteen lines are the harness. That ratio is roughly the ratio of effort in a real system. This is the same shift I wrote about in LLM Calls to Autonomous Agents — moving from a single API call to a reasoning loop is an architectural change, not a prompting trick.

What's Actually Inside a Harness

"The loop" is the skeleton. A production harness hangs five organs off that skeleton. Each one exists to fix a specific way the naive loop falls apart.

Tools

Tools are the model's hands. A tool is just a function you expose to the model — search_orders(customer_id), send_email(to, body), run_sql(query) — described in a way the model can request by name. The model emits "call run_sql with this argument," the harness runs the real function, and the harness feeds the return value back in.

The hard part is not wiring up a tool. It is deciding which tools exist and what they are permitted to do. Give an agent a delete_file tool with no guardrails and you have built a machine that can wipe a disk because it misread a path. Tool design is permission design.

Context management

The context window is the model's short-term memory — everything it can "see" on a given call. It is finite. On a long task, the conversation plus tool results plus retrieved documents will blow past the limit. When that happens you cannot just keep appending. You have to decide what to summarize, what to drop, and what to keep verbatim.

This is now its own discipline, sometimes called context engineering. Anthropic's framing is that even the folder and file structure an agent works in is a form of context engineering — when the agent hits a giant log file, a well-built harness lets it grep and tail the file instead of dumping the whole thing into the window. What you choose to inject on each turn determines whether the agent stays coherent or slowly loses the plot.

Memory and state

Context is short-term and disappears when the session ends. Memory is what survives. A scratchpad the agent writes notes to mid-task, a task list it checks off, facts it learned in a previous session — these have to live somewhere outside the model and get loaded back in when relevant. Without memory, every conversation starts blind. With it, the agent compounds what it knows.

Control flow

This is the unglamorous plumbing that keeps the loop from destroying itself: retries when a tool call fails, timeouts when something hangs, a hard cap on the number of steps so a confused agent doesn't loop a thousand times and run up a four-figure bill. The naive while True has no brakes. Control flow is the brakes.

Safety and observability

Guardrails check the model's requested actions before they run — "this agent is about to send 400 emails, block it." Observability is the logging and tracing that lets you reconstruct why the agent did something after the fact. When an agent makes a baffling decision at step 30 of 50, the trace is the only way to find out which tool result sent it down the wrong path. You cannot debug what you did not record.

The point: none of these five live inside the model. They are all things you build around it. That is the harness, and that is why two teams using the identical model can ship agents that are worlds apart in reliability.

The Failure Modes: Why 88% of Agent Projects Never Ship

The most-cited number in this space right now is that up to 88% of enterprise AI agent projects fail to reach production. Sit with that. It is not because the models are bad — everyone has access to the same excellent models. It is because the harness is hard, and these are the specific ways it breaks:

Context overflow. The task runs long, the window fills, and the harness either crashes or silently truncates the part the model needed. The agent "forgets" the original goal halfway through and confidently finishes the wrong task.
The runaway loop. The model gets stuck — calls a tool, misreads the result, calls the same tool again, forever. With no step cap and no loop detection, this burns money and time until something times out.
Tool errors the model can't recover from. A tool returns an error string. A weak harness pastes that error back in raw, the model gets more confused, calls the tool again with the same bad arguments, and the failure compounds instead of resolving.
The silent wrong answer. The worst one. No crash, no error — the agent returns a clean, plausible, wrong result because a tool returned stale data three steps back and nothing flagged it. Loud failures you can fix. Silent ones reach production and erode trust.

Notice that not one of these is a model problem. Every single one is a harness problem. The model is the part that works. The wrapper is the part that fails. That is the whole thesis of harness engineering in one observation.

The Tradeoffs: When You Shouldn't Build One

A harness is real infrastructure, and infrastructure has a cost. The honest version of this post has to say when not to reach for one.

If your task is genuinely single-turn, skip it. Summarize this text, classify this ticket, rewrite this paragraph — one call in, one answer out. Wrapping that in an agentic loop adds latency, cost, and brand-new failure modes to solve a problem you didn't have. Not everything needs to be an agent. Most "AI features" are still better as one clean model call.

Build vs. framework is a real fork. You can hand-roll the loop yourself, or use a framework that ships one. Hand-rolling gives you total control and zero magic — you understand every line because you wrote it — but you also rebuild retries, tracing, and memory from scratch. A framework hands you those for free but buries the control flow under abstractions, so when it misbehaves you're debugging someone else's loop. Roll your own when the logic is simple and you want to see it. Reach for a framework when you need memory, multi-agent coordination, and tracing on day one and don't want to build all three yourself.

The trap to avoid is treating the harness as boilerplate you copy once and forget. It is the part of the system most exposed to the messy real world — flaky tools, weird inputs, edge cases — so it is the part that needs the most care, not the least.

Recent Developments: Harness Engineering Goes Mainstream

For years, everyone building serious agents was quietly writing the same loop, the same retry logic, the same context-trimming code, and nobody had a shared name for it. That changed fast in 2026. A few concrete markers:

The vocabulary got formalized. Early in 2026 the industry converged on "harness" and "scaffold" as the standard terms for this layer, and a wave of writing — surveys, glossaries, an awesome-harness-engineering list — turned scattered folklore into a named discipline. When something gets a survey paper and a curated GitHub list, it has officially stopped being a side concern.

Anthropic published its harness playbook. Two engineering pieces — Effective harnesses for long-running agents and Building agents with the Claude Agent SDK — laid out patterns from the team that builds Claude Code. The standout idea is the Planner → Generator → Evaluator pattern: one agent plans the work, one does it, one checks it, which keeps long tasks from drifting off course. That three-agent split is becoming a default answer to the "the agent loses coherence after 20 steps" problem.

The harness became a shipped product. The Claude Agent SDK — the same harness that powers Claude Code, repackaged so you can wrap your own agents in it — went from a niche tool to a default. Search volume for "claude agent sdk" reportedly jumped from around 50 searches a month in mid-2025 to nearly 15,000 by April 2026. Anthropic also moved memory into public beta for its managed agents: agents that learn from every session, stored as files you can export and manage through the API. Memory stopped being something you hand-build and started being something the harness provides.

Context management turned into a research front. A cluster of 2026 papers — on managing agent context like a Git history, on routing context for long-horizon web agents, on treating context itself as a tool — all attack the same bottleneck: keeping an agent coherent across hundreds of steps without drowning it in its own history. The field's attention has visibly shifted from "make the model smarter" to "manage what the model sees."

The throughline is unmistakable. As the models converged, the competition moved one layer out — to the harness. The control plane around the model is now where the real engineering, the real cost, and the real differentiation live.

Where This Leaves You

If you take one thing from this: when an agent demo dazzles you, the magic you're seeing is mostly harness, not model. The model is a commodity you rent by the token. The harness is the part someone designed — the tools they chose, the guardrails they set, the way they decided what the agent remembers and when it stops.

That is also the opportunity. The frontier models are available to everyone at roughly the same price. The harness is yours to build, and it is the part that decides whether your agent is the one that ships or the one in the 88% that quietly dies in a demo. The next time you're tempted to chase a two-point benchmark bump, ask whether your loop, your context strategy, and your guardrails are actually solid first. That's usually where the real wins are hiding.

If you want to see how this plays out in practice rather than theory, the LangChain agents walkthrough builds a real reasoning loop step by step, and you can find more of what I'm building over on my projects.

References

I use AI tools to help research and draft posts. The ideas, opinions, and takes are mine. Verify anything technical or time-sensitive before acting on it.