Editor’s Note: This essay, co-authored with Hans Schabert is grounded in empirical testing of LLM agent process execution. The series argues that constraint must precede cognition, that enforcement must be structurally separate from the entity being governed, and that authority which cannot be read cannot be governed. This essay demonstrates where those principles meet the most common deployment pattern for LLM agents and find it structurally ungoverned. It stands alone.
An LLM agent given a procedure does not follow it. It interprets it.
There is a difference between handing someone a manual and giving them an order. The manual is a reference. The reader decides what to do with it; which sections apply, in what sequence, with what emphasis. An order is not a reference. It is a directive. The recipient does not interpret. The recipient executes.
Every agent framework that loads a process, a set of instructions, a workflow, a sequence of steps, into a system prompt is handing the model a manual. Not an order. The model reads the full text, parses the intent, selects what it considers the appropriate next action, executes, re-reads, selects again. The model has context. It has the full conversation history. What it does not have is procedural commitment. It does not bind itself to a path. It re-interprets the instructions on every turn, and re-interpretation produces a different path. Not because the model forgot, but because interpretation is not execution.
Interpretation is not the problem in general. It is the problem at the wrong layer. An autonomous system that interprets goals, evaluates context, and reasons about intent is doing exactly what autonomy requires. That is interpretation at the cognitive layer, the layer where reasoning about what to do belongs. But when the same model re-interprets the process it was given, the prescribed sequence of steps, the order of tool calls, the decision to validate before concluding, it is interpreting at the procedural layer. That is interpretation where execution should be.
The series has established this structurally: the supervisory layer decides what to delegate and why. It holds the authority to assign work and define scope. The execution layer carries out the delegated task within those defined constraints. It acts within the boundary, not upon it. The supervisor determines what the task is and which execution agent should perform it. The execution agent performs the work without reinterpreting the assignment.
Interpretation at the supervisory layer is governance. Interpretation at the execution layer is the absence of it. The reinvention problem is not that the model interprets. It is that the architecture does not distinguish which layer the interpretation occurs in. The model is handed the full process and permitted to interpret at every layer simultaneously.
This is not a limitation that will be resolved by larger context windows or better prompting. It is a structural property of how LLM agents consume instructions. The process is not a structure the model traverses. It is a text the model reads. A structure constrains. A text suggests.
Dozens of paths through a single corridor
A linear process. A handful of steps, a handful of tool calls. The instructions prescribe exactly one execution path. Give them to an LLM agent. Run it hundreds of times.
In our testing, a single prescribed process produces dozens of distinct execution paths. The dominant path, the single most common sequence, accounts for barely a quarter of all executions. Three out of four runs follow a path that is not the most common one. No two consecutive executions can be assumed to follow the same sequence.
This is not an edge case. This is baseline behaviour. The model is not failing. It is interpreting.
Setting the temperature to zero does not resolve it. Temperature governs token-level sampling: it can make the model’s language deterministic for a given input. It does not make the model’s process deterministic across varying inputs. In production, no two executions receive identical context: the user’s data differs, the conversation history differs, the state of external systems differs. Each variation in input produces a fresh interpretation of the same instructions.
Interpretation produces variation: variation that is invisible to anyone who measures only the final output, and structurally ungovernable by anyone who needs to predict, verify, or diagnose the process that produced it.
The trust prerequisites
Research on human-AI delegation is unambiguous on one point: humans delegate to automated systems only when three conditions are met. They can anticipate the system’s behaviour. They can verify its reasoning. They can attribute failures to specific, actionable causes.
Process variation defeats all three.
Predictability. A supervisor approving an agent for operational deployment must be able to answer: what will the agent do when given this task? If the answer is “one of dozens of possible execution paths, and we cannot tell you which one in advance,” the supervisor cannot predict. Predictability requires not only that the agent produces the correct output, but that it arrives there through a consistent, anticipatable process. A correct answer reached by an unpredictable path is not a governed outcome. It is a coincidence that happened to be right.
Verifiability. Compliance requires evidence that the prescribed process was followed. When a model executes a procedure in its entirety within a single cognitive pass, there are no intermediate checkpoints. No step-level documentation. No evidence trail. The model receives an input and produces an output. What happened between them is opaque. An auditor reviewing the execution cannot verify that step four was completed before step five, because the model may not have executed them as discrete steps at all. It may have collapsed them, reordered them, or skipped them. There is no record either way.
Diagnosability. When the output is wrong, the question is not what failed but where. In a multi-step process, the failure could originate at any step: wrong tool selection, incorrect parameter, misinterpreted decision branch, skipped validation. Without step-level traces, the only diagnostic path is forensic reconstruction of the full conversation. For a single failure, this is expensive. For a production system running thousands of executions per day, it is impossible. The failure is visible. The cause is not.
These are not theoretical concerns. They are the operational prerequisites for deploying any agent in a regulated workflow, a compliance-sensitive domain, or any environment where a wrong answer has consequences beyond a retry.
Right answer, wrong process
The subtlest failure mode is the one that looks like success. The model produces the expected answer. The evaluation passes. But the model did not do what it was instructed to do.
Consider a simple case. The instructions say: validate the user’s input against the reference system, then return the result. Step one: call the validation API with the input. Step two: read the response. Step three: return the verdict.
The model does not call the validation API. It looks at the input, recognizes the pattern from training data, determines that this input is valid, and returns “validated — ok.” The answer happens to be correct. The validation was never performed.
Under any accuracy metric, this is a success. Task completed. Output matches ground truth.
Under any operational standard where the process encodes a regulatory, compliance, or auditability requirement, this is a failure. The validation step exists for a reason. It exists because the input must be checked against the system of record; not because the check is the only way to reach the right answer, but because the check ensures the decision is grounded in current data rather than in the model’s training distribution. The right answer is not the point. The point is that the work was not done. The model decided it already knew. The validation was never performed. The evidence was never produced. If the reference data had changed since the model was trained, the answer would be wrong, and no one would know, because no one checked.
Standard accuracy metrics do not detect this. They measure what the model said. They do not measure how the model got there. Every deployment decision made on accuracy alone is measuring the answer, not the process. The answer looks right. The work was never done.
The observation problem
Under prompt-based delivery, the practitioner’s only window into the agent’s process is the agent’s own output. The model receives the instructions, executes (or does not execute) the steps, and produces a final answer. The practitioner sees the answer. The practitioner does not see the process. If the answer is correct, the practitioner assumes the process was followed. If the answer is wrong, the practitioner assumes the process failed, but cannot determine where. The practitioner operates on outcomes and guesses at causes.
The most common response: reasoning. Modern models can reason. They can produce extended chains of thought, show their work, explain which step they considered and why. This is real capability. It is not a substitute for process observability. Reasoning produces a self-reported trace: what the model believes it did. It does not produce a verified trace: what the model actually did relative to the prescribed sequence. The model has no mechanism to detect its own divergence from the prescribed path.
Reasoning shows how the model arrived at an answer. It does not show whether the model followed the prescribed process at the prescribed step. These are different questions. The model can reason flawlessly about why an input is valid, but that reasoning is not anchored to a specific step in the prescribed sequence. The model does not reason “I am now at step three, and step three requires me to call the validation API.” It reasons about the problem. It reasons about the domain. It does not reason about its position in the process. The reasoning is precise about the what. It is imprecise about the where: which step it was executing, whether that step was the one the instructions prescribed at that point, whether the previous steps were completed.
A model that reasons “this input matches the pattern for valid entries, therefore it is valid” has reasoned correctly about the domain. It has not established that it was at the right step, doing the right thing, in the right order.
Reasoning tells you what the model thought. It does not tell you what the model did.
The structural diagnosis
The reinvention problem is not a model problem. It is a category error in how instructions are given.
Instructions embedded in a system prompt are a reference document. The model reads them the way a human reads a manual: selectively, interpretively, with latitude to skip, reorder, and optimize. The architecture does not distinguish between “here is context about the domain” and “here is the exact sequence of actions you must perform.” Both arrive as text. Both are interpreted. The model cannot tell the difference between a suggestion and a directive, because in a system prompt, there is no difference. Everything is text. Nothing is an order.
An order has different properties. It is scoped: do this, now. It is sequential: one action, then the next. It is non-negotiable: the recipient does not choose which parts to follow. The difference between a manual and an order is not tone. It is structure. A manual permits interpretation. An order constrains it.
Tool schemas appear to address this. Agent frameworks that define available tools, enforce structured outputs, and break processes into discrete function calls constrain the model’s action space. They constrain what the model can call. They do not constrain when, whether, or in what order. The model still interprets the process; the constraint is on the action vocabulary, not on the procedural path. A model with access to a validate_input tool can still decide the validation is unnecessary and skip it. A model with a defined sequence of tool calls can still reorder them if the instructions arrive as text rather than as enforced state transitions.
The default architecture for LLM agents does not give orders. It gives manuals. The practitioner wrote the instructions assuming the model would follow them. The model interprets the instructions assuming it has the latitude to optimise. Neither assumption is explicit. Both are structural. The result is dozens of paths where one was prescribed.
The consequence: every execution is a reinvention. Not because the model is unreliable. Because the architecture hands it a manual and expects it to behave as though it received an order. The model is doing exactly what the architecture allows. The problem is not the model’s behavior. It is the architecture’s category error.
The series established that constraint precedes cognition: governance encoded ahead of action is structurally different from governance applied after. The reinvention problem is where that principle meets the most common deployment pattern for LLM agents. The instructions are present. The constraint is absent. Dozens of paths emerge where one was prescribed. The artefacts of process compliance may exist; the model may mention the steps, may narrate its reasoning, may produce output that looks like a followed process. The governance itself has left the building.
What the architecture requires
The reinvention problem names a failure. It does not yet name the resolution in full. But the series has already established the structural primitive that the resolution depends on.
Authority is not a prompt convention. It is not a string of text that says “you MUST.” It is a designed, enforceable boundary: explicit in scope, evaluated before execution, independent of the entity it constrains. The observation problem showed that process which cannot be read cannot be governed. The structural diagnosis showed that constraint which is not enforced ahead of cognition is not constraint at all. These are the same properties the series has formalized as requirements for governed autonomy.
The reinvention problem exists because the default architecture delivers process instructions without any of these properties. The instructions have no enforceable scope: the model decides which parts to follow. They are not evaluated before execution: the model executes and the practitioner reviews after. They are not independent of the entity they constrain: the model interprets its own constraints as part of the same cognitive process that produces the action. Every property that the series has established as necessary for governed autonomy is absent from the standard prompt-based delivery of procedural instructions.
What would instructions-as-constraint look like? Not text in a prompt. State transitions enforced by an external mechanism. Each step delivered as a scoped directive: one action, evaluated before execution, with the next step gated on externally verified completion of the current one. The model reasons within the step. It does not reason about which step to take. The procedural path is not interpreted. It is traversed. The architecture holds the process. The model holds the cognition. They are not the same component.
The question is not whether the model is capable. It is whether the architecture treats instructions as a reference or as a constraint. One produces interpretation. The other produces execution. The default architecture chose interpretation.
It gave a manual. The model chose dozens of paths.


