Prompt Injection Is a Governance Failure
Prompt injection is the wrong place to spend the whole defence budget, and at the wrong layer. Over the past three years a dominant approach has been to improve the model: better training, input sanitisation, instruction hierarchies, output classifiers, prompt shielding. I do not want to dismiss that work, and the engineering behind each of these defences is serious. The point worth pressing is that none of them can be sufficient on its own, and that the place to actually contain a successful prompt injection is not at the model layer at all.
Take any of the now-familiar shapes the attack takes. Adversarial text in a retrieved document, in a user-supplied input, or in the content of a tool response tells a model to ignore its previous instructions and do something it was not supposed to do — exfiltrate data, escalate privileges, modify state outside its scope. The model, processing the text as text, is influenced by it, and the next tool call the model produces reflects the adversarial instruction. The question that matters next is whether that tool call gets executed.
In a system without action-boundary governance, the answer is yes. The model produces a tool call and the runtime executes it. The injection succeeded at the model layer and found no second check; the attack terminates at consequence. In a system with action-boundary governance (where each tool call is evaluated against an explicit delegation before it runs) the answer is conditional. If the resulting action falls outside the delegation’s scope, the evaluation denies the action, and the attack terminates as a logged denial rather than as a breach.
The difference between the two outcomes is whether a second check exists.
The Model Layer Is The Wrong Place To Solve This
I do not want to be dismissive about model-layer work. The security research community has produced excellent work on prompt-injection mechanics, including how adversarial text in system prompts, user inputs and retrieved documents can override a model’s intended behaviour. Defences at the model layer have grown more sophisticated over time and the work continues to be valuable. What I find consistently overlooked, though, is that none of these defences reduces the failure probability to zero, and none of them can in principle. The model processes natural language, and any system that processes natural language can be influenced by natural language. Better training may reduce the surface area but it cannot eliminate a property of systems that interpret unstructured input.
The model-layer defence is trying to answer one question: can the model be prevented from being influenced by adversarial input? For any general-purpose language model, the answer is no, not with certainty. The governance question is structurally different: when the model is influenced by adversarial input, does the resulting action execute?
Prompt Injection Crosses Two Layers
Content safety and action authority are doing different work at different points along the path from input to consequence. Content safety asks whether the model said something it should not have, operating at the model boundary on the text, the reasoning, the response. Action authority asks whether the system did something it was not authorised to do, operating at the action boundary on the tool invocation, the API call, the state change. Prompt injection crosses the first boundary: it compromises the model’s reasoning, and the model produces an action that its operator never intended. What happens next is the question that decides whether anything bad actually occurs. At the action boundary the question is not whether the model’s reasoning was compromised, but whether this specific action, by this specific actor, under this specific delegation, at this specific time, is authorised.
If the action is outside the delegation’s scope, it is denied. If no enforcement boundary exists, it executes.
Why Filtering Is Insufficient
Input filters inspect what goes into the model and output classifiers inspect what comes out, but both of them operate on text and neither of them operates on actions. A more targeted variant uses a classifier to detect injection and block the resulting action, which reaches closer to the action boundary but is still driven by content analysis. If the detector misses the injection, the action executes unexamined; if it fires on a legitimate input, a valid action is blocked. The defence remains only as reliable as the classifier, and the classifier is engaged in the same adversarial arms race that makes model-boundary certainty unachievable in the first place.
The hardest case for filter-based defence is an injected instruction that produces a well-formed, innocuous-looking tool call. The input may not contain obvious adversarial markers, the output may be a syntactically valid function call with plausible parameters, and a content filter sees a clean request and a clean response. The tool layer sees a valid invocation and executes it. The filter evaluated the content; nobody evaluated the action. The category error is using a model-boundary tool to solve an action-boundary problem.
Governance Contains The Resulting Action
Governance does not prevent prompt injection. What it does is contain the blast radius. At the action boundary, every tool call is evaluated against the agent’s delegation before execution. The evaluation does not inspect the model’s reasoning or determine whether the model was compromised; it does not need to. It asks only whether the resulting action is within the delegation’s scope. If the agent is delegated to read claims and write summaries, then a data-export to an external endpoint is outside scope, regardless of why the model produced it.
An agent acting within its delegation can still produce incorrect results; governance constrains which actions execute, not the correctness of authorised behaviour. But a compromised model cannot escalate beyond its delegation. The structural advantage of action-boundary governance is that it sidesteps the model-boundary fight entirely. Model-boundary defence has to distinguish legitimate from adversarial reasoning in open-ended language, a problem no production system can reduce to a deterministic certainty. Action-boundary evaluation asks a narrower question: whether the resulting action is authorised under an explicit contract.
The injection corrupts the model’s intent. Governance constrains the system’s actions. The attack succeeds at the first layer, and it fails at the second, if the second layer is there to fail at.
The Enforcement Gap Lets The Attack Land
Many autonomous AI systems I have looked at do not have this second layer. The model produces a tool call and the tool layer executes it. The entire defence relies on preventing the model from being compromised in the first place, and when that defence fails — and for any system processing untrusted input it will eventually fail — nothing sits between the compromised model and the consequence. Prompt injection is correctly discussed as an AI safety problem, but the reason it produces real-world consequences is not only that models are vulnerable. It is that actions produced by compromised models are allowed to execute without governance evaluation.
The vulnerability is in the model. The failure is in the architecture.