If You Can't Replay the Decision, It's Not Governance
There is a test many systems claiming governance of autonomous AI action would struggle to pass today, even though no one is currently running it on them. The test is the simplest possible check on whether governance occurred: take a decision the system claims to have made, hand the same inputs back to the system, and ask it to render the same decision again. It is the act of replaying the decision.
I keep returning to this test because it is one of the few that cannot really be argued with. If the second evaluation produces a different answer than the first, the original decision depended on something that is not in the record: runtime state, an inference, a configuration that has since moved. That dependency makes the decision unreconstructable. A decision that cannot be reconstructed, in any honest accounting, was not a governance decision. It was an assertion that happened to have a timestamp on it.
For autonomous AI systems, this stops being a theoretical concern and becomes an operational one. Governance decisions about consequential actions have to be independently reconstructable. If a decision cannot be replayed, no auditor can verify it, and unverifiable governance is indistinguishable from no governance at all.
Verification Has To Be Independent
The word that does most of the work here is “independently.” Governance decisions have to be verifiable by someone other than the system that made them. A third party, whether auditor, regulator, counterparty or future internal team, has to be able to take the evidence record, the policy and the delegation artefacts, and reconstruct the decision themselves, arriving at the same outcome the system did. If that reconstruction is not possible without trusting the originating system, the verification has not happened. It has been replaced by trust, and trust is exactly what an audit is supposed to question, not what an audit is supposed to accept.
The Evaluation Function Has To Stay Pure
The reconstruction requires that the evaluation function be pure in the mathematical sense: its output depends on its declared inputs alone, and it changes nothing outside itself. The inputs are three. The canonical action is the fully specified description of what is being attempted, including an evaluation timestamp anchored to an authenticated time source. The delegation artefacts are the authority under which the action is claimed. The policy identity is the exact rule set in effect at the moment of evaluation.
Anything else entering the function breaks determinism. Ambient state, runtime configuration that varies between deployments, model calls, probabilistic classification, database lookups, network calls: none of those can be inside the evaluation function, because none of them produces the same result on a different machine or at a different time. Given identical inputs, the function must produce identical outputs, byte for byte, on any machine, at any point in the future. The requirement is not aesthetic. It is the only way the reconstruction the previous section asked for is even possible.
Two Phases, One Boundary
I want to acknowledge what makes this hard. Real governance evaluations involve I/O. Delegation chains have to be validated. Revocation status has to be checked. The canonical action has to be assembled from the raw request. These are operations that touch the outside world: they read state, make network calls, verify signatures. So how does that reconcile with the purity requirement just laid out? By splitting the work into two phases.
Resolution is the I/O phase. It fetches revocation status, validates delegation chains, assembles the canonical action representation, and determines which policy, at which immutable version, governs the evaluation. Resolution can involve mutable state, network calls and fallible operations. What it produces is an integrity-protected bundle of inputs for the phase that follows.
Evaluation is the pure phase. It takes the resolved inputs and produces a decision, with no further I/O and no state of its own. Given the same bundle of inputs, it produces the same decision every time.
Both phases produce evidence: the resolution phase records what was fetched and how it was validated; the evaluation phase records what was decided and why. Taken together, they form an end-to-end record that a third party can examine. The line between the two phases is the line between I/O and purity, and it is where governance either stays auditable or quietly fails to be.
Determinism Makes The Decision Portable
Independent verification becomes possible: any party with access to the evidence record can reconstruct the decision without needing the originating system to be available. The decision is proven by the evidence, not vouched for by the system that produced it. Third-party audit becomes possible for the same reason. A regulator, an auditor or a counterparty can take the recorded inputs, apply the recorded policy and verify the outcome on their own terms. The governance claim becomes falsifiable in a way that other people’s governance claims, however well-intentioned, currently are not.
Two further consequences fall out of that. Decision reconstruction becomes possible for any action at any point in the future, because each decision is recorded with the policy identity under which it was evaluated, and policy changes do not retroactively alter previous decisions. And temporal integrity becomes possible, because historical decisions stay interpretable under their original policy without silent upgrades or retroactive reinterpretation. The evidence record is, in a real sense, immutable testimony of what was decided, when, and under what rules. A receipt becomes more than a log entry when the decision can be replayed without asking the original system to vouch for itself.
The Replay Test Is Simple
The test is simple, and it works on any governance system that claims determinism. Take an evidence record. Extract the three inputs from it, the canonical action, delegation artefacts and policy identity, feed them back through the evaluation function on a different machine, and compare the output to the recorded decision. If they match, every time, on any machine, at any point in the future, the decision can be verified.
If they do not, the system may have made a decision, but it cannot prove the decision it made. An audit, when one finally happens, will look for exactly this failure.