Governance

AI Agent Execution Governance

Most organizations that deploy AI agents have some form of model governance. They track models in an inventory, run offline evaluations before release, and keep a review board that signs off on production deployments. That program is necessary and it does real work. It also does not answer the only question that matters at 3 a.m. on a Wednesday when an agent commits an action that should not have happened: which policy authorized this, at what version, under what evidence?

Key concepts

Model governance is about whether a model should be deployed. Execution governance is about whether a specific action should be allowed to execute. Those are different questions, evaluated at different times, by different controls, with different evidence. A mature organization runs both. A fragile one runs only the first and discovers the second the hard way.

This post is about the second. It covers the discipline, the org model, the primitives you actually enforce, and the operating pattern that works without turning governance into a committee that stalls delivery. If you want the concept-level category explainer, start with AI execution governance and runtime authorization for AI agents. This is the program view.

Why the model governance program is not enough

Model governance, done well, covers a specific set of lifecycle concerns. Data lineage. Training provenance. Evaluation coverage. Fairness and bias tests. Release gates. Monitoring in production. Deprecation and retirement. Those concerns are real and they have well-understood frameworks - SR 11-7 for financial institutions, the EU AI Act's lifecycle expectations, NIST's AI RMF, and the emerging ISO standards in the 42000 series.

None of those frameworks were designed for the case where a model's output directly triggers an irreversible action in an external system, without human-in-the-loop approval, in milliseconds. That case did not exist at scale when the frameworks were written. It exists now, and it is what AI agents do for a living.

The gap between model governance and the runtime reality shows up in three places:

Execution governance fills those gaps. It is the discipline of authorizing specific runtime actions, producing evidence for each decision, and enforcing the decision structurally. It does not replace model governance. It sits downstream of it.

The four primitives of an execution governance program

A program is not a document. It is a set of enforced primitives and an operating rhythm that keeps them enforced. For AI agents, the four primitives that matter are:

1. A registered set of governed surfaces

A surface is a class of action that can commit an external side effect. Payments. Data exports. EHR writes. Infrastructure mutation. Customer communications. Privileged API calls. The registry is the canonical list of surfaces that fall under the governance program. The registry is stored, versioned, and reviewed. Anything not in the registry is either explicitly out of scope or a gap to close.

The registry is the first document to produce when starting a program. It is also the first document to review every quarter, because new surfaces appear constantly as products ship.

2. A policy-as-code library

Every governed surface has at least one policy. Policies are declarative, version-controlled, and reviewed through the same PR process as any other production code. The policy engine (OPA, Cedar, or a domain-specific evaluator) is an implementation detail; the discipline is that policies are code, not spreadsheets. Writing a policy is a normal engineering task, not a bespoke consultation.

This is where the program either scales or does not. Organizations that treat policy as a deliverable of the governance team bottleneck. Organizations that treat policy as a deliverable of the engineering team that owns the surface, with the governance team as reviewer, scale linearly.

3. A runtime authorization gate

The gate is the control that turns policy into enforcement. Every action submitted by an agent is evaluated against policy, the decision is one of PERMIT, DENY, SILENCE, and only PERMIT permits dispatch. The gate produces signed receipts. It enforces fail-closed defaults. It handles timeouts as denials. It is the infrastructure that makes the governance program observable.

Without a gate, the program is advisory. With a gate, the program is infrastructure. This is the single largest capability investment the program makes.

4. A verification surface

Decisions need to be checkable by parties that do not trust the producing components. The verification surface is a small service, or a small set of libraries, that takes a signed receipt and verifies it against the authority's public key. Second-line risk uses it. Internal audit uses it. External parties use it in regulated industries. The verifier's independence is the cryptographic property that makes the program's evidence defensible.

Together, these four primitives are the minimum viable execution governance capability. Anything less is a subset and you should know which subset you are running.

Organizational model

The question teams ask second, after "what do we build," is "who owns this." The answer that works is three-way:

Platform engineering owns the gate and the verifier

The runtime authorization gate, the receipt store, the verification surface, and the policy bundle distribution pipeline are all infrastructure. They belong to the team that owns infrastructure. That team is accountable for availability, latency, and correctness. It is not accountable for the content of policies.

Surface-owning engineering teams own their policies

The team that owns the payments service writes the payment-surface policies. The team that owns the data platform writes the export policies. Policies live close to the engineers who understand the surface. Review is through the same PR process as code changes, with governance as a required reviewer for policy files. This is the arrangement that scales; centralizing policy authorship is the step that kills the program.

Governance owns the registry, the review process, and the evidence standard

The governance function owns the surface registry, owns the review bar for new policies and policy changes, and owns the evidence standard (what a receipt must contain, how long it is retained, how it is verified). Governance does not write policies day to day. Governance reviews, audits, and makes sure the program is operating as designed.

The three-way split works because it aligns ownership with expertise. Infrastructure teams are good at building infrastructure. Surface teams are good at modeling surface-specific risk. Governance teams are good at evidence and consistency. Nobody is being asked to operate outside their strengths.

Operating rhythm

A governance program has a weekly, monthly, and quarterly rhythm.

Weekly

Look at receipt outcome distributions by surface. If a surface's DENY or SILENCE rate has changed materially, that is a signal - either the policy is now wrong or the agent's behavior has drifted. Either is worth investigating within the week. Also review the break-glass log. If break-glass was used, the use should be explained and closed.

Monthly

Review policy changes that landed in the last month. Confirm each change had the appropriate reviewers. Replay a sample of receipts against the archived policy version and confirm reproducibility. Walk through any incident receipts with the surface team.

Quarterly

Review the surface registry for completeness. Ask the product teams what shipped this quarter that produces external side effects. Add anything missing. Review the evidence standard against current regulatory guidance. Update the retention windows if they changed.

Annually

External audit of the evidence pipeline. Replay a year's sample of receipts. Confirm every production decision reconstructs from the archive. This is what the framework documents (SR 11-7, EU AI Act, etc.) eventually ask for; building the rhythm early means the annual review is a formality, not a scramble.

What bad execution governance looks like

It is worth being specific about the failure modes, because they are common and recognizable.

Governance as committee review

Every policy change goes through a committee that meets weekly. Engineering teams learn to avoid the committee. Policies ossify. Shadow paths grow around the committee. Six months later, nobody trusts the committee's output.

Fix. Make governance a review role, not an approval gate. Use the PR process. Require governance as a reviewer, not as a blocker except for a narrow set of high-risk classes.

Governance without enforcement

A policy library exists. A review bar exists. No gate exists. Engineers reference policies when they remember to. The program looks good on paper and provides zero runtime assurance.

Fix. Ship the gate. Policies only have weight if they enforce. See decision vs enforcement in AI systems for why this separation matters.

Governance as documentation

A governance site describes what a well-governed AI system looks like. Slides are written. Workshops are run. Runtime behavior is unchanged. This is the most common failure mode in large organizations because it produces artifacts that are easy to present, without any of the hard work of changing runtime behavior.

Fix. Measure the program by what receipts exist in the store, not by what documents exist on the wiki.

Governance that does not survive leadership change

Programs tied to a specific leader's attention are fragile. The moment that leader rotates, the program loses momentum and drifts. Programs tied to infrastructure, receipts, and a review rhythm survive rotations because the evidence is self-justifying.

Fix. Build the program into the infrastructure and the code-review process. Documentation is supplementary.

A minimal pilot plan

If you are starting from zero, a pragmatic first quarter looks like this:

By end of quarter, two surfaces are under the program, a verifier exists, and the operating rhythm is in place. By end of the second quarter, five surfaces. By end of the year, most of the registry. The growth is linear and predictable, because the program is infrastructure plus rhythm, not a one-time project.

Frequently asked questions

How does this relate to the EU AI Act or NIST AI RMF?

Execution governance is the runtime layer that both framework families implicitly require but do not specify in detail. Signed receipts and versioned policies produce the kind of traceable, auditable record these frameworks describe. The frameworks tell you what evidence is required; execution governance is the program that produces it.

Is this the same as an AI ethics committee?

No. Ethics committees operate on the model and product level and typically decide release questions. Execution governance operates on the action level and decides runtime questions. Both can coexist. They do not substitute for each other.

Who should own the program in my organization?

If you are a regulated entity, the first-line function that owns AI-driven products should own the operational program, with second-line risk reviewing. If you are not regulated, the infrastructure function that owns platform services should own the gate and the evidence pipeline, and the product teams should own their policies.

Next step

For the category-level explainer see AI execution governance and runtime authorization for AI agents. For implementation patterns, the architecture and protocol pages walk through the gate, receipt, and verifier shapes. For industry-specific framing see industries.

Next step

Stand up an execution governance program that scales with engineering instead of slowing it.

Request a demo Review architecture Read protocol Documentation