From CI/CD to agent release readiness

For two decades, every meaningful change in how software ships has worked the same way: a class of release risk got serious enough that engineers moved it into a deterministic, automated check at PR time.

Tests caught regressions. CI made tests run on every change. CD made deployments routine. Linters caught style. SAST caught vulnerabilities. SBOM made supply-chain risk auditable. Type-checkers caught contracts.

Each of those was, at the time, “extra friction nobody asked for.” Each is, now, table stakes. Teams without them ship more bugs.

The same arc is starting to play out for agent releases.

The shape of the shift

The pattern is consistent across every layer that’s gone through it:

A new failure mode shows up in production with high enough frequency that manual review can’t scale.
The community converges on a static, deterministic check that runs at PR time and answers a specific question.
The check produces a structured artifact (lint output, SAST findings, SBOM, type errors) that fits into the existing review surface.
CI fails on the new finding shape. PRs don’t merge until they’re green.
Five years later, nobody can imagine shipping without it.

CI/CD covered the code artifact: did the code change do what it was supposed to, without breaking what was already there?

What it didn’t cover — what nothing covered, until recently — was release-readiness for agents that take action. The artifact in the diff isn’t just code. It’s a tool surface, a scope grant, a prompt, and a set of policies. The failure modes aren’t “this function returns wrong values” but “this agent can issue refunds nobody approved.”

That’s a new category of release risk. It needs a check.

What’s different about agent releases

For a traditional service deployment, the release artifact is the binary plus its config. A reviewer asks: does the binary do what the spec says, do the config flags match staging, are there any open Sev-1s?

For an agent release, the artifact is broader:

The model. A pinned checkpoint, usually swappable.
The prompt. Instructions that shape behavior.
The tool surface. A list of named, schemaed actions the agent can take — sourced from MCP servers, OpenAPI specs, SDK code, or framework-specific configs.
The policies. Approval gates, scope grants, prohibited-action lists, retry rules, idempotency requirements.
The runtime. The orchestrator, gateway, observability stack.

Each piece is a release artifact in its own right. Each has its own failure mode if you ship it without review.

The model swap is a quality concern (caught by evals). The runtime is an ops concern (caught by tracing and incident response). The tool surface, scopes, and policies are a release-readiness concern — and that’s the slot that didn’t have a name until recently.

The release-readiness slot

A release-readiness check for an agent does what SAST does for code:

Runs in CI on every PR
Inspects the artifact statically (no model invocation, no traffic)
Produces a structured finding list with severities, evidence, and recommended remediation
Fails the build on net-new findings; existing findings can be baselined

The shape is familiar. The questions are new:

SAST asks	Release-readiness asks
Does this code path use a tainted input unsafely?	Does this tool action have a declared approval policy?
Is this dependency vulnerable?	Is this MCP source’s tool surface bounded?
Is this credential hardcoded?	Are these auth scopes narrower than the agent’s declared purpose?
Are there missing input validations?	Are these tool schemas strict enough to bound the model’s actions?

A reviewer reading SAST findings recognizes the format. A reviewer reading release-readiness findings should too.

Why this is a category, not a feature

For a while it looked like this might be a sub-feature of an existing platform — an extension to the eval framework, a check inside the LLM gateway, a setting on the orchestrator.

But the shape doesn’t fit any of those:

Eval frameworks measure behavior on inputs. They don’t read the manifest, don’t enumerate the surface, don’t apply per-tool policies.
LLM gateways enforce policies at runtime. They’re necessary but reactive — by the time the gateway sees the call, the release has already happened.
Orchestrators route calls and manage state. They’re load-bearing for execution, not for the static review of the surface they execute.

Release-readiness is upstream of all of those. It belongs in CI, alongside tests and SAST. It produces an artifact the GRC reviewer signs off on before the runtime layer ever runs.

That’s a category, not a feature. It’s the same way SAST became a category despite living “inside the build pipeline” — once the question got specific enough, the category formed around it.

What this means for builders

If you’re shipping agents that take action, the practical implication is small and concrete:

Add a manifest. Even a minimal one.
Pick a tool surface to declare — MCP, OpenAPI, SDK code, framework config.
Run a static check on it before you merge. Today that’s Agents Shipgate. Tomorrow it’ll be a checkbox in the platform you already use.

If you’re an infra lead deciding what to invest in: the release-readiness slot is unfilled at most teams. The cost of leaving it unfilled is the gap that produces the “the agent did something we didn’t approve” incidents that show up in retros six months from now.

If you’re an investor: this is the layer that emerges when “agents take action” becomes routine in production. SAST took five years from “nobody runs it” to “nobody ships without it.” Release-readiness for agents is on a faster curve because the failure modes are more visible — refunds issued, emails sent, infrastructure changed — and the audit trail demand is louder.

If you’re a design partner: the loop we’re closing is what does release-readiness actually mean for your agent stack today. The seven dimensions in What is tool-use readiness? are the working set; the categories will sharpen as more teams come in with their actual workflows.

CI/CD made code releases safe. Agent releases get the same shift — they just need different artifacts in the gate slot.