All posts

Your AI agent has a tool surface. It needs a release gate.

Tools are release artifacts. Evals are not release gates. Once an agent can refund, email, or deploy, the tool surface itself needs a deterministic check before promotion.

The first time a teammate ships an agent that can issue refunds, your release process changes — whether you’ve noticed or not.

Before tool use, an agent is a chat experience. It can be wrong. The cost of being wrong is some tokens and a confused user. Code review and eval scores are appropriate guards.

After tool use, an agent is software that takes action. A bad call costs a refund, a cluster, an inbox, a record. The cost of being wrong jumped categories — and the things that protect you have to jump with it.

Tools are release artifacts

Code review catches code. Evals catch behavior. Observability catches runtime. None of them answers the release question for a tool-using agent:

Given the tool surface declared in this PR, do we have explicit approval policies, scope coverage, idempotency evidence, and review readiness for every action — before promotion?

That question is static. It doesn’t need the model to run. It doesn’t need a synthetic dataset. It needs the manifest, the tool schemas, the policy file, and a finite list of what’s missing.

A tool-using agent is software with a public-ish API: the set of named, schemaed actions it can take. Treat it like one. Diff it on every PR. Gate the merge.

Evals are not release gates

Evals validate behavior on the inputs you wrote. They tell you whether the model did the right thing under tested conditions. They are good at this.

They don’t tell you:

  • What tools the model could call once a release ships.
  • Whether a newly-added delete_user operation has an approval policy.
  • Whether the OAuth scopes attached to a service account are wider than the declared agent purpose.
  • Whether a prompt that says “advise only” still has write tools enabled.
  • Whether a financial action is retried without idempotency evidence.

A passing eval suite from yesterday is silent on today’s added action. That’s not a flaw — it’s the wrong tool for the job. Evals belong in the quality loop, not the release gate.

The release gate slot — the thing that turns red when a PR isn’t ready to ship — needs to be deterministic, fast, and keyed on the tool surface itself, not on the model’s behavior under sample inputs.

What release-ready looks like for a tool surface

A useful release gate forces the answer to a small set of questions to live in the manifest, where they show up as a diff a reviewer can read:

QuestionWhat “yes” looks like in the manifest
Who reviews this when it fires?require_approval_for_tools: [issue_refund]
What scopes does it need?Declared on the tool or in permissions.scopes — not “service account”
Can it be retried safely?Idempotency key in the schema, documented retry policy, or explicit “do not retry”
Who owns it?owner on the tool, especially for high-risk actions
Does the prompt match the surface?Read-only prompt + write tool = release blocker

Most agent repos answer some of these in code-review comments and the rest in tribal knowledge. The release gate’s job is to demand a written answer in the artifact you’re shipping.

Concretely

This is the slot we’ve been building agents-shipgate for. The shape of an agent’s release-readiness check fits in a single YAML file:

version: "0.1"
project:
  name: refund-assistant
agent:
  name: refund-assistant
  declared_purpose:
    - issue customer refunds
environment:
  target: production_like
tool_sources:
  - id: stripe
    type: openapi
    path: openapi.yaml
policies:
  require_approval_for_tools: [issue_refund]

A scan of that manifest produces a deterministic finding list:

## Agents Shipgate

Status: Release blockers detected
Critical: 2 · High: 14 · Medium: 2
Human review: recommended

Top findings:
1. stripe.create_refund lacks a declared approval policy
2. stripe.create_refund lacks idempotency evidence
3. Manifest declares broad permission scopes

These are the questions a security or platform reviewer would ask in a release review meeting — asked once in the PR, with evidence and a recommended remediation, before the change reaches production.

Where the gate fits in the wider stack

Tests, evals, observability, and a release gate cover different slices:

GuardWhen it runsWhat it catches
TestsCI on every PRCode paths in the agent’s code
EvalsOn a schedule or per releaseModel behavior on curated inputs
Release gate (this)CI on every PRTool surface, scopes, policies, prompt/surface alignment
ObservabilityRuntimeWhat actually happened in production

Each catches something real. Removing any of them is a regression. Conflating the eval pipeline with the release gate produces a slow, expensive, and fundamentally indeterministic merge check that doesn’t answer the question that matters at promotion time.

A 60-second first run

If you’re shipping a tool-using agent today and want to see what a release gate would surface, the smallest useful step:

pipx install agents-shipgate
agents-shipgate init --workspace . --write
agents-shipgate scan -c shipgate.yaml

The first scan against any non-trivial agent surfaces things worth fixing — even (especially) on agents you thought were buttoned up. Most teams find 3–5 missing approval policies and a scope they meant to tighten.

Run it on your repo. Tell us what it gets wrong. That’s the most useful feedback we get.

Install agents-shipgate GitHub