AI agent deployment checklist: 18 checks before production
An 18-item pre-flight for shipping AI agents to staging or production. Covers inventory, schemas, scopes, approvals, side effects, idempotency, and blast radius.
When teams ship their first tool-using AI agent, the same question shows up in retrospectives six months later: “why didn’t we catch this before production?” The answer is almost always that the agent’s tool surface was promoted without a static review.
This is the checklist that catches those failures before promotion. Eighteen items, grouped into the seven dimensions of tool-use readiness plus two CI integration items. Each one is concrete: what to verify, what evidence to require, what the failure mode looks like in production.
The list applies to any tool-using agent — OpenAI Agents SDK, Anthropic Messages API, Google ADK, LangChain/LangGraph, CrewAI, OpenAI Agents API, MCP-connected, or custom — because the seven dimensions are framework-agnostic.
How to use this checklist
Pick an agent you’re about to promote to staging, production-like, or production. For each item below, the release reviewer should be able to point to evidence in the diff — a manifest declaration, a schema field, a policy entry, a comment. If they can’t, you have a release-blocking finding.
Most teams running this checklist for the first time find 8–15 findings. That’s normal. It’s also the entire point — until those findings are reviewed, the surface isn’t ready to ship.
You can answer all 18 items automatically with
Agents Shipgate,
which implements them as deterministic checks against a
shipgate.yaml manifest. The point of this post is the questions;
Agents Shipgate is one way to answer them in CI.
Inventory: what tools can the agent call? (3 checks)
1. Every tool the agent can call is named in the manifest
What to verify: there is a finite, enumerated list of tools the agent has access to. No “whatever the MCP server returns at runtime.” No implicit tool discovery.
Failure mode: a teammate adds a new tool to the MCP server. The agent picks it up on the next deploy. Nobody reviewed it.
Evidence: tool_sources in the manifest enumerates every source. For
MCP, the export is a snapshot of named tools, not a live connection.
2. No wildcard tool sources
What to verify: no * in tool inventory, no “include all from this
server” patterns.
Failure mode: an MCP server’s minor version adds a destructive tool. Your wildcard pulls it in. Strict-mode CI never noticed because the wildcard was already there.
Check ID: SHIP-INVENTORY-WILDCARD-TOOLS.
3. The manifest matches runtime reality
What to verify: the tools declared in the manifest match what the agent actually has at runtime. The set of names is identical.
Failure mode: someone modifies the runtime config to add an extra tool without updating the manifest. Static reviews show the safe surface; the agent gets the broader one.
Evidence: a CI check that compares the manifest against the runtime config, or a single source of truth (the manifest) that the runtime reads.
Schema: what inputs can each tool accept? (2 checks)
4. Every tool has a strict JSON schema
What to verify: every tool’s input schema sets
additionalProperties: false (or the framework equivalent). For
OpenAI Agents SDK, strict: true on the function tool. For Anthropic
Messages API, explicit additionalProperties: false in
input_schema.
Failure mode: the model smuggles extra fields into a tool call. Most of the time that does nothing. The case where it does something is the case you needed to catch.
Check ID: SHIP-API-FUNCTION-SCHEMA-STRICTNESS.
5. Numeric fields are bounded; required fields are enumerated
What to verify: minimum/maximum on numeric fields where the
parameter is bounded in the real world (refund amounts, page sizes,
IDs). required enumerates every parameter that is not safely
optional.
Failure mode: a tool exposes amount: number with no max. The model,
on a confused user input, attempts a refund larger than the order.
The gateway has to catch it because the schema didn’t.
Auth scopes: what permissions does each tool need? (2 checks)
6. Every tool declares its required auth scopes
What to verify: per-tool declarations in permissions.scopes or
equivalent. Not a single shared scope set across the whole agent.
Failure mode: the agent has orders:* because that is what the
service account has. When the prompt drifts to “you can also cancel
subscriptions,” the scope was already permitting it.
Check ID: SHIP-AUTH-MISSING-SCOPE.
permissions:
scopes:
- billing:refunds:write
- customers:read_pii
permissions.scopes is a flat list of strings — the scopes the agent’s
credential carries in aggregate. Per-tool scope narrowing is enforced by
the runtime layer (gateway or token exchange) against the manifest
declaration plus the underlying credential.
7. Declared scopes are narrower than the service account’s actual permissions
What to verify: the scope the manifest declares for each tool is a strict subset of what the underlying credential allows.
Failure mode: the service account has admin:* for dev convenience.
The manifest doesn’t narrow it. The agent has admin permissions
nobody explicitly approved.
Evidence: the manifest pins per-tool scopes; the runtime layer (gateway or token exchange) enforces them. This is one of the places static review and runtime enforcement work together — static catches missing declarations, runtime catches over-grants.
Approval policies: who signs off on destructive actions? (2 checks)
8. Every write or destructive tool requires human approval
What to verify: policies.require_approval_for_tools includes every
tool that writes, deletes, sends, transfers, refunds, or otherwise
changes state externally.
Failure mode: the agent issues a refund nobody approved. The eval suite passed. The auth scopes were correct. Nothing checked the policy because no policy was declared.
Check ID: SHIP-POLICY-APPROVAL-MISSING.
policies:
require_approval_for_tools:
- tool: stripe.create_refund
reason: financial action
- tool: users.delete
reason: destructive write to user records
- tool: infrastructure.deploy
reason: production environment change
9. Every customer-touching tool requires confirmation
What to verify: tools that affect a specific named customer (sending an email, modifying their account, refunding their order) require confirmation in addition to approval.
Failure mode: an approval policy got engineering sign-off but the customer wasn’t asked. The customer didn’t want the refund — they wanted the order rerouted. Now there is a support escalation that shouldn’t have happened.
Check ID: SHIP-POLICY-CONFIRMATION-MISSING.
Side effects: what does each tool change in the world? (2 checks)
10. Every tool has accurate risk tags
What to verify: each tool is tagged with what it does. read_only,
write, destructive, external_write, financial_action,
customer_communication, infrastructure_change, pii_access. Tags
should match the tool’s actual behavior, not the docstring.
Failure mode: a tool was named get_customer_info but actually
modifies a “last accessed” timestamp. It was tagged read_only. The
audit log shows writes from the agent nobody expected.
Evidence: tags in the manifest match what the tool’s source code does. A reviewer should be able to read both and confirm.
11. PII-reading tools are tagged
What to verify: any tool that reads name, email, phone, address,
identifier, payment, or other personal data is tagged with
pii_access or the equivalent.
Failure mode: a customer-support agent reads PII into the context window. The session log includes the PII. The retention policy applies to the wrong logs because the access was never classified.
Idempotency: can writes be retried safely? (2 checks)
12. Every write tool has an idempotency key or is declared safe to retry
What to verify: write tools either accept an idempotency_key
parameter, or the manifest explicitly declares them as safe to retry
(read-only, naturally idempotent, etc.).
Failure mode: a transient network blip causes the orchestrator to retry. The same refund fires twice. The bank records two transactions. The customer-success team is paging you on Saturday.
Check ID: SHIP-SIDEFX-IDEMPOTENCY-MISSING.
13. Retry policy is documented
What to verify: the manifest or runtime config declares what triggers a retry (transient network errors only? 5xx responses? specific error codes?) and a max retry count.
Failure mode: the default retry policy fires aggressively. The downstream service starts throttling. A simple recovery cascades into an incident.
Blast radius: how bounded is each tool? (3 checks)
14. Every high-risk tool has an owner
What to verify: tools tagged destructive, external_write,
financial_action, or infrastructure_change have an owner field
naming a team or person.
Failure mode: something goes wrong with a refund tool. Nobody knows who to page. The incident drags because the agent’s tool list is “shared infrastructure.”
Check ID: SHIP-MANIFEST-HIGH-RISK-OWNER-MISSING.
risk_overrides:
tools:
stripe.create_refund:
owner: billing-team
reason: financial action requires a named owner for incident response
user_data.delete:
owner: data-platform-team
reason: destructive write to user records
15. Prohibited actions are enumerated
What to verify: the manifest contains an explicit list of actions the agent must not take (“do not cancel orders older than 30 days without approval,” “do not refund subscriptions, only one-time purchases”).
Failure mode: the prompt says “be helpful” and the tool surface allows anything that’s not explicitly approval-gated. The model is helpful in a way nobody specified.
Evidence: a prohibited_actions block in the manifest, or a
referenced policy doc with the equivalent list.
16. Resource scope is bounded
What to verify: tools that affect specific resources are scoped to the resources they should affect — orders belonging to the calling user, records in the calling tenant, infrastructure tagged with the calling team’s prefix.
Failure mode: the agent is asked about “the order” and looks up an order that belongs to a different customer because the scope wasn’t enforced. It returns information that should never have crossed tenant boundaries.
CI integration: how does the check land in your pipeline? (2 checks)
17. Advisory mode is enabled before strict mode
What to verify: the agent’s CI pipeline runs the release-readiness check in advisory mode first, surfacing findings as PR evidence without failing the build. Teams move to strict mode only after the backlog of findings has been triaged and baselined.
Failure mode: turning on strict mode without triage means every PR fails until the backlog is empty. Teams disable the check rather than clean it up. The check becomes shelfware.
- uses: ThreeMoonsLab/agents-shipgate@v0.8.0
with:
config: shipgate.yaml
ci_mode: advisory
pr_comment: "true"
18. A baseline of acceptable findings is committed
What to verify: when strict mode goes live, a baseline file commits the list of findings the team has explicitly accepted as not-blocking. Net-new findings fail the build; baselined findings don’t.
Failure mode: without a baseline, strict mode either blocks everything or accepts everything. The middle ground — “block net-new, accept existing” — is the only one that produces forward progress on a real codebase.
What this checklist does not cover
The 18 items above are the release-readiness check. They are not:
- An eval suite — that catches model behavior regressions. See why evals are not release gates for the long-form argument.
- A runtime guardrail — that enforces calls at execution time. See Agents Shipgate vs MCP gateways and Agents Shipgate vs LLM gateways.
- Observability — that records what happened in production. See Agents Shipgate vs agent observability.
- A safety certification — no static check certifies an agent as safe.
Each of those tools answers a different question. The 18 checks here are the release question: given the artifact in this PR, do we have evidence the surface is reviewable and bounded?
Automating the 18 checks
You can run all of the above against your repo:
pipx install agents-shipgate
agents-shipgate init --workspace . --write
agents-shipgate scan -c shipgate.yaml
The scan reads your shipgate.yaml manifest plus your declared local
tool sources (MCP exports, OpenAPI specs, SDK entrypoints) and
produces a Tool-Use Readiness Report
with one finding per failed check.
To wire it into GitHub Actions:
- uses: ThreeMoonsLab/agents-shipgate@v0.8.0
with:
config: shipgate.yaml
ci_mode: advisory
pr_comment: "true"
Start in advisory mode so the team sees findings on every PR without
blocking merges. Once the baseline is clean, switch to
ci_mode: strict to fail builds on net-new findings.
The full check catalog lists every check with severity, evidence shape, and example finding. The 18 items in this post are the conceptual backbone; the catalog has the deterministic implementations. For a worked example of what a real report looks like, see walking a release-readiness report — a real scan of a published Anthropic cookbook agent, walked finding-by-finding.
This checklist is the artifact a release reviewer should ask for before they sign off. Eighteen items, every one of them answerable from the manifest, every one of them a category of incident that has already happened to some team in production.