What is tool-use readiness?
Tool-use readiness is the static check that an agent's tool surface can ship: inventory, schema, auth, approval, side effects, idempotency, blast radius.
When platform engineers and security reviewers start auditing AI agents, the first thing that breaks is vocabulary.
“Does the agent work?” is a quality question. “Does the agent answer correctly on this dataset?” is an eval question. “Did the tool call return 200?” is a runtime question.
None of those is the question a release reviewer needs to answer at promotion time.
The question is tool-use readiness: given the tool surface declared in this PR, do we have evidence — explicit, written, diff-able — that every action the agent could call is safe to ship?
This post defines that term concretely.
The seven dimensions
Tool-use readiness isn’t one check. It’s a static composite that evaluates the agent’s tool surface across seven dimensions. A release reviewer should be able to answer “yes” to each one for every tool the agent can invoke.
| Dimension | What it asks | Evidence in the manifest |
|---|---|---|
| Inventory | What tools can the agent call? | A complete, named list — no wildcards, no *, no “whatever this MCP server returns” |
| Schema | What inputs does each tool accept? | Strict JSON schema — additionalProperties: false, complete required, bounded numeric fields |
| Auth | What scopes does each tool need? | Declared per-tool or in permissions.scopes — narrower than the service account’s actual scopes |
| Approval | Who reviews destructive actions before they fire? | policies.require_approval_for_tools: [...] for every write/destructive/financial action |
| Side effects | What does this tool change in the world? | Risk tags on the tool: write, destructive, external_write, financial_action, customer_communication |
| Idempotency | Can it be retried safely? | Idempotency key in the schema, documented retry policy, or explicit “do not retry” |
| Blast radius | If this tool fires unexpectedly, how bad is it? | Owner declared, prohibited actions enumerated, scope of resources bounded |
Each dimension corresponds to a separate failure mode in production. A tool that has a good schema but no approval policy will eventually issue a refund nobody approved. A tool that has an approval policy but a wildcard inventory will eventually call something nobody knew existed.
The release gate’s job is to demand evidence on all seven, not just the ones you remembered.
Walking a single tool through the dimensions
Consider a tool an agent might invoke: cancel_order(order_id: string).
| Dimension | Walked through |
|---|---|
| Inventory | Listed by name in tool_sources — not behind a wildcard MCP server |
| Schema | order_id is required and typed; the schema sets additionalProperties: false so the model can’t smuggle in cascade: true |
| Auth | Declared scope orders:cancel — narrower than the service account’s orders:* |
| Approval | Listed in policies.require_approval_for_tools — runtime requires a human approval token before firing |
| Side effects | Tagged destructive, customer_communication, write — surfaces in any policy or owner check |
| Idempotency | Schema declares idempotency_key; retry policy documented as “single retry on transient network error only” |
| Blast radius | Owner support-platform-team; prohibited action cancel orders past 30 days without approval |
This is what “tool-use readiness” looks like for a single action. A release review for a 12-tool agent should produce 12 of these tables — or 12 reasons why the merge isn’t ready.
What tool-use readiness is NOT
It’s not “the tool call succeeds.” That’s a runtime concern.
It’s not “the model picks the right tool for the user’s request.” That’s an eval concern.
It’s not “no breaking change to the API contract.” That’s a service-owner concern.
It’s not “the prompt is well-written.” That’s a quality concern.
It’s the static, manifest-level check that the surface the agent gets at runtime has been reviewed and approved. Once that’s true, runtime concerns are runtime concerns — and they’re appropriately scoped because the surface is finite and reviewed.
Why platform engineers and security reviewers care
For a security reviewer: tool-use readiness is the closest analog to the review they already do for service deployments. It produces an artifact (a finding list with severities and evidence) that fits into existing review workflows. The seven dimensions map onto questions they already ask — “what scopes does this need”, “what’s the blast radius if this goes wrong” — but framed for an agent’s tool surface specifically.
For a platform engineer: it’s the gate that prevents a teammate from
shipping an agent that calls delete_user without an approval policy
because the eval suite happened to pass. It’s also the artifact a release
ticket can reference: “the manifest declared 14 tools, 2 had release
blockers, both were resolved before merge.”
Where to assess it
Agents Shipgate checks all seven dimensions statically. The shape of the check fits in a single manifest file:
pipx install agents-shipgate
agents-shipgate init --workspace . --write
agents-shipgate scan -c shipgate.yaml
The scan produces a finding list grouped by dimension. Critical findings block strict-mode CI; advisory mode emits the same evidence as a PR comment without failing the build.
The further read is the post on why this is the release question, not the eval question.