All posts

What is tool-use readiness?

Tool-use readiness is the static check that an agent's tool surface can ship: inventory, schema, auth, approval, side effects, idempotency, blast radius.

When platform engineers and security reviewers start auditing AI agents, the first thing that breaks is vocabulary.

“Does the agent work?” is a quality question. “Does the agent answer correctly on this dataset?” is an eval question. “Did the tool call return 200?” is a runtime question.

None of those is the question a release reviewer needs to answer at promotion time.

The question is tool-use readiness: given the tool surface declared in this PR, do we have evidence — explicit, written, diff-able — that every action the agent could call is safe to ship?

This post defines that term concretely.

The seven dimensions

Tool-use readiness isn’t one check. It’s a static composite that evaluates the agent’s tool surface across seven dimensions. A release reviewer should be able to answer “yes” to each one for every tool the agent can invoke.

DimensionWhat it asksEvidence in the manifest
InventoryWhat tools can the agent call?A complete, named list — no wildcards, no *, no “whatever this MCP server returns”
SchemaWhat inputs does each tool accept?Strict JSON schema — additionalProperties: false, complete required, bounded numeric fields
AuthWhat scopes does each tool need?Declared per-tool or in permissions.scopes — narrower than the service account’s actual scopes
ApprovalWho reviews destructive actions before they fire?policies.require_approval_for_tools: [...] for every write/destructive/financial action
Side effectsWhat does this tool change in the world?Risk tags on the tool: write, destructive, external_write, financial_action, customer_communication
IdempotencyCan it be retried safely?Idempotency key in the schema, documented retry policy, or explicit “do not retry”
Blast radiusIf this tool fires unexpectedly, how bad is it?Owner declared, prohibited actions enumerated, scope of resources bounded

Each dimension corresponds to a separate failure mode in production. A tool that has a good schema but no approval policy will eventually issue a refund nobody approved. A tool that has an approval policy but a wildcard inventory will eventually call something nobody knew existed.

The release gate’s job is to demand evidence on all seven, not just the ones you remembered.

Walking a single tool through the dimensions

Consider a tool an agent might invoke: cancel_order(order_id: string).

DimensionWalked through
InventoryListed by name in tool_sources — not behind a wildcard MCP server
Schemaorder_id is required and typed; the schema sets additionalProperties: false so the model can’t smuggle in cascade: true
AuthDeclared scope orders:cancel — narrower than the service account’s orders:*
ApprovalListed in policies.require_approval_for_tools — runtime requires a human approval token before firing
Side effectsTagged destructive, customer_communication, write — surfaces in any policy or owner check
IdempotencySchema declares idempotency_key; retry policy documented as “single retry on transient network error only”
Blast radiusOwner support-platform-team; prohibited action cancel orders past 30 days without approval

This is what “tool-use readiness” looks like for a single action. A release review for a 12-tool agent should produce 12 of these tables — or 12 reasons why the merge isn’t ready.

What tool-use readiness is NOT

It’s not “the tool call succeeds.” That’s a runtime concern.

It’s not “the model picks the right tool for the user’s request.” That’s an eval concern.

It’s not “no breaking change to the API contract.” That’s a service-owner concern.

It’s not “the prompt is well-written.” That’s a quality concern.

It’s the static, manifest-level check that the surface the agent gets at runtime has been reviewed and approved. Once that’s true, runtime concerns are runtime concerns — and they’re appropriately scoped because the surface is finite and reviewed.

Why platform engineers and security reviewers care

For a security reviewer: tool-use readiness is the closest analog to the review they already do for service deployments. It produces an artifact (a finding list with severities and evidence) that fits into existing review workflows. The seven dimensions map onto questions they already ask — “what scopes does this need”, “what’s the blast radius if this goes wrong” — but framed for an agent’s tool surface specifically.

For a platform engineer: it’s the gate that prevents a teammate from shipping an agent that calls delete_user without an approval policy because the eval suite happened to pass. It’s also the artifact a release ticket can reference: “the manifest declared 14 tools, 2 had release blockers, both were resolved before merge.”

Where to assess it

Agents Shipgate checks all seven dimensions statically. The shape of the check fits in a single manifest file:

pipx install agents-shipgate
agents-shipgate init --workspace . --write
agents-shipgate scan -c shipgate.yaml

The scan produces a finding list grouped by dimension. Critical findings block strict-mode CI; advisory mode emits the same evidence as a PR comment without failing the build.

The further read is the post on why this is the release question, not the eval question.

Install agents-shipgate GitHub