What is tool-use readiness? · Three Moons Lab

When platform engineers and security reviewers start auditing AI agents, the first thing that breaks is vocabulary.

“Does the agent work?” is a quality question. “Does the agent answer correctly on this dataset?” is an eval question. “Did the tool call return 200?” is a runtime question.

None of those is the question a release reviewer needs to answer at promotion time.

The question is tool-use readiness: given the tool surface declared in this PR, do we have evidence — explicit, written, diff-able — that every action the agent could call is safe to ship?

This post defines that term concretely.

The seven dimensions

Tool-use readiness isn’t one check. It’s a static composite that evaluates the agent’s tool surface across seven dimensions. A release reviewer should be able to answer “yes” to each one for every tool the agent can invoke.

Dimension	What it asks	Evidence in the manifest
Inventory	What tools can the agent call?	A complete, named list — no wildcards, no `*`, no “whatever this MCP server returns”
Schema	What inputs does each tool accept?	Strict JSON schema — `additionalProperties: false`, complete `required`, bounded numeric fields
Auth	What scopes does each tool need?	Declared per-tool or in `permissions.scopes` — narrower than the service account’s actual scopes
Approval	Who reviews destructive actions before they fire?	`policies.require_approval_for_tools: [...]` for every write/destructive/financial action
Side effects	What does this tool change in the world?	Risk tags on the tool: `write`, `destructive`, `external_write`, `financial_action`, `customer_communication`
Idempotency	Can it be retried safely?	Idempotency key in the schema, documented retry policy, or explicit “do not retry”
Blast radius	If this tool fires unexpectedly, how bad is it?	Owner declared, prohibited actions enumerated, scope of resources bounded

Each dimension corresponds to a separate failure mode in production. A tool that has a good schema but no approval policy will eventually issue a refund nobody approved. A tool that has an approval policy but a wildcard inventory will eventually call something nobody knew existed.

The release gate’s job is to demand evidence on all seven, not just the ones you remembered.

Walking a single tool through the dimensions

Consider a tool an agent might invoke: cancel_order(order_id: string).

Dimension	Walked through
Inventory	Listed by name in `tool_sources` — not behind a wildcard MCP server
Schema	`order_id` is required and typed; the schema sets `additionalProperties: false` so the model can’t smuggle in `cascade: true`
Auth	Declared scope `orders:cancel` — narrower than the service account’s `orders:*`
Approval	Listed in `policies.require_approval_for_tools` — runtime requires a human approval token before firing
Side effects	Tagged `destructive`, `customer_communication`, `write` — surfaces in any policy or owner check
Idempotency	Schema declares `idempotency_key`; retry policy documented as “single retry on transient network error only”
Blast radius	Owner `support-platform-team`; prohibited action `cancel orders past 30 days without approval`

This is what “tool-use readiness” looks like for a single action. A release review for a 12-tool agent should produce 12 of these tables — or 12 reasons why the merge isn’t ready.

What tool-use readiness is NOT

It’s not “the tool call succeeds.” That’s a runtime concern.

It’s not “the model picks the right tool for the user’s request.” That’s an eval concern.

It’s not “no breaking change to the API contract.” That’s a service-owner concern.

It’s not “the prompt is well-written.” That’s a quality concern.

It’s the static, manifest-level check that the surface the agent gets at runtime has been reviewed and approved. Once that’s true, runtime concerns are runtime concerns — and they’re appropriately scoped because the surface is finite and reviewed.

Why platform engineers and security reviewers care

For a security reviewer: tool-use readiness is the closest analog to the review they already do for service deployments. It produces an artifact (a finding list with severities and evidence) that fits into existing review workflows. The seven dimensions map onto questions they already ask — “what scopes does this need”, “what’s the blast radius if this goes wrong” — but framed for an agent’s tool surface specifically.

For a platform engineer: it’s the gate that prevents a teammate from shipping an agent that calls delete_user without an approval policy because the eval suite happened to pass. It’s also the artifact a release ticket can reference: “the manifest declared 14 tools, 2 had release blockers, both were resolved before merge.”

Where to assess it

Agents Shipgate checks all seven dimensions statically. The shape of the check fits in a single manifest file:

pipx install agents-shipgate
agents-shipgate init --workspace . --write
agents-shipgate scan -c shipgate.yaml

The scan produces a finding list grouped by dimension. Critical findings block strict-mode CI; advisory mode emits the same evidence as a PR comment without failing the build.

The further read is the post on why this is the release question, not the eval question.