All posts

Why evals are not release gates

Evals validate behavior on inputs you wrote. They don't answer the release question for a tool-using agent. Here's what each is for, and why conflating them ships bugs.

Most LLM app teams have an eval suite. It’s the right thing to have. It’s also, increasingly, the thing teams reach for when someone asks “is this agent ready to ship?”

The reach is wrong. Evals belong in the development loop. They’re not the release gate for a tool-using agent.

This post is the long-form version of that argument — what evals are good at, what they can’t do, and why conflating the two ships preventable bugs.

What evals are good at

Evals validate model behavior on the inputs you wrote. Specifically:

  • Did the model pick the right tool for this user request?
  • Did the model produce a response that satisfies a rubric (factually accurate, on-tone, grounded in retrieved context)?
  • Did the model refuse to do something it shouldn’t?
  • Has accuracy regressed on a stable benchmark since the last model swap?

Evals are how you choose between two prompt versions. They’re how you catch a regression after upgrading from claude-sonnet-4-5 to a new checkpoint. They’re how you measure whether a RAG change actually helped.

They are an instrument. A good one. Don’t get rid of yours.

What evals can’t catch

Here’s what an eval suite, no matter how comprehensive, doesn’t tell you about a release:

What tools the model could call. The eval runs the model on your inputs and observes its choices. It doesn’t enumerate the tool surface the model has access to. A passing eval suite is silent on the new delete_user operation a teammate added to the MCP server yesterday.

Whether the auth scopes match the declared purpose. The eval doesn’t read the manifest. The agent’s prompt says “advise only,” the service account has orders:write, and the model never tried to write in the eval set, so the suite passes. The contradiction is in the artifact, not the behavior.

Whether destructive actions have approval policies. Even if your eval includes a “did the model ask for approval?” check, it can only test what you wrote. A real release reviewer needs the deterministic list: which destructive tools have a declared approval policy and which don’t.

Whether retries are safe. Evals run once per case. They can’t tell you whether your retry policy plus the absence of an idempotency key means a transient network blip causes a duplicate refund.

Whether the prompt and the surface contradict each other. Eval inputs are happy paths or known adversarial cases. They miss the structural fact that a “look up customer info only” prompt is paired with a tool surface that includes cancel_account. The contradiction is static. Evals are dynamic.

Each of these is a release question. None is an eval question. The shape of the question — “given the artifact in this PR, can we ship?” — is wrong for a behavioral test suite.

The “eval as release gate” anti-pattern

What teams actually do, when they treat the eval suite as the release gate:

  1. PR adds a new tool to the agent’s MCP server.
  2. CI runs the eval suite. It still passes (the eval inputs don’t exercise the new tool).
  3. PR merges. The change rolls to production.
  4. A user prompt eventually triggers the new tool.
  5. The new tool fires without an approval policy because nobody added one. The eval suite still passes. Nobody noticed for a week.

This is a deterministic outcome of using a behavioral instrument for a structural job. It’s not the eval team’s fault. It’s a category error.

The same pattern produces other bugs:

  • An OAuth scope grant gets widened to unblock a feature; the eval suite passes; the agent now has access to operations the team didn’t realize they’d granted.
  • An MCP server’s minor version adds a new tool; the agent picks it up on next deploy; the eval suite still passes because the dataset doesn’t trigger it.
  • A prompt update changes “advise only” to “help the user complete their task”; the eval suite passes (it’s measured against the new rubric); write tools that were previously gated by language are now fair game.

In each case, the structural change is in the artifact. The eval is the wrong place to catch it.

What evals are for

Evals are for iteration on what the model does given a fixed surface. They live in the inner loop:

  • Tweak the prompt → run evals → does behavior improve?
  • Try a new model → run evals → did anything regress?
  • Add a new tool → run evals → does the model use it correctly when it should?
  • Update the rubric → backfill scores → are old responses still passing?

That’s a tight, valuable loop. The signal evals produce there is unique — no other instrument tells you whether accuracy on the support_refund eval set went from 0.78 to 0.83.

What a release gate is for

A release gate is for catching unsafe surface state in the artifact diff at PR time, before the change reaches users. It’s keyed on:

  • The manifest
  • The tool definitions
  • The policy file
  • The prompt, statically inspected against the surface

It’s deterministic. It runs in CI in seconds, not minutes. It produces a finding list with severities, evidence, and recommended remediation. It doesn’t need a dataset, doesn’t call the model, doesn’t depend on model availability.

The two instruments answer different questions. The release gate asks “can this surface ship?” The eval suite asks “is the model good on this surface?”

You need both. Conflating them means one of those questions never gets asked at the right time.

Use both

A workable workflow:

WhenWhat runsWhat it catches
Pre-merge (CI on every PR)Agents Shipgate on the manifestMissing approval policies, broad scopes, prompt/surface contradictions, schema sloppiness
Pre-merge (CI on PRs touching prompts/tools)Eval suite on the changed artifactBehavior regression on the curated dataset
Post-merge (on a schedule)Full eval sweep on the model + prompt + tool stackDrift, model swaps, dataset growth
RuntimeTracing + observabilityWhat actually happened in production

The release gate sits at the front. It cuts the surface down to “things the eval suite can responsibly evaluate.” The eval suite measures quality on that finite, reviewed surface. Observability watches what actually shipped.

Each guard catches something the others can’t. Removing the release gate means the eval suite has to do a job it was never built for — and that’s how regressions ship.

For the upstream argument that the tool surface is itself a release artifact, see Your AI agent has a tool surface. It needs a release gate.

Install agents-shipgate GitHub