Guardrails and evals: a working stack for LLM features that ship

The eval pipeline we wired around our own agents. What we measure pre-deploy, what we measure in production, and the three regressions we caught last month that static tests would have missed.

Priya Menon

AI Security Engineer, Blackbead.ai

"We have tests." Every team building LLM features says this in the first meeting. It's almost always true and almost never sufficient. Pytest is fine for parsers; it's the wrong shape for a system whose output distribution shifts when the model provider rolls a minor version.

This post documents the eval pipeline we run around our own agents. It's not novel — every serious LLM team converges on something similar — but the failure modes it catches are concrete enough to be worth writing down.

The three layers we run

Pre-deploy, in-prod, and red-team. Each layer answers a different question.

Pre-deploy: did we ship a regression?

A golden dataset of 600 representative prompts per agent, each with one or more pass/fail assertions. The assertions aren't string matches; they're LLM-judged rubrics. The pipeline runs the new model + new prompt against the golden set, scores each output, and surfaces the deltas vs. the last green run.

What we measure:

Task accuracy. Did the agent do the thing it was asked to do?
Format compliance. Did it produce the shape downstream code expects?
Refusal behaviour. Did it refuse what it should refuse, and only what it should refuse?
Hallucination rate. Did any factual claim resolve to something it could not have known?

The bar is "no regression worse than 2σ" on any axis. Improvements are great; we don't gate on them. Regressions block the deploy.

In-prod: did the world shift under us?

Production traffic is heterogeneous in ways the golden set never fully captures. We run two background tasks against live traffic:

Sampled judging. 1% of production responses are sampled and scored by the same rubric judges as pre-deploy. The numbers we publish on a dashboard. When they drift outside the in-prod baseline, we get paged.
Output distribution monitoring. Per-agent, per-prompt-class, we track output length, refusal rate, tool-call rate, and a handful of embedding-space cluster counts. Drift on any of these is a leading indicator of upstream changes — a model provider update, a RAG corpus change, an injection campaign.

Red-team: did anyone find a way around us?

A nightly job runs a corpus of known attack patterns — direct injection, indirect injection, jailbreak chains, tool-call abuse, exfiltration attempts. The corpus is versioned and grows. Anything that escapes our guardrails goes to a triage queue. Most days the queue is empty. The days it isn't are interesting.

Why three layers

Each layer catches what the others miss. Pre-deploy catches regressions. In-prod catches drift. Red-team catches adversaries. Drop any one and you'll be surprised in production by exactly the failure mode that layer would have caught.

Three regressions we caught last month

1. Format-compliance silent break

We swapped to a newer model checkpoint. Task accuracy improved 4%. Format compliance dropped 0.6% — and the failure mode was that the model occasionally wrapped JSON output in backticks. Pre-deploy caught it; the deploy was held until we adjusted the prompt. A static parser test would have missed this because it was statistically rare and downstream tolerated it half the time.

2. Refusal drift on a low-volume agent

A specialist agent's refusal rate climbed from 0.4% to 2.1% over two weeks. Nothing changed on our end. Investigation: the model provider had updated its underlying safety policy, and the new policy was over-refusing on a class of queries we explicitly want answered. We worked around it with a small system-prompt addition. In-prod monitoring caught this; pre-deploy didn't because the prompts in question weren't in the golden set (we hadn't expected this surface to move).

3. New injection pattern

Red-team caught an indirect-injection variant that used Unicode bidirectional control characters to hide the payload from human reviewers but not from the model. Our pre-existing tainting caught it at the egress stage, so production was never at risk — but the alerting was noisier than it should have been. We tuned the detection and added the pattern to the corpus.

What we don't do

Two things we explicitly don't:

We don't gate on absolute scores. The numbers drift as models drift. We gate on deltas vs. the last known-good run. Absolute targets become aspiration; deltas become discipline.
We don't judge with the same model we're testing. Same-family judges are correlated with the system being judged in ways that bias the eval. We use a different family for judging, and we re-validate the judge against human labels quarterly.

If you're starting tomorrow

Build a golden set of 100 prompts per agent. It's enough to start; you can grow it.
Write the assertions as LLM-judged rubrics, not string matches. The rubrics are easier to maintain and survive model upgrades.
Wire 1% production sampling and judge it with the same rubrics. The first week of data will tell you whether your golden set is representative.
Start a red-team corpus. Even ten prompts is a start. Grow it every time you read an incident report.

The pipeline above is not free — it costs us roughly $4-6 per deployable model day in inference and judging. For a system whose failure modes include "compliance breach" and "exfiltration," that's the cheapest part of the stack.