Eval-driven development isn't a luxury.
Most production AI failures aren't model failures — they're observability failures. Here's the discipline we ship every agent with.
Every AI demo you’ve seen works on the keynote slide. Most of them break in production within a week. The difference is almost never the model — it’s the absence of an honest eval suite, an honest trace, and an honest guardrail layer.
We treat these three as non-negotiable. If a client asks us to skip any of them to ship faster, we politely refuse, or we refer them out.
1. Evals are the unit tests of the LLM era
A unit test for a deterministic function asserts that f(x) = y. An eval for a probabilistic model asserts that, across 200+ representative inputs, the distribution of outputs satisfies a quality bar.
We ship every agent with:
- Golden tests: specific input→output pairs we never want to regress on.
- Property tests: behaviour that must hold regardless of phrasing (PII redaction, refusal to hallucinate prices, etc).
- Distributional tests: “across 50 randomly sampled tickets, the agent should resolve 80%+ correctly.”
These run on every commit, every model swap, every prompt change. CI gates merges on regression budgets, the same way it gates merges on broken builds.
2. Traces are non-optional
When a customer asks “why did the agent do that?” — three weeks after the conversation — the only acceptable answer is “let me pull the trace.” Anything else is a process failure.
Every production agent we ship logs:
- Every prompt, in full, with the system prompt’s git SHA.
- Every model response, including any tool calls.
- Every tool’s input/output payload.
- Every guardrail decision and the reason.
Stored in Postgres + Langfuse, queryable by user, session, time, model, intent. Two weeks of retention is the floor; some regulated clients keep five years.
3. Guardrails are model-agnostic
Trusting any single model — even the frontier ones — to enforce its own guardrails is operationally naive. Models drift; vendors rollout; jailbreaks evolve. The guardrails live around the model, in code we control:
- PII redaction (regex + NER).
- Topic boundaries (semantic classifier).
- Jailbreak detection (signature + adversarial prompt detection).
- Output validators (schema, hallucination check on prices/dates/people).
When any one fires, the request is logged, escalated, or refused — never silently corrected.
Why this matters
The cost of a wrong answer in production is asymmetric. A retail-customer assistant that hallucinates a promo costs a refund. A medical triage agent that hallucinates a dosage costs a lawsuit.
The point isn’t that the model is good enough — it’s that, when the model fails, the failure is boring, observable, and recoverable. That’s the only honest bar for production AI in 2026.
— Defne