WorkServicesStudioInsightsContact Start a project ↗
Index/ Insights/ Eval-driven development isn't a luxury.

Eval-driven development isn't a luxury.

Most production AI failures aren't model failures — they're observability failures. Here's the discipline we ship every agent with.

Every AI demo you’ve seen works on the keynote slide. Most of them break in production within a week. The difference is almost never the model — it’s the absence of an honest eval suite, an honest trace, and an honest guardrail layer.

We treat these three as non-negotiable. If a client asks us to skip any of them to ship faster, we politely refuse, or we refer them out.

1. Evals are the unit tests of the LLM era

A unit test for a deterministic function asserts that f(x) = y. An eval for a probabilistic model asserts that, across 200+ representative inputs, the distribution of outputs satisfies a quality bar.

We ship every agent with:

  • Golden tests: specific input→output pairs we never want to regress on.
  • Property tests: behaviour that must hold regardless of phrasing (PII redaction, refusal to hallucinate prices, etc).
  • Distributional tests: “across 50 randomly sampled tickets, the agent should resolve 80%+ correctly.”

These run on every commit, every model swap, every prompt change. CI gates merges on regression budgets, the same way it gates merges on broken builds.

2. Traces are non-optional

When a customer asks “why did the agent do that?” — three weeks after the conversation — the only acceptable answer is “let me pull the trace.” Anything else is a process failure.

Every production agent we ship logs:

  • Every prompt, in full, with the system prompt’s git SHA.
  • Every model response, including any tool calls.
  • Every tool’s input/output payload.
  • Every guardrail decision and the reason.

Stored in Postgres + Langfuse, queryable by user, session, time, model, intent. Two weeks of retention is the floor; some regulated clients keep five years.

3. Guardrails are model-agnostic

Trusting any single model — even the frontier ones — to enforce its own guardrails is operationally naive. Models drift; vendors rollout; jailbreaks evolve. The guardrails live around the model, in code we control:

  • PII redaction (regex + NER).
  • Topic boundaries (semantic classifier).
  • Jailbreak detection (signature + adversarial prompt detection).
  • Output validators (schema, hallucination check on prices/dates/people).

When any one fires, the request is logged, escalated, or refused — never silently corrected.

Why this matters

The cost of a wrong answer in production is asymmetric. A retail-customer assistant that hallucinates a promo costs a refund. A medical triage agent that hallucinates a dosage costs a lawsuit.

The point isn’t that the model is good enough — it’s that, when the model fails, the failure is boring, observable, and recoverable. That’s the only honest bar for production AI in 2026.

— Defne

NEXT ENTRY
GEO: ranking where the questions are now asked.
APRIL 12, 2026 · RANIA HABIB

Got a brief, or a topic to argue?

A senior partner replies personally within 24 hours.

Start a project