CHAPTER 01
Why this practice exists.
Most "AI products" you've seen are demos in a wig. They work in the keynote and fall over in the wild. We built this practice for the other thing — the production system, sitting in front of paying customers, handling tens of thousands of conversations a day without a human at the wheel.
The cost of a wrong answer in production is asymmetric. A retail-customer assistant that hallucinates a promo costs a refund. A medical triage agent that hallucinates a dosage costs a lawsuit. Eval-driven development isn't a luxury — it's the only honest way to ship.
We work with a small set of clients where the agent is the product, or where it sits on the critical path. We don't take chatbot-bolted-onto-existing-app briefs — there are agencies for that, and we'll happily refer.
If the model can fail it will fail. Our job is to make that failure boring, observable, and recoverable.
— DEFNE ARSLAN, PRACTICE LEAD
CHAPTER 02
The architecture we ship.
Every agent we put in production follows the same skeleton. The pieces vary; the shape does not.
The orchestrator is where the opinions live. It picks the model per task — GPT-4 for nuanced reasoning, Claude for tool-heavy chains, a self-hosted Mistral for cheap classification. It enforces evals before any side-effect commits. It logs every decision to a queryable trace, so when a customer asks "why did the agent do that?" — three weeks later — the answer is in your dashboard, not your memory.
CHAPTER 03
What we deliver.
Every engagement ships with the same seven artifacts. Hand-wave on any of them and the agent is a demo, not a product.
01 Eval suite
200+ test cases per agent. Run on every commit, every model change.
02 Observability
Every prompt, response, tool call, and latency in a queryable trace.
03 Guardrails
PII redaction, topic boundaries, jailbreak detection — model-agnostic.
04 Human escalation
Routed handoff to your support team with full conversation context.
05 Model abstraction
Swap providers in one config line. No vendor lock-in, ever.
06 Cost dashboard
Tokens per user, per query, per quarter. Forecast before you scale.
CHAPTER 04
Engagement shape.
Three ways to start. Most engagements begin with a two-week Sprint to de-risk the model choice and the user research before any production code is written.
/01 · SPRINT
2 weeks
Working prototype on real data. €24k fixed.
/02 · BUILD
3 — 9 months
Full production engagement + all deliverables.
/03 · OPERATE
Ongoing
Continuous pod. Pause or scale every 90 days.
CHAPTER 05
The stack we trust.
Opinionated but not religious. We choose per-task; we'd rather use the right tool than the cool one.
FRONTIER OpenAI
FRONTIER Anthropic
OPEN Mistral
EMBED Cohere
STT/TTS Whisper · 11Labs
VECTOR Qdrant
VECTOR pgvector
OBSERVE Langfuse
EVAL Promptfoo
SERVE Modal · Fly
CHAPTER 06
Selected work.
Three production agents from the practice. Each shipped in under nine months.