WorkServicesStudioInsightsContact Start a project ↗
Index / Services / /04 · AI & AGENTS
/04
JOURNAL ENTRY · UPDATED MAY 2026 12 MIN READ

AI products
& autonomous
agents.

Autonomous conversational agents, custom LLM integrations, voice and multimodal interfaces — engineered for production, not demos. Evals, observability, guardrails, escalation paths.

Practice lead
Defne Arslan
Team size
11 engineers · 3 researchers
Active projects
14 in production
Models in stock
GPT · Claude · Mistral · OSS
CHAPTER 01

Why this practice exists.

Most "AI products" you've seen are demos in a wig. They work in the keynote and fall over in the wild. We built this practice for the other thing — the production system, sitting in front of paying customers, handling tens of thousands of conversations a day without a human at the wheel.

The cost of a wrong answer in production is asymmetric. A retail-customer assistant that hallucinates a promo costs a refund. A medical triage agent that hallucinates a dosage costs a lawsuit. Eval-driven development isn't a luxury — it's the only honest way to ship.

We work with a small set of clients where the agent is the product, or where it sits on the critical path. We don't take chatbot-bolted-onto-existing-app briefs — there are agencies for that, and we'll happily refer.

If the model can fail it will fail. Our job is to make that failure boring, observable, and recoverable.

— DEFNE ARSLAN, PRACTICE LEAD
CHAPTER 02

The architecture we ship.

Every agent we put in production follows the same skeleton. The pieces vary; the shape does not.

The orchestrator is where the opinions live. It picks the model per task — GPT-4 for nuanced reasoning, Claude for tool-heavy chains, a self-hosted Mistral for cheap classification. It enforces evals before any side-effect commits. It logs every decision to a queryable trace, so when a customer asks "why did the agent do that?" — three weeks later — the answer is in your dashboard, not your memory.

CHAPTER 03

What we deliver.

Every engagement ships with the same seven artifacts. Hand-wave on any of them and the agent is a demo, not a product.

01
Eval suite
200+ test cases per agent. Run on every commit, every model change.
02
Observability
Every prompt, response, tool call, and latency in a queryable trace.
03
Guardrails
PII redaction, topic boundaries, jailbreak detection — model-agnostic.
04
Human escalation
Routed handoff to your support team with full conversation context.
05
Model abstraction
Swap providers in one config line. No vendor lock-in, ever.
06
Cost dashboard
Tokens per user, per query, per quarter. Forecast before you scale.
CHAPTER 04

Engagement shape.

Three ways to start. Most engagements begin with a two-week Sprint to de-risk the model choice and the user research before any production code is written.

CHAPTER 05

The stack we trust.

Opinionated but not religious. We choose per-task; we'd rather use the right tool than the cool one.

FRONTIER OpenAI
FRONTIER Anthropic
OPEN Mistral
EMBED Cohere
STT/TTS Whisper · 11Labs
VECTOR Qdrant
VECTOR pgvector
OBSERVE Langfuse
EVAL Promptfoo
SERVE Modal · Fly

Got a brief? AI or otherwise.

A senior partner replies personally within 24 hours.

Start a project