In full

A proof of concept earns applause in a meeting. A production agent earns its keep running unattended, at volume, against inputs nobody anticipated. The distance between those two things is not model quality and it is not scale. It is design. This is what that design actually looks like, and why most pilots never cross the gap.

Why most agent pilots stall before production

A demo is a controlled environment. You choose the input, you run it once, and a human watches the output and nods. Production removes all three of those comforts. The input arrives malformed, in a format the prompt never saw. It runs ten thousand times a month. And no one is watching, because the entire point was to stop watching.

The failure is rarely that the model "isn't good enough." The failure is that the pilot optimised for the wrong property. It optimised for an impressive answer on a clean input. Production rewards a different property entirely: predictable behaviour on a dirty input, including predictable behaviour when the agent should refuse to act.

A demo proves the model can be right. Production proves the system stays safe when the model is wrong.

That reframing changes what you build. You stop tuning prompts for peak cleverness and start engineering for the worst plausible input, because that input will arrive, repeatedly, while you sleep.

The three properties that make an agent production-grade

Across the regulated operations work we run, the same three properties separate a system we will put our name on from one we will not. None of them are about the model.

1. A defined escalation path

Every agent must have an answer to one question: what happens when I am not sure? In a demo the answer is "a human is looking, so it does not matter." In production the answer has to be encoded. A confidence threshold decides, per case, whether the agent acts or hands the decision to a person. Below the line, the case is routed to a queue with the agent's reasoning attached, so the human starts from context rather than from scratch.

The threshold is not a guess. You set it from the cost of being wrong. A misfiled internal document is cheap to correct, so the bar can sit lower. A regulated decision that affects a customer is expensive and sometimes irreversible, so the bar sits high and most cases escalate. The agent does the volume; people keep the judgement.

2. An immutable record of every action

If you cannot reconstruct what the agent did and why, you do not have a production system, you have a liability. Every action the agent takes is written to an append-only log: the input it received, the model and version that ran, the confidence it reported, the decision, and the outcome. The log is write-once. Nothing edits history.

This is not bureaucracy. It is what lets you answer a regulator, debug a bad output six weeks later, and prove that a contested decision followed the process you claimed it followed. For regulated clients it is the difference between an incident and a catastrophe.

3. Cost routed by task, not by habit

The instinct after a successful demo is to run the best model on everything. That instinct will bankrupt the use case. Most of what an agent does is cheap, repetitive triage that a small, fast model handles perfectly. Only the genuinely hard cases need a frontier model.

  • Classify cheaply. A small model decides what kind of case this is and how confident it is.
  • Escalate selectively. Only low-confidence or high-stakes cases are promoted to a larger model.
  • Escalate to a human last. The frontier model is still a model. The final backstop on a regulated decision is a person.

The cheapest reliable path that satisfies the compliance boundary wins. We have seen this single decision cut running cost by an order of magnitude with no measurable drop in quality, because quality was never coming from the expensive model on the easy 90% of cases.

How we build it: Azure AI Foundry, routing, and guardrails

Concretely, we build on Azure AI Foundry. Model routing is chosen for cost and task, never for novelty. The agent runs inside the same compliance boundary as the rest of the platform: identity through Entra, secrets in Key Vault, every action logged to the same append-only store as every other regulated event. There is no separate "AI system" sitting outside governance. The agent is a citizen of the platform, subject to the same rules.

No regulated decision runs without a human in the loop. That is not a limitation we apologise for. It is the feature that lets a supervised business deploy AI at all.

The takeaway

Moving an agent to production is not a scaling problem, it is a design problem. Build for the worst plausible input, give the agent a defined escalation path, log every action immutably, and route cost by task. Do that and the agent runs unattended at volume. Skip it and you have a demo that happens to be live.

If you are sitting on a pilot that impresses in the room but you would not trust unattended, the gap is almost certainly one of these three properties, not the model. That gap is bridgeable, and it is the work we do.

The Fourths · Engineering for regulated industries