In full

There are two separate moments in the life of an AI feature, and they have very different price tags. The first is building it: wiring the model in, shaping the prompts, getting the demo to work. That part is cheap and fast, which is exactly why AI feels so accessible. The second moment is running it, once it is live and real people are using it many times a day. That is where the money is, and it is the part the demo never shows you.

Training versus inference, and why it matters to you

It helps to separate two words. Training is the enormous, one-off cost of creating a model, and unless you are building your own, it is somebody else's bill. Inference is the cost of running that model for a single request, every time someone uses your feature. You pay for inference, not training, and inference is not a one-off. It is a recurring, per-use cost that scales with exactly one thing: how much your product gets used. The more successful the feature, the larger the bill.

Building an AI feature is capital you spend once. Running it is an operating cost you pay forever, and it grows with success rather than shrinking.

This is the inversion that catches teams out. In most software, a feature that takes off gets cheaper per user as fixed costs spread across more of them. An AI feature can do the opposite: each additional user adds real, marginal compute cost, so popularity drives the bill up, not down. A pilot that costs almost nothing can become a serious line item the moment it reaches the whole customer base.

The falling-price trap

The usual reassurance is that model prices keep dropping, and they do, sharply, year on year. The catch is in how the two curves move. The price per request falls, but usage tends to rise faster, because a cheaper, better model gets put in front of more people doing more things more often. The result is the uncomfortable one: each call costs less and the total spend still climbs. Cheaper units do not save you if you buy far more of them, and AI usage has a way of expanding to fill whatever it is allowed to.

And there is a floor under the price

There is also a hard limit to how far inference can fall, and it is physical. Running these models depends on scarce, expensive hardware, and on the high-bandwidth memory that has been in short supply through 2026 as the whole industry competes for it. When the hardware underneath has a rising floor, the cost of the compute sitting on top cannot fall indefinitely. The economics of running AI are tied to the economics of the silicon, and that silicon is currently the most fought-over component in the market.

How to keep it honest

None of this argues against shipping AI. It argues for costing it like the operating expense it is, before it is live rather than after the invoice.

  • Estimate the running cost at realistic volume, not demo volume. Cost per request multiplied by requests at full adoption is the number that matters, and it is rarely the one in the pitch.
  • Match the model to the job. Most tasks do not need the largest, most expensive model. Routing simpler work to smaller models is often the difference between a feature that pays for itself and one that does not.
  • Cut the work, not just the price. Caching repeated requests, trimming what you send, and not calling the model when you do not need to all reduce the bill more reliably than waiting for prices to drop.
  • Put a number on the value. A running cost is only a problem if the feature is not worth it. Decide what each call is worth before you decide what it costs.

The point

AI has made the building part of software cheap and the running part of it visible again. For years, compute was a rounding error in most products. With AI features it is a real, recurring cost that scales with how much people use what you made. The teams that do well are not the ones chasing the lowest price per call. They are the ones who modelled the running cost honestly, before they shipped, and built the feature to be worth what it costs to keep alive.

The takeaway

The expensive part of an AI feature is running it, not building it. Inference is a recurring, per-use cost that grows with success, and although the price per request keeps falling, usage tends to rise faster, so the total bill climbs anyway, against a hardware floor that is currently going up. Cost the feature at real volume before you ship, route simple work to smaller models, cut the work rather than waiting for prices to fall, and make sure each call is worth what it costs.

The Fourths · Engineering for regulated industries