Where owners go wrong with the bill
Some owners spend a weekend shaving prompts to save less than the weekend was worth. Others let API and subscription spend climb for a year because AI feels like a cost of doing business, when a structural fix would have cut it without losing anything. And hiding behind both is the panic move: switching everything to the cheapest model. On workflows where a wrong answer costs more than the tokens, that trade goes backward.
What changed recently
AI pricing stopped being a flat meter. The major providers now offer prompt caching, where repeated context is billed far cheaper on reuse, batch processing tiers that trade speed for a meaningful discount on work that can wait, and cheaper fast models that handle simple tasks well. These discounts do not apply themselves: you get them by structuring how you use the tools, which is exactly why the order of operations matters more than prompt wording now. One caution in the other direction: some providers bill very large prompts at higher rates, and more context always means more tokens on every single call. Discount shapes change, so check your provider's current pricing page before building around any of this.
Measure before you touch anything
Spend one month getting the real picture, because you cannot sequence cuts you cannot see. What did each workflow cost, on which model, at what volume. Where is the spend concentrated: usually it is one or two heavy workflows, not spread evenly. And before you look at tokens at all, look at seats and overlap. Unused subscriptions and three tools doing one job often dwarf the token line, and cancelling them takes an afternoon, not an engineering project. We covered that audit in its own piece on overlapping AI subscriptions.
The order that actually works
First, cut overlap and idle seats: highest savings, zero risk to quality. Second, route by task: send simple, high-volume work to a cheaper fast model and keep the stronger model for the work where judgment matters. This is usually the biggest structural saving. Third, use caching for any workflow that resends the same context over and over, like a product catalog or a policy document. Fourth, move non-urgent work to a batch tier, since reports that run overnight do not need real-time pricing. Fifth, and only then, trim prompts and oversized context. It is the step owners start with, and it belongs last because it saves the least and costs the most attention.
When not to bother, and when to get help
Do not act yet if the bill is small enough that an hour of your time outweighs a month of savings, if usage is still changing too fast to measure, or if the spend is buying real value at a fair rate. Revisit when the bill rivals a line item you would normally scrutinize, or when it grows for three months straight without the value growing with it. If the bill is significant and you cannot tell which workflow is driving it or whether quality would survive routing, that is a decision worth an outside pair of eyes before you change anything in production.
The short version
- A small AI bill is usually not worth your hours. A bill that grows without added value is.
- Measure one month first. Spend concentrates in one or two workflows.
- Cut in order: overlap and seats, then model routing, then caching, then batch, then prompts last.
- Caching, batch tiers, and cheap fast models are structural discounts. You get them by how you use the tools, not by asking.
- Never trade quality for tokens on work where a wrong answer costs more than the tokens.