The mental model is wrong before the first call
Teams budget AI the way they budget software: pick a tool, pay a fixed monthly fee, move on. A metered model API breaks that intuition. You are not buying a seat — you are buying tokens, priced per million, charged on both what you send and what comes back. Cost is a function of volume × prompt size × model tier, and all three drift upward on their own as a product grows.
The distribution proves the point. In Ramp’s June 2026 AI Index, the median company spent about $11.38 per employee per month on AI, while the top 1% of spenders ran roughly $7,500 per employee per month — Ramp put the gap at about 680×. A spread that wide is not explained by “using more AI.” It is explained by architecture.
And much of the real usage is invisible to finance in the first place. Menlo Security’s 2025 report found a 68% jump in employees using personal accountsto reach free AI tools at work — the “shadow AI” that never shows up on a vendor invoice. You cannot optimize spend you cannot see.
Three mistakes that multiply, not add
Most overspend traces to the same small set of structural errors. Individually each is survivable. The damage is that they compound — a single request can carry all three at once.
No prompt caching. A long, static system prompt is re-sent and re-billed on every turn. A 4,000-token system block, ten messages a session, a thousand sessions a day is forty million input tokens a day — paid for repeatedly to send the model text it already processed last turn.
Over-routing. Sentiment checks, yes/no decisions, and templated email drafts get sent to a frontier model out of habit. The price gap is not marginal: a top-tier model can cost on the order of a hundred-plus times a small fast model for the same short call. Any task a lighter, lower-cost model handles well, a frontier model simply overpays for.
Uncapped pipelines.No per-user or per-session cost ceiling, no rate limit. A retry loop, a broken page that keeps reloading, or a forgotten test account can burn a day’s budget in an afternoon. Teams report bills climbing several-fold in a single quarter before anyone could name the cause — the classic “suddenly we have a $10k bill” story.
What right-sizing actually looks like
The fix is not “use AI less.” It is routing each task to the least expensive model that clears the quality bar, and caching everything that repeats. Classify the request first — cheaply, or with a rule — then send it to the matching tier:
Intent / yes-no / sentiment → tier-1 small model (or local), often near-free Short extraction / summaries → tier-1 small model Templated drafting → tier-1, with the static prompt cached Multi-step reasoning / code → tier-2 mid model Genuinely hard / high-stakes → tier-3 frontier — and only here
Pair that with prompt caching on static blocks, an output cache for repeated identical queries, and a hard per-session spend cap. A workflow built this way commonly lands 60–80% in savings versus the all-frontier version, at the same output quality — because the expensive model only runs on the requests that actually need it. That is not a hypothetical: see the support-chatbot case study where measured per-action routing did exactly that.
The threshold where self-hosting wins
Per-token pricing is cheap at low volume and punishing at high volume. Owned hardware is the inverse: a fixed monthly cost and a marginal cost per call near zero. They cross at a volume, not a vibe. A mid-size open model on a single workstation-class machine amortizes to roughly ninety dollars a month plus power; at a hundred thousand calls a month that is on the order of a tenth of a cent per call — under any API rate for the same task.
So the question is never “API or self-host” in the abstract. It is: at your call volume and task mix, which side of the line are you on — and are your high-volume, low-complexitycalls the ones still going to a metered frontier endpoint? EyesInAI’s hardware profiles and the cost-per-task numbers on the leaderboard are there to find that line for your workload.
What to do this week (no infra change)
Three moves recover most of the leak before you touch your architecture. First, turn on prompt caching for any static system prompt — it is usually a config flag. Second, set a per-user or per-session spend cap and a rate limit, so a loop can’t become a headline. Third, take your highest-volume task and ask whether it truly needs the model it is on; move it down a tier and measure quality. Most teams find the answer is yes, it was overpaying — which is the whole token tax in one sentence: you were billed for structure, not for intelligence.
That’s the why. The Token Tax Playbook is the how — four concrete levers (routing, caching & caps, self-hosting, and a self-correcting routing loop), each grounded in what we actually run on the DLR chatbot. The measured proof is in the case studies.
Read the playbook