Route each task to the least expensive model that still passes
The single biggest lever, and the one to pull first, because it needs no new hardware. Most chatbot turns are not hard. A pricing lookup, an invoice pull, a yes/no intent — these do not need a frontier model, but they get one by default. Routing means: classify the request cheaply, then send it to the lowest tier that clears a measured quality bar.
How we measure that bar is the whole point of EyesInAI. For DLR we benchmarked 5 candidate models across 11 chatbot actions — 33 tasks, 10 runs each, 1,650 graded calls, every answer scored against ground truth by a judge model. The output is a routing table: the least expensive model that still answers each action correctly. The measured result was 31.3% in savings per turn at the same accuracy — about $0.0062 → $0.0042 a turn. That is the difference between a guess (“Haiku is probably fine for that”) and a number you can defend to finance.
The first move for any team: take your highest-volume action, benchmark two or three models against your real data, and move it down a tier if the cheaper one passes. The cost-per-task leaderboard is the starting shortlist; the DLR case study is the full method.
Cache what repeats, cap what can run away
Two cheap, structural fixes that stop the most common leaks — and the ones a team can ship this week without changing a model.
Prompt caching. A chatbot re-sends a long static system prompt and tool definitions on every turn. Cached, that text is billed once and read back cheaply. On DLR we load skills selectively — only the handful a session needs — precisely because the cache-creation payload is the expensive part of the first turn; loading all of them would add over a dollar to an Opus turn. We track the cache-read-to-creation ratio per turn so we can see the caching actually working.
Hard caps. The runaway-bill story is always a loop with no ceiling. The DLR bridge runs a $1.50 per-turn budget backstoppassed straight to the model runner (so a looping tool call can’t exceed it) and a $20/day tenant spend cap with alerts at a per-request threshold, at 75% of the cap, and on any daily spike over 2× the rolling average. Past the cap, new chats are refused until the day rolls over. None of that is exotic — it is the difference between a $20 day and a headline.
Move high-volume, low-complexity calls onto owned hardware
Per-token pricing is cheap at low volume and punishing at high volume. Owned hardware is the inverse — a fixed monthly cost and a marginal cost per call near zero. They cross at a volume, not a vibe, and the crossing point depends on the price of the API task you’d replace:
Replacing a ~$0.14/M-token API task → break-even near ~110M tokens/day Replacing a ~$0.50/M-token tier → break-even near ~30M tokens/day Replacing a ~$1.50/M premium tier → break-even near ~10M tokens/day
So the question is never “API or self-host” in the abstract — it is whether your high-volume, low-complexity calls (intent classification, short extraction, templated drafting) are still hitting a metered frontier endpoint when a local open-weight model would pass for a fraction of the cost. Match the hardware to the model you actually need:
- 128GB unified-memory box (Mac Studio, or an AMD Ryzen AI Max+ 395 mini PC) — runs a 70B-class model locally; the privacy-first, single-team workhorse. Bandwidth, not capacity, sets the speed.
- Single-GPU / workstation — for higher throughput on mid-size models when one box is the bottleneck.
- Cluster — only past the break-even volumes above, where the fixed cost is genuinely amortized.
Live, dated build costs, TCO, and which models actually run at a usable speed on each tier are on the Cost & TCO tab and the hardware profiles. And see the deeper write-up on the app-layer economics of where the line falls.
Where DLR actually is on this: today the DLR chatbot answers via the Claude API — the Mac Mini hosts the bridge, the benchmark server, and the router, not the model. That setup runs well at our current single-team volume, which is exactly why self-hosting is the nextlever for us, not the first. The honest sequence is: route and cap first (done), then move the highest-volume classified actions onto a local model once the volume justifies the box. We’re building toward it, and we’ll publish the real numbers when we cross the line — same as we did with routing.
And a deliberate caution on this lever: self-hosting is the narrowest of the four, and the hype oversells it. At team scale a single box is usually neither faster nor cheaper than a cloud API once concurrency, the accuracy gap, and real subscription value are counted — it wins for batch jobs, data sovereignty, and one-narrow-task-at-high-volume, not as a drop-in replacement for a shared assistant. Read the self-hosting reality check before you buy hardware — the durable savings for most teams are in routing and caching above, not the box.
Make routing a live, self-correcting loop — and ship it as a plugin
A routing table goes stale the moment a new model ships or prices change. The lever that keeps the savings is making it a closed loop instead of a one-time tuning. On DLR that loop already runs:
- the benchmark measures candidate models on the real actions;
- an “Examine” step proposes a re-tiering when a cheaper model now passes;
- a human approves it;
- the chatbot reads the approved routing config and re-tiers live — no redeploy;
- a morning report shows exactly what changed.
Alongside it runs a shadow-compare mode: when the router would pick a non-Claude model, the bridge quietly runs that suggestion in the background, judges it against the answer the user already got, and logs the pair. It is how we earn confidence in a routing change before it ever touches a real answer — observe first, switch second. A self-repair step disables any model that starts failing, so the loop is safe to leave running.
The plugin pattern.The reusable shape of all of this is a chatbot whose model choice is data-driven, not hardcoded. We’re building that on openclaw.ai — an open, self-hostable agent gateway whose provider layer already abstracts many model backends (including local Ollama / LM Studio endpoints) and tracks per-model token cost. A routing client that calls EyesInAI’s approved table, drops into that provider slot, and falls back to a hardcoded tier if the service is unreachable, becomes a portable “chatbot + measured routing” building block — the same loop DLR runs, packaged so the next product gets it by configuration, not a rebuild.
The order matters
These compound, but the cheap ones come first. Cap and cache before you touch a model — they’re config flags and they stop the bleeding. Then route, because measuring your real tasks is where the durable 30%-plus lives. Then self-host the high-volume tail once the math crosses. Then close the loop so none of it goes stale. The whole arc is one sentence: stop paying for structure you can fix, and pay only for the intelligence you actually need.
We run these levers on our own product before we recommend them. The measured proof — models tested, calls graded, dollars saved — is in the case studies.
See it applied