Model routing · an equipment-rental company

A support chatbot, routed to the right model for each task — 31.3% in savings, same accuracy.

an equipment-rental company runs a field-service support chatbot that answers customer questions, looks up accounts, and quotes prices. It was running every request through one strong, expensive model. We asked a simpler question: which model is the least expensive one that still gets each job right?

31.3%

lower model cost

chatbot actions

models compared

1,650

graded responses

The approach

We don’t trust a single run or a leaderboard score. For each of the 11 actions the chatbot performs, we wrote tasks containing the real data the model would see and the correct answer it should produce. Then we ran every candidate model against every task 10 times — 1,650 responses in all — and had a strong independent model grade each one against the ground truth.

Repetition over single runs. One good answer can be luck. Ten runs per task expose the models that are consistently right.
Graded against ground truth.A model “passes” a task only when an independent judge scores it ≥4 of 5 against the known-correct answer — not on vibes.
Cost tracked per task.Every response’s real token cost is recorded, so we can price each routing decision exactly.

How the models did

Accuracy across all 11actions, ranked. Note the strongest model isn’t far ahead of a much less expensive open-weights one on routine work:

Model	Maker	Avg score	Pass rate
gpt-oss-120b	OpenAI (open weights, via Groq)	4.80/5	94%
gemini-2.5-flash	Google	4.71/5	92%
claude-sonnet-4-6	Anthropic	4.61/5	95%
gpt-4.1	OpenAI	4.42/5	83%
claude-haiku-4-5	Anthropic	4.33/5	80%

Judge: claude-opus-4-7. Pass = score ≥ 4/5 vs ground truth, averaged across all 11 actions.

The result: a routing table

The deliverable isn’t “use this model.” It’s a per-action map of the least expensive model that still passes. Most actions route to a light, fast model; only the genuinely hard reasoning tasks need the premium one.

gpt-oss-120b8 of 11 actions

Cheapest model that still clears the bar — the routing choice for 8 of 11 actions at 96–100% accuracy.

gemini-2.5-flash2 of 11 actions

Best value on two tasks (pricing economics, intent routing) where it ties on accuracy and undercuts on cost.

claude-sonnet-4-61 of 11 actions

Reserved for the single hardest task (dynamic pricing lookup), where it is the only model to reach 100%.

Routed this way, the blended cost per chatbot turn drops from $0.0062 to $0.0042 — 31.3% in savings, with no measurable drop in answer quality.

What the review console gave them

This number didn’t come from a spreadsheet of list prices. It came from running the work through our admin review console — the same tool we point at every client — which does three things a leaderboard can’t:

It benchmarked the agent harness itself, not just models.The expensive baseline here wasn’t a raw API call — it was work running through the Claude Agent SDK. The console treats that whole harness as a candidate and measures it against lower-cost API models on the exact same graded tasks, answering the question that actually saves money: which agent work can move to a lower-cost model at equal quality?
It priced the saving two honest ways. The headline figure is computed on the metered-equivalent basis (apples-to-apples against other API models); the console also tracks the true subscription burn — which, with harness cache and loop overhead, runs meaningfully higher. Both are shown, so the saving is one you can take to a finance team.
It’s repeatable.When a new model ships or the chatbot’s tasks change, the review re-runs and the routing table updates — the savings don’t decay silently as the model landscape moves.

That last point matters more each quarter: agent-SDK usage is moving off flat subscription pricing onto metered credits, so work that feels free today is about to carry a per-call bill. Knowing in advance which of it re-routes cleanly is the whole game. How the review works

What this surfaced

The strongest model is not the most cost-effective everywhere — only 1 of 11 actions actually needed the top-priced model.
An open-weights model (gpt-oss-120b) was the best routing choice for 8 of 11 actions, winning on accuracy-per-dollar against the frontier models.
On a deliberately harder task suite the models genuinely separated — pass rates spread from 80% to 95%, so routing decisions rest on measured differences, not ties.
Two tasks exposed ambiguous ground truth that most models struggled with — caught by the suite-quality guard, not blamed on the models.
Per-action routing cut blended model cost 31.3% versus running everything on the strongest model, with no measurable accuracy loss.

Why you can trust the number

The 31.3% figure on this page is computed from a real benchmark run — 5 models × 33 tasks × 10 runs, 1,650graded responses. We deliberately don’t name the client or show their data; the methodology and the numbers are real, the customer specifics stay private. When tasks are too ambiguous for any model to answer reliably, our process flags them as a ground-truth problem to fix — not as a model to blame.

Running an AI feature on one expensive model?

We’ll benchmark your real tasks and hand you a routing table — the least expensive model that still gets each job right.

Tell us about your use case