an equipment-rental company runs a field-service support chatbot that answers customer questions, looks up accounts, and quotes prices. It was running every request through one strong, expensive model. We asked a simpler question: which model is the cheapest one that still gets each job right?
We don’t trust a single run or a leaderboard score. For each of the 11 actions the chatbot performs, we wrote tasks containing the real data the model would see and the correct answer it should produce. Then we ran every candidate model against every task 10 times — 1,250 responses in all — and had a strong independent model grade each one against the ground truth.
Accuracy across all 11actions, ranked. Note the strongest model isn’t far ahead of a much cheaper open-weights one on routine work:
| Model | Maker | Avg score | Pass rate |
|---|---|---|---|
| claude-sonnet-4-6 | Anthropic | 4.98/5 | 100% |
| gemini-2.5-flash | 4.84/5 | 95% | |
| gpt-oss-120b | OpenAI (open weights, via Groq) | 4.78/5 | 92% |
| claude-haiku-4-5 | Anthropic | 4.70/5 | 87% |
| gpt-4.1 | OpenAI | 4.52/5 | 86% |
Judge: claude-sonnet-4-6. Pass = score ≥ 4/5 vs ground truth, averaged across all 11 actions.
The deliverable isn’t “use this model.” It’s a per-action map of the cheapest model that still passes. Most actions go to a cheap, fast model; only the genuinely hard reasoning tasks need the expensive one.
Cheapest model that still passes — handles 9 of 11 routine actions at 96–100% accuracy.
Reserved for the two hardest reasoning tasks (customer lookup, dynamic pricing) where weaker models slip.
Routed this way, the blended cost per chatbot turn drops from $0.0030 to $0.0020 — 32.5% cheaper, with no measurable drop in answer quality.
The 32.5% figure on this page is computed from a real benchmark run — 5 models × 25 tasks × 10 runs, 1,250graded responses. We deliberately don’t name the client or show their data; the methodology and the numbers are real, the customer specifics stay private. When tasks are too ambiguous for any model to answer reliably, our process flags them as a ground-truth problem to fix — not as a model to blame.
We’ll benchmark your real tasks and hand you a routing table — the cheapest model that still gets each job right.
Tell us about your use case