The question a leaderboard can’t answer
Most teams pick one strong, expensive model and send every request to it — the safe default. But across a real product, most requests are routine: a lookup, a classification, a short summary. Paying frontier prices for work a cheaper model handles perfectly is pure waste, and a public benchmark can't find it for you, because the answer depends entirely on your tasks and yourdefinition of “correct.”
The review reframes the question from “which model is best?” to “for each thing my product actually does, what is the cheapest model that still gets it right?” That is a measurement problem, and measurement is what the harness is for. (For the grading method underneath it, see how we test.)
What the review actually does
For each action your system performs, we build tasks from the real datathe model would see and the correct answer it should produce. Then every candidate model runs every task many times over — repetition, not a single lucky pass — and an independent strong model grades each response against the known-correct answer. Every response's real token cost is recorded, so each routing decision is priced exactly.
The deliverable is not “switch to model X.” It is a per-action routing table: cheap fast models for the routine majority, the expensive one reserved for the genuinely hard reasoning tasks that need it — each choice backed by a measured pass rate. (How that table is enforced at runtime is the subject of prompt routing.)
Benchmarking the agent harness, not just the model
The newer capability — and the one most teams have no way to measure — is treating an agent harness itself as a candidate. When work runs through the Claude Agent SDK (the same engine behind coding agents and assistants), the cost isn't just the model's tokens; it's the whole loop — repeated context, tool calls, retries. The console runs that full harness against the same tasks as the raw API models and puts them side by side.
That answers a question with real money behind it: which subscription-billed agent work can move to a cheaper API model at equal quality? It matters now because agent-SDK usage is moving off flat subscription pricing onto metered credits — so work that felt “free” under a subscription is about to carry a per-call bill. Knowing in advance which of it can be re-routed without losing quality is the difference between a smooth migration and a surprise invoice.
Two honest cost numbers, never one flattering one
Cost comparisons are easy to fudge, so the review reports two bases and lets you toggle between them:
- Metered-equivalent (clean).What the work would cost at the resolved model's normal API rates — the apples-to-apples basis for routing math against other API models.
- True subscription burn (billed).The actual spend including the harness's cache churn and loop overhead — which in testing ran roughly 2.6× the clean figure. This is what you really pay today.
Featuring both is the point: the clean number keeps the routing comparison fair, and the billed number shows the true cost of leaving the work on the harness. We default to the conservative basis for the headline savings so the figure is one you can defend to a finance team, not just a marketing slide.
What you walk away with
Concretely, a review hands you four things:
- A per-action routing table — the cheapest passing model for each job.
- A defensible savings number — computed from graded runs on real tasks, with both cost bases shown.
- A harness-vs-API verdict — which agent work can move to a cheaper model before metered billing lands.
- A repeatable re-review — re-run it when a new model ships or your tasks change, and the table updates.
On our first client's support chatbot, the review moved a pricing-lookup task off the premium agent harness onto a cheaper model at equal graded quality — a 95.9% cost reduction on that action. See the full write-up in the DLR case study.
Why the numbers are trustworthy
Every figure comes from a real benchmark run — models × tasks × repeated runs, each response graded against ground truth. Case studies publish aggregate metrics and public model names only; client data, raw responses, and internal task text never leave the review. And when a task is too ambiguous for any model to answer reliably, the process flags it as a ground-truth problem to fix — not a model to blame. The method is the product; the honesty is what makes the savings number worth anything.