The least expensive model that still does the job

A leaderboard tells you which model is best in the abstract. It can't tell you which model is right for your task at the lowest cost — because it never saw your task. Our admin review console does exactly that: it runs every candidate model, and the agent harness itself, against your real work, grades the answers, and hands back a routing table with a defensible savings number attached.

The question a leaderboard can’t answer

Most teams pick one strong, expensive model and send every request to it — the safe default. But across a real product, most requests are routine: a lookup, a classification, a short summary. Paying frontier prices for work a lighter, less expensive model handles perfectly is pure waste, and a public benchmark can't find it for you, because the answer depends entirely on your tasks and yourdefinition of “correct.”

The review reframes the question from “which model is best?” to “for each thing my product actually does, what is the least expensive model that still gets it right?” That is a measurement problem, and measurement is what the harness is for. (For the grading method underneath it, see how we test.)

What the review actually does

For each action your system performs, we build tasks from the real datathe model would see and the correct answer it should produce. Then every candidate model runs every task many times over — repetition, not a single lucky pass — and an independent strong model grades each response against the known-correct answer. Every response's real token cost is recorded, so each routing decision is priced exactly.

Real tasks

Your actual data and ground truth, not generic prompts.

Graded, repeated

Many runs per task, scored ≥4/5 vs truth — consistency, not vibes.

Priced per decision

Every token cost captured, so savings are computed not guessed.

The deliverable is not “switch to model X.” It is a per-action routing table: lighter, low-cost models for the routine majority, the premium one reserved for the genuinely hard reasoning tasks that need it — each choice backed by a measured pass rate. (How that table is enforced at runtime is the subject of prompt routing.)

Benchmarking the agent harness, not just the model

The newer capability — and the one most teams have no way to measure — is treating an agent harness itself as a candidate. When work runs through the Claude Agent SDK (the same engine behind coding agents and assistants), the cost isn't just the model's tokens; it's the whole loop — repeated context, tool calls, retries. The console runs that full harness against the same tasks as the raw API models and puts them side by side.

That answers a question with real money behind it: which subscription-billed agent work can move to a lower-cost API model at equal quality? It matters now because agent-SDK usage is moving off flat subscription pricing onto metered credits — so work that felt “free” under a subscription is about to carry a per-call bill. Knowing in advance which of it can be re-routed without losing quality is the difference between a smooth migration and a surprise invoice. We track this shift as a live beat on Pricing & Billing Watch — including Anthropic’s June 2026 pause of exactly this change (paused, not cancelled).

Two honest cost numbers, never one flattering one

Cost comparisons are easy to fudge, so the review reports two bases and lets you toggle between them:

Metered-equivalent (clean).What the work would cost at the resolved model's normal API rates — the apples-to-apples basis for routing math against other API models.
True subscription burn (billed).The actual spend including the harness's cache churn and loop overhead — which in testing ran roughly 2.6× the clean figure. This is what you really pay today.

Featuring both is the point: the clean number keeps the routing comparison fair, and the billed number shows the true cost of leaving the work on the harness. We default to the conservative basis for the headline savings so the figure is one you can defend to a finance team, not just a marketing slide.

What you walk away with

Concretely, a review hands you four things:

A per-action routing table — the least-expensive passing model for each job.
A defensible savings number — computed from graded runs on real tasks, with both cost bases shown.
A harness-vs-API verdict — which agent work can move to a lower-cost model before metered billing lands.
A repeatable re-review — re-run it when a new model ships or your tasks change, and the table updates.

On our first client's support chatbot, the review moved a pricing-lookup task off the premium agent harness onto a lower-cost model at equal graded quality — a 95.9% cost reduction on that action. See the full write-up in the DLR case study.

Why the numbers are trustworthy

Every figure comes from a real benchmark run — models × tasks × repeated runs, each response graded against ground truth. Case studies publish aggregate metrics and public model names only; client data, raw responses, and internal task text never leave the review. And when a task is too ambiguous for any model to answer reliably, the process flags it as a ground-truth problem to fix — not a model to blame. The method is the product; the honesty is what makes the savings number worth anything.

Running an AI feature on one expensive model?

We'll benchmark your real tasks — and your agent harness — and hand you a routing table with a savings number you can defend.

Tell us about your use case

EyesInAI·Loading explainers…

Explainers

Capability · model & harness review