How prompt routing works — and the part no gateway does for you

“Routing” gets used for three different things. Two are solved infrastructure — one key for many models, and automatic failover across providers. The third — which model should this prompt go to?— is the one that decides your cost, speed, and whether the answer is even right. And it's the one the gateways can't do, because they route on price and availability, not on quality. Here's the whole stack, and where measurement comes in.

“Routing” is three stacked layers

The hard layer: decide WHICH model a given prompt should go to. Send the hard agentic call to the strong model, the cheap classification to the cheap one. This is capability-aware routing — and it is the part a gateway can’t do for you, because it has no idea which model is actually better at your task.

The catch — gateways route on price/latency, not quality

Same prompt — “a hard multi-step reasoning task” — routed four ways. Only the cheap-7b model is wrong on it; mid and frontier pass. Watch which strategy lands on a correct model.

cheap-7b

fails this task

mid-tier

passes this task

◄ routed here

frontier

passes this task

Routes on: their classifier (opaque) → picks a "balanced" model

Convenient, but a black box — the selection logic is theirs, heuristic, and not benchmarkable. Treat it as a baseline to beat, not the router.

The first three strategies are oblivious to whether the answer is correct — they optimize availability, order, or price. Only a strategy fed a measured quality signalreliably lands on a model that actually passes. The gateway moves the request; it doesn't know which model is right.

Layer 1 — the unified API

The base layer is the one most teams already rely on: a single OpenAI-compatible endpoint and key that fronts hundreds of models across providers. Change the model field and the request goes somewhere else — no new SDK, no new key, one bill. OpenRouter is the best-known example (400+ models, 60+ providers). This is pure convenience: it unifies access, but makes no decision for you. You still say which model.

Layer 2 — provider routing (load-balance + failover)

Under the unified API, for any one model that several providers host, the gateway load-balances across those hosts on price, latency, and uptime — with automatic failover if one is down. You can pin or exclude providers per request (OpenRouter exposes order, only, ignore, max_price, preferred_max_latency, and more). This is the layer behind the speed and price differences a benchmark measures — and it's genuinely useful reliability engineering. But note what it optimizes: getting the model you already chose, cheaply and reliably. It never questions the choice itself.

Vercel AI Gateway is the same layer, built into the AI SDK and the obvious default if you already deploy on Vercel. You write streamText({ model: 'anthropic/claude-opus-4.7' }) and control transport with providerOptions.gateway: order / only to pin providers, sort: 'cost' | 'ttft' | 'tps' to rank them, caching: 'auto', per-provider timeouts, model-level fallbacks, and BYOK — pass-through pricing, no per-token markup. It's excellent at this job. But read its sort: 'cost' precisely: it finds the cheapest host of the model you named, not the cheapest model that still answers correctly. Like every gateway, its “best experience” means uptime and latency — it has no measure of whether the model is right for the task.

Layer 3 — model selection (the hard one)

The third sense of routing is the one that matters most and is hardest: deciding which modela given prompt should go to at all. Send the hard agentic call to the strong model; send the cheap classification to the cheap one. Done well, this is where the cost savings and the quality live at once. Two built-in flavors exist, and it's worth being precise about them:

Auto Router (openrouter/auto). You send the prompt to that model ID and its classifier picks a model for you (with a cost_quality_tradeoff dial, 0–10). Convenient, but a black box — the selection logic is theirs, heuristic, and not benchmarkable. For anything you want to measure, treat it as a baseline to beat, not the router itself.

The models[] array (fallback). Instead of a single model you pass an ordered list, and the gateway tries them in order, falling through to the next on error, rate-limit, or context-length failure. That gives you an explicit primary→challenger→fallback chain you define. It's deterministic — but it routes on order and availability, not quality. If your first-listed model is the wrong tool for the task, the request still goes there.

The honest caveat: most routing is quality-blind

Here is the load-bearing point, and the demo above is built around it. OpenRouter — and every general gateway (LiteLLM, Portkey, Cloudflare, Helicone) — routes on availability, price, latency, and your declared order. It has no idea which model is actually better for a given task.The capability-aware routing everyone wants — “hard agentic calls to the strong model, cheap classification to the cheap one” — is a logic layer that sits on top, using the gateway as the execution substrate.

A few products do route on predicted quality — Martian(proprietary “model-mapping”) and Not Diamond(a learned recommender) — and they're the real comparison. But their routing model is a black box: you can't see whya prompt went where, or check the decision against your own tasks. The difference EyesInAI draws isn't quality-aware vs. not — it's opaque quality vs. measured, inspectable quality.

The market, in one table

The tooling spans raw load-balancers, self-hosted policy proxies, and a handful of genuine capability routers. Most are OpenAI-compatible, so proxies like LiteLLM (budgets, per-tenant keys, complexity/semantic routing) or Portkey (composable conditional routing + guardrails) sit in front of OpenRouter for richer policy than native fallback. A few — Martian and Not Diamond— do route on predicted quality, but through a proprietary, opaque model you can't inspect. Read the quality-aware column carefully, and click any row for the full review:

ToolRoutes onQuality-aware?

Read the “Quality-aware?” column: most gateways route on availability, price, latency, or order (No / Heuristic). A few route on quality — but via a proprietary, opaque model, or single-vendor(the provider’s own router, which can only ever pick its own models). Only a measured signal lets you see why a task routes where, across vendors, on your own tasks. Click any row for the full review.

Note the two kinds of quality. OpenRouter Fusion (June 2026) buys quality at inference time — fan a request out to a panel and let a judge fuse the answers, paying more compute per request. EyesInAI buys it at measurement time— benchmark first so you know the cheapest single model that passes, then route there. They stack: measure to pick the panel and the judge, fuse when a task is worth the spend, and route to one cheap model when it isn’t.

Where EyesInAI fits — the quality signal

Routers move the request; they don't know which model is right. That missing input — which model is actually correct, per task, at what cost — is exactly what this site measures. Pass-rate per task, cost per call, latency, and variance across repeated runs are the signal a capability-aware router needs to make a defensible choice. The pattern is: use the benchmark to decide the model, use the gateway to execute the call.

And the missing signal isn't a nicety — its absence is the documented failure mode. Routing on a classifier's guess instead of measured quality is how teams cut the bill and quietly break the product: the classifier can't tell the easy center from the hidden long tail at decision time, the cheap model fails confidently on the cases that matter, and the dashboards stay green while the quality loss — several times the savings — lands in a cost center nobody's watching. The defense is per-tier measurement and a downgrade gated on real traffic. That is exactly what the quality signal below provides.

That's not theoretical — it's the case study: we benchmarked a client's real chatbot tasks across five models and produced a per-task routing table that cut model cost ~32% with no loss in accuracy. The live routing recommendations apply the same idea to the public leaderboard. The gateway was never the hard part — knowing which model to point it at is.

And the two layer cleanly. Our router decides what — the least expensive model that still passes the task, from a live table you can inspect — then hands that model ID to a transport gateway like Vercel AI Gateway, which decides how: cheapest healthy provider, failover, caching. Concretely, our router returns a model plus a measured fallback_chain, and that maps straight onto the gateway's model + order / fallback options — so even the backup is a benchmark-good model, not a guess. Measure to pick the model; let the gateway run the call.You don't replace Vercel's router — you give it the one input it can't produce on its own.

A newer move worth naming: inference-time quality. In June 2026 OpenRouter shipped Fusion— send one prompt, it fans out to a panel of models in parallel, then a judge model fuses their answers into one. Its own benchmark shows a two-model fusion topping every single model in the test. That's a real quality gain, but it buys it at a different layer: you spend more compute per request(N models plus a judge, every call), and the judge's synthesis is a black box you can't audit on your own tasks. Two of its sibling tools, Advisor (a cheap model pauses to consult a stronger one) and Subagent (the primary model delegates a sub-task to a cheaper one), are the same idea pointed at cost.

That sharpens, rather than threatens, the distinction. There are two kinds of quality: inference-time — pay compute now to ensemble and judge — and measurement-time — benchmark first so you already know the cheapest single model that clears the bar. They compose: measure to choose which models belong in the panel and which should judge; fuse when a task is genuinely worth the extra spend; and for the bulk of routine traffic, route to the one cheap model the measurement says will pass.Fusing every request is the expensive default; knowing when you don't need to is the measured one.

The takeaway

Three layers, increasing in difficulty: unifying access (solved), choosing the cheapest reliable host for a model (solved), and choosing the right model for the prompt (the open problem). Gateways nail the first two and punt on the third, because the third needs something they don't have: a measured signal of which model is actually correct. Supply that signal and routing stops being a black-box convenience and becomes an engineering decision you can defend with numbers — which is the entire point of measuring models in the first place.

EyesInAI·Loading explainers…

Explainers

Prompt routing · interactive

How prompt routing works — and the part no gateway does for you

“Routing” is three stacked layers

The catch — gateways route on price/latency, not quality

Same prompt — “a hard multi-step reasoning task” — routed four ways. Only the cheap-7b model is wrong on it; mid and frontier pass. Watch which strategy lands on a correct model.

cheap-7b

fails this task

mid-tier

passes this task

◄ routed here

frontier

passes this task

Routes on: their classifier (opaque) → picks a "balanced" model

Convenient, but a black box — the selection logic is theirs, heuristic, and not benchmarkable. Treat it as a baseline to beat, not the router.