Layer 1 — the unified API
The base layer is the one most teams already rely on: a single OpenAI-compatible endpoint and key that fronts hundreds of models across providers. Change the model field and the request goes somewhere else — no new SDK, no new key, one bill. OpenRouter is the best-known example (400+ models, 60+ providers). This is pure convenience: it unifies access, but makes no decision for you. You still say which model.
Layer 2 — provider routing (load-balance + failover)
Under the unified API, for any one model that several providers host, the gateway load-balances across those hosts on price, latency, and uptime — with automatic failover if one is down. You can pin or exclude providers per request (OpenRouter exposes order, only, ignore, max_price, preferred_max_latency, and more). This is the layer behind the speed and price differences a benchmark measures — and it's genuinely useful reliability engineering. But note what it optimizes: getting the model you already chose, cheaply and reliably. It never questions the choice itself.
Layer 3 — model selection (the hard one)
The third sense of routing is the one that matters most and is hardest: deciding which modela given prompt should go to at all. Send the hard agentic call to the strong model; send the cheap classification to the cheap one. Done well, this is where the cost savings and the quality live at once. Two built-in flavors exist, and it's worth being precise about them:
Auto Router (openrouter/auto). You send the prompt to that model ID and its classifier picks a model for you (with a cost_quality_tradeoff dial, 0–10). Convenient, but a black box — the selection logic is theirs, heuristic, and not benchmarkable. For anything you want to measure, treat it as a baseline to beat, not the router itself.
The models[] array (fallback). Instead of a single model you pass an ordered list, and the gateway tries them in order, falling through to the next on error, rate-limit, or context-length failure. That gives you an explicit primary→challenger→fallback chain you define. It's deterministic — but it routes on order and availability, not quality. If your first-listed model is the wrong tool for the task, the request still goes there.
The honest caveat: most routing is quality-blind
Here is the load-bearing point, and the demo above is built around it. OpenRouter — and every general gateway (LiteLLM, Portkey, Cloudflare, Helicone) — routes on availability, price, latency, and your declared order. It has no idea which model is actually better for a given task.The capability-aware routing everyone wants — “hard agentic calls to the strong model, cheap classification to the cheap one” — is a logic layer that sits on top, using the gateway as the execution substrate.
A few products do route on predicted quality — Martian(proprietary “model-mapping”) and Not Diamond(a learned recommender) — and they're the real comparison. But their routing model is a black box: you can't see whya prompt went where, or check the decision against your own tasks. The difference EyesInAI draws isn't quality-aware vs. not — it's opaque quality vs. measured, inspectable quality.
The market, in one table
The tooling spans raw load-balancers, self-hosted policy proxies, and a handful of genuine capability routers. Most are OpenAI-compatible, so proxies like LiteLLM (budgets, per-tenant keys, complexity/semantic routing) or Portkey (composable conditional routing + guardrails) sit in front of OpenRouter for richer policy than native fallback. A few — Martian and Not Diamond— do route on predicted quality, but through a proprietary, opaque model you can't inspect. Read the quality-aware column carefully, and click any row for the full review:
Read the “Quality-aware?” column: most gateways route on availability, price, latency, or order (No / Heuristic). A few route on quality — but via a proprietary, opaque model. Only a measured signal lets you see why a task routes where, on your own tasks. Click any row for the full review.
Where EyesInAI fits — the quality signal
Routers move the request; they don't know which model is right. That missing input — which model is actually correct, per task, at what cost — is exactly what this site measures. Pass-rate per task, cost per call, latency, and variance across repeated runs are the signal a capability-aware router needs to make a defensible choice. The pattern is: use the benchmark to decide the model, use the gateway to execute the call.
That's not theoretical — it's the case study: we benchmarked a client's real chatbot tasks across five models and produced a per-task routing table that cut model cost ~32% with no loss in accuracy. The live routing recommendations apply the same idea to the public leaderboard. The gateway was never the hard part — knowing which model to point it at is.
The takeaway
Three layers, increasing in difficulty: unifying access (solved), choosing the cheapest reliable host for a model (solved), and choosing the right model for the prompt (the open problem). Gateways nail the first two and punt on the third, because the third needs something they don't have: a measured signal of which model is actually correct. Supply that signal and routing stops being a black-box convenience and becomes an engineering decision you can defend with numbers — which is the entire point of measuring models in the first place.