The leaderboard is only as trustworthy as the procedure behind it. Here is exactly how every ranked model is scored — the same prompts, the same deterministic checks, the same cost math, every run. Nothing is hand-graded; everything below mirrors what the live bench actually does.
Every model that EyesInAI ranks goes through the same loop. The point is fairness and reproducibility — the same prompt, judged by the same rule, scored the same way each run.
Fourteen graded tests span the things models are actually used for — structured output, instruction-following, reasoning, long-context recall, tool calls — plus a latency ping. Click any test to see the real prompt and exactly what counts as a pass.
Return ONLY a valid JSON object with these exact keys: name (string), status (string "active"|"inactive"), score (integer 0-100), tags (array of 2 strings). No explanation, no markdown, no code block.Strict structured-output compliance.
Parses as JSON AND has all 4 keys with the right types (status in {active,inactive}, score an int 0–100, tags a 2-element array).
A single test produces a row of hard numbers. The leaderboard is built entirely from these — nothing is hand-entered.
Did the deterministic validator accept the answer?
Round-trip time for the call.
Output tokens ÷ latency — the throughput number.
Exact token counts from the provider response.
Tokens × per-token price, calculated at read time — never stored, so old runs re-price when rates change.
First 200 chars of the response, and the error if any — for auditing.
The code test executes model output in an isolated subprocess with an empty environment and a throwaway working directory — so a jailbroken model can’t read API keys or write anywhere it shouldn’t.
One call per test. A truncated or flaky answer is scored as it happened — the experience a real user would get — instead of being quietly re-rolled until it passes.
Every model sees the identical prompt. No prompt is hand-optimized for one provider, so the ranking reflects the model, not the prompt engineering.
Each test caps max_tokens, and a few providers that ignore the cap and stream unbounded are explicitly bounded — so no model wins on speed by simply being throttled differently.
Cost comes from published per-token prices applied to the real token counts at read time. Change a price table and the whole history re-scores correctly.
Models that return "not found" / "deprecated" are auto-retired so the bench stops wasting calls on them — and non-generative safety classifiers are excluded entirely.