How we test models

The leaderboard is only as trustworthy as the procedure behind it. Here is exactly how every ranked model is scored — the same prompts, the same deterministic checks, the same cost math, every run. Nothing is hand-graded; everything below mirrors what the live bench actually does.

One prompt, every model, a deterministic verdict

Every model that EyesInAI ranks goes through the same loop. The point is fairness and reproducibility — the same prompt, judged by the same rule, scored the same way each run.

Send

The identical prompt goes to every selected model — no per-model prompt tuning, so the comparison is apples-to-apples.

→

Call once

Exactly one API call per test. No retries: a flaky or truncated answer counts as it happened, the way a real user would experience it. 20-second timeout.

→

Validate

A deterministic validator checks the response — JSON parsing, exact-string match, sandboxed code execution, rule counting. No model-judges-model grading; pass/fail is reproducible.

→

Record

Latency, input/output tokens, tokens-per-second, pass/fail, and a 200-char preview are written to the store under a run id.

→

Price

Cost is computed from per-token pricing at read time, never stored — so if a provider changes prices, every past result re-scores correctly.

The test suite

Eighteen graded tests span the things models are actually used for — structured output, instruction-following, reasoning, executed code (write, fix, refactor, algorithm), long-context recall, tool calls — plus a latency ping. Click any test to see the real prompt and exactly what counts as a pass.

json

Structured outputmax 150 tokens

Prompt

Return ONLY a valid JSON object with these exact keys: name (string), status (string "active"|"inactive"), score (integer 0-100), tags (array of 2 strings). No explanation, no markdown, no code block.

Measures

Strict structured-output compliance.

Counts as a pass when

Parses as JSON AND has all 4 keys with the right types (status in {active,inactive}, score an int 0–100, tags a 2-element array).

What every test records

A single test produces a row of hard numbers. The leaderboard is built entirely from these — nothing is hand-entered.

pass / fail

Did the deterministic validator accept the answer?

latency_ms

Round-trip time for the call.

tokens / sec

Output tokens ÷ latency — the throughput number.

input + output tokens

Exact token counts from the provider response.

cost (computed)

Tokens × per-token price, calculated at read time — never stored, so old runs re-price when rates change.

preview + error

First 200 chars of the response, and the error if any — for auditing.

The choices that keep it honest

Generated code runs in a sandbox

The code test executes model output in an isolated subprocess with an empty environment and a throwaway working directory — so a jailbroken model can’t read API keys or write anywhere it shouldn’t.

No retries

One call per test. A truncated or flaky answer is scored as it happened — the experience a real user would get — instead of being quietly re-rolled until it passes.

Same prompt, no per-model tuning

Every model sees the identical prompt. No prompt is hand-optimized for one provider, so the ranking reflects the model, not the prompt engineering.

Token caps prevent gaming

Each test caps max_tokens, and a few providers that ignore the cap and stream unbounded are explicitly bounded — so no model wins on speed by simply being throttled differently.

Cost is computed, not claimed

Cost comes from published per-token prices applied to the real token counts at read time. Change a price table and the whole history re-scores correctly.

Dead models are pruned

Models that return "not found" / "deprecated" are auto-retired so the bench stops wasting calls on them — and non-generative safety classifiers are excluded entirely.

A note on scope: these tests are fast, deterministic probes — they catch real capability and reliability differences across models cheaply and reproducibly, which is what a cost-first leaderboard needs. They are not a substitute for a deep, task-specific evaluation of your own workload. Use the leaderboard to shortlist; validate the finalist on your actual data.

Methodology mirrors the live EyesInAI bench: 18 graded tests + availability ping, deterministic validators, single call per test, cost computed from per-token pricing.

EyesInAI·Loading explainers…

Explainers

How EyesInAI tests models · the methodology