EyesInAI is a public benchmark of AI models across 6 providers and 201model variants. Real costs, real latency, real pass rates — refreshed nightly with deterministic validators. No vendor-supplied numbers, no marketing spin.
Every AI provider publishes benchmark scores. Almost all of them show their own model winning. That's because the model and the benchmark are designed together — the benchmark optimizes for what the model is good at, not what developers actually need to build with.
Developers picking a model for a real workload don't care about MMLU. They care about whether the model writes valid JSON when asked, whether it runs the function tool with the right arguments, whether the cost-per-task pencils out at scale, and whether the latency stays under their SLO during a traffic spike.
EyesInAI runs 14 deterministic tests — tests where the answer is either correct or not, with no human or LLM judging the output — across every major model nightly. Then we publish everything: the cost of a passing run, the regression history, what's on the cost-vs-quality Pareto frontier, what's newly broken.
We run 14 tests across 2 categories. Every test has a deterministic validator — we execute code in a sandbox, parse JSON against a strict schema, regex-check exact phrases, count bullet points. We do not use an LLM to judge another LLM. If a result is on this site, a machine verified it.
Probe specific capabilities in isolation: latency, throughput, structured output compliance, code execution correctness, fact retrieval, tool calling, multi-step logic, long-context recall.
Mirror actual production workloads: summarize a news article, classify a customer review, extract invoice fields, follow multi-rule instructions, comply with output format constraints.
pingLatency probe — single-word replyreasoningBasic math (17 × 23). Must contain "391"jsonStrict JSON output. Parsed + 4 required keys with correct typescodePython flatten() function. Executed in sandbox against 3 casesspeedThroughput. 500-token response measured for tokens/seccontextIn-prompt retrieval. Extract Austin/Illinois/Frankfort from 20 capitalsWe run benchmark calls against each provider's official API — Anthropic, OpenAI, Groq, Google (Gemini), OpenRouter, NVIDIA — using our own API keys and a Mac Mini test rig. The rig sits behind a Cloudflare tunnel; the public site reads from it via a hardened proxy.
Each night a rotation cron divides ~472 model variants into 7 groups (one per weekday) and benchmarks 1/7th of the catalog so every model gets fresh data weekly. Costs are computed per-call from the provider's current pricing. Results land in Supabase and become visible on the site within minutes.
The AI Analysis tab is a different beast: each morning Claude Haiku summarizes the previous day's data into structured insights (Pareto frontier, task routings, regressions, surprising findings). The narrative wraps around facts we computed in Python — we never let the LLM make up numbers.
1/7 of all known models benchmarked
Claude Haiku generates narrative + insights
Analysis last generated Jun 2, 2026, 9:00 PM
Full sortable table. Default sort: cost ascending. Filter by provider, category, status.
Today's headline + Pareto chart + per-task picks + surprising findings + migration paths.
30-day pass-rate trajectories. Catch regressions before your production code does.
Aggregated AI provider blogs, curated articles, trending HuggingFace models, status incidents.
Nobody pays us to put their model first. We have no commercial relationship with any of the providers we benchmark. If a model regresses, we publish the regression even if it's from a provider we usually like. If a free open-source model dominates the Pareto frontier on a task, we say so.
The site is operated as a public good. The benchmark code is closed-source for now — we'll consider open-sourcing once the infrastructure can be scrubbed of operator-specific details.
tool_usemulti_step_logicBBH-lite. Boolean expr + web-of-lies + object counting in one calllong_context_needleFind a 6-digit code buried in ~3.5k tokens of filler textsummarizeDistill a news article into exactly 3 bullet pointsclassifyZero-shot: sentiment + category from a customer reviewextractPull structured fields (invoice_number, total_usd, line_items) from textinstruction_followMulti-rule compliance: 5 sentences, all end with !, no "the"format_complianceIFEval-style: 4 bullets, keyword inclusion/exclusion, no preambleProvider blogs + HuggingFace trending