We measure what AI models actually do, not what vendors claim.

EyesInAI is a public benchmark of AI models across 18 providers and 222model variants. Real costs, real latency, real pass rates — refreshed nightly with deterministic validators. No vendor-supplied numbers, no marketing spin.

What we believe

Chatbots fail in two directions at once: they answer wrong, and they cost too much doing it. Both failures have the same root — nobody is measuring. We attack both with one discipline: never spend a token you don't have to, and never save a token at the cost of a correct answer.

Done properly, accuracy and economy stop being a trade-off and become the same discipline — the grading that catches a wrong answer is the same machinery that proves a cheaper model is safe to use. The public benchmark below is the evidence engine that makes that possible.

The seven principles behind how we build

Why this exists

Every AI provider publishes benchmark scores. Almost all of them show their own model winning. That's because the model and the benchmark are designed together — the benchmark optimizes for what the model is good at, not what developers actually need to build with.

Developers picking a model for a real workload don't care about MMLU. They care about whether the model writes valid JSON when asked, whether it runs the function tool with the right arguments, whether the cost-per-task pencils out at scale, and whether the latency stays under their SLO during a traffic spike.

EyesInAI runs 18deterministic tests — tests where the answer is either correct or not, with no human or LLM judging the output — across every major model nightly. Then we publish everything: the cost of a passing run, the regression history, what's on the cost-vs-quality Pareto frontier, what's newly broken.

How we measure

We run 18tests across 2 categories. Every test has a deterministic validator — we execute code in a sandbox, parse JSON against a strict schema, regex-check exact phrases, count bullet points. We do not use an LLM to judge another LLM. If a result is on this site, a machine verified it.

13 synthetic tests

Probe specific capabilities in isolation: latency, throughput, structured output compliance, code generation and repair executed in a sandbox, fact retrieval, tool calling, multi-step logic, long-context recall.

5 real-world tests

Mirror actual production workloads: summarize a news article, classify a customer review, extract invoice fields, follow multi-rule instructions, comply with output format constraints.

Full test suite

syntheticpingLatency probe — single-word reply
syntheticreasoningBasic math (17 × 23). Must contain "391"

Where the data comes from

We run benchmark calls against each provider's official API — Anthropic, OpenAI, Groq, Google (Gemini), OpenRouter, NVIDIA — using our own API keys and a Mac Mini test rig. The rig sits behind a Cloudflare tunnel; the public site reads from it via a hardened proxy.

Each night a rotation cron divides the full model catalog into 7 groups (one per weekday) and benchmarks 1/7th of it, so every model gets fresh data weekly. Costs are computed per-call from the provider's current pricing. Results land in Supabase and become visible on the site within minutes.

The AI Analysis tab is a different beast: each morning Claude Haiku summarizes the previous day's data into structured insights (Pareto frontier, task routings, regressions, surprising findings). The narrative wraps around facts we computed in Python — we never let the LLM make up numbers.

2am daily

Rotation cron

1/7 of all known models benchmarked

4am daily

Analysis cron

Claude Haiku generates narrative + insights

8am daily

News cron

Provider blogs + HuggingFace trending

Current state

Providers

API endpoints tested

Models

222

distinct variants benched

Tasks completed

1,339

successful test runs

Tests

13 synthetic, 5 real-world

Analysis last generated Jul 18, 2026, 9:00 PM

What you can do here

Leaderboard

Full sortable table. Default sort: cost ascending. Filter by provider, category, status.

AI Analysis

Today's headline + Pareto chart + per-task picks + surprising findings + migration paths.

Trends

30-day pass-rate trajectories. Catch regressions before your production code does.

News

Current AI provider blogs, curated articles, trending HuggingFace models, and status incidents.

A note on independence

Nobody pays us to put their model first. We have no commercial relationship with any of the providers we benchmark. If a model regresses, we publish the regression even if it's from a provider we usually like. If a free open-source model dominates the Pareto frontier on a task, we say so. This is Proof, our seventh principle, applied to ourselves.

The site is operated as a public good. The benchmark code is closed-source for now — we'll consider open-sourcing once the infrastructure can be scrubbed of operator-specific details.

Questions, corrections, or a model we should add? Email [email protected].