The Deep Suite: tests built to make good models trip

Most benchmarks measure whether a model can do a task. Past a certain tier, they all can — so the score stops discriminating. The Deep Suite is our answer: a harder set of stress tests, run only on models that already clear the base bar, where every question has a tempting wrong answer baked in. The point isn’t coverage — it’s pressure. See the live results.

Why a second, harder tier

On the main leaderboard, the frontier models cluster near the top — the basics are solved. That’s useful for routing cheap tasks, but it tells you little about which model to trust on the hard5% where mistakes are expensive. The Deep Suite exists to re-introduce signal: it’s a gated cascade — a model only earns a Deep Suite run by passing the base suite first, so we spend the costlier, trickier tests only on models worth the deeper look.

The four axes — each with a trap

The suite probes four kinds of pressure. Every test ships a deliberate failure mode:

Reasoning under pressure

A billing rule that lures you to the wrong total, a knights-and-knaves puzzle with exactly one consistent answer, an off-by-one date count, conditional probability that punishes ignoring the condition, and a unit-conversion problem where one worker is already metric. The tempting answer is always wrong.

Format fidelity

Nested JSON with literal escaping and a real null, a deliberate instruction conflict, a no-letter-“e” constrained sentence, RFC-4180 CSV with embedded commas and newlines, and strict 2-decimal formatting that exposes rounding sloppiness. Tests whether a model does exactly what was asked.

Long-context retrieval

Multiple needles returned in order, decoys placed before and after the real answer, three facts scattered far apart that must be combined, and a later “correction” that must override an earlier value.

Tool use / agentic

Picking the right tool among look-alikes, knowing when not to call a tool, chaining dependent calls, and asking for a missing argument instead of fabricating one. Graded on the actual tool calls a model emits — across Anthropic, OpenAI, and Google.

How it's graded — and why you can trust it

Deterministic validators for everything checkable — the exact number, the reduced fraction, the JSON shape, the CSV parse, the needles in order. No vibes.
An LLM judge only for the two genuinely open-ended tests (instruction-conflict resolution, asking for a missing argument), with a strict pass/fail rubric.
Verified ground truth — the math, dates, and logic answers were all checked in code before the tests went live (the knights-and-knaves solution, for instance, is provably unique).
Multiple runs per model, reported as a pass-rate, so one lucky completion can’t flatter a model. These are non-deterministic systems; variance is part of the signal.

What the first results showed

The Deep Suite immediately did its job: it separated models that look tied on the leaderboard — spreading five top models across a 22-point range where the basic leaderboard clusters them all near the top. On the reasoning traps specifically, a logic puzzle (knights and knaves) tripped three different frontier models every single run, while two others solved it cleanly — exactly the task-specific signal a single aggregate score hides. That’s why measured routing beats picking by reputation: the right model for a hard reasoning task is not always the one with the biggest name. The numbers are on the results page, updated as we run more models.

See the live results The base methodology

The Deep Suite: tests built to make good models trip

Why a second, harder tier

The four axes — each with a trap

The suite probes four kinds of pressure. Every test ships a deliberate failure mode:

Reasoning under pressure

Format fidelity

Long-context retrieval

Tool use / agentic

How it's graded — and why you can trust it

Deterministic validators for everything checkable — the exact number, the reduced fraction, the JSON shape, the CSV parse, the needles in order. No vibes.

An LLM judge only for the two genuinely open-ended tests (instruction-conflict resolution, asking for a missing argument), with a strict pass/fail rubric.

Verified ground truth — the math, dates, and logic answers were all checked in code before the tests went live (the knights-and-knaves solution, for instance, is provably unique).

Multiple runs per model, reported as a pass-rate, so one lucky completion can’t flatter a model. These are non-deterministic systems; variance is part of the signal.

What the first results showed