Why a second, harder tier
On the main leaderboard, the frontier models cluster near the top — the basics are solved. That’s useful for routing cheap tasks, but it tells you little about which model to trust on the hard5% where mistakes are expensive. The Deep Suite exists to re-introduce signal: it’s a gated cascade — a model only earns a Deep Suite run by passing the base suite first, so we spend the costlier, trickier tests only on models worth the deeper look.
The four axes — each with a trap
The suite probes four kinds of pressure. Every test ships a deliberate failure mode:
A billing rule that lures you to the wrong total, a knights-and-knaves puzzle with exactly one consistent answer, an off-by-one date count, conditional probability that punishes ignoring the condition, and a unit-conversion problem where one worker is already metric. The tempting answer is always wrong.
Nested JSON with literal escaping and a real null, a deliberate instruction conflict, a no-letter-“e” constrained sentence, RFC-4180 CSV with embedded commas and newlines, and strict 2-decimal formatting that exposes rounding sloppiness. Tests whether a model does exactly what was asked.
Multiple needles returned in order, decoys placed before and after the real answer, three facts scattered far apart that must be combined, and a later “correction” that must override an earlier value.
Picking the right tool among look-alikes, knowing when not to call a tool, chaining dependent calls, and asking for a missing argument instead of fabricating one. Graded on the actual tool calls a model emits — across Anthropic, OpenAI, and Google.
How it's graded — and why you can trust it
- Deterministic validators for everything checkable — the exact number, the reduced fraction, the JSON shape, the CSV parse, the needles in order. No vibes.
- An LLM judge only for the two genuinely open-ended tests (instruction-conflict resolution, asking for a missing argument), with a strict pass/fail rubric.
- Verified ground truth — the math, dates, and logic answers were all checked in code before the tests went live (the knights-and-knaves solution, for instance, is provably unique).
- Multiple runs per model, reported as a pass-rate, so one lucky completion can’t flatter a model. These are non-deterministic systems; variance is part of the signal.
What the first results showed
The Deep Suite immediately did its job: it separated models that look tied on the leaderboard — spreading five top models across a 22-point range where the basic leaderboard clusters them all near the top. On the reasoning traps specifically, a logic puzzle (knights and knaves) tripped three different frontier models every single run, while two others solved it cleanly — exactly the task-specific signal a single aggregate score hides. That’s why measured routing beats picking by reputation: the right model for a hard reasoning task is not always the one with the biggest name. The numbers are on the results page, updated as we run more models.