The main leaderboard answers “can a model do the basics, cheaply?” This is the next question: when a model clears that bar, how does it hold up under pressure? The deep suite is a harder, trap-laden tier — every test has a tempting wrong answer baked in — run only on top-tier models to separate the ones that merely pass from the ones that actually reason. Each result below is a pass-rate over 3 runs, graded by deterministic validators (and an LLM judge for the open-ended ones).
Read the explainerfor how the traps are designed · part of EyesInAI’s measured methodology.
| Test | GPT-5 | Gemini | Sonnet | Opus | GPT-4o |
|---|---|---|---|---|---|
Off-by-one date counting reasoning_date_arithmetic · exact_date | 2/3 | 3/3 | 0/3 | 0/3 | 3/3 |
Counterintuitive conditional probability reasoning_conditional_probability · exact_fraction | 3/3 | 3/3 | 3/3 | 0/3 | 0/3 |
Unit-conversion trap (one worker already metric) reasoning_mixed_units · exact_number | 3/3 | 3/3 | 1/3 | 0/3 | 0/3 |
Anchored arithmetic with a misleading number reasoning_billing_trap · exact_number | 3/3 | 3/3 | 2/3 | 2/3 | 3/3 |
Knights-and-knaves constraint satisfaction reasoning_knights_knaves · exact_sequence | 3/3 | 3/3 | 0/3 | 0/3 | 0/3 |
| Test | GPT-5 | Gemini | Sonnet | Opus | GPT-4o |
|---|---|---|---|---|---|
Nested JSON with escaping, null, exact values format_json_escaping · json_schema_strict | 3/3 | 0/3 | 3/3 | 3/3 | 3/3 |
Instruction-conflict resolution (open-ended → judged) format_conflicting_constraints · rubric | 2/3 | 1/3 | 1/3 | 2/3 | 0/3 |
12-word sentence, 'lift' x2, no letter e format_constrained_writing · constraint_regex | 3/3 | 0/3 | 1/3 | 0/3 | 0/3 |
RFC-4180 CSV with commas/quotes/embedded newline format_csv_rfc4180 · csv_parse_strict | 0/3 | 3/3 | 3/3 | 3/3 | 3/3 |
Strict 2-decimal string formatting after tax format_number_rounding · json_exact | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
| Test | GPT-5 | Gemini | Sonnet | Opus | GPT-4o |
|---|---|---|---|---|---|
Chain calls where output feeds the next tool_sequential_dependency · tool_call_sequence | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
Know when NOT to call a tool tool_abstention · no_tool_call | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
Ask for a missing required arg instead of fabricating tool_missing_param · rubric | 3/3 | 3/3 | 3/3 | 3/3 | 1/3 |
Pick the right tool among look-alikes tool_selection_correct · tool_call_match | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
| Test | GPT-5 | Gemini | Sonnet | Opus | GPT-4o |
|---|---|---|---|---|---|
Three needles, returned in document order longctx_multi_needle · multi_needle_ordered | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
Ignore decoys; pick the one 'active' ticket longctx_distractor_needles · exact_number | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
Combine three facts spread far apart longctx_scattered_facts · exact_number | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
Later correction supersedes the earlier value longctx_contradiction_override · exact_number | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |
A low score on one test is a finding, not a verdict — these are deliberately adversarial. The value is in the pattern: which model holds up where. Run 3× each so a single lucky pass doesn’t flatter a model. Generated 2026-06-20.