Benchmarks · the deep suite

Where the leaderboard ends, the deep suite begins

The main leaderboard answers “can a model do the basics, cheaply?” This is the next question: when a model clears that bar, how does it hold up under pressure? The deep suite is a harder, trap-laden tier — every test has a tempting wrong answer baked in — run only on top-tier models to separate the ones that merely pass from the ones that actually reason. Each result below is a pass-rate over 3 runs, graded by deterministic validators (and an LLM judge for the open-ended ones).

Read the explainerfor how the traps are designed · part of EyesInAI’s measured methodology.

GPT-5

91%

avg pass-rate · leads

Gemini 2.5 Pro

85%

avg pass-rate

Sonnet 4.6

76%

avg pass-rate

Opus 4.8

69%

avg pass-rate

GPT-4o

69%

avg pass-rate

Reasoning under pressure

Test	GPT-5	Gemini	Sonnet	Opus	GPT-4o
Off-by-one date counting reasoning_date_arithmetic · `exact_date`	2/3	3/3	0/3	0/3	3/3
Counterintuitive conditional probability reasoning_conditional_probability · `exact_fraction`	3/3	3/3	3/3	0/3	0/3
Unit-conversion trap (one worker already metric) reasoning_mixed_units · `exact_number`	3/3	3/3	1/3	0/3	0/3
Anchored arithmetic with a misleading number reasoning_billing_trap · `exact_number`	3/3	3/3	2/3	2/3	3/3
Knights-and-knaves constraint satisfaction reasoning_knights_knaves · `exact_sequence`	3/3	3/3	0/3	0/3	0/3

Format fidelity

Test	GPT-5	Gemini	Sonnet	Opus	GPT-4o
Nested JSON with escaping, null, exact values format_json_escaping · `json_schema_strict`	3/3	0/3	3/3	3/3	3/3
Instruction-conflict resolution (open-ended → judged) format_conflicting_constraints · `rubric`	2/3	1/3	1/3	2/3	0/3
12-word sentence, 'lift' x2, no letter e format_constrained_writing · `constraint_regex`	3/3	0/3	1/3	0/3	0/3
RFC-4180 CSV with commas/quotes/embedded newline format_csv_rfc4180 · `csv_parse_strict`	0/3	3/3	3/3	3/3	3/3
Strict 2-decimal string formatting after tax format_number_rounding · `json_exact`	3/3	3/3	3/3	3/3	3/3

Tool use / agentic

Test	GPT-5	Gemini	Sonnet	Opus	GPT-4o
Chain calls where output feeds the next tool_sequential_dependency · `tool_call_sequence`	3/3	3/3	3/3	3/3	3/3
Know when NOT to call a tool tool_abstention · `no_tool_call`	3/3	3/3	3/3	3/3	3/3
Ask for a missing required arg instead of fabricating tool_missing_param · `rubric`	3/3	3/3	3/3	3/3	1/3
Pick the right tool among look-alikes tool_selection_correct · `tool_call_match`	3/3	3/3	3/3	3/3	3/3

Long-context

Test	GPT-5	Gemini	Sonnet	Opus	GPT-4o
Three needles, returned in document order longctx_multi_needle · `multi_needle_ordered`	3/3	3/3	3/3	3/3	3/3
Ignore decoys; pick the one 'active' ticket longctx_distractor_needles · `exact_number`	3/3	3/3	3/3	3/3	3/3
Combine three facts spread far apart longctx_scattered_facts · `exact_number`	3/3	3/3	3/3	3/3	3/3
Later correction supersedes the earlier value longctx_contradiction_override · `exact_number`	3/3	3/3	3/3	3/3	3/3

A low score on one test is a finding, not a verdict — these are deliberately adversarial. The value is in the pattern: which model holds up where. Run 3× each so a single lucky pass doesn’t flatter a model. Generated 2026-06-20.

How the deep suite is built

EyesInAI·Loading live benchmark data