Live benchmark results across all tested AI models. Sortable by cost, latency, throughput.
Gemini 2.0 Flash retired; Claude Haiku is the new cost-efficiency leader at $0.17/1M with 100% core test pass rate.
Claude Haiku (anthropic) is the cheapest passing model at $0.17 per 1M tokens with 83% pass rate. Gemini 2.0 Flash has been retired upstream (404 errors across all tests). No regressions detected in active models. OpenRouter's free tier models (Nemotron 3-nano, GPT-OSS) offer strong cost-efficiency but with lower pass rates; Claude Sonnet 4.6 achieves 100% pass rate at $2.50/1M tokens.
The Pareto frontier spans three models: nvidia/nemotron-3-nano-30b-a3b (67% pass, free), anthropic/claude-3-haiku (83% pass, $0.17/1M), and anthropic/claude-sonnet-4.6 (100% pass, $2.50/1M). Free NVIDIA models offer zero-cost baseline but lower accuracy; Haiku balances cost and reliability; Sonnet dominates for mission-critical deployments.
For ping/preprocessing, route to groq's openai/gpt-oss-20b (263ms, free). For speed (throughput 500-token generations), use same model (618.8 tps). For json/code/reasoning (structured output), use anthropic/claude-haiku-4-5-20251001 or openrouter/claude-3-haiku (both ~1s latency, 100% pass). For context retrieval (3-fact recall from 20 options), anthropic claude-haiku passes 100%; for tool_use (get_weather calls), claude-haiku also passes 100%. High-value recommendation: anthropic/claude-sonnet-4.6 for any mission-critical task (100% pass, <2s latency, $2.50/1M cost).
allenai/olmo-3-32b-think exhibits the lowest latency across all benchmarks (avg 166ms) but fails every single test (0% pass rate on all 6 core tests). Despite sub-200ms response times, the model is non-functional for all measured tasks — likely a downstream API issue or fundamentally broken model variant.
lowest median latency across all tests at 166ms
Migrate gemini-2.0-flash users to anthropic/claude-sonnet-4.6 (100% pass rate, $2.50/1M) or to anthropic/claude-3-haiku via OpenRouter ($0.17/1M, 83% pass rate). For latency-critical workloads, use groq's openai/gpt-oss-20b (263ms ping, free) only for ping and speed; route structured output tasks (json, code, reasoning) to Claude Haiku.