Optimize for cost and quality across 6 providers and 255 models. Real costs, real latency, real pass rates — measured nightly with deterministic validators, not vendor claims.
Groq's open llama wins on speed, cost, and reliability—anthropic times out on code.
Groq's llama-3.1-8b-instant dominates speed and cost efficiency across most benchmarks, achieving 415.3 tokens/sec on the speed test while costing under $0.00001 per request. Anthropic's claude-opus-4-7 suffered a critical timeout on the code test (20s latency), the only regression in the active cache. The surprising finding: o1-pro (OpenAI) shows lowest median latency at 282ms across historical tests, contradicting expectations for reasoning-focused models.
lowest median latency across all tests at 282ms
per 1,000 successful task runs
Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI
Nemotron 3.5 Content Safety provides customizable multimodal safety mechanisms for enterprise AI applications. Enables safety filtering across multiple modalities.
4models that aren't dominated by any cheaper model on pass rate. Pick from this list and you're on the efficient frontier.
Aggregated from provider blogs & status pages, classified by Claude.
Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI
Nemotron 3.5 Content Safety provides customizable multimodal safety mechanisms for enterprise AI applications. Enables safety filtering across multiple modalities.
NVIDIA Nemotron 3 Ultra Powers Faster, More Efficient Reasoning for Long-Running Agents
NVIDIA has released Nemotron 3 Ultra, designed to handle the growing demands of autonomous agent systems that operate over extended sequences of interactions. Unlike traditional single-turn conversational models, this offering enables complex workflows where agents must track information, reason through multi-step problems, and integrate external tool calls without losing thread. The architecture prioritizes computational efficiency while maintaining the contextual awareness necessary for agents to execute sophisticated tasks across numerous consecutive rounds of execution.
8deterministic tests per model. Pass means the response was correct — we execute code, parse JSON, check facts. Not just “model returned 200.”
Basic math, multi-step logic
Python function synthesis with sandbox execution
Strict JSON schema with type validation
BBH-lite: boolean + web-of-lies + counting
Find a code buried in 3.5k tokens of filler
Fact extraction from in-prompt data
Every model on every test, with cost per task, latency, throughput, and pass rate. Free, no signup, refreshed nightly.
EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios
EVA-Bench Data 2.0 introduces evaluation benchmark with 121 tools across 3 domains and 213 scenarios. Supports tool-use evaluation for LLMs.
Function/tool invocation with correct arguments
IFEval-style: bullet format + keyword constraints