Leaderboard

Live benchmark results across all tested AI models. Sortable by cost, latency, throughput.

Models tested

222

across 18 providers

Tasks completed

1339

of 1556 attempts

Avg cost / 1k runs

$0.643

across passing tasks

Best value

<$0.01

Mistral-Nemo-Instruct-2407

Leaderboard Charts Compare AI Analysis

Today's HeadlineJul 18, 09:00 PM

Free tier Llama 3.1-70B and ultra-cheap DeepSeek v4-Pro break cost-quality tradeoff; premium Claude loses on tool use.

Deepseek-v4-pro dominates cost-efficiency at $0.154/1M tokens while maintaining 100% pass rates on core tasks. No regressions detected. Most surprising: nvidia/abacusai/dracarys-llama-3.1-70b-instruct is FREE and passes 83% of tests, making it a strong default for cost-constrained deployments. Claude Opus 4.8 remains premium leader but fails tool_use; groq/compound has speed timeouts.

Cost-vs-Quality Frontier

2 optimal models

deepinfranvidia

Two models define cost-optimal frontier: nvidia/dracarys-llama-3.1-70b-instruct at $0/1M tokens with 83% pass (6/7 tests), and deepinfra/zai-org/GLM-4.7-Flash at $0.22/1M with 100% pass. DeepSeek v4-Pro ($0.154/1M) bridges mid-range with flawless execution. Groq/compound at $0 loses to tool_use and speed timeouts, disqualifying it despite zero marginal cost.

Per-Task Recommendations

For ping/reasoning: use nvidia/gliner-pii (479ms ping, 2131ms reasoning). For json output: together/llama-3.3-70b (940ms, 100% pass). For code generation: groq/compound (3079ms, 100% pass). For throughput: together/MiniMax-M2.7 (115.9 tps, 100% pass). For tool use (critical): fireworks/kimi-k2p6 (100% pass, 1608ms) or sambanova/MiniMax-M2.7 (100% pass, 911ms). Context retrieval: nvidia/gliner-pii (531ms, 100% pass).

Code

cheapest free

nvidiadracarys-llama-3.1-70b-instruct

fastest 3079ms

groqcompound

throughput 161 tok/s

groqcompound

JSON

cheapest free

nvidiadracarys-llama-3.1-70b-instruct

fastest

Surprising Findings

nvidia/abacusai/dracarys-llama-3.1-70b-instruct is completely FREE yet achieves 83% pass rate (5/6 tests passing; only speed timed out). This breaks the cost-quality assumption and makes it mandatory default for budget-constrained teams. Conversely, premium claude-opus-4-8 at $5-25/1M tokens fails tool_use despite 100% on reasoning/json/code/speed.

nvidiadracarys-llama-3.1-70b-instruct

FREE model with 83% pass rate — strong default for cost-sensitive workloads

openrouterolmo-3-32b-think

lowest median latency across all tests at 192ms

Migration Paths

retired or failing models — consider switching

fromopenrouterglm-5.2try

nvidia

dracarys-llama-3.1-70b-instruct

83% · —

deepinfra

GLM-4.7-Flash

100% · $0.218

together

MiniMax-M2.7

100% · $0.396

fromopenrouterglm-4.6try

nvidia

dracarys-llama-3.1-70b-instruct

Category Winners

Code

groq3079ms

compound

JSON

together100%

Llama-3.3-70B-Instruct-Turbo

Ping

nvidia479ms

gliner-pii

Speed

together115.9 tok/s

MiniMax-M2.7

Context

nvidia100%

gliner-pii

Provider Health

xai

Pass rate87%

Avg latency3705ms

groq

Pass rate71%

Avg latency4972ms

zhipu

Pass rate30%

Avg latency5908ms

gemini

Pass rate100%

Avg latency2141ms

nvidia

Pass rate50%

Avg latency8421ms

baseten

Pass rate100%

Avg latency1119ms

mistral

Pass rate86%

Avg latency2744ms

cerebras

Pass rate86%

Avg latency3651ms

deepseek

Pass rate100%

Avg latency461ms

moonshot

Pass rate87%

Avg latency7047ms

anthropic

Pass rate86%

Avg latency2128ms

dashscope

Pass rate100%

Avg latency2393ms

fireworks

Pass rate100%

Avg latency4240ms

sambanova

Pass rate100%

Avg latency1683ms

Notable Changes

·zhipu/glm-4.6: rate-limited after 2/16 tests; account quota exceeded (Chinese service degradation)
·groq/compound: speed test timeout at 20.1s latency; tool_use unsupported (model limitation)
·anthropic/claude-opus-4-8: tool_use failed with wrong_tool_or_no_call error despite 100% on other 6/7 tests
·cerebras/zai-glm-4.7: tool_use rate-limited (quota_exceeded); 1/7 tests lost
·nvidia/dracarys-llama: 50% pass_rate driven by 4 test timeouts; context/code pass but speed/json/reasoning failed

Recommendation

Migrate cost-sensitive workloads to nvidia/abacusai/dracarys-llama-3.1-70b-instruct (FREE, 83% pass) or deepseek-v4-pro ($0.154/1M, 100% pass). Route tool_use exclusively to fireworks/kimi-k2p6 or sambanova/MiniMax-M2.7. Avoid groq/compound for throughput tasks (20s timeout) and zhipu (rate-limited).