How to read this (and why three runs)
A single benchmark pass lies — a model has a lucky or unlucky moment and the table tells the wrong story. Everything below is three runs per cell, scored against a known-correct answer the same way our standard test suite grades every model. “Pass” means exact-match or validator pass, not a vibe. Where a result is suggestive rather than definitive, we say so.
C1 — Raw model quality (7-test suite, ×3)
| Model | Pass | Avg latency | Where it lost |
|---|---|---|---|
| Kimi-K2.7-Code | 21/21 | 2.2s | — (perfect + fastest) |
| GPT-5 | 21/21 | 6.6s | — |
| Opus 4.8 | 16/21 | 3.2s | tool_use, code |
| Kimi-K2.6 (deepinfra) | 7/21 | 42.7s | provider failure — see below |
Kimi K2.7-Code tied GPT-5 at a perfect 21/21 — and was the fastest model in the field(2.2s average vs GPT-5's 6.6s). That is the headline, and it held under repetition: on this suite, by measurement rather than marketing, Kimi is frontier-competitive.
The Kimi-K2.6 score of 7/21 is not a model failure — it is a provider failure. The same model scored 21/21 on a different serving provider (see C2). One endpoint was timing out and returning empty answers; we attribute that to the serving layer, not to Kimi. (Opus's 16/21 was an off-run on a single endpoint/day — our longer deep suite scores it higher; don't over-read one comparison.)
C2 — Same model, different source: does first-party win?
| Source | Pass | Avg latency |
|---|---|---|
| Kimi-K2.6 via fireworks | 21/21 | 9.9s |
| kimi-for-coding (Kimi direct) | 18/21 | 4.8s |
| Kimi-K2.6 via deepinfra | 7/21 | 42.7s |
A natural assumption is that going straight to the model's maker gives the best result. It didn't. Going Kimi-direct for inference bought no quality advantage— a third-party serving provider actually beat it on pass-rate this run, and direct's only edge was lower latency. The lesson generalizes: the serving layer matters more than the source. Pick a provider by reliability and latency, not by whose logo is on the endpoint. (This is the same serving-layer thesis as our managed-inference scan.) Kimi-direct's real value isn't its API — it's the Code CLI on a flat subscription, which is exactly what the next test puts to work.
C3 — The harness duel: Kimi Code CLI vs claude -p
This is the one that matters strategically. We pitted Moonshot's Kimi Code CLI — its agent harness — against claude -p (the Claude Agent SDK running Sonnet 4.6), graded, three runs each.
| Task | Kimi Code CLI | claude -p (Sonnet) |
|---|---|---|
| reasoning (17×23) | 3/3 · 4.0s | 3/3 · 2.0s · $0.087 |
| conditional probability | 3/3 · 5.1s | 1/3 · 1.9s · $0.152 |
| code (palindrome) | 3/3 · 5.4s | 3/3 · 2.9s · $0.149 |
Repetition changed the story. Kimi Code CLI went 9/9; claude -p went 7/9— it missed the conditional-probability trap two times out of three (the classic P(both | at least one red) error). On a single pass we'd have called it a tie; three runs showed Kimi's agent was more reliable on that reasoning trap here.
The trade-offs are real and run both ways: claude -p was about twice as fast in wall-clock time (Kimi reasons more visibly before answering). But on cost the gap is structural, not incremental — which is the whole point.
The cost basis is the real story
In that duel, claude -p cost $0.087–$0.152 per task, metered, billed per token. The Kimi Code CLI ran on a flat-rate plan — its marginal cost per task is effectively zero, because the plan is quota-windowed, not per-token. Over volume, that is the token tax in one table.
We don't print a fake metered-equivalent for the flat-rate side, because there isn't an honest one to print — a flat plan has a fixed monthly price and a usage window, and break-even depends entirely on your volume. The honest framing is a trade, not a winner: speed and a familiar API on one side, reliability on the reasoning trap and a flat bill on the other. This is exactly the kind of trade our review console now measures directly — Kimi's harness is a selectable candidate alongside claude -p in the same graded run.
Bottom line — after three runs
- Kimi K2.7-Code is frontier-competitive and fast. 21/21 ties GPT-5, at the lowest latency in the field. The strongest measured version of “Kimi is advancing.”
- Serving layer beats source. The same model ranged from 21/21 to 7/21 depending only on who served it. Going direct bought no inference advantage — pick the provider, not the logo.
- At the agent layer it was the trade we expected. Kimi held up better on the reasoning trap (9/9 vs 7/9);
claude -pwas faster but metered. Speed vs. billing-model and reliability is the real decision.
The honesty bar (caveats)
- Three runs per cell smooths most wobble but isn't infinite; a 1/3 or 2/3 result is suggestive, not the last word.
- Latency is wall-clock including provider queueing, not pure model speed — the 42s figure is inflated by long reasoning streams and timeouts on a failing endpoint.
- The flat-rate side's marginal cost is ~$0, but the plan is a fixed monthly fee — break-even is volume-dependent, so “cheaper” is a function of how much you run.
- The provider failure is a serving issue to re-test next snapshot, not a Kimi quality signal — and we'll mark the failing endpoint down if it persists.
The method is the product. We publish what the benchmark actually returned — including the parts that complicate a clean headline — because a comparison you can't poke holes in is the only kind worth running.