Why free text invites the lie
Ask a RAG system for “an answer” and you hand the model a blank page. On a topic that saturated its training data it will fill that page reliably; on yourcontract — seen once or never — it produces a continuation just as fluent, just as confident, and far more likely to be wrong. That isn’t a bug to patch with a sterner prompt. It’s what generation is.
The fix is not a smarter prompt; it’s controlled execution. The model answers only from the passages in front of it, in a typed shape, with a citation for every claim. Structured input in, structured object out. Ask for a JSON object whose every field is validated against the input, and the gaps it used to fill from memory simply aren’t there.
Typed values, not strings
The first move is to type the answer. Instead of the string "USD 1,200 per claim", the model fills an Amount(value, currency, unit); instead of "15 March 2024", a DateValue(iso, original); instead of pipe-separated text, a real TableValue(headers, rows). Downstream code never re-parses a string, and the named fields double as an instruction: a model that sees street / postal_code / city / country breaks the address block apart itself — one call, one row, one place to audit, instead of four brittle round-trips.
Each value also carries its own evidence spans— line ranges into the source, not the evidence text. The model returns line numbers plus an optional quote used only as a checksum (the validator confirms it’s a substring of the cited lines, catching a wrong line number); the human-readable snippet is recovered by the pipeline, never trusted from the model. A list answer is many items each with its own spans; a single answer whose evidence sits on pages 5 and 23 is one item with two spans. The shape models both directly.
The sharpest rule: the LLM extracts, Python compares
Never ask the model to compute, compare, or aggregate when the result is derivable from extracted values. “Is the premium above one million dollars?” asks the model to locate the number, parse its currency, convert, and compare — in one shot, with a yes/no that erases the value that produced it. A 100,000,000 JPYpremium becomes “yes” or “no” on whatever exchange rate the model imagines.
- Extract first. Ask for an
Amount(value, currency, unit)— nothing else. - Compute in Python. Apply the conversion with a visible rate, compare to the threshold.
- Stay auditable. Every step is visible and replayable; if the rate updates, recompute without re-calling the model.
The signal you must NOT ask the model
The schema can also carry feedback fields the pipeline routes on — answer_found vs complete_answer_found, conflicting_evidence, a suggested clarification. But the best idea in the piece is the one field it deliberately does not ask the model for. A clean period at the bottom of page 5 proves a sentence ended — not that a listended. The model can’t see page 6, so a page-5-only context always reads as complete when the cut is clean.
So completeness is computed by the pipeline, via a one-page retrieval overlap the model never sees: a new section heading at the top of page 6 → the list was bounded; continuation content → it was truncated, refetch. It’s the same instinct as chunk overlap, applied to retrieval scope — buy safety by deliberately pulling slightly more than seems necessary. Deterministic, grounded in document structure, catching a failure neither the model nor a human reading page 5 alone could.
What makes the contract real: constrained decoding
A schema is only a contract if it’s enforced. Constrained decoding (OpenAI’s responses.parse with Pydantic, or grammar-based approaches like Outlines / vLLM grammars / llama.cpp GBNF) makes the model unableto emit output that fails to parse. Below it sits a reliability ladder: JSON-schema mode, then “return valid JSON” in the prompt (validate after the fact), then a vague “return JSON” — which will occasionally wrap the object in a fence or prepend “Here’s the answer:” and break the consumer. Anything above the bottom rung is meaningfully safer.
We tested the claim: “thinking” models are worse at JSON
The article makes one falsifiable, measurable claim in our wheelhouse: reasoning (“thinking”) models are lessreliable at structured JSON — their reasoning trace pollutes the output — and JSON quality doesn’t correlate with model size. So we measured it, 5 runs per model, asking each for one strict JSON object (four typed keys, no prose, no markdown) with enough token budget that a thinking model can finish reasoning and still emit JSON. Two columns, and the gap between them is the whole story:
| Model | Type | Strict JSON | Extractable | <think> trace |
|---|---|---|---|---|
| DeepSeek-R1 · 671B MoE | thinking | 0/5 | 5/5 | 5/5 |
| DeepSeek-V3.1 · 671B MoE | non-thinking | 5/5 | 5/5 | 0/5 |
| QwQ-32B · 32B | thinking | 0/5 | 5/5 | 5/5 |
| Qwen2.5-72B · 72B | non-thinking | 5/5 | 5/5 | 0/5 |
| gpt-oss-20b · 20B | non-thinking | 5/5 | 5/5 | 0/5 |
| gpt-oss-120b · 120B | non-thinking | 5/5 | 5/5 | 0/5 |
| GPT-4o · frontier | non-thinking | 5/5 | 5/5 | 0/5 |
- The claim holds, precisely. Both thinking models (DeepSeek-R1, QwQ-32B) scored 0/5 on strict JSON — yet 5/5extractable. They know the values; they won’t hand them over cleanly, because the
<think>trace precedes the object and a strict parser chokes. - Size doesn’t correlate. A 20B model and a 671B model both hit 5/5 strict; the two failures are the thinking ones (a 671B and a 32B). The variable is the reasoning mode, not the parameter count.
- The fix is the contract.For structured-output workloads, pick a non-thinking model (or disable reasoning) — and let constrained decoding force the schema, so “produces clean JSON” stops being a property you hope for.
Measured on EyesInAI’s bench, 2026-07-05, 5×/model at 800-token budget. “Strict” = clean JSON, nothing around it (what a pipeline needs); “extractable” = a valid object recoverable even from a reasoning trace. See the leaderboard for the live suite.