Cut the tokens before the model ever sees them

The Token Tax Playbook is about which model answers a request. This is about how much you feed it in the first place. The cheapest token is the one you never send. Two small local tools we put into the DLR stack do exactly that — one turns binary documents into compact text, the other turns a whole codebase into a queryable map — and both run on the Mac Mini for $0, before a single metered call. Here is how each one works, and the real numbers it moved.

The cheapest token is the one you never send

When an agent needs to understand something — a 10-page PDF, a 25,000-symbol codebase — the lazy move is to pour the raw material into the context window and let the model sort it out. That works, and it is expensive: every byte is billed as input tokens on every turn it stays in context, and a model reasoning over noise is slower and more error-prone than one handed the signal.

A pre-processing tool changes the shape of the problem. Instead of paying a frontier model to read a binary blob or grep a whole repository, you run a cheap local program that extracts just the part that matters — the text, the relationship — and hand the model a fraction of the bytes. The work moves from a metered API to your own hardware, and the token count drops by an order of magnitude. Two tools illustrate the two halves of this: documents and code.

Live on DLR · Microsoft OSS

MarkItDown — binary documents into compact Markdown

The problem.DLR’s workflow is full of PDFs — vendor quotes, competitor quotes, invoices. A PDF is not text; it is a binary layout format. Feeding one to a model means either uploading the whole file (billed as a large image/document payload, and re-billed every turn it stays in context) or asking the model to OCR and parse it from scratch. For a multi-page quote that is thousands of tokens and several seconds, just to get at words that were already there.

How it works. MarkItDown is a small open-source converter from Microsoft. It reads PDF, Word, PowerPoint, Excel, HTML, CSV, EPUB, even images, and emits clean, structured Markdown — headings, tables, and lists preserved, layout junk stripped. It runs as a local CLI:

markitdown "vendor-quote.pdf" > quote.md
# 18,000 chars of clean Markdown — headings, line items, totals

The output is plain text a model can read at a fraction of the token cost of the original file, and it is human-readable, so you can verify the extraction before spending anything on interpretation. In the DLR stack this became a deliberate two-phase rule: phase one is text extraction, done locally by MarkItDown for free; phase two — turning that text into structured JSON, or making a judgment about it — goes to a model, and only when model judgment is actually required. Scanned or image-only PDFs, where there is no text to extract, fall back to the vision model. The admin bridge exposes this as an extractText flag, and the quote-handling routes and the invoice-extraction edge function all run the local pass first.

What it moved. A large vendor-quote PDF that used to go straight to a model for reading took ~168 seconds and roughly $0.35 in tokens. The same PDF through MarkItDown: ~520 milliseconds, 18k characters extracted, $0 — the model is only paid when the text genuinely needs interpreting. As a rule of thumb, content that costs a fortune as a raw document payload is typically 2–5K tokens as Markdown: a 10–50× reduction, and faster.

Live on DLR · nightly, $0

Graphify — a whole codebase as a queryable map

The problem.Ask a coding agent “what calls this function?” or “what breaks if I change this module?” and the naive answer is to read files — often dozens of them, hundreds of kilobytes — into context until the picture is clear. On a large repository (DLR’s web app is ~25,000 symbols) that exhausts the context window on plumbing before any real work begins, and most of what got read was noise.

How it works. Graphify parses the repository with tree-sitter — a fast, language-aware parser, no LLM involved — and extracts a knowledge graph of the code: every function, class, and module is a node; every calls, imports, contains, and inherits relationship is an edge. For the DLR web app that is roughly 25,700 nodes and 40,100 edges, written to a graphify-out/graph.json next to the code. Because it is pure parsing, building the graph costs nothing — no tokens, no API.

A read-only lens (we call it graph-eye) answers architecture questions straight from that graph instead of from the files:

god — the hub nodes, ranked by how many things touch them: the high-coupling code that is risky to change.
affected “X” — the reverse blast radius: everything that depends on X, so you see what a change could break without grepping the tree.
find / explain — locate a symbol and show its immediate neighbors (its callers, callees, imports).
query “…” — a token-capped graph traversal that answers a natural-language question with a bounded answer, not a pile of files.

What it moved.A question like “what is the blast radius of changing this auth helper?” that would otherwise mean reading a dozen 100KB files now returns a precise list in single-digit kilobytes. The graph locates the code; the agent then reads only the one file it actually needs to edit. The discipline matters: the graph is a map, not the source of truth for a diff — you still open the real file before you change it.

Kept fresh for free. A nightly job re-extracts the graph for six repositories, but only when the code has actually changed since the last build, and it runs in a stripped environment with no API keys present — so it is provably incapable of hitting any metered path. The graphs are rebuilt before the morning review, every day, at a token cost of exactly zero.

The pattern: do the cheap work locally first

MarkItDown and Graphify do unrelated jobs — one reads documents, one reads code — but they are the same idea. Both put a cheap, deterministic, local step in front of the expensive model. Convert the binary before you reason over it. Map the repo before you grep it. Extract the signal on your own hardware, then spend metered tokens only on the part that genuinely needs intelligence.

That is the lever this page adds to the playbook. Routing and caching decide which model answers and how often you re-pay for the same prompt; tooling decides how much you hand it. The two compound: a smaller, cleaner input is cheaper on every model, so cutting tokens at the source multiplies whatever routing already saved.

The two-phase rule

Phase 1 (local, $0): extract / map — MarkItDown turns a PDF into Markdown, Graphify turns a repo into a graph. Phase 2 (metered, only if needed): hand the model the compact result and pay only for the judgment — interpretation, a structured answer, the actual edit.

These tools run in our own stack before we write about them — the same way we benchmark every model before we recommend it. The routing and caching levers that pair with them are in the playbook; the measured dollars are in the case studies.

The Token Tax Playbook See it applied

EyesInAI·Loading explainers…

Explainers

The Token Tax · the tools layer