Prompt compression — strip the bloat before the model reads it
What it is. A small, locally-run model that compresses a prompt— removing the tokens that don’t carry meaning while keeping the ones that do. LLMLingua-2 is a BERT-level token classifier distilled from GPT-4, built specifically for task-agnostic compression, and it’s 3–6× faster than the original.
How it works.It’s a standalone Python library — model-agnostic preprocessing, not tied to any API. You hand it text and a target size; it hands back the compressed text plus the token counts:
from llmlingua import PromptCompressor
compressed = PromptCompressor().compress_prompt(long_context, target_token=500)
# → { compressed_prompt, origin_tokens, compressed_tokens, ratio }
# then send compressed_prompt to any model — API or the Claude CLIWhy it cuts tokens. The claims are large and consistently reported: up to 20× compression with a small accuracy drop, 95–98% accuracy retained on LLMLingua-2, and a frequently-cited production case of a monthly bill going $42K → $2.1K with zero model change. The pitch we agree with: teams chase cheaper models and miss that their prompts carry 5–20× of compressible bloat — long static context, RAG chunks, pasted documents.
It runs on CPU (a GPU just speeds it up), so it’s a local $0 filter. And because the output is plain text, it works with the Claude CLI, not just the API — compress a big blob locally, then pipe it to claude -p. Best on large pasted context, not the CLI’s own system prompt (which prompt-caching already makes cheap). The one rule, which the sources are blunt about: budget A/B testing per task before trusting it — aggressive compression can break specific tasks. That caveat is exactly why it belongs next to a benchmark: we can measure the accuracy cost instead of guessing it — so we did. We installed it locally (isolated environment, CPU, model cached on disk), wired it into our benchmark, and ran a real A/B: a long meeting-minutes document with facts buried in filler, three questions with known answers, each asked of claude-haiku-4.5 twice — full context vs compressed. The result: 49% fewer input tokens, ~43% lower cost, and the same 3/3 correct answers— every buried fact (a delivery rate, a contract value, an SLA window) survived, because the question anchors what must stay. The honest boundary: that’s for factual QA over prose. Code, exact strings, and number-dense tables are a different task class and get their own A/B before we trust compression on them.
Agent memory — retrieve the relevant slice, not the whole history
What it is. A memory layer for chat agents. Instead of re-sending the entire conversation (and a growing pile of facts) on every turn, Mem0 extracts, stores, and retrieves only what’s relevant to the current message.
How it works. It stores memories in a vector database (with hybrid semantic + BM25 keyword + entity search), and a single-pass extraction step keeps the write cost low. Python and TypeScript SDKs plus a CLI; pip install mem0ai. It runs as a local library, a self-hosted Docker server, or a managed cloud — and works with any LLM through a standard interface, Claude included.
Why it cuts tokens.This is the structural fix for the context window that grows every turn. Mem0’s 2026 numbers: 91.6 on the LoCoMo benchmark using only ~7.0K tokens per retrieval — versus the 25K+ tokens a full-context approach burns to hit similar quality. On a chatbot doing thousands of turns, paying for 7K instead of 25K per turn is a recurring, compounding saving.
The DLR chatbot re-pays for conversation context on every turn; this is the lever aimed straight at that workload. Self-hostable (local library or Docker) keeps it on our hardware and our keys. The integration is more involved than a one-shot filter — it’s a memory store you operate — so it’s a build, not a drop-in. Worth piloting on real chatbot traffic and measuring the per-turn token delta the way we measured routing.
Observability — see exactly where the tokens went
What it is. A local terminal dashboard that reads your AI coding-agent session logs — Claude Code, Codex, Gemini CLI, Aider, Cursor and more — and turns them into one fast view of cost, tokens, and time.
How it works. It parses the logs already on disk — no uploading anywhere — and surfaces input/output/cache tokens, estimated cost, the top token-burning sessions, plus diagnostics: tool loops, repeated calls, retries, long gaps, hanging runs, and tool failures. It also emits JSON / Markdown / HTML reports with CI health and tool-failure gates.
Why it helps the bill.It doesn’t cut tokens directly — it shows you which sessions and tool-loops are eating them, so you can fix the actual leak. The runaway-cost story is almost always a loop with no ceiling; this is the tool that makes that loop visible locally and per-session, in real time, rather than after the fact in a lagged billing dashboard.
This fits our read-only -eye-lens philosophy exactly: observe locally, never upload, report. It’s the missing piece for provingtoken-tax savings instead of estimating them from a billing API that lags by hours. Lowest-risk of the four — it only reads logs — so it’s the first one to stand up.
Web extraction — MarkItDown, but for the live web
What it is. A command-line tool that scrapes, searches, crawls, and maps the web and returns clean Markdown by default — LLM-ready, ~67% fewer tokens than raw HTML. It installs agent skills and an MCP server so an agent can do live web work as a tool.
How it works. Same idea as MarkItDown, pointed at the web instead of a file on disk: fetch the page, strip the layout and script noise, emit the text. It can run against a self-hosted instance — firecrawl --api-url http://localhost:3002 scrape … — and a custom API URL skips auth entirely, so a local instance needs no key.
Why it cuts tokens.The same lever as document conversion: never hand a model raw HTML. A scraped page as Markdown is a fraction of the tokens of the original DOM, and it’s readable, so you verify before you spend.
Two caveats for a stack like ours. First, it overlaps what we already run: our hardened, SSRF-guarded article reader does web → clean text for the site. Second, the cloud API is metered — and the open-source self-hosted version is the less-polished path (it behaves differently from the cloud, and self-hosting it is real work). For us the local --api-urlroute is the only one that meets the $0 bar; the hosted key is a spend surface to weigh, not adopt by default. Strong tool — adopt deliberately, for crawl/search jobs our reader doesn’t cover, not as a reflex.
The same idea, four shapes
Group these by where they sit relative to the model call. LLMLingua and Firecrawl shrink the input before it’s sent — compress a prompt, convert a page. Mem0 changes what you send at all — retrieve the relevant slice instead of the whole history. agenttracedoesn’t touch the call; it shows you where the tokens are going so you can aim the other three. Together with MarkItDown and Graphify, that’s a full local toolkit for the one principle under all of it: do the cheap, deterministic work on your own hardware, and spend metered tokens only on the part that needs intelligence.
None of these replace choosing the right model— that’s what the rest of this site measures. They stack on top of it: a smaller, cleaner input is cheaper on whatever model routing already picked, so cutting tokens at the source multiplies the saving. Our plan is the one we always run — pilot each locally, measure the real before/after against the benchmark, and publish the numbers, the same way we did for MarkItDown.
These are candidates we’re evaluating for our own stack — the same way we benchmark every model before recommending it. The levers that decide which model runs in the first place are in the playbook.