Context engineering: one inference call
Context engineering is the design of a single inference call: what goes into the window, in what form, in what position, and what gets left out. Everything in scope is ephemeral — when the call ends, the window clears. Do it well and the model reasons cleanly on that one request; do it poorly and quality drops long before you hit the token limit. Three decisions carry most of the weight:
- Selective inclusion. A query returning hundreds of rows, a search returning five full articles, a verbose tool log — dumping all of it in bloats the window and lowers reasoning quality. What enters verbatim, what gets compressed to key facts, and what gets dropped is a design choice, not a default.
- Structural placement.Models attend most to the start and end of a long context and least to the middle — the “lost in the middle” effect. Hard constraints belong at the top; the most task-relevant retrieved material and the actual query belong near the end, close to where generation happens.
- Compression on arrival. Summarise a tool output when it returns, not when the window overflows. A 3,000-token API response the agent needs 150 tokens of should be squeezed before the next step — proactive, not a panic truncation later.
Conversation history is the component that grows fastest, so long-running agents need a standing compression strategy (rolling window, hierarchical summary, or structured state extraction) applied on a schedule — not a scramble at overflow. This is the same token-tax discipline, applied inside a single call.
Memory engineering: across calls, sessions, agents
Memory engineering is about what survivesa single interaction — the systems and policies for writing, storing, retrieving, updating, and governing information so later calls can use it. When an agent recalls something from last week, coordinates with another agent, or applies a preference it learned days ago, that's memory, not context. Four concerns:
- Write policy— the most overlooked, most consequential piece. What triggers a write, what's eligible, in what format (raw text vs. extracted facts vs. summaries), with what confidence, who's allowed to write, and how conflicts and expiry are handled. Skip it and the store defaults to keeping everything forever at equal trust — growing endlessly while getting less useful.
- Storage layer — different memory types want different backends: working (task state → fast K/V like Redis), episodic (past runs → vector store), semantic (durable facts/preferences → vector + K/V), procedural (learned workflows → structured store). The backend also constrains how you can retrieve.
- Retrieval strategy — not one operation: check working memory first (cheap, exact), fall back to semantic search, filter by recency and trust, inject only what the step needs.
- Maintenance — a store with no upkeep rots: confidence decay on volatile facts, dedup of near-identical entries, TTL expiry on time-sensitive data, and periodic compression of old episodes into summaries.
A useful honesty note from the source: structured state extraction (writing typed, validated facts) beats dumping raw conversation chunks into a vector store, which is brittle to rephrasing and conflicting updates — and tagging each entry with a trust level(internal system > user input > external) keeps low-trust content from silently poisoning later reasoning.
Where they meet: the retrieval boundary
The two disciplines aren't really separate — they meet at retrieval. Memory produces candidate information; context assembly decides whether it enters the prompt, how much of it, and where. Manage that boundary well and a pile of memory components becomes a coherent agent. Manage it badly and you get two failure modes that both look like something else:
Failure 1 — retrieval with no context budget. Memory finds the right entries and the assembler injects all of them, crowding out instructions and tool outputs. Retrieval metrics look great, memories are found, and performance still degrades — because context assembly never allocated a token budget. The fix is retrieval-aware assembly: set the budget first, then return only the highest-value memories that fit.
Failure 2 — good retrieval, bad placement. The right memory is retrieved and inserted, but appended wherever it lands — often deep in the middle, where attention is weakest — so the model behaves as if it's missing. Retrieval succeeded; placement failed. Retrieved info that must drive the current step belongs near the active reasoning region, not tacked on arbitrarily.
The through-line: retrieval isn't a search problem you solve in isolation — it's the first step of building the context window, subject to a budget and a placement decision. This is the persistence side of the same story the agent hand-off tax tells from the compute side: multi-agent systems live or die on how carefully they move the right information to the right place at the right time.
Why it's on a cost-and-quality site
Because both disciplines are, underneath, token discipline — and token discipline is cost and quality at once. Every un-budgeted retrieval and every un-compressed tool output is tokens you pay for that also degradethe answer. Getting context and memory right isn't just reliability engineering; it's the same “spend tokens where they earn their keep” instinct behind the token-tax playbook — applied to what the agent remembers and sees, not just which model runs it.