How RAG hands a model the right page

RAG means giving a model the information it needs at the moment you ask, instead of relying only on what it absorbed in training. The model is never retrained — you change what it sees, not what it knows. Below: both phases, and a live look at the step that decides whether it works.

Two phases · different times

Query time — find the page, feed it in

Runs on every question. The model is a fixed reasoning engine — RAG hands it the right page a split second before it answers. Click any stage.

→

Search · ANN top-k

Approximate nearest-neighbour search returns the closest k passages (often a wide net, k=20–50). HNSW or IVFFlat indexes make this fast at scale by trading a sliver of accuracy for speed.

Retrieval is where quality lives

When a RAG system underperforms, the fix is almost always in retrieval, not the model. Pick a question — its embedding is scored against five demo chunks by cosine similarity (computed live), then the top 3 are what actually reach the prompt.

TOP-3Anaheim depot scissor-lift pricing table, current.

0.989

TOP-3Q3 2025 Anaheim market lift rates increased 3% over Q2.

0.984

TOP-3Pricing change policy: how rate adjustments are approved.

0.796

—Boom lift safety inspection checklist, revision 4.

0.322

—Company holiday schedule and office closures.

0.158

Only the top 3get injected into the prompt. Change the question and watch the ranking reorder — the model’s answer is only ever as good as what survives this step.

RAG, fine-tuning, prompting — they do different jobs

The rule of thumb: prompting changes instructions, RAG changes knowledge, fine-tuning changes behaviour. Mature systems use all three.

Prompt engineering

Changes: What you ask
Best for: Format, tone, task framing
Update cost: Instant, free
Attribution: No

RAG

Changes: What the model knows at runtime
Best for: Facts, private data, fresh info
Update cost: Cheap — re-index docs
Attribution: Yes — carries sources

Fine-tuning

Changes: How the model behaves
Best for: Consistent style, output shape
Update cost: Expensive — retrain
Attribution: No

The two production gotchas worth memorizing

Caching layout

Retrieved passages change every query, so they must sit after the cached prefix, never inside it.

[ cached: system + instructions + tools ]
+ [ uncached: retrieved chunks + query ]

Treat retrieved text as untrusted

A malicious passage in your corpus can carry instructions the model may follow (prompt injection). And in multi-tenant systems, scope retrieval at the row-security layer so one tenant can never pull another’s passages — not just in the prompt.

Debugging order:before touching the model or prompt, check retrieval in isolation. For a failing query — did the right passage come back at all? If no, it’s an indexing / embedding / chunking problem. If yes but ranked low, it’s a reranking / k problem. Only if the right passage was right there and the answer is still wrong is it a generation problem.

RAG: embed(query) → search → rerank → augment → generate · the model is never retrained — you change what it sees, not what it knows.

EyesInAI·Loading explainers…

Explainers

Retrieval-augmented generation · the working model