RAG means giving a model the information it needs at the moment you ask, instead of relying only on what it absorbed in training. The model is never retrained — you change what it sees, not what it knows. Below: both phases, and a live look at the step that decides whether it works.
Runs on every question. The model is a fixed reasoning engine — RAG hands it the right page a split second before it answers. Click any stage.
Approximate nearest-neighbour search returns the closest k passages (often a wide net, k=20–50). HNSW or IVFFlat indexes make this fast at scale by trading a sliver of accuracy for speed.
When a RAG system underperforms, the fix is almost always in retrieval, not the model. Pick a question — its embedding is scored against five demo chunks by cosine similarity (computed live), then the top 3 are what actually reach the prompt.
Only the top 3get injected into the prompt. Change the question and watch the ranking reorder — the model’s answer is only ever as good as what survives this step.
The rule of thumb: prompting changes instructions, RAG changes knowledge, fine-tuning changes behaviour. Mature systems use all three.
Retrieved passages change every query, so they must sit after the cached prefix, never inside it.
[ cached: system + instructions + tools ]
+ [ uncached: retrieved chunks + query ]A malicious passage in your corpus can carry instructions the model may follow (prompt injection). And in multi-tenant systems, scope retrieval at the row-security layer so one tenant can never pull another’s passages — not just in the prompt.
embed(query) → search → rerank → augment → generate · the model is never retrained — you change what it sees, not what it knows.