Turning a base model into an assistant

A pretrained model knows enormous amounts but only wants to continue text. RLHF — reinforcement learning from human feedback — is the pipeline that taught it to answer questions, follow instructions, and refuse harmful requests instead. It has three stages, and the trick at its heart is learning a model of human preference and then optimizing against it.

Why a base model isn't enough

A base model out of pretraining is a next-token engine. Prompt it with a question and it may answer, or it may write five more questions, or drift into an unrelated essay — all are plausible continuations of internet text. It has the knowledge but no notion that it is supposed to be helpful, honest, and harmless. RLHF's job is to install that notion without damaging the underlying capability.

Stage 1 — Supervised fine-tuning (SFT)

First, humans write (or curate) a few thousand to a few tens of thousands of high-quality example conversations: a prompt and an ideal assistant response. The base model is fine-tuned on these in the ordinary supervised way — predict the demonstrator's next token. This alone gets you most of the way to an assistant: the model now defaults to answering rather than continuing.

SFT is covered in depth in the fine-tuning explainer. The catch is that writing demonstrations is expensive and can only cover so much. It teaches the model what good answers look like, but it cannot easily teach the finer judgment of which of two decent answers is better. That requires a cheaper signal humans are far better at producing: comparison.

Stage 2 — The reward model and preference pairs

People are unreliable at writing the perfect answer but reliable at judging which of two answers they prefer. So the pipeline collects exactly that: for a given prompt, sample two (or more) responses from the SFT model, and have a human label which one is better. Each labeled comparison is a preference pair — a small, cheap, consistent judgment.

A separate reward modelis then trained on these pairs. It takes a prompt and a response and outputs a single scalar score, and it is trained so that the human-preferred response in each pair scores higher than the rejected one. After enough pairs, the reward model has distilled a population of human judgments into a function that can score any new response automatically — a stand-in for “how much would a human like this?”

Stage 3 — Reinforcement learning against the reward

Now the loop closes. The SFT model (the policy) generates responses, the reward model scores them, and a reinforcement learning algorithm — classically PPO, Proximal Policy Optimization — updates the policy to produce higher-scoring responses. Run this over many prompts and the model drifts toward outputs humans prefer: clearer, more helpful, better formatted, more willing to refuse harmful requests.

One critical guardrail: a KL-divergence penalty keeps the policy from straying too far from the original SFT model. Without it, RL will happily discover degenerate, high-reward gibberish that exploits quirks in the reward model — the classic reward hackingfailure. The penalty says “maximize reward, but stay recognizably the model you started as.”

DPO: skipping the reward model

Building and running a separate reward model plus a full RL loop is finicky and compute-heavy. Direct Preference Optimization (DPO, 2023) showed you can often skip it. DPO uses the same preference-pair data, but proves that there is a closed-form loss you can optimize directly on the policy that achieves the same objective — no explicit reward model, no RL rollout loop, just supervised-style training on the pairs.

Because it is simpler, more stable, and cheaper, DPO and its descendants have become the default alignment method for many open and frontier models. RLHF-with-PPO is still used — especially where an explicit reward model is reused across many objectives — but “train on preference pairs” today often means DPO under the hood. The conceptual story is unchanged: humans compare, a preference signal is learned, the model is steered toward it.

Why this is where 'alignment' lives

Helpfulness, tone, refusal behavior, instruction-following, and most of what users experience as a model's “personality” are shaped in this stage. Constitutional AI (Anthropic) extends the idea by having the model critique and revise its own responses against a written set of principles, reducing the need for human labels on harmful content. The reasoning-model training behind o1/R1 builds on the same foundation, applying reinforcement learning to reasoning traces graded by process reward models (see the test-time compute explainer).

The throughline: pretraining decides what the model can do; the RLHF pipeline decides what it chooses to do when you talk to it.

EyesInAI·Loading explainers…

Explainers

RLHF · alignment pipeline