The first decision: do you even need to train?
Most teams reach for fine-tuning when a cheaper option would do. Before you touch weights, rule these out in order: prompting (a better system prompt + few-shot examples), then retrieval (RAG) — put your knowledge in a vector store and feed the relevant chunks at query time. RAG changes what the model knows; training changes how it behaves. If the model already has the capability and just needs your facts, RAG is faster, cheaper, and updatable without retraining. Train when you need a durable change in behavior, format, or skillthat prompting can’t hold — a consistent tone, a strict output schema, a domain task it keeps getting wrong, or a smaller/cheaper model taught to match a bigger one.
See RAG for the retrieval alternative before you commit to training.
The four levels of 'training' — pick the smallest that works
1Pretraining — almost certainly not you
Building a base model from raw text, trillions of tokens, millions of dollars of compute. This is what the labs do to produce Llama, Qwen, DeepSeek. You start from their output, not here. Covered for understanding in Pretraining.
2Supervised fine-tuning (SFT) — the workhorse
Show the model a few hundred to a few thousand example input→output pairs of the behavior you want; it learns to imitate them. This is what most people mean by “fine-tuning” and it handles the majority of real use cases — tone, format, a domain task. The mechanism is in Fine-tuning (SFT).
3Parameter-efficient (LoRA / QLoRA) — how you afford it
You almost never update all the weights — you train a tiny set of added “adapter” weights and freeze the rest. LoRA makes this cheap; QLoRAadds 4-bit quantization so a 70B model fine-tunes on a single consumer/workstation GPU. This isn’t a different goal from SFT — it’show you run SFT (or preference tuning) affordably. The how-it-works is in LoRA & PEFT.
4Preference tuning (RLHF / DPO) — polish, not foundation
After SFT, you can align the model to preferences — make it choose the better of two answers — via RLHF or the simpler, now-common DPO. This is how labs turn a base model into a helpful assistant; for a focused business task SFT alone is often enough. The pipeline is in RLHF.
Rule of thumb: SFT with LoRA/QLoRA covers most real needs.Reach for preference tuning only when SFT can’t get the last mile.
The data is the project
Your result is your dataset. The model is commodity; the curated examples are the moat. For SFT you need clean input→output pairs that demonstrate exactly the behavior you want — quality and consistency beat volume. A few hundred excellent examples routinely outperform tens of thousands of noisy ones. Sources: your own logs (support transcripts, past answers), hand-written gold examples, or synthetic generation — using a strong model to produce training data (the Self-Instruct idea). For preference tuning you need pairs ranked better/worse, not just single answers.
Hold out a test set before you train, split by something real (by customer, by date) so you measure generalization, not memorization. You cannot tell if training worked without it — which is the whole reason measurement exists.
The hardware — and what runs where
QLoRA changed the math: 4-bit quantization means a single workstation-class GPU or a 128GB unified-memory machine fine-tunes models far larger than you could before. A rough mapping (training needs more memory than inference for the same model, because of optimizer state + gradients):
Model size Method Fits on… ≤8B LoRA/QLoRA a single 24GB GPU / 32GB Mac 13–34B QLoRA a 48GB GPU / 64–128GB unified box 70B QLoRA 1× 80GB GPU, or a 128GB unified machine (tight) full SFT (all weights) multi-GPU — usually not worth it vs QLoRA
Live, dated build costs and which models run at a usable speed per tier are on the Cost & TCO tab and the hardware profiles. If you’re weighing owning the box vs. renting, the self-hosting lever has the break-even math.
The workflow, end to end
- Pick a base model — strong, openly licensed, the smallest that could plausibly do the job (test the candidates on your task first).
- Build the dataset — clean input→output pairs; hold out a real test split.
- Fine-tune with LoRA/QLoRA — a few epochs; watch the loss, stop before it overfits.
- Evaluate on the held-out set against the un-tuned base — did it actually improve, and at what cost?
- Merge or serve the adapter; quantize for inference; deploy (self-host or a US/neutral host).
- Iterate on the data, not the hyperparameters — that’s where the gains are.
And mind the deployment side once it’s trained — if your base model is Chinese-origin, the security read on how you run it applies to the fine-tune too.
- Fine-tuning (SFT) · LoRA & PEFT · RLHF · Pretraining · RAG
- LoRA · QLoRA (single-GPU fine-tuning) · InstructGPT (RLHF) · Direct Preference Optimization (DPO) · Self-Instruct (synthetic data) — all in the Papers reading path under Training & capabilities.