How to train an open-source model — without training one from scratch

“Train a model” sounds like a data-center project. For almost everyone it isn’t. You start from an open-weight model someone already pretrained on trillions of tokens, and you adaptit to your task — often on a single GPU, in an afternoon. This is the practical path: the four levels of “training,” which one you actually need, the data and hardware it takes, and the papers + mechanism explainers that go deep on each step.

The first decision: do you even need to train?

Most teams reach for fine-tuning when a cheaper option would do. Before you touch weights, rule these out in order: prompting (a better system prompt + few-shot examples), then retrieval (RAG) — put your knowledge in a vector store and feed the relevant chunks at query time. RAG changes what the model knows; training changes how it behaves. If the model already has the capability and just needs your facts, RAG is faster, cheaper, and updatable without retraining. Train when you need a durable change in behavior, format, or skillthat prompting can’t hold — a consistent tone, a strict output schema, a domain task it keeps getting wrong, or a smaller/cheaper model taught to match a bigger one.

See RAG for the retrieval alternative before you commit to training.

The four levels of 'training' — pick the smallest that works

1Pretraining — almost certainly not you

Building a base model from raw text, trillions of tokens, millions of dollars of compute. This is what the labs do to produce Llama, Qwen, DeepSeek. You start from their output, not here. Covered for understanding in Pretraining.

2Supervised fine-tuning (SFT) — the workhorse

Show the model a few hundred to a few thousand example input→output pairs of the behavior you want; it learns to imitate them. This is what most people mean by “fine-tuning” and it handles the majority of real use cases — tone, format, a domain task. The mechanism is in Fine-tuning (SFT).

3Parameter-efficient (LoRA / QLoRA) — how you afford it

You almost never update all the weights — you train a tiny set of added “adapter” weights and freeze the rest. LoRA makes this cheap; QLoRAadds 4-bit quantization so a 70B model fine-tunes on a single consumer/workstation GPU. This isn’t a different goal from SFT — it’show you run SFT (or preference tuning) affordably. The how-it-works is in LoRA & PEFT.

4Preference tuning (RLHF / DPO) — polish, not foundation

After SFT, you can align the model to preferences — make it choose the better of two answers — via RLHF or the simpler, now-common DPO. This is how labs turn a base model into a helpful assistant; for a focused business task SFT alone is often enough. The pipeline is in RLHF.

Rule of thumb: SFT with LoRA/QLoRA covers most real needs.Reach for preference tuning only when SFT can’t get the last mile.

The data is the project

Your result is your dataset. The model is commodity; the curated examples are the moat. For SFT you need clean input→output pairs that demonstrate exactly the behavior you want — quality and consistency beat volume. A few hundred excellent examples routinely outperform tens of thousands of noisy ones. Sources: your own logs (support transcripts, past answers), hand-written gold examples, or synthetic generation — using a strong model to produce training data (the Self-Instruct idea). For preference tuning you need pairs ranked better/worse, not just single answers.

Hold out a test set before you train, split by something real (by customer, by date) so you measure generalization, not memorization. You cannot tell if training worked without it — which is the whole reason measurement exists.

The hardware — and what runs where

QLoRA changed the math: 4-bit quantization means a single workstation-class GPU or a 128GB unified-memory machine fine-tunes models far larger than you could before. A rough mapping (training needs more memory than inference for the same model, because of optimizer state + gradients):

Model size   Method        Fits on…
≤8B          LoRA/QLoRA    a single 24GB GPU / 32GB Mac
13–34B       QLoRA         a 48GB GPU / 64–128GB unified box
70B          QLoRA         1× 80GB GPU, or a 128GB unified machine (tight)
full SFT     (all weights) multi-GPU — usually not worth it vs QLoRA

Live, dated build costs and which models run at a usable speed per tier are on the Cost & TCO tab and the hardware profiles. If you’re weighing owning the box vs. renting, the self-hosting lever has the break-even math.

The workflow, end to end

Pick a base model — strong, openly licensed, the smallest that could plausibly do the job (test the candidates on your task first).
Build the dataset — clean input→output pairs; hold out a real test split.
Fine-tune with LoRA/QLoRA — a few epochs; watch the loss, stop before it overfits.
Evaluate on the held-out set against the un-tuned base — did it actually improve, and at what cost?
Merge or serve the adapter; quantize for inference; deploy (self-host or a US/neutral host).
Iterate on the data, not the hyperparameters — that’s where the gains are.

Now serve it — and don’t underestimate step 5. An OpenAI-compatible endpoint gets a request working in minutes, but state, tool calling, structured output, files, and observability all live around it. What it actually takes to serve a self-hosted model in production is its own topic: serving a local model. And if your base model is Chinese-origin, the security read on how you run it applies to the fine-tune too.

Go deeper

Mechanism explainers

Fine-tuning (SFT) · LoRA & PEFT · RLHF · Pretraining · RAG

Key papers

LoRA · QLoRA (single-GPU fine-tuning) · InstructGPT (RLHF) · Direct Preference Optimization (DPO) · Self-Instruct (synthetic data) — all in the Papers reading path under Training & capabilities.

Browse the training papers

EyesInAI·Loading explainers…

Explainers

Model training · the practical track