How 15 trillion tokens get cleaned before a model sees them

Every LLM pretrains on text, but the text doesn't arrive clean. FineWeb is Hugging Face's open answer to that problem — 96 CommonCrawl snapshots, processed through a five-stage pipeline into the largest publicly available clean pretraining dataset. Understanding its pipeline is understanding how the internet becomes a training corpus.

What FineWeb is and why it exists

Most state-of-the-art open models — Llama 3, Mixtral, Falcon — are trained on pretraining datasets that are never publicly released. The model weights ship without the data that produced them. FineWeb is a direct response to that gap: a fully open, reproducible, 15-trillion-token English dataset derived from CommonCrawl dumps spanning mid-2013 through April 2024, released under the permissive ODC-By license.

CommonCrawl has been crawling the public web since 2007 and releases new snapshots roughly every one to two months. Each snapshot is petabytes of raw WARC (Web ARChive) files — compressed HTML, headers, and metadata for billions of pages. FineWeb processed 96 of these snapshots. The raw input, before any filtering, is on the order of tens of trillions of tokens. What comes out the other end is 15 trillion — a large reduction driven entirely by quality decisions, not arbitrary cuts.

The pipeline as a funnel

Each stage throws data away. Here is where it goes — tap any stage to see what it removes, and toggle the FineWeb-Edu branch to see the educational subset split off at the end. Every figure is taken directly from the FineWeb paper.

The funnel: 36T → 15T → 1.3T

Tap a stage to see what it removes. Every figure is from the FineWeb paper.

Base filtering36T tokens remainingpaper figure

After WARC text extraction (trafilatura), URL blocklist, fastText English filter (≥0.65), and the Gopher quality/repetition rules. The paper reports ~36 trillion tokens at this point.

Why per-crawl?Deduplicating globally across all 96 crawls would have removed ~90% of the base-filtered data — leaving only ~4T. The team measured that the extra cuts didn't improve downstream models, so they kept per-crawl dedup (~20T) instead. The pipeline's big calls were ablation-tested, not guessed.

Stage 1 — Text extraction from WARC files

CommonCrawl provides two formats: WET files (pre-stripped plain text) and WARC files (raw HTML with full structure). FineWeb deliberately uses WARC files and runs its own text extraction using trafilatura, rather than accepting CommonCrawl's default WET extraction. The reason is fidelity — CommonCrawl's own extraction strips too aggressively, discarding content that trafilatura successfully recovers from the HTML structure.

Trafilatura identifies the main content region of a webpage — the article body, the post text — and discards navigation menus, footers, cookie banners, and ads. Getting this step right matters disproportionately: junk introduced here propagates through every downstream stage, and no filter further down the pipeline can recover content that was thrown away too early.

Stage 2 — Base filtering

Once text is extracted, three filters run in sequence before deduplication. First, a URL blocklist removes adult content domains. Second, a fastText language classifierdrops any document scoring below 0.65 confidence for English — this is what keeps FineWeb English-only. Third, quality and repetition filters from MassiveText (the dataset behind DeepMind's Gopher model) remove documents with pathological statistics: abnormal word counts, excessive character repetition, too few stop words, or suspiciously uniform line lengths.

The Gopher filters are heuristic — they're not learned from labeled examples, they're rules derived from intuitions about what makes text coherent. A document must have between 50 and 100,000 words. Mean word length must fall between 3 and 10 characters. No more than 10% of words can be symbols. These thresholds were empirically validated in the original MassiveText paper and carried forward.

Stage 3 — Per-crawl MinHash deduplication

The web is massively duplicated. Aggregator sites republish articles verbatim. Mirrors copy entire domains. Templated pages differ only in a few fields. If these near-duplicates survive into a training set, the model sees the same content hundreds of times — which degrades generalization and increases the risk of memorization.

FineWeb runs MinHash deduplication independently on each of the 96 CommonCrawl snapshots rather than globally across all of them. The choice is deliberate: global deduplication at 15-trillion-token scale would require comparing every document against every other, which is computationally intractable. Per-crawl deduplication catches the most impactful duplicates — pages that appear multiple times within the same crawl — while keeping the job parallelizable.

MinHash works by converting each document into a set of word shingles (overlapping n-grams), hashing them into a compact signature, and using Locality Sensitive Hashing to find pairs whose signatures are similar above a Jaccard threshold. Documents flagged as near-duplicates of an already-retained document are dropped. The process is probabilistic — it misses some duplicates and occasionally flags non-duplicates — but at this scale, approximate is the only option.

Stage 4 — C4 quality filters

C4 (the dataset used to train Google's T5) established a set of heuristic text filters that became a baseline for the field. FineWeb applies a subset of them: documents containing the string lorem ipsum are dropped as placeholder content. Documents where more than half of lines contain a curly brace are dropped as code or template fragments. Pages mentioning "javascript is disabled" or cookie/terms-of-service notices are removed as navigation artifacts.

One C4 filter that FineWeb deliberately omits: the terminal punctuation filter, which drops any line not ending in a period, question mark, or exclamation point. Ablation experiments showed this filter removed too much valid content — lists, headers, dialogue — for a marginal quality gain. The decision to drop a filter based on empirical measurement rather than intuition is characteristic of how the entire FineWeb pipeline was designed.

Stage 5 — Custom FineWeb heuristics

On top of the inherited filters, the FineWeb team developed additional heuristics through their own ablation experiments. These target content patterns not well covered by Gopher or C4: documents where the majority of lines are very short (list-like structure), documents with high fractions of duplicate lines, and documents whose ratio of punctuation to words falls outside normal prose ranges.

Each filter was evaluated by training small ablation models on the filtered versus unfiltered data and measuring benchmark performance. Only filters that produced measurable improvements on tasks like HellaSwag, ARC, and CommonSense QA were included in the final pipeline. This empirical validation loop — filter, train, measure, decide — is what distinguishes FineWeb from earlier datasets that applied heuristics without measuring their downstream effect.

The final step before release applies PII removal, anonymizing email addresses and public IP addresses found in the corpus.

FineWeb-Edu: filtering with a learned classifier

FineWeb-Edu is a 1.3-trillion-token subset of FineWeb produced by a single additional filtering pass: an educational quality classifier. To build it, Llama-3-70B-Instruct was prompted to score 460,000 sampled FineWeb documents on a 0-to-5 scale for their educational value. The prompt was tuned to favor clear explanatory prose and penalize marketing copy, listicles, forum arguments, and SEO-padded content.

Of those annotations, 410,000 were used to fine-tune a linear regression head on top of the Snowflake-arctic-embed-m embedding model (the remaining ~50,000 held out for validation). Applying this lightweight classifier to all 15 trillion FineWeb tokens required roughly 6,000 H100 GPU hours. Documents scoring 3 or above were retained, yielding the 1.3-trillion-token FineWeb-Edu corpus.

The benchmark results were striking. Models pretrained on FineWeb-Edu reached a given quality bar with far less data — the paper reports matching the performance of comparable web datasets with roughly 10× fewer training tokens, and MMLU climbing from about 33% to 37% on models trained at the same token budget. The data-efficiency gain comes entirely from the quality of the retained subset, not from any change to the model architecture or training recipe.

The FineWeb family: what came next

FineWeb spawned a direct successor and several derivatives. FineWeb2 extends the same pipeline to over 1,000 languages, applying per-language deduplication and language-specific filtering thresholds. The core WARC extraction and base filtering logic is identical; the language identification and quality classifier stages are adapted for non-English scripts.

FineMath applies the same classifier-training approach to mathematical content, producing tens of billions of tokens of math-dense text rated for reasoning clarity and step-by-step solution quality. Ultra-FineWeb introduced a verification strategy using a small model trained with a two-stage annealing schedule to evaluate candidate data before inclusion, producing around 1 trillion high-quality English tokens.

The broader implication is the pattern, not the specific dataset: use a strong teacher model to label a sample, train a cheap classifier on those labels, apply the classifier at scale. Llama 3 and Phi-3 used the same technique internally before FineWeb made the approach public and reproducible. It is now a standard tool in the pretraining data toolkit.

Why this matters for benchmarking

When a model scores well on MMLU or ARC, the result reflects both the model architecture and the pretraining data it saw. FineWeb-Edu's benchmark gains demonstrate that data curation — specifically, increasing the density of educational, well-structured prose — can produce larger benchmark improvements than many architectural changes, at substantially lower compute cost.

This creates an interpretive challenge for benchmarking: a model trained on FineWeb-Edu will naturally score higher on knowledge-retrieval benchmarks regardless of its reasoning capability, because the training distribution is tilted toward the kind of expository text those benchmarks reward. FineWeb-Edu's own authors note this — the dataset is biased toward formal writing and likely underperforms on tasks requiring conversational or informal register. Understanding what data a model trained on is therefore a prerequisite for understanding what its benchmark scores actually measure.

EyesInAI·Loading explainers…

Explainers

FineWeb · pretraining data at scale