Self-supervised learning
Learn how LLMs train on unlabeled text by predicting masked or next tokens, why this makes labeled data unnecessary at scale, and what it means for how models generalize.
TL;DR
- Self-supervised learning (SSL) generates training labels from the data itself, removing the need for human annotation entirely.
- LLMs use two core pretraining objectives: causal language modeling (predict token N+1, GPT-style) and masked language modeling (predict hidden tokens, BERT-style).
- Every sentence on the internet is implicitly labeled because the correct next token was always in the text. This turns the entire web into a training set of 15+ trillion tokens.
- Chinchilla scaling (2022) showed the optimal ratio is ~20 tokens per parameter. GPT-3 was undertrained at 175B params with only 300B tokens.
- Pre-training builds general knowledge (
$100M compute). Fine-tuning changes behavior ($10K compute). Fine-tuning does NOT reliably add new facts. - Understanding SSL tells you when to recommend RAG over fine-tuning, and why pre-training costs dominate LLM economics.
The problem it solves
Imagine you need to build a model that understands English, Mandarin, Python, organic chemistry, and contract law. With supervised learning, every training example needs a human-created label. Someone must read each document, write the expected output, and verify correctness. At GPT-4 scale (roughly 13 trillion tokens), labeling even 0.01% of the data would take millions of person-hours.
This is the labeled data bottleneck. For decades it kept NLP systems narrow: you could build a chatbot that handled airline reservations or medical queries, but not both. Each domain needed its own labeled dataset, its own annotation team, and its own model.
Self-supervised learning sidesteps this bottleneck entirely. Instead of requiring external labels, it generates supervision from the structure of the data. The label was always there in the text; the trick is recognizing it.
I've seen teams spend months building annotation pipelines for tasks that SSL handles out of the box. The moment you understand this paradigm shift, you stop asking "how do we label this?" and start asking "can we frame this as a prediction problem?"
What is it?
Self-supervised learning is a training paradigm where the model creates its own supervision signal from the structure of the input data, with no external labels required.
Think of it like a fill-in-the-blank test that writes itself. You take a sentence, hide one word, and ask the model to guess it. The answer was always in the original sentence. Multiply this by trillions of sentences and you have the entire LLM pretraining paradigm.
This sits between supervised learning (external labels required) and unsupervised learning (no prediction objective at all, just clustering or density estimation). SSL has a clear loss function and a clear target, but both come from the data itself.
The elegance is in the economics. You need zero annotation budget, zero label quality control, and zero domain expertise to create training data. You just need text. In my experience, once teams grasp this, they immediately see why LLM pretraining costs are dominated by compute, not data.
How it works
Causal language modeling (CLM)
GPT-style models use causal language modeling. The objective is simple: given tokens 1 through N, predict token N+1.
During a single forward pass over a 2,048-token sequence, the model makes 2,047 predictions in parallel (each position predicts the next). Every token in every document is a training example. A single Wikipedia article with 3,000 tokens generates 2,999 training examples for free.
The "causal" part means the model never sees future tokens when making a prediction. A causal mask in the attention layers enforces left-to-right processing. This is why the same architecture works for generation at inference time: the model just continues the causal chain, predicting one token at a time.
Masked language modeling (MLM)
BERT-style models work differently. They randomly mask about 15% of input tokens (replacing them with a special [MASK] token) and train the model to predict the originals.
Because nothing is hidden directionally, the model attends to tokens both before and after each masked position. This bidirectional context builds richer representations for understanding tasks. BERT-based models consistently outperform GPT-style models on classification, entity recognition, and similarity tasks.
The tradeoff is that MLM models are not naturally generative. They are designed for encoding and understanding, not for producing text sequences. You would not use BERT to write a paragraph.
CLM vs MLM: the core tradeoff
The internet as a training set
The web is the largest labeled dataset in history, and nobody had to label it. Every sentence a human writes encodes world knowledge: physical laws, social norms, historical facts, logical reasoning. The next token is never random. It reflects the constraints of language and the author's understanding of reality.
When a physics textbook says "the ball falls due to...", the next token ("gravity") encodes a fact about physics. When legal text says "the defendant is liable for...", the next token encodes legal reasoning. The model learns chemistry, medicine, programming, and philosophy from next-token prediction alone.
I've heard people dismiss this as "just autocomplete." They are underestimating what it takes to predict the next token well. Predicting the next token in a proof, a codebase, or a medical paper requires deep structural understanding.
Chinchilla scaling: how much data is enough?
Before 2022, the ML community believed bigger models always win. The Chinchilla paper (Hoffmann et al., DeepMind, 2022) proved this wrong by showing that for a fixed compute budget, you should split resources roughly equally between model size and training tokens.
The optimal ratio: approximately 20 training tokens per parameter.
This single finding reshaped the industry. GPT-3 (175B parameters, ~300B tokens) was catastrophically undertrained: it should have seen ~3.5T tokens. A 70B model trained on 1.4T tokens (Llama 2) matches or exceeds GPT-3 at a fraction of the cost.
For your interviews: when someone mentions a model's parameter count, immediately ask about the training token count. The ratio matters more than raw size.
The LLM training pipeline
SSL is just the first (and most expensive) phase of building a modern LLM. The complete pipeline has four stages, each cheaper than the last by orders of magnitude.
Continue Reading with Premium
Unlock this article and every other in-depth system design guide on the platform with NotesFromSDE Premium.