Training and Pretraining
Pre-Training
Concept
Pretraining is the large-scale phase where a model learns general language understanding from massive unlabeled text corpora. It requires enormous compute and data but is done only once - the resulting base model is then fine-tuned for specific tasks.
The three stages of an LLM's life:
1. Pretraining → Base model (knows language, world knowledge, no instruction following)
2. Fine-tuning → Instruction-tuned model (follows instructions, behaves as assistant)
3. Alignment → RLHF / DPO (safe, helpful, honest - reduced harmful outputs)
Scale of pretraining:
- LLaMA-3 8B: trained on ~15 trillion tokens (15T), >2M GPU-hours on H100
- GPT-4: estimated 10T+ tokens, undisclosed compute
- Rule of thumb: 1B GPU-hours on modern H100s costs ~$1M–$3M at cloud prices
Dataset Curation
Concept
The quality of pretraining data is at least as important as model architecture. "Garbage in, garbage out" applies at trillion-token scale.
Data sources:
- Common Crawl: Web scrapes of the entire internet - petabytes of raw text; requires aggressive filtering
- Books: Project Gutenberg, BooksCorpus, Books3 - high-quality, diverse language
- Wikipedia: Clean, factual, structured - high signal-to-noise
- Code: GitHub - improves reasoning capabilities, not just coding
- Academic papers: ArXiv - improves scientific understanding
- Curated datasets: Refinedweb, RedPajama, SlimPajama, Dolma
Data processing pipeline:
Raw web text
↓
URL/domain filtering (block adult content, spam, known low-quality domains)
↓
Language detection (keep target languages)
↓
Exact deduplication (MinHash, exact hash) - removes copy-pasted content
↓
Near-deduplication (SimHash, n-gram overlap) - removes near-duplicates
↓
Quality filtering:
- Perplexity filtering (low-perplexity text from a reference model → good quality)
- Heuristic rules (min/max token counts, symbol ratio, etc.)
- Classifier-based quality scoring
↓
PII removal (personal email, phone numbers, SSNs)
↓
Final tokenization + storage
Why deduplication matters: Training on duplicate data memorizes specific text rather than learning generalizable patterns. Deduplication also reduces privacy risk (memorized PII).
Data mixture ratios (approximate, from public information):
LLaMA-3 8B training mix:
~50% general web (filtered Common Crawl)
~15% code
~10% curated/academic
~25% other high-quality sources
Tokenization
Concept
Before training, all text is converted to token sequences using a fixed vocabulary tokenizer built from the training data.
Byte-Pair Encoding (BPE) - step by step:
Initial vocabulary: all individual bytes (256 symbols)
Training procedure:
1. Split all training text into characters/bytes
2. Count frequency of all adjacent symbol pairs
3. Merge the most frequent pair into a new symbol
4. Repeat until vocabulary reaches target size (e.g., 32K)
Example:
Text: "low lower lowest"
Initial: [l,o,w] [l,o,w,e,r] [l,o,w,e,s,t]
Most frequent pair: (l,o) → merge to (lo)
→ [lo,w] [lo,w,e,r] [lo,w,e,s,t]
Most frequent pair: (lo,w) → merge to (low)
→ [low] [low,e,r] [low,e,s,t]
...continues until target vocab size
Tokenization algorithms - full comparison:
| Algorithm | Description | Characteristics | Used by | Pros | Cons |
|---|---|---|---|---|---|
| Whitespace | Splits on spaces | Fast, naive | Early NLP tools | Very simple | Poor for subword languages |
| Character-level | Each char is a token | Fine-grained | Some toy/code models | Robust to OOV | Long sequences |
| Word-level | Splits on words | Basic NLP | NLTK, SpaCy | Easy to read | Poor generalization |
| BPE (Byte-Pair Encoding) | Merges most frequent subword pairs | Greedy, deterministic | GPT-2, RoBERTa, CodeBERT | Efficient, fast | Doesn't adapt well |
| WordPiece | Probabilistic merges | Likelihood-based | BERT, DistilBERT | Better OOV handling | Slightly slower |
| Unigram Language Model | Chooses most likely subword sequence | Probabilistic | ALBERT, T5, Gemini | Flexible, language-agnostic | More complex |
| SentencePiece (BPE/Unigram) | Works on raw UTF-8 bytes | Multilingual models | T5, mT5, Gemini | No pre-tokenization needed | May be less readable |
| Byte-Level BPE | Extends BPE to raw bytes | Includes spaces, emojis | GPT-2, GPT-Neo, GPT-J | No UNK tokens | Tokens may be unreadable |
| SentencePiece (byte-level) | Uses bytes with Unigram | Language-agnostic | PaLM, Gemini | Works on all scripts | Needs detokenization mapping |
| tiktoken (OpenAI) | GPT-custom BPE | Special tokens, efficient | GPT-3, GPT-4 | Very fast | Custom, undocumented |
Key practical notes:
- SentencePiece (LLaMA, Gemma): Works on raw bytes without pre-tokenization - handles all languages and code uniformly; no whitespace issues.
- tiktoken (GPT-4): Uses cl100k_base with a 100K vocabulary - larger vocab reduces token counts, improving efficiency for English and code.
- Multilingual/emoji-rich datasets: Always use byte-level tokenization to avoid OOV.
- Training vs inference: Tokenizer must match exactly - inconsistency causes silent failures.
Tokenizer implementations across provider ecosystems:
| Provider | Tokenizer/Model | Algorithm | Notes |
|---|---|---|---|
| Hugging Face | BertTokenizer | WordPiece | For BERT and variants |
| GPT2Tokenizer | Byte-Level BPE | GPT-2, GPT-Neo | |
| RobertaTokenizer | BPE | No space token splitting | |
| T5Tokenizer | SentencePiece (Unigram) | Used in T5, mT5 | |
| LlamaTokenizer | SentencePiece (BPE) | Used in LLaMA 1/2/3 | |
| BloomTokenizer | Byte-Level BPE | For BLOOM | |
| GCP / Vertex AI | Gemini 1.5 | SentencePiece (Unigram, Byte-Level) | Google prefers SentencePiece |
| PaLM / T5 / mT5 | SentencePiece (Unigram) | Internal tokenizer tools | |
| AWS Bedrock | Claude (Anthropic) | BPE / Custom variant | Similar to GPT-2 tokenizer |
| Mistral / LLaMA | SentencePiece (BPE) | HF model hosted via Bedrock | |
| OpenAI | GPT-3, GPT-4 | tiktoken (custom BPE) | Based on GPT-2 tokenizer |
| Meta | LLaMA 1/2/3 | SentencePiece (BPE) | Open, multilingual |
| Anthropic | Claude 1/2/3 | BPE-like, tiktoken-compatible | Focused on safety tokenization |
Popular tokenization libraries:
| Library | Algorithms supported | Notes |
|---|---|---|
transformers (HuggingFace) | BPE, WordPiece, Unigram, SentencePiece | Easy integration |
tokenizers (HuggingFace) | Fast Rust-based tokenizers | Train your own |
sentencepiece | BPE, Unigram | Used by Google models |
tiktoken | GPT BPE | Used in OpenAI APIs |
Training Objectives
Concept
Causal Language Modeling (CLM) - Decoder-only:
Input: "The cat sat on the mat"
Targets: "cat sat on the mat <EOS>"
Loss: CrossEntropy(logits, targets) averaged over all non-padding positions
The model sees tokens 0..t-1 and predicts token t. This is why causal masking is applied during training - the model must predict each position without seeing future tokens. The loss is the average cross-entropy over all predicted positions in the sequence.
Masked Language Modeling (MLM) - Encoder-only (BERT):
Input: "The [MASK] sat on the [MASK]"
Targets: "cat" and "mat"
Loss: CrossEntropy only on masked positions
15% of tokens are replaced: 80% with [MASK], 10% with a random token, 10% kept unchanged (this mix helps the model generalize beyond just masked positions).
Span Corruption - Encoder-Decoder (T5):
Input: "The <X> on the mat. The <Y> is fluffy." (spans replaced by sentinels)
Target: "<X> cat sat <Y> cat"
Loss: CrossEntropy on the decoder's output
Scaling Laws
Concept
Neural scaling laws describe how model performance improves predictably with scale. Kaplan et al. (2020, OpenAI) and Hoffmann et al. (2022, DeepMind/Chinchilla) are the key papers.
Kaplan scaling laws (original):
- Loss scales as a power law in compute, parameters, and data independently
- For a fixed compute budget: larger model + less data tends to be better
- This led to GPT-3 (175B) being undertrained - not enough tokens for the model size
Chinchilla scaling law (the correction): Hoffmann et al. showed the Kaplan law overweighted parameters. Their revised finding:
Optimal: tokens = 20 × parameters
- A 7B model should train on 140B tokens for "compute-optimal" training
- LLaMA-1's 65B model was trained on only 1.4T tokens - compute-optimal would be 1.3T, so roughly right
- LLaMA-2 and LLaMA-3 deliberately overtrain beyond Chinchilla optimal because the serving cost of a smaller model is more valuable than training efficiency at a fixed compute budget
The inference-aware scaling insight (LLaMA philosophy): Chinchilla optimizes for minimum training loss. But in practice, you want a model that's as good as possible after serving millions of requests. Training a smaller model on more tokens → smaller model → cheaper inference × millions of requests. So "compute-optimal training" ≠ "deployment-optimal training."
Key scaling law takeaways for interviews:
- Performance scales predictably as a power law in N (parameters), D (tokens), C (compute)
- Doubling parameters gives diminishing returns unless you also double training data
- Chinchilla: optimal token count ≈ 20× parameter count
- Modern LLMs (LLaMA-3, Gemma) train well beyond Chinchilla optimal for inference efficiency
Distributed Training
Concept
A 70B model in BF16 requires 140 GB of VRAM just for weights - that's more than any single GPU. Training is even worse (4× for optimizer states - see GPU and Hardware). Distributed training splits work across many GPUs.
Data Parallelism (DDP - DistributedDataParallel):
- Replicate the full model on each GPU
- Split the batch across GPUs (different data, same model)
- After each backward pass, average gradients across all GPUs (all-reduce)
- Simplest strategy; works when model fits on one GPU
Tensor Parallelism (Megatron-LM style):
- Split individual weight matrices across GPUs
- Attention heads split across GPUs: head 1–8 on GPU 1, head 9–16 on GPU 2, etc.
- FFN layers split across GPUs: first half of d_ffn on GPU 1, second half on GPU 2
- Requires all-reduce after each layer - high communication overhead
- Essential for models too large to fit on one GPU
Pipeline Parallelism:
- Split layers (not weights) across GPUs: layers 1–16 on GPU 1, layers 17–32 on GPU 2
- Micro-batching: split the batch into micro-batches so GPUs overlap computation ("bubble" reduction)
- Communication: only activations at layer boundaries cross GPU - less bandwidth than tensor parallelism
- Bubble overhead: GPUs still idle when waiting for previous stage - mitigated by micro-batching
ZeRO (Zero Redundancy Optimizer) - covered in detail in GPU and Hardware:
- Shards optimizer states, gradients, and parameters across GPUs
- Eliminates redundant copies present in pure DDP
- ZeRO-3: full parameter sharding - enables training models much larger than per-GPU memory
Practical training setup for 7B model:
- 8× A100-80GB: enough for BF16 training with ZeRO-2
- Gradient checkpointing: trade 30% compute for 5× memory reduction on activations
- Mixed precision (BF16): 2× memory reduction vs FP32, similar training stability
Code
# Minimal CLM training loop (conceptual)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
model_name = "gpt2" # small, runnable locally
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Sample training data
texts = [
"The transformer architecture revolutionized NLP.",
"Scaling laws predict model performance from compute.",
]
def tokenize(texts, max_length=64):
return tokenizer(
texts,
truncation=True,
padding="max_length",
max_length=max_length,
return_tensors="pt"
)
optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=100)
model.train()
for step in range(10):
batch = tokenize(texts)
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
# Labels = input_ids shifted by 1 (CLM: predict next token)
# HuggingFace handles the shift internally when labels == input_ids
labels = input_ids.clone()
labels[attention_mask == 0] = -100 # ignore padding in loss
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # gradient clipping
optimizer.step()
scheduler.step()
print(f"Step {step}: loss={loss.item():.4f}")
# BPE tokenizer training demo (SentencePiece)
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
tokenizer_bpe = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer_bpe.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=1000, special_tokens=["[UNK]", "[CLS]", "[PAD]", "[MASK]"])
# trainer.train(["corpus.txt"]) # train on your corpus
print("BPE tokenizer configured (needs corpus file to actually train)")
Learning Rate and Training Stability
Concept
Warmup + cosine decay: The standard schedule for transformer training.
Warmup (0 → max_lr over first T_warmup steps):
lr = max_lr × (step / T_warmup)
Cosine decay (T_warmup → T_total):
lr = min_lr + 0.5 × (max_lr - min_lr) × (1 + cos(π × t / T_decay))
Why warmup? At the start of training, gradients are large and inconsistent - high learning rates cause instability. Warmup gradually increases lr while the model's parameter estimates stabilize.
Gradient clipping: Clips the global gradient norm to a maximum value (typically 1.0):
if ||g|| > max_norm:
g = g × max_norm / ||g||
Prevents gradient explosions from occasional bad batches. Essential for stable pretraining.
Gradient checkpointing: Instead of storing all activations from the forward pass (needed for backpropagation), discard them and recompute from saved checkpoints during backward. Reduces activation memory by ~5× at the cost of ~30% more compute. Standard for training large models.
Study Notes
Must-know for interviews:
- Pretraining = learn language from massive unlabeled data; fine-tuning = specialize for tasks
- Data quality matters enormously: deduplication, quality filtering, and PII removal are critical
- BPE builds subword vocabulary by merging frequent adjacent pairs iteratively
- CLM loss: cross-entropy on next-token prediction; label shifting handled by the framework
- Chinchilla: compute-optimal training = 20× tokens per parameter
- Modern LLMs deliberately overtrain beyond Chinchilla because smaller models are cheaper to serve
- Distributed training: data parallelism (batch split), tensor parallelism (weight split), pipeline parallelism (layer split)
Quick recall Q&A:
- What is the Chinchilla scaling law? Optimal token count ≈ 20× parameter count for compute-efficient training.
- Why does LLaMA-3 overtrain beyond Chinchilla? Inference cost dominates training cost over millions of requests - a smaller, well-trained model costs less to serve.
- What is gradient checkpointing? Trading compute for memory by discarding and recomputing activations during backward pass.
- Why is data deduplication critical? Duplicate data causes memorization of specific text rather than learning generalizable patterns; also reduces privacy risk.
- What is the CLM training objective? Predict the next token given all preceding tokens; minimize cross-entropy averaged over all positions.