Training and Pretraining
Pre-Training
Concept
Pretraining is the large-scale phase where a model learns general language understanding from massive unlabeled text corpora. It requires enormous compute and data but is done only once — the resulting base model is then fine-tuned for specific tasks.
The three stages of an LLM's life:
1. Pretraining → Base model (knows language, world knowledge, no instruction following)
2. Fine-tuning → Instruction-tuned model (follows instructions, behaves as assistant)
3. Alignment → RLHF / DPO (safe, helpful, honest — reduced harmful outputs)
Scale of pretraining: - LLaMA-3 8B: trained on ~15 trillion tokens (15T), >2M GPU-hours on H100 - GPT-4: estimated 10T+ tokens, undisclosed compute - Rule of thumb: 1B GPU-hours on modern H100s costs ~$1M–$3M at cloud prices
Dataset Curation
Concept
The quality of pretraining data is at least as important as model architecture. "Garbage in, garbage out" applies at trillion-token scale.
Data sources: - Common Crawl: Web scrapes of the entire internet — petabytes of raw text; requires aggressive filtering - Books: Project Gutenberg, BooksCorpus, Books3 — high-quality, diverse language - Wikipedia: Clean, factual, structured — high signal-to-noise - Code: GitHub — improves reasoning capabilities, not just coding - Academic papers: ArXiv — improves scientific understanding - Curated datasets: Refinedweb, RedPajama, SlimPajama, Dolma
Data processing pipeline:
Raw web text
↓
URL/domain filtering (block adult content, spam, known low-quality domains)
↓
Language detection (keep target languages)
↓
Exact deduplication (MinHash, exact hash) — removes copy-pasted content
↓
Near-deduplication (SimHash, n-gram overlap) — removes near-duplicates
↓
Quality filtering:
- Perplexity filtering (low-perplexity text from a reference model → good quality)
- Heuristic rules (min/max token counts, symbol ratio, etc.)
- Classifier-based quality scoring
↓
PII removal (personal email, phone numbers, SSNs)
↓
Final tokenization + storage
Why deduplication matters: Training on duplicate data memorizes specific text rather than learning generalizable patterns. Deduplication also reduces privacy risk (memorized PII).
Data mixture ratios (approximate, from public information):
LLaMA-3 8B training mix:
~50% general web (filtered Common Crawl)
~15% code
~10% curated/academic
~25% other high-quality sources
Tokenization
Concept
Before training, all text is converted to token sequences using a fixed vocabulary tokenizer built from the training data.
Byte-Pair Encoding (BPE) — step by step:
Initial vocabulary: all individual bytes (256 symbols)
Training procedure:
1. Split all training text into characters/bytes
2. Count frequency of all adjacent symbol pairs
3. Merge the most frequent pair into a new symbol
4. Repeat until vocabulary reaches target size (e.g., 32K)
Example:
Text: "low lower lowest"
Initial: [l,o,w] [l,o,w,e,r] [l,o,w,e,s,t]
Most frequent pair: (l,o) → merge to (lo)
→ [lo,w] [lo,w,e,r] [lo,w,e,s,t]
Most frequent pair: (lo,w) → merge to (low)
→ [low] [low,e,r] [low,e,s,t]
...continues until target vocab size
Tokenization algorithms — full comparison:
| Algorithm | Description | Characteristics | Used by | Pros | Cons |
|---|---|---|---|---|---|
| Whitespace | Splits on spaces | Fast, naive | Early NLP tools | Very simple | Poor for subword languages |
| Character-level | Each char is a token | Fine-grained | Some toy/code models | Robust to OOV | Long sequences |
| Word-level | Splits on words | Basic NLP | NLTK, SpaCy | Easy to read | Poor generalization |
| BPE (Byte-Pair Encoding) | Merges most frequent subword pairs | Greedy, deterministic | GPT-2, RoBERTa, CodeBERT | Efficient, fast | Doesn't adapt well |
| WordPiece | Probabilistic merges | Likelihood-based | BERT, DistilBERT | Better OOV handling | Slightly slower |
| Unigram Language Model | Chooses most likely subword sequence | Probabilistic | ALBERT, T5, Gemini | Flexible, language-agnostic | More complex |
| SentencePiece (BPE/Unigram) | Works on raw UTF-8 bytes | Multilingual models | T5, mT5, Gemini | No pre-tokenization needed | May be less readable |
| Byte-Level BPE | Extends BPE to raw bytes | Includes spaces, emojis | GPT-2, GPT-Neo, GPT-J | No UNK tokens | Tokens may be unreadable |
| SentencePiece (byte-level) | Uses bytes with Unigram | Language-agnostic | PaLM, Gemini | Works on all scripts | Needs detokenization mapping |
| tiktoken (OpenAI) | GPT-custom BPE | Special tokens, efficient | GPT-3, GPT-4 | Very fast | Custom, undocumented |
Key practical notes: - SentencePiece (LLaMA, Gemma): Works on raw bytes without pre-tokenization — handles all languages and code uniformly; no whitespace issues. - tiktoken (GPT-4): Uses cl100k_base with a 100K vocabulary — larger vocab reduces token counts, improving efficiency for English and code. - Multilingual/emoji-rich datasets: Always use byte-level tokenization to avoid OOV. - Training vs inference: Tokenizer must match exactly — inconsistency causes silent failures.
Tokenizer implementations across provider ecosystems:
| Provider | Tokenizer/Model | Algorithm | Notes |
|---|---|---|---|
| Hugging Face | BertTokenizer | WordPiece | For BERT and variants |
| GPT2Tokenizer | Byte-Level BPE | GPT-2, GPT-Neo | |
| RobertaTokenizer | BPE | No space token splitting | |
| T5Tokenizer | SentencePiece (Unigram) | Used in T5, mT5 | |
| LlamaTokenizer | SentencePiece (BPE) | Used in LLaMA 1/2/3 | |
| BloomTokenizer | Byte-Level BPE | For BLOOM | |
| GCP / Vertex AI | Gemini 1.5 | SentencePiece (Unigram, Byte-Level) | Google prefers SentencePiece |
| PaLM / T5 / mT5 | SentencePiece (Unigram) | Internal tokenizer tools | |
| AWS Bedrock | Claude (Anthropic) | BPE / Custom variant | Similar to GPT-2 tokenizer |
| Mistral / LLaMA | SentencePiece (BPE) | HF model hosted via Bedrock | |
| OpenAI | GPT-3, GPT-4 | tiktoken (custom BPE) | Based on GPT-2 tokenizer |
| Meta | LLaMA 1/2/3 | SentencePiece (BPE) | Open, multilingual |
| Anthropic | Claude 1/2/3 | BPE-like, tiktoken-compatible | Focused on safety tokenization |
Popular tokenization libraries:
| Library | Algorithms supported | Notes |
|---|---|---|
transformers (HuggingFace) |
BPE, WordPiece, Unigram, SentencePiece | Easy integration |
tokenizers (HuggingFace) |
Fast Rust-based tokenizers | Train your own |
sentencepiece |
BPE, Unigram | Used by Google models |
tiktoken |
GPT BPE | Used in OpenAI APIs |
Training Objectives
Concept
Causal Language Modeling (CLM) — Decoder-only:
Input: "The cat sat on the mat"
Targets: "cat sat on the mat <EOS>"
Loss: CrossEntropy(logits, targets) averaged over all non-padding positions
The model sees tokens 0..t-1 and predicts token t. This is why causal masking is applied during training — the model must predict each position without seeing future tokens. The loss is the average cross-entropy over all predicted positions in the sequence.
Masked Language Modeling (MLM) — Encoder-only (BERT):
Input: "The [MASK] sat on the [MASK]"
Targets: "cat" and "mat"
Loss: CrossEntropy only on masked positions
15% of tokens are replaced: 80% with [MASK], 10% with a random token, 10% kept unchanged (this mix helps the model generalize beyond just masked positions).
Span Corruption — Encoder-Decoder (T5):
Input: "The <X> on the mat. The <Y> is fluffy." (spans replaced by sentinels)
Target: "<X> cat sat <Y> cat"
Loss: CrossEntropy on the decoder's output
Scaling Laws
Concept
Neural scaling laws describe how model performance improves predictably with scale. Kaplan et al. (2020, OpenAI) and Hoffmann et al. (2022, DeepMind/Chinchilla) are the key papers.
Kaplan scaling laws (original): - Loss scales as a power law in compute, parameters, and data independently - For a fixed compute budget: larger model + less data tends to be better - This led to GPT-3 (175B) being undertrained — not enough tokens for the model size
Chinchilla scaling law (the correction): Hoffmann et al. showed the Kaplan law overweighted parameters. Their revised finding:
- A 7B model should train on 140B tokens for "compute-optimal" training
- LLaMA-1's 65B model was trained on only 1.4T tokens — compute-optimal would be 1.3T, so roughly right
- LLaMA-2 and LLaMA-3 deliberately overtrain beyond Chinchilla optimal because the serving cost of a smaller model is more valuable than training efficiency at a fixed compute budget
The inference-aware scaling insight (LLaMA philosophy): Chinchilla optimizes for minimum training loss. But in practice, you want a model that's as good as possible after serving millions of requests. Training a smaller model on more tokens → smaller model → cheaper inference × millions of requests. So "compute-optimal training" ≠ "deployment-optimal training."
Key scaling law takeaways for interviews: - Performance scales predictably as a power law in N (parameters), D (tokens), C (compute) - Doubling parameters gives diminishing returns unless you also double training data - Chinchilla: optimal token count ≈ 20× parameter count - Modern LLMs (LLaMA-3, Gemma) train well beyond Chinchilla optimal for inference efficiency
Distributed Training
Concept
A 70B model in BF16 requires 140 GB of VRAM just for weights — that's more than any single GPU. Training is even worse (4× for optimizer states — see GPU and Hardware). Distributed training splits work across many GPUs.
Data Parallelism (DDP — DistributedDataParallel): - Replicate the full model on each GPU - Split the batch across GPUs (different data, same model) - After each backward pass, average gradients across all GPUs (all-reduce) - Simplest strategy; works when model fits on one GPU
Tensor Parallelism (Megatron-LM style): - Split individual weight matrices across GPUs - Attention heads split across GPUs: head 1–8 on GPU 1, head 9–16 on GPU 2, etc. - FFN layers split across GPUs: first half of d_ffn on GPU 1, second half on GPU 2 - Requires all-reduce after each layer — high communication overhead - Essential for models too large to fit on one GPU
Pipeline Parallelism: - Split layers (not weights) across GPUs: layers 1–16 on GPU 1, layers 17–32 on GPU 2 - Micro-batching: split the batch into micro-batches so GPUs overlap computation ("bubble" reduction) - Communication: only activations at layer boundaries cross GPU — less bandwidth than tensor parallelism - Bubble overhead: GPUs still idle when waiting for previous stage — mitigated by micro-batching
ZeRO (Zero Redundancy Optimizer) — covered in detail in GPU and Hardware: - Shards optimizer states, gradients, and parameters across GPUs - Eliminates redundant copies present in pure DDP - ZeRO-3: full parameter sharding — enables training models much larger than per-GPU memory
Practical training setup for 7B model: - 8× A100-80GB: enough for BF16 training with ZeRO-2 - Gradient checkpointing: trade 30% compute for 5× memory reduction on activations - Mixed precision (BF16): 2× memory reduction vs FP32, similar training stability
Code
# Minimal CLM training loop (conceptual)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
model_name = "gpt2" # small, runnable locally
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Sample training data
texts = [
"The transformer architecture revolutionized NLP.",
"Scaling laws predict model performance from compute.",
]
def tokenize(texts, max_length=64):
return tokenizer(
texts,
truncation=True,
padding="max_length",
max_length=max_length,
return_tensors="pt"
)
optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=100)
model.train()
for step in range(10):
batch = tokenize(texts)
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
# Labels = input_ids shifted by 1 (CLM: predict next token)
# HuggingFace handles the shift internally when labels == input_ids
labels = input_ids.clone()
labels[attention_mask == 0] = -100 # ignore padding in loss
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # gradient clipping
optimizer.step()
scheduler.step()
print(f"Step {step}: loss={loss.item():.4f}")
# BPE tokenizer training demo (SentencePiece)
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
tokenizer_bpe = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer_bpe.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=1000, special_tokens=["[UNK]", "[CLS]", "[PAD]", "[MASK]"])
# trainer.train(["corpus.txt"]) # train on your corpus
print("BPE tokenizer configured (needs corpus file to actually train)")
Learning Rate and Training Stability
Concept
Warmup + cosine decay: The standard schedule for transformer training.
Warmup (0 → max_lr over first T_warmup steps):
lr = max_lr × (step / T_warmup)
Cosine decay (T_warmup → T_total):
lr = min_lr + 0.5 × (max_lr - min_lr) × (1 + cos(π × t / T_decay))
Why warmup? At the start of training, gradients are large and inconsistent — high learning rates cause instability. Warmup gradually increases lr while the model's parameter estimates stabilize.
Gradient clipping: Clips the global gradient norm to a maximum value (typically 1.0):
Prevents gradient explosions from occasional bad batches. Essential for stable pretraining.Gradient checkpointing: Instead of storing all activations from the forward pass (needed for backpropagation), discard them and recompute from saved checkpoints during backward. Reduces activation memory by ~5× at the cost of ~30% more compute. Standard for training large models.
Study Notes
Must-know for interviews: - Pretraining = learn language from massive unlabeled data; fine-tuning = specialize for tasks - Data quality matters enormously: deduplication, quality filtering, and PII removal are critical - BPE builds subword vocabulary by merging frequent adjacent pairs iteratively - CLM loss: cross-entropy on next-token prediction; label shifting handled by the framework - Chinchilla: compute-optimal training = 20× tokens per parameter - Modern LLMs deliberately overtrain beyond Chinchilla because smaller models are cheaper to serve - Distributed training: data parallelism (batch split), tensor parallelism (weight split), pipeline parallelism (layer split)
Quick recall Q&A: - What is the Chinchilla scaling law? Optimal token count ≈ 20× parameter count for compute-efficient training. - Why does LLaMA-3 overtrain beyond Chinchilla? Inference cost dominates training cost over millions of requests — a smaller, well-trained model costs less to serve. - What is gradient checkpointing? Trading compute for memory by discarding and recomputing activations during backward pass. - Why is data deduplication critical? Duplicate data causes memorization of specific text rather than learning generalizable patterns; also reduces privacy risk. - What is the CLM training objective? Predict the next token given all preceding tokens; minimize cross-entropy averaged over all positions.