Training and Pretraining

Pre-Training

Concept

Pretraining is the large-scale phase where a model learns general language understanding from massive unlabeled text corpora. It requires enormous compute and data but is done only once - the resulting base model is then fine-tuned for specific tasks.

The three stages of an LLM's life:

1. Pretraining   → Base model (knows language, world knowledge, no instruction following)
2. Fine-tuning   → Instruction-tuned model (follows instructions, behaves as assistant)
3. Alignment     → RLHF / DPO (safe, helpful, honest - reduced harmful outputs)

Scale of pretraining:

LLaMA-3 8B: trained on ~15 trillion tokens (15T), >2M GPU-hours on H100
GPT-4: estimated 10T+ tokens, undisclosed compute
Rule of thumb: 1B GPU-hours on modern H100s costs ~$1M–$3M at cloud prices

Dataset Curation

Concept

The quality of pretraining data is at least as important as model architecture. "Garbage in, garbage out" applies at trillion-token scale.

Data sources:

Common Crawl: Web scrapes of the entire internet - petabytes of raw text; requires aggressive filtering
Books: Project Gutenberg, BooksCorpus, Books3 - high-quality, diverse language
Wikipedia: Clean, factual, structured - high signal-to-noise
Code: GitHub - improves reasoning capabilities, not just coding
Academic papers: ArXiv - improves scientific understanding
Curated datasets: Refinedweb, RedPajama, SlimPajama, Dolma

Data processing pipeline:

Raw web text
    ↓
URL/domain filtering (block adult content, spam, known low-quality domains)
    ↓
Language detection (keep target languages)
    ↓
Exact deduplication (MinHash, exact hash) - removes copy-pasted content
    ↓
Near-deduplication (SimHash, n-gram overlap) - removes near-duplicates
    ↓
Quality filtering:
  - Perplexity filtering (low-perplexity text from a reference model → good quality)
  - Heuristic rules (min/max token counts, symbol ratio, etc.)
  - Classifier-based quality scoring
    ↓
PII removal (personal email, phone numbers, SSNs)
    ↓
Final tokenization + storage

Why deduplication matters: Training on duplicate data memorizes specific text rather than learning generalizable patterns. Deduplication also reduces privacy risk (memorized PII).

Data mixture ratios (approximate, from public information):

LLaMA-3 8B training mix:
~50% general web (filtered Common Crawl)
~15% code
~10% curated/academic
~25% other high-quality sources

Tokenization

Concept

Before training, all text is converted to token sequences using a fixed vocabulary tokenizer built from the training data.

Byte-Pair Encoding (BPE) - step by step:

Initial vocabulary: all individual bytes (256 symbols)

Training procedure:
1. Split all training text into characters/bytes
2. Count frequency of all adjacent symbol pairs
3. Merge the most frequent pair into a new symbol
4. Repeat until vocabulary reaches target size (e.g., 32K)

Example:
  Text: "low lower lowest"
  Initial: [l,o,w] [l,o,w,e,r] [l,o,w,e,s,t]
  Most frequent pair: (l,o) → merge to (lo)
  → [lo,w] [lo,w,e,r] [lo,w,e,s,t]
  Most frequent pair: (lo,w) → merge to (low)
  → [low] [low,e,r] [low,e,s,t]
  ...continues until target vocab size

Tokenization algorithms - full comparison:

Algorithm	Description	Characteristics	Used by	Pros	Cons
Whitespace	Splits on spaces	Fast, naive	Early NLP tools	Very simple	Poor for subword languages
Character-level	Each char is a token	Fine-grained	Some toy/code models	Robust to OOV	Long sequences
Word-level	Splits on words	Basic NLP	NLTK, SpaCy	Easy to read	Poor generalization
BPE (Byte-Pair Encoding)	Merges most frequent subword pairs	Greedy, deterministic	GPT-2, RoBERTa, CodeBERT	Efficient, fast	Doesn't adapt well
WordPiece	Probabilistic merges	Likelihood-based	BERT, DistilBERT	Better OOV handling	Slightly slower
Unigram Language Model	Chooses most likely subword sequence	Probabilistic	ALBERT, T5, Gemini	Flexible, language-agnostic	More complex
SentencePiece (BPE/Unigram)	Works on raw UTF-8 bytes	Multilingual models	T5, mT5, Gemini	No pre-tokenization needed	May be less readable
Byte-Level BPE	Extends BPE to raw bytes	Includes spaces, emojis	GPT-2, GPT-Neo, GPT-J	No UNK tokens	Tokens may be unreadable
SentencePiece (byte-level)	Uses bytes with Unigram	Language-agnostic	PaLM, Gemini	Works on all scripts	Needs detokenization mapping
tiktoken (OpenAI)	GPT-custom BPE	Special tokens, efficient	GPT-3, GPT-4	Very fast	Custom, undocumented

Key practical notes:

SentencePiece (LLaMA, Gemma): Works on raw bytes without pre-tokenization - handles all languages and code uniformly; no whitespace issues.
tiktoken (GPT-4): Uses cl100k_base with a 100K vocabulary - larger vocab reduces token counts, improving efficiency for English and code.
Multilingual/emoji-rich datasets: Always use byte-level tokenization to avoid OOV.
Training vs inference: Tokenizer must match exactly - inconsistency causes silent failures.

Tokenizer implementations across provider ecosystems:

Provider	Tokenizer/Model	Algorithm	Notes
Hugging Face	BertTokenizer	WordPiece	For BERT and variants
	GPT2Tokenizer	Byte-Level BPE	GPT-2, GPT-Neo
	RobertaTokenizer	BPE	No space token splitting
	T5Tokenizer	SentencePiece (Unigram)	Used in T5, mT5
	LlamaTokenizer	SentencePiece (BPE)	Used in LLaMA 1/2/3
	BloomTokenizer	Byte-Level BPE	For BLOOM
GCP / Vertex AI	Gemini 1.5	SentencePiece (Unigram, Byte-Level)	Google prefers SentencePiece
	PaLM / T5 / mT5	SentencePiece (Unigram)	Internal tokenizer tools
AWS Bedrock	Claude (Anthropic)	BPE / Custom variant	Similar to GPT-2 tokenizer
	Mistral / LLaMA	SentencePiece (BPE)	HF model hosted via Bedrock
OpenAI	GPT-3, GPT-4	tiktoken (custom BPE)	Based on GPT-2 tokenizer
Meta	LLaMA 1/2/3	SentencePiece (BPE)	Open, multilingual
Anthropic	Claude 1/2/3	BPE-like, tiktoken-compatible	Focused on safety tokenization

Popular tokenization libraries:

Library	Algorithms supported	Notes
`transformers` (HuggingFace)	BPE, WordPiece, Unigram, SentencePiece	Easy integration
`tokenizers` (HuggingFace)	Fast Rust-based tokenizers	Train your own
`sentencepiece`	BPE, Unigram	Used by Google models
`tiktoken`	GPT BPE	Used in OpenAI APIs

Training Objectives

Concept

Causal Language Modeling (CLM) - Decoder-only:

Input:   "The cat sat on the mat"
Targets: "cat sat on the mat <EOS>"
Loss:    CrossEntropy(logits, targets) averaged over all non-padding positions

The model sees tokens 0..t-1 and predicts token t. This is why causal masking is applied during training - the model must predict each position without seeing future tokens. The loss is the average cross-entropy over all predicted positions in the sequence.

Masked Language Modeling (MLM) - Encoder-only (BERT):

Input:   "The [MASK] sat on the [MASK]"
Targets: "cat" and "mat"
Loss:    CrossEntropy only on masked positions

15% of tokens are replaced: 80% with [MASK], 10% with a random token, 10% kept unchanged (this mix helps the model generalize beyond just masked positions).

Span Corruption - Encoder-Decoder (T5):

Input:   "The <X> on the mat. The <Y> is fluffy."  (spans replaced by sentinels)
Target:  "<X> cat sat <Y> cat"
Loss:    CrossEntropy on the decoder's output

Scaling Laws

Concept

Neural scaling laws describe how model performance improves predictably with scale. Kaplan et al. (2020, OpenAI) and Hoffmann et al. (2022, DeepMind/Chinchilla) are the key papers.

Kaplan scaling laws (original):

Loss scales as a power law in compute, parameters, and data independently
For a fixed compute budget: larger model + less data tends to be better
This led to GPT-3 (175B) being undertrained - not enough tokens for the model size

Chinchilla scaling law (the correction): Hoffmann et al. showed the Kaplan law overweighted parameters. Their revised finding:

Optimal: tokens = 20 × parameters

A 7B model should train on 140B tokens for "compute-optimal" training
LLaMA-1's 65B model was trained on only 1.4T tokens - compute-optimal would be 1.3T, so roughly right
LLaMA-2 and LLaMA-3 deliberately overtrain beyond Chinchilla optimal because the serving cost of a smaller model is more valuable than training efficiency at a fixed compute budget

The inference-aware scaling insight (LLaMA philosophy): Chinchilla optimizes for minimum training loss. But in practice, you want a model that's as good as possible after serving millions of requests. Training a smaller model on more tokens → smaller model → cheaper inference × millions of requests. So "compute-optimal training" ≠ "deployment-optimal training."

Key scaling law takeaways for interviews:

Performance scales predictably as a power law in N (parameters), D (tokens), C (compute)
Doubling parameters gives diminishing returns unless you also double training data
Chinchilla: optimal token count ≈ 20× parameter count
Modern LLMs (LLaMA-3, Gemma) train well beyond Chinchilla optimal for inference efficiency

Distributed Training

Concept

A 70B model in BF16 requires 140 GB of VRAM just for weights - that's more than any single GPU. Training is even worse (4× for optimizer states - see GPU and Hardware). Distributed training splits work across many GPUs.

Data Parallelism (DDP - DistributedDataParallel):

Replicate the full model on each GPU
Split the batch across GPUs (different data, same model)
After each backward pass, average gradients across all GPUs (all-reduce)
Simplest strategy; works when model fits on one GPU

Tensor Parallelism (Megatron-LM style):

Split individual weight matrices across GPUs
Attention heads split across GPUs: head 1–8 on GPU 1, head 9–16 on GPU 2, etc.
FFN layers split across GPUs: first half of d_ffn on GPU 1, second half on GPU 2
Requires all-reduce after each layer - high communication overhead
Essential for models too large to fit on one GPU

Pipeline Parallelism:

Split layers (not weights) across GPUs: layers 1–16 on GPU 1, layers 17–32 on GPU 2
Micro-batching: split the batch into micro-batches so GPUs overlap computation ("bubble" reduction)
Communication: only activations at layer boundaries cross GPU - less bandwidth than tensor parallelism
Bubble overhead: GPUs still idle when waiting for previous stage - mitigated by micro-batching

ZeRO (Zero Redundancy Optimizer) - covered in detail in GPU and Hardware:

Shards optimizer states, gradients, and parameters across GPUs
Eliminates redundant copies present in pure DDP
ZeRO-3: full parameter sharding - enables training models much larger than per-GPU memory

Practical training setup for 7B model:

8× A100-80GB: enough for BF16 training with ZeRO-2
Gradient checkpointing: trade 30% compute for 5× memory reduction on activations
Mixed precision (BF16): 2× memory reduction vs FP32, similar training stability

Code

# Minimal CLM training loop (conceptual)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR

model_name = "gpt2"  # small, runnable locally
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

tokenizer.pad_token = tokenizer.eos_token

# Sample training data
texts = [
    "The transformer architecture revolutionized NLP.",
    "Scaling laws predict model performance from compute.",
]

def tokenize(texts, max_length=64):
    return tokenizer(
        texts, 
        truncation=True, 
        padding="max_length", 
        max_length=max_length,
        return_tensors="pt"
    )

optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=100)

model.train()
for step in range(10):
    batch = tokenize(texts)
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    
    # Labels = input_ids shifted by 1 (CLM: predict next token)
    # HuggingFace handles the shift internally when labels == input_ids
    labels = input_ids.clone()
    labels[attention_mask == 0] = -100  # ignore padding in loss

    outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
    loss = outputs.loss

    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # gradient clipping
    optimizer.step()
    scheduler.step()

    print(f"Step {step}: loss={loss.item():.4f}")

# BPE tokenizer training demo (SentencePiece)
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer_bpe = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer_bpe.pre_tokenizer = Whitespace()
trainer = BpeTrainer(vocab_size=1000, special_tokens=["[UNK]", "[CLS]", "[PAD]", "[MASK]"])
# trainer.train(["corpus.txt"])  # train on your corpus
print("BPE tokenizer configured (needs corpus file to actually train)")

Learning Rate and Training Stability

Concept

Warmup + cosine decay: The standard schedule for transformer training.

Warmup (0 → max_lr over first T_warmup steps):
  lr = max_lr × (step / T_warmup)

Cosine decay (T_warmup → T_total):
  lr = min_lr + 0.5 × (max_lr - min_lr) × (1 + cos(π × t / T_decay))

Why warmup? At the start of training, gradients are large and inconsistent - high learning rates cause instability. Warmup gradually increases lr while the model's parameter estimates stabilize.

Gradient clipping: Clips the global gradient norm to a maximum value (typically 1.0):

if ||g|| > max_norm:
    g = g × max_norm / ||g||

Prevents gradient explosions from occasional bad batches. Essential for stable pretraining.

Gradient checkpointing: Instead of storing all activations from the forward pass (needed for backpropagation), discard them and recompute from saved checkpoints during backward. Reduces activation memory by ~5× at the cost of ~30% more compute. Standard for training large models.

Study Notes

Must-know for interviews:

Pretraining = learn language from massive unlabeled data; fine-tuning = specialize for tasks
Data quality matters enormously: deduplication, quality filtering, and PII removal are critical
BPE builds subword vocabulary by merging frequent adjacent pairs iteratively
CLM loss: cross-entropy on next-token prediction; label shifting handled by the framework
Chinchilla: compute-optimal training = 20× tokens per parameter
Modern LLMs deliberately overtrain beyond Chinchilla because smaller models are cheaper to serve
Distributed training: data parallelism (batch split), tensor parallelism (weight split), pipeline parallelism (layer split)

Quick recall Q&A:

What is the Chinchilla scaling law? Optimal token count ≈ 20× parameter count for compute-efficient training.
Why does LLaMA-3 overtrain beyond Chinchilla? Inference cost dominates training cost over millions of requests - a smaller, well-trained model costs less to serve.
What is gradient checkpointing? Trading compute for memory by discarding and recomputing activations during backward pass.
Why is data deduplication critical? Duplicate data causes memorization of specific text rather than learning generalizable patterns; also reduces privacy risk.
What is the CLM training objective? Predict the next token given all preceding tokens; minimize cross-entropy averaged over all positions.

Training & Pretraining