01 — LLM Models
What You Will Learn
- What large language models are, how they predict tokens, and how sampling works
- Transformer architecture internals: embeddings, positional encoding (RoPE/ALiBi), residuals, FFN
- Attention mechanisms: scaled dot-product, multi-head, Flash Attention, GQA
- Model architecture types: encoder-only (BERT), decoder-only (GPT/LLaMA), encoder-decoder (T5)
- KV caching, paged attention, speculative decoding, and inference optimization
- How LLMs are pretrained and how scaling laws shape model design decisions
- Fine-tuning: SFT, RLHF, DPO, LoRA, QLoRA, and multi-head fine-tuning
- GPU/hardware considerations: VRAM estimation, quantization, parallelism, ZeRO
- Failure modes: catastrophic forgetting, lost in the middle, hallucination, sycophancy
- Production deployment: serving frameworks, latency optimization, context window workarounds
- Interview-ready answers on all LLM topics with 68+ Q&A pairs
Chapter Map
| # | File | Topic | Difficulty |
|---|---|---|---|
| 1 | LLM Fundamentals | Tokens, sampling parameters, context window, model types | Beginner |
| 2 | Transformer Architecture | Embeddings, positional encoding, Pre-LN, residuals, SwiGLU FFN | Intermediate |
| 3 | Attention Mechanisms | Q/K/V math, multi-head, Flash Attention, GQA, causal masking | Intermediate |
| 4 | Model Architecture Types | Encoder-only, Decoder-only, Encoder-Decoder, MoE, model comparison table | Intermediate |
| 5 | KV Cache & Inference Optimization | KV cache math, MQA/GQA, paged attention, speculative decoding, continuous batching | Advanced |
| 6 | Training & Pretraining | Data curation, BPE, CLM/MLM objectives, scaling laws, distributed training | Intermediate |
| 7 | Fine-Tuning | SFT, RLHF, DPO, LoRA math, QLoRA, multi-head fine-tuning | Advanced |
| 8 | GPU & Hardware | VRAM estimation, quantization (INT8/INT4/AWQ/NF4), tensor/pipeline/ZeRO parallelism | Advanced |
| 9 | Failure Modes & Tricky Issues | Catastrophic forgetting, lost in the middle, hallucination, sycophancy, repetition | Advanced |
| 10 | Production Deployment | vLLM/TGI, latency budgets, prefix caching, token window workarounds, cost optimization | Advanced |
| 11 | Prompting Strategies | Chat templates, CoT mechanics, system prompts, structured output, prompt injection | Intermediate |
| 12 | Q&A Review Bank | 68+ Q&A pairs tagged Easy/Medium/Hard across all topics | All levels |
Recommended Learning Paths
Path A: Beginner → Conceptual Understanding
- LLM Fundamentals — understand what LLMs are and how they generate text
- Transformer Architecture — understand the building blocks
- Attention Mechanisms — understand the core operation
- Model Architecture Types — understand the landscape
- Prompting Strategies — understand how to interact with models
Path B: Interview Preparation (Accelerated)
- LLM Fundamentals + Transformer Architecture in parallel
- Attention Mechanisms — very common in technical interviews
- KV Cache & Inference — increasingly asked in production roles
- Fine-Tuning — LoRA math, RLHF vs DPO
- GPU & Hardware — VRAM estimation questions are common
- Q&A Review Bank — drill all 68 questions
Path C: Production Engineering (Advanced)
- KV Cache & Inference Optimization
- GPU & Hardware
- Production Deployment
- Failure Modes & Tricky Issues
Resources
- Gemma Handbook — Google's Gemma open model reference
- Q&A Review Bank — 68+ Q&A pairs in this module
- Cross-topic Interview Questions
Key Cross-References
- KV cache and attention variants → Attention Mechanisms + KV Cache
- Catastrophic forgetting → Failure Modes + Fine-Tuning (LoRA)
- Token window limits → Failure Modes + Production Deployment
- VRAM and parallelism → GPU & Hardware
- RAG as a complement to LLM capabilities → 03-RAGs module