Skip to content

01 — LLM Models

What You Will Learn

  • What large language models are, how they predict tokens, and how sampling works
  • Transformer architecture internals: embeddings, positional encoding (RoPE/ALiBi), residuals, FFN
  • Attention mechanisms: scaled dot-product, multi-head, Flash Attention, GQA
  • Model architecture types: encoder-only (BERT), decoder-only (GPT/LLaMA), encoder-decoder (T5)
  • KV caching, paged attention, speculative decoding, and inference optimization
  • How LLMs are pretrained and how scaling laws shape model design decisions
  • Fine-tuning: SFT, RLHF, DPO, LoRA, QLoRA, and multi-head fine-tuning
  • GPU/hardware considerations: VRAM estimation, quantization, parallelism, ZeRO
  • Failure modes: catastrophic forgetting, lost in the middle, hallucination, sycophancy
  • Production deployment: serving frameworks, latency optimization, context window workarounds
  • Interview-ready answers on all LLM topics with 68+ Q&A pairs

Chapter Map

# File Topic Difficulty
1 LLM Fundamentals Tokens, sampling parameters, context window, model types Beginner
2 Transformer Architecture Embeddings, positional encoding, Pre-LN, residuals, SwiGLU FFN Intermediate
3 Attention Mechanisms Q/K/V math, multi-head, Flash Attention, GQA, causal masking Intermediate
4 Model Architecture Types Encoder-only, Decoder-only, Encoder-Decoder, MoE, model comparison table Intermediate
5 KV Cache & Inference Optimization KV cache math, MQA/GQA, paged attention, speculative decoding, continuous batching Advanced
6 Training & Pretraining Data curation, BPE, CLM/MLM objectives, scaling laws, distributed training Intermediate
7 Fine-Tuning SFT, RLHF, DPO, LoRA math, QLoRA, multi-head fine-tuning Advanced
8 GPU & Hardware VRAM estimation, quantization (INT8/INT4/AWQ/NF4), tensor/pipeline/ZeRO parallelism Advanced
9 Failure Modes & Tricky Issues Catastrophic forgetting, lost in the middle, hallucination, sycophancy, repetition Advanced
10 Production Deployment vLLM/TGI, latency budgets, prefix caching, token window workarounds, cost optimization Advanced
11 Prompting Strategies Chat templates, CoT mechanics, system prompts, structured output, prompt injection Intermediate
12 Q&A Review Bank 68+ Q&A pairs tagged Easy/Medium/Hard across all topics All levels

Path A: Beginner → Conceptual Understanding

  1. LLM Fundamentals — understand what LLMs are and how they generate text
  2. Transformer Architecture — understand the building blocks
  3. Attention Mechanisms — understand the core operation
  4. Model Architecture Types — understand the landscape
  5. Prompting Strategies — understand how to interact with models

Path B: Interview Preparation (Accelerated)

  1. LLM Fundamentals + Transformer Architecture in parallel
  2. Attention Mechanisms — very common in technical interviews
  3. KV Cache & Inference — increasingly asked in production roles
  4. Fine-Tuning — LoRA math, RLHF vs DPO
  5. GPU & Hardware — VRAM estimation questions are common
  6. Q&A Review Bank — drill all 68 questions

Path C: Production Engineering (Advanced)

  1. KV Cache & Inference Optimization
  2. GPU & Hardware
  3. Production Deployment
  4. Failure Modes & Tricky Issues

Resources

Key Cross-References

Next Topic

02 — Prompt Engineering