Contents

Llm Models

Overview

View as:

01 - LLM Models

What You Will Learn

  • What large language models are, how they predict tokens, and how sampling works
  • Transformer architecture internals: embeddings, positional encoding (RoPE/ALiBi), residuals, FFN
  • Attention mechanisms: scaled dot-product, multi-head, Flash Attention, GQA
  • Model architecture types: encoder-only (BERT), decoder-only (GPT/LLaMA), encoder-decoder (T5)
  • KV caching, paged attention, speculative decoding, and inference optimization
  • How LLMs are pretrained and how scaling laws shape model design decisions
  • Fine-tuning: SFT, RLHF, DPO, LoRA, QLoRA, and multi-head fine-tuning
  • GPU/hardware considerations: VRAM estimation, quantization, parallelism, ZeRO
  • Failure modes: catastrophic forgetting, lost in the middle, hallucination, sycophancy
  • Production deployment: serving frameworks, latency optimization, context window workarounds
  • Interview-ready answers on all LLM topics with 68+ Q&A pairs

Chapter Map

#FileTopicDifficulty
1LLM FundamentalsTokens, sampling parameters, context window, model typesBeginner
2Transformer ArchitectureEmbeddings, positional encoding, Pre-LN, residuals, SwiGLU FFNIntermediate
3Attention MechanismsQ/K/V math, multi-head, Flash Attention, GQA, causal maskingIntermediate
4Model Architecture TypesEncoder-only, Decoder-only, Encoder-Decoder, MoE, model comparison tableIntermediate
5KV Cache & Inference OptimizationKV cache math, MQA/GQA, paged attention, speculative decoding, continuous batchingAdvanced
6Training & PretrainingData curation, BPE, CLM/MLM objectives, scaling laws, distributed trainingIntermediate
7Fine-TuningSFT, RLHF, DPO, LoRA math, QLoRA, multi-head fine-tuningAdvanced
8GPU & HardwareVRAM estimation, quantization (INT8/INT4/AWQ/NF4), tensor/pipeline/ZeRO parallelismAdvanced
9Failure Modes & Tricky IssuesCatastrophic forgetting, lost in the middle, hallucination, sycophancy, repetitionAdvanced
10Production DeploymentvLLM/TGI, latency budgets, prefix caching, token window workarounds, cost optimizationAdvanced
11Prompting StrategiesChat templates, CoT mechanics, system prompts, structured output, prompt injectionIntermediate
12Q&A Review Bank68+ Q&A pairs tagged Easy/Medium/Hard across all topicsAll levels

Path A: Beginner → Conceptual Understanding

  1. LLM Fundamentals - understand what LLMs are and how they generate text
  2. Transformer Architecture - understand the building blocks
  3. Attention Mechanisms - understand the core operation
  4. Model Architecture Types - understand the landscape
  5. Prompting Strategies - understand how to interact with models

Path B: Interview Preparation (Accelerated)

  1. LLM Fundamentals + Transformer Architecture in parallel
  2. Attention Mechanisms - very common in technical interviews
  3. KV Cache & Inference - increasingly asked in production roles
  4. Fine-Tuning - LoRA math, RLHF vs DPO
  5. GPU & Hardware - VRAM estimation questions are common
  6. Q&A Review Bank - drill all 68 questions

Path C: Production Engineering (Advanced)

  1. KV Cache & Inference Optimization
  2. GPU & Hardware
  3. Production Deployment
  4. Failure Modes & Tricky Issues

Resources

Key Cross-References

Next Topic

02 - Prompt Engineering