LLM Fundamentals
What Is a Large Language Model
Concept
A Large Language Model is a neural network trained to predict the next token given a sequence of preceding tokens. That is the entire job description. Every capability — summarization, coding, reasoning, translation — emerges from doing this one thing at massive scale on diverse data.
This framing matters for interviews. An LLM is not: - A knowledge base (it encodes statistical patterns, not facts with citations) - A reasoning engine (it simulates reasoning by predicting tokens that look like reasoning) - A retrieval system (it cannot look things up unless given tools or RAG)
The correct mental model: an LLM is a compressed probabilistic model of human text. It has processed a significant fraction of human-written text and learned what tokens tend to follow other tokens across contexts. When you prompt it, you are sampling from that distribution conditioned on your input.
Why "large"? The term is relative. In 2018, BERT-large (340M parameters) was "large." By 2024, the baseline is 7B–70B+ parameters. What scale provides: - More parameters → more capacity to memorize patterns and relationships - More training tokens → better generalization and knowledge coverage - The relationship is governed by scaling laws (see Training and Pretraining)
Autoregressive generation: At inference time, the model generates one token at a time, appending each new token to the context and re-running the forward pass. This is why generation is slow — you cannot parallelize sequential token production.
Input: "The capital of France is"
Step 1: model predicts " Paris" (highest probability) → appended
Step 2: model predicts "." → appended
Step 3: model predicts end-of-sequence token → stop
Output: "The capital of France is Paris."
Key Concepts: Tokens, Context, Temperature
Tokens
Concept
LLMs operate on tokens — subword units produced by a tokenizer — not characters or words. Understanding tokens is critical for cost estimation, context window management, and debugging unexpected model behavior.
How BPE tokenization works: 1. Start with all individual characters as the vocabulary 2. Repeatedly merge the most frequent adjacent pair into a single new token 3. Continue until vocabulary reaches the target size (typically 32K–200K tokens) 4. The result: common words become single tokens; rare words split into subword pieces
Practical rules of thumb: - English text: ~1 token ≈ 4 characters ≈ 0.75 words - Code: more tokens per line (brackets, indentation, symbols are expensive) - Non-Latin scripts (Chinese, Arabic): often 1–3 characters per token (less efficient) - Numbers: each digit is often a separate token — source of arithmetic failures
Why tokenization causes reasoning failures:
- "9.11 > 9.9": the model sees ["9", ".", "1", "1"] and ["9", ".", "9"] as token sequences, not numbers — it cannot easily do digit-wise comparison
- "50,000" vs "50000" may tokenize differently, causing inconsistency
- Words split across tokens can confuse rhyming and spelling tasks
Token boundary quiz (common interview trap):
- Q: How many tokens is "ChatGPT is great"?
A: 5 tokens: ["Chat", "G", "PT", " is", " great"] — "ChatGPT" splits because it was rare at training time
Code
import tiktoken
from transformers import AutoTokenizer
# GPT-4 / Claude-family tokenizer
enc = tiktoken.get_encoding("cl100k_base")
text = "The tokenization of language models is surprisingly tricky."
tokens = enc.encode(text)
print(f"Token count: {len(tokens)}")
print(f"Tokens: {[enc.decode([t]) for t in tokens]}")
# Show the arithmetic problem
number_text = "Is 9.11 greater than 9.9?"
print(f"\n'{number_text}'")
print(f"Tokens: {[enc.decode([t]) for t in enc.encode(number_text)]}")
# Output shows digits tokenized individually
# Compare across model families
llama_tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
sample = "Tokenization affects everything downstream."
print(f"\nLLaMA-3 tokens: {llama_tok.tokenize(sample)}")
print(f"LLaMA-3 count: {len(llama_tok.encode(sample))}")
Context Window
Concept
The context window is the maximum number of tokens an LLM can process in a single forward pass — input + output combined. It is a hard architectural limit.
Why context windows are bounded: - Self-attention computes pairwise relationships between every pair of tokens: O(n²) memory and compute - A 128K-token context requires 128K × 128K = 16.4 billion attention weight computations per layer - This is why extending context is expensive; Flash Attention and architectural tricks mitigate but don't eliminate this
Context window sizes (representative, 2024):
| Model | Context Window |
|---|---|
| GPT-3.5-turbo | 16K |
| GPT-4 Turbo / GPT-4o | 128K |
| Claude 3.5 Sonnet | 200K |
| LLaMA-3.1 8B/70B | 128K |
| Gemini 1.5 Pro | 1M |
| Gemma 2 | 8K |
| Mistral 7B | 32K |
The dirty secret: A large context window ≠ uniform retrieval quality. The "lost in the middle" problem (see Failure Modes) means models attend poorly to tokens buried in the middle of very long contexts. Putting the most important information at the beginning or end of the context is a practical mitigation.
Context is shared between input and output. A 128K window with a 100K-token prompt leaves only 28K tokens for the response.
Context window scaling approaches — how modern models extend context:
| Approach | Description | Used by |
|---|---|---|
| Fixed Context | Choose max tokens at model design time (e.g., 2K, 8K) | Early GPTs, BERT |
| Sliding Window / Chunking | Split input into overlapping windows (e.g., 512 tokens with 128 overlap) | RAG / long-doc QA |
| Adaptive Context | Dynamically select relevant chunks based on task | Attention routing systems |
| Sparse Attention | Fixed-size context but sparsified heads — reduces O(n²) cost | Longformer, BigBird, FlashAttention |
| Memory-efficient Transformers | Linear attention or recurrent state instead of full attention | Performer, RWKV, Mamba |
| Recurrence / Caching | Reuse past hidden states or token representations | LLaMA 2/3, RWKV, Gemini, GPT Turbo |
| Retriever-Augmented (RAG) | Retrieve external chunks to keep active context short and focused | All RAG systems |
| Compressed Context | Summarize long text, feed a compact version to the model | Hybrid long-doc systems |
| RoPE Interpolation | Rescale rotary position embeddings to extend native context window | GPT-NeoX, Mistral 7B, Claude 3, LLaMA 2/3 |
Context window sizes in current models (2025/2026):
| Model | Context Length | How it's achieved |
|---|---|---|
| GPT-3.5 | 4K / 16K | Naive Transformer |
| GPT-4 / GPT-4.1 | 128K / 1M | FlashAttention + chunking |
| Claude 3 / Sonnet 4 | 200K | Sparse attention + retrieval + caching |
| Gemini 2.5 Pro | 1M | Memory compression + long-context training |
| LLaMA 4 Scout | 10M | MoE + extreme context training |
| Mistral 7B | 32K (tested 64K) | RoPE interpolation |
| Longformer | 4K–16K | Sparse local + global attention |
| RWKV | Infinite (streaming) | RNN-like with token recurrence |
| Mamba | Sub-quadratic | Attention-free state-space model |
Practical guidelines — choosing context window strategy:
| Use Case | Recommended Strategy |
|---|---|
| Small chatbot (FAQ, support) | 2K–4K context is sufficient |
| Long documents (legal, medical, policy) | 8K–32K + retrieval or summarization |
| Summarization of books or legal docs | RAG + chunking + re-ranking |
| Code generation | 8K–16K (code needs larger context for multi-file awareness) |
| Training from scratch | Trade off context size vs GPU memory (O(n²) attention cost) |
| Local inference on laptop | Prefer 2K–4K with quantization |
Sampling Parameters
Concept
When the model predicts the next token, it produces a probability distribution over its entire vocabulary (e.g., 32K–200K tokens). Sampling parameters control how you draw from that distribution.
Temperature scales the logits before softmax:
| Temperature | Effect | Use case |
|---|---|---|
| 0.0 | Greedy: always pick argmax | Deterministic tasks, factual QA |
| 0.1–0.5 | Sharp, conservative | Code generation, structured output |
| 0.7–1.0 | Balanced creativity | General chat, creative writing |
| 1.5–2.0 | Very random, often incoherent | Brainstorming, diversity sampling |
Tricky Q: Is temperature=0 identical to greedy decoding?
Mathematically yes — as T→0, softmax concentrates all mass on the argmax. In practice, temperature=0 in most APIs is implemented as argmax, producing deterministic results. True floating-point softmax(logits/0) would divide by zero.
Top-p (Nucleus Sampling): Consider only the smallest set of tokens whose cumulative probability ≥ p, then renormalize and sample from that nucleus.
Sort tokens by P descending: [0.4, 0.2, 0.15, 0.1, 0.05, ...]
top_p=0.9: include tokens until cumulative sum ≥ 0.9 → [0.4, 0.2, 0.15, 0.1, 0.05] (sum=0.9)
Sample from these 5 tokens (renormalized)
Top-p adapts: when the model is confident (peaked distribution), the nucleus is tiny; when uncertain, it's larger.
Top-k: Restrict to the k highest-probability tokens before sampling. Less adaptive than top-p.
Repetition penalty: Divides the logit of tokens that have already appeared:
Values > 1.0 penalize repetition. Without this, greedy decoding often collapses into loops.| Parameter | Effect | Typical Range |
|---|---|---|
temperature |
Distribution sharpness | 0.0–2.0 |
top_p |
Nucleus size | 0.7–1.0 |
top_k |
Hard token cutoff | 10–100 |
repetition_penalty |
Loop prevention | 1.0–1.3 |
max_new_tokens |
Output length cap | task-specific |
Types of LLMs
Concept
LLMs are categorized by their training architecture and objective. Three core types exist, with a clear market winner in 2024.
1. Decoder-Only (Causal LM) - Architecture: transformer with causal (left-to-right) attention masking - Training objective: predict next token (Causal Language Modeling, CLM) - Examples: GPT family, LLaMA, Gemma, Mistral, Phi, Falcon - Best for: text generation, chat, reasoning, code, general-purpose tasks - Dominant in production — has displaced other architectures for most tasks
2. Encoder-Only (Masked LM) - Architecture: transformer with bidirectional attention (sees full context) - Training objective: predict masked tokens (Masked Language Modeling, MLM) - Examples: BERT, RoBERTa, DeBERTa, DistilBERT - Best for: text classification, NER, semantic similarity, embeddings - Not generative — cannot produce free-form text, only classify or embed
3. Encoder-Decoder (Seq2Seq) - Architecture: encoder produces context representations; decoder attends to encoder output via cross-attention - Training objective: span corruption (T5), denoising (BART) - Examples: T5, BART, mT5, FLAN-T5, mBART - Best for: translation, summarization, structured prediction (document → output) - Still used for seq2seq tasks but decoder-only models have largely caught up via prompting
Why decoder-only won: 1. Unified training objective (CLM) — no need for masked prediction or denoising design choices 2. Generation is natural — the architecture is designed for it 3. Instruction fine-tuning (SFT + RLHF) works exceptionally well on top of CLM pretraining 4. Emergent in-context learning scales with model size
Mixture of Experts (MoE) — a fourth category: Not a separate architecture but a modification: instead of one dense FFN per layer, use N "expert" FFNs and a router that selects 1–2 experts per token. Result: large total parameters but only a fraction active per token (Mixtral-8x7B: 47B total, ~13B active).
Full architecture deep dive: see Transformer Architecture and Model Architecture Types.
Open Source vs Proprietary
Concept
| Dimension | Open Source (LLaMA-3, Gemma, Mistral) | Proprietary (GPT-4o, Claude 3.5, Gemini 1.5) |
|---|---|---|
| Per-token cost at scale | Infrastructure cost only | Provider billing (can be high at volume) |
| Data privacy | Runs on your infra; no data leaves | Data sent to provider API |
| Customization | Full fine-tuning, quantization, architecture changes | Limited fine-tuning API at extra cost |
| Frontier capability | Lags by ~6–12 months | Cutting-edge models |
| Ops burden | You manage GPUs, scaling, updates, failures | Zero ops — just call an API |
| Latency control | Full control; can optimize aggressively | Variable, dependent on provider load |
| Compliance/audit | Can inspect weights and pipeline | Black box; trust provider's policies |
Choose open source when: - Strict data residency (GDPR, HIPAA, SOC 2, financial PII) - High-volume workloads where per-token cost dominates - Fine-tuning on proprietary data you cannot share with an external provider - Research requiring reproducibility or custom modifications
Choose proprietary when: - Rapid prototyping where GPU ops is not your core competency - Tasks requiring frontier-quality reasoning (frontier open models are ~6–12 months behind) - Low-volume usage where managed reliability > cost optimization - Multimodal tasks (vision, audio) where open models still lag
Study Notes
Must-know for interviews: - LLM = next-token predictor; all capabilities emerge from this at scale - ~1 token ≈ 4 characters / 0.75 words in English; code and non-Latin scripts use more tokens - Context window = hard O(n²) limit; large windows exist but quality degrades in the middle - Temperature=0 ≈ greedy; top-p nucleus sampling is preferred for creative tasks - Three architecture types: decoder-only (dominant, generative), encoder-only (BERT, classification), encoder-decoder (seq2seq) - Open source = data control + cost at scale; proprietary = ops simplicity + frontier capability
Quick recall Q&A: - What does temperature=0 produce? Deterministic greedy output — always the highest-probability next token. - Why can't LLMs reliably do arithmetic? Digits tokenize individually; no native integer arithmetic, only learned statistical patterns over digit sequences. - What limits context window size? Attention is O(n²) — quadratic memory and compute in sequence length. - What is a token? A subword unit produced by BPE/SentencePiece; typically 3–5 characters in English. - Why did decoder-only win? Unified CLM objective, natural generation, instruction fine-tuning works well, scales with emergent in-context learning.