Skip to content

Model Architecture Types

Encoder-Only Architecture

Concept

Encoder-only models use bidirectional self-attention — each token can attend to all other tokens simultaneously. There is no causal mask; the model sees the full context in both directions.

Training objective — Masked Language Modeling (MLM): - Randomly mask 15% of tokens in the input - Train the model to predict the masked tokens from context - This forces the model to understand bidirectional context

Architecture specifics: - Input sequence → bidirectional attention → contextual representations per token - No autoregressive generation — the model produces a fixed-size representation for each token, not new tokens - Classification and span-extraction heads are added on top of the final hidden states

When to use encoder-only: - Text classification (sentiment analysis, intent detection): take the [CLS] token embedding → linear head - Named Entity Recognition (NER): classify each token's hidden state → per-token labels - Semantic similarity / embeddings: mean pool or [CLS] embedding → similarity search - Extractive QA (SQuAD): predict start/end token positions within the context

Key models:

Model Parameters Context Key innovation
BERT-base 110M 512 Bidirectional MLM + NSP
BERT-large 340M 512 Larger BERT
RoBERTa 125M/355M 512 Removed NSP, more data, better MLM
DeBERTa-v3 183M 512 Disentangled attention, ELECTRA pretraining
DistilBERT 66M 512 40% smaller, 60% faster, 97% of BERT quality
BGE-M3 ~560M 8K Multi-function embedding model

Tricky Q: Can a BERT-family model generate text?
No — BERT was never trained to autoregressively predict the next token. Its output is a contextual representation, not a probability distribution over the next token. You can build a seq2seq encoder-decoder on top of a BERT encoder, but the encoder alone cannot generate.


Decoder-Only Architecture

Concept

Decoder-only models use causal (left-to-right) self-attention — each token can only attend to itself and all preceding tokens. The model is trained to predict the next token.

Training objective — Causal Language Modeling (CLM): - Given tokens t₁, t₂, ..., tₙ₋₁, predict tₙ - Loss is cross-entropy on the predicted next-token distribution vs the actual next token - The simplicity of this objective is a key reason for the architecture's dominance

Why decoder-only won:

  1. Unified objective: CLM pretraining + instruction fine-tuning (SFT) + RLHF all use the same token prediction framework — no architectural changes between stages
  2. Natural generation: The architecture is inherently designed to generate; no adapter or cross-attention bridge needed
  3. Emergent reasoning: At scale, decoder-only models develop chain-of-thought reasoning by learning to produce intermediate "thinking" tokens
  4. In-context learning: Few-shot prompting works naturally by prepending examples before the query
  5. Scalability: Simple objective → easy to scale to trillions of training tokens

Architecture flow:

Input: "The cat sat on the"
→ Tokenize → Embed → [Transformer Block × N with causal mask] → Final LN → Linear → Softmax
→ P(next token) → sample → append → repeat

Key models:

Model Params Context Key innovation
GPT-2 1.5B 1024 First large-scale CLM demo
GPT-3 175B 2048 In-context learning at scale
GPT-4 / GPT-4o ~1.8T (MoE, est.) 128K Multimodal, top-tier reasoning
LLaMA-2 7B–70B 4K Open weights, commercial license
LLaMA-3.1 8B–405B 128K GQA, strong open weights
Gemma 2 2B–27B 8K Google open, KV-cache sliding window
Mistral 7B 7B 32K GQA + sliding window attention
Phi-3-mini 3.8B 128K High quality from small size
Falcon 7B–180B 2K–8K Multi-Query Attention

Encoder-Decoder Architecture

Concept

Encoder-decoder models have two components: 1. Encoder: reads the input with bidirectional attention → produces context representations 2. Decoder: generates output tokens autoregressively, attending to both its own output (causal self-attention) and the encoder output (cross-attention)

Cross-attention:

Q = decoder hidden states (what the decoder wants to know)
K = V = encoder output    (what the input provides)
cross_attention_output = softmax(Q·Kᵀ / √d_k) · V

This allows the decoder to directly "look up" relevant parts of the input at each generation step.

Training objectives:

  • T5 (Span Corruption): Mask contiguous spans of input tokens (replace with single mask token), train decoder to reconstruct the spans
  • BART (Denoising): Corrupt input via masking, shuffling, deletion; train decoder to reconstruct the clean sequence

When to use encoder-decoder: - Machine translation: long input → long output with reordering - Abstractive summarization: input document → shorter summary (not extractive) - Document-to-structured output: parse a document into a structured format - Question generation: answer + context → question

Key models:

Model Parameters Key innovation
T5-base/large 220M/770M Text-to-text unified framework
FLAN-T5 80M–11B Instruction fine-tuned T5
BART 140M/400M Denoising pretraining
mT5 300M–13B Multilingual T5
mBART 610M Multilingual denoising

Tricky Q: Why did decoder-only models overtake encoder-decoder for summarization and translation?

Two reasons: (1) At sufficient scale, decoder-only models learn to produce appropriate-length summaries and translations via instruction fine-tuning, without the architectural inductive bias of cross-attention. (2) A single decoder-only model can handle many tasks via prompting, while encoder-decoder models require task-specific fine-tuning to perform well. Operational simplicity wins.


Architecture Comparison

Concept

Dimension Encoder-Only Decoder-Only Encoder-Decoder
Attention direction Bidirectional Causal (left-to-right) Encoder: bi; Decoder: causal + cross
Training objective MLM / ELECTRA CLM (next token) Denoising / span corruption
Can generate text? No Yes Yes
Best for Classification, NER, embeddings Generation, reasoning, chat Seq2seq tasks
In-context learning Poor Excellent Moderate
Instruction fine-tuning Awkward Natural Possible
Production dominance Embedding models General LLMs Niche seq2seq

Mixture of Experts (MoE)

Concept

MoE is an architectural modification to the FFN layer, not a new architecture class. Instead of one dense FFN per layer, the model has N "expert" FFNs and a learned router that selects top-K experts per token.

Standard FFN:
  token → single FFN → output

MoE FFN:
  token → router (softmax over N experts)
         → select top-K experts by router score
         → weighted sum of selected expert outputs

Key MoE concept: sparse activation - Total parameters: N × (d_model × d_ffn) — much larger than a dense model - Active parameters per token: only K experts are used — much smaller computation - Example: Mixtral-8x7B has 8 experts of 7B each ≈ 47B total parameters, but activates 2 experts per token ≈ 13B active parameters

MoE advantages: - Scale model capacity without proportional compute increase - Different experts can specialize in different domains/languages/task types - Efficient at inference for large models (only K/N of FFN is computed per token)

MoE challenges: - Load balancing: without constraints, router collapses all tokens to the same 1–2 experts - Communication overhead in distributed training (all-to-all for expert routing) - KV cache still scales with total layers — memory savings only in FFN compute

Models: - Mixtral-8x7B: 8 experts, top-2 routing, 47B total / 13B active - GPT-4: estimated to be a large MoE (not officially confirmed) - Switch Transformer (Google): pioneered MoE at scale with top-1 routing


Model Family Comparison Table

Concept

A reference table for major model families you'll encounter in interviews and production:

Model Family Org Type Context Architecture innovations
GPT-2 OpenAI Decoder 1K First demo of large CLM
GPT-3 OpenAI Decoder 2K 175B in-context learning
GPT-4 / 4o OpenAI Decoder (MoE est.) 128K Multimodal, top-tier
GPT-4.1 OpenAI Multimodal Transformer 1M Extended context, code + general
LLaMA-2 Meta Decoder 4K Open weights, GQA (70B)
LLaMA-3 / 3.1 Meta Decoder 8K / 128K GQA all sizes, 405B
LLaMA-4 Scout Meta MoE Transformer 10M Ultra-long context, open source
LLaMA-4 Maverick Meta MoE Transformer 1M Balanced capability/cost
Gemma Google Decoder 8K Multi-query attention
Gemma 2 Google Decoder 8K GQA + local+global attn
Gemini 2.5 Pro Google DeepMind Multimodal Transformer 1M Complex reasoning, multimodal
Mistral 7B Mistral Decoder 32K GQA + sliding window
Mixtral 8x7B Mistral Decoder (MoE) 32K Top-2 MoE, 47B/13B active
Mistral Magistral Mistral Dense/Sparse variants 128K–256K Cost-efficient reasoning
Phi-3-mini Microsoft Decoder 128K 3.8B, textbook-quality data
Qwen 3 Alibaba Cloud Hybrid sparse MoE 262K–1M Multilingual, Asian-language
Command A Cohere Retrieval-optimized 256K RAG systems, enterprise search
BERT-base Google Encoder 512 MLM + NSP, bidirectional
RoBERTa Meta Encoder 512 Better BERT training
DeBERTa-v3 Microsoft Encoder 512 Disentangled attention
T5 / FLAN-T5 Google Enc-Dec 512–2K Text-to-text, instruction
BART Meta Enc-Dec 1K Denoising pretraining

LLM Types by Modality

Concept

Beyond architecture type, LLMs are also categorized by the modalities they handle. This affects model selection, embedding strategy, and system design.

Type Description Examples Use Cases
Text-Only Trained on text corpora only GPT-3, LLaMA, BERT, Falcon General NLP, summarization, generation
Multilingual Trained on multilingual corpora XLM-R, mT5, BLOOM, GPT-4 Translation, cross-lingual search
Multimodal Input: image/video/audio + text GPT-4o, Gemini 1.5, Kosmos-2, LLaVA Image captioning, audio Q&A, OCR, document understanding
Code Specialized in programming languages Codex, CodeLLaMA, CodeGemma, StarCoder Code generation, completion, refactoring
Speech-Text Integrate speech recognition and synthesis Whisper, SeamlessM4T Transcription, speech translation
Image/Vision Image understanding and classification ViT, ConvNeXT, DETR, Mask2Former Object detection, segmentation, depth estimation

Modality in system design: When building a multi-modal pipeline (e.g., processing scanned PDFs + structured data + free text), you must choose whether to use a single multi-modal foundation model (simplicity, one context) or multiple specialized models (higher per-task accuracy, more complex orchestration). For RAG over images, multimodal embeddings (CLIP, Gemini Embeddings) are required — text-only embeddings cannot capture visual content.

Models by specific task (quick reference):

Model Task
Wav2Vec2 Audio classification, automatic speech recognition (ASR)
Vision Transformer (ViT), ConvNeXT Image classification
DETR Object detection
Mask2Former Image segmentation
GLPN Depth estimation
BERT Text classification, token classification, question answering
GPT-2, LLaMA Text generation
BART, T5 Summarization and translation
Codex, CodeLLaMA Code generation
Whisper Speech-to-text transcription

When to Choose Which Architecture

Concept

Use encoder-only (BERT/DeBERTa/BGE) when: - Task is purely classification, NER, or span extraction - You need high-quality fixed-size embeddings for search/RAG - Inference must be very fast (smaller models, no generation overhead) - Labels are sentence-level or token-level — not generative

Use decoder-only (LLaMA/Gemma/GPT) when: - Task requires free-form text generation - You want a single model for multiple tasks via prompting - You need in-context learning (few-shot examples in prompt) - You plan to instruction fine-tune for custom tasks

Use encoder-decoder (T5/FLAN-T5/BART) when: - Task is explicitly seq2seq with fixed input and output schemas - You have limited compute — smaller fine-tuned encoder-decoder can beat large decoder-only for specific structured tasks - Translation or summarization at scale where encoder-decoder efficiency matters

Practical rule of thumb (2024): Default to decoder-only for new projects. Switch to encoder-only if you need embeddings or fast token classification. Encoder-decoder is a niche choice for legacy systems or very specific seq2seq workloads.

Code

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,  # Encoder-only
    AutoModelForCausalLM,                               # Decoder-only
    AutoModelForSeq2SeqLM,                             # Encoder-Decoder
    pipeline
)

# --- Encoder-only: text classification ---
classifier = pipeline(
    "text-classification",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)
result = classifier("This movie was absolutely fantastic!")
print(f"Encoder-only classification: {result}")

# --- Decoder-only: text generation ---
generator = pipeline(
    "text-generation",
    model="gpt2",  # small, runs locally
    max_new_tokens=50,
    temperature=0.7
)
result = generator("The transformer architecture revolutionized AI because")
print(f"\nDecoder-only generation: {result[0]['generated_text']}")

# --- Encoder-Decoder: summarization ---
summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    max_length=60,
    min_length=20
)
article = """
The transformer architecture, introduced in the paper "Attention Is All You Need" by Vaswani 
et al. in 2017, revolutionized natural language processing. It replaced recurrent neural 
networks with a self-attention mechanism that could process all tokens simultaneously, 
enabling much faster training and better handling of long-range dependencies.
"""
result = summarizer(article)
print(f"\nEncoder-Decoder summary: {result[0]['summary_text']}")

# --- Checking architecture type programmatically ---
from transformers import AutoConfig

for model_name in ["bert-base-uncased", "gpt2", "facebook/bart-large"]:
    config = AutoConfig.from_pretrained(model_name)
    print(f"{model_name}: {config.model_type}{config.architectures}")

8 Types of LLMs Used in AI Agents

Concept

Beyond the classic encoder/decoder taxonomy, modern AI agent systems draw from a richer vocabulary of specialized model types. These represent functional roles more than architecture classes.

Type Full Name What It Is Key Examples Primary Use in Agents
GPT Generative Pretrained Transformer Standard autoregressive decoder-only LLM GPT-4, LLaMA, Gemma General reasoning, planning, text generation
MoE Mixture of Experts Sparse routing — only top-K expert FFNs fire per token Mixtral-8x7B, GPT-4 (est.), LLaMA 4 Scale capacity without proportional compute cost
LRM Large Reasoning Model LLM fine-tuned (via GRPO/RL) to produce long chain-of-thought reasoning traces before answering DeepSeek-R1, OpenAI o1/o3, QwQ Complex math, coding, multi-step logical deduction
VLM Vision-Language Model Multi-stage model combining a vision encoder (ViT) with a language model; trained on image-text pairs GPT-4o, Gemini, LLaVA, Qwen-VL Document parsing, image Q&A, multimodal agent perception
SLM Small Language Model Compact decoder-only model (1B–7B params) with GQA and efficient attention, deployable on edge/local Phi-3-mini (3.8B), Gemma 2B, TinyLLaMA On-device inference, latency-critical agent sub-tasks
LAM Large Action Model LLM trained to produce structured actions (tool calls, API sequences, UI interactions) not just text Browser-use, Claude Computer Use, OpenAI Operator Agentic task execution, tool orchestration, UI automation
HLM Hierarchical Language Model Two-LLM system: an item/task-level LLM models specifics; a user-level LLM models user behavior across sessions Research architectures Personalization, recommendation, sequential decision-making
LCM Large Concept Model Operates in a learned concept space (not tokens); encodes input into concepts, reasons in that space, decodes back Meta SONAR LCM (2025) Language-agnostic reasoning, cross-lingual transfer

Key distinctions for interviews: - LRM vs GPT: Both are decoder-only, but LRM is trained with RL to output explicit reasoning steps (think-then-answer). GPT answers directly. - VLM training stages: (1) Pretrain vision encoder + LM separately; (2) Align vision→language with image-text pairs; (3) SFT on multi-task visual instruction data. The ViT and LM share a projection layer. - SLM vs quantized LLM: SLMs are architecturally designed small (GQA, RMSNorm, SwiGLU) — not just a compressed large model. - LAM vs standard LLM with tools: A LAM is specifically fine-tuned on action trajectories; a standard LLM uses tools via prompt engineering. LAMs produce more reliable action sequences. - LCM: Addresses the token-level limitation — instead of next-token prediction, LCMs predict the next concept embedding. This makes them language-agnostic by design.


Study Notes

Must-know for interviews: - Three architecture families: encoder-only (bidirectional, MLM, classification), decoder-only (causal, CLM, generation), encoder-decoder (cross-attention, seq2seq) - Decoder-only dominates production for general LLMs — simple CLM objective + instruction fine-tuning = powerful general purpose model - BERT uses bidirectional attention — cannot generate; best for embeddings and classification - MoE: sparse routing selects top-K experts per token → large total params, small active params - LLaMA-3, Gemma, Mistral all use GQA for efficient KV caching - When in doubt in 2024: reach for decoder-only (LLaMA-3, Gemma)

Quick recall Q&A: - Why can BERT not generate text? BERT was trained with MLM (predict masked tokens), not CLM (predict next token) — it has no autoregressive generation mechanism. - What is MoE and what is the key tradeoff? Multiple expert FFNs with sparse routing — scales capacity without proportional compute, but requires load balancing and has communication overhead in distributed settings. - Why did decoder-only beat encoder-decoder for summarization? At scale + instruction fine-tuning, decoder-only generalizes across tasks without architectural inductive bias; operational simplicity wins. - Name 3 decoder-only models and their context lengths. LLaMA-3.1 (128K), Gemma 2 (8K), Mistral 7B (32K).