Fine-Tuning LLMs

Fine-Tuning Taxonomy

Concept

Fine-tuning adapts a pretrained base model for specific tasks or behaviors. The spectrum ranges from minimal (few-shot prompting) to maximal (full parameter updates).

Increasing parameter modification:
←─────────────────────────────────────────────────────────────→
Prompting   Prompt Tuning   Prefix Tuning   LoRA   Full Fine-Tune
(0 params)  (few params)    (more params)  (small %)  (all params)

Key decision: fine-tune vs RAG vs prompt engineering?

Approach	Best for	Cost	Data needed
Prompt engineering	Behavior change, style, instruction framing	Free	None
RAG	Adding new factual knowledge	Medium	Documents
Fine-tuning	Consistent format/style, task specialization	High	Labeled examples
Full fine-tuning	Complete behavior overhaul, new domain	Very high	Large labeled dataset

Rule of thumb: Try prompting first, RAG second, fine-tuning only when both are insufficient.

Supervised Fine-Tuning (SFT)

Concept

SFT trains the model on labeled (input, output) pairs using the standard CLM loss. The model learns to produce the desired output format and style.

Instruction dataset format:

{
  "instruction": "Summarize this article in 2 sentences.",
  "input": "The transformer architecture was introduced...",
  "output": "Transformers use self-attention mechanisms...\nThis architecture now dominates NLP."
}

Critical technique — loss masking:

During SFT, you only compute the loss on the completion tokens (the output), not the instruction tokens. This is crucial:

Full sequence: [INST]Summarize this.[/INST] The model should focus on...
Loss mask:     [  0  ]  0  0   0  [  0  ]  1   1     1     1   1  ...

Why? You want the model to learn to generate good outputs, not to memorize the instruction phrasing. Computing loss on the instruction would also wastefully push the model toward strange completions of instruction fragments.

Tricky Q: Why do you mask the instruction tokens during SFT loss computation?
If you include instruction tokens in the loss, you're computing gradients to "predict" arbitrary instruction text — but instructions vary across examples and have no consistent pattern to learn. More importantly, you want to optimize output quality, not instruction completion. Masking ensures gradients only flow from the target output.

SFT data quality > quantity: - 1,000 high-quality diverse instruction-response pairs often outperform 100,000 noisy ones (LIMA paper, 2023) - Diversity matters: different tasks, domains, lengths, and formats - Consistency matters: the output should reflect the persona and format you want the model to learn

RLHF — Reinforcement Learning from Human Feedback

Concept

SFT teaches the model to produce outputs matching a dataset, but datasets can't capture all preferences. RLHF fine-tunes for human preference directly.

The RLHF pipeline:

Stage 1: Supervised Fine-Tuning (SFT)
  Base model → SFT on instruction dataset → SFT model

Stage 2: Reward Model Training
  Human annotators compare pairs of outputs: (response A vs response B) → prefer A
  Train a reward model: input = (prompt, response) → output = scalar reward score

Stage 3: Reinforcement Learning (PPO)
  Use PPO to fine-tune the SFT model to maximize reward
  Constraint: KL divergence from SFT model (prevents reward hacking)
  result_policy = argmax_θ [ E[reward(response)] - β * KL(π_θ || π_SFT) ]

Why RLHF is hard: - Reward models are imperfect proxies for human preference — they can be gamed - PPO is unstable: too many updates → reward hacking (model exploits reward model flaws) - Expensive: requires thousands of human preference annotations - Distribution shift: fine-tuned model wanders from the SFT distribution → capability regression

DPO — Direct Preference Optimization

Concept

DPO (Rafailov et al., 2023) achieves the same goal as RLHF but without training a separate reward model or running PPO. It reformulates the preference optimization problem into a direct supervised loss.

Key insight: The optimal policy under the RLHF objective has a closed-form relationship with the reward function. DPO rearranges this to optimize directly on preference pairs without explicitly computing rewards.

DPO loss:

L_DPO = -E[(chosen, rejected)] [ log σ( β * log(π_θ(y_w|x)/π_ref(y_w|x)) 
                                       - β * log(π_θ(y_l|x)/π_ref(y_l|x)) ) ]

Where:
  y_w = preferred (winning) response
  y_l = rejected (losing) response  
  π_ref = reference SFT model (frozen)
  β = temperature controlling how far policy deviates from reference

In plain English: increase the log-probability of preferred responses relative to the reference model; decrease the log-probability of rejected responses — all in one supervised loss without RL.

Why DPO is preferred in 2024: - No separate reward model training - No PPO instability - Single training stage (like SFT) - Competitive or superior results on most benchmarks - Used by: LLaMA-3, Gemma 2 instruction tuning, Mistral models

Method	Reward model?	RL training?	Stability	Used by
RLHF/PPO	Yes (separate)	Yes (PPO)	Difficult	GPT-3.5, early ChatGPT
DPO	No	No (supervised)	Easy	LLaMA-3, Gemma 2
ORPO	No	No	Easiest	Some recent models
RFT	Reward signal	No PPO loop	Moderate	Gemini, open LLMs
GRPO	Generative RL	Yes (GRPO)	Good at scale	DeepSeek-R1, reasoning models

RFT and GRPO — Modern Alignment Alternatives

Concept

RFT (Reward Function Fine-Tuning): Trains the model directly with reward signals without running the full PPO loop. Simpler than RLHF and often more stable, though less expressive than full RL. Best used when you want alignment improvements with less complexity than RLHF.

GRPO (Group Relative Policy Optimization): A generative RL algorithm that optimizes model policies with better sample efficiency than PPO. Instead of a value function, GRPO estimates baselines from groups of sampled outputs — making it more stable and scalable to long-sequence generation.

GRPO vs PPO key difference:
PPO: needs a value function (critic) estimated for each token
GRPO: samples a group of outputs, uses relative reward within the group as baseline
      → eliminates the need for a separate critic model
      → more stable gradient estimates for long responses

Why GRPO matters for reasoning models: - DeepSeek-R1 (2025) used GRPO to train long chain-of-thought reasoning — the model learns to produce extended reasoning traces that score higher on math/code benchmarks - GRPO's group-relative baseline naturally handles sparse rewards (e.g., "is the final answer correct?") which PPO struggles with at long horizons - Enables "thinking out loud" behavior where intermediate steps are rewarded

Method	Reward model needed?	Stability	Best for
RLHF/PPO	Yes (separate)	Difficult	Helpfulness + safety alignment
DPO	No	Easy	Preference alignment from pairs
RFT	Reward signal only	Moderate	Structured response format alignment
GRPO	Reward function	Good	Reasoning, long-form generation, math

Parameter-Efficient Fine-Tuning (PEFT)

Concept

Full fine-tuning updates all model parameters. For a 7B model, that's 7 billion gradients, a full copy of optimizer states, etc. PEFT methods update a tiny fraction of parameters.

Why not full fine-tuning? 1. Cost: Adam optimizer states = 3× model size in memory (momentum + variance + params) 2. Catastrophic forgetting: Full updates overwrite general capabilities 3. Storage: Each fine-tuned variant requires saving the full model 4. Composability: Hard to combine multiple task-specific fine-tunes

LoRA — Low-Rank Adaptation

Concept

LoRA (Hu et al., 2021) is the dominant PEFT method. The key insight: the change in weights during fine-tuning has low intrinsic rank — it can be approximated by two small matrices.

The math:

During full fine-tuning:
  W_new = W_0 + ΔW    (W_0 frozen, ΔW = same shape as W_0)

LoRA approximation:
  ΔW ≈ B × A    where B ∈ ℝ^{d×r}, A ∈ ℝ^{r×k}, rank r << min(d,k)

Forward pass:
  output = x·W_0 + x·(B·A) = x·W_0 + x·ΔW

W_0 (base model weights): frozen — never updated
A (right matrix): initialized with random Gaussian
B (left matrix): initialized with zeros → ΔW = B×A = 0 at start → training starts from the base model behavior
r (rank): hyperparameter, typically 4–64. Lower rank = fewer parameters, potentially lower quality

Parameter savings:

Full weight: d × k (e.g., 4096 × 4096 = 16.8M)
LoRA:        r × k + d × r = r × (d + k) (e.g., r=16: 16 × 8192 = 131K)
Reduction:   16.8M → 131K = 128× fewer parameters to train

LoRA is typically applied to: Q and V projection matrices (most impactful), sometimes also K, O, and FFN layers.

Scaling parameter alpha (α):

output = x·W_0 + (α/r) × x·(B·A)

α controls the effective learning rate of the LoRA update. Often set to r (so α/r = 1) or 2r.

QLoRA — Quantized LoRA

Concept

QLoRA (Dettmers et al., 2023) enables fine-tuning very large models on consumer hardware by: 1. Quantizing the base model to NF4 (4-bit NormalFloat) — reduces base model memory by 4× 2. Training LoRA adapters in BF16 — the adapters are small, full-precision 3. Double quantization — quantize the quantization constants themselves for extra savings

Why NF4? LLM weights have a distribution close to normal. NF4 places quantization bins at equal probability intervals of the normal distribution — better coverage of likely weight values than uniform INT4.

VRAM comparison for fine-tuning LLaMA-3 8B:

Full fine-tuning (BF16):    8B × 2 bytes = 16 GB weights
                            + 3× for optimizer = 48 GB total ≈ 4× A100-40GB

LoRA (BF16 base):           16 GB weights + ~0.3 GB adapters + optimizer ≈ 24 GB
                            Fits on 1× A100-40GB

QLoRA (NF4 base + BF16 LoRA): 8B × 0.5 bytes = 4 GB weights 
                                + BF16 adapters ≈ 6 GB total
                                Fits on 1× RTX 4090 (24 GB) ✓

Code

# QLoRA fine-tuning with PEFT + BitsAndBytes
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Step 1: Load base model in NF4 (4-bit quantization)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,   # quantize quantization constants
    bnb_4bit_quant_type="nf4",        # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16  # compute in BF16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    quantization_config=bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
tokenizer.pad_token = tokenizer.eos_token

# Step 2: Add LoRA adapters
lora_config = LoraConfig(
    r=16,                       # rank
    lora_alpha=32,              # scaling = alpha/r = 2
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # which layers
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: X | all params: 1B | trainable%: ~0.5%

# Step 3: Training loop (simplified)
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./qlora_output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # effective batch = 4×4 = 16
    learning_rate=2e-4,
    fp16=False,         # can't use fp16 with NF4 loaded model
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    optim="paged_adamw_32bit",  # memory-efficient optimizer for QLoRA
)

Multi-Head Fine-Tuning

Concept

Multi-head fine-tuning uses a shared encoder backbone with multiple task-specific output heads. This allows a single model to perform several tasks with minimal parameter overhead.

Architecture:

Input text
    ↓
Shared BERT/RoBERTa/LLaMA encoder
    ↓ hidden states
    ├── [CLS] → Classification head → Intent label
    ├── Token states → NER head → Entity labels per token  
    └── [CLS] → QA head → Start/End logit positions

Implementation pattern:

import torch
import torch.nn as nn
from transformers import AutoModel

class MultiTaskModel(nn.Module):
    def __init__(self, model_name, num_classes_intent, num_entity_types, hidden_size=768):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)

        # Task head 1: sequence classification (intent)
        self.classification_head = nn.Sequential(
            nn.Dropout(0.1),
            nn.Linear(hidden_size, num_classes_intent)
        )

        # Task head 2: token classification (NER)
        self.ner_head = nn.Sequential(
            nn.Dropout(0.1),
            nn.Linear(hidden_size, num_entity_types)
        )

        # Task head 3: extractive QA (start/end positions)
        self.qa_head = nn.Linear(hidden_size, 2)  # 2 = start + end logits

    def forward(self, input_ids, attention_mask, task="classification"):
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        hidden_states = outputs.last_hidden_state  # [batch, seq, hidden]
        cls_embedding = hidden_states[:, 0, :]     # [batch, hidden] — CLS token

        if task == "classification":
            return self.classification_head(cls_embedding)
        elif task == "ner":
            return self.ner_head(hidden_states)    # per-token logits
        elif task == "qa":
            logits = self.qa_head(hidden_states)   # [batch, seq, 2]
            return logits[:, :, 0], logits[:, :, 1]  # start, end
        else:
            raise ValueError(f"Unknown task: {task}")

# Multi-task training strategy
model = MultiTaskModel("bert-base-uncased", num_classes_intent=10, num_entity_types=9)

# Option 1: Joint training — interleave batches from all tasks
# Option 2: Sequential training — train task 1 then task 2 (catastrophic forgetting risk!)
# Option 3: Task-specific learning rates
optimizer = torch.optim.AdamW([
    {"params": model.encoder.parameters(), "lr": 2e-5},      # small lr for shared encoder
    {"params": model.classification_head.parameters(), "lr": 1e-4},  # larger for heads
    {"params": model.ner_head.parameters(), "lr": 1e-4},
    {"params": model.qa_head.parameters(), "lr": 1e-4},
])

Multi-task interference mitigation: - Gradient surgery (PCGrad): Projects gradients from one task onto the perpendicular of conflicting gradients from another task — reduces destructive interference - Task-specific adapters: Freeze the shared backbone and train separate LoRA adapters per task — no interference possible - Careful task weighting: Weight task losses to prevent one task dominating gradient updates - Sample difficulty: Easy tasks can hurt hard tasks if they dominate the batch — use balanced sampling

PEFT Methods Comparison

Method	Trainable params	Quality	Memory	When to use
Full fine-tune	100%	Best	Very high	Plenty of data + compute
LoRA (r=16)	~0.5–1%	Near full	Medium	Most fine-tuning tasks
QLoRA (NF4)	~0.5–1% LoRA	Near LoRA	Low	Limited GPU (single 24GB)
Prefix tuning	~0.1%	Good for generation	Low	Domain-specific generation
Prompt tuning	~0.01%	Acceptable	Minimal	Very limited compute
Adapters	~1–5%	Good	Medium	Multi-task with swappable modules

Choosing the Right Fine-Tuning Method

Use Case	Recommended Method
Lightweight tuning for multiple tasks	LoRA / QLoRA / Adapter Tuning
Personalization or named entity learning	DreamBooth / Textual Inversion (multimodal)
Teaching specific structured response formats	SFT + RFT
Aligning with human preferences	SFT → Reward Modeling → RLHF / GRPO
Tuning on large domain corpus without labels	Continual Pretraining
High compute, max performance task adaptation	Full Fine-Tuning
Few-shot, low-resource setups	Prefix Tuning / LoRA
Fast prototyping with prebuilt tools	PEFT (Hugging Face library)
Long-chain reasoning (math, code)	GRPO-based RL fine-tuning

Continual Pretraining (distinct from instruction fine-tuning): Resume language model training on unlabeled domain-specific corpus. Captures domain jargon and writing style without requiring labeled examples. Risk: forgetting general language knowledge. Used when you want the model to "speak the language" of your domain before task-specific instruction tuning.

Fine-Tuning Evaluation

Concept

After fine-tuning, you need to verify that quality improved on the target task without degrading general capability. Evaluation spans multiple dimensions.

Category	Method	What It Measures	Tools
Quantitative Metrics	Accuracy, F1, BLEU, ROUGE, Perplexity	Basic task correctness or fluency	`evaluate`, scikit-learn, sacrebleu, rouge-score, bert_score
LLM-as-a-Judge	GPT-4/Gemini comparison & scoring	Human-like eval for quality, factuality, tone	TruLens, promptfoo, LangSmith
Embedding Similarity	BERTScore, Cosine similarity	Semantic similarity of output to ground truth	bert_score, sentence-transformers
Prompt-Based Unit Tests	Pass/Fail output checks	Regression testing on curated prompts	promptfoo, LangSmith, Evals-as-Code
RAG-Specific Metrics	Faithfulness, Context Recall, Precision	Groundedness of RAG output in source	RAGAS, TruLens, LangChain evals
Human Feedback	Thumbs-up/down, rating	Real-world helpfulness and satisfaction	LangSmith, TruLens, custom dashboards
Live Monitoring	Latency, fail rates, usage stats	Operational metrics in production	OpenTelemetry, MLflow, Weights & Biases

Code snippet — ROUGE + BERTScore evaluation:

import evaluate

rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

predictions = ["The model predicts the next token using attention."]
references = ["LLMs use attention mechanisms to predict the next token."]

print(rouge.compute(predictions=predictions, references=references))
print(bertscore.compute(predictions=predictions, references=references, lang="en"))

Regression testing: Always keep a fixed "golden set" of prompts and expected behaviors. After each fine-tuning run, check that scores on this set don't degrade — catches catastrophic forgetting before deployment.

Study Notes

Must-know for interviews: - SFT loss is masked on instruction tokens — gradients only from output tokens - RLHF: reward model + PPO; DPO: direct preference on pairs without RL — DPO is now dominant - LoRA: ΔW ≈ BA where rank r << min(d,k); base model is frozen; ~0.5% trainable params at r=16 - QLoRA = NF4 quantized base + BF16 LoRA adapters → enables 7B fine-tuning on a 24GB GPU - Multi-head fine-tuning: shared encoder + task-specific heads; task interference → use gradient surgery or per-task LoRA adapters - When to fine-tune: task-specific format/style that prompting can't achieve consistently

Quick recall Q&A: - Why mask instruction tokens in SFT? To only optimize on output quality — instructions vary per example with no consistent target distribution. - What is the LoRA initialization strategy and why? B=zeros, A=random → ΔW=BA=0 at init → training starts exactly from base model behavior. - What is NF4? 4-bit NormalFloat — bins placed at equal probability intervals of a normal distribution, optimal for LLM weights. - What is the KL penalty in RLHF PPO? KL(π_θ || π_SFT) penalizes deviating too far from the SFT model — prevents reward hacking. - Why is DPO easier than RLHF? No separate reward model training; no PPO instability; single supervised training stage.