GPU and Hardware Considerations

VRAM Estimation

Concept

Before deploying or training any LLM, you must know whether it fits in your GPU(s). VRAM estimation is a fundamental interview skill.

Model weights memory:

weights_GB = num_parameters × bytes_per_parameter / (1024³)

Bytes per precision:
  FP32:  4 bytes
  BF16:  2 bytes  ← standard for training and most inference
  FP16:  2 bytes
  INT8:  1 byte
  INT4:  0.5 bytes
  NF4:   0.5 bytes

Quick mental math (BF16): - 7B model: 7 × 2 = 14 GB - 13B model: 13 × 2 = 26 GB - 70B model: 70 × 2 = 140 GB - 405B model: 405 × 2 = 810 GB → needs 10+ A100-80GB

Training overhead (Adam optimizer): Adam optimizer stores 3 copies per parameter: current weights (2 bytes BF16) + momentum (4 bytes FP32) + variance (4 bytes FP32) = 10 bytes per parameter.

Training VRAM ≈ 10 bytes × num_parameters
  7B model: 70 GB → needs 2× A100-40GB minimum with ZeRO
  70B model: 700 GB → needs 9+ A100-80GB

Additional VRAM components:

Component	Size	Notes
Model weights	params × bytes	Dominant
Optimizer states	2× params (Adam FP32)	Training only
Gradients	1× params	Training only
Activations	Seq_len × d_model × n_layers × batch	Cleared by grad checkpointing
KV cache	See formula in file 05	Inference only

Worked example — 70B inference on A100-80GB:

70B × 2 bytes (BF16) = 140 GB → needs 2× A100-80GB minimum
With 4-bit quantization: 70B × 0.5 bytes = 35 GB → fits on 1× A100-80GB

Quantization

Concept

Quantization reduces the numerical precision of model weights (and sometimes activations) to use fewer bits. The trade-off is reduced memory and compute cost vs potential quality degradation.

Post-Training Quantization (PTQ): Quantize a pretrained model without additional training. Fast and cheap; some quality loss.

Quantization-Aware Training (QAT): Simulate quantization during training so the model learns to compensate. Better quality but requires access to training pipeline.

INT8 — LLM.int8()

How it works: - Most weights are quantized to INT8 (1 byte) - "Emergent outliers" — a small fraction of activations with very large magnitude that distort quantization — are kept in FP16 and computed separately - This mixed-precision approach prevents the significant quality loss of naive INT8

Results: - ~2× memory reduction vs FP16 - Minimal quality degradation (< 1% on most benchmarks) - Some compute overhead from mixed-precision handling - Library: bitsandbytes (load_in_8bit=True)

GPTQ — GPU Post-Training Quantization

How it works: - Weight-only quantization (weights to INT4 or INT8, activations remain FP16) - Layer-by-layer quantization that minimizes the L2 reconstruction error of each layer's output - Uses the inverse Hessian of the weight matrix (second-order information) to find optimal rounding

Results: - INT4: ~4× memory reduction vs FP16 - Quality close to FP16 for most tasks - Slightly slower than NF4 for inference - Library: auto-gptq

AWQ — Activation-Aware Weight Quantization

How it works: - Observes that 1% of weight channels (those corresponding to large activation values) contribute disproportionately to reconstruction quality - Protects those important channels by scaling them before quantization - Result: better quality than GPTQ at the same bit-width

Results: - Better quality than GPTQ at INT4, especially for instruction following - Better inference speed than GPTQ (hardware-friendly kernel) - Library: autoawq

NF4 — NormalFloat4 (QLoRA)

How it works: - 4-bit format with bins placed at equal probability intervals of the standard normal distribution N(0,1) - LLM weights have a near-normal distribution → NF4 bins are optimally placed - Used exclusively for QLoRA base model storage; combined with FP16/BF16 LoRA adapters

Not intended for standalone inference — primarily a training-efficiency format.

Quantization Comparison Table

Method	Bits	Memory reduction	Quality	Speed	Library	Best for
BF16	16	1× (baseline)	Best	Good	PyTorch	Training, production inference
LLM.int8	8	~2×	Near-lossless	Slightly slower	bitsandbytes	When 4-bit is too lossy
GPTQ	4	~4×	Good	Fast (custom kernel)	auto-gptq	Production 4-bit inference
AWQ	4	~4×	Better than GPTQ	Fastest	autoawq	Best 4-bit quality+speed
NF4	4	~4×	Good	Medium	bitsandbytes	QLoRA fine-tuning
GGUF/Q4_K_M	4-5	~4×	Good	CPU-optimized	llama.cpp	Local/CPU inference

Interview Q: A 70B model in FP16 — minimum A100-80GB GPUs to run inference?
70B × 2 bytes = 140 GB. One A100-80GB has 80 GB → need at least 2 GPUs. With INT4 quantization (70B × 0.5 = 35 GB) → fits on 1 GPU.

Parallelism Strategies

Concept

For models that don't fit on a single GPU (or for faster training), you must distribute computation across multiple GPUs.

Data Parallelism (DDP)

How it works: - Each GPU holds a complete copy of the model - Different batches of data are processed on different GPUs - After the backward pass, gradients are averaged across all GPUs (all-reduce communication) - Each GPU updates its local copy with the averaged gradient

GPU 0: Model copy A | Batch 1 → grad_1
GPU 1: Model copy A | Batch 2 → grad_2
GPU 2: Model copy A | Batch 3 → grad_3
                           ↓ All-reduce
All GPUs: grad = (grad_1 + grad_2 + grad_3) / 3
Update: W = W - lr × grad

When to use: Model fits on a single GPU; want to increase batch size or training speed.

Communication cost: One all-reduce per training step, proportional to model size.

Tensor Parallelism (Megatron-LM style)

How it works: - Split individual weight matrices across GPUs along one dimension - Attention: Q, K, V, O projection matrices split along the head dimension - GPU 0: heads 1–16; GPU 1: heads 17–32 - FFN: split the intermediate dimension across GPUs - GPU 0: first half of d_ffn columns; GPU 1: second half

Linear(d_model, d_ffn):
  GPU 0: first d_ffn/2 columns → output_0
  GPU 1: last d_ffn/2 columns  → output_1
  All-reduce: output = concat(output_0, output_1)

Communication cost: One all-reduce per layer forward AND backward. High communication — requires NVLink for efficiency (PCIe is too slow).

When to use: Single layer too large for one GPU; NVLink interconnect available.

Pipeline Parallelism

How it works: - Split model layers across GPUs: GPU 0 runs layers 1–16, GPU 1 runs layers 17–32 - Each GPU processes its layers, passes activations to the next GPU - Micro-batching: split the batch into M micro-batches to keep all GPUs busy

Layers 1–16: GPU 0
Layers 17–32: GPU 1

Step 1: GPU 0 processes micro-batch 1, sends activations to GPU 1
Step 2: GPU 0 processes micro-batch 2 (while GPU 1 processes micro-batch 1)
...

"Bubble" problem: At the start and end of a batch, some GPUs are idle waiting for data. The bubble size = (num_GPUs - 1) / (num_micro_batches + num_GPUs - 1). More micro-batches = smaller bubble.

When to use: Very deep models; layer-granularity parallelism needed; lower communication bandwidth requirement than tensor parallelism (only activations cross GPU boundaries, not gradients).

ZeRO — Zero Redundancy Optimizer

Concept

ZeRO (Rajbhandari et al., 2019) eliminates redundant copies of optimizer states, gradients, and parameters across GPUs in data-parallel training. Three stages:

Stage 1 — Optimizer State Partitioning:
  Each GPU stores only 1/N of optimizer states (momentum, variance)
  Parameters and gradients: still replicated on all GPUs
  Memory reduction: 4× (optimizer states are typically 2× weights in Adam)

Stage 2 — Gradient Partitioning:
  Each GPU stores only 1/N of gradients during backward pass
  Parameters: still replicated
  Memory reduction: 8× vs DDP

Stage 3 — Parameter Partitioning:
  Each GPU stores only 1/N of parameters
  Parameters are gathered via all-gather as needed during forward/backward
  Memory reduction: 16× or more vs DDP — enables training models far larger than single-GPU memory

ZeRO-Infinity: Extends ZeRO-3 to offload to CPU RAM and NVMe SSDs — can train trillion-parameter models on limited GPU clusters.

ZeRO Stage	What's partitioned	Memory reduction	Communication overhead
0 (DDP)	Nothing	1×	1 all-reduce
1	Optimizer states	~4×	1 all-reduce + scatter
2	+ Gradients	~8×	Same as 1
3	+ Parameters	~16×	Higher (all-gather)

DeepSpeed is the primary library implementing ZeRO.

Flash Attention as Hardware Optimization

Concept

Flash Attention is primarily a hardware (GPU memory hierarchy) optimization — see Attention Mechanisms for the algorithm. Here is the hardware context:

GPU memory hierarchy:

Registers: ~256 KB, fastest (no latency)
L1 cache:  ~128 KB per SM
SRAM:      ~192 KB per SM on A100, ~20 TB/s bandwidth
HBM:       80 GB on A100, ~2 TB/s bandwidth — 10× slower than SRAM

Standard attention writes the n×n attention matrix to HBM (slow), reads it back for softmax (slow), writes softmax output (slow), reads for ×V (slow). Flash Attention keeps all intermediate results in SRAM by tiling — eliminates the HBM round-trips.

Why this matters at scale: - For 128K tokens: attention matrix = 128K × 128K × 2 bytes = 32 GB — impossible to hold in SRAM, Flash Attention never materializes it - Flash Attention makes long-context models practical, not just mathematically possible

PCIe vs NVLink

Concept

GPU-to-GPU communication speed determines how well parallelism strategies scale.

Interconnect	Bandwidth	Latency	Use
PCIe 4.0 ×16	~32 GB/s (bidirectional)	Medium	Consumer GPUs, budget clusters
NVLink 4.0	~900 GB/s (bidirectional)	Low	A100, H100 — production AI clusters
NVSwitch	~3.6 TB/s (all-to-all)	Very low	H100 SXM clusters

Practical impact: - Tensor parallelism requires all-reduce after every layer: high bandwidth demand. PCIe is the bottleneck — tensor parallelism scales poorly without NVLink. - Pipeline parallelism only passes activations at layer boundaries: lower bandwidth requirement — viable with PCIe. - A100 SXM4 (NVLink) vs A100 PCIe: tensor-parallel scaling efficiency drops from ~95% to ~50% at 8 GPUs.

Code

# VRAM estimation utility
def estimate_vram(
    num_params_billions,
    precision="bf16",
    mode="inference",
    kv_cache_tokens=4096,
    n_layers=32,
    n_kv_heads=8,
    d_head=128,
    batch_size=1
):
    """Estimate VRAM requirements in GB."""
    bytes_per_param = {"fp32": 4, "bf16": 2, "fp16": 2, "int8": 1, "int4": 0.5, "nf4": 0.5}
    bpp = bytes_per_param[precision]

    weight_gb = num_params_billions * 1e9 * bpp / 1e9

    if mode == "training":
        # Adam: weights (bpp) + momentum (4B) + variance (4B) = bpp + 8
        optimizer_factor = (4 + 4) / bpp  # extra FP32 copies
        total = weight_gb * (1 + optimizer_factor)
        return {"weights": weight_gb, "total_approx": total}
    else:
        kv_gb = (n_layers * kv_cache_tokens * 2 * n_kv_heads * d_head * 2 * batch_size) / 1e9
        return {"weights": weight_gb, "kv_cache": kv_gb, "total_approx": weight_gb + kv_gb}

# Examples
print("LLaMA-3 8B inference (BF16, 4K context):")
print(estimate_vram(8, "bf16", "inference"))

print("\nLLaMA-3 70B inference (INT4, 4K context):")
print(estimate_vram(70, "int4", "inference"))

print("\nLLaMA-3 7B training (BF16):")
print(estimate_vram(7, "bf16", "training"))

# BitsAndBytes quantization
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

config_8bit = BitsAndBytesConfig(load_in_8bit=True)
config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

# Load in 8-bit (comment out when running, requires GPU + large model)
# model_8bit = AutoModelForCausalLM.from_pretrained(
#     "meta-llama/Llama-3.2-1B",
#     quantization_config=config_8bit
# )
# print(f"8-bit model memory: {model_8bit.get_memory_footprint() / 1e9:.2f} GB")

# GPU memory profiling
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"Total VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"Available: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB reserved")

Study Notes

Must-know for interviews: - VRAM for weights = params × bytes_per_precision (BF16=2, INT8=1, INT4=0.5) - Training VRAM ≈ 10× params in bytes (Adam optimizer stores FP32 momentum + variance + weights) - 70B in FP16 = 140 GB → needs 2× A100-80GB for inference; INT4 = 35 GB → 1 GPU - Quantization hierarchy: BF16 (best quality) → INT8 (near-lossless) → AWQ INT4 (recommended 4-bit) → NF4 (QLoRA only) - Data parallelism: replicate model, split batch; Tensor parallelism: split weight matrices; Pipeline: split layers - ZeRO-3: partition optimizer + gradients + parameters → 16× memory reduction, enables single-node training of very large models - NVLink is required for efficient tensor parallelism (PCIe is 28× slower than NVLink)

Quick recall Q&A: - A 13B model in BF16 — how much VRAM for inference? 13 × 2 = 26 GB → needs 1× A100-40GB or more. - What 3 things does Adam store per parameter? Current weights (BF16), first moment/momentum (FP32), second moment/variance (FP32). - What is the key difference between GPTQ and AWQ? AWQ uses activation statistics to identify and protect important weight channels before quantization — better quality at the same bit-width. - What does ZeRO Stage 3 partition? Optimizer states + gradients + parameters (model weights themselves). - Why does tensor parallelism require NVLink? It requires all-reduce communication after every layer (forward and backward) — PCIe bandwidth is a severe bottleneck at this frequency.