Production Deployment

Inference Serving Frameworks

Concept

Running an LLM in production is fundamentally different from running inference in a notebook. Production requires: high throughput, low latency, batching, model management, and hardware efficiency. Specialized serving frameworks exist for this.

Framework overview:

Framework	Primary strengths	Best for
vLLM	Paged Attention, highest throughput, OpenAI-compatible API	High-throughput self-hosted serving
TGI (Text Generation Inference)	HuggingFace integration, Flash Attention, streaming	HuggingFace models, easy deployment
Triton Inference Server	NVIDIA, supports any model format, enterprise features	NVIDIA-heavy production, non-LLM workloads too
Ollama	Dead-simple local deployment, GGUF quantized models	Developer machines, prototyping
LMDeploy (turbomind)	High speed, GQA support, quantization	High throughput on Ascend/NVIDIA
SGLang	Structured generation, RadixAttention for caching	Complex generation pipelines, structured output

vLLM is the most commonly used in production open-source deployments as of 2024. It implements: - Paged Attention (see KV Cache) - Continuous batching - OpenAI-compatible API (drop-in replacement for OpenAI SDK calls) - Tensor parallelism for multi-GPU

Latency Budget

Concept

Understanding where latency comes from lets you optimize the right bottleneck.

Latency components:

Total latency = Queue wait + Prefill time + (N_output_tokens × per-token decode time)

Time-to-First-Token (TTFT) = Queue wait + Prefill time
Tokens-per-second (TPS)    = 1 / per-token-decode-time

Prefill time (processing input prompt): - Scales with prompt length - All input tokens are processed in parallel (like training) - Bottleneck: compute (matrix multiplications for all tokens simultaneously) - Typical: 1–5 seconds for 10K tokens on A100

Decode time (generating each output token): - Each token requires one full forward pass through the model - Must read the entire KV cache for all past tokens - Bottleneck: memory bandwidth (reading KV cache from HBM) - Typical: 20–50ms per token for 7B model on A100 → 20–50 TPS

Queue time: - If all GPU capacity is occupied with other requests, your request waits - Manageable with horizontal scaling and load balancing

SLO targets (typical): - User-facing chat: TTFT < 500ms, TPS > 15 tokens/second (below this feels slow) - Batch processing: throughput matters, latency is secondary - Streaming: TTFT is paramount — users see first token immediately

Batching Strategies

Concept

Static batching: Wait for N requests, process all together, return all results. - Pro: simple, maximum GPU utilization per batch - Con: short sequences wait for long ones; any finished sequence wastes its slot

Continuous batching (in-flight batching): Fill the batch dynamically: - After each decode step, evict any finished sequences - Immediately insert the next waiting request - GPU always runs at full batch capacity

Static (4 requests, max 10 tokens):
  Step 1–3:  [A B C D]   A, B finish at step 3
  Step 4–10: [_ _ C D]   2 empty slots wasted

Continuous:
  Step 1–3:  [A B C D]   A, B finish at step 3
  Step 4:    [E F C D]   E, F fill immediately
  Step 5–8:  [E F C D]   full batch throughout

Dynamic batching: Group incoming requests by similar sequence length to minimize padding waste.

Chunked prefill: For very long prompts, process the prefill in chunks to avoid monopolizing GPU during prefill (which would delay other decode steps in the continuous batch).

Caching Hierarchy

Concept

Multiple layers of caching are possible in LLM serving, each with different trade-offs.

Level 1 — KV Cache (per-request, in-GPU): - Stores computed K/V for all tokens in the active sequence - Eliminated by sequence completion; re-created per request - Critical for decode speed (see KV Cache)

Level 2 — Prefix Cache (across requests, in-GPU): - Reuses KV cache blocks for shared prompt prefixes across requests - High ROI for system prompts (often 500–2000 tokens, shared across all users) - vLLM automatic prefix caching: compute prefix KV once, share blocks across requests - Savings: a 1000-token system prompt with 1000 requests/minute = 1M prefill tokens/min saved

Level 3 — Semantic Cache (across requests, external): - Cache full responses for semantically similar (or identical) queries - Tools: GPTCache, Redis with embedding similarity search - Works for FAQ-style applications where many users ask near-identical questions - Does NOT help for unique/dynamic queries

Level 4 — Batch API (offline): - For non-real-time workloads, queue requests to be processed at off-peak hours - 50% cost reduction in OpenAI Batch API - Latency: hours, but acceptable for document processing, bulk analysis

Token Window Workarounds

Concept

When documents or conversations exceed the context window, several strategies exist beyond simply refusing to process them.

Strategy 1: Sliding Window with Overlap

Document: [chunk_1][chunk_2][chunk_3]...

Process:
  Pass 1: context = [chunk_1][chunk_2] → partial answer
  Pass 2: context = [overlap][chunk_2][chunk_3] → partial answer
  Pass 3: context = [overlap][chunk_3][chunk_4] → partial answer
Merge partial answers (map-reduce style)

Strategy 2: Hierarchical Summarization

Long document (500 pages):
  1. Split into 100 sections
  2. Summarize each section (100 LLM calls)
  3. Concatenate summaries → still may be long
  4. Summarize summaries (recursive)
  5. Final response from concise summary

Suitable for: document QA, long-form summarization, book analysis.

Strategy 3: RAG (Retrieval-Augmented Generation) Instead of loading the entire document, retrieve only relevant passages (see RAG section). This sidesteps the context window entirely for most questions.

Strategy 4: Context Compression - Use a smaller, faster model to compress/summarize less-relevant context - LLMLingua, AutoCompressor: compress prompt by 3–20× with minimal information loss - Selective context: use a classifier to identify irrelevant sentences and drop them

Strategy 5: Extended Context Models / RoPE Scaling

For models trained with RoPE positional encoding, you can extend the context window beyond training length by scaling the RoPE frequencies:

YaRN (Yet another RoPE extensioN):
  Multiply RoPE base frequency by a scaling factor
  LLaMA-3.1: trained at 8K, YaRN extends to 128K with quality

LongLoRA: 
  Fine-tune with sparse attention (shift-short-attention) on longer sequences
  Enables cheap context extension via fine-tuning

Trade-offs:

Approach	Latency	Quality	Cost	When to use
Sliding window	Medium	Good (if overlap right)	Medium	Sequential document processing
Hierarchical summarization	High	Lossy (compression artifacts)	High	Very long documents
RAG	Low	High (relevant context)	Low	Knowledge-base queries
Context compression	Low-medium	Good	Low	Short on context budget
Extended context model	Medium-high	Best	Hardware	When exact retrieval matters

Latency Optimization Tricks

Concept

A comprehensive list of techniques to reduce perceived and actual latency:

1. Quantization (2–4× throughput improvement)

BF16 → INT8: ~1.5× faster (memory bandwidth halved)
BF16 → INT4: ~2–4× faster (AWQ/GPTQ with optimized kernels)
Quality tradeoff: acceptable for most tasks at INT4 AWQ

2. Speculative Decoding (2–3× decode speedup) - Use a small draft model (1B) to propose K tokens - Verify with large model in one batch pass - Most effective for predictable/repetitive output (structured, code) - See KV Cache file for details

3. Prompt Caching / Prefix Sharing - Cache system prompt KV: saves reprocessing 1000-token system prompts per request - Anthropic, OpenAI, and vLLM all support this - Savings: ~90% TTFT reduction for requests where prefix is 90% of the prompt

4. Streaming Responses - Return the first token as soon as it's generated — don't wait for the full response - Reduces perceived latency even though total time is the same - Implementation: Server-Sent Events (SSE) or WebSocket

5. Smaller Models for Routing/Triage - Classify request complexity with a cheap small model (1B) - Route simple requests to a small model (7B), complex requests to large model (70B) - Cascade: try small model first → if low confidence → retry with large model

6. Flash Attention 2/3 - Replace standard attention with Flash Attention kernel - 2–4× faster for long sequences with no quality change

7. Continuous Batching - Already discussed — essential for production throughput

8. Tensor Parallelism Tuning - Split across 2 GPUs: ~1.8× faster (some communication overhead) - Split across 4 GPUs: ~3.2× faster - Beyond 8 GPUs: communication overhead often outweighs benefit for 7B models

9. CUDA Graphs (static shapes) - Capture the computation graph for fixed-size batches - Replay the same graph without Python overhead - Significant benefit for small batch sizes where CPU overhead is a bottleneck

Cost Optimization

Concept

At scale, LLM inference cost is significant. Key levers:

1. Prompt caching ROI:

Without caching:
  1M requests × 1000 token system prompt × $0.01/1K tokens = $10,000/day

With 90% cache hit rate:
  100K uncached × $0.01 + 900K cached × $0.001 = $1,900/day  (81% savings)

2. Smaller models for simple tasks: - Classify request complexity → route to appropriate model tier - 80% of requests may be satisfiable by a 7B model; 20% need 70B - Cost of 7B vs 70B inference: roughly 8–10× difference in throughput/GPU

3. Batch API for offline workloads: - Background document processing, embedding generation, bulk classification - OpenAI batch API: 50% discount for 24-hour turnaround - Self-hosted: run batch jobs during off-peak hours to maximize GPU utilization

4. Quantization: - INT4 inference: ~2–4× higher throughput per GPU → fewer GPUs needed - Break-even vs quality: most production tasks are acceptable at AWQ INT4

5. Output length control: - max_new_tokens: set tight bounds to prevent runaway generation - Structured output: constrain output to JSON/specific format → predictable shorter outputs

Monitoring and Observability

Concept

Key metrics to track in production LLM serving:

Latency metrics: - TTFT p50/p95/p99: time-to-first-token distribution - TPS p50/p95: tokens per second for decode phase - E2E latency p99: total time from request to response

Throughput metrics: - tokens_per_second_total: aggregate throughput across all requests - requests_per_second: request rate - queue_depth: number of waiting requests (early warning for capacity issues)

Quality metrics: - context_length_distribution: are requests using more context over time? - generation_length_distribution: output getting longer? (cost signal) - cache_hit_rate: prefix cache effectiveness

GPU metrics: - gpu_utilization: should be > 80% in healthy serving - gpu_memory_used: approaching limit → reduce batch size or add GPUs - kv_cache_utilization: paged attention's block utilization

Tools: - vLLM metrics endpoint: Prometheus-compatible /metrics - OpenTelemetry traces for distributed LLM pipelines - LangSmith / Langfuse for LLM-specific observability (prompt versions, output quality)

Code

# vLLM server setup and basic usage
# pip install vllm

# Start server (command line):
# python -m vllm.entrypoints.openai.api_server \
#   --model meta-llama/Llama-3.2-1B-Instruct \
#   --tensor-parallel-size 1 \
#   --max-model-len 8192 \
#   --enable-prefix-caching \
#   --quantization awq  # if using AWQ quantized model
#   --port 8000

# Client usage (OpenAI-compatible):
from openai import OpenAI
import time

client = OpenAI(base_url="http://localhost:8000/v1", api_key="placeholder")

# Standard chat completion
def chat_with_timing(messages, model="meta-llama/Llama-3.2-1B-Instruct"):
    start = time.time()
    first_token_time = None
    full_response = ""

    # Streaming to measure TTFT
    stream = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=200,
        stream=True,
        temperature=0.7,
    )

    for chunk in stream:
        if chunk.choices[0].delta.content:
            if first_token_time is None:
                first_token_time = time.time()
            full_response += chunk.choices[0].delta.content

    end = time.time()
    total_tokens = len(full_response.split())  # approximate

    print(f"TTFT: {(first_token_time - start)*1000:.0f}ms")
    print(f"Total time: {(end - start)*1000:.0f}ms")
    print(f"TPS (approx): {total_tokens / (end - first_token_time):.1f} tok/s")
    return full_response

result = chat_with_timing([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain transformers in 3 sentences."}
])

# Prefix caching benefit measurement
SYSTEM_PROMPT = "You are a helpful AI assistant specialized in machine learning. " * 50  # ~200 tokens

# First request: cache miss (slower)
t0 = time.time()
r1 = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "What is attention?"}
    ],
    max_tokens=100
)
print(f"First request (cache miss): {(time.time()-t0)*1000:.0f}ms")

# Second request: cache hit (faster prefill)
t0 = time.time()
r2 = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},  # same prefix
        {"role": "user", "content": "What is a transformer?"}
    ],
    max_tokens=100
)
print(f"Second request (cache hit): {(time.time()-t0)*1000:.0f}ms")
# Should be significantly faster due to prefix caching

Study Notes

Must-know for interviews: - vLLM is the dominant open-source serving framework: Paged Attention + continuous batching + OpenAI-compatible API - Latency has two components: TTFT (prefill, compute-bound) and TPS (decode, memory-bandwidth-bound) - Prefix caching eliminates reprocessing shared system prompts — very high ROI for chatbots - Speculative decoding: small draft model proposes tokens → large model verifies in one pass → 2–3× speedup - Token window workarounds: sliding window, hierarchical summarization, RAG, LLMLingua compression, RoPE scaling - Cost optimization: prompt caching > model routing > quantization > batch API

Quick recall Q&A: - What is TTFT and what determines it? Time-to-first-token = queue wait + prefill compute time. Determined by input prompt length and server load. - Why is continuous batching better than static batching? Completed sequences are evicted immediately and new requests fill their slots — no wasted GPU capacity waiting for stragglers. - What is chunked prefill? Processing long prompts in chunks to avoid monopolizing the GPU during prefill, which would delay decode steps for other in-flight requests. - When should you use RAG vs sliding window? RAG when you can identify relevant content via search. Sliding window when you must process a document sequentially without a query. - Name 3 ways to improve TTFT. Prefix caching (eliminate redundant prefill), quantization (faster compute), reduce prompt length (LLMLingua compression).