System Design: Simple RAG Pipeline
Domain: Enterprise Knowledge Base · Pattern: Naive → Production-Grade RAG
Interview Problem Statement
"Design a production-grade RAG system that enables employees to query a company's internal knowledge base of 5 million documents (PDFs, Word, HTML) and receive accurate, cited answers in under 3 seconds at 1,000 concurrent users."
Clarifying Questions
| Question | Why It Matters |
|---|---|
| What document types? (PDF, Word, HTML, scanned images?) | Determines parsers needed — scanned PDFs require OCR (Document AI) |
| Expected query types? (factual lookup, comparison, summarization?) | Factual → top-3 chunks sufficient; summarization → may need 20+ chunks |
| What is the acceptable hallucination rate? (0%? <5%?) | Drives need for guardrails, grounding checks, citation enforcement |
| Does freshness matter? (new docs within minutes, hours, or daily batch?) | Real-time → streaming ingestion; batch is fine → scheduled pipelines |
| Is PII present in documents? | Drives redaction pipeline before chunking and logging controls |
| Multi-language support needed? | Affects embedding model choice (multilingual models vs. per-language indexes) |
| Is the query population internal employees or external users? | Security perimeter, auth model, and rate limiting design |
System Architecture Overview
┌──────────────────────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE │
│ │
│ Cloud Storage (raw docs) │
│ │ │
│ ▼ │
│ Cloud Run: Document Parser │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ PDF → Document AI (OCR + layout) │ │
│ │ Word/HTML → python-docx / BeautifulSoup │ │
│ │ Output: cleaned text + metadata JSON │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Cloud Tasks Queue (back-pressure management) │
│ │ │
│ ▼ │
│ Cloud Run: Chunker + Embedder │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Semantic chunking (512 tokens, 10% overlap) │ │
│ │ Vertex AI text-embedding-005 → 768-dim vectors │ │
│ │ Metadata: doc_id, chunk_id, source_url, date, section │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ├─────────────────────┬──────────────────────────────┤ │
│ ▼ ▼ ▼ │
│ Vertex AI Vector AlloyDB (BM25 BigQuery (metadata │
│ Search (ANN) sparse index) + eval logging) │
└──────────────────────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────────┐
│ QUERY PIPELINE │
│ │
│ User Query │
│ │ │
│ ▼ │
│ API Gateway → Cloud Run: Query Service │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ 1. Auth check (IAP or OAuth2) │ │
│ │ 2. Query embedding (text-embedding-005) │ │
│ │ 3. Check Memorystore (Redis) semantic cache │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ (cache miss) │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ HYBRID RETRIEVAL │ │
│ │ │ │
│ │ Dense: Vertex AI Vector Search → top-20 chunks │ │
│ │ Sparse: AlloyDB full-text (BM25) → top-20 chunks │ │
│ │ Fusion: Reciprocal Rank Fusion → top-5 │ │
│ │ Rerank: Vertex AI Reranker (cross-encoder) → top-3 │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ GENERATION │ │
│ │ │ │
│ │ Prompt assembly: system + context + query │ │
│ │ Vertex AI Gemini 1.5 Pro → answer + citations │ │
│ │ Grounding check: Vertex AI Grounding API │ │
│ │ Citation extraction: source_url per claim │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Response → User + log to BigQuery (for eval/feedback) │
└──────────────────────────────────────────────────────────────────────────────┘
Component Design
1. Document Ingestion
Trigger: New file lands in Cloud Storage bucket → Eventarc → Cloud Run parser job.
Parser decisions:
- Native PDF (text layer present) → pdfplumber / pypdf — fast, no API cost
- Scanned PDF / image PDFs → Document AI (Form Parser or Document OCR) — preserves layout, tables
- Word/PPTX → python-docx, python-pptx — extract paragraphs + headings
- HTML → BeautifulSoup with boilerpy3 to strip navigation/ads
Metadata enriched at parse time:
{
"doc_id": "uuid",
"source_path": "gs://bucket/hr/policies/pto-policy-v3.pdf",
"title": "PTO Policy v3",
"section": "Human Resources",
"author": "HR Team",
"created_date": "2024-01-15",
"last_modified": "2024-11-01",
"language": "en",
"page_count": 12
}
2. Chunking Strategy
Why this matters: Chunk size is the single biggest quality lever in RAG.
| Strategy | Chunk Size | Use Case |
|---|---|---|
| Fixed-token | 256–512 tokens | Simple factual docs, uniform structure |
| Semantic (sentence-boundary) | Variable 200–600 tokens | Prose documents, policies, articles |
| Parent-child hierarchical | Parent: 1024 / Child: 256 | When you need summary retrieval + precise citation |
| Document-level | Full doc | Short docs (<2k tokens), global summarization |
Recommended for this system: Semantic chunking (512 token target, 50-token overlap) using sentence-transformers sentence boundary detection. Store both the child chunk (for retrieval) and a parent context window (±1 paragraph, for generation).
3. Embedding + Vector Index
Embedding model: text-embedding-005 (Vertex AI)
- 768 dimensions
- Optimized for retrieval tasks
- Batched at 250 docs/batch via Cloud Tasks
Index: Vertex AI Vector Search - Algorithm: ScaNN (Scalable Nearest Neighbor) — tree-AH hybrid - Recall target: 98% at top-10 - Shards: 5 (5M docs ÷ 1M per shard) - Online updates enabled for near-real-time ingestion
Fallback sparse index: AlloyDB tsvector column + GIN index for BM25 keyword search. Critical for queries with rare proper nouns, product codes, and acronyms that embedding models underrepresent.
4. Hybrid Retrieval + Reranking
Dense: score_dense = cosine_similarity(q_emb, chunk_emb)
Sparse: score_sparse = BM25(q_tokens, chunk_tokens)
RRF fusion:
rrf_score(d) = Σ 1 / (k + rank_i(d)) where k=60
Final ranking: top-5 by rrf_score
Reranker:
Cross-encoder: Vertex AI Reranker API
Input: (query, chunk_text) → relevance score 0–1
Keep: top-3 chunks above threshold 0.4
Why rerank after RRF? RRF is an unsupervised fusion — it can elevate lexically-matching but semantically-shallow chunks. A cross-encoder cross-attends between query and full chunk text and produces calibrated relevance scores.
5. Generation
Prompt template:
You are a knowledgeable assistant for [Company] employees.
Answer the question below using ONLY the provided context.
If the answer is not in the context, say "I don't have that information."
Cite each claim with [Source N] notation.
Context:
[1] {chunk_1_text} (Source: {source_url_1})
[2] {chunk_2_text} (Source: {source_url_2})
[3] {chunk_3_text} (Source: {source_url_3})
Question: {user_query}
Answer:
Model: Gemini 1.5 Pro (128k context window — fits 3-10 chunks comfortably, no lost-in-middle risk at this scale)
Grounding check: Vertex AI Grounding API validates that each sentence in the answer is attributable to the provided context. Claims without grounding are flagged.
GCP Services Map
| Component | GCP Service | Why |
|---|---|---|
| Raw document storage | Cloud Storage | Durable, cheap, Eventarc integration |
| Document parsing (scanned) | Document AI (OCR / Form Parser) | Layout-aware OCR, table extraction |
| Parse/chunk/embed jobs | Cloud Run (CPU jobs) | Serverless, scales to 0, 1000 concurrent |
| Job queue / back-pressure | Cloud Tasks | Rate limits API calls, retries failures |
| Embedding model | Vertex AI text-embedding-005 | Low-latency, 768-dim, GCP-native |
| Vector index | Vertex AI Vector Search | Managed ANN, online updates, SLA |
| Keyword index | AlloyDB for PostgreSQL (pgvector + tsvector) | Hybrid: vector + BM25 in one store |
| Reranker | Vertex AI Reranker API | Cross-encoder, no self-hosting overhead |
| LLM generation | Vertex AI Gemini 1.5 Pro | 128k context, grounding support |
| Grounding validation | Vertex AI Grounding API | Attribution and hallucination guard |
| Semantic cache | Memorystore for Redis (7.x) | Vector similarity cache on query embedding |
| Auth | Identity-Aware Proxy (IAP) | Internal employee auth, no code changes |
| API frontend | API Gateway + Cloud Run | Rate limiting, auth, versioning |
| Logging | Cloud Logging + BigQuery | Structured logs, RAGAS eval dataset |
| Monitoring | Cloud Monitoring + Trace | Latency percentiles, error rates |
| CI/CD | Cloud Build + Artifact Registry | Automated embed pipeline deployments |
Scalability Considerations
Throughput: 1,000 Concurrent Users
Bottleneck analysis:
| Stage | Latency (P99) | Parallelizable? | Mitigation |
|---|---|---|---|
| Query embedding | 20ms | Yes (stateless) | Cloud Run autoscale |
| Cache lookup | 5ms | Yes | Redis cluster mode |
| Vector search | 80ms | Yes | Vector Search handles concurrency |
| Sparse search | 30ms | Yes | AlloyDB read replicas |
| Reranking | 50ms | Yes (batched) | Reranker API auto-scales |
| Generation (Gemini) | 800–1500ms | Yes | Vertex AI handles quota per project |
| Total P99 | ~2.5s | — | Within 3s SLA |
Scaling levers:
- Cloud Run: --min-instances=10 to prevent cold starts under load
- Memorystore: 10–20% of queries hit cache after 24h warm-up (semantic cache)
- Vector Search: add shards as corpus grows beyond 5M docs
- Gemini quota: request higher TPM limits via GCP support for production
Corpus Scale: Beyond 5M Documents
| Corpus Size | Vector Search Config | Embedding Cost | Strategy |
|---|---|---|---|
| <1M | Single shard, DiskANN | Low | Simple |
| 1M–10M | 5–10 shards, ScaNN | Moderate | Shard by department/domain |
| 10M–100M | 50+ shards + index versioning | High | Domain-specific sub-indexes + routing |
| 100M+ | Multi-index with pre-filter | Very High | Metadata filtering before ANN search |
Domain partitioning: Split indexes by department (HR, Legal, Engineering). Query router classifies department from query → hits targeted sub-index. Reduces search space 10x, improves recall.
Freshness: Near-Real-Time Updates
Vertex AI Vector Search supports online index updates (upsert/delete) without reindexing. For batch ingestion of 100k docs/day: - Cloud Tasks queue with 50 concurrent workers - Each worker: parse → chunk → embed (batched 250/call) → upsert to Vector Search - Throughput: ~20k chunks/minute, 100k docs ingested within 2–3 hours
Cost Optimization
| Lever | Saving |
|---|---|
| Semantic cache (Redis) | 15–20% LLM cost reduction |
| Batch embedding during off-hours | 50% cheaper with committed use discounts |
| Gemini Flash for simple factual queries | 5× cheaper than Pro, detect via query classifier |
| Index compression (PQ in Vector Search) | 4× storage reduction, ~3% recall tradeoff |
Failure Modes and Mitigations
| Failure | Symptom | Mitigation |
|---|---|---|
| Hallucination | Answer not grounded in context | Grounding API check + fallback "I don't know" |
| Retrieval miss | Right doc exists but not returned | Hybrid retrieval (sparse catches what dense misses) |
| Chunk boundary splits key fact | Incomplete context | 50-token overlap + parent-child retrieval |
| Stale index | New doc not yet indexed | Show doc freshness timestamp; warn if >24h old |
| LLM context overflow | Long docs exceed context window | Chunk top-3 only; limit to 6k tokens context |
| Toxic/injection query | Prompt injection in user query | Input sanitization + Vertex AI safety filters |
Q&A Review Bank
Q1: A user complains that a critical answer "isn't in the system" even though you can manually find it in a document. What are the top 3 causes and how do you diagnose each? [Hard]
A: (1) Embedding model mismatch — the document uses domain-specific terminology (e.g., "SOX compliance audit cycle") that the embedding model encodes differently than the user's query ("financial audit timeline"). Diagnose by running the query against BM25 alone — if BM25 finds it but vector search doesn't, it's a semantic gap. Fix: domain-adapted embedding or expand query with HyDE. (2) Chunking boundary problem — the key fact spans two chunks and neither chunk alone contains enough context for a high similarity score. Diagnose by inspecting which chunk contains the sentence. Fix: increase overlap, or use parent-child retrieval where the parent chunk contains the full context. (3) Metadata filtering too aggressive — a pre-filter by department/date excluded the relevant document. Diagnose by running without filters. Fix: broaden filter logic or remove it for fallback queries.
Q2: Why is hybrid retrieval (dense + sparse) better than dense-only retrieval, and when does sparse search have a structural advantage? [Medium]
A: Dense (embedding) search excels at semantic similarity — it finds chunks that mean the same thing even with different words. Sparse (BM25) search excels at exact-match retrieval — it finds documents containing the exact query tokens, which is critical for product codes, acronyms, proper nouns, and rare technical terms that embedding models compress into similar vectors as related-but-wrong concepts. Example: query "CVE-2024-1234 vulnerability" — a dense model may retrieve general "vulnerability management" content; BM25 will pinpoint the exact CVE document because it exact-matches the token. Hybrid search with RRF combines both signals without requiring a learned fusion model, making it robust to distributional shifts in the query population.
Q3: How does the semantic cache work and what are the failure modes introduced by caching RAG responses? [Hard]
A: The semantic cache stores (query_embedding → cached_response) pairs in Redis. On each new query, its embedding is computed and a nearest-neighbor lookup in the cache checks whether a "similar enough" prior query has been answered (typically cosine similarity > 0.95). If yes, the cached response is returned without hitting Vector Search or the LLM. Failure modes: (1) Stale cache — if the underlying document is updated after the cache entry was created, the cached answer is outdated. Fix: short TTL (1–4 hours) or invalidate on document update. (2) Similarity threshold too low — different questions get the same answer (e.g., "Who is the CEO?" vs "Who is the CFO?" may have embeddings close enough to hit the same cache entry). Fix: tighten threshold to 0.98+. (3) Cache pollution — rare or malicious queries poison cache slots. Fix: only cache queries above a minimum frequency threshold.
Q4: Design the latency budget for a 3-second P99 SLA. Which stage is the most dangerous and why? [Hard]
A: Budget breakdown: auth (20ms) → query embedding (20ms) → cache lookup (5ms) → vector search (80ms) → sparse search (30ms) → RRF fusion (5ms) → reranking (50ms) → prompt assembly (5ms) → Gemini generation (800–1500ms P99) → response serialization (10ms) = ~1.5–2.2s under normal load. The generation stage is most dangerous because it's the only stage that doesn't scale horizontally in a way you control — it depends on Gemini API P99 latency which can spike under shared infrastructure load. Mitigations: (1) semantic cache to skip generation for repeated queries; (2) Gemini Flash for short factual queries (3× faster); (3) streaming responses so users perceive faster response even if total latency is 3s; (4) timeout budget of 2s for generation with fallback to "Please try again."
Q5: A regulatory audit requires you to prove that every answer the system gave was grounded in a specific document version. How do you design for this? [Hard]
A: Three layers of auditability needed: (1) Document versioning — every document in Cloud Storage is versioned (Object Versioning enabled). Each chunk stores doc_id + version_hash as metadata. When a document is updated, old chunks are soft-deleted (marked archived: true) rather than deleted, so historical retrievals can be reconstructed. (2) Query logging — every query logs to BigQuery: query_id, user_id, timestamp, query_text, retrieved_chunk_ids (with version_hash), prompt_sent_to_llm, llm_response, grounding_score. This creates a complete audit trail. (3) Grounding verification — the Vertex AI Grounding API response includes which sentences are attributed to which source segments. This response is logged alongside the answer. An auditor can replay the query, retrieve the same versioned chunks, and verify the answer was derivable from those specific document versions.
Q6: Compare Vertex AI Vector Search vs. AlloyDB pgvector for this use case. When would you choose each? [Medium]
A: Vertex AI Vector Search is a dedicated ANN service — it handles billion-scale corpora, online index updates, and ~80ms P99 at scale with no operational overhead. It's the right choice when vector search is your primary query pattern and you need managed scaling. AlloyDB pgvector runs vector search as a PostgreSQL extension — you get SQL joins between vector results and structured metadata in the same query (e.g., WHERE department = 'HR' AND cosine_similarity > 0.8). It's the right choice when your retrieval requires complex metadata joins, when corpus size is <10M docs, or when you want hybrid BM25 + vector in one query. For this system: use both in tandem — Vector Search for high-throughput dense retrieval; AlloyDB for sparse BM25 and metadata-filtered lookups. The hybrid retrieval layer fuses results from both.
Q7: What is "lost-in-the-middle" and how do you mitigate it in this system? [Medium]
A: Lost-in-the-middle is the empirically-observed degradation in LLM attention on context chunks placed in the middle of a long context window — information in positions 2 through N-1 is less well-utilized than the first and last chunks. For a 3-chunk RAG context, this is minimal, but for 10+ chunks it significantly degrades answer quality. Mitigations: (1) Limit to 3–5 chunks — enforce a hard cap on context size, relying on the reranker to surface only the most relevant. (2) Positional priority — place the highest-relevance chunk first and second, not in the middle. (3) Summary-then-detail — prepend a 2-sentence summary of each chunk before the full text, giving the model anchor points. (4) Gemini 1.5 Pro's 128k window — while the problem exists, it's less severe in Gemini 1.5 than earlier models, which exhibited sharper degradation at the 4–8k mark.