Contents

Rag

Simple RAG Pipeline

View as:

System Design: Simple RAG Pipeline

Domain: Enterprise Knowledge Base · Pattern: Naive → Production-Grade RAG

Back to RAG Concepts | RAG Types & Advanced Patterns →


Interview Problem Statement

"Design a production-grade RAG system that enables employees to query a company's internal knowledge base of 5 million documents (PDFs, Word, HTML) and receive accurate, cited answers in under 3 seconds at 1,000 concurrent users."


Clarifying Questions

QuestionWhy It Matters
What document types? (PDF, Word, HTML, scanned images?)Determines parsers needed - scanned PDFs require OCR (Document AI)
Expected query types? (factual lookup, comparison, summarization?)Factual → top-3 chunks sufficient; summarization → may need 20+ chunks
What is the acceptable hallucination rate? (0%? <5%?)Drives need for guardrails, grounding checks, citation enforcement
Does freshness matter? (new docs within minutes, hours, or daily batch?)Real-time → streaming ingestion; batch is fine → scheduled pipelines
Is PII present in documents?Drives redaction pipeline before chunking and logging controls
Multi-language support needed?Affects embedding model choice (multilingual models vs. per-language indexes)
Is the query population internal employees or external users?Security perimeter, auth model, and rate limiting design

System Architecture Overview

┌──────────────────────────────────────────────────────────────────────────────┐
│                           INGESTION PIPELINE                                 │
│                                                                              │
│  Cloud Storage (raw docs)                                                    │
│       │                                                                      │
│       ▼                                                                      │
│  Cloud Run: Document Parser                                                  │
│  ┌─────────────────────────────────────────────────────────┐                │
│  │  PDF → Document AI (OCR + layout)                       │                │
│  │  Word/HTML → python-docx / BeautifulSoup                │                │
│  │  Output: cleaned text + metadata JSON                   │                │
│  └─────────────────────────────────────────────────────────┘                │
│       │                                                                      │
│       ▼                                                                      │
│  Cloud Tasks Queue (back-pressure management)                                │
│       │                                                                      │
│       ▼                                                                      │
│  Cloud Run: Chunker + Embedder                                               │
│  ┌─────────────────────────────────────────────────────────┐                │
│  │  Semantic chunking (512 tokens, 10% overlap)            │                │
│  │  Vertex AI text-embedding-005 → 768-dim vectors         │                │
│  │  Metadata: doc_id, chunk_id, source_url, date, section  │                │
│  └─────────────────────────────────────────────────────────┘                │
│       │                                                                      │
│       ├─────────────────────┬──────────────────────────────┤                │
│       ▼                     ▼                              ▼                │
│  Vertex AI Vector      AlloyDB (BM25         BigQuery (metadata            │
│  Search (ANN)          sparse index)         + eval logging)               │
└──────────────────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────────────────┐
│                            QUERY PIPELINE                                    │
│                                                                              │
│  User Query                                                                  │
│       │                                                                      │
│       ▼                                                                      │
│  API Gateway → Cloud Run: Query Service                                      │
│  ┌─────────────────────────────────────────────────────────┐                │
│  │  1. Auth check (IAP or OAuth2)                          │                │
│  │  2. Query embedding (text-embedding-005)                │                │
│  │  3. Check Memorystore (Redis) semantic cache            │                │
│  └─────────────────────────────────────────────────────────┘                │
│                   │  (cache miss)                                            │
│                   ▼                                                          │
│  ┌─────────────────────────────────────────────────────────┐                │
│  │  HYBRID RETRIEVAL                                        │                │
│  │                                                          │                │
│  │  Dense:  Vertex AI Vector Search → top-20 chunks        │                │
│  │  Sparse: AlloyDB full-text (BM25) → top-20 chunks       │                │
│  │  Fusion: Reciprocal Rank Fusion → top-5                 │                │
│  │  Rerank: Vertex AI Reranker (cross-encoder) → top-3     │                │
│  └─────────────────────────────────────────────────────────┘                │
│                   │                                                          │
│                   ▼                                                          │
│  ┌─────────────────────────────────────────────────────────┐                │
│  │  GENERATION                                              │                │
│  │                                                          │                │
│  │  Prompt assembly: system + context + query              │                │
│  │  Vertex AI Gemini 1.5 Pro → answer + citations          │                │
│  │  Grounding check: Vertex AI Grounding API               │                │
│  │  Citation extraction: source_url per claim              │                │
│  └─────────────────────────────────────────────────────────┘                │
│                   │                                                          │
│                   ▼                                                          │
│  Response → User + log to BigQuery (for eval/feedback)                      │
└──────────────────────────────────────────────────────────────────────────────┘

Component Design

1. Document Ingestion

Trigger: New file lands in Cloud Storage bucket → Eventarc → Cloud Run parser job.

Parser decisions:

  • Native PDF (text layer present) → pdfplumber / pypdf - fast, no API cost
  • Scanned PDF / image PDFs → Document AI (Form Parser or Document OCR) - preserves layout, tables
  • Word/PPTX → python-docx, python-pptx - extract paragraphs + headings
  • HTML → BeautifulSoup with boilerpy3 to strip navigation/ads

Metadata enriched at parse time:

{
  "doc_id": "uuid",
  "source_path": "gs://bucket/hr/policies/pto-policy-v3.pdf",
  "title": "PTO Policy v3",
  "section": "Human Resources",
  "author": "HR Team",
  "created_date": "2024-01-15",
  "last_modified": "2024-11-01",
  "language": "en",
  "page_count": 12
}

2. Chunking Strategy

Why this matters: Chunk size is the single biggest quality lever in RAG.

StrategyChunk SizeUse Case
Fixed-token256–512 tokensSimple factual docs, uniform structure
Semantic (sentence-boundary)Variable 200–600 tokensProse documents, policies, articles
Parent-child hierarchicalParent: 1024 / Child: 256When you need summary retrieval + precise citation
Document-levelFull docShort docs (<2k tokens), global summarization

Recommended for this system: Semantic chunking (512 token target, 50-token overlap) using sentence-transformers sentence boundary detection. Store both the child chunk (for retrieval) and a parent context window (±1 paragraph, for generation).


3. Embedding + Vector Index

Embedding model: text-embedding-005 (Vertex AI)

  • 768 dimensions
  • Optimized for retrieval tasks
  • Batched at 250 docs/batch via Cloud Tasks

Index: Vertex AI Vector Search

  • Algorithm: ScaNN (Scalable Nearest Neighbor) - tree-AH hybrid
  • Recall target: 98% at top-10
  • Shards: 5 (5M docs ÷ 1M per shard)
  • Online updates enabled for near-real-time ingestion

Fallback sparse index: AlloyDB tsvector column + GIN index for BM25 keyword search. Critical for queries with rare proper nouns, product codes, and acronyms that embedding models underrepresent.


4. Hybrid Retrieval + Reranking

Dense:  score_dense  = cosine_similarity(q_emb, chunk_emb)
Sparse: score_sparse = BM25(q_tokens, chunk_tokens)

RRF fusion:
  rrf_score(d) = Σ 1 / (k + rank_i(d))   where k=60
  
  Final ranking: top-5 by rrf_score

Reranker:
  Cross-encoder: Vertex AI Reranker API
  Input: (query, chunk_text) → relevance score 0–1
  Keep: top-3 chunks above threshold 0.4

Why rerank after RRF? RRF is an unsupervised fusion - it can elevate lexically-matching but semantically-shallow chunks. A cross-encoder cross-attends between query and full chunk text and produces calibrated relevance scores.


5. Generation

Prompt template:

You are a knowledgeable assistant for [Company] employees.
Answer the question below using ONLY the provided context.
If the answer is not in the context, say "I don't have that information."
Cite each claim with [Source N] notation.

Context:
[1] {chunk_1_text} (Source: {source_url_1})
[2] {chunk_2_text} (Source: {source_url_2})
[3] {chunk_3_text} (Source: {source_url_3})

Question: {user_query}

Answer:

Model: Gemini 1.5 Pro (128k context window - fits 3-10 chunks comfortably, no lost-in-middle risk at this scale)

Grounding check: Vertex AI Grounding API validates that each sentence in the answer is attributable to the provided context. Claims without grounding are flagged.


GCP Services Map

ComponentGCP ServiceWhy
Raw document storageCloud StorageDurable, cheap, Eventarc integration
Document parsing (scanned)Document AI (OCR / Form Parser)Layout-aware OCR, table extraction
Parse/chunk/embed jobsCloud Run (CPU jobs)Serverless, scales to 0, 1000 concurrent
Job queue / back-pressureCloud TasksRate limits API calls, retries failures
Embedding modelVertex AI text-embedding-005Low-latency, 768-dim, GCP-native
Vector indexVertex AI Vector SearchManaged ANN, online updates, SLA
Keyword indexAlloyDB for PostgreSQL (pgvector + tsvector)Hybrid: vector + BM25 in one store
RerankerVertex AI Reranker APICross-encoder, no self-hosting overhead
LLM generationVertex AI Gemini 1.5 Pro128k context, grounding support
Grounding validationVertex AI Grounding APIAttribution and hallucination guard
Semantic cacheMemorystore for Redis (7.x)Vector similarity cache on query embedding
AuthIdentity-Aware Proxy (IAP)Internal employee auth, no code changes
API frontendAPI Gateway + Cloud RunRate limiting, auth, versioning
LoggingCloud Logging + BigQueryStructured logs, RAGAS eval dataset
MonitoringCloud Monitoring + TraceLatency percentiles, error rates
CI/CDCloud Build + Artifact RegistryAutomated embed pipeline deployments

Scalability Considerations

Throughput: 1,000 Concurrent Users

Bottleneck analysis:

StageLatency (P99)Parallelizable?Mitigation
Query embedding20msYes (stateless)Cloud Run autoscale
Cache lookup5msYesRedis cluster mode
Vector search80msYesVector Search handles concurrency
Sparse search30msYesAlloyDB read replicas
Reranking50msYes (batched)Reranker API auto-scales
Generation (Gemini)800–1500msYesVertex AI handles quota per project
Total P99~2.5s-Within 3s SLA

Scaling levers:

  • Cloud Run: --min-instances=10 to prevent cold starts under load
  • Memorystore: 10–20% of queries hit cache after 24h warm-up (semantic cache)
  • Vector Search: add shards as corpus grows beyond 5M docs
  • Gemini quota: request higher TPM limits via GCP support for production

Corpus Scale: Beyond 5M Documents

Corpus SizeVector Search ConfigEmbedding CostStrategy
<1MSingle shard, DiskANNLowSimple
1M–10M5–10 shards, ScaNNModerateShard by department/domain
10M–100M50+ shards + index versioningHighDomain-specific sub-indexes + routing
100M+Multi-index with pre-filterVery HighMetadata filtering before ANN search

Domain partitioning: Split indexes by department (HR, Legal, Engineering). Query router classifies department from query → hits targeted sub-index. Reduces search space 10x, improves recall.

Freshness: Near-Real-Time Updates

Vertex AI Vector Search supports online index updates (upsert/delete) without reindexing. For batch ingestion of 100k docs/day:

  • Cloud Tasks queue with 50 concurrent workers
  • Each worker: parse → chunk → embed (batched 250/call) → upsert to Vector Search
  • Throughput: ~20k chunks/minute, 100k docs ingested within 2–3 hours

Cost Optimization

LeverSaving
Semantic cache (Redis)15–20% LLM cost reduction
Batch embedding during off-hours50% cheaper with committed use discounts
Gemini Flash for simple factual queries5× cheaper than Pro, detect via query classifier
Index compression (PQ in Vector Search)4× storage reduction, ~3% recall tradeoff

Failure Modes and Mitigations

FailureSymptomMitigation
HallucinationAnswer not grounded in contextGrounding API check + fallback "I don't know"
Retrieval missRight doc exists but not returnedHybrid retrieval (sparse catches what dense misses)
Chunk boundary splits key factIncomplete context50-token overlap + parent-child retrieval
Stale indexNew doc not yet indexedShow doc freshness timestamp; warn if >24h old
LLM context overflowLong docs exceed context windowChunk top-3 only; limit to 6k tokens context
Toxic/injection queryPrompt injection in user queryInput sanitization + Vertex AI safety filters

Q&A Review Bank

Q1: A user complains that a critical answer "isn't in the system" even though you can manually find it in a document. What are the top 3 causes and how do you diagnose each? [Hard]

A: (1) Embedding model mismatch - the document uses domain-specific terminology (e.g., "SOX compliance audit cycle") that the embedding model encodes differently than the user's query ("financial audit timeline"). Diagnose by running the query against BM25 alone - if BM25 finds it but vector search doesn't, it's a semantic gap. Fix: domain-adapted embedding or expand query with HyDE. (2) Chunking boundary problem - the key fact spans two chunks and neither chunk alone contains enough context for a high similarity score. Diagnose by inspecting which chunk contains the sentence. Fix: increase overlap, or use parent-child retrieval where the parent chunk contains the full context. (3) Metadata filtering too aggressive - a pre-filter by department/date excluded the relevant document. Diagnose by running without filters. Fix: broaden filter logic or remove it for fallback queries.


Q2: Why is hybrid retrieval (dense + sparse) better than dense-only retrieval, and when does sparse search have a structural advantage? [Medium]

A: Dense (embedding) search excels at semantic similarity - it finds chunks that mean the same thing even with different words. Sparse (BM25) search excels at exact-match retrieval - it finds documents containing the exact query tokens, which is critical for product codes, acronyms, proper nouns, and rare technical terms that embedding models compress into similar vectors as related-but-wrong concepts. Example: query "CVE-2024-1234 vulnerability" - a dense model may retrieve general "vulnerability management" content; BM25 will pinpoint the exact CVE document because it exact-matches the token. Hybrid search with RRF combines both signals without requiring a learned fusion model, making it robust to distributional shifts in the query population.


Q3: How does the semantic cache work and what are the failure modes introduced by caching RAG responses? [Hard]

A: The semantic cache stores (query_embedding → cached_response) pairs in Redis. On each new query, its embedding is computed and a nearest-neighbor lookup in the cache checks whether a "similar enough" prior query has been answered (typically cosine similarity > 0.95). If yes, the cached response is returned without hitting Vector Search or the LLM. Failure modes: (1) Stale cache - if the underlying document is updated after the cache entry was created, the cached answer is outdated. Fix: short TTL (1–4 hours) or invalidate on document update. (2) Similarity threshold too low - different questions get the same answer (e.g., "Who is the CEO?" vs "Who is the CFO?" may have embeddings close enough to hit the same cache entry). Fix: tighten threshold to 0.98+. (3) Cache pollution - rare or malicious queries poison cache slots. Fix: only cache queries above a minimum frequency threshold.


Q4: Design the latency budget for a 3-second P99 SLA. Which stage is the most dangerous and why? [Hard]

A: Budget breakdown: auth (20ms) → query embedding (20ms) → cache lookup (5ms) → vector search (80ms) → sparse search (30ms) → RRF fusion (5ms) → reranking (50ms) → prompt assembly (5ms) → Gemini generation (800–1500ms P99) → response serialization (10ms) = ~1.5–2.2s under normal load. The generation stage is most dangerous because it's the only stage that doesn't scale horizontally in a way you control - it depends on Gemini API P99 latency which can spike under shared infrastructure load. Mitigations: (1) semantic cache to skip generation for repeated queries; (2) Gemini Flash for short factual queries (3× faster); (3) streaming responses so users perceive faster response even if total latency is 3s; (4) timeout budget of 2s for generation with fallback to "Please try again."


Q5: A regulatory audit requires you to prove that every answer the system gave was grounded in a specific document version. How do you design for this? [Hard]

A: Three layers of auditability needed: (1) Document versioning - every document in Cloud Storage is versioned (Object Versioning enabled). Each chunk stores doc_id + version_hash as metadata. When a document is updated, old chunks are soft-deleted (marked archived: true) rather than deleted, so historical retrievals can be reconstructed. (2) Query logging - every query logs to BigQuery: query_id, user_id, timestamp, query_text, retrieved_chunk_ids (with version_hash), prompt_sent_to_llm, llm_response, grounding_score. This creates a complete audit trail. (3) Grounding verification - the Vertex AI Grounding API response includes which sentences are attributed to which source segments. This response is logged alongside the answer. An auditor can replay the query, retrieve the same versioned chunks, and verify the answer was derivable from those specific document versions.


Q6: Compare Vertex AI Vector Search vs. AlloyDB pgvector for this use case. When would you choose each? [Medium]

A: Vertex AI Vector Search is a dedicated ANN service - it handles billion-scale corpora, online index updates, and ~80ms P99 at scale with no operational overhead. It's the right choice when vector search is your primary query pattern and you need managed scaling. AlloyDB pgvector runs vector search as a PostgreSQL extension - you get SQL joins between vector results and structured metadata in the same query (e.g., WHERE department = 'HR' AND cosine_similarity > 0.8). It's the right choice when your retrieval requires complex metadata joins, when corpus size is <10M docs, or when you want hybrid BM25 + vector in one query. For this system: use both in tandem - Vector Search for high-throughput dense retrieval; AlloyDB for sparse BM25 and metadata-filtered lookups. The hybrid retrieval layer fuses results from both.


Q7: What is "lost-in-the-middle" and how do you mitigate it in this system? [Medium]

A: Lost-in-the-middle is the empirically-observed degradation in LLM attention on context chunks placed in the middle of a long context window - information in positions 2 through N-1 is less well-utilized than the first and last chunks. For a 3-chunk RAG context, this is minimal, but for 10+ chunks it significantly degrades answer quality. Mitigations: (1) Limit to 3–5 chunks - enforce a hard cap on context size, relying on the reranker to surface only the most relevant. (2) Positional priority - place the highest-relevance chunk first and second, not in the middle. (3) Summary-then-detail - prepend a 2-sentence summary of each chunk before the full text, giving the model anchor points. (4) Gemini 1.5 Pro's 128k window - while the problem exists, it's less severe in Gemini 1.5 than earlier models, which exhibited sharper degradation at the 4–8k mark.