Contents

Rag

Agentic RAG - Hybrid Vector + Graph

View as:

System Design: Agentic RAG with Hybrid Data (Vector + Graph)

Domain: Life Sciences / Enterprise Knowledge Graph · Pattern: Agentic RAG + Graph Traversal + Multi-hop Reasoning

Back to RAG Concepts | Simple RAG Design ←


Interview Problem Statement

"Design an AI system for a pharmaceutical company that enables researchers to answer complex multi-hop questions like: 'Which of our drug candidates interact with proteins implicated in both Alzheimer's and Type 2 Diabetes, and what clinical trials have investigated those targets?' The knowledge base includes 2M research papers, internal trial documents, and a curated molecular interaction database."


Why Simple RAG Fails Here

LimitationWhy It Breaks This Use Case
Single-hop retrievalQuestion requires: Drug → Protein → Disease (×2) → Trial - 4 hops across entity types
No entity relationshipsVector search returns similar text, not connected facts
No structured reasoning"Both Alzheimer's AND Type 2 Diabetes" requires set intersection, not similarity
Context window bottleneck2M papers → top-5 chunks rarely co-contain all hops
No iterative refinementAnswer to hop 1 should inform retrieval for hop 2

The solution: Hybrid Data - pair a Vector Index (semantic document retrieval) with a Knowledge Graph (structured entity relationships). An agent orchestrates multi-hop traversal across both stores.


Clarifying Questions

QuestionWhy It Matters
What entity types exist in the domain? (drugs, proteins, genes, diseases, trials?)Defines graph schema - nodes and edge types
Is the molecular database structured (SQL/graph) or unstructured (papers only)?Structured → import directly to graph; unstructured → NER extraction pipeline
How often is data updated? (daily trial updates? weekly paper ingestion?)Incremental graph update strategy vs. full rebuild
What is the maximum acceptable latency? (research tool: 10s OK; clinical decision: 3s max)Determines whether multi-hop can be synchronous or needs streaming
Does the agent need to take actions (flag a trial, create a report) or only answer?Read-only vs. read-write agent design
Are there compliance requirements? (FDA, HIPAA - PII in trial documents?)Drives data masking and audit logging depth

System Architecture Overview

┌────────────────────────────────────────────────────────────────────────────┐
│                         INGESTION PIPELINE                                  │
│                                                                              │
│  Sources:                                                                    │
│  ├── Cloud Storage: research papers (PDF), internal trial docs (PDF/Word)   │
│  └── Cloud SQL: molecular interaction DB (structured rows)                  │
│                                                                              │
│                    ┌──────────────────────────────────────┐                 │
│  PDF/Word ─────►  │  Cloud Run: Document Parser           │                 │
│                    │  ├── Document AI (OCR, layout)        │                 │
│                    │  └── Named Entity Recognizer          │                 │
│                    │      (Vertex AI NLP API + custom NER) │                 │
│                    │  Output: text chunks + entities JSON  │                 │
│                    └──────────────┬───────────────────────┘                 │
│                                   │                                          │
│                    ┌──────────────▼────────────────────────┐                │
│  Structured DB ──► │  Cloud Run: Graph Builder             │                │
│                    │  ├── Entity deduplication (fuzzy match)│                │
│                    │  ├── Relation extraction (LLM-assisted)│                │
│                    │  └── Upsert to Spanner Graph           │                │
│                    └──────────────┬────────────────────────┘                │
│                                   │                                          │
│             ┌─────────────────────▼───────────────────────┐                 │
│             │  Spanner Graph (knowledge graph)              │                │
│             │  Nodes: Drug, Protein, Gene, Disease, Trial   │                │
│             │  Edges: TARGETS, ASSOCIATED_WITH, TESTED_IN   │                │
│             └───────────────────────────────────────────────┘                │
│                                                                              │
│             ┌───────────────────────────────────────────────┐                │
│             │  Cloud Run: Chunker + Embedder                 │                │
│             │  → Vertex AI Vector Search (dense index)       │                │
│             └───────────────────────────────────────────────┘                │
└────────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────────────┐
│                         AGENTIC QUERY PIPELINE                              │
│                                                                              │
│  Researcher Query                                                            │
│       │                                                                      │
│       ▼                                                                      │
│  ┌───────────────────────────────────────────────────┐                      │
│  │  Cloud Run: Orchestrator Agent (ADK / LangGraph)  │                      │
│  │                                                    │                      │
│  │  Step 1: Query Decomposition                       │                      │
│  │  "Which drug candidates interact with proteins     │                      │
│  │   implicated in both AD and T2D with trials?"      │                      │
│  │   → Sub-queries:                                   │                      │
│  │     a) proteins linked to Alzheimer's              │                      │
│  │     b) proteins linked to Type 2 Diabetes          │                      │
│  │     c) intersection of (a) ∩ (b)                   │                      │
│  │     d) drug candidates targeting those proteins    │                      │
│  │     e) clinical trials for those drugs             │                      │
│  └───────────────────┬───────────────────────────────┘                      │
│                       │  calls tools                                         │
│         ┌─────────────┼─────────────────────┐                               │
│         ▼             ▼                     ▼                               │
│  ┌─────────────┐ ┌──────────────┐ ┌─────────────────┐                      │
│  │ graph_query │ │ vector_search│ │ synthesis_tool  │                      │
│  │             │ │              │ │                 │                      │
│  │ Spanner     │ │ Vertex AI    │ │ Gemini 1.5 Pro  │                      │
│  │ Graph GQL   │ │ Vector Search│ │ (answer builder)│                      │
│  │ traversal   │ │ + AlloyDB    │ │                 │                      │
│  └─────────────┘ └──────────────┘ └─────────────────┘                      │
│         │                │                 │                                │
│         └────────────────┴─────────────────┘                               │
│                          │                                                   │
│                          ▼                                                   │
│  ┌───────────────────────────────────────────────────┐                      │
│  │  ReAct Loop (Reasoning + Acting)                  │                      │
│  │  Thought → Action → Observation → Thought → ...  │                      │
│  │  Max 8 iterations, timeout 15s                    │                      │
│  └───────────────────────────────────────────────────┘                      │
│                          │                                                   │
│                          ▼                                                   │
│  Final Answer: entities + citations + graph path + source papers             │
└────────────────────────────────────────────────────────────────────────────┘

Knowledge Graph Schema

Node Types

Node LabelKey PropertiesExample
Drugname, candidate_id, mechanism, phaseLecanemab, Phase 3
Proteinname, uniprot_id, function, subcellular_locationAPOE, tau, amyloid-beta
Genesymbol, entrez_id, chromosomeAPOE, TREM2, IDE
Diseasename, icd10, mesh_id, categoryAlzheimer's (G30), T2D (E11)
ClinicalTrialnct_id, phase, status, sponsor, start_dateNCT02714153
Paperdoi, title, authors, year, journalNature 2024 APOE study

Edge Types

EdgeFrom → ToProperties
TARGETSDrug → Proteinmechanism, affinity_nM, confidence
ENCODESGene → Protein-
ASSOCIATED_WITHProtein → Diseaseevidence_type, score, pmid
IMPLICATED_INGene → Diseasegwas_p_value, effect_size
TESTED_INDrug → ClinicalTrialprimary_endpoint, result
MENTIONSPaper → Drug/Protein/Diseasecontext, section
INTERACTS_WITHProtein → Proteininteraction_type, experimental_evidence

Example Multi-hop Query in Graph Query Language

-- Proteins implicated in BOTH Alzheimer's AND Type 2 Diabetes
SELECT p.name, p.uniprot_id,
       ad_edge.score AS ad_score,
       t2d_edge.score AS t2d_score
FROM Protein p
JOIN ProteinDiseaseAssociation ad_edge ON p.id = ad_edge.protein_id
JOIN Disease ad ON ad_edge.disease_id = ad.id AND ad.mesh_id = 'D000544'   -- Alzheimer's
JOIN ProteinDiseaseAssociation t2d_edge ON p.id = t2d_edge.protein_id
JOIN Disease t2d ON t2d_edge.disease_id = t2d.id AND t2d.icd10 = 'E11'    -- T2D
WHERE ad_edge.score > 0.7 AND t2d_edge.score > 0.7;

-- Then: Drug candidates targeting those proteins
SELECT d.name, d.phase, t.mechanism
FROM Drug d
JOIN DrugProteinTarget t ON d.id = t.drug_id
WHERE t.protein_id IN (/* above result */);

Agent Tool Definitions

Tool 1: graph_query

@agent.tool
async def graph_query(
    query_type: str,       # "entity_lookup" | "traversal" | "intersection"
    entity_name: str,      # starting entity
    entity_type: str,      # "Drug" | "Protein" | "Disease" | "Trial"
    hops: int = 2,         # max traversal depth
    filters: dict = None   # {edge_type: "TARGETS", min_confidence: 0.7}
) -> list[dict]:
    """
    Traverses the Spanner Graph knowledge graph.
    Returns a list of connected entities with relationship metadata.
    """

Why bounded hops? Unbounded graph traversal can return millions of nodes. Cap at 2–3 hops for response time; if more depth is needed, decompose into sub-queries.

@agent.tool
async def vector_search(
    query: str,
    top_k: int = 5,
    filters: dict = None,    # {year: {gte: 2020}, source: "internal_trials"}
    include_snippets: bool = True
) -> list[dict]:
    """
    Semantic search over research papers and trial documents.
    Returns top_k chunks with source metadata and relevance scores.
    """

Tool 3: entity_intersection

@agent.tool
async def entity_intersection(
    entity_lists: list[list[str]],   # list of entity ID lists to intersect
    entity_type: str
) -> list[str]:
    """
    Computes set intersection of entity IDs returned by prior graph queries.
    Used for "implicated in BOTH disease A AND disease B" queries.
    """

Tool 4: synthesis

@agent.tool
async def synthesis(
    question: str,
    graph_facts: list[dict],
    document_chunks: list[dict]
) -> str:
    """
    Calls Gemini 1.5 Pro to synthesize a grounded answer
    from structured graph facts and unstructured document evidence.
    """

ReAct Loop Trace (Example)

Query: "Which drug candidates target proteins implicated in both Alzheimer's and Type 2 Diabetes?"

Thought 1: I need to find proteins linked to Alzheimer's Disease.
Action 1: graph_query(entity_name="Alzheimer's Disease", entity_type="Disease", query_type="traversal", hops=1, filters={edge_type: "ASSOCIATED_WITH", min_score: 0.7})
Observation 1: [APOE, tau, amyloid-beta, TREM2, CLU] with association scores

Thought 2: Now I need proteins linked to Type 2 Diabetes with same criteria.
Action 2: graph_query(entity_name="Type 2 Diabetes", entity_type="Disease", query_type="traversal", hops=1, filters={edge_type: "ASSOCIATED_WITH", min_score: 0.7})
Observation 2: [IDE, APOE, INSR, GLP1R, IRS1] with association scores

Thought 3: Compute the intersection - proteins in both lists.
Action 3: entity_intersection([[APOE, tau, TREM2, CLU], [IDE, APOE, INSR, IRS1]], entity_type="Protein")
Observation 3: [APOE]

Thought 4: Find drug candidates targeting APOE.
Action 4: graph_query(entity_name="APOE", entity_type="Protein", query_type="traversal", hops=1, filters={edge_type: "TARGETS", phase: {gte: 2}})
Observation 4: [Lecanemab (Phase 3), ABBV-916 (Phase 2), PTC-027 (Phase 2)]

Thought 5: Find clinical trials for these drug candidates.
Action 5: graph_query(entity_name="Lecanemab", entity_type="Drug", query_type="traversal", hops=1, filters={edge_type: "TESTED_IN"})
Observation 5: [NCT03887455 (Phase 3, completed), NCT04468659 (Phase 3, active)]

Thought 6: Get supporting evidence from research papers.
Action 6: vector_search(query="APOE Alzheimer's Type 2 Diabetes drug target clinical evidence", top_k=3, filters={year: {gte: 2020}})
Observation 6: [Paper 1: "APOE4 metabolic dysfunction links AD and T2D...", Paper 2: "...]

Thought 7: I have enough to synthesize a grounded answer.
Action 7: synthesis(question=..., graph_facts=[...], document_chunks=[...])
Observation 7: Final answer with citations

GCP Services Map

ComponentGCP ServiceWhy
Raw document storageCloud StorageSource of truth; Eventarc triggers ingestion
Structured DB importCloud SQL → DataflowBulk import molecular DB into graph format
Document parsingDocument AI (Form Parser + OCR)Layout-aware parsing of research PDFs
Named Entity RecognitionVertex AI Natural Language API + custom AutoML NERExtract Drug/Protein/Disease entities
Relation extractionVertex AI Gemini 1.5 Flash (batch)"Paper says Drug X targets Protein Y" → edge
Knowledge graph storeSpanner GraphManaged, ACID, GQL support, multi-region
Entity deduplicationDataflow + Fuzzy matching (jaro-winkler)"Aβ" and "amyloid-beta" are the same node
Graph update queuePub/Sub + DataflowStreaming graph updates from new papers
Vector indexVertex AI Vector SearchANN over 2M paper embeddings
Sparse indexAlloyDB pgvector + tsvectorBM25 for exact term matching
Embedding modelVertex AI text-embedding-005Unified embedding space for papers
Agent orchestrationVertex AI Agent Engine (ADK)Managed agent runtime, session state
LLM (synthesis)Vertex AI Gemini 1.5 Pro128k context, tool use, grounding
LLM (relation extraction)Vertex AI Gemini 1.5 Flash (batch)Cost-efficient batch NER/relation jobs
Semantic cacheMemorystore for RedisCache frequent research queries
Session stateFirestorePer-researcher agent session context
AuthIdentity-Aware Proxy + VPC Service ControlsInternal research tool, strict perimeter
Audit loggingCloud Logging + BigQueryCompliance audit trail per query
MonitoringCloud Monitoring + Cloud TraceLatency per agent step, error rates
Offline evalVertex AI Experiments + BigQueryRAGAS scores, graph recall metrics
CI/CDCloud Build + Artifact RegistryGraph pipeline + agent deployment

Scalability Considerations

Graph Scalability

ScaleSpanner ConfigQuery LatencyStrategy
<10M nodesSingle region, 3 nodes<100msDefault config
10M–100M nodesMulti-region, 10 nodes100–500msPartition by entity type
100M+ nodesMulti-region, 30+ nodes500ms–2sSub-graph caching + materialized views

Key insight: Graph traversal latency grows with degree (number of edges per node), not graph size. High-degree nodes (e.g., APOE protein connected to thousands of papers) need degree-bounded queries and result truncation. Always specify LIMIT in graph queries.

Agent Scalability

Problem: ReAct loops are inherently sequential - each action depends on the prior observation.

Mitigations:

  1. Parallel tool calls within a step - when sub-queries are independent (e.g., "proteins in AD" and "proteins in T2D"), execute both graph queries concurrently. ADK supports parallel tool dispatch.
  2. Materialized sub-graph caching - pre-compute and cache common sub-graphs (e.g., "all proteins implicated in Alzheimer's") in Redis. Graph traversal for common starting nodes takes 1ms (cache hit) vs. 300ms (live query).
  3. Iteration cap - hard limit of 8 ReAct iterations with timeout of 15s. If exceeded, return partial answer with "further research needed" flag.
  4. Query complexity classifier - simple single-hop queries (entity lookup) bypass the agent loop entirely and go directly to vector search + graph_query.

Ingestion Scalability: 2M Papers

StageThroughputGCP Config
PDF parsing500 docs/minCloud Run 50 instances × 10 RPS
Entity extraction (NLP API)300 docs/minVertex AI NLP: 300 RPS quota
Relation extraction (Gemini Flash batch)1000 docs/minBatch Prediction API, async
Graph upsert (Spanner)10k mutations/secSpanner 10-node cluster
Vector embed + index20k chunks/minCloud Tasks + Vector Search online updates

Full initial load of 2M papers: ~3–4 days. Incremental updates (100 papers/day): <30 minutes.

Cost Optimization at Scale

LeverSaving
Gemini Flash for NER/relation extraction (batch)5–10× cheaper than Pro
Materialized sub-graph cache for top-100 disease entities60% reduction in Spanner queries
Semantic cache for agent responses20–30% LLM cost reduction
Committed use discounts on Spanner + Cloud Run25–57% savings
Offline batch embedding vs. real-time40% cheaper with batch API

Failure Modes and Mitigations

FailureSymptomMitigation
Graph entity not foundAgent gets empty traversal resultFall back to vector search; flag "not in graph yet"
Entity disambiguation failure"insulin" maps to wrong proteinCanonical entity disambiguation using UniProt/MeSH IDs
Agent infinite loopReAct exceeds iteration capHard 8-iteration + 15s timeout; return partial answer
Stale graph dataDrug candidate phase changedGraph edges have last_updated timestamp; warn if >30 days
High-degree node explosionQuery returns 50k edgesDegree-bounded traversal (LIMIT 100 per hop)
Hallucinated graph pathsAgent invents relationshipsSynthesis tool only uses graph_facts actually returned - no interpolation
NER extraction errorsWrong entity extractedHuman-in-the-loop review queue for low-confidence extractions; threshold 0.85

Simple RAG vs. Agentic RAG Hybrid - Decision Matrix

CriterionSimple RAGAgentic RAG Hybrid
Query typeSingle-hop factualMulti-hop relational
Data structureUnstructured text onlyText + structured relationships
Latency requirement<3s5–15s acceptable
Entity relationships critical?NoYes
Set operations needed?NoYes (intersection, union)
Agent complexityLowHigh (ReAct loop)
Operational costLowHigh (graph infra + agent)
Use Simple RAG when...Q&A, summarization, policy lookup-
Use Hybrid when...-Drug discovery, compliance tracing, financial fraud chains

Q&A Review Bank

Q1: Why can't a standard RAG pipeline answer multi-hop questions like "proteins implicated in both Disease A and Disease B"? What specifically breaks? [Medium]

A: Three structural failures: (1) No set operations - vector similarity returns the most similar chunks, not the intersection of entity sets. A query for "both Alzheimer's AND T2D" may retrieve chunks about each disease separately but has no mechanism to compute the protein overlap. (2) No entity relationship traversal - the relationship Drug TARGETS Protein is implicit in text but not navigable. A chunk that says "Lecanemab binds APOE" and a chunk that says "APOE is associated with T2D" exist in different embedding neighborhoods - the system has no way to connect them without explicitly traversing an edge. (3) Context window bottleneck - even if you retrieved all relevant chunks (hundreds), the LLM would struggle to compute a precise intersection from unstructured prose across a 100k-token context window. Knowledge graphs make these relationships explicit and queryable.


Q2: What is the risk of an unbounded graph traversal in the agent's graph_query tool, and how do you design against it? [Hard]

A: Unbounded traversal from a high-degree node (e.g., "APOE", which may connect to 50,000 papers and 10,000 proteins in a large biomedical graph) can return millions of nodes, consuming all available memory and timing out the agent within seconds. Three defenses: (1) Hard LIMIT per hop - cap results at 100–500 edges per traversal level; the agent sees the most relevant (by confidence score) rather than all. (2) Hop depth limit - cap at 2–3 hops; queries requiring 4+ hops should be decomposed by the agent into chained 2-hop queries. (3) Degree pruning - edges are sorted by confidence or evidence_count before applying the limit, ensuring high-quality edges are retained. The agent must be designed to recognize "too many results" as a signal to add more filters, not to use all results.


Q3: Describe the role of Spanner Graph specifically - why not use Neo4j, or just use AlloyDB with a graph extension? [Hard]

A: Spanner Graph is the right choice for this use case for three reasons: (1) ACID + global scale - Spanner provides externally consistent transactions across regions. For a pharmaceutical knowledge graph where an incorrect drug-protein edge could affect drug safety decisions, transactional consistency matters - you cannot have partial edge inserts. Neo4j Community provides ACID locally but not across multi-region deployments without enterprise licensing. (2) GQL (Graph Query Language) ISO standard - Spanner Graph supports the ISO/IEC GQL standard, making queries portable and the team's skills transferable. (3) GCP-native integration - Spanner integrates natively with Dataflow (ingestion), IAM (access control), VPC Service Controls (compliance perimeter), and Cloud Monitoring - all required for enterprise pharma. AlloyDB with Apache AGE provides graph extensions but is better suited for small graphs (<10M nodes) where SQL-first modeling is more natural; at 100M+ node scale, native graph storage with graph-optimized query execution (like Spanner Graph) significantly outperforms.


Q4: The agent is hallucinating graph paths - claiming Drug X targets Protein Y when that edge doesn't exist. How do you prevent this? [Hard]

A: This happens when the synthesis step extrapolates from implicit context rather than explicit graph facts. Three-layer fix: (1) Structural grounding constraint - the synthesis tool prompt explicitly prohibits generating relationship claims not present in the graph_facts parameter: "Only state relationships that appear verbatim in the provided graph facts. Do not infer or extrapolate." (2) Citation enforcement - every relationship claim in the answer must be tagged with its source: either a graph edge ID (e.g., [Graph: Drug-Protein edge #4521, confidence=0.92]) or a document citation. Any uncited relational claim is flagged as unverified. (3) Post-generation fact check - after generation, extract all relational claims from the answer and verify each against the knowledge graph with a graph_query(query_type="entity_lookup"). Any claim not found in the graph is either removed or flagged with "not confirmed in knowledge base." This adds ~200ms but eliminates hallucinated relationships.


Q5: How do you handle the cold-start problem when the knowledge graph is first populated from 2M unstructured research papers? [Hard]

A: Cold-start is a 3–4 day batch pipeline, not an online process. (1) Entity extraction at scale - use Vertex AI Gemini Flash batch prediction to run NER across all 2M papers in parallel. Each paper produces a JSON of extracted entities: {drugs: [], proteins: [], diseases: [], relations: []}. At Flash pricing, 2M papers cost ~$200–400. (2) Deduplication - entities must be canonicalized before graph insertion. "amyloid-beta", "Aβ", "A-beta" must map to the same protein node (UniProt Q9Y287). Use fuzzy string matching (Jaro-Winkler > 0.92) + lookup against MeSH, UniProt, and ChEMBL canonical name dictionaries. Run in Dataflow. (3) Confidence scoring - each extracted edge gets a confidence score based on: extraction model confidence + number of papers supporting the claim + whether the paper is a review article (higher weight) vs. a single study. Only edges with confidence > 0.7 are inserted at cold-start; lower-confidence edges are stored in a review queue. (4) Incremental updates - after cold-start, new papers trigger streaming Pub/Sub messages → Dataflow → NER → graph upsert, keeping the graph current within hours of new publications.


Q6: A researcher reports that the agent gives inconsistent answers to the same question on different days. What are the likely causes and how do you build reproducibility? [Hard]

A: Three sources of non-determinism: (1) Graph data changes - new papers added edges between old cold-start queries. The answer was correct both times given the graph state at that moment. Fix: log the graph state snapshot (edge version IDs) used for each query in BigQuery. Reproducibility means replaying a query against the same graph version. (2) LLM temperature > 0 - synthesis step uses sampling, producing different phrasings or emphasis on different runs. Fix: set temperature=0 for synthesis in research/compliance contexts; accept that some variation is normal in phrasing but ensure core entity claims are deterministic from graph facts. (3) Semantic cache invalidation race - cached response from Day 1 returned on Day 2, but graph was updated in between; then cache expired and live query returned different (more current) answer. Fix: cache keys include graph_version_hash so cache entries are invalidated when the underlying graph changes. Include the query timestamp and graph version in the response so researchers know which data generation answered their question.


Q7: Design the observability stack for this system. What metrics do you track at each layer? [Medium]

A: Four layers: (1) Infrastructure (Cloud Monitoring): Cloud Run CPU/memory per service, Spanner read/write latency P50/P99, Vector Search query latency, Redis cache hit rate. Alert on: Spanner P99 > 500ms, cache hit rate < 10%, Cloud Run error rate > 1%. (2) Agent (Cloud Trace + custom spans): total ReAct loop duration, iterations per query, tool call latency per tool type (graph_query vs. vector_search), timeout rate. Track mean_iterations_per_query - if it rises, queries are getting more complex or the agent is getting confused. (3) Retrieval quality (BigQuery + offline eval): graph traversal recall (did the graph path exist?), vector search MRR@5, entity extraction precision/recall on a labeled test set. Run RAGAS weekly on a sample of queries. (4) Answer quality (human feedback + LLM-as-judge): thumbs up/down from researchers, LLM-as-judge faithfulness score (are answer claims grounded in graph facts?), hallucination rate on labeled golden set. Dashboard in Looker Studio.